-
Notifications
You must be signed in to change notification settings - Fork 139
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Separate Model and Environment with GI Interface #220
Comments
This looks like an interesting use-case. I am looking forward to hearing more about your thesis! I am not sure I understand your main question. By "model", do you mean the neural network being trained? The GI interface has nothing to do with the model in this sense: it just defines state and action spaces along with a transition function. In your case, it looks like you just want to define different environments for your top-level game and your lower-level game (the lower-level game being used to compute a policy for the top-level game if I understand correctly). More generally, am I correct that your plan is to first train one AlphaZero agent to implement a policy for the low-level game and then to use this policy to implement a top-level game that you are then going to solve using AlphaZero or some other technique? If so, this sounds really interesting but using an AlphaZero policy in order to simulate your top-level environment will be pretty expensive and so you have to plan accordingly (e.g. using a smaller network to trade longer learning on the low-level game against faster environment simulation on the top-level game). |
Thanks for your fast reply! i don‘t know if this is the right place for such extensive discourse. I could contact you via Email if you prefer that. Otherwise, we‘ll keep it here. Thanks for your support :) |
It seems to me that what you want here are two different environments for the top-level and low-level games. You can use When training an AZ agent for the low-level game, the training is not aware of the existence of a top-level game anyway. |
Exactly, I effectively have two environments. If |
Why are you initialising both the low-level and top-level envs at the start? You should only initialise the top-level env at the start. Then, every time you want to compute a top-level action, you initialise a fresh low-level env and set its state (computed based on |
If I understand correctly, you propose to initialize the top-level env at the beginning of each training episode and then initialize a fresh low-level environment for every top-level action. A training episode is started if |
I don't understand what you are saying here. There should not be any complex interactions between the low-level and high-level games. The low-level game is unaware of the high-level game and a separate policy is trained for it. Then, a high-level game env can be defined that takes such a policy as an argument. You can then again solve this new game, but the learning algorithm is never aware that the transition function relies on a lower-level environment along with a previously trained policy for this environment. |
Essentially I just need the top-level environment to act as a controlled, dynamic spec that is changed by the result of the low-level environment at each learning episode. Basically you could view the high-level environment as a controlled spec-generator that generates a problem configuration with the same action space in every learning episode. The low-level environment needs to be initialized with the current state of the high-level environment at each episode. Both environments don‘t really interact with esch other, it‘s just a way to set up a new instance of the problem while learning, so that you don‘t have to start learning over and over with esch sub-problem as it can be automatically generated. When initializing the low-level one, the progress of the high-level environment therefore must not be lost. It‘s hard to do this with the GI Interface, as ‘GI.init‘ solely depends on the spec. The only possibility I see is to have a state object of the high level environment in the spec, that will also be referenced in the environment struct to a) preserve the state of the high level one in the spec and b) use the outcome of the low-level environment to progress its state at the end of each episode. This is hacky in my opinion and could possibly clash with initialization for simulated search experiences if no deep copies are made there. Basically the algorithm for training would need to be:
So the main challenged is how to dynamically create low-level instances in each episode automatically whilst not interfering with simulation logic of the framework. What I‘m trying to say is, the world- or often also called dynamics model isn‘t always the same es the environment an agent acts in, but can be a separate instance. Whether this architecture to be sensical or not is based on the use-case. If you think about board games or puzzles, it‘s definitely not sensical. However, in real-world problem application you don‘t have access to the full underlying MDP sometimes. You might not be able to model the the full problem with its dynamics and uncertainty, but in many cases you can model discrete sub-problems to solve this issue step-by-step. The low-level environment is essentially this sub-problem and the high-level one is the full problem. This is what I‘m trying to do and what has proven to work in literature. I hope this is a bit more insightful. I can understand if you think this is no use-case for this framework or not the goal of it. However I can see this work with minimal adjustments, opening the gates to other model-based RL like robotics, autonomous vehicles, scheduling and much more coming from an engineers point of view. |
The way I understand, what you are trying to do is just apply AlphaZero on the low-level problem, with one caveat. Instead of starting each episode on the same initial state, you determine initial states via an outer loop that simulates the higher-level game. Such an outer loop you would have to write yourself but this should not be much work since you can reuse all of The way this package is designed is via what the FastAI authors call a tiered API. For simplest cases such as board games, I try to provide a very high-level API so that people only have to provide an environment and some configuration. But for more advanced research applications or custom pipelines, people can just write code and rely on lower level abstractions. To some extent, this is unavoidable since anticipating every legitimate customisation is impossible and an API with dozens of hooks and flags would be difficult to learn and error-prone. Now, this is the theory and the current low-level abstractions might still be imperfect and/or unnecessarily limited. In this case, you should feel free to suggest changes and propose pull requests. Also, it is true that giving up on the top-level API right now may require you to put more work than ideally needed, in particular in terms of rewriting some logging or session management boilerplate. I am also interested in PRs that would fix that. But in any case, it should still be a small amount of work compared to the challenges of getting AlphaZero to work on an open research problem. If you are successful with this case study, there could be an interesting debate to be had on whether or not we could enable your workflow in the top-level API, or even provide an alternative top-level API that supports it. But doing so always comes with tradeoffs. On a related note, I suspect in hindsight that my current top-level API tries too hard to provide a single, unified, config-only API that is maximally simple to use. This somehow sends the wrong message that AlphaZero can be treated as a black box, when it is in fact a powerful but very subtle algorithm that typically requires deep understanding and domain-specific optimisations to be leveraged in use cases that go beyond the simplest board games. Anyway, you seem to have a really exciting thesis topic and I am looking forward to hearing more from you! EDIT: as an alternative to reusing lower-level abstractions from AlphaZero.jl, you can also probably achieve what you want by just forking the repo and directly making the changes you require. Still, I would then be interested in a discussion about what we can learn from your experience and how AlphaZero.jl could be improved from it. |
I‘ll try my best to make this work. I‘ll look into both options using either the lower-level abstractions or forking. If I make it work, I‘ll send you an update and we could discuss this further. For now I think we managed to get on the same page and this issue can be closed imo :) |
Good luck! By the way, if your environment can be programmed on GPU, you could get unbeatable speed by using the full-GPU implementation in the |
Hi all,
I'm playing around with this package, exploring possible options to train an AZ model for my bachelors thesis. My use-case is far beyond games - I've built an environment that simulates a production facility with multiple machines and orders that need to be scheduled in a way to minimize maximize a certain reward. This scheduling problem is far too large to model exhaustively. State and action spaces are practically continous and the scheduling process is stochastic. Nonetheless, research approaches this by building a deterministic sub-problem based on the current environment's state. This subproblem is the one that needs to be solved by alphazero. After fully solving this problem, an action can be derived for the simulation environment and from the resulting state, a new sub-problem can be built and so on.
The current implementation uses the Environment directly as a model. This means I am only able to solve one sub-problem whilst training at the moment by defining the GI Interface to the Model with a sub-problem built from the initial state of the simulation environment. Is there a possibility to split the Model and the Environment for my use-case?
Thanks in advance.
The text was updated successfully, but these errors were encountered: