Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Separate Model and Environment with GI Interface #220

Open
lukasgppl opened this issue Dec 15, 2024 · 11 comments
Open

Separate Model and Environment with GI Interface #220

lukasgppl opened this issue Dec 15, 2024 · 11 comments

Comments

@lukasgppl
Copy link

Hi all,
I'm playing around with this package, exploring possible options to train an AZ model for my bachelors thesis. My use-case is far beyond games - I've built an environment that simulates a production facility with multiple machines and orders that need to be scheduled in a way to minimize maximize a certain reward. This scheduling problem is far too large to model exhaustively. State and action spaces are practically continous and the scheduling process is stochastic. Nonetheless, research approaches this by building a deterministic sub-problem based on the current environment's state. This subproblem is the one that needs to be solved by alphazero. After fully solving this problem, an action can be derived for the simulation environment and from the resulting state, a new sub-problem can be built and so on.
The current implementation uses the Environment directly as a model. This means I am only able to solve one sub-problem whilst training at the moment by defining the GI Interface to the Model with a sub-problem built from the initial state of the simulation environment. Is there a possibility to split the Model and the Environment for my use-case?
Thanks in advance.

@jonathan-laurent
Copy link
Owner

This looks like an interesting use-case. I am looking forward to hearing more about your thesis!

I am not sure I understand your main question. By "model", do you mean the neural network being trained? The GI interface has nothing to do with the model in this sense: it just defines state and action spaces along with a transition function. In your case, it looks like you just want to define different environments for your top-level game and your lower-level game (the lower-level game being used to compute a policy for the top-level game if I understand correctly).

More generally, am I correct that your plan is to first train one AlphaZero agent to implement a policy for the low-level game and then to use this policy to implement a top-level game that you are then going to solve using AlphaZero or some other technique? If so, this sounds really interesting but using an AlphaZero policy in order to simulate your top-level environment will be pretty expensive and so you have to plan accordingly (e.g. using a smaller network to trade longer learning on the low-level game against faster environment simulation on the top-level game).

@lukasgppl
Copy link
Author

Thanks for your fast reply!
When I was referring to the „mode“ earlier, I meant the world-/dynamics-model of the environment.
In AlphaZero, this model is set manually, while in MuZero it is learned. In Chess the model would include the game rules like determining what states an action leads to and determining if a terminal state in tree search is a win/loss/draw. In the framework, this abstraction doesn‘t need to be made since the game environment is also the model. We can just create copies of the environment and use them as a dynamics model in Tree Search to simulate outcomes and perform only the action chosen by MCTS at the end in the main environment object.
However, assume a real-world manufacturing company where you can assign different orders to be processed on a subset of machines that are compatible. Multiple dynamic interruptions can happen in this process - machines can break, the priority of orders can change, etc. You can‘t really build a deterministic model due to the inherent uncertainty and stochasticity of state transitions. However, some of the sparse research papers out there, aiming to tackle this problem simply view this problem (called Dynamic Flexible Job Shop Scheduling Problem if you‘re interested) as a non-dynamic problem at every timestep. While we don‘t know how states transition if a machine breakdown happens, we know how they transition when no breakdown happens. Basically we just naively view the problem as if no dynamics were going to happen, and solve this low level problem. If no dynamic events occur in the real environment that solution will perform great. If a dynamic event does occur, we‘ll just solve a new low level problem in order to react to the changed circumstances. If you‘re interested, here‘s literature on how this was done with pure MCTS: https://doi.org/10.1016/j.cie.2021.107211.
So at core, my „model“ is a mechanism that takes the state of the high-level problem and generates a low-level problem from it and generates state transitions, reward, action mask, etc.
My plan is to learn a policy for these low level problems with AZ. Optimizing the high level problem will then just involve creating low-level problems and using the AZ policy to solve them. From a solved low-level problem we can them derive an action, no need for training multiple AZ agents. The action space of all low level problems will be exactly the same. The only thing differing will be the action masks, dependent on the state of the high-level problem.
My issue right now is that can‘t fully grasp how the AZ framework interacts with the environment during tree search, especially whilst training. I at least think that I understand that GI.play!() universally performs an action in an environment, not differentiating if it is a „real“ or simulated one. While I could theoretically use the dynamics model as the environment and simply perform an action in the high-level environment at every terminal state of the model, I don‘t think this would pair well with the copies during simulated experiences. I‘d need some way to distinguish between real and simulated actions or do some hacky stuff where the nested high-level problem object is also deep copied in simulated experiences during tree-search.
Also, I can‘t fully grasp the mechanisms of the framework whilst training, how does the environment get reset once it‘s fully solved, etc since fully solving my low-level problem is just one step in the high-level problem, instead of resetting my dynamics model, I‘d need to generate a new one based on the new state of the high-level problem. This would also need the mechanism to set the model to an initial state again after one high-level problem is fully solved.

i don‘t know if this is the right place for such extensive discourse. I could contact you via Email if you prefer that. Otherwise, we‘ll keep it here.

Thanks for your support :)

@jonathan-laurent
Copy link
Owner

It seems to me that what you want here are two different environments for the top-level and low-level games. You can use GI.init to create a new environment and GI.setstate! to set its internal state. When playing the top-level game, you have to create a fresh instance of the low-level environment every time you want to compute an action.

When training an AZ agent for the low-level game, the training is not aware of the existence of a top-level game anyway.

@lukasgppl
Copy link
Author

lukasgppl commented Dec 16, 2024

Exactly, I effectively have two environments. If GI.init is called, both environments will be initialized. If I wanted to use GI.set_state! to set the state of the high-level environment after initialization, I'd need to preserve its previous state somehow. I thought about storing this previous state after performing an action through GI.play before GI.init is called but GI.play can't modify the Spec instance.

@jonathan-laurent
Copy link
Owner

Why are you initialising both the low-level and top-level envs at the start? You should only initialise the top-level env at the start. Then, every time you want to compute a top-level action, you initialise a fresh low-level env and set its state (computed based on GI.state toplevel_env). You can do all your planning on this low-level env, which is never going to affect the top-level env. What am I missing here?

@lukasgppl
Copy link
Author

If I understand correctly, you propose to initialize the top-level env at the beginning of each training episode and then initialize a fresh low-level environment for every top-level action. A training episode is started if GI.game_terminated is true. With the proposed approach, I'd have GI.game_terminated check if the high-level env is terminated. However, MCTS also relies on GI.game_terminated, so this would cause MCTS to simulate actions in the low-level environment endlessly. Therefore, a new training episode needs to be started each time a low-level environment is terminated. I could only initialize the low-level env at the start of every training step according to the current high-level env (maybe that's also what you meant) and in GI.play perform an action in the low_level environment. After, I could check if the low-level env is done and if so, derive the high-level actions from it and perform them. If the high-level environment is terminated after this, I could simply reset it by using an auxillary function (initializing a new one here is not possible as GameSpec is not provided in GI.play. However, when GI.init(::GameSpec) is called, I'd need to know the current state of the high-level environment in order to correctly initialize it for the new training episode. I only have the GameSpec though, so I'm unsure how I'd be able to "initialise a fresh low-level env and set its state (computed based on GI.state toplevel_env)". Could you elaborate how this would work?

@jonathan-laurent
Copy link
Owner

jonathan-laurent commented Dec 16, 2024

I don't understand what you are saying here. There should not be any complex interactions between the low-level and high-level games. The low-level game is unaware of the high-level game and a separate policy is trained for it. Then, a high-level game env can be defined that takes such a policy as an argument. You can then again solve this new game, but the learning algorithm is never aware that the transition function relies on a lower-level environment along with a previously trained policy for this environment.

@lukasgppl
Copy link
Author

lukasgppl commented Dec 17, 2024

Essentially I just need the top-level environment to act as a controlled, dynamic spec that is changed by the result of the low-level environment at each learning episode. Basically you could view the high-level environment as a controlled spec-generator that generates a problem configuration with the same action space in every learning episode. The low-level environment needs to be initialized with the current state of the high-level environment at each episode. Both environments don‘t really interact with esch other, it‘s just a way to set up a new instance of the problem while learning, so that you don‘t have to start learning over and over with esch sub-problem as it can be automatically generated. When initializing the low-level one, the progress of the high-level environment therefore must not be lost. It‘s hard to do this with the GI Interface, as ‘GI.init‘ solely depends on the spec. The only possibility I see is to have a state object of the high level environment in the spec, that will also be referenced in the environment struct to a) preserve the state of the high level one in the spec and b) use the outcome of the low-level environment to progress its state at the end of each episode. This is hacky in my opinion and could possibly clash with initialization for simulated search experiences if no deep copies are made there.

Basically the algorithm for training would need to be:

  1. Initialize a level environment instance at the start of each training episode, based on the state of the high level-environment
  2. Perform tree search until the low-level environment is fully solved/
  3. Provide the terminal state of the low-level environment to the high-level one. If the high-level environment is terminal after, reset it and start a new learning episode
  4. Initialize a new low-level instance based on the updated state of the high-level one

So the main challenged is how to dynamically create low-level instances in each episode automatically whilst not interfering with simulation logic of the framework.

What I‘m trying to say is, the world- or often also called dynamics model isn‘t always the same es the environment an agent acts in, but can be a separate instance. Whether this architecture to be sensical or not is based on the use-case. If you think about board games or puzzles, it‘s definitely not sensical. However, in real-world problem application you don‘t have access to the full underlying MDP sometimes. You might not be able to model the the full problem with its dynamics and uncertainty, but in many cases you can model discrete sub-problems to solve this issue step-by-step. The low-level environment is essentially this sub-problem and the high-level one is the full problem. This is what I‘m trying to do and what has proven to work in literature.

I hope this is a bit more insightful. I can understand if you think this is no use-case for this framework or not the goal of it. However I can see this work with minimal adjustments, opening the gates to other model-based RL like robotics, autonomous vehicles, scheduling and much more coming from an engineers point of view.

@jonathan-laurent
Copy link
Owner

jonathan-laurent commented Dec 17, 2024

The way I understand, what you are trying to do is just apply AlphaZero on the low-level problem, with one caveat. Instead of starting each episode on the same initial state, you determine initial states via an outer loop that simulates the higher-level game. Such an outer loop you would have to write yourself but this should not be much work since you can reuse all of AlphaZero.jl abstractions.

The way this package is designed is via what the FastAI authors call a tiered API. For simplest cases such as board games, I try to provide a very high-level API so that people only have to provide an environment and some configuration. But for more advanced research applications or custom pipelines, people can just write code and rely on lower level abstractions. To some extent, this is unavoidable since anticipating every legitimate customisation is impossible and an API with dozens of hooks and flags would be difficult to learn and error-prone.

Now, this is the theory and the current low-level abstractions might still be imperfect and/or unnecessarily limited. In this case, you should feel free to suggest changes and propose pull requests. Also, it is true that giving up on the top-level API right now may require you to put more work than ideally needed, in particular in terms of rewriting some logging or session management boilerplate. I am also interested in PRs that would fix that. But in any case, it should still be a small amount of work compared to the challenges of getting AlphaZero to work on an open research problem.

If you are successful with this case study, there could be an interesting debate to be had on whether or not we could enable your workflow in the top-level API, or even provide an alternative top-level API that supports it. But doing so always comes with tradeoffs. On a related note, I suspect in hindsight that my current top-level API tries too hard to provide a single, unified, config-only API that is maximally simple to use. This somehow sends the wrong message that AlphaZero can be treated as a black box, when it is in fact a powerful but very subtle algorithm that typically requires deep understanding and domain-specific optimisations to be leveraged in use cases that go beyond the simplest board games.

Anyway, you seem to have a really exciting thesis topic and I am looking forward to hearing more from you!

EDIT: as an alternative to reusing lower-level abstractions from AlphaZero.jl, you can also probably achieve what you want by just forking the repo and directly making the changes you require. Still, I would then be interested in a discussion about what we can learn from your experience and how AlphaZero.jl could be improved from it.

@lukasgppl
Copy link
Author

I‘ll try my best to make this work. I‘ll look into both options using either the lower-level abstractions or forking. If I make it work, I‘ll send you an update and we could discuss this further. For now I think we managed to get on the same page and this issue can be closed imo :)

@jonathan-laurent
Copy link
Owner

Good luck! By the way, if your environment can be programmed on GPU, you could get unbeatable speed by using the full-GPU implementation in the redesign folder. Although it is part of a more global redesign, it is already working and decently documented. Fabrice Rosay got some really impressive results with it recently (see #217).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants