-
Notifications
You must be signed in to change notification settings - Fork 92
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feat: sebulba rec ippo #1142
base: develop
Are you sure you want to change the base?
Feat: sebulba rec ippo #1142
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall the system looks correct and reasonable. Well done Simon! I just kept minor requests :)
- arch: sebulba | ||
- system: ppo/rec_ippo | ||
- network: rnn # [mlp, continuous_mlp, cnn] | ||
- env: lbf_gym # [rware_gym, lbf_gym] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you can add smaclite_gym to the list
observation: Observation, | ||
dones, | ||
hstates, | ||
key: chex.PRNGKey, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
key: chex.PRNGKey, | |
dones: chex.Array, | |
hstates: HiddenStates, |
@@ -0,0 +1,910 @@ | |||
# Copyright 2022 InstaDeep Ltd. All rights reserved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you can update the typings in Pipeline mava/utils/sebulba.py
to be Union[PPOTransition, RNNPPOTransition]
log_prob = actor_policy.log_prob(action) | ||
# It may be faster to calculate the values in the learner as | ||
# then we won't need to pass critic params to actors. | ||
# value = critic_apply_fn(params.critic_params, observation).squeeze() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you can remove this comment
timestep = env.reset(seed=seeds) | ||
dones = np.repeat(timestep.last(), num_agents).reshape(num_envs, -1) | ||
|
||
# simon |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here once you are done from cleaning, if you can remove this comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see this comment exists in multiple lines, if you can remove them all
) | ||
|
||
params, opt_states, traj_batch, advantages, targets, key = update_state | ||
# learner_state = LearnerState(params, opt_states, key, None, learner_state.timestep) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you can remove this comment
Sebulba implementation of recurrent IPPO.