Skip to content

SB3 v1.1.0: Dictionary observation support, timeout handling and refactored HER buffer

Compare
Choose a tag to compare
@araffin araffin released this 02 Jul 10:07
· 335 commits to master since this release
5af35fa

Breaking Changes

  • All customs environments (e.g. the BitFlippingEnv or IdentityEnv) were moved to stable_baselines3.common.envs folder
  • Refactored HER which is now the HerReplayBuffer class that can be passed to any off-policy algorithm
  • Handle timeout termination properly for off-policy algorithms (when using TimeLimit)
  • Renamed _last_dones and dones to _last_episode_starts and episode_starts in RolloutBuffer.
  • Removed ObsDictWrapper as Dict observation spaces are now supported
  her_kwargs = dict(n_sampled_goal=2, goal_selection_strategy="future", online_sampling=True)
  # SB3 < 1.1.0
  # model = HER("MlpPolicy", env, model_class=SAC, **her_kwargs)
  # SB3 >= 1.1.0:
  model = SAC("MultiInputPolicy", env, replay_buffer_class=HerReplayBuffer, replay_buffer_kwargs=her_kwargs)
  • Updated the KL Divergence estimator in the PPO algorithm to be positive definite and have lower variance (@09tangriro)
  • Updated the KL Divergence check in the PPO algorithm to be before the gradient update step rather than after end of epoch (@09tangriro)
  • Removed parameter channels_last from is_image_space as it can be inferred.
  • The logger object is now an attribute model.logger that be set by the user using model.set_logger()
  • Changed the signature of logger.configure and utils.configure_logger, they now return a Logger object
  • Removed Logger.CURRENT and Logger.DEFAULT
  • Moved warn(), debug(), log(), info(), dump() methods to the Logger class
  • .learn() now throws an import error when the user tries to log to tensorboard but the package is not installed

New Features

  • Added support for single-level Dict observation space (@JadenTravnik)
  • Added DictRolloutBuffer DictReplayBuffer to support dictionary observations (@JadenTravnik)
  • Added StackedObservations and StackedDictObservations that are used within VecFrameStack
  • Added simple 4x4 room Dict test environments
  • HerReplayBuffer now supports VecNormalize when online_sampling=False
  • Added VecMonitor and VecExtractDictObs wrappers to handle gym3-style vectorized environments (@vwxyzjn)
  • Ignored the terminal observation if the it is not provided by the environment
    such as the gym3-style vectorized environments. (@vwxyzjn)
  • Added policy_base as input to the OnPolicyAlgorithm for more flexibility (@09tangriro)
  • Added support for image observation when using HER
  • Added replay_buffer_class and replay_buffer_kwargs arguments to off-policy algorithms
  • Added kl_divergence helper for Distribution classes (@09tangriro)
  • Added support for vector environments with num_envs > 1 (@benblack769)
  • Added wrapper_kwargs argument to make_vec_env (@amy12xx)

Bug Fixes

  • Fixed potential issue when calling off-policy algorithms with default arguments multiple times (the size of the replay buffer would be the same)
  • Fixed loading of ent_coef for SAC and TQC, it was not optimized anymore (thanks @Atlis)
  • Fixed saving of A2C and PPO policy when using gSDE (thanks @liusida)
  • Fixed a bug where no output would be shown even if verbose>=1 after passing verbose=0 once
  • Fixed observation buffers dtype in DictReplayBuffer (@c-rizz)
  • Fixed EvalCallback tensorboard logs being logged with the incorrect timestep. They are now written with the timestep at which they were recorded. (@skandermoalla)

Others

  • Added flake8-bugbear to tests dependencies to find likely bugs
  • Updated env_checker to reflect support of dict observation spaces
  • Added Code of Conduct
  • Added tests for GAE and lambda return computation
  • Updated distribution entropy test (thanks @09tangriro)
  • Added sanity check batch_size > 1 in PPO to avoid NaN in advantage normalization

Documentation:

  • Added gym pybullet drones project (@JacopoPan)
  • Added link to SuperSuit in projects (@justinkterry)
  • Fixed DQN example (thanks @ltbd78)
  • Clarified channel-first/channel-last recommendation
  • Update sphinx environment installation instructions (@tom-doerr)
  • Clarified pip installation in Zsh (@tom-doerr)
  • Clarified return computation for on-policy algorithms (TD(lambda) estimate was used)
  • Added example for using ProcgenEnv
  • Added note about advanced custom policy example for off-policy algorithms
  • Fixed DQN unicode checkmarks
  • Updated migration guide (@juancroldan)
  • Pinned docutils==0.16 to avoid issue with rtd theme
  • Clarified callback save_freq definition
  • Added doc on how to pass a custom logger
  • Remove recurrent policies from A2C docs (@bstee615)