Release SB3 v1.1.0: Dictionary observation support, timeout handling and refactored HER buffer · DLR-RM/stable-baselines3

Breaking Changes

All customs environments (e.g. the BitFlippingEnv or IdentityEnv) were moved to stable_baselines3.common.envs folder
Refactored HER which is now the HerReplayBuffer class that can be passed to any off-policy algorithm
Handle timeout termination properly for off-policy algorithms (when using TimeLimit)
Renamed _last_dones and dones to _last_episode_starts and episode_starts in RolloutBuffer.
Removed ObsDictWrapper as Dict observation spaces are now supported

  her_kwargs = dict(n_sampled_goal=2, goal_selection_strategy="future", online_sampling=True)
  # SB3 < 1.1.0
  # model = HER("MlpPolicy", env, model_class=SAC, **her_kwargs)
  # SB3 >= 1.1.0:
  model = SAC("MultiInputPolicy", env, replay_buffer_class=HerReplayBuffer, replay_buffer_kwargs=her_kwargs)

Updated the KL Divergence estimator in the PPO algorithm to be positive definite and have lower variance (@09tangriro)
Updated the KL Divergence check in the PPO algorithm to be before the gradient update step rather than after end of epoch (@09tangriro)
Removed parameter channels_last from is_image_space as it can be inferred.
The logger object is now an attribute model.logger that be set by the user using model.set_logger()
Changed the signature of logger.configure and utils.configure_logger, they now return a Logger object
Removed Logger.CURRENT and Logger.DEFAULT
Moved warn(), debug(), log(), info(), dump() methods to the Logger class
.learn() now throws an import error when the user tries to log to tensorboard but the package is not installed

New Features

Added support for single-level Dict observation space (@JadenTravnik)
Added DictRolloutBuffer DictReplayBuffer to support dictionary observations (@JadenTravnik)
Added StackedObservations and StackedDictObservations that are used within VecFrameStack
Added simple 4x4 room Dict test environments
HerReplayBuffer now supports VecNormalize when online_sampling=False
Added VecMonitor and VecExtractDictObs wrappers to handle gym3-style vectorized environments (@vwxyzjn)
Ignored the terminal observation if the it is not provided by the environment
such as the gym3-style vectorized environments. (@vwxyzjn)
Added policy_base as input to the OnPolicyAlgorithm for more flexibility (@09tangriro)
Added support for image observation when using HER
Added replay_buffer_class and replay_buffer_kwargs arguments to off-policy algorithms
Added kl_divergence helper for Distribution classes (@09tangriro)
Added support for vector environments with num_envs > 1 (@benblack769)
Added wrapper_kwargs argument to make_vec_env (@amy12xx)

Bug Fixes

Fixed potential issue when calling off-policy algorithms with default arguments multiple times (the size of the replay buffer would be the same)
Fixed loading of ent_coef for SAC and TQC, it was not optimized anymore (thanks @Atlis)
Fixed saving of A2C and PPO policy when using gSDE (thanks @liusida)
Fixed a bug where no output would be shown even if verbose>=1 after passing verbose=0 once
Fixed observation buffers dtype in DictReplayBuffer (@c-rizz)
Fixed EvalCallback tensorboard logs being logged with the incorrect timestep. They are now written with the timestep at which they were recorded. (@skandermoalla)

Others

Added flake8-bugbear to tests dependencies to find likely bugs
Updated env_checker to reflect support of dict observation spaces
Added Code of Conduct
Added tests for GAE and lambda return computation
Updated distribution entropy test (thanks @09tangriro)
Added sanity check batch_size > 1 in PPO to avoid NaN in advantage normalization

Documentation:

Added gym pybullet drones project (@JacopoPan)
Added link to SuperSuit in projects (@justinkterry)
Fixed DQN example (thanks @ltbd78)
Clarified channel-first/channel-last recommendation
Update sphinx environment installation instructions (@tom-doerr)
Clarified pip installation in Zsh (@tom-doerr)
Clarified return computation for on-policy algorithms (TD(lambda) estimate was used)
Added example for using ProcgenEnv
Added note about advanced custom policy example for off-policy algorithms
Fixed DQN unicode checkmarks
Updated migration guide (@juancroldan)
Pinned docutils==0.16 to avoid issue with rtd theme
Clarified callback save_freq definition
Added doc on how to pass a custom logger
Remove recurrent policies from A2C docs (@bstee615)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SB3 v1.1.0: Dictionary observation support, timeout handling and refactored HER buffer

Breaking Changes

New Features

Bug Fixes

Others

Documentation: