Releases: DLR-RM/stable-baselines3
HER with online and offline sampling, bug fixes for features extraction
Breaking Changes
- Warning: Renamed
common.cmd_util
tocommon.env_util
for clarity (affectsmake_vec_env
andmake_atari_env
functions)
New Features
- Allow custom actor/critic network architectures using
net_arch=dict(qf=[400, 300], pi=[64, 64])
for off-policy algorithms (SAC, TD3, DDPG) - Added Hindsight Experience Replay
HER
. (@megan-klaiber) VecNormalize
now supportsgym.spaces.Dict
observation spaces- Support logging videos to Tensorboard (@SwamyDev)
- Added
share_features_extractor
argument toSAC
andTD3
policies
Bug Fixes
- Fix GAE computation for on-policy algorithms (off-by one for the last value) (thanks @Wovchena)
- Fixed potential issue when loading a different environment
- Fix ignoring the exclude parameter when recording logs using json, csv or log as logging format (@SwamyDev)
- Make
make_vec_env
support theenv_kwargs
argument when using an env ID str (@ManifoldFR) - Fix model creation initializing CUDA even when
device="cpu"
is provided - Fix
check_env
not checking if the env has a Dict actionspace before calling_check_nan
(@wmmc88) - Update the check for spaces unsupported by Stable Baselines 3 to include checks on the action space (@wmmc88)
- Fixed feature extractor bug for target network where the same net was shared instead
of being separate. This bug affectsSAC
,DDPG
andTD3
when usingCnnPolicy
(or custom feature extractor) - Fixed a bug when passing an environment when loading a saved model with a
CnnPolicy
, the passed env was not wrapped properly
(the bug was introduced when implementingHER
so it should not be present in previous versions)
Others
- Improved typing coverage
- Improved error messages for unsupported spaces
- Added
.vscode
to the gitignore
Documentation
Bug fixes, get/set parameters and improved docs
Breaking Changes:
- Removed
device
keyword argument of policies; usepolicy.to(device)
instead. (@qxcv) - Rename
BaseClass.get_torch_variables
->BaseClass._get_torch_save_params
and
BaseClass.excluded_save_params
->BaseClass._excluded_save_params
- Renamed saved items
tensors
topytorch_variables
for clarity make_atari_env
,make_vec_env
andset_random_seed
must be imported with (and not directly fromstable_baselines3.common
):
from stable_baselines3.common.cmd_util import make_atari_env, make_vec_env
from stable_baselines3.common.utils import set_random_seed
New Features:
- Added
unwrap_vec_wrapper()
tocommon.vec_env
to extractVecEnvWrapper
if needed - Added
StopTrainingOnMaxEpisodes
to callback collection (@xicocaio) - Added
device
keyword argument toBaseAlgorithm.load()
(@liorcohen5) - Callbacks have access to rollout collection locals as in SB2. (@partiallytyped)
- Added
get_parameters
andset_parameters
for accessing/setting parameters of the agent - Added actor/critic loss logging for TD3. (@mloo3)
Bug Fixes:
- Fixed a bug where the environment was reset twice when using
evaluate_policy
- Fix logging of
clip_fraction
in PPO (@diditforlulz273) - Fixed a bug where cuda support was wrongly checked when passing the GPU index, e.g.,
device="cuda:0"
(@liorcohen5) - Fixed a bug when the random seed was not properly set on cuda when passing the GPU index
Others:
- Improve typing coverage of the
VecEnv
- Fix type annotation of
make_vec_env
(@ManifoldFR) - Removed
AlreadySteppingError
andNotSteppingError
that were not used - Fixed typos in SAC and TD3
- Reorganized functions for clarity in
BaseClass
(save/load functions close to each other, private
functions at top) - Clarified docstrings on what is saved and loaded to/from files
- Simplified
save_to_zip_file
function by removing duplicate code - Store library version along with the saved models
- DQN loss is now logged
Documentation:
- Added
StopTrainingOnMaxEpisodes
details and example (@xicocaio) - Updated custom policy section (added custom feature extractor example)
- Re-enable
sphinx_autodoc_typehints
- Updated doc style for type hints and remove duplicated type hints
Added DQN and DDPG, bug fixes and performance matching for Atari games
Breaking Changes:
AtariWrapper
and other Atari wrappers were updated to match SB2 onessave_replay_buffer
now receives as argument the file path instead of the folder path (@tirafesi)- Refactored
Critic
class forTD3
andSAC
, it is now calledContinuousCritic
and has an additional parametern_critics
SAC
andTD3
now accept an arbitrary number of critics (e.g.policy_kwargs=dict(n_critics=3)
)
instead of only 2 previously
New Features:
- Added
DQN
Algorithm (@Artemis-Skade) - Buffer dtype is now set according to action and observation spaces for
ReplayBuffer
- Added warning when allocation of a buffer may exceed the available memory of the system
whenpsutil
is available - Saving models now automatically creates the necessary folders and raises appropriate warnings (@partiallytyped)
- Refactored opening paths for saving and loading to use strings, pathlib or io.BufferedIOBase (@partiallytyped)
- Added
DDPG
algorithm as a special case ofTD3
. - Introduced
BaseModel
abstract parent forBasePolicy
, which critics inherit from.
Bug Fixes:
- Fixed a bug in the
close()
method ofSubprocVecEnv
, causing wrappers further down in the wrapper stack to not be closed. (@NeoExtended) - Fix target for updating q values in SAC: the entropy term was not conditioned by terminals states
- Use
cloudpickle.load
instead ofpickle.load
inCloudpickleWrapper
. (@shwang) - Fixed a bug with orthogonal initialization when
bias=False
in custom policy (@rk37) - Fixed approximate entropy calculation in PPO and A2C. (@AndyShih12)
- Fixed DQN target network sharing feature extractor with the main network.
- Fixed storing correct
dones
in on-policy algorithm rollout collection. (@AndyShih12) - Fixed number of filters in final convolutional layer in NatureCNN to match original implementation.
Others:
- Refactored off-policy algorithm to share the same
.learn()
method - Split the
collect_rollout()
method for off-policy algorithms - Added
_on_step()
for off-policy base class - Optimized replay buffer size by removing the need of
next_observations
numpy array - Optimized polyak updates (1.5-1.95 speedup) through inplace operations (@partiallytyped)
- Switch to
black
codestyle and addedmake format
,make check-codestyle
andcommit-checks
- Ignored errors from newer pytype version
- Added a check when using
gSDE
- Removed codacy dependency from Dockerfile
- Added
common.sb2_compat.RMSpropTFLike
optimizer, which corresponds closer to the implementation of RMSprop from Tensorflow.
Documentation:
- Updated notebook links
- Fixed a typo in the section of Enjoy a Trained Agent, in RL Baselines3 Zoo README. (@blurLake)
- Added Unity reacher to the projects page (@koulakis)
- Added PyBullet colab notebook
- Fixed typo in PPO example code (@joeljosephjin)
- Fixed typo in custom policy doc (@RaphaelWag)
Hotfix for PPO/A2C + gSDE, internal refactoring and bug fixes
Breaking Changes:
-
render()
method ofVecEnvs
now only accept one argument:mode
-
Created new file common/torch_layers.py, similar to SB refactoring
- Contains all PyTorch network layer definitions and feature extractors:
MlpExtractor
,create_mlp
,NatureCNN
- Contains all PyTorch network layer definitions and feature extractors:
-
Renamed
BaseRLModel
toBaseAlgorithm
(along with offpolicy and onpolicy variants) -
Moved on-policy and off-policy base algorithms to
common/on_policy_algorithm.py
andcommon/off_policy_algorithm.py
, respectively. -
Moved
PPOPolicy
toActorCriticPolicy
in common/policies.py -
Moved
PPO
(algorithm class) intoOnPolicyAlgorithm
(common/on_policy_algorithm.py
), to be shared with A2C -
Moved following functions from
BaseAlgorithm
:_load_from_file
toload_from_zip_file
(save_util.py)_save_to_file_zip
tosave_to_zip_file
(save_util.py)safe_mean
tosafe_mean
(utils.py)check_env
tocheck_for_correct_spaces
(utils.py. Renamed to avoid confusion with environment checker tools)
-
Moved static function
_is_vectorized_observation
from common/policies.py to common/utils.py under nameis_vectorized_observation
. -
Removed
{save,load}_running_average
functions ofVecNormalize
in favor ofload/save
. -
Removed
use_gae
parameter fromRolloutBuffer.compute_returns_and_advantage
.
Bug Fixes:
- Fixed
render()
method forVecEnvs
- Fixed
seed()
method forSubprocVecEnv
- Fixed loading on GPU for testing when using gSDE and
deterministic=False
- Fixed
register_policy
to allow re-registering same policy for same sub-class (i.e. assign same value to same key). - Fixed a bug where the gradient was passed when using
gSDE
withPPO
/A2C
, this does not affectSAC
Others:
- Re-enable unsafe
fork
start method in the tests (was causing a deadlock with tensorflow) - Added a test for seeding
SubprocVecEnv
and rendering - Fixed reference in NatureCNN (pointed to older version with different network architecture)
- Fixed comments saying "CxWxH" instead of "CxHxW" (same style as in torch docs / commonly used)
- Added bit further comments on register/getting policies ("MlpPolicy", "CnnPolicy").
- Renamed
progress
(value from 1 in start of training to 0 in end) toprogress_remaining
. - Added
policies.py
files for A2C/PPO, which define MlpPolicy/CnnPolicy (renamed ActorCriticPolicies). - Added some missing tests for
VecNormalize
,VecCheckNan
andPPO
.
Documentation:
- Added a paragraph on "MlpPolicy"/"CnnPolicy" and policy naming scheme under "Developer Guide"
- Fixed second-level listing in changelog
Tensorboard support, refactored logger
Breaking Changes:
- Remove State-Dependent Exploration (SDE) support for
TD3
- Methods were renamed in the logger:
logkv
->record
,writekvs
->write
,writeseq
->write_sequence
,logkvs
->record_dict
,dumpkvs
->dump
,getkvs
->get_log_dict
,logkv_mean
->record_mean
,
New Features:
- Added env checker (Sync with Stable Baselines)
- Added
VecCheckNan
andVecVideoRecorder
(Sync with Stable Baselines) - Added determinism tests
- Added
cmd_util
andatari_wrappers
- Added support for
MultiDiscrete
andMultiBinary
observation spaces (@rolandgvc) - Added
MultiCategorical
andBernoulli
distributions for PPO/A2C (@rolandgvc) - Added support for logging to tensorboard (@rolandgvc)
- Added
VectorizedActionNoise
for continuous vectorized environments (@partiallytyped) - Log evaluation in the
EvalCallback
using the logger
Bug Fixes:
- Fixed a bug that prevented model trained on cpu to be loaded on gpu
- Fixed version number that had a new line included
- Fixed weird seg fault in docker image due to FakeImageEnv by reducing screen size
- Fixed
sde_sample_freq
that was not taken into account for SAC - Pass logger module to
BaseCallback
otherwise they cannot write in the one used by the algorithms
Others:
- Renamed to Stable-Baseline3
- Added Dockerfile
- Sync
VecEnvs
with Stable-Baselines - Update requirement:
gym>=0.17
- Added
.readthedoc.yml
file - Added
flake8
andmake lint
command - Added Github workflow
- Added warning when passing both
train_freq
andn_episodes_rollout
to Off-Policy Algorithms
Documentation:
- Added most documentation (adapted from Stable-Baselines)
- Added link to CONTRIBUTING.md in the README (@kinalmehta)
- Added gSDE project and update docstrings accordingly
- Fix
TD3
example code block