diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index 0063cc1fb..80366b2d2 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -8,7 +8,7 @@ into two categories: 2. You want to implement a feature or bug-fix for an outstanding issue - Look at the outstanding issues here: https://github.com/DLR-RM/stable-baselines3/issues - Pick an issue or feature and comment on the task that you want to work on this feature. - - If you need more context on a particular issue, please ask and we shall provide. + - If you need more context on a particular issue, please ask, and we shall provide. Once you finish implementing a feature or bug-fix, please send a Pull Request to https://github.com/DLR-RM/stable-baselines3 @@ -61,7 +61,7 @@ def my_function(arg1: type1, arg2: type2) -> returntype: ## Pull Request (PR) -Before proposing a PR, please open an issue, where the feature will be discussed. This prevent from duplicated PR to be proposed and also ease the code review process. +Before proposing a PR, please open an issue, where the feature will be discussed. This prevents from duplicated PR to be proposed and also ease the code review process. Each PR need to be reviewed and accepted by at least one of the maintainers (@hill-a, @araffin, @ernestum, @AdamGleave, @Miffyli or @qgallouedec). A PR must pass the Continuous Integration tests to be merged with the master branch. diff --git a/README.md b/README.md index 81f4c6e34..4f427087b 100644 --- a/README.md +++ b/README.md @@ -109,7 +109,7 @@ pip install stable-baselines3[extra] ``` **Note:** Some shells such as Zsh require quotation marks around brackets, i.e. `pip install 'stable-baselines3[extra]'` ([More Info](https://stackoverflow.com/a/30539963)). -This includes an optional dependencies like Tensorboard, OpenCV or `atari-py` to train on atari games. If you do not need those, you can use: +This includes an optional dependencies like Tensorboard, OpenCV or `ale-py` to train on atari games. If you do not need those, you can use: ```sh pip install stable-baselines3 ``` diff --git a/docs/common/distributions.rst b/docs/common/distributions.rst index d5c3a077d..a716f8be6 100644 --- a/docs/common/distributions.rst +++ b/docs/common/distributions.rst @@ -16,7 +16,7 @@ The policy networks output parameters for the distributions (named ``flat`` in t Actions are then sampled from those distributions. For instance, in the case of discrete actions. The policy network outputs probability -of taking each action. The ``CategoricalDistribution`` allows to sample from it, +of taking each action. The ``CategoricalDistribution`` allows sampling from it, computes the entropy, the log probability (``log_prob``) and backpropagate the gradient. In the case of continuous actions, a Gaussian distribution is used. The policy network outputs diff --git a/docs/guide/callbacks.rst b/docs/guide/callbacks.rst index 5b2cfaee5..239966a6f 100644 --- a/docs/guide/callbacks.rst +++ b/docs/guide/callbacks.rst @@ -30,7 +30,7 @@ You can find two examples of custom callbacks in the documentation: one for savi :param verbose: Verbosity level: 0 for no output, 1 for info messages, 2 for debug messages """ def __init__(self, verbose=0): - super(CustomCallback, self).__init__(verbose) + super().__init__(verbose) # Those variables will be accessible in the callback # (they are defined in the base class) # The RL model @@ -70,7 +70,7 @@ You can find two examples of custom callbacks in the documentation: one for savi For child callback (of an `EventCallback`), this will be called when the event is triggered. - :return: (bool) If the callback returns False, training is aborted early. + :return: If the callback returns False, training is aborted early. """ return True @@ -110,7 +110,7 @@ A child callback is for instance :ref:`StopTrainingOnRewardThreshold ` to have a better overview of what can be achieved with this kind of callbacks. + We recommend taking a look at the source code of :ref:`EvalCallback` and :ref:`StopTrainingOnRewardThreshold ` to have a better overview of what can be achieved with this kind of callbacks. .. code-block:: python @@ -159,8 +159,8 @@ corresponding statistics using ``save_vecnormalize`` (``False`` by default). .. warning:: - When using multiple environments, each call to ``env.step()`` will effectively correspond to ``n_envs`` steps. - If you want the ``save_freq`` to be similar when using different number of environments, + When using multiple environments, each call to ``env.step()`` will effectively correspond to ``n_envs`` steps. + If you want the ``save_freq`` to be similar when using a different number of environments, you need to account for it using ``save_freq = max(save_freq // n_envs, 1)``. The same goes for the other callbacks. @@ -189,7 +189,7 @@ EvalCallback ^^^^^^^^^^^^ Evaluate periodically the performance of an agent, using a separate test environment. -It will save the best model if ``best_model_save_path`` folder is specified and save the evaluations results in a numpy archive (``evaluations.npz``) if ``log_path`` folder is specified. +It will save the best model if ``best_model_save_path`` folder is specified and save the evaluations results in a NumPy archive (``evaluations.npz``) if ``log_path`` folder is specified. .. note:: @@ -230,7 +230,7 @@ This callback is integrated inside SB3 via the ``progress_bar`` argument of the .. note:: - This callback requires ``tqdm`` and ``rich`` packages to be installed. This is done automatically when using ``pip install stable-baselines3[extra]`` + ``ProgressBarCallback`` callback requires ``tqdm`` and ``rich`` packages to be installed. This is done automatically when using ``pip install stable-baselines3[extra]`` .. code-block:: python @@ -367,7 +367,7 @@ StopTrainingOnNoModelImprovement ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Stop the training if there is no new best model (no new best mean reward) after more than a specific number of consecutive evaluations. -The idea is to save time in experiments when you know that the learning curves are somehow well behaved and, therefore, +The idea is to save time in experiments when you know that the learning curves are somehow well-behaved and, therefore, after many evaluations without improvement the learning has probably stabilized. It must be used with the :ref:`EvalCallback` and use the event triggered after every evaluation. diff --git a/docs/guide/custom_env.rst b/docs/guide/custom_env.rst index a4499634b..e07562794 100644 --- a/docs/guide/custom_env.rst +++ b/docs/guide/custom_env.rst @@ -3,7 +3,7 @@ Using Custom Environments ========================== -To use the RL baselines with custom environments, they just need to follow the *gymnasium* `interface `_. +To use the RL baselines with custom environments, they just need to follow the *gymnasium* `interface `_. That is to say, your environment must implement the following methods (and inherits from Gym Class): diff --git a/docs/guide/custom_policy.rst b/docs/guide/custom_policy.rst index 0807498e4..1662d2dac 100644 --- a/docs/guide/custom_policy.rst +++ b/docs/guide/custom_policy.rst @@ -262,7 +262,7 @@ Custom Networks If you need a network architecture that is different for the actor and the critic when using ``PPO``, ``A2C`` or ``TRPO``, you can pass a dictionary of the following structure: ``dict(pi=[], vf=[])``. -For example, if you want a different architecture for the actor (aka ``pi``) and the critic ( value-function aka ``vf``) networks, +For example, if you want a different architecture for the actor (aka ``pi``) and the critic (value-function aka ``vf``) networks, then you can specify ``net_arch=dict(pi=[32, 32], vf=[64, 64])``. Otherwise, to have actor and critic that share the same network architecture, diff --git a/docs/guide/examples.rst b/docs/guide/examples.rst index 7a0586c33..0d097483f 100644 --- a/docs/guide/examples.rst +++ b/docs/guide/examples.rst @@ -5,7 +5,7 @@ Examples .. note:: - These examples are only to demonstrate the use of the library and its functions, and the trained agents may not solve the environments. Optimized hyperparameters can be found in the RL Zoo `repository `_. + These examples are only to demonstrate the use of the library and its functions, and the trained agents may not solve the environments. Optimized hyperparameters can be found in the RL Zoo `repository `_. Try it online with Colab Notebooks! @@ -191,8 +191,8 @@ Dict Observations You can use environments with dictionary observation spaces. This is useful in the case where one can't directly concatenate observations such as an image from a camera combined with a vector of servo sensor data (e.g., rotation angles). -Stable Baselines3 provides ``SimpleMultiObsEnv`` as an example of this kind of of setting. -The environment is a simple grid world but the observations for each cell come in the form of dictionaries. +Stable Baselines3 provides ``SimpleMultiObsEnv`` as an example of this kind of setting. +The environment is a simple grid world, but the observations for each cell come in the form of dictionaries. These dictionaries are randomly initialized on the creation of the environment and contain a vector observation and an image observation. .. code-block:: python @@ -217,7 +217,7 @@ Callbacks: Monitoring Training You can define a custom callback function that will be called inside the agent. This could be useful when you want to monitor training, for instance display live -learning curves in Tensorboard (or in Visdom) or save the best agent. +learning curves in Tensorboard or save the best agent. If your callback returns False, training is aborted early. .. image:: ../_static/img/colab-badge.svg @@ -251,7 +251,7 @@ If your callback returns False, training is aborted early. :param verbose: Verbosity level: 0 for no output, 1 for info messages, 2 for debug messages """ def __init__(self, check_freq: int, log_dir: str, verbose: int = 1): - super(SaveOnBestTrainingRewardCallback, self).__init__(verbose) + super().__init__(verbose) self.check_freq = check_freq self.log_dir = log_dir self.save_path = os.path.join(log_dir, "best_model") diff --git a/docs/guide/export.rst b/docs/guide/export.rst index b50c484be..cccf30014 100644 --- a/docs/guide/export.rst +++ b/docs/guide/export.rst @@ -194,14 +194,14 @@ Full example code: https://github.com/chunky/sb3_to_coral Google created a chip called the "Coral" for deploying AI to the edge. It's available in a variety of form factors, including USB (using -the Coral on a Rasbperry pi, with a SB3-developed model, was the original +the Coral on a Raspberry Pi, with a SB3-developed model, was the original motivation for the code example above). The Coral chip is fast, with very low power consumption, but only has limited on-device training abilities. More information is on the webpage here: https://coral.ai. -To deploy to a Coral, one must work via TFLite, and quantise the +To deploy to a Coral, one must work via TFLite, and quantize the network to reflect the Coral's capabilities. The full chain to go from SB3 to Coral is: SB3 (Torch) => ONNX => TensorFlow => TFLite => Coral. diff --git a/docs/guide/install.rst b/docs/guide/install.rst index a464496e6..587234b00 100644 --- a/docs/guide/install.rst +++ b/docs/guide/install.rst @@ -9,10 +9,10 @@ Prerequisites Stable-Baselines3 requires python 3.8+ and PyTorch >= 1.13 -Windows 10 -~~~~~~~~~~ +Windows +~~~~~~~ -We recommend using `Anaconda `_ for Windows users for easier installation of Python packages and required libraries. You need an environment with Python version 3.6 or above. +We recommend using `Anaconda `_ for Windows users for easier installation of Python packages and required libraries. You need an environment with Python version 3.8 or above. For a quick start you can move straight to installing Stable-Baselines3 in the next step. @@ -34,7 +34,7 @@ To install Stable Baselines3 with pip, execute: Some shells such as Zsh require quotation marks around brackets, i.e. ``pip install 'stable-baselines3[extra]'`` `More information `_. -This includes an optional dependencies like Tensorboard, OpenCV or ``ale-py`` to train on atari games. If you do not need those, you can use: +This includes an optional dependencies like Tensorboard, OpenCV or ``ale-py`` to train on Atari games. If you do not need those, you can use: .. code-block:: bash diff --git a/docs/guide/migration.rst b/docs/guide/migration.rst index 967a6ac0d..5f19ab278 100644 --- a/docs/guide/migration.rst +++ b/docs/guide/migration.rst @@ -15,7 +15,7 @@ Overview Overall Stable-Baselines3 (SB3) keeps the high-level API of Stable-Baselines (SB2). Most of the changes are to ensure more consistency and are internal ones. -Because of the backend change, from Tensorflow to PyTorch, the internal code is much much readable and easy to debug +Because of the backend change, from Tensorflow to PyTorch, the internal code is much more readable and easy to debug at the cost of some speed (dynamic graph vs static graph., see `Issue #90 `_) However, the algorithms were extensively benchmarked on Atari games and continuous control PyBullet envs (see `Issue #48 `_ and `Issue #49 `_) @@ -203,8 +203,8 @@ New Features (SB3 vs SB2) - Much cleaner and consistent base code (and no more warnings =D!) and static type checks - Independent saving/loading/predict for policies - A2C now supports Generalized Advantage Estimation (GAE) and advantage normalization (both are deactivated by default) -- Generalized State-Dependent Exploration (gSDE) exploration is available for A2C/PPO/SAC. It allows to use RL directly on real robots (cf https://arxiv.org/abs/2005.05719) -- Better saving/loading: optimizers are now included in the saved parameters and there is two new methods ``save_replay_buffer`` and ``load_replay_buffer`` for the replay buffer when using off-policy algorithms (DQN/DDPG/SAC/TD3) +- Generalized State-Dependent Exploration (gSDE) exploration is available for A2C/PPO/SAC. It allows using RL directly on real robots (cf https://arxiv.org/abs/2005.05719) +- Better saving/loading: optimizers are now included in the saved parameters and there are two new methods ``save_replay_buffer`` and ``load_replay_buffer`` for the replay buffer when using off-policy algorithms (DQN/DDPG/SAC/TD3) - You can pass ``optimizer_class`` and ``optimizer_kwargs`` to ``policy_kwargs`` in order to easily customize optimizers - Seeding now works properly to have deterministic results diff --git a/docs/guide/rl.rst b/docs/guide/rl.rst index eca9ce635..637d8a239 100644 --- a/docs/guide/rl.rst +++ b/docs/guide/rl.rst @@ -15,4 +15,5 @@ However, if you want to learn about RL, there are several good resources to get - `Lilian Weng's blog `_ - `Berkeley's Deep RL Bootcamp `_ - `Berkeley's Deep Reinforcement Learning course `_ +- `DQN tutorial `_ - `More resources `_ diff --git a/docs/guide/rl_tips.rst b/docs/guide/rl_tips.rst index f82c163ac..ce6f43e55 100644 --- a/docs/guide/rl_tips.rst +++ b/docs/guide/rl_tips.rst @@ -4,7 +4,7 @@ Reinforcement Learning Tips and Tricks ====================================== -The aim of this section is to help you doing reinforcement learning experiments. +The aim of this section is to help you do reinforcement learning experiments. It covers general advice about RL (where to start, which algorithm to choose, how to evaluate an algorithm, ...), as well as tips and tricks when using a custom environment or implementing an RL algorithm. @@ -27,7 +27,7 @@ TL;DR Like any other subject, if you want to work with RL, you should first read about it (we have a dedicated `resource page `_ to get you started) -to understand what you are using. We also recommend you read Stable Baselines3 (SB3) documentation and do the `tutorial `_. +to understand what you are using. We also recommend you read Stable Baselines3 (SB3) documentation and do the `tutorial `_. It covers basic usage and guide you towards more advanced concepts of the library (e.g. callbacks and wrappers). Reinforcement Learning differs from other machine learning methods in several ways. The data used to train the agent is collected @@ -38,13 +38,13 @@ bad trajectories. This factor, among others, explains that results in RL may vary from one run to another (i.e., when only the seed of the pseudo-random generator changes). For this reason, you should always do several runs to have quantitative results. -Good results in RL are generally dependent on finding appropriate hyperparameters. Recent algorithms (PPO, SAC, TD3) normally require little hyperparameter tuning, +Good results in RL are generally dependent on finding appropriate hyperparameters. Recent algorithms (PPO, SAC, TD3, DroQ) normally require little hyperparameter tuning, however, *don't expect the default ones to work* on any environment. Therefore, we *highly recommend you* to take a look at the `RL zoo `_ (or the original papers) for tuned hyperparameters. A best practice when you apply RL to a new problem is to do automatic hyperparameter optimization. Again, this is included in the `RL zoo `_. -When applying RL to a custom problem, you should always normalize the input to the agent (e.g. using VecNormalize for PPO/A2C) +When applying RL to a custom problem, you should always normalize the input to the agent (e.g. using ``VecNormalize`` for PPO/A2C) and look at common preprocessing done on other environments (e.g. for `Atari `_, frame-stack, ...). Please refer to *Tips and Tricks when creating a custom environment* paragraph below for more advice related to custom environments. @@ -86,13 +86,15 @@ and average the reward per episode to have a good estimate. We provide an ``EvalCallback`` for doing such evaluation. You can read more about it in the :ref:`Callbacks ` section. -As some policy are stochastic by default (e.g. A2C or PPO), you should also try to set `deterministic=True` when calling the `.predict()` method, +As some policies are stochastic by default (e.g. A2C or PPO), you should also try to set `deterministic=True` when calling the `.predict()` method, this frequently leads to better performance. Looking at the training curve (episode reward function of the timesteps) is a good proxy but underestimates the agent true performance. +We highly recommend reading `Empirical Design in Reinforcement Learning `_, as it provides valuable insights for best practices when running RL experiments. -We suggest you reading `Deep Reinforcement Learning that Matters `_ for a good discussion about RL evaluation. +We also suggest reading `Deep Reinforcement Learning that Matters `_ for a good discussion about RL evaluation, +and `Rliable: Better Evaluation for Reinforcement Learning `_ for comparing results. You can also take a look at this `blog post `_ and this `issue `_ by Cédric Colas. @@ -111,6 +113,10 @@ The second difference that will help you choose is whether you can parallelize y If what matters is the wall clock training time, then you should lean towards ``A2C`` and its derivatives (PPO, ...). Take a look at the `Vectorized Environments `_ to learn more about training with multiple workers. +To accelerate training, you can also take a look at `SBX`_, which is SB3 + Jax, it has fewer features than SB3 but can be up to 20x faster than SB3 PyTorch thanks to JIT compilation of the gradient update. + +In sparse reward settings, we either recommend to use dedicated methods like HER (see below) or population-based algorithms like ARS (available in our :ref:`contrib repo `). + To sum it up: Discrete Actions @@ -143,6 +149,8 @@ Continuous Actions - Single Process Current State Of The Art (SOTA) algorithms are ``SAC``, ``TD3`` and ``TQC`` (available in our :ref:`contrib repo `). Please use the hyperparameters in the `RL zoo `_ for best results. +If you want an extremely sample-efficient algorithm, we recommend using the `DroQ configuration `_ in `SBX`_ (it does many gradient steps per step in the environment). + Continuous Actions - Multiprocessed ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ @@ -188,7 +196,8 @@ and properly handle termination due to a timeout (maximum number of steps in an For instance, if there is some time delay between action and observation (e.g. due to wifi communication), you should give a history of observations as input. -Termination due to timeout (max number of steps per episode) needs to be handled separately. You should fill the key in the info dict: ``info["TimeLimit.truncated"] = True``. +Termination due to timeout (max number of steps per episode) needs to be handled separately. +You should return ``truncated = True``. If you are using the gym ``TimeLimit`` wrapper, this will be done automatically. You can read `Time Limit in RL `_ or take a look at the `RL Tips and Tricks video `_ for more details. @@ -230,7 +239,7 @@ this can harm learning and be difficult to debug (cf attached image and `issue # .. figure:: ../_static/img/mistake.png -Another consequence of using a Gaussian is that the action range is not bounded. +Another consequence of using a Gaussian distribution is that the action range is not bounded. That's why clipping is usually used as a bandage to stay in a valid interval. A better solution would be to use a squashing function (cf ``SAC``) or a Beta distribution (cf `issue #112 `_). @@ -272,3 +281,5 @@ in RL with discrete actions: 2. LunarLander 3. Pong (one of the easiest Atari game) 4. other Atari games (e.g. Breakout) + +.. _SBX: https://github.com/araffin/sbx \ No newline at end of file diff --git a/docs/guide/sbx.rst b/docs/guide/sbx.rst index eee1d1ba4..52b4348bc 100644 --- a/docs/guide/sbx.rst +++ b/docs/guide/sbx.rst @@ -15,6 +15,8 @@ Implemented algorithms: - Dropout Q-Functions for Doubly Efficient Reinforcement Learning (DroQ) - Proximal Policy Optimization (PPO) - Deep Q Network (DQN) +- Twin Delayed DDPG (TD3) +- Deep Deterministic Policy Gradient (DDPG) As SBX follows SB3 API, it is also compatible with the `RL Zoo `_. @@ -27,17 +29,20 @@ For that you will need to create two files: import rl_zoo3 import rl_zoo3.train from rl_zoo3.train import train - from sbx import DQN, PPO, SAC, TQC, DroQ + from sbx import DDPG, DQN, PPO, SAC, TD3, TQC, DroQ - rl_zoo3.ALGOS["tqc"] = TQC + rl_zoo3.ALGOS["ddpg"] = DDPG + rl_zoo3.ALGOS["dqn"] = DQN rl_zoo3.ALGOS["droq"] = DroQ rl_zoo3.ALGOS["sac"] = SAC rl_zoo3.ALGOS["ppo"] = PPO - rl_zoo3.ALGOS["dqn"] = DQN + rl_zoo3.ALGOS["td3"] = TD3 + rl_zoo3.ALGOS["tqc"] = TQC rl_zoo3.train.ALGOS = rl_zoo3.ALGOS rl_zoo3.exp_manager.ALGOS = rl_zoo3.ALGOS + if __name__ == "__main__": train() @@ -51,16 +56,19 @@ Then you can call ``python train_sbx.py --algo sac --env Pendulum-v1`` and use t import rl_zoo3 import rl_zoo3.enjoy from rl_zoo3.enjoy import enjoy - from sbx import DQN, PPO, SAC, TQC, DroQ + from sbx import DDPG, DQN, PPO, SAC, TD3, TQC, DroQ - rl_zoo3.ALGOS["tqc"] = TQC + rl_zoo3.ALGOS["ddpg"] = DDPG + rl_zoo3.ALGOS["dqn"] = DQN rl_zoo3.ALGOS["droq"] = DroQ rl_zoo3.ALGOS["sac"] = SAC rl_zoo3.ALGOS["ppo"] = PPO - rl_zoo3.ALGOS["dqn"] = DQN + rl_zoo3.ALGOS["td3"] = TD3 + rl_zoo3.ALGOS["tqc"] = TQC rl_zoo3.enjoy.ALGOS = rl_zoo3.ALGOS rl_zoo3.exp_manager.ALGOS = rl_zoo3.ALGOS + if __name__ == "__main__": enjoy() diff --git a/docs/guide/tensorboard.rst b/docs/guide/tensorboard.rst index 720c3ded2..4ef1b496a 100644 --- a/docs/guide/tensorboard.rst +++ b/docs/guide/tensorboard.rst @@ -141,7 +141,7 @@ Here is an example of how to render an image to TensorBoard at regular intervals Logging Figures/Plots --------------------- -TensorBoard supports periodic logging of figures/plots created with matplotlib, which helps evaluating agents at various stages during training. +TensorBoard supports periodic logging of figures/plots created with matplotlib, which helps evaluate agents at various stages during training. .. warning:: To support figure logging `matplotlib `_ must be installed otherwise, TensorBoard ignores the figure and logs a warning. @@ -179,7 +179,7 @@ Here is an example of how to store a plot in TensorBoard at regular intervals: Logging Videos -------------- -TensorBoard supports periodic logging of video data, which helps evaluating agents at various stages during training. +TensorBoard supports periodic logging of video data, which helps evaluate agents at various stages during training. .. warning:: To support video logging `moviepy `_ must be installed otherwise, TensorBoard ignores the video and logs a warning. @@ -252,7 +252,7 @@ Here is an example of how to render an episode and log the resulting video to Te Logging Hyperparameters ----------------------- -TensorBoard supports logging of hyperparameters in its HPARAMS tab, which helps comparing agents trainings. +TensorBoard supports logging of hyperparameters in its HPARAMS tab, which helps to compare agents trainings. .. warning:: To display hyperparameters in the HPARAMS section, a ``metric_dict`` must be given (as well as a ``hparam_dict``). diff --git a/docs/guide/vec_envs.rst b/docs/guide/vec_envs.rst index 10bba850c..792fedecb 100644 --- a/docs/guide/vec_envs.rst +++ b/docs/guide/vec_envs.rst @@ -58,7 +58,7 @@ SB3 VecEnv API is actually close to Gym 0.21 API but differs to Gym 0.26+ API: - the ``vec_env.step(actions)`` method expects an array as input (with a batch size corresponding to the number of environments) and returns a 4-tuple (and not a 5-tuple): ``obs, rewards, dones, infos`` instead of ``obs, reward, terminated, truncated, info`` where ``dones = terminated or truncated`` (for each env). - ``obs, rewards, dones`` are numpy arrays with shape ``(n_envs, shape_for_single_env)`` (so with a batch dimension). + ``obs, rewards, dones`` are NumPy arrays with shape ``(n_envs, shape_for_single_env)`` (so with a batch dimension). Additional information is passed via the ``infos`` value which is a list of dictionaries. - at the end of an episode, ``infos[env_idx]["TimeLimit.truncated"] = truncated and not terminated`` diff --git a/docs/misc/changelog.rst b/docs/misc/changelog.rst index 19aeeec44..5999aa9c7 100644 --- a/docs/misc/changelog.rst +++ b/docs/misc/changelog.rst @@ -62,7 +62,8 @@ Others: Documentation: ^^^^^^^^^^^^^^ - +- Updated RL Tips and Tricks (include recommendation for evaluation, added links to DroQ, ARS and SBX). +- Fixed various typos and grammar mistakes Release 2.1.0 (2023-08-17) -------------------------- diff --git a/docs/misc/projects.rst b/docs/misc/projects.rst index 1c3ba95fc..39803018e 100644 --- a/docs/misc/projects.rst +++ b/docs/misc/projects.rst @@ -13,13 +13,13 @@ An open-source Gym-compatible environment specifically tailored for developing R | Authors: Parth Kothari, Christian Perone, Luca Bergamini, Alexandre Alahi, Peter Ondruska | Github: https://github.com/lyft/l5kit -| Paper: https://arxiv.org/abs/2111.06889 +| Paper: https://arxiv.org/abs/2111.06889 RL Reach -------- -A platform for running reproducible reinforcement learning experiments for customisable robotic reaching tasks. This self-contained and straightforward toolbox allows its users to quickly investigate and identify optimal training configurations. +A platform for running reproducible reinforcement learning experiments for customizable robotic reaching tasks. This self-contained and straightforward toolbox allows its users to quickly investigate and identify optimal training configurations. | Authors: Pierre Aumjaud, David McAuliffe, Francisco Javier Rodríguez Lera, Philip Cardiff | Github: https://github.com/PierreExeter/rl_reach @@ -40,7 +40,7 @@ It was the starting point of Stable-Baselines3. Furuta Pendulum Robot --------------------- -Everything you need to build and train a rotary inverted pendulum, also know as a furuta pendulum! This makes use of gSDE listed above. +Everything you need to build and train a rotary inverted pendulum, also known as a furuta pendulum! This makes use of gSDE listed above. The Github repository contains code, CAD files and a bill of materials for you to build the robot. You can watch `a video overview of the project here `_. | Authors: Armand du Parc Locmaria, Pierre Fabre @@ -65,7 +65,7 @@ A simple interface to instantiate RL environments with SUMO for Traffic Signal C - Supports Multiagent RL - Compatibility with gym.Env and popular RL libraries such as stable-baselines3 and RLlib -- Easy customisation: state and reward definitions are easily modifiable +- Easy customization: state and reward definitions are easily modifiable | Author: Lucas Alegre | Github: https://github.com/LucasAlegre/sumo-rl @@ -178,11 +178,11 @@ RLeXplore is a set of implementations of intrinsic reward driven-exploration app UAV_Navigation_DRL_AirSim ------------------------- -A platform for training UAV navigation policies in complex unknown environments. +A platform for training UAV navigation policies in complex unknown environments. -- Based on AirSim and SB3. -- An Open AI Gym env is created including kinematic models for both multirotor and fixed-wing UAVs. -- Some UE4 environments are provided to train and test the navigation policy. +- Based on AirSim and SB3. +- An Open AI Gym env is created including kinematic models for both multirotor and fixed-wing UAVs. +- Some UE4 environments are provided to train and test the navigation policy. Try to train your own autonomous flight policy and even transfer it to real UAVs! Have fun ^_^!