Skip to main content
Log in

Deep intrinsically motivated exploration in continuous control

  • Published:
Machine Learning Aims and scope Submit manuscript

This article has been updated

Abstract

In continuous control, exploration is often performed through undirected strategies in which parameters of the networks or selected actions are perturbed by random noise. Although the deep setting of undirected exploration has been shown to improve the performance of on-policy methods, they introduce an excessive computational complexity and are known to fail in the off-policy setting. The intrinsically motivated exploration is an effective alsetup and hyper-parameterternative to the undirected strategies, but they are usually studied for discrete action domains. In this paper, we investigate how intrinsic motivation can effectively be combined with deep reinforcement learning in the control of continuous systems to obtain a directed exploratory behavior. We adapt the existing theories on animal motivational systems into the reinforcement learning paradigm and introduce a novel and scalable directed exploration strategy. The introduced approach, motivated by the maximization of the value function’s error, can benefit from a collected set of experiences by extracting useful information and unify the intrinsic exploration motivations in the literature under a single exploration objective. An extensive set of empirical studies demonstrate that our framework extends to larger and more diverse state spaces, dramatically improves the baselines, and outperforms the undirected strategies significantly.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Availability of data and materials

Simulators used in the experiments are freely and publicly available online: https://gym.openai.com/.

Code availability

The source code for the research leading to these results are made publicly available at the corresponding author’s GitHub repository: https://github.com/baturaysaglam/DISCOVER.

Change history

  • 04 November 2023

    Red type in figure captions changed to black.

Notes

  1. https://github.com/baturaysaglam/DISCOVER.

  2. https://github.com/Kaixhin/NoisyNet-A3C.

  3. https://github.com/openai/baselines.

  4. https://github.com/sfujim/TD3.

  5. https://github.com/ikostrikov/pytorch-a2c-ppo-acktr-gail.

  6. https://github.com/DLR-RM/rl-baselines3-zoo.

  7. https://opentsne.readthedocs.io/en/latest/.

References

Download references

Funding

The authors did not receive funds, grants, or other support from any organization for the submitted work.

Author information

Authors and Affiliations

Authors

Contributions

All authors contributed to the study conception and design. Formal analysis, investigation and literature search was performed by Baturay Saglam. The first draft of the manuscript was written by Baturay Saglam, and editing was done by Baturay Saglam and Suleyman Serdar Kozat. Revision and supervision were performed by Suleyman Serdar Kozat. All authors commented on previous versions of the manuscript.

Corresponding author

Correspondence to Baturay Saglam.

Ethics declarations

Conflict of interest

All authors certify that they have no affiliations with or involvement in any organization or entity with any financial interest or non-financial interest in the subject matter or materials discussed in this manuscript.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Editor: Sicco Verwer

Appendices

Appendix 1: detailed experimental setup

1.1 Software and environment

All networks are trained with PyTorch (version 1.8.1) (Paszke et al., 2019), using default values for all unmentioned hyper-parameters. Performances of all methods are evaluated in MuJoCo (mujoco-py version 1.50), and Box2D (version 2.3.10) physics engines interfaced by OpenAI Gym (version 0.17.3), using v3 environment for BipedalWalker and v2 for rest of the environments. The environment dynamics, state and action spaces, and reward functions are not pre-processed and modified for easy reproducibility and fair evaluation procedure with the baseline and competing algorithms. Each environment episode runs for a maximum of 1000 steps until a terminal condition is encountered. The multi-dimensional action space for all environments is within the range \((-1, 1)\) except for Humanoid, which uses the range of \((-0.4, 0.4)\).

1.2 Evaluation

All experiments are run for 1 million time steps with evaluations every 1000 time steps, where an evaluation of an agent records the average reward over 10 episodes without exploration noise and network updates. We use a new environment with a fixed seed (the training seed + a constant) for each evaluation to decrease the variation caused by different seeds. Therefore, each evaluation uses the same set of initial start states.

We report the average evaluation return of 10 random seeds for each environment, including the initialization of behavioral policies, simulators, network parameters, and dependencies. Unless stated otherwise, each agent is trained by one training iteration after each time step. Agents are trained by batches of transitions uniformly sampled from the experience replay. Learning curves are used to show performance, and they are given as an average of 10 trials with a shaded zone added to reflect a half standard deviation across the trials. The curves are smoothed uniformly over a sliding window of 5 evaluations for visual clarity.

1.3 Visualization of the state visitations

We visualize the states within the collected transitions for greedy action selection and DISCOVER under the TD3 algorithm while learning in the Swimmer environment over 1 million time steps. The results are reported over a single seed. We consider the last 975,000 transitions in the replay buffer as the first 25,000 is sampled from the environment’s action space. We first separately project the datasets onto a 4D state space through PCA to reduce the visual artifacts.

Later, we jointly embed the resulting datasets onto the 2D state space through the t-SNE implementation of the openTSNE library (Poličar et al., 2019).Footnote 7 We use a perplexity value of 1396 and euclidean metric in measuring the distances between each state. The t-SNE is run over 1500 iterations. Default values in openTSNE is used for all unmentioned parameters. We split the datasets into three portions and separately visualize them, where each portion contains 325,000 samples. The PCA operation yields a proportion of variance explained by the value of 0.984 and 0.966 for the greedy and DISCOVER datasets.

1.4 Implementation

Our implementation of A2C and PPO is based on the code from the well-known repository (see footnote 5), following the tuned hyper-parameters for the considered continuous control tasks. For the implementation of TD3, we use the author’s GitHub repository (see footnote 4) for the fine-tuned version of the algorithm and the DDPG implementation. For SAC, we follow structure outlined in the original paper.

We implement NoisyNet by adapting the code from authors’ GitHub repository (see footnote 2) to the baseline actor-critic algorithms. Authors’ OpenAI Baselines implementation (see footnote 3) is used to implement PSNE. Similar to SAC, we refer to the original papers in implementing Deep Coherent Exploration and Meta-Policy Gradient as the authors did not provide a valid code repository.

1.5 Architecture and hyper-parameter setting

The on-policy methods, A2C and PPO follow the tuned hyper-parameters for the MuJoCo and Box2D tasks provided by the repository (see footnote 5). Our implementation of the off-policy actor-critic algorithms, DDPG, SAC, and TD3, closely follows the set of hyper-parameters given in the respective papers. For DDPG and SAC, we use the fine-tuned environment-specific hyper-parameters provided by the OpenAI Baselines3 Zoo (see footnote 6). TD3 uses the fine-tuned parameters provided in the author’s GitHub repository (see footnote 4). Shared and environment and algorithm specific hyper-parameters for the off-policy methods are given in Tables 2, 3, and 4. Additionally Tables 5 and 6 reports the shared and algorithm specific tuned hyper-parameters for the on-policy baselines, respectively. Note that entropy coefficient used for A2C and PPO is 0, meaning that there is no entropy maximization and hence, inherent exploration. Nevertheless, the tuned 0 value corresponds to greedy action selection, which is found to perform best.

For the parameter-space noise algorithms, we initialize the parameter noise at 0.017 and 0.034, which gives the best results in practice for on- and off-policy algorithms, respectively, as suggested by Zhang and Van Hoof (2021). Furthermore, we use \(\beta = 0.01\) for all environments in the Deep Coherent Exploration algorithm, and set the mean-squared error threshold in PSNE to 0.1.

The exploration framework of DISCOVER strictly follows the actor framework in the corresponding baseline algorithms. This includes the depth and size of the networks, learning rate, optimizer, nonlinearity, target, and behavioral policy update frequency, target network learning rate, and the number of gradient steps in the updates. Moreover, we still use the exploration policy in the exploration time steps at the beginning of each training.

Table 2 Shared hyper-parameters of the baseline off-policy actor-critic algorithms
Table 3 Algorithm specific hyper-parameters used for the implementation of the baseline off-policy actor-critic algorithms
Table 4 SAC specific hyper-parameters
Table 5 Shared hyper-parameters of the baseline on-policy actor-critic algorithms
Table 6 Algorithm specific hyper-parameters used for the implementation of the baseline on-policy actor-critic algorithms

Appendix 2: complete experimental results

1.1 Additional evaluation results

Additional evaluation results in the InvertedDoublePendulum, InvertedPendulum, and Reacher environments for A2C, PPO, DDPG, SAC, and TD3 are reported in Figs. 10, 11, 12, 13, and 14, respectively.

Fig. 10
figure 10

Additional evaluation curves for the set of MuJoCo continuous control tasks under the A2C algorithm. The shaded region represents half a standard deviation of the average evaluation return over 10 random seeds. A sliding window of size 5 smoothes curves for visual clarity

Fig. 11
figure 11

Additional evaluation curves for the set of MuJoCo continuous control tasks under the PPO algorithm. The shaded region represents half a standard deviation of the average evaluation return over 10 random seeds. A sliding window of size 5 smoothes curves for visual clarity

Fig. 12
figure 12

Additional evaluation curves for the set of MuJoCo continuous control tasks under the DDPG algorithm. The shaded region represents half a standard deviation of the average evaluation return over 10 random seeds. A sliding window of size 5 smoothes curves for visual clarity

Fig. 13
figure 13

Additional evaluation curves for the set of MuJoCo continuous control tasks under the SAC algorithm. The shaded region represents half a standard deviation of the average evaluation return over 10 random seeds. A sliding window of size 5 smoothes curves for visual clarity

Fig. 14
figure 14

Additional evaluation curves for the set of MuJoCo continuous control tasks under the TD3 algorithm. The shaded region represents half a standard deviation of the average evaluation return over 10 random seeds. A sliding window of size 5 smoothes curves for visual clarity

1.2 Learning curves for the ablation studies

1.2.1 Exploration direction regularization

From Figs. 15 and 16, we observe that the increasing degree of exploration detrimentally degrades the performance of the baseline algorithms. This is an expected result as highly perturbed actions and state distributions prevent agents from effectively learning from its mistakes. Inversely, insufficient exploration obtains a notable but suboptimal performance. In fact, greedy action selection, i.e., \(\lambda = 0.0\), peforms second best after DISCOVER yet, converges to a suboptimal policy.

The optimal exploration regularization value is found to be 0.1 and 0.3 for on- and off-policy settings. These values either enables a faster convergence to the optimal policy, e.g., Humanoid, or higher evaluation results, e.g., Swimmer, or both, e.g., HalfCheetah, Hopper. In addition, from Table 1, we find that the last 10 returns for \(\lambda = 0.1\) is higher than \(\lambda = 0.3\) in the Hopper-v2 environment under the off-policy setting. However, this is not visible in Fig. 16 due to the sliding window on the evaluation results. As \(\lambda = 0.3\) setting has a faster convergence than \(\lambda = 0.1\), we observe a better overall performance in the plots.

As explained, we perform our ablation studies in environments with different characteristics. Interestingly, we find that \(\lambda = 0.3\) can generalize all the tested tasks. Hence, we infer that one can tune DISCOVER on a single physics dynamics and use the optimal value in different environments.

Fig. 15
figure 15

Evaluation curves for the set of MuJoCo and Box2D continuous control tasks under PPO + DISCOVER when \(\lambda = \{0.0, 0.1, 0.3, 0.6, 0.9, 1.0\}\). The shaded region represents half a standard deviation of the average evaluation return over 10 random seeds. A sliding window of size 5 smoothes curves for visual clarity

Fig. 16
figure 16

Evaluation curves for the set of MuJoCo and Box2D continuous control tasks under TD3 + DISCOVER when \(\lambda = \{0.0, 0.1, 0.3, 0.6, 0.9, 1.0\}\). The shaded region represents half a standard deviation of the average evaluation return over 10 random seeds. A sliding window of size 5 smoothes curves for visual clarity

1.2.2 Ablation study of DISCOVER

Overall, Fig. 17 demonstrates that all settings exhibit a similar performance except for HalfCheetah as the complete algorithm converges faster to significantly higher evaluation results. This is because the HalfCheetah environment vastly requires off-policy samples to be solved, as highlighted by Henderson et al. (2018). Therefore, the complete algorithm can further benefit from off-policy learning, demonstrated in the HalfCheetah environment.

In the rest of the environments, the complete algorithm attains slightly faster convergence to higher returns, from which we conclude that the ensemble of each component is the most effective setting in improving the baseline’s policy. Thus, the exploration policy of DISCOVER should mimic the baseline’s policy framework to obtain the optimal performance.

Fig. 17
figure 17

Evaluation curves for the set of MuJoCo continuous control tasks under TD3 + DISCOVER when each of the DISCOVER components is removed. The shaded region represents half a standard deviation of the average evaluation return over 10 random seeds. A sliding window of size 5 smoothes curves for visual clarity

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Saglam, B., Kozat, S.S. Deep intrinsically motivated exploration in continuous control. Mach Learn 112, 4959–4993 (2023). https://doi.org/10.1007/s10994-023-06363-4

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10994-023-06363-4

Keywords

Navigation