Deep intrinsically motivated exploration in continuous control

Saglam, Baturay; Kozat, Suleyman S.

doi:10.1007/s10994-023-06363-4

Deep intrinsically motivated exploration in continuous control

Published: 26 October 2023

Volume 112, pages 4959–4993, (2023)
Cite this article

Machine Learning Aims and scope Submit manuscript

266 Accesses
1 Altmetric
Explore all metrics

This article has been updated

Abstract

In continuous control, exploration is often performed through undirected strategies in which parameters of the networks or selected actions are perturbed by random noise. Although the deep setting of undirected exploration has been shown to improve the performance of on-policy methods, they introduce an excessive computational complexity and are known to fail in the off-policy setting. The intrinsically motivated exploration is an effective alsetup and hyper-parameterternative to the undirected strategies, but they are usually studied for discrete action domains. In this paper, we investigate how intrinsic motivation can effectively be combined with deep reinforcement learning in the control of continuous systems to obtain a directed exploratory behavior. We adapt the existing theories on animal motivational systems into the reinforcement learning paradigm and introduce a novel and scalable directed exploration strategy. The introduced approach, motivated by the maximization of the value function’s error, can benefit from a collected set of experiences by extracting useful information and unify the intrinsic exploration motivations in the literature under a single exploration objective. An extensive set of empirical studies demonstrate that our framework extends to larger and more diverse state spaces, dramatically improves the baselines, and outperforms the undirected strategies significantly.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A novel investigation on the effects of state and reward structure in designing deep reinforcement learning-based controller for nonlinear dynamical systems

Article 26 March 2024

Generalized exploration in policy search

Article 13 July 2017

Policy Prediction Network: Model-Free Behavior Policy with Model-Based Learning in Continuous Action Space

Availability of data and materials

Simulators used in the experiments are freely and publicly available online: https://gym.openai.com/.

Code availability

The source code for the research leading to these results are made publicly available at the corresponding author’s GitHub repository: https://github.com/baturaysaglam/DISCOVER.

Change history

04 November 2023
Red type in figure captions changed to black.

Notes

References

Barto, A. G. (2013). Intrinsic motivation and reinforcement learning. In G. Baldassarre, & M. Mirolli (Eds.), Intrinsically motivated learning in natural and artificial systems (pp. 17–47). Springer. https://doi.org/10.1007/978-3-642-32375-1_2
Barto, A. G., & Simsek, O. (2005). Intrinsic motivation for reinforcement learning systems. In The thirteenth yale workshop on adaptive and learning systems (pp. 113–118).
Bellemare, M. G., Naddaf, Y., Veness, J., & Bowling, M. (2013). The arcade learning environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research, 47, 253–279. https://doi.org/10.1613/jair.3912
Article Google Scholar
Bellman, R. (1957). Dynamic programming. Dover Publications.
MATH Google Scholar
Berns, G. S., McClure, S. M., Pagnoni, G., & Montague, P. R. (2001). Predictability modulates human brain response to reward. Journal of Neuroscience, 21(8), 2793–2798.
Article Google Scholar
Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., & Zaremba, W. (2016). Openai gym. CoRR. arXiv:1606.01540
Dayan, P. (2002). Motivated reinforcement learning. In T. Dietterich, S. Becker, & Z. Ghahramani (Eds.), Advances in neural information processing systems (Vol. 14). MIT Press. https://proceedings.neurips.cc/paper/2001/file/051928341be67dcba03f0e04104d9047-Paper.pdf
Dayan, P., & Sejnowski, T. J. (1996). Exploration bonuses and dual control. Machine Learning, 25(1), 5–22. https://doi.org/10.1007/BF00115298
Article Google Scholar
Dhariwal, P., Hesse, C., Klimov, O., Nichol, A., Plappert, M., Radford, A., Schulman, J., Sidor, S., Wu, Y., & Zhokhov, P. (2017). Openai baselines. https://github.com/openai/baselines. GitHub.
Fortunato, M., Azar, M. G., Piot, B., Menick, J., Hessel, M., Osband, I., Mnih, A. G. V., Pietquin, R. M. D. H. O., Blundell, C., & Legg, S. (2018). Noisy networks for exploration. In International conference on learning representations. https://openreview.net/forum?id=rywHCPkAW
Fujimoto, S., van Hoof, H., & Meger, D. (2018). Addressing function approximation error in actor-critic methods. In J. Dy, & A. Krause (Eds.), Proceedings of the 35th international conference on machine learning (Vol. 80, pp. 1587–1596). PMLR. https://proceedings.mlr.press/v80/fujimoto18a.html
Garris, P. A., Kilpatrick, M., Bunin, M. A., Michael, D., Walker, Q. D., & Wightman, R. M. (1999). Dissociation of dopamine release in the nucleus accumbens from intracranial self-stimulation. Nature, 398(6722), 67–69. https://doi.org/10.1038/18019
Article Google Scholar
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., & Bengio, Y. (2014). Generative adversarial nets. In Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, & K. Weinberger (Eds.), Advances in neural information processing systems (Vol. 27). Curran Associates, Inc. https://proceedings.neurips.cc/paper/2014/file/5ca3e9b122f61f8f06494c97b1afccf3-Paper.pdf
Haarnoja, T., Zhou, A., Abbeel, P., & Levine, S. (2018). Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In J. Dy, & A. Krause (Eds.), Proceedings of the 35th international conference on machine learning (Vol. 80, pp. 1861–1870). PMLR. https://proceedings.mlr.press/v80/haarnoja18b.html
Henderson, P., Islam, R., Bachman, P., Pineau, J., Precup, D., & Meger, D. (2018). Deep reinforcement learning that matters. In Proceedings of the AAAI conference on artificial intelligence (Vol. 321). https://ojs.aaai.org/index.php/AAAI/article/view/11694
Kearns, M., Mansour, Y., & Ng, A. Y. (2002). A sparse sampling algorithm for near-optimal planning in large Markov decision processes. Machine Learning, 49(2), 193–208. https://doi.org/10.1023/A:1017932429737
Article MATH Google Scholar
Kearns, M., & Singh, S. (2002). Near-optimal reinforcement learning in polynomial time. Machine Learning, 49(2), 209–232. https://doi.org/10.1023/A:1017984413808
Article MATH Google Scholar
Kilpatrick, M. R., Rooney, M. B., Michael, D. J., & Wightman, R. M. (2000). Extracellular dopamine dynamics in rat caudate–putamen during experimenter-delivered and intracranial self-stimulation. Neuroscience, 96(4), 697–706.
Article Google Scholar
Kingma, D. P., & Ba, J. (2015). Adam: A method for stochastic optimization. Iclr (poster). arXiv:1412.6980
Koenig, S., & Simmons, R. G. (1996). The effect of representation and knowledge on goal-directed exploration with reinforcement-learning algorithms. Machine Learning, 22(1), 227–250. https://doi.org/10.1007/BF00114729
Article MATH Google Scholar
Lee, K., Kim, G.-H., Ortega, P., Lee, D. D., & Kim, K.-E. (2019). Bayesian optimistic Kullback–Leibler exploration. Machine Learning, 108(5), 765–783. https://doi.org/10.1007/s10994-018-5767-4
Article MathSciNet MATH Google Scholar
Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., & Wierstra, D. (2016). Continuous control with deep reinforcement learning. Iclr (poster). arXiv:1509.02971
Lin, L.-J. (1992). Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine Learning, 8(3), 293–321. https://doi.org/10.1023/A:1022628806385
Article Google Scholar
McClure, S. M., Berns, G. S., & Montague, P. R. (2003). Temporal prediction errors in a passive learning task activate human striatum. Neuron, 38(2), 339–346.
Article Google Scholar
McClure, S. M., Daw, N. D., & Read Montague, P. (2003). A computational substrate for incentive salience. Trends in Neurosciences, 26(8), 423–428. https://doi.org/10.1016/S0166-2236(03)00177-2
Article Google Scholar
Meuleau, N., & Bourgine, P. (1999). Exploration of multi-state environments: Local measures and back-propagation of uncertainty. Machine Learning, 35(2), 117–154. https://doi.org/10.1023/A:1007541107674
Article MATH Google Scholar
Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T., Harley, T., Silver, D., & Kavukcuoglu, K. (2016). Asynchronous methods for deep reinforcement learning. In M. F. Balcan, & K. Q. Weinberger (Eds.), Proceedings of the 33rd international conference on machine learning (Vol. 48, pp. 1928–1937). USAPMLR. https://proceedings.mlr.press/v48/mniha16.html
Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., & Hassabis, D. (2015). Human-level control through deep reinforcement learning. Nature, 518(7540), 529–533. https://doi.org/10.1038/nature14236
Article Google Scholar
Montague, P. R., Dayan, P., & Sejnowski, T. J. (1996). A framework for mesencephalic dopamine systems based on predictive Hebbian learning. Journal of Neuroscience, 16(5), 1936–1947.
Article Google Scholar
Moore, A. W., & Atkeson, C. G. (1993). Prioritized sweeping: Reinforcement learning with less data and less time. Machine Learning, 13(1), 103–130. https://doi.org/10.1007/BF00993104
Article Google Scholar
Nouri, A., & Littman, M. L. (2010). Dimension reduction and its application to model-based exploration in continuous spaces. Machine Learning, 81(1), 85–98. https://doi.org/10.1007/s10994-010-5202-y
Article MathSciNet MATH Google Scholar
O’Doherty, J. P., Dayan, P., Friston, K., Critchley, H., & Dolan, R. J. (2003). Temporal difference models and reward-related learning in the human brain. Neuron, 38(2), 329–337.
Article Google Scholar
Pagnoni, G., Zink, C. F., Montague, P. R., & Berns, G. S. (2002). Activity in human ventral striatum locked to errors of reward prediction. Nature Neuroscience, 5(2), 97–98.
Article Google Scholar
Parberry, I. (2013). Introduction to game physics with box2d (1st ed.). USACRC Press, Inc.
Book Google Scholar
Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., & Chintala, S. (2019). Pytorch: An imperative style, high-performance deep learning library. In Advances in neural information processing systems (Vol. 32, pp. 8024–8035). Curran Associates, Inc. http://papers.neurips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf
Plappert, M., Houthooft, R., Dhariwal, P., Sidor, S., Chen, R. Y., Chen, X., Asfour, T., Abbeel, P., & Andrychowicz, M. (2018). Parameter space noise for exploration. In International conference on learning representations. https://openreview.net/forum?id=ByBAl2eAZ
Poličar, P. G., Stražar, M., & Zupan, B. (2019). opentsne: A modular python library for t-SNE dimensionality reduction and embedding. bioRxiv. Retrieved from https://www.biorxiv.org/content/early/2019/08/13/731877
Precup, D., Sutton, R., & Dasgupta, S. (2001). Off-policy temporal-difference learning with function approximation. In Proceedings of the 18th international conference on machine learning.
Raffin, A. (2020). Rl baselines3 zoo. GitHub. https://github.com/DLR-RM/rl-baselines3-zoo
Ryan, R. M., & Deci, E. L. (2000). Intrinsic and extrinsic motivations: Classic definitions and new directions. Contemporary Educational Psychology, 25(1), 54–67. https://www.sciencedirect.com/science/article/pii/S0361476X99910202.
Article Google Scholar
Schulman, J., Moritz, P., Levine, S., Jordan, M., & Abbeel, P. (2016). High-dimensional continuous control using generalized advantage estimation. In Proceedings of the international conference on learning representations (iclr).
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). Proximal policy optimization algorithms. arXiv:1707.06347
Schultz, W. (1998). Predictive reward signal of dopamine neurons. Journal of Neurophysiology, 80(1), 1–27.
Article Google Scholar
Schultz, W., Dayan, P., & Montague, P. R. (1997). A neural substrate of prediction and reward. Science, 275(5306), 1593–1599. https://doi.org/10.1126/science.275.5306.1593
Article Google Scholar
Silver, D., Lever, G., Heess, N., Degris, T., Wierstra, D., & Riedmiller, M. (2014). Deterministic policy gradient algorithms. In E. P. Xing, & T. Jebara (Eds.), Proceedings of the 31st international conference on machine learning (Vol. 32, pp. 387–395). PMLR. https://proceedings.mlr.press/v32/silver14.html
Singh, S., Jaakkola, T., Littman, M. L., & Szepesvári, C. (2000). Convergence results for single-step on-policy reinforcement-learning algorithms. Machine Learning, 38(3), 287–308. https://doi.org/10.1023/A:1007678930559
Article MATH Google Scholar
Singh, S. P. (1992). Transfer of learning by composing solutions of elemental sequential tasks. Machine Learning, 8(3), 323–339. https://doi.org/10.1023/A:1022680823223
Article MATH Google Scholar
Sutton, R. S. (1988). Learning to predict by the methods of temporal differences. Machine Learning, 3(1), 9–44. https://doi.org/10.1007/BF00115009
Article Google Scholar
Sutton, R. S., & Barto, A. G. (2018). Reinforcement learning: An introduction. MIT Press.
MATH Google Scholar
Thrun, S. B. (1992). Efficient exploration in reinforcement learning (Technical Report No. CMU-CS-92-102). USA Carnegie Mellon University.
Todorov, E., Erez, T., & Tassa, Y. (2012). Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ international conference on intelligent robots and systems (pp. 5026–5033). https://doi.org/10.1109/IROS.2012.6386109
Uhlenbeck, G. E., & Ornstein, L. S. (1930). On the theory of the Brownian motion. Physical Review, 36, 823–841. https://doi.org/10.1103/PhysRev.36.823
Article MATH Google Scholar
van der Maaten, L., & Hinton, G. (2008). Visualizing data using t-SNE. Journal of Machine Learning Research, 9(86), 2579–2605.
MATH Google Scholar
Watkins, C. J. C. H., & Dayan, P. (1992). Q-learning. Machine Learning, 8(3), 279–292. https://doi.org/10.1007/BF00992698
Article MATH Google Scholar
Whitehead, S. D. (1991). A complexity analysis of cooperative mechanisms in reinforcement learning. In Proceedings of the ninth national conference on artificial intelligence (Vol. 2, pp. 607–613). AAAI Press.
Whitehead, S. D., & Ballard, D. H. (1991). Learning to perceive and act by trial and error. Machine Learning, 7(1), 45–83. https://doi.org/10.1023/A:1022619109594
Article Google Scholar
Williams, R. J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8(3), 229–256. https://doi.org/10.1007/BF00992696
Article MATH Google Scholar
Xu, T., Liu, Q., Zhao, L., & Peng, J. (2018). Learning to explore via meta-policy gradient. In J. Dy, & A. Krause (Eds.), Proceedings of the 35th international conference on machine learning (Vol. 80, pp. 5463–5472). https://proceedings.mlr.press/v80/xu18d.html
Zhang, Y., & Van Hoof, H. (2021). Deep coherent exploration for continuous control. In M. Meila, & T. Zhang (Eds.), Proceedings of the 38th international conference on machine learning, PMLR (Vol. 139, pp. 12567–12577). https://proceedings.mlr.press/v139/zhang21t.html
Zheng, Z., Oh, J., & Singh, S. (2018). On learning intrinsic rewards for policy gradient methods. S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, & R. Garnett (Eds.), Advances in neural information processing systems (Vol. 31). Curran Associates, Inc. https://proceedings.neurips.cc/paper/2018/file/51de85ddd068f0bc787691d356176df9-Paper.pdf

Download references

Funding

The authors did not receive funds, grants, or other support from any organization for the submitted work.

Author information

Authors and Affiliations

Department of Electrical and Electronics Engineering, Bilkent University, 06800, Bilkent, Ankara, Turkey
Baturay Saglam & Suleyman S. Kozat

Authors

Baturay Saglam
View author publications
You can also search for this author in PubMed Google Scholar
Suleyman S. Kozat
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

All authors contributed to the study conception and design. Formal analysis, investigation and literature search was performed by Baturay Saglam. The first draft of the manuscript was written by Baturay Saglam, and editing was done by Baturay Saglam and Suleyman Serdar Kozat. Revision and supervision were performed by Suleyman Serdar Kozat. All authors commented on previous versions of the manuscript.

Corresponding author

Correspondence to Baturay Saglam.

Ethics declarations

Conflict of interest

All authors certify that they have no affiliations with or involvement in any organization or entity with any financial interest or non-financial interest in the subject matter or materials discussed in this manuscript.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Editor: Sicco Verwer

Appendices

Appendix 1: detailed experimental setup

1.1 Software and environment

All networks are trained with PyTorch (version 1.8.1) (Paszke et al., 2019), using default values for all unmentioned hyper-parameters. Performances of all methods are evaluated in MuJoCo (mujoco-py version 1.50), and Box2D (version 2.3.10) physics engines interfaced by OpenAI Gym (version 0.17.3), using v3 environment for BipedalWalker and v2 for rest of the environments. The environment dynamics, state and action spaces, and reward functions are not pre-processed and modified for easy reproducibility and fair evaluation procedure with the baseline and competing algorithms. Each environment episode runs for a maximum of 1000 steps until a terminal condition is encountered. The multi-dimensional action space for all environments is within the range \((-1, 1)\) except for Humanoid, which uses the range of \((-0.4, 0.4)\).

1.2 Evaluation

All experiments are run for 1 million time steps with evaluations every 1000 time steps, where an evaluation of an agent records the average reward over 10 episodes without exploration noise and network updates. We use a new environment with a fixed seed (the training seed + a constant) for each evaluation to decrease the variation caused by different seeds. Therefore, each evaluation uses the same set of initial start states.

We report the average evaluation return of 10 random seeds for each environment, including the initialization of behavioral policies, simulators, network parameters, and dependencies. Unless stated otherwise, each agent is trained by one training iteration after each time step. Agents are trained by batches of transitions uniformly sampled from the experience replay. Learning curves are used to show performance, and they are given as an average of 10 trials with a shaded zone added to reflect a half standard deviation across the trials. The curves are smoothed uniformly over a sliding window of 5 evaluations for visual clarity.

1.3 Visualization of the state visitations

We visualize the states within the collected transitions for greedy action selection and DISCOVER under the TD3 algorithm while learning in the Swimmer environment over 1 million time steps. The results are reported over a single seed. We consider the last 975,000 transitions in the replay buffer as the first 25,000 is sampled from the environment’s action space. We first separately project the datasets onto a 4D state space through PCA to reduce the visual artifacts.

Later, we jointly embed the resulting datasets onto the 2D state space through the t-SNE implementation of the openTSNE library (Poličar et al., 2019).^{Footnote 7} We use a perplexity value of 1396 and euclidean metric in measuring the distances between each state. The t-SNE is run over 1500 iterations. Default values in openTSNE is used for all unmentioned parameters. We split the datasets into three portions and separately visualize them, where each portion contains 325,000 samples. The PCA operation yields a proportion of variance explained by the value of 0.984 and 0.966 for the greedy and DISCOVER datasets.

1.4 Implementation

Our implementation of A2C and PPO is based on the code from the well-known repository (see footnote 5), following the tuned hyper-parameters for the considered continuous control tasks. For the implementation of TD3, we use the author’s GitHub repository (see footnote 4) for the fine-tuned version of the algorithm and the DDPG implementation. For SAC, we follow structure outlined in the original paper.

We implement NoisyNet by adapting the code from authors’ GitHub repository (see footnote 2) to the baseline actor-critic algorithms. Authors’ OpenAI Baselines implementation (see footnote 3) is used to implement PSNE. Similar to SAC, we refer to the original papers in implementing Deep Coherent Exploration and Meta-Policy Gradient as the authors did not provide a valid code repository.

1.5 Architecture and hyper-parameter setting

The on-policy methods, A2C and PPO follow the tuned hyper-parameters for the MuJoCo and Box2D tasks provided by the repository (see footnote 5). Our implementation of the off-policy actor-critic algorithms, DDPG, SAC, and TD3, closely follows the set of hyper-parameters given in the respective papers. For DDPG and SAC, we use the fine-tuned environment-specific hyper-parameters provided by the OpenAI Baselines3 Zoo (see footnote 6). TD3 uses the fine-tuned parameters provided in the author’s GitHub repository (see footnote 4). Shared and environment and algorithm specific hyper-parameters for the off-policy methods are given in Tables 2, 3, and 4. Additionally Tables 5 and 6 reports the shared and algorithm specific tuned hyper-parameters for the on-policy baselines, respectively. Note that entropy coefficient used for A2C and PPO is 0, meaning that there is no entropy maximization and hence, inherent exploration. Nevertheless, the tuned 0 value corresponds to greedy action selection, which is found to perform best.

For the parameter-space noise algorithms, we initialize the parameter noise at 0.017 and 0.034, which gives the best results in practice for on- and off-policy algorithms, respectively, as suggested by Zhang and Van Hoof (2021). Furthermore, we use \(\beta = 0.01\) for all environments in the Deep Coherent Exploration algorithm, and set the mean-squared error threshold in PSNE to 0.1.

The exploration framework of DISCOVER strictly follows the actor framework in the corresponding baseline algorithms. This includes the depth and size of the networks, learning rate, optimizer, nonlinearity, target, and behavioral policy update frequency, target network learning rate, and the number of gradient steps in the updates. Moreover, we still use the exploration policy in the exploration time steps at the beginning of each training.

Table 2 Shared hyper-parameters of the baseline off-policy actor-critic algorithms

Full size table

Table 3 Algorithm specific hyper-parameters used for the implementation of the baseline off-policy actor-critic algorithms

Full size table

Table 4 SAC specific hyper-parameters

Full size table

Table 5 Shared hyper-parameters of the baseline on-policy actor-critic algorithms

Full size table

Table 6 Algorithm specific hyper-parameters used for the implementation of the baseline on-policy actor-critic algorithms

Full size table

Appendix 2: complete experimental results

1.1 Additional evaluation results

Additional evaluation results in the InvertedDoublePendulum, InvertedPendulum, and Reacher environments for A2C, PPO, DDPG, SAC, and TD3 are reported in Figs. 10, 11, 12, 13, and 14, respectively.

1.2 Learning curves for the ablation studies

1.2.1 Exploration direction regularization

From Figs. 15 and 16, we observe that the increasing degree of exploration detrimentally degrades the performance of the baseline algorithms. This is an expected result as highly perturbed actions and state distributions prevent agents from effectively learning from its mistakes. Inversely, insufficient exploration obtains a notable but suboptimal performance. In fact, greedy action selection, i.e., \(\lambda = 0.0\), peforms second best after DISCOVER yet, converges to a suboptimal policy.

The optimal exploration regularization value is found to be 0.1 and 0.3 for on- and off-policy settings. These values either enables a faster convergence to the optimal policy, e.g., Humanoid, or higher evaluation results, e.g., Swimmer, or both, e.g., HalfCheetah, Hopper. In addition, from Table 1, we find that the last 10 returns for \(\lambda = 0.1\) is higher than \(\lambda = 0.3\) in the Hopper-v2 environment under the off-policy setting. However, this is not visible in Fig. 16 due to the sliding window on the evaluation results. As \(\lambda = 0.3\) setting has a faster convergence than \(\lambda = 0.1\), we observe a better overall performance in the plots.

As explained, we perform our ablation studies in environments with different characteristics. Interestingly, we find that \(\lambda = 0.3\) can generalize all the tested tasks. Hence, we infer that one can tune DISCOVER on a single physics dynamics and use the optimal value in different environments.

1.2.2 Ablation study of DISCOVER

Overall, Fig. 17 demonstrates that all settings exhibit a similar performance except for HalfCheetah as the complete algorithm converges faster to significantly higher evaluation results. This is because the HalfCheetah environment vastly requires off-policy samples to be solved, as highlighted by Henderson et al. (2018). Therefore, the complete algorithm can further benefit from off-policy learning, demonstrated in the HalfCheetah environment.

In the rest of the environments, the complete algorithm attains slightly faster convergence to higher returns, from which we conclude that the ensemble of each component is the most effective setting in improving the baseline’s policy. Thus, the exploration policy of DISCOVER should mimic the baseline’s policy framework to obtain the optimal performance.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Saglam, B., Kozat, S.S. Deep intrinsically motivated exploration in continuous control. Mach Learn 112, 4959–4993 (2023). https://doi.org/10.1007/s10994-023-06363-4

Download citation

Received: 24 March 2022
Revised: 09 March 2023
Accepted: 27 June 2023
Published: 26 October 2023
Issue Date: December 2023
DOI: https://doi.org/10.1007/s10994-023-06363-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Deep intrinsically motivated exploration in continuous control

Abstract

Access this article

Similar content being viewed by others