Abstract
In continuous control, exploration is often performed through undirected strategies in which parameters of the networks or selected actions are perturbed by random noise. Although the deep setting of undirected exploration has been shown to improve the performance of on-policy methods, they introduce an excessive computational complexity and are known to fail in the off-policy setting. The intrinsically motivated exploration is an effective alsetup and hyper-parameterternative to the undirected strategies, but they are usually studied for discrete action domains. In this paper, we investigate how intrinsic motivation can effectively be combined with deep reinforcement learning in the control of continuous systems to obtain a directed exploratory behavior. We adapt the existing theories on animal motivational systems into the reinforcement learning paradigm and introduce a novel and scalable directed exploration strategy. The introduced approach, motivated by the maximization of the value function’s error, can benefit from a collected set of experiences by extracting useful information and unify the intrinsic exploration motivations in the literature under a single exploration objective. An extensive set of empirical studies demonstrate that our framework extends to larger and more diverse state spaces, dramatically improves the baselines, and outperforms the undirected strategies significantly.
Similar content being viewed by others
Availability of data and materials
Simulators used in the experiments are freely and publicly available online: https://gym.openai.com/.
Code availability
The source code for the research leading to these results are made publicly available at the corresponding author’s GitHub repository: https://github.com/baturaysaglam/DISCOVER.
Change history
04 November 2023
Red type in figure captions changed to black.
References
Barto, A. G. (2013). Intrinsic motivation and reinforcement learning. In G. Baldassarre, & M. Mirolli (Eds.), Intrinsically motivated learning in natural and artificial systems (pp. 17–47). Springer. https://doi.org/10.1007/978-3-642-32375-1_2
Barto, A. G., & Simsek, O. (2005). Intrinsic motivation for reinforcement learning systems. In The thirteenth yale workshop on adaptive and learning systems (pp. 113–118).
Bellemare, M. G., Naddaf, Y., Veness, J., & Bowling, M. (2013). The arcade learning environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research, 47, 253–279. https://doi.org/10.1613/jair.3912
Bellman, R. (1957). Dynamic programming. Dover Publications.
Berns, G. S., McClure, S. M., Pagnoni, G., & Montague, P. R. (2001). Predictability modulates human brain response to reward. Journal of Neuroscience, 21(8), 2793–2798.
Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., & Zaremba, W. (2016). Openai gym. CoRR. arXiv:1606.01540
Dayan, P. (2002). Motivated reinforcement learning. In T. Dietterich, S. Becker, & Z. Ghahramani (Eds.), Advances in neural information processing systems (Vol. 14). MIT Press. https://proceedings.neurips.cc/paper/2001/file/051928341be67dcba03f0e04104d9047-Paper.pdf
Dayan, P., & Sejnowski, T. J. (1996). Exploration bonuses and dual control. Machine Learning, 25(1), 5–22. https://doi.org/10.1007/BF00115298
Dhariwal, P., Hesse, C., Klimov, O., Nichol, A., Plappert, M., Radford, A., Schulman, J., Sidor, S., Wu, Y., & Zhokhov, P. (2017). Openai baselines. https://github.com/openai/baselines. GitHub.
Fortunato, M., Azar, M. G., Piot, B., Menick, J., Hessel, M., Osband, I., Mnih, A. G. V., Pietquin, R. M. D. H. O., Blundell, C., & Legg, S. (2018). Noisy networks for exploration. In International conference on learning representations. https://openreview.net/forum?id=rywHCPkAW
Fujimoto, S., van Hoof, H., & Meger, D. (2018). Addressing function approximation error in actor-critic methods. In J. Dy, & A. Krause (Eds.), Proceedings of the 35th international conference on machine learning (Vol. 80, pp. 1587–1596). PMLR. https://proceedings.mlr.press/v80/fujimoto18a.html
Garris, P. A., Kilpatrick, M., Bunin, M. A., Michael, D., Walker, Q. D., & Wightman, R. M. (1999). Dissociation of dopamine release in the nucleus accumbens from intracranial self-stimulation. Nature, 398(6722), 67–69. https://doi.org/10.1038/18019
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., & Bengio, Y. (2014). Generative adversarial nets. In Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, & K. Weinberger (Eds.), Advances in neural information processing systems (Vol. 27). Curran Associates, Inc. https://proceedings.neurips.cc/paper/2014/file/5ca3e9b122f61f8f06494c97b1afccf3-Paper.pdf
Haarnoja, T., Zhou, A., Abbeel, P., & Levine, S. (2018). Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In J. Dy, & A. Krause (Eds.), Proceedings of the 35th international conference on machine learning (Vol. 80, pp. 1861–1870). PMLR. https://proceedings.mlr.press/v80/haarnoja18b.html
Henderson, P., Islam, R., Bachman, P., Pineau, J., Precup, D., & Meger, D. (2018). Deep reinforcement learning that matters. In Proceedings of the AAAI conference on artificial intelligence (Vol. 321). https://ojs.aaai.org/index.php/AAAI/article/view/11694
Kearns, M., Mansour, Y., & Ng, A. Y. (2002). A sparse sampling algorithm for near-optimal planning in large Markov decision processes. Machine Learning, 49(2), 193–208. https://doi.org/10.1023/A:1017932429737
Kearns, M., & Singh, S. (2002). Near-optimal reinforcement learning in polynomial time. Machine Learning, 49(2), 209–232. https://doi.org/10.1023/A:1017984413808
Kilpatrick, M. R., Rooney, M. B., Michael, D. J., & Wightman, R. M. (2000). Extracellular dopamine dynamics in rat caudate–putamen during experimenter-delivered and intracranial self-stimulation. Neuroscience, 96(4), 697–706.
Kingma, D. P., & Ba, J. (2015). Adam: A method for stochastic optimization. Iclr (poster). arXiv:1412.6980
Koenig, S., & Simmons, R. G. (1996). The effect of representation and knowledge on goal-directed exploration with reinforcement-learning algorithms. Machine Learning, 22(1), 227–250. https://doi.org/10.1007/BF00114729
Lee, K., Kim, G.-H., Ortega, P., Lee, D. D., & Kim, K.-E. (2019). Bayesian optimistic Kullback–Leibler exploration. Machine Learning, 108(5), 765–783. https://doi.org/10.1007/s10994-018-5767-4
Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., & Wierstra, D. (2016). Continuous control with deep reinforcement learning. Iclr (poster). arXiv:1509.02971
Lin, L.-J. (1992). Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine Learning, 8(3), 293–321. https://doi.org/10.1023/A:1022628806385
McClure, S. M., Berns, G. S., & Montague, P. R. (2003). Temporal prediction errors in a passive learning task activate human striatum. Neuron, 38(2), 339–346.
McClure, S. M., Daw, N. D., & Read Montague, P. (2003). A computational substrate for incentive salience. Trends in Neurosciences, 26(8), 423–428. https://doi.org/10.1016/S0166-2236(03)00177-2
Meuleau, N., & Bourgine, P. (1999). Exploration of multi-state environments: Local measures and back-propagation of uncertainty. Machine Learning, 35(2), 117–154. https://doi.org/10.1023/A:1007541107674
Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T., Harley, T., Silver, D., & Kavukcuoglu, K. (2016). Asynchronous methods for deep reinforcement learning. In M. F. Balcan, & K. Q. Weinberger (Eds.), Proceedings of the 33rd international conference on machine learning (Vol. 48, pp. 1928–1937). USAPMLR. https://proceedings.mlr.press/v48/mniha16.html
Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., & Hassabis, D. (2015). Human-level control through deep reinforcement learning. Nature, 518(7540), 529–533. https://doi.org/10.1038/nature14236
Montague, P. R., Dayan, P., & Sejnowski, T. J. (1996). A framework for mesencephalic dopamine systems based on predictive Hebbian learning. Journal of Neuroscience, 16(5), 1936–1947.
Moore, A. W., & Atkeson, C. G. (1993). Prioritized sweeping: Reinforcement learning with less data and less time. Machine Learning, 13(1), 103–130. https://doi.org/10.1007/BF00993104
Nouri, A., & Littman, M. L. (2010). Dimension reduction and its application to model-based exploration in continuous spaces. Machine Learning, 81(1), 85–98. https://doi.org/10.1007/s10994-010-5202-y
O’Doherty, J. P., Dayan, P., Friston, K., Critchley, H., & Dolan, R. J. (2003). Temporal difference models and reward-related learning in the human brain. Neuron, 38(2), 329–337.
Pagnoni, G., Zink, C. F., Montague, P. R., & Berns, G. S. (2002). Activity in human ventral striatum locked to errors of reward prediction. Nature Neuroscience, 5(2), 97–98.
Parberry, I. (2013). Introduction to game physics with box2d (1st ed.). USACRC Press, Inc.
Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., & Chintala, S. (2019). Pytorch: An imperative style, high-performance deep learning library. In Advances in neural information processing systems (Vol. 32, pp. 8024–8035). Curran Associates, Inc. http://papers.neurips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf
Plappert, M., Houthooft, R., Dhariwal, P., Sidor, S., Chen, R. Y., Chen, X., Asfour, T., Abbeel, P., & Andrychowicz, M. (2018). Parameter space noise for exploration. In International conference on learning representations. https://openreview.net/forum?id=ByBAl2eAZ
Poličar, P. G., Stražar, M., & Zupan, B. (2019). opentsne: A modular python library for t-SNE dimensionality reduction and embedding. bioRxiv. Retrieved from https://www.biorxiv.org/content/early/2019/08/13/731877
Precup, D., Sutton, R., & Dasgupta, S. (2001). Off-policy temporal-difference learning with function approximation. In Proceedings of the 18th international conference on machine learning.
Raffin, A. (2020). Rl baselines3 zoo. GitHub. https://github.com/DLR-RM/rl-baselines3-zoo
Ryan, R. M., & Deci, E. L. (2000). Intrinsic and extrinsic motivations: Classic definitions and new directions. Contemporary Educational Psychology, 25(1), 54–67. https://www.sciencedirect.com/science/article/pii/S0361476X99910202.
Schulman, J., Moritz, P., Levine, S., Jordan, M., & Abbeel, P. (2016). High-dimensional continuous control using generalized advantage estimation. In Proceedings of the international conference on learning representations (iclr).
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). Proximal policy optimization algorithms. arXiv:1707.06347
Schultz, W. (1998). Predictive reward signal of dopamine neurons. Journal of Neurophysiology, 80(1), 1–27.
Schultz, W., Dayan, P., & Montague, P. R. (1997). A neural substrate of prediction and reward. Science, 275(5306), 1593–1599. https://doi.org/10.1126/science.275.5306.1593
Silver, D., Lever, G., Heess, N., Degris, T., Wierstra, D., & Riedmiller, M. (2014). Deterministic policy gradient algorithms. In E. P. Xing, & T. Jebara (Eds.), Proceedings of the 31st international conference on machine learning (Vol. 32, pp. 387–395). PMLR. https://proceedings.mlr.press/v32/silver14.html
Singh, S., Jaakkola, T., Littman, M. L., & Szepesvári, C. (2000). Convergence results for single-step on-policy reinforcement-learning algorithms. Machine Learning, 38(3), 287–308. https://doi.org/10.1023/A:1007678930559
Singh, S. P. (1992). Transfer of learning by composing solutions of elemental sequential tasks. Machine Learning, 8(3), 323–339. https://doi.org/10.1023/A:1022680823223
Sutton, R. S. (1988). Learning to predict by the methods of temporal differences. Machine Learning, 3(1), 9–44. https://doi.org/10.1007/BF00115009
Sutton, R. S., & Barto, A. G. (2018). Reinforcement learning: An introduction. MIT Press.
Thrun, S. B. (1992). Efficient exploration in reinforcement learning (Technical Report No. CMU-CS-92-102). USA Carnegie Mellon University.
Todorov, E., Erez, T., & Tassa, Y. (2012). Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ international conference on intelligent robots and systems (pp. 5026–5033). https://doi.org/10.1109/IROS.2012.6386109
Uhlenbeck, G. E., & Ornstein, L. S. (1930). On the theory of the Brownian motion. Physical Review, 36, 823–841. https://doi.org/10.1103/PhysRev.36.823
van der Maaten, L., & Hinton, G. (2008). Visualizing data using t-SNE. Journal of Machine Learning Research, 9(86), 2579–2605.
Watkins, C. J. C. H., & Dayan, P. (1992). Q-learning. Machine Learning, 8(3), 279–292. https://doi.org/10.1007/BF00992698
Whitehead, S. D. (1991). A complexity analysis of cooperative mechanisms in reinforcement learning. In Proceedings of the ninth national conference on artificial intelligence (Vol. 2, pp. 607–613). AAAI Press.
Whitehead, S. D., & Ballard, D. H. (1991). Learning to perceive and act by trial and error. Machine Learning, 7(1), 45–83. https://doi.org/10.1023/A:1022619109594
Williams, R. J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8(3), 229–256. https://doi.org/10.1007/BF00992696
Xu, T., Liu, Q., Zhao, L., & Peng, J. (2018). Learning to explore via meta-policy gradient. In J. Dy, & A. Krause (Eds.), Proceedings of the 35th international conference on machine learning (Vol. 80, pp. 5463–5472). https://proceedings.mlr.press/v80/xu18d.html
Zhang, Y., & Van Hoof, H. (2021). Deep coherent exploration for continuous control. In M. Meila, & T. Zhang (Eds.), Proceedings of the 38th international conference on machine learning, PMLR (Vol. 139, pp. 12567–12577). https://proceedings.mlr.press/v139/zhang21t.html
Zheng, Z., Oh, J., & Singh, S. (2018). On learning intrinsic rewards for policy gradient methods. S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, & R. Garnett (Eds.), Advances in neural information processing systems (Vol. 31). Curran Associates, Inc. https://proceedings.neurips.cc/paper/2018/file/51de85ddd068f0bc787691d356176df9-Paper.pdf
Funding
The authors did not receive funds, grants, or other support from any organization for the submitted work.
Author information
Authors and Affiliations
Contributions
All authors contributed to the study conception and design. Formal analysis, investigation and literature search was performed by Baturay Saglam. The first draft of the manuscript was written by Baturay Saglam, and editing was done by Baturay Saglam and Suleyman Serdar Kozat. Revision and supervision were performed by Suleyman Serdar Kozat. All authors commented on previous versions of the manuscript.
Corresponding author
Ethics declarations
Conflict of interest
All authors certify that they have no affiliations with or involvement in any organization or entity with any financial interest or non-financial interest in the subject matter or materials discussed in this manuscript.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Editor: Sicco Verwer
Appendices
Appendix 1: detailed experimental setup
1.1 Software and environment
All networks are trained with PyTorch (version 1.8.1) (Paszke et al., 2019), using default values for all unmentioned hyper-parameters. Performances of all methods are evaluated in MuJoCo (mujoco-py version 1.50), and Box2D (version 2.3.10) physics engines interfaced by OpenAI Gym (version 0.17.3), using v3 environment for BipedalWalker and v2 for rest of the environments. The environment dynamics, state and action spaces, and reward functions are not pre-processed and modified for easy reproducibility and fair evaluation procedure with the baseline and competing algorithms. Each environment episode runs for a maximum of 1000 steps until a terminal condition is encountered. The multi-dimensional action space for all environments is within the range \((-1, 1)\) except for Humanoid, which uses the range of \((-0.4, 0.4)\).
1.2 Evaluation
All experiments are run for 1 million time steps with evaluations every 1000 time steps, where an evaluation of an agent records the average reward over 10 episodes without exploration noise and network updates. We use a new environment with a fixed seed (the training seed + a constant) for each evaluation to decrease the variation caused by different seeds. Therefore, each evaluation uses the same set of initial start states.
We report the average evaluation return of 10 random seeds for each environment, including the initialization of behavioral policies, simulators, network parameters, and dependencies. Unless stated otherwise, each agent is trained by one training iteration after each time step. Agents are trained by batches of transitions uniformly sampled from the experience replay. Learning curves are used to show performance, and they are given as an average of 10 trials with a shaded zone added to reflect a half standard deviation across the trials. The curves are smoothed uniformly over a sliding window of 5 evaluations for visual clarity.
1.3 Visualization of the state visitations
We visualize the states within the collected transitions for greedy action selection and DISCOVER under the TD3 algorithm while learning in the Swimmer environment over 1 million time steps. The results are reported over a single seed. We consider the last 975,000 transitions in the replay buffer as the first 25,000 is sampled from the environment’s action space. We first separately project the datasets onto a 4D state space through PCA to reduce the visual artifacts.
Later, we jointly embed the resulting datasets onto the 2D state space through the t-SNE implementation of the openTSNE library (Poličar et al., 2019).Footnote 7 We use a perplexity value of 1396 and euclidean metric in measuring the distances between each state. The t-SNE is run over 1500 iterations. Default values in openTSNE is used for all unmentioned parameters. We split the datasets into three portions and separately visualize them, where each portion contains 325,000 samples. The PCA operation yields a proportion of variance explained by the value of 0.984 and 0.966 for the greedy and DISCOVER datasets.
1.4 Implementation
Our implementation of A2C and PPO is based on the code from the well-known repository (see footnote 5), following the tuned hyper-parameters for the considered continuous control tasks. For the implementation of TD3, we use the author’s GitHub repository (see footnote 4) for the fine-tuned version of the algorithm and the DDPG implementation. For SAC, we follow structure outlined in the original paper.
We implement NoisyNet by adapting the code from authors’ GitHub repository (see footnote 2) to the baseline actor-critic algorithms. Authors’ OpenAI Baselines implementation (see footnote 3) is used to implement PSNE. Similar to SAC, we refer to the original papers in implementing Deep Coherent Exploration and Meta-Policy Gradient as the authors did not provide a valid code repository.
1.5 Architecture and hyper-parameter setting
The on-policy methods, A2C and PPO follow the tuned hyper-parameters for the MuJoCo and Box2D tasks provided by the repository (see footnote 5). Our implementation of the off-policy actor-critic algorithms, DDPG, SAC, and TD3, closely follows the set of hyper-parameters given in the respective papers. For DDPG and SAC, we use the fine-tuned environment-specific hyper-parameters provided by the OpenAI Baselines3 Zoo (see footnote 6). TD3 uses the fine-tuned parameters provided in the author’s GitHub repository (see footnote 4). Shared and environment and algorithm specific hyper-parameters for the off-policy methods are given in Tables 2, 3, and 4. Additionally Tables 5 and 6 reports the shared and algorithm specific tuned hyper-parameters for the on-policy baselines, respectively. Note that entropy coefficient used for A2C and PPO is 0, meaning that there is no entropy maximization and hence, inherent exploration. Nevertheless, the tuned 0 value corresponds to greedy action selection, which is found to perform best.
For the parameter-space noise algorithms, we initialize the parameter noise at 0.017 and 0.034, which gives the best results in practice for on- and off-policy algorithms, respectively, as suggested by Zhang and Van Hoof (2021). Furthermore, we use \(\beta = 0.01\) for all environments in the Deep Coherent Exploration algorithm, and set the mean-squared error threshold in PSNE to 0.1.
The exploration framework of DISCOVER strictly follows the actor framework in the corresponding baseline algorithms. This includes the depth and size of the networks, learning rate, optimizer, nonlinearity, target, and behavioral policy update frequency, target network learning rate, and the number of gradient steps in the updates. Moreover, we still use the exploration policy in the exploration time steps at the beginning of each training.
Appendix 2: complete experimental results
1.1 Additional evaluation results
Additional evaluation results in the InvertedDoublePendulum, InvertedPendulum, and Reacher environments for A2C, PPO, DDPG, SAC, and TD3 are reported in Figs. 10, 11, 12, 13, and 14, respectively.
1.2 Learning curves for the ablation studies
1.2.1 Exploration direction regularization
From Figs. 15 and 16, we observe that the increasing degree of exploration detrimentally degrades the performance of the baseline algorithms. This is an expected result as highly perturbed actions and state distributions prevent agents from effectively learning from its mistakes. Inversely, insufficient exploration obtains a notable but suboptimal performance. In fact, greedy action selection, i.e., \(\lambda = 0.0\), peforms second best after DISCOVER yet, converges to a suboptimal policy.
The optimal exploration regularization value is found to be 0.1 and 0.3 for on- and off-policy settings. These values either enables a faster convergence to the optimal policy, e.g., Humanoid, or higher evaluation results, e.g., Swimmer, or both, e.g., HalfCheetah, Hopper. In addition, from Table 1, we find that the last 10 returns for \(\lambda = 0.1\) is higher than \(\lambda = 0.3\) in the Hopper-v2 environment under the off-policy setting. However, this is not visible in Fig. 16 due to the sliding window on the evaluation results. As \(\lambda = 0.3\) setting has a faster convergence than \(\lambda = 0.1\), we observe a better overall performance in the plots.
As explained, we perform our ablation studies in environments with different characteristics. Interestingly, we find that \(\lambda = 0.3\) can generalize all the tested tasks. Hence, we infer that one can tune DISCOVER on a single physics dynamics and use the optimal value in different environments.
1.2.2 Ablation study of DISCOVER
Overall, Fig. 17 demonstrates that all settings exhibit a similar performance except for HalfCheetah as the complete algorithm converges faster to significantly higher evaluation results. This is because the HalfCheetah environment vastly requires off-policy samples to be solved, as highlighted by Henderson et al. (2018). Therefore, the complete algorithm can further benefit from off-policy learning, demonstrated in the HalfCheetah environment.
In the rest of the environments, the complete algorithm attains slightly faster convergence to higher returns, from which we conclude that the ensemble of each component is the most effective setting in improving the baseline’s policy. Thus, the exploration policy of DISCOVER should mimic the baseline’s policy framework to obtain the optimal performance.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Saglam, B., Kozat, S.S. Deep intrinsically motivated exploration in continuous control. Mach Learn 112, 4959–4993 (2023). https://doi.org/10.1007/s10994-023-06363-4
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10994-023-06363-4