Abstract
Reinforcement learning (RL) algorithms with deterministic actors (policy) commonly apply noise to the action space for exploration. These exploration methods are either undirected or require extra knowledge of the environment. In the aim of addressing these fundamental limitations, this paper introduces a parameterized stochastic action-noise policy (as a probability distribution) that correlates with the objectivity of the RL algorithm. This policy is optimized based on state-action values of predicted future states. Consequently, the optimization does not rely on the explicit definition of the reward function which improves the adaptability of this exploration strategy for different environments and algorithms. Moreover, this paper presents a predictive model of system dynamics (transitional probability) with the capacity to capture the uncertainty of the environments with optimal design and fewer parameters. It significantly reduces the model complexity while maintaining the same level of accuracy as current methods. This research evaluates and analyzes the proposed method and models while demonstrating significant increase in performance and reliability across various locomotion and control tasks in comparison with current methods.
Similar content being viewed by others
Code Availability
The custom codes are developed or implemented during the current study are available from the corresponding author on reasonable request.
References
Levine S, Finn C, Darrell T, Abbeel P (2016) End-to-end training of deep visuomotor policies. J Mach Learn Res 17(1):1334–1373
Pinto L, Gandhi D, Han Y, Park Y, Gupta A (2016) The curious robot: Learning visual representations via physical interactions. In: European conference on computer vision. Springer, pp 3–18
Añazco EV, Lopez PR, Park N, Oh J, Ryu G, Al-antari MA, Kim T-S (2021) Natural object manipulation using anthropomorphic robotic hand through deep reinforcement learning and deep grasping probability network. Appl Intell 51(2):1041–1055. https://doi.org/10.1007/s10489-020-01870-6
Luck KS, Campbell J, Jansen MA, Aukes DM, Amor HB (2017) From the lab to the desert: Fast prototyping and learning of robot locomotion. In: Robotics: Science and Systems
Jaritz M, Charette RD, Toromanoff M, Perot E, Nashashibi F (2018) End-to-end race driving with deep reinforcement learning. In: 2018 IEEE International conference on robotics and automation (ICRA). IEEE, pp 2070–2075
Gupta S, Singal G, Garg D (2021) Deep reinforcement learning techniques in diversified domains: A survey. Archives of Computational Methods in Engineering, https://doi.org/10.1007/s11831-021-09552-3https://doi.org/10.1007/s11831-021-09552-3
Thrun SB, M?ller K (1991) On planning and exploration in non-discrete environments. Gesellschaft fur Mathematik und Datenverarbeitung, D-5205 St
Barto AG, Sutton RS, Watkins CsJCH (1990) Sequential decision problems and neural networks. In: Advances in neural information processing systems, vol 2. Morgan-Kaufmann
Baird L (1995) Residual algorithms: Reinforcement learning with function approximation. In: Proceedings of the twelfth international conference on machine learning. Morgan Kaufmann, pp 30–37
Sutton RS, McAllester D, Singh S, Mansour Y (2000) Policy gradient methods for reinforcement learning with function approximation. In: Advances in neural information processing systems, vol 12. MIT Press
Langford J, Webb GI (2010) Efficient exploration in reinforcement learning. In: Encyclopedia of Machine Learning. Springer US, Boston, MA, pp 309–311
Kaelbling LP, Littman ML, Moore AW (1996) Reinforcement learning: a survey. J Artif Intell Res 4:237–285
McFarlane R (2018) A survey of exploration strategies in reinforcement learning. In: McGill University
Mann TA, Choe Y (2013) Directed exploration in reinforcement learning with transferred knowledge. In: Deisenroth MP, Szepesvári C, Peters J (eds) Proceedings of the tenth european workshop on reinforcement learning. Proceedings of machine learning research, vol 24. PMLR, Edinburgh, Scotland, pp 59–76
Haarnoja T, Zhou A, Abbeel P, Levine S (2018) Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In: Proceedings of the 35th international conference on machine learning. Proceedings of Machine Learning Research, vol 80. PMLR, pp 1861–1870
Lillicrap TP, Hunt JJ, Pritzel A, Heess N, Erez T, Tassa Y, Silver D, Wierstra D (2016) Continuous control with deep reinforcement learning.. In: ICLR
Fujimoto S, van Hoof H, Meger D (2018) Addressing function approximation error in actor-critic methods. arXiv:1802.09477
Plappert M, Houthooft R, Dhariwal P, Sidor S, Chen RY, Chen X, Asfour T, Abbeel P, Andrychowicz M (2018) Parameter space noise for exploration. In: International conference on learning representations
Uhlenbeck GE, Ornstein LS (1930) On the theory of the brownian motion. Phys Rev 36 (5):823–841
Wawrzynski P (2015) Control policy with autocorrelated noise in reinforcement learning for robotics. Int J Mach Learn Comput 5(2):91–95
Schulman J, Levine S, Abbeel P, Jordan M, Moritz P (2015) Trust region policy optimization. In: Proceedings of the 32nd international conference on machine learning. Proceedings of Machine Learning Research, vol 37. PMLR, Lille, France, pp 1889–1897
Schulman J, Moritz P, Levine S, Jordan M, Abbeel P (2016) High-dimensional continuous control using generalized advantage estimation. In: Proceedings of the international conference on learning representations (ICLR)
Qu G, Yu C, Low SH, Wierman A (2020) Combining model-based and model-free methods for nonlinear control: A provably convergent policy gradient approach. CoRR abs/2006.07476, arXiv:https://arxiv.org/abs/2006.07476
Gu S, Lillicrap TP, Sutskever I, Levine S (2016) Continuous deep q-learning with model-based acceleration. CoRR abs/1603.00748
Chebotar Y, Hausman K, Zhang M, Sukhatme G, Schaal S, Levine S (2017) Combining model-based and model-free updates for trajectory-centric reinforcement learning. In: Proceedings of the 34th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol 70. PMLR, pp 703–711
Nagabandi A, Kahn G, Fearing RS, Levine S (2017) Neural network dynamics for model-based deep reinforcement learning with model-free fine-tuning. In: arXiv:1708.02596
Chua K, Calandra R, McAllister R, Levine S (2018) Deep reinforcement learning in a handful of trials using probabilistic dynamics models. In: Advances in neural information processing systems, vol 31. Curran Associates, Inc.
Doshi-velez F (2009) The infinite partially observable markov decision process. In: Advances in neural information processing systems, vol 22. Curran Associates, Inc.
Srinivas A, Jabri A, Abbeel P, Levine S, Finn C (2018) Universal planning networks. ICML 2018, arXiv:https://arxiv.org/abs/1804.00645
Sutton RS, Barto AG (2018) Reinforcement learning: An introduction, Second. The MIT Press, Massachusetts, USA
Abbeel P, Quigley M, Ng AY (2006) Using inaccurate models in reinforcement learning. In: Proceedings of the 23rd international conference on machine Learning. ACM International Conference Proceeding Series, vol 148. ACM, pp 1–8
Huellermeier E, Waegeman W (2021) Aleatoric and epistemic uncertainty in machine learning: an introduction to concepts and methods. Mach Learn 110(3):457–506
Lakshminarayanan B, Pritzel A, Blundell C (2017) Simple and scalable predictive uncertainty estimation using deep ensembles. In: Advances in neural information processing systems, vol 30. Curran Associates, Inc.
Sutton RS (1988) Learning to predict by the methods of temporal differences. In: Machine Learning, pp 9–44
Mnih V, Kavukcuoglu K, Silver D, Rusu AA, Veness J, Bellemare MG, Graves A, Riedmiller M, Fidjeland AK, Ostrovski G, Petersen S, Beattie C, Sadik A, Antonoglou I, King H, Kumaran D, Wierstra D, Legg S, Hassabis D (2015) Human-level control through deep reinforcement learning. Nature 518(7540):529–533
Neu G, Rosasco L (2018) Iterate averaging as regularization for stochastic gradient descent. In: Bubeck S, Perchet V, Rigollet P (eds) Proceedings of the 31st conference on learning theory. Proceedings of Machine Learning Research, vol 75. PMLR, pp 3222–3242
Vlassis N, Ghavamzadeh M, Mannor S, Poupart P (2012) Bayesian reinforcement learning. In: Reinforcement Learning: State-of-the-Art. Springer Berlin Heidelberg, Berlin, Heidelberg, pp 359–386
Schulman J, Heess N, Weber T, Abbeel P (2015) Gradient estimation using stochastic computation graphs. In: Advances in neural information processing systems 28: annual conference on neural information processing systems 2015, December 7-12, 2015, Montreal, Quebec, Canada. https://proceedings.neurips.cc/paper/2015/hash/de03beffeed9da5f3639a621bcab5dd4-Abstract.html, pp 3528–3536
Kingma DP, Welling M (2014) Auto-encoding variational bayes. In: Bengio Y, LeCun Y (eds) 2nd international conference on learning representations, ICLR. arXiv:1312.6114
Brockman G, Cheung V, Pettersson L, Schneider J, Schulman J, Tang J, Zaremba W (2016) Openai gym
Livni R, Shalev-Shwartz S, Shamir O (2014) On the computational efficiency of training neural networks. In: Ghahramani Z, Welling M, Cortes C, Lawrence N, Weinberger KQ (eds) Advances in neural information processing systems, vol 27. Curran Associates, Inc.
Acknowledgements
This work was supported in part by China National Science Foundation under grants 61473155, by Jiangsu Technology Department under Modern Agriculture BE2017301, and by Six Talent Peaks project in Jiangsu Province (China) GDZB-039. Great thanks to Aida Nobakht for her insightful comments.
Funding
This work was supported in part by China National Science Foundation under grants 61473155, by Jiangsu Technology Department under Modern Agriculture BE2017301, and by Six talent peaks project in Jiangsu Province GDZB-039.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Ethics approval
The research meets all standards with regard to the ethics of experimentation and research integrity. The paper has been submitted with full responsibility. And there is no duplicate publication, fraud, or plagiarism.
Consent for Publication
All authors whose names appear on the submission approved the version to be published.
Conflict of Interests
The authors declare that they have no conflict of interest.
Additional information
Availability of Data and Material
Data sharing not applicable to this article as no datasets were stored during the current study.
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
About this article
Cite this article
Nobakht, H., Liu, Y. Action space noise optimization as exploration in deterministic policy gradient for locomotion tasks. Appl Intell 52, 14218–14232 (2022). https://doi.org/10.1007/s10489-021-02995-y
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-021-02995-y