Action space noise optimization as exploration in deterministic policy gradient for locomotion tasks

Nobakht, Hesan; Liu, Yong

doi:10.1007/s10489-021-02995-y

Action space noise optimization as exploration in deterministic policy gradient for locomotion tasks

Published: 04 March 2022

Volume 52, pages 14218–14232, (2022)
Cite this article

Applied Intelligence Aims and scope Submit manuscript

311 Accesses
1 Citation
1 Altmetric
Explore all metrics

Abstract

Reinforcement learning (RL) algorithms with deterministic actors (policy) commonly apply noise to the action space for exploration. These exploration methods are either undirected or require extra knowledge of the environment. In the aim of addressing these fundamental limitations, this paper introduces a parameterized stochastic action-noise policy (as a probability distribution) that correlates with the objectivity of the RL algorithm. This policy is optimized based on state-action values of predicted future states. Consequently, the optimization does not rely on the explicit definition of the reward function which improves the adaptability of this exploration strategy for different environments and algorithms. Moreover, this paper presents a predictive model of system dynamics (transitional probability) with the capacity to capture the uncertainty of the environments with optimal design and fewer parameters. It significantly reduces the model complexity while maintaining the same level of accuracy as current methods. This research evaluates and analyzes the proposed method and models while demonstrating significant increase in performance and reliability across various locomotion and control tasks in comparison with current methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Model gradient: unified model and policy learning in model-based reinforcement learning

Article 27 December 2023

Gait Learning Using Reinforcement Learning

Adapting Biped Locomotion to Sloped Environments

Article 01 February 2015

Code Availability

The custom codes are developed or implemented during the current study are available from the corresponding author on reasonable request.

References

Levine S, Finn C, Darrell T, Abbeel P (2016) End-to-end training of deep visuomotor policies. J Mach Learn Res 17(1):1334–1373
MathSciNet MATH Google Scholar
Pinto L, Gandhi D, Han Y, Park Y, Gupta A (2016) The curious robot: Learning visual representations via physical interactions. In: European conference on computer vision. Springer, pp 3–18
Añazco EV, Lopez PR, Park N, Oh J, Ryu G, Al-antari MA, Kim T-S (2021) Natural object manipulation using anthropomorphic robotic hand through deep reinforcement learning and deep grasping probability network. Appl Intell 51(2):1041–1055. https://doi.org/10.1007/s10489-020-01870-6
Article Google Scholar
Luck KS, Campbell J, Jansen MA, Aukes DM, Amor HB (2017) From the lab to the desert: Fast prototyping and learning of robot locomotion. In: Robotics: Science and Systems
Jaritz M, Charette RD, Toromanoff M, Perot E, Nashashibi F (2018) End-to-end race driving with deep reinforcement learning. In: 2018 IEEE International conference on robotics and automation (ICRA). IEEE, pp 2070–2075
Gupta S, Singal G, Garg D (2021) Deep reinforcement learning techniques in diversified domains: A survey. Archives of Computational Methods in Engineering, https://doi.org/10.1007/s11831-021-09552-3 https://doi.org/10.1007/s11831-021-09552-3
Thrun SB, M?ller K (1991) On planning and exploration in non-discrete environments. Gesellschaft fur Mathematik und Datenverarbeitung, D-5205 St
Barto AG, Sutton RS, Watkins CsJCH (1990) Sequential decision problems and neural networks. In: Advances in neural information processing systems, vol 2. Morgan-Kaufmann
Baird L (1995) Residual algorithms: Reinforcement learning with function approximation. In: Proceedings of the twelfth international conference on machine learning. Morgan Kaufmann, pp 30–37
Sutton RS, McAllester D, Singh S, Mansour Y (2000) Policy gradient methods for reinforcement learning with function approximation. In: Advances in neural information processing systems, vol 12. MIT Press
Langford J, Webb GI (2010) Efficient exploration in reinforcement learning. In: Encyclopedia of Machine Learning. Springer US, Boston, MA, pp 309–311
Kaelbling LP, Littman ML, Moore AW (1996) Reinforcement learning: a survey. J Artif Intell Res 4:237–285
Article Google Scholar
McFarlane R (2018) A survey of exploration strategies in reinforcement learning. In: McGill University
Mann TA, Choe Y (2013) Directed exploration in reinforcement learning with transferred knowledge. In: Deisenroth MP, Szepesvári C, Peters J (eds) Proceedings of the tenth european workshop on reinforcement learning. Proceedings of machine learning research, vol 24. PMLR, Edinburgh, Scotland, pp 59–76
Haarnoja T, Zhou A, Abbeel P, Levine S (2018) Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In: Proceedings of the 35th international conference on machine learning. Proceedings of Machine Learning Research, vol 80. PMLR, pp 1861–1870
Lillicrap TP, Hunt JJ, Pritzel A, Heess N, Erez T, Tassa Y, Silver D, Wierstra D (2016) Continuous control with deep reinforcement learning.. In: ICLR
Fujimoto S, van Hoof H, Meger D (2018) Addressing function approximation error in actor-critic methods. arXiv:1802.09477
Plappert M, Houthooft R, Dhariwal P, Sidor S, Chen RY, Chen X, Asfour T, Abbeel P, Andrychowicz M (2018) Parameter space noise for exploration. In: International conference on learning representations
Uhlenbeck GE, Ornstein LS (1930) On the theory of the brownian motion. Phys Rev 36 (5):823–841
Article Google Scholar
Wawrzynski P (2015) Control policy with autocorrelated noise in reinforcement learning for robotics. Int J Mach Learn Comput 5(2):91–95
Article Google Scholar
Schulman J, Levine S, Abbeel P, Jordan M, Moritz P (2015) Trust region policy optimization. In: Proceedings of the 32nd international conference on machine learning. Proceedings of Machine Learning Research, vol 37. PMLR, Lille, France, pp 1889–1897
Schulman J, Moritz P, Levine S, Jordan M, Abbeel P (2016) High-dimensional continuous control using generalized advantage estimation. In: Proceedings of the international conference on learning representations (ICLR)
Qu G, Yu C, Low SH, Wierman A (2020) Combining model-based and model-free methods for nonlinear control: A provably convergent policy gradient approach. CoRR abs/2006.07476, arXiv:https://arxiv.org/abs/2006.07476
Gu S, Lillicrap TP, Sutskever I, Levine S (2016) Continuous deep q-learning with model-based acceleration. CoRR abs/1603.00748
Chebotar Y, Hausman K, Zhang M, Sukhatme G, Schaal S, Levine S (2017) Combining model-based and model-free updates for trajectory-centric reinforcement learning. In: Proceedings of the 34th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol 70. PMLR, pp 703–711
Nagabandi A, Kahn G, Fearing RS, Levine S (2017) Neural network dynamics for model-based deep reinforcement learning with model-free fine-tuning. In: arXiv:1708.02596
Chua K, Calandra R, McAllister R, Levine S (2018) Deep reinforcement learning in a handful of trials using probabilistic dynamics models. In: Advances in neural information processing systems, vol 31. Curran Associates, Inc.
Doshi-velez F (2009) The infinite partially observable markov decision process. In: Advances in neural information processing systems, vol 22. Curran Associates, Inc.
Srinivas A, Jabri A, Abbeel P, Levine S, Finn C (2018) Universal planning networks. ICML 2018, arXiv:https://arxiv.org/abs/1804.00645
Sutton RS, Barto AG (2018) Reinforcement learning: An introduction, Second. The MIT Press, Massachusetts, USA
MATH Google Scholar
Abbeel P, Quigley M, Ng AY (2006) Using inaccurate models in reinforcement learning. In: Proceedings of the 23rd international conference on machine Learning. ACM International Conference Proceeding Series, vol 148. ACM, pp 1–8
Huellermeier E, Waegeman W (2021) Aleatoric and epistemic uncertainty in machine learning: an introduction to concepts and methods. Mach Learn 110(3):457–506
Article MathSciNet Google Scholar
Lakshminarayanan B, Pritzel A, Blundell C (2017) Simple and scalable predictive uncertainty estimation using deep ensembles. In: Advances in neural information processing systems, vol 30. Curran Associates, Inc.
Sutton RS (1988) Learning to predict by the methods of temporal differences. In: Machine Learning, pp 9–44
Mnih V, Kavukcuoglu K, Silver D, Rusu AA, Veness J, Bellemare MG, Graves A, Riedmiller M, Fidjeland AK, Ostrovski G, Petersen S, Beattie C, Sadik A, Antonoglou I, King H, Kumaran D, Wierstra D, Legg S, Hassabis D (2015) Human-level control through deep reinforcement learning. Nature 518(7540):529–533
Article Google Scholar
Neu G, Rosasco L (2018) Iterate averaging as regularization for stochastic gradient descent. In: Bubeck S, Perchet V, Rigollet P (eds) Proceedings of the 31st conference on learning theory. Proceedings of Machine Learning Research, vol 75. PMLR, pp 3222–3242
Vlassis N, Ghavamzadeh M, Mannor S, Poupart P (2012) Bayesian reinforcement learning. In: Reinforcement Learning: State-of-the-Art. Springer Berlin Heidelberg, Berlin, Heidelberg, pp 359–386
Schulman J, Heess N, Weber T, Abbeel P (2015) Gradient estimation using stochastic computation graphs. In: Advances in neural information processing systems 28: annual conference on neural information processing systems 2015, December 7-12, 2015, Montreal, Quebec, Canada. https://proceedings.neurips.cc/paper/2015/hash/de03beffeed9da5f3639a621bcab5dd4-Abstract.html, pp 3528–3536
Kingma DP, Welling M (2014) Auto-encoding variational bayes. In: Bengio Y, LeCun Y (eds) 2nd international conference on learning representations, ICLR. arXiv:1312.6114
Brockman G, Cheung V, Pettersson L, Schneider J, Schulman J, Tang J, Zaremba W (2016) Openai gym
Livni R, Shalev-Shwartz S, Shamir O (2014) On the computational efficiency of training neural networks. In: Ghahramani Z, Welling M, Cortes C, Lawrence N, Weinberger KQ (eds) Advances in neural information processing systems, vol 27. Curran Associates, Inc.

Download references

Acknowledgements

This work was supported in part by China National Science Foundation under grants 61473155, by Jiangsu Technology Department under Modern Agriculture BE2017301, and by Six Talent Peaks project in Jiangsu Province (China) GDZB-039. Great thanks to Aida Nobakht for her insightful comments.

Funding

This work was supported in part by China National Science Foundation under grants 61473155, by Jiangsu Technology Department under Modern Agriculture BE2017301, and by Six talent peaks project in Jiangsu Province GDZB-039.

Author information

Authors and Affiliations

School of Computer Science, Nanjing University of Science and Technology, Nanjing City, Jiangsu, China
Hesan Nobakht & Yong Liu

Authors

Hesan Nobakht
View author publications
You can also search for this author in PubMed Google Scholar
Yong Liu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yong Liu.

Ethics declarations

Ethics approval

The research meets all standards with regard to the ethics of experimentation and research integrity. The paper has been submitted with full responsibility. And there is no duplicate publication, fraud, or plagiarism.

Consent for Publication

All authors whose names appear on the submission approved the version to be published.

Conflict of Interests

The authors declare that they have no conflict of interest.

Additional information

Availability of Data and Material

Data sharing not applicable to this article as no datasets were stored during the current study.

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Electronic supplementary material

Below is the link to the electronic supplementary material.

(PDF 128 KB)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Nobakht, H., Liu, Y. Action space noise optimization as exploration in deterministic policy gradient for locomotion tasks. Appl Intell 52, 14218–14232 (2022). https://doi.org/10.1007/s10489-021-02995-y

Download citation

Accepted: 11 November 2021
Published: 04 March 2022
Issue Date: September 2022
DOI: https://doi.org/10.1007/s10489-021-02995-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Action space noise optimization as exploration in deterministic policy gradient for locomotion tasks

Abstract

Access this article

Similar content being viewed by others

Model gradient: unified model and policy learning in model-based reinforcement learning

Gait Learning Using Reinforcement Learning

Adapting Biped Locomotion to Sloped Environments

Code Availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Ethics approval

Consent for Publication

Conflict of Interests

Additional information

Availability of Data and Material

Publisher’s note

Electronic supplementary material

(PDF 128 KB)

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Action space noise optimization as exploration in deterministic policy gradient for locomotion tasks

Abstract

Access this article

Similar content being viewed by others

Model gradient: unified model and policy learning in model-based reinforcement learning

Gait Learning Using Reinforcement Learning

Adapting Biped Locomotion to Sloped Environments

Code Availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Ethics approval

Consent for Publication

Conflict of Interests

Additional information

Availability of Data and Material

Publisher’s note

Electronic supplementary material

(PDF 128 KB)

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation