Skip to main content
Log in

Action space noise optimization as exploration in deterministic policy gradient for locomotion tasks

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

Reinforcement learning (RL) algorithms with deterministic actors (policy) commonly apply noise to the action space for exploration. These exploration methods are either undirected or require extra knowledge of the environment. In the aim of addressing these fundamental limitations, this paper introduces a parameterized stochastic action-noise policy (as a probability distribution) that correlates with the objectivity of the RL algorithm. This policy is optimized based on state-action values of predicted future states. Consequently, the optimization does not rely on the explicit definition of the reward function which improves the adaptability of this exploration strategy for different environments and algorithms. Moreover, this paper presents a predictive model of system dynamics (transitional probability) with the capacity to capture the uncertainty of the environments with optimal design and fewer parameters. It significantly reduces the model complexity while maintaining the same level of accuracy as current methods. This research evaluates and analyzes the proposed method and models while demonstrating significant increase in performance and reliability across various locomotion and control tasks in comparison with current methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Code Availability

The custom codes are developed or implemented during the current study are available from the corresponding author on reasonable request.

References

  1. Levine S, Finn C, Darrell T, Abbeel P (2016) End-to-end training of deep visuomotor policies. J Mach Learn Res 17(1):1334–1373

    MathSciNet  MATH  Google Scholar 

  2. Pinto L, Gandhi D, Han Y, Park Y, Gupta A (2016) The curious robot: Learning visual representations via physical interactions. In: European conference on computer vision. Springer, pp 3–18

  3. Añazco EV, Lopez PR, Park N, Oh J, Ryu G, Al-antari MA, Kim T-S (2021) Natural object manipulation using anthropomorphic robotic hand through deep reinforcement learning and deep grasping probability network. Appl Intell 51(2):1041–1055. https://doi.org/10.1007/s10489-020-01870-6

    Article  Google Scholar 

  4. Luck KS, Campbell J, Jansen MA, Aukes DM, Amor HB (2017) From the lab to the desert: Fast prototyping and learning of robot locomotion. In: Robotics: Science and Systems

  5. Jaritz M, Charette RD, Toromanoff M, Perot E, Nashashibi F (2018) End-to-end race driving with deep reinforcement learning. In: 2018 IEEE International conference on robotics and automation (ICRA). IEEE, pp 2070–2075

  6. Gupta S, Singal G, Garg D (2021) Deep reinforcement learning techniques in diversified domains: A survey. Archives of Computational Methods in Engineering, https://doi.org/10.1007/s11831-021-09552-3https://doi.org/10.1007/s11831-021-09552-3

  7. Thrun SB, M?ller K (1991) On planning and exploration in non-discrete environments. Gesellschaft fur Mathematik und Datenverarbeitung, D-5205 St

  8. Barto AG, Sutton RS, Watkins CsJCH (1990) Sequential decision problems and neural networks. In: Advances in neural information processing systems, vol 2. Morgan-Kaufmann

  9. Baird L (1995) Residual algorithms: Reinforcement learning with function approximation. In: Proceedings of the twelfth international conference on machine learning. Morgan Kaufmann, pp 30–37

  10. Sutton RS, McAllester D, Singh S, Mansour Y (2000) Policy gradient methods for reinforcement learning with function approximation. In: Advances in neural information processing systems, vol 12. MIT Press

  11. Langford J, Webb GI (2010) Efficient exploration in reinforcement learning. In: Encyclopedia of Machine Learning. Springer US, Boston, MA, pp 309–311

  12. Kaelbling LP, Littman ML, Moore AW (1996) Reinforcement learning: a survey. J Artif Intell Res 4:237–285

    Article  Google Scholar 

  13. McFarlane R (2018) A survey of exploration strategies in reinforcement learning. In: McGill University

  14. Mann TA, Choe Y (2013) Directed exploration in reinforcement learning with transferred knowledge. In: Deisenroth MP, Szepesvári C, Peters J (eds) Proceedings of the tenth european workshop on reinforcement learning. Proceedings of machine learning research, vol 24. PMLR, Edinburgh, Scotland, pp 59–76

  15. Haarnoja T, Zhou A, Abbeel P, Levine S (2018) Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In: Proceedings of the 35th international conference on machine learning. Proceedings of Machine Learning Research, vol 80. PMLR, pp 1861–1870

  16. Lillicrap TP, Hunt JJ, Pritzel A, Heess N, Erez T, Tassa Y, Silver D, Wierstra D (2016) Continuous control with deep reinforcement learning.. In: ICLR

  17. Fujimoto S, van Hoof H, Meger D (2018) Addressing function approximation error in actor-critic methods. arXiv:1802.09477

  18. Plappert M, Houthooft R, Dhariwal P, Sidor S, Chen RY, Chen X, Asfour T, Abbeel P, Andrychowicz M (2018) Parameter space noise for exploration. In: International conference on learning representations

  19. Uhlenbeck GE, Ornstein LS (1930) On the theory of the brownian motion. Phys Rev 36 (5):823–841

    Article  Google Scholar 

  20. Wawrzynski P (2015) Control policy with autocorrelated noise in reinforcement learning for robotics. Int J Mach Learn Comput 5(2):91–95

    Article  Google Scholar 

  21. Schulman J, Levine S, Abbeel P, Jordan M, Moritz P (2015) Trust region policy optimization. In: Proceedings of the 32nd international conference on machine learning. Proceedings of Machine Learning Research, vol 37. PMLR, Lille, France, pp 1889–1897

  22. Schulman J, Moritz P, Levine S, Jordan M, Abbeel P (2016) High-dimensional continuous control using generalized advantage estimation. In: Proceedings of the international conference on learning representations (ICLR)

  23. Qu G, Yu C, Low SH, Wierman A (2020) Combining model-based and model-free methods for nonlinear control: A provably convergent policy gradient approach. CoRR abs/2006.07476, arXiv:https://arxiv.org/abs/2006.07476

  24. Gu S, Lillicrap TP, Sutskever I, Levine S (2016) Continuous deep q-learning with model-based acceleration. CoRR abs/1603.00748

  25. Chebotar Y, Hausman K, Zhang M, Sukhatme G, Schaal S, Levine S (2017) Combining model-based and model-free updates for trajectory-centric reinforcement learning. In: Proceedings of the 34th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol 70. PMLR, pp 703–711

  26. Nagabandi A, Kahn G, Fearing RS, Levine S (2017) Neural network dynamics for model-based deep reinforcement learning with model-free fine-tuning. In: arXiv:1708.02596

  27. Chua K, Calandra R, McAllister R, Levine S (2018) Deep reinforcement learning in a handful of trials using probabilistic dynamics models. In: Advances in neural information processing systems, vol 31. Curran Associates, Inc.

  28. Doshi-velez F (2009) The infinite partially observable markov decision process. In: Advances in neural information processing systems, vol 22. Curran Associates, Inc.

  29. Srinivas A, Jabri A, Abbeel P, Levine S, Finn C (2018) Universal planning networks. ICML 2018, arXiv:https://arxiv.org/abs/1804.00645

  30. Sutton RS, Barto AG (2018) Reinforcement learning: An introduction, Second. The MIT Press, Massachusetts, USA

    MATH  Google Scholar 

  31. Abbeel P, Quigley M, Ng AY (2006) Using inaccurate models in reinforcement learning. In: Proceedings of the 23rd international conference on machine Learning. ACM International Conference Proceeding Series, vol 148. ACM, pp 1–8

  32. Huellermeier E, Waegeman W (2021) Aleatoric and epistemic uncertainty in machine learning: an introduction to concepts and methods. Mach Learn 110(3):457–506

    Article  MathSciNet  Google Scholar 

  33. Lakshminarayanan B, Pritzel A, Blundell C (2017) Simple and scalable predictive uncertainty estimation using deep ensembles. In: Advances in neural information processing systems, vol 30. Curran Associates, Inc.

  34. Sutton RS (1988) Learning to predict by the methods of temporal differences. In: Machine Learning, pp 9–44

  35. Mnih V, Kavukcuoglu K, Silver D, Rusu AA, Veness J, Bellemare MG, Graves A, Riedmiller M, Fidjeland AK, Ostrovski G, Petersen S, Beattie C, Sadik A, Antonoglou I, King H, Kumaran D, Wierstra D, Legg S, Hassabis D (2015) Human-level control through deep reinforcement learning. Nature 518(7540):529–533

    Article  Google Scholar 

  36. Neu G, Rosasco L (2018) Iterate averaging as regularization for stochastic gradient descent. In: Bubeck S, Perchet V, Rigollet P (eds) Proceedings of the 31st conference on learning theory. Proceedings of Machine Learning Research, vol 75. PMLR, pp 3222–3242

  37. Vlassis N, Ghavamzadeh M, Mannor S, Poupart P (2012) Bayesian reinforcement learning. In: Reinforcement Learning: State-of-the-Art. Springer Berlin Heidelberg, Berlin, Heidelberg, pp 359–386

  38. Schulman J, Heess N, Weber T, Abbeel P (2015) Gradient estimation using stochastic computation graphs. In: Advances in neural information processing systems 28: annual conference on neural information processing systems 2015, December 7-12, 2015, Montreal, Quebec, Canada. https://proceedings.neurips.cc/paper/2015/hash/de03beffeed9da5f3639a621bcab5dd4-Abstract.html, pp 3528–3536

  39. Kingma DP, Welling M (2014) Auto-encoding variational bayes. In: Bengio Y, LeCun Y (eds) 2nd international conference on learning representations, ICLR. arXiv:1312.6114

  40. Brockman G, Cheung V, Pettersson L, Schneider J, Schulman J, Tang J, Zaremba W (2016) Openai gym

  41. Livni R, Shalev-Shwartz S, Shamir O (2014) On the computational efficiency of training neural networks. In: Ghahramani Z, Welling M, Cortes C, Lawrence N, Weinberger KQ (eds) Advances in neural information processing systems, vol 27. Curran Associates, Inc.

Download references

Acknowledgements

This work was supported in part by China National Science Foundation under grants 61473155, by Jiangsu Technology Department under Modern Agriculture BE2017301, and by Six Talent Peaks project in Jiangsu Province (China) GDZB-039. Great thanks to Aida Nobakht for her insightful comments.

Funding

This work was supported in part by China National Science Foundation under grants 61473155, by Jiangsu Technology Department under Modern Agriculture BE2017301, and by Six talent peaks project in Jiangsu Province GDZB-039.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yong Liu.

Ethics declarations

Ethics approval

The research meets all standards with regard to the ethics of experimentation and research integrity. The paper has been submitted with full responsibility. And there is no duplicate publication, fraud, or plagiarism.

Consent for Publication

All authors whose names appear on the submission approved the version to be published.

Conflict of Interests

The authors declare that they have no conflict of interest.

Additional information

Availability of Data and Material

Data sharing not applicable to this article as no datasets were stored during the current study.

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Electronic supplementary material

Below is the link to the electronic supplementary material.

(PDF 128 KB)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Nobakht, H., Liu, Y. Action space noise optimization as exploration in deterministic policy gradient for locomotion tasks. Appl Intell 52, 14218–14232 (2022). https://doi.org/10.1007/s10489-021-02995-y

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-021-02995-y

Keywords

Navigation