Encyclopedia of Machine Learning and Data Mining

2017 Edition
| Editors: Claude Sammut, Geoffrey I. Webb

Policy Gradient Methods

  • Jan PetersEmail author
  • J. Andrew Bagnell
Reference work entry
DOI: https://doi.org/10.1007/978-1-4899-7687-1_646


Already Richard Bellman suggested that searching in policy space is fundamentally different from value function-based reinforcement learning — and frequently advantageous, especially in robotics and other systems with continuous actions. Policy gradient methods optimize in policy space by maximizing the expected reward using a direct gradient ascent. We discuss their basics and the most prominent approaches to policy gradient estimation.

This is a preview of subscription content, log in to check access.

Recommended Reading

  1. Bagnell JA (2004) Learning decisions: robustness, uncertainty, and approximation. Doctoral dissertation, Robotics institute, Carnegie Mellon University, PittsburghGoogle Scholar
  2. Fu MC (2006) Stochastic gradient estimation. In: Henderson SG, Nelson BL (eds) Handbook on operations research and management science: simulation, vol 19. Elsevier, Burlington, pp 575–616Google Scholar
  3. Glynn P (1990) Likelihood ratio gradient estimation for stochastic systems. Commun ACM 33(10):75–84CrossRefGoogle Scholar
  4. Lawrence G, Cowan N, Russell S (2003) Efficient gradient estimation for motor control learning. In: Proceedings of the international conference on uncertainty in artificial intelligence (UAI), AcapulcoGoogle Scholar
  5. Peters J, Schaal S (2008) Reinforcement learning of motor skills with policy gradients. Neural Netw 21(4):682–697CrossRefGoogle Scholar
  6. Pontryagin LS, Boltyanskii VG, Gamkrelidze RV, Mishchenko E (1962) The mathematical theory of optimal processes. International series of monographs in pure and applied mathematics. Interscience publishers, New YorkGoogle Scholar
  7. Silver D, Lever G, Heess N, Degris T, Wierstra D, Riedmiller M (2014) Deterministic policy gradient algorithms. In: Proceedings of the 31st international conference on Machine learning (ICML), BejingGoogle Scholar
  8. Spall JC (2003) Introduction to stochastic search and optimization: estimation, simulation, and control. Wiley, HobokenzbMATHCrossRefGoogle Scholar
  9. Sutton RS, McAllester D, Singh S, Mansour Y (2000) Policy gradient methods for reinforcement learning with function approximation. In: Solla SA, Leen TK, Mueller KR (eds) Advances in neural information processing systems (NIPS). MIT, DenverGoogle Scholar
  10. Williams RJ (1992) Simple statistical gradient-following algorithms for connectionist reinforcement learning. Mach Learn 8:229–256zbMATHGoogle Scholar

Copyright information

© Springer Science+Business Media New York 2017

Authors and Affiliations

  1. 1.Department of Empirical InferenceMax-Planck Institute for Intelligent SystemsTübingenGermany
  2. 2.Carnegie Mellon UniversityPittsburghUSA
  3. 3.Intelligent Autonomous Systems, Computer Science DepartmentTechnische Universität DarmstadtDarmstadtGermany
  4. 4.Max Planck Institute for Biological CyberneticsTübingenGermany