Encyclopedia of Machine Learning

2010 Edition
| Editors: Claude Sammut, Geoffrey I. Webb

Policy Gradient Methods

  • Jan Peters
  • J. Andrew Bagnell
Reference work entry
DOI: https://doi.org/10.1007/978-0-387-30164-8_640



A policy gradient method is a  reinforcement learning approach that directly optimizes a parametrized control policy by a variant of gradient descent. These methods belong to the class of  policy search techniques that maximize the expected return of a policy in a fixed policy class, in contrast with traditional  value function approximationapproaches that derive policies from a value function. Policy gradient approaches have various advantages: they enable the straightforward incorporation of domain knowledge in policy parametrization and often an optimal policy is more compactly represented than the corresponding value function; many such methods guarantee to convergence to at least a locally optimal policy; the methods naturally handle continuous states and actions and often even imperfect state information. The counterveiling drawbacks include difficulties in off-policy settings, the potential for very slow convergence and high sample...

This is a preview of subscription content, log in to check access.

Recommended Reading

  1. Bagnell, J. A. (2004). Learning decisions: Robustness, uncertainty, and approximation. Doctoral dissertation, Robotics Institute, Carnegie Mellon University, 5000 Forbes Avenue, Pittsburgh, PA 15213.Google Scholar
  2. Fu, M. C. (2006). Handbook on operations research and management science: Simulation (Vol. 13, pp. 575–616) (Chapter 19: Stochastic gradient estimation). ISBN 10: 0-444-51428-7, Elsevier.Google Scholar
  3. Glynn, P. (1990). Likelihood ratio gradient estimation for stochastic systems. Communications of the ACM, 33(10), 75–84.CrossRefGoogle Scholar
  4. Hasdorff, L. (1976). Gradient optimization and nonlinear control. John Wiley & Sons.Google Scholar
  5. Jacobson, D. & H., Mayne, D. Q. (1970). Differential Dynamic Programming. New York: American Elsevier Publishing Company, Inc.Google Scholar
  6. Lawrence, G., Cowan, N., & Russell, S. (2003). Efficient gradient estimation for motor control learning. In Proceedings of the international conference on uncertainty in artificial intelligence (UAI), Acapulco, Mexico.Google Scholar
  7. Peters, J., & Schaal, S. (2008). Reinforcement learning of motor skills with policy gradients. Neural Networks, 21(4), 682–97.CrossRefGoogle Scholar
  8. Spall, J. C. (2003). Introduction to stochastic search and optimization: Estimation, simulation, and control. Hoboken: Wiley.zbMATHCrossRefGoogle Scholar
  9. Sutton, R. S., McAllester, D., Singh, S., & Mansour, Y. (2000). Policy gradient methods for reinforcement learning with function approximation. In S. A. Solla, T. K. Leen, & K.-R. Mueller, (Eds.), Advances in neural information processing systems (NIPS), Denver, CO. Cambridge: MIT Press.Google Scholar
  10. Williams, R. J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8, 229–256.zbMATHGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC 2011

Authors and Affiliations

  • Jan Peters
    • 1
  • J. Andrew Bagnell
    • 1
  1. 1.Max Planck Institute for Biological CyberneticsTuebingenGermany