Policy Gradients with Parameter-Based Exploration for Control

  • Frank Sehnke
  • Christian Osendorfer
  • Thomas Rückstieß
  • Alex Graves
  • Jan Peters
  • Jürgen Schmidhuber
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5163)


We present a model-free reinforcement learning method for partially observable Markov decision problems. Our method estimates a likelihood gradient by sampling directly in parameter space, which leads to lower variance gradient estimates than those obtained by policy gradient methods such as REINFORCE. For several complex control tasks, including robust standing with a humanoid robot, we show that our method outperforms well-known algorithms from the fields of policy gradients, finite difference methods and population based heuristics. We also provide a detailed analysis of the differences between our method and the other algorithms.


Reinforcement Learning Humanoid Robot Biped Robot Evolution Strategy Gaussian Sampling 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Benbrahim, H., Franklin, J.: Biped dynamic walking using reinforcement learning. Robotics and Autonomous Systems Journal (1997)Google Scholar
  2. 2.
    Peters, J., Schaal, S.: Policy gradient methods for robotics. In: IROS-2006, Beijing, China, pp. 2219–2225 (2006)Google Scholar
  3. 3.
    Schraudolph, N., Yu, J., Aberdeen, D.: Fast online policy gradient learning with smd gain vector adaptation. In: Weiss, Y., Schölkopf, B., Platt, J. (eds.) Advances in Neural Information Processing Systems, vol. 18. MIT Press, Cambridge (2006)Google Scholar
  4. 4.
    Peters, J., Vijayakumar, S., Schaal, S.: Natural actor-critic. In: Gama, J., Camacho, R., Brazdil, P.B., Jorge, A.M., Torgo, L. (eds.) ECML 2005. LNCS (LNAI), vol. 3720, pp. 280–291. Springer, Heidelberg (2005)CrossRefGoogle Scholar
  5. 5.
    Williams, R.: Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning 8, 229–256 (1992)zbMATHGoogle Scholar
  6. 6.
    Baxter, J., Bartlett, P.L.: Reinforcement learning in POMDPs via direct gradient ascent. In: Proc. 17th International Conf. on Machine Learning, pp. 41–48. Morgan Kaufmann, San Francisco (2000)Google Scholar
  7. 7.
    Aberdeen, D.: Policy-Gradient Algorithms for Partially Observable Markov Decision Processes. PhD thesis, Australian National University (2003)Google Scholar
  8. 8.
    Sutton, R., McAllester, D., Singh, S., Mansour, Y.: Policy gradient methods for reinforcement learning with function approximation. In: NIPS 1999, pp. 1057–1063 (2000)Google Scholar
  9. 9.
    Schwefel, H.: Evolution and optimum seeking. Wiley, New York (1995)Google Scholar
  10. 10.
    Spall, J.: An overview of the simultaneous perturbation method for efficient optimization. Johns Hopkins APL Technical Digest 19(4), 482–492 (1998)Google Scholar
  11. 11.
    Riedmiller, M., Peters, J., Schaal, S.: Evaluation of policy gradient methods and variants on the cart-pole benchmark. In: ADPRL 2007 (2007)Google Scholar
  12. 12.
    Müller, H., Lauer, M., Hafner, R., Lange, S., Merke, A., Riedmiller, M.: Making a robot learn to play soccer. In: Hertzberg, J., Beetz, M., Englert, R. (eds.) KI 2007. LNCS (LNAI), vol. 4667, pp. 220–234. Springer, Heidelberg (2007)CrossRefGoogle Scholar
  13. 13.
    Jordan, M.: Attractor dynamics and parallelism in a connectionist sequential machine. In: Proc. of the Eighth Annual Conference of the Cognitive Science Society, vol. 8, pp. 531–546 (1986)Google Scholar
  14. 14.
    Ulbrich, H.: Institute of Applied Mechanics, TU München, Germany (2008),
  15. 15.
    Hansen, N., Ostermeier, A.: Completely Derandomized Self-Adaptation in Evolution Strategies. Evolutionary Computation 9(2), 159–195 (2001)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2008

Authors and Affiliations

  • Frank Sehnke
    • 1
  • Christian Osendorfer
    • 1
  • Thomas Rückstieß
    • 1
  • Alex Graves
    • 1
  • Jan Peters
    • 3
  • Jürgen Schmidhuber
    • 1
    • 2
  1. 1.Faculty of Computer ScienceTechnische Universität MünchenGermany
  2. 2.IDSIA, Manno-LuganoSwitzerland
  3. 3.Max-Planck Institute for Biological Cybernetics TübingenGermany

Personalised recommendations