Policy Learning – A Unified Perspective with Applications in Robotics

  • Jan Peters
  • Jens Kober
  • Duy Nguyen-Tuong
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5323)


Policy Learning approaches are among the best suited methods for high-dimensional, continuous control systems such as anthropomorphic robot arms and humanoid robots. In this paper, we show two contributions: firstly, we show a unified perspective which allows us to derive several policy learning algorithms from a common point of view, i.e, policy gradient algorithms, natural-gradient algorithms and EM-like policy learning. Secondly, we present several applications to both robot motor primitive learning as well as to robot control in task space. Results both from simulation and several different real robots are shown.


Reinforcement Learning Humanoid Robot Neural Information Processing System Policy Learning Policy Improvement 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Aberdeen, D.: POMDPs and policy gradients. In: Proceedings of the Machine Learning Summer School (MLSS), Canberra, Australia (2006)Google Scholar
  2. 2.
    Aberdeen, D.A.: Policy-Gradient Algorithms for Partially Observable Markov Decision Processes. Ph.D thesis, Australian National Unversity (2003)Google Scholar
  3. 3.
    Benbrahim, H., Doleac, J., Franklin, J., Selfridge, O.: Real-time learning: A ball on a beam. In: Proceedings of the International Joint Conference on Neural Networks (IJCNN), Baltimore, MD (1992)Google Scholar
  4. 4.
    Benbrahim, H., Franklin, J.: Biped dynamic walking using reinforcement learning. Robotics and Autonomous Systems 22, 283–302 (1997)CrossRefGoogle Scholar
  5. 5.
    Dayan, P., Hinton, G.E.: Using expectation-maximization for reinforcement learning. Neural Computation 9(2), 271–278 (1997)CrossRefzbMATHGoogle Scholar
  6. 6.
    Endo, G., Morimoto, J., Matsubara, T., Nakanishi, J., Cheng, G.: Learning cpg sensory feedback with policy gradient for biped locomotion for a full-body humanoid. In: Proceedings of the National Conference on Artificial Intelligence (AAAI), Pittsburgh, PA (2005)Google Scholar
  7. 7.
    Geng, T., Porr, B., Wörgötter, F.: Fast biped walking with a reflexive neuronal controller and real-time online learning. In: Int. Journal of Robotics Res. (submitted, 2005)Google Scholar
  8. 8.
    Gullapalli, V., Franklin, J., Benbrahim, H.: Aquiring robot skills via reinforcement learning. IEEE Control Systems Journal, Special Issue on Robotics: Capturing Natural Motion 4(1), 13–24 (1994)Google Scholar
  9. 9.
    Kakade, S.A.: Natural policy gradient. In: Advances in Neural Information Processing Systems, Vancouver, CA, vol. 14 (2002)Google Scholar
  10. 10.
    Kimura, H., Kobayashi, S.: Reinforcement learning for locomotion of a two-linked robot arm. In: Birk, A., Demiris, J. (eds.) EWLR 1997. LNCS, vol. 1545, pp. 144–153. Springer, Heidelberg (1998)Google Scholar
  11. 11.
    Kimura, H., Kobayashi, S.: Reinforcement learning for continuous action using stochastic gradient ascent. In: Proceedings of the International Conference on Intelligent Autonomous Systems (IAS), Madison, Wisconsin, vol. 5, pp. 288–295 (1998)Google Scholar
  12. 12.
    Kober, J., Peters, J.: Reinforcement learning of perceptual coupling for motor primitives. In: The European Workshop on Reinforcement Learning, EWRL (submitted, 2008)Google Scholar
  13. 13.
    Kohl, N., Stone, P.: Policy gradient reinforcement learning for fast quadrupedal locomotion. In: Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), New Orleans, LA (May 2004)Google Scholar
  14. 14.
    Konda, V., Tsitsiklis, J.: Actor-critic algorithms. Advances in Neural Information Processing Systems 12 (2000)Google Scholar
  15. 15.
    Mitsunaga, N., Smith, C., Kanda, T., Ishiguro, H., Hagita, N.: Robot behavior adaptation for human-robot interaction based on policy gradient reinforcement learning. In: Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Edmonton, Canada, pp. 1594–1601 (2005)Google Scholar
  16. 16.
    Mori, T., Nakamura, Y., Sato, M.-a., Ishii, S.: Reinforcement learning for cpg-driven biped robot. In: Proceedings of the National Conference on Artificial Intelligence (AAAI), San Jose, CA, pp. 623–630 (2004)Google Scholar
  17. 17.
    Peters, J.: The bias of the greedy update. Technical report, University of Southern California (2007)Google Scholar
  18. 18.
    Peters, J., Schaal, S.: Learning operational space control. In: Proceedings of Robotics: Science and Systems (RSS), Philadelphia, PA (2006)Google Scholar
  19. 19.
    Peters, J., Vijayakumar, S., Schaal, S.: Reinforcement learning for humanoid robotics. In: Proceedings of the IEEE-RAS International Conference on Humanoid Robots (HUMANOIDS), Karlsruhe, Germany (September 2003)Google Scholar
  20. 20.
    Peters, J., Vijayakumar, S., Schaal, S.: Natural actor-critic. In: Gama, J., Camacho, R., Brazdil, P.B., Jorge, A.M., Torgo, L. (eds.) ECML 2005. LNCS, vol. 3720, pp. 280–291. Springer, Heidelberg (2005)CrossRefGoogle Scholar
  21. 21.
    Richter, S., Aberdeen, D., Yu, J.: Natural actor-critic for road traffic optimisation. In: Schoelkopf, B., Platt, J.C., Hofmann, T. (eds.) Advances in Neural Information Processing Systems, vol. 19. MIT Press, Cambridge (2007)Google Scholar
  22. 22.
    Sato, M.-a., Nakamura, Y., Ishii, S.: Reinforcement learning for biped locomotion. In: Dorronsoro, J.R. (ed.) ICANN 2002. LNCS, vol. 2415, pp. 777–782. Springer, Heidelberg (2002)CrossRefGoogle Scholar
  23. 23.
    Sutton, R.S., McAllester, D., Singh, S., Mansour, Y.: Policy gradient methods for reinforcement learning with function approximation. In: Solla, S.A., Leen, T.K., Mueller, K.-R. (eds.) Advances in Neural Information Processing Systems (NIPS), Denver, CO. MIT Press, Cambridge (2000)Google Scholar
  24. 24.
    Tedrake, R., Zhang, T.W., Seung, H.S.: Learning to walk in 20 minutes. In: Proceedings of the Yale Workshop on Adaptive and Learning Systems. Yale University, New Haven (2005)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2008

Authors and Affiliations

  • Jan Peters
    • 1
    • 2
  • Jens Kober
    • 1
  • Duy Nguyen-Tuong
    • 1
  1. 1.Max-Planck Institute for Biological CyberneticsTübingen
  2. 2.University of Southern CaliforniaLos AngelesUSA

Personalised recommendations