Advertisement

Machine Learning

, Volume 84, Issue 1–2, pp 171–203 | Cite as

Policy search for motor primitives in robotics

  • Jens KoberEmail author
  • Jan Peters
Article

Abstract

Many motor skills in humanoid robotics can be learned using parametrized motor primitives. While successful applications to date have been achieved with imitation learning, most of the interesting motor learning problems are high-dimensional reinforcement learning problems. These problems are often beyond the reach of current reinforcement learning methods. In this paper, we study parametrized policy search methods and apply these to benchmark problems of motor primitive learning in robotics. We show that many well-known parametrized policy search methods can be derived from a general, common framework. This framework yields both policy gradient methods and expectation-maximization (EM) inspired algorithms. We introduce a novel EM-inspired algorithm for policy learning that is particularly well-suited for dynamical system motor primitives. We compare this algorithm, both in simulation and on a real robot, to several well-known parametrized policy search methods such as episodic REINFORCE, ‘Vanilla’ Policy Gradients with optimal baselines, episodic Natural Actor Critic, and episodic Reward-Weighted Regression. We show that the proposed method out-performs them on an empirical benchmark of learning dynamical system motor primitives both in simulation and on a real robot. We apply it in the context of motor learning and show that it can learn a complex Ball-in-a-Cup task on a real Barrett WAM™ robot arm.

Keywords

Motor primitives Episodic reinforcement learning Motor control Policy learning 

References

  1. Andrieu, C., de Freitas, N., Doucet, A., & Jordan, M. I. (2003). An introduction to MCMC for machine learning. Machine Learning, 50(1), 5–43. zbMATHCrossRefGoogle Scholar
  2. Atkeson, C. G. (1994). Using local trajectory optimizers to speed up global optimization in dynamic programming. In Advances in neural information processing systems (Vol. 6, pp. 503–521), Denver, CO, USA. Google Scholar
  3. Attias, H. (2003). Planning by probabilistic inference. In Proceedings of the ninth international workshop on artificial intelligence and statistics (AISTATS), Key West, FL, USA. Google Scholar
  4. Bagnell, J., & Schneider, J. (2003). Covariant policy search. In Proceedings of the international joint conference on artificial intelligence (IJCAI) (pp. 1019–1024), Acapulco, Mexico. Google Scholar
  5. Bagnell, J., Kadade, S., Ng, A., & Schneider, J. (2004). Policy search by dynamic programming. In Advances in neural information processing systems (Vol. 16), Vancouver, BC, CA. Google Scholar
  6. Binder, J., Koller, D., Russell, S., & Kanazawa, K. (1997). Adaptive probabilistic networks with hidden variables. Machine Learning, 29(2–3), 213–244. zbMATHCrossRefGoogle Scholar
  7. Chiappa, S., Kober, J., & Peters, J. (2009). Using Bayesian dynamical systems for motion template libraries. In D. Koller, D. Schuurmans, Y. Bengio, & L. Bottou (Eds.), Advances in neural information processing systems (Vol. 21, pp. 297–304). Google Scholar
  8. DARPA (2010a). Learning locomotion (L2). http://www.darpa.mil/ipto/programs/ll/ll.asp.
  9. DARPA (2010b). Learning applied to ground robotics (LAGR). http://www.darpa.mil/ipto/programs/lagr/lagr.asp.
  10. DARPA (2010c). Autonomous robot manipulation (ARM). http://www.darpa.mil/ipto/programs/arm/arm.asp.
  11. Dayan, P., & Hinton, G. E. (1997). Using expectation-maximization for reinforcement learning. Neural Computation, 9(2), 271–278. zbMATHCrossRefGoogle Scholar
  12. Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B (Methodological), 39, 1–38. MathSciNetzbMATHGoogle Scholar
  13. El-Fakdi, A., Carreras, M., & Ridao, P. (2006). Towards direct policy search reinforcement learning for robot control. In Proceedings of the IEEE/RSJ 2006 international conference on intelligent robots and systems (IROS), Beijing, China. Google Scholar
  14. Fantoni, I., & Lozano, R. (2001). Non-linear control for underactuated mechanical systems. New York: Springer. Google Scholar
  15. Guenter, F., Hersch, M., Calinon, S., & Billard, A. (2007). Reinforcement learning for imitating constrained reaching movements. Advanced Robotics, Special Issue on Imitative Robots, 21(13), 1521–1544. Google Scholar
  16. Gullapalli, V., Franklin, J., & Benbrahim, H. (1994). Acquiring robot skills via reinforcement learning. IEEE Control Systems Journal, Special Issue on Robotics: Capturing Natural Motion, 4(1), 13–24. Google Scholar
  17. Hoffman, M., Doucet, A., de Freitas, N., & Jasra, A. (2007). Bayesian policy learning with trans-dimensional MCMC. In Advances in neural information processing systems (Vol. 20), Vancouver, BC, CA. Google Scholar
  18. Ijspeert, A. J., Nakanishi, J., & Schaal, S. (2002). Movement imitation with nonlinear dynamical systems in humanoid robots. In Proceedings of IEEE international conference on robotics and automation (ICRA) (pp. 1398–1403), Washington, DC. Google Scholar
  19. Ijspeert, A. J., Nakanishi, J., & Schaal, S. (2003). Learning attractor landscapes for learning motor primitives. In Advances in neural information processing systems (Vol. 15, pp. 1547–1554), Vancouver, BC, CA. Google Scholar
  20. Jaakkola, T., Jordan, M. I., & Singh, S. P. (1994). Convergence of stochastic iterative dynamic programming algorithms. In J. D. Cowan, G. Tesauro, & J. Alspector (Eds.), Advances in neural information processing systems (Vol. 6, pp. 703–710). San Mateo: Morgan Kaufmann. Google Scholar
  21. Kirk, D. E. (1970). Optimal control theory. Englewood Cliffs: Prentice-Hall. Google Scholar
  22. Kober, J., & Peters, J. (2009a). Learning motor primitives for robotics. In Proceedings of IEEE international conference on robotics and automation (ICRA) (pp. 2112–2118). Google Scholar
  23. Kober, J., & Peters, J. (2009b). Policy search for motor primitives in robotics. In D. Koller, D. Schuurmans, Y. Bengio, & L. Bottou (Eds.), Advances in neural information processing systems (Vol. 21, pp. 849–856). Google Scholar
  24. Kober, J., Mohler, B., & Peters, J. (2008). Learning perceptual coupling for motor primitives. In Proceedings of the IEEE/RSJ 2008 international conference on intelligent robots and systems (IROS) (pp. 834–839), Nice, France. CrossRefGoogle Scholar
  25. Kormushev, P., Calinon, S., & Caldwell, D. G. (2010). Robot motor skill coordination with em-based reinforcement learning. In Proceedings of the IEEE/RSJ international conference on intelligent robots and systems (IROS). Google Scholar
  26. Kwee, I., Hutter, M., & Schmidhuber, J. (2001). Gradient-based reinforcement planning in policy-search methods. In M. A. Wiering (Ed.), Cognitieve Kunstmatige Intelligentie: Vol. 27. Proceedings of the 5th European workshop on reinforcement learning (EWRL) (pp. 27–29), Lugano. Manno: Onderwijsinsituut CKI, Utrecht University. Google Scholar
  27. Lawrence, G., Cowan, N., & Russell, S. (2003). Efficient gradient estimation for motor control learning. In Proceedings of the international conference on uncertainty in artificial intelligence (UAI) (pp. 354–361), Acapulco, Mexico. Google Scholar
  28. Martín, H. J. A., de Lope, J., & Maravall, D. (2009). The knn-td reinforcement learning algorithm. In Proceedings of the 3rd international work-conference on the interplay between natural and artificial computation (IWINAC) (pp. 305–314). Berlin: Springer. Google Scholar
  29. McLachan, G. J., & Krishnan, T. (1997). Wiley series in probability and statistics. The EM algorithm and extensions. New York: Wiley. Google Scholar
  30. Miyamoto, H., Schaal, S., Gandolfo, F., Gomi, H., Koike, Y., Osu, R., Nakano, E., Wada, Y., & Kawato, M. (1996). A Kendama learning robot based on bi-directional theory. Neural Networks, 9(8), 1281–1302. CrossRefGoogle Scholar
  31. Ng, A. Y., & Jordan, M. (2000). Pegasus: A policy search method for large mdps and pomdps. In Proceedings of the international conference on uncertainty in artificial intelligence (UAI) (pp. 406–415), Palo Alto, CA. Google Scholar
  32. Ng, A. Y., Kim, H. J., Jordan, M. I., & Sastry, S. (2004). Inverted autonomous helicopter flight via reinforcement learning. In Proceedings of the international symposium on experimental robotics (ISER). Cambridge: MIT Press. Google Scholar
  33. Park, D. H., Hoffmann, H., Pastor, P., & Schaal, S. (2008). Movement reproduction and obstacle avoidance with dynamic movement primitives and potential fields. In IEEE international conference on humanoid robots (HUMANOIDS) (pp. 91–98). Google Scholar
  34. PASCAL2 (2010). Challenges. http://pascallin2.ecs.soton.ac.uk/Challenges/.
  35. Peshkin, L. (2001). Reinforcement learning by policy search. PhD thesis, Brown University, Providence, RI. Google Scholar
  36. Peters, J. (2007). Machine learning of motor skills for robotics. PhD thesis, University of Southern California, Los Angeles, CA, 90089, USA. Google Scholar
  37. Peters, J., & Schaal, S. (2006). Policy gradient methods for robotics. In Proceedings of the IEEE/RSJ 2006 international conference on intelligent robots and systems (IROS) (pp. 2219–2225), Beijing, China. CrossRefGoogle Scholar
  38. Peters, J., & Schaal, S. (2007). Reinforcement learning by reward-weighted regression for operational space control. In Proceedings of the international conference on machine learning (ICML), Corvallis, OR, USA. Google Scholar
  39. Peters, J., Vijayakumar, S., & Schaal, S. (2003). Reinforcement learning for humanoid robotics. In Proceedings of the IEEE-RAS international conference on humanoid robots (HUMANOIDS) (pp. 103–123), Karlsruhe, Germany. Google Scholar
  40. Peters, J., Vijayakumar, S., & Schaal, S. (2005). Natural actor-critic. In Proceedings of the European conference on machine learning (ECML) (pp. 280–291), Porto, Portugal. Google Scholar
  41. Rückstieß, T., Felder, M., & Schmidhuber, J. (2008). State-dependent exploration for policy gradient methods. In Proceedings of the European conference on machine learning (ECML) (pp. 234–249), Antwerp, Belgium. Google Scholar
  42. Sato, S., Sakaguchi, T., Masutani, Y., & Miyazaki, F. (1993). Mastering of a task with interaction between a robot and its environment: “kendama” task. Transactions of the Japan Society of Mechanical Engineers C, 59(558), 487–493. Google Scholar
  43. Schaal, S., Atkeson, C. G., & Vijayakumar, S. (2002). Scalable techniques from nonparameteric statistics for real-time robot learning. Applied Intelligence, 17(1), 49–60. zbMATHCrossRefGoogle Scholar
  44. Schaal, S., Peters, J., Nakanishi, J., & Ijspeert, A. J. (2003). Control, planning, learning, and imitation with dynamic movement primitives. In Proceedings of the workshop on bilateral paradigms on humans and humanoids, IEEE international conference on intelligent robots and systems (IROS), Las Vegas, NV, October 27–31, 2003. Google Scholar
  45. Schaal, S., Mohajerian, P., & Ijspeert, A. J. (2007). Dynamics systems vs. optimal control—a unifying view. Progress in Brain Research, 165(1), 425–445. CrossRefGoogle Scholar
  46. Sehnke, F., Osendorfer, C., Rückstieß, T., Graves, A., Peters, J., & Schmidhuber, J. (2010). Parameter-exploring policy gradients. Neural Networks, 21(4), 551–559. CrossRefGoogle Scholar
  47. Shone, T., Krudysz, G., & Brown, K. (2000). Dynamic manipulation of Kendama (Tech. rep.). Rensselaer Polytechnic Institute. Google Scholar
  48. Strens, M., & Moore, A. (2001). Direct policy search using paired statistical tests. In Proceedings of the 18th international conference on machine learning (ICML). Google Scholar
  49. Sumners, C. (1997). Toys in space: exploring science with the astronauts. New York: McGraw-Hill. Google Scholar
  50. Sutton, R., & Barto, A. (1998). Reinforcement learning. Cambridge: MIT Press. Google Scholar
  51. Sutton, R. S. (1990). Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In Proceedings of the international machine learning conference (pp. 9–44). Google Scholar
  52. Sutton, R. S., McAllester, D., Singh, S., & Mansour, Y. (2000). Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems (NIPS) (Vol. 13, pp. 1057–1063), Denver, CO, USA. Google Scholar
  53. Takenaka, K. (1984). Dynamical control of manipulator with vision: “cup and ball” game demonstrated by robot. Transactions of the Japan Society of Mechanical Engineers C, 50(458), 2046–2053. Google Scholar
  54. Taylor, M. E., Whiteson, S., & Stone, P. (2007). Transfer via inter-task mappings in policy search reinforcement learning. In Proceedings of the sixth international joint conference on autonomous agents and multiagent systems (AAMAS). Google Scholar
  55. Tedrake, R., Zhang, T. W., & Seung, H. S. (2004). Stochastic policy gradient reinforcement learning on a simple 3d biped. In Proceedings of the IEEE 2004 international conference on intelligent robots and systems (IROS) (pp. 2849–2854). Google Scholar
  56. Theodorou, E. A., Buchli, J., & Schaal, S. (2010). Reinforcement learning of motor skills in high dimensions: a path integral approach. In Proceedings of IEEE international conference on robotics and automation (ICRA) (pp. 2397–2403). Google Scholar
  57. Toussaint, M., & Goerick, C. (2007). Probabilistic inference for structured planning in robotics. In Proceedings of the IEEE/RSJ 2007 international conference on intelligent robots and systems (IROS), San Diego, CA, USA. Google Scholar
  58. Van Der Maaten, L., Postma, E., & Van Den Herik, H. (2007). Dimensionality reduction: a comparative review. Preprint. Google Scholar
  59. Vlassis, N., Toussaint, M., Kontes, G., & Piperidis, S. (2009). Learning model-free robot control by a Monte Carlo EM algorithm. Autonomous Robots, 27(2), 123–130. CrossRefGoogle Scholar
  60. Williams, R. J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8, 229–256. zbMATHGoogle Scholar
  61. Wulf, G. (2007). Attention and motor skill learning. Champaign: Human Kinetics. Google Scholar

Copyright information

© The Author(s) 2010

Authors and Affiliations

  1. 1.Dept. Empirical InferenceMax Planck Institute for Biological CyberneticsTübingenGermany

Personalised recommendations