Autonomous Robots

, Volume 27, Issue 1, pp 25–53 | Cite as

Learning to search: Functional gradient techniques for imitation learning

  • Nathan D. Ratliff
  • David Silver
  • J. Andrew Bagnell
Article

Abstract

Programming robot behavior remains a challenging task. While it is often easy to abstractly define or even demonstrate a desired behavior, designing a controller that embodies the same behavior is difficult, time consuming, and ultimately expensive. The machine learning paradigm offers the promise of enabling “programming by demonstration” for developing high-performance robotic systems. Unfortunately, many “behavioral cloning” (Bain and Sammut in Machine intelligence agents. London: Oxford University Press, 1995; Pomerleau in Advances in neural information processing systems 1, 1989; LeCun et al. in Advances in neural information processing systems 18, 2006) approaches that utilize classical tools of supervised learning (e.g. decision trees, neural networks, or support vector machines) do not fit the needs of modern robotic systems. These systems are often built atop sophisticated planning algorithms that efficiently reason far into the future; consequently, ignoring these planning algorithms in lieu of a supervised learning approach often leads to myopic and poor-quality robot performance.

While planning algorithms have shown success in many real-world applications ranging from legged locomotion (Chestnutt et al. in Proceedings of the IEEE-RAS international conference on humanoid robots, 2003) to outdoor unstructured navigation (Kelly et al. in Proceedings of the international symposium on experimental robotics (ISER), 2004; Stentz et al. in AUVSI’s unmanned systems, 2007), such algorithms rely on fully specified cost functions that map sensor readings and environment models to quantifiable costs. Such cost functions are usually manually designed and programmed. Recently, a set of techniques has been developed that explore learning these functions from expert human demonstration. These algorithms apply an inverse optimal control approach to find a cost function for which planned behavior mimics an expert’s demonstration.

The work we present extends the Maximum Margin Planning (MMP) (Ratliff et al. in Twenty second international conference on machine learning (ICML06), 2006a) framework to admit learning of more powerful, non-linear cost functions. These algorithms, known collectively as LEARCH (LEArning to seaRCH), are simpler to implement than most existing methods, more efficient than previous attempts at non-linearization (Ratliff et al. in NIPS, 2006b), more naturally satisfy common constraints on the cost function, and better represent our prior beliefs about the function’s form. We derive and discuss the framework both mathematically and intuitively, and demonstrate practical real-world performance with three applied case-studies including legged locomotion, grasp planning, and autonomous outdoor unstructured navigation. The latter study includes hundreds of kilometers of autonomous traversal through complex natural environments. These case-studies address key challenges in applying the algorithm in practical settings that utilize state-of-the-art planners, and which may be constrained by efficiency requirements and imperfect expert demonstration.

Keywords

Imitation learning Structured prediction Subgradient methods Nonparametric optimization Functional gradient techniques Robotics Planning Autonomous navigation Quadrupedal locomotion Grasping 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Abbeel, P., & Ng, A. Y. (2004). Apprenticeship learning via inverse reinforcement learning. In ICML ’04: Proceedings of the twenty-first international conference on machine learning. Google Scholar
  2. Anderson, B. D. O., & Moore, J. B. (1990). Optimal control: linear quadratic methods. Englewood Cliffs: Prentice Hall. MATHGoogle Scholar
  3. Argall, B., Chernova, S., Veloso, M., & Browning, B. (2009). A survey of robot learning from demonstration. Robotics and Autonomous Systems. Google Scholar
  4. Atkeson, C., Schaal, S., & Moore, A. (1995). Locally weighted learning. AI Review. Google Scholar
  5. Bain, M., & Sammut, C. (1995). A framework for behavioral cloning. In Machine intelligence agents. London: Oxford University Press. Google Scholar
  6. Boyd, S., Ghaoui, L. E., Feron, E., & Balakrishnan, V. (1994). Linear matrix inequalities in system and control theory. Society for Industrial and Applied Mathematics (SIAM). Google Scholar
  7. Calinon, S., Guenter, F., & Billard, A. (2007). On learning, representing and generalizing a task in a humanoid robot. In IEEE Transactions on Systems, Man and Cybernetics, Part B. Special issue on robot learning by observation, demonstration and imitation, 37, 286–298. Google Scholar
  8. Cesa-Bianchi, N., & Lugosi, G. (2006). Prediction, learning, and games. New York: Cambridge University Press. MATHGoogle Scholar
  9. Chestnutt, J., Kuffner, J., Nishiwaki, K., & Kagami, S. (2003). Planning biped navigation strategies in complex environments. In Proceedings of the IEEE-RAS, international conference on humanoid robots. Karlsruhe, Germany. Google Scholar
  10. Chestnutt, J., Lau, M., Cheng, G., Kuffner, J., Hodgins, J., & Kanade, T. (2005). Footstep planning for the Honda ASIMO humanoid. In Proceedings of the IEEE, international conference on robotics and automation. Google Scholar
  11. Donoho, D. L., & Elad, M. (2003). Maximal sparsity representation via l1 minimization. Proceedings of the National Academy Sciences, 100, 2197–2202. MATHCrossRefMathSciNetGoogle Scholar
  12. Ferguson, D., & Stentz, A. (2006). Using interpolation to improve path planning: The field D* algorithm. Journal of Field Robotics, 23, 79–101. CrossRefGoogle Scholar
  13. Friedman, J. H. (1999a). Greedy function approximation: A gradient boosting machine. Annals of Statistics. Google Scholar
  14. Gordon, G. (1999). Approximate solutions to Markov decision processes. Doctoral dissertation, Robotics Institute, Carnegie Mellon University. Google Scholar
  15. Hersch, M., Guenter, F., Calinon, S., & Billard, A. (2008). Dynamical system modulation for robot learning via kinesthetic demonstrations. IEEE Transactions on Robotics, 24, 1463–1467. CrossRefGoogle Scholar
  16. Jaynes, E. (2003). Probability: The logic of science. Cambridge: Cambridge University Press. MATHGoogle Scholar
  17. Kalman, R. (1964). When is a linear control system optimal? Transaction ASME, Journal Basic Engineering, 86, 51–60. Google Scholar
  18. Kelly, A., Amidi, O., Happold, M., Herman, H., Pilarski, T., Rander, P., Stentz, A., Vallidis, N., & Warner, R. (2004). Toward reliable autonomous vehicles operating in challenging environments. In Proceedings of the international symposium on experimental robotics (ISER). Singapore. Google Scholar
  19. Kivinen, J., & Warmuth, M. K. (1997). Exponentiated gradient versus gradient descent for linear predictors. Information and Computation, 132. Google Scholar
  20. Kolter, J. Z., Abbeel, P., & Ng, A. Y. (2008). Hierarchical apprenticeship learning with application to quadruped locomotion. Neural Information Processing Systems, 20. Google Scholar
  21. Kulesza, A., & Pereira, F. (2008). Structured learning with approximate inference. In Advances in neural information processing systems. Cambridge: MIT. Google Scholar
  22. LeCun, Y., Muller, U., Ben, J., Cosatto, E., & Flepp, B. (2006). Off-road obstacle avoidance through end-to-end learning. In Advances in neural information processing systems (Vol. 18). Cambridge: MIT. Google Scholar
  23. Mason, L., Baxter, J., Bartlett, P., & Frean, M. (1999). Functional gradient techniques for combining hypotheses. In Advances in large margin classifiers. Cambridge: MIT. Google Scholar
  24. Miller, A. T., Knoop, S., Allen, P. K., & Christensen, H. I. (2003). Automatic grasp planning using shape primitives. In Proceedings of the IEEE, International conference on robotics and automation. Google Scholar
  25. Munoz, D., Bagnell, J. A. D., Vandapel, N., & Hebert, M. (2009). Contextual classification with functional max-margin Markov networks. In IEEE computer society conference on computer vision and pattern recognition (CVPR). Google Scholar
  26. Munoz, D., Vandapel, N., & Hebert, M. (2008). Directional associative Markov network for 3-d point cloud classification. In Fourth international symposium on 3D data processing, visualization and transmission. Google Scholar
  27. Neu, G., & Szepesvari, C. (2007). Apprenticeship learning using inverse reinforcement learning and gradient methods. In Uncertainty in artificial intelligence (UAI). Google Scholar
  28. Ng, A. Y., & Russell, S. (2000). Algorithms for inverse reinforcement learning. In Proc. 17th international conf. on machine learning. Google Scholar
  29. Pomerleau, D. (1989). ALVINN: An autonomous land vehicle in a neural network. In Advances in neural information processing systems (Vol. 1). Google Scholar
  30. Puterman, M. (1994). Markov decision processes: Discrete stochastic dynamic programming. New York: Wiley. MATHGoogle Scholar
  31. Ratliff, N., & Bagnell, J. A. (2009). Functional bundle methods. In The Learning workshop. Clearwater Beach, Florida. Google Scholar
  32. Ratliff, N., Bagnell, J. A., & Zinkevich, M. (2006a). Maximum margin planning. In Twenty second international conference on machine learning (ICML06). Google Scholar
  33. Ratliff, N., Bagnell, J. A., & Zinkevich, M. (2007a). (Online) subgradient methods for structured prediction. In Artificial intelligence and statistics. San Juan, Puerto Rico. Google Scholar
  34. Ratliff, N., Bradley, D., Bagnell, J. A., & Chestnutt, J. (2006b). Boosting structured prediction for imitation learning. In NIPS. Vancouver, B.C. Google Scholar
  35. Ratliff, N., Srinivasa, S., & Bagnell, J. A. (2007b). Imitation learning for locomotion and manipulation. In IEEE-RAS international conference on humanoid robots. Google Scholar
  36. Rifkin, Y., Poggio (2003). Regularized least squares classification. In Advances in learning theory: methods, models and applications. Amsterdam: IOS Press. Google Scholar
  37. Rosset, S., Zhu, J., & Hastie, T. (2004). Boosting as a regularized path to a maximum margin classifier. Journal Machine Learning Research, 5, 941–973. MathSciNetGoogle Scholar
  38. Schaal, S., & Atkeson, C. (1994). Robot juggling: An implementation of memory-based learning. IEEE Control Systems Magazine, 14. Google Scholar
  39. Shor, N. Z. (1985). Minimization methods for non-differentiable functions. Berlin: Springer. MATHGoogle Scholar
  40. Silver, D., Bagnell, J. A., & Stentz, A. (2008). High performance outdoor navigation from overhead data using imitation learning. In Proceedings of Robotics Science and Systems. Google Scholar
  41. Silver, D., Sofman, B., Vandapel, N., Bagnell, J. A., & Stentz, A. (2006). Experimental analysis of overhead data processing to support long range navigation. In Proceedings of the IEEE/JRS international conference on intelligent robots and systems. Google Scholar
  42. Stentz, A., Bares, J., Pilarski, T., & Stager, D. (2007). The crusher system for autonomous navigation. In AUVSI’s unmanned systems. Google Scholar
  43. Taskar, B., Chatalbashev, V., Guestrin, C., & Koller, D. (2005). Learning structured prediction models: A large margin approach. In Twenty second international conference on machine learning (ICML05). Google Scholar
  44. Taskar, B., Guestrin, C., & Koller, D. (2003). Max margin Markov networks. In Advances in neural information processing systems (NIPS-14). Google Scholar
  45. Taskar, B., Lacoste-Julien, S., & Jordan, M. (2006). Structured prediction via the extragradient method. In Advances in neural information processing systems (Vol. 18). Cambridge: MIT. Google Scholar
  46. Tropp, J. A. (2004). Greed is good: Algorithmic results for sparse approximation. IEEE Transactions on Information Theory, 50, 2231–2242. CrossRefMathSciNetGoogle Scholar
  47. Vandapel, N., Donamukkala, R. R., & Hebert, M. (2003). Quality assessment of traversability maps from aerial lidar data for an unmanned ground vehicle. In Proceedings of the IEEE/JRS international conference on intelligent robots and systems. Google Scholar
  48. Ziebart, B., Bagnell, J. A., Mass, A., & Dey, A. (2008). Maximum entropy inverse reinforcement learning. In Twenty-third AAAI conference. Google Scholar
  49. Zinkevich, M. (2003). Online convex programming and generalized infinitesimal gradient ascent. In Proceedings of the twentieth international conference on machine learning. Google Scholar

Copyright information

© Springer Science+Business Media, LLC 2009

Authors and Affiliations

  • Nathan D. Ratliff
    • 1
  • David Silver
    • 1
  • J. Andrew Bagnell
    • 2
  1. 1.Robotics InstituteCarnegie Mellon UniversityPittsburghUSA
  2. 2.Robotics Institute and Machine LearningCarnegie Mellon UniversityPittsburghUSA

Personalised recommendations