Machine Learning

, Volume 90, Issue 3, pp 385–429 | Cite as

TEXPLORE: real-time sample-efficient reinforcement learning for robots

Article

Abstract

The use of robots in society could be expanded by using reinforcement learning (RL) to allow robots to learn and adapt to new situations online. RL is a paradigm for learning sequential decision making tasks, usually formulated as a Markov Decision Process (MDP). For an RL algorithm to be practical for robotic control tasks, it must learn in very few samples, while continually taking actions in real-time. In addition, the algorithm must learn efficiently in the face of noise, sensor/actuator delays, and continuous state features. In this article, we present texplore, the first algorithm to address all of these challenges together. texplore is a model-based RL method that learns a random forest model of the domain which generalizes dynamics to unseen states. The agent explores states that are promising for the final policy, while ignoring states that do not appear promising. With sample-based planning and a novel parallel architecture, texplore can select actions continually in real-time whenever necessary. We empirically evaluate the importance of each component of texplore in isolation and then demonstrate the complete algorithm learning to control the velocity of an autonomous vehicle in real-time.

Keywords

Reinforcement learning Robotics MDP Real-time 

References

  1. Albus, J. S. (1975). A new approach to manipulator control: the cerebellar model articulation controller. Journal of Dynamic Systems, Measurement, and Control, 97(3), 220–227. MATHCrossRefGoogle Scholar
  2. Asmuth, J., Li, L., Littman, M., Nouri, A., & Wingate, D. (2009). A Bayesian sampling approach to exploration in reinforcement learning. In Proceedings of the 25th conference on uncertainty in artificial intelligence (UAI). Google Scholar
  3. Auer, P., Cesa-Bianchi, N., & Fischer, P. (2002). Finite-time analysis of the multiarmed bandit problem. Machine Learning, 47(2), 235–256. MATHCrossRefGoogle Scholar
  4. Barto, A. G., Bradtke, S. J., & Singh, S. P. (1995). Learning to act using real-time dynamic programming. Artificial Intelligence, 72(1–2), 81–138. CrossRefGoogle Scholar
  5. Beeson, P., O’Quin, J., Gillan, B., Nimmagadda, T., Ristroph, M., Li, D., & Stone, P. (2008). Multiagent interactions in urban driving. Journal of Physical Agents, 2(1), 15–30. Google Scholar
  6. Brafman, R., & Tennenholtz, M. (2001). R-Max—a general polynomial time algorithm for near-optimal reinforcement learning. In Proceedings of the seventeenth international joint conference on artificial intelligence (IJCAI) (pp. 953–958). Google Scholar
  7. Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5–32. MATHCrossRefGoogle Scholar
  8. Chakraborty, D., & Stone, P. (2011). Structure learning in ergodic factored MDPs without knowledge of the transition function’s in-degree. In Proceedings of the Twenty-Eighth international conference on machine learning (ICML). Google Scholar
  9. Chaslot, G., Winands, M. H. M., & van den Herik, H. J. (2008). Parallel Monte-Carlo tree search. In The 6th international conference on computers and games (CG 2008) (pp. 60–71). Google Scholar
  10. Dearden, R., Friedman, N., & Andre, D. (1999). Model based Bayesian exploration. In Proceedings of the fifteenth conference on uncertainty in artificial intelligence (UAI) (pp. 150–159). Google Scholar
  11. Degris, T., Sigaud, O., & Wuillemin, P. H. (2006). Learning the structure of factored Markov decision processes in reinforcement learning problems. In Proceedings of the twenty-third international conference on machine learning (ICML) (pp. 257–264). Google Scholar
  12. Deisenroth, M., & Rasmussen, C. (2011). PILCO: a model-based and data-efficient approach to policy search. In Proceedings of the Twenty-Eighth international conference on machine learning (ICML). Google Scholar
  13. Diuk, C., Li, L., & Leffler, B. (2009). The adaptive-meteorologists problem and its application to structure learning and feature selection in reinforcement learning. In Proceedings of the twenty-sixth international conference on machine learning (ICML) (p. 32). Google Scholar
  14. Duff, M. (2003). Design for an optimal probe. In Proceedings of the twentieth international conference on machine learning (ICML) (pp. 131–138). Google Scholar
  15. Ernst, D., Geurts, P., & Wehenkel, L. (2003). Iteratively extending time horizon reinforcement learning. In Proceedings of the fourteenth European conference on machine learning (ECML) (pp. 96–107). Google Scholar
  16. Ernst, D., Geurts, P., & Wehenkel, L. (2005). Tree-based batch mode reinforcement learning. Journal of Machine Learning Research, 6, 503–556. MathSciNetMATHGoogle Scholar
  17. Fasel, I., Wilt, A., Mafi, N., & Morris, C. (2010). Intrinsically motivated information foraging. In Proceedings of the ninth international conference on development and learning (ICDL). Google Scholar
  18. Gelly, S., & Silver, D. (2007). Combining online and offline knowledge in UCT. In Proceedings of the twenty-fourth international conference on machine learning (ICML) (pp. 273–280). CrossRefGoogle Scholar
  19. Gelly, S., Hoock, J. B., Rimmel, A., Teytaud, O., & Kalemkarian, Y. (2008). The parallelization of Monte-Carlo planning. In Proceedings of the fifth international conference on informatics in control, automation and robotics, intelligent control systems and optimization (ICINCO 2008) (pp. 244–249). Google Scholar
  20. Gordon, G. (1995). Stable function approximation in dynamic programming. In Proceedings of the twelfth international conference on machine learning (ICML). Google Scholar
  21. Guestrin, C., Patrascu, R., & Schuurmans, D. (2002). Algorithm-directed exploration for model-based reinforcement learning in factored MDPs. In Proceedings of the nineteenth international conference on machine learning (ICML) (pp. 235–242). Google Scholar
  22. Hester, T., & Stone, P. (2009). An empirical comparison of abstraction in models of Markov decision processes. In Proceedings of the ICML/UAI/COLT workshop on abstraction in reinforcement learning. Google Scholar
  23. Hester, T., & Stone, P. (2010). Real time targeted exploration in large domains. In Proceedings of the ninth international conference on development and learning (ICDL). Google Scholar
  24. Hester, T., Quinlan, M., & Stone, P. (2012). RTMBA: a real-time model-based reinforcement learning architecture for robot control. In IEEE international conference on robotics and automation (ICRA). Google Scholar
  25. Jong, N., & Stone, P. (2007). Model-based function approximation for reinforcement learning. In Proceedings of the sixth international joint conference on autonomous agents and multiagent systems (AAMAS). Google Scholar
  26. Katsikopoulos, K., & Engelbrecht, S. (2003). Markov decision processes with delays and asynchronous cost collection. IEEE Transactions on Automatic Control, 48(4), 568–574. MathSciNetCrossRefGoogle Scholar
  27. Kearns, M., Mansour, Y., & Ng, A. (1999). A sparse sampling algorithm for near-optimal planning in large Markov decision processes. In Proceedings of the sixteenth international joint conference on artificial intelligence (IJCAI) (pp. 1324–1331). Google Scholar
  28. Kober, J., & Peters, J. (2011). Policy search for motor primitives in robotics. Machine Learning, 84(1–2), 171–203. MATHCrossRefGoogle Scholar
  29. Kocsis, L., & Szepesvári, C. (2006). Bandit based Monte-Carlo planning. In Proceedings of the seventeenth European conference on machine learning (ECML). Google Scholar
  30. Kohl, N., & Stone, P. (2004). Machine learning for fast quadrupedal locomotion. In Proceedings of the nineteenth AAAI conference on artificial intelligence. Google Scholar
  31. Kolobov, A., Mausam, & Weld, D. (2012). LRTDP versus UCT for online probabilistic planning. In AAAI conference on artificial intelligence. https://www.aaai.org/ocs/index.php/AAAI/AAAI12/paper/view/4961/5334. Google Scholar
  32. Lagoudakis, M., & Parr, R. (2003). Least-squares policy iteration. Journal of Machine Learning Research, 4, 1107–1149. MathSciNetGoogle Scholar
  33. Leffler, B., Littman, M., & Edmunds, T. (2007). Efficient reinforcement learning with relocatable action models. In Proceedings of the twenty-second AAAI conference on artificial intelligence (pp. 572–577). Google Scholar
  34. Li, L., Littman, M., & Walsh, T. (2008). Knows what it knows: a framework for self-aware learning. In Proceedings of the twenty-fifth international conference on machine learning (ICML) (pp. 568–575). CrossRefGoogle Scholar
  35. Lin, L. J. (1992). Reinforcement learning for robots using neural networks. Ph.D. Thesis, Pittsburgh, PA, USA. Google Scholar
  36. McCallum, A. (1996). Learning to use selective attention and short-term memory in sequential tasks. In From animals to animats 4: proceedings of the fourth international conference on simulation of adaptive behavior. Google Scholar
  37. Méhat, J., & Cazenave, T. (2011). A parallel general game player. KI. Künstliche Intelligenz, 25(1), 43–47. CrossRefGoogle Scholar
  38. Munos, R., & Moore, A. (2002). Variable resolution discretization in optimal control. Machine Learning, 49, 291–323. MATHCrossRefGoogle Scholar
  39. Ng, A., Kim Jordan M, H. J., & Sastry, S. (2003). Autonomous helicopter flight via reinforcement learning. In Advances in neural information processing systems (NIPS) (Vol. 16). Google Scholar
  40. Ohyama, T., Nores, W. L., Murphy, M., & Mauk, M. D. (2003). What the cerebellum computes. Trends in Neurosciences, 26(4), 222–227. CrossRefGoogle Scholar
  41. Oudeyer, P. Y., Kaplan, F., & Hafner, V. V. (2007). Intrinsic motivation systems for autonomous mental development. IEEE Transactions on Evolutionary Computation, 11(2), 265–286. CrossRefGoogle Scholar
  42. Poupart, P., Vlassis, N., Hoey, J., & Regan, K. (2006). An analytic solution to discrete Bayesian reinforcement learning. In Proceedings of the twenty-third international conference on machine learning (ICML) (pp. 697–704). Google Scholar
  43. Quigley, M., Conley, K., Gerkey, B., Faust, J., Foote, T., Leibs, J., Wheeler, R., & Ng, A. (2009). ROS: an open-source robot operating system. In ICRA workshop on open source software. Google Scholar
  44. Quinlan, R. (1986). Induction of decision trees. Machine Learning, 1, 81–106. Google Scholar
  45. Quinlan, R. (1992). Learning with continuous classes. In 5th Australian joint conference on artificial intelligence (pp. 343–348). Singapore: World Scientific. Google Scholar
  46. Schuitema, E., Busoniu, L., Babuska, R., & Jonker, P. (2010). Control delay in reinforcement learning for real-time dynamic systems: a memoryless approach. In Proceedings of the IEEE/RSJ international conference on intelligent robots and systems (IROS) (pp. 3226–3231). Google Scholar
  47. Silver, D., Sutton, R., & Müller, M. (2008). Sample-based learning and search with permanent and transient memories. In Proceedings of the twenty-fifth international conference on machine learning (ICML) (pp. 968–975). CrossRefGoogle Scholar
  48. Silver, D., Sutton, R., & Muller, M. (2012). Temporal difference search in computer go. Machine Learning, 87 Google Scholar
  49. Strehl, A., & Littman, M. (2005). A theoretical analysis of model-based interval estimation. In Proceedings of the twenty-second international conference on machine learning (ICML) (pp. 856–863). CrossRefGoogle Scholar
  50. Strehl, A., & Littman, M. (2007). Online linear regression and its application to model-based reinforcement learning. In Advances in neural information processing systems (NIPS) (Vol. 20). Google Scholar
  51. Strehl, A., Diuk, C., & Littman, M. (2007). Efficient structure learning in factored-state MDPs. In Proceedings of the twenty-second AAAI conference on artificial intelligence (pp. 645–650). Google Scholar
  52. Strens, M. (2000). A Bayesian framework for reinforcement learning. In Proceedings of the seventeenth international conference on machine learning (ICML) (pp. 943–950). Google Scholar
  53. Sutton, R. (1990). Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In Proceedings of the seventh international conference on machine learning (ICML) (pp. 216–224). Google Scholar
  54. Sutton, R., & Barto, A. (1998). Reinforcement learning: an introduction. Cambridge: MIT Press. Google Scholar
  55. Sutton, R., Modayil, J., Delp, M., Degris, T., Pilarski, P., White, A., & Precup, D. (2011). Horde: a scalable real-time architecture for learning knowledge from unsupervised sensorimotor interaction. In Proceedings of the tenth international joint conference on autonomous agents and multiagent systems (AAMAS). Google Scholar
  56. Tanner, B., & White, A. (2009). RL-Glue: language-independent software for reinforcement-learning experiments. Journal of Machine Learning Research, 10, 2133–2136. Google Scholar
  57. Veness, J., Ng, K. S., Hutter, M., Uther, W. T. B., & Silver, D. (2011). A Monte-Carlo AIXI approximation. The Journal of Artificial Intelligence Research, 40, 95–142. MathSciNetMATHGoogle Scholar
  58. Walsh, T., Nouri, A., Li, L., & Littman, M. (2009a). Learning and planning in environments with delayed feedback. Autonomous Agents and Multi-Agent Systems, 18, 83–105. CrossRefGoogle Scholar
  59. Walsh, T., Szita, I., Diuk, C., & Littman, M. (2009b). Exploring compact reinforcement-learning representations with linear regression. In Proceedings of the 25th conference on uncertainty in artificial intelligence (UAI). Google Scholar
  60. Walsh, T., Goschin, S., & Littman, M. (2010). Integrating sample-based planning and model-based reinforcement learning. In Proceedings of the twenty-fifth AAAI conference on artificial intelligence. Google Scholar
  61. Wang, T., Lizotte, D., Bowling, M., & Schuurmans, D. (2005). Bayesian sparse sampling for on-line reward optimization. In Proceedings of the twenty-second international conference on machine learning (ICML) (pp. 956–963). CrossRefGoogle Scholar
  62. Watkins, C. (1989). Learning from delayed rewards. Ph.D. Thesis, University of Cambridge. Google Scholar
  63. Wiering, M., & Schmidhuber, J. (1998). Efficient model-based exploration. In From animals to animats 5: proceedings of the fifth international conference on simulation of adaptive behavior (pp. 223–228). Cambridge: MIT Press. Google Scholar
  64. Willems, F. M. J., Shtarkov, Y. M., & Tjalkens, T. J. (1995). The context tree weighting method: basic properties. IEEE Transactions on Information Theory, 41, 653–664. MATHCrossRefGoogle Scholar

Copyright information

© The Author(s) 2012

Authors and Affiliations

  1. 1.Department of Computer ScienceThe University of Texas at AustinAustinUSA

Personalised recommendations