# TEXPLORE: real-time sample-efficient reinforcement learning for robots

## Abstract

The use of robots in society could be expanded by using reinforcement learning (RL) to allow robots to learn and adapt to new situations online. RL is a paradigm for learning sequential decision making tasks, usually formulated as a Markov Decision Process (MDP). For an RL algorithm to be practical for robotic control tasks, it must learn in very few samples, while continually taking actions in real-time. In addition, the algorithm must learn efficiently in the face of noise, sensor/actuator delays, and continuous state features. In this article, we present texplore, the first algorithm to address all of these challenges together. texplore is a model-based RL method that learns a random forest model of the domain which generalizes dynamics to unseen states. The agent explores states that are promising for the final policy, while ignoring states that do not appear promising. With sample-based planning and a novel parallel architecture, texplore can select actions continually in real-time whenever necessary. We empirically evaluate the importance of each component of texplore in isolation and then demonstrate the complete algorithm learning to control the velocity of an autonomous vehicle in real-time.

### Keywords

Reinforcement learning Robotics MDP Real-time### References

- Albus, J. S. (1975). A new approach to manipulator control: the cerebellar model articulation controller.
*Journal of Dynamic Systems, Measurement, and Control*,*97*(3), 220–227. MATHCrossRefGoogle Scholar - Asmuth, J., Li, L., Littman, M., Nouri, A., & Wingate, D. (2009). A Bayesian sampling approach to exploration in reinforcement learning. In
*Proceedings of the 25th conference on uncertainty in artificial intelligence (UAI)*. Google Scholar - Auer, P., Cesa-Bianchi, N., & Fischer, P. (2002). Finite-time analysis of the multiarmed bandit problem.
*Machine Learning*,*47*(2), 235–256. MATHCrossRefGoogle Scholar - Barto, A. G., Bradtke, S. J., & Singh, S. P. (1995). Learning to act using real-time dynamic programming.
*Artificial Intelligence*,*72*(1–2), 81–138. CrossRefGoogle Scholar - Beeson, P., O’Quin, J., Gillan, B., Nimmagadda, T., Ristroph, M., Li, D., & Stone, P. (2008). Multiagent interactions in urban driving.
*Journal of Physical Agents*,*2*(1), 15–30. Google Scholar - Brafman, R., & Tennenholtz, M. (2001). R-Max—a general polynomial time algorithm for near-optimal reinforcement learning. In
*Proceedings of the seventeenth international joint conference on artificial intelligence (IJCAI)*(pp. 953–958). Google Scholar - Chakraborty, D., & Stone, P. (2011). Structure learning in ergodic factored MDPs without knowledge of the transition function’s in-degree. In
*Proceedings of the Twenty-Eighth international conference on machine learning (ICML)*. Google Scholar - Chaslot, G., Winands, M. H. M., & van den Herik, H. J. (2008). Parallel Monte-Carlo tree search. In
*The 6th international conference on computers and games (CG 2008)*(pp. 60–71). Google Scholar - Dearden, R., Friedman, N., & Andre, D. (1999). Model based Bayesian exploration. In
*Proceedings of the fifteenth conference on uncertainty in artificial intelligence (UAI)*(pp. 150–159). Google Scholar - Degris, T., Sigaud, O., & Wuillemin, P. H. (2006). Learning the structure of factored Markov decision processes in reinforcement learning problems. In
*Proceedings of the twenty-third international conference on machine learning (ICML)*(pp. 257–264). Google Scholar - Deisenroth, M., & Rasmussen, C. (2011). PILCO: a model-based and data-efficient approach to policy search. In
*Proceedings of the Twenty-Eighth international conference on machine learning (ICML)*. Google Scholar - Diuk, C., Li, L., & Leffler, B. (2009). The adaptive-meteorologists problem and its application to structure learning and feature selection in reinforcement learning. In
*Proceedings of the twenty-sixth international conference on machine learning (ICML)*(p. 32). Google Scholar - Duff, M. (2003). Design for an optimal probe. In
*Proceedings of the twentieth international conference on machine learning (ICML)*(pp. 131–138). Google Scholar - Ernst, D., Geurts, P., & Wehenkel, L. (2003). Iteratively extending time horizon reinforcement learning. In
*Proceedings of the fourteenth European conference on machine learning (ECML)*(pp. 96–107). Google Scholar - Ernst, D., Geurts, P., & Wehenkel, L. (2005). Tree-based batch mode reinforcement learning.
*Journal of Machine Learning Research*,*6*, 503–556. MathSciNetMATHGoogle Scholar - Fasel, I., Wilt, A., Mafi, N., & Morris, C. (2010). Intrinsically motivated information foraging. In
*Proceedings of the ninth international conference on development and learning (ICDL)*. Google Scholar - Gelly, S., & Silver, D. (2007). Combining online and offline knowledge in UCT. In
*Proceedings of the twenty-fourth international conference on machine learning (ICML)*(pp. 273–280). CrossRefGoogle Scholar - Gelly, S., Hoock, J. B., Rimmel, A., Teytaud, O., & Kalemkarian, Y. (2008). The parallelization of Monte-Carlo planning. In
*Proceedings of the fifth international conference on informatics in control, automation and robotics, intelligent control systems and optimization (ICINCO 2008)*(pp. 244–249). Google Scholar - Gordon, G. (1995). Stable function approximation in dynamic programming. In
*Proceedings of the twelfth international conference on machine learning (ICML)*. Google Scholar - Guestrin, C., Patrascu, R., & Schuurmans, D. (2002). Algorithm-directed exploration for model-based reinforcement learning in factored MDPs. In
*Proceedings of the nineteenth international conference on machine learning (ICML)*(pp. 235–242). Google Scholar - Hester, T., & Stone, P. (2009). An empirical comparison of abstraction in models of Markov decision processes. In
*Proceedings of the ICML/UAI/COLT workshop on abstraction in reinforcement learning*. Google Scholar - Hester, T., & Stone, P. (2010). Real time targeted exploration in large domains. In
*Proceedings of the ninth international conference on development and learning (ICDL)*. Google Scholar - Hester, T., Quinlan, M., & Stone, P. (2012). RTMBA: a real-time model-based reinforcement learning architecture for robot control. In
*IEEE international conference on robotics and automation (ICRA)*. Google Scholar - Jong, N., & Stone, P. (2007). Model-based function approximation for reinforcement learning. In
*Proceedings of the sixth international joint conference on autonomous agents and multiagent systems (AAMAS)*. Google Scholar - Katsikopoulos, K., & Engelbrecht, S. (2003). Markov decision processes with delays and asynchronous cost collection.
*IEEE Transactions on Automatic Control*,*48*(4), 568–574. MathSciNetCrossRefGoogle Scholar - Kearns, M., Mansour, Y., & Ng, A. (1999). A sparse sampling algorithm for near-optimal planning in large Markov decision processes. In
*Proceedings of the sixteenth international joint conference on artificial intelligence (IJCAI)*(pp. 1324–1331). Google Scholar - Kober, J., & Peters, J. (2011). Policy search for motor primitives in robotics.
*Machine Learning*,*84*(1–2), 171–203. MATHCrossRefGoogle Scholar - Kocsis, L., & Szepesvári, C. (2006). Bandit based Monte-Carlo planning. In
*Proceedings of the seventeenth European conference on machine learning (ECML)*. Google Scholar - Kohl, N., & Stone, P. (2004). Machine learning for fast quadrupedal locomotion. In
*Proceedings of the nineteenth AAAI conference on artificial intelligence*. Google Scholar - Kolobov, A., Mausam, & Weld, D. (2012). LRTDP versus UCT for online probabilistic planning. In
*AAAI conference on artificial intelligence*. https://www.aaai.org/ocs/index.php/AAAI/AAAI12/paper/view/4961/5334. Google Scholar - Lagoudakis, M., & Parr, R. (2003). Least-squares policy iteration.
*Journal of Machine Learning Research*,*4*, 1107–1149. MathSciNetGoogle Scholar - Leffler, B., Littman, M., & Edmunds, T. (2007). Efficient reinforcement learning with relocatable action models. In
*Proceedings of the twenty-second AAAI conference on artificial intelligence*(pp. 572–577). Google Scholar - Li, L., Littman, M., & Walsh, T. (2008). Knows what it knows: a framework for self-aware learning. In
*Proceedings of the twenty-fifth international conference on machine learning (ICML)*(pp. 568–575). CrossRefGoogle Scholar - Lin, L. J. (1992).
*Reinforcement learning for robots using neural networks*. Ph.D. Thesis, Pittsburgh, PA, USA. Google Scholar - McCallum, A. (1996). Learning to use selective attention and short-term memory in sequential tasks. In
*From animals to animats 4: proceedings of the fourth international conference on simulation of adaptive behavior*. Google Scholar - Méhat, J., & Cazenave, T. (2011). A parallel general game player.
*KI. Künstliche Intelligenz*,*25*(1), 43–47. CrossRefGoogle Scholar - Munos, R., & Moore, A. (2002). Variable resolution discretization in optimal control.
*Machine Learning*,*49*, 291–323. MATHCrossRefGoogle Scholar - Ng, A., Kim Jordan M, H. J., & Sastry, S. (2003). Autonomous helicopter flight via reinforcement learning. In
*Advances in neural information processing systems (NIPS)*(Vol. 16). Google Scholar - Ohyama, T., Nores, W. L., Murphy, M., & Mauk, M. D. (2003). What the cerebellum computes.
*Trends in Neurosciences*,*26*(4), 222–227. CrossRefGoogle Scholar - Oudeyer, P. Y., Kaplan, F., & Hafner, V. V. (2007). Intrinsic motivation systems for autonomous mental development.
*IEEE Transactions on Evolutionary Computation*,*11*(2), 265–286. CrossRefGoogle Scholar - Poupart, P., Vlassis, N., Hoey, J., & Regan, K. (2006). An analytic solution to discrete Bayesian reinforcement learning. In
*Proceedings of the twenty-third international conference on machine learning (ICML)*(pp. 697–704). Google Scholar - Quigley, M., Conley, K., Gerkey, B., Faust, J., Foote, T., Leibs, J., Wheeler, R., & Ng, A. (2009). ROS: an open-source robot operating system. In
*ICRA workshop on open source software*. Google Scholar - Quinlan, R. (1992). Learning with continuous classes. In
*5th Australian joint conference on artificial intelligence*(pp. 343–348). Singapore: World Scientific. Google Scholar - Schuitema, E., Busoniu, L., Babuska, R., & Jonker, P. (2010). Control delay in reinforcement learning for real-time dynamic systems: a memoryless approach. In
*Proceedings of the IEEE/RSJ international conference on intelligent robots and systems (IROS)*(pp. 3226–3231). Google Scholar - Silver, D., Sutton, R., & Müller, M. (2008). Sample-based learning and search with permanent and transient memories. In
*Proceedings of the twenty-fifth international conference on machine learning (ICML)*(pp. 968–975). CrossRefGoogle Scholar - Silver, D., Sutton, R., & Muller, M. (2012). Temporal difference search in computer go.
*Machine Learning*,*87*Google Scholar - Strehl, A., & Littman, M. (2005). A theoretical analysis of model-based interval estimation. In
*Proceedings of the twenty-second international conference on machine learning (ICML)*(pp. 856–863). CrossRefGoogle Scholar - Strehl, A., & Littman, M. (2007). Online linear regression and its application to model-based reinforcement learning. In
*Advances in neural information processing systems (NIPS)*(Vol. 20). Google Scholar - Strehl, A., Diuk, C., & Littman, M. (2007). Efficient structure learning in factored-state MDPs. In
*Proceedings of the twenty-second AAAI conference on artificial intelligence*(pp. 645–650). Google Scholar - Strens, M. (2000). A Bayesian framework for reinforcement learning. In
*Proceedings of the seventeenth international conference on machine learning (ICML)*(pp. 943–950). Google Scholar - Sutton, R. (1990). Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In
*Proceedings of the seventh international conference on machine learning (ICML)*(pp. 216–224). Google Scholar - Sutton, R., & Barto, A. (1998).
*Reinforcement learning: an introduction*. Cambridge: MIT Press. Google Scholar - Sutton, R., Modayil, J., Delp, M., Degris, T., Pilarski, P., White, A., & Precup, D. (2011). Horde: a scalable real-time architecture for learning knowledge from unsupervised sensorimotor interaction. In
*Proceedings of the tenth international joint conference on autonomous agents and multiagent systems (AAMAS)*. Google Scholar - Tanner, B., & White, A. (2009). RL-Glue: language-independent software for reinforcement-learning experiments.
*Journal of Machine Learning Research*,*10*, 2133–2136. Google Scholar - Veness, J., Ng, K. S., Hutter, M., Uther, W. T. B., & Silver, D. (2011). A Monte-Carlo AIXI approximation.
*The Journal of Artificial Intelligence Research*,*40*, 95–142. MathSciNetMATHGoogle Scholar - Walsh, T., Nouri, A., Li, L., & Littman, M. (2009a). Learning and planning in environments with delayed feedback.
*Autonomous Agents and Multi-Agent Systems*,*18*, 83–105. CrossRefGoogle Scholar - Walsh, T., Szita, I., Diuk, C., & Littman, M. (2009b). Exploring compact reinforcement-learning representations with linear regression. In
*Proceedings of the 25th conference on uncertainty in artificial intelligence (UAI)*. Google Scholar - Walsh, T., Goschin, S., & Littman, M. (2010). Integrating sample-based planning and model-based reinforcement learning. In
*Proceedings of the twenty-fifth AAAI conference on artificial intelligence*. Google Scholar - Wang, T., Lizotte, D., Bowling, M., & Schuurmans, D. (2005). Bayesian sparse sampling for on-line reward optimization. In
*Proceedings of the twenty-second international conference on machine learning (ICML)*(pp. 956–963). CrossRefGoogle Scholar - Watkins, C. (1989).
*Learning from delayed rewards*. Ph.D. Thesis, University of Cambridge. Google Scholar - Wiering, M., & Schmidhuber, J. (1998). Efficient model-based exploration. In
*From animals to animats 5: proceedings of the fifth international conference on simulation of adaptive behavior*(pp. 223–228). Cambridge: MIT Press. Google Scholar - Willems, F. M. J., Shtarkov, Y. M., & Tjalkens, T. J. (1995). The context tree weighting method: basic properties.
*IEEE Transactions on Information Theory*,*41*, 653–664. MATHCrossRefGoogle Scholar