Reducing reinforcement learning to KWIK online regression

Article

Abstract

One of the key problems in reinforcement learning (RL) is balancing exploration and exploitation. Another is learning and acting in large Markov decision processes (MDPs) where compact function approximation has to be used. This paper introduces REKWIRE, a provably efficient, model-free algorithm for finite-horizon RL problems with value function approximation (VFA) that addresses the exploration-exploitation tradeoff in a principled way. The crucial element of this algorithm is a reduction of RL to online regression in the recently proposed KWIK learning model. We show that, if the KWIK online regression problem can be solved efficiently, then the sample complexity of exploration of REKWIRE is polynomial. Therefore, the reduction suggests a new and sound direction to tackle general RL problems. The efficiency of our algorithm is verified on a set of proof-of-concept experiments where popular, ad hoc exploration approaches fail.

Keywords

Reinforcement learning Exploration PAC-MDP Knows What It Knows (KWIK) Online regression Value function approximation 

Mathematics Subject Classification (2010)

68T05 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Asmuth, J., Li, L., Littman, M. L., Nouri, A., Wingate, D.: A Bayesian sampling approach to exploration in reinforcement learning. In: Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence (UAI-09), pp. 19–26 (2009)Google Scholar
  2. 2.
    Auer, P.: Using confidence bounds for exploitation-exploration trade-offs. J. Mach. Learn. Res. 3, 397–422 (2002)CrossRefMathSciNetGoogle Scholar
  3. 3.
    Bagnell, J.A., Kakade, S., Ng, A.Y., Schneider, J.: Policy search by dynamic programming. Adv. Neural Inf. Process. Syst. 16 (NIPS-03), 831–838 (2004)Google Scholar
  4. 4.
    Boyan, J.A., Moore, A.W.: Generalization in reinforcement learning: safely approximating the value function. Adv. Neural Inf. Process. Syst. 7, 369–376 (1995)Google Scholar
  5. 5.
    Brafman, R.I., Tennenholtz, M.: R-max—a general polynomial time algorithm for near-optimal reinforcement learning. J. Mach. Learn. Res. 3, 213–231 (2002)CrossRefMathSciNetGoogle Scholar
  6. 6.
    Cesa-Bianchi, N., Lugosi, G.: Prediction, Learning, and Games. Cambridge University Press (2006)Google Scholar
  7. 7.
    Chernoff, H.: A measure of asymptotic efficiency for tests of a hypothesis based on the sum of observations. Ann. Math. Stat. 23, 493–507 (1952)MATHCrossRefMathSciNetGoogle Scholar
  8. 8.
    Chow, C.-S., Tsitsiklis, J.N.: The complexity of dynamic programming. J. Complex. 5, 466–488 (1989)MATHCrossRefMathSciNetGoogle Scholar
  9. 9.
    Duff, M.O.: Optimal Learning: Computational Procedures for Bayes-adaptive Markov Decision Processes. Doctoral dissertation, University of Massachusetts, Amherst, MA (2002)Google Scholar
  10. 10.
    Engel, Y., Mannor, S., Meir, R.: Reinforcement learning with Gaussian processes. In: Proceedings of the Twenty-Second International Conference on Machine Learning (ICML-05), pp. 201–208 (2005)Google Scholar
  11. 11.
    Fern, A., Yoon, S.W., Givan, R.: Approximate policy iteration with a policy language bias: solving relational Markov decision processes. J. Artif. Intell. Res. 25, 75–118 (2006)MATHMathSciNetGoogle Scholar
  12. 12.
    Fiechter, C.-N.: Efficient reinforcement learning. In: Proceedings of the Seventh Annual ACM Conference on Computational Learning Theory, pp. 88–97 (1994)Google Scholar
  13. 13.
    Geist, M., Pietquin, O., Fricout, G.: Kalman temporal differences: the deterministic case. In: Proceedings of the IEEE International Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL-09), pp. 185–192 (2009)Google Scholar
  14. 14.
    Jong, N.K., Stone, P.: Model-based exploration in continuous state spaces. In: Proceedings of the Seventh International Symposium on Abstraction, Reformulation and Approximation (SARA-07), pp. 258–272 (2007)Google Scholar
  15. 15.
    Kakade, S.: On the Sample Complexity of Reinforcement Learning. Doctoral dissertation, University College London, UK (2003)Google Scholar
  16. 16.
    Kakade, S., Kearns, M.J., Langford, J.: Exploration in metric state spaces. In: Proceedings of the Twentieth International Conference on Machine Learning, pp. 306–312 (2003)Google Scholar
  17. 17.
    Kearns, M.J., Mansour, Y., Ng, A.Y.: A sparse sampling algorithm for near-optimal planning in large Markov decision processes. Mach. Learn. 49, 193–208 (2002)MATHCrossRefGoogle Scholar
  18. 18.
    Kearns, M.J., Singh, S.P.: Near-optimal reinforcement learning in polynomial time. Mach. Learn. 49, 209–232 (2002)MATHCrossRefGoogle Scholar
  19. 19.
    Kearns, M.J., Vazirani, U.V.: An Introduction to Computational Learning Theory. MIT Press (1994)Google Scholar
  20. 20.
    Koenig, S., Simmons, R.G.: The effect of representation and knowledge on goal-directed exploration with reinforcement-learning algorithms. Mach. Learn. 22, 227–250 (1996)MATHGoogle Scholar
  21. 21.
    Kolter, J.Z., Ng, A.Y.: Near Bayesian exploration in polynomial time. In: Proceedings of the Twenty-Sixth International Conference on Machine Learning (ICML-09), pp. 513–520 (2009)Google Scholar
  22. 22.
    Lagoudakis, M.G., Parr, R.: Least-squares policy iteration. J. Mach. Learn. Res. 4, 1107–1149 (2003)CrossRefMathSciNetGoogle Scholar
  23. 23.
    Lagoudakis, M.G., Parr, R.: Reinforcement learning as classification: leveraging modern classifiers. In: Proceedings of the Twentieth International Conference on Machine Learning (ICML-03), pp. 424–431 (2003)Google Scholar
  24. 24.
    Langford, J., Zadrozny, B.: Reducing T-step reinforcement learning to classification. In: Proceedings of the Machine Learning Reductions Workshop. Chicago, IL (2003)Google Scholar
  25. 25.
    Langford, J., Zadrozny, B.: Relating reinforcement learning performance to classification performance. In: Proceedings of the Twenty-Second International Conference on Machine Learning (ICML-05), pp. 473–480 (2005)Google Scholar
  26. 26.
    Li, L.: Focus of Attention in Reinforcement Learning. Master’s thesis, University of Alberta, Edmonton, AB, Canada (2004)Google Scholar
  27. 27.
    Li, L.: A Unifying Framework for Computational Reinforcement Learning Theory. Doctoral dissertation, Rutgers University, New Brunswick, NJ (2009)Google Scholar
  28. 28.
    Li, L., Bulitko, V., Greiner, R.: Focus of attention in reinforcement learning. J. Univers. Comput. Sci. 13, 1246–1269 (2007)Google Scholar
  29. 29.
    Li, L., Littman, M.L.: Efficient value-function approximation via online linear regression. In: Proceedings of the Tenth International Symposium on Artificial Intelligence and Mathematics (AMAI-08) (2008)Google Scholar
  30. 30.
    Li, L., Littman, M.L., Mansley, C.R.: Online exploration in least-squares policy iteration. In: Proceedings of the Eighteenth International Conference on Agents and Multiagent Systems (AAMAS-09) (2009)Google Scholar
  31. 31.
    Li, L., Littman, M.L., Walsh, T.J.: Knows what it knows: A framework for self-aware learning. In: Proceedings of the Twenty-Fifth International Conference on Machine Learning (ICML-08), pp. 568–575 (2008)Google Scholar
  32. 32.
    Li, L., Littman, M.L., Walsh, T.J., Strehl, A.L.: Knows what it knows: a framework for self-aware learning. (2010, in submission)Google Scholar
  33. 33.
    Ortner, R., Auer, P., Jaksch, T.: Near-optimal regret bounds for reinforcement learning. J. Mach. Learn. Res. 11, 1563–1600 (2010)Google Scholar
  34. 34.
    Poupart, P., Vlassis, N., Hoey, J., Regan, K.: An analytic solution to discrete Bayesian reinforcement learning. In: Proceedings of the Twenty-Third International Conference on Machine Learning (ICML-06), pp. 697–704 (2006)Google Scholar
  35. 35.
    Puterman, M.L.: Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley-Interscience, New York (1994)MATHGoogle Scholar
  36. 36.
    Strehl, A.L., Li, L., Littman, M.L.: Reinforcement learning in finite MDPs: PAC analysis. J. Mach. Learn. Res. 10, 2413–2444 (2009)MathSciNetGoogle Scholar
  37. 37.
    Strehl, A.L., Li, L., Wiewiora, E., Langford, J., Littman, M.L.: PAC model-free reinforcement learning. In: Proceedings of the Twenty-Third International Conference on Machine Learning, pp. 881–888 (2006)Google Scholar
  38. 38.
    Strehl, A.L., Littman, M.L.: A theoretical analysis of model-based interval estimation. In: Proceedings of the Twenty-Second Conference on Machine Learning, pp. 857–864 (2005)Google Scholar
  39. 39.
    Strehl, A.L., Littman, M.L.: Online linear regression and its application to model-based reinforcement learning. Adv. Neural Inf. Process. Syst. 20, 1417–1424 (2008)Google Scholar
  40. 40.
    Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. MIT Press, Cambridge, MA (1998)Google Scholar
  41. 41.
    Thrun, S.: The role of exploration in learning control. In: White, D.A., Sofge, D.A. (eds.) Handbook of Intelligent Control: Neural, Fuzzy and Adaptive Approaches, pp. 527–559. Van Nostrand Reinhold (1992)Google Scholar

Copyright information

© Springer Science+Business Media B.V. 2010

Authors and Affiliations

  1. 1.Yahoo! ResearchSanta ClaraUSA
  2. 2.Rutgers Laboratory for Real-Life Reinforcement Learning (RL3), Department of Computer ScienceRutgers UniversityPiscatawayUSA

Personalised recommendations