Bayesian Reinforcement Learning with Exploration

  • Tor Lattimore
  • Marcus Hutter
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8776)


We consider a general reinforcement learning problem and show that carefully combining the Bayesian optimal policy and an exploring policy leads to minimax sample-complexity bounds in a very general class of (history-based) environments. We also prove lower bounds and show that the new algorithm displays adaptive behaviour when the environment is easier than worst-case.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. Auer, P., Jaksch, T., Ortner, R.: Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research 99, 1532–4435 (2010) ISSN 1532-4435Google Scholar
  2. Azar, M.G., Lazaric, A., Brunskill, E.: Regret bounds for reinforcement learning with policy advice. In: Blockeel, H., Kersting, K., Nijssen, S., Železný, F. (eds.) ECML PKDD 2013, Part I. LNCS, vol. 8188, pp. 97–112. Springer, Heidelberg (2013)CrossRefGoogle Scholar
  3. Bubeck, S., Cesa-Bianchi, N.: Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems. Foundations and Trends in Machine Learning. Now Publishers Incorporated (2012) ISBN 9781601986269Google Scholar
  4. Chakraborty, D., Stone, P.: Structure learning in ergodic factored mdps without knowledge of the transition function’s in-degree. In: Proceedings of the Twenty Eighth International Conference on Machine Learning (2011)Google Scholar
  5. Diuk, C., Li, L., Leffler, B.: The adaptive k-meteorologists problem and its application to structure learning and feature selection in reinforcement learning. In: Danyluk, A.P., Bottou, L., Littman, M.L. (eds.) Proceedings of the 26th Annual International Conference on Machine Learning, pp. 249–256. ACM (2009)Google Scholar
  6. Dyagilev, K., Mannor, S., Shimkin, N.: Efficient reinforcement learning in parameterized Models: Discrete parameter case. In: Girgin, S., Loth, M., Munos, R., Preux, P., Ryabko, D. (eds.) EWRL 2008. LNCS (LNAI), vol. 5323, pp. 41–54. Springer, Heidelberg (2008)CrossRefGoogle Scholar
  7. Even-Dar, E., Mannor, S., Mansour, Y.: PAC Bounds for Multi-armed Bandit and Markov Decision Processes. In: Kivinen, J., Sloan, R.H. (eds.) COLT 2002. LNCS (LNAI), vol. 2375, pp. 255–270. Springer, Heidelberg (2002)CrossRefGoogle Scholar
  8. Even-Dar, E., Kakade, S., Mansour, Y.: Reinforcement learning in POMDPs without resets. In: International Joint Conference on Artificial Intelligence, pp. 690–695 (2005)Google Scholar
  9. Hutter, M.: Self-optimizing and Pareto-optimal policies in general environments based on Bayes-mixtures. In: Kivinen, J., Sloan, R.H. (eds.) COLT 2002. LNCS (LNAI), vol. 2375, pp. 364–379. Springer, Heidelberg (2002)CrossRefGoogle Scholar
  10. Hutter, M.: Universal Artificial Intelligence: Sequential Decisions based on Algorithmic Probability. Springer, Berlin (2005)Google Scholar
  11. Hutter, M., Muchnik, A.: On semimeasures predicting Martin-Löf random sequences. Theoretical Computer Science 382(3), 247–261 (2007)MathSciNetCrossRefMATHGoogle Scholar
  12. Kearns, M., Singh, S.: Near-optimal reinforcement learning in polynomial time. Machine Learning 49(2-3), 209–232 (2002)CrossRefMATHGoogle Scholar
  13. Lattimore, T., Hutter, M.: PAC bounds for discounted MDPs. In: Bshouty, N.H., Stoltz, G., Vayatis, N., Zeugmann, T. (eds.) ALT 2012. LNCS, vol. 7568, pp. 320–334. Springer, Heidelberg (2012)CrossRefGoogle Scholar
  14. Lattimore, T., Hutter, M.: Bayesian reinforcement learning with exploration. arxiv (2014)Google Scholar
  15. Lattimore, T., Hutter, M., Sunehag, P.: The sample-complexity of general reinforcement learning. In: Proceedings of the 30th International Conference on Machine Learning (2013a)Google Scholar
  16. Lattimore, T., Hutter, M., Sunehag, P.: Concentration and confidence for discrete bayesian sequence predictors. In: Jain, S., Munos, R., Stephan, F., Zeugmann, T. (eds.) ALT 2013. LNCS, vol. 8139, pp. 324–338. Springer, Heidelberg (2013)CrossRefGoogle Scholar
  17. Mannor, S., Tsitsiklis, J.: The sample complexity of exploration in the multi-armed bandit problem. Journal of Machine Learning Research 5, 623–648 (2004)MathSciNetMATHGoogle Scholar
  18. Odalric-Ambrym, M., Nguyen, P., Ortner, R., Ryabko, D.: Optimal regret bounds for selecting the state representation in reinforcement learning. In: Proceedings of the Thirtieth International Conference on Machine Learning (2013)Google Scholar
  19. Orseau, L.: Optimality issues of universal greedy agents with static priors. In: Hutter, M., Stephan, F., Vovk, V., Zeugmann, T. (eds.) ALT 2010. LNCS (LNAI), vol. 6331, pp. 345–359. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  20. Osband, I., Russo, D., Van Roy, B.: (More) efficient reinforcement learning via posterior sampling. In: Advances in Neural Information Processing Systems, pp. 3003–3011 (2013)Google Scholar
  21. Sunehag, P., Hutter, M.: Optimistic agents are asymptotically optimal. In: Thielscher, M., Zhang, D. (eds.) AI 2012. LNCS, vol. 7691, pp. 15–26. Springer, Heidelberg (2012)Google Scholar
  22. Szita, I., Szepesvári, C.: Model-based reinforcement learning with nearly tight exploration complexity bounds. In: Proceedings of the 27th International Conference on Machine Learning, pp. 1031–1038. ACM, New York (2010)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2014

Authors and Affiliations

  • Tor Lattimore
    • 1
  • Marcus Hutter
    • 2
  1. 1.University of AlbertaCanada
  2. 2.Australian National UniversityAustralia

Personalised recommendations