Advertisement

Bayesian optimistic Kullback–Leibler exploration

  • Kanghoon Lee
  • Geon-Hyeong Kim
  • Pedro Ortega
  • Daniel D. Lee
  • Kee-Eung Kim
Article
  • 166 Downloads
Part of the following topical collections:
  1. Special Issue of the ACML 2018 Journal Track

Abstract

We consider a Bayesian approach to model-based reinforcement learning, where the agent uses a distribution of environment models to find the action that optimally trades off exploration and exploitation. Unfortunately, it is intractable to find the Bayes-optimal solution to the problem except for restricted cases. In this paper, we present BOKLE, a simple algorithm that uses Kullback–Leibler divergence to constrain the set of plausible models for guiding the exploration. We provide a formal analysis that this algorithm is near Bayes-optimal with high probability. We also show an asymptotic relation between the solution pursued by BOKLE and a well-known algorithm called Bayesian exploration bonus. Finally, we show experimental results that clearly demonstrate the exploration efficiency of the algorithm.

Keywords

Model-based Bayesian reinforcement learning Bayes-adaptive Markov decision process PAC-BAMDP 

Notes

Acknowledgements

This work was supported by the ICT R&D program of MSIT/IITP (No. 2017-0-01778, Development of Explainable Human-level Deep Machine Learning Inference Framework) and was conducted at High-Speed Vehicle Research Center of KAIST with the support of the Defense Acquisition Program Administration and the Agency for Defense Development under Contract UD170018CD.

References

  1. Araya-López, M., Thomas, V., & Buffet, O. (2012). Near-optimal BRL using optimistic local transitions. In Proceedings of the 29th international conference on machine learning (pp. 97–104).Google Scholar
  2. Asmuth, J., Li, L., Littman, M. L., Nouri, A., & Wingate, D. (2009). A Bayesian sampling approach to exploration in reinforcement learning. In Proceedings of the 25th conference on uncertainty in artificial intelligence (pp. 19–26).Google Scholar
  3. Asmuth, J. T. (2013). Model-based Bayesian reinforcement learning with generalized priors. Ph.D. thesis, Rutgers University-Graduate School-New Brunswick.Google Scholar
  4. Audibert, J. Y., Munos, R., & Szepesvári, C. (2009). Exploration–exploitation tradeoff using variance estimates in multi-armed bandits. Theoretical Computer Science, 410, 1876–1902.MathSciNetCrossRefGoogle Scholar
  5. Boyd, S., & Vandenberghe, L. (2004). Convex optimization. Cambridge: Cambridge University Press.CrossRefGoogle Scholar
  6. Brafman, R. I., & Tennenholtz, M. (2002). R-MAX—A general polynomial time algorithm for near-optimal reinforcement learning. Journal of Machine Learning Research, 3, 213–231.MathSciNetzbMATHGoogle Scholar
  7. Dearden, R., Friedman, N., & Russell, S. (1998). Bayesian Q-learning. In Proceedings of the fifteenth national conference on artificial intelligence (pp. 761–768).Google Scholar
  8. Duff, M. O. (2002). Optimal learning: Computational procedures for Bayes-adaptive Markov decision processes. Ph.D. thesis, University of Massachusetts Amherst.Google Scholar
  9. Filippi, S., Cappé, O., & Garivier, A. (2010). Optimism in reinforcement learning and Kullback–Leibler divergence. In 48th Annual Allerton conference on communication, control, and computing (Allerton) (pp. 115–122).Google Scholar
  10. Garivier, A., & Cappé, O. (2011) The KL-UCB algorithm for bounded stochastic bandits and beyond. In The 24rd annual conference on learning theory (pp. 359–376).Google Scholar
  11. Jaksch, T., Ortner, R., & Auer, P. (2010). Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 11, 1563–1600.MathSciNetzbMATHGoogle Scholar
  12. Kaufmann, E., Cappé, O., & Garivier, A. (2012). On Bayesian upper confidence bounds for bandit problems. In Fifteenth international conference on artificial intelligence and statistics (pp. 592–600).Google Scholar
  13. Kearns, M., & Singh, S. (1998) Near-optimal reinforcement learning in polynomial time. In Proceedings of the 15th international conference on machine learning (pp. 260–268).Google Scholar
  14. Kearns, M., & Singh, S. (2002). Near-optimal reinforcement learning in polynomial time. Machine Learning, 49, 209–232.CrossRefGoogle Scholar
  15. Kolter, J. Z., & Ng, A. Y. (2009). Near-Bayesian exploration in polynomial time. In Proceedings of the 26th international conference on machine learning (pp. 513–520).Google Scholar
  16. Ortner, R., & Ryabko, D. (2012). Online regret bounds for undiscounted continuous reinforcement learning. In Proceedings of the 25th international conference on neural information processing systems (pp. 1763–1771).Google Scholar
  17. Osband, I., Roy, B. V., & Russo, D. (2013). (More) efficient reinforcement learning via posterior sampling. In Proceedings of the 26th international conference on neural information processing systems (pp. 3003–3011).Google Scholar
  18. Poupart, P., Vlassis, N., Hoey, J., & Regan, K. (2006). An analytic solution to discrete Bayesian reinforcement learning. In Proceedings of the 23rd international conference on machine learning (pp. 697–704).Google Scholar
  19. Puterman, M. L. (2005). Markov decision processes: Discrete Stochastic Dynamic Programming. New York: Wiley-Interscience.zbMATHGoogle Scholar
  20. Ross, S., Chaib-draa, B., & Pineau, J. (2007). Bayes-adaptive POMDPs. In Proceedings of the 20th international conference on neural information processing systems (pp. 1225–1232).Google Scholar
  21. Sorg, J., Singh, S., & Lewis, R. L. (2010). Variance-based rewards for approximate Bayesian reinforcement learning. In Proceedings of the 26th conference on uncertainty in artificial intelligence.Google Scholar
  22. Strehl, A. L., & Littman, M. L. (2005) A theoretical analysis of model-based interval estimation. In Proceedings of the 22nd international conference on machine learning (pp. 856–863).Google Scholar
  23. Strehl, A. L., & Littman, M. L. (2008). An analysis of model-based interval estimation for Markov decision processes. Journal of Computer and System Sciences, 74, 1309–1331.MathSciNetCrossRefGoogle Scholar
  24. Strens, M. (2000). A Bayesian framework for reinforcement learning. In Proceedings of the 17th international conference on machine learning (pp. 943–950).Google Scholar

Copyright information

© The Author(s) 2018

Authors and Affiliations

  1. 1.School of ComputingKAISTDaejeonRepublic of Korea
  2. 2.Google UKLondonUK
  3. 3.Cornell TechNew YorkUSA

Personalised recommendations