Applied Intelligence

, Volume 39, Issue 2, pp 345–353 | Cite as

Monte-Carlo tree search for Bayesian reinforcement learning

  • Ngo Anh Vien
  • Wolfgang Ertel
  • Viet-Hung Dang
  • TaeChoong Chung


Bayesian model-based reinforcement learning can be formulated as a partially observable Markov decision process (POMDP) to provide a principled framework for optimally balancing exploitation and exploration. Then, a POMDP solver can be used to solve the problem. If the prior distribution over the environment’s dynamics is a product of Dirichlet distributions, the POMDP’s optimal value function can be represented using a set of multivariate polynomials. Unfortunately, the size of the polynomials grows exponentially with the problem horizon. In this paper, we examine the use of an online Monte-Carlo tree search (MCTS) algorithm for large POMDPs, to solve the Bayesian reinforcement learning problem online. We will show that such an algorithm successfully searches for a near-optimal policy. In addition, we examine the use of a parameter tying method to keep the model search space small, and propose the use of nested mixture of tied models to increase robustness of the method when our prior information does not allow us to specify the structure of tied models exactly. Experiments show that the proposed methods substantially improve scalability of current Bayesian reinforcement learning methods.


Bayesian reinforcement learning Model-based reinforcement learning Monte-Carlo tree search POMDP 



This work was supported by the Collaborative Center of Applied Research on Service Robotics (ZAFH Servicerobotik, and the Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education, Science, and Technology (2010-0012609).


  1. 1.
    Asmuth J, Li L, Littman ML, Nouri A, Wingate D (2009) A Bayesian sampling approach to exploration in reinforcement learning. In: Proceedings of the 25th conference on uncertainty in artificial intelligence (UAI-09) Google Scholar
  2. 2.
    Asmuth J, Littman ML (2011) Learning is planning: near Bayes-optimal reinforcement learning via Monte-Carlo tree search. In: Proceedings of the twenty-seventh conference on uncertainty in artificial intelligence, pp 19–26 Google Scholar
  3. 3.
    Auer P, Cesa-Bianchi N, Fischer P (2002) Finite-time analysis of the multiarmed bandit problem. Mach Learn 47(2–3):235–256 zbMATHCrossRefGoogle Scholar
  4. 4.
    Baxter J, Tridgell A, Weaver L (2000) Learning to play chess using temporal differences. Mach Learn 40(3):243–263 zbMATHCrossRefGoogle Scholar
  5. 5.
    Brafman RI, Tennenholtz M (2002) R-max—a general polynomial time algorithm for near-optimal reinforcement learning. J Mach Learn Res 3:213–231 MathSciNetGoogle Scholar
  6. 6.
    Castro PS, Precup D (2007) Using linear programming for Bayesian exploration in Markov decision processes. In: IJCAI 2007. Proceedings of the 20th international joint conference on artificial intelligence, Hyderabad, India, January 6–12, 2007, pp 2437–2442 Google Scholar
  7. 7.
    Dearden R, Friedman N, Russell SJ (1998) Bayesian Q-learning. In: Proceedings of the fifteenth national conference on artificial intelligence and tenth innovative applications of artificial intelligence conference, AAAI/IAAI 98, Madison, WI, USA, July 26–30, 1998, pp 761–768 Google Scholar
  8. 8.
    Duff M (2002) Optimal learning: computational procedures for Bayes-adaptive Markov decision processes. PhD thesis, University of Massassachusetts Amherst Google Scholar
  9. 9.
    Engel Y, Mannor S, Meir R (2003) Bayes meets bellman: the Gaussian process approach to temporal difference learning. In: International conference on machine learning (ICML), pp 154–161 Google Scholar
  10. 10.
    Engel Y, Mannor S, Meir R (2005) Reinforcement learning with Gaussian processes. In: International conference on machine learning (ICML), pp 201–208 Google Scholar
  11. 11.
    Gelly S, Silver D (2007) Combining online and offline knowledge in uct. In: International conference on machine learning (ICML), pp 273–280 Google Scholar
  12. 12.
    Ghavamzadeh M, Engel Y (2006) Bayesian policy gradient algorithms. In: Advances in neural information processing (NIPS), pp 457–464 Google Scholar
  13. 13.
    Ghavamzadeh M, Engel Y (2007) Bayesian actor-critic algorithms. In: International conference on machine learning (ICML), pp 297–304 Google Scholar
  14. 14.
    Granmo OC, Glimsdal S (2012) Accelerated Bayesian learning for decentralized two-armed bandit based decision making with applications to the goore game. Appl Intell Google Scholar
  15. 15.
    Hong J, Prabhu VV (2004) Distributed reinforcement learning control for batch sequencing and sizing in just-in-time manufacturing systems. Appl Intell 20(1):71–87 CrossRefGoogle Scholar
  16. 16.
    Hsu D, Lee WS, Rong N (2007) What makes some POMDP problems easy to approximate? In: Advances in neural information processing (NIPS) Google Scholar
  17. 17.
    Iglesias A, Martínez P, Aler R, Fernández F (2009) Learning teaching strategies in an adaptive and intelligent educational system through reinforcement learning. Appl Intell 31(1):89–106 CrossRefGoogle Scholar
  18. 18.
    Kakade S, Kearns MJ, Langford J (2003) Exploration in metric state spaces. In: International conference on machine learning (ICML), pp 306–312 Google Scholar
  19. 19.
    Kearns MJ, Singh SP (2002) Near-optimal reinforcement learning in polynomial time. Mach Learn 49(2–3):209–232 zbMATHCrossRefGoogle Scholar
  20. 20.
    Kocsis L, Szepesvári C (2006) Bandit based Monte-Carlo planning. In: European conference on machine learning (ECML), pp 282–293 Google Scholar
  21. 21.
    Kolter JZ, Ng AY (2009) Near-Bayesian exploration in polynomial time. In: International conference on machine learning (ICML), p 65 Google Scholar
  22. 22.
    Li J, Li Z, Chen J (2011) Microassembly path planning using reinforcement learning for improving positioning accuracy of a 1 cm3 omni-directional mobile microrobot. Appl Intell 34(2):211–225 CrossRefGoogle Scholar
  23. 23.
    Pakizeh E, Palhang M, Pedram MM (2012) Multi-criteria expertness based cooperative Q-learning. Appl Intell Google Scholar
  24. 24.
    Poupart P, Vlassis NA, Hoey J, Regan K (2006) An analytic solution to discrete Bayesian reinforcement learning. In: International conference on machine learning (ICML), pp 697–704 Google Scholar
  25. 25.
    Ross S, Chaib-draa B, Pineau J (2007) Bayes-adaptive POMDPs. In: Advances in neural information processing (NIPS) Google Scholar
  26. 26.
    Ross S, Pineau J (2008) Model-based Bayesian reinforcement learning in large structured domains. In: Proceedings of the 24th conference in uncertainty in artificial intelligence, pp 476–483 Google Scholar
  27. 27.
    Russell SJ, Norvig P (2003) Artificial intelligence: a modern approach, 2nd edn. Prentice Hall, Upper Saddle River Google Scholar
  28. 28.
    Samuel AL (1959) Some studies in machine learning using the game of checkers. IBM J Res Dev 3(3):210–229 MathSciNetCrossRefGoogle Scholar
  29. 29.
    Silver D, Veness J (2010) Monte-Carlo planning in large POMDPs. In: Advances in neural information processing (NIPS), pp 2164–2172 Google Scholar
  30. 30.
    Singh SP, Bertsekas D (1996) Reinforcement learning for dynamic channel allocation in cellular telephone systems. In: Advances in neural information processing systems, vol NIPS, pp 974–980 Google Scholar
  31. 31.
    Strehl AL, Littman ML (2008) An analysis of model-based interval estimation for Markov decision processes. J Comput Syst Sci 74(8):1309–1331 MathSciNetzbMATHCrossRefGoogle Scholar
  32. 32.
    Strens MJA (2000) A Bayesian framework for reinforcement learning. In: Proceedings of the seventeenth international conference on machine learning (ICML 2000). Stanford University, Stanford, CA, USA, June 29–July 2, 2000, pp 943–950 Google Scholar
  33. 33.
    Sutton RS, Barto AG (1998) Reinforcement learning: an introduction. MIT Press, Cambridge Google Scholar
  34. 34.
    Szita I, Szepesvári C (2010) Model-based reinforcement learning with nearly tight exploration complexity bounds. In: International conference on machine learning (ICML), pp 1031–1038 Google Scholar
  35. 35.
    Tesauro G (1992) Practical issues in temporal difference learning. Mach Learn 8:257–277 zbMATHGoogle Scholar
  36. 36.
    Tesauro G (1994) Td-gammon, a self-teaching backgammon program, achieves master-level play. Neural Comput 6(2):215–219 CrossRefGoogle Scholar
  37. 37.
    Tesauro G (1995) Temporal difference learning and td-gammon. Commun ACM 38(3):58–68 CrossRefGoogle Scholar
  38. 38.
    Vien NA, Viet NH, Lee S, Chung T (2009) Policy gradient SMDP for resource allocation and routing in integrated services networks. IEICE Trans 92-B(6):2008–2022 Google Scholar
  39. 39.
    Vien NA, Yu H, Chung T (2011) Hessian matrix distribution for Bayesian policy gradient reinforcement learning. Inf Sci 181(9):1671–1685 MathSciNetzbMATHCrossRefGoogle Scholar
  40. 40.
    Walsh TJ, Goschin S, Littman ML (2010) Integrating sample-based planning and model-based reinforcement learning. In: Proceedings of the twenty-fourth AAAI conference on artificial intelligence (AAAI 2010), Atlanta, GA, USA, July 11–15, 2010, pp 11–15 Google Scholar
  41. 41.
    Wang T, Lizotte DJ, Bowling MH, Schuurmans D (2005) Bayesian sparse sampling for on-line reward optimization. In: International conference on machine learning (ICML), pp 956–963 Google Scholar
  42. 42.
    Zhang W, Dietterich TG (1995) A reinforcement learning approach to job-shop scheduling. In: International joint conferences on artificial intelligence, pp 1114–1120 Google Scholar

Copyright information

© Springer Science+Business Media New York 2013

Authors and Affiliations

  • Ngo Anh Vien
    • 1
  • Wolfgang Ertel
    • 1
  • Viet-Hung Dang
    • 2
  • TaeChoong Chung
    • 3
  1. 1.Institute of Artificial IntelligenceRavensburg-Weingarten University of Applied SciencesWeingartenGermany
  2. 2.Research and Development Center for Science and TechnologyDuyTan UniversityDa NangVietnam
  3. 3.Department of Computer EngineeringKyung Hee UniversitySeoulSouth Korea

Personalised recommendations