Bandit Based Monte-Carlo Planning

  • Levente Kocsis
  • Csaba Szepesvári
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4212)

Abstract

For large state-space Markovian Decision Problems Monte-Carlo planning is one of the few viable approaches to find near-optimal solutions. In this paper we introduce a new algorithm, UCT, that applies bandit ideas to guide Monte-Carlo planning. In finite-horizon or discounted MDPs the algorithm is shown to be consistent and finite sample bounds are derived on the estimation error due to sampling. Experimental results show that in several domains, UCT is significantly more efficient than its alternatives.

References

  1. 1.
    Auer, P., Cesa-Bianchi, N., Fischer, P.: Finite time analysis of the multiarmed bandit problem. Machine Learning 47(2-3), 235–256 (2002)MATHCrossRefGoogle Scholar
  2. 2.
    Auer, P., Cesa-Bianchi, N., Freund, Y., Schapire, R.E.: The nonstochastic multiarmed bandit problem. SIAM Journal on Computing 32, 48–77 (2002)MATHCrossRefMathSciNetGoogle Scholar
  3. 3.
    Barto, A.G., Bradtke, S.J., Singh, S.P.: Real-time learning and control using asynchronous dynamic programming. Technical report 91-57, Computer Science Department, University of Massachusetts (1991)Google Scholar
  4. 4.
    Billings, D., Davidson, A., Schaeffer, J., Szafron, D.: The challenge of poker. Artificial Intelligence 134, 201–240 (2002)MATHCrossRefGoogle Scholar
  5. 5.
    Bouzy, B., Helmstetter, B.: Monte Carlo Go developments. In: van den Herik, H.J., Iida, H., Heinz, E.A. (eds.) Advances in Computer Games 10, pp. 159–174 (2004)Google Scholar
  6. 6.
    Chang, H.S., Fu, M., Hu, J., Marcus, S.I.: An adaptive sampling algorithm for solving Markov decision processes. Operations Research 53(1), 126–139 (2005)MATHCrossRefMathSciNetGoogle Scholar
  7. 7.
    Chung, M., Buro, M., Schaeffer, J.: Monte Carlo planning in RTS games. In: CIG 2005, Colchester, UK (2005)Google Scholar
  8. 8.
    Kearns, M., Mansour, Y., Ng, A.Y.: A sparse sampling algorithm for near-optimal planning in large Markovian decisi on processes. In: Proceedings of IJCAI 1999, pp. 1324–1331 (1999)Google Scholar
  9. 9.
    Lai, T.L., Robbins, H.: Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics 6, 4–22 (1985)MATHCrossRefMathSciNetGoogle Scholar
  10. 10.
    Péret, L., Garcia, F.: On-line search for solving Markov decision processes via heuristic sampling. In: de Mántaras, R.L., Saitta, L. (eds.) ECAI, pp. 530–534 (2004)Google Scholar
  11. 11.
    Sheppard, B.: World-championship-caliber Scrabble. Artificial Intelligence 134(1–2), 241–275 (2002)MATHCrossRefGoogle Scholar
  12. 12.
    Smith, S.J.J., Nau, D.S.: An analysis of forward pruning. In: AAAI, pp. 1386–1391 (1994)Google Scholar
  13. 13.
    Tesauro, G., Galperin, G.R.: On-line policy improvement using Monte-Carlo search. In: Mozer, M.C., Jordan, M.I., Petsche, T. (eds.) NIPS 9, pp. 1068–1074 (1997)Google Scholar
  14. 14.
    Vanderbei, R.: Optimal sailing strategies, statistics and operations research program. University of Princeton (1996), http://www.sor.princeton.edu/~rvdb/sail/sail.html

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Levente Kocsis
    • 1
  • Csaba Szepesvári
    • 1
  1. 1.Computer and Automation Research Institute of the Hungarian Academy of SciencesBudapestHungary

Personalised recommendations