Chapter

Machine Learning: ECML 2006

Volume 4212 of the series Lecture Notes in Computer Science pp 282-293

Bandit Based Monte-Carlo Planning

  • Levente KocsisAffiliated withCarnegie Mellon UniversityComputer and Automation Research Institute of the Hungarian Academy of Sciences
  • , Csaba SzepesváriAffiliated withCarnegie Mellon UniversityComputer and Automation Research Institute of the Hungarian Academy of Sciences

* Final gross prices may vary according to local VAT.

Get Access

Abstract

For large state-space Markovian Decision Problems Monte-Carlo planning is one of the few viable approaches to find near-optimal solutions. In this paper we introduce a new algorithm, UCT, that applies bandit ideas to guide Monte-Carlo planning. In finite-horizon or discounted MDPs the algorithm is shown to be consistent and finite sample bounds are derived on the estimation error due to sampling. Experimental results show that in several domains, UCT is significantly more efficient than its alternatives.