Pure Exploration in Multi-armed Bandits Problems

  • Sébastien Bubeck
  • Rémi Munos
  • Gilles Stoltz
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5809)

Abstract

We consider the framework of stochastic multi-armed bandit problems and study the possibilities and limitations of strategies that perform an online exploration of the arms. The strategies are assessed in terms of their simple regret, a regret notion that captures the fact that exploration is only constrained by the number of available rounds (not necessarily known in advance), in contrast to the case when the cumulative regret is considered and when exploitation needs to be performed at the same time. We believe that this performance criterion is suited to situations when the cost of pulling an arm is expressed in terms of resources rather than rewards. We discuss the links between the simple and the cumulative regret. The main result is that the required exploration–exploitation trade-offs are qualitatively different, in view of a general lower bound on the simple regret in terms of the cumulative regret.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. [ACBF02]
    Auer, P., Cesa-Bianchi, N., Fischer, P.: Finite-time analysis of the multiarmed bandit problem. Machine Learning Journal 47, 235–256 (2002)CrossRefMATHGoogle Scholar
  2. [ACBFS02]
    Auer, P., Cesa-Bianchi, N., Freund, Y., Schapire, R.: The non-stochastic multi-armed bandit problem. SIAM Journal on Computing 32(1), 48–77 (2002)MathSciNetCrossRefMATHGoogle Scholar
  3. [BMS09]
    Bubeck, S., Munos, R., Stoltz, G.: Pure exploration for multi-armed bandit problems. Technical report, HAL report hal-00257454 (2009), http://hal.archives-ouvertes.fr/hal-00257454/en
  4. [BMSS09]
    Bubeck, S., Munos, R., Stoltz, G., Szepesvari, C.: Online optimization in \(\mathcal{X}\)–armed bandits. In: Advances in Neural Information Processing Systems, vol. 21 (2009)Google Scholar
  5. [CM07]
    Coquelin, P.-A., Munos, R.: Bandit algorithms for tree search. In: Proceedings of the 23rd Conference on Uncertainty in Artificial Intelligence (2007)Google Scholar
  6. [EDMM02]
    Even-Dar, E., Mannor, S., Mansour, Y.: PAC bounds for multi-armed bandit and Markov decision processes. In: Kivinen, J., Sloan, R.H. (eds.) COLT 2002. LNCS (LNAI), vol. 2375, pp. 255–270. Springer, Heidelberg (2002)CrossRefGoogle Scholar
  7. [GWMT06]
    Gelly, S., Wang, Y., Munos, R., Teytaud, O.: Modification of UCT with patterns in Monte-Carlo go. Technical Report RR-6062, INRIA (2006)Google Scholar
  8. [Kle04]
    Kleinberg, R.: Nearly tight bounds for the continuum-armed bandit problem. In: 18th Advances in Neural Information Processing Systems (2004)Google Scholar
  9. [KS06]
    Kocsis, L., Szepesvari, C.: Bandit based Monte-carlo planning. In: Fürnkranz, J., Scheffer, T., Spiliopoulou, M. (eds.) ECML 2006. LNCS (LNAI), vol. 4212, pp. 282–293. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  10. [LR85]
    Lai, T.L., Robbins, H.: Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics 6, 4–22 (1985)MathSciNetCrossRefMATHGoogle Scholar
  11. [MLG04]
    Madani, O., Lizotte, D., Greiner, R.: The budgeted multi-armed bandit problem. In: Proceedings of the 17th Annual Conference on Computational Learning Theory, pp. 643–645 (2004); Open problems sessionGoogle Scholar
  12. [MT04]
    Mannor, S., Tsitsiklis, J.N.: The sample complexity of exploration in the multi-armed bandit problem. Journal of Machine Learning Research 5, 623–648 (2004)MathSciNetMATHGoogle Scholar
  13. [Rob52]
    Robbins, H.: Some aspects of the sequential design of experiments. Bulletin of the American Mathematics Society 58, 527–535 (1952)MathSciNetCrossRefMATHGoogle Scholar
  14. [Sch06]
    Schlag, K.: Eleven tests needed for a recommendation. Technical Report ECO2006/2, European University Institute (2006)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2009

Authors and Affiliations

  • Sébastien Bubeck
    • 1
  • Rémi Munos
    • 1
  • Gilles Stoltz
    • 2
    • 3
  1. 1.INRIA Lille, SequeL ProjectFrance
  2. 2.Ecole normale supérieure, CNRS, ParisFrance
  3. 3.HEC Paris, CNRS, Jouy-en-JosasFrance

Personalised recommendations