Skip to main content
SpringerLink
Log in
Menu
Find a journal Publish with us Track your research
Search
Cart
  1. Home
  2. Machine Learning
  3. Article

Finite-time Analysis of the Multiarmed Bandit Problem

  • Published: May 2002
  • Volume 47, pages 235–256, (2002)
  • Cite this article
Download PDF
Machine Learning Aims and scope Submit manuscript
Finite-time Analysis of the Multiarmed Bandit Problem
Download PDF
  • Peter Auer1,
  • Nicolò Cesa-Bianchi2 &
  • Paul Fischer3 
  • 38k Accesses

  • 3025 Citations

  • 34 Altmetric

  • 3 Mentions

  • Explore all metrics

Abstract

Reinforcement learning policies face the exploration versus exploitation dilemma, i.e. the search for a balance between exploring the environment to find profitable actions while taking the empirically best action as often as possible. A popular measure of a policy's success in addressing this dilemma is the regret, that is the loss due to the fact that the globally optimal policy is not followed all the times. One of the simplest examples of the exploration/exploitation dilemma is the multi-armed bandit problem. Lai and Robbins were the first ones to show that the regret for this problem has to grow at least logarithmically in the number of plays. Since then, policies which asymptotically achieve this regret have been devised by Lai and Robbins and many others. In this work we show that the optimal logarithmic regret is also achievable uniformly over time, with simple and efficient policies, and for all reward distributions with bounded support.

Article PDF

Download to read the full article text

Similar content being viewed by others

Asymptotically optimal algorithms for budgeted multiple play bandits

Article 16 May 2019

Alex Luedtke, Emilie Kaufmann & Antoine Chambaz

Sub-sampling for Multi-armed Bandits

Chapter © 2014

BelMan: An Information-Geometric Approach to Stochastic Bandits

Chapter © 2020
Use our pre-submission checklist

Avoid common mistakes on your manuscript.

References

  • Agrawal, R. (1995). Sample mean based index policies with O(log n) regret for the multi-armed bandit problem. Advances in Applied Probability, 27, 1054–1078.

    Google Scholar 

  • Berry, D., & Fristedt, B. (1985). Bandit problems. London: Chapman and Hall.

    Google Scholar 

  • Burnetas, A., & Katehakis, M. (1996). Optimal adaptive policies for sequential allocation problems. Advances in Applied Mathematics, 17:2, 122–142.

    Google Scholar 

  • Duff, M. (1995). Q-learning for bandit problems. In Proceedings of the 12th International Conference on Machine Learning (pp. 209-217).

  • Gittins, J. (1989). Multi-armed bandit allocation indices, Wiley-Interscience series in Systems and Optimization. New York: John Wiley and Sons.

    Google Scholar 

  • Holland, J. (1992). Adaptation in natural and artificial systems. Cambridge: MIT Press/Bradford Books.

    Google Scholar 

  • Ishikida, T., & Varaiya, P. (1994). Multi-armed bandit problem revisited. Journal of Optimization Theory and Applications, 83:1, 113–154.

    Google Scholar 

  • Lai, T., & Robbins, H. (1985). Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, 6, 4–22.

    Google Scholar 

  • Pollard, D. (1984). Convergence of stochastic processes. Berlin: Springer.

    Google Scholar 

  • Sutton, R., & Barto, A. (1998). Reinforcement learning, an introduction. Cambridge: MIT Press/Bradford Books.

    Google Scholar 

  • Wilks, S. (1962). Matematical statistics. New York: John Wiley and Sons.

    Google Scholar 

  • Yakowitz, S., & Lowe, W. (1991). Nonparametric bandit methods. Annals of Operations Research, 28, 297–312.

    Google Scholar 

Download references

Author information

Authors and Affiliations

  1. University of Technology Graz, A-8010, Graz, Austria

    Peter Auer

  2. DTI, University of Milan, via Bramante 65, I-26013, Crema, Italy

    Nicolò Cesa-Bianchi

  3. Lehrstuhl Informatik II, Universität Dortmund, D-44221, Dortmund, Germany

    Paul Fischer

Authors
  1. Peter Auer
    View author publications

    You can also search for this author in PubMed Google Scholar

  2. Nicolò Cesa-Bianchi
    View author publications

    You can also search for this author in PubMed Google Scholar

  3. Paul Fischer
    View author publications

    You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

About this article

Cite this article

Auer, P., Cesa-Bianchi, N. & Fischer, P. Finite-time Analysis of the Multiarmed Bandit Problem. Machine Learning 47, 235–256 (2002). https://doi.org/10.1023/A:1013689704352

Download citation

  • Issue Date: May 2002

  • DOI: https://doi.org/10.1023/A:1013689704352

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • bandit problems
  • adaptive allocation rules
  • finite horizon regret
Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Advertisement

search

Navigation

  • Find a journal
  • Publish with us
  • Track your research

Discover content

  • Journals A-Z
  • Books A-Z

Publish with us

  • Publish your research
  • Open access publishing

Products and services

  • Our products
  • Librarians
  • Societies
  • Partners and advertisers

Our imprints

  • Springer
  • Nature Portfolio
  • BMC
  • Palgrave Macmillan
  • Apress
  • Your US state privacy rights
  • Accessibility statement
  • Terms and conditions
  • Privacy policy
  • Help and support

5.135.140.155

Not affiliated

Springer Nature

© 2024 Springer Nature