Skip to main content

Multi-armed Bandit Problem

  • Reference work entry
  • First Online:
  • 227 Accesses

Years and Authors of Summarized Original Work

  • 2002; Auer, Cesa-Bianchi, Freund, Schapire

  • 2002; Auer, Cesa-Bianchi, Fischer

Problem Definition

A multi-armed bandit is a sequential decision problem defined on a set of actions. At each time step, the decision maker selects an action from the set and obtains an observable payoff. The goal is to maximize the total payoff obtained in a sequence of decisions. The name banditrefers to the colloquial term for a slot machine (“one-armed bandit” in American slang) and to the decision problem, faced by a casino gambler, of choosing which slot machine to play next. Bandit problems naturally address the fundamental trade-off between exploration and exploitation in sequential experiments. Indeed, the decision maker must use a strategy (called allocation policy) able to balance the exploitation of actions that did well in the past with the exploration of actions that might give higher payoffs in the future. Although the original motivation came from...

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   1,599.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD   1,999.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Recommended Reading

  1. Arora R, Dekel O, Tewari A (2009) Online bandit learning against an adaptive adversary: from regret to policy regret. In: Proceedings of the 29th international conference on machine learning, Montreal

    Google Scholar 

  2. Auer P, Cesa-Bianchi N, Fischer P (2002) Finite-time analysis of the multiarmed bandit problem. Mach Learn J 47(2–3):235–256

    Article  MATH  Google Scholar 

  3. Auer P, Cesa-Bianchi N, Freund Y, Schapire R (2002) The nonstochastic multiarmed bandit problem. SIAM J Comput 32(1):48–77

    Article  MathSciNet  MATH  Google Scholar 

  4. Awerbuch B, Kleinberg R (2004) Adaptive routing with end-to-end feedback: distributed learning and geometric approaches. In: Proceedings of the 36th annual ACM symposium on theory of computing, Chicago. ACM, pp 45–53

    Google Scholar 

  5. Blackwell D (1956) An analog of the minimax theorem for vector payoffs. Pac J Math 6:1–8

    Article  MathSciNet  MATH  Google Scholar 

  6. Bubeck S, Munos R, Stoltz G (2009) Pure exploration in multi-armed bandits problems. In: Proceedings of the 20th international conference on algorithmic learning theory, Porto

    Google Scholar 

  7. Flaxman AD, Kalai AT, McMahan HB (2005) Online convex optimization in the bandit setting: gradient descent without a gradient. In: Proceedings of the 16th annual ACM-SIAM symposium on discrete algorithms, Philadelphia. Society for Industrial and Applied Mathematics, pp 385–394

    Google Scholar 

  8. Gittins J, Glazebrook K, Weber R (2011) Multi-armed bandit allocation indices, 2nd edn. Wiley, Hoboken

    Book  MATH  Google Scholar 

  9. Hannan J (1957) Approximation to Bayes risk in repeated play. Contrib. Theory Games 3:97–139

    MATH  Google Scholar 

  10. Kocsis L, Szepesvari C (2006) Bandit based Monte-Carlo planning. In: Proceedings of the 15th European conference on machine learning, Vienna, pp 282–293

    Google Scholar 

  11. Lai TL, Robbins H (1985) Asymptotically efficient adaptive allocation rules. Adv Appl Math 6: 4–22

    Article  MathSciNet  MATH  Google Scholar 

  12. Li L, Chu W, Langford J, Schapire R (2010) A contextual-bandit approach to personalized news article recommendation. In: Proceedings of the 19th international conference on world wide web, Raleigh

    Google Scholar 

  13. Robbins H (1952) Some aspects of the sequential design of experiments. Bull Am Math Soc 58:527–535

    Article  MathSciNet  MATH  Google Scholar 

  14. Thompson W (1933) On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Bull Am Math Soc 25: 285–294

    MATH  Google Scholar 

  15. Wang CC, Kulkarni S, Poor H (2005) Bandit problems with side observations. IEEE Trans Autom Control 50(3):338–355

    Article  MathSciNet  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Nicolò Cesa-Bianchi .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer Science+Business Media New York

About this entry

Cite this entry

Cesa-Bianchi, N. (2016). Multi-armed Bandit Problem. In: Kao, MY. (eds) Encyclopedia of Algorithms. Springer, New York, NY. https://doi.org/10.1007/978-1-4939-2864-4_768

Download citation

Publish with us

Policies and ethics