Encyclopedia of Algorithms

2016 Edition
| Editors: Ming-Yang Kao

Multi-armed Bandit Problem

  • Nicolò Cesa-BianchiEmail author
Reference work entry
DOI: https://doi.org/10.1007/978-1-4939-2864-4_768

Years and Authors of Summarized Original Work

  • 2002; Auer, Cesa-Bianchi, Freund, Schapire

  • 2002; Auer, Cesa-Bianchi, Fischer

Problem Definition

A multi-armed bandit is a sequential decision problem defined on a set of actions. At each time step, the decision maker selects an action from the set and obtains an observable payoff. The goal is to maximize the total payoff obtained in a sequence of decisions. The name banditrefers to the colloquial term for a slot machine (“one-armed bandit” in American slang) and to the decision problem, faced by a casino gambler, of choosing which slot machine to play next. Bandit problems naturally address the fundamental trade-off between exploration and exploitation in sequential experiments. Indeed, the decision maker must use a strategy (called allocation policy) able to balance the exploitation of actions that did well in the past with the exploration of actions that might give higher payoffs in the future. Although the original motivation came from...


Adaptive allocation Regret minimization Repeated games Sequential experiment design 
This is a preview of subscription content, log in to check access.

Recommended Reading

  1. 1.
    Arora R, Dekel O, Tewari A (2009) Online bandit learning against an adaptive adversary: from regret to policy regret. In: Proceedings of the 29th international conference on machine learning, MontrealGoogle Scholar
  2. 2.
    Auer P, Cesa-Bianchi N, Fischer P (2002) Finite-time analysis of the multiarmed bandit problem. Mach Learn J 47(2–3):235–256zbMATHCrossRefGoogle Scholar
  3. 3.
    Auer P, Cesa-Bianchi N, Freund Y, Schapire R (2002) The nonstochastic multiarmed bandit problem. SIAM J Comput 32(1):48–77MathSciNetzbMATHCrossRefGoogle Scholar
  4. 4.
    Awerbuch B, Kleinberg R (2004) Adaptive routing with end-to-end feedback: distributed learning and geometric approaches. In: Proceedings of the 36th annual ACM symposium on theory of computing, Chicago. ACM, pp 45–53Google Scholar
  5. 5.
    Blackwell D (1956) An analog of the minimax theorem for vector payoffs. Pac J Math 6:1–8MathSciNetzbMATHCrossRefGoogle Scholar
  6. 6.
    Bubeck S, Munos R, Stoltz G (2009) Pure exploration in multi-armed bandits problems. In: Proceedings of the 20th international conference on algorithmic learning theory, PortoGoogle Scholar
  7. 7.
    Flaxman AD, Kalai AT, McMahan HB (2005) Online convex optimization in the bandit setting: gradient descent without a gradient. In: Proceedings of the 16th annual ACM-SIAM symposium on discrete algorithms, Philadelphia. Society for Industrial and Applied Mathematics, pp 385–394Google Scholar
  8. 8.
    Gittins J, Glazebrook K, Weber R (2011) Multi-armed bandit allocation indices, 2nd edn. Wiley, HobokenzbMATHCrossRefGoogle Scholar
  9. 9.
    Hannan J (1957) Approximation to Bayes risk in repeated play. Contrib. Theory Games 3:97–139zbMATHGoogle Scholar
  10. 10.
    Kocsis L, Szepesvari C (2006) Bandit based Monte-Carlo planning. In: Proceedings of the 15th European conference on machine learning, Vienna, pp 282–293Google Scholar
  11. 11.
    Lai TL, Robbins H (1985) Asymptotically efficient adaptive allocation rules. Adv Appl Math 6: 4–22MathSciNetzbMATHCrossRefGoogle Scholar
  12. 12.
    Li L, Chu W, Langford J, Schapire R (2010) A contextual-bandit approach to personalized news article recommendation. In: Proceedings of the 19th international conference on world wide web, RaleighGoogle Scholar
  13. 13.
    Robbins H (1952) Some aspects of the sequential design of experiments. Bull Am Math Soc 58:527–535MathSciNetzbMATHCrossRefGoogle Scholar
  14. 14.
    Thompson W (1933) On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Bull Am Math Soc 25: 285–294zbMATHGoogle Scholar
  15. 15.
    Wang CC, Kulkarni S, Poor H (2005) Bandit problems with side observations. IEEE Trans Autom Control 50(3):338–355MathSciNetzbMATHCrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media New York 2016

Authors and Affiliations

  1. 1.Dipartimento di Informatica, Università degli Studi di MilanoMilanoItaly