Periodica Mathematica Hungarica

, Volume 61, Issue 1–2, pp 55–65 | Cite as

UCB revisited: Improved regret bounds for the stochastic multi-armed bandit problem

  • Peter AuerEmail author
  • Ronald Ortner


In the stochastic multi-armed bandit problem we consider a modification of the UCB algorithm of Auer et al. [4]. For this modified algorithm we give an improved bound on the regret with respect to the optimal reward. While for the original UCB algorithm the regret in K-armed bandits after T trials is bounded by const · \( \frac{{K\log (T)}} {\Delta } \), where Δ measures the distance between a suboptimal arm and the optimal arm, for the modified UCB algorithm we show an upper bound on the regret of const · \( \frac{{K\log (T\Delta ^2 )}} {\Delta } \).

Key words and phrases

multi-armed bandit problem regret 

Mathematics subject classification numbers

68T05 62M05 91A60 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. [1]
    Rajeev Agrawal, Sample mean based index policies with O(log n) regret for the multi-armed bandit problem, Adv. in Appl. Probab., 27 (1995), 1054–1078.zbMATHCrossRefMathSciNetGoogle Scholar
  2. [2]
    Jean-Yves Audibert and Sébastien Bubeck, Minimax policies for adversarial and stochastic bandits, Proceedings of the 22nd Annual Conference on Learning Theory (COLT2009), 2009, 217–226.Google Scholar
  3. [3]
    Jean-Yves Audibert, Rémi Munos and Csaba Szepesvári, Exploration-exploitation tradeoff using variance estimates in multi-armed bandits, Theor. Comput. Sci., 410 (2009), 1876–1902.zbMATHCrossRefGoogle Scholar
  4. [4]
    Peter Auer, Nicolò Cesa-Bianchi and Paul Fischer, Finite-Time Analysis of the Multi-Armed Bandit Problem, Mach. Learn., 47 (2002), 235–256.zbMATHCrossRefGoogle Scholar
  5. [5]
    Peter Auer, Nicolò Cesa-Bianchi, Yoav Freund and Robert E. Schapire, The Nonstochastic Multiarmed Bandit Problem, SIAM J. Comput., 32 (2002), 48–77.zbMATHCrossRefMathSciNetGoogle Scholar
  6. [6]
    Eyal Even-Dar, Shie Mannor and Yishay Mansour, Action Elimination and Stopping Conditions for the Multi-Armed Bandit and Reinforcement Learning Problems, J. Mach. Learn. Res., 7 (2006), 1079–1105.MathSciNetGoogle Scholar
  7. [7]
    Wassily Hoeffding, Probability inequalities for sums of bounded random variables, J. Amer. Statist. Assoc., 58 (1963), 13–30.zbMATHCrossRefMathSciNetGoogle Scholar
  8. [8]
    Robert D. Kleinberg, Nearly Tight Bounds for the Continuum-Armed Bandit Problem, Advances in Neural Information Processing Systems 17, MIT Press, 2005, 697–704.Google Scholar
  9. [9]
    Tze Leung Lai and Herbert Robbins, Asymptotically Efficient Adaptive Allocation Rules, Adv. in Appl. Math., 6 (1985), 4–22.zbMATHCrossRefMathSciNetGoogle Scholar
  10. [10]
    Shie Mannor and John N. Tsitsiklis, The Sample Complexity of Exploration in the Multi-Armed Bandit Problem, J. Mach. Learn. Res., 5 (2004), 623–648.MathSciNetGoogle Scholar
  11. [11]
    Richard S. Sutton and Andrew G. Barto, Reinforcement Learning: An Introduction, MIT Press, 1998.Google Scholar

Copyright information

© Akadémiai Kiadó, Budapest, Hungary 2010

Authors and Affiliations

  1. 1.Lehrstuhl für InformationstechnologieMontanuniversität LeobenLeobenAustria

Personalised recommendations