Skip to main content
Log in

UCB revisited: Improved regret bounds for the stochastic multi-armed bandit problem

  • Published:
Periodica Mathematica Hungarica Aims and scope Submit manuscript

Abstract

In the stochastic multi-armed bandit problem we consider a modification of the UCB algorithm of Auer et al. [4]. For this modified algorithm we give an improved bound on the regret with respect to the optimal reward. While for the original UCB algorithm the regret in K-armed bandits after T trials is bounded by const · \( \frac{{K\log (T)}} {\Delta } \), where Δ measures the distance between a suboptimal arm and the optimal arm, for the modified UCB algorithm we show an upper bound on the regret of const · \( \frac{{K\log (T\Delta ^2 )}} {\Delta } \).

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Rajeev Agrawal, Sample mean based index policies with O(log n) regret for the multi-armed bandit problem, Adv. in Appl. Probab., 27 (1995), 1054–1078.

    Article  MATH  MathSciNet  Google Scholar 

  2. Jean-Yves Audibert and Sébastien Bubeck, Minimax policies for adversarial and stochastic bandits, Proceedings of the 22nd Annual Conference on Learning Theory (COLT2009), 2009, 217–226.

  3. Jean-Yves Audibert, Rémi Munos and Csaba Szepesvári, Exploration-exploitation tradeoff using variance estimates in multi-armed bandits, Theor. Comput. Sci., 410 (2009), 1876–1902.

    Article  MATH  Google Scholar 

  4. Peter Auer, Nicolò Cesa-Bianchi and Paul Fischer, Finite-Time Analysis of the Multi-Armed Bandit Problem, Mach. Learn., 47 (2002), 235–256.

    Article  MATH  Google Scholar 

  5. Peter Auer, Nicolò Cesa-Bianchi, Yoav Freund and Robert E. Schapire, The Nonstochastic Multiarmed Bandit Problem, SIAM J. Comput., 32 (2002), 48–77.

    Article  MATH  MathSciNet  Google Scholar 

  6. Eyal Even-Dar, Shie Mannor and Yishay Mansour, Action Elimination and Stopping Conditions for the Multi-Armed Bandit and Reinforcement Learning Problems, J. Mach. Learn. Res., 7 (2006), 1079–1105.

    MathSciNet  Google Scholar 

  7. Wassily Hoeffding, Probability inequalities for sums of bounded random variables, J. Amer. Statist. Assoc., 58 (1963), 13–30.

    Article  MATH  MathSciNet  Google Scholar 

  8. Robert D. Kleinberg, Nearly Tight Bounds for the Continuum-Armed Bandit Problem, Advances in Neural Information Processing Systems 17, MIT Press, 2005, 697–704.

  9. Tze Leung Lai and Herbert Robbins, Asymptotically Efficient Adaptive Allocation Rules, Adv. in Appl. Math., 6 (1985), 4–22.

    Article  MATH  MathSciNet  Google Scholar 

  10. Shie Mannor and John N. Tsitsiklis, The Sample Complexity of Exploration in the Multi-Armed Bandit Problem, J. Mach. Learn. Res., 5 (2004), 623–648.

    MathSciNet  Google Scholar 

  11. Richard S. Sutton and Andrew G. Barto, Reinforcement Learning: An Introduction, MIT Press, 1998.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Peter Auer.

Additional information

Dedicated to Endre Csáki and Pál Révész on the occasion of their 75th birthdays

Rights and permissions

Reprints and permissions

About this article

Cite this article

Auer, P., Ortner, R. UCB revisited: Improved regret bounds for the stochastic multi-armed bandit problem. Period Math Hung 61, 55–65 (2010). https://doi.org/10.1007/s10998-010-3055-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10998-010-3055-6

Mathematics subject classification numbers

Key words and phrases

Navigation