Skip to main content

Tuning Bandit Algorithms in Stochastic Environments

  • Conference paper
Algorithmic Learning Theory (ALT 2007)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 4754))

Included in the following conference series:

Abstract

Algorithms based on upper-confidence bounds for balancing exploration and exploitation are gaining popularity since they are easy to implement, efficient and effective. In this paper we consider a variant of the basic algorithm for the stochastic, multi-armed bandit problem that takes into account the empirical variance of the different arms. In earlier experimental works, such algorithms were found to outperform the competing algorithms. The purpose of this paper is to provide a theoretical explanation of these findings and provide theoretical guidelines for the tuning of the parameters of these algorithms. For this we analyze the expected regret and for the first time the concentration of the regret. The analysis of the expected regret shows that variance estimates can be especially advantageous when the payoffs of suboptimal arms have low variance. The risk analysis, rather unexpectedly, reveals that except for some very special bandit problems, the regret, for upper confidence bounds based algorithms with standard bias sequences, concentrates only at a polynomial rate. Hence, although these algorithms achieve logarithmic expected regret rates, they seem less attractive when the risk of suffering much worse than logarithmic regret is also taken into account.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Agrawal, R.: Sample mean based index policies with O(logn) regret for the multi-armed bandit problem. Advances in Applied Probability 27, 1054–1078 (1995)

    MathSciNet  MATH  Google Scholar 

  2. Audibert, J.-Y., Munos, R., Szepesvári,Cs.: Variance estimates and exploration function in multi-armed bandit. Research report 07-31, Certis - Ecole des Ponts (2007), http://cermics.enpc.fr/~audibert/RR0731.pdf

  3. Auer, P., Cesa-Bianchi, N., Fischer, P.: Finite time analysis of the multiarmed bandit problem. Machine Learning 47(2-3), 235–256 (2002)

    Article  MATH  Google Scholar 

  4. Auer, P., Cesa-Bianchi, N., Shawe-Taylor, J.: Exploration versus exploitation challenge. In: 2nd PASCAL Challenges Workshop. Pascal Network (2006)

    Google Scholar 

  5. Gittins, J.C.: Multi-armed Bandit Allocation Indices. In: Wiley-Interscience series in systems and optimization. Wiley, Chichester (1989)

    Google Scholar 

  6. Lai, T.L., Robbins, H.: Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics 6, 4–22 (1985)

    Article  MathSciNet  MATH  Google Scholar 

  7. Lai, T.L., Yakowitz, S.: Machine learning and nonparametric bandit theory. IEEE Transactions on Automatic Control 40, 1199–1209 (1995)

    Article  MathSciNet  MATH  Google Scholar 

  8. Robbins, H.: Some aspects of the sequential design of experiments. Bulletin of the American Mathematics Society 58, 527–535 (1952)

    Article  MathSciNet  MATH  Google Scholar 

  9. Thompson, W.R.: On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika 25, 285–294 (1933)

    Article  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2007 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Audibert, JY., Munos, R., Szepesvári, C. (2007). Tuning Bandit Algorithms in Stochastic Environments. In: Hutter, M., Servedio, R.A., Takimoto, E. (eds) Algorithmic Learning Theory. ALT 2007. Lecture Notes in Computer Science(), vol 4754. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-75225-7_15

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-75225-7_15

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-75224-0

  • Online ISBN: 978-3-540-75225-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics