Machine Learning

, Volume 47, Issue 2–3, pp 235–256

Finite-time Analysis of the Multiarmed Bandit Problem

  • Peter Auer
  • Nicolò Cesa-Bianchi
  • Paul Fischer
Article

Abstract

Reinforcement learning policies face the exploration versus exploitation dilemma, i.e. the search for a balance between exploring the environment to find profitable actions while taking the empirically best action as often as possible. A popular measure of a policy's success in addressing this dilemma is the regret, that is the loss due to the fact that the globally optimal policy is not followed all the times. One of the simplest examples of the exploration/exploitation dilemma is the multi-armed bandit problem. Lai and Robbins were the first ones to show that the regret for this problem has to grow at least logarithmically in the number of plays. Since then, policies which asymptotically achieve this regret have been devised by Lai and Robbins and many others. In this work we show that the optimal logarithmic regret is also achievable uniformly over time, with simple and efficient policies, and for all reward distributions with bounded support.

bandit problems adaptive allocation rules finite horizon regret 

References

  1. Agrawal, R. (1995). Sample mean based index policies with O(log n) regret for the multi-armed bandit problem. Advances in Applied Probability, 27, 1054–1078.Google Scholar
  2. Berry, D., & Fristedt, B. (1985). Bandit problems. London: Chapman and Hall.Google Scholar
  3. Burnetas, A., & Katehakis, M. (1996). Optimal adaptive policies for sequential allocation problems. Advances in Applied Mathematics, 17:2, 122–142.Google Scholar
  4. Duff, M. (1995). Q-learning for bandit problems. In Proceedings of the 12th International Conference on Machine Learning (pp. 209-217).Google Scholar
  5. Gittins, J. (1989). Multi-armed bandit allocation indices, Wiley-Interscience series in Systems and Optimization. New York: John Wiley and Sons.Google Scholar
  6. Holland, J. (1992). Adaptation in natural and artificial systems. Cambridge: MIT Press/Bradford Books.Google Scholar
  7. Ishikida, T., & Varaiya, P. (1994). Multi-armed bandit problem revisited. Journal of Optimization Theory and Applications, 83:1, 113–154.Google Scholar
  8. Lai, T., & Robbins, H. (1985). Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, 6, 4–22.Google Scholar
  9. Pollard, D. (1984). Convergence of stochastic processes. Berlin: Springer.Google Scholar
  10. Sutton, R., & Barto, A. (1998). Reinforcement learning, an introduction. Cambridge: MIT Press/Bradford Books.Google Scholar
  11. Wilks, S. (1962). Matematical statistics. New York: John Wiley and Sons.Google Scholar
  12. Yakowitz, S., & Lowe, W. (1991). Nonparametric bandit methods. Annals of Operations Research, 28, 297–312.Google Scholar

Copyright information

© Kluwer Academic Publishers 2002

Authors and Affiliations

  • Peter Auer
    • 1
  • Nicolò Cesa-Bianchi
    • 2
  • Paul Fischer
    • 3
  1. 1.University of Technology GrazGrazAustria
  2. 2.DTIUniversity of MilanCremaItaly
  3. 3.Lehrstuhl Informatik IIUniversität DortmundDortmundGermany

Personalised recommendations