Advertisement

Multi-armed Bandit Algorithms and Empirical Evaluation

  • Joannès Vermorel
  • Mehryar Mohri
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3720)

Abstract

The multi-armed bandit problem for a gambler is to decide which arm of a K-slot machine to pull to maximize his total reward in a series of trials. Many real-world learning and optimization problems can be modeled in this way. Several strategies or algorithms have been proposed as a solution to this problem in the last two decades, but, to our knowledge, there has been no common evaluation of these algorithms.

This paper provides a preliminary empirical evaluation of several multi-armed bandit algorithms. It also describes and analyzes a new algorithm, Poker (Price Of Knowledge and Estimated Reward) whose performance compares favorably to that of other existing algorithms in several experiments. One remarkable outcome of our experiments is that the most naive approach, the ε-greedy strategy, proves to be often hard to beat.

Keywords

Empirical Evaluation Greedy Strategy Bandit Problem Content Distribution Network Reward Distribution 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

  1. 1.
    Auer, P., Cesa-Bianchi, N., Fischer, P.: Finite Time Analysis of the Multiarmed Bandit Problem. Machine Learning 47(2/3), 235–256 (2002)zbMATHCrossRefGoogle Scholar
  2. 2.
    Auer, P., Cesa-Bianchi, N., Freund, Y., Schapire, R.E.: Gambling in a Rigged Casino: the Adversarial Multi-Armed Bandit Problem. In: Proceedings of the 36th Annual Symposium on Foundations of Computer Science (FOCS 1995), pp. 322–331. IEEE Computer Society Press, Los Alamitos (1995)Google Scholar
  3. 3.
    Auer, P., Cesa-Bianchi, N., Freund, Y., Schapire, R.E.: The nonstochastic multiarmed bandit problem. SIAM Journal on Computing 32(1), 48–77 (2002)zbMATHCrossRefMathSciNetGoogle Scholar
  4. 4.
    Awerbuch, B., Kleinberg, R.: Adaptive Routing with End-to-End feedback: Distributed Learning and Geometric Approaches. In: Proceedings of the 36th ACM Symposium on Theory of Computing (STOC 2004), pp. 45–53 (2004)Google Scholar
  5. 5.
    Cesa-Bianchi, N., Fischer, P.: Finite-Time Regret Bounds for the Multiarmed Bandit Problem. In: Proceedings of the 15th International Conference on Machine Learning (ICML 1998), pp. 100–108. Morgan Kaufmann, San Francisco (1998)Google Scholar
  6. 6.
    Dearden, R.W.: Learning and Planning in Structured Worlds. PhD thesis, University of British Columbia (2000)Google Scholar
  7. 7.
    Even-Dar, E., Mannor, S., Mansour, Y.: PAC Bounds for Multi-Armed Bandit and Markov Decision Processes. In: Kivinen, J., Sloan, R.H. (eds.) COLT 2002. LNCS (LNAI), vol. 2375, pp. 255–270. Springer, Heidelberg (2002)CrossRefGoogle Scholar
  8. 8.
    Frostig, E., Weiss, G.: Four proofs of gittins’ multiarmed bandit theorem. In: Applied Probability Trust (1999)Google Scholar
  9. 9.
    Gittins, J.C.: Multiarmed Bandits Allocation Indices. Wiley, New York (1989)Google Scholar
  10. 10.
    Gittins, J.C., Jones, D.M.: A dynamic allocation indices for the sequential design of experiments. In: Progress in Statistics, European Meeting of Statisticians, vol. 1, pp. 241–266 (1974)Google Scholar
  11. 11.
    Hardwick, J.P., Stout, Q.F.: Bandit Strategies for Ethical Sequential Allocation. Computing Science and Statistics 23, 421–424 (1991)Google Scholar
  12. 12.
    Kaelbling, L.P.: Learning in Embedded Systems. MIT Press, Cambridge (1993)Google Scholar
  13. 13.
    Krishnamurthy, B., Wills, C., Zhang, Y.: On the use and performance of content distribution networks. In: SIGCOMM IMW, November 2001, pp. 169–182 (2001)Google Scholar
  14. 14.
    Littlestone, N., Warmuth, M.K.: The Weighted Majority Algorithm. In: IEEE Symposium on Foundations of Computer Science, pp. 256–261 (1989)Google Scholar
  15. 15.
    Luce, D.: Individual Choice Behavior. Wiley, Chichester (1959)zbMATHGoogle Scholar
  16. 16.
    Mannor, S., Tsitsiklis, J.N.: The Sample Complexity of Exploration in the Multi-Armed Bandit Problem. In: Sixteenth Annual Conference on Computational Learning Theory, COLT (2003)Google Scholar
  17. 17.
    Meuleau, N., Bourgine, P.: Exploration of Multi-State Environments: Local Measures and Back-Propagation of Uncertainty. Machine Learning 35(2), 117–154 (1999)zbMATHCrossRefGoogle Scholar
  18. 18.
    Rivest, R.L., Yin, Y.: Simulation Results for a New Two-Armed Bandit Heuristic. Technical report, Laboratory for Computer Science. M.I.T. (February 1993)Google Scholar
  19. 19.
    Robbins, H.: Some Aspects of the Sequential Design of Experiments. Bulletin of the American Mathematical Society 55, 527–535 (1952)CrossRefMathSciNetGoogle Scholar
  20. 20.
    Strens, M.: Learning, Cooperation and Feedback in Pattern Recognition. PhD thesis, Physics Department, King’s College London (1999)Google Scholar
  21. 21.
    Strens, M.: A Bayesian Framework for Reinforcement Learning. In: Proceedings of the 7th International Conf. on Machine Learning (2000)Google Scholar
  22. 22.
    Sutton, R.S.: Integrated Architecture for Learning, Planning, and Reacting Based on Approximating Dynamic Programming. In: Proceedings of the seventh international conference (1990) on Machine learning, pp. 216–224. Morgan Kaufmann Publishers Inc., San Francisco (1990)Google Scholar
  23. 23.
    Varaiya, P., Walrand, J., Buyukkoc, C.: Extensions of the multiarmed bandit problem: The discounted case. IEEE Transactions on Automatic Control AC-30, 426–439 (1985)zbMATHCrossRefMathSciNetGoogle Scholar
  24. 24.
    Watkins, C.J.C.H.: Learning from Delayed Rewards. Ph.D. thesis. Cambridge University (1989)Google Scholar
  25. 25.
    Wyatt, J.: Exploration and Inference in Learning from Reinforcement. PhD thesis, University of Edinburgh (1997)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2005

Authors and Affiliations

  • Joannès Vermorel
    • 1
  • Mehryar Mohri
    • 2
  1. 1.École normale supérieureParisFrance
  2. 2.Courant Institute of Mathematical SciencesNew YorkUSA

Personalised recommendations