Abstract
The multi-armed bandit problem for a gambler is to decide which arm of a K-slot machine to pull to maximize his total reward in a series of trials. Many real-world learning and optimization problems can be modeled in this way. Several strategies or algorithms have been proposed as a solution to this problem in the last two decades, but, to our knowledge, there has been no common evaluation of these algorithms.
This paper provides a preliminary empirical evaluation of several multi-armed bandit algorithms. It also describes and analyzes a new algorithm, Poker (Price Of Knowledge and Estimated Reward) whose performance compares favorably to that of other existing algorithms in several experiments. One remarkable outcome of our experiments is that the most naive approach, the ε-greedy strategy, proves to be often hard to beat.
Chapter PDF
Similar content being viewed by others
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
Auer, P., Cesa-Bianchi, N., Fischer, P.: Finite Time Analysis of the Multiarmed Bandit Problem. Machine Learning 47(2/3), 235–256 (2002)
Auer, P., Cesa-Bianchi, N., Freund, Y., Schapire, R.E.: Gambling in a Rigged Casino: the Adversarial Multi-Armed Bandit Problem. In: Proceedings of the 36th Annual Symposium on Foundations of Computer Science (FOCS 1995), pp. 322–331. IEEE Computer Society Press, Los Alamitos (1995)
Auer, P., Cesa-Bianchi, N., Freund, Y., Schapire, R.E.: The nonstochastic multiarmed bandit problem. SIAM Journal on Computing 32(1), 48–77 (2002)
Awerbuch, B., Kleinberg, R.: Adaptive Routing with End-to-End feedback: Distributed Learning and Geometric Approaches. In: Proceedings of the 36th ACM Symposium on Theory of Computing (STOC 2004), pp. 45–53 (2004)
Cesa-Bianchi, N., Fischer, P.: Finite-Time Regret Bounds for the Multiarmed Bandit Problem. In: Proceedings of the 15th International Conference on Machine Learning (ICML 1998), pp. 100–108. Morgan Kaufmann, San Francisco (1998)
Dearden, R.W.: Learning and Planning in Structured Worlds. PhD thesis, University of British Columbia (2000)
Even-Dar, E., Mannor, S., Mansour, Y.: PAC Bounds for Multi-Armed Bandit and Markov Decision Processes. In: Kivinen, J., Sloan, R.H. (eds.) COLT 2002. LNCS (LNAI), vol. 2375, pp. 255–270. Springer, Heidelberg (2002)
Frostig, E., Weiss, G.: Four proofs of gittins’ multiarmed bandit theorem. In: Applied Probability Trust (1999)
Gittins, J.C.: Multiarmed Bandits Allocation Indices. Wiley, New York (1989)
Gittins, J.C., Jones, D.M.: A dynamic allocation indices for the sequential design of experiments. In: Progress in Statistics, European Meeting of Statisticians, vol. 1, pp. 241–266 (1974)
Hardwick, J.P., Stout, Q.F.: Bandit Strategies for Ethical Sequential Allocation. Computing Science and Statistics 23, 421–424 (1991)
Kaelbling, L.P.: Learning in Embedded Systems. MIT Press, Cambridge (1993)
Krishnamurthy, B., Wills, C., Zhang, Y.: On the use and performance of content distribution networks. In: SIGCOMM IMW, November 2001, pp. 169–182 (2001)
Littlestone, N., Warmuth, M.K.: The Weighted Majority Algorithm. In: IEEE Symposium on Foundations of Computer Science, pp. 256–261 (1989)
Luce, D.: Individual Choice Behavior. Wiley, Chichester (1959)
Mannor, S., Tsitsiklis, J.N.: The Sample Complexity of Exploration in the Multi-Armed Bandit Problem. In: Sixteenth Annual Conference on Computational Learning Theory, COLT (2003)
Meuleau, N., Bourgine, P.: Exploration of Multi-State Environments: Local Measures and Back-Propagation of Uncertainty. Machine Learning 35(2), 117–154 (1999)
Rivest, R.L., Yin, Y.: Simulation Results for a New Two-Armed Bandit Heuristic. Technical report, Laboratory for Computer Science. M.I.T. (February 1993)
Robbins, H.: Some Aspects of the Sequential Design of Experiments. Bulletin of the American Mathematical Society 55, 527–535 (1952)
Strens, M.: Learning, Cooperation and Feedback in Pattern Recognition. PhD thesis, Physics Department, King’s College London (1999)
Strens, M.: A Bayesian Framework for Reinforcement Learning. In: Proceedings of the 7th International Conf. on Machine Learning (2000)
Sutton, R.S.: Integrated Architecture for Learning, Planning, and Reacting Based on Approximating Dynamic Programming. In: Proceedings of the seventh international conference (1990) on Machine learning, pp. 216–224. Morgan Kaufmann Publishers Inc., San Francisco (1990)
Varaiya, P., Walrand, J., Buyukkoc, C.: Extensions of the multiarmed bandit problem: The discounted case. IEEE Transactions on Automatic Control AC-30, 426–439 (1985)
Watkins, C.J.C.H.: Learning from Delayed Rewards. Ph.D. thesis. Cambridge University (1989)
Wyatt, J.: Exploration and Inference in Learning from Reinforcement. PhD thesis, University of Edinburgh (1997)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Vermorel, J., Mohri, M. (2005). Multi-armed Bandit Algorithms and Empirical Evaluation. In: Gama, J., Camacho, R., Brazdil, P.B., Jorge, A.M., Torgo, L. (eds) Machine Learning: ECML 2005. ECML 2005. Lecture Notes in Computer Science(), vol 3720. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11564096_42
Download citation
DOI: https://doi.org/10.1007/11564096_42
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-29243-2
Online ISBN: 978-3-540-31692-3
eBook Packages: Computer ScienceComputer Science (R0)