Machine Learning

, Volume 85, Issue 3, pp 361–391 | Cite as

An asymptotically optimal policy for finite support models in the multiarmed bandit problem

  • Junya HondaEmail author
  • Akimichi Takemura


In the multiarmed bandit problem the dilemma between exploration and exploitation in reinforcement learning is expressed as a model of a gambler playing a slot machine with multiple arms. A policy chooses an arm in each round so as to minimize the number of times that arms with suboptimal expected rewards are pulled. We propose the minimum empirical divergence (MED) policy and derive an upper bound on the finite-time regret which meets the asymptotic bound for the case of finite support models. In a setting similar to ours, Burnetas and Katehakis have already proposed an asymptotically optimal policy. However, we do not assume any knowledge of the support except for its upper and lower bounds. Furthermore, the criterion for choosing an arm, minimum empirical divergence, can be computed easily by a convex optimization technique. We confirm by simulations that the MED policy demonstrates good performance in finite time in comparison to other currently popular policies.


Bandit problems Finite-time regret MED policy Convex optimization 


  1. Agrawal, R. (1995a). The continuum-armed bandit problem. SIAM Journal on Control and Optimization, 33, 1926–1951. MathSciNetzbMATHCrossRefGoogle Scholar
  2. Agrawal, R. (1995b). Sample mean based index policies with o(log n) regret for the multi-armed bandit problem. Advances in Applied Probability, 27, 1054–1078. MathSciNetzbMATHCrossRefGoogle Scholar
  3. Audibert, J.-Y., & Bubeck, S. (2009). Minimax policies for adversarial and stochastic bandits. In Proceedings of COLT 2009. Montreal: Omnipress. Google Scholar
  4. Auer, P., Cesa-Bianchi, N., & Fischer, P. (2002a). Finite-time analysis of the multiarmed bandit problem. Machine Learning, 47, 235–256. zbMATHCrossRefGoogle Scholar
  5. Auer, P., Cesa-Bianchi, N., Freund, Y., & Schapire, R. E. (2002b). The nonstochastic multiarmed bandit problem. SIAM Journal on Computing, 32, 48–77. MathSciNetzbMATHCrossRefGoogle Scholar
  6. Boyd, S., & Vandenberghe, L. (2004). Convex optimization. Cambridge: Cambridge University Press. zbMATHGoogle Scholar
  7. Burnetas, A. N., & Katehakis, M. N. (1996). Optimal adaptive policies for sequential allocation problems. Advances in Applied Mathematics, 17, 122–142. MathSciNetzbMATHCrossRefGoogle Scholar
  8. Cover, T. M., & Thomas, J. A. (2006). Elements of information theory (2nd edn.). New York: Wiley-Interscience. zbMATHGoogle Scholar
  9. Even-Dar, E., Mannor, S., & Mansour, Y. (2002). Pac bounds for multi-armed bandit and Markov decision processes. In Proceedings of COLT 2002 (pp. 255–270). London: Springer. Google Scholar
  10. Fiacco, A. V. (1983). Introduction to sensitivity and stability analysis in nonlinear programming. New York: Academic Press. zbMATHGoogle Scholar
  11. Gittins, J. C. (1989). Multi-armed bandit allocation indices. Wiley-Interscience Series in Systems and Optimization. Chichester: Wiley. Google Scholar
  12. Honda, J., & Takemura, A. (2010). An asymptotically optimal bandit algorithm for bounded support models. In Proceedings of COLT 2010, Haifa, Israel (pp. 67–79). Google Scholar
  13. Ishikida, T., & Varaiya, P. (1994). Multi-armed bandit problem revisited. Journal of Optimization Theory and Applications, 83, 113–154. MathSciNetzbMATHCrossRefGoogle Scholar
  14. Katehakis, M. N., & Veinott, A. F. Jr. (1987). The multi-armed bandit problem: decomposition and computation. Mathematics of Operations Research, 12, 262–268. MathSciNetzbMATHCrossRefGoogle Scholar
  15. Kleinberg, R. (2005). Nearly tight bounds for the continuum-armed bandit problem. In Proceedings of NIPS 2005 (pp. 697–704). New York: MIT Press. Google Scholar
  16. Lai, T. L., & Robbins, H. (1985). Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, 6, 4–22. MathSciNetzbMATHCrossRefGoogle Scholar
  17. Meuleau, N., & Bourgine, P. (1999). Exploration of multi-state environments: Local measures and back-propagation of uncertainty. Machine Learning, 35, 117–154. zbMATHCrossRefGoogle Scholar
  18. Pollard, D. (1984). Convergence of stochastic processes. Springer Series in Statistics. New York: Springer. zbMATHCrossRefGoogle Scholar
  19. Robbins, H. (1952). Some aspects of the sequential design of experiments. Bulletin of the American Mathematical Society, 58, 527–535. MathSciNetzbMATHCrossRefGoogle Scholar
  20. Strens, M. (2000). A Bayesian framework for reinforcement learning. In Proceedings of ICML 2000 (pp. 943–950). San Francisco: Kaufmann. Google Scholar
  21. Vermorel, J., & Mohri, M. (2005). Multi-armed bandit algorithms and empirical evaluation. In Proceedings of ECML 2005, Porto, Portugal (pp. 437–448). Berlin: Springer. Google Scholar
  22. Wyatt, J. (1997). Exploration and inference in learning from reinforcement. Doctoral dissertation, Department of Artificial Intelligence, University of Edinburgh. Google Scholar
  23. Yakowitz, S., & Lowe, W. (1991). Nonparametric bandit methods. Annals of Operation Research, 28, 297–312. MathSciNetzbMATHCrossRefGoogle Scholar

Copyright information

© The Author(s) 2011

Authors and Affiliations

  1. 1.Graduate School of Frontier SciencesThe University of TokyoKashiwa-shi ChibaJapan
  2. 2.Graduate School of Information Science and TechnologyThe University of TokyoBunkyo-ku TokyoJapan

Personalised recommendations