An asymptotically optimal policy for finite support models in the multiarmed bandit problem Authors Junya Honda Graduate School of Frontier Sciences The University of Tokyo Akimichi Takemura Graduate School of Information Science and Technology The University of Tokyo Article

First Online: 02 July 2011 Received: 04 June 2009 Accepted: 05 June 2011 DOI :
10.1007/s10994-011-5257-4

Cite this article as: Honda, J. & Takemura, A. Mach Learn (2011) 85: 361. doi:10.1007/s10994-011-5257-4
Abstract
In the multiarmed bandit problem the dilemma between exploration and exploitation in reinforcement learning is expressed as a model of a gambler playing a slot machine with multiple arms. A policy chooses an arm in each round so as to minimize the number of times that arms with suboptimal expected rewards are pulled. We propose the minimum empirical divergence (MED) policy and derive an upper bound on the finite-time regret which meets the asymptotic bound for the case of finite support models. In a setting similar to ours, Burnetas and Katehakis have already proposed an asymptotically optimal policy. However, we do not assume any knowledge of the support except for its upper and lower bounds. Furthermore, the criterion for choosing an arm, minimum empirical divergence, can be computed easily by a convex optimization technique. We confirm by simulations that the MED policy demonstrates good performance in finite time in comparison to other currently popular policies.

Keywords
Bandit problems
Finite-time regret
MED policy
Convex optimization
Editor: Nicolo Cesa-Bianchi.

References
Agrawal, R. (1995a). The continuum-armed bandit problem.

SIAM Journal on Control and Optimization ,

33 , 1926–1951.

MathSciNet MATH CrossRef
Agrawal, R. (1995b). Sample mean based index policies with o(log

n ) regret for the multi-armed bandit problem.

Advances in Applied Probability ,

27 , 1054–1078.

MathSciNet MATH CrossRef
Audibert, J.-Y., & Bubeck, S. (2009). Minimax policies for adversarial and stochastic bandits. In Proceedings of COLT 2009 . Montreal: Omnipress.

Auer, P., Cesa-Bianchi, N., & Fischer, P. (2002a). Finite-time analysis of the multiarmed bandit problem.

Machine Learning ,

47 , 235–256.

MATH CrossRef
Auer, P., Cesa-Bianchi, N., Freund, Y., & Schapire, R. E. (2002b). The nonstochastic multiarmed bandit problem.

SIAM Journal on Computing ,

32 , 48–77.

MathSciNet MATH CrossRef
Boyd, S., & Vandenberghe, L. (2004).

Convex optimization . Cambridge: Cambridge University Press.

MATH
Burnetas, A. N., & Katehakis, M. N. (1996). Optimal adaptive policies for sequential allocation problems.

Advances in Applied Mathematics ,

17 , 122–142.

MathSciNet MATH CrossRef
Cover, T. M., & Thomas, J. A. (2006).

Elements of information theory (2nd edn.). New York: Wiley-Interscience.

MATH
Even-Dar, E., Mannor, S., & Mansour, Y. (2002). Pac bounds for multi-armed bandit and Markov decision processes. In Proceedings of COLT 2002 (pp. 255–270). London: Springer.

Fiacco, A. V. (1983).

Introduction to sensitivity and stability analysis in nonlinear programming . New York: Academic Press.

MATH
Gittins, J. C. (1989). Multi-armed bandit allocation indices. Wiley-Interscience Series in Systems and Optimization . Chichester: Wiley.

Honda, J., & Takemura, A. (2010). An asymptotically optimal bandit algorithm for bounded support models. In Proceedings of COLT 2010 , Haifa, Israel (pp. 67–79).

Ishikida, T., & Varaiya, P. (1994). Multi-armed bandit problem revisited.

Journal of Optimization Theory and Applications ,

83 , 113–154.

MathSciNet MATH CrossRef
Katehakis, M. N., & Veinott, A. F. Jr. (1987). The multi-armed bandit problem: decomposition and computation.

Mathematics of Operations Research ,

12 , 262–268.

MathSciNet MATH CrossRef
Kleinberg, R. (2005). Nearly tight bounds for the continuum-armed bandit problem. In Proceedings of NIPS 2005 (pp. 697–704). New York: MIT Press.

Lai, T. L., & Robbins, H. (1985). Asymptotically efficient adaptive allocation rules.

Advances in Applied Mathematics ,

6 , 4–22.

MathSciNet MATH CrossRef
Meuleau, N., & Bourgine, P. (1999). Exploration of multi-state environments: Local measures and back-propagation of uncertainty.

Machine Learning ,

35 , 117–154.

MATH CrossRef
Pollard, D. (1984).

Convergence of stochastic processes .

Springer Series in Statistics . New York: Springer.

MATH CrossRef
Robbins, H. (1952). Some aspects of the sequential design of experiments.

Bulletin of the American Mathematical Society ,

58 , 527–535.

MathSciNet MATH CrossRef
Strens, M. (2000). A Bayesian framework for reinforcement learning. In Proceedings of ICML 2000 (pp. 943–950). San Francisco: Kaufmann.

Vermorel, J., & Mohri, M. (2005). Multi-armed bandit algorithms and empirical evaluation. In Proceedings of ECML 2005 , Porto, Portugal (pp. 437–448). Berlin: Springer.

Wyatt, J. (1997). Exploration and inference in learning from reinforcement . Doctoral dissertation, Department of Artificial Intelligence, University of Edinburgh.

Yakowitz, S., & Lowe, W. (1991). Nonparametric bandit methods.

Annals of Operation Research ,

28 , 297–312.

MathSciNet MATH CrossRef