Abstract
In the multiarmed bandit problem the dilemma between exploration and exploitation in reinforcement learning is expressed as a model of a gambler playing a slot machine with multiple arms. A policy chooses an arm in each round so as to minimize the number of times that arms with suboptimal expected rewards are pulled. We propose the minimum empirical divergence (MED) policy and derive an upper bound on the finite-time regret which meets the asymptotic bound for the case of finite support models. In a setting similar to ours, Burnetas and Katehakis have already proposed an asymptotically optimal policy. However, we do not assume any knowledge of the support except for its upper and lower bounds. Furthermore, the criterion for choosing an arm, minimum empirical divergence, can be computed easily by a convex optimization technique. We confirm by simulations that the MED policy demonstrates good performance in finite time in comparison to other currently popular policies.
Article PDF
Similar content being viewed by others
Avoid common mistakes on your manuscript.
References
Agrawal, R. (1995a). The continuum-armed bandit problem. SIAM Journal on Control and Optimization, 33, 1926–1951.
Agrawal, R. (1995b). Sample mean based index policies with o(log n) regret for the multi-armed bandit problem. Advances in Applied Probability, 27, 1054–1078.
Audibert, J.-Y., & Bubeck, S. (2009). Minimax policies for adversarial and stochastic bandits. In Proceedings of COLT 2009. Montreal: Omnipress.
Auer, P., Cesa-Bianchi, N., & Fischer, P. (2002a). Finite-time analysis of the multiarmed bandit problem. Machine Learning, 47, 235–256.
Auer, P., Cesa-Bianchi, N., Freund, Y., & Schapire, R. E. (2002b). The nonstochastic multiarmed bandit problem. SIAM Journal on Computing, 32, 48–77.
Boyd, S., & Vandenberghe, L. (2004). Convex optimization. Cambridge: Cambridge University Press.
Burnetas, A. N., & Katehakis, M. N. (1996). Optimal adaptive policies for sequential allocation problems. Advances in Applied Mathematics, 17, 122–142.
Cover, T. M., & Thomas, J. A. (2006). Elements of information theory (2nd edn.). New York: Wiley-Interscience.
Even-Dar, E., Mannor, S., & Mansour, Y. (2002). Pac bounds for multi-armed bandit and Markov decision processes. In Proceedings of COLT 2002 (pp. 255–270). London: Springer.
Fiacco, A. V. (1983). Introduction to sensitivity and stability analysis in nonlinear programming. New York: Academic Press.
Gittins, J. C. (1989). Multi-armed bandit allocation indices. Wiley-Interscience Series in Systems and Optimization. Chichester: Wiley.
Honda, J., & Takemura, A. (2010). An asymptotically optimal bandit algorithm for bounded support models. In Proceedings of COLT 2010, Haifa, Israel (pp. 67–79).
Ishikida, T., & Varaiya, P. (1994). Multi-armed bandit problem revisited. Journal of Optimization Theory and Applications, 83, 113–154.
Katehakis, M. N., & Veinott, A. F. Jr. (1987). The multi-armed bandit problem: decomposition and computation. Mathematics of Operations Research, 12, 262–268.
Kleinberg, R. (2005). Nearly tight bounds for the continuum-armed bandit problem. In Proceedings of NIPS 2005 (pp. 697–704). New York: MIT Press.
Lai, T. L., & Robbins, H. (1985). Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, 6, 4–22.
Meuleau, N., & Bourgine, P. (1999). Exploration of multi-state environments: Local measures and back-propagation of uncertainty. Machine Learning, 35, 117–154.
Pollard, D. (1984). Convergence of stochastic processes. Springer Series in Statistics. New York: Springer.
Robbins, H. (1952). Some aspects of the sequential design of experiments. Bulletin of the American Mathematical Society, 58, 527–535.
Strens, M. (2000). A Bayesian framework for reinforcement learning. In Proceedings of ICML 2000 (pp. 943–950). San Francisco: Kaufmann.
Vermorel, J., & Mohri, M. (2005). Multi-armed bandit algorithms and empirical evaluation. In Proceedings of ECML 2005, Porto, Portugal (pp. 437–448). Berlin: Springer.
Wyatt, J. (1997). Exploration and inference in learning from reinforcement. Doctoral dissertation, Department of Artificial Intelligence, University of Edinburgh.
Yakowitz, S., & Lowe, W. (1991). Nonparametric bandit methods. Annals of Operation Research, 28, 297–312.
Author information
Authors and Affiliations
Corresponding author
Additional information
Editor: Nicolo Cesa-Bianchi.
Rights and permissions
About this article
Cite this article
Honda, J., Takemura, A. An asymptotically optimal policy for finite support models in the multiarmed bandit problem. Mach Learn 85, 361–391 (2011). https://doi.org/10.1007/s10994-011-5257-4
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10994-011-5257-4