An asymptotically optimal policy for finite support models in the multiarmed bandit problem

Honda, Junya; Takemura, Akimichi

doi:10.1007/s10994-011-5257-4

An asymptotically optimal policy for finite support models in the multiarmed bandit problem

Published: 02 July 2011

Volume 85, pages 361–391, (2011)
Cite this article

Download PDF

Machine Learning Aims and scope Submit manuscript

An asymptotically optimal policy for finite support models in the multiarmed bandit problem

Download PDF

Junya Honda¹ &
Akimichi Takemura²

1225 Accesses
17 Citations
6 Altmetric
Explore all metrics

Abstract

In the multiarmed bandit problem the dilemma between exploration and exploitation in reinforcement learning is expressed as a model of a gambler playing a slot machine with multiple arms. A policy chooses an arm in each round so as to minimize the number of times that arms with suboptimal expected rewards are pulled. We propose the minimum empirical divergence (MED) policy and derive an upper bound on the finite-time regret which meets the asymptotic bound for the case of finite support models. In a setting similar to ours, Burnetas and Katehakis have already proposed an asymptotically optimal policy. However, we do not assume any knowledge of the support except for its upper and lower bounds. Furthermore, the criterion for choosing an arm, minimum empirical divergence, can be computed easily by a convex optimization technique. We confirm by simulations that the MED policy demonstrates good performance in finite time in comparison to other currently popular policies.

Article PDF

Multi-Agent Reinforcement Learning: A Selective Overview of Theories and Algorithms

Distributionally robust stochastic programs with side information based on trimmings

Article Open access 22 November 2021

Data-driven distributionally robust optimization using the Wasserstein metric: performance guarantees and tractable reformulations

Article Open access 07 July 2017

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

References

Agrawal, R. (1995a). The continuum-armed bandit problem. SIAM Journal on Control and Optimization, 33, 1926–1951.
Article MathSciNet MATH Google Scholar
Agrawal, R. (1995b). Sample mean based index policies with o(log n) regret for the multi-armed bandit problem. Advances in Applied Probability, 27, 1054–1078.
Article MathSciNet MATH Google Scholar
Audibert, J.-Y., & Bubeck, S. (2009). Minimax policies for adversarial and stochastic bandits. In Proceedings of COLT 2009. Montreal: Omnipress.
Google Scholar
Auer, P., Cesa-Bianchi, N., & Fischer, P. (2002a). Finite-time analysis of the multiarmed bandit problem. Machine Learning, 47, 235–256.
Article MATH Google Scholar
Auer, P., Cesa-Bianchi, N., Freund, Y., & Schapire, R. E. (2002b). The nonstochastic multiarmed bandit problem. SIAM Journal on Computing, 32, 48–77.
Article MathSciNet MATH Google Scholar
Boyd, S., & Vandenberghe, L. (2004). Convex optimization. Cambridge: Cambridge University Press.
MATH Google Scholar
Burnetas, A. N., & Katehakis, M. N. (1996). Optimal adaptive policies for sequential allocation problems. Advances in Applied Mathematics, 17, 122–142.
Article MathSciNet MATH Google Scholar
Cover, T. M., & Thomas, J. A. (2006). Elements of information theory (2nd edn.). New York: Wiley-Interscience.
MATH Google Scholar
Even-Dar, E., Mannor, S., & Mansour, Y. (2002). Pac bounds for multi-armed bandit and Markov decision processes. In Proceedings of COLT 2002 (pp. 255–270). London: Springer.
Google Scholar
Fiacco, A. V. (1983). Introduction to sensitivity and stability analysis in nonlinear programming. New York: Academic Press.
MATH Google Scholar
Gittins, J. C. (1989). Multi-armed bandit allocation indices. Wiley-Interscience Series in Systems and Optimization. Chichester: Wiley.
Google Scholar
Honda, J., & Takemura, A. (2010). An asymptotically optimal bandit algorithm for bounded support models. In Proceedings of COLT 2010, Haifa, Israel (pp. 67–79).
Google Scholar
Ishikida, T., & Varaiya, P. (1994). Multi-armed bandit problem revisited. Journal of Optimization Theory and Applications, 83, 113–154.
Article MathSciNet MATH Google Scholar
Katehakis, M. N., & Veinott, A. F. Jr. (1987). The multi-armed bandit problem: decomposition and computation. Mathematics of Operations Research, 12, 262–268.
Article MathSciNet MATH Google Scholar
Kleinberg, R. (2005). Nearly tight bounds for the continuum-armed bandit problem. In Proceedings of NIPS 2005 (pp. 697–704). New York: MIT Press.
Google Scholar
Lai, T. L., & Robbins, H. (1985). Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, 6, 4–22.
Article MathSciNet MATH Google Scholar
Meuleau, N., & Bourgine, P. (1999). Exploration of multi-state environments: Local measures and back-propagation of uncertainty. Machine Learning, 35, 117–154.
Article MATH Google Scholar
Pollard, D. (1984). Convergence of stochastic processes. Springer Series in Statistics. New York: Springer.
Book MATH Google Scholar
Robbins, H. (1952). Some aspects of the sequential design of experiments. Bulletin of the American Mathematical Society, 58, 527–535.
Article MathSciNet MATH Google Scholar
Strens, M. (2000). A Bayesian framework for reinforcement learning. In Proceedings of ICML 2000 (pp. 943–950). San Francisco: Kaufmann.
Google Scholar
Vermorel, J., & Mohri, M. (2005). Multi-armed bandit algorithms and empirical evaluation. In Proceedings of ECML 2005, Porto, Portugal (pp. 437–448). Berlin: Springer.
Google Scholar
Wyatt, J. (1997). Exploration and inference in learning from reinforcement. Doctoral dissertation, Department of Artificial Intelligence, University of Edinburgh.
Yakowitz, S., & Lowe, W. (1991). Nonparametric bandit methods. Annals of Operation Research, 28, 297–312.
Article MathSciNet MATH Google Scholar

Download references

Author information

Authors and Affiliations

Graduate School of Frontier Sciences, The University of Tokyo, Kashiwa-shi Chiba, 277–8561, Japan
Junya Honda
Graduate School of Information Science and Technology, The University of Tokyo, Bunkyo-ku Tokyo, 113-8656, Japan
Akimichi Takemura

Authors

Junya Honda
View author publications
You can also search for this author in PubMed Google Scholar
Akimichi Takemura
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Junya Honda.

Additional information

Editor: Nicolo Cesa-Bianchi.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Honda, J., Takemura, A. An asymptotically optimal policy for finite support models in the multiarmed bandit problem. Mach Learn 85, 361–391 (2011). https://doi.org/10.1007/s10994-011-5257-4

Download citation

Received: 04 June 2009
Accepted: 05 June 2011
Published: 02 July 2011
Issue Date: December 2011
DOI: https://doi.org/10.1007/s10994-011-5257-4

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

An asymptotically optimal policy for finite support models in the multiarmed bandit problem

Abstract

Article PDF

Similar content being viewed by others

Multi-Agent Reinforcement Learning: A Selective Overview of Theories and Algorithms

Distributionally robust stochastic programs with side information based on trimmings

Data-driven distributionally robust optimization using the Wasserstein metric: performance guarantees and tractable reformulations

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

An asymptotically optimal policy for finite support models in the multiarmed bandit problem

Abstract

Article PDF

Similar content being viewed by others

Multi-Agent Reinforcement Learning: A Selective Overview of Theories and Algorithms

Distributionally robust stochastic programs with side information based on trimmings

Data-driven distributionally robust optimization using the Wasserstein metric: performance guarantees and tractable reformulations

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation