An asymptotically optimal strategy for constrained multi-armed bandit problems

Chang, Hyeong Soo

doi:10.1007/s00186-019-00697-3

An asymptotically optimal strategy for constrained multi-armed bandit problems

Original Article
Published: 02 January 2020

Volume 91, pages 545–557, (2020)
Cite this article

Mathematical Methods of Operations Research Aims and scope Submit manuscript

Hyeong Soo Chang ORCID: orcid.org/0000-0003-3298-0018¹

398 Accesses
3 Citations
Explore all metrics

Abstract

This note considers the model of “constrained multi-armed bandit” (CMAB) that generalizes that of the classical stochastic MAB by adding a feasibility constraint for each action. The feasibility is in fact another (conflicting) objective that should be kept in order for a playing-strategy to achieve the optimality of the main objective. While the stochastic MAB model is a special case of the Markov decision process (MDP) model, the CMAB model is a special case of the constrained MDP model. For the asymptotic optimality measured by the probability of choosing an optimal feasible arm over infinite horizon, we show that the optimality is achievable by a simple strategy extended from the \(\epsilon _t\)-greedy strategy used for unconstrained MAB problems. We provide a finite-time lower bound on the probability of correct selection of an optimal near-feasible arm that holds for all time steps. Under some conditions, the bound approaches one as time t goes to infinity. A particular example sequence of \(\{\epsilon _t\}\) having the asymptotic convergence rate in the order of \((1-\frac{1}{t})^4\) that holds from a sufficiently large t is also discussed.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

The non-stationary stochastic multi-armed bandit problem

Article 30 March 2017

Refined Algorithms for Infinitely Many-Armed Bandits with Deterministic Rewards

Approximate Indexability and Bandit Problems with Concave Rewards and Delayed Feedback

References

Achab M, Clemencon S, Garivier A (2018) Profitable bandits. In: Proceedings of the 10th Asian conference on machine learning, vol 95, pp 694–709
Altman E (1998) Constrained Markov decision processes. Chapman & Hall, London
MATH Google Scholar
Audibert J-Y, Bubeck S, Munos R (2010) Best arm identification in multi-armed bandits. In Proceedings of the 23rd international conference on learning theory (COLT)
Auer P, Cesa-Bianchi N, Fisher P (2002) Finite-time analysis of the multiarmed bandit problem. Mach Learn 47:235–256
Article Google Scholar
Bather J (1980) Randomized allocation of treatments in sequential trials. Adv Appl Probab 12(1):174–182
Article MathSciNet Google Scholar
Berry D, Fristedt B (1985) Bandit problems: sequential allocation of experiments. Chapman & Hall, London
Book Google Scholar
Browne CB, Powley E, Whitehouse D, Lucas SM, Cowling PI, Rohlfshagen P, Tavener S, Perez D, Samothrakis S, Colton S (2012) A survey of Monte Carlo tree search methods. IEEE Trans Comput Intell AI Games 4(1):1–43
Article Google Scholar
Bubeck S, Munos R, Stoltz G (2011) Pure exploration in finitely armed and continuous armed bandits. Theor Comput Sci 412:1832–1852
Article MathSciNet Google Scholar
Cesa-Bianchi N, Lugosi G (2006) Prediction, learning, and games. Cambridge University Press, Cambridge
Book Google Scholar
Denardo EV, Feinberg EA, Rothblum UG (2013) The multi-armed bandit, with constraints. Ann Oper Res 208(1):37–62
Article MathSciNet Google Scholar
Ding W, Qin T, Zhang XD, Liu TY (2013) Multi-armed bandit with budget constraint and variable costs. In: Proceedings of the 27th AAAI conference on artificial intelligence, pp 232–238
Gittins J, Glazebrook K, Weber R (2011) Multi-armed bandit allocation indices. Wiley, Hoboken
Book Google Scholar
Hoeffding W (1963) Probability inequalities for sums of bounded random variables. J Am Stat Assoc 58:13–30
Article MathSciNet Google Scholar
Hunter SR, Pasupathy R (2013) Optimal sampling laws for stochastically constrained simulation optimization on finite sets. INFORMS J Comput 25(3):527–542
Article MathSciNet Google Scholar
Kuleshov V, Precup D (2014) Algorithms for the multi-armed bandit problem. arXiv:1402.6028
Lai T, Robbins H (1985) Asymptotically efficient adaptive allocation rules. Adv Appl Math 6:4–22
Article MathSciNet Google Scholar
Lan G, Zhou Z (2016) Algorithms for stochastic optimization with expectation constraints. arXiv:1604.03887
Locatelli A, Gutzeit M, Carpentier A (2016) An optimal algorithm for the thresholding bandit problem. In: Proceedings of the 33rd international conference on machine learning, pp 1690–1698
Mahajan A, Teneketzis D (2007) Multi-armed bandit problems. In: Hero AO, Castanon DA, Cochran D, Kastella K (eds) Foundations and applications of sensor management. Springer, Boston
Google Scholar
Park C, Kim S (2015) Penalty function with memory for discrete optimization via simulation with stochastic constraints. Oper Res 63(5):1195–1212
Article MathSciNet Google Scholar
Pasupathy R, Hunter SR, Pujowidianto NA, Lee LH, Chen C (2014) Stochastically constrained ranking and selection via SCORE. ACM Trans Modeling Comput Simul 25, Article 1
Robbins H (1952) Some aspects of the sequential design of experiments. Bull Am Math Soc 58:527–535
Article MathSciNet Google Scholar
Santner T, Tamhane A (1984) Design of experiments: ranking and selection. CRC Press, Boca Raton
MATH Google Scholar
Spall JC (2003) Introduction to stochastic search and optimization: estimation, simulation, and control. Wiley, Hoboken
Book Google Scholar
Tekin C, Liu M (2013) Online learning methods for networking. Found Trends Netw 8(4):281–409
Article Google Scholar
Uspensky JV (1937) Introduction to mathematical probability. McGraw-Hill, London
MATH Google Scholar
Vermorel J, Mohri M (2005) Multi-armed bandit algorithms and empirical evaluation. In: Gama J, Camacho R, Brazdil PB, Jorge AM, Torgo L (eds) Machine learning: ECML 2005, vol 3720. Lecture notes in computer science. Springer, Berlin, pp 437–448
Chapter Google Scholar
Wang W, Ahmed S (2008) Sample average approximation of expected value constrained stochastic systems. Oper Res Lett 36:515–519
Article MathSciNet Google Scholar
Watanabe R, Komiyama J, Nakamura A, Kudo M (2017) KL-UCB-based policy for budgeted multi-armed bandits with stochastic action costs. IEICE Trans Fundam Electron Commun Comput Sci E100–A(11):2470–2486
Article Google Scholar
Zhou DP, Tomlin CJ (2018) Budget-constrained multi-armed bandits with multiple plays. In: Proceedings of the 32nd AAAI conference on artificial intelligence, pp 4572–4579

Download references

Author information

Authors and Affiliations

Department of Computer Science and Engineering, Sogang University, Seoul, 04107, Korea
Hyeong Soo Chang

Authors

Hyeong Soo Chang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hyeong Soo Chang.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Chang, H.S. An asymptotically optimal strategy for constrained multi-armed bandit problems. Math Meth Oper Res 91, 545–557 (2020). https://doi.org/10.1007/s00186-019-00697-3

Download citation

Received: 03 September 2018
Revised: 29 November 2019
Published: 02 January 2020
Issue Date: June 2020
DOI: https://doi.org/10.1007/s00186-019-00697-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An asymptotically optimal strategy for constrained multi-armed bandit problems

Abstract

Access this article

Similar content being viewed by others

The non-stationary stochastic multi-armed bandit problem

Refined Algorithms for Infinitely Many-Armed Bandits with Deterministic Rewards

Approximate Indexability and Bandit Problems with Concave Rewards and Delayed Feedback

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

An asymptotically optimal strategy for constrained multi-armed bandit problems

Abstract

Access this article

Similar content being viewed by others

The non-stationary stochastic multi-armed bandit problem

Refined Algorithms for Infinitely Many-Armed Bandits with Deterministic Rewards

Approximate Indexability and Bandit Problems with Concave Rewards and Delayed Feedback

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation