An asymptotically optimal strategy for constrained multi-armed bandit problems

  • Hyeong Soo ChangEmail author
Original Article


This note considers the model of “constrained multi-armed bandit” (CMAB) that generalizes that of the classical stochastic MAB by adding a feasibility constraint for each action. The feasibility is in fact another (conflicting) objective that should be kept in order for a playing-strategy to achieve the optimality of the main objective. While the stochastic MAB model is a special case of the Markov decision process (MDP) model, the CMAB model is a special case of the constrained MDP model. For the asymptotic optimality measured by the probability of choosing an optimal feasible arm over infinite horizon, we show that the optimality is achievable by a simple strategy extended from the \(\epsilon _t\)-greedy strategy used for unconstrained MAB problems. We provide a finite-time lower bound on the probability of correct selection of an optimal near-feasible arm that holds for all time steps. Under some conditions, the bound approaches one as time t goes to infinity. A particular example sequence of \(\{\epsilon _t\}\) having the asymptotic convergence rate in the order of \((1-\frac{1}{t})^4\) that holds from a sufficiently large t is also discussed.


Multi-armed bandit Constrained stochastic optimization Simulation optimization Constrained Markov decision process 



  1. Achab M, Clemencon S, Garivier A (2018) Profitable bandits. In: Proceedings of the 10th Asian conference on machine learning, vol 95, pp 694–709Google Scholar
  2. Altman E (1998) Constrained Markov decision processes. Chapman & Hall, LondonzbMATHGoogle Scholar
  3. Audibert J-Y, Bubeck S, Munos R (2010) Best arm identification in multi-armed bandits. In Proceedings of the 23rd international conference on learning theory (COLT)Google Scholar
  4. Auer P, Cesa-Bianchi N, Fisher P (2002) Finite-time analysis of the multiarmed bandit problem. Mach Learn 47:235–256CrossRefGoogle Scholar
  5. Bather J (1980) Randomized allocation of treatments in sequential trials. Adv Appl Probab 12(1):174–182MathSciNetCrossRefGoogle Scholar
  6. Berry D, Fristedt B (1985) Bandit problems: sequential allocation of experiments. Chapman & Hall, LondonCrossRefGoogle Scholar
  7. Browne CB, Powley E, Whitehouse D, Lucas SM, Cowling PI, Rohlfshagen P, Tavener S, Perez D, Samothrakis S, Colton S (2012) A survey of Monte Carlo tree search methods. IEEE Trans Comput Intell AI Games 4(1):1–43CrossRefGoogle Scholar
  8. Bubeck S, Munos R, Stoltz G (2011) Pure exploration in finitely armed and continuous armed bandits. Theor Comput Sci 412:1832–1852MathSciNetCrossRefGoogle Scholar
  9. Cesa-Bianchi N, Lugosi G (2006) Prediction, learning, and games. Cambridge University Press, CambridgeCrossRefGoogle Scholar
  10. Denardo EV, Feinberg EA, Rothblum UG (2013) The multi-armed bandit, with constraints. Ann Oper Res 208(1):37–62MathSciNetCrossRefGoogle Scholar
  11. Ding W, Qin T, Zhang XD, Liu TY (2013) Multi-armed bandit with budget constraint and variable costs. In: Proceedings of the 27th AAAI conference on artificial intelligence, pp 232–238Google Scholar
  12. Gittins J, Glazebrook K, Weber R (2011) Multi-armed bandit allocation indices. Wiley, HobokenCrossRefGoogle Scholar
  13. Hoeffding W (1963) Probability inequalities for sums of bounded random variables. J Am Stat Assoc 58:13–30MathSciNetCrossRefGoogle Scholar
  14. Hunter SR, Pasupathy R (2013) Optimal sampling laws for stochastically constrained simulation optimization on finite sets. INFORMS J Comput 25(3):527–542MathSciNetCrossRefGoogle Scholar
  15. Kuleshov V, Precup D (2014) Algorithms for the multi-armed bandit problem. arXiv:1402.6028
  16. Lai T, Robbins H (1985) Asymptotically efficient adaptive allocation rules. Adv Appl Math 6:4–22MathSciNetCrossRefGoogle Scholar
  17. Lan G, Zhou Z (2016) Algorithms for stochastic optimization with expectation constraints. arXiv:1604.03887
  18. Locatelli A, Gutzeit M, Carpentier A (2016) An optimal algorithm for the thresholding bandit problem. In: Proceedings of the 33rd international conference on machine learning, pp 1690–1698Google Scholar
  19. Mahajan A, Teneketzis D (2007) Multi-armed bandit problems. In: Hero AO, Castanon DA, Cochran D, Kastella K (eds) Foundations and applications of sensor management. Springer, BostonGoogle Scholar
  20. Park C, Kim S (2015) Penalty function with memory for discrete optimization via simulation with stochastic constraints. Oper Res 63(5):1195–1212MathSciNetCrossRefGoogle Scholar
  21. Pasupathy R, Hunter SR, Pujowidianto NA, Lee LH, Chen C (2014) Stochastically constrained ranking and selection via SCORE. ACM Trans Modeling Comput Simul 25, Article 1Google Scholar
  22. Robbins H (1952) Some aspects of the sequential design of experiments. Bull Am Math Soc 58:527–535MathSciNetCrossRefGoogle Scholar
  23. Santner T, Tamhane A (1984) Design of experiments: ranking and selection. CRC Press, Boca RatonzbMATHGoogle Scholar
  24. Spall JC (2003) Introduction to stochastic search and optimization: estimation, simulation, and control. Wiley, HobokenCrossRefGoogle Scholar
  25. Tekin C, Liu M (2013) Online learning methods for networking. Found Trends Netw 8(4):281–409CrossRefGoogle Scholar
  26. Uspensky JV (1937) Introduction to mathematical probability. McGraw-Hill, LondonzbMATHGoogle Scholar
  27. Vermorel J, Mohri M (2005) Multi-armed bandit algorithms and empirical evaluation. In: Gama J, Camacho R, Brazdil PB, Jorge AM, Torgo L (eds) Machine learning: ECML 2005, vol 3720. Lecture notes in computer science. Springer, Berlin, pp 437–448CrossRefGoogle Scholar
  28. Wang W, Ahmed S (2008) Sample average approximation of expected value constrained stochastic systems. Oper Res Lett 36:515–519MathSciNetCrossRefGoogle Scholar
  29. Watanabe R, Komiyama J, Nakamura A, Kudo M (2017) KL-UCB-based policy for budgeted multi-armed bandits with stochastic action costs. IEICE Trans Fundam Electron Commun Comput Sci E100–A(11):2470–2486CrossRefGoogle Scholar
  30. Zhou DP, Tomlin CJ (2018) Budget-constrained multi-armed bandits with multiple plays. In: Proceedings of the 32nd AAAI conference on artificial intelligence, pp 4572–4579Google Scholar

Copyright information

© Springer-Verlag GmbH Germany, part of Springer Nature 2020

Authors and Affiliations

  1. 1.Department of Computer Science and EngineeringSogang UniversitySeoulKorea

Personalised recommendations