Abstract
We study a robust model of the multi-armed bandit (MAB) problem in which the transition probabilities are ambiguous and belong to subsets of the probability simplex. We first show that for each arm there exists a robust counterpart of the Gittins index that is the solution to a robust optimal stopping-time problem and can be computed effectively with an equivalent restart problem. We then characterize the optimal policy of the robust MAB as a project-by-project retirement policy but we show that arms become dependent so the policy based on the robust Gittins index is not optimal. For a project selection problem, we show that the robust Gittins index policy is near optimal but its implementation requires more computational effort than solving a non-robust MAB problem. Hence, we propose a Lagrangian index policy that requires the same computational effort as evaluating the indices of a non-robust MAB and is within 1 % of the optimum in the robust project selection problem.
Similar content being viewed by others
References
Auer, P., Cesa-Bianchi, N., Freund, Y., & Schapire, R. E. (2003). The nonstochastic multiarmed bandit problem. SIAM Journal on Computing, 32(1), 48–77.
Bagnell, J. D., Ng, A. Y., & Schneider, J. (2001). Solving uncertain markov decision problems. Technical report, CMU-RI-TR-01-25, Pittsburgh, PA: Robotics Institute, Carnegie Mellon University.
Bertsekas, D. (2000). Dynamic programming and optimal control (Vol. II). Belmont, MA: Athena Scientific.
Besbes, O., Gur, Y., & Zeevi, A. (2014). Optimal exploration-exploitation in multi-armed-bandit problems with non-stationary rewards. Columbia Business School Working paper.
Burnetas, A. N., & Katehakis, M. N. (1996). Optimal adaptive policies for sequential allocation problems. Advances in Applied Mathematics, 17(2), 122–142.
Burnetas, A. N., & Katehakis, M. N. (1997). Optimal adaptive policies for Markov decision processes. Mathematics of Operations Research, 22(1), 222–255.
Caro, F., & Yoo, O. S. (2010). Indexability of bandit problems with response delays. Probability in the Engineering and Informational Sciences, 24, 349–374.
Cowan, W., & Katehakis, M. N. (2015). Multi-armed bandits under general depreciation and commitment. Probability in the Engineering and Informational Sciences, 29, 51–76.
Delage, E., & Mannor, S. (2010). Percentile optimization for markov decision processes with parameter uncertainty. Operations Research, 58(1), 203–213.
Denardo, E. V., Feinberg, E. A., & Rothblum, U. G. (2013). The multi-armed bandit, with constraints. Ann, 208, 37–62.
Dimitrov, N., Dimitrov, S., & Chukova, S. (2014). Robust decomposable Markov decision processes motivated by allocating school budgets. European Journal of Operational Research, 239, 199–213.
Frostig, E., & Weiss, G. (2014). Four proofs of gittins multiarmed bandit theorem. Annals of Operations Research. doi:10.1007/s10479-013-1523-0.
Givan, R., Leach, S., & Dean, T. (2000). Bounded-parameter Markov decision processes. Artificial Intelligence, 122(1), 71–109.
Gocgun, Y., & Ghate, A. (2012). Lagrangian relaxation and constraint generation for allocation and advanced scheduling. Computers & Operations Research, 39, 2323–2336.
Iyengar, G. N. (2005). Robust dynamic programming. Mathematics of Operations Research, 30(2), 257–280.
Katehakis, M. N., & Robbins, H. (1995). Sequential choice from several populations. Proceedings of the National Academy of Sciences of the United States of America, 92(19), 8584–8585.
Katehakis, M. N., & Veinott, A. F, Jr. (1987). The multi-armed bandit problem: Decomposition and computation. Mathematics of Operations Research, 22(2), 262–268.
Kim, M. J., & Lim, A. (2015). Robust multi-armed bandit problems. Management Science. doi:10.1287/mnsc.2015.2153.
Lai, T. L., & Robbins, H. (1985). Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, 6, 4–22.
Nilim, A., & El Ghaoui, L. (2005). Robust control of Markov decision processes with uncertain transition matrix. Operations Research, 53(5), 780–798.
Pandelis, D. G., & Teneketzis, D. (1990). On the optimality of the Gittins index rule for multi-armed bandits with multiple plays. Mathematical Methods of Operations Research, 50, 449–461.
Paschalidis, I. C., & Kang, S. C. (2008). A robust approach to Markov decision problems with uncertain transition probabilities. In Proceedings of the 17th IFAC World Congress, pp. 408–413.
Roberts, K., & Weitzman, M. L. (1981). Funding criteria for research, development, and exploration projects. Econometrica, 49(5), 1261–1288.
Robinson, D. R. (1982). Algorithms for evaluating the dynamic allocation index. Operations Research Letters, 1, 72–74.
Satia, J. K., & Lave, R. E. (1973). Markovian decision processes with uncertain transition probabilities. Operations Research, 21(3), 728–740.
Shapiro, A. (2011). A dynamic programming approach to adjustable robust optimization. Operations Research Letters, 39, 83–87.
White, C. C., & Eldeib, H. K. (1994). Markov decision processes with imprecise transition probabilities. Operations Research, 42(4), 739–749.
Whittle, P. (1981). Arm-acquiring bandits. The Annals of Probability, 9(2), 284–292.
Whittle, P. (1988). Restless bandits: Activity allocation in a changing world. Journal of Applied Probability, 25, 287–298.
Wiesemann, W., Kuhn, D., & Rustem, B. (2013). Robust Markov decision processes. Mathematics of Operations Research, 38(1), 153–183.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Caro, F., Das Gupta, A. Robust control of the multi-armed bandit problem. Ann Oper Res 317, 461–480 (2022). https://doi.org/10.1007/s10479-015-1965-7
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10479-015-1965-7