A dynamic programming strategy to balance exploration and exploitation in the bandit problem

  • Olivier CaelenEmail author
  • Gianluca Bontempi


The K-armed bandit problem is a well-known formalization of the exploration versus exploitation dilemma. In this learning problem, a player is confronted to a gambling machine with K arms where each arm is associated to an unknown gain distribution. The goal of the player is to maximize the sum of the rewards. Several approaches have been proposed in literature to deal with the K-armed bandit problem. This paper introduces first the concept of “expected reward of greedy actions” which is based on the notion of probability of correct selection (PCS), well-known in simulation literature. This concept is then used in an original semi-uniform algorithm which relies on the dynamic programming framework and on estimation techniques to optimally balance exploration and exploitation. Experiments with a set of simulated and realistic bandit problems show that the new DP-greedy algorithm is competitive with state-of-the-art semi-uniform techniques.


Multi-armed bandit problem Greedy Estimation 

Mathematics Subject Classifications (2010)

68T05 68T20 62H12 49L20 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Audibert, J.-Y., Munos, R., Szepesvári, C.: Use of variance estimation in the multi-armed bandit problem. In: NIPS 2006 Workshop on On-line Trading of Exploration and Exploitation (2006)Google Scholar
  2. 2.
    Auer, P., Cesa-Bianchi, N., Fischer, P.: Finite-time analysis of the multiarmed bandit problem. Mach. Learn. 47(2/3), 235–256 (2002)zbMATHCrossRefGoogle Scholar
  3. 3.
    Auer, P., Cesa-Bianchi, N., Freund, Y., Schapire, R.E.: Gambling in a rigged casino: the adversarial multi-armed bandit problem. In: Proceedings of the 36th Annual Symposium on Foundations of Computer Science, pp. 322–331. IEEE Computer Society, Los Alamitos (1995)Google Scholar
  4. 4.
    Azoulay-Schwartz, R., Kraus, S., Wilkenfeld, J.: Exploitation vs. exploration: choosing a supplier in an environment of incomplete information. Decis. Support Syst. 38(1), 1–18 (2004)CrossRefGoogle Scholar
  5. 5.
    Bertsekas, D.P.: Dynamic Programming—Deterministic and Stochastic Models. Prentice-Hall, Upper Saddle River (1987)zbMATHGoogle Scholar
  6. 6.
    Bertsekas, D.P., Tsitsiklis, J.N.: Neuro-Dynamic Programming. Athena Scientific, Belmont (1996)zbMATHGoogle Scholar
  7. 7.
    Bishop, C.M.: Pattern Recognition and Machine Learning (Information Science and Statistics). Springer, New York (2006)Google Scholar
  8. 8.
    Caelen, O., Bontempi, G.: Improving the exploration strategy in bandit algorithms. In: Maniezzo, V., Battiti, R., Watson, J.-P. (eds.) Learning and Intelligent OptimizatioN LION 2007 II. Lecture Notes in Computer Science, vol. 5313, pp. 56–68. Springer, New York (2007)CrossRefGoogle Scholar
  9. 9.
    Caelen, O., Bontempi, G.: On the evolution of the expected gain of a greedy action in the bandit problem. Technical Report 589, Département d’Informatique, Université Libre de Bruxelles, Brussels, Belgium (2008)Google Scholar
  10. 10.
    Efron, B., Tibshirani, R.J.: An Introduction to the Bootstrap. Chapman and Hall, New York (1993)zbMATHGoogle Scholar
  11. 11.
    Gittins, J.C.: Multi-armed Bandit Allocation Indices. Wiley, New York (1989)zbMATHGoogle Scholar
  12. 12.
    Hardwick, J., Stout, Q.: Bandit strategies for ethical sequential allocation. Computing Sci. Stat. 23, 421–424 (1991)Google Scholar
  13. 13.
    Kim, S., Nelson, B.: Handbooks in Operations Research and Management Science: Simulation, Chapter Selecting the Best System. Elsevier, Amsterdam (2006)Google Scholar
  14. 14.
    Kleywegt, A.J., Shapiro, A., Homem de Mello, T.: The sample average approximation method for stochastic discrete optimization. SIAM J. Optim. 12, 479–502 (2001)zbMATHCrossRefMathSciNetGoogle Scholar
  15. 15.
    Meuleau, N., Bourgine, P.: Exploration of multi-state environments: local measures and back-propagation of uncertainty. Mach. Learn. 35(2), 117–154 (1999)zbMATHCrossRefGoogle Scholar
  16. 16.
    Perrone, M.P., Cooper, L.N.: When networks disagree: ensemble methods for hybrid neural networks. In: Mammone, R.J. (ed.) Artificial Neural Networks for Speech and Vision, pp. 126–142. Chapman and Hall, New York (1993)Google Scholar
  17. 17.
    Powell, W.B.: Approximate Dynamic Programming—Solving the Curses of Dimensionality. Wiley, Princeton (2007)zbMATHCrossRefGoogle Scholar
  18. 18.
    Puterman, M.L.: Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley, New York (1994)zbMATHGoogle Scholar
  19. 19.
    Robbins, H.: Some aspects of the sequential design of experiments. Bull. Am. Math. Soc. 58(5), 527–535 (1952)zbMATHCrossRefMathSciNetGoogle Scholar
  20. 20.
    Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. MIT, Cambridge (1998)Google Scholar
  21. 21.
    Vermorel, J., Mohri, M.: Multi-armed bandit algorithms and empirical evaluation. In: 16th European Conference on Machine Learning (ECML05), pp. 437–448. ecml (2005)Google Scholar
  22. 22.
    Watkins, C.: Learning from delayed rewards. Ph.D. thesis, Cambridge University (1989)Google Scholar

Copyright information

© Springer Science+Business Media B.V. 2010

Authors and Affiliations

  1. 1.Computer Science DepartmentUniversité Libre de BruxellesBruxellesBelgium

Personalised recommendations