Improving the Exploration Strategy in Bandit Algorithms

  • Olivier Caelen
  • Gianluca Bontempi
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5313)


The K-armed bandit problem is a formalization of the exploration versus exploitation dilemma, a well-known issue in stochastic optimization tasks. In a K-armed bandit problem, a player is confronted with a gambling machine with K arms where each arm is associated to an unknown gain distribution and the goal is to maximize the sum of the rewards (or minimize the sum of losses). Several approaches have been proposed in literature to deal with the K-armed bandit problem. Most of them combine a greedy exploitation strategy with a random exploratory phase. This paper focuses on the improvement of the exploration step by having recourse to the notion of probability of correct selection (PCS), a well-known notion in the simulation literature yet overlooked in the optimization domain. The rationale of our approach is to perform at each exploration step the arm sampling which maximizes the probability of selecting the optimal arm (i.e. the PCS) at the following step. This strategy is implemented by a bandit algorithm, called ε-PCSgreedy, which integrates the PCS exploration approach with the classical ε-greedy schema. A set of numerical experiments on artificial and real datasets shows that a more effective exploration may improve the performance of the entire bandit strategy.


Greedy Algorithm Multivariate Normal Distribution Exploration Strategy Total Reward Bandit Problem 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Auer, P., Cesa-Bianchi, N., Fischer, P.: Finite-time analysis of the multiarmed bandit problem. Machine Learning 47(2/3), 235–256 (2002)CrossRefzbMATHGoogle Scholar
  2. 2.
    Azoulay-Schwartz, R., Kraus, S., Wilkenfeld, J.: Exploitation vs. exploration: choosing a supplier in an environment of incomplete information. Decision support systems 38(1), 1–18 (2004)CrossRefGoogle Scholar
  3. 3.
    Bertsekas, D.P.: Dynamic Programming - Deterministic and Stochastic Models. Prentice-Hall, Englewood Cliffs (1987)zbMATHGoogle Scholar
  4. 4.
    Genz, A.: Numerical computation of multivariate normal probabilities. Journal of Computational and Graphical Statistics (1), 141–149 (1992)Google Scholar
  5. 5.
    Gittins, J.C.: Multi-armed Bandit Allocation Indices. Wiley, Chichester (1989)zbMATHGoogle Scholar
  6. 6.
    Hardwick, J., Stout, Q.: Bandit strategies for ethical sequential allocation. Computing Science and Statistics 23, 421–424 (1991)Google Scholar
  7. 7.
    Kaelbling, L.P., Littman, M.L., Moore, A.P.: Reinforcement learning: A survey. Journal of Artificial Intelligence Research 4, 237–285 (1996)Google Scholar
  8. 8.
    Kim, S., Nelson, B.: Selecting the Best System. In: Handbooks in Operations Research and Management Science. Elsevier Science, Amsterdam (2006)Google Scholar
  9. 9.
    Kim, S.-H., Nelson, B.L.: Selecting the best system: theory and methods. In: WSC 2003: Proceedings of the 35th conference on Winter simulation, pp. 101–112 (2003)Google Scholar
  10. 10.
    Schneider, J., Moore, A.: Active learning in discrete input spaces. In: Proceedings of the 34th Interface Symposium (2002)Google Scholar
  11. 11.
    Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. MIT Press, Cambridge (1998)Google Scholar
  12. 12.
    Tong, Y.L.: The Multivariate Normal Distribution. Springer, Heidelberg (1990)CrossRefzbMATHGoogle Scholar
  13. 13.
    Vermorel, J., Mohri, M.: Multi-armed bandit algorithms and empirical evaluation. In: Gama, J., Camacho, R., Brazdil, P.B., Jorge, A.M., Torgo, L. (eds.) ECML 2005. LNCS, vol. 3720, pp. 437–448. Springer, Heidelberg (2005)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2008

Authors and Affiliations

  • Olivier Caelen
    • 1
  • Gianluca Bontempi
    • 1
  1. 1.Machine Learning Group, Département d’Informatique, Faculté des SciencesUniversité Libre de BruxellesBruxellesBelgium

Personalised recommendations