A dynamic programming strategy to balance exploration and exploitation in the bandit problem
- 149 Downloads
The K-armed bandit problem is a well-known formalization of the exploration versus exploitation dilemma. In this learning problem, a player is confronted to a gambling machine with K arms where each arm is associated to an unknown gain distribution. The goal of the player is to maximize the sum of the rewards. Several approaches have been proposed in literature to deal with the K-armed bandit problem. This paper introduces first the concept of “expected reward of greedy actions” which is based on the notion of probability of correct selection (PCS), well-known in simulation literature. This concept is then used in an original semi-uniform algorithm which relies on the dynamic programming framework and on estimation techniques to optimally balance exploration and exploitation. Experiments with a set of simulated and realistic bandit problems show that the new DP-greedy algorithm is competitive with state-of-the-art semi-uniform techniques.
KeywordsMulti-armed bandit problem Greedy Estimation
Mathematics Subject Classifications (2010)68T05 68T20 62H12 49L20
Unable to display preview. Download preview PDF.
- 1.Audibert, J.-Y., Munos, R., Szepesvári, C.: Use of variance estimation in the multi-armed bandit problem. In: NIPS 2006 Workshop on On-line Trading of Exploration and Exploitation (2006)Google Scholar
- 3.Auer, P., Cesa-Bianchi, N., Freund, Y., Schapire, R.E.: Gambling in a rigged casino: the adversarial multi-armed bandit problem. In: Proceedings of the 36th Annual Symposium on Foundations of Computer Science, pp. 322–331. IEEE Computer Society, Los Alamitos (1995)Google Scholar
- 7.Bishop, C.M.: Pattern Recognition and Machine Learning (Information Science and Statistics). Springer, New York (2006)Google Scholar
- 9.Caelen, O., Bontempi, G.: On the evolution of the expected gain of a greedy action in the bandit problem. Technical Report 589, Département d’Informatique, Université Libre de Bruxelles, Brussels, Belgium (2008)Google Scholar
- 12.Hardwick, J., Stout, Q.: Bandit strategies for ethical sequential allocation. Computing Sci. Stat. 23, 421–424 (1991)Google Scholar
- 13.Kim, S., Nelson, B.: Handbooks in Operations Research and Management Science: Simulation, Chapter Selecting the Best System. Elsevier, Amsterdam (2006)Google Scholar
- 16.Perrone, M.P., Cooper, L.N.: When networks disagree: ensemble methods for hybrid neural networks. In: Mammone, R.J. (ed.) Artificial Neural Networks for Speech and Vision, pp. 126–142. Chapman and Hall, New York (1993)Google Scholar
- 20.Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. MIT, Cambridge (1998)Google Scholar
- 21.Vermorel, J., Mohri, M.: Multi-armed bandit algorithms and empirical evaluation. In: 16th European Conference on Machine Learning (ECML05), pp. 437–448. ecml (2005)Google Scholar
- 22.Watkins, C.: Learning from delayed rewards. Ph.D. thesis, Cambridge University (1989)Google Scholar