Upper-Confidence-Bound Algorithms for Active Learning in Multi-armed Bandits
In this paper, we study the problem of estimating the mean values of all the arms uniformly well in the multi-armed bandit setting. If the variances of the arms were known, one could design an optimal sampling strategy by pulling the arms proportionally to their variances. However, since the distributions are not known in advance, we need to design adaptive sampling strategies to select an arm at each round based on the previous observed samples. We describe two strategies based on pulling the arms proportionally to an upper-bound on their variance and derive regret bounds for these strategies. We show that the performance of these allocation strategies depends not only on the variances of the arms but also on the full shape of their distribution.
KeywordsEurope Marketing Banner Alloca
Unable to display preview. Download preview PDF.
- Audibert, J.-Y., Bubeck, S., Munos, R.: Best arm identification in multi-armed bandits. In: Proceedings of the Twenty-Third Annual Conference on Learning Theory (COLT 2010), pp. 41–53 (2010)Google Scholar
- Carpentier, A., Lazaric, A., Ghavamzadeh, M., Munos, R., Auer, P.: Upper-confidence-bound algorithms for active learning in multi-armed bandits. Technical Report inria-0059413, INRIA (2011)Google Scholar
- Castro, R., Willett, R., Nowak, R.: Faster rates in regression via active learning. In: Proceedings of Neural Information Processing Systems (NIPS), pp. 179–186 (2005)Google Scholar
- Fedorov, V.: Theory of Optimal Experiments. Academic Press, London (1972)Google Scholar
- Maurer, A., Pontil, M.: Empirical bernstein bounds and sample-variance penalization. In: Proceedings of the Twenty-Second Annual Conference on Learning Theory, pp. 115–124 (2009)Google Scholar