Sub-sampling for Multi-armed Bandits

  • Akram Baransi
  • Odalric-Ambrym Maillard
  • Shie Mannor
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8724)


The stochastic multi-armed bandit problem is a popular model of the exploration/exploitation trade-off in sequential decision problems. We introduce a novel algorithm that is based on sub-sampling. Despite its simplicity, we show that the algorithm demonstrates excellent empirical performances against state-of-the-art algorithms, including Thompson sampling and KL-UCB. The algorithm is very flexible, it does need to know a set of reward distributions in advance nor the range of the rewards. It is not restricted to Bernoulli distributions and is also invariant under rescaling of the rewards. We provide a detailed experimental study comparing the algorithm to the state of the art, the main intuition that explains the striking results, and conclude with a finite-time regret analysis for this algorithm in the simplified two-arm bandit setting.


Multi-armed Bandits Sub-sampling Reinforcement Learning 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Agrawal, R.: Sample mean based index policies with O(log n) regret for the multi-armed bandit problem. Advances in Applied Probability 27(4), 1054–1078 (1995)CrossRefzbMATHMathSciNetGoogle Scholar
  2. 2.
    Agrawal, S., Goyal, N.: Further optimal regret bounds for thompson sampling. In: International Conference on Artificial Intelligence and Statistics, Scottsdale, AZ, US. JMLR W&CP, vol. 31 (2013)Google Scholar
  3. 3.
    Audibert, J.-Y., Bubeck, S.: Minimax policies for adversarial and stochastic banditsGoogle Scholar
  4. 4.
    Audibert, J.-Y., Munos, R., Szepesvári, C.: Exploration-exploitation trade-off using variance estimates in multi-armed bandits. Theoretical Computer Science 410, 1876–1902 (2009)CrossRefzbMATHMathSciNetGoogle Scholar
  5. 5.
    Auer, P.: Using confidence bounds for exploitation-exploration trade-offs. Journal of Machine Learning Research 3, 397–422 (2003)zbMATHMathSciNetGoogle Scholar
  6. 6.
    Auer, P., Ortner, R.: UCB revisited: Improved regret bounds for the stochastic multi-armed bandit problem. Periodica Mathematica Hungarica 61(1-2), 55–65 (2010)CrossRefzbMATHMathSciNetGoogle Scholar
  7. 7.
    Bickel, P.J., Sakov, A.: On the choice of m in the m out of n bootstrap and confidence bounds for extrema. Statistica Sinica 18, 967–985 (2008)zbMATHMathSciNetGoogle Scholar
  8. 8.
    Burnetas, A.N., Katehakis, M.N.: Optimal adaptive policies for sequential allocation problems. Adv. Appl. Math. 17(2), 122–142 (1996)CrossRefzbMATHMathSciNetGoogle Scholar
  9. 9.
    Cappé, O., Garivier, A., Maillard, O.-A., Munos, R., Stoltz, G.: Kullback–leibler upper confidence bounds for optimal sequential allocation. Ann. Statist. 41(3), 1516–1541 (2013)CrossRefzbMATHMathSciNetGoogle Scholar
  10. 10.
    Chang, F., Lai, T.L.: Optimal stopping and dynamic allocation. Advances in Applied Probability 19(4), 829–853 (1987)CrossRefzbMATHMathSciNetGoogle Scholar
  11. 11.
    Garivier, A., Cappé, O.: The KL-UCB algorithm for bounded stochastic bandits and beyond. In: Proceedings of the 24th annual Conference on Learning Theory, COLT 2011 (2011)Google Scholar
  12. 12.
    Gittins, J.C.: Bandit processes and dynamic allocation indices. Journal of the Royal Statistical Society. Series B (Methodological) 41(2), 148–177 (1979)zbMATHMathSciNetGoogle Scholar
  13. 13.
    Gittins, J.C., Jones, D.M.: A dynamic allocation index for the discounted multiarmed bandit problem. Biometrika 66(3), 561–565 (1979)CrossRefGoogle Scholar
  14. 14.
    Gittins, J.C., Weber, R., Glazebrook, K.: Multi-armed Bandit Allocation Indices. Wiley (1989)Google Scholar
  15. 15.
    Honda, J., Takemura, A.: An asymptotically optimal bandit algorithm for bounded support models, pp. 67–79Google Scholar
  16. 16.
    Kaufmann, E., Korda, N., Munos, R.: Thompson sampling: an asymptotically optimal finite-time analysis. In: Bshouty, N.H., Stoltz, G., Vayatis, N., Zeugmann, T. (eds.) ALT 2012. LNCS (LNAI), vol. 7568, pp. 199–213. Springer, Heidelberg (2012)CrossRefGoogle Scholar
  17. 17.
    Korda, N., Kaufmann, E., Munos, R.: Thompson sampling for 1-dimensional exponential family bandits. In: Burges, C.J.C., Bottou, L., Ghahramani, Z., Weinberger, K.Q. (eds.) NIPS, Lake Tahoe, Nevada, United States, vol. 26, pp. 1448–1456 (2013)Google Scholar
  18. 18.
    Lai, T.L., Robbins, H.: Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics 6(1), 4–22 (1985)CrossRefzbMATHMathSciNetGoogle Scholar
  19. 19.
    Maillard, O.-A., Munos, R., Stoltz, G.: Finite-time analysis of multi-armed bandits problems with kullback-leibler divergences. In: Proceedings of the 24th Annual Conference on Learning Theory, COLT 2011 (2011)Google Scholar
  20. 20.
    Robbins, H.: Some aspects of the sequential design of experiments. Bulletin of the American Mathematics Society 58, 527–535 (1952)CrossRefzbMATHGoogle Scholar
  21. 21.
    Romano, J.P., Shaikh, A.M.: On the uniform asymptotic validity of subsampling and the bootstrap. The Annals of Statistics 40(6), 2798–2822 (2012)CrossRefMathSciNetGoogle Scholar
  22. 22.
    Serfling, R.J.: Probability inequalities for the sum in sampling without replacement. The Annals of Statistics 2(1), 39–48 (1974)CrossRefzbMATHMathSciNetGoogle Scholar
  23. 23.
    Thompson, W.R.: On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika 25, 285–294 (1933)CrossRefzbMATHGoogle Scholar
  24. 24.
    Thompson, W.R.: On the theory of apportionment. American Journal of Mathematics 57, 450–456 (1935)CrossRefMathSciNetGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2014

Authors and Affiliations

  • Akram Baransi
    • 1
  • Odalric-Ambrym Maillard
    • 1
  • Shie Mannor
    • 1
  1. 1.Department of Electrical EngineeringTechnion - Israel Institute of TechnologyHaifaIsrael

Personalised recommendations