Deviations of Stochastic Bandit Regret

  • Antoine Salomon
  • Jean-Yves Audibert
Part of the Lecture Notes in Computer Science book series (LNCS, volume 6925)

Abstract

This paper studies the deviations of the regret in a stochastic multi-armed bandit problem. When the total number of plays n is known beforehand by the agent, Audibert et al. (2009) exhibit a policy such that with probability at least 1-1/n, the regret of the policy is of order logn. They have also shown that such a property is not shared by the popular ucb1 policy of Auer et al. (2002). This work first answers an open question: it extends this negative result to any anytime policy. The second contribution of this paper is to design anytime robust policies for specific multi-armed bandit problems in which some restrictions are put on the set of possible distributions of the different arms.

Keywords

Neural Information Processing System Impossibility Result Bandit Problem Cumulative Reward Robust Policy 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. [ACBF02]
    Auer, P., Cesa-Bianchi, N., Fischer, P.: Finite-time analysis of the multiarmed bandit problem. Mach. Learn. 47(2-3), 235–256 (2002)CrossRefMATHGoogle Scholar
  2. [Agr95]
    Agrawal, R.: Sample mean based index policies with o(log n) regret for the multi-armed bandit problem. Advances in Applied Mathematics 27, 1054–1078 (1995)MATHGoogle Scholar
  3. [AMS09]
    Audibert, J.-Y., Munos, R., Szepesvári, C.: Exploration-exploitation tradeoff using variance estimates in multi-armed bandits. Theoretical Computer Science 410(19), 1876–1902 (2009)MathSciNetCrossRefMATHGoogle Scholar
  4. [BMSS09]
    Bubeck, S., Munos, R., Stoltz, G., Szepesvari, C.: Online optimization in X-armed bandits. In: Advances in Neural Information Processing Systems, vol. 21, pp. 201–208 (2009)Google Scholar
  5. [BSS09]
    Babaioff, M., Sharma, Y., Slivkins, A.: Characterizing truthful multi-armed bandit mechanisms: extended abstract. In: Proceedings of the Tenth ACM Conference on Electronic Commerce, pp. 79–88. ACM, New York (2009)CrossRefGoogle Scholar
  6. [BV08]
    Bergemann, D., Valimaki, J.: Bandit problems. In: The New Palgrave Dictionary of Economics, 2nd edn. Macmillan Press, Basingstoke (2008)Google Scholar
  7. [CM07]
    Coquelin, P.A., Munos, R.: Bandit algorithms for tree search. In: Uncertainty in Artificial Intelligence (2007)Google Scholar
  8. [DK09]
    Devanur, N.R., Kakade, S.M.: The price of truthfulness for pay-per-click auctions. In: Proceedings of the Tenth ACM Conference on Electronic Commerce, pp. 99–106. ACM, New York (2009)CrossRefGoogle Scholar
  9. [GW06]
    Gelly, S., Wang, Y.: Exploration exploitation in go: UCT for Monte-Carlo go. In: Online trading between exploration and exploitation Workshop, Twentieth Annual Conference on Neural Information Processing Systems, NIPS 2006 (2006)Google Scholar
  10. [Hol92]
    Holland, J.H.: Adaptation in natural and artificial systems. MIT Press, Cambridge (1992)Google Scholar
  11. [Kle05]
    Kleinberg, R.D.: Nearly tight bounds for the continuum-armed bandit problem. In: Advances in Neural Information Processing Systems, vol. 17, pp. 697–704 (2005)Google Scholar
  12. [KS06]
    Kocsis, L., Szepesvári, C.: Bandit based Monte-Carlo planning. In: Fürnkranz, J., Scheffer, T., Spiliopoulou, M. (eds.) ECML 2006. LNCS (LNAI), vol. 4212, pp. 282–293. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  13. [KSU08]
    Kleinberg, R., Slivkins, A., Upfal, E.: Multi-armed bandits in metric spaces. In: Proceedings of the 40th Annual ACM Symposium on Theory of Computing, pp. 681–690 (2008)Google Scholar
  14. [LPT04]
    Lamberton, D., Pagès, G., Tarrès, P.: When can the two-armed bandit algorithm be trusted? Annals of Applied Probability 14(3), 1424–1454 (2004)MathSciNetCrossRefMATHGoogle Scholar
  15. [LR85]
    Lai, T.L., Robbins, H.: Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics 6, 4–22 (1985)MathSciNetCrossRefMATHGoogle Scholar
  16. [Mas90]
    Massart, P.: The tight constant in the Dvoretzky-Kiefer-Wolfowitz inequality. The Annals of Probability 18(3), 1269–1283 (1990)MathSciNetCrossRefMATHGoogle Scholar
  17. [Rob52]
    Robbins, H.: Some aspects of the sequential design of experiments. Bulletin of the American Mathematics Society 58, 527–535 (1952)MathSciNetCrossRefMATHGoogle Scholar
  18. [Rud86]
    Rudin, W.: Real and complex analysis, 3rd edn. McGraw-Hill Inc., New York (1986)MATHGoogle Scholar
  19. [SB98]
    Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. MIT Press, Cambridge (1998)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2011

Authors and Affiliations

  • Antoine Salomon
    • 1
  • Jean-Yves Audibert
    • 1
    • 2
  1. 1.Imagine, LIGMÉcole des Ponts ParisTech Université Paris EstFrance
  2. 2.Sierra, CNRS/ENS/INRIAParisFrance

Personalised recommendations