Advertisement

On Upper-Confidence Bound Policies for Switching Bandit Problems

  • Aurélien Garivier
  • Eric Moulines
Part of the Lecture Notes in Computer Science book series (LNCS, volume 6925)

Abstract

Many problems, such as cognitive radio, parameter control of a scanning tunnelling microscope or internet advertisement, can be modelled as non-stationary bandit problems where the distributions of rewards changes abruptly at unknown time instants. In this paper, we analyze two algorithms designed for solving this issue: discounted UCB (D-UCB) and sliding-window UCB (SW-UCB). We establish an upper-bound for the expected regret by upper-bounding the expectation of the number of times suboptimal arms are played. The proof relies on an interesting Hoeffding type inequality for self normalized deviations with a random number of summands. We establish a lower-bound for the regret in presence of abrupt changes in the arms reward distributions. We show that the discounted UCB and the sliding-window UCB both match the lower-bound up to a logarithmic factor. Numerical simulations show that D-UCB and SW-UCB perform significantly better than existing soft-max methods like EXP3.S.

Keywords

Cognitive Radio Discount Factor Total Reward Bandit Problem Exploration Bonus 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Agrawal, R.: Sample mean based index policies with O(logn) regret for the multi-armed bandit problem. Adv. in Appl. Probab. 27(4), 1054–1078 (1995)MathSciNetzbMATHGoogle Scholar
  2. 2.
    Audibert, J.Y., Munos, R., Szepesvari, A.: Tuning bandit algorithms in stochastic environments. In: Hutter, M., Servedio, R.A., Takimoto, E. (eds.) ALT 2007. LNCS (LNAI), vol. 4754, pp. 150–165. Springer, Heidelberg (2007)CrossRefGoogle Scholar
  3. 3.
    Auer, P., Cesa-Bianchi, N., Freund, Y., Schapire, R.E.: The nonstochastic multiarmed bandit problem. SIAM J. Comput. 32(1), 48–77 (2002)MathSciNetCrossRefzbMATHGoogle Scholar
  4. 4.
    Auer, P.: Using confidence bounds for exploitation-exploration trade-offs. J. Mach. Learn. Res. 3(Spec. Issue Comput. Learn. Theory), 397–422 (2002)MathSciNetzbMATHGoogle Scholar
  5. 5.
    Auer, P., Cesa-Bianchi, N., Fischer, P.: Finite-time analysis of the multiarmed bandit problem. Machine Learning 47(2/3), 235–256 (2002)CrossRefzbMATHGoogle Scholar
  6. 6.
    Cesa-Bianchi, N., Lugosi, G.: Prediction, Learning, and Games. Cambridge University Press, New York (2006)CrossRefzbMATHGoogle Scholar
  7. 7.
    Cesa-Bianchi, N., Lugosi, G.: On prediction of individual sequences. Ann. Statist. 27(6), 1865–1895 (1999)MathSciNetCrossRefzbMATHGoogle Scholar
  8. 8.
    Cesa-Bianchi, N., Lugosi, G., Stoltz, G.: Regret minimization under partial monitoring. Math. Oper. Res. 31(3), 562–580 (2006)MathSciNetCrossRefzbMATHGoogle Scholar
  9. 9.
    Cesa-Bianchi, N., Lugosi, G., Stoltz, G.: Competing with typical compound actions (2008)Google Scholar
  10. 10.
    Devroye, L., Györfi, L., Lugosi, G.: A probabilistic theory of pattern recognition. Applications of Mathematics, vol. 31. Springer, New York (1996)zbMATHGoogle Scholar
  11. 11.
    Freund, Y., Schapire, R.E.: A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. System Sci. 55(1, part 2), 119–139 (1997); In: Vitányi, P.M.B. (ed.) EuroCOLT 1995. LNCS, vol. 904. Springer, Heidelberg (1995)Google Scholar
  12. 12.
    Fuh, C.D.: Asymptotic operating characteristics of an optimal change point detection in hidden Markov models. Ann. Statist. 32(5), 2305–2339 (2004)MathSciNetCrossRefzbMATHGoogle Scholar
  13. 13.
    Garivier, A., Cappé, O.: The kl-ucb algorithm for bounded stochastic bandits and beyond. In: Proceedings of the 24rd Annual International Conference on Learning Theory (2011)Google Scholar
  14. 14.
    Hartland, C., Gelly, S., Baskiotis, N., Teytaud, O., Sebag, M.: Multi-armed bandit, dynamic environments and meta-bandits. In: nIPS-2006 Workshop, Online Trading Between Exploration and Exploitation, Whistler, Canada (2006)Google Scholar
  15. 15.
    Herbster, M., Warmuth, M.: Tracking the best expert. Machine Learning 32(2), 151–178 (1998)CrossRefzbMATHGoogle Scholar
  16. 16.
    Honda, J., Takemura, A.: An asymptotically optimal bandit algorithm for bounded support models. In: Proceedings of the 23rd Annual International Conference on Learning Theory (2010)Google Scholar
  17. 17.
    Kocsis, L., Szepesvári, C.: Discounted UCB. In: 2nd PASCAL Challenges Workshop, Venice, Italy (April 2006)Google Scholar
  18. 18.
    Koulouriotis, D.E., Xanthopoulos, A.: Reinforcement learning and evolutionary algorithms for non-stationary multi-armed bandit problems. Applied Mathematics and Computation 196(2), 913–922 (2008)CrossRefzbMATHGoogle Scholar
  19. 19.
    Lai, L., El Gamal, H., Jiang, H., Poor, H.V.: Cognitive medium access: Exploration, exploitation and competition (2007)Google Scholar
  20. 20.
    Lai, T.L., Robbins, H.: Asymptotically efficient adaptive allocation rules. Adv. in Appl. Math. 6(1), 4–22 (1985)MathSciNetCrossRefzbMATHGoogle Scholar
  21. 21.
    Mei, Y.: Sequential change-point detection when unknown parameters are present in the pre-change distribution. Ann. Statist. 34(1), 92–122 (2006)MathSciNetCrossRefzbMATHGoogle Scholar
  22. 22.
    Slivkins, A., Upfal, E.: Adapting to a changing environment: the brownian restless bandits. In: Proceedings of the Conference on 21st Conference on Learning Theory, pp. 343–354 (2008)Google Scholar
  23. 23.
    Whittle, P.: Restless bandits: activity allocation in a changing world. J. Appl. Probab. Special 25A, 287–298 (1988) a celebration of applied probabilityMathSciNetCrossRefzbMATHGoogle Scholar
  24. 24.
    Yu, J.Y., Mannor, S.: Piecewise-stationary bandit problems with side observations. In: ICML 2009: Proceedings of the 26th Annual International Conference on Machine Learning, pp. 1177–1184. ACM, New York (2009)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2011

Authors and Affiliations

  • Aurélien Garivier
    • 1
  • Eric Moulines
    • 1
  1. 1.Laboratoire LTCI, CNRS UMR 5141Institut Telecom, Telecom ParisTechParis Cedex 13

Personalised recommendations