Advertisement

A Time and Space Efficient Algorithm for Contextual Linear Bandits

  • José Bento
  • Stratis Ioannidis
  • S. Muthukrishnan
  • Jinyun Yan
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8188)

Abstract

We consider a multi-armed bandit problem where payoffs are a linear function of an observed stochastic contextual variable. In the scenario where there exists a gap between optimal and suboptimal rewards, several algorithms have been proposed that achieve O(logT) regret after T time steps. However, proposed methods either have a computation complexity per iteration that scales linearly with T or achieve regrets that grow linearly with the number of contexts |χ|. We propose an ε-greedy type of algorithm that solves both limitations. In particular, when contexts are variables in \(I\!R^d\), we prove that our algorithm has a constant computation complexity per iteration of O(poly(d)) and can achieve a regret of O(poly(d) logT) even when |χ| = Ω(2 d ) . In addition, unlike previous algorithms, its space complexity scales like O(Kd 2) and does not grow with T.

Keywords

Contextual Linear Bandits Space and Time Efficiency 

References

  1. 1.
    Abbasi-Yadkori, Y., Pál, D., Szepesvári, C.: Improved algorithms for linear stochastic bandits. In: Advances in Neural Information Processing Systems (2011)Google Scholar
  2. 2.
    Abernethy, J., Hazan, E., Rakhlin, A.: Competing in the dark: An efficient algorithm for bandit linear optimization. In: Proceedings of the 21st Annual Conference on Learning Theory (COLT), vol. 3, p. 3 (2008)Google Scholar
  3. 3.
    Auer, P., Cesa-Bianchi, N., Fischer, P.: Finite-time analysis of the multiarmed bandit problem. Machine Learning 47(2), 235–256 (2002)zbMATHCrossRefGoogle Scholar
  4. 4.
    Auer, P.: Using confidence bounds for exploitation-exploration trade-offs. The Journal of Machine Learning Research 3, 397–422 (2003)MathSciNetzbMATHGoogle Scholar
  5. 5.
    Auer, P., Cesa-Bianchi, N., Freund, Y., Schapire, R.E.: Gambling in a rigged casino: The adversarial multi-armed bandit problem. In: Proceedings of the 36th Annual Symposium on Foundations of Computer Science, pp. 322–331. IEEE (1995)Google Scholar
  6. 6.
    Auer, P., Cesa-Bianchi, N., Freund, Y., Schapire, R.E.: The nonstochastic multiarmed bandit problem. SIAM Journal on Computing 32(1), 48–77 (2002)MathSciNetzbMATHCrossRefGoogle Scholar
  7. 7.
    Bartlett, P., Ben-David, S.: Hardness results for neural network approximation problems. In: Fischer, P., Simon, H.U. (eds.) EuroCOLT 1999. LNCS (LNAI), vol. 1572, pp. 50–62. Springer, Heidelberg (1999)CrossRefGoogle Scholar
  8. 8.
    Beygelzimer, A., Langford, J., Li, L., Reyzin, L., Schapire, R.E.: Contextual bandit algorithms with supervised learning guarantees. In: Proceedings of the International Conference on Artificial Intelligence and Statistics, AISTATS (2011)Google Scholar
  9. 9.
    Chu, W., Li, L., Reyzin, L., Schapire, R.E.: Contextual bandits with linear payoff functions. In: Proceedings of the International Conference on Artificial Intelligence and Statistics, AISTATS (2011)Google Scholar
  10. 10.
    Crammer, K., Gentile, C.: Multiclass classification with bandit feedback using adaptive regularization. In: Proceedings of the 28th International Conference on Machine Learning (2011)Google Scholar
  11. 11.
    Dani, V., Hayes, T.P., Kakade, S.M.: The price of bandit information for online optimization. In: Advances in Neural Information Processing Systems 20, pp. 345–352 (2008)Google Scholar
  12. 12.
    Dani, V., Hayes, T.P., Kakade, S.M.: Stochastic linear optimization under bandit feedback. In: Proceedings of the 21st Annual Conference on Learning Theory (COLT), pp. 355–366 (2008)Google Scholar
  13. 13.
    Dudik, M., Hsu, D., Kale, S., Karampatziakis, N., Langford, J., Reyzin, L., Zhang, T.: Efficient optimal learning for contextual bandits. In: UAI (2011)Google Scholar
  14. 14.
    Hazan, E., Kale, S.: Newtron: an efficient bandit algorithm for online multiclass prediction. In: Advances in Neural Information Processing Systems, NIPS (2011)Google Scholar
  15. 15.
    Johnson, D.S., Preparata, F.P.: The densest hemisphere problem. Theoretical Computer Science 6(1), 93–107 (1978)MathSciNetzbMATHCrossRefGoogle Scholar
  16. 16.
    Kakade, S.M., Shalev-Shwartz, S., Tewari, A.: Efficient bandit algorithms for online multiclass prediction. In: Proceedings of the 25th International Conference on Machine Learning, pp. 440–447. ACM (2008)Google Scholar
  17. 17.
    Langford, J., Zhang, T.: The epoch-greedy algorithm for contextual multi-armed bandits. In: Advances in Neural Information Processing Systems 20, pp. 1096–1103 (2007)Google Scholar
  18. 18.
    Li, L., Chu, W., Langford, J., Schapire, R.: A contextual-bandit approach to personalized news article recommendation. In: Proceedings of the 19th International Conference on World Wide Web, pp. 661–670. ACM (2010)Google Scholar
  19. 19.
    Rusmevichientong, P., Tsitsiklis, J.: Linearly parameterized bandits. Mathematics of Operations Research 35(2) (2010)Google Scholar
  20. 20.
    Sutton, B.: Reinforcement learning, and introduction. MIT Press, Cambdrige (1998)Google Scholar
  21. 21.
    Vershynin, R.: Introduction to the non-asymptotic analysis of random matrices. In: Compressed Sensing, Theory and Applications, ch. 5 (2012)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2013

Authors and Affiliations

  • José Bento
    • 1
  • Stratis Ioannidis
    • 2
  • S. Muthukrishnan
    • 3
  • Jinyun Yan
    • 3
  1. 1.Stanford UniversityUSA
  2. 2.TechnicolorUSA
  3. 3.Rutgers UniversityUSA

Personalised recommendations