Linear Bandits in Unknown Environments

  • Thibault GisselbrechtEmail author
  • Sylvain Lamprier
  • Patrick Gallinari
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9852)


In contextual bandit problems, an agent has to choose an action among a bigger set of available ones at each decision step, according to features observed on them. The goal is to define a decision strategy that maximizes the cumulative reward of actions over time. We focus on the specific case where the features of each action correspond to some kind of a constant profile, which can be used to determine its intrinsic utility for the task in concern. If there exists an unknown linear application that allows rewards to be mapped from profiles, this can be leveraged to greatly improve the exploitation-exploration trade-off of stationary stochastic methods like UCB. In this paper, we consider the case where action profiles are unknown beforehand. Instead, the agent only observes sample vectors, with mean equal to the true profiles, for a subset of actions at each decision step. We propose a new algorithm, called SampLinUCB, and derive a finite time high probability upper bound on its regret. We also provide numerical experiments on a task of focused data capture from online social networks.


Contextual Bandit Problem Context Vector Profile Vector Thompson Sampling Algorithm Reward Distribution 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.



This research work has been carried out in the framework of the Technological Research Institute SystemX, and therefore granted with public funds within the scope of the French Program “Investissements d’Avenir”. Part of the work was supported by project Luxid’x financed by DGA on the Rapid program.


  1. 1.
    Abbasi-yadkori, Y., Pál, D., Szepesvári, C.: Improved algorithms for linear stochastic bandits. In: NIPS (2011)Google Scholar
  2. 2.
    Agrawal, S., Goyal, N.: Analysis of thompson sampling for the multi-armed bandit problem. In: COLT (2012)Google Scholar
  3. 3.
    Agrawal, S., Goyal, N.: Thompson sampling for contextual bandits with linear payoffs. In: ICML (2013)Google Scholar
  4. 4.
    Audibert, J.Y., Bubeck, S.: Minimax policies for adversarial and stochastic bandits. In: COLT (2009)Google Scholar
  5. 5.
    Audibert, J.-Y., Munos, R., Szepesvári, C.: Tuning bandit algorithms in stochastic environments. In: Chaudhuri, K., Gentile, C., Zilles, S. (eds.) ALT 2015. Lecture Notes in Artificial Intelligence (LNAI), vol. 9355, pp. 150–165. Springer, Heidelberg (2007). doi: 10.1007/978-3-540-75225-7_15 CrossRefGoogle Scholar
  6. 6.
    Auer, P.: Using confidence bounds for exploitation-exploration trade-offs. J. Mach. Learn. Res. 3, 397–422 (2003)MathSciNetzbMATHGoogle Scholar
  7. 7.
    Auer, P., Cesa-Bianchi, N., Fischer, P.: Finite-time analysis of the multiarmed bandit problem. Mach. Learn. 47, 235–256 (2002)CrossRefzbMATHGoogle Scholar
  8. 8.
    Chapelle, O., Li, L.: An empirical evaluation of thompson sampling. In: NIPS, Curran Associates, Inc. (2011)Google Scholar
  9. 9.
    Dani, V., Hayes, T.P., Kakade, S.M.: Stochastic linear optimization under bandit feedback. In: COLT (2008)Google Scholar
  10. 10.
    Filippi, S., Cappe, O., Garivier, A., Szepesvári, C.: Parametric bandits: The generalized linear case. In: NIPS (2010)Google Scholar
  11. 11.
    Garivier, A.: The KL-UCB algorithm for bounded stochastic bandits and beyond. In: COLT (2011)Google Scholar
  12. 12.
    Gisselbrecht, T., Denoyer, L., Gallinari, P., Lamprier, S.: Whichstreams: a dynamic approach for focused data capture from large social media. In: ICWSM (2015)Google Scholar
  13. 13.
    Hong, L., Davison, B.D.: Empirical study of topic modeling in twitter. In: ECIR (2010)Google Scholar
  14. 14.
    Kaufmann, E., Cappe, O., Garivier, A.: On bayesian upper confidence bounds for bandit problems. In: AISTATS (2012)Google Scholar
  15. 15.
    Kaufmann, E., Korda, N., Munos, R.: Thompson sampling: an asymptotically optimal finite-time analysis. In: Chaudhuri, K., Gentile, C., Zilles, S. (eds.) ALT 2015. Lecture Notes in Artificial Intelligence (LNAI), vol. 9355, pp. 199–213. Springer, Heidelberg (2012). doi: 10.1007/978-3-642-34106-9_18 CrossRefGoogle Scholar
  16. 16.
    Lai, T., Robbins, H.: Asymptotically efficient adaptive allocation rules. Adv. Appl. Math. 6(1), 4–22 (1985)MathSciNetCrossRefzbMATHGoogle Scholar
  17. 17.
    May, B.C., Korda, N., Lee, A., Leslie, D.S.: Optimistic bayesian sampling in contextual-bandit problems. J. Mach. Learn. Res. 13, 2069–2106 (2012)MathSciNetzbMATHGoogle Scholar
  18. 18.
    Qin, L., Chen, S., Zhu, X.: Contextual combinatorial bandit and its application on diversified online recommendation. In: SIAM (2014)Google Scholar
  19. 19.
    Rusmevichientong, P., Tsitsiklis, J.N.: Linearly parameterized bandits. Math. Oper. Res. 35, 395–411 (2010)MathSciNetCrossRefzbMATHGoogle Scholar
  20. 20.
    Thompson, W.: On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Bull. Am. Math. Soc. 25, 285–294 (1933)zbMATHGoogle Scholar

Copyright information

© Springer International Publishing AG 2016

Authors and Affiliations

  • Thibault Gisselbrecht
    • 1
    • 2
    Email author
  • Sylvain Lamprier
    • 2
  • Patrick Gallinari
    • 2
  1. 1.Technological Research Institute SystemXPalaiseauFrance
  2. 2.Sorbonne Universités, UPMC Univ Paris 06, CNRS, LIP6 UMR 7606ParisFrance

Personalised recommendations