Variational Thompson Sampling for Relational Recurrent Bandits

  • Sylvain LamprierEmail author
  • Thibault Gisselbrecht
  • Patrick Gallinari
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10535)


In this paper, we introduce a novel non-stationary bandit setting, called relational recurrent bandit, where rewards of arms at successive time steps are interdependent. The aim is to discover temporal and structural dependencies between arms in order to maximize the cumulative collected reward. Two algorithms are proposed: the first one directly models temporal dependencies between arms, as the second one assumes the existence of hidden states of the system behind the observed rewards. For both approaches, we develop a Variational Thompson Sampling method, which approximates distributions via variational inference, and uses the estimated distributions to sample reward expectations at each iteration of the process. Experiments conducted on both synthetic and real data demonstrate the effectiveness of our approaches.



The work was supported by the IRT SystemX and the ANR project LOCUST (2015–2019, ANR-15-CE23-0027).


  1. 1.
    Abbasi-yadkori, Y., Pál, D., Szepesvári, C.: Improved algorithms for linear stochastic bandits. In: NIPS (2011)Google Scholar
  2. 2.
    Agrawal, S., Goyal, N.: Analysis of Thompson sampling for the multi-armed bandit problem. In: COLT (2012)Google Scholar
  3. 3.
    Agrawal, S., Goyal, N.: Thompson sampling for contextual bandits with linear payoffs. In: ICML (2013)Google Scholar
  4. 4.
    Audibert, J.Y., Bubeck, S.: Minimax policies for adversarial and stochastic bandits. In: COLT (2009)Google Scholar
  5. 5.
    Audibert, J.-Y., Munos, R., Szepesvári, C.: Tuning bandit algorithms in stochastic environments. In: Hutter, M., Servedio, R.A., Takimoto, E. (eds.) ALT 2007. LNCS (LNAI), vol. 4754, pp. 150–165. Springer, Heidelberg (2007). CrossRefGoogle Scholar
  6. 6.
    Audiffren, J., Ralaivola, L.: Cornering stationary and restless mixing bandits with remix-ucb. In: NIPS, pp. 3339–3347 (2015)Google Scholar
  7. 7.
    Auer, P., Cesa-Bianchi, N., Fischer, P.: Finite-time analysis of the multiarmed bandit problem. Mach. Learn. 47, 235–256 (2002)CrossRefzbMATHGoogle Scholar
  8. 8.
    Beal, M.J.: Variational algorithms for approximate Bayesian inference. Ph.D. thesis, Gatsby Computational Neuroscience Unit, University College London (2003)Google Scholar
  9. 9.
    Besbes, O., Gur, Y., Zeevi, A.: Stochastic multi-armed-bandit problem with non-stationary rewards. In: NIPS (2014)Google Scholar
  10. 10.
    Bubeck, S., Stoltz, G., Szepesvári, C., Munos, R.: Online optimization in x-armed bandits. In: NIPS (2009)Google Scholar
  11. 11.
    Buccapatnam, S., Eryilmaz, A., Shroff, N.B.: Stochastic bandits with side observations on networks. In: SIGMETRICS (2014)Google Scholar
  12. 12.
    Caron, S., Kveton, B., Lelarge, M., Bhagat, S.: Leveraging side observations in stochastic bandits. In: UAI (2012)Google Scholar
  13. 13.
    Carpentier, A., Valko, M.: Revealing graph bandits for maximizing local influence. In: AISTATS, Seville, Spain (2016)Google Scholar
  14. 14.
    Cesa-Bianchi, N., Gentile, C., Zappella, G.: A gang of bandits. In: NIPS (2013)Google Scholar
  15. 15.
    Chapelle, O., Li, L.: An empirical evaluation of Thompson sampling. In: NIPS. Curran Associates, Inc. (2011)Google Scholar
  16. 16.
    Chen, W., Wang, Y., Yuan, Y.: Combinatorial multi-armed bandit: general framework and applications. In: ICML (2013)Google Scholar
  17. 17.
    Claudio, G., Shuai, L., Giovanni, Z.: Online clustering of bandits. In: ICML (2014)Google Scholar
  18. 18.
    Dani, V., Hayes, T.P., Kakade, S.M.: Stochastic linear optimization under bandit feedback. In: COLT (2008)Google Scholar
  19. 19.
    Garivier, A., Moulines, E.: On upper-confidence bound policies for switching bandit problems. In: Kivinen, J., Szepesvári, C., Ukkonen, E., Zeugmann, T. (eds.) ALT 2011. LNCS (LNAI), vol. 6925, pp. 174–188. Springer, Heidelberg (2011). CrossRefGoogle Scholar
  20. 20.
    Garivier, A.: The KL-UCB algorithm for bounded stochastic bandits and beyond. In: COLT (2011)Google Scholar
  21. 21.
    Gisselbrecht, T., Denoyer, L., Gallinari, P., Lamprier, S.: WhichStreams: a dynamic approach for focused data capture from large social media. In: ICWSM (2015)Google Scholar
  22. 22.
    Kalman, R.E.: A new approach to linear filtering and prediction problems. Trans. ASME-J. Basic Eng. 82(Ser. D), 35–45 (1960)CrossRefGoogle Scholar
  23. 23.
    Komiyama, J., Honda, J., Nakagawa, H.: Optimal regret analysis of Thompson sampling in stochastic multi-armed bandit problem with multiple plays. In: ICML (2015)Google Scholar
  24. 24.
    Lai, T., Robbins, H.: Asymptotically efficient adaptive allocation rules. Adva. Appl. Math. 6(1), 4–22 (1985)MathSciNetCrossRefzbMATHGoogle Scholar
  25. 25.
    Li, L., Chu, W., Langford, J., Schapire, R.E.: A contextual-bandit approach to personalized news article recommendation. In: WWW (2010)Google Scholar
  26. 26.
    Mannor, S., Shamir, O.: From bandits to experts: on the value of side-observations. In: NIPS (2011)Google Scholar
  27. 27.
    Ortner, R., Ryabko, D., Auer, P., Munos, R.: Regret bounds for restless Markov bandits. Theor. Comput. Sci. 558, 62–76 (2014)MathSciNetCrossRefzbMATHGoogle Scholar
  28. 28.
    Pczos, B., Lrincz, A., Ghahramani, Z.: Identification of recurrent neural networks by Bayesian interrogation techniques. JMLR 10, 515–554 (2009)Google Scholar
  29. 29.
    Richard, C., Alexandre, P.: Unimodal bandits: regret lower bounds and optimal algorithms. In: ICML (2014)Google Scholar
  30. 30.
    Slivkins, A., Upfal, E.: Adapting to a changing environment: the Brownian restless bandits. In: COLT (2008)Google Scholar
  31. 31.
    Tekin, C., Liu, M.: Online learning of rested and restless bandits. IEEE Trans. Inf. Theory 58(8), 5588–5611 (2012)MathSciNetCrossRefzbMATHGoogle Scholar
  32. 32.
    Thompson, W.: On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Am. Math. Soc. 25, 285–294 (1933)zbMATHGoogle Scholar
  33. 33.
    Whittle, P.: Restless bandits: activity allocation in a changing world. J. Appl. Probab. 25, 287–298 (1988)MathSciNetCrossRefzbMATHGoogle Scholar

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  • Sylvain Lamprier
    • 1
    Email author
  • Thibault Gisselbrecht
    • 1
    • 2
    • 3
  • Patrick Gallinari
    • 1
  1. 1.Sorbonne Universités, UPMC Univ Paris 06, CNRS, LIP6 UMR 7606ParisFrance
  2. 2.IRT SystemXPalaiseauFrance
  3. 3.SNIPSParisFrance

Personalised recommendations