A Neural Networks Committee for the Contextual Bandit Problem

  • Robin Allesiardo
  • Raphaël Féraud
  • Djallel Bouneffouf
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8834)


This paper presents a new contextual bandit algorithm, NeuralBandit, which does not need hypothesis on stationarity of contexts and rewards. Several neural networks are trained to modelize the value of rewards knowing the context. Two variants, based on multi-experts approach, are proposed to choose online the parameters of multi-layer perceptrons. The proposed algorithms are successfully tested on a large dataset with and without stationarity of rewards.


Concept Drift Bandit Problem Forest Cover Type Multiarmed Bandit Neural Network Committee 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Lai, T.L., Robbins, H.: Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics 6(1), 4–22 (1985)MathSciNetCrossRefzbMATHGoogle Scholar
  2. 2.
    Auer, P., Cesa-Bianchi, N., Fischer, P.: Finite-time analysis of the multiarmed bandit problem. Machine Learning 47(2-3), 235–256 (2002)CrossRefzbMATHGoogle Scholar
  3. 3.
    Thompson, W.: On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika 25, 285–294 (1933)CrossRefzbMATHGoogle Scholar
  4. 4.
    Kaufmann, E., Korda, N., Munos, R.: Thompson Sampling: An Asymptotically Optimal Finite-Time Analysis. In: Bshouty, N.H., Stoltz, G., Vayatis, N., Zeugmann, T. (eds.) ALT 2012. LNCS (LNAI), vol. 7568, pp. 199–213. Springer, Heidelberg (2012)CrossRefGoogle Scholar
  5. 5.
    Auer, P., Cesa-Bianchi, N.: On-line learning with malicious noise and the closure algorithm. Ann. Math. Artif. Intell. 23(1-2), 83–99 (1998)MathSciNetCrossRefzbMATHGoogle Scholar
  6. 6.
    Auer, P., Cesa-Bianchi, N., Freund, Y., Schapire, R.E.: The nonstochastic multiarmed bandit problem. SIAM J. Comput. 32(1), 48–77 (2002)MathSciNetCrossRefzbMATHGoogle Scholar
  7. 7.
    Kleinberg, R.D., Niculescu-Mizil, A., Sharma, Y.: Regret bounds for sleeping experts and bandits. In: COLT, pp. 425–436 (2008)Google Scholar
  8. 8.
    Chakrabarti, D., Kumar, R., Radlinski, F., Upfal, E.: Mortal multi-armed bandits. In: NIPS, pp. 273–280 (2008)Google Scholar
  9. 9.
    Feraud, R., Urvoy, T.: A stochastic bandit algorithm for scratch games. In: Hoi, S.C.H., Buntine, W.L. (eds.) ACML. JMLR Proceedings, vol. 25, pp. 129–143. (2012)Google Scholar
  10. 10.
    Feraud, R., Urvoy, T.: Exploration and exploitation of scratch games. Machine Learning 92(2-3), 377–401 (2013)MathSciNetCrossRefzbMATHGoogle Scholar
  11. 11.
    Bubeck, S., Munos, R., Stoltz, G., Szepesvári, C.: Online optimization in x-armed bandits. In: Koller, D., Schuurmans, D., Bengio, Y., Bottou, L. (eds.) NIPS, pp. 201–208. Curran Associates, Inc. (2008)Google Scholar
  12. 12.
    Kocsis, L., Szepesvári, C.: Bandit based monte-carlo planning. In: Fürnkranz, J., Scheffer, T., Spiliopoulou, M. (eds.) ECML 2006. LNCS (LNAI), vol. 4212, pp. 282–293. Springer, Heidelberg (2006)Google Scholar
  13. 13.
    Gaudel, R., Sebag, M.: Feature selection as a one-player game. In: Fürnkranz, J., Joachims, T. (eds.) ICML, pp. 359–366. Omnipress (2010)Google Scholar
  14. 14.
    Seldin, Y., Auer, P., Laviolette, F., Shawe-Taylor, J., Ortner, R.: Pac-bayesian analysis of contextual bandits. In: Shawe-Taylor, J., Zemel, R.S., Bartlett, P.L., Pereira, F.C.N., Weinberger, K.Q. (eds.) NIPS, pp. 1683–1691 (2011)Google Scholar
  15. 15.
    Dudík, M., Hsu, D., Kale, S., Karampatziakis, N., Langford, J., Reyzin, L., Zhang, T.: Efficient optimal learning for contextual bandits. CoRR (2011)Google Scholar
  16. 16.
    Langford, J., Zhang, T.: The epoch-greedy algorithm for multi-armed bandits with side information. In: Platt, J.C., Koller, D., Singer, Y., Roweis, S.T. (eds.) NIPS. Curran Associates, Inc. (2007)Google Scholar
  17. 17.
    Li, L., Chu, W., Langford, J., Schapire, R.E.: A contextual-bandit approach to personalized news article recommendation. CoRR (2010)Google Scholar
  18. 18.
    Chu, W., Li, L., Reyzin, L., Schapire, R.E.: Contextual bandits with linear payoff functions. In: Gordon, G.J., Dunson, D.B., Dudk, M. (eds.) AISTATS. JMLR Proceedings, vol. 15, pp. 208–214. JMLRorg. (2011)Google Scholar
  19. 19.
    Agrawal, S., Goyal, N.: Thompson sampling for contextual bandits with linear payoffs. CoRR (2012)Google Scholar
  20. 20.
    Kakade, S.M., Shalev-Shwartz, S., Tewari, A.: Efficient bandit algorithms for online multiclass prediction. In: Proceedings of the 25th International Conference on Machine Learning, ICML 2008, pp. 440–447. ACM, New York (2008)Google Scholar
  21. 21.
    Hornik, K., Stinchcombe, M., White, H.: Multilayer feedforward networks are universal approximators. Neural Netw. 2(5), 359–366 (1989)CrossRefGoogle Scholar
  22. 22.
    Tesauro, G.: Programming backgammon using self-teaching neural nets. Artificial Intelligence 134, 181–199 (2002)CrossRefzbMATHGoogle Scholar
  23. 23.
    Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., Riedmiller, M.: Playing atari with deep reinforcement learning. CoRR (2013)Google Scholar
  24. 24.
    Bottou, L., LeCun, Y.: On-line learning for very large datasets. Applied Stochastic Models in Business and Industry 21(2), 137–151 (2005)MathSciNetCrossRefzbMATHGoogle Scholar
  25. 25.
    Rumelhart, D.E., Hinton, G.E., Williams, R.J.: Parallel distributed processing: Explorations in the microstructure of cognition, vol. 1, pp. 318–362. MIT Press, Cambridge (1986)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2014

Authors and Affiliations

  • Robin Allesiardo
    • 1
    • 2
  • Raphaël Féraud
    • 2
  • Djallel Bouneffouf
    • 2
  1. 1.TAO - INRIA, CNRSUniversity of Paris-SudOrsayFrance
  2. 2.Orange LabsLannionFrance

Personalised recommendations