Multi-armed bandits with episode context

Rosin, Christopher D.

doi:10.1007/s10472-011-9258-6

Christopher D. Rosin¹

1599 Accesses
86 Citations
3 Altmetric
Explore all metrics

Abstract

A multi-armed bandit episode consists of n trials, each allowing selection of one of K arms, resulting in payoff from a distribution over [0,1] associated with that arm. We assume contextual side information is available at the start of the episode. This context enables an arm predictor to identify possible favorable arms, but predictions may be imperfect so that they need to be combined with further exploration during the episode. Our setting is an alternative to classical multi-armed bandits which provide no contextual side information, and is also an alternative to contextual bandits which provide new context each individual trial. Multi-armed bandits with episode context can arise naturally, for example in computer Go where context is used to bias move decisions made by a multi-armed bandit algorithm. The UCB1 algorithm for multi-armed bandits achieves worst-case regret bounded by \(O\left(\sqrt{Kn\log(n)}\right)\). We seek to improve this using episode context, particularly in the case where K is large. Using a predictor that places weight M _i > 0 on arm i with weights summing to 1, we present the PUCB algorithm which achieves regret \(O\left(\frac{1}{M_{\ast}}\sqrt{n\log(n)}\right)\) where M _∗ is the weight on the optimal arm. We illustrate the behavior of PUCB with small simulation experiments, present extensions that provide additional capabilities for PUCB, and describe methods for obtaining suitable predictors for use with PUCB.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Audibert, J.Y., Bubeck, S.: Minimax policies for adversarial and stochastic bandits. In: 22nd Annual Conference on Learning Theory (COLT 2009) (2009)
Audibert, J.Y., Munos, R., Szepesvári, C.: Exploration-exploitation tradeoff using variance estimates in multi-armed bandits. Theor. Comput. Sci. 410(19), 1876–1902 (2009)
Article MATH Google Scholar
Auer, P., Cesa-Bianchi, N., Fischer, P.: Finite-time analysis of the multiarmed bandit problem. Mach. Learn. 47(2–3), 235–256 (2002)
Article MATH Google Scholar
Auer, P., Cesa-Bianchi, N., Freund, Y., Schapire, R.E.: The nonstochastic multiarmed bandit problem. SIAM J. Comput. 32(1), 48–77 (2002)
Article MathSciNet MATH Google Scholar
Bouzy, B., Cazenave, T.: Computer Go: An AI oriented survey. Artif. Intell. 132(1), 39–103 (2001)
Article MathSciNet MATH Google Scholar
Bouzy, B., Chaslot, G.: Bayesian generation and integration of K-nearest-neighbor patterns for 19x19 Go. In: IEEE Symposium on Computational Intelligence in Games (CIG05), pp. 176–181 (2005)
Bouzy, B., Helmstetter, B.: Monte-Carlo Go developments. In: van den Herik, H.J., Iida, H., Heinz, E.A. (eds.) Advances in Computer Games (ACG 2003), IFIP, vol. 263, pp. 159–174. Springer, New York (2003)
Google Scholar
Bubeck, S., Munos, R., Stoltz, G.: Pure exploration in multi-armed bandits problems. In: Gavaldà, R., Lugosi, G., Zeugmann, T., Zilles, S. (eds.) Algorithmic Learning Theory (ALT 2009), Lecture Notes in Computer Science, vol. 5809, pp. 23–37. Springer, New York (2009)
Chapter Google Scholar
Cai, X., Wunsch, D.C.: Computer Go: A grand challenge to AI. In: Duch, W., Mandziuk, J. (eds.) Challenges for Computational Intelligence, Studies in Computational Intelligence, vol. 63, pp. 443–465. Springer, New York (2007)
Chapter Google Scholar
Chaslot, G., Winands, M., Uiterwijk, J., van den Herik, H., Bouzy, B.: Progressive strategies for Monte-Carlo tree search. In: Proceedings of the 10th Joint Conference on Information Sciences (JCIS 2007), pp. 655–661 (2007)
Chaslot, G., Chatriot, L., Fiter, C., Gelly, S., Hoock, J., Perez, J., Rimmel, A., Teytaud, O.: Combining expert, offline, transient and online knowledge in Monte-Carlo exploration. http://www.lri.fr/~teytaud/eg.pdf (2008)
Chaslot, G., Fiter, C., Hoock, J.B., Rimmel, A., Teytaud, O.: Adding expert knowledge and exploration in Monte-Carlo tree search. In: Advances in Computer Games (ACG12). Springer, New York (2009)
Google Scholar
Coulom, R.: Efficient selectivity and backup operators in Monte-Carlo tree search. In: van den Herik, H.J., Ciancarini, P., Donkers, H.H.L.M. (eds.) Computers and Games (CG 2006), Lecture Notes in Computer Science, vol. 4630, pp. 72–83. Springer, New York (2006)
Chapter Google Scholar
Coulom, R.: Computing Elo ratings of move patterns in the game of Go. In: Computer Games Workshop 2007 (2007)
Coulom, R.: Monte-Carlo tree search in crazy stone. In: 12th Game Programming Workshop (GPW-07) (2007)
de Mesmay, F., Rimmel, A., Voronenko, Y., Püschel, M.: Bandit-based optimization on graphs with application to library performance tuning. In: Danyluk, A.P., Bottou, L., Littman, M.L. (eds.) International Conference on Machine Learning (ICML 2009), pp. 729–736. ACM, New York (2009)
Google Scholar
Even-Dar, E., Mannor, S., Mansour, Y.: Action elimination and stopping conditions for the multi-armed bandit and reinforcement learning problems. J. Mach. Learn. Res. 7, 1079–1105 (2006)
MathSciNet MATH Google Scholar
Finnsson, H., Björnsson, Y.: Simulation-based approach to general game playing. In: Fox, D., Gomes, C.P. (eds.) Proceedings of the Twenty-Third AAAI Conference on Artificial Intelligence, AAAI 2008, Chicago, IL, USA, 13–17 July 2008, pp. 259–264. AAAI Press, Menlo Park (2008)
Gelly, S., Silver, D.: Combining online and offline knowledge in UCT. In: Ghahramani, Z. (ed.) International Conference on Machine Learning (ICML 2007), pp. 273–280. ACM, New York (2007)
Gelly, S., Silver, D.: Achieving master level play in 9 x 9 computer Go. In: Fox, D., Gomes, C.P. (eds.) Proceedings of the Twenty-Third AAAI Conference on Artificial Intelligence, AAAI 2008, Chicago, IL, USA, 13–17 July 2008, pp. 1537–1540. AAAI Press, Menlo Park (2008)
Juditsky, A., Nazin, A., Tsybakov, A., Vayatis, N.: Gap-free bounds for multi-armed stochastic bandit. In: World Congress of the International Federation of Automatic Control (IFAC) 2008 (2008)
Kakade, S.M., Shalev-Shwartz, S., Tewari, A.: Efficient bandit algorithms for online multiclass prediction. In: Cohen, W.W., McCallum, A., Roweis, S.T. (eds.) International Conference on Machine Learning (ICML 2008), pp. 440–447. ACM, New York (2008)
Kocsis, L., Szepesvari, C.: Bandit based Monte-Carlo planning. In: European Conference on Machine Learning (ECML 2006), pp. 282–293 (2006)
Langford, J., Zhang, T.: The epoch-greedy algorithm for multi-armed bandits with side information. In: Platt, J.C., Koller, D., Singer, Y., Roweis, S.T. (eds.) Neural Information Processing Systems (NIPS). MIT Press, Cambridge (2007)
Google Scholar
Mannor, S., Tsitsiklis, J.N.: The sample complexity of exploration in the multi-armed bandit problem. J. Mach. Learn. Res. 5, 623–648 (2004)
MathSciNet MATH Google Scholar
Rosin, C.D.: Multi-armed bandits with episode context. In: The Eleventh International Symposium on Artificial Intelligence and Mathematics (ISAIM 2010) (2010)
Shalev-Shwartz, S., Shamir, O., Srebro, N., Sridharan, K.: Stochastic convex optimization. In: 22nd Annual Conference on Learning Theory (COLT 2009) (2009)
Streeter, M.J., Smith, S.F.: A simple distribution-free approach to the max k-armed bandit problem. In: Benhamou, F. (ed.) Principles and Practice of Constraint Programming (CP 2006), Lecture Notes in Computer Science, vol. 4204, pp. 560–574. Springer, New York (2006)
Chapter Google Scholar
Strehl, A.L., Mesterharm, C., Littman, M.L., Hirsh, H.: Experience-efficient learning in associative bandit problems. In: Cohen, W.W., Moore, A. (eds.) International Conference on Machine Learning (ICML 2006), pp. 889–896. ACM, New York (2006)
Strehl, A.L., Langford, J., Li, L., Kakade, S.M.: Learning from Logged Implicit Exploration Data. In: Lafferty, J., Williams, C.K.I, Shawe-Taylor, J., Zemel, R.S., Culotta, A. (eds.) Neural Information Processing Systems (NIPS) (2010)
Teytaud, O., Gelly, S., Sebag, M.: Anytime many-armed bandits. In: Zucker, J., Cornuéjols, A. (eds.) Conférence d’Apprentissage (CAP07), pp. 387–402 (2007)

Download references

Author information

Authors and Affiliations

Parity Computing, Inc., 6160 Lusk Blvd, Suite C205, San Diego, CA, 92121, USA
Christopher D. Rosin

Authors

Christopher D. Rosin
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Christopher D. Rosin.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Rosin, C.D. Multi-armed bandits with episode context. Ann Math Artif Intell 61, 203–230 (2011). https://doi.org/10.1007/s10472-011-9258-6

Download citation

Published: 26 August 2011
Issue Date: March 2011
DOI: https://doi.org/10.1007/s10472-011-9258-6

Keywords

Mathematics Subject Classifications (2010)

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Multi-armed bandits with episode context

Abstract

Access this article

Similar content being viewed by others

A practical guide to multi-objective reinforcement learning and planning

A survey of experimental research on contests, all-pay auctions and tournaments

A simple introduction to Markov Chain Monte–Carlo sampling

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Mathematics Subject Classifications (2010)

Navigation

Multi-armed bandits with episode context

Abstract

Access this article

Similar content being viewed by others

A practical guide to multi-objective reinforcement learning and planning

A survey of experimental research on contests, all-pay auctions and tournaments

A simple introduction to Markov Chain Monte–Carlo sampling

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classifications (2010)

Search

Navigation