Meta-learning of Exploration/Exploitation Strategies: The Multi-armed Bandit Case

Maes, Francis; Wehenkel, Louis; Ernst, Damien

doi:10.1007/978-3-642-36907-0_7

Meta-learning of Exploration/Exploitation Strategies: The Multi-armed Bandit Case

Francis Maes³,
Louis Wehenkel³ &
Damien Ernst³

Conference paper

921 Accesses
2 Citations

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 358))

Abstract

The exploration/exploitation (E/E) dilemma arises naturally in many subfields of Science. Multi-armed bandit problems formalize this dilemma in its canonical form. Most current research in this field focuses on generic solutions that can be applied to a wide range of problems. However, in practice, it is often the case that a form of prior information is available about the specific class of target problems. Prior knowledge is rarely used in current solutions due to the lack of a systematic approach to incorporate it into the E/E strategy.

To address a specific class of E/E problems, we propose to proceed in three steps: (i) model prior knowledge in the form of a probability distribution over the target class of E/E problems; (ii) choose a large hypothesis space of candidate E/E strategies; and (iii), solve an optimization problem to find a candidate E/E strategy of maximal average performance over a sample of problems drawn from the prior distribution.

We illustrate this meta-learning approach with two different hypothesis spaces: one where E/E strategies are numerically parameterized and another where E/E strategies are represented as small symbolic formulas. We propose appropriate optimization algorithms for both cases. Our experiments, with two-armed “Bernoulli” bandit problems and various playing budgets, show that the meta-learnt E/E strategies outperform generic strategies of the literature (UCB1, UCB1-Tuned, UCB-V, KL-UCB and ε _n -Greedy); they also evaluate the robustness of the learnt E/E strategies, by tests carried out on arms whose rewards follow a truncated Gaussian distribution.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Robbins, H.: Some aspects of the sequential design of experiments. Bulletin of The American Mathematical Society 58, 527–536 (1952)
Article MathSciNet MATH Google Scholar
Lai, T., Robbins, H.: Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics 6, 4–22 (1985)
Article MathSciNet MATH Google Scholar
Agrawal, R.: Sample mean based index policies with o(log n) regret for the multi-armed bandit problem. Advances in Applied Mathematics 27, 1054–1078 (1995)
MATH Google Scholar
Auer, P., Fischer, P., Cesa-Bianchi, N.: Finite-time analysis of the multi-armed bandit problem. Machine Learning 47, 235–256 (2002)
Article MATH Google Scholar
Audibert, J.-Y., Munos, R., Szepesvári, C.: Tuning Bandit Algorithms in Stochastic Environments. In: Hutter, M., Servedio, R.A., Takimoto, E. (eds.) ALT 2007. LNCS (LNAI), vol. 4754, pp. 150–165. Springer, Heidelberg (2007)
Chapter Google Scholar
Audibert, J., Munos, R., Szepesvari, C.: Exploration-exploitation trade-off using variance estimates in multi-armed bandits. In: Theoretical Computer Science (2008)
Google Scholar
Maes, F., Wehenkel, L., Ernst, D.: Learning to play K-armed bandit problems. In: Proc. of the 4th International Conference on Agents and Artificial Intelligence (2012)
Google Scholar
Maes, F., Wehenkel, L., Ernst, D.: Automatic Discovery of Ranking Formulas for Playing with Multi-armed Bandits. In: Sanner, S., Hutter, M. (eds.) EWRL 2011. LNCS, vol. 7188, pp. 5–17. Springer, Heidelberg (2012)
Chapter Google Scholar
Gonzalez, C., Lozano, J., Larrañaga, P.: Estimation of Distribution Algorithms. A New Tool for Evolutionary Computation. Kluwer Academic Publishers (2002)
Google Scholar
Pelikan, M., Mühlenbein, H.: Marginal distributions in evolutionary algorithms. In: Proceedings of the 4th International Conference on Genetic Algorithms (1998)
Google Scholar
Bubeck, S., Munos, R., Stoltz, G.: Pure Exploration in Multi-armed Bandits Problems. In: Gavaldà, R., Lugosi, G., Zeugmann, T., Zilles, S. (eds.) ALT 2009. LNCS, vol. 5809, pp. 23–37. Springer, Heidelberg (2009)
Chapter Google Scholar
Bubeck, S., Munos, R., Stoltz, G., Szepesvári, C.: X-armed bandits. Journal of Machine Learning Research 12, 1655–1695 (2011)
Google Scholar
Garivier, A., Cappé, O.: The KL-UCB algorithm for bounded stochastic bandits and beyond. CoRR abs/1102.2490 (2011)
Google Scholar
Rubenstein, R., Kroese, D.: The cross-entropy method: a unified approach to combinatorial optimization, Monte-Carlo simluation, and machine learning. Springer, New York (2004)
Google Scholar
Castronovo, M., Maes, F., Fonteneau, R., Ernst, D.: Learning exploration/exploitation strategies for single trajectory reinforcement learning. In: Proc. of 10th European Workshop on Reinforcement Learning (2012)
Google Scholar

Download references

Author information

Authors and Affiliations

Dept. of Electrical Engineering and Computer Science, Institut Montefiore, University of Liège, B28, B-4000, Liège, Belgium
Francis Maes, Louis Wehenkel & Damien Ernst

Authors

Francis Maes
View author publications
You can also search for this author in PubMed Google Scholar
Louis Wehenkel
View author publications
You can also search for this author in PubMed Google Scholar
Damien Ernst
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

INSTICC and IPS, Estefanilha, Setúbal, Portugal
Joaquim Filipe
IST - Technical University of Lisbon, Av.Rovisco Pais, 1, 1049-001, Lisbon, Portugal
Ana Fred

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Maes, F., Wehenkel, L., Ernst, D. (2013). Meta-learning of Exploration/Exploitation Strategies: The Multi-armed Bandit Case. In: Filipe, J., Fred, A. (eds) Agents and Artificial Intelligence. ICAART 2012. Communications in Computer and Information Science, vol 358. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-36907-0_7

Download citation

DOI: https://doi.org/10.1007/978-3-642-36907-0_7
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-36906-3
Online ISBN: 978-3-642-36907-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics