Abstract
The exploration/exploitation (E/E) dilemma arises naturally in many subfields of Science. Multi-armed bandit problems formalize this dilemma in its canonical form. Most current research in this field focuses on generic solutions that can be applied to a wide range of problems. However, in practice, it is often the case that a form of prior information is available about the specific class of target problems. Prior knowledge is rarely used in current solutions due to the lack of a systematic approach to incorporate it into the E/E strategy.
To address a specific class of E/E problems, we propose to proceed in three steps: (i) model prior knowledge in the form of a probability distribution over the target class of E/E problems; (ii) choose a large hypothesis space of candidate E/E strategies; and (iii), solve an optimization problem to find a candidate E/E strategy of maximal average performance over a sample of problems drawn from the prior distribution.
We illustrate this meta-learning approach with two different hypothesis spaces: one where E/E strategies are numerically parameterized and another where E/E strategies are represented as small symbolic formulas. We propose appropriate optimization algorithms for both cases. Our experiments, with two-armed “Bernoulli” bandit problems and various playing budgets, show that the meta-learnt E/E strategies outperform generic strategies of the literature (UCB1, UCB1-Tuned, UCB-V, KL-UCB and ε n -Greedy); they also evaluate the robustness of the learnt E/E strategies, by tests carried out on arms whose rewards follow a truncated Gaussian distribution.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Robbins, H.: Some aspects of the sequential design of experiments. Bulletin of The American Mathematical Society 58, 527–536 (1952)
Lai, T., Robbins, H.: Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics 6, 4–22 (1985)
Agrawal, R.: Sample mean based index policies with o(log n) regret for the multi-armed bandit problem. Advances in Applied Mathematics 27, 1054–1078 (1995)
Auer, P., Fischer, P., Cesa-Bianchi, N.: Finite-time analysis of the multi-armed bandit problem. Machine Learning 47, 235–256 (2002)
Audibert, J.-Y., Munos, R., Szepesvári, C.: Tuning Bandit Algorithms in Stochastic Environments. In: Hutter, M., Servedio, R.A., Takimoto, E. (eds.) ALT 2007. LNCS (LNAI), vol. 4754, pp. 150–165. Springer, Heidelberg (2007)
Audibert, J., Munos, R., Szepesvari, C.: Exploration-exploitation trade-off using variance estimates in multi-armed bandits. In: Theoretical Computer Science (2008)
Maes, F., Wehenkel, L., Ernst, D.: Learning to play K-armed bandit problems. In: Proc. of the 4th International Conference on Agents and Artificial Intelligence (2012)
Maes, F., Wehenkel, L., Ernst, D.: Automatic Discovery of Ranking Formulas for Playing with Multi-armed Bandits. In: Sanner, S., Hutter, M. (eds.) EWRL 2011. LNCS, vol. 7188, pp. 5–17. Springer, Heidelberg (2012)
Gonzalez, C., Lozano, J., Larrañaga, P.: Estimation of Distribution Algorithms. A New Tool for Evolutionary Computation. Kluwer Academic Publishers (2002)
Pelikan, M., Mühlenbein, H.: Marginal distributions in evolutionary algorithms. In: Proceedings of the 4th International Conference on Genetic Algorithms (1998)
Bubeck, S., Munos, R., Stoltz, G.: Pure Exploration in Multi-armed Bandits Problems. In: Gavaldà, R., Lugosi, G., Zeugmann, T., Zilles, S. (eds.) ALT 2009. LNCS, vol. 5809, pp. 23–37. Springer, Heidelberg (2009)
Bubeck, S., Munos, R., Stoltz, G., Szepesvári, C.: X-armed bandits. Journal of Machine Learning Research 12, 1655–1695 (2011)
Garivier, A., Cappé, O.: The KL-UCB algorithm for bounded stochastic bandits and beyond. CoRR abs/1102.2490 (2011)
Rubenstein, R., Kroese, D.: The cross-entropy method: a unified approach to combinatorial optimization, Monte-Carlo simluation, and machine learning. Springer, New York (2004)
Castronovo, M., Maes, F., Fonteneau, R., Ernst, D.: Learning exploration/exploitation strategies for single trajectory reinforcement learning. In: Proc. of 10th European Workshop on Reinforcement Learning (2012)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Maes, F., Wehenkel, L., Ernst, D. (2013). Meta-learning of Exploration/Exploitation Strategies: The Multi-armed Bandit Case. In: Filipe, J., Fred, A. (eds) Agents and Artificial Intelligence. ICAART 2012. Communications in Computer and Information Science, vol 358. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-36907-0_7
Download citation
DOI: https://doi.org/10.1007/978-3-642-36907-0_7
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-36906-3
Online ISBN: 978-3-642-36907-0
eBook Packages: Computer ScienceComputer Science (R0)