Abstract
In interactive multi-objective reinforcement learning (MORL), an agent has to simultaneously learn about the environment and the preferences of the user, in order to quickly zoom in on those decisions that are likely to be preferred by the user. In this paper we study interactive MORL in the context of multi-objective multi-armed bandits. Contrary to earlier approaches to interactive MORL that force the utility of the user to be expressed as a weighted sum of the values for each objective, we do not make such stringent a priori assumptions. Specifically, we not only allow non-linear preferences, but also obviate the need to specify the exact model class in the utility function must fall. To achieve this, we propose a new approach called Gaussian-process Utility Thompson Sampling (GUTS). GUTS employs parameterless Bayesian learning to allow any type of utility function, exploits monotonicity information, and limits the number of queries posed to the user by ensuring that questions are statistically significant. We show empirically that GUTS can learn non-linear preferences, and that the regret and number of queries posed to the user are highly sub-linear in the number of arm pulls. (A preliminary version of this work was presented at the ALA workshop in 2018 [20]).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
See e.g. [17] for a reference implementation of \(\mathtt {PPrune}\).
References
Agrawal, S., Goyal, N.: Analysis of Thompson sampling for the multi-armed bandit problem. In: COLT, pp. 39–1 (2012)
Auer, P., Cesa-Bianchi, N., Fischer, P.: Finite-time analysis of the multiarmed bandit problem. Mach. Learn. 47(2–3), 235–256 (2002)
Auer, P., Chiang, C.K., Ortner, R., Drugan, M.M.: Pareto front identification from stochastic bandit feedback. In: AISTATS, pp. 939–947 (2016)
Bishop, C.M.: Pattern Recognition and Machine Learning (2006)
Brochu, E., Cora, V.M., De Freitas, N.: A tutorial on Bayesian optimization of expensive cost functions, with application to active user modeling and hierarchical reinforcement learning. arXiv:1012.2599 (2010)
Chu, W., Ghahramani, Z.: Preference learning with Gaussian processes. In: ICML, pp. 137–144 (2005)
Drugan, M.M., Nowé, A.: Designing multi-objective multi-armed bandits algorithms: a study. In: IJCNN, pp. 1–8. IEEE (2013)
Drugan, M.M.: PAC models in stochastic multi-objective multi-armed bandits. In: GEC, pp. 409–416 (2017)
Forgas, J.P.: Mood and judgment: the affect infusion model (aim). Psychol. Bull. 117(1), 39 (1995)
Hotelling, H.: The generalization of Student’s ratio. In: Annals of Mathematical Statistics ii, pp. 360–378 (1931)
Lampinen, J.: Gaussian processes with monotonicity constraint for big data (2014)
Libin, P., Verstraeten, T., Roijers, D.M., Wang, W., Theys, K., Nowé, A.: Bayesian anytime m-top exploration. In: ICTAI, pp. 1422–1428 (2019)
Libin, P.J., et al.: Bayesian best-arm identification for selecting influenza mitigation strategies. In: ECML-PKDD, pp. 456–471 (2018)
Lunn, D., Jackson, C., Best, N., Thomas, A., Spiegelhalter, D.: The BUGS Book: A Practical Introduction to Bayesian Analysis. CRC Press, Boca Raton (2012)
Rasmussen, C.E.: Gaussian processes for machine learning (2006)
Riihimäki, J., Vehtari, A.: Gaussian processes with monotonicity information. In: Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pp. 645–652 (2010)
Roijers, D.M.: Multi-Objective Decision-Theoretic Planning. Ph.D. thesis, University of Amsterdam (2016)
Roijers, D.M., Vamplew, P., Whiteson, S., Dazeley, R.: A survey of multi-objective sequential decision-making. JAIR 48, 67–113 (2013)
Roijers, D.M., Zintgraf, L.M., Nowé, A.: Interactive Thompson sampling for multi-objective multi-armed bandits. In: Algorithmic Decision Theory, pp. 18–34 (2017)
Roijers, D.M., Zintgraf, L.M., Libin, P., Nowé, A.: Interactive multi-objective reinforcement learning in multi-armed bandits for any utility function. In: ALA workshop at FAIM (2018)
Siegel, S.: Nonparametric statistics for the behavioral sciences (1956)
Sirakaya, E., Petrick, J., Choi, H.S.: The role of mood on tourism product evaluations. Ann. Tourism Res. 31(3), 517–539 (2004)
Soulsby, R.L., Thomas, J.A.: Insect population curves: modelling and application to butterfly transect data. Methods Ecol. Evol. 3(5), 832–841 (2012)
Tesauro, G.: Connectionist learning of expert preferences by comparison training. NeurIPS 1, 99–106 (1988)
Thompson, W.R.: On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika 25(3/4), 285–294 (1933)
Ustyuzhaninov, I., Kazlauskaite, I., Ek, C.H., Campbell, N.D.: Monotonic Gaussian process flow. arXiv preprint arXiv:1905.12930 (2019)
Wu, H., Liu, X.: Double thompson sampling for dueling bandits. In: NeurIPS, pp. 649–657 (2016)
Yahyaa, S.Q., Drugan, M.M., Manderick, B.: Thompson sampling in the adaptive linear scalarized multi objective multi armed bandit. In: ICAART, pp. 55–65 (2015)
Zintgraf, L.M., Roijers, D.M., Linders, S., Jonker, C.M., Nowé, A.: Ordered preference elicitation strategies for supporting multi-objective decision making. In: AAMAS, pp. 1477–1485 (2018)
Zoghi, M., Whiteson, S., Munos, R., De Rijke, M.: Relative upper confidence bound for the k-armed dueling bandit problem. In: ICML, pp. 10–18 (2014)
Acknowledgment
This research received funding from the Flemish Government (AI Research Program). Pieter Libin and Eugenio Bargiacchi were supported by a PhD grant of the FWO (Fonds Wetenschappelijk Onderzoek – Vlaanderen).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Roijers, D.M., Zintgraf, L.M., Libin, P., Reymond, M., Bargiacchi, E., Nowé, A. (2021). Interactive Multi-objective Reinforcement Learning in Multi-armed Bandits with Gaussian Process Utility Models. In: Hutter, F., Kersting, K., Lijffijt, J., Valera, I. (eds) Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2020. Lecture Notes in Computer Science(), vol 12459. Springer, Cham. https://doi.org/10.1007/978-3-030-67664-3_28
Download citation
DOI: https://doi.org/10.1007/978-3-030-67664-3_28
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-67663-6
Online ISBN: 978-3-030-67664-3
eBook Packages: Computer ScienceComputer Science (R0)