Skip to main content

Interactive Multi-objective Reinforcement Learning in Multi-armed Bandits with Gaussian Process Utility Models

  • Conference paper
  • First Online:
Machine Learning and Knowledge Discovery in Databases (ECML PKDD 2020)

Abstract

In interactive multi-objective reinforcement learning (MORL), an agent has to simultaneously learn about the environment and the preferences of the user, in order to quickly zoom in on those decisions that are likely to be preferred by the user. In this paper we study interactive MORL in the context of multi-objective multi-armed bandits. Contrary to earlier approaches to interactive MORL that force the utility of the user to be expressed as a weighted sum of the values for each objective, we do not make such stringent a priori assumptions. Specifically, we not only allow non-linear preferences, but also obviate the need to specify the exact model class in the utility function must fall. To achieve this, we propose a new approach called Gaussian-process Utility Thompson Sampling (GUTS). GUTS employs parameterless Bayesian learning to allow any type of utility function, exploits monotonicity information, and limits the number of queries posed to the user by ensuring that questions are statistically significant. We show empirically that GUTS can learn non-linear preferences, and that the regret and number of queries posed to the user are highly sub-linear in the number of arm pulls. (A preliminary version of this work was presented at the ALA workshop in 2018 [20]).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    See e.g. [17] for a reference implementation of \(\mathtt {PPrune}\).

References

  1. Agrawal, S., Goyal, N.: Analysis of Thompson sampling for the multi-armed bandit problem. In: COLT, pp. 39–1 (2012)

    Google Scholar 

  2. Auer, P., Cesa-Bianchi, N., Fischer, P.: Finite-time analysis of the multiarmed bandit problem. Mach. Learn. 47(2–3), 235–256 (2002)

    Article  Google Scholar 

  3. Auer, P., Chiang, C.K., Ortner, R., Drugan, M.M.: Pareto front identification from stochastic bandit feedback. In: AISTATS, pp. 939–947 (2016)

    Google Scholar 

  4. Bishop, C.M.: Pattern Recognition and Machine Learning (2006)

    Google Scholar 

  5. Brochu, E., Cora, V.M., De Freitas, N.: A tutorial on Bayesian optimization of expensive cost functions, with application to active user modeling and hierarchical reinforcement learning. arXiv:1012.2599 (2010)

  6. Chu, W., Ghahramani, Z.: Preference learning with Gaussian processes. In: ICML, pp. 137–144 (2005)

    Google Scholar 

  7. Drugan, M.M., Nowé, A.: Designing multi-objective multi-armed bandits algorithms: a study. In: IJCNN, pp. 1–8. IEEE (2013)

    Google Scholar 

  8. Drugan, M.M.: PAC models in stochastic multi-objective multi-armed bandits. In: GEC, pp. 409–416 (2017)

    Google Scholar 

  9. Forgas, J.P.: Mood and judgment: the affect infusion model (aim). Psychol. Bull. 117(1), 39 (1995)

    Article  Google Scholar 

  10. Hotelling, H.: The generalization of Student’s ratio. In: Annals of Mathematical Statistics ii, pp. 360–378 (1931)

    Google Scholar 

  11. Lampinen, J.: Gaussian processes with monotonicity constraint for big data (2014)

    Google Scholar 

  12. Libin, P., Verstraeten, T., Roijers, D.M., Wang, W., Theys, K., Nowé, A.: Bayesian anytime m-top exploration. In: ICTAI, pp. 1422–1428 (2019)

    Google Scholar 

  13. Libin, P.J., et al.: Bayesian best-arm identification for selecting influenza mitigation strategies. In: ECML-PKDD, pp. 456–471 (2018)

    Google Scholar 

  14. Lunn, D., Jackson, C., Best, N., Thomas, A., Spiegelhalter, D.: The BUGS Book: A Practical Introduction to Bayesian Analysis. CRC Press, Boca Raton (2012)

    Google Scholar 

  15. Rasmussen, C.E.: Gaussian processes for machine learning (2006)

    Google Scholar 

  16. Riihimäki, J., Vehtari, A.: Gaussian processes with monotonicity information. In: Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pp. 645–652 (2010)

    Google Scholar 

  17. Roijers, D.M.: Multi-Objective Decision-Theoretic Planning. Ph.D. thesis, University of Amsterdam (2016)

    Google Scholar 

  18. Roijers, D.M., Vamplew, P., Whiteson, S., Dazeley, R.: A survey of multi-objective sequential decision-making. JAIR 48, 67–113 (2013)

    Article  MathSciNet  Google Scholar 

  19. Roijers, D.M., Zintgraf, L.M., Nowé, A.: Interactive Thompson sampling for multi-objective multi-armed bandits. In: Algorithmic Decision Theory, pp. 18–34 (2017)

    Google Scholar 

  20. Roijers, D.M., Zintgraf, L.M., Libin, P., Nowé, A.: Interactive multi-objective reinforcement learning in multi-armed bandits for any utility function. In: ALA workshop at FAIM (2018)

    Google Scholar 

  21. Siegel, S.: Nonparametric statistics for the behavioral sciences (1956)

    Google Scholar 

  22. Sirakaya, E., Petrick, J., Choi, H.S.: The role of mood on tourism product evaluations. Ann. Tourism Res. 31(3), 517–539 (2004)

    Article  Google Scholar 

  23. Soulsby, R.L., Thomas, J.A.: Insect population curves: modelling and application to butterfly transect data. Methods Ecol. Evol. 3(5), 832–841 (2012)

    Article  Google Scholar 

  24. Tesauro, G.: Connectionist learning of expert preferences by comparison training. NeurIPS 1, 99–106 (1988)

    Google Scholar 

  25. Thompson, W.R.: On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika 25(3/4), 285–294 (1933)

    Article  Google Scholar 

  26. Ustyuzhaninov, I., Kazlauskaite, I., Ek, C.H., Campbell, N.D.: Monotonic Gaussian process flow. arXiv preprint arXiv:1905.12930 (2019)

  27. Wu, H., Liu, X.: Double thompson sampling for dueling bandits. In: NeurIPS, pp. 649–657 (2016)

    Google Scholar 

  28. Yahyaa, S.Q., Drugan, M.M., Manderick, B.: Thompson sampling in the adaptive linear scalarized multi objective multi armed bandit. In: ICAART, pp. 55–65 (2015)

    Google Scholar 

  29. Zintgraf, L.M., Roijers, D.M., Linders, S., Jonker, C.M., Nowé, A.: Ordered preference elicitation strategies for supporting multi-objective decision making. In: AAMAS, pp. 1477–1485 (2018)

    Google Scholar 

  30. Zoghi, M., Whiteson, S., Munos, R., De Rijke, M.: Relative upper confidence bound for the k-armed dueling bandit problem. In: ICML, pp. 10–18 (2014)

    Google Scholar 

Download references

Acknowledgment

This research received funding from the Flemish Government (AI Research Program). Pieter Libin and Eugenio Bargiacchi were supported by a PhD grant of the FWO (Fonds Wetenschappelijk Onderzoek – Vlaanderen).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Diederik M. Roijers .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Roijers, D.M., Zintgraf, L.M., Libin, P., Reymond, M., Bargiacchi, E., Nowé, A. (2021). Interactive Multi-objective Reinforcement Learning in Multi-armed Bandits with Gaussian Process Utility Models. In: Hutter, F., Kersting, K., Lijffijt, J., Valera, I. (eds) Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2020. Lecture Notes in Computer Science(), vol 12459. Springer, Cham. https://doi.org/10.1007/978-3-030-67664-3_28

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-67664-3_28

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-67663-6

  • Online ISBN: 978-3-030-67664-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics