Skip to main content

Interactive Thompson Sampling for Multi-objective Multi-armed Bandits

  • Conference paper
  • First Online:
Algorithmic Decision Theory (ADT 2017)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10576))

Included in the following conference series:

Abstract

In multi-objective reinforcement learning (MORL), much attention is paid to generating optimal solution sets for unknown utility functions of users, based on the stochastic reward vectors only. In online MORL on the other hand, the agent will often be able to elicit preferences from the user, enabling it to learn about the utility function of its user directly. In this paper, we study online MORL with user interaction employing the multi-objective multi-armed bandit (MOMAB) setting — perhaps the most fundamental MORL setting. We use Bayesian learning algorithms to learn about the environment and the user simultaneously. Specifically, we propose two algorithms: Utility-MAP UCB (umap-UCB) and Interactive Thompson Sampling (ITS), and show empirically that the performance of these algorithms in terms of regret closely approximates the regret of UCB and regular Thompson sampling provided with the ground truth utility function of the user from the start, and that ITS outperforms umap-UCB.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    We note that logistic regression based on maximum likelihood can lead to problems in earlier iterations of umap-UCB when there is little data available. We observed this empirically. Specifically, in earlier iterations umap-UCB with ML logistic regression instead of Bayesian logistic regression makes an estimate, \(\bar{\mathbf{w}}\), with a sheer-infinite weight on one objective, such that no comparison will be asked from the user again. This can be prevented with a reasonable choice of prior in Bayesian logistic regression.

  2. 2.

    \(\mathbb {E}[\mathbf{x} \cdot \mathbf{y}]= \mathbb {E}[\mathbf{x}] \cdot \mathbb {E}[\mathbf{y}]\), iff \(\mathbf{x}\) and \(\mathbf{y}\) are independent.

  3. 3.

    Please note that for obtaining 0 regret, it is not necessary that the MAP estimate \(\bar{\mathbf{w}}\) is identical to the ground truth \(\mathbf{w}^*\), as long as it leads to selecting the same arm.

References

  1. Agrawal, S., Goyal, N.: Analysis of Thompson sampling for the multi-armed bandit problem. In: COLT, p. 39.1–39.26 (2012)

    Google Scholar 

  2. Auer, P., Cesa-Bianchi, N., Fischer, P.: Finite-time analysis of the multiarmed bandit problem. Mach. Learn. 47(2–3), 235–256 (2002)

    Article  Google Scholar 

  3. Auer, P., Chiang, C.-K., Ortner, R., Drugan, M.M.: Pareto front identification from stochastic bandit feedback. In: AISTATS, pp. 939–947 (2016)

    Google Scholar 

  4. Benabbou, N., Perny, P.: Combining preference elicitation and search in multiobjective state-space graphs. In: IJCAI, pp. 297–303 (2015)

    Google Scholar 

  5. Bishop, C.M.: Pattern Recognition and Machine Learning. Springer, New York (2006)

    MATH  Google Scholar 

  6. Brochu, E., de Freitas, N., Ghosh, A.: Active preference learning with discrete choice data. In: NIPS, pp. 409–416 (2008)

    Google Scholar 

  7. Chapelle, O., Li, L.: An empirical evaluation of Thompson sampling. In: NIPS, pp. 2249–2257 (2011)

    Google Scholar 

  8. Clemen, R.T., Decisions, M.H.: An Introduction to Decision Analysis. PWS-Kent, Boston (1997)

    Google Scholar 

  9. Drugan, M.M., Nowé, A.: Designing multi-objective multi-armed bandits algorithms: a study. In: IJCNN, pp. 1–8. IEEE (2013)

    Google Scholar 

  10. Igarashi, A., Roijers, D.M.: Multi-criteria coalition formation games. In: Rothe, J. (ed.) ADT 2017. LNAI, vol. 10576, pp. 197–213. Springer, Cham (2017)

    Chapter  Google Scholar 

  11. Libin, P., Verstraeten, T., Theys, K., Roijers, D.M., Vrancx, P., Nowé, A.: Efficient evaluation of influenza mitigation strategies using preventive bandits. In: ALA, 9 p. (2017)

    Google Scholar 

  12. Mannion, P., Duggan, J., Howley, E.: A theoretical and empirical analysis of reward transformations in multi-objective stochastic games. In: AAMAS, pp. 1625–1627 (2017)

    Google Scholar 

  13. Roijers, D.M., Vamplew, P., Whiteson, S., Dazeley, R.: A survey of multi-objective sequential decision-making. JAIR 48, 67–113 (2013)

    Article  MathSciNet  Google Scholar 

  14. Roijers, D.M., Whiteson, S.: Multi-objective decision making. Synth. Lect. Artif. Intell. Mach. Learn. 11(1), 1–129 (2017)

    Article  Google Scholar 

  15. Tesauro, G.: Connectionist learning of expert preferences by comparison training. In: NIPS, vol. 1, pp. 99–106 (1988)

    Google Scholar 

  16. Thompson, W.R.: On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika 25(3/4), 285–294 (1933)

    Article  Google Scholar 

  17. Van Moffaert, K., Nowé, A.: Multi-objective reinforcement learning using sets of Pareto dominating policies. JMLR 15(1), 3483–3512 (2014)

    MathSciNet  MATH  Google Scholar 

  18. Van Moffaert, K., Van Vaerenbergh, K., Vrancx, P., Nowé, A.: Multi-objective \(\chi \)-armed bandits. In: IJCNN, pp. 2331–2338 (2014)

    Google Scholar 

  19. Wiering, M.A., Withagen, M., Drugan, M.M.: Model-based multi-objective reinforcement learning. In: ADPRL, pp. 1–6 (2014)

    Google Scholar 

  20. Wilson, N., Razak, A., Marinescu, R.: Computing possibly optimal solutions for multi-objective constraint optimisation with tradeoffs. In: IJCAI, pp. 815–822 (2015)

    Google Scholar 

  21. Wu, H., Liu, X.: Double Thompson sampling for dueling bandits. In: NIPS, pp. 649–657 (2016)

    Google Scholar 

  22. Yahyaa, S.Q., Drugan, M.M., Manderick, B.: Thompson sampling in the adaptive linear scalarized multi objective multi armed bandit. In: ICAART, pp. 55–65 (2015)

    Google Scholar 

  23. Zoghi, M., Whiteson, S., Munos, R., De Rijke, M.: Relative upper confidence bound for the k-armed dueling bandit problem. In: ICML, pp. 10–18 (2014)

    Google Scholar 

Download references

Acknowledgements

The first author is a postdoctoral fellow of the Research Foundation – Flanders (FWO). This research was in part supported by Innoviris – Brussels Institute for Research and Innovation.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Diederik M. Roijers .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this paper

Cite this paper

Roijers, D.M., Zintgraf, L.M., Nowé, A. (2017). Interactive Thompson Sampling for Multi-objective Multi-armed Bandits. In: Rothe, J. (eds) Algorithmic Decision Theory. ADT 2017. Lecture Notes in Computer Science(), vol 10576. Springer, Cham. https://doi.org/10.1007/978-3-319-67504-6_2

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-67504-6_2

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-67503-9

  • Online ISBN: 978-3-319-67504-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics