Abstract
In multi-objective reinforcement learning (MORL), much attention is paid to generating optimal solution sets for unknown utility functions of users, based on the stochastic reward vectors only. In online MORL on the other hand, the agent will often be able to elicit preferences from the user, enabling it to learn about the utility function of its user directly. In this paper, we study online MORL with user interaction employing the multi-objective multi-armed bandit (MOMAB) setting — perhaps the most fundamental MORL setting. We use Bayesian learning algorithms to learn about the environment and the user simultaneously. Specifically, we propose two algorithms: Utility-MAP UCB (umap-UCB) and Interactive Thompson Sampling (ITS), and show empirically that the performance of these algorithms in terms of regret closely approximates the regret of UCB and regular Thompson sampling provided with the ground truth utility function of the user from the start, and that ITS outperforms umap-UCB.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
We note that logistic regression based on maximum likelihood can lead to problems in earlier iterations of umap-UCB when there is little data available. We observed this empirically. Specifically, in earlier iterations umap-UCB with ML logistic regression instead of Bayesian logistic regression makes an estimate, \(\bar{\mathbf{w}}\), with a sheer-infinite weight on one objective, such that no comparison will be asked from the user again. This can be prevented with a reasonable choice of prior in Bayesian logistic regression.
- 2.
\(\mathbb {E}[\mathbf{x} \cdot \mathbf{y}]= \mathbb {E}[\mathbf{x}] \cdot \mathbb {E}[\mathbf{y}]\), iff \(\mathbf{x}\) and \(\mathbf{y}\) are independent.
- 3.
Please note that for obtaining 0 regret, it is not necessary that the MAP estimate \(\bar{\mathbf{w}}\) is identical to the ground truth \(\mathbf{w}^*\), as long as it leads to selecting the same arm.
References
Agrawal, S., Goyal, N.: Analysis of Thompson sampling for the multi-armed bandit problem. In: COLT, p. 39.1–39.26 (2012)
Auer, P., Cesa-Bianchi, N., Fischer, P.: Finite-time analysis of the multiarmed bandit problem. Mach. Learn. 47(2–3), 235–256 (2002)
Auer, P., Chiang, C.-K., Ortner, R., Drugan, M.M.: Pareto front identification from stochastic bandit feedback. In: AISTATS, pp. 939–947 (2016)
Benabbou, N., Perny, P.: Combining preference elicitation and search in multiobjective state-space graphs. In: IJCAI, pp. 297–303 (2015)
Bishop, C.M.: Pattern Recognition and Machine Learning. Springer, New York (2006)
Brochu, E., de Freitas, N., Ghosh, A.: Active preference learning with discrete choice data. In: NIPS, pp. 409–416 (2008)
Chapelle, O., Li, L.: An empirical evaluation of Thompson sampling. In: NIPS, pp. 2249–2257 (2011)
Clemen, R.T., Decisions, M.H.: An Introduction to Decision Analysis. PWS-Kent, Boston (1997)
Drugan, M.M., Nowé, A.: Designing multi-objective multi-armed bandits algorithms: a study. In: IJCNN, pp. 1–8. IEEE (2013)
Igarashi, A., Roijers, D.M.: Multi-criteria coalition formation games. In: Rothe, J. (ed.) ADT 2017. LNAI, vol. 10576, pp. 197–213. Springer, Cham (2017)
Libin, P., Verstraeten, T., Theys, K., Roijers, D.M., Vrancx, P., Nowé, A.: Efficient evaluation of influenza mitigation strategies using preventive bandits. In: ALA, 9 p. (2017)
Mannion, P., Duggan, J., Howley, E.: A theoretical and empirical analysis of reward transformations in multi-objective stochastic games. In: AAMAS, pp. 1625–1627 (2017)
Roijers, D.M., Vamplew, P., Whiteson, S., Dazeley, R.: A survey of multi-objective sequential decision-making. JAIR 48, 67–113 (2013)
Roijers, D.M., Whiteson, S.: Multi-objective decision making. Synth. Lect. Artif. Intell. Mach. Learn. 11(1), 1–129 (2017)
Tesauro, G.: Connectionist learning of expert preferences by comparison training. In: NIPS, vol. 1, pp. 99–106 (1988)
Thompson, W.R.: On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika 25(3/4), 285–294 (1933)
Van Moffaert, K., Nowé, A.: Multi-objective reinforcement learning using sets of Pareto dominating policies. JMLR 15(1), 3483–3512 (2014)
Van Moffaert, K., Van Vaerenbergh, K., Vrancx, P., Nowé, A.: Multi-objective \(\chi \)-armed bandits. In: IJCNN, pp. 2331–2338 (2014)
Wiering, M.A., Withagen, M., Drugan, M.M.: Model-based multi-objective reinforcement learning. In: ADPRL, pp. 1–6 (2014)
Wilson, N., Razak, A., Marinescu, R.: Computing possibly optimal solutions for multi-objective constraint optimisation with tradeoffs. In: IJCAI, pp. 815–822 (2015)
Wu, H., Liu, X.: Double Thompson sampling for dueling bandits. In: NIPS, pp. 649–657 (2016)
Yahyaa, S.Q., Drugan, M.M., Manderick, B.: Thompson sampling in the adaptive linear scalarized multi objective multi armed bandit. In: ICAART, pp. 55–65 (2015)
Zoghi, M., Whiteson, S., Munos, R., De Rijke, M.: Relative upper confidence bound for the k-armed dueling bandit problem. In: ICML, pp. 10–18 (2014)
Acknowledgements
The first author is a postdoctoral fellow of the Research Foundation – Flanders (FWO). This research was in part supported by Innoviris – Brussels Institute for Research and Innovation.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Roijers, D.M., Zintgraf, L.M., Nowé, A. (2017). Interactive Thompson Sampling for Multi-objective Multi-armed Bandits. In: Rothe, J. (eds) Algorithmic Decision Theory. ADT 2017. Lecture Notes in Computer Science(), vol 10576. Springer, Cham. https://doi.org/10.1007/978-3-319-67504-6_2
Download citation
DOI: https://doi.org/10.1007/978-3-319-67504-6_2
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-67503-9
Online ISBN: 978-3-319-67504-6
eBook Packages: Computer ScienceComputer Science (R0)