Preference-Based Monte Carlo Tree Search

Joppen, Tobias; Wirth, Christian; Fürnkranz, Johannes

doi:10.1007/978-3-030-00111-7_28

Tobias Joppen¹⁵,
Christian Wirth¹⁵ &
Johannes Fürnkranz¹⁵

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11117))

Included in the following conference series:

Joint German/Austrian Conference on Artificial Intelligence (Künstliche Intelligenz)

1325 Accesses
3 Citations
3 Altmetric

Abstract

Monte Carlo tree search (MCTS) is a popular choice for solving sequential anytime problems. However, it depends on a numeric feedback signal, which can be difficult to define. Real-time MCTS is a variant which may only rarely encounter states with an explicit, extrinsic reward. To deal with such cases, the experimenter has to supply an additional numeric feedback signal in the form of a heuristic, which intrinsically guides the agent. Recent work has shown evidence that in different areas the underlying structure is ordinal and not numerical. Hence erroneous and biased heuristics are inevitable, especially in such domains. In this paper, we propose a MCTS variant which only depends on qualitative feedback, and therefore opens up new applications for MCTS. We also find indications that translating absolute into ordinal feedback may be beneficial. Using a puzzle domain, we show that our preference-based MCTS variant, wich only receives qualitative feedback, is able to reach a performance level comparable to a regular MCTS baseline, which obtains quantitative feedback.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Please note that this is a fair comparison between PB-MCTS and H-MCTS: The first uses more #samples per iteration, the latter uses more iterations.

References

Amodei, D., Olah, C., Steinhardt, J., Christiano, P., Schulman, J., Mané, D.: Concrete problems in AI safety. CoRR abs/1606.06565 (2016)
Google Scholar
Auer, P., Cesa-Bianchi, N., Fischer, P.: Finite-time analysis of the multiarmed bandit problem. Mach. Learn. 47(2–3), 235–256 (2002)
Article Google Scholar
Browne, C.B., et al.: A survey of Monte Carlo tree search methods. IEEE Trans. Comput. Intell. AI Games 4(1), 1–43 (2012)
Article Google Scholar
Busa-Fekete, R., Hüllermeier, E.: A survey of preference-based online learning with bandit algorithms. In: Auer, P., Clark, A., Zeugmann, T., Zilles, S. (eds.) ALT 2014. LNCS (LNAI), vol. 8776, pp. 18–39. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-11662-4_3
Chapter MATH Google Scholar
Christiano, P., Leike, J., Brown, T.B., Martic, M., Legg, S., Amodei, D.: Deep reinforcement learning from human preferences. In: Guyon, I., et al. (eds.) Advances in Neural Information Processing Systems 30 (NIPS 2017), Long Beach, CA (2017)
Google Scholar
Finnsson, H.: Simulation-based general game playing. Ph.D. thesis, Reykjavík University (2012)
Google Scholar
Fürnkranz, J., Hüllermeier, E. (eds.): Preference Learning. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-14125-6
Book MATH Google Scholar
Fürnkranz, J., Hüllermeier, E., Cheng, W., Park, S.H.: Preference-based reinforcement learning: a formal framework and a policy iteration algorithm. Mach. Learn. 89(1–2), 123–156 (2012). https://doi.org/10.1007/s10994-012-5313-8. Special Issue of Selected Papers from ECML PKDD 2011
Article MathSciNet MATH Google Scholar
Knowles, J.D., Watson, R.A., Corne, D.W.: Reducing local optima in single-objective problems by multi-objectivization. In: Zitzler, E., Thiele, L., Deb, K., Coello Coello, C.A., Corne, D. (eds.) EMO 2001. LNCS, vol. 1993, pp. 269–283. Springer, Heidelberg (2001). https://doi.org/10.1007/3-540-44719-9_19
Chapter Google Scholar
Kocsis, L., Szepesvári, C.: Bandit based Monte-Carlo planning. In: Fürnkranz, J., Scheffer, T., Spiliopoulou, M. (eds.) ECML 2006. LNCS (LNAI), vol. 4212, pp. 282–293. Springer, Heidelberg (2006). https://doi.org/10.1007/11871842_29
Chapter Google Scholar
Lee, C.S.: The computational intelligence of MoGo revealed in Taiwan’s computer go tournaments. IEEE Trans. Comput. Intell. AI Games 1, 73–89 (2009)
Article Google Scholar
Pepels, T., Winands, M.H., Lanctot, M.: Real-time Monte Carlo tree search in Ms Pac-Man. IEEE Trans. Comput. Intell. AI Games 6(3), 245–257 (2014)
Article Google Scholar
Perez-Liebana, D., Mostaghim, S., Lucas, S.M.: Multi-objective tree search approaches for general video game playing. In: IEEE Congress on Evolutionary Computation (CEC 2016), pp. 624–631. IEEE (2016)
Google Scholar
Ponsen, M., Gerritsen, G., Chaslot, G.: Integrating opponent models with Monte-Carlo tree search in poker. In: Proceedings of Interactive Decision Theory and Game Theory Workshop at the Twenty-Fourth Conference on Artificial Intelligence (AAAI 2010), AAAI Workshops, vol. WS-10-03, pp. 37–42 (2010)
Google Scholar
Puterman, M.L.: Markov Decision Processes: Discrete Stochastic Dynamic Programming, 2nd edn. Wiley, Hoboken (2005)
MATH Google Scholar
Rimmel, A., Teytaud, O., Lee, C.S., Yen, S.J., Wang, M.H., Tsai, S.R.: Current frontiers in computer go. IEEE Trans. Comput. Intell. AI Games 2(4), 229–238 (2010)
Article Google Scholar
Silver, D., et al.: Mastering the game of go without human knowledge. Nature 550(7676), 354 (2017)
Article Google Scholar
Sutton, R.S., Barto, A.: Reinforcement Learning: An Introduction. MIT Press, Cambridge (1998)
Google Scholar
Thurstone, L.L.: A law of comparative judgement. Psychol. Rev. 34, 278–286 (1927)
Google Scholar
Weng, P.: Markov decision processes with ordinal rewards: reference point-based preferences. In: Proceedings of the 21st International Conference on Automated Planning and Scheduling (ICAPS 2011) (2011)
Google Scholar
Wirth, C., Fürnkranz, J., Neumann, G.: Model-free preference-based reinforcement learning. In: Proceedings of the 30th AAAI Conference on Artificial Intelligence (AAAI 2016), pp. 2222–2228 (2016)
Google Scholar
Yannakakis, G.N., Cowie, R., Busso, C.: The ordinal nature of emotions. In: Proceedings of the 7th International Conference on Affective Computing and Intelligent Interaction (ACII 2017) (2017)
Google Scholar
Yue, Y., Broder, J., Kleinberg, R., Joachims, T.: The k-armed dueling bandits problem. J. Comput. Syst. Sci. 78(5), 1538–1556 (2012). https://doi.org/10.1016/j.jcss.2011.12.028
Article MathSciNet MATH Google Scholar
Yue, Y., Joachims, T.: Interactively optimizing information retrieval systems as a dueling bandits problem. In: Proceedings of the 26th Annual International Conference on Machine Learning (ICML 2009), pp. 1201–1208 (2009)
Google Scholar
Zoghi, M., Whiteson, S., Munos, R., Rijke, M.: Relative upper confidence bound for the k-armed dueling bandit problem. In: Proceedings of the 31st International Conference on Machine Learning (ICML 2014), pp. 10–18 (2014)
Google Scholar

Download references

Acknowledgments

This work was supported by the German Research Foundation (DFG project number FU 580/10). We gratefully acknowledge the use of the Lichtenberg high performance computer of the TU Darmstadt for our experiments.

Author information

Authors and Affiliations

Technische Universität Darmstadt, Darmstadt, Germany
Tobias Joppen, Christian Wirth & Johannes Fürnkranz

Authors

Tobias Joppen
View author publications
You can also search for this author in PubMed Google Scholar
Christian Wirth
View author publications
You can also search for this author in PubMed Google Scholar
Johannes Fürnkranz
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Tobias Joppen .

Editor information

Editors and Affiliations

TU Berlin, Berlin, Germany
Frank Trollmann
TU Dresden, Dresden, Germany
Anni-Yasmin Turhan

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Joppen, T., Wirth, C., Fürnkranz, J. (2018). Preference-Based Monte Carlo Tree Search. In: Trollmann, F., Turhan, AY. (eds) KI 2018: Advances in Artificial Intelligence. KI 2018. Lecture Notes in Computer Science(), vol 11117. Springer, Cham. https://doi.org/10.1007/978-3-030-00111-7_28

Download citation

DOI: https://doi.org/10.1007/978-3-030-00111-7_28
Published: 30 August 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-00110-0
Online ISBN: 978-3-030-00111-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics