Abstract
In this paper, we present a link between preference-based and multiobjective sequential decision-making. While transforming a multiobjective problem to a preference-based one is quite natural, the other direction is a bit less obvious. We present how this transformation (from preference-based to multiobjective) can be done under the classic condition that preferences over histories can be represented by additively decomposable utilities and that the decision criterion to evaluate policies in a state is based on expectation. This link yields a new source of multiobjective sequential decision-making problems (i.e., when reward values are unknown) and justifies the use of solving methods developed in one setting in the other one.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
The randomization is over policies and not over actions, like in randomized policies.
References
Abbeel, P., Coates, A., Ng, A.Y.: Autonomous helicopter aerobatics through apprenticeship learning. Int. J. Rob. Res. 29(13), 1608–1639 (2010)
Akrour, R., Schoenauer, M., Sebag, M.: APRIL: active preference learning-based reinforcement learning. In: Flach, P.A., Bie, T., Cristianini, N. (eds.) ECML PKDD 2012. LNCS (LNAI), vol. 7524, pp. 116–131. Springer, Heidelberg (2012). doi:10.1007/978-3-642-33486-3_8
Barrett, L., Narayanan, S.: Learning all optimal policies with multiple criteria. In: ICML (2008)
Busa-Fekete, R., Szörenyi, B., Weng, P., Cheng, W., Hüllermeier, E.: Preference-based reinforcement learning. In: European Workshop on Reinforcement Learning, Dagstuhl Seminar (2013)
Busa-Fekete, R., Szörenyi, B., Weng, P., Cheng, W., Hüllermeier, E.: Top-k selection based on adaptive sampling of noisy preferences. In: International Conference on Marchine Learning (ICML) (2013)
Busa-Fekete, R., Szorenyi, B., Weng, P., Cheng, W., Hüllermeier, E.: Preference-based reinforcement learning: evolutionary direct policy search using a preference-based Racing algorithm. Mach. Learn. 97(3), 327–351 (2014)
Chatterjee, K., Majumdar, R., Henzinger, T.A.: Markov decision processes with multiple objectives. In: Durand, B., Thomas, W. (eds.) STACS 2006. LNCS, vol. 3884, pp. 325–336. Springer, Heidelberg (2006). doi:10.1007/11672142_26
Dudík, M., Hofmann, K., Schapire, R.E., Slivkins, A., Zoghi, M.: Contextual dueling bandits. In: COLT (2015)
Fürnkranz, J., Hüllermeier, E., Cheng, W., Park, S.: Preference-based reinforcement learning: a formal framework and a policy iteration algorithm. Mach. Learn. 89(1), 123–156 (2012)
Gábor, Z., Kalmár, Z., Szepesvári, C.: Multicriteria reinforcement learning. In: Proceedings of International Conference of Machine Learning (1998)
Gilbert, H., Spanjaard, O., Viappiani, P., Weng, P.: Reducing the number of queries in interactive value iteration. In: Walsh, T. (ed.) ADT 2015. (LNAI), vol. 9346, pp. 139–152. Springer, Heidelberg (2015). doi:10.1007/978-3-319-23114-3_9
Gilbert, H., Spanjaard, O., Viappiani, P., Weng, P.: Solving MDPs with skew symmetric bilinear utility functions. In: IJCAI, pp. 1989–1995 (2015)
Gretton, C., Price, D., Thiebaux, S.: Implementation and comparison of solution methods for decision processes with non-Markovian rewards. In: UAI, vol. 19, pp. 289–296 (2003)
Lizotte, D.J., Bowling, M., Murphy, S.A.: Efficient reinforcement learning with multiple reward functions for randomized controlled trial analysis. In: ICML (2010)
Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A.A., Veness, J., Bellemare, M.G., Graves, A., Riedmiller, M., Fidjeland, A.K., Ostrovski, G., Petersen, S., Beattie, C., Sadik, A., Antonoglou, I., King, H., Kumaran, D., Wierstra, D., Legg, S., Hassabis, D.: Human-level control through deep reinforcement learning. Nature 518, 529–533 (2015)
Ng, A., Russell, S.: Algorithms for inverse reinforcement learning. In: ICML. Morgan Kaufmann (2000)
Ogryczak, W., Perny, P., Weng, P.: On minimizing ordered weighted regrets in multiobjective Markov decision processes. In: Brafman, R.I., Roberts, F.S., Tsoukiàs, A. (eds.) ADT 2011. LNCS (LNAI), vol. 6992, pp. 190–204. Springer, Heidelberg (2011). doi:10.1007/978-3-642-24873-3_15
Ogryczak, W., Perny, P., Weng, P.: A compromise programming approach to multiobjective Markov decision processes. Int. J. Inf. Technol. Decis. Making 12, 1021–1053 (2013)
Perny, P., Weng, P.: On finding compromise solutions in multiobjective Markov decision processes. In: Multidisciplinary Workshop on Advances in Preference Handling (MPREF) @ European Conference on Artificial Intelligence (ECAI) (2010)
Perny, P., Weng, P., Goldsmith, J., Hanna, J.: Approximation of Lorenz-optimal solutions in multiobjective Markov decision processes. In: International Conference on Uncertainty in Artificial Intelligence (UAI) (2013)
Puterman, M.: Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley, Hoboken (1994)
Regan, K., Boutilier, C.: Eliciting additive reward functions for Markov decision processes. In: IJCAI, pp. 2159–2164 (2011)
Regan, K., Boutilier, C.: Robust online optimization of reward-uncertain MDPs. In: IJCAI, pp. 2165–2171 (2011)
Roijers, D., Vamplew, P., Whiteson, S., Dazeley, R.: A survey of multi-objective sequential decision-making. J. Artif. Intell. Res. 48, 67–113 (2013)
Steuer, R., Choo, E.U.: An interactive weighted Tchebycheff procedure for multiple objective programming. Math. Program. 26, 326–344 (1983)
Strehl, A.L., Littman, M.L.: Reinforcement learning in finite MDPs: PAC analysis. J. Mach. Learn. Res. 10, 2413–2444 (2009)
Sutton, R., Barto, A.: Reinforcement Learning: An Introduction. MIT Press, Cambridge (1998)
Tesauro, G.: Temporal difference learning and TD-Gammon. Commun. ACM 38(3), 58–68 (1995)
Weng, P.: Markov decision processes with ordinal rewards: Reference point-based preferences. International Conference on Automated Planning and Scheduling (ICAPS), vol. 21, pp. 282–289 (2011)
Weng, P.: Ordinal decision models for Markov decision processes. In: European Conference on Artificial Intelligence (ECAI), vol. 20, pp. 828–833 (2012)
Weng, P., Zanuttini, B.: Interactive value iteration for Markov decision processes with unknown rewards. In: IJCAI (2013)
Weng, P., Busa-Fekete, R., Hüllermeier, E.: Interactive Q-learning with ordinal rewards and unreliable tutor. In: ECML/PKDD Workshop Reinforcement Learning with Generalized Feedback, September 2013
White, D.: Multi-objective infinite-horizon discounted Markov decision processes. J. Math. Anal. Appls. 89, 639–647 (1982)
Wray, K.H., Zilberstein, S., Mouaddib, A.I.: Multi-objective MDPs with conditional lexicographic reward preferences. In: AAAI (2015)
Yue, Y., Broder, J., Kleinberg, R., Joachims, T.: The k-armed dueling bandits problem. J. Comput. Syst. Sci. 78(5), 1538–1556 (2012)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing AG
About this paper
Cite this paper
Weng, P. (2016). From Preference-Based to Multiobjective Sequential Decision-Making. In: Sombattheera, C., Stolzenburg, F., Lin, F., Nayak, A. (eds) Multi-disciplinary Trends in Artificial Intelligence. MIWAI 2016. Lecture Notes in Computer Science(), vol 10053. Springer, Cham. https://doi.org/10.1007/978-3-319-49397-8_20
Download citation
DOI: https://doi.org/10.1007/978-3-319-49397-8_20
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-49396-1
Online ISBN: 978-3-319-49397-8
eBook Packages: Computer ScienceComputer Science (R0)