From Preference-Based to Multiobjective Sequential Decision-Making

  • Paul Weng
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10053)


In this paper, we present a link between preference-based and multiobjective sequential decision-making. While transforming a multiobjective problem to a preference-based one is quite natural, the other direction is a bit less obvious. We present how this transformation (from preference-based to multiobjective) can be done under the classic condition that preferences over histories can be represented by additively decomposable utilities and that the decision criterion to evaluate policies in a state is based on expectation. This link yields a new source of multiobjective sequential decision-making problems (i.e., when reward values are unknown) and justifies the use of solving methods developed in one setting in the other one.


Sequential decision-making Preference-based reinforcement learning Multiobjective markov decision process Multiobjective Reinforcement Learning 


  1. 1.
    Abbeel, P., Coates, A., Ng, A.Y.: Autonomous helicopter aerobatics through apprenticeship learning. Int. J. Rob. Res. 29(13), 1608–1639 (2010)CrossRefGoogle Scholar
  2. 2.
    Akrour, R., Schoenauer, M., Sebag, M.: APRIL: active preference learning-based reinforcement learning. In: Flach, P.A., Bie, T., Cristianini, N. (eds.) ECML PKDD 2012. LNCS (LNAI), vol. 7524, pp. 116–131. Springer, Heidelberg (2012). doi: 10.1007/978-3-642-33486-3_8 CrossRefGoogle Scholar
  3. 3.
    Barrett, L., Narayanan, S.: Learning all optimal policies with multiple criteria. In: ICML (2008)Google Scholar
  4. 4.
    Busa-Fekete, R., Szörenyi, B., Weng, P., Cheng, W., Hüllermeier, E.: Preference-based reinforcement learning. In: European Workshop on Reinforcement Learning, Dagstuhl Seminar (2013)Google Scholar
  5. 5.
    Busa-Fekete, R., Szörenyi, B., Weng, P., Cheng, W., Hüllermeier, E.: Top-k selection based on adaptive sampling of noisy preferences. In: International Conference on Marchine Learning (ICML) (2013)Google Scholar
  6. 6.
    Busa-Fekete, R., Szorenyi, B., Weng, P., Cheng, W., Hüllermeier, E.: Preference-based reinforcement learning: evolutionary direct policy search using a preference-based Racing algorithm. Mach. Learn. 97(3), 327–351 (2014)MathSciNetCrossRefzbMATHGoogle Scholar
  7. 7.
    Chatterjee, K., Majumdar, R., Henzinger, T.A.: Markov decision processes with multiple objectives. In: Durand, B., Thomas, W. (eds.) STACS 2006. LNCS, vol. 3884, pp. 325–336. Springer, Heidelberg (2006). doi: 10.1007/11672142_26 CrossRefGoogle Scholar
  8. 8.
    Dudík, M., Hofmann, K., Schapire, R.E., Slivkins, A., Zoghi, M.: Contextual dueling bandits. In: COLT (2015)Google Scholar
  9. 9.
    Fürnkranz, J., Hüllermeier, E., Cheng, W., Park, S.: Preference-based reinforcement learning: a formal framework and a policy iteration algorithm. Mach. Learn. 89(1), 123–156 (2012)MathSciNetCrossRefzbMATHGoogle Scholar
  10. 10.
    Gábor, Z., Kalmár, Z., Szepesvári, C.: Multicriteria reinforcement learning. In: Proceedings of International Conference of Machine Learning (1998)Google Scholar
  11. 11.
    Gilbert, H., Spanjaard, O., Viappiani, P., Weng, P.: Reducing the number of queries in interactive value iteration. In: Walsh, T. (ed.) ADT 2015. (LNAI), vol. 9346, pp. 139–152. Springer, Heidelberg (2015). doi: 10.1007/978-3-319-23114-3_9 CrossRefGoogle Scholar
  12. 12.
    Gilbert, H., Spanjaard, O., Viappiani, P., Weng, P.: Solving MDPs with skew symmetric bilinear utility functions. In: IJCAI, pp. 1989–1995 (2015)Google Scholar
  13. 13.
    Gretton, C., Price, D., Thiebaux, S.: Implementation and comparison of solution methods for decision processes with non-Markovian rewards. In: UAI, vol. 19, pp. 289–296 (2003)Google Scholar
  14. 14.
    Lizotte, D.J., Bowling, M., Murphy, S.A.: Efficient reinforcement learning with multiple reward functions for randomized controlled trial analysis. In: ICML (2010)Google Scholar
  15. 15.
    Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A.A., Veness, J., Bellemare, M.G., Graves, A., Riedmiller, M., Fidjeland, A.K., Ostrovski, G., Petersen, S., Beattie, C., Sadik, A., Antonoglou, I., King, H., Kumaran, D., Wierstra, D., Legg, S., Hassabis, D.: Human-level control through deep reinforcement learning. Nature 518, 529–533 (2015)CrossRefGoogle Scholar
  16. 16.
    Ng, A., Russell, S.: Algorithms for inverse reinforcement learning. In: ICML. Morgan Kaufmann (2000)Google Scholar
  17. 17.
    Ogryczak, W., Perny, P., Weng, P.: On minimizing ordered weighted regrets in multiobjective Markov decision processes. In: Brafman, R.I., Roberts, F.S., Tsoukiàs, A. (eds.) ADT 2011. LNCS (LNAI), vol. 6992, pp. 190–204. Springer, Heidelberg (2011). doi: 10.1007/978-3-642-24873-3_15 CrossRefGoogle Scholar
  18. 18.
    Ogryczak, W., Perny, P., Weng, P.: A compromise programming approach to multiobjective Markov decision processes. Int. J. Inf. Technol. Decis. Making 12, 1021–1053 (2013)CrossRefGoogle Scholar
  19. 19.
    Perny, P., Weng, P.: On finding compromise solutions in multiobjective Markov decision processes. In: Multidisciplinary Workshop on Advances in Preference Handling (MPREF) @ European Conference on Artificial Intelligence (ECAI) (2010)Google Scholar
  20. 20.
    Perny, P., Weng, P., Goldsmith, J., Hanna, J.: Approximation of Lorenz-optimal solutions in multiobjective Markov decision processes. In: International Conference on Uncertainty in Artificial Intelligence (UAI) (2013)Google Scholar
  21. 21.
    Puterman, M.: Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley, Hoboken (1994)CrossRefzbMATHGoogle Scholar
  22. 22.
    Regan, K., Boutilier, C.: Eliciting additive reward functions for Markov decision processes. In: IJCAI, pp. 2159–2164 (2011)Google Scholar
  23. 23.
    Regan, K., Boutilier, C.: Robust online optimization of reward-uncertain MDPs. In: IJCAI, pp. 2165–2171 (2011)Google Scholar
  24. 24.
    Roijers, D., Vamplew, P., Whiteson, S., Dazeley, R.: A survey of multi-objective sequential decision-making. J. Artif. Intell. Res. 48, 67–113 (2013)MathSciNetzbMATHGoogle Scholar
  25. 25.
    Steuer, R., Choo, E.U.: An interactive weighted Tchebycheff procedure for multiple objective programming. Math. Program. 26, 326–344 (1983)MathSciNetCrossRefzbMATHGoogle Scholar
  26. 26.
    Strehl, A.L., Littman, M.L.: Reinforcement learning in finite MDPs: PAC analysis. J. Mach. Learn. Res. 10, 2413–2444 (2009)MathSciNetzbMATHGoogle Scholar
  27. 27.
    Sutton, R., Barto, A.: Reinforcement Learning: An Introduction. MIT Press, Cambridge (1998)Google Scholar
  28. 28.
    Tesauro, G.: Temporal difference learning and TD-Gammon. Commun. ACM 38(3), 58–68 (1995)CrossRefGoogle Scholar
  29. 29.
    Weng, P.: Markov decision processes with ordinal rewards: Reference point-based preferences. International Conference on Automated Planning and Scheduling (ICAPS), vol. 21, pp. 282–289 (2011)Google Scholar
  30. 30.
    Weng, P.: Ordinal decision models for Markov decision processes. In: European Conference on Artificial Intelligence (ECAI), vol. 20, pp. 828–833 (2012)Google Scholar
  31. 31.
    Weng, P., Zanuttini, B.: Interactive value iteration for Markov decision processes with unknown rewards. In: IJCAI (2013)Google Scholar
  32. 32.
    Weng, P., Busa-Fekete, R., Hüllermeier, E.: Interactive Q-learning with ordinal rewards and unreliable tutor. In: ECML/PKDD Workshop Reinforcement Learning with Generalized Feedback, September 2013Google Scholar
  33. 33.
    White, D.: Multi-objective infinite-horizon discounted Markov decision processes. J. Math. Anal. Appls. 89, 639–647 (1982)MathSciNetCrossRefzbMATHGoogle Scholar
  34. 34.
    Wray, K.H., Zilberstein, S., Mouaddib, A.I.: Multi-objective MDPs with conditional lexicographic reward preferences. In: AAAI (2015)Google Scholar
  35. 35.
    Yue, Y., Broder, J., Kleinberg, R., Joachims, T.: The k-armed dueling bandits problem. J. Comput. Syst. Sci. 78(5), 1538–1556 (2012)MathSciNetCrossRefzbMATHGoogle Scholar

Copyright information

© Springer International Publishing AG 2016

Authors and Affiliations

  1. 1.School of Electronics and Information TechnologySYSU-CMU Joint Institute of EngineeringGuangzhouPeople’s Republic of China
  2. 2.SYSU-CMU Shunde Joint Research InstituteShundePeople’s Republic of China

Personalised recommendations