Skip to main content

From Preference-Based to Multiobjective Sequential Decision-Making

  • Conference paper
  • First Online:
Multi-disciplinary Trends in Artificial Intelligence (MIWAI 2016)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10053))

Abstract

In this paper, we present a link between preference-based and multiobjective sequential decision-making. While transforming a multiobjective problem to a preference-based one is quite natural, the other direction is a bit less obvious. We present how this transformation (from preference-based to multiobjective) can be done under the classic condition that preferences over histories can be represented by additively decomposable utilities and that the decision criterion to evaluate policies in a state is based on expectation. This link yields a new source of multiobjective sequential decision-making problems (i.e., when reward values are unknown) and justifies the use of solving methods developed in one setting in the other one.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    The randomization is over policies and not over actions, like in randomized policies.

References

  1. Abbeel, P., Coates, A., Ng, A.Y.: Autonomous helicopter aerobatics through apprenticeship learning. Int. J. Rob. Res. 29(13), 1608–1639 (2010)

    Article  Google Scholar 

  2. Akrour, R., Schoenauer, M., Sebag, M.: APRIL: active preference learning-based reinforcement learning. In: Flach, P.A., Bie, T., Cristianini, N. (eds.) ECML PKDD 2012. LNCS (LNAI), vol. 7524, pp. 116–131. Springer, Heidelberg (2012). doi:10.1007/978-3-642-33486-3_8

    Chapter  Google Scholar 

  3. Barrett, L., Narayanan, S.: Learning all optimal policies with multiple criteria. In: ICML (2008)

    Google Scholar 

  4. Busa-Fekete, R., Szörenyi, B., Weng, P., Cheng, W., Hüllermeier, E.: Preference-based reinforcement learning. In: European Workshop on Reinforcement Learning, Dagstuhl Seminar (2013)

    Google Scholar 

  5. Busa-Fekete, R., Szörenyi, B., Weng, P., Cheng, W., Hüllermeier, E.: Top-k selection based on adaptive sampling of noisy preferences. In: International Conference on Marchine Learning (ICML) (2013)

    Google Scholar 

  6. Busa-Fekete, R., Szorenyi, B., Weng, P., Cheng, W., Hüllermeier, E.: Preference-based reinforcement learning: evolutionary direct policy search using a preference-based Racing algorithm. Mach. Learn. 97(3), 327–351 (2014)

    Article  MathSciNet  MATH  Google Scholar 

  7. Chatterjee, K., Majumdar, R., Henzinger, T.A.: Markov decision processes with multiple objectives. In: Durand, B., Thomas, W. (eds.) STACS 2006. LNCS, vol. 3884, pp. 325–336. Springer, Heidelberg (2006). doi:10.1007/11672142_26

    Chapter  Google Scholar 

  8. Dudík, M., Hofmann, K., Schapire, R.E., Slivkins, A., Zoghi, M.: Contextual dueling bandits. In: COLT (2015)

    Google Scholar 

  9. Fürnkranz, J., Hüllermeier, E., Cheng, W., Park, S.: Preference-based reinforcement learning: a formal framework and a policy iteration algorithm. Mach. Learn. 89(1), 123–156 (2012)

    Article  MathSciNet  MATH  Google Scholar 

  10. Gábor, Z., Kalmár, Z., Szepesvári, C.: Multicriteria reinforcement learning. In: Proceedings of International Conference of Machine Learning (1998)

    Google Scholar 

  11. Gilbert, H., Spanjaard, O., Viappiani, P., Weng, P.: Reducing the number of queries in interactive value iteration. In: Walsh, T. (ed.) ADT 2015. (LNAI), vol. 9346, pp. 139–152. Springer, Heidelberg (2015). doi:10.1007/978-3-319-23114-3_9

    Chapter  Google Scholar 

  12. Gilbert, H., Spanjaard, O., Viappiani, P., Weng, P.: Solving MDPs with skew symmetric bilinear utility functions. In: IJCAI, pp. 1989–1995 (2015)

    Google Scholar 

  13. Gretton, C., Price, D., Thiebaux, S.: Implementation and comparison of solution methods for decision processes with non-Markovian rewards. In: UAI, vol. 19, pp. 289–296 (2003)

    Google Scholar 

  14. Lizotte, D.J., Bowling, M., Murphy, S.A.: Efficient reinforcement learning with multiple reward functions for randomized controlled trial analysis. In: ICML (2010)

    Google Scholar 

  15. Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A.A., Veness, J., Bellemare, M.G., Graves, A., Riedmiller, M., Fidjeland, A.K., Ostrovski, G., Petersen, S., Beattie, C., Sadik, A., Antonoglou, I., King, H., Kumaran, D., Wierstra, D., Legg, S., Hassabis, D.: Human-level control through deep reinforcement learning. Nature 518, 529–533 (2015)

    Article  Google Scholar 

  16. Ng, A., Russell, S.: Algorithms for inverse reinforcement learning. In: ICML. Morgan Kaufmann (2000)

    Google Scholar 

  17. Ogryczak, W., Perny, P., Weng, P.: On minimizing ordered weighted regrets in multiobjective Markov decision processes. In: Brafman, R.I., Roberts, F.S., Tsoukiàs, A. (eds.) ADT 2011. LNCS (LNAI), vol. 6992, pp. 190–204. Springer, Heidelberg (2011). doi:10.1007/978-3-642-24873-3_15

    Chapter  Google Scholar 

  18. Ogryczak, W., Perny, P., Weng, P.: A compromise programming approach to multiobjective Markov decision processes. Int. J. Inf. Technol. Decis. Making 12, 1021–1053 (2013)

    Article  Google Scholar 

  19. Perny, P., Weng, P.: On finding compromise solutions in multiobjective Markov decision processes. In: Multidisciplinary Workshop on Advances in Preference Handling (MPREF) @ European Conference on Artificial Intelligence (ECAI) (2010)

    Google Scholar 

  20. Perny, P., Weng, P., Goldsmith, J., Hanna, J.: Approximation of Lorenz-optimal solutions in multiobjective Markov decision processes. In: International Conference on Uncertainty in Artificial Intelligence (UAI) (2013)

    Google Scholar 

  21. Puterman, M.: Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley, Hoboken (1994)

    Book  MATH  Google Scholar 

  22. Regan, K., Boutilier, C.: Eliciting additive reward functions for Markov decision processes. In: IJCAI, pp. 2159–2164 (2011)

    Google Scholar 

  23. Regan, K., Boutilier, C.: Robust online optimization of reward-uncertain MDPs. In: IJCAI, pp. 2165–2171 (2011)

    Google Scholar 

  24. Roijers, D., Vamplew, P., Whiteson, S., Dazeley, R.: A survey of multi-objective sequential decision-making. J. Artif. Intell. Res. 48, 67–113 (2013)

    MathSciNet  MATH  Google Scholar 

  25. Steuer, R., Choo, E.U.: An interactive weighted Tchebycheff procedure for multiple objective programming. Math. Program. 26, 326–344 (1983)

    Article  MathSciNet  MATH  Google Scholar 

  26. Strehl, A.L., Littman, M.L.: Reinforcement learning in finite MDPs: PAC analysis. J. Mach. Learn. Res. 10, 2413–2444 (2009)

    MathSciNet  MATH  Google Scholar 

  27. Sutton, R., Barto, A.: Reinforcement Learning: An Introduction. MIT Press, Cambridge (1998)

    Google Scholar 

  28. Tesauro, G.: Temporal difference learning and TD-Gammon. Commun. ACM 38(3), 58–68 (1995)

    Article  Google Scholar 

  29. Weng, P.: Markov decision processes with ordinal rewards: Reference point-based preferences. International Conference on Automated Planning and Scheduling (ICAPS), vol. 21, pp. 282–289 (2011)

    Google Scholar 

  30. Weng, P.: Ordinal decision models for Markov decision processes. In: European Conference on Artificial Intelligence (ECAI), vol. 20, pp. 828–833 (2012)

    Google Scholar 

  31. Weng, P., Zanuttini, B.: Interactive value iteration for Markov decision processes with unknown rewards. In: IJCAI (2013)

    Google Scholar 

  32. Weng, P., Busa-Fekete, R., Hüllermeier, E.: Interactive Q-learning with ordinal rewards and unreliable tutor. In: ECML/PKDD Workshop Reinforcement Learning with Generalized Feedback, September 2013

    Google Scholar 

  33. White, D.: Multi-objective infinite-horizon discounted Markov decision processes. J. Math. Anal. Appls. 89, 639–647 (1982)

    Article  MathSciNet  MATH  Google Scholar 

  34. Wray, K.H., Zilberstein, S., Mouaddib, A.I.: Multi-objective MDPs with conditional lexicographic reward preferences. In: AAAI (2015)

    Google Scholar 

  35. Yue, Y., Broder, J., Kleinberg, R., Joachims, T.: The k-armed dueling bandits problem. J. Comput. Syst. Sci. 78(5), 1538–1556 (2012)

    Article  MathSciNet  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Paul Weng .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing AG

About this paper

Cite this paper

Weng, P. (2016). From Preference-Based to Multiobjective Sequential Decision-Making. In: Sombattheera, C., Stolzenburg, F., Lin, F., Nayak, A. (eds) Multi-disciplinary Trends in Artificial Intelligence. MIWAI 2016. Lecture Notes in Computer Science(), vol 10053. Springer, Cham. https://doi.org/10.1007/978-3-319-49397-8_20

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-49397-8_20

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-49396-1

  • Online ISBN: 978-3-319-49397-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics