Encyclopedia of Machine Learning and Data Mining

2017 Edition
| Editors: Claude Sammut, Geoffrey I. Webb

Markov Decision Processes

Reference work entry
DOI: https://doi.org/10.1007/978-1-4899-7687-1_512



A Markov Decision Process (MDP) is a discrete, stochastic, and generally finite model of a system to which some external control can be applied. Originally developed in the Operations Research and Statistics communities, MDPs, and their extension to Partially Observable Markov Decision Processes (POMDPs), are now commonly used in the study of reinforcement learning in the Artificial Intelligence and Robotics communities (Bellman 1957; Bertsekas and Tsitsiklis 1996; Howard 1960; Puterman 1994). When used for reinforcement learning, firstly the parameters of an MDP are learned from data, and then the MDP is processed to choose a behavior.

Formally, an MDP is defined as a tuple: \(< \mathcal{S},\mathcal{A},T,R >\)

This is a preview of subscription content, log in to check access.

Recommended Reading

  1. Albus JS (1981) Brains, behavior, and robotics. BYTE, Peterborough. ISBN:0070009759Google Scholar
  2. Andre D, Friedman N, Parr R (1997) Generalized prioritized sweeping. In: Neural and information processing systems, Denver, pp 1001–1007Google Scholar
  3. Andre D, Russell SJ (2002) State abstraction for programmable reinforcement learning agents. In: Proceedings of the eighteenth national conference on artificial intelligence (AAAI), EdmontonGoogle Scholar
  4. Baird LC (1995) Residual algorithms: reinforcement learning with function approximation. In: Prieditis A, Russell S (eds) Machine learning: proceedings of the twelfth international conference (ICML95). Morgan Kaufmann, San Mateo, pp 30–37Google Scholar
  5. Bellman RE (1957) Dynamic programming. Princeton University Press, PrincetonzbMATHGoogle Scholar
  6. Bertsekas DP, Tsitsiklis J (1996) Neuro-dynamic programming. Athena Scientific, BelmontzbMATHGoogle Scholar
  7. Dietterich TG (2000) Hierarchical reinforcement learning with the MAXQ value function decomposition. J Artif Intell Res 13:227–303MathSciNetzbMATHGoogle Scholar
  8. Gordon GJ (1995) Stable function approximation in dynamic programming (Technical report CMU-CS-95-103). School of Computer Science, Carnegie Mellon UniversityGoogle Scholar
  9. Guestrin C et al (2003) Efficient solution algorithms for factored MDPs. J Artif Intell Res 19:399–468MathSciNetzbMATHGoogle Scholar
  10. Hansen EA, Zilberstein S (1998) Heuristic search in cyclic AND/OR graphs. In: Proceedings of the fifteenth national conference on artificial intelligence. http://rbr.cs.umass.edu/shlomo/papers/HZaaai98.html
  11. Howard RA (1960) Dynamic programming and Markov processes. MIT Press, CambridgezbMATHGoogle Scholar
  12. Kocsis L, Szepesvári C (2006) Bandit based Monte-Carlo planning. In: European conference on machine learning (ECML), Berlin. Lecture notes in computer science, vol 4212. Springer, pp 282–293Google Scholar
  13. Moore AW, Atkeson CG (1993) Prioritized sweeping: reinforcement learning with less data and less real time. Mach Learn 13:103–130Google Scholar
  14. Moore AW, Baird L, Pack Kaelbling L (1999) Multi-value-functions: efficient automatic action hierarchies for multiple goal MDPs. In: International joint conference on artificial intelligence (IJCAI99), StockholmGoogle Scholar
  15. Munos R, Moore AW (2001) Variable resolution discretization in optimal control. Mach Learn 1:1–31zbMATHGoogle Scholar
  16. Puterman ML (1994) Markov decision processes: discrete stochastic dynamic programming. Wiley series in probability and mathematical statistics. Applied probability and statistics section. Wiley, New York. ISBN:0-471-61977-9CrossRefGoogle Scholar
  17. St-Aubin R, Hoey J, Boutilier C (2000) APRICODD: approximate policy construction using decision diagrams. In: NIPS-2000, DenverGoogle Scholar
  18. Sutton RS, Precup D, Singh S (1998) Intra-option learning about temporally abstract actions. In: Machine learning: proceedings of the fifteenth international conference (ICML98). Morgan Kaufmann, Madison, pp 556–564Google Scholar
  19. Tsitsiklis JN, Van Roy B (1997) An analysis of temporal-difference learning with function approximation. IEEE Trans Autom Control 42(5):674–690MathSciNetzbMATHCrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media New York 2017

Authors and Affiliations

  1. 1.NICTA and The University of New South WalesSydneyAustralia