Markov Decision Processes

Uther, William

doi:10.1007/978-1-4899-7687-1_512

William Uther³

257 Accesses
1 Citations

Synonyms

Policy search

Definition

A Markov Decision Process (MDP) is a discrete, stochastic, and generally finite model of a system to which some external control can be applied. Originally developed in the Operations Research and Statistics communities, MDPs, and their extension to Partially Observable Markov Decision Processes (POMDPs), are now commonly used in the study of reinforcement learning in the Artificial Intelligence and Robotics communities (Bellman 1957; Bertsekas and Tsitsiklis 1996; Howard 1960; Puterman 1994). When used for reinforcement learning, firstly the parameters of an MDP are learned from data, and then the MDP is processed to choose a behavior.

Formally, an MDP is defined as a tuple: \(< \mathcal{S},\mathcal{A},T,R >\), where \(\mathcal{S}\) is a discrete set of states, \(\mathcal{A}\) is a discrete set of actions, \(T : \mathcal{S}\times \mathcal{A}\rightarrow (\mathcal{S}\rightarrow R)\) is a stochastic transition function, and \(R : \mathcal{S}\times...

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 699.99; Price excludes VAT (USA)

Hardcover Book: USD 949.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Recommended Reading

Albus JS (1981) Brains, behavior, and robotics. BYTE, Peterborough. ISBN:0070009759
Google Scholar
Andre D, Friedman N, Parr R (1997) Generalized prioritized sweeping. In: Neural and information processing systems, Denver, pp 1001–1007
Google Scholar
Andre D, Russell SJ (2002) State abstraction for programmable reinforcement learning agents. In: Proceedings of the eighteenth national conference on artificial intelligence (AAAI), Edmonton
Google Scholar
Baird LC (1995) Residual algorithms: reinforcement learning with function approximation. In: Prieditis A, Russell S (eds) Machine learning: proceedings of the twelfth international conference (ICML95). Morgan Kaufmann, San Mateo, pp 30–37
Google Scholar
Bellman RE (1957) Dynamic programming. Princeton University Press, Princeton
MATH Google Scholar
Bertsekas DP, Tsitsiklis J (1996) Neuro-dynamic programming. Athena Scientific, Belmont
MATH Google Scholar
Dietterich TG (2000) Hierarchical reinforcement learning with the MAXQ value function decomposition. J Artif Intell Res 13:227–303
MathSciNet MATH Google Scholar
Gordon GJ (1995) Stable function approximation in dynamic programming (Technical report CMU-CS-95-103). School of Computer Science, Carnegie Mellon University
Google Scholar
Guestrin C et al (2003) Efficient solution algorithms for factored MDPs. J Artif Intell Res 19:399–468
MathSciNet MATH Google Scholar
Hansen EA, Zilberstein S (1998) Heuristic search in cyclic AND/OR graphs. In: Proceedings of the fifteenth national conference on artificial intelligence. http://rbr.cs.umass.edu/shlomo/papers/HZaaai98.html
Howard RA (1960) Dynamic programming and Markov processes. MIT Press, Cambridge
MATH Google Scholar
Kocsis L, Szepesvári C (2006) Bandit based Monte-Carlo planning. In: European conference on machine learning (ECML), Berlin. Lecture notes in computer science, vol 4212. Springer, pp 282–293
Google Scholar
Moore AW, Atkeson CG (1993) Prioritized sweeping: reinforcement learning with less data and less real time. Mach Learn 13:103–130
Google Scholar
Moore AW, Baird L, Pack Kaelbling L (1999) Multi-value-functions: efficient automatic action hierarchies for multiple goal MDPs. In: International joint conference on artificial intelligence (IJCAI99), Stockholm
Google Scholar
Munos R, Moore AW (2001) Variable resolution discretization in optimal control. Mach Learn 1:1–31
MATH Google Scholar
Puterman ML (1994) Markov decision processes: discrete stochastic dynamic programming. Wiley series in probability and mathematical statistics. Applied probability and statistics section. Wiley, New York. ISBN:0-471-61977-9
Book Google Scholar
St-Aubin R, Hoey J, Boutilier C (2000) APRICODD: approximate policy construction using decision diagrams. In: NIPS-2000, Denver
Google Scholar
Sutton RS, Precup D, Singh S (1998) Intra-option learning about temporally abstract actions. In: Machine learning: proceedings of the fifteenth international conference (ICML98). Morgan Kaufmann, Madison, pp 556–564
Google Scholar
Tsitsiklis JN, Van Roy B (1997) An analysis of temporal-difference learning with function approximation. IEEE Trans Autom Control 42(5):674–690
Article MathSciNet MATH Google Scholar

Download references

Author information

Authors and Affiliations

NICTA and The University of New South Wales, Sydney, NSW, Australia
William Uther

Authors

William Uther
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

The University of New South Wales, Sydney, NSW, Australia
Claude Sammut
Faculty of Information Technology, Monash University, Melbourne, VIC, Australia
Geoffrey I. Webb

Rights and permissions

Reprints and permissions

Copyright information

About this entry

Cite this entry

Uther, W. (2017). Markov Decision Processes. In: Sammut, C., Webb, G.I. (eds) Encyclopedia of Machine Learning and Data Mining. Springer, Boston, MA. https://doi.org/10.1007/978-1-4899-7687-1_512

Download citation

DOI: https://doi.org/10.1007/978-1-4899-7687-1_512
Published: 14 April 2017
Publisher Name: Springer, Boston, MA
Print ISBN: 978-1-4899-7685-7
Online ISBN: 978-1-4899-7687-1
eBook Packages: Computer ScienceReference Module Computer Science and Engineering

Publish with us

Policies and ethics