Reinforcement Learning and Markov Decision Processes

  • Martijn van OtterloEmail author
  • Marco Wiering
Part of the Adaptation, Learning, and Optimization book series (ALO, volume 12)


Situated in between supervised learning and unsupervised learning, the paradigm of reinforcement learning deals with learning in sequential decision making problems in which there is limited feedback. This text introduces the intuitions and concepts behind Markov decision processes and two classes of algorithms for computing optimal behaviors: reinforcement learning and dynamic programming. First the formal framework of Markov decision process is defined, accompanied by the definition of value functions and policies. The main part of this text deals with introducing foundational classes of algorithms for learning optimal behaviors, based on various definitions of optimality with respect to the goal of learning sequential decisions. Additionally, it surveys efficient extensions of the foundational algorithms, differing mainly in the way feedback given by the environment is used to speed up learning, and in the way they concentrate on relevant parts of the problem. For both model-based and model-free settings these efficient extensions have shown useful in scaling up to larger problems.


Monte Carlo Optimal Policy Goal State Markov Decision Process Reward Function 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. Bain, M., Sammut, C.: A framework for behavioral cloning. In: Muggleton, S.H., Furakawa, K., Michie, D. (eds.) Machine Intelligence, vol. 15, pp. 103–129. Oxford University Press (1995)Google Scholar
  2. Barto, A.G., Sutton, R.S., Anderson, C.W.: Neuronlike elements that can solve difficult learning control problems. IEEE Transactions on Systems, Man, and Cybernetics 13, 835–846 (1983)Google Scholar
  3. Barto, A.G., Bradtke, S.J., Singh, S.: Learning to act using real-time dynamic programming. Artificial Intelligence 72(1), 81–138 (1995)Google Scholar
  4. Bellman, R.E.: Dynamic Programming. Princeton University Press, Princeton (1957)zbMATHGoogle Scholar
  5. Bertsekas, D.P.: Dynamic Programming and Optimal Control, vol. 1, 2. Athena Scientific, Belmont (1995)zbMATHGoogle Scholar
  6. Bertsekas, D.P., Tsitsiklis, J.: Neuro-Dynamic Programming. Athena Scientific, Belmont (1996)zbMATHGoogle Scholar
  7. Bonet, B., Geffner, H.: Faster heuristic search algorithms for planning with uncertainty and full feedback. In: Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI), pp. 1233–1238 (2003a)Google Scholar
  8. Bonet, B., Geffner, H.: Labeled RTDP: Improving the convergence of real-time dynamic programming. In: Proceedings of the International Conference on Artificial Intelligence Planning Systems (ICAPS), pp. 12–21 (2003b)Google Scholar
  9. Boutilier, C.: Knowledge Representation for Stochastic Decision Processes. In: Veloso, M.M., Wooldridge, M.J. (eds.) Artificial Intelligence Today. LNCS (LNAI), vol. 1600, pp. 111–152. Springer, Heidelberg (1999)Google Scholar
  10. Boutilier, C., Dean, T., Hanks, S.: Decision theoretic planning: Structural assumptions and computational leverage. Journal of Artificial Intelligence Research 11, 1–94 (1999)MathSciNetzbMATHGoogle Scholar
  11. Brafman, R.I., Tennenholtz, M.: R-MAX - a general polynomial time algorithm for near-optimal reinforcement learning. Journal of Machine Learning Research (JMLR) 3, 213–231 (2002)MathSciNetGoogle Scholar
  12. Dean, T., Kaelbling, L.P., Kirman, J., Nicholson, A.: Planning under time constraints in stochastic domains. Artificial Intelligence 76, 35–74 (1995)Google Scholar
  13. Dixon, K.R., Malak, M.J., Khosla, P.K.: Incorporating prior knowledge and previously learned information into reinforcement learning agents. Tech. rep., Institute for Complex Engineered Systems, Carnegie Mellon University (2000)Google Scholar
  14. Dorigo, M., Colombetti, M.: Robot Shaping: An Experiment in Behavior Engineering. The MIT Press, Cambridge (1997)Google Scholar
  15. Drescher, G.: Made-Up Minds: A Constructivist Approach to Artificial Intelligence. The MIT Press, Cambridge (1991)zbMATHGoogle Scholar
  16. Ferguson, D., Stentz, A.: Focussed dynamic programming: Extensive comparative results. Tech. Rep. CMU-RI-TR-04-13, Robotics Institute, Carnegie Mellon University, Pittsburgh, Pennsylvania (2004)Google Scholar
  17. Främling, K.: Bi-memory model for guiding exploration by pre-existing knowledge. In: Driessens, K., Fern, A., van Otterlo, M. (eds.) Proceedings of the ICML-2005 Workshop on Rich Representations for Reinforcement Learning, pp. 21–26 (2005)Google Scholar
  18. Großmann, A.: Adaptive state-space quantisation and multi-task reinforcement learning using constructive neural networks. In: From Animals to Animats: Proceedings of The International Conference on Simulation of Adaptive Behavior (SAB), pp. 160–169 (2000)Google Scholar
  19. Hansen, E.A., Zilberstein, S.: LAO*: A heuristic search algorithm that finds solutions with loops. Artificial Intelligence 129, 35–62 (2001)MathSciNetzbMATHGoogle Scholar
  20. Howard, R.A.: Dynamic Programming and Markov Processes. The MIT Press, Cambridge (1960)zbMATHGoogle Scholar
  21. Kaelbling, L.P.: Learning in Embedded Systems. The MIT Press, Cambridge (1993)Google Scholar
  22. Kaelbling, L.P., Littman, M.L., Moore, A.W.: Reinforcement learning: A survey. Journal of Artificial Intelligence Research 4, 237–285 (1996)Google Scholar
  23. Kearns, M., Singh, S.: Near-optimal reinforcement learning in polynomial time. In: Proceedings of the International Conference on Machine Learning (ICML) (1998)Google Scholar
  24. Koenig, S., Liu, Y.: The interaction of representations and planning objectives for decision-theoretic planning. Journal of Experimental and Theoretical Artificial Intelligence 14(4), 303–326 (2002)zbMATHGoogle Scholar
  25. Konda, V., Tsitsiklis, J.: Actor-critic algorithms. SIAM Journal on Control and Optimization 42(4), 1143–1166 (2003)MathSciNetzbMATHGoogle Scholar
  26. Konidaris, G.: A framework for transfer in reinforcement learning. In: ICML-2006 Workshop on Structural Knowledge Transfer for Machine Learning (2006)Google Scholar
  27. Kushmerick, N., Hanks, S., Weld, D.S.: An algorithm for probabilistic planning. Artificial Intelligence 76(1-2), 239–286 (1995)Google Scholar
  28. Littman, M.L., Dean, T., Kaelbling, L.P.: On the complexity of solving Markov decision problems. In: Proceedings of the National Conference on Artificial Intelligence (AAAI), pp. 394–402 (1995)Google Scholar
  29. Mahadevan, S.: Average reward reinforcement learning: Foundations, algorithms, and empirical results. Machine Learning 22, 159–195 (1996)Google Scholar
  30. Maloof, M.A.: Incremental rule learning with partial instance memory for changing concepts. In: Proceedings of the International Joint Conference on Neural Networks, pp. 2764–2769 (2003)Google Scholar
  31. Mataric, M.J.: Reward functions for accelerated learning. In: Proceedings of the International Conference on Machine Learning (ICML), pp. 181–189 (1994)Google Scholar
  32. Matthews, W.H.: Mazes and Labyrinths: A General Account of their History and Developments. Longmans, Green and Co., London (1922); Mazes & Labyrinths: Their History & Development. Dover Publications, New York (reprinted in 1970) Google Scholar
  33. McMahan, H.B., Likhachev, M., Gordon, G.J.: Bounded real-time dynamic programming: RTDP with monotone upper bounds and performance guarantees. In: Proceedings of the International Conference on Machine Learning (ICML), pp. 569–576 (2005)Google Scholar
  34. Moore, A.W., Atkeson, C.G.: Prioritized sweeping: Reinforcement learning with less data and less time. Machine Learning 13(1), 103–130 (1993)Google Scholar
  35. Ng, A.Y., Harada, D., Russell, S.J.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Proceedings of the International Conference on Machine Learning (ICML), pp. 278–287 (1999)Google Scholar
  36. Peng, J., Williams, R.J.: Incremental multi-step Q-learning. Machine Learning 22, 283–290 (1996)Google Scholar
  37. Puterman, M.L.: Markov Decision Processes—Discrete Stochastic Dynamic Programming. John Wiley & Sons, Inc., New York (1994)zbMATHGoogle Scholar
  38. Puterman, M.L., Shin, M.C.: Modified policy iteration algorithms for discounted Markov decision processes. Management Science 24, 1127–1137 (1978)MathSciNetzbMATHGoogle Scholar
  39. Ratitch, B.: On characteristics of Markov decision processes and reinforcement learning in large domains. PhD thesis, The School of Computer Science, McGill University, Montreal (2005)Google Scholar
  40. Reynolds, S.I.: Reinforcement learning with exploration. PhD thesis, The School of Computer Science, The University of Birmingham, UK (2002)Google Scholar
  41. Rummery, G.A.: Problem solving with reinforcement learning. PhD thesis, Cambridge University, Engineering Department, Cambridge, England (1995)Google Scholar
  42. Rummery, G.A., Niranjan, M.: On-line Q-Learning using connectionist systems. Tech. Rep. CUED/F-INFENG/TR 166, Cambridge University, Engineering Department (1994)Google Scholar
  43. Russell, S.J., Norvig, P.: Artificial Intelligence: a Modern Approach, 2nd edn. Prentice Hall, New Jersey (2003)Google Scholar
  44. Schaeffer, J., Plaat, A.: Kasparov versus deep blue: The re-match. International Computer Chess Association Journal 20(2), 95–101 (1997)Google Scholar
  45. Schwartz, A.: A reinforcement learning method for maximizing undiscounted rewards. In: Proceedings of the International Conference on Machine Learning (ICML), pp. 298–305 (1993)Google Scholar
  46. Singh, S., Jaakkola, T., Littman, M., Szepesvari, C.: Convergence results for single-step on-policy reinforcement-learning algorithms. Machine Learning 38(3), 287–308 (2000)zbMATHGoogle Scholar
  47. Sutton, R.S.: Learning to predict by the methods of temporal differences. Machine Learning 3, 9–44 (1988)Google Scholar
  48. Sutton, R.S.: Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In: Proceedings of the International Conference on Machine Learning (ICML), pp. 216–224 (1990)Google Scholar
  49. Sutton, R.S.: DYNA, an integrated architecture for learning, planning and reacting. In: Working Notes of the AAAI Spring Symposium on Integrated Intelligent Architectures, pp. 151–155 (1991a)Google Scholar
  50. Sutton, R.S.: Reinforcement learning architectures for animats. In: From Animals to Animats: Proceedings of The International Conference on Simulation of Adaptive Behavior (SAB), pp. 288–296 (1991b)Google Scholar
  51. Sutton, R.S.: Generalization in reinforcement learning: Successful examples using sparse coarse coding. In: Proceedings of the Neural Information Processing Conference (NIPS), pp. 1038–1044 (1996)Google Scholar
  52. Sutton, R.S., Barto, A.G.: Reinforcement Learning: an Introduction. The MIT Press, Cambridge (1998)Google Scholar
  53. Tash, J., Russell, S.J.: Control strategies for a stochastic planner. In: Proceedings of the National Conference on Artificial Intelligence (AAAI), pp. 1079–1085 (1994)Google Scholar
  54. Watkins, C.J.C.H.: Learning from delayed rewards. PhD thesis, King’s College, Cambridge, England (1989)Google Scholar
  55. Watkins, C.J.C.H., Dayan, P.: Q-learning. Machine Learning 8(3/4) (1992); Special Issue on Reinforcement LearningGoogle Scholar
  56. Wiering, M.A.: Explorations in efficient reinforcement learning. PhD thesis, Faculteit der Wiskunde, Informatica, Natuurkunde en Sterrenkunde, Universiteit van Amsterdam (1999)Google Scholar
  57. Wiering, M.A.: Model-based reinforcement learning in dynamic environments. Tech. Rep. UU-CS-2002-029, Institute of Information and Computing Sciences, University of Utrecht, The Netherlands (2002)Google Scholar
  58. Wiering, M.A.: QV(λ)-Learning: A new on-policy reinforcement learning algorithm. In: Proceedings of the 7th European Workshop on Reinforcement Learning (2005)Google Scholar
  59. Wiering, M.A., Schmidhuber, J.H.: Efficient model-based exploration. In: From Animals to Animats: Proceedings of The International Conference on Simulation of Adaptive Behavior (SAB), pp. 223–228 (1998a)Google Scholar
  60. Wiering, M.A., Schmidhuber, J.H.: Fast online Q(λ). Machine Learning 33(1), 105–115 (1998b)zbMATHGoogle Scholar
  61. Winston, W.L.: Operations Research Applications and Algorithms, 2nd edn. Thomson Information/Publishing Group, Boston (1991)zbMATHGoogle Scholar
  62. Witten, I.H.: An adaptive optimal controller for discrete-time markov environments. Information and Control 34, 286–295 (1977)MathSciNetzbMATHGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  1. 1.Radboud UniversityNijmegenThe Netherlands
  2. 2.Department of Artificial IntelligenceUniversity of GroningenGroningenThe Netherlands

Personalised recommendations