Solving Markov Decision Processes via Simulation

  • Abhijit GosaviEmail author
Part of the International Series in Operations Research & Management Science book series (ISOR, volume 216)


This chapter presents an overview of simulation-based techniques useful for solving Markov decision processes (MDPs). MDPs model problems of sequential decision-making under uncertainty, in which decisions made in each state collectively affect the trajectory of the states visited by the system over a time horizon of interest. Traditionally, MDPs have been solved via dynamic programming (DP), which requires the transition probability model that is difficult to derive in many realistic settings. The use of simulation for solving MDPs allows us to bypass the transition probability model and solve large-scale MDPs considered intractable to solve by traditional DP. The simulation-based methodology for solving MDPs, which like DP is also rooted in the Bellman equations, goes by names such as reinforcement learning, neuro-DP, and approximate or adaptive DP. We begin with a description of algorithms for infinite-horizon discounted reward MDPs, followed by the same for infinite-horizon average reward MDPs. Then we present a discussion on finite-horizon MDPs. For each problem considered, we present a step-by-step description of a selected group of algorithms. In making this selection, we have attempted to blend the old and the classical with more recent developments. Finally, after touching upon extensions and convergence theory, we conclude with a brief summary of some applications and directions for future research.


Reinforcement Learning Preventive Maintenance Bellman Equation Policy Iteration Average Reward 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    P. Abbeel, A. Coates, T. Hunter, and A. Y. Ng. Autonomous autorotation of an RC helicopter. In International Symposium on Robotics, 2008.Google Scholar
  2. 2.
    J. Abounadi, D. Bertsekas, and V. S. Borkar. Learning algorithms for Markov decision processes with average cost. SIAM Journal of Control and Optimization, 40(3):681–698, 2001.CrossRefGoogle Scholar
  3. 3.
    N. Akar and S. Sahin. Reinforcement learning as a means of dynamic aggregate qos provisioning. In Lecture Notes in Computer Science, Volume 2698/2003. Springer, Berlin/Heidelberg, 2003.Google Scholar
  4. 4.
    J. S. Albus. Brain, Behavior and Robotics. Byte Books, Peterborough, NH, 1981.Google Scholar
  5. 5.
    R. Askin and J. Goldberg. Design and Analysis of Lean Production Systems. Wiley, NY, 2002.Google Scholar
  6. 6.
    L. Baird. Residual algorithms: Reinforcement learning with function approximation. In Proceedings of the Twelfth International Conference on Machine Learning, pages 30–37. Morgan Kaufmann, 1995.Google Scholar
  7. 7.
    A. G. Barto, S. J. Bradtke, and S. P. Singh. Learning to act using real-time dynamic programming. Artificial Intelligence, 72:81–138, 1995.CrossRefGoogle Scholar
  8. 8.
    A. G. Barto, R. S. Sutton, and C. W. Anderson. Neuronlike elements that can solve difficult learning control problems. IEEE Transactions on Systems, Man, and Cybernetics, 13:835–846, 1983.Google Scholar
  9. 9.
    J. Baxter and P. Bartlett. Infinite-horizon policy-gradient estimation. Journal of Artificial Intelligence, 15:319–350, 2001.Google Scholar
  10. 10.
    R. E. Bellman. The theory of dynamic programming. Bull. Amer. Math. Soc, 60:503–516, 1954.CrossRefGoogle Scholar
  11. 11.
    R. E. Bellman. Dynamic Programming. Princeton University Press, Princeton, NJ, 1957.Google Scholar
  12. 12.
    D. Bertsekas and J. Tsitsiklis. Neuro-Dynamic Programming. Athena Scientific, Belmont, MA, USA, 1996.Google Scholar
  13. 13.
    D. Bertsekas and H. Yu. Q-learning and enhanced policy iteration in discounted dynamic programming. In Proceedings of the 49th IEEE Conference on Decision and Control, pages 1409–1416, 2010.Google Scholar
  14. 14.
    D. P. Bertsekas. Approximate policy iteration: A survey and some new methods. Journal of Control Theory and Applications, 9(3):310–335, 2011.CrossRefGoogle Scholar
  15. 15.
    D. P. Bertsekas. Dynamic Programming and Optimal Control. Athena Scientific, Belmont, MA, USA, 4th edition, 2012.Google Scholar
  16. 16.
    V. S. Borkar. Stochastic Approximation: A Dynamical Systems Viewpoint. Hindusthan Book Agency, New Delhi, India, 2008.Google Scholar
  17. 17.
    J. A. Boyan and A. W. Moore. Generalization in Reinforcement Learning: Safely Approximating the Value Function. Advances in Neural Information Processing Systems, pages 369–376, 1995.Google Scholar
  18. 18.
    S. Bradtke and A. G. Barto. Linear least squares learning for temporal differences learning. Machine Learning, 22:33–57, 1996.Google Scholar
  19. 19.
    S. Bradtke and M. Duff. Reinforcement learning methods for continuous-time Markov decision problems. In Advances in Neural Information Processing Systems 7. MIT Press, Cambridge, MA, USA, 1995.Google Scholar
  20. 20.
    R. I. Brafman and M. Tennenholtz. R-max: A general polynomial time algorithm for near-optimal reinforcement learning. Journal of Machine Learning Research, 3:213–231, 2002.Google Scholar
  21. 21.
    X. R. Cao. Stochastic Learning and Optimization: A Sensitivity-Based View. Springer, New York, 2007.CrossRefGoogle Scholar
  22. 22.
    S. K. Chaharsooghi, J. Heydari, and S. H. Zegordi. A reinforcement learning model for supply chain ordering management: An application to the beer game. Decision Support Systems, 45, 2008.Google Scholar
  23. 23.
    H. S. Chang, M. C. Fu, J. Hu, and S. I. Marcus. Recursive learning automata approach to Markov decision processes. IEEE Transactions on Automatic Control, 52(7):1349–1355, 2007.CrossRefGoogle Scholar
  24. 24.
    H. S. Chang, M. C. Fu, J. Hu, and S. I. Marcus. Simulation-based Algorithms for Markov Decision Processes. Springer, 2007.Google Scholar
  25. 25.
    H. S. Chang, H. G. Lee, M. C. Fu, and S. I. Marcus. Evolutionary policy iteration for solving Markov decision processes. IEEE Transactions on Automatic Control, 50(11):1804–1808, 2005.CrossRefGoogle Scholar
  26. 26.
    T. K. Das, A. Gosavi, S. Mahadevan, and N. Marchalleck. Solving semi-Markov decision problems using average reward reinforcement learning. Management Science, 45(4):560–574, 1999.CrossRefGoogle Scholar
  27. 27.
    S. Davies. Multi-dimensional interpolation and triangulation for reinforcement learning. Advances in Neural Information and Processing Systems, 1996.Google Scholar
  28. 28.
    C. Diuk, L. Li, and B. Leffler. The adaptive k-meteorologists problem and its application to structure learning and feature selection in reinforcement learning. In Proceedings of the 26th Annual International Conference on Machine Learning, 2009.Google Scholar
  29. 29.
    J. Filar and K. Vrieze. Competitive Markov Decision Processes. Springer-Verlag, New York, NY, USA, 1997.Google Scholar
  30. 30.
    A. Gosavi. Simulation-Based Optimization: Parametric Optimization Techniques and Reinforcement Learning. Springer, Boston, 2003.CrossRefGoogle Scholar
  31. 31.
    A. Gosavi. A reinforcement learning algorithm based on policy iteration for average reward: Empirical results with yield management and convergence analysis. Machine Learning, 55:5–29, 2004.CrossRefGoogle Scholar
  32. 32.
    A. Gosavi. Reinforcement learning for long-run average cost. European Journal of Operational Research, 155:654–674, 2004.CrossRefGoogle Scholar
  33. 33.
    A. Gosavi. Boundedness of iterates in Q-learning. Systems and Control Letters, 55:347–349, 2006.CrossRefGoogle Scholar
  34. 34.
    A. Gosavi. A risk-sensitive approach to total productive maintenance. Automatica, 42:1321–1330, 2006.CrossRefGoogle Scholar
  35. 35.
    A. Gosavi. Reinforcement learning: A tutorial survey and recent advances. INFORMS Journal on Computing, 21(2):178–192, 2009.CrossRefGoogle Scholar
  36. 36.
    A. Gosavi. Reinforcement learning for model building and variance-penalized control. In Proceedings of the 2009 Winter Simulation Conference. IEEE, Piscataway, NJ, 2009.Google Scholar
  37. 37.
    A. Gosavi. Finite horizon Markov control with one-step variance penalties. In Conference Proceedings of the Allerton Conference. University of Illinois, USA, 2010.Google Scholar
  38. 38.
    A. Gosavi. Target-sensitive control of Markov and semi-Markov processes. International Journal of Control, Automation, and Systems, 9(5):1–11, 2011.CrossRefGoogle Scholar
  39. 39.
    A. Gosavi. Approximate policy iteration for Markov control revisited. In Procedia Computer Science, Complex Adaptive Systems, Chicago. Elsevier, 2012.Google Scholar
  40. 40.
    A. Gosavi, N. Bandla, and T. K. Das. A reinforcement learning approach to a single leg airline revenue management problem with multiple fare classes and overbooking. IIE Transactions, 34(9):729–742, 2002.Google Scholar
  41. 41.
    A. Gosavi, S. Murray, J. Hu, and S. Ghosh. Model-building adaptive critics for semi-Markov control. Journal of Artificial Intelligence and Soft Computing Research, 2(1), 2012.Google Scholar
  42. 42.
    C. M. Grinstead and J. L. Snell. Introduction to Probability. American Mathematical Society, Providence, RI, 1997.Google Scholar
  43. 43.
    G. E. Hinton. Distributed representations. Technical Report, CMU-CS-84-157, Carnegie Mellon University, Pittsburgh, PA, USA, 1984.Google Scholar
  44. 44.
    J. H. Holland. Adaptation in Natural and Artificial Systems. University of Michigan Press, Ann Arbor, MI, USA, 1975.Google Scholar
  45. 45.
    R. Howard. Dynamic Programming and Markov Processes. MIT Press, Cambridge, MA, 1960.Google Scholar
  46. 46.
    J. Hu and M. P. Wellman. Nash Q-Learning for general-sum stochastic games. Journal of Machine Learning Research, 4:1039–1069, 2003.Google Scholar
  47. 47.
    S. Ishii, W. Yoshida, and J. Yoshimoto. Control of exploitation-exploration meta-parameter in reinforcement learning. Neural Networks, 15:665–687, 2002.CrossRefGoogle Scholar
  48. 48.
    T. Jaakkola, M. Jordan, and S. Singh. On the convergence of stochastic iterative dynamic programming algorithms. Neural Computation, 6(6):1185–1201, 1994.CrossRefGoogle Scholar
  49. 49.
    L. P. Kaelbling, M. L. Littman, and A. W. Moore. Reinforcement learning: A survey. Journal of Artificial Intelligence Research, 4:237–285, 1996.Google Scholar
  50. 50.
    P. Kanerva. Sparse Distributed Memory. MIT Press, Cambridge, MA, USA, 1988.Google Scholar
  51. 51.
    M. Kearns and S. P. Singh. Near-optimal reinforcement learning in polynomial time. Machine Learning, 49(2):209–232, 2002.CrossRefGoogle Scholar
  52. 52.
    S. S. Keerthi and B. Ravindran. A tutorial survey of reinforcement learning. Sadhana, 19(6):851–889, 1994.CrossRefGoogle Scholar
  53. 53.
    V. Konda and V. S. Borkar. Actor-critic type learning algorithms for Markov decision processes. SIAM Journal on Control and Optimization, 38(1):94–123, 1999.CrossRefGoogle Scholar
  54. 54.
    R. Koppejan and S. Whiteson. Neuroevolutionary reinforcement learning for generalized helicopter control. In GECCO: Proceedings of the Genetic and Evolutionary Computation Conference, pages 145–152, 2009.Google Scholar
  55. 55.
    M. Lagoudakis and R. Parr. Least-squares policy iteration. Journal of Machine Learning Research, 4:1107–1149, 2003.Google Scholar
  56. 56.
    S. Mahadevan. Learning representation and control in Markov decision processes: New frontiers. In Foundations and Trends in Machine Learning, Vol I(4), pages 403–565. Now Publishers, 2009.Google Scholar
  57. 57.
    J. Michels, A. Saxena, and A. Y. Ng. High speed obstacle avoidance using monocular vision and reinforcement learning. In Proceedings of the 22nd International Conference on Machine Learning, Bonn, Germany, 2005.Google Scholar
  58. 58.
    T. M. Mitchell. Machine Learning. McGraw Hill, Boston, MA, USA, 1997.Google Scholar
  59. 59.
    A. Moore and C. Atkeson. Prioritized sweeping: Reinforcement learning with less data and less real time. Machine Learning, 13:103–130, 1993.Google Scholar
  60. 60.
    K. Narendra and M. Thathachar. Learning Automata: An Introduction. Prentice Hall, Englewood Cliffs, NJ, USA, 1989.Google Scholar
  61. 61.
    J. F. Nash. Equilibrium points in n-person games. Proceedings, Nat. Acad. of Science, USA, 36:48–49, 1950.Google Scholar
  62. 62.
    A. Nedić and D. P. Bertsekas. Least-squares policy evaluation with linear function approximation. Discret-event Dynamic Systems: Theory and Applications, 13:79–110, 2003.CrossRefGoogle Scholar
  63. 63.
    A. Y. Ng, H. J. Kim, M. I. Jordan, and S. Sastry. Autonomous helicopter flight via reinforcement learning. In Advances in Neural Information Processing Systems 17. MIT Press, 2004.Google Scholar
  64. 64.
    D. Ormoneit and S. Sen. Kernel-based reinforcement learning. Machine Learning, 49(2–3):161–178, 2002.CrossRefGoogle Scholar
  65. 65.
    J. Peng and R. J. Williams. Incremental multi-step Q-learning. In Machine Learning, pages 226–232. Morgan Kaufmann, 1996.Google Scholar
  66. 66.
    W. Powell. Approximate Dynamic Programming: Solving the curses of dimensionality. Wiley-Interscience, NJ, USA, 2007.CrossRefGoogle Scholar
  67. 67.
    K. Rajaraman and P. Sastry. Finite time analysis of the pursuit algorithm for learning automata. IEEE Transactions on Systems, Man, and Cybernetics: Part B, 26(4):590–598, 1996.CrossRefGoogle Scholar
  68. 68.
    J. A. Ramirez-Hernandez and E. Fernandez. A case study in scheduling re-entrant manufacturing lines: Optimal and simulation-based approaches. In Proceedings of 44th IEEE Conference on Decision and Control, pages 2158–2163. IEEE, 2005.Google Scholar
  69. 69.
    K. Ravulapati, J. Rao, and T. Das. A reinforcement learning approach to stochastic business games. IIE Transactions, 36:373–385, 2004.CrossRefGoogle Scholar
  70. 70.
    H. Robbins and S. Monro. A stochastic approximation method. Annals of Mathematical Statistics, 22:400–407, 1951.CrossRefGoogle Scholar
  71. 71.
    G. Rummery and M. Niranjan. On-line Q-learning using connectionist systems. Technical Report CUED/F-INFENG/TR 166. Engineering Department, Cambridge University, 1994.Google Scholar
  72. 72.
    N. Schutze and G.H.Schmitz. Neuro-dynamic programming as a new framework for decision support for deficit irrigation sytems. In International Congress on Modelling and Simulation, Christchurch, New Zealand, pages 2271–2277, 2007.Google Scholar
  73. 73.
    L. Shapley. Stochastic games. Proceedings of the National Academy of Sciences, 39:1095–1100, 1953.CrossRefGoogle Scholar
  74. 74.
    J. Si, A. G. Barto, W. B. Powell, and D. Wunsch, editors. Handbook of Learning and Approximate Dynamic Programming. IEEE Press, Wiley, Hoboken, NJ, 2004.Google Scholar
  75. 75.
    S. Singh, T. Jaakkola, M. Littman, and C. Szepesvári. Convergence results for single-step on-policy reinforcement-learning algorithms. Machine Learning, 39:287–308, 2000.CrossRefGoogle Scholar
  76. 76.
    S. Singh and R. Sutton. Reinforcement learning with replacing eligibility traces. Machine Learning, 22:123–158, 1996.Google Scholar
  77. 77.
    S. Singh, V. Tadic, and A. Doucet. A policy-gradient method for semi-Markov decision processes with application to call admission control. European Journal of Operational Research, 178(3):808–818, 2007.CrossRefGoogle Scholar
  78. 78.
    A. Strehl and M. Littman. A theoretical analysis of model-based interval estimation. In Proceedings of the 22th International Conference on Machine Learning, pages 856–863, 2005.Google Scholar
  79. 79.
    R. Sutton and A. G. Barto. Reinforcement Learning: An Introduction. The MIT Press, Cambridge, MA, USA, 1998.Google Scholar
  80. 80.
    R. S. Sutton. Learning to predict by the method of temporal differences. Machine Learning, 3:9–44, 1988.Google Scholar
  81. 81.
    R. S. Sutton. Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In Proceedings of the 7th International Workshop on Machine Learning, pages 216–224. Morgan Kaufmann, San Mateo, CA, 1990.Google Scholar
  82. 82.
    C. Szepesvári. Algorithms for Reinforcement Learning. Synthesis Lectures on Artificial Intelligence and Machine Learning. Morgan & Claypool Publishers, 2010.Google Scholar
  83. 83.
    C. Szepesvári and M. Littman. A unified analysis of value-function-based reinforcement-learning algorithms. Neural Computation, 11(8):2017–2060, 1999.CrossRefGoogle Scholar
  84. 84.
    P. Tadepalli and D. Ok. Model-based average reward reinforcement learning algorithms. Artificial Intelligence, 100:177–224, 1998.CrossRefGoogle Scholar
  85. 85.
    M. Thathachar and P. Sastry. A class of rapidly converging algorithms for learning automata. IEEE Transactions on Systems, Man, and Cybernetics, 15:168–175, 1985.Google Scholar
  86. 86.
    J. E. E. van Nunen. A set of successive approximation methods for discounted Markovian decision problems. Z. Operations Research, 20:203–208, 1976.Google Scholar
  87. 87.
    H. van Seijen, S. Whiteson, H. van Hasselt, and M. Wiering. Exploiting best-match equations for efficient reinforcement learning. Journal of Machine Learning Research, 12:2045–2094, 2011.Google Scholar
  88. 88.
    C. J. Watkins. Learning from Delayed Rewards. PhD thesis, Kings College, Cambridge, England, 1989.Google Scholar
  89. 89.
    P. J. Werbös. Building and understanding adaptive systems: A statistical/numerical approach to factory automation and brain research. IEEE Transactions on Systems, Man., and Cybernetics, 17:7–20, 1987.Google Scholar
  90. 90.
    P. J. Werbös. Consistency of HDP applied to a simple reinforcement learning problem. Neural Networks, 3:179–189, 1990.CrossRefGoogle Scholar
  91. 91.
    R. M. Wheeler and K. S. Narenda. Decentralized learning in finite Markov chains. IEEE Transactions on Automatic Control, 31(6):373–376, 1986.CrossRefGoogle Scholar
  92. 92.
    S. Whiteson and P. Stone. Evolutionary function approximation for reinforcement learning. Journal of Machine Learning Research, 7:877–917, 2006.Google Scholar
  93. 93.
    B. Widrow and M. E. Hoff. Adaptive Switching Circuits. In Institute of Radio Engineers, Western Electronic Show and Convention, Convention Record, Part 4, pages 96–104. 1960.Google Scholar
  94. 94.
    M. A. Wiering, R. P. Salustowicz, and J. Schmidhuber. Model-based reinforcement learning for evolving soccer strategies. In Computational Intelligence in Games. Springer Verlag, 2001.Google Scholar
  95. 95.
    I. H. Witten. An adaptive optimal controller for discrete time Markov environments. Information and Control, 34:286–295, 1977.CrossRefGoogle Scholar
  96. 96.
    W. Yeow, C. Tham, and W. Wong. Energy efficient multiple target tracking in wireless sensor networks. IEEE Transactions on Vehicular Technology, 56(2):918–928, 2007.CrossRefGoogle Scholar
  97. 97.
    W. Yoshida and S. Ishii. Model-based reinforcement learning: A computational model and an fMRI study. Neurocomputing, 63:253–269, 2005.CrossRefGoogle Scholar
  98. 98.
    H. Yu and D. P. Bertsekas. Convergence results on some temporal difference methods based on least squares. IEEE Transactions on Automatic Control, 54(7):1515–1531, 2009.CrossRefGoogle Scholar
  99. 99.
    W. Zhang and T. G. Dietterich. A reinforcement learning approach to job-shop scheduling. In Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence, pages 1114–1120. Morgan Kaufmann, 1995.Google Scholar

Copyright information

© Springer Science+Business Media New York 2015

Authors and Affiliations

  1. 1.Missouri University of Science and TechnologyRollaUSA

Personalised recommendations