Markov Decision Processes

  • Hyeong Soo Chang
  • Jiaqiao Hu
  • Michael C. Fu
  • Steven I. Marcus
Part of the Communications and Control Engineering book series (CCE)

Abstract

We provide a formal description of the discounted reward MDP framework in Chap. 1, including both the finite- and the infinite-horizon settings and summarizing the associated optimality equations. We then present the well-known exact solution algorithms, value iteration and policy iteration, and outline a framework of rolling-horizon control (also called receding-horizon control) as an approximate solution methodology for solving MDPs, in conjunction with simulation-based approaches covered later in the book. We conclude with a brief survey of other recently proposed MDP solution techniques designed to break the curse of dimensionality.

Keywords

Entropy Convolution Prefix 

References

  1. 2.
    Altman, E., Koole, G.: On submodular value functions and complex dynamic programming. Stoch. Models 14, 1051–1072 (1998) MathSciNetMATHCrossRefGoogle Scholar
  2. 3.
    Arapostathis, A., Borkar, V.S., Fernández-Gaucherand, E., Ghosh, M.K., Marcus, S.I.: Discrete-time controlled Markov processes with average cost criterion: a survey. SIAM J. Control Optim. 31(2), 282–344 (1993) MathSciNetMATHCrossRefGoogle Scholar
  3. 5.
    Baglietto, M., Parisini, T., Zoppoli, R.: Neural approximators and team theory for dynamic routing: a receding horizon approach. In: Proceedings of the 38th IEEE Conference on Decision and Control, pp. 3283–3288 (1999) Google Scholar
  4. 7.
    Banks, J. (ed.): Handbook of Simulation: Principles, Methodology, Advances, Applications, and Practice. Wiley, New York (1998) Google Scholar
  5. 9.
    Barto, A., Sutton, R., Anderson, C.: Neuron-like elements that can solve difficult learning control problems. IEEE Trans. Syst. Man Cybern. 13, 835–846 (1983) CrossRefGoogle Scholar
  6. 12.
    Bertsekas, D.P.: Dynamic Programming and Optimal Control, vol. 1. Athena Scientific, Belmont (2005), vol. 2 (2012) MATHGoogle Scholar
  7. 13.
    Bertsekas, D.P.: Dynamic programming and suboptimal control: a survey from ASP to MPC. Eur. J. Control 11, 310–334 (2005) MathSciNetCrossRefGoogle Scholar
  8. 14.
    Bertsekas, D.P., Castanon, D.A.: Adaptive aggregation methods for infinite horizon dynamic programming. IEEE Trans. Autom. Control 34(6), 589–598 (1989) MathSciNetMATHCrossRefGoogle Scholar
  9. 16.
    Bertsekas, D.P., Shreve, S.E.: Stochastic Control: The Discrete Time Case. Academic Press, New York (1978) MATHGoogle Scholar
  10. 17.
    Bertsekas, D.P., Tsitsiklis, J.N.: Neuro-Dynamic Programming. Athena Scientific, Belmont (1996) MATHGoogle Scholar
  11. 18.
    Bes, C., Lasserre, J.B.: An on-line procedure in discounted infinite-horizon stochastic optimal control. J. Optim. Theory Appl. 50, 61–67 (1986) MathSciNetMATHCrossRefGoogle Scholar
  12. 22.
    Blondel, V.D., Tsitsiklis, J.N.: A survey of computational complexity results in systems and control. Automatica 36, 1249–1274 (2000) MathSciNetMATHCrossRefGoogle Scholar
  13. 23.
    Borkar, V.S.: White-noise representations in stochastic realization theory. SIAM J. Control Optim. 31, 1093–1102 (1993) MathSciNetMATHCrossRefGoogle Scholar
  14. 24.
    Borkar, V.S.: Convex analytic methods in Markov decision processes. In: Feinberg, E.A., Shwartz, A. (eds.) Handbook of Markov Decision Processes: Methods and Applications. Kluwer, Boston (2002) Google Scholar
  15. 25.
    Bratley, P., Fox, B.L., Schrage, L.E.: A Guide to Simulation. Springer, New York (1983) MATHCrossRefGoogle Scholar
  16. 28.
    Campos-Nanez, E., Garcia, A., Li, C.: A game-theoretic approach to efficient power management in sensor networks. Oper. Res. 56(3), 552–561 (2008) MathSciNetMATHCrossRefGoogle Scholar
  17. 29.
    Chand, S., Hsu, V.N., Sethi, S.: Forecast, solution, and rolling horizons in operations management problems: a classified bibliography. Manuf. Serv. Oper. Manag. 4(1), 25–43 (2003) Google Scholar
  18. 36.
    Chang, H.S.: Finite step approximation error bounds for solving average reward controlled Markov set-chains. IEEE Trans. Autom. Control 53(1), 350–355 (2008) CrossRefGoogle Scholar
  19. 37.
    Chang, H.S.: Decentralized learning in finite Markov chains: revisited. IEEE Trans. Autom. Control 54(7), 1648–1653 (2009) CrossRefGoogle Scholar
  20. 38.
    Chang, H.S., Chong, E.K.P.: Solving controlled Markov set-chains with discounting via multi-policy improvement. IEEE Trans. Autom. Control 52(3), 564–569 (2007) MathSciNetCrossRefGoogle Scholar
  21. 48.
    Cooper, W.L., Henderson, S.G., Lewis, M.E.: Convergence of simulation-based policy iteration. Probab. Eng. Inf. Sci. 17(2), 213–234 (2003) MathSciNetMATHCrossRefGoogle Scholar
  22. 52.
    de Farias, D.P., Van Roy, B.: The linear programming approach to approximate dynamic programming. Oper. Res. 51(6), 850–865 (2003) MathSciNetMATHCrossRefGoogle Scholar
  23. 54.
    Devroye, L.: Non-uniform Random Variate Generation. Springer, New York (1986) MATHGoogle Scholar
  24. 57.
    Even-Dar, E., Mannor, S., Mansour, Y.: PAC bounds for multi-armed bandit and Markov decision processes. In: Proceedings of the 15th Annual Conference on Computational Learning Theory, pp. 255–270 (2002) Google Scholar
  25. 59.
    Fang, H., Cao, X.: Potential-based on-line policy iteration algorithms for Markov decision processes. IEEE Trans. Autom. Control 49, 493–505 (2004) MathSciNetCrossRefGoogle Scholar
  26. 60.
    Federgruen, A., Tzur, M.: Detection of minimal forecast horizons in dynamic programs with multiple indicators of the future. Nav. Res. Logist. 43, 169–189 (1996) MathSciNetMATHCrossRefGoogle Scholar
  27. 61.
    Feinberg, E.A., Shwartz, A. (eds.): Handbook of Markov Decision Processes: Methods and Applications. Kluwer, Boston (2002) MATHGoogle Scholar
  28. 62.
    Fernández-Gaucherand, E., Arapostathis, A., Marcus, S.I.: On the average cost optimality equation and the structure of optimal policies for partially observable Markov processes. Ann. Oper. Res. 29, 471–512 (1991) MathSciNetCrossRefGoogle Scholar
  29. 63.
    Fishman, G.S.: Monte Carlo Methods: Concepts, Algorithms, and Applications. Springer, New York (1996) MATHGoogle Scholar
  30. 64.
    Fishman, G.S.: A First Course in Monte Carlo. Duxbury/Thomson Brooks/Cole, Belmont (2006) Google Scholar
  31. 68.
    Fu, M.C., Marcus, S.I., Wang, I.-J.: Monotone optimal policies for a transient queueing staffing problem. Oper. Res. 46, 327–331 (2000) CrossRefGoogle Scholar
  32. 70.
    Garcia, A., Patek, S.D., Sinha, K.: A decentralized approach to discrete optimization via simulation: application to network flow. Oper. Res. 55(4), 717–732 (2007) MathSciNetMATHCrossRefGoogle Scholar
  33. 71.
    Givan, R., Leach, S., Dean, T.: Bounded Markov decision processes. Artif. Intell. 122, 71–109 (2000) MathSciNetMATHCrossRefGoogle Scholar
  34. 75.
    Gosavi, A.: Simulation-Based Optimization: Parametric Optimization Techniques and Reinforcement Learning. Kluwer, Dordrecht (2003) MATHCrossRefGoogle Scholar
  35. 76.
    Grinold, R.C.: Finite horizon approximations of infinite horizon linear programs. Math. Program. 12, 1–17 (1997) MathSciNetCrossRefGoogle Scholar
  36. 80.
    He, Y., Fu, M.C., Marcus, S.I.: Simulation-based algorithms for average cost Markov decision processes. In: Laguna, M., González Velarde, J.L. (eds.) Computing Tools for Modeling, Optimization and Simulation, Interfaces in Computer Science and Operations Research, pp. 161–182. Kluwer, Dordrecht (2000) CrossRefGoogle Scholar
  37. 81.
    Henderson, S.G., Nelson, B.L. (eds.): Handbooks in Operations Research and Management Science: Simulation. North-Holland/Elsevier, Amsterdam (2006) MATHGoogle Scholar
  38. 82.
    Hernández-Lerma, O.: Adaptive Markov Control Processes. Springer, New York (1989) MATHCrossRefGoogle Scholar
  39. 83.
    Hernández-Lerma, O., Lasserre, J.B.: A forecast horizon and a stopping rule for general Markov decision processes. J. Math. Anal. Appl. 132, 388–400 (1988) MathSciNetMATHCrossRefGoogle Scholar
  40. 84.
    Hernández-Lerma, O., Lasserre, J.B.: Error bounds for rolling horizon policies in discrete-time Markov control processes. IEEE Trans. Autom. Control 35, 1118–1124 (1990) MATHCrossRefGoogle Scholar
  41. 85.
    Hernández-Lerma, O., Lasserre, J.B.: Discrete-Time Markov Control Processes: Basic Optimality Criteria. Springer, New York (1996) CrossRefGoogle Scholar
  42. 99.
    Jain, R., Varaiya, P.: Simulation-based uniform value function estimates of Markov decision processes. SIAM J. Control Optim. 45(5), 1633–1656 (2006) MathSciNetMATHCrossRefGoogle Scholar
  43. 100.
    Johansen, L.: Lectures on Macroeconomic Planning. North-Holland, Amsterdam (1977) Google Scholar
  44. 101.
    Kaelbling, L., Littman, M., Moore, A.: Reinforcement learning: a survey. Artif. Intell. 4, 237–285 (1996) Google Scholar
  45. 102.
    Kallenberg, L.: Finite state and action MDPs. In: Feinberg, E.A., Shwartz, A. (eds.) Handbook of Markov Decision Processes: Methods and Applications. Kluwer, Boston (2002) Google Scholar
  46. 103.
    Kalyanasundaram, S., Chong, E.K.P., Shroff, N.B.: Markov decision processes with uncertain transition rates: sensitivity and max-min control. Asian J. Control 6(2), 253–269 (2004) CrossRefGoogle Scholar
  47. 105.
    Keerthi, S.S., Gilbert, E.G.: Optimal infinite horizon feedback laws for a general class of constrained discrete time systems: stability and moving-horizon approximations. J. Optim. Theory Appl. 57, 265–293 (1988) MathSciNetMATHCrossRefGoogle Scholar
  48. 107.
    Kitaev, M.Y., Rykov, V.V.: Controlled Queueing Systems. CRC Press, Boca Raton (1995) MATHGoogle Scholar
  49. 110.
    Koller, D., Parr, R.: Policy iteration for factored MDPs. In: Proceedings of the 16th Conference on Uncertainty in Artificial Intelligence, pp. 326–334 (2000) Google Scholar
  50. 111.
    Konda, V.R., Tsitsiklis, J.N.: Actor-critic algorithms. SIAM J. Control Optim. 42(4), 1143–1166 (2003) MathSciNetMATHCrossRefGoogle Scholar
  51. 114.
    Kumar, P.R., Varaiya, P.: Stochastic Systems: Estimation, Identification, and Adaptive Control. Prentice-Hall, Englewood Cliffs (1986) MATHGoogle Scholar
  52. 115.
    Kurano, M., Song, J., Hosaka, M., Huang, Y.: Controlled Markov set-chains with discounting. J. Appl. Probab. 35, 293–302 (1998) MathSciNetMATHCrossRefGoogle Scholar
  53. 120.
    Law, A.M., Kelton, W.D.: Simulation Modeling and Analysis, 3rd edn. McGraw-Hill, New York (2000) Google Scholar
  54. 123.
    Littman, M., Dean, T., Kaelbling, L.: On the complexity of solving Markov decision problems. In: Proceedings of the 11th Annual Conference on Uncertainty in Artificial Intelligence, pp. 394–402 (1995) Google Scholar
  55. 127.
    Marbach, P., Tsitsiklis, J.N.: Simulation-based optimization of Markov reward processes. IEEE Trans. Autom. Control 46(2), 191–209 (2001) MathSciNetMATHCrossRefGoogle Scholar
  56. 128.
    Marbach, P., Tsitsiklis, J.N.: Approximate gradient methods in policy-space optimization of Markov reward processes. In: Discrete Event Dynamic Systems: Theory and Applications, vol. 13, pp. 111–148 (2003) Google Scholar
  57. 129.
    Mayne, D.Q., Michalska, H.: Receding horizon control of nonlinear system. IEEE Trans. Autom. Control 38, 814–824 (1990) MathSciNetCrossRefGoogle Scholar
  58. 130.
    Morari, M., Lee, J.H.: Model predictive control: past, present, and future. Comput. Chem. Eng. 23, 667–682 (1999) CrossRefGoogle Scholar
  59. 135.
    Niederreiter, H.: Random Number Generation and Quasi-Monte Carlo Methods. SIAM, Philadelphia (1992) MATHCrossRefGoogle Scholar
  60. 136.
    Nilim, A., Ghaoui, L.E.: Robust control of Markov decision processes with uncertain transition matrices. Oper. Res. 53(5), 780–798 (2005) MathSciNetMATHCrossRefGoogle Scholar
  61. 139.
    Patten, W.N., White, L.W.: A sliding horizon feedback control problem with feedforward and disturbance. J. Math. Syst. Estim. Control 7, 1–33 (1997) MathSciNetGoogle Scholar
  62. 141.
    Porteus, E.L.: Conditions for characterizing the structure of optimal strategies in infinite-horizon dynamic programs. J. Optim. Theory Appl. 36, 419–432 (1982) MathSciNetMATHCrossRefGoogle Scholar
  63. 142.
    Powell, W.B.: Approximate Dynamic Programming: Solving the Curses of Dimensionality, 2nd edn. Wiley, New York (2010) Google Scholar
  64. 145.
    Puterman, M.L.: Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley, New York (1994) MATHCrossRefGoogle Scholar
  65. 153.
    Rust, J.: Structural estimation of Markov decision processes. In: Engle, R., McFadden, D. (eds.) Handbook of Econometrics. North-Holland/Elsevier, Amsterdam (1994) Google Scholar
  66. 154.
    Rust, J.: Using randomization to break the curse of dimensionality. Econometrica 65(3), 487–516 (1997) MathSciNetMATHCrossRefGoogle Scholar
  67. 156.
    Satia, J.K., Lave, R.E.: Markovian decision processes with uncertain transition probabilities. Oper. Res. 21, 728–740 (1973) MathSciNetMATHCrossRefGoogle Scholar
  68. 159.
    Sennott, L.I.: Stochastic Dynamic Programming and the Control of Queueing Systems. Wiley, New York (1999) MATHGoogle Scholar
  69. 160.
    Shanthikumar, J.G., Yao, D.D.: Stochastic monotonicity in general queueing networks. J. Appl. Probab. 26, 413–417 (1989) MathSciNetMATHCrossRefGoogle Scholar
  70. 164.
    Si, J., Barto, A.G., Powell, W.B., Wunsch, D.W. (eds.): Handbook of Learning and Approximate Dynamic Programming. IEEE Press, Piscataway (2004) Google Scholar
  71. 166.
    Smith, J.E., McCardle, K.F.: Structural properties of stochastic dynamic programs. Oper. Res. 50, 796–809 (2002) MathSciNetMATHCrossRefGoogle Scholar
  72. 170.
    Stidham, S., Weber, R.: A survey of Markov decision models for control of networks of queues. Queueing Syst. 13, 291–314 (1993) MathSciNetMATHCrossRefGoogle Scholar
  73. 171.
    Sutton, R., Barto, A.: Reinforcement Learning: An Introduction. MIT Press, Cambridge (1998) Google Scholar
  74. 177.
    Tsitsiklis, J.N.: Asynchronous stochastic approximation and Q-learning. Mach. Learn. 16, 185–202 (1994) MATHGoogle Scholar
  75. 178.
    van den Broek, W.A.: Moving horizon control in dynamic games. J. Econ. Dyn. Control 26, 937–961 (2002) MATHCrossRefGoogle Scholar
  76. 179.
    Van Roy, B.: Neuro-dynamic programming: overview and recent trends. In: Feinberg, E.A., Shwartz, A. (eds.) Handbook of Markov Decision Processes: Methods and Applications. Kluwer, Boston (2002) Google Scholar
  77. 180.
    Watkins, C.J.C.H.: Q-learning Mach. Learn. 8, 279–292 (1992) MATHGoogle Scholar
  78. 181.
    Weber, R.: On the Gittins index for multiarmed bandits. Ann. Appl. Probab. 2, 1024–1033 (1992) MathSciNetMATHCrossRefGoogle Scholar
  79. 184.
    White, C.C., Eldeib, H.K.: Markov decision processes with imprecise transition probabilities. Oper. Res. 43, 739–749 (1994) MathSciNetCrossRefGoogle Scholar
  80. 185.
    Williams, R.J.: Simple statistical gradient-following algorithms for connectionist reinforcement learning. Mach. Learn. 8, 229–256 (1992) MATHGoogle Scholar
  81. 186.
    Williams, J.L., Fisher, J.W. III, Willsky, A.S.: Importance sampling actor-critic algorithms. In: Proceedings of the 2006 American Control Conference, pp. 1625–1630 (2006) Google Scholar

Copyright information

© Springer-Verlag London 2013

Authors and Affiliations

  • Hyeong Soo Chang
    • 1
  • Jiaqiao Hu
    • 2
  • Michael C. Fu
    • 3
  • Steven I. Marcus
    • 4
  1. 1.Dept. of Computer Science and EngineeringSogang UniversitySeoulSouth Korea
  2. 2.Dept. Applied Mathematics & StatisticsState University of New YorkStony BrookUSA
  3. 3.Smith School of BusinessUniversity of MarylandCollege ParkUSA
  4. 4.Dept. Electrical & Computer EngineeringUniversity of MarylandCollege ParkUSA

Personalised recommendations