Skip to main content
Log in

Illustrated review of convergence conditions of the value iteration algorithm and the rolling horizon procedure for average-cost MDPs

  • Published:
Annals of Operations Research Aims and scope Submit manuscript

Abstract

This paper is concerned with the links between the Value Iteration algorithm and the Rolling Horizon procedure, for solving problems of stochastic optimal control under the long-run average criterion, in Markov Decision Processes with finite state and action spaces. We review conditions of the literature which imply the geometric convergence of Value Iteration to the optimal value. Aperiodicity is an essential prerequisite for convergence. We prove that the convergence of Value Iteration generally implies that of Rolling Horizon. We also present a modified Rolling Horizon procedure that can be applied to models without analyzing periodicity, and discuss the impact of this transformation on convergence. We illustrate with numerous examples the different results. Finally, we discuss rules for stopping Value Iteration or finding the length of a Rolling Horizon. We provide an example which demonstrates the difficulty of the question, disproving in particular a conjectured rule proposed by Puterman.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2

Similar content being viewed by others

Notes

  1. Observe the discrepancy with the general notion of convergence of algorithms in Computer Science, which requires that an algorithm stops and returns the correct result.

References

  • Alden, J. M., & Smith, R. L. (1992). Rolling horizon procedures in nonhomogeneous Markov decision processes. Operations Research, 40(2), S183–S194.

    Article  Google Scholar 

  • Bertsekas, D. P. (1987). Dynamic programming: deterministic and stochastic models. Englewood Cliffs: Prentice Hall.

    Google Scholar 

  • Derman, C. (1970). Finite state Markovian decision processes. New York: Academic Press.

    Google Scholar 

  • Çinlar, E. (1975). Introduction to stochastic processes. New York: Prentice Hall.

    Google Scholar 

  • Federgruen, A., Schweitzer, P., & Tijms, C. (1978). Contraction mappings underlying undiscounted Markov decision problems. Journal of Mathematical Analysis and Applications, 65, 711–730.

    Article  Google Scholar 

  • Guo, X., & Shi, P. (2001). Limiting average criteria for non stationary Markov decision processes. SIAM Journal on Optimization, 11(4), 1037–1053.

    Article  Google Scholar 

  • Hernández-Lerma, O., & Lasserre, J. B. (1990). Error bounds for rolling horizon policies in discrete-time Markov control processes. IEEE Transactions on Automatic Control, 35(10), 1118–1124.

    Article  Google Scholar 

  • Kallenberg, L. (2002). Finite state and action MDPS. In E. Feinberg & A. Shwartz (Eds.), Handbook of Markov decision processes. Methods and applications. Kluwer’s international series.

    Google Scholar 

  • Kallenberg, L. Markov decision processes. Lectures notes, University of Leiden, 2009, in www.math.leidenuniv.nl/~kallenberg/Lecture-notes-MDP.pdf.

  • Lanery, E. (1967). Etude asymptotique des systèmes Markoviens à commande. Revue Française d’Informatique et de Recherche Opérationnelle, 1, 3–56.

    Google Scholar 

  • Meyn, S. P., & Tweedie, R. L. (2009). Markov chains and stochastic stability (2nd ed.). London: Cambridge University Press.

    Book  Google Scholar 

  • Puterman, L. (1994). Markov decision processes. New York: Wiley.

    Book  Google Scholar 

  • Ross, S. M. (1970). Applied probability models with optimization applications. Oakland: Holden-Day.

    Google Scholar 

  • Schweitzer, P. J. (1971). Iterative solution of the functional equation of undiscounted Markov renewal programming. Journal of Mathematical Analysis and Applications, 34, 495–501.

    Article  Google Scholar 

  • Schweitzer, P. J., & Federgruen, A. (1977). The asymptotic behavior of undiscounted value iteration in Markov decision problems. Mathematics of Operations Research, 2(4), 360–381.

    Article  Google Scholar 

  • Schweitzer, P. J., & Federgruen, A. (1979). Geometric convergence of the value iteration in multichain Markov decision problems. Advances in Applied Probability, 11, 188–217.

    Article  Google Scholar 

  • Tijms, H. C. (1986). Stochastic modelling and analysis, a computational approach. New York: Wiley.

    Google Scholar 

  • White, D. J. (1993). Markov decision processes. New York: Wiley.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Silvia Di Marco.

Appendix

Appendix

Each month an individual must decide how to allocate his wealth between different consumptions and investments. Each state represents a level of individual’s wealth at the start of a month. Wealth levels give access to two different investment opportunities, prudent or risky. Choosing an investment profile at each level results in a probability transition for the next wealth level, as well as an instantaneous gain. The individual’s objective is to maximize the average gain.

There are five levels of wealth, ordered from the smallest to the largest. At the medium level, connected to the risky behavior, there exists positive probability to pass to the next inferior level of wealth. It is also possible to cycle among the two inferior levels, but there is no action which permit the access to the three superior levels from the inferior ones. Besides, being at the poorest level, by some external help we achieve level 2. There is a common action space A={a 1,a 2}, where a 1 represents the prudent investment profile and a 2 the risky attitude. We show the data below. \(P_{a_{k}}(s,j)\) is the transition probability from the state s to state j when action a k is used, i.e. \(P_{a_{k}}(s,j)=p(j|s,a_{k})\).

The gains can summarize as follows: r(s,a k ) in the matrix below is the gain when at state s, the action a k is chosen.

$$\left(\begin{array}{c@{\quad}c}1 & 2\\1 & 2\\1 & 1\\3 & 2\\6 & 6\end{array}\right).$$

Through the implementation of the MRH procedure the optimal average wealth can be computed: g =(2,2,4,4,4). It is produced by the stationary policy associated to the decision rule d=(a 2,a 2,a 1,a 2,a 1) whose transition matrix is:

$$\left (\begin{array}{c@{\quad}c@{\quad}c@{\quad}c@{\quad}c}0 & 1 & 0 & 0 & 0\\1& 0 & 0 & 0& 0\\0& 0& 0.7 & 0.3& 0 \\0& 0& 0& 0 & 1\\0& 0& 0& 1 & 0\end{array} \right).$$

Clearly, it is a multichain periodic model. When RH procedure is applied directly, there is no convergence: the procedure gives infinitely (and periodically) many times two policies, (a 2,a 2,a 1,a 1,a 1) and (a 2,a 2,a 1,a 2,a 1). The first one produces a gain g=(2,2,3,3,3) and then it is not optimal.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Della Vecchia, E., Di Marco, S. & Jean-Marie, A. Illustrated review of convergence conditions of the value iteration algorithm and the rolling horizon procedure for average-cost MDPs. Ann Oper Res 199, 193–214 (2012). https://doi.org/10.1007/s10479-012-1070-0

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10479-012-1070-0

Keywords

Navigation