Abstract
This paper is concerned with the links between the Value Iteration algorithm and the Rolling Horizon procedure, for solving problems of stochastic optimal control under the long-run average criterion, in Markov Decision Processes with finite state and action spaces. We review conditions of the literature which imply the geometric convergence of Value Iteration to the optimal value. Aperiodicity is an essential prerequisite for convergence. We prove that the convergence of Value Iteration generally implies that of Rolling Horizon. We also present a modified Rolling Horizon procedure that can be applied to models without analyzing periodicity, and discuss the impact of this transformation on convergence. We illustrate with numerous examples the different results. Finally, we discuss rules for stopping Value Iteration or finding the length of a Rolling Horizon. We provide an example which demonstrates the difficulty of the question, disproving in particular a conjectured rule proposed by Puterman.
Similar content being viewed by others
Notes
Observe the discrepancy with the general notion of convergence of algorithms in Computer Science, which requires that an algorithm stops and returns the correct result.
References
Alden, J. M., & Smith, R. L. (1992). Rolling horizon procedures in nonhomogeneous Markov decision processes. Operations Research, 40(2), S183–S194.
Bertsekas, D. P. (1987). Dynamic programming: deterministic and stochastic models. Englewood Cliffs: Prentice Hall.
Derman, C. (1970). Finite state Markovian decision processes. New York: Academic Press.
Çinlar, E. (1975). Introduction to stochastic processes. New York: Prentice Hall.
Federgruen, A., Schweitzer, P., & Tijms, C. (1978). Contraction mappings underlying undiscounted Markov decision problems. Journal of Mathematical Analysis and Applications, 65, 711–730.
Guo, X., & Shi, P. (2001). Limiting average criteria for non stationary Markov decision processes. SIAM Journal on Optimization, 11(4), 1037–1053.
Hernández-Lerma, O., & Lasserre, J. B. (1990). Error bounds for rolling horizon policies in discrete-time Markov control processes. IEEE Transactions on Automatic Control, 35(10), 1118–1124.
Kallenberg, L. (2002). Finite state and action MDPS. In E. Feinberg & A. Shwartz (Eds.), Handbook of Markov decision processes. Methods and applications. Kluwer’s international series.
Kallenberg, L. Markov decision processes. Lectures notes, University of Leiden, 2009, in www.math.leidenuniv.nl/~kallenberg/Lecture-notes-MDP.pdf.
Lanery, E. (1967). Etude asymptotique des systèmes Markoviens à commande. Revue Française d’Informatique et de Recherche Opérationnelle, 1, 3–56.
Meyn, S. P., & Tweedie, R. L. (2009). Markov chains and stochastic stability (2nd ed.). London: Cambridge University Press.
Puterman, L. (1994). Markov decision processes. New York: Wiley.
Ross, S. M. (1970). Applied probability models with optimization applications. Oakland: Holden-Day.
Schweitzer, P. J. (1971). Iterative solution of the functional equation of undiscounted Markov renewal programming. Journal of Mathematical Analysis and Applications, 34, 495–501.
Schweitzer, P. J., & Federgruen, A. (1977). The asymptotic behavior of undiscounted value iteration in Markov decision problems. Mathematics of Operations Research, 2(4), 360–381.
Schweitzer, P. J., & Federgruen, A. (1979). Geometric convergence of the value iteration in multichain Markov decision problems. Advances in Applied Probability, 11, 188–217.
Tijms, H. C. (1986). Stochastic modelling and analysis, a computational approach. New York: Wiley.
White, D. J. (1993). Markov decision processes. New York: Wiley.
Author information
Authors and Affiliations
Corresponding author
Appendix
Appendix
Each month an individual must decide how to allocate his wealth between different consumptions and investments. Each state represents a level of individual’s wealth at the start of a month. Wealth levels give access to two different investment opportunities, prudent or risky. Choosing an investment profile at each level results in a probability transition for the next wealth level, as well as an instantaneous gain. The individual’s objective is to maximize the average gain.
There are five levels of wealth, ordered from the smallest to the largest. At the medium level, connected to the risky behavior, there exists positive probability to pass to the next inferior level of wealth. It is also possible to cycle among the two inferior levels, but there is no action which permit the access to the three superior levels from the inferior ones. Besides, being at the poorest level, by some external help we achieve level 2. There is a common action space A={a 1,a 2}, where a 1 represents the prudent investment profile and a 2 the risky attitude. We show the data below. \(P_{a_{k}}(s,j)\) is the transition probability from the state s to state j when action a k is used, i.e. \(P_{a_{k}}(s,j)=p(j|s,a_{k})\).
The gains can summarize as follows: r(s,a k ) in the matrix below is the gain when at state s, the action a k is chosen.
Through the implementation of the MRH procedure the optimal average wealth can be computed: g ∗=(2,2,4,4,4). It is produced by the stationary policy associated to the decision rule d=(a 2,a 2,a 1,a 2,a 1) whose transition matrix is:
Clearly, it is a multichain periodic model. When RH procedure is applied directly, there is no convergence: the procedure gives infinitely (and periodically) many times two policies, (a 2,a 2,a 1,a 1,a 1) and (a 2,a 2,a 1,a 2,a 1). The first one produces a gain g=(2,2,3,3,3) and then it is not optimal.
Rights and permissions
About this article
Cite this article
Della Vecchia, E., Di Marco, S. & Jean-Marie, A. Illustrated review of convergence conditions of the value iteration algorithm and the rolling horizon procedure for average-cost MDPs. Ann Oper Res 199, 193–214 (2012). https://doi.org/10.1007/s10479-012-1070-0
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10479-012-1070-0