Abstract
This paper provides a policy iteration algorithm for solving communicating Markov decision processes (MDPs) with average reward criterion. The algorithm is based on the result that for communicating MDPs there is an optimal policy which is unichain. The improvement step is modified to select only unichain policies; consequently the nested optimality equations of Howard's multichain policy iteration algorithm are avoided. Properties and advantages of the algorithm are discussed and it is incorporated into a decomposition algorithm for solving multichain MDPs. Since it is easier to show that a problem is communicating than unichain we recommend use of this algorithm instead of unichain policy iteration.
Similar content being viewed by others
References
J. Bather, Optimal decision procedures for finite Markov chains, Part II: communicating systems, Adv. Appl. Prob. 5 (1973) 521–540.
D. Blackwell, Discrete dynamic programming, Ann. Math. Statist. 33 (1962) 719–726.
C. Derman, Denumerable state Markov decision processes — average cost criteria, Ann. Math. Statist. 37 (1966) 1545–1553.
J. Filar and Schultz, Communicating MDPs: Equivalence and LP properties, Oper. Res. Lett. 7 (1988) 303–307.
B.L. Fox and M.D. Landi, An algorithm for identifying the ergodic subchains and transient states of a stochastic matrix, Commun. ACM 1 (1968) 619–621.
A. Hordijk and M.L. Puterman, On the convergence of policy iteration in undiscounted finite state Markov decision processes: The unichain case, Math. Oper. Res. 12 (1987) 163–176.
R. Howard,Dynamic Programming and Markov Processes (The MIT Press, Cambridge, MA, 1960).
K. Ohno and K. Ichiki, Computing optimal policies for controlled tandem queueing systems, Oper. Res. 35 (1987) 121–126.
M.L. Puterman, Markov decision processes, in:Handbook of Operations Research, vol. 2,Stochastic Models, D.P. Heyman and M.J. Sobel (eds.) (North-Holland, 1190) pp. 331–434.
K.W. Ross and R. Vardarajan, Multichain Markov decision processes with a sample-path constraint: a decomposition approach, Math. Oper. Res. (1991), to appear.
H. Tijms,Stochastic Modelling and Analysis (Wiley, New York, 1986).
J. van der Wal,Stochastic Dynamic Programs, Tract 139 (The Mathematical Centre, Amsterdam, 1981).
Author information
Authors and Affiliations
Additional information
This research has been partially supported by NSERC Grant A-5527.
Rights and permissions
About this article
Cite this article
Haviv, M., Puterman, M.L. An improved algorithm for solving communicating average reward Markov decision processes. Ann Oper Res 28, 229–242 (1991). https://doi.org/10.1007/BF02055583
Issue Date:
DOI: https://doi.org/10.1007/BF02055583