Skip to main content
Log in

An improved algorithm for solving communicating average reward Markov decision processes

  • Research Contributions
  • Published:
Annals of Operations Research Aims and scope Submit manuscript

Abstract

This paper provides a policy iteration algorithm for solving communicating Markov decision processes (MDPs) with average reward criterion. The algorithm is based on the result that for communicating MDPs there is an optimal policy which is unichain. The improvement step is modified to select only unichain policies; consequently the nested optimality equations of Howard's multichain policy iteration algorithm are avoided. Properties and advantages of the algorithm are discussed and it is incorporated into a decomposition algorithm for solving multichain MDPs. Since it is easier to show that a problem is communicating than unichain we recommend use of this algorithm instead of unichain policy iteration.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. J. Bather, Optimal decision procedures for finite Markov chains, Part II: communicating systems, Adv. Appl. Prob. 5 (1973) 521–540.

    Google Scholar 

  2. D. Blackwell, Discrete dynamic programming, Ann. Math. Statist. 33 (1962) 719–726.

    Google Scholar 

  3. C. Derman, Denumerable state Markov decision processes — average cost criteria, Ann. Math. Statist. 37 (1966) 1545–1553.

    Google Scholar 

  4. J. Filar and Schultz, Communicating MDPs: Equivalence and LP properties, Oper. Res. Lett. 7 (1988) 303–307.

    Google Scholar 

  5. B.L. Fox and M.D. Landi, An algorithm for identifying the ergodic subchains and transient states of a stochastic matrix, Commun. ACM 1 (1968) 619–621.

    Google Scholar 

  6. A. Hordijk and M.L. Puterman, On the convergence of policy iteration in undiscounted finite state Markov decision processes: The unichain case, Math. Oper. Res. 12 (1987) 163–176.

    Article  Google Scholar 

  7. R. Howard,Dynamic Programming and Markov Processes (The MIT Press, Cambridge, MA, 1960).

    Google Scholar 

  8. K. Ohno and K. Ichiki, Computing optimal policies for controlled tandem queueing systems, Oper. Res. 35 (1987) 121–126.

    Article  Google Scholar 

  9. M.L. Puterman, Markov decision processes, in:Handbook of Operations Research, vol. 2,Stochastic Models, D.P. Heyman and M.J. Sobel (eds.) (North-Holland, 1190) pp. 331–434.

  10. K.W. Ross and R. Vardarajan, Multichain Markov decision processes with a sample-path constraint: a decomposition approach, Math. Oper. Res. (1991), to appear.

  11. H. Tijms,Stochastic Modelling and Analysis (Wiley, New York, 1986).

    Google Scholar 

  12. J. van der Wal,Stochastic Dynamic Programs, Tract 139 (The Mathematical Centre, Amsterdam, 1981).

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Additional information

This research has been partially supported by NSERC Grant A-5527.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Haviv, M., Puterman, M.L. An improved algorithm for solving communicating average reward Markov decision processes. Ann Oper Res 28, 229–242 (1991). https://doi.org/10.1007/BF02055583

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1007/BF02055583

Keywords

Navigation