Computing semi-stationary optimal policies for multichain semi-Markov decision processes
We consider semi-Markov decision processes with finite state and action spaces and a general multichain structure. A form of limiting ratio average (undiscounted) reward is the criterion for comparing different policies. The main result is that the value vector and a pure optimal semi-stationary policy (i.e., a policy which depends only on the initial state and the current state) for such an SMDP can be computed directly from an optimal solution of a finite set (whose cardinality equals the number of states) of linear programming (LP) problems. To be more precise, we prove that the single LP associated with a fixed initial state provides the value and an optimal pure stationary policy of the corresponding SMDP. The relation between the set of feasible solutions of each LP and the set of stationary policies is also analyzed. Examples are worked out to describe the algorithm.
KeywordsSemi-Markov decision processes Limiting ratio average reward Multichain structure Pure optimal semi-stationary policies Linear programming
Mathematics Subject Classification60K15 60K20
I am grateful to Prof. T. Parthasarathy of CMI & ISI Chennai. Some ideas presented in this paper results from a fruitful discussion with him during the International Conference & Workshop on “Game Theory and Optimization”, June 6-10, 2016 at IIT Madras. This paper is dedicated to celebrate the 75th birthday of Prof. T. Parthasarathy, who has made significant contributions to the theory of games and linear complementarity problems. I am also thankful to Prof. S. Sinha of Jadavpur University, Kolkata for many valuable suggestions. I would like to thank the two anonymous Referees for their valuable and detailed comments that has helped structure this paper better.
- Baykal-Gürsoy, M. (2010). Semi-Markov decision processes. Wiley Encyclopedia of Operations Research and Management Science, doi: 10.1002/9780470400531.eorms0757.
- Bellman, R. E. (1962). A Markovian decision process. Journal of Mathematics and Mechanics, 6, 679–684.Google Scholar
- Doob, J. L. (1953). Stochastic processes (p. 52369). Hoboken: Willey.Google Scholar
- Howard, R. A. (1960). Dynamic Programming and Markov Processes. New York: Wiley.Google Scholar
- Howard, R. A. (1963). Linear programming and Markov decision chains. Ottawa: Proc Internat Statist Inst.Google Scholar
- Jaśkiewicz, A., & Nowak, A. (2007). Average optimality for semi-Markov control processes. Morfismos, 11(1), 15–36.Google Scholar
- Mondal, P. (2016b). Completely mixed strategies for single controller unichain Semi-Markov games with undiscounted payoffs. Operational Research, doi: 10.1007/s12351-016-0272-7.
- Ross, S. M. (1970). Applied probability models with optimization applications. San Francisco: Holden-Day.Google Scholar
- Shapley, L. (1953). Stochastic games. In Proceedings of National Academy Sciences, USA (Vol. 39, pp. 1095–1100).Google Scholar