Abstract
The goals of perturbation analysis (PA), Markov decision processes (MDPs), and reinforcement learning (RL) are common: to make decisions to improve the system performance based on the information obtained by analyzing the current system behavior. In this paper, we study the relations among these closely related fields. We show that MDP solutions can be derived naturally from performance sensitivity analysis provided by PA. Performance potential plays an important role in both PA and MDPs; it also offers a clear intuitive interpretation for many results. Reinforcement learning, TD(λ), neuro-dynamic programming, etc., are efficient ways of estimating the performance potentials and related quantities based on sample paths. The sensitivity point of view of PA, MDP, and RL brings in some new insight to the area of learning and optimization. In particular, gradient-based optimization can be applied to parameterized systems with large state spaces, and gradient-based policy iteration can be applied to some nonstandard MDPs such as systems with correlated actions, etc. Potential-based on-line approaches and their advantages are also discussed.
Similar content being viewed by others
References
Abounadi, J., Bertsekas, D., and Borkar, V. Learning algorithms for Markov decision processes with average cost, Report LIDS-P-2434, Lab. for Info. and Decision Systems, October 1998; to appear in SIAM J. on Control and Optimization.
Altman, E. 1999. Constrained Markov Decision Processes. Chapman Hall/CRC.
Barto, A., Sutton, R., and Anderson, C. 1983. Neuron-like elements that can solve difficult learning control problems. IEEE Transactions on Systems, Man, and Cybernetics13: 835–846.
Baxter, J., and Bartlett, P. L. 2001. Innite-horizon policy-gradient estimation. Journal of Artificial Intelligence Research 15: 319–350.
Baxter, J., Bartlett, P. L., and Weaver, L. 2001. Experiments with innite-horizon policy-gradient estimation. Journal of Artificial Intelligence Research 15: 351–381.
Bertsekas, D. P. 1995. Dynamic Programming and Optimal Control. Vols. I, II, Athena Scientic, Belmont, Massachusetts.
Berman, A., and Plemmons, R. J. 1994. Nonnegative matrices in the mathematical sciences. SIAM, Philadelphia.
Bertsekas, D. P., and Tsitsiklis, T. N. 1996. Neuro-Dynamic Programming. Athena Scientic, Belmont, Massachusetts.
Cao, X. R. 1994. Realization Probabilities: The Dynamics of Queuing Systems. Springer-Verlag, New York.
Cao, X. R. 1998. The relation among potentials, perturbation analysis, Markov decision processes, and other topics. Journal of Discrete Event Dynamic Systems 8: 71–87.
Cao, X. R. 1999. Single sample path-based optimization of markov chains. Journal of Optimization: Theory and Application 100: 527–548.
Cao, X. R. 2000. A unified approach to Markov decision problems and performance sensitivity analysis. Automatica36: 771–774.
Cao, X. R., and Chen, H. F. 1997. Potentials, perturbation realization, and sensitivity analysis of Markov processes. IEEE Transactions on AC 42: 1382–1393.
Cao, X. R., Ren, Z. Y., Bhatnagar, S., Fu, M., and Marcus, S. 2002. A time aggregation approach to Markov decision processes. Automatica 38: 929–943.
Cao, X. R., and Fang, H. T. Gradient-based policy iteration: an example. To appear in 2002 IEEE Conference on Decision and Control.
Cao, X. R., and Wan, Y. W. 1998. Algorithms for sensitivity analysis of Markov systems through potentials and perturbation realization. IEEE Transactions on Control Systems Technology 6: 482–494.
Cassandras, C. G., and Lafortune, S. 1999. Introduction to Discrete Event Systems. Kluwer Academic Publishers.
Chong, E. K. P., and Ramadge, P. J. 1992. Convergence of recursive optimization algorithms using infinitesimal perturbation analysis estimates. Journal of Discrete Event Dynamic Systems1: 339–372.
CËinlar, E. 1975. Introduction to Stochastic Processes. Prentice Hall, Englewood cliffs, NJ.
Fang, H. T., and Cao, X. R. Single sample path-based recursive algorithms for Markov decision processes. IEEE Trans.on Automatic Control, submitted.
Feinberg, E. A., and Adam Shwartz (ed.) 2002. Handbook of Markov Decision Processes. Kluwer, 2002.
Fu, M. C. 1990. Convergence of a stochastic approximation algorithm for the GI/G/1 queue using infinitesimal perturbation analysis. Journal of Optimization Theory and Applications 65: 149–160.
Fu, M. C. and Hu, J. Q. 1997. Conditional Monte Carlo: Gradient Estimation and Optimization Applications. Kluwer Academic Publishers, Boston.
Glasserman, P. 1991. Gradient Estimation Via Perturbation Analysis. Kluwer Academic Publishers, Boston.
Glynn, P. W., and Meyn, S. P. 1996. A Lyapunov bound for solutions of Poisson's equation. Ann.Probab. 24: 916–931.
Gong, W. B., and Ho, Y. C. 1987. Smoothed perturbation analysis of discrete event systems. IEEE Transactions on Control Systems Technology 32: 858–866.
Ho, Y. C., and Cao, X. R. 1991. Perturbation Analysis of Discrete-Event Dynamic Systems. Kluwer Academic Publisher, Boston.
Ho, Y. C., and Cao, X. R. 1983. Perturbation analysis and optimization of queuing networks. Journal of Optimization Theory and Applications 40(4): 559–582.
Kaelbling, L. P., Littman, M. L., and Cassandra, A. R. 1998. Planning and acting in partially observable stochastic domains. Artificial Intelligence 101.
Kemeny, J. G., and Snell, J. L. 1960. Finite Markov Chains. Van Nostrand, New York.
Konda, V. R., and Borkar, V. S. 1990. Actor-critic like learning algorithms for Markov decision processes. SIAM Journal on Control and Optimization 38: 94–123.
Konda, V. R., and Tsitsiklis, J. N. 2001. Actor-critic Algorithms. Submitted to SIAM Journal on Control and Optimization, February.
Marbach, P., and Tsitsiklis, T. N. 2001. Simulation-based optimization of Markov reward processes. IEEE Transactions on Automatic Control 46: 191–209.
Meyn, S. P. 1997. The policy improvement algorithm for Markov decision processes with general state space. IEEE Transactions on Automatic Control 42: 1663–1680.
Meyn, S. P., and Tweedie, R. L. 1993. Markov Chains and Stochastic Stability. Springer-Verlag, London.
Puterman, M. L. 1994. Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley, New York.
Singh, S. P. 1994. Reinforcement learning algorithms for average-payoff Markovain decision processes. Proceedings of the Twelfth National Conference on Artificial Intelligence 202–207.
Smart, W. D., and Kaelbling, L. P. 2000. Practical reinforcement learning in continuous spaces. Proceedings of the Seventeenth International Conference on Machine Learning.
Sutton, R. S. 1988. Learning to predict by the methods of temporal differences. Machine Learning 3: 835–846.
Sutton, R. S., and Barto, A. G. 1998. Reinforcement Learning: An Introduction. MIT Press, Cambridge, MA.
Tsitsiklis, J. N., and Van Roy, B. 1999. Average cost temporal-difference learning. Automatica 35: 1799–1808.
Vazquez-Abad, F. J., Cassandras, C. G., and Julka, V. 1998. Centralized and decentralized asynchronous optimization of stochastic discret event systems. IEEE Transactions on Automatic Control 43: 631–655.
Zhang, B., and Ho, Y. C. 1991. Performance gradient estimation for very large finite Markov chains. IEEE Transactions on Automatic Control 36: 1218–1227.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Cao, XR. From Perturbation Analysis to Markov Decision Processes and Reinforcement Learning. Discrete Event Dynamic Systems 13, 9–39 (2003). https://doi.org/10.1023/A:1022188803039
Issue Date:
DOI: https://doi.org/10.1023/A:1022188803039