This article proposes several two-timescale simulation-based actor-critic algorithms for solution of infinite horizon Markov Decision Processes with finite state-space under the average cost criterion. Two of the algorithms are for the compact (non-discrete) action setting while the rest are for finite-action spaces. On the slower timescale, all the algorithms perform a gradient search over corresponding policy spaces using two different Simultaneous Perturbation Stochastic Approximation (SPSA) gradient estimates. On the faster timescale, the differential cost function corresponding to a given stationary policy is updated and an additional averaging is performed for enhanced performance. A proof of convergence to a locally optimal policy is presented. Next, we discuss a memory efficient implementation that uses a feature-based representation of the state-space and performs TD(0) learning along the faster timescale. The TD(0) algorithm does not follow an on-line sampling of states but is observed to do well on our setting. Numerical experiments on a problem of rate based flow control are presented using the proposed algorithms. We consider here the model of a single bottleneck node in the continuous time queueing framework. We show performance comparisons of our algorithms with the two-timescale actor-critic algorithms of Konda and Borkar (1999) and Bhatnagar and Kumar (2004). Our algorithms exhibit more than an order of magnitude better performance over those of Konda and Borkar (1999).
This is a preview of subscription content, access via your institution.
Buy single article
Instant access to the full article PDF.
Tax calculation will be finalised during checkout.
Subscribe to journal
Immediate online access to all issues from 2019. Subscription will auto renew annually.
Tax calculation will be finalised during checkout.
Altman E (2001) Applications of Markov decision processes in communication networks. A survey. In: Feinberg E, Shwartz A (eds) Handbook of Markov Decision Processes Methods and Applications. Kluwer, Dordrecht
Bertsekas DP (1976) Dynamic programming and stochastic control. Academic Press, New York
Bertsekas DP, Tsitsiklis JN (1996) Neuro-dynamic programming. Belmont MA: Athena Scientific
Bhatnagar S (2005) Adaptive multivariate three-timescale stochastic approximation algorithms for simulation based optimization. ACM Transactions on Modeling and Computer Simulation 15(1):74–107
Bhatnagar S, Abdulla MS (2005) Reinforcement Learning Based Algorithms for Finite Horizon Markov Decision Processes (submitted)
Bhatnagar S, Fu MC, Marcus SI, Fard PJ (2001) Optimal structured feedback policies for ABR flow control using two-timescale SPSA. IEEE/ACM Transactions on Networking 9(4):479–491
Bhatnagar S, Fu MC, Marcus SI, Bhatnagar S (2001) Two timescale algorithms for simulation optimization of hidden Markov models. IIE Transactions (Pritsker special issue on simulation) 3:245–258
Bhatnagar S, Fu MC, Marcus SI, Wang I-J (2003) Two-timescale simultaneous perturbation stochastic approximation using deterministic perturbation sequences. ACM Transactions on Modeling and Computer Simulation 13(4):180–209
Bhatnagar S, Kumar S (2004) A simultaneous perturbation stochastic approximation-based actor-critic algorithm for Markov decision processes. IEEE Transactions on Automatic Control 49(4):592–598
Bhatnagar S, Panigrahi JR (2006) Actor-critic algorithms for hierarchical Markov decision processes. Automatica 42(4):637–644
Borkar VS (1998) Asynchronous stochastic approximation. SIAM Journal on Control and Optimization 36:840–851
Borkar VS, Konda VR (1997) Actor-critic algorithm as multi-time scale stochastic approximation. Sadhana 22:525–543
Borkar VS, Meyn SP (2000) The ODE method for convergence of stochastic approximation and reinforcement learning. SIAM Journal on Control and Optimization 38(2):447–469
Gerencser L, Hill SD, Vago Z (1999) Optimization over discrete sets via SPSA. In: Proceedings of the 38th IEEE Conference on Decision and Control-CDC99, Phoenix Arizona, pp. 1791–1794
He Y, Fu MC, Marcus SI (2000) A simulation-based policy iteration algorithm for average cost unichain Markov decision processes. In: Laguna M, Gonzalez-Velarde JL (eds) Computing tools for modeling, optimization and simulation, Kluwer, pp. 161–182
Konda VR, Borkar VS (1999) Actor-critic type learning algorithms for Markov decision processes. SIAM Journal on Control and Optimization 38(1):94–123
Konda VR, Tsitsiklis JN (2003) Actor-critic algorithms. SIAM Journal on Control and Optimization 42(4):1143–1166
Kushner HJ, Clark DS (1978) Stochastic Approximation Methods for Constrained and Unconstrained Systems. Springer, Berlin Heidelberg New York
Marbach P, Mihatsch O, Tsitsiklis JN (2000) Call admission control and routing in integrated services networks using neuro-dynamic programming. IEEE J Sel Areas Commun 18(2):197–208
Perko L (1998) Differential Equations and Dynamical Systems, 2nd ed. Texts in Applied Mathematics, vol. 7. Springer, Berlin Heidelberg New York
Puterman ML (1994) Markov decision processes: Discrete stochastic dynamic programming. Wiley, New York
Tsitsiklis JN, Van Roy B (1997) An analysis of temporal-difference learning with function approximation. IEEE Transactions on Automatic Control 42(5):674–690
Tsitsiklis JN, Van Roy B (1999) Average cost temporal-difference learning. Automatica 35(11):1799–1808
Spall JC (1992) Multivariate stochastic approximation using a simultaneous perturbation gradient approximation. IEEE Transactions on Automatic Control 37:332–341
Spall JC (1997) A one-measurement form of simultaneous perturbation stochastic approximation. Automatica 33:109–112
Van Roy B (2001) Neuro-dynamic programming: Overview and recent trends. In: Feinberg E, Shwartz A (eds) Handbook of Markov Decision Processes: Methods and Applications. Kluwer, Dordrecht
This work was supported in part by Grant no. SR/S3/EE/43/2002-SERC-Engg from the Department of Science and Technology, Government of India.
About this article
Cite this article
Abdulla, M.S., Bhatnagar, S. Reinforcement Learning Based Algorithms for Average Cost Markov Decision Processes. Discrete Event Dyn Syst 17, 23–52 (2007). https://doi.org/10.1007/s10626-006-0003-y
- Actor-critic algorithms
- Two timescale stochastic approximation
- Markov decision processes
- Policy iteration
- Simultaneous perturbation stochastic approximation
- Normalized Hadamard matrices
- Reinforcement learning