Reinforcement Learning Based Algorithms for Average Cost Markov Decision Processes

Abstract

This article proposes several two-timescale simulation-based actor-critic algorithms for solution of infinite horizon Markov Decision Processes with finite state-space under the average cost criterion. Two of the algorithms are for the compact (non-discrete) action setting while the rest are for finite-action spaces. On the slower timescale, all the algorithms perform a gradient search over corresponding policy spaces using two different Simultaneous Perturbation Stochastic Approximation (SPSA) gradient estimates. On the faster timescale, the differential cost function corresponding to a given stationary policy is updated and an additional averaging is performed for enhanced performance. A proof of convergence to a locally optimal policy is presented. Next, we discuss a memory efficient implementation that uses a feature-based representation of the state-space and performs TD(0) learning along the faster timescale. The TD(0) algorithm does not follow an on-line sampling of states but is observed to do well on our setting. Numerical experiments on a problem of rate based flow control are presented using the proposed algorithms. We consider here the model of a single bottleneck node in the continuous time queueing framework. We show performance comparisons of our algorithms with the two-timescale actor-critic algorithms of Konda and Borkar (1999) and Bhatnagar and Kumar (2004). Our algorithms exhibit more than an order of magnitude better performance over those of Konda and Borkar (1999).

This is a preview of subscription content, access via your institution.

References

  1. Altman E (2001) Applications of Markov decision processes in communication networks. A survey. In: Feinberg E, Shwartz A (eds) Handbook of Markov Decision Processes Methods and Applications. Kluwer, Dordrecht

    Google Scholar 

  2. Bertsekas DP (1976) Dynamic programming and stochastic control. Academic Press, New York

    Google Scholar 

  3. Bertsekas DP, Tsitsiklis JN (1996) Neuro-dynamic programming. Belmont MA: Athena Scientific

    Google Scholar 

  4. Bhatnagar S (2005) Adaptive multivariate three-timescale stochastic approximation algorithms for simulation based optimization. ACM Transactions on Modeling and Computer Simulation 15(1):74–107

    Article  Google Scholar 

  5. Bhatnagar S, Abdulla MS (2005) Reinforcement Learning Based Algorithms for Finite Horizon Markov Decision Processes (submitted)

  6. Bhatnagar S, Fu MC, Marcus SI, Fard PJ (2001) Optimal structured feedback policies for ABR flow control using two-timescale SPSA. IEEE/ACM Transactions on Networking 9(4):479–491

    Article  Google Scholar 

  7. Bhatnagar S, Fu MC, Marcus SI, Bhatnagar S (2001) Two timescale algorithms for simulation optimization of hidden Markov models. IIE Transactions (Pritsker special issue on simulation) 3:245–258

    Google Scholar 

  8. Bhatnagar S, Fu MC, Marcus SI, Wang I-J (2003) Two-timescale simultaneous perturbation stochastic approximation using deterministic perturbation sequences. ACM Transactions on Modeling and Computer Simulation 13(4):180–209

    Article  Google Scholar 

  9. Bhatnagar S, Kumar S (2004) A simultaneous perturbation stochastic approximation-based actor-critic algorithm for Markov decision processes. IEEE Transactions on Automatic Control 49(4):592–598

    Article  MathSciNet  Google Scholar 

  10. Bhatnagar S, Panigrahi JR (2006) Actor-critic algorithms for hierarchical Markov decision processes. Automatica 42(4):637–644

    MATH  Article  MathSciNet  Google Scholar 

  11. Borkar VS (1998) Asynchronous stochastic approximation. SIAM Journal on Control and Optimization 36:840–851

    MATH  Article  MathSciNet  Google Scholar 

  12. Borkar VS, Konda VR (1997) Actor-critic algorithm as multi-time scale stochastic approximation. Sadhana 22:525–543

    MATH  MathSciNet  Google Scholar 

  13. Borkar VS, Meyn SP (2000) The ODE method for convergence of stochastic approximation and reinforcement learning. SIAM Journal on Control and Optimization 38(2):447–469

    MATH  Article  MathSciNet  Google Scholar 

  14. Gerencser L, Hill SD, Vago Z (1999) Optimization over discrete sets via SPSA. In: Proceedings of the 38th IEEE Conference on Decision and Control-CDC99, Phoenix Arizona, pp. 1791–1794

  15. He Y, Fu MC, Marcus SI (2000) A simulation-based policy iteration algorithm for average cost unichain Markov decision processes. In: Laguna M, Gonzalez-Velarde JL (eds) Computing tools for modeling, optimization and simulation, Kluwer, pp. 161–182

  16. Konda VR, Borkar VS (1999) Actor-critic type learning algorithms for Markov decision processes. SIAM Journal on Control and Optimization 38(1):94–123

    MATH  Article  MathSciNet  Google Scholar 

  17. Konda VR, Tsitsiklis JN (2003) Actor-critic algorithms. SIAM Journal on Control and Optimization 42(4):1143–1166

    MATH  Article  MathSciNet  Google Scholar 

  18. Kushner HJ, Clark DS (1978) Stochastic Approximation Methods for Constrained and Unconstrained Systems. Springer, Berlin Heidelberg New York

    Google Scholar 

  19. Marbach P, Mihatsch O, Tsitsiklis JN (2000) Call admission control and routing in integrated services networks using neuro-dynamic programming. IEEE J Sel Areas Commun 18(2):197–208

    Article  Google Scholar 

  20. Perko L (1998) Differential Equations and Dynamical Systems, 2nd ed. Texts in Applied Mathematics, vol. 7. Springer, Berlin Heidelberg New York

    Google Scholar 

  21. Puterman ML (1994) Markov decision processes: Discrete stochastic dynamic programming. Wiley, New York

    Google Scholar 

  22. Tsitsiklis JN, Van Roy B (1997) An analysis of temporal-difference learning with function approximation. IEEE Transactions on Automatic Control 42(5):674–690

    MATH  Article  Google Scholar 

  23. Tsitsiklis JN, Van Roy B (1999) Average cost temporal-difference learning. Automatica 35(11):1799–1808

    MATH  Article  Google Scholar 

  24. Spall JC (1992) Multivariate stochastic approximation using a simultaneous perturbation gradient approximation. IEEE Transactions on Automatic Control 37:332–341

    MATH  Article  MathSciNet  Google Scholar 

  25. Spall JC (1997) A one-measurement form of simultaneous perturbation stochastic approximation. Automatica 33:109–112

    MATH  Article  MathSciNet  Google Scholar 

  26. Van Roy B (2001) Neuro-dynamic programming: Overview and recent trends. In: Feinberg E, Shwartz A (eds) Handbook of Markov Decision Processes: Methods and Applications. Kluwer, Dordrecht

    Google Scholar 

Download references

Author information

Affiliations

Authors

Corresponding author

Correspondence to Shalabh Bhatnagar.

Additional information

This work was supported in part by Grant no. SR/S3/EE/43/2002-SERC-Engg from the Department of Science and Technology, Government of India.

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Abdulla, M.S., Bhatnagar, S. Reinforcement Learning Based Algorithms for Average Cost Markov Decision Processes. Discrete Event Dyn Syst 17, 23–52 (2007). https://doi.org/10.1007/s10626-006-0003-y

Download citation

Keywords

  • Actor-critic algorithms
  • Two timescale stochastic approximation
  • Markov decision processes
  • Policy iteration
  • Simultaneous perturbation stochastic approximation
  • Normalized Hadamard matrices
  • Reinforcement learning
  • TD-learning