Solution of Mdps Using Simulation-Based Value Iteration

  • Mohammed Shahid Abdulla
  • Shalabh Bhatnagar
Conference paper
Part of the IFIP — The International Federation for Information Processing book series (IFIPAICT, volume 187)


This article proposes a three-timescale simulation based algorithm for solution of infinite horizon Markov Decision Processes (MDPs). We assume a finite state space and discounted cost criterion and adopt the value iteration approach. An approximation of the Dynamic Programming operator T is applied to the value function iterates. This ‘approximate’ operator is implemented using three timescales, the slowest of which updates the value function iterates. On the middle timescale we perform a gradient search over the feasible action set of each state using Simultaneous Perturbation Stochastic Approximation (SPSA) gradient estimates, thus finding the minimizing action in T. On the fastest timescale, the ‘critic’ estimates, over which the gradient search is performed, are obtained. A sketch of convergence explaining the dynamics of the algorithm using associated ODEs is also presented. Numerical experiments on rate based flow control on a bottleneck node using a continuous-time queueing model are performed using the proposed algorithm. The results obtained are verified against classical value iteration where the feasible set is suitably discretized. Over such a discretized setting, a variant of the algorithm of missing data is compared and the proposed algorithm is found to converge faster.


  1. [1]
    Abdulla, M.S., and Bhatnagar, S. Reinforcement Learning Based Algorithms for Average Cost Markov Decision Processes. Submitted.Google Scholar
  2. [2]
    Bertsekas, D.P. Dynamic Programming and Stochastic Control, 1976 New York: Academic Press.Google Scholar
  3. [3]
    Bhatnagar, S., and Abdulla, M.S. Reinforcement Learning Based Algorithms for Finite Horizon Markov Decision Processes. Submitted.Google Scholar
  4. [4]
    Bhatnagar, S., and Kumar, S. A Simultaneous Perturbation Stochastic Approximation-Based Actor-Critic Algorithm for Markov Decision Processes. IEEE Trans, on Automatic Control, 2004, 49(4):592–598.MathSciNetCrossRefGoogle Scholar
  5. [5]
    Bhatnagar, S., et al., Two-timescale simultaneous perturbation stochastic approximation using deterministic perturbation sequences. ACM Trans, on Modeling and Computer Simulation, 2003, 13(4): 180–209.CrossRefGoogle Scholar
  6. [6]
    Choi, D. S., and Van Roy, B. A Generalized Kalman Filter for Fixed Point Approximation and Efficient Temporal-Difference Learning, Submitted to Discrete Event Dynamic Systems.Google Scholar
  7. [7]
    De Farias, D.P., and Van Roy, B. On the Existence of Fixed Points for Approximate Value Iteration and Temporal-Difference Learning. Journal of Optimization Theory and Applications, June 2000, 105(3).Google Scholar
  8. [8]
    Konda, V.R., and Borkar, V.S. Actor-Critic Type Learning Algorithms for Markov Decision Processes. SIAM J. Control Optim., 1999, 38(1): 94–123.MathSciNetCrossRefGoogle Scholar
  9. [9]
    Konda, V.R., and Tsitsiklis, J.N. Actor-Critic Algorithms. SIAM J. Control Optim., 2003, 42(4): 1143–1166.MathSciNetCrossRefGoogle Scholar
  10. [10]
    Singh, S., and Bertsekas, D. Reinforcement Learning for Dynamic Channel Allocation in Cellular Telephone Systems. Advances in Neural Information Processing Systems (NIPS), 1997,9:974–980.Google Scholar
  11. [11]
    Tsitsiklis, J. N., and Van Roy, B. Optimal Stopping of Markov Processes: Hilbert Space Theory, Approximation Algorithms, and an Application to Pricing High-Dimensional Financial Derivatives. IEEE Trans, on Automatic Control, 1999, 44(10): 1840–1851.CrossRefGoogle Scholar
  12. [12]
    Van Roy, B., et. al., A Neuro-Dynamic Programming Approach to Retailer Inventory Management, 1997, Proc. of the IEEE Conf. on Decision and Control.Google Scholar

Copyright information

© International Federation for Information Processing 2005

Authors and Affiliations

  • Mohammed Shahid Abdulla
    • 1
  • Shalabh Bhatnagar
    • 1
  1. 1.Department of Computer Science and AutomationIndian Institute of ScienceBangaloreIndia

Personalised recommendations