Solution of Mdps Using Simulation-Based Value Iteration

Abdulla, Mohammed Shahid; Bhatnagar, Shalabh

doi:10.1007/0-387-29295-0_83

Solution of Mdps Using Simulation-Based Value Iteration

Mohammed Shahid Abdulla² &
Shalabh Bhatnagar²

Conference paper

1611 Accesses

Part of the book series: IFIP — The International Federation for Information Processing ((IFIPAICT,volume 187))

Abstract

This article proposes a three-timescale simulation based algorithm for solution of infinite horizon Markov Decision Processes (MDPs). We assume a finite state space and discounted cost criterion and adopt the value iteration approach. An approximation of the Dynamic Programming operator T is applied to the value function iterates. This ‘approximate’ operator is implemented using three timescales, the slowest of which updates the value function iterates. On the middle timescale we perform a gradient search over the feasible action set of each state using Simultaneous Perturbation Stochastic Approximation (SPSA) gradient estimates, thus finding the minimizing action in T. On the fastest timescale, the ‘critic’ estimates, over which the gradient search is performed, are obtained. A sketch of convergence explaining the dynamics of the algorithm using associated ODEs is also presented. Numerical experiments on rate based flow control on a bottleneck node using a continuous-time queueing model are performed using the proposed algorithm. The results obtained are verified against classical value iteration where the feasible set is suitably discretized. Over such a discretized setting, a variant of the algorithm of missing data is compared and the proposed algorithm is found to converge faster.

Download to read the full chapter text

Chapter PDF

References

Abdulla, M.S., and Bhatnagar, S. Reinforcement Learning Based Algorithms for Average Cost Markov Decision Processes. Submitted.
Google Scholar
Bertsekas, D.P. Dynamic Programming and Stochastic Control, 1976 New York: Academic Press.
Google Scholar
Bhatnagar, S., and Abdulla, M.S. Reinforcement Learning Based Algorithms for Finite Horizon Markov Decision Processes. Submitted.
Google Scholar
Bhatnagar, S., and Kumar, S. A Simultaneous Perturbation Stochastic Approximation-Based Actor-Critic Algorithm for Markov Decision Processes. IEEE Trans, on Automatic Control, 2004, 49(4):592–598.
Article MathSciNet Google Scholar
Bhatnagar, S., et al., Two-timescale simultaneous perturbation stochastic approximation using deterministic perturbation sequences. ACM Trans, on Modeling and Computer Simulation, 2003, 13(4): 180–209.
Article Google Scholar
Choi, D. S., and Van Roy, B. A Generalized Kalman Filter for Fixed Point Approximation and Efficient Temporal-Difference Learning, Submitted to Discrete Event Dynamic Systems.
Google Scholar
De Farias, D.P., and Van Roy, B. On the Existence of Fixed Points for Approximate Value Iteration and Temporal-Difference Learning. Journal of Optimization Theory and Applications, June 2000, 105(3).
Google Scholar
Konda, V.R., and Borkar, V.S. Actor-Critic Type Learning Algorithms for Markov Decision Processes. SIAM J. Control Optim., 1999, 38(1): 94–123.
Article MathSciNet Google Scholar
Konda, V.R., and Tsitsiklis, J.N. Actor-Critic Algorithms. SIAM J. Control Optim., 2003, 42(4): 1143–1166.
Article MathSciNet Google Scholar
Singh, S., and Bertsekas, D. Reinforcement Learning for Dynamic Channel Allocation in Cellular Telephone Systems. Advances in Neural Information Processing Systems (NIPS), 1997,9:974–980.
Google Scholar
Tsitsiklis, J. N., and Van Roy, B. Optimal Stopping of Markov Processes: Hilbert Space Theory, Approximation Algorithms, and an Application to Pricing High-Dimensional Financial Derivatives. IEEE Trans, on Automatic Control, 1999, 44(10): 1840–1851.
Article Google Scholar
Van Roy, B., et. al., A Neuro-Dynamic Programming Approach to Retailer Inventory Management, 1997, Proc. of the IEEE Conf. on Decision and Control.
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science and Automation, Indian Institute of Science, Bangalore, 560 012, India
Mohammed Shahid Abdulla & Shalabh Bhatnagar

Authors

Mohammed Shahid Abdulla
View author publications
You can also search for this author in PubMed Google Scholar
Shalabh Bhatnagar
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

China Agricultural University, China
Daoliang Li & Baoji Wang &

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Abdulla, M.S., Bhatnagar, S. (2005). Solution of Mdps Using Simulation-Based Value Iteration. In: Li, D., Wang, B. (eds) Artificial Intelligence Applications and Innovations. AIAI 2005. IFIP — The International Federation for Information Processing, vol 187. Springer, Boston, MA. https://doi.org/10.1007/0-387-29295-0_83

Download citation

DOI: https://doi.org/10.1007/0-387-29295-0_83
Publisher Name: Springer, Boston, MA
Print ISBN: 978-0-387-28318-0
Online ISBN: 978-0-387-29295-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics