Solution of Mdps Using Simulation-Based Value Iteration
This article proposes a three-timescale simulation based algorithm for solution of infinite horizon Markov Decision Processes (MDPs). We assume a finite state space and discounted cost criterion and adopt the value iteration approach. An approximation of the Dynamic Programming operator T is applied to the value function iterates. This ‘approximate’ operator is implemented using three timescales, the slowest of which updates the value function iterates. On the middle timescale we perform a gradient search over the feasible action set of each state using Simultaneous Perturbation Stochastic Approximation (SPSA) gradient estimates, thus finding the minimizing action in T. On the fastest timescale, the ‘critic’ estimates, over which the gradient search is performed, are obtained. A sketch of convergence explaining the dynamics of the algorithm using associated ODEs is also presented. Numerical experiments on rate based flow control on a bottleneck node using a continuous-time queueing model are performed using the proposed algorithm. The results obtained are verified against classical value iteration where the feasible set is suitably discretized. Over such a discretized setting, a variant of the algorithm of missing data is compared and the proposed algorithm is found to converge faster.
- Abdulla, M.S., and Bhatnagar, S. Reinforcement Learning Based Algorithms for Average Cost Markov Decision Processes. Submitted.
- Bertsekas, D.P. Dynamic Programming and Stochastic Control, 1976 New York: Academic Press.
- Bhatnagar, S., and Abdulla, M.S. Reinforcement Learning Based Algorithms for Finite Horizon Markov Decision Processes. Submitted.
- Bhatnagar, S., and Kumar, S. A Simultaneous Perturbation Stochastic Approximation-Based Actor-Critic Algorithm for Markov Decision Processes. IEEE Trans, on Automatic Control, 2004, 49(4):592–598. CrossRef
- Bhatnagar, S., et al., Two-timescale simultaneous perturbation stochastic approximation using deterministic perturbation sequences. ACM Trans, on Modeling and Computer Simulation, 2003, 13(4): 180–209. CrossRef
- Choi, D. S., and Van Roy, B. A Generalized Kalman Filter for Fixed Point Approximation and Efficient Temporal-Difference Learning, Submitted to Discrete Event Dynamic Systems.
- De Farias, D.P., and Van Roy, B. On the Existence of Fixed Points for Approximate Value Iteration and Temporal-Difference Learning. Journal of Optimization Theory and Applications, June 2000, 105(3).
- Konda, V.R., and Borkar, V.S. Actor-Critic Type Learning Algorithms for Markov Decision Processes. SIAM J. Control Optim., 1999, 38(1): 94–123. CrossRef
- Konda, V.R., and Tsitsiklis, J.N. Actor-Critic Algorithms. SIAM J. Control Optim., 2003, 42(4): 1143–1166. CrossRef
- Singh, S., and Bertsekas, D. Reinforcement Learning for Dynamic Channel Allocation in Cellular Telephone Systems. Advances in Neural Information Processing Systems (NIPS), 1997,9:974–980.
- Tsitsiklis, J. N., and Van Roy, B. Optimal Stopping of Markov Processes: Hilbert Space Theory, Approximation Algorithms, and an Application to Pricing High-Dimensional Financial Derivatives. IEEE Trans, on Automatic Control, 1999, 44(10): 1840–1851. CrossRef
- Van Roy, B., et. al., A Neuro-Dynamic Programming Approach to Retailer Inventory Management, 1997, Proc. of the IEEE Conf. on Decision and Control.
- Solution of Mdps Using Simulation-Based Value Iteration
- Book Title
- Artificial Intelligence Applications and Innovations
- Book Subtitle
- IFIP TC12 WG12.5 - Second IFIP Conference on Artificial Intelligence Applications and Innovations (AIAI2005), September 7–9, 2005, Beijing, China
- pp 765-775
- Print ISBN
- Online ISBN
- Series Title
- IFIP — The International Federation for Information Processing
- Series Volume
- Series ISSN
- Springer US
- Copyright Holder
- International Federation for Information Processing
- Additional Links
- Industry Sectors
- eBook Packages
To view the rest of this content please follow the download PDF link above.