Least Squares Policy Evaluation Algorithms with Linear Function Approximation
Rent the article at a discountRent now
* Final gross prices may vary according to local VAT.Get Access
We consider policy evaluation algorithms within the context of infinite-horizon dynamic programming problems with discounted cost. We focus on discrete-time dynamic systems with a large number of states, and we discuss two methods, which use simulation, temporal differences, and linear cost function approximation. The first method is a new gradient-like algorithm involving least-squares subproblems and a diminishing stepsize, which is based on the λ-policy iteration method of Bertsekas and Ioffe. The second method is the LSTD(λ) algorithm recently proposed by Boyan, which for λ=0 coincides with the linear least-squares temporal-difference algorithm of Bradtke and Barto. At present, there is only a convergence result by Bradtke and Barto for the LSTD(0) algorithm. Here, we strengthen this result by showing the convergence of LSTD(λ), with probability 1, for every λ ∈ [0, 1].
- Ash, R. B. 1972. Real Analysis and Probability. New York: Academic Press Inc.
- Bertsekas, D. P. 1995. A counterexample to temporal differences learning. Neural Computation 7: 270–279.
- Bertsekas, D. P., and Ioffe, S. 1996. Temporal differences-based policy iteration and application in neuro-dynamic programming. Lab. for Info. and Decision Systems Report LIDS-P-2349. Cambridge, MA: MIT.
- Bertsekas, D. P. 1999. Nonlinear Programming, 2nd edition. Belmont, MA: Athena Scientific.
- Bertsekas, D. P. 2001. Dynamic Programming and Optimal Control, 2nd edition. Belmont, MA: Athena Scientific.
- Bertsekas, D. P., and Tsitsiklis, J. N. 1996. Neuro-Dynamic Programming. Belmont, MA: Athena Scientific.
- Bertsekas, D. P., and Tsitsiklis, J. N. 2000. Gradient convergence in gradient methods with errors. SIAM J. Optim. 10: 627–642.
- Boyan, J. A. 2002. Technical update: least-squares temporal difference learning. To appear in Machine Learning, 49.
- Bradtke, S. J., and Barto, A. G. 1996. Linear least-squares algorithms for temporal difference learning. Machine Learning 22: 33–57.
- Dayan, P., and Sejnowski, T. J. 1994. TD(l) converges with probability 1. Machine Learning 14: 295–301.
- Gallager, R. G. 1995. Discrete Stochastic Processes. Boston, MA: Kluwer Academic Publishers.
- Gurvits, L., Lin, L., and Hanson, S. J. 1994. Incremental Learning of Evaluation Functions for Absorbing Markov Chains: New Methods and Theorems. Working paper. Princeton, NJ: Siemens Corporate Research.
- Golub, G. H., and Van Loan, C. F. 1996. Matrix Computations, 3rd edition. Baltimore, MD: Johns Hopkins University Press
- Jaakkola, T., Jordan, M. I., and Singh S. P. 1994. On the convergence of stochastic iterative dynamic programming algorithms. Neural Computation 6: 1185–1201.
- Kemeny, J. G., and Snell, J. L. 1967. Finite Markov Chains. New York: Van Nostrand Company.
- Neveu, J. 1975. Discrete Parameter Martingales. Amsterdam: North-Holland.
- Parzen, E. 1962. Modern Probability Theory and Its Applications. New York: John Wiley Inc.
- Puterman, M. L. 1994. Markovian Decision Problems. New York: John Wiley Inc.
- Sutton, R. S. 1988. Learning to predict by the methods of temporal differences. Machine Learning 3: 9–44.
- Tadic Â, V. 2001. On the convergence of temporal-difference learning with linear function approximation. Machine Learning 42: 241–267.
- Tsitsiklis, J. N., and Van Roy, B. 1997. An analysis of temporal-difference learning with function approximation. IEEE Transactions on Automatic Control 42: 674–690.
- Least Squares Policy Evaluation Algorithms with Linear Function Approximation
Discrete Event Dynamic Systems
Volume 13, Issue 1-2 , pp 79-110
- Cover Date
- Print ISSN
- Online ISSN
- Kluwer Academic Publishers
- Additional Links