Discrete Event Dynamic Systems

, Volume 13, Issue 1–2, pp 79–110 | Cite as

Least Squares Policy Evaluation Algorithms with Linear Function Approximation

  • A. NediĆ
  • D. P. Bertsekas
Article

Abstract

We consider policy evaluation algorithms within the context of infinite-horizon dynamic programming problems with discounted cost. We focus on discrete-time dynamic systems with a large number of states, and we discuss two methods, which use simulation, temporal differences, and linear cost function approximation. The first method is a new gradient-like algorithm involving least-squares subproblems and a diminishing stepsize, which is based on the λ-policy iteration method of Bertsekas and Ioffe. The second method is the LSTD(λ) algorithm recently proposed by Boyan, which for λ=0 coincides with the linear least-squares temporal-difference algorithm of Bradtke and Barto. At present, there is only a convergence result by Bradtke and Barto for the LSTD(0) algorithm. Here, we strengthen this result by showing the convergence of LSTD(λ), with probability 1, for every λ ∈ [0, 1].

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Reference

  1. Ash, R. B. 1972. Real Analysis and Probability. New York: Academic Press Inc.Google Scholar
  2. Bertsekas, D. P. 1995. A counterexample to temporal differences learning. Neural Computation 7: 270–279.Google Scholar
  3. Bertsekas, D. P., and Ioffe, S. 1996. Temporal differences-based policy iteration and application in neuro-dynamic programming. Lab. for Info. and Decision Systems Report LIDS-P-2349. Cambridge, MA: MIT.Google Scholar
  4. Bertsekas, D. P. 1999. Nonlinear Programming, 2nd edition. Belmont, MA: Athena Scientific.Google Scholar
  5. Bertsekas, D. P. 2001. Dynamic Programming and Optimal Control, 2nd edition. Belmont, MA: Athena Scientific.Google Scholar
  6. Bertsekas, D. P., and Tsitsiklis, J. N. 1996. Neuro-Dynamic Programming. Belmont, MA: Athena Scientific.Google Scholar
  7. Bertsekas, D. P., and Tsitsiklis, J. N. 2000. Gradient convergence in gradient methods with errors. SIAM J. Optim. 10: 627–642.Google Scholar
  8. Boyan, J. A. 2002. Technical update: least-squares temporal difference learning. To appear in Machine Learning, 49.Google Scholar
  9. Bradtke, S. J., and Barto, A. G. 1996. Linear least-squares algorithms for temporal difference learning. Machine Learning 22: 33–57.Google Scholar
  10. Dayan, P., and Sejnowski, T. J. 1994. TD(l) converges with probability 1. Machine Learning 14: 295–301.Google Scholar
  11. Gallager, R. G. 1995. Discrete Stochastic Processes. Boston, MA: Kluwer Academic Publishers.Google Scholar
  12. Gurvits, L., Lin, L., and Hanson, S. J. 1994. Incremental Learning of Evaluation Functions for Absorbing Markov Chains: New Methods and Theorems. Working paper. Princeton, NJ: Siemens Corporate Research.Google Scholar
  13. Golub, G. H., and Van Loan, C. F. 1996. Matrix Computations, 3rd edition. Baltimore, MD: Johns Hopkins University PressGoogle Scholar
  14. Jaakkola, T., Jordan, M. I., and Singh S. P. 1994. On the convergence of stochastic iterative dynamic programming algorithms. Neural Computation 6: 1185–1201.Google Scholar
  15. Kemeny, J. G., and Snell, J. L. 1967. Finite Markov Chains. New York: Van Nostrand Company.Google Scholar
  16. Neveu, J. 1975. Discrete Parameter Martingales. Amsterdam: North-Holland.Google Scholar
  17. Parzen, E. 1962. Modern Probability Theory and Its Applications. New York: John Wiley Inc.Google Scholar
  18. Puterman, M. L. 1994. Markovian Decision Problems. New York: John Wiley Inc.Google Scholar
  19. Sutton, R. S. 1988. Learning to predict by the methods of temporal differences. Machine Learning 3: 9–44.Google Scholar
  20. Tadic Â, V. 2001. On the convergence of temporal-difference learning with linear function approximation. Machine Learning 42: 241–267.Google Scholar
  21. Tsitsiklis, J. N., and Van Roy, B. 1997. An analysis of temporal-difference learning with function approximation. IEEE Transactions on Automatic Control 42: 674–690.Google Scholar

Copyright information

© Kluwer Academic Publishers 2003

Authors and Affiliations

  • A. NediĆ
    • 1
  • D. P. Bertsekas
    • 1
  1. 1.Department of Electrical Engineering and Computer ScienceM.I.T.CambridgeUSA

Personalised recommendations