Machine Learning

, Volume 22, Issue 1–3, pp 33–57 | Cite as

Linear Least-Squares algorithms for temporal difference learning

  • Steven J. Bradtke
  • Andrew G. Barto
Article

Abstract

We introduce two new temporal diffence (TD) algorithms based on the theory of linear least-squares function approximation. We define an algorithm we call Least-Squares TD (LS TD) for which we prove probability-one convergence when it is used with a function approximator linear in the adjustable parameters. We then define a recursive version of this algorithm, Recursive Least-Squares TD (RLS TD). Although these new TD algorithms require more computation per time-step than do Sutton's TD(λ) algorithms, they are more efficient in a statistical sense because they extract more information from training experiences. We describe a simulation experiment showing the substantial improvement in learning rate achieved by RLS TD in an example Markov prediction problem. To quantify this improvement, we introduce theTD error variance of a Markov chain, ωTD, and experimentally conclude that the convergence rate of a TD algorithm depends linearly on ωTD. In addition to converging more rapidly, LS TD and RLS TD do not have control parameters, such as a learning rate parameter, thus eliminating the possibility of achieving poor performance by an unlucky choice of parameters.

Keywords

Reinforcement learning Markov Decision Problems Temporal Difference Methods Least-Squares 

References

  1. Anderson, C. W. (1988). Strategy learning with multilayer connectionist representations. Technical Report 87–509.3, GTE Laboratories Incorporated, Computer and Intelligent Systems Laboratory, 40 Sylvan Road, Waltham, MA 02254.Google Scholar
  2. Barto, A. G., Sutton, R. S. & Anderson, C. W. (1983) Neuronlike elements that can solve difficult learning control problems.IEEE Transactions on Systems, Man, and Cybernetics, 13: 835–846.Google Scholar
  3. Bradtke, S. J., (1994).Incremental Dynamic Programming for On-Line Adaptive Optimal Control. PhD thesis, University of Massachusetts, Computer Science Dept. Technical Report 94-62.Google Scholar
  4. Darken, C. Chang, J. & Moody, J., (1992) Learning rate schedules for faster stochastic gradient search. InNeural Networks for Signal Processing 2 — Proceedings of the 1992 IEEE Workshop. IEEE Press.Google Scholar
  5. Dayan, P., (1992). The convergence of TD(λ) for general λ.Machine Learning, 8: 341–362.Google Scholar
  6. Dayan, P. & Sejnowski, T.J., (1994). TD(λ): Convergence with probability 1.Machine Learning.Google Scholar
  7. Goodwin, G.C. & Sin, K.S., (1984).Adaptive Filtering Prediction and Control. Prentice-Hall, Englewood Cliffs, N.J.Google Scholar
  8. Jaakkola, T. Jordan, M.I. & Singh, S.P., (1994). On the convergence of stechastic iterative dynamic programming algorithms,Neural Computation, 6(6).Google Scholar
  9. Kemeny, J.G. & Snell, J.L., (1976).Finite Markov Chains. Springer-Verlag, New York.Google Scholar
  10. Ljung, L. & Söderström, T. (1983)Theory and Practice of Recursive Identification, MIT Press, Cambridge, MA.Google Scholar
  11. Lukes, G., Thompson, B. & Werbos, P., (1990), Expectation driven learning with an associative memory. InProceedings of the International Joint Conference on Neural Networks, pages 1: 521–524.Google Scholar
  12. Robbins, H. & Monro, S., (1951). A stochastic approximation method.Annals of Mathematical Statistics, 22:400–407.Google Scholar
  13. Söderström, T. & Stoica, P.G., (1983).Instrumental Variable Methods for System Identification. Springer-Verlag, Berlin.Google Scholar
  14. Sutton, A.S., (1984).Temporal Credit Assignment in Reinforcement Learning. PhD thesis, Department of Computer and Information Science, University of Massachusetts at Amherst, Amherst, MA 01003.Google Scholar
  15. Sutton, R.S., (1988). Learning to predict by the method of temporal differences.Machine Learning, 3:9–44.Google Scholar
  16. Tesauro, G.J., (1992). Practical issues in temporal difference learning.Machine Learning, 8(3/4):257–277.Google Scholar
  17. Tsitsiklis, J.N., (1993). Asynchronous stochastic approximation and Q-learning. Technical Report LIDS-P-2172, Laboratory for Information and Decision Systems, MIT, Cambridge, MA.Google Scholar
  18. Watkins, C. J. C. H., (1989).Learning from Delayed Rewards. PhD thesis, Cambridge University, Cambridge, England.Google Scholar
  19. Watkins, C. J. C. H. & Dayan, P., (1992). Q-learning.Machine Learning, 8(3/4):257–277, May 1992.Google Scholar
  20. Werbos, P.J., (1987). Building and understanding adaptive systems: A statistical/numerical approach to factory automation and brain research.IEEE Transactions on Systems, Man, and Cybernetics, 17(1):7–20.Google Scholar
  21. Werbos, P.J., (1988). Generalization of backpropagation with application to a recurrent gas market model.Neural Networks, 1(4):339–356, 1988.Google Scholar
  22. Werbos, P.J., (1990). Consistency of HDP applied to a simple reinforcement learning problem.Neural Networks, 3(2):179–190.Google Scholar
  23. Werbos, P.J., (1992). Approximate dynamic programming for real-time control and neural modeling. In D. A. White and D. A. Sofge, editors,Handbook of Intelligent Control: Neural, Fuzzy, and Adaptive Approaches. pages 493–525. Van Nostrand Reinhold, New York.Google Scholar
  24. Young, P., (1984).Recursive Estimation and Time-series Analysis. Springer-Verlag.Google Scholar

Copyright information

© Kluwer Academic Publishers 1996

Authors and Affiliations

  • Steven J. Bradtke
    • 1
  • Andrew G. Barto
    • 2
  1. 1.One E Telecom PkwyGTE Data ServicesTemple Terrace
  2. 2.Dept. of Computer ScienceUniversity of MassachusettsAmherst

Personalised recommendations