Linear Least-Squares Algorithms for Temporal Difference Learning

Bradtke, Steven J.; Barto, Andrew G.

doi:10.1023/A:1018056104778

Linear Least-Squares Algorithms for Temporal Difference Learning

Published: January 1996

Volume 22, pages 33–57, (1996)
Cite this article

Download PDF

Machine Learning Aims and scope Submit manuscript

Linear Least-Squares Algorithms for Temporal Difference Learning

Download PDF

Steven J. Bradtke &
Andrew G. Barto

2541 Accesses
294 Citations
3 Altmetric
Explore all metrics

Abstract

We introduce two new temporal difference (TD) algorithms based on the theory of linear least-squares function approximation. We define an algorithm we call Least-Squares TD (LS TD) for which we prove probability-one convergence when it is used with a function approximator linear in the adjustable parameters. We then define a recursive version of this algorithm, Recursive Least-Square TD (RLS TD). Although these new TD algorithms require more computation per time-step than do Sutton‘s TD(λ) algorithms, they are more efficient in a statistical sense because they extract more information from training experiences. We describe a simulation experiment showing the substantial improvement in learning rate achieved by RLS TD in an example Markov prediction problem. To quantify this improvement, we introduce the TD error variance of a Markov chain, σTD, and experimentally conclude that the convergence rate of a TD algorithm depends linearly on σTD. In addition to converging more rapidly, LS TD and RLS TD do not have control parameters, such as a learning rate parameter, thus eliminating the possibility of achieving poor performance by an unlucky choice of parameters.

References

Anderson, C. W. (1988). Strategy learning with multilayer connectionist representations. Technical Report 87-509,3 GTE Laboratories Incorporated, Computer and Intelligent Systems Laboratory, 40 Sylvan Road. Waltham, MA 02254.
Google Scholar
Barto, A. G., Sutton, R. S. & Anderson, C. W. (1983). Neuronlike elements that can solve difficult learning control problems. IEEE Transactions on Systems, Man, and Cybernetics 13: 835-846
Google Scholar
Bradike, S J., (1994). Incremental Dynamic Programming for On-Line Adaptive Optimal Control. PhD thesis, University of Massachusetts, Computer Science Dept. Technical Report 94-62.
Darken, C. Chang, J. & Moody, J., (1992) Learning rate schedules for faster stochastic gradient search. In Neural Networks or Signal Processing 2 — Proceedings of the 1992 IEEE Workshop, IEEE Press.
Dayan, (1992) The convergence of TD(λ) or general λ. Machine Learning, 8: 341-362
Google Scholar
Dayan, P. & Sejnowski, T. J., (1994) TD(λ): Convergence with probability I. Mahine Learning.
Goodwin, G.C. & Sin, K.S., (1984). Adaptive Filtering Prediction and Control, Prentice-Hall, Englewood Cliffs, NJ.
Google Scholar
Jaakkola, T., Jordan, M.I & Singh, S.P, (1994). On the convergence of stochastic iterative dynamic programming algorithms. Neural Computation, 6(6).
Kemeny, J. G. & Snell, J.L. (1976). Finite Markov Chains. Springer-Verlag, New York.
Google Scholar
Liung, L. & Soderstrorn, T. (1983). Theory and Practice of Recursive Identification. MIT Press, Cambridge, MA.
Google Scholar
Lukes, G., Thompson, B. & Werbos, P., (1990) Expectation driven learning with an associative associative memory. In Proceedings of the International Joint Conference on Neural Networks, pages 1: 521-524.
Google Scholar
Robbins, H. & Monro, (1951) A stochastic approxmation method. Annals of Mathematical Statistics. 22: 400-407.
Google Scholar
Soderstrom, T. & Sloica, P.G., (1983). Instrumental Variable Methods for System Idenfication. Springer Verlag, Berlin.
Google Scholar
Sutton, A.S., (1984). Temporal Credit Assignment in Reinforcement Learning. PhD thesis, Department of Computer and Information Science, University of Massachusetts at Amherst, Arherst, MA. 01003.
Google Scholar
Sutton, R.S., (1988) Learning to predict by the method of temporal differences. Machine Learning, 3: 9-44.
Google Scholar
Tesauro, G.J., (1992). Practical issues in temporal difference learning. Machine Learning 8(3/4):257-277.
Google Scholar
Tsitsiklis, J.N. (1995) Asynchronous stochastic approximation and Q-learning. Technical Report IIDS-P-2172, Laboratory for Information and Decision Systems, MIT, Cambridge, MA.
Google Scholar
Watkins, C. I. C. H., (1989). Learning from Delayed Rewards PhD thesis, Cambridge University, Cambridge, England.
Google Scholar
Watkins, C. J. C. H. & Dayan, P. (1992). Q-Learning. Machine Learning, 8(3/4): 257-277, May 1992.
Google Scholar
Werbos, P.J. (1987). Building and understanding adaptive systems: A statistical/numerical approach to factory automation and brain research IEEE: Transaction on Systems, Man, and Cybernetics, 17(1) 7-20.
Google Scholar
Werbos, P.J. (1988) Generalization of backpropagation with application to a recurrent gas market model. Neural Networks, 1(4): 339-356, 1988.
Google Scholar
Werbos, P.J. (1990). Consistency of HDP applied to a simple reinforcement learning problem. Neural Networks. 3(2): 179-190
Google Scholar
Werbos, P.J. (1992) Approximate dynamic programming for real time control and neural modeling. In D. A. White and D. A. Sofge, editors, Handbook of Intelligent Cotrol: Neural, Fuzzy, and Adaptive Approaches, pages 493-525. Van Nostrand Reinhold, New York.
Google Scholar
Young, P. (1984) Recursive Estimation and Time-series. Analysis. Springer-Verlag.

Download references

Authors

Steven J. Bradtke
View author publications
You can also search for this author in PubMed Google Scholar
Andrew G. Barto
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

About this article

Cite this article

Bradtke, S.J., Barto, A.G. Linear Least-Squares Algorithms for Temporal Difference Learning. Machine Learning 22, 33–57 (1996). https://doi.org/10.1023/A:1018056104778

Download citation

Issue Date: January 1996
DOI: https://doi.org/10.1023/A:1018056104778

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Linear Least-Squares Algorithms for Temporal Difference Learning

Abstract

Article PDF

Similar content being viewed by others

A Data–Driven Approximation of the Koopman Operator: Extending Dynamic Mode Decomposition

Overfitting, Model Tuning, and Evaluation of Prediction Performance

The Frank-Wolfe Algorithm: A Short Introduction

References

Rights and permissions

About this article

Cite this article

Navigation

Linear Least-Squares Algorithms for Temporal Difference Learning

Abstract

Article PDF

Similar content being viewed by others

A Data–Driven Approximation of the Koopman Operator: Extending Dynamic Mode Decomposition

Overfitting, Model Tuning, and Evaluation of Prediction Performance

The Frank-Wolfe Algorithm: A Short Introduction

References

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation