Linear Least-Squares Algorithms for Temporal Difference Learning

Bradtke, Steven J.; Barto, Andrew G.

doi:10.1007/978-0-585-33656-5_4

Steven J. Bradtke² &
Andrew G. Barto³

234 Accesses
3 Citations

Abstract

We introduce two new temporal difference (TD) algorithms based on the theory of linear least-squares function approximation. We define an algorithm we call Least-Squares TD (LS TD) for which we prove probability-one convergence when it is used with a function approximator linear in the adjustable parameters. We then define a recursive version of this algorithm, Recursive Least-Squares TD (RLS TD). Although these new TD algorithms require more computation per time-step than do Sutton’s TD(λ) algorithms, they are more efficient in a statistical sense because they extract more information from training experiences. We describe a simulation experiment showing the substantial improvement in learning rate achieved by RLS TD in an example Markov prediction problem. To quantify this improvement, we introduce the TD error variance of a Markov chain, σ_TD, and experimentally conclude that the convergence rate of a TD algorithm depends linearly on σ_TD. In addition to converging more rapidly, LS TD and RLS TD do not have control parameters, such as a learning rate parameter, thus eliminating the possibility of achieving poor performance by an unlucky choice of parameters.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Anderson, C. W., (1988). Strategy learning with multilayer connectionist representations. Technical Report 87-509.3, GTE Laboratories Incorporated, Computer and Intelligent Systems Laboratory, 40 Sylvan Road, Waltham, MA 02254.
Google Scholar
Barto, A. G., Sutton, R. S. & Anderson, C. W. (1983). Neuronlike elements that can solve difficult learning control problems. IEEE Transactions on Systems, Man, and Cybernetics, 13:835–846.
Google Scholar
Bradtke, S. J., (1994). Incremental Dynamic Programming for On-Line Adaptive Optimal Control. PhD thesis, University of Massachusetts, Computer Science Dept. Technical Report 94-62.
Google Scholar
Darken, C. Chang, J. & Moody, J., (1992) Learning rate schedules for faster stochastic gradient search. In Neural Networks for Signal Processing 2 — Proceedings of the 1992 IEEE Workshop. IEEE Press.
Google Scholar
Dayan, P., (1992). The convergence of TD(λ) for general λ. Machine Learning, 8:341–362.
MATH Google Scholar
Dayan, P. & Sejnowski, T.J., (1994). TD(λ): Convergence with probability 1. Machine Learning.
Google Scholar
Goodwin, G.C. & Sin, K.S., (1984). Adaptive Filtering Prediction and Control. Prentice-Hall, Englewood Cliffs, N.J.
MATH Google Scholar
Jaakkola, T, Jordan, M.I. & Singh, S.P., (1994). On the convergence of stochastic iterative dynamic programming algorithms. Neural Computation, 6(6).
Google Scholar
Kemeny, J.G. & Snell, J.L., (1976). Finite Markov Chains. Springer-Verlag, New York.
MATH Google Scholar
Ljung, L. & Söderström, T., (1983). Theory and Practice of Recursive Identification. MIT Press, Cambridge, MA.
MATH Google Scholar
Lukes, G., Thompson, B. & Werbos, P., (1990). Expectation driven learning with an associative memory. In Proceedings of the International Joint Conference on Neural Networks, pages I:521–524.
Google Scholar
Robbins, H & Monro, S., (1951). A stochastic approximation method. Annals of Mathematical Statistics, 22:400–407.
Google Scholar
Söderström, T. & Stoica, P.G., (1983). Instrumental Variable Methods for System Identification. Springer-Verlag, Berlin.
MATH Google Scholar
Sutton, A.S., (1984). Temporal Credit Assignment in Reinforcement Learning. PhD thesis, Department of Computer and Information Science, University of Massachusetts at Amherst, Amherst, MA 01003.
Google Scholar
Sutton, R.S., (1988). Learning to predict by the method of temporal differences. Machine Learning, 3:9–44.
Google Scholar
Tesauro, G.J., (1992). Practical issues in temporal difference learning. Machine Learning, 8(3/4):257–277.
Article MATH Google Scholar
Tsitsiklis, J.N., (1993). Asynchronous stochastic approximation and Q-learning. Technical Report LIDS-P-2172, Laboratory for Information and Decision Systems, MIT, Cambridge, MA.
Google Scholar
Watkins, C. J. C. H., (1989). Learning from Delayed Rewards. PhD thesis, Cambridge University, Cambridge, England.
Google Scholar
Watkins, C. J. C. H. & Dayan, P., (1992). Q-learning. Machine Learning, 8(3/4):257–277, May 1992.
Article Google Scholar
Werbos, P.J., (1987). Building and understanding adaptive systems: A statistical/numerical approach to factory automation and brain research. IEEE Transactions on Systems, Man, and Cybernetics, 17(1):7–20.
Article Google Scholar
Werbos, P.J., (1988). Generalization of backpropagation with application to a recurrent gas market model. Neural Networks, 1(4):339–356, 1988.
Article Google Scholar
Werbos, P.J., (1990). Consistency of HDP applied to a simple reinforcement learning problem. Neural Networks, 3(2):179–190.
Article Google Scholar
Werbos, P.J., (1992). Approximate dynamic programming for real-time control and neural modeling. In D. A. White and D. A. Sofge, editors, Handbook of Intelligent Control: Neural, Fuzzy, and Adaptive Approaches, pages 493–525. Van Nostrand Reinhold, New York.
Google Scholar
Young, P., (1984). Recursive Estimation and Time-series Analysis. Springer-Verlag.
Google Scholar

Download references

Author information

Authors and Affiliations

GTE Data Services, One E Telecom Pkwy, DC B2H, Temple Terrace, FL, 33637
Steven J. Bradtke
Dept. of Computer Science, University of Massachusetts, Amherst, MA, 01003-4610
Andrew G. Barto

Authors

Steven J. Bradtke
View author publications
You can also search for this author in PubMed Google Scholar
Andrew G. Barto
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Brown University, USA
Leslie Pack Kaelbling

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Bradtke, S.J., Barto, A.G. (1996). Linear Least-Squares Algorithms for Temporal Difference Learning. In: Kaelbling, L.P. (eds) Recent Advances in Reinforcement Learning. Springer, Boston, MA. https://doi.org/10.1007/978-0-585-33656-5_4

Download citation

DOI: https://doi.org/10.1007/978-0-585-33656-5_4
Received: 10 November 1994
Accepted: 04 October 1995
Publisher Name: Springer, Boston, MA
Print ISBN: 978-0-7923-9705-2
Online ISBN: 978-0-585-33656-5
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics