Abstract
We introduce two new temporal difference (TD) algorithms based on the theory of linear least-squares function approximation. We define an algorithm we call Least-Squares TD (LS TD) for which we prove probability-one convergence when it is used with a function approximator linear in the adjustable parameters. We then define a recursive version of this algorithm, Recursive Least-Squares TD (RLS TD). Although these new TD algorithms require more computation per time-step than do Sutton’s TD(λ) algorithms, they are more efficient in a statistical sense because they extract more information from training experiences. We describe a simulation experiment showing the substantial improvement in learning rate achieved by RLS TD in an example Markov prediction problem. To quantify this improvement, we introduce the TD error variance of a Markov chain, σTD, and experimentally conclude that the convergence rate of a TD algorithm depends linearly on σTD. In addition to converging more rapidly, LS TD and RLS TD do not have control parameters, such as a learning rate parameter, thus eliminating the possibility of achieving poor performance by an unlucky choice of parameters.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Anderson, C. W., (1988). Strategy learning with multilayer connectionist representations. Technical Report 87-509.3, GTE Laboratories Incorporated, Computer and Intelligent Systems Laboratory, 40 Sylvan Road, Waltham, MA 02254.
Barto, A. G., Sutton, R. S. & Anderson, C. W. (1983). Neuronlike elements that can solve difficult learning control problems. IEEE Transactions on Systems, Man, and Cybernetics, 13:835–846.
Bradtke, S. J., (1994). Incremental Dynamic Programming for On-Line Adaptive Optimal Control. PhD thesis, University of Massachusetts, Computer Science Dept. Technical Report 94-62.
Darken, C. Chang, J. & Moody, J., (1992) Learning rate schedules for faster stochastic gradient search. In Neural Networks for Signal Processing 2 — Proceedings of the 1992 IEEE Workshop. IEEE Press.
Dayan, P., (1992). The convergence of TD(λ) for general λ. Machine Learning, 8:341–362.
Dayan, P. & Sejnowski, T.J., (1994). TD(λ): Convergence with probability 1. Machine Learning.
Goodwin, G.C. & Sin, K.S., (1984). Adaptive Filtering Prediction and Control. Prentice-Hall, Englewood Cliffs, N.J.
Jaakkola, T, Jordan, M.I. & Singh, S.P., (1994). On the convergence of stochastic iterative dynamic programming algorithms. Neural Computation, 6(6).
Kemeny, J.G. & Snell, J.L., (1976). Finite Markov Chains. Springer-Verlag, New York.
Ljung, L. & Söderström, T., (1983). Theory and Practice of Recursive Identification. MIT Press, Cambridge, MA.
Lukes, G., Thompson, B. & Werbos, P., (1990). Expectation driven learning with an associative memory. In Proceedings of the International Joint Conference on Neural Networks, pages I:521–524.
Robbins, H & Monro, S., (1951). A stochastic approximation method. Annals of Mathematical Statistics, 22:400–407.
Söderström, T. & Stoica, P.G., (1983). Instrumental Variable Methods for System Identification. Springer-Verlag, Berlin.
Sutton, A.S., (1984). Temporal Credit Assignment in Reinforcement Learning. PhD thesis, Department of Computer and Information Science, University of Massachusetts at Amherst, Amherst, MA 01003.
Sutton, R.S., (1988). Learning to predict by the method of temporal differences. Machine Learning, 3:9–44.
Tesauro, G.J., (1992). Practical issues in temporal difference learning. Machine Learning, 8(3/4):257–277.
Tsitsiklis, J.N., (1993). Asynchronous stochastic approximation and Q-learning. Technical Report LIDS-P-2172, Laboratory for Information and Decision Systems, MIT, Cambridge, MA.
Watkins, C. J. C. H., (1989). Learning from Delayed Rewards. PhD thesis, Cambridge University, Cambridge, England.
Watkins, C. J. C. H. & Dayan, P., (1992). Q-learning. Machine Learning, 8(3/4):257–277, May 1992.
Werbos, P.J., (1987). Building and understanding adaptive systems: A statistical/numerical approach to factory automation and brain research. IEEE Transactions on Systems, Man, and Cybernetics, 17(1):7–20.
Werbos, P.J., (1988). Generalization of backpropagation with application to a recurrent gas market model. Neural Networks, 1(4):339–356, 1988.
Werbos, P.J., (1990). Consistency of HDP applied to a simple reinforcement learning problem. Neural Networks, 3(2):179–190.
Werbos, P.J., (1992). Approximate dynamic programming for real-time control and neural modeling. In D. A. White and D. A. Sofge, editors, Handbook of Intelligent Control: Neural, Fuzzy, and Adaptive Approaches, pages 493–525. Van Nostrand Reinhold, New York.
Young, P., (1984). Recursive Estimation and Time-series Analysis. Springer-Verlag.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 1996 Kluwer Academic Publishers
About this chapter
Cite this chapter
Bradtke, S.J., Barto, A.G. (1996). Linear Least-Squares Algorithms for Temporal Difference Learning. In: Kaelbling, L.P. (eds) Recent Advances in Reinforcement Learning. Springer, Boston, MA. https://doi.org/10.1007/978-0-585-33656-5_4
Download citation
DOI: https://doi.org/10.1007/978-0-585-33656-5_4
Received:
Accepted:
Publisher Name: Springer, Boston, MA
Print ISBN: 978-0-7923-9705-2
Online ISBN: 978-0-585-33656-5
eBook Packages: Springer Book Archive