# Linear Least-Squares Algorithms for Temporal Difference Learning

- 1.1k Downloads
- 6 Citations

## Abstract

We introduce two new temporal difference (TD) algorithms based on the theory of linear least-squares function approximation. We define an algorithm we call Least-Squares TD (LS TD) for which we prove probability-one convergence when it is used with a function approximator linear in the adjustable parameters. We then define a recursive version of this algorithm, Recursive Least-Square TD (RLS TD). Although these new TD algorithms require more computation per time-step than do Sutton‘s TD(λ) algorithms, they are more efficient in a statistical sense because they extract more information from training experiences. We describe a simulation experiment showing the substantial improvement in learning rate achieved by RLS TD in an example Markov prediction problem. To quantify this improvement, we introduce the TD error variance of a Markov chain, σTD, and experimentally conclude that the convergence rate of a TD algorithm depends linearly on σTD. In addition to converging more rapidly, LS TD and RLS TD do not have control parameters, such as a learning rate parameter, thus eliminating the possibility of achieving poor performance by an unlucky choice of parameters.

## References

- Anderson, C. W. (1988). Strategy learning with multilayer connectionist representations. Technical Report 87-509,3 GTE Laboratories Incorporated, Computer and Intelligent Systems Laboratory, 40 Sylvan Road. Waltham, MA 02254.Google Scholar
- Barto, A. G., Sutton, R. S. & Anderson, C. W. (1983). Neuronlike elements that can solve difficult learning control problems.
*IEEE Transactions on Systems, Man, and Cybernetics*13: 835-846Google Scholar - Bradike, S J., (1994).
*Incremental Dynamic Programming for On-Line Adaptive Optimal Control*. PhD thesis, University of Massachusetts, Computer Science Dept. Technical Report 94-62.Google Scholar - Darken, C. Chang, J. & Moody, J., (1992) Learning rate schedules for faster stochastic gradient search. In
*Neural Networks or Signal Processing 2 — Proceedings of the 1992 IEEE Workshop*, IEEE Press.Google Scholar - Dayan, (1992) The convergence of TD(λ) or general λ.
*Machine Learning*, 8: 341-362Google Scholar - Dayan, P. & Sejnowski, T. J., (1994) TD(λ): Convergence with probability I.
*Mahine Learning*.Google Scholar - Goodwin, G.C. & Sin, K.S., (1984).
*Adaptive Filtering Prediction and Control*, Prentice-Hall, Englewood Cliffs, NJ.Google Scholar - Jaakkola, T., Jordan, M.I & Singh, S.P, (1994). On the convergence of stochastic iterative dynamic programming algorithms.
*Neural Computation*, 6(6).Google Scholar - Kemeny, J. G. & Snell, J.L. (1976).
*Finite Markov Chains*. Springer-Verlag, New York.Google Scholar - Liung, L. & Soderstrorn, T. (1983).
*Theory and Practice of Recursive Identification*. MIT Press, Cambridge, MA.Google Scholar - Lukes, G., Thompson, B. & Werbos, P., (1990) Expectation driven learning with an associative associative memory. In
*Proceedings of the International Joint Conference on Neural Networks*, pages 1: 521-524.Google Scholar - Robbins, H. & Monro, (1951) A stochastic approxmation method.
*Annals of Mathematical Statistics*. 22: 400-407.Google Scholar - Soderstrom, T. & Sloica, P.G., (1983). Instrumental Variable Methods for System Idenfication. Springer Verlag, Berlin.Google Scholar
- Sutton, A.S., (1984).
*Temporal Credit Assignment in Reinforcement Learning*. PhD thesis, Department of Computer and Information Science, University of Massachusetts at Amherst, Arherst, MA. 01003.Google Scholar - Sutton, R.S., (1988) Learning to predict by the method of temporal differences.
*Machine Learning*, 3: 9-44.Google Scholar - Tesauro, G.J., (1992). Practical issues in temporal difference learning.
*Machine Learning*8(3/4):257-277.Google Scholar - Tsitsiklis, J.N. (1995) Asynchronous stochastic approximation and Q-learning. Technical Report IIDS-P-2172, Laboratory for Information and Decision Systems, MIT, Cambridge, MA.Google Scholar
- Watkins, C. I. C. H., (1989).
*Learning from Delayed Rewards*PhD thesis, Cambridge University, Cambridge, England.Google Scholar - Watkins, C. J. C. H. & Dayan, P. (1992). Q-Learning.
*Machine Learning*, 8(3/4): 257-277, May 1992.Google Scholar - Werbos, P.J. (1987). Building and understanding adaptive systems: A statistical/numerical approach to factory automation and brain research
*IEEE: Transaction on Systems, Man, and Cybernetics*, 17(1) 7-20.Google Scholar - Werbos, P.J. (1988) Generalization of backpropagation with application to a recurrent gas market model.
*Neural Networks*, 1(4): 339-356, 1988.Google Scholar - Werbos, P.J. (1990). Consistency of HDP applied to a simple reinforcement learning problem.
*Neural Networks*. 3(2): 179-190Google Scholar - Werbos, P.J. (1992) Approximate dynamic programming for real time control and neural modeling. In D. A. White and D. A. Sofge, editors,
*Handbook of Intelligent Cotrol: Neural, Fuzzy, and Adaptive Approaches*, pages 493-525. Van Nostrand Reinhold, New York.Google Scholar - Young, P. (1984)
*Recursive Estimation and Time-series. Analysis*. Springer-Verlag.Google Scholar