Linear LeastSquares Algorithms for Temporal Difference Learning
 Steven J. Bradtke,
 Andrew G. Barto
 … show all 2 hide
Abstract
We introduce two new temporal difference (TD) algorithms based on the theory of linear leastsquares function approximation. We define an algorithm we call LeastSquares TD (LS TD) for which we prove probabilityone convergence when it is used with a function approximator linear in the adjustable parameters. We then define a recursive version of this algorithm, Recursive LeastSquare TD (RLS TD). Although these new TD algorithms require more computation per timestep than do Sutton‘s TD(λ) algorithms, they are more efficient in a statistical sense because they extract more information from training experiences. We describe a simulation experiment showing the substantial improvement in learning rate achieved by RLS TD in an example Markov prediction problem. To quantify this improvement, we introduce the TD error variance of a Markov chain, σTD, and experimentally conclude that the convergence rate of a TD algorithm depends linearly on σTD. In addition to converging more rapidly, LS TD and RLS TD do not have control parameters, such as a learning rate parameter, thus eliminating the possibility of achieving poor performance by an unlucky choice of parameters.
 Anderson, C. W. (1988) Strategy learning with multilayer connectionist representations. GTE Laboratories Incorporated, Computer and Intelligent Systems Laboratory, 40 Sylvan Road. Waltham, MA 02254
 Barto, A. G., Sutton, R. S., Anderson, C. W. (1983) Neuronlike elements that can solve difficult learning control problems. IEEE Transactions on Systems, Man, and Cybernetics 13: pp. 835846
 Bradike, S J., (1994). Incremental Dynamic Programming for OnLine Adaptive Optimal Control. PhD thesis, University of Massachusetts, Computer Science Dept. Technical Report 9462.
 Darken, C. Chang, J. & Moody, J., (1992) Learning rate schedules for faster stochastic gradient search. In Neural Networks or Signal Processing 2 — Proceedings of the 1992 IEEE Workshop, IEEE Press.
 Dayan, (1992) The convergence of TD(λ) or general λ. Machine Learning 8: pp. 341362
 Dayan, P. & Sejnowski, T. J., (1994) TD(λ): Convergence with probability I. Mahine Learning.
 Goodwin, G.C., Sin, K.S. (1984) Adaptive Filtering Prediction and Control. PrenticeHall, Englewood Cliffs, NJ
 Jaakkola, T., Jordan, M.I & Singh, S.P, (1994). On the convergence of stochastic iterative dynamic programming algorithms. Neural Computation, 6(6).
 Kemeny, J. G., Snell, J.L. (1976) Finite Markov Chains. SpringerVerlag, New York
 Liung, L., Soderstrorn, T. (1983) Theory and Practice of Recursive Identification. MIT Press, Cambridge, MA
 Lukes, G., Thompson, B., Werbos, P. (1990) Expectation driven learning with an associative associative memory. Proceedings of the International Joint Conference on Neural Networks 1: pp. 521524
 Robbins, H., Monro, (1951) A stochastic approxmation method. Annals of Mathematical Statistics 22: pp. 400407
 Soderstrom, T., Sloica, P.G. (1983) Instrumental Variable Methods for System Idenfication. Springer Verlag, Berlin
 Sutton, A.S. (1984) Temporal Credit Assignment in Reinforcement Learning. Department of Computer and Information Science, University of Massachusetts at Amherst, Arherst, MA
 Sutton, R.S. (1988) Learning to predict by the method of temporal differences. Machine Learning 3: pp. 944
 Tesauro, G.J. (1992) Practical issues in temporal difference learning. Machine Learning 8: pp. 257277
 Tsitsiklis, J.N. (1995) Asynchronous stochastic approximation and Qlearning. Laboratory for Information and Decision Systems, MIT, Cambridge, MA
 Watkins, C. I. C. H. (1989) Learning from Delayed Rewards. Cambridge University, Cambridge, England
 Watkins, C. J. C. H., Dayan, P. (1992) QLearning. Machine Learning 8: pp. 257277
 Werbos, P.J. (1987) Building and understanding adaptive systems: A statistical/numerical approach to factory automation and brain research. IEEE: Transaction on Systems, Man, and Cybernetics 17: pp. 720
 Werbos, P.J. (1988) Generalization of backpropagation with application to a recurrent gas market model. Neural Networks 1: pp. 339356
 Werbos, P.J. (1990) Consistency of HDP applied to a simple reinforcement learning problem. Neural Networks 3: pp. 179190
 Werbos, P.J. Approximate dynamic programming for real time control and neural modeling. In: White, D. A., Sofge, D. A. eds. (1992) Handbook of Intelligent Cotrol: Neural, Fuzzy, and Adaptive Approaches. Van Nostrand Reinhold, New York, pp. 493525
 Young, P. (1984) Recursive Estimation and Timeseries. Analysis. SpringerVerlag.
 Title
 Linear LeastSquares Algorithms for Temporal Difference Learning
 Journal

Machine Learning
Volume 22, Issue 13 , pp 3357
 Cover Date
 19960101
 DOI
 10.1023/A:1018056104778
 Print ISSN
 08856125
 Online ISSN
 15730565
 Publisher
 Kluwer Academic PublishersPlenum Publishers
 Additional Links
 Topics
 Keywords

 Reinforcement learning
 Markov Decision Problems
 Temporal Difference Methods
 LeastSquares
 Industry Sectors
 Authors