Linear Least-Squares Algorithms for Temporal Difference Learning Article DOI:
Cite this article as: Bradtke, S.J. & Barto, A.G. Machine Learning (1996) 22: 33. doi:10.1023/A:1018056104778 Abstract
We introduce two new temporal difference (TD) algorithms based on the theory of linear least-squares function approximation. We define an algorithm we call Least-Squares TD (LS TD) for which we prove probability-one convergence when it is used with a function approximator linear in the adjustable parameters. We then define a recursive version of this algorithm, Recursive Least-Square TD (RLS TD). Although these new TD algorithms require more computation per time-step than do Sutton‘s TD(λ) algorithms, they are more efficient in a statistical sense because they extract more information from training experiences. We describe a simulation experiment showing the substantial improvement in learning rate achieved by RLS TD in an example Markov prediction problem. To quantify this improvement, we introduce the TD error variance of a Markov chain, σTD, and experimentally conclude that the convergence rate of a TD algorithm depends linearly on σTD. In addition to converging more rapidly, LS TD and RLS TD do not have control parameters, such as a learning rate parameter, thus eliminating the possibility of achieving poor performance by an unlucky choice of parameters.
Reinforcement learning Markov Decision Problems Temporal Difference Methods Least-Squares
Download to read the full article text
Anderson, C. W. (1988). Strategy learning with multilayer connectionist representations. Technical Report 87-509,3 GTE Laboratories Incorporated, Computer and Intelligent Systems Laboratory, 40 Sylvan Road. Waltham, MA 02254.
Barto, A. G., Sutton, R. S. & Anderson, C. W. (1983). Neuronlike elements that can solve difficult learning control problems.
IEEE Transactions on Systems, Man, and Cybernetics
Bradike, S J., (1994).
Incremental Dynamic Programming for On-Line Adaptive Optimal Control. PhD thesis, University of Massachusetts, Computer Science Dept. Technical Report 94-62.
Darken, C. Chang, J. & Moody, J., (1992) Learning rate schedules for faster stochastic gradient search. In
Neural Networks or Signal Processing 2 — Proceedings of the 1992 IEEE Workshop, IEEE Press.
Dayan, (1992) The convergence of TD(λ) or general λ.
, 8: 341-362
Dayan, P. & Sejnowski, T. J., (1994) TD(λ): Convergence with probability I.
Goodwin, G.C. & Sin, K.S., (1984).
Adaptive Filtering Prediction and Control
, Prentice-Hall, Englewood Cliffs, NJ.
Jaakkola, T., Jordan, M.I & Singh, S.P, (1994). On the convergence of stochastic iterative dynamic programming algorithms.
Neural Computation, 6(6).
Kemeny, J. G. & Snell, J.L. (1976).
Finite Markov Chains
. Springer-Verlag, New York.
Liung, L. & Soderstrorn, T. (1983).
Theory and Practice of Recursive Identification
. MIT Press, Cambridge, MA.
Lukes, G., Thompson, B. & Werbos, P., (1990) Expectation driven learning with an associative associative memory. In
Proceedings of the International Joint Conference on Neural Networks
, pages 1: 521-524.
Robbins, H. & Monro, (1951) A stochastic approxmation method.
Annals of Mathematical Statistics
. 22: 400-407.
Soderstrom, T. & Sloica, P.G., (1983). Instrumental Variable Methods for System Idenfication. Springer Verlag, Berlin.
Sutton, A.S., (1984).
Temporal Credit Assignment in Reinforcement Learning
. PhD thesis, Department of Computer and Information Science, University of Massachusetts at Amherst, Arherst, MA. 01003.
Sutton, R.S., (1988) Learning to predict by the method of temporal differences.
, 3: 9-44.
Tesauro, G.J., (1992). Practical issues in temporal difference learning.
Tsitsiklis, J.N. (1995) Asynchronous stochastic approximation and Q-learning. Technical Report IIDS-P-2172, Laboratory for Information and Decision Systems, MIT, Cambridge, MA.
Watkins, C. I. C. H., (1989).
Learning from Delayed Rewards
PhD thesis, Cambridge University, Cambridge, England.
Watkins, C. J. C. H. & Dayan, P. (1992). Q-Learning.
, 8(3/4): 257-277, May 1992.
Werbos, P.J. (1987). Building and understanding adaptive systems: A statistical/numerical approach to factory automation and brain research
IEEE: Transaction on Systems, Man, and Cybernetics
, 17(1) 7-20.
Werbos, P.J. (1988) Generalization of backpropagation with application to a recurrent gas market model.
, 1(4): 339-356, 1988.
Werbos, P.J. (1990). Consistency of HDP applied to a simple reinforcement learning problem.
. 3(2): 179-190
Werbos, P.J. (1992) Approximate dynamic programming for real time control and neural modeling. In D. A. White and D. A. Sofge, editors,
Handbook of Intelligent Cotrol: Neural, Fuzzy, and Adaptive Approaches
, pages 493-525. Van Nostrand Reinhold, New York.
Young, P. (1984)
Recursive Estimation and Time-series. Analysis. Springer-Verlag. Copyright information
© Kluwer Academic Publishers 1996