Abstract
The method of temporal differences (TD) is one way of making consistent predictions about the future. This paper uses some analysis of Watkins (1989) to extend a convergence theorem due to Sutton (1988) from the case which only uses information from adjacent time steps to that involving information from arbitrary ones.
It also considers how this version of TD behaves in the face of linearly dependent representations for states—demonstrating that it still converges, but to a different answer from the least mean squares algorithm. Finally it adapts Watkins' theorem that Q-learning, his closely related prediction and action learning method, converges with probability one, to demonstrate this strong form of convergence for a slightly modified version of TD.
Article PDF
Similar content being viewed by others
References
Albus, J.S. (1975). A new approach to manipulator control: The Cerebellar Model Articulation Controller (CMAC).Transactions of the ASME: Journal of Dynamical Systems, Measurement and Control, 97, 220–227.
Barto, A.G., Sutton, R.S. & Anderson, C.W. (1983). Neuronlike elements that can solve difficult learning problems.IEEE Transactions on Systems, Man, and Cybernetics, 13, 834–846.
Barto, A.G., Sutton, R.S. & Watkins, C.J.C.H. (1990). Learning and sequential decision making. In M. Gabriel & J. Moore (Eds.),Learning and computational neuroscience: Foundations of adaptive networks. Cambridge, MA: MIT Press, Bradford Books.
Bellman, R.E. & Dreyfus, S.E. (1962).Applied dynamic programming. RAND Corporation.
Dayan, P. (1991).Reinforcing connectionism: Learning the statistical way. Ph.D. Thesis, University of Edinburgh, Scotland.
Hampson, S.E. (1983).A neural model of adaptive behavior. Ph.D. Thesis. University of California, Irvine, CA.
Hampson, S.E. (1990).Connectionistic problem solving: computational aspects of biological learning. Boston, MA: Birkhäuser Boston.
Holland, J.H. (1986). Escaping brittleness: The possibilities of general-purpose learning algorithms applied to parallel rule-based systems. In R.S. Michalski, J.G. Carbonell & T.M. Mitchell (Eds.),Machine learning: An artificial intelligence approach, 2. Los Altos, CA: Morgan Kaufmann.
Klopf, A.H. (1972).Brain function and adaptive systems–A heterostatic theory. Air Force Research Laboratories Research Report, AFCRL-72-0164. Bedford, MA.
Klopf, A.H. (1982).The hedonistic neuron: A theory of memory, learning, and intelligence. Washington, DC: Hemisphere.
Michie. D. & Chambers, R.A. (1968). BOXES: An experiment in adaptive control.Machine Intelligence, 2, 137–152.
Moore, A.W. (1990).Efficient memory-based learning for robot control. Ph.D. Thesis, University of Cambridge Computer Laboratory, Cambridge, England.
Omohundro, S. (1987). Efficient algorithms with neural network behaviour.Complex Systems, 1, 273–347.
Samuel, A.L. (1959). Some studies in machine learning using the game of checkers. Reprinted in E.A. Feigenbaum & J. Feldman (Eds.) (1963).Computers and thought. McGraw-Hill.
Samuel, A.L. (1967). Some studies in machine learning using the game of checkers II: Recent progress.IBM Journal of Research and Development, 11, 601–617.
Sutton, R.S. (1984).Temporal credit assignment in reinforcement learning. Ph.D. Thesis, University of Massachusetts, Amherst, MA.
Sutton, R.S. (1988). Learning to predict by the methods of temporal difference.Machine Learning, 3, 9–44.
Varga, R.S. (1962).Matrix iterative analysis. Englewood Cliffs, NJ: Prentice-Hall.
Watkins, C.I.C.H. (1989).Learning from delayed rewards. Ph.D. Thesis. University of Cambridge, England.
Werbos, P.J. (1990). Consistency of HDP applied to a simple reinforcement learning problem.Neural Networks, 3, 179–189.
Widrow, B. & Stearns, S.D. (1985).Adaptive signal processing. Englewood Cliffs, NJ: Prentice-Hall.
Witten, I.H. (1977). An adaptive optimal controller for discrete-time Markov environments.Information and Control, 34, 286–295.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Dayan, P. The convergence of TD(λ) for general λ. Mach Learn 8, 341–362 (1992). https://doi.org/10.1007/BF00992701
Issue Date:
DOI: https://doi.org/10.1007/BF00992701