Machine Learning

, Volume 8, Issue 3–4, pp 341–362 | Cite as

The convergence of TD(λ) for general λ

  • Peter Dayan
Article

Abstract

The method of temporal differences (TD) is one way of making consistent predictions about the future. This paper uses some analysis of Watkins (1989) to extend a convergence theorem due to Sutton (1988) from the case which only uses information from adjacent time steps to that involving information from arbitrary ones.

It also considers how this version of TD behaves in the face of linearly dependent representations for states—demonstrating that it still converges, but to a different answer from the least mean squares algorithm. Finally it adapts Watkins' theorem that Q-learning, his closely related prediction and action learning method, converges with probability one, to demonstrate this strong form of convergence for a slightly modified version of TD.

Keywords

Reinforcement learning temporal differences asynchronous dynamic programming 

References

  1. Albus, J.S. (1975). A new approach to manipulator control: The Cerebellar Model Articulation Controller (CMAC).Transactions of the ASME: Journal of Dynamical Systems, Measurement and Control, 97, 220–227.Google Scholar
  2. Barto, A.G., Sutton, R.S. & Anderson, C.W. (1983). Neuronlike elements that can solve difficult learning problems.IEEE Transactions on Systems, Man, and Cybernetics, 13, 834–846.Google Scholar
  3. Barto, A.G., Sutton, R.S. & Watkins, C.J.C.H. (1990). Learning and sequential decision making. In M. Gabriel & J. Moore (Eds.),Learning and computational neuroscience: Foundations of adaptive networks. Cambridge, MA: MIT Press, Bradford Books.Google Scholar
  4. Bellman, R.E. & Dreyfus, S.E. (1962).Applied dynamic programming. RAND Corporation.Google Scholar
  5. Dayan, P. (1991).Reinforcing connectionism: Learning the statistical way. Ph.D. Thesis, University of Edinburgh, Scotland.Google Scholar
  6. Hampson, S.E. (1983).A neural model of adaptive behavior. Ph.D. Thesis. University of California, Irvine, CA.Google Scholar
  7. Hampson, S.E. (1990).Connectionistic problem solving: computational aspects of biological learning. Boston, MA: Birkhäuser Boston.Google Scholar
  8. Holland, J.H. (1986). Escaping brittleness: The possibilities of general-purpose learning algorithms applied to parallel rule-based systems. In R.S. Michalski, J.G. Carbonell & T.M. Mitchell (Eds.),Machine learning: An artificial intelligence approach, 2. Los Altos, CA: Morgan Kaufmann.Google Scholar
  9. Klopf, A.H. (1972).Brain function and adaptive systems–A heterostatic theory. Air Force Research Laboratories Research Report, AFCRL-72-0164. Bedford, MA.Google Scholar
  10. Klopf, A.H. (1982).The hedonistic neuron: A theory of memory, learning, and intelligence. Washington, DC: Hemisphere.Google Scholar
  11. Michie. D. & Chambers, R.A. (1968). BOXES: An experiment in adaptive control.Machine Intelligence, 2, 137–152.Google Scholar
  12. Moore, A.W. (1990).Efficient memory-based learning for robot control. Ph.D. Thesis, University of Cambridge Computer Laboratory, Cambridge, England.Google Scholar
  13. Omohundro, S. (1987). Efficient algorithms with neural network behaviour.Complex Systems, 1, 273–347.Google Scholar
  14. Samuel, A.L. (1959). Some studies in machine learning using the game of checkers. Reprinted in E.A. Feigenbaum & J. Feldman (Eds.) (1963).Computers and thought. McGraw-Hill.Google Scholar
  15. Samuel, A.L. (1967). Some studies in machine learning using the game of checkers II: Recent progress.IBM Journal of Research and Development, 11, 601–617.Google Scholar
  16. Sutton, R.S. (1984).Temporal credit assignment in reinforcement learning. Ph.D. Thesis, University of Massachusetts, Amherst, MA.Google Scholar
  17. Sutton, R.S. (1988). Learning to predict by the methods of temporal difference.Machine Learning, 3, 9–44.Google Scholar
  18. Varga, R.S. (1962).Matrix iterative analysis. Englewood Cliffs, NJ: Prentice-Hall.Google Scholar
  19. Watkins, C.I.C.H. (1989).Learning from delayed rewards. Ph.D. Thesis. University of Cambridge, England.Google Scholar
  20. Werbos, P.J. (1990). Consistency of HDP applied to a simple reinforcement learning problem.Neural Networks, 3, 179–189.Google Scholar
  21. Widrow, B. & Stearns, S.D. (1985).Adaptive signal processing. Englewood Cliffs, NJ: Prentice-Hall.Google Scholar
  22. Witten, I.H. (1977). An adaptive optimal controller for discrete-time Markov environments.Information and Control, 34, 286–295.Google Scholar

Copyright information

© Kluwer Academic Publishers 1992

Authors and Affiliations

  • Peter Dayan
    • 1
  1. 1.Centre for Cognitive Science & Department of PhysicsUniversity of EdinburghScotland

Personalised recommendations