Machine Learning

, Volume 14, Issue 3, pp 295–301 | Cite as

TD(λ) converges with probability 1

  • Peter Dayan
  • Terrence J. Sejnowski


The methods of temporal differences (Samuel, 1959; Sutton, 1984, 1988) allow an agent to learn accurate predictions of stationary stochastic future outcomes. The learning is effectively stochastic approximation based on samples extracted from the process generating the agent's future.

Sutton (1988) proved that for a special case of temporal differences, the expected values of the predictions converge to their correct values, as large samples are taken, and Dayan (1992) extended his proof to the general case. This article proves the stronger result that the predictions of a slightly modified form of temporal difference learning converge with probability one, and shows how to quantify the rate of convergence.


reinforcement learning temporal differences Q-learning 


  1. Benveniste, A., Métivier, M., & Priouret, P. (1990).Adaptive algorithms and stochastic approximation. Berlin: Springer-Verlag.Google Scholar
  2. Dayan, P. (1992). The convergence of TD(λ) for general λ.Machine Learning, 8, 341–362.Google Scholar
  3. Geman, S., Bienenstock, E., & Doursat, R. (1991). Neural networks and the bias/variance dilemma.Neural Computation, 4, 1–58.Google Scholar
  4. Kuan, C.M., & White, H. (1990).Recursive m-estimation, non-linear regression and neural network learning with dependent observations (discussion paper). Department of Economics, University of California at San Diego.Google Scholar
  5. Kuan, C.M., & White, H. (1991).Strong convergence of recursive m-estimators for models with dynamic latent variables (discussion paper 91-05). Department of Economics, University of California at San Diego.Google Scholar
  6. Kushner, H.J. (1984).Approximation and weak convergence methods for random processes, with applications to stochastic systems theory. Cambridge, MA: MIT Press.Google Scholar
  7. Kushner, H.J., & Clark, D. (1978).Stochastic approximation methods for constrained and unconstrained systems. Berlin: Springer-Verlag.Google Scholar
  8. Robbins, H., & Monro, S. (1951). A stochastic approximation method.Annals of Mathematical Statistics, 22, 400–407.Google Scholar
  9. Ross, S. (1983).Introduction to stochastic dynamic programming. New York: Academic Press.Google Scholar
  10. Samuel, A.L. (1959). Some studies in machine learning using the game of checkers.IBM Journal of Research and Development, 3, 311–229.Google Scholar
  11. Sutton, R.S. (1984).Temporal credit assignment in reinforcement learning. Ph.D. thesis, Department of Computer Science, University of Massachusetts, Amherst, MA.Google Scholar
  12. Sutton, R.S. (1988). Learning to predict by the methods of temporal difference.Machine Learning, 3, 9–44.Google Scholar
  13. Sutton, R.S., & Barto, A.G. (1987). A temporal-difference model of classical conditioning. GTE Laboratories Report TR87-509-2. Waltham, MA.Google Scholar
  14. Tesauro, G. (1992). Practical issues in temporal difference learning.Machine Learning, 8, 257–278.Google Scholar
  15. Watkins, C.J.C.H. (1989).Learning from delayed rewards. Ph.D. thesis, King's College, University of Cambridge, England.Google Scholar
  16. Watkins, C.J.C.H., & Dayan, P. (1992). Q-learning.Machine Learning, 8, 279–292.Google Scholar

Copyright information

© Kluwer Academic Publishers 1994

Authors and Affiliations

  • Peter Dayan
    • 1
  • Terrence J. Sejnowski
    • 1
  1. 1.CNL, The Salk InstituteSan Diego

Personalised recommendations