Skip to main content

Temporal Difference Learning

  • Reference work entry
  • 416 Accesses

Definition

Temporal Difference Learning, also known as TD-Learning, is a method for computing the long term utility of a pattern of behavior from a series of intermediate rewards (Sutton, 198419881998). It uses differences between successive utility estimates as a feedback signal for learning. The Temporal Differencing approach to model-free reinforcement learning was introduced by, and is often associated with, R.S. Sutton. It has ties to both the artificial intelligence and psychological theories of reinforcement learning as well as dynamic programming and operations research from economics (Bellman, 1957; Samuel, 1959; Watkins, 1989; Puterman, 1994; Bertsekas, 1996).

While TD learning can be formalised using the theory of Markov Decision Processes, in many cases it has been used more as a heuristic technique and has achieved impressive results even in situations where the formal theory does not strictly apply, e.g., Tesauro’s TD-Gammon (Tesauro, 1995) achieved world champion...

This is a preview of subscription content, log in via an institution.

References

  • Albus, J. S. (1981). Brains, behavior, and robotics. Peterborough: BYTE, ISBN: 0070009759.

    Google Scholar 

  • Auer, P., & Ortner, R. (2007). Logarithmic online regret bounds for undiscounted reinforcement learning. Neural and Information Processing Systems (NIPS).

    Google Scholar 

  • Baird, L. C. (1995). Residual algorithms: reinforcement learning with function approximation. In A. Prieditis & S. Russell (Eds.), Machine Learning: Proceedings of the Twelfth International Conference (ICML95) (pp. 30–37). San Mateo: Morgan Kaufmann.

    Google Scholar 

  • Baxter, J., Tridgell, A., & Weaver, L. (1998). KnightCap: a chess program that learns by combining TD(lambda) with game-tree search. In J. W. Shavlik (Ed.), Proceedings of the Fifteenth International Conference on Machine Learning (ICML ’98) (pp. 28–36). San Francisco: Morgan Kaufmann.

    Google Scholar 

  • Bellman, R. E. (1957). Dynamic programming. Princeton: Princeton University Press.

    MATH  Google Scholar 

  • Bertsekas, D. P., & Tsitsiklis, J. (1996). Neuro-dynamic programming. Belmont: Athena Scientific.

    MATH  Google Scholar 

  • Boyan, J. A., & Moore, A. W. (1995). Generalization in reinforcement learning: safely approximating the value function. In G. Tesauro, D. S. Touretzky, & T. K. Leen (Eds.), Advances in neural information processing systems (Vol. 7). Cambridge: MIT Press.

    Google Scholar 

  • Di Castro, D., & Meir, R. (2010). A convergent online single time scale actor critic algorithm. Journal of Machine Learning Research, 11, 367–410. http://jmlr.csail.mit.edu/papers/v11/dicastro10a.html

  • Gordon, G. F. (1995). Stable function approximation in dynamic programming (Technical report CMU-CS-95-103). School of Computer Science, Carnegie Mellon University.

    Google Scholar 

  • Lagoudakis, M. G., & Parr, R. (2003). Least-squares policy iteration. Journal of Machine Learning Research 4, 1107–1149. http://www.cs.duke.edu/~parr/jmlr03.pdf

  • Maei, H. R. et al. (2009). Convergent temporal-difference learning with arbitrary smooth function approximation. Neural and Information Processing Systems (NIPS), pp. 1204–1212. http://books.nips.cc/papers/files/nips22/NIPS2009_1121.pdf

  • Mahadevan, S. (1996). Average reward reinforcement learning: foundations, algorithms, and empirical results. Machine Learning, 22, 159–195, doi: 10.1023/A:1018064306595.

    Google Scholar 

  • Papavassiliou, V. A., & Russell, S. (1999). Convergence of reinforcement learning with general function approximators. International Joint Conference on Artificial Intelligence, Stockholm.

    Google Scholar 

  • Puterman, M. L. (1994). Markov decision processes: discrete stochastic dynamic programming. Wiley series in probability and mathematical statistics. Applied probability and statistics section. New York: Wiley.

    Google Scholar 

  • Samuel, A. L. (1959). Some studies in machine learning using the game of checkers. IBM Journal on Research and Development, 3(3), 210–229.

    Article  Google Scholar 

  • Schultz, W., Dayan, P., & Read Montague, P. (1997). A neural substrate of prediction and reward. Science, 275(5306), 1593–1599, doi: 10.1126/science.275.5306.1593.

    Article  Google Scholar 

  • Sutton, R., & Tanner, B. (2004). Temporal difference networks. Neural and Information Processing Systems (NIPS).

    Google Scholar 

  • Sutton, R. S. (1984). Temporal credit assignment in reinforcement learning. Ph.D. thesis, University of Massachusetts, Amherst.

    Google Scholar 

  • Sutton, R. S. (1988). Learning to predict by the method of temporal differences. Machine learning, 3, 9–44, doi: 10.1007/BF00115009.

    Google Scholar 

  • Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning: an introduction. Cambridge: MIT Press.

    Google Scholar 

  • Sutton, R. S., & Barto, A. G. (1990). Time-derivative models of Pavlovian reinforcement. In M. Gabriel & J. Moore (Eds.), Learning and computational neuroscience: foundations of adaptive networks (pp. 497–537). Cambridge: MIT Press.

    Google Scholar 

  • Tesauro, G. (1995). Temporal difference learning and TD-gammon. Communications of the ACM, 38(3), 58–67.

    Article  Google Scholar 

  • Tsitsiklis, J. N., & Van Roy, B. (1997). An analysis of temporal-difference learning with function approximation. IEEE Transactions on Automatic Control, 42(5), 674–690.

    Article  MATH  Google Scholar 

  • Veness, J., et al. (2009). Bootstrapping from game tree search. Neural and Information Processing Systems (NIPS).

    Google Scholar 

  • Watkins, C. J. C. H. (1989). Learning with delayed rewards. Ph.D. thesis, Cambridge University Psychology Department, Cambridge.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2011 Springer Science+Business Media, LLC

About this entry

Cite this entry

Uther, W. (2011). Temporal Difference Learning. In: Sammut, C., Webb, G.I. (eds) Encyclopedia of Machine Learning. Springer, Boston, MA. https://doi.org/10.1007/978-0-387-30164-8_817

Download citation

Publish with us

Policies and ethics