Temporal Difference Learning

Uther, William

doi:10.1007/978-0-387-30164-8_817

Temporal Difference Learning

William Uther

Reference work entry

416 Accesses

Definition

Temporal Difference Learning, also known as TD-Learning, is a method for computing the long term utility of a pattern of behavior from a series of intermediate rewards (Sutton, 1984, 1988, 1998). It uses differences between successive utility estimates as a feedback signal for learning. The Temporal Differencing approach to model-free reinforcement learning was introduced by, and is often associated with, R.S. Sutton. It has ties to both the artificial intelligence and psychological theories of reinforcement learning as well as dynamic programming and operations research from economics (Bellman, 1957; Samuel, 1959; Watkins, 1989; Puterman, 1994; Bertsekas, 1996).

While TD learning can be formalised using the theory of Markov Decision Processes, in many cases it has been used more as a heuristic technique and has achieved impressive results even in situations where the formal theory does not strictly apply, e.g., Tesauro’s TD-Gammon (Tesauro, 1995) achieved world champion...

This is a preview of subscription content, log in via an institution.

References

Albus, J. S. (1981). Brains, behavior, and robotics. Peterborough: BYTE, ISBN: 0070009759.
Google Scholar
Auer, P., & Ortner, R. (2007). Logarithmic online regret bounds for undiscounted reinforcement learning. Neural and Information Processing Systems (NIPS).
Google Scholar
Baird, L. C. (1995). Residual algorithms: reinforcement learning with function approximation. In A. Prieditis & S. Russell (Eds.), Machine Learning: Proceedings of the Twelfth International Conference (ICML95) (pp. 30–37). San Mateo: Morgan Kaufmann.
Google Scholar
Baxter, J., Tridgell, A., & Weaver, L. (1998). KnightCap: a chess program that learns by combining TD(lambda) with game-tree search. In J. W. Shavlik (Ed.), Proceedings of the Fifteenth International Conference on Machine Learning (ICML ’98) (pp. 28–36). San Francisco: Morgan Kaufmann.
Google Scholar
Bellman, R. E. (1957). Dynamic programming. Princeton: Princeton University Press.
MATH Google Scholar
Bertsekas, D. P., & Tsitsiklis, J. (1996). Neuro-dynamic programming. Belmont: Athena Scientific.
MATH Google Scholar
Boyan, J. A., & Moore, A. W. (1995). Generalization in reinforcement learning: safely approximating the value function. In G. Tesauro, D. S. Touretzky, & T. K. Leen (Eds.), Advances in neural information processing systems (Vol. 7). Cambridge: MIT Press.
Google Scholar
Di Castro, D., & Meir, R. (2010). A convergent online single time scale actor critic algorithm. Journal of Machine Learning Research, 11, 367–410. http://jmlr.csail.mit.edu/papers/v11/dicastro10a.html
Gordon, G. F. (1995). Stable function approximation in dynamic programming (Technical report CMU-CS-95-103). School of Computer Science, Carnegie Mellon University.
Google Scholar
Lagoudakis, M. G., & Parr, R. (2003). Least-squares policy iteration. Journal of Machine Learning Research 4, 1107–1149. http://www.cs.duke.edu/~parr/jmlr03.pdf
Maei, H. R. et al. (2009). Convergent temporal-difference learning with arbitrary smooth function approximation. Neural and Information Processing Systems (NIPS), pp. 1204–1212. http://books.nips.cc/papers/files/nips22/NIPS2009_1121.pdf
Mahadevan, S. (1996). Average reward reinforcement learning: foundations, algorithms, and empirical results. Machine Learning, 22, 159–195, doi: 10.1023/A:1018064306595.
Google Scholar
Papavassiliou, V. A., & Russell, S. (1999). Convergence of reinforcement learning with general function approximators. International Joint Conference on Artificial Intelligence, Stockholm.
Google Scholar
Puterman, M. L. (1994). Markov decision processes: discrete stochastic dynamic programming. Wiley series in probability and mathematical statistics. Applied probability and statistics section. New York: Wiley.
Google Scholar
Samuel, A. L. (1959). Some studies in machine learning using the game of checkers. IBM Journal on Research and Development, 3(3), 210–229.
Article Google Scholar
Schultz, W., Dayan, P., & Read Montague, P. (1997). A neural substrate of prediction and reward. Science, 275(5306), 1593–1599, doi: 10.1126/science.275.5306.1593.
Article Google Scholar
Sutton, R., & Tanner, B. (2004). Temporal difference networks. Neural and Information Processing Systems (NIPS).
Google Scholar
Sutton, R. S. (1984). Temporal credit assignment in reinforcement learning. Ph.D. thesis, University of Massachusetts, Amherst.
Google Scholar
Sutton, R. S. (1988). Learning to predict by the method of temporal differences. Machine learning, 3, 9–44, doi: 10.1007/BF00115009.
Google Scholar
Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning: an introduction. Cambridge: MIT Press.
Google Scholar
Sutton, R. S., & Barto, A. G. (1990). Time-derivative models of Pavlovian reinforcement. In M. Gabriel & J. Moore (Eds.), Learning and computational neuroscience: foundations of adaptive networks (pp. 497–537). Cambridge: MIT Press.
Google Scholar
Tesauro, G. (1995). Temporal difference learning and TD-gammon. Communications of the ACM, 38(3), 58–67.
Article Google Scholar
Tsitsiklis, J. N., & Van Roy, B. (1997). An analysis of temporal-difference learning with function approximation. IEEE Transactions on Automatic Control, 42(5), 674–690.
Article MATH Google Scholar
Veness, J., et al. (2009). Bootstrapping from game tree search. Neural and Information Processing Systems (NIPS).
Google Scholar
Watkins, C. J. C. H. (1989). Learning with delayed rewards. Ph.D. thesis, Cambridge University Psychology Department, Cambridge.
Google Scholar

Download references

Author information

Authors and Affiliations

Authors

William Uther
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

School of Computer Science and Engineering, University of New South Wales, Sydney, Australia, 2052
Claude Sammut
Faculty of Information Technology, Clayton School of Information Technology, Monash University, P.O. Box 63, Victoria, Australia, 3800
Geoffrey I. Webb

Rights and permissions

Reprints and permissions

Copyright information

About this entry

Cite this entry

Uther, W. (2011). Temporal Difference Learning. In: Sammut, C., Webb, G.I. (eds) Encyclopedia of Machine Learning. Springer, Boston, MA. https://doi.org/10.1007/978-0-387-30164-8_817

Download citation

DOI: https://doi.org/10.1007/978-0-387-30164-8_817
Publisher Name: Springer, Boston, MA
Print ISBN: 978-0-387-30768-8
Online ISBN: 978-0-387-30164-8
eBook Packages: Computer ScienceReference Module Computer Science and Engineering

Publish with us

Policies and ethics