Definition
Temporal Difference Learning, also known as TD-Learning, is a method for computing the long term utility of a pattern of behavior from a series of intermediate rewards (Sutton, 1984, 1988, 1998). It uses differences between successive utility estimates as a feedback signal for learning. The Temporal Differencing approach to model-free reinforcement learning was introduced by, and is often associated with, R.S. Sutton. It has ties to both the artificial intelligence and psychological theories of reinforcement learning as well as dynamic programming and operations research from economics (Bellman, 1957; Samuel, 1959; Watkins, 1989; Puterman, 1994; Bertsekas, 1996).
While TD learning can be formalised using the theory of Markov Decision Processes, in many cases it has been used more as a heuristic technique and has achieved impressive results even in situations where the formal theory does not strictly apply, e.g., Tesauro’s TD-Gammon (Tesauro, 1995) achieved world champion...
This is a preview of subscription content, log in via an institution.
References
Albus, J. S. (1981). Brains, behavior, and robotics. Peterborough: BYTE, ISBN: 0070009759.
Auer, P., & Ortner, R. (2007). Logarithmic online regret bounds for undiscounted reinforcement learning. Neural and Information Processing Systems (NIPS).
Baird, L. C. (1995). Residual algorithms: reinforcement learning with function approximation. In A. Prieditis & S. Russell (Eds.), Machine Learning: Proceedings of the Twelfth International Conference (ICML95) (pp. 30–37). San Mateo: Morgan Kaufmann.
Baxter, J., Tridgell, A., & Weaver, L. (1998). KnightCap: a chess program that learns by combining TD(lambda) with game-tree search. In J. W. Shavlik (Ed.), Proceedings of the Fifteenth International Conference on Machine Learning (ICML ’98) (pp. 28–36). San Francisco: Morgan Kaufmann.
Bellman, R. E. (1957). Dynamic programming. Princeton: Princeton University Press.
Bertsekas, D. P., & Tsitsiklis, J. (1996). Neuro-dynamic programming. Belmont: Athena Scientific.
Boyan, J. A., & Moore, A. W. (1995). Generalization in reinforcement learning: safely approximating the value function. In G. Tesauro, D. S. Touretzky, & T. K. Leen (Eds.), Advances in neural information processing systems (Vol. 7). Cambridge: MIT Press.
Di Castro, D., & Meir, R. (2010). A convergent online single time scale actor critic algorithm. Journal of Machine Learning Research, 11, 367–410. http://jmlr.csail.mit.edu/papers/v11/dicastro10a.html
Gordon, G. F. (1995). Stable function approximation in dynamic programming (Technical report CMU-CS-95-103). School of Computer Science, Carnegie Mellon University.
Lagoudakis, M. G., & Parr, R. (2003). Least-squares policy iteration. Journal of Machine Learning Research 4, 1107–1149. http://www.cs.duke.edu/~parr/jmlr03.pdf
Maei, H. R. et al. (2009). Convergent temporal-difference learning with arbitrary smooth function approximation. Neural and Information Processing Systems (NIPS), pp. 1204–1212. http://books.nips.cc/papers/files/nips22/NIPS2009_1121.pdf
Mahadevan, S. (1996). Average reward reinforcement learning: foundations, algorithms, and empirical results. Machine Learning, 22, 159–195, doi: 10.1023/A:1018064306595.
Papavassiliou, V. A., & Russell, S. (1999). Convergence of reinforcement learning with general function approximators. International Joint Conference on Artificial Intelligence, Stockholm.
Puterman, M. L. (1994). Markov decision processes: discrete stochastic dynamic programming. Wiley series in probability and mathematical statistics. Applied probability and statistics section. New York: Wiley.
Samuel, A. L. (1959). Some studies in machine learning using the game of checkers. IBM Journal on Research and Development, 3(3), 210–229.
Schultz, W., Dayan, P., & Read Montague, P. (1997). A neural substrate of prediction and reward. Science, 275(5306), 1593–1599, doi: 10.1126/science.275.5306.1593.
Sutton, R., & Tanner, B. (2004). Temporal difference networks. Neural and Information Processing Systems (NIPS).
Sutton, R. S. (1984). Temporal credit assignment in reinforcement learning. Ph.D. thesis, University of Massachusetts, Amherst.
Sutton, R. S. (1988). Learning to predict by the method of temporal differences. Machine learning, 3, 9–44, doi: 10.1007/BF00115009.
Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning: an introduction. Cambridge: MIT Press.
Sutton, R. S., & Barto, A. G. (1990). Time-derivative models of Pavlovian reinforcement. In M. Gabriel & J. Moore (Eds.), Learning and computational neuroscience: foundations of adaptive networks (pp. 497–537). Cambridge: MIT Press.
Tesauro, G. (1995). Temporal difference learning and TD-gammon. Communications of the ACM, 38(3), 58–67.
Tsitsiklis, J. N., & Van Roy, B. (1997). An analysis of temporal-difference learning with function approximation. IEEE Transactions on Automatic Control, 42(5), 674–690.
Veness, J., et al. (2009). Bootstrapping from game tree search. Neural and Information Processing Systems (NIPS).
Watkins, C. J. C. H. (1989). Learning with delayed rewards. Ph.D. thesis, Cambridge University Psychology Department, Cambridge.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2011 Springer Science+Business Media, LLC
About this entry
Cite this entry
Uther, W. (2011). Temporal Difference Learning. In: Sammut, C., Webb, G.I. (eds) Encyclopedia of Machine Learning. Springer, Boston, MA. https://doi.org/10.1007/978-0-387-30164-8_817
Download citation
DOI: https://doi.org/10.1007/978-0-387-30164-8_817
Publisher Name: Springer, Boston, MA
Print ISBN: 978-0-387-30768-8
Online ISBN: 978-0-387-30164-8
eBook Packages: Computer ScienceReference Module Computer Science and Engineering