Machine Learning

, Volume 3, Issue 1, pp 9–44 | Cite as

Learning to predict by the methods of temporal differences

  • Richard S. Sutton


This article introduces a class of incremental learning procedures specialized for prediction-that is, for using past experience with an incompletely known system to predict its future behavior. Whereas conventional prediction-learning methods assign credit by means of the difference between predicted and actual outcomes, the new methods assign credit by means of the difference between temporally successive predictions. Although such temporal-difference methods have been used in Samuel's checker player, Holland's bucket brigade, and the author's Adaptive Heuristic Critic, they have remained poorly understood. Here we prove their convergence and optimality for special cases and relate them to supervised-learning methods. For most real-world prediction problems, temporal-difference methods require less memory and less peak computation than conventional methods and they produce more accurate predictions. We argue that most problems to which supervised learning is currently applied are really prediction problems of the sort to which temporal-difference methods can be applied to advantage.


Incremental learning prediction connectionism credit assignment evaluation functions 


  1. Ackley, D. H., Hinton, G. H., & Sejnowski, T. J. (1985). A learning algorithm for Boltzmann machines. Cognitive Science, 9, 147–169.Google Scholar
  2. Anderson, C. W. (1986). Learning and problem solving with multilayer connectionist systems. Doctoral dissertation. Department of Computer and Information Science. University of Massachusetts, Amherst.Google Scholar
  3. Anderson, C. W. (1987). Strategy learning with multilayer connectionist representations. Proceedings of the Fourth International Workshop on Machine Learning (pp. 103–114). Irvine. CA: Morgan Kaufmann.Google Scholar
  4. Barto, A. G. (1985). Learning by statistical cooperation of self-interested neuron-like computing elements. Human Neurobiology, 4, 229–256.Google Scholar
  5. Barto, A. G., Sutton, R. S., & Anderson, C. W. (1983). Neuronlike elements that can solve difficult learning control problems. IEEE Transactions on Systems. Man, and Cybernetics, 13, 834–846.Google Scholar
  6. Booker, L. B. (1982). Intelligent behavior as an adaptation to the task environment. Doctoral dissertation. Department of Computer and Communication Sciences, University of Michigan. Ann Arbor.Google Scholar
  7. Christensen, J. (1986). Learning static evaluation functions by linear regression. In T. M. Mitchell, J. G. Carbonell, & R. S. Michalski (Eds.). Machine learning: A guide to current research. Boston: Kluwer AcademicGoogle Scholar
  8. Christensen, J., & Korf, R. E. (1986). A unified theory of hemistic evaluation functions and its application to learning. Proceedings of the Fifth National Conference on Artificial Intelligence (pp. 148–152). Philadelphia, PA: Morgan Kaufmann.Google Scholar
  9. Denardo, E. V. (1982). Dynamic programming: Models and applications. Englewood Cliffs, NJ: Prentice-Hall.Google Scholar
  10. Dietterich, T. G., & Michalski, R. S. (1986). Learning to predict sequences. In R. S. Michalski, J. G. Carbonell, & T. M. Mitchell (Eds.). Machine learning: An artificial intelligence approach (Vol. 2). Los Altos, CA Morgan Kaufmann.Google Scholar
  11. Gelperin, A., Hopfield, J. J., Tank, D. W. (1985). The logic of Limax learning. In A. Selverston (Ed.), Model neural networks and behavior. New York: Plenum Press.Google Scholar
  12. Hampson, S. E. (1983). A neural model of adaptive behavior. Doctoral dissertation, Department of Information and Computer Science. University of California, Irvine.Google Scholar
  13. Hampson, S. E., & Volper, D. J. (1987). Disjunctive models of boolean category learning. Biological Cybernetics, 56, 121–137.Google Scholar
  14. Holland, J. H. (1986). Escaping brittleness: The possibilities of general-purpose learning algorithms applied to parallel rule-based systems. In R. S. Michalski, J. G. Carbonell, & T. M. Mitchell (Eds.). Machine learning: An artificial intelligence approach (Vol. 2). Los Altos, CA: Morgan Kaufmann.Google Scholar
  15. Kehoe, E. J., Schreurs, B. G., & Graham, P. (1987). Temporal primacy over-rides prior training in serial compound conditioning of the rabbit's nictitating membrane response. Animal Learning and Behavior, 15, 455–464.Google Scholar
  16. Kemeny, J. G., & Snell, J. L. (1976). Finite Markov chains, New York: Springer-Verlag.Google Scholar
  17. Klopf, A. H. (1987). A neuronal model of classical conditioning (Technical Report 87–1139). OH: Wright-Patterson Air Force Base, Wright Aeronautical Laboratories.Google Scholar
  18. Moore, J. W., Desmond, J. E., Berthier, N. E., Blazis, D. E. J., Sutton, R. S., & Barto, A. G. (1986). Simulation of the classically conditioned nictitating membrane response by a neuron-like adaptive element: Response topography, neuronal firing and interstimulus intervals. Behavioral Brain Research, 21, 143–154.Google Scholar
  19. Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1985). Learning internal representations by error propagation (Technical Report No. 8506). La Jolla: University of California, San Diego, Institute for Cognitive Science. Also in D. E. Rumelhart & J. L. McClelland (Eds.). Paralled distributed processing: Explorations in the microstructure of cognition (Vol. 1). Cambridge, MA: MIT Press.Google Scholar
  20. Samuel, A. L. (1959). Some studies in machine learning using the game of checkers. IBM Journal on Research and Development, 3, 210–229. Reprinted in E. A. Feigenbaum & J. Feldman (Eds.). Computers and though. New York: McGraw-Hill.Google Scholar
  21. Sutton, R. S. (1984). Temporal credit assignment in reinforcement learning Doctoral dissertation, Department of Computer and Information Science. University of Massachusetts. Amherst.Google Scholar
  22. Sutton, R. S., & Barto, A. G. (1981a). Toward a modern theory of adaptive networks: Expectation and prediction. Psychological Review, 88, 135–171.Google Scholar
  23. Sutton, R. S., & Barto, A. G. (1981b). An adaptive network that constructs and uses an internal model of its environment. Cognition and Brain Theory, 4, 217–246.Google Scholar
  24. Sutton, R. S., & Barto, A. G. (1987). A temporal-difference model of classical conditioning. Proceedings of the Ninth Annual Conference of the Cognitive Science Society (pp. 355–378). Seattle, WA: Lawrence Erlbaum.Google Scholar
  25. Sutton, R. S., & Pinette, B. (1985). The learning of world models by connectionist networks. Proceedings of the Seventh Annual Conference of the Cognitive Science Society (pp. 54–64). Irvine, CA: Lawrence Erlbaum.Google Scholar
  26. Varga, R. S. (1962). Matrix iterative analysis. Englewood Cliffs, NJ: Prentice-Hall.Google Scholar
  27. Widrow B., & Hoff, M. E. (1960). Adaptive switching circuits, 1960 WESCON Convention Record, Part IV (pp. 96–104).Google Scholar
  28. Widrow, B., & Stearns, S. D. (1985). Adaptive signal processing. Englewood Cliffs, NJ: Prentice-Hall.Google Scholar
  29. Williams, R. J. (1986). Reinforcement learning in connectionist networks: A mathematical analysis (Technical Report No. 8605). La Jolla: University of California. San Diego. Institute for Cognitive Science.Google Scholar
  30. Witten, I. H. (1977). An adaptive optimal controller for discrete-time Markov environments. Information and Control, 34, 286–295.Google Scholar

Copyright information

© Kluwer Academic Publishers 1988

Authors and Affiliations

  • Richard S. Sutton
    • 1
  1. 1.GTE Laboratories IncorporatedWalthamU.S.A.

Personalised recommendations