Advertisement

Machine Learning

, Volume 8, Issue 3–4, pp 257–277 | Cite as

Practical issues in temporal difference learning

  • Gerald Tesauro
Article

Abstract

This paper examines whether temporal difference methods for training connectionist networks, such as Sutton's TD(λ) algorithm, can be successfully applied to complex real-world problems. A number of important practical issues are identified and discussed from a general theoretical perspective. These practical issues are then examined in the context of a case study in which TD(λ) is applied to learning the game of backgammon from the outcome of self-play. This is apparently the first application of this algorithm to a complex non-trivial task. It is found that, with zero knowledge built in, the network is able to learn from scratch to play the entire game at a fairly strong intermediate level of performance, which is clearly better than conventional commercial programs, and which in fact surpasses comparable networks trained on a massive human expert data set. This indicates that TD learning may work better in practice than one would expect based on current theory, and it suggests that further analysis of TD methods, as well as applications in other complex domains, may be worth investigating.

Keywords

Temporal difference learning neural networks connectionist methods backgammon games feature discovery 

References

  1. Anderson, C.W. (1987). Strategy learning with multilayer connectionist representations.Proceedings of the Fourth International Workshop on Machine Learning (pp. 103–114).Google Scholar
  2. Barto, A.G., Sutton, R.S., & Anderson, C.W. (1983). Neuronlike adaptive elements that can solve difficult learning control problems.IEEE Transactions on Systems, Man and Cybernetics, 13 835–846.Google Scholar
  3. Berliner, H. (1977). Experiences in evaluation with BKG—a program that plays backgammon.Proceedings of IJCAI (pp. 428–433).Google Scholar
  4. Berliner, H. (1979). On the construction of evaluation functions for large domains.Proceedings of IJCAI (pp. 53–55).Google Scholar
  5. Blumer, A., Ehrenfeucht, A., Haussler, D., & Warmuth, M. (1989). Learnability and the Vapnik-Chervonenkis dimension.JACM, 36 929–965.Google Scholar
  6. Christensen, J. & Korf, R. (1986). A unified theory of heuristic evaluation functions and its application to learning.Proceeding of AAAI-86 (pp. 148–152).Google Scholar
  7. Dayan, P. (1992). The convergence of TD(λ).Machine Learning, 8 341–362.Google Scholar
  8. Frey, P.W. (1986). Algorithmic strategies for improving the performance of game playing programs. In: D. Farmer, et al. (Eds.),Evolution, games and learning. Amsterdam: North Holland.Google Scholar
  9. Griffith, A.K. (1974). A comparison and evaluation of three machine learning procedures as applied to the game of checkers.Artificial Intelligence, 5 137–148.Google Scholar
  10. Holland, J.H. (1986). Escaping brittleness: The possibilities of general-purpose learning algorithms applied to parallel rule-based systems. In: R.S. Michalski, J.G. Carbonell & T.M. Mitchell, (Eds.),Machine learning: An artificial intelligence approach (Vol. 2). Los Altos, CA: Morgan Kaufmann.Google Scholar
  11. Hornik, K., Stinchcombe, M., & White, H. (1989). Multilayer feedforward networks are universal approximators.Neural Networks, 2 359–366.Google Scholar
  12. Lee, K.-F. & Majahan, S. (1988). A pattern classification approach to evaluation function learning.Artificial Intelligence, 36 1–25.Google Scholar
  13. Magriel, P. (1976).Backgammon. New York: Times Books.Google Scholar
  14. Minsky, M.L. & Papert, S.A. (1969).Perceptrons. Cambridge, MA: MIT Press. (Republished as an expanded edition in 1988).Google Scholar
  15. Mitchell, D.H. (1984). Using features to evaluate positions in experts' and novices' Othello games. Master's Thesis, Northwestern Univ., Evanston, IL.Google Scholar
  16. Quinlan, J.R. (1983). Learning efficient classification procedures and their application to chess end games. In: R.S. Michalski, J.G. Carbonell & T.M. Mitchell (Eds.),Machine learning. Palo Alto, CA: Tioga.Google Scholar
  17. Robbins, H. & Monro, S. (1951). A stochastic approximation method.Annals of Mathematical Statistics, 22 400–407.Google Scholar
  18. Rumelhart, D.E., Hinton, G.E., & Williams, R.J. (1986). Learning internal representations by error propagation. In: D. Rumelhart & J. McClelland, (Eds.),Parallel distributed processing. Vol. 1. Cambridge, MA: MIT Press.Google Scholar
  19. Samuel, A. (1959). Some studies in machine learning using the game of checkers.IBM J. of Research and Development, 3 210–229.Google Scholar
  20. Samuel, A. (1967). Some studies in machine learning using the game of checkers, II—recent progress.IBM J. of Research and Development, 11 601–617.Google Scholar
  21. Sutton, R.S. (1984). Temporal credit assignment in reinforcement learning. Doctoral Dissertation, Dept. of Computer and Information Science, Univ. of Massachusetts, Amherst.Google Scholar
  22. Sutton, R.S. (1988). Learning to predict by the methods of temporal differences.Machine Learning, 3 9–44.Google Scholar
  23. Tesauro, G. & Sejnowski, T.J. (1989). A parallel network that learns to play backgammon.Artificial Intelligence, 39 357–390.Google Scholar
  24. Tesauro, G. (1989). Connectionist learning of expert preferences by comparison training. In D. Touretzky (Ed.),Advances in neural information processing, 1 99–106.Google Scholar
  25. Tesauro, G. (1990). Neurogammon: a neural network backgammon program.IJCNN Proceedings III, 33–39.Google Scholar
  26. Utgoff, P.E. & Clouse, J.A. (1991). Two kinds of training information for evaluation function training. To appear in:Proceedings of AAAI-91.Google Scholar
  27. Vapnik, V.N. & Chervonenkis (1971). On the uniform convergence of relative frequencies of events to their probabilities.Theory Prob. Appl., 16 264–280.Google Scholar
  28. Widrow, B., et al. (1976). Stationary and nonstationary learning characteristics of the LMS adaptive filter.Proceedings of the IEEE, 64 1151–1162.Google Scholar
  29. Zadeh, N. & Kobliska, G. (1977). On optimal doubling in backgammon.Management Science, 23 853–858.Google Scholar

Copyright information

© Kluwer Academic Publishers 1992

Authors and Affiliations

  • Gerald Tesauro
    • 1
  1. 1.IBM Thomas J. Watson Research CenterYorktown HeightsUSA

Personalised recommendations