Advertisement

TD-Gammon: A Self-Teaching Backgammon Program

  • Gerald Tesauro
Chapter

Abstract

This chapter describes TD-Gammon, a neural network that is able to teach itself to play backgammon solely by playing against itself and learning from the results. TD-Gammon uses a recently proposed reinforcement learning algorithm called TD(λ) (Sutton, 1988), and is apparently the first application of this algorithm to a complex nontrivial task. Despite starting from random initial weights (and hence random initial strategy), TD-Gammon achieves a surprisingly strong level of play. With zero knowledge built in at the start of learning (i.e. given only a “raw” description of the board state), the network learns to play the entire game at a strong intermediate level that surpasses not only conventional commercial programs, but also comparable networks trained via supervised learning on a large corpus of human expert games. The hidden units in the network have apparently discovered useful features, a longstanding goal of computer games research.

Furthermore, when a set of hand-crafted features is added to the network’s input representation, the result is a truly staggering level of performance: TD-Gammon is now estimated to play at a strong master level that is extremely close to the world’s best human players. We discuss possible principles underlying the success of TD-Gammon, and the prospects for successful real-world applications of TD learning in other domains.

Keywords

Reinforcement Learning Hide Unit Temporal Difference Learning Training Game Random Initial Weight 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. H. Berliner, “Computer backgammon.” Scientific American 243:1, 64–72 (1980).CrossRefGoogle Scholar
  2. D. P. Bertsekas, Dynamic Programming: Deterministic and Stochastic Models. En-glewood Cliffs NJ: Prentice Hall (1987).MATHGoogle Scholar
  3. J. Christensen and R. Korf, “A unified theory of heuristic evaluation functions and its application to learning.” Proc. of AAAI-86, 148-152 (1986).Google Scholar
  4. P. Dayan, ‘The convergence of TD(λ) for general λ.” Machine Learning 8, 341–362 (1992).MATHGoogle Scholar
  5. P. W. Frey, “Algorithmic strategies for improving the performance of game playing programs.” In: D. Farmer et al. (Eds.), Evolution, Games and Learning. Amsterdam: North Holland (1986).Google Scholar
  6. A. K. Griffith, “A comparison and evaluation of three machine learning procedures as applied to the game of checkers.” Artificial Intelligence 5, 137–148 (1974).MATHCrossRefGoogle Scholar
  7. K. Hornik, M. Stinchcombe and H. White, “Multilayer feedforward networks are universal approximators.” Neural Networks 2, 359–366 (1989).CrossRefGoogle Scholar
  8. K.-F. Lee and S. Majahan, “A pattern classification approach to evaluation function learning.” Artificial Intelligence 36, 1–25 (1988).CrossRefGoogle Scholar
  9. P. Magriel, Backgammon. New York: Times Books (1976).Google Scholar
  10. M. L. Minsky and S. A. Papert, Perceptrons. Cambridge MA: MIT Press (1969). (Republished as an expanded edition in 1988).MATHGoogle Scholar
  11. D. H. Mitchell, “Using features to evaluate positions in experts’ and novices’ Othello games.” Master’s Thesis, Northwestern Univ., Evanston IL (1984).Google Scholar
  12. J. R. Quinlan, “Learning efficient classification procedures and their application to chess end games.” In: R. S. Michalski, J. G. Carbonell and T. M. Mitchell (Eds.), Machine Learning. Palo Alto CA: Tioga (1983).Google Scholar
  13. B. Robertie, Advanced Backgammon. Arlington MA: Gammon Press (1991).Google Scholar
  14. B. Robertie, “Carbon versus silicon: matching wits with TD-Gammon.” Inside Backgammon 2:2, 14–22 (1992).Google Scholar
  15. D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning internal representation by error propagation.” In D. Rumelhart and J. McClelland (Eds.), Parallel Distributed Processing, Vol. 1. Cambridge MA: MIT Press (1986).Google Scholar
  16. A. Samuel, “Some studies in machine learning using the game of checkers.” IBM J. of Research and Development 3, 210–229 (1959).MathSciNetCrossRefGoogle Scholar
  17. A. Samuel, “Some studies in machine learning using the game of checkers, II — recent progress.” IBM J. of Research and Development 11, 601–617 (1967).CrossRefGoogle Scholar
  18. R. S. Sutton, “Temporal credit assignment in reinforcement learning.” Ph. D. Thesis, Univ. of Massachusetts, Amherst MA (1984).Google Scholar
  19. R. S. Sutton, “Learning to predict by the methods of temporal differences.” Machine Learning 3, 9–44 (1988).Google Scholar
  20. G. Tesauro and T. J. Sejnowski, “A parallel network that learns to play backgammon.” Artificial Intelligence 39, 357–390 (1989).MATHCrossRefGoogle Scholar
  21. G. Tesauro, “Connectionist learning of expert preferences by comparison training.” In D. Touretzky (Ed.), Advances in Neural Information Processing 1, 99–106. San Mateo, CA: Morgan Kauffmann (1989).Google Scholar
  22. G. Tesauro, “Neurogammon: a neural network backgammon program.” IJCNN Proceedings III, 33–39 (1990).Google Scholar
  23. G. Tesauro, “Practical issues in temporal difference learning.” Machine Learning 8, 257–277 (1992).MATHGoogle Scholar
  24. N. Zadeh and G. Kobliska, “On optimal doubling in backgammon.” Management Science 23, 853–858 (1977).MathSciNetMATHCrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media New York 1995

Authors and Affiliations

  • Gerald Tesauro
    • 1
  1. 1.IBM Thomas J. Watson Research CenterYorktown HeightsUSA

Personalised recommendations