Machine Learning

, Volume 49, Issue 1, pp 5–37 | Cite as

The Lagging Anchor Algorithm: Reinforcement Learning in Two-Player Zero-Sum Games with Imperfect Information

  • Fredrik A. Dahl


The article describes a gradient search based reinforcement learning algorithm for two-player zero-sum games with imperfect information. Simple gradient search may result in oscillation around solution points, a problem similar to the “Crawford puzzle”. To dampen oscillations, the algorithm uses lagging anchors, drawing the strategy state of the players toward a weighted average of earlier strategy states. The algorithm is applicable to games represented in extensive form. We develop methods for sampling the parameter gradient of a player's performance against an opponent, using temporal-difference learning. The algorithm is used successfully for a simplified poker game with infinite sets of pure strategies, and for the air combat game Campaign, using neural nets. We prove exponential convergence of the algorithm for a subset of matrix games.

two-player zero-sum game reinforcement learning neural net imperfect information 


  1. Bakken, B. T., & Dahl, F. A. (1998). Experimental studies of neural net training and human learning in a military air campaign game. Proceedings of the Seventh Conference on Computer Generated Forces and Behavioral Representation, University of Central Florida, Orlando, Florida, Institute for Simulation and Training, pp. 263-274.Google Scholar
  2. Berkovitz, L. D. (1975). The tactical air game: A multimove game with mixed strategy solution. In J. D. Grote (Ed.), The theory and application of differential games (pp. 169-177).Google Scholar
  3. Busemann, H. (1958). Convex surfaces. Interscience tracts in pure and applied mathematics, No. 6. New York: Interscience Publishers.Google Scholar
  4. Conlisk, J. (1993a). Adaptation in games-2 solutions to the Crawford puzzle. Journal of Economic Behavior and Organizations, 22, 25-50.Google Scholar
  5. Conlisk, J. (1993b). Adaptive tactics in games-Further solutions to the Crawford puzzle. Journal of EconomicBehavior and Organizations, 22, 51-68.Google Scholar
  6. Crawford, V. P. (1974). Learning the optimal strategy in a zero-sum game. Econometrica, 42, 885-891.Google Scholar
  7. Dahl, F. A., & Halck, O. M. (1998). Three games designed for the study of human and automated decision making. Definitions and properties of the games campaign, Operation lucid and operation opaque. FFI/Rapport-98/02799, Norwegian Defence Research Establishment (FFI), Kjeller, Norway.Google Scholar
  8. Dahl, F. A., & Halck, O. M. (2000). Minimax TD-learning with neural nets in a Markov game. In R. Lopez de Mantaras & E. Plaza (Eds.), ECML 2000. Proceedings of the 11th European Conference on Machine Learning, Lecture Notes in Computer Science (Vol. 1810). Berlin: Springer-Verlag.Google Scholar
  9. Dahl, F. A., Halck, O. M., & Braathen, S. (2000). Machine learning in the game of Campaign. FFI/Rapport-2000/04400, Norwegian Defence Research Establishment (FFI), Kjeller, Norway.Google Scholar
  10. Erev, I., & Roth, A. E. (1998). Predicting how people play games: Reinforcement learning in experimental games with unique, mixed strategy equilibria. The American economic review, 88, 848-881.Google Scholar
  11. Fudenberg, D., & Levine, D. K. (1998). The theory of learning in games. Cambridge: MIT Press.Google Scholar
  12. Halck, O. M., & Dahl, F. A. (1999). On classification of games and evaluation of players-with some sweeping generalizations about the literature. In: J. Fürnkranz,& M. Kubat (Eds.), Proceedings of the ICML-99Workshop on Machine Learning in Game Playing. Ljubljana, Slovenia: Jozef Stefan Institute.Google Scholar
  13. Harmon, M. E., Baird, L. C., & Klopf, A. H. (1995). Reinforcement learning applied to a differential game. Adaptive behavior (Vol. 0004). Cambridge, MA: MIT Press.Google Scholar
  14. Hassoun, M. H. (1995). Fundamentals of artificial neural networks. Cambridge, MA: MIT Press.Google Scholar
  15. Koller, D., Megiddo, N., & von Stengel, B. (1996). Efficient solutions of extensive two-person games. Games and Economic Behavior, 14, 247-259.Google Scholar
  16. Littman, M. L. (1994). Markov games as a framework for multi-agent reinforcement learning. In Proceedings of the 11th International Conference on Machine Learning (pp. 157-163). New Brunswick: Morgan Kaufmann.Google Scholar
  17. Luce, R. D., & Raiffa, H. (1957). Games and decisions. New York: Wiley.Google Scholar
  18. Luenberger D. G. (1979). Introduction to dynamic systems. Theory, models, & applications. New York: Wiley.Google Scholar
  19. Luenberger, D. G. (1984). Linear and nonlinear programming. Reading, MA: Addison-Wesley.Google Scholar
  20. Michie, D. (1966). Game-playing and game-learning automata. In L. Fox (Ed.), Advances in programming and non-numerical computation (pp. 183-200). New York, Pergamon.Google Scholar
  21. Padberg, M. (1995). Linear optimization and extensions. Berlin: Springer-Verlag.Google Scholar
  22. Papadimitriou, C. H. (1994). Computational complexity. Reading, MA: Addison Wesley.Google Scholar
  23. Samuel, A. L. (1959). Some studies in machine learning using the game of checkers. IBM J Res. Develop., 3, 210-229.Google Scholar
  24. Schaeffer, J., Billings, D., Peña, L., & Szafron, D. (1999). Learning to play strong poker. In: J. Furnkranz, & M. Kubat (Eds.), Proceedings of the ICML-99 Workshop on Machine Learning in Game Playing, Ljubljana, Slovenia: Jozef Stefan Institute.Google Scholar
  25. Selten, R. (1991). Anticipatory learning in two-person games. In R. Selten (Ed.), Game equilibrium models (Vol. I: Evolution and game dynamics). Berlin: Springer-Verlag.Google Scholar
  26. Strang, G. (1980). Linear algebra and its applications. London: Harcourt Brace.Google Scholar
  27. Sutton, R. S. (1988). Learning to predict by the methods of temporal differences. Machine Learning, 3, 9-44.Google Scholar
  28. Szepesvari, C., & Littman, M. L. (1999). A unified analysis of value-function-based reinforcement-learning algorithms. Neural Computation, 11, 2017-2060.Google Scholar
  29. Tesauro, G. J. (1992). Practical issues in temporal difference learning. Machine Learning, 8, 257-277.Google Scholar
  30. Tesauro, G. J., & Sejnowski, T. J. (1989).Aparallel network that learns to play backgammon. Artificial Intelligence, 39, 357-390.Google Scholar
  31. von Neumann, J., & Morgenstern, O. (1953). Theory of games and economic behavior, 3rd ed. New York: Wiley.Google Scholar
  32. Watkins, C. J. C. H. (1989). Learning from delayed rewards. PhD thesis, Psychology Department, Cambridge University, Cambridge, UK.Google Scholar
  33. Weibull, J. (1995). Evolutionary game theory, Cambridge: MIT Press.Google Scholar

Copyright information

© Kluwer Academic Publishers 2002

Authors and Affiliations

  • Fredrik A. Dahl
    • 1
  1. 1.Norwegian Defence Research Establishment (FFI)KjellerNorway

Personalised recommendations