## Abstract

Two *minimal* requirements for a satisfactory multiagent learning algorithm are that it 1. learns to play optimally against stationary opponents and 2. converges to a Nash equilibrium in self-play. The previous algorithm that has come closest, WoLF-IGA, has been proven to have these two properties in 2-player 2-action (repeated) games—assuming that the opponent’s mixed strategy is observable. Another algorithm, ReDVaLeR (which was introduced after the algorithm described in this paper), achieves the two properties in games with arbitrary numbers of actions and players, but still requires that the opponents' mixed strategies are observable. In this paper we present AWESOME, the first algorithm that is guaranteed to have the two properties in games with arbitrary numbers of actions and players. It is still the only algorithm that does so while only relying on observing the other players' actual actions (not their mixed strategies). It also learns to play optimally against opponents that *eventually become* stationary. The basic idea behind AWESOME (*Adapt When Everybody is Stationary, Otherwise Move to Equilibrium*) is to try to adapt to the others' strategies when they appear stationary, but otherwise to retreat to a precomputed equilibrium strategy. We provide experimental results that suggest that AWESOME converges fast in practice. The techniques used to prove the properties of AWESOME are fundamentally different from those used for previous algorithms, and may help in analyzing future multiagent learning algorithms as well.

## References

Auer, P., Cesa-Bianchi, N., Freund, Y., & Schapire, R. E. (1995). Gambling in a rigged casino: The adversarial multi-arm bandit problem. In

*Proceedings of the Annual Symposium on Foundations of Computer Science (FOCS)*(pp. 322–331).Aumann, R. (1974). Subjectivity and correlation in randomized strategies.

*Journal of Mathematical Economics*,*1*, 67–96.Banerjee, B., & Peng, J. (2004). Performance bounded reinforcement learning in strategic interactions. In

*Proceedings of the National Conference on Artificial Intelligence (AAAI)*(pp. 2–7). San Jose, CA, USA.Banerjee, B., Sen, S., & Peng, J. (2001). Fast concurrent reinforcement learners. In

*Proceedings of the Seventeenth International Joint Conference on Artificial Intelligence (IJCAI)*(pp. 825–830). Seattle, WA.Bowling, M. (2005). Convergence and no-regret in multiagent learning. In

*Proceedings of the Annual Conference on Neural Information Processing Systems (NIPS)*(pp. 209–216). Vancouver, Canada.Bowling, M., & Veloso, M. (2002). Multiagent learning using a variable learning rate.

*Artificial Intelligence*,*136*, 215–250.Brafman, R., & Tennenholtz, M. (2000). A near-optimal polynomial time algorithm for learning in certain classes of stochastic games.

*Artificial Intelligence*,*121*, 31–47.Brafman, R., & Tennenholtz, M. (2003). R-max—a general polynomial time algorithm for near-optimal reinforcement learning.

*Journal of Machine Learning Research*,*3*, 213–231.Brafman, R., & Tennenholtz, M. (2004). Efficient learning equilibrium.

*Artificial Intelligence*,*159*, 27–47.Brafman, R., & Tennenholtz, M. (2005). Optimal efficient learning equilibrium: Imperfect monitoring in symmetric games. In

*Proceedings of the National Conference on Artificial Intelligence (AAAI)*(pp. 726–731). Pittsburgh, PA, USA.Cahn, A. (2000).

*General procedures leading to correlated equilibria*. Discussion paper 216, Center for Rationality, The Hebrew University of Jerusalem, Israel.Claus, C., & Boutilier, C. (1998). The dynamics of reinforcement learning in cooperative multiagent systems. In

*Proceedings of the National Conference on Artificial Intelligence (AAAI)*(pp. 746–752). Madison, WI.Conitzer, V., & Sandholm, T. (2003a). BL-WoLF: A framework for loss-bounded learnability in zero-sum games. In

*International Conference on Machine Learning (ICML)*(pp. 91–98). Washington, DC, USA.Conitzer, V., & Sandholm, T. (2003b). Complexity results about Nash equilibria. In

*Proceedings of the Eighteenth International Joint Conference on Artificial Intelligence (IJCAI)*(pp. 765–771). Acapulco, Mexico.Conitzer, V., & Sandholm, T. (2004). Communication complexity as a lower bound for learning in games. In

*International Conference on Machine Learning (ICML)*(pp. 185–192). Banff, Alberta, Canada.Foster, D., & Vohra, R. (1997). Calibrated learning and correlated equilibrium.

*Games and Economic Behavior*,*21*, 40–55.Foster, D. P., & Young, H. P. (2001). On the impossibility of predicting the behavior of rational agents. In

*Proceedings of the National Academy of Sciences*, (Vol. 98, pp. 12848–12853).Freund, Y., & Schapire, R. (1999). Adaptive game playing using multiplicative weights.

*Games and Economic Behavior*,*29*, 79–103.Fudenberg, D., & Levine, D. (1998).

*The theory of learning in games*. MIT Press.Fudenberg, D., & Levine, D. (1999). Conditional universal consistency.

*Games and Economic Behavior*,*29*, 104–130.Fudenberg, D., & Levine, D. K. (1995). Consistency and cautious fictitious play.

*Journal of Economic Dynamics and Control*,*19*, 1065–1089.Gilboa, I., & Zemel, E. (1989). Nash and correlated equilibria: some complexity considerations.

*Games and Economic Behavior*,*1*, 80–93.Greenwald, A., & Hall, K. (2003). Correlated Q-learning.

*International Conference on Machine Learning (ICML)*(pp. 242–249). Washington, DC, USA.Greenwald, A., & Jafari, A. (2003). A general class of no-regret learning algorithms and game-theoretic equilibria.

*Conference on Learning Theory (COLT)*. Washington, DC.Hart, S., & Mas-Colell, A. (2000). A simple adaptive procedure leading to correlated equilibrium.

*Econometrica*,*68*, 1127–1150.Hart, S., & Mas-Colell, A. (2003). Uncoupled dynamics do not lead to Nash equilibrium.

*American Economic Review*,*93*, 1830–1836.Hu, J., & Wellman, M. P. (1998). Multiagent reinforcement learning: theoretical framework and an algorithm.

*International Conference on Machine Learning (ICML)*(pp. 242–250).Jafari, A., Greenwald, A., Gondek, D., & Ercal, G. (2001). On no-regret learning, fictitious play, and Nash equilibrium.

*International Conference on Machine Learning (ICML)*(pp. 226–233). Williams College, MA, USA.Kakade, S., & Foster, D. (2004). Deterministic calibration and Nash equilibrium. In

*Conference on Learning Theory (COLT)*. Banff, Alberta, Canada.Kalai, E., & Lehrer, E. (1993). Rational learning leads to Nash equilibrium.

*Econometrica*,*61*, 1019–1045.Lemke, C., & Howson, J. (1964). Equilibrium points of bimatrix games.

*Journal of the Society of Industrial and Applied Mathematics*,*12*, 413–423.Littlestone, N., & Warmuth, M. K. (1994). The weighted majority algorithm.

*Information and Computation*,*108*, 212–261.Littman, M. (1994). Markov games as a framework for multi-agent reinforcement learning. In

*International Conference on Machine Learning (ICML)*(pp. 157–163).Littman, M. (2001). Friend or foe Q-learning in general-sum Markov games. In

*International Conference on Machine Learning (ICML)*(pp. 322–328).Littman, M., & Stone, P. (2003). A polynomial-time Nash equilibrium algorithm for repeated games. In

*Proceedings of the ACM Conference on Electronic Commerce (ACM-EC)*(pp. 48–54). San Diego, CA.Littman, M., & Szepesvári, C. (1996). A generalized reinforcement-learning model: convergence and applications. In

*International Conference on Machine Learning (ICML)*(pp. 310–318).Miyasawa, K. (1961).

*On the convergence of the learning process in a 2 × 2 nonzero sum two-person game*. Research memo 33, Princeton University.Nachbar, J. (1990). Evolutionary selection dynamics in games: Convergence and limit properties.

*International Journal of Game Theory*,*19*, 59–89.Nachbar, J. (1997). Prediction, optimization, and learning in games.

*Econometrica*,*65*, 275–309.Nachbar, J. (2001). Bayesian learning in repeated games of incomplete information.

*Social Choice and Welfare*,*18*, 303–326.Nash, J. (1950). Equilibrium points in n-person games. In

*Proc. of the National Academy of Sciences*,*36*, 48–49.Papadimitriou, C. (2001). Algorithms, games and the Internet. In

*Proceedings of the Annual Symposium on Theory of Computing (STOC)*(pp. 749–753).Pivazyan, K., & Shoham, Y. (2002). Polynomial-time reinforcement learning of near-optimal policies. In

*Proceedings of the National Conference on Artificial Intelligence (AAAI)*. Edmonton, Canada.Porter, R., Nudelman, E., & Shoham, Y. (2004). Simple search methods for finding a Nash equilibrium. In

*Proceedings of the National Conference on Artificial Intelligence (AAAI)*(pp. 664–669). San Jose, CA, USA.Powers, R., & Shoham, Y. (2005a). Learning against opponents with bounded memory. In

*Proceedings of the Nineteenth International Joint Conference on Artificial Intelligence (IJCAI)*. Edinburgh, UK.Powers, R., & Shoham, Y. (2005b). New criteria and a new algorithm for learning in multi-agent systems. In

*Proceedings of the Annual Conference on Neural Information Processing Systems (NIPS)*. Vancouver, Canada.Robinson, J. (1951). An iterative method of solving a game.

*Annals of Mathematics*,*54*, 296–301.Sandholm, T., & Crites, R. (1996). Multiagent reinforcement learning in the iterated prisoner's dilemma.

*Biosystems*,*37*, 147–166. Special issue on the Prisoner's Dilemma.Sandholm, T., Gilpin, A., & Conitzer, V. (2005). Mixed-integer programming methods for finding Nash equilibria. In

*Proceedings of the National Conference on Artificial Intelligence (AAAI)*(pp. 495–501). Pittsburgh, PA, USA.Sen, S., & Weiss, G. (1998). Learning in multiagent systems. In G. Weiss (Ed.),

*Multiagent systems: a modern introduction to distributed artificial intelligence*(Chapter 6, pp. 259–298). MIT Press.Shapley, L. S. (1964). Some topics in two-person games. In M. Drescher, L. S. Shapley & A. W. Tucker (Eds.),

*Advances in game theory*. Princeton University Press.Simon, H. A. (1982).

*Models of bounded rationality*, vol. 2. MIT Press.Singh, S., Kearns, M., & Mansour, Y. (2000). Nash convergence of gradient dynamics in general-sum games. In

*Proceedings of the Conference on Uncertainty in Artificial Intelligence (UAI)*(pp. 541–548). Stanford, CA.Stimpson, J., Goodrich, M., & Walters, L. (2001). Satisficing and learning cooperation in the prisoner's dilemma. In

*Proceedings of the Seventeenth International Joint Conference on Artificial Intelligence (IJCAI)*(pp. 535–540). Seattle, WA.Tan, M. (1993). Multi-agent reinforcement learning: independent vs. cooperative agents. In

*International Conference on Machine Learning (ICML)*(pp. 330–337).Wang, X., & Sandholm, T. (2002). Reinforcement learning to play an optimal Nash equilibrium in team Markov games. In

*Proceedings of the Annual Conference on Neural Information Processing Systems (NIPS)*. Vancouver, Canada.Wang, X., & Sandholm, T. (2003). Learning near-Pareto-optimal conventions in polynomial time. In

*Proceedings of the Annual Conference on Neural Information Processing Systems (NIPS)*. Vancouver, Canada.Zinkevich, M. (2003). Online convex programming and generalized infinitesimal gradient ascent. In

*International Conference on Machine Learning (ICML)*(pp. 928–936). Washington, DC, USA.

## Author information

### Authors and Affiliations

### Corresponding author

## Additional information

**Editors:** Amy Greenwald and Michael Littman

## Rights and permissions

## About this article

### Cite this article

Conitzer, V., Sandholm, T. AWESOME: A general multiagent learning algorithm that converges in self-play and learns a best response against stationary opponents.
*Mach Learn* **67**, 23–43 (2007). https://doi.org/10.1007/s10994-006-0143-1

Received:

Revised:

Accepted:

Published:

Issue Date:

DOI: https://doi.org/10.1007/s10994-006-0143-1

### Keywords

- Game theory
- Learning in games
- Nash equilibrium