# AWESOME: A general multiagent learning algorithm that converges in self-play and learns a best response against stationary opponents

## Abstract

Two *minimal* requirements for a satisfactory multiagent learning algorithm are that it 1. learns to play optimally against stationary opponents and 2. converges to a Nash equilibrium in self-play. The previous algorithm that has come closest, WoLF-IGA, has been proven to have these two properties in 2-player 2-action (repeated) games—assuming that the opponent’s mixed strategy is observable. Another algorithm, ReDVaLeR (which was introduced after the algorithm described in this paper), achieves the two properties in games with arbitrary numbers of actions and players, but still requires that the opponents' mixed strategies are observable. In this paper we present AWESOME, the first algorithm that is guaranteed to have the two properties in games with arbitrary numbers of actions and players. It is still the only algorithm that does so while only relying on observing the other players' actual actions (not their mixed strategies). It also learns to play optimally against opponents that *eventually become* stationary. The basic idea behind AWESOME (*Adapt When Everybody is Stationary, Otherwise Move to Equilibrium*) is to try to adapt to the others' strategies when they appear stationary, but otherwise to retreat to a precomputed equilibrium strategy. We provide experimental results that suggest that AWESOME converges fast in practice. The techniques used to prove the properties of AWESOME are fundamentally different from those used for previous algorithms, and may help in analyzing future multiagent learning algorithms as well.

## Keywords

Game theory Learning in games Nash equilibrium## References

- Auer, P., Cesa-Bianchi, N., Freund, Y., & Schapire, R. E. (1995). Gambling in a rigged casino: The adversarial multi-arm bandit problem. In
*Proceedings of the Annual Symposium on Foundations of Computer Science (FOCS)*(pp. 322–331).Google Scholar - Aumann, R. (1974). Subjectivity and correlation in randomized strategies.
*Journal of Mathematical Economics*,*1*, 67–96.zbMATHCrossRefMathSciNetGoogle Scholar - Banerjee, B., & Peng, J. (2004). Performance bounded reinforcement learning in strategic interactions. In
*Proceedings of the National Conference on Artificial Intelligence (AAAI)*(pp. 2–7). San Jose, CA, USA.Google Scholar - Banerjee, B., Sen, S., & Peng, J. (2001). Fast concurrent reinforcement learners. In
*Proceedings of the Seventeenth International Joint Conference on Artificial Intelligence (IJCAI)*(pp. 825–830). Seattle, WA.Google Scholar - Bowling, M. (2005). Convergence and no-regret in multiagent learning. In
*Proceedings of the Annual Conference on Neural Information Processing Systems (NIPS)*(pp. 209–216). Vancouver, Canada.Google Scholar - Bowling, M., & Veloso, M. (2002). Multiagent learning using a variable learning rate.
*Artificial Intelligence*,*136*, 215–250.zbMATHCrossRefMathSciNetGoogle Scholar - Brafman, R., & Tennenholtz, M. (2000). A near-optimal polynomial time algorithm for learning in certain classes of stochastic games.
*Artificial Intelligence*,*121*, 31–47.zbMATHCrossRefMathSciNetGoogle Scholar - Brafman, R., & Tennenholtz, M. (2003). R-max—a general polynomial time algorithm for near-optimal reinforcement learning.
*Journal of Machine Learning Research*,*3*, 213–231.zbMATHCrossRefMathSciNetGoogle Scholar - Brafman, R., & Tennenholtz, M. (2004). Efficient learning equilibrium.
*Artificial Intelligence*,*159*, 27–47.zbMATHCrossRefMathSciNetGoogle Scholar - Brafman, R., & Tennenholtz, M. (2005). Optimal efficient learning equilibrium: Imperfect monitoring in symmetric games. In
*Proceedings of the National Conference on Artificial Intelligence (AAAI)*(pp. 726–731). Pittsburgh, PA, USA.Google Scholar - Cahn, A. (2000).
*General procedures leading to correlated equilibria*. Discussion paper 216, Center for Rationality, The Hebrew University of Jerusalem, Israel.Google Scholar - Claus, C., & Boutilier, C. (1998). The dynamics of reinforcement learning in cooperative multiagent systems. In
*Proceedings of the National Conference on Artificial Intelligence (AAAI)*(pp. 746–752). Madison, WI.Google Scholar - Conitzer, V., & Sandholm, T. (2003a). BL-WoLF: A framework for loss-bounded learnability in zero-sum games. In
*International Conference on Machine Learning (ICML)*(pp. 91–98). Washington, DC, USA.Google Scholar - Conitzer, V., & Sandholm, T. (2003b). Complexity results about Nash equilibria. In
*Proceedings of the Eighteenth International Joint Conference on Artificial Intelligence (IJCAI)*(pp. 765–771). Acapulco, Mexico.Google Scholar - Conitzer, V., & Sandholm, T. (2004). Communication complexity as a lower bound for learning in games. In
*International Conference on Machine Learning (ICML)*(pp. 185–192). Banff, Alberta, Canada.Google Scholar - Foster, D., & Vohra, R. (1997). Calibrated learning and correlated equilibrium.
*Games and Economic Behavior*,*21*, 40–55.zbMATHCrossRefMathSciNetGoogle Scholar - Foster, D. P., & Young, H. P. (2001). On the impossibility of predicting the behavior of rational agents. In
*Proceedings of the National Academy of Sciences*, (Vol. 98, pp. 12848–12853).CrossRefMathSciNetGoogle Scholar - Freund, Y., & Schapire, R. (1999). Adaptive game playing using multiplicative weights.
*Games and Economic Behavior*,*29*, 79–103.zbMATHCrossRefMathSciNetGoogle Scholar - Fudenberg, D., & Levine, D. (1998).
*The theory of learning in games*. MIT Press.Google Scholar - Fudenberg, D., & Levine, D. (1999). Conditional universal consistency.
*Games and Economic Behavior*,*29*, 104–130.zbMATHCrossRefMathSciNetGoogle Scholar - Fudenberg, D., & Levine, D. K. (1995). Consistency and cautious fictitious play.
*Journal of Economic Dynamics and Control*,*19*, 1065–1089.zbMATHCrossRefMathSciNetGoogle Scholar - Gilboa, I., & Zemel, E. (1989). Nash and correlated equilibria: some complexity considerations.
*Games and Economic Behavior*,*1*, 80–93.zbMATHCrossRefMathSciNetGoogle Scholar - Greenwald, A., & Hall, K. (2003). Correlated Q-learning.
*International Conference on Machine Learning (ICML)*(pp. 242–249). Washington, DC, USA.Google Scholar - Greenwald, A., & Jafari, A. (2003). A general class of no-regret learning algorithms and game-theoretic equilibria.
*Conference on Learning Theory (COLT)*. Washington, DC.Google Scholar - Hart, S., & Mas-Colell, A. (2000). A simple adaptive procedure leading to correlated equilibrium.
*Econometrica*,*68*, 1127–1150.zbMATHCrossRefMathSciNetGoogle Scholar - Hart, S., & Mas-Colell, A. (2003). Uncoupled dynamics do not lead to Nash equilibrium.
*American Economic Review*,*93*, 1830–1836.CrossRefGoogle Scholar - Hu, J., & Wellman, M. P. (1998). Multiagent reinforcement learning: theoretical framework and an algorithm.
*International Conference on Machine Learning (ICML)*(pp. 242–250).Google Scholar - Jafari, A., Greenwald, A., Gondek, D., & Ercal, G. (2001). On no-regret learning, fictitious play, and Nash equilibrium.
*International Conference on Machine Learning (ICML)*(pp. 226–233). Williams College, MA, USA.Google Scholar - Kakade, S., & Foster, D. (2004). Deterministic calibration and Nash equilibrium. In
*Conference on Learning Theory (COLT)*. Banff, Alberta, Canada.Google Scholar - Kalai, E., & Lehrer, E. (1993). Rational learning leads to Nash equilibrium.
*Econometrica*,*61*, 1019–1045.zbMATHCrossRefMathSciNetGoogle Scholar - Lemke, C., & Howson, J. (1964). Equilibrium points of bimatrix games.
*Journal of the Society of Industrial and Applied Mathematics*,*12*, 413–423.zbMATHCrossRefMathSciNetGoogle Scholar - Littlestone, N., & Warmuth, M. K. (1994). The weighted majority algorithm.
*Information and Computation*,*108*, 212–261.zbMATHCrossRefMathSciNetGoogle Scholar - Littman, M. (1994). Markov games as a framework for multi-agent reinforcement learning. In
*International Conference on Machine Learning (ICML)*(pp. 157–163).Google Scholar - Littman, M. (2001). Friend or foe Q-learning in general-sum Markov games. In
*International Conference on Machine Learning (ICML)*(pp. 322–328).Google Scholar - Littman, M., & Stone, P. (2003). A polynomial-time Nash equilibrium algorithm for repeated games. In
*Proceedings of the ACM Conference on Electronic Commerce (ACM-EC)*(pp. 48–54). San Diego, CA.Google Scholar - Littman, M., & Szepesvári, C. (1996). A generalized reinforcement-learning model: convergence and applications. In
*International Conference on Machine Learning (ICML)*(pp. 310–318).Google Scholar - Miyasawa, K. (1961).
*On the convergence of the learning process in a 2 × 2 nonzero sum two-person game*. Research memo 33, Princeton University.Google Scholar - Nachbar, J. (1990). Evolutionary selection dynamics in games: Convergence and limit properties.
*International Journal of Game Theory*,*19*, 59–89.zbMATHCrossRefMathSciNetGoogle Scholar - Nachbar, J. (1997). Prediction, optimization, and learning in games.
*Econometrica*,*65*, 275–309.zbMATHCrossRefMathSciNetGoogle Scholar - Nachbar, J. (2001). Bayesian learning in repeated games of incomplete information.
*Social Choice and Welfare*,*18*, 303–326.zbMATHCrossRefMathSciNetGoogle Scholar - Nash, J. (1950). Equilibrium points in n-person games. In
*Proc. of the National Academy of Sciences*,*36*, 48–49.zbMATHCrossRefMathSciNetGoogle Scholar - Papadimitriou, C. (2001). Algorithms, games and the Internet. In
*Proceedings of the Annual Symposium on Theory of Computing (STOC)*(pp. 749–753).Google Scholar - Pivazyan, K., & Shoham, Y. (2002). Polynomial-time reinforcement learning of near-optimal policies. In
*Proceedings of the National Conference on Artificial Intelligence (AAAI)*. Edmonton, Canada.Google Scholar - Porter, R., Nudelman, E., & Shoham, Y. (2004). Simple search methods for finding a Nash equilibrium. In
*Proceedings of the National Conference on Artificial Intelligence (AAAI)*(pp. 664–669). San Jose, CA, USA.Google Scholar - Powers, R., & Shoham, Y. (2005a). Learning against opponents with bounded memory. In
*Proceedings of the Nineteenth International Joint Conference on Artificial Intelligence (IJCAI)*. Edinburgh, UK.Google Scholar - Powers, R., & Shoham, Y. (2005b). New criteria and a new algorithm for learning in multi-agent systems. In
*Proceedings of the Annual Conference on Neural Information Processing Systems (NIPS)*. Vancouver, Canada.Google Scholar - Robinson, J. (1951). An iterative method of solving a game.
*Annals of Mathematics*,*54*, 296–301.CrossRefMathSciNetGoogle Scholar - Sandholm, T., & Crites, R. (1996). Multiagent reinforcement learning in the iterated prisoner's dilemma.
*Biosystems*,*37*, 147–166. Special issue on the Prisoner's Dilemma.CrossRefGoogle Scholar - Sandholm, T., Gilpin, A., & Conitzer, V. (2005). Mixed-integer programming methods for finding Nash equilibria. In
*Proceedings of the National Conference on Artificial Intelligence (AAAI)*(pp. 495–501). Pittsburgh, PA, USA.Google Scholar - Sen, S., & Weiss, G. (1998). Learning in multiagent systems. In G. Weiss (Ed.),
*Multiagent systems: a modern introduction to distributed artificial intelligence*(Chapter 6, pp. 259–298). MIT Press.Google Scholar - Shapley, L. S. (1964). Some topics in two-person games. In M. Drescher, L. S. Shapley & A. W. Tucker (Eds.),
*Advances in game theory*. Princeton University Press.Google Scholar - Simon, H. A. (1982).
*Models of bounded rationality*, vol. 2. MIT Press.Google Scholar - Singh, S., Kearns, M., & Mansour, Y. (2000). Nash convergence of gradient dynamics in general-sum games. In
*Proceedings of the Conference on Uncertainty in Artificial Intelligence (UAI)*(pp. 541–548). Stanford, CA.Google Scholar - Stimpson, J., Goodrich, M., & Walters, L. (2001). Satisficing and learning cooperation in the prisoner's dilemma. In
*Proceedings of the Seventeenth International Joint Conference on Artificial Intelligence (IJCAI)*(pp. 535–540). Seattle, WA.Google Scholar - Tan, M. (1993). Multi-agent reinforcement learning: independent vs. cooperative agents. In
*International Conference on Machine Learning (ICML)*(pp. 330–337).Google Scholar - Wang, X., & Sandholm, T. (2002). Reinforcement learning to play an optimal Nash equilibrium in team Markov games. In
*Proceedings of the Annual Conference on Neural Information Processing Systems (NIPS)*. Vancouver, Canada.Google Scholar - Wang, X., & Sandholm, T. (2003). Learning near-Pareto-optimal conventions in polynomial time. In
*Proceedings of the Annual Conference on Neural Information Processing Systems (NIPS)*. Vancouver, Canada.Google Scholar - Zinkevich, M. (2003). Online convex programming and generalized infinitesimal gradient ascent. In
*International Conference on Machine Learning (ICML)*(pp. 928–936). Washington, DC, USA.Google Scholar