Machine Learning

, Volume 67, Issue 1–2, pp 23–43 | Cite as

AWESOME: A general multiagent learning algorithm that converges in self-play and learns a best response against stationary opponents

Article

Abstract

Two minimal requirements for a satisfactory multiagent learning algorithm are that it 1. learns to play optimally against stationary opponents and 2. converges to a Nash equilibrium in self-play. The previous algorithm that has come closest, WoLF-IGA, has been proven to have these two properties in 2-player 2-action (repeated) games—assuming that the opponent’s mixed strategy is observable. Another algorithm, ReDVaLeR (which was introduced after the algorithm described in this paper), achieves the two properties in games with arbitrary numbers of actions and players, but still requires that the opponents' mixed strategies are observable. In this paper we present AWESOME, the first algorithm that is guaranteed to have the two properties in games with arbitrary numbers of actions and players. It is still the only algorithm that does so while only relying on observing the other players' actual actions (not their mixed strategies). It also learns to play optimally against opponents that eventually become stationary. The basic idea behind AWESOME (Adapt When Everybody is Stationary, Otherwise Move to Equilibrium) is to try to adapt to the others' strategies when they appear stationary, but otherwise to retreat to a precomputed equilibrium strategy. We provide experimental results that suggest that AWESOME converges fast in practice. The techniques used to prove the properties of AWESOME are fundamentally different from those used for previous algorithms, and may help in analyzing future multiagent learning algorithms as well.

Keywords

Game theory Learning in games Nash equilibrium 

References

  1. Auer, P., Cesa-Bianchi, N., Freund, Y., & Schapire, R. E. (1995). Gambling in a rigged casino: The adversarial multi-arm bandit problem. In Proceedings of the Annual Symposium on Foundations of Computer Science (FOCS) (pp. 322–331).Google Scholar
  2. Aumann, R. (1974). Subjectivity and correlation in randomized strategies. Journal of Mathematical Economics, 1, 67–96.MATHCrossRefMathSciNetGoogle Scholar
  3. Banerjee, B., & Peng, J. (2004). Performance bounded reinforcement learning in strategic interactions. In Proceedings of the National Conference on Artificial Intelligence (AAAI) (pp. 2–7). San Jose, CA, USA.Google Scholar
  4. Banerjee, B., Sen, S., & Peng, J. (2001). Fast concurrent reinforcement learners. In Proceedings of the Seventeenth International Joint Conference on Artificial Intelligence (IJCAI) (pp. 825–830). Seattle, WA.Google Scholar
  5. Bowling, M. (2005). Convergence and no-regret in multiagent learning. In Proceedings of the Annual Conference on Neural Information Processing Systems (NIPS) (pp. 209–216). Vancouver, Canada.Google Scholar
  6. Bowling, M., & Veloso, M. (2002). Multiagent learning using a variable learning rate. Artificial Intelligence, 136, 215–250.MATHCrossRefMathSciNetGoogle Scholar
  7. Brafman, R., & Tennenholtz, M. (2000). A near-optimal polynomial time algorithm for learning in certain classes of stochastic games. Artificial Intelligence, 121, 31–47.MATHCrossRefMathSciNetGoogle Scholar
  8. Brafman, R., & Tennenholtz, M. (2003). R-max—a general polynomial time algorithm for near-optimal reinforcement learning. Journal of Machine Learning Research, 3, 213–231.MATHCrossRefMathSciNetGoogle Scholar
  9. Brafman, R., & Tennenholtz, M. (2004). Efficient learning equilibrium. Artificial Intelligence, 159, 27–47.MATHCrossRefMathSciNetGoogle Scholar
  10. Brafman, R., & Tennenholtz, M. (2005). Optimal efficient learning equilibrium: Imperfect monitoring in symmetric games. In Proceedings of the National Conference on Artificial Intelligence (AAAI) (pp. 726–731). Pittsburgh, PA, USA.Google Scholar
  11. Cahn, A. (2000). General procedures leading to correlated equilibria. Discussion paper 216, Center for Rationality, The Hebrew University of Jerusalem, Israel.Google Scholar
  12. Claus, C., & Boutilier, C. (1998). The dynamics of reinforcement learning in cooperative multiagent systems. In Proceedings of the National Conference on Artificial Intelligence (AAAI) (pp. 746–752). Madison, WI.Google Scholar
  13. Conitzer, V., & Sandholm, T. (2003a). BL-WoLF: A framework for loss-bounded learnability in zero-sum games. In International Conference on Machine Learning (ICML) (pp. 91–98). Washington, DC, USA.Google Scholar
  14. Conitzer, V., & Sandholm, T. (2003b). Complexity results about Nash equilibria. In Proceedings of the Eighteenth International Joint Conference on Artificial Intelligence (IJCAI) (pp. 765–771). Acapulco, Mexico.Google Scholar
  15. Conitzer, V., & Sandholm, T. (2004). Communication complexity as a lower bound for learning in games. In International Conference on Machine Learning (ICML) (pp. 185–192). Banff, Alberta, Canada.Google Scholar
  16. Foster, D., & Vohra, R. (1997). Calibrated learning and correlated equilibrium. Games and Economic Behavior, 21, 40–55.MATHCrossRefMathSciNetGoogle Scholar
  17. Foster, D. P., & Young, H. P. (2001). On the impossibility of predicting the behavior of rational agents. In Proceedings of the National Academy of Sciences, (Vol. 98, pp. 12848–12853).CrossRefMathSciNetGoogle Scholar
  18. Freund, Y., & Schapire, R. (1999). Adaptive game playing using multiplicative weights. Games and Economic Behavior, 29, 79–103.MATHCrossRefMathSciNetGoogle Scholar
  19. Fudenberg, D., & Levine, D. (1998). The theory of learning in games. MIT Press.Google Scholar
  20. Fudenberg, D., & Levine, D. (1999). Conditional universal consistency. Games and Economic Behavior, 29, 104–130.MATHCrossRefMathSciNetGoogle Scholar
  21. Fudenberg, D., & Levine, D. K. (1995). Consistency and cautious fictitious play. Journal of Economic Dynamics and Control, 19, 1065–1089.MATHCrossRefMathSciNetGoogle Scholar
  22. Gilboa, I., & Zemel, E. (1989). Nash and correlated equilibria: some complexity considerations. Games and Economic Behavior, 1, 80–93.MATHCrossRefMathSciNetGoogle Scholar
  23. Greenwald, A., & Hall, K. (2003). Correlated Q-learning. International Conference on Machine Learning (ICML) (pp. 242–249). Washington, DC, USA.Google Scholar
  24. Greenwald, A., & Jafari, A. (2003). A general class of no-regret learning algorithms and game-theoretic equilibria. Conference on Learning Theory (COLT). Washington, DC.Google Scholar
  25. Hart, S., & Mas-Colell, A. (2000). A simple adaptive procedure leading to correlated equilibrium. Econometrica, 68, 1127–1150.MATHCrossRefMathSciNetGoogle Scholar
  26. Hart, S., & Mas-Colell, A. (2003). Uncoupled dynamics do not lead to Nash equilibrium. American Economic Review, 93, 1830–1836.CrossRefGoogle Scholar
  27. Hu, J., & Wellman, M. P. (1998). Multiagent reinforcement learning: theoretical framework and an algorithm. International Conference on Machine Learning (ICML) (pp. 242–250).Google Scholar
  28. Jafari, A., Greenwald, A., Gondek, D., & Ercal, G. (2001). On no-regret learning, fictitious play, and Nash equilibrium. International Conference on Machine Learning (ICML) (pp. 226–233). Williams College, MA, USA.Google Scholar
  29. Kakade, S., & Foster, D. (2004). Deterministic calibration and Nash equilibrium. In Conference on Learning Theory (COLT). Banff, Alberta, Canada.Google Scholar
  30. Kalai, E., & Lehrer, E. (1993). Rational learning leads to Nash equilibrium. Econometrica, 61, 1019–1045.MATHCrossRefMathSciNetGoogle Scholar
  31. Lemke, C., & Howson, J. (1964). Equilibrium points of bimatrix games. Journal of the Society of Industrial and Applied Mathematics, 12, 413–423.MATHCrossRefMathSciNetGoogle Scholar
  32. Littlestone, N., & Warmuth, M. K. (1994). The weighted majority algorithm. Information and Computation, 108, 212–261.MATHCrossRefMathSciNetGoogle Scholar
  33. Littman, M. (1994). Markov games as a framework for multi-agent reinforcement learning. In International Conference on Machine Learning (ICML) (pp. 157–163).Google Scholar
  34. Littman, M. (2001). Friend or foe Q-learning in general-sum Markov games. In International Conference on Machine Learning (ICML) (pp. 322–328).Google Scholar
  35. Littman, M., & Stone, P. (2003). A polynomial-time Nash equilibrium algorithm for repeated games. In Proceedings of the ACM Conference on Electronic Commerce (ACM-EC) (pp. 48–54). San Diego, CA.Google Scholar
  36. Littman, M., & Szepesvári, C. (1996). A generalized reinforcement-learning model: convergence and applications. In International Conference on Machine Learning (ICML) (pp. 310–318).Google Scholar
  37. Miyasawa, K. (1961). On the convergence of the learning process in a 2 × 2 nonzero sum two-person game. Research memo 33, Princeton University.Google Scholar
  38. Nachbar, J. (1990). Evolutionary selection dynamics in games: Convergence and limit properties. International Journal of Game Theory, 19, 59–89.MATHCrossRefMathSciNetGoogle Scholar
  39. Nachbar, J. (1997). Prediction, optimization, and learning in games. Econometrica, 65, 275–309.MATHCrossRefMathSciNetGoogle Scholar
  40. Nachbar, J. (2001). Bayesian learning in repeated games of incomplete information. Social Choice and Welfare, 18, 303–326.MATHCrossRefMathSciNetGoogle Scholar
  41. Nash, J. (1950). Equilibrium points in n-person games. In Proc. of the National Academy of Sciences, 36, 48–49.MATHCrossRefMathSciNetGoogle Scholar
  42. Papadimitriou, C. (2001). Algorithms, games and the Internet. In Proceedings of the Annual Symposium on Theory of Computing (STOC) (pp. 749–753).Google Scholar
  43. Pivazyan, K., & Shoham, Y. (2002). Polynomial-time reinforcement learning of near-optimal policies. In Proceedings of the National Conference on Artificial Intelligence (AAAI). Edmonton, Canada.Google Scholar
  44. Porter, R., Nudelman, E., & Shoham, Y. (2004). Simple search methods for finding a Nash equilibrium. In Proceedings of the National Conference on Artificial Intelligence (AAAI) (pp. 664–669). San Jose, CA, USA.Google Scholar
  45. Powers, R., & Shoham, Y. (2005a). Learning against opponents with bounded memory. In Proceedings of the Nineteenth International Joint Conference on Artificial Intelligence (IJCAI). Edinburgh, UK.Google Scholar
  46. Powers, R., & Shoham, Y. (2005b). New criteria and a new algorithm for learning in multi-agent systems. In Proceedings of the Annual Conference on Neural Information Processing Systems (NIPS). Vancouver, Canada.Google Scholar
  47. Robinson, J. (1951). An iterative method of solving a game. Annals of Mathematics, 54, 296–301.CrossRefMathSciNetGoogle Scholar
  48. Sandholm, T., & Crites, R. (1996). Multiagent reinforcement learning in the iterated prisoner's dilemma. Biosystems, 37, 147–166. Special issue on the Prisoner's Dilemma.CrossRefGoogle Scholar
  49. Sandholm, T., Gilpin, A., & Conitzer, V. (2005). Mixed-integer programming methods for finding Nash equilibria. In Proceedings of the National Conference on Artificial Intelligence (AAAI) (pp. 495–501). Pittsburgh, PA, USA.Google Scholar
  50. Sen, S., & Weiss, G. (1998). Learning in multiagent systems. In G. Weiss (Ed.), Multiagent systems: a modern introduction to distributed artificial intelligence (Chapter 6, pp. 259–298). MIT Press.Google Scholar
  51. Shapley, L. S. (1964). Some topics in two-person games. In M. Drescher, L. S. Shapley & A. W. Tucker (Eds.), Advances in game theory. Princeton University Press.Google Scholar
  52. Simon, H. A. (1982). Models of bounded rationality, vol. 2. MIT Press.Google Scholar
  53. Singh, S., Kearns, M., & Mansour, Y. (2000). Nash convergence of gradient dynamics in general-sum games. In Proceedings of the Conference on Uncertainty in Artificial Intelligence (UAI) (pp. 541–548). Stanford, CA.Google Scholar
  54. Stimpson, J., Goodrich, M., & Walters, L. (2001). Satisficing and learning cooperation in the prisoner's dilemma. In Proceedings of the Seventeenth International Joint Conference on Artificial Intelligence (IJCAI) (pp. 535–540). Seattle, WA.Google Scholar
  55. Tan, M. (1993). Multi-agent reinforcement learning: independent vs. cooperative agents. In International Conference on Machine Learning (ICML) (pp. 330–337).Google Scholar
  56. Wang, X., & Sandholm, T. (2002). Reinforcement learning to play an optimal Nash equilibrium in team Markov games. In Proceedings of the Annual Conference on Neural Information Processing Systems (NIPS). Vancouver, Canada.Google Scholar
  57. Wang, X., & Sandholm, T. (2003). Learning near-Pareto-optimal conventions in polynomial time. In Proceedings of the Annual Conference on Neural Information Processing Systems (NIPS). Vancouver, Canada.Google Scholar
  58. Zinkevich, M. (2003). Online convex programming and generalized infinitesimal gradient ascent. In International Conference on Machine Learning (ICML) (pp. 928–936). Washington, DC, USA.Google Scholar

Copyright information

© Springer Science + Business Media, LLC 2007

Authors and Affiliations

  1. 1., Computer Science DepartmentCarnegie Mellon UniversityPittsburgh

Personalised recommendations