# Multiagent learning in the presence of memory-bounded agents

- 480 Downloads
- 11 Citations

## Abstract

In recent years, great strides have been made towards creating autonomous agents that can learn via interaction with their environment. When considering just an individual agent, it is often appropriate to model the world as being stationary, meaning that the same action from the same state will always yield the same (possibly stochastic) effects. However, in the presence of other independent agents, the environment is not stationary: an action’s effects may depend on the actions of the other agents. This non-stationarity poses the primary challenge of multiagent learning and comprises the main reason that it is best considered distinctly from single agent learning. The multiagent learning problem is often studied in the stylized settings provided by repeated matrix games. The goal of this article is to introduce a novel multiagent learning algorithm for such a setting, called Convergence with Model Learning and Safety (or CMLeS), that achieves a new set of objectives which have not been previously achieved. Specifically, CMLeS is the first multiagent learning algorithm to achieve the following three objectives: (1) converges to following a Nash equilibrium joint-policy in self-play; (2) achieves close to the best response when interacting with a set of memory-bounded agents whose memory size is upper bounded by a known value; and (3) ensures an individual return that is very close to its security value when interacting with any other set of agents. Our presentation of CMLeS is backed by a rigorous theoretical analysis, including an analysis of sample complexity wherever applicable.

## Keywords

Multiagent learning Memory-bounded agents Sample complexity analysis## Notes

### Acknowledgments

This work has taken place in the Learning Agents Research Group (LARG) at the Artificial Intelligence Laboratory, The University of Texas at Austin. LARG research is supported in part by Grants from the National Science Foundation (IIS-0917122), ONR (N00014-09-1-0658), and the Federal Highway Administration (DTFH61-07-H-00030).

## References

- 1.Airiau, S., Saha, S., & Sen, S. (2007). Evolutionary tournament-based comparison of learning and non-learning algorithms for iterated games.
*Journal of Artificial Societies and Social Simulation*,*10*, 1–12.Google Scholar - 2.Aumann, R. (1974). Subjectivity and correlation in randomized strategies.
*Journal of Mathematical Economics*,*1*, 67–96.CrossRefzbMATHMathSciNetGoogle Scholar - 3.Banerjee, B., & Peng, J. (2004). Performance bounded reinforcement learning in strategic interactions. In L. Deborah, McGuinness, & G. Ferguson (Eds.),
*AAAI’04: Proceedings of the 19th National Conference on Artifical Intelligence*(pp. 2–7). Menlo Park, CA: AAAI Press/The MIT Press.Google Scholar - 4.Banerjee, B., Sen, S., & Peng, J. (2001). Fast concurrent reinforcement learners. In N. Bernhard (Ed.),
*Proceedings of the Seventeenth International Joint Conference on Artificial Intelligence*(pp. 825–830). San Francisco, CA: Morgan Kaufmann.Google Scholar - 5.Bouzy, B., & Metivier, M. (2010). Multi-agent learning experiments on repeated matrix games. In J. Furnkranz & T. Joachims (Eds.),
*Proceedings of the Twenty-Seventh International Conference on Machine Learning*. Haifa: ICML.Google Scholar - 6.Bowling, M. (2005). Convergence and no-regret in multiagent learning. In L. K. Saul, Y. Weiss, & L. Bottou (Eds.),
*NIPS’05: Advances in Neural Information Processing Systems*(pp. 209–216). Cambridge, MA: MIT Press.Google Scholar - 7.Bowling, M., & Veloso, M. (2001a). Convergence of gradient dynamics with a variable learning rate.
*Procedings of the 18th International Conference on Machine Learning*(pp. 27–34). Morgan Kaufmann, San Francisco, CA.Google Scholar - 8.Bowling, M., & Veloso, M. (2001b). Rational and convergent learning in stochastic games. In B. Nebel (Ed.),
*International Joint Conference on Artificial Intelligence*(pp. 1021–1026). San Francisco, CA: Morgan Kaufmann.Google Scholar - 9.Brafman, R. I., & Tennenholtz, M. (2003).
*R-max—a general polynomial time algorithm for near-optimal reinforcement learning*. Menlo Park, CA: MIT Press.Google Scholar - 10.Brown, G. (1951). Iterative solution to games by fictitious play. In T. C. Koopmans (Ed.),
*Activity analysis of production and allocation*(pp. 374–376). New York, NY: Wiley.Google Scholar - 11.Chakraborty, D., & Stone, P. (2008). Online multiagent learning against memory bounded adversaries.
*European Conference on Machine Learning*(pp. 211–226). Antwerp, Belgium.Google Scholar - 12.Chakraborty, D., & Stone, P. (2010). Convergence, targeted optimality and safety in multiagent learning. In J. Furnkranz & T. Joachims (Eds.),
*Proceedings of the Twenty-Seventh International Conference on Machine Learning*. Haifa: ICML.Google Scholar - 13.Chen, X. & Deng, X. (2006). Settling the complexity of two-player Nash equilibrium.
*Proceedings of the 47th Foundations of Computer Science (FOCS)*(pp. 261–272). Berkeley, CA.Google Scholar - 14.Chevaleyre, Y., Dunne, P. E., Endriss, U., Lang, J., Lemaêtre, M., Maudet, N., et al. (2006). Issues in multiagent resource allocation.
*Informatica*,*30*, 3–31.zbMATHGoogle Scholar - 15.Claus, C., & Boutilier, C. (1998). The dynamics of reinforcement learning in cooperative multiagent systems. In J. Mostow & C. Rich (Eds.),
*Proceedings of the Fifteenth National Conference on Artificial Intelligence*(pp. 746–752). Menlo Park: AAAI Press.Google Scholar - 16.Conitzer, V., & Sandholm, T. (2006). AWESOME: A general multiagent learning algorithm that converges in self-play and learns a best response against stationary opponents.
*Machine Learning*,*67*, 23–43.CrossRefGoogle Scholar - 17.Foster, D. P., & Vohra, R. V. (1993). A randomization rule for selecting forecasts.
*Institute for Operations Research and the Management Sciences (INFORMS)*,*41*, 704–709.zbMATHGoogle Scholar - 18.Fudenberg, D., & Levine, D. K. (1995). Consistency and cautious fictitious play.
*Journal of Economic Dynamics and Control*,*19*, 1065–1089.CrossRefzbMATHMathSciNetGoogle Scholar - 19.Fudenberg, D., & Levine, D. K. (1999).
*The theory of learning in games*(1st ed.). Cambridge, MA: MIT Press.Google Scholar - 20.Hannan, J. (1957).
*Approximation to Bayes risk in repeated plays. Contributions to the theory of games*. Princeton, NJ: Princeton University Press.Google Scholar - 21.Hart, S., & Mas-Colel, A. (2000). A simple adaptive procedure leading to correlated equilibrium.
*Econometrica*,*68*, 1127–1150.CrossRefzbMATHMathSciNetGoogle Scholar - 22.Hu, J. & Wellman M.P. (1998). Multiagent reinforcement learning: Theoretical framework and an algorithm.
*Proceedings 15th International Conference on Machine Learning*(pp. 242–250). Morgan Kaufmann, San Francisco, CA.Google Scholar - 23.Kaisers, M., & Tuyls, K. (2010). Frequency adjusted multi-agent Q-learning.
*Proceedings of the 9th International Conference on Autonomous Agents and Multiagent Systems: Volume 1– Volume 1*(pp. 309–316). Richland, SC.Google Scholar - 24.Kearns, M., & Singh, S. (1998). Near-optimal reinforcement learning in polynomial time.
*Proceedings of the 15th International Conference on Machine Learning*(pp. 260–268). Morgan Kaufmann, San Francisco, CA.Google Scholar - 25.Littman, M. L. (1994). Markov games as a framework for multi-agent reinforcement learning. In W. W. Cohen & H. Hirsh (Eds.),
*Proceedings of the Eleventh International Conference on Machine Learning*(pp. 157–163). San Francisco, CA: Morgan Kaufmann.Google Scholar - 26.Littman, M. L., & Stone, P. (2005).
*A polynomial-time Nash equilibrium algorithm for repeated games*(pp. 55–66). Amsterdam: Elsevier.Google Scholar - 27.Littman, M. L., & Szepesvari, C. (1996). A generalized reinforcement-learning model: Convergence and applications. In L. Saitta (Ed.),
*Proceedings of the 13th International Conference on Machine Learning*(pp. 310–318). San Francisco, CA: Morgan Kaufmann Publishers.Google Scholar - 28.Mahadevan, S. (1996). Average reward reinforcement learning: Foundations, algorithms, and empirical results.
*Machine Learning*,*22*(1–3), 159–195.Google Scholar - 29.Nash, J. F, Jr. (1950). Equilibrium points in n-person games.
*Proceedings of the National Academy of Sciences*,*36*, 48–49.CrossRefzbMATHMathSciNetGoogle Scholar - 30.Osborne, M. J., & Rubinstein, A. (1994).
*A course in game theory*. Cambridge, MA: The MIT Press.zbMATHGoogle Scholar - 31.Pardoe, D., Chakraborty, D., & Stone, P. (2010). TacTex09: A champion bidding agent for ad auctions. In van der Hoek, Kaminka, Lesperance, Luck, & Sen (Eds.),
*Proceedings of the 9th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2010)*. Dunbeath: International Foundation for Autonomous Agents and Multiagent Systems.Google Scholar - 32.Powers, R., & Shoham, Y. (2005). Learning against opponents with bounded memory.
*Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI)*(pp. 817–822). Edinburgh, Scotland.Google Scholar - 33.Powers, R., Shoham, Y., & Vu, T. (2007). A general criterion and an algorithmic framework for learning in multi-agent systems.
*Machine Learning*,*67*, 45–76.CrossRefGoogle Scholar - 34.Puterman, M. L. (1994).
*Markov Decision processes: Discrete stochastic dynamic programming*. New York, NY: Wiley.CrossRefzbMATHGoogle Scholar - 35.Sela, A., & Herreiner, D. K. (1997).
*Fictitious play in coordination games. Discussion paper serie B*. University of Bonn, Germany.Google Scholar - 36.Singh, S., Kearns, M., & Mansour, Y. (2000). Nash convergence of gradient dynamics in general-sum games. In C. Boutilier & M. Goldszmidt (Eds.),
*Proceedings of the Sixteenth Conference on Uncertainty in Artificial Intelligence*(pp. 541–548). San Francisco, CA: Morgan Kaufmann Publishers.Google Scholar - 37.Southey, F., Hoehn, B., & Holte, R. (2008). Effective short-term opponent exploitation in simplified poker.
*Machine Learning*,*74*(2), 159–189.CrossRefzbMATHGoogle Scholar - 38.Stone, P., Dresner, K., Fidelman, P., Kohl, N., Kuhlmann, G., Sridharan, M., et al. (2005).
*The UT Austin Villa 2005 RoboCup four-legged team: Technical report*. The University of Texas, Austin.Google Scholar - 39.Stone, P., & Littman, M. L. (2001). Implicit negotiation in repeated games. In J.-J. Meyer & M. Tambe (Eds.),
*Pre-Proceedings of the Eighth International Workshop on Agent Theories, Architectures, and Languages (ATAL-2001)*(pp. 96–105). Heidelberg: Springer.Google Scholar - 40.Stone, P., & Veloso, M. (2000). Multiagent systems: A survey from a machine learning perspective.
*Autonomous Robots*,*8*(3), 345–383.CrossRefGoogle Scholar - 41.Sutton, R. S., & Barto, A. G. (1998).
*Reinforcement learning*. Cambridge, MA: MIT Press.Google Scholar - 42.Sykulski, A. M., Chapman, A. C., de Cote, E. M., & Jennings, N. R. (2010). EA2: The winning strategy for the inaugural lemonade stand game tournament. In H. Coelho, R. Studer, & M. Wooldridge (Eds.),
*ECAI 2010: 19th European Conference on Artificial Intelligence*(pp. 209–214). Amsterdam: IOS Press.Google Scholar - 43.Tuyls, K., & Parson, S. (2007). What evolutionary game theory tells us about multiagent learning.
*Artificial Intelligence*,*171*(7), 406–416.CrossRefzbMATHMathSciNetGoogle Scholar - 44.Van Dyke Parunak, H. (1999).
*Industrial and practical applications of DAI*(pp. 377–421). Cambridge, MA: The MIP Press.Google Scholar - 45.Watkins, C. J. C. H., & Dayan, P. D. (1992). Q-learning.
*Machine Learning*,*3*, 279–292.Google Scholar - 46.Wolpert, D. H., & Macready, W. G. (1997). No free lunch theorems for optimization.
*IEEE Transactions on Evolutionary Computation*,*1*, 67–82.CrossRefGoogle Scholar - 47.Wooldridge, M. J. (2001).
*Introduction to multiagent systems*. New York, NY: Wiley.Google Scholar