Autonomous Agents and Multi-Agent Systems

, Volume 28, Issue 2, pp 182–213 | Cite as

Multiagent learning in the presence of memory-bounded agents

  • Doran Chakraborty
  • Peter Stone


In recent years, great strides have been made towards creating autonomous agents that can learn via interaction with their environment. When considering just an individual agent, it is often appropriate to model the world as being stationary, meaning that the same action from the same state will always yield the same (possibly stochastic) effects. However, in the presence of other independent agents, the environment is not stationary: an action’s effects may depend on the actions of the other agents. This non-stationarity poses the primary challenge of multiagent learning and comprises the main reason that it is best considered distinctly from single agent learning. The multiagent learning problem is often studied in the stylized settings provided by repeated matrix games. The goal of this article is to introduce a novel multiagent learning algorithm for such a setting, called Convergence with Model Learning and Safety (or CMLeS), that achieves a new set of objectives which have not been previously achieved. Specifically, CMLeS is the first multiagent learning algorithm to achieve the following three objectives: (1) converges to following a Nash equilibrium joint-policy in self-play; (2) achieves close to the best response when interacting with a set of memory-bounded agents whose memory size is upper bounded by a known value; and (3) ensures an individual return that is very close to its security value when interacting with any other set of agents. Our presentation of CMLeS is backed by a rigorous theoretical analysis, including an analysis of sample complexity wherever applicable.


Multiagent learning Memory-bounded agents Sample complexity analysis 



This work has taken place in the Learning Agents Research Group (LARG) at the Artificial Intelligence Laboratory, The University of Texas at Austin. LARG research is supported in part by Grants from the National Science Foundation (IIS-0917122), ONR (N00014-09-1-0658), and the Federal Highway Administration (DTFH61-07-H-00030).


  1. 1.
    Airiau, S., Saha, S., & Sen, S. (2007). Evolutionary tournament-based comparison of learning and non-learning algorithms for iterated games. Journal of Artificial Societies and Social Simulation, 10, 1–12.Google Scholar
  2. 2.
    Aumann, R. (1974). Subjectivity and correlation in randomized strategies. Journal of Mathematical Economics, 1, 67–96.CrossRefzbMATHMathSciNetGoogle Scholar
  3. 3.
    Banerjee, B., & Peng, J. (2004). Performance bounded reinforcement learning in strategic interactions. In L. Deborah, McGuinness, & G. Ferguson (Eds.),  AAAI’04: Proceedings of the 19th National Conference on Artifical Intelligence (pp. 2–7). Menlo Park, CA: AAAI Press/The MIT Press.Google Scholar
  4. 4.
    Banerjee, B., Sen, S., & Peng, J. (2001). Fast concurrent reinforcement learners. In N. Bernhard (Ed.), Proceedings of the Seventeenth International Joint Conference on Artificial Intelligence (pp. 825–830). San Francisco, CA: Morgan Kaufmann.Google Scholar
  5. 5.
    Bouzy, B., & Metivier, M. (2010). Multi-agent learning experiments on repeated matrix games. In J. Furnkranz & T. Joachims (Eds.), Proceedings of the Twenty-Seventh International Conference on Machine Learning. Haifa: ICML.Google Scholar
  6. 6.
    Bowling, M. (2005). Convergence and no-regret in multiagent learning. In L. K. Saul, Y. Weiss, & L. Bottou (Eds.), NIPS’05: Advances in Neural Information Processing Systems (pp. 209–216). Cambridge, MA: MIT Press.Google Scholar
  7. 7.
    Bowling, M., & Veloso, M. (2001a). Convergence of gradient dynamics with a variable learning rate. Procedings of the 18th International Conference on Machine Learning (pp. 27–34). Morgan Kaufmann, San Francisco, CA.Google Scholar
  8. 8.
    Bowling, M., & Veloso, M. (2001b). Rational and convergent learning in stochastic games. In B. Nebel (Ed.), International Joint Conference on Artificial Intelligence (pp. 1021–1026). San Francisco, CA: Morgan Kaufmann.Google Scholar
  9. 9.
    Brafman, R. I., & Tennenholtz, M. (2003). R-max—a general polynomial time algorithm for near-optimal reinforcement learning. Menlo Park, CA: MIT Press.Google Scholar
  10. 10.
    Brown, G. (1951). Iterative solution to games by fictitious play. In T. C. Koopmans (Ed.), Activity analysis of production and allocation (pp. 374–376). New York, NY: Wiley.Google Scholar
  11. 11.
    Chakraborty, D., & Stone, P. (2008). Online multiagent learning against memory bounded adversaries. European Conference on Machine Learning (pp. 211–226). Antwerp, Belgium.Google Scholar
  12. 12.
    Chakraborty, D., & Stone, P. (2010). Convergence, targeted optimality and safety in multiagent learning. In J. Furnkranz & T. Joachims (Eds.), Proceedings of the Twenty-Seventh International Conference on Machine Learning. Haifa: ICML.Google Scholar
  13. 13.
    Chen, X. & Deng, X. (2006). Settling the complexity of two-player Nash equilibrium. Proceedings of the 47th Foundations of Computer Science (FOCS) (pp. 261–272). Berkeley, CA.Google Scholar
  14. 14.
    Chevaleyre, Y., Dunne, P. E., Endriss, U., Lang, J., Lemaêtre, M., Maudet, N., et al. (2006). Issues in multiagent resource allocation. Informatica, 30, 3–31.zbMATHGoogle Scholar
  15. 15.
    Claus, C., & Boutilier, C. (1998). The dynamics of reinforcement learning in cooperative multiagent systems. In J. Mostow & C. Rich (Eds.), Proceedings of the Fifteenth National Conference on Artificial Intelligence (pp. 746–752). Menlo Park: AAAI Press.Google Scholar
  16. 16.
    Conitzer, V., & Sandholm, T. (2006). AWESOME: A general multiagent learning algorithm that converges in self-play and learns a best response against stationary opponents. Machine Learning, 67, 23–43.CrossRefGoogle Scholar
  17. 17.
    Foster, D. P., & Vohra, R. V. (1993). A randomization rule for selecting forecasts. Institute for Operations Research and the Management Sciences (INFORMS), 41, 704–709.zbMATHGoogle Scholar
  18. 18.
    Fudenberg, D., & Levine, D. K. (1995). Consistency and cautious fictitious play. Journal of Economic Dynamics and Control, 19, 1065–1089.CrossRefzbMATHMathSciNetGoogle Scholar
  19. 19.
    Fudenberg, D., & Levine, D. K. (1999). The theory of learning in games (1st ed.). Cambridge, MA: MIT Press.Google Scholar
  20. 20.
    Hannan, J. (1957). Approximation to Bayes risk in repeated plays. Contributions to the theory of games. Princeton, NJ: Princeton University Press.Google Scholar
  21. 21.
    Hart, S., & Mas-Colel, A. (2000). A simple adaptive procedure leading to correlated equilibrium. Econometrica, 68, 1127–1150.CrossRefzbMATHMathSciNetGoogle Scholar
  22. 22.
    Hu, J. & Wellman M.P. (1998). Multiagent reinforcement learning: Theoretical framework and an algorithm. Proceedings 15th International Conference on Machine Learning (pp. 242–250). Morgan Kaufmann, San Francisco, CA.Google Scholar
  23. 23.
    Kaisers, M., & Tuyls, K. (2010). Frequency adjusted multi-agent Q-learning. Proceedings of the 9th International Conference on Autonomous Agents and Multiagent Systems: Volume 1– Volume 1 (pp. 309–316). Richland, SC.Google Scholar
  24. 24.
    Kearns, M., & Singh, S. (1998). Near-optimal reinforcement learning in polynomial time. Proceedings of the 15th International Conference on Machine Learning (pp. 260–268). Morgan Kaufmann, San Francisco, CA.Google Scholar
  25. 25.
    Littman, M. L. (1994). Markov games as a framework for multi-agent reinforcement learning. In W. W. Cohen & H. Hirsh (Eds.), Proceedings of the Eleventh International Conference on Machine Learning (pp. 157–163). San Francisco, CA: Morgan Kaufmann.Google Scholar
  26. 26.
    Littman, M. L., & Stone, P. (2005). A polynomial-time Nash equilibrium algorithm for repeated games (pp. 55–66). Amsterdam: Elsevier.Google Scholar
  27. 27.
    Littman, M. L., & Szepesvari, C. (1996). A generalized reinforcement-learning model: Convergence and applications. In L. Saitta (Ed.), Proceedings of the 13th International Conference on Machine Learning (pp. 310–318). San Francisco, CA: Morgan Kaufmann Publishers.Google Scholar
  28. 28.
    Mahadevan, S. (1996). Average reward reinforcement learning: Foundations, algorithms, and empirical results. Machine Learning, 22(1–3), 159–195.Google Scholar
  29. 29.
    Nash, J. F, Jr. (1950). Equilibrium points in n-person games. Proceedings of the National Academy of Sciences, 36, 48–49.CrossRefzbMATHMathSciNetGoogle Scholar
  30. 30.
    Osborne, M. J., & Rubinstein, A. (1994). A course in game theory. Cambridge, MA: The MIT Press.zbMATHGoogle Scholar
  31. 31.
    Pardoe, D., Chakraborty, D., & Stone, P. (2010). TacTex09: A champion bidding agent for ad auctions. In van der Hoek, Kaminka, Lesperance, Luck, & Sen (Eds.), Proceedings of the 9th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2010). Dunbeath: International Foundation for Autonomous Agents and Multiagent Systems.Google Scholar
  32. 32.
    Powers, R., & Shoham, Y. (2005). Learning against opponents with bounded memory. Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI) (pp. 817–822). Edinburgh, Scotland.Google Scholar
  33. 33.
    Powers, R., Shoham, Y., & Vu, T. (2007). A general criterion and an algorithmic framework for learning in multi-agent systems. Machine Learning, 67, 45–76.CrossRefGoogle Scholar
  34. 34.
    Puterman, M. L. (1994). Markov Decision processes: Discrete stochastic dynamic programming. New York, NY: Wiley.CrossRefzbMATHGoogle Scholar
  35. 35.
    Sela, A., & Herreiner, D. K. (1997). Fictitious play in coordination games. Discussion paper serie B. University of Bonn, Germany.Google Scholar
  36. 36.
    Singh, S., Kearns, M., & Mansour, Y. (2000). Nash convergence of gradient dynamics in general-sum games. In C. Boutilier & M. Goldszmidt (Eds.), Proceedings of the Sixteenth Conference on Uncertainty in Artificial Intelligence (pp. 541–548). San Francisco, CA: Morgan Kaufmann Publishers.Google Scholar
  37. 37.
    Southey, F., Hoehn, B., & Holte, R. (2008). Effective short-term opponent exploitation in simplified poker. Machine Learning, 74(2), 159–189.CrossRefzbMATHGoogle Scholar
  38. 38.
    Stone, P., Dresner, K., Fidelman, P., Kohl, N., Kuhlmann, G., Sridharan, M., et al. (2005). The UT Austin Villa 2005 RoboCup four-legged team: Technical report. The University of Texas, Austin.Google Scholar
  39. 39.
    Stone, P., & Littman, M. L. (2001). Implicit negotiation in repeated games. In J.-J. Meyer & M. Tambe (Eds.), Pre-Proceedings of the Eighth International Workshop on Agent Theories, Architectures, and Languages (ATAL-2001) (pp. 96–105). Heidelberg: Springer.Google Scholar
  40. 40.
    Stone, P., & Veloso, M. (2000). Multiagent systems: A survey from a machine learning perspective. Autonomous Robots, 8(3), 345–383.CrossRefGoogle Scholar
  41. 41.
    Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning. Cambridge, MA: MIT Press.Google Scholar
  42. 42.
    Sykulski, A. M., Chapman, A. C., de Cote, E. M., & Jennings, N. R. (2010). EA2: The winning strategy for the inaugural lemonade stand game tournament. In H. Coelho, R. Studer, & M. Wooldridge (Eds.), ECAI 2010: 19th European Conference on Artificial Intelligence (pp. 209–214). Amsterdam: IOS Press.Google Scholar
  43. 43.
    Tuyls, K., & Parson, S. (2007). What evolutionary game theory tells us about multiagent learning. Artificial Intelligence, 171(7), 406–416.CrossRefzbMATHMathSciNetGoogle Scholar
  44. 44.
    Van Dyke Parunak, H. (1999). Industrial and practical applications of DAI (pp. 377–421). Cambridge, MA: The MIP Press.Google Scholar
  45. 45.
    Watkins, C. J. C. H., & Dayan, P. D. (1992). Q-learning. Machine Learning, 3, 279–292.Google Scholar
  46. 46.
    Wolpert, D. H., & Macready, W. G. (1997). No free lunch theorems for optimization. IEEE Transactions on Evolutionary Computation, 1, 67–82.CrossRefGoogle Scholar
  47. 47.
    Wooldridge, M. J. (2001). Introduction to multiagent systems. New York, NY: Wiley.Google Scholar

Copyright information

© The Author(s) 2013

Authors and Affiliations

  1. 1.MicrosoftSunnyvaleUSA
  2. 2.Department of Computer ScienceThe University of TexasAustinUSA

Personalised recommendations