Skip to main content

Multi-Agent Reinforcement Learning: A Selective Overview of Theories and Algorithms

  • Chapter
  • First Online:
Handbook of Reinforcement Learning and Control

Part of the book series: Studies in Systems, Decision and Control ((SSDC,volume 325))

Abstract

Recent years have witnessed significant advances in reinforcement learning (RL), which has registered tremendous success in solving various sequential decision-making problems in machine learning. Most of the successful RL applications, e.g., the games of Go and Poker, robotics, and autonomous driving, involve the participation of more than one single agent, which naturally fall into the realm of multi-agent RL (MARL), a domain with a relatively long history, and has recently re-emerged due to advances in single-agent RL techniques. Though empirically successful, theoretical foundations for MARL are relatively lacking in the literature. In this chapter, we provide a selective overview of MARL, with focus on algorithms backed by theoretical analysis. More specifically, we review the theoretical results of MARL algorithms mainly within two representative frameworks, Markov/stochastic games and extensive-form games, in accordance with the types of tasks they address, i.e., fully cooperative, fully competitive, and a mix of the two. We also introduce several significant but challenging applications of these algorithms. Orthogonal to the existing reviews on MARL, we highlight several new angles and taxonomies of MARL theory, including learning in extensive-form games, decentralized MARL with networked agents, MARL in the mean-field regime, (non-)convergence of policy-based methods for learning in games, etc. Some of the new angles extrapolate from our own research endeavors and interests. Our overall goal with this chapter is, beyond providing an assessment of the current state of the field on the mark, to identify fruitful future research directions on theoretical studies of MARL. We expect this chapter to serve as continuing stimulus for researchers interested in working on this exciting while challenging topic.

Writing of this chapter was supported in part by the US Army Research Laboratory (ARL) Cooperative Agreement W911NF-17-2-0196, and in part by the Air Force Office of Scientific Research (AFOSR) Grant FA9550-19-1-0353.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 189.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 249.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 249.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Hereafter, we will use agent and player interchangeably.

  2. 2.

    Note that there are several other standard formulations of MDPs, e.g., time-average-reward setting and finite-horizon episodic setting. Here, we only present the classical infinite-horizon discounted setting for ease of exposition.

  3. 3.

    The partially observed MDP (POMDP) model is usually advocated when the agent has no access to the exact system state but only an observation of the state. See [45, 46] for more details on the POMDP model.

  4. 4.

    Similar to the single-agent setting, here we only introduce the infinite-horizon discounted setting for simplicity, though other settings of MGs, e.g., time-average-reward setting and finite-horizon episodic setting, also exist [85].

  5. 5.

    Here, we focus only on stationary Markov Nash equilibria, for the infinite-horizon discounted MGs considered.

  6. 6.

    Partially observed Markov games under the cooperative setting are usually formulated as decentralized POMDP (Dec-POMDP) problems. See Sect. 12.4.1.3 for more discussions on this setting.

  7. 7.

    The difference between mean-field teams and mean-field games is mainly the solution concept: optimum versus equilibrium, as the difference between general dynamic team theory [88, 173, 174] and game theory [82, 85]. Although the former can be viewed as a special case of the latter, related works are usually reviewed separately in the literature. We follow here this convention.

  8. 8.

    Note that hereafter we use decentralized and distributed interchangeably for describing this paradigm.

References

  1. Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of Go with deep neural networks and tree search. Nature 529(7587), 484–489 (2016)

    Article  Google Scholar 

  2. Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., Hubert, T., Baker, L., Lai, M., Bolton, A., et al.: Mastering the game of Go without human knowledge. Nature 550(7676), 354 (2017)

    Article  Google Scholar 

  3. OpenAI: Openai five. https://blog.openai.com/openai-five/ (2018)

  4. Vinyals, O., Babuschkin, I., Chung, J., Mathieu, M., Jaderberg, M., Czarnecki, W.M., Dudzik, A., Huang, A., Georgiev, P., Powell, R., Ewalds, T., Horgan, D., Kroiss, M., Danihelka, I., Agapiou, J., Oh, J., Dalibard, V., Choi, D., Sifre, L., Sulsky, Y., Vezhnevets, S., Molloy, J., Cai, T., Budden, D., Paine, T., Gulcehre, C., Wang, Z., Pfaff, T., Pohlen, T., Wu, Y., Yogatama, D., Cohen, J., McKinney, K., Smith, O., Schaul, T., Lillicrap, T., Apps, C., Kavukcuoglu, K., Hassabis, D., Silver, D.: AlphaStar: mastering the real-time strategy game starcraft II. https://deepmind.com/blog/alphastar-mastering-real-time-strategy-game-starcraft-ii/ (2019)

  5. Kober, J., Bagnell, J.A., Peters, J.: Reinforcement learning in robotics: a survey. Int. J. Robot. Res. 32(11), 1238–1274 (2013)

    Article  Google Scholar 

  6. Lillicrap, T.P., Hunt, J.J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., Wierstra, D.: Continuous control with deep reinforcement learning. In: International Conference on Learning Representations (2016)

    Google Scholar 

  7. Brown, N., Sandholm, T.: Libratus: the superhuman AI for no-limit Poker. In: International Joint Conference on Artificial Intelligence, pp. 5226–5228 (2017)

    Google Scholar 

  8. Brown, N., Sandholm, T.: Superhuman AI for multiplayer poker. Science 365, 885–890 (2019)

    Article  MathSciNet  MATH  Google Scholar 

  9. Shalev-Shwartz, S., Shammah, S., Shashua, A.: Safe, multi-agent, reinforcement learning for autonomous driving (2016). arXiv preprint arXiv:1610.03295

  10. Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A.A., Veness, J., Bellemare, M.G., Graves, A., Riedmiller, M., Fidjeland, A.K., Ostrovski, G., et al.: Human-level control through deep reinforcement learning. Nature 518(7540), 529–533 (2015)

    Article  Google Scholar 

  11. Busoniu, L., Babuska, R., De Schutter, B., et al.: A comprehensive survey of multiagent reinforcement learning. IEEE Trans. Syst. Man Cybern. Part C 38(2), 156–172 (2008)

    Article  Google Scholar 

  12. Adler, J.L., Blue, V.J.: A cooperative multi-agent transportation management and route guidance system. Transp. Res. Part C: Emerg. Technol. 10(5), 433–454 (2002)

    Google Scholar 

  13. Wang, S., Wan, J., Zhang, D., Li, D., Zhang, C.: Towards smart factory for industry 4.0: a self-organized multi-agent system with big data based feedback and coordination. Comput. Netw. 101, 158–168 (2016)

    Google Scholar 

  14. Jangmin, O., Lee, J.W., Zhang, B.T.: Stock trading system using reinforcement learning with cooperative agents. In: International Conference on Machine Learning, pp. 451–458 (2002)

    Google Scholar 

  15. Lee, J.W., Park, J., Jangmin, O., Lee, J., Hong, E.: A multiagent approach to \(Q \)-learning for daily stock trading. IEEE Trans. Syst. Man Cybern.-Part A: Syst. Hum. 37(6), 864–877 (2007)

    Google Scholar 

  16. Cortes, J., Martinez, S., Karatas, T., Bullo, F.: Coverage control for mobile sensing networks. IEEE Trans. Robot. Autom. 20(2), 243–255 (2004)

    Article  Google Scholar 

  17. Choi, J., Oh, S., Horowitz, R.: Distributed learning and cooperative control for multi-agent systems. Automatica 45(12), 2802–2814 (2009)

    Article  MathSciNet  MATH  Google Scholar 

  18. Castelfranchi, C.: The theory of social functions: challenges for computational social science and multi-agent learning. Cogn. Syst. Res. 2(1), 5–38 (2001)

    Article  Google Scholar 

  19. Leibo, J.Z., Zambaldi, V., Lanctot, M., Marecki, J., Graepel, T.: Multi-agent reinforcement learning in sequential social dilemmas. In: International Conference on Autonomous Agents and Multi-Agent Systems, pp. 464–473 (2017)

    Google Scholar 

  20. Hernandez-Leal, P., Kartal, B., Taylor, M.E.: A survey and critique of multiagent deep reinforcement learning (2018). arXiv preprint arXiv:1810.05587

  21. Foerster, J., Assael, Y.M., de Freitas, N., Whiteson, S.: Learning to communicate with deep multi-agent reinforcement learning. In: Advances in Neural Information Processing Systems, pp. 2137–2145 (2016)

    Google Scholar 

  22. Zazo, S., Macua, S.V., Sánchez-Fernández, M., Zazo, J.: Dynamic potential games with constraints: fundamentals and applications in communications. IEEE Trans. Signal Process. 64(14), 3806–3821 (2016)

    Article  MathSciNet  MATH  Google Scholar 

  23. Zhang, K., Yang, Z., Liu, H., Zhang, T., Başar, T.: Fully decentralized multi-agent reinforcement learning with networked agents. In: International Conference on Machine Learning, pp. 5867–5876 (2018)

    Google Scholar 

  24. Subramanian, J., Mahajan, A.: Reinforcement learning in stationary mean-field games. In: International Conference on Autonomous Agents and Multi-Agent Systems, pp. 251–259 (2019)

    Google Scholar 

  25. Heinrich, J., Silver, D.: Deep reinforcement learning from self-play in imperfect-information games (2016). arXiv preprint arXiv:1603.01121

  26. Lowe, R., Wu, Y., Tamar, A., Harb, J., Abbeel, P., Mordatch, I.: Multi-agent actor-critic for mixed cooperative-competitive environments. In: Advances in Neural Information Processing Systems, pp. 6379–6390 (2017)

    Google Scholar 

  27. Foerster, J., Farquhar, G., Afouras, T., Nardelli, N., Whiteson, S.: Counterfactual multi-agent policy gradients (2017). arXiv preprint arXiv:1705.08926

  28. Gupta, J.K., Egorov, M., Kochenderfer, M.: Cooperative multi-agent control using deep reinforcement learning. In: International Conference on Autonomous Agents and Multi-Agent Systems, pp. 66–83 (2017)

    Google Scholar 

  29. Omidshafiei, S., Pazis, J., Amato, C., How, J.P., Vian, J.: Deep decentralized multi-task multi-agent reinforcement learning under partial observability. In: International Conference on Machine Learning, pp. 2681–2690 (2017)

    Google Scholar 

  30. Kawamura, K., Mizukami, N., Tsuruoka, Y.: Neural fictitious self-play in imperfect information games with many players. In: Workshop on Computer Games, pp. 61–74 (2017)

    Google Scholar 

  31. Zhang, L., Wang, W., Li, S., Pan, G.: Monte Carlo neural fictitious self-play: Approach to approximate Nash equilibrium of imperfect-information games (2019). arXiv preprint arXiv:1903.09569

  32. Mazumdar, E., Ratliff, L.J.: On the convergence of gradient-based learning in continuous games (2018). arXiv preprint arXiv:1804.05464

  33. Jin, C., Netrapalli, P., Jordan, M.I.: Minmax optimization: stable limit points of gradient descent ascent are locally optimal (2019). arXiv preprint arXiv:1902.00618

  34. Zhang, K., Yang, Z., Başar, T.: Policy optimization provably converges to Nash equilibria in zero-sum linear quadratic games. In: Advances in Neural Information Processing Systems (2019)

    Google Scholar 

  35. Sidford, A., Wang, M., Yang, L.F., Ye, Y.: Solving discounted stochastic two-player games with near-optimal time and sample complexity (2019). arXiv preprint arXiv:1908.11071

  36. Oliehoek, F.A., Amato, C.: A Concise Introduction to Decentralized POMDPs, vol. 1. Springer, Berlin (2016)

    Google Scholar 

  37. Arslan, G., Yüksel, S.: Decentralized Q-learning for stochastic teams and games. IEEE Trans. Autom. Control 62(4), 1545–1558 (2017)

    Article  MathSciNet  MATH  Google Scholar 

  38. Yongacoglu, B., Arslan, G., Yüksel, S.: Learning team-optimality for decentralized stochastic control and dynamic games (2019). arXiv preprint arXiv:1903.05812

  39. Zhang, K., Miehling, E., Başar, T.: Online planning for decentralized stochastic control with partial history sharing. In: IEEE American Control Conference, pp. 167–172 (2019)

    Google Scholar 

  40. Hernandez-Leal, P., Kaisers, M., Baarslag, T., de Cote, E.M.: A survey of learning in multiagent environments: dealing with non-stationarity (2017). arXiv preprint arXiv:1707.09183

  41. Nguyen, T.T., Nguyen, N.D., Nahavandi, S.: Deep reinforcement learning for multi-agent systems: a review of challenges, solutions and applications (2018). arXiv preprint arXiv:1812.11794

  42. Oroojlooy Jadid, A., Hajinezhad, D.: A review of cooperative multi-agent deep reinforcement learning (2019). arXiv preprint arXiv:1908.03963

  43. Zhang, K., Yang, Z., Başar, T.: Networked multi-agent reinforcement learning in continuous spaces. In: IEEE Conference on Decision and Control, pp. 2771–2776 (2018)

    Google Scholar 

  44. Zhang, K., Yang, Z., Liu, H., Zhang, T., Başar, T.: Finite-sample analyses for fully decentralized multi-agent reinforcement learning (2018). arXiv preprint arXiv:1812.02783

  45. Monahan, G.E.: State of the art-a survey of partially observable Markov decision processes: theory, models, and algorithms. Manag. Sci. 28(1), 1–16 (1982)

    Article  MATH  Google Scholar 

  46. Cassandra, A.R.: Exact and approximate algorithms for partially observable Markov decision processes. Brown University (1998)

    Google Scholar 

  47. Bertsekas, D.P.: Dynamic Programming and Optimal Control, vol. 1. Athena Scientific, Belmont (2005)

    MATH  Google Scholar 

  48. Watkins, C.J., Dayan, P.: Q-learning. Mach. Learn. 8(3–4), 279–292 (1992)

    Article  MATH  Google Scholar 

  49. Szepesvári, C., Littman, M.L.: A unified analysis of value-function-based reinforcement-learning algorithms. Neural Comput. 11(8), 2017–2060 (1999)

    Article  Google Scholar 

  50. Singh, S., Jaakkola, T., Littman, M.L., Szepesvári, C.: Convergence results for single-step on-policy reinforcement-learning algorithms. Mach. Learn. 38(3), 287–308 (2000)

    Article  MATH  Google Scholar 

  51. Chang, H.S., Fu, M.C., Hu, J., Marcus, S.I.: An adaptive sampling algorithm for solving Markov decision processes. Oper. Res. 53(1), 126–139 (2005)

    Article  MathSciNet  MATH  Google Scholar 

  52. Kocsis, L., Szepesvári, C.: Bandit based Monte-Carlo planning. In: European Conference on Machine Learning, pp. 282–293. Springer (2006)

    Google Scholar 

  53. Coulom, R.: Efficient selectivity and backup operators in Monte-Carlo tree search. In: International Conference on Computers and Games, pp. 72–83 (2006)

    Google Scholar 

  54. Agrawal, R.: Sample mean based index policies by \(O(log n)\) regret for the multi-armed bandit problem. Adv. Appl. Probab. 27(4), 1054–1078 (1995)

    Article  MathSciNet  MATH  Google Scholar 

  55. Auer, P., Cesa-Bianchi, N., Fischer, P.: Finite-time analysis of the multiarmed bandit problem. Mach. Learn. 47(2–3), 235–256 (2002)

    Article  MATH  Google Scholar 

  56. Jiang, D., Ekwedike, E., Liu, H.: Feedback-based tree search for reinforcement learning. In: International Conference on Machine Learning, pp. 2284–2293 (2018)

    Google Scholar 

  57. Shah, D., Xie, Q., Xu, Z.: On reinforcement learning using Monte-Carlo tree search with supervised learning: non-asymptotic analysis (2019). arXiv preprint arXiv:1902.05213

  58. Tesauro, G.: Temporal difference learning and TD-Gammon. Commun. ACM 38(3), 58–68 (1995)

    Article  Google Scholar 

  59. Tsitsiklis, J.N., Van Roy, B.: Analysis of temporal-difference learning with function approximation. In: Advances in Neural Information Processing Systems, pp. 1075–1081 (1997)

    Google Scholar 

  60. Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. MIT Press, Cambridge (2018)

    Google Scholar 

  61. Sutton, R.S., Szepesvári, C., Maei, H.R.: A convergent \(O(n)\) algorithm for off-policy temporal-difference learning with linear function approximation. In: Advances in Neural Information Processing Systems, vol. 21(21), pp. 1609–1616 (2008)

    Google Scholar 

  62. Sutton, R.S., Maei, H.R., Precup, D., Bhatnagar, S., Silver, D., Szepesvári, C., Wiewiora, E.: Fast gradient-descent methods for temporal-difference learning with linear function approximation. In: International Conference on Machine Learning, pp. 993–1000 (2009)

    Google Scholar 

  63. Liu, B., Liu, J., Ghavamzadeh, M., Mahadevan, S., Petrik, M.: Finite-sample analysis of proximal gradient TD algorithms. In: Conference on Uncertainty in Artificial Intelligence, pp. 504–513 (2015)

    Google Scholar 

  64. Bhatnagar, S., Precup, D., Silver, D., Sutton, R.S., Maei, H.R., Szepesvári, C.: Convergent temporal-difference learning with arbitrary smooth function approximation. In: Advances in Neural Information Processing Systems, pp. 1204–1212 (2009)

    Google Scholar 

  65. Dann, C., Neumann, G., Peters, J., et al.: Policy evaluation with temporal differences: a survey and comparison. J. Mach. Learn. Res. 15, 809–883 (2014)

    MathSciNet  MATH  Google Scholar 

  66. Sutton, R.S., McAllester, D.A., Singh, S.P., Mansour, Y.: Policy gradient methods for reinforcement learning with function approximation. In: Advances in Neural Information Processing Systems, pp. 1057–1063 (2000)

    Google Scholar 

  67. Williams, R.J.: Simple statistical gradient-following algorithms for connectionist reinforcement learning. Mach. Learn. 8(3–4), 229–256 (1992)

    Article  MATH  Google Scholar 

  68. Baxter, J., Bartlett, P.L.: Infinite-horizon policy-gradient estimation. J. Artif. Intell. Res. 15, 319–350 (2001)

    Article  MathSciNet  MATH  Google Scholar 

  69. Konda, V.R., Tsitsiklis, J.N.: Actor-critic algorithms. In: Advances in Neural Information Processing Systems, pp. 1008–1014 (2000)

    Google Scholar 

  70. Bhatnagar, S., Sutton, R., Ghavamzadeh, M., Lee, M.: Natural actor-critic algorithms. Automatica 45(11), 2471–2482 (2009)

    Article  MathSciNet  MATH  Google Scholar 

  71. Silver, D., Lever, G., Heess, N., Degris, T., Wierstra, D., Riedmiller, M.: Deterministic policy gradient algorithms. In: International Conference on Machine Learning, pp. 387–395 (2014)

    Google Scholar 

  72. Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms (2017). arXiv preprint arXiv:1707.06347

  73. Schulman, J., Levine, S., Abbeel, P., Jordan, M., Moritz, P.: Trust region policy optimization. In: International Conference on Machine Learning, pp. 1889–1897 (2015)

    Google Scholar 

  74. Haarnoja, T., Zhou, A., Abbeel, P., Levine, S.: Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor (2018). arXiv preprint arXiv:1801.01290

  75. Yang, Z., Zhang, K., Hong, M., Başar, T.: A finite sample analysis of the actor-critic algorithm. In: IEEE Conference on Decision and Control, pp. 2759–2764 (2018)

    Google Scholar 

  76. Zhang, K., Koppel, A., Zhu, H., Başar, T.: Global convergence of policy gradient methods to (almost) locally optimal policies (2019). arXiv preprint arXiv:1906.08383

  77. Agarwal, A., Kakade, S.M., Lee, J.D., Mahajan, G.: Optimality and approximation with policy gradient methods in Markov decision processes (2019). arXiv preprint arXiv:1908.00261

  78. Liu, B., Cai, Q., Yang, Z., Wang, Z.: Neural proximal/trust region policy optimization attains globally optimal policy (2019). arXiv preprint arXiv:1906.10306

  79. Wang, L., Cai, Q., Yang, Z., Wang, Z.: Neural policy gradient methods: global optimality and rates of convergence (2019). arXiv preprint arXiv:1909.01150

  80. Chen, Y., Wang, M.: Stochastic primal-dual methods and sample complexity of reinforcement learning (2016). arXiv preprint arXiv:1612.02516

  81. Wang, M.: Primal-dual \(\pi \) learning: sample complexity and sublinear run time for ergodic Markov decision problems (2017). arXiv preprint arXiv:1710.06100

  82. Shapley, L.S.: Stochastic games. Proc. Natl. Acad. Sci. 39(10), 1095–1100 (1953)

    Article  MathSciNet  MATH  Google Scholar 

  83. Littman, M.L.: Markov games as a framework for multi-agent reinforcement learning. In: International Conference on Machine Learning, pp. 157–163 (1994)

    Google Scholar 

  84. Başar, T., Olsder, G.J.: Dynamic Noncooperative Game Theory, vol. 23. SIAM, Philadelphia (1999)

    Google Scholar 

  85. Filar, J., Vrieze, K.: Competitive Markov Decision Processes. Springer Science & Business Media, Berlin (2012)

    Google Scholar 

  86. Boutilier, C.: Planning, learning and coordination in multi-agent decision processes. In: Conference on Theoretical Aspects of Rationality and Knowledge, pp. 195–210 (1996)

    Google Scholar 

  87. Lauer, M., Riedmiller, M.: An algorithm for distributed reinforcement learning in cooperative multi-agent systems. In: International Conference on Machine Learning (2000)

    Google Scholar 

  88. Yoshikawa, T.: Decomposition of dynamic team decision problems. IEEE Trans. Autom. Control 23(4), 627–632 (1978)

    Article  MATH  Google Scholar 

  89. Ho, Y.C.: Team decision theory and information structures. Proc. IEEE 68(6), 644–654 (1980)

    Article  Google Scholar 

  90. Wang, X., Sandholm, T.: Reinforcement learning to play an optimal Nash equilibrium in team Markov games. In: Advances in Neural Information Processing Systems, pp. 1603–1610 (2003)

    Google Scholar 

  91. Mahajan, A.: Sequential decomposition of sequential dynamic teams: applications to real-time communication and networked control systems. Ph.D. thesis, University of Michigan (2008)

    Google Scholar 

  92. González-Sánchez, D., Hernández-Lerma, O.: Discrete-Time Stochastic Control and Dynamic Potential Games: The Euler-Equation Approach. Springer Science & Business Media, Berlin (2013)

    Google Scholar 

  93. Valcarcel Macua, S., Zazo, J., Zazo, S.: Learning parametric closed-loop policies for Markov potential games. In: International Conference on Learning Representations (2018)

    Google Scholar 

  94. Kar, S., Moura, J.M., Poor, H.V.: QD-learning: a collaborative distributed strategy for multi-agent reinforcement learning through consensus + innovations. IEEE Trans. Signal Process. 61(7), 1848–1862 (2013)

    Article  MathSciNet  MATH  Google Scholar 

  95. Doan, T., Maguluri, S., Romberg, J.: Finite-time analysis of distributed TD (0) with linear function approximation on multi-agent reinforcement learning. In: International Conference on Machine Learning, pp. 1626–1635 (2019)

    Google Scholar 

  96. Wai, H.T., Yang, Z., Wang, Z., Hong, M.: Multi-agent reinforcement learning via double averaging primal-dual optimization. In: Advances in Neural Information Processing Systems, pp. 9649–9660 (2018)

    Google Scholar 

  97. OpenAI: Openai dota 2 1v1 bot. https://openai.com/the-international/ (2017)

  98. Jacobson, D.: Optimal stochastic linear systems with exponential performance criteria and their relation to deterministic differential games. IEEE Trans. Autom. Control 18(2), 124–131 (1973)

    Article  MathSciNet  MATH  Google Scholar 

  99. Başar, T., Bernhard, P.: H\(_\infty \) Optimal Control and Related Minimax Design Problems: A Dynamic Game Approach. Birkhäuser, Boston (1995)

    MATH  Google Scholar 

  100. Zhang, K., Hu, B., Başar, T.: Policy optimization for \(\cal{H}_2\) linear control with \(\cal{H}_{\infty }\) robustness guarantee: implicit regularization and global convergence (2019). arXiv preprint arXiv:1910.09496

  101. Hu, J., Wellman, M.P.: Nash Q-learning for general-sum stochastic games. J. Mach. Learn. Res. 4, 1039–1069 (2003)

    Google Scholar 

  102. Littman, M.L.: Friend-or-foe Q-learning in general-sum games. In: International Conference on Machine Learning, pp. 322–328 (2001)

    Google Scholar 

  103. Lagoudakis, M.G., Parr, R.: Learning in zero-sum team Markov games using factored value functions. In: Advances in Neural Information Processing Systems, pp. 1659–1666 (2003)

    Google Scholar 

  104. Bernstein, D.S., Givan, R., Immerman, N., Zilberstein, S.: The complexity of decentralized control of Markov decision processes. Math. Oper. Res. 27(4), 819–840 (2002)

    Article  MathSciNet  MATH  Google Scholar 

  105. Osborne, M.J., Rubinstein, A.: A Course in Game Theory. MIT Press, Cambridge (1994)

    Google Scholar 

  106. Shoham, Y., Leyton-Brown, K.: Multiagent Systems: Algorithmic, Game-Theoretic, and Logical Foundations. Cambridge University Press, Cambridge (2008)

    Google Scholar 

  107. Koller, D., Megiddo, N.: The complexity of two-person zero-sum games in extensive form. Games Econ. Behav. 4(4), 528–552 (1992)

    Article  MathSciNet  MATH  Google Scholar 

  108. Kuhn, H.: Extensive games and the problem of information. Contrib. Theory Games 2, 193–216 (1953)

    MathSciNet  Google Scholar 

  109. Zinkevich, M., Johanson, M., Bowling, M., Piccione, C.: Regret minimization in games with incomplete information. In: Advances in Neural Information Processing Systems, pp. 1729–1736 (2008)

    Google Scholar 

  110. Heinrich, J., Lanctot, M., Silver, D.: Fictitious self-play in extensive-form games. In: International Conference on Machine Learning, pp. 805–813 (2015)

    Google Scholar 

  111. Srinivasan, S., Lanctot, M., Zambaldi, V., Pérolat, J., Tuyls, K., Munos, R., Bowling, M.: Actor-critic policy optimization in partially observable multiagent environments. In: Advances in Neural Information Processing Systems, pp. 3422–3435 (2018)

    Google Scholar 

  112. Omidshafiei, S., Hennes, D., Morrill, D., Munos, R., Perolat, J., Lanctot, M., Gruslys, A., Lespiau, J.B., Tuyls, K.: Neural replicator dynamics (2019). arXiv preprint arXiv:1906.00190

  113. Rubin, J., Watson, I.: Computer Poker: a review. Artif. Intell. 175(5–6), 958–987 (2011)

    Google Scholar 

  114. Lanctot, M., Lockhart, E., Lespiau, J.B., Zambaldi, V., Upadhyay, S., Pérolat, J., Srinivasan, S., Timbers, F., Tuyls, K., Omidshafiei, S., et al.: Openspiel: a framework for reinforcement learning in games (2019). arXiv preprint arXiv:1908.09453

  115. Claus, C., Boutilier, C.: The dynamics of reinforcement learning in cooperative multiagent systems. In: AAAI Conference on Artificial Intelligence, vol. 1998, pp. 746–752, 2p. (1998)

    Google Scholar 

  116. Bowling, M., Veloso, M.: Rational and convergent learning in stochastic games. In: International Joint Conference on Artificial Intelligence, vol. 17, pp. 1021–1026 (2001)

    Google Scholar 

  117. Kapetanakis, S., Kudenko, D.: Reinforcement learning of coordination in cooperative multi-agent systems. In: AAAI Conference on Artificial Intelligence, vol. 2002, pp. 326–331 (2002)

    Google Scholar 

  118. Conitzer, V., Sandholm, T.: Awesome: a general multiagent learning algorithm that converges in self-play and learns a best response against stationary opponents. Mach. Learn. 67(1–2), 23–43 (2007)

    Article  Google Scholar 

  119. Hansen, E.A., Bernstein, D.S., Zilberstein, S.: Dynamic programming for partially observable stochastic games. In: AAAI Conference on Artificial Intelligence, pp. 709–715 (2004)

    Google Scholar 

  120. Amato, C., Chowdhary, G., Geramifard, A., Üre, N.K., Kochenderfer, M.J.: Decentralized control of partially observable Markov decision processes. In: IEEE Conference on Decision and Control, pp. 2398–2405 (2013)

    Google Scholar 

  121. Amato, C., Oliehoek, F.A.: Scalable planning and learning for multiagent POMDPs. In: AAAI Conference on Artificial Intelligence (2015)

    Google Scholar 

  122. Shoham, Y., Powers, R., Grenager, T.: Multi-agent reinforcement learning: a critical survey. Technical Report (2003)

    Google Scholar 

  123. Zinkevich, M., Greenwald, A., Littman, M.L.: Cyclic equilibria in Markov games. In: Advances in Neural Information Processing Systems, pp. 1641–1648 (2006)

    Google Scholar 

  124. Bowling, M., Veloso, M.: Multiagent learning using a variable learning rate. Artif. Intell. 136(2), 215–250 (2002)

    Article  MathSciNet  MATH  Google Scholar 

  125. Bowling, M.: Convergence and no-regret in multiagent learning. In: Advances in Neural Information Processing Systems, pp. 209–216 (2005)

    Google Scholar 

  126. Blum, A., Mansour, Y.: Learning, regret minimization, and equilibria. In: Algorithmic Game Theory, pp. 79–102 (2007)

    Google Scholar 

  127. Hart, S., Mas-Colell, A.: A reinforcement procedure leading to correlated equilibrium. In: Economics Essays, pp. 181–200. Springer, Berlin (2001)

    Google Scholar 

  128. Kasai, T., Tenmoto, H., Kamiya, A.: Learning of communication codes in multi-agent reinforcement learning problem. In: IEEE Conference on Soft Computing in Industrial Applications, pp. 1–6 (2008)

    Google Scholar 

  129. Kim, D., Moon, S., Hostallero, D., Kang, W.J., Lee, T., Son, K., Yi, Y.: Learning to schedule communication in multi-agent reinforcement learning. In: International Conference on Learning Representations (2019)

    Google Scholar 

  130. Chen, T., Zhang, K., Giannakis, G.B., Başar, T.: Communication-efficient distributed reinforcement learning (2018). arXiv preprint arXiv:1812.03239

  131. Lin, Y., Zhang, K., Yang, Z., Wang, Z., Başar, T., Sandhu, R., Liu, J.: A communication-efficient multi-agent actor-critic algorithm for distributed reinforcement learning. In: IEEE Conference on Decision and Control (2019)

    Google Scholar 

  132. Ren, J., Haupt, J.: A communication efficient hierarchical distributed optimization algorithm for multi-agent reinforcement learning. In: Real-World Sequential Decision Making Workshop at International Conference on Machine Learning (2019)

    Google Scholar 

  133. Kim, W., Cho, M., Sung, Y.: Message-dropout: an efficient training method for multi-agent deep reinforcement learning. In: AAAI Conference on Artificial Intelligence (2019)

    Google Scholar 

  134. He, H., Boyd-Graber, J., Kwok, K., Daumé III, H.: Opponent modeling in deep reinforcement learning. In: International Conference on Machine Learning, pp. 1804–1813 (2016)

    Google Scholar 

  135. Grover, A., Al-Shedivat, M., Gupta, J., Burda, Y., Edwards, H.: Learning policy representations in multiagent systems. In: International Conference on Machine Learning, pp. 1802–1811 (2018)

    Google Scholar 

  136. Gao, C., Mueller, M., Hayward, R.: Adversarial policy gradient for alternating Markov games. In: Workshop at International Conference on Learning Representations (2018)

    Google Scholar 

  137. Li, S., Wu, Y., Cui, X., Dong, H., Fang, F., Russell, S.: Robust multi-agent reinforcement learning via minimax deep deterministic policy gradient. In: AAAI Conference on Artificial Intelligence (2019)

    Google Scholar 

  138. Zhang, X., Zhang, K., Miehling, E., Basar, T.: Non-cooperative inverse reinforcement learning. In: Advances in Neural Information Processing Systems, pp. 9482–9493 (2019)

    Google Scholar 

  139. Tan, M.: Multi-agent reinforcement learning: Independent vs. cooperative agents. In: International Conference on Machine Learning, pp. 330–337 (1993)

    Google Scholar 

  140. Matignon, L., Laurent, G.J., Le Fort-Piat, N.: Independent reinforcement learners in cooperative Markov games: a survey regarding coordination problems. Knowl. Eng. Rev. 27(1), 1–31 (2012)

    Article  Google Scholar 

  141. Foerster, J., Nardelli, N., Farquhar, G., Torr, P., Kohli, P., Whiteson, S., et al.: Stabilising experience replay for deep multi-agent reinforcement learning. In: International Conference of Machine Learning, pp. 1146–1155 (2017)

    Google Scholar 

  142. Tuyls, K., Weiss, G.: Multiagent learning: basics, challenges, and prospects. AI Mag. 33(3), 41 (2012)

    Google Scholar 

  143. Guestrin, C., Lagoudakis, M., Parr, R.: Coordinated reinforcement learning. In: International Conference on Machine Learning, pp. 227–234 (2002)

    Google Scholar 

  144. Guestrin, C., Koller, D., Parr, R.: Multiagent planning with factored MDPs. In: Advances in Neural Information Processing Systems, pp. 1523–1530 (2002)

    Google Scholar 

  145. Kok, J.R., Vlassis, N.: Sparse cooperative Q-learning. In: International Conference on Machine learning, pp. 61–69 (2004)

    Google Scholar 

  146. Sunehag, P., Lever, G., Gruslys, A., Czarnecki, W.M., Zambaldi, V., Jaderberg, M., Lanctot, M., Sonnerat, N., Leibo, J.Z., Tuyls, K., et al.: Value-decomposition networks for cooperative multi-agent learning based on team reward. In: International Conference on Autonomous Agents and Multi-Agent Systems, pp. 2085–2087 (2018)

    Google Scholar 

  147. Rashid, T., Samvelyan, M., De Witt, C.S., Farquhar, G., Foerster, J., Whiteson, S.: QMIX: monotonic value function factorisation for deep multi-agent reinforcement learning. In: International Conference on Machine Learning, pp. 681–689 (2018)

    Google Scholar 

  148. Qu, G., Li, N.: Exploiting fast decaying and locality in multi-agent MDP with tree dependence structure. In: IEEE Conference on Decision and Control (2019)

    Google Scholar 

  149. Mahajan, A.: Optimal decentralized control of coupled subsystems with control sharing. IEEE Trans. Autom. Control 58(9), 2377–2382 (2013)

    Article  MathSciNet  MATH  Google Scholar 

  150. Oliehoek, F.A., Amato, C.: Dec-POMDPs as non-observable MDPs. IAS Technical Report (IAS-UVA-14-01) (2014)

    Google Scholar 

  151. Foerster, J.N., Farquhar, G., Afouras, T., Nardelli, N., Whiteson, S.: Counterfactual multi-agent policy gradients. In: AAAI Conference on Artificial Intelligence (2018)

    Google Scholar 

  152. Dibangoye, J., Buffet, O.: Learning to act in decentralized partially observable MDPs. In: International Conference on Machine Learning, pp. 1233–1242 (2018)

    Google Scholar 

  153. Kraemer, L., Banerjee, B.: Multi-agent reinforcement learning as a rehearsal for decentralized planning. Neurocomputing 190, 82–94 (2016)

    Article  Google Scholar 

  154. Macua, S.V., Chen, J., Zazo, S., Sayed, A.H.: Distributed policy evaluation under multiple behavior strategies. IEEE Trans. Autom. Control 60(5), 1260–1274 (2015)

    Article  MathSciNet  MATH  Google Scholar 

  155. Macua, S.V., Tukiainen, A., Hernández, D.G.O., Baldazo, D., de Cote, E.M., Zazo, S.: Diff-dac: Distributed actor-critic for average multitask deep reinforcement learning (2017). arXiv preprint arXiv:1710.10363

  156. Lee, D., Yoon, H., Hovakimyan, N.: Primal-dual algorithm for distributed reinforcement learning: distributed GTD. In: IEEE Conference on Decision and Control, pp. 1967–1972 (2018)

    Google Scholar 

  157. Doan, T.T., Maguluri, S.T., Romberg, J.: Finite-time performance of distributed temporal difference learning with linear function approximation (2019). arXiv preprint arXiv:1907.12530

  158. Suttle, W., Yang, Z., Zhang, K., Wang, Z., Başar, T., Liu, J.: A multi-agent off-policy actor-critic algorithm for distributed reinforcement learning (2019). arXiv preprint arXiv:1903.06372

  159. Littman, M.L.: Value-function reinforcement learning in Markov games. Cogn. Syst. Res. 2(1), 55–66 (2001)

    Article  Google Scholar 

  160. Young, H.P.: The evolution of conventions. Econ.: J. Econ. Soc. 57–84 (1993)

    Google Scholar 

  161. Son, K., Kim, D., Kang, W.J., Hostallero, D.E., Yi, Y.: QTRAN: Learning to factorize with transformation for cooperative multi-agent reinforcement learning. In: International Conference on Machine Learning, pp. 5887–5896 (2019)

    Google Scholar 

  162. Perolat, J., Piot, B., Pietquin, O.: Actor-critic fictitious play in simultaneous move multistage games. In: International Conference on Artificial Intelligence and Statistics (2018)

    Google Scholar 

  163. Monderer, D., Shapley, L.S.: Potential games. Games Econ. Behav. 14(1), 124–143 (1996)

    Article  MathSciNet  MATH  Google Scholar 

  164. Başar, T., Zaccour, G.: Handbook of Dynamic Game Theory. Springer, Berlin (2018)

    Google Scholar 

  165. Huang, M., Caines, P.E., Malhamé, R.P.: Individual and mass behaviour in large population stochastic wireless power control problems: centralized and Nash equilibrium solutions. In: IEEE Conference on Decision and Control, pp. 98–103 (2003)

    Google Scholar 

  166. Huang, M., Malhamé, R.P., Caines, P.E., et al.: Large population stochastic dynamic games: closed-loop Mckean-Vlasov systems and the Nash certainty equivalence principle. Commun. Inf. Syst. 6(3), 221–252 (2006)

    Article  MathSciNet  MATH  Google Scholar 

  167. Lasry, J.M., Lions, P.L.: Mean field games. Jpn. J. Math. 2(1), 229–260 (2007)

    Article  MathSciNet  MATH  Google Scholar 

  168. Bensoussan, A., Frehse, J., Yam, P., et al.: Mean Field Games and Mean Field Type Control Theory, vol. 101. Springer, Berlin (2013)

    Google Scholar 

  169. Tembine, H., Zhu, Q., Başar, T.: Risk-sensitive mean-field games. IEEE Trans. Autom. Control 59(4), 835–850 (2013)

    Article  MathSciNet  MATH  Google Scholar 

  170. Arabneydi, J., Mahajan, A.: Team optimal control of coupled subsystems with mean-field sharing. In: IEEE Conference on Decision and Control, pp. 1669–1674 (2014)

    Google Scholar 

  171. Arabneydi, J.: New concepts in team theory: Mean field teams and reinforcement learning. Ph.D. thesis, McGill University (2017)

    Google Scholar 

  172. Yang, Y., Luo, R., Li, M., Zhou, M., Zhang, W., Wang, J.: Mean field multi-agent reinforcement learning. In: International Conference on Machine Learning, pp. 5571–5580 (2018)

    Google Scholar 

  173. Witsenhausen, H.S.: Separation of estimation and control for discrete time systems. Proc. IEEE 59(11), 1557–1566 (1971)

    Article  MathSciNet  Google Scholar 

  174. Yüksel, S., Başar, T.: Stochastic Networked Control Systems: Stabilization and Optimization Under Information Constraints. Springer Science & Business Media, Berlin (2013)

    Google Scholar 

  175. Subramanian, J., Seraj, R., Mahajan, A.: Reinforcement learning for mean-field teams. In: Workshop on Adaptive and Learning Agents at International Conference on Autonomous Agents and Multi-Agent Systems (2018)

    Google Scholar 

  176. Arabneydi, J., Mahajan, A.: Linear quadratic mean field teams: optimal and approximately optimal decentralized solutions (2016). arXiv preprint arXiv:1609.00056

  177. Carmona, R., Laurière, M., Tan, Z.: Linear-quadratic mean-field reinforcement learning: convergence of policy gradient methods (2019). arXiv preprint arXiv:1910.04295

  178. Carmona, R., Laurière, M., Tan, Z.: Model-free mean-field reinforcement learning: mean-field MDP and mean-field Q-learning (2019). arXiv preprint arXiv:1910.12802

  179. Rabbat, M., Nowak, R.: Distributed optimization in sensor networks. In: International Symposium on Information Processing in Sensor Networks, pp. 20–27 (2004)

    Google Scholar 

  180. Dall’Anese, E., Zhu, H., Giannakis, G.B.: Distributed optimal power flow for smart microgrids. IEEE Trans. Smart Grid 4(3), 1464–1475 (2013)

    Article  Google Scholar 

  181. Zhang, K., Shi, W., Zhu, H., Dall’Anese, E., Başar, T.: Dynamic power distribution system management with a locally connected communication network. IEEE J. Sel. Top. Signal Process. 12(4), 673–687 (2018)

    Article  Google Scholar 

  182. Zhang, K., Lu, L., Lei, C., Zhu, H., Ouyang, Y.: Dynamic operations and pricing of electric unmanned aerial vehicle systems and power networks. Transp. Res. Part C: Emerg. Technol. 92, 472–485 (2018)

    Article  Google Scholar 

  183. Corke, P., Peterson, R., Rus, D.: Networked robots: flying robot navigation using a sensor net. Robot. Res. 234–243 (2005)

    Google Scholar 

  184. Zhang, K., Liu, Y., Liu, J., Liu, M., Başar, T.: Distributed learning of average belief over networks using sequential observations. Automatica (2019)

    Google Scholar 

  185. Nedic, A., Ozdaglar, A.: Distributed subgradient methods for multi-agent optimization. IEEE Trans. Autom. Control 54(1), 48–61 (2009)

    Article  MathSciNet  MATH  Google Scholar 

  186. Agarwal, A., Duchi, J.C.: Distributed delayed stochastic optimization. In: Advances in Neural Information Processing Systems, pp. 873–881 (2011)

    Google Scholar 

  187. Jakovetic, D., Xavier, J., Moura, J.M.: Cooperative convex optimization in networked systems: augmented Lagrangian algorithms with directed gossip communication. IEEE Trans. Signal Process. 59(8), 3889–3902 (2011)

    Article  MathSciNet  MATH  Google Scholar 

  188. Tu, S.Y., Sayed, A.H.: Diffusion strategies outperform consensus strategies for distributed estimation over adaptive networks. IEEE Trans. Signal Process. 60(12), 6217–6234 (2012)

    Article  MathSciNet  MATH  Google Scholar 

  189. Varshavskaya, P., Kaelbling, L.P., Rus, D.: Efficient distributed reinforcement learning through agreement. In: Distributed Autonomous Robotic Systems, pp. 367–378 (2009)

    Google Scholar 

  190. Ciosek, K., Whiteson, S.: Expected policy gradients for reinforcement learning (2018). arXiv preprint arXiv:1801.03326

  191. Sutton, R.S., Mahmood, A.R., White, M.: An emphatic approach to the problem of off-policy temporal-difference learning. J. Mach. Learn. Res. 17(1), 2603–2631 (2016)

    MathSciNet  MATH  Google Scholar 

  192. Yu, H.: On convergence of emphatic temporal-difference learning. In: Conference on Learning Theory, pp. 1724–1751 (2015)

    Google Scholar 

  193. Zhang, Y., Zavlanos, M.M.: Distributed off-policy actor-critic reinforcement learning with policy consensus (2019). arXiv preprint arXiv:1903.09255

  194. Pennesi, P., Paschalidis, I.C.: A distributed actor-critic algorithm and applications to mobile sensor network coordination problems. IEEE Trans. Autom. Control 55(2), 492–497 (2010)

    Article  MathSciNet  MATH  Google Scholar 

  195. Lange, S., Gabel, T., Riedmiller, M.: Batch reinforcement learning. In: Reinforcement Learning, pp. 45–73. Springer, Berlin (2012)

    Google Scholar 

  196. Riedmiller, M.: Neural fitted Q iteration–first experiences with a data efficient neural reinforcement learning method. In: European Conference on Machine Learning, pp. 317–328 (2005)

    Google Scholar 

  197. Antos, A., Szepesvári, C., Munos, R.: Fitted Q-iteration in continuous action-space MDPs. In: Advances in Neural Information Processing Systems, pp. 9–16 (2008)

    Google Scholar 

  198. Hong, M., Chang, T.H.: Stochastic proximal gradient consensus over random networks. IEEE Trans. Signal Process. 65(11), 2933–2948 (2017)

    Article  MathSciNet  MATH  Google Scholar 

  199. Nedic, A., Olshevsky, A., Shi, W.: Achieving geometric convergence for distributed optimization over time-varying graphs. SIAM J. Optim. 27(4), 2597–2633 (2017)

    Article  MathSciNet  MATH  Google Scholar 

  200. Munos, R.: Performance bounds in \(\ell _p\)-norm for approximate value iteration. SIAM J. Control Optim. 46(2), 541–561 (2007)

    Article  MathSciNet  MATH  Google Scholar 

  201. Munos, R., Szepesvári, C.: Finite-time bounds for fitted value iteration. J. Mach. Learn. Res. 9(May), 815–857 (2008)

    MathSciNet  MATH  Google Scholar 

  202. Antos, A., Szepesvári, C., Munos, R.: Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path. Mach. Learn. 71(1), 89–129 (2008)

    Article  MATH  Google Scholar 

  203. Farahmand, A.M., Szepesvári, C., Munos, R.: Error propagation for approximate policy and value iteration. In: Advances in Neural Information Processing Systems, pp. 568–576 (2010)

    Google Scholar 

  204. Cassano, L., Yuan, K., Sayed, A.H.: Multi-agent fully decentralized off-policy learning with linear convergence rates (2018). arXiv preprint arXiv:1810.07792

  205. Qu, G., Li, N.: Harnessing smoothness to accelerate distributed optimization. IEEE Trans. Control Netw. Syst. 5(3), 1245–1260 (2017)

    Article  MathSciNet  MATH  Google Scholar 

  206. Schmidt, M., Le Roux, N., Bach, F.: Minimizing finite sums with the stochastic average gradient. Math. Program. 162(1–2), 83–112 (2017)

    Article  MathSciNet  MATH  Google Scholar 

  207. Ying, B., Yuan, K., Sayed, A.H.: Convergence of variance-reduced learning under random reshuffling. In: IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 2286–2290 (2018)

    Google Scholar 

  208. Singh, S.P., Sutton, R.S.: Reinforcement learning with replacing eligibility traces. Mach. Learn. 22(1–3), 123–158 (1996)

    Article  MATH  Google Scholar 

  209. Bhandari, J., Russo, D., Singal, R.: A finite time analysis of temporal difference learning with linear function approximation. In: Conference On Learning Theory, pp. 1691–1692 (2018)

    Google Scholar 

  210. Srikant, R., Ying, L.: Finite-time error bounds for linear stochastic approximation and TD learning. In: Conference on Learning Theory, pp. 2803–2830 (2019)

    Google Scholar 

  211. Stanković, M.S., Stanković, S.S.: Multi-agent temporal-difference learning with linear function approximation: weak convergence under time-varying network topologies. In: IEEE American Control Conference, pp. 167–172 (2016)

    Google Scholar 

  212. Stanković, M.S., Ilić, N., Stanković, S.S.: Distributed stochastic approximation: weak convergence and network design. IEEE Trans. Autom. Control 61(12), 4069–4074 (2016)

    Article  MathSciNet  MATH  Google Scholar 

  213. Zhang, H., Jiang, H., Luo, Y., Xiao, G.: Data-driven optimal consensus control for discrete-time multi-agent systems with unknown dynamics using reinforcement learning method. IEEE Trans. Ind. Electron. 64(5), 4091–4100 (2016)

    Article  Google Scholar 

  214. Zhang, Q., Zhao, D., Lewis, F.L.: Model-free reinforcement learning for fully cooperative multi-agent graphical games. In: International Joint Conference on Neural Networks, pp. 1–6 (2018)

    Google Scholar 

  215. Bernstein, D.S., Amato, C., Hansen, E.A., Zilberstein, S.: Policy iteration for decentralized control of Markov decision processes. J. Artif. Intell. Res. 34, 89–132 (2009)

    Article  MathSciNet  MATH  Google Scholar 

  216. Amato, C., Bernstein, D.S., Zilberstein, S.: Optimizing fixed-size stochastic controllers for POMDPs and decentralized POMDPs. Auton. Agents Multi-Agent Syst. 21(3), 293–320 (2010)

    Article  Google Scholar 

  217. Liu, M., Amato, C., Liao, X., Carin, L., How, J.P.: Stick-breaking policy learning in Dec-POMDPs. In: International Joint Conference on Artificial Intelligence (2015)

    Google Scholar 

  218. Dibangoye, J.S., Amato, C., Buffet, O., Charpillet, F.: Optimally solving Dec-POMDPs as continuous-state MDPs. J. Artif. Intell. Res. 55, 443–497 (2016)

    Article  MathSciNet  MATH  Google Scholar 

  219. Wu, F., Zilberstein, S., Chen, X.: Rollout sampling policy iteration for decentralized POMDPs. In: Conference on Uncertainty in Artificial Intelligence (2010)

    Google Scholar 

  220. Wu, F., Zilberstein, S., Jennings, N.R.: Monte-Carlo expectation maximization for decentralized POMDPs. In: International Joint Conference on Artificial Intelligence (2013)

    Google Scholar 

  221. Best, G., Cliff, O.M., Patten, T., Mettu, R.R., Fitch, R.: Dec-MCTS: decentralized planning for multi-robot active perception. Int. J. Robot. Res. 1–22 (2018)

    Google Scholar 

  222. Amato, C., Zilberstein, S.: Achieving goals in decentralized POMDPs. In: International Conference on Autonomous Agents and Multi-Agent Systems, pp. 593–600 (2009)

    Google Scholar 

  223. Banerjee, B., Lyle, J., Kraemer, L., Yellamraju, R.: Sample bounded distributed reinforcement learning for decentralized POMDPs. In: AAAI Conference on Artificial Intelligence (2012)

    Google Scholar 

  224. Nayyar, A., Mahajan, A., Teneketzis, D.: Decentralized stochastic control with partial history sharing: a common information approach. IEEE Trans. Autom. Control 58(7), 1644–1658 (2013)

    Article  MathSciNet  MATH  Google Scholar 

  225. Arabneydi, J., Mahajan, A.: Reinforcement learning in decentralized stochastic control systems with partial history sharing. In: IEEE American Control Conference, pp. 5449–5456 (2015)

    Google Scholar 

  226. Papadimitriou, C.H.: On inefficient proofs of existence and complexity classes. In: Annals of Discrete Mathematics, vol. 51, pp. 245–250. Elsevier (1992)

    Google Scholar 

  227. Daskalakis, C., Goldberg, P.W., Papadimitriou, C.H.: The complexity of computing a Nash equilibrium. SIAM J. Comput. 39(1), 195–259 (2009)

    Article  MathSciNet  MATH  Google Scholar 

  228. Von Neumann, J., Morgenstern, O., Kuhn, H.W.: Theory of Games and Economic Behavior (commemorative edition). Princeton University Press, Princeton (2007)

    Google Scholar 

  229. Vanderbei, R.J., et al.: Linear Programming. Springer, Berlin (2015)

    Google Scholar 

  230. Hoffman, A.J., Karp, R.M.: On nonterminating stochastic games. Manag. Sci. 12(5), 359–370 (1966)

    Article  MathSciNet  MATH  Google Scholar 

  231. Van Der Wal, J.: Discounted Markov games: generalized policy iteration method. J. Optim. Theory Appl. 25(1), 125–138 (1978)

    Article  MathSciNet  MATH  Google Scholar 

  232. Rao, S.S., Chandrasekaran, R., Nair, K.: Algorithms for discounted stochastic games. J. Optim. Theory Appl. 11(6), 627–637 (1973)

    Article  MathSciNet  MATH  Google Scholar 

  233. Patek, S.D.: Stochastic and shortest path games: theory and algorithms. Ph.D. thesis, Massachusetts Institute of Technology (1997)

    Google Scholar 

  234. Hansen, T.D., Miltersen, P.B., Zwick, U.: Strategy iteration is strongly polynomial for 2-player turn-based stochastic games with a constant discount factor. J. ACM 60(1), 1 (2013)

    Article  MathSciNet  MATH  Google Scholar 

  235. Lagoudakis, M.G., Parr, R.: Value function approximation in zero-sum Markov games. In: Conference on Uncertainty in Artificial Intelligence, pp. 283–292 (2002)

    Google Scholar 

  236. Zou, S., Xu, T., Liang, Y.: Finite-sample analysis for SARSA with linear function approximation (2019). arXiv preprint arXiv:1902.02234

  237. Sutton, R.S., Barto, A.G.: A temporal-difference model of classical conditioning. In: Proceedings of the Annual Conference of the Cognitive Science Society, pp. 355–378 (1987)

    Google Scholar 

  238. Al-Tamimi, A., Abu-Khalaf, M., Lewis, F.L.: Adaptive critic designs for discrete-time zero-sum games with application to \(\cal{H}_\infty \) control. IEEE Trans. Syst. Man Cybern. Part B 37(1), 240–247 (2007)

    Article  MATH  Google Scholar 

  239. Al-Tamimi, A., Lewis, F.L., Abu-Khalaf, M.: Model-free Q-learning designs for linear discrete-time zero-sum games with application to \(\cal{H}_\infty \) control. Automatica 43(3), 473–481 (2007)

    Article  MathSciNet  MATH  Google Scholar 

  240. Farahmand, A.M., Ghavamzadeh, M., Szepesvári, C., Mannor, S.: Regularized policy iteration with nonparametric function spaces. J. Mach. Learn. Res. 17(1), 4809–4874 (2016)

    Google Scholar 

  241. Yang, Z., Xie, Y., Wang, Z.: A theoretical analysis of deep Q-learning (2019). arXiv preprint arXiv:1901.00137

  242. Jia, Z., Yang, L.F., Wang, M.: Feature-based Q-learning for two-player stochastic games (2019). arXiv preprint arXiv:1906.00423

  243. Sidford, A., Wang, M., Wu, X., Yang, L., Ye, Y.: Near-optimal time and sample complexities for solving Markov decision processes with a generative model. In: Advances in Neural Information Processing Systems, pp. 5186–5196 (2018)

    Google Scholar 

  244. Wei, C.Y., Hong, Y.T., Lu, C.J.: Online reinforcement learning in stochastic games. In: Advances in Neural Information Processing Systems, pp. 4987–4997 (2017)

    Google Scholar 

  245. Auer, P., Ortner, R.: Logarithmic online regret bounds for undiscounted reinforcement learning. In: Advances in Neural Information Processing Systems, pp. 49–56 (2007)

    Google Scholar 

  246. Jaksch, T., Ortner, R., Auer, P.: Near-optimal regret bounds for reinforcement learning. J. Mach. Learn. Res. 11, 1563–1600 (2010)

    Google Scholar 

  247. Koller, D., Megiddo, N., von Stengel, B.: Fast algorithms for finding randomized strategies in game trees. Computing 750, 759 (1994)

    MATH  Google Scholar 

  248. Von Stengel, B.: Efficient computation of behavior strategies. Games Econ. Behav. 14(2), 220–246 (1996)

    Article  MathSciNet  MATH  Google Scholar 

  249. Koller, D., Megiddo, N., Von Stengel, B.: Efficient computation of equilibria for extensive two-person games. Games Econ. Behav. 14(2), 247–259 (1996)

    Article  MathSciNet  MATH  Google Scholar 

  250. Von Stengel, B.: Computing equilibria for two-person games. Handbook of Game Theory with Economic Applications 3, 1723–1759 (2002)

    Article  Google Scholar 

  251. Parr, R., Russell, S.: Approximating optimal policies for partially observable stochastic domains. In: International Joint Conference on Artificial Intelligence, pp. 1088–1094 (1995)

    Google Scholar 

  252. Rodriguez, A.C., Parr, R., Koller, D.: Reinforcement learning using approximate belief states. In: Advances in Neural Information Processing Systems, pp. 1036–1042 (2000)

    Google Scholar 

  253. Hauskrecht, M.: Value-function approximations for partially observable Markov decision processes. J. Artif. Intell. Res. 13, 33–94 (2000)

    Article  MathSciNet  MATH  Google Scholar 

  254. Buter, B.J.: Dynamic programming for extensive form games with imperfect information. Ph.D. thesis, Universiteit van Amsterdam (2012)

    Google Scholar 

  255. Cowling, P.I., Powley, E.J., Whitehouse, D.: Information set Monte Carlo tree search. IEEE Trans. Comput. Intell. AI Games 4(2), 120–143 (2012)

    Article  Google Scholar 

  256. Teraoka, K., Hatano, K., Takimoto, E.: Efficient sampling method for Monte Carlo tree search problem. IEICE Trans. Inf. Syst. 97(3), 392–398 (2014)

    Article  Google Scholar 

  257. Whitehouse, D.: Monte Carlo tree search for games with hidden information and uncertainty. Ph.D. thesis, University of York (2014)

    Google Scholar 

  258. Kaufmann, E., Koolen, W.M.: Monte-Carlo tree search by best arm identification. In: Advances in Neural Information Processing Systems, pp. 4897–4906 (2017)

    Google Scholar 

  259. Hannan, J.: Approximation to Bayes risk in repeated play. Contrib. Theory Games 3, 97–139 (1957)

    MathSciNet  MATH  Google Scholar 

  260. Brown, G.W.: Iterative solution of games by fictitious play. Act. Anal. Prod. Allo. 13(1), 374–376 (1951)

    MathSciNet  MATH  Google Scholar 

  261. Robinson, J.: An iterative method of solving a game. Ann. Math. 296–301 (1951)

    Google Scholar 

  262. Benaïm, M., Hofbauer, J., Sorin, S.: Stochastic approximations and differential inclusions. SIAM J. Control Optim. 44(1), 328–348 (2005)

    Article  MathSciNet  MATH  Google Scholar 

  263. Hart, S., Mas-Colell, A.: A general class of adaptive strategies. J. Econ. Theory 98(1), 26–54 (2001)

    Article  MathSciNet  MATH  Google Scholar 

  264. Monderer, D., Samet, D., Sela, A.: Belief affirming in learning processes. J. Econ. Theory 73(2), 438–452 (1997)

    Article  MathSciNet  MATH  Google Scholar 

  265. Viossat, Y., Zapechelnyuk, A.: No-regret dynamics and fictitious play. J. Econ. Theory 148(2), 825–842 (2013)

    Article  MathSciNet  MATH  Google Scholar 

  266. Kushner, H.J., Yin, G.G.: Stochastic Approximation and Recursive Algorithms and Applications. Springer, New York (2003)

    MATH  Google Scholar 

  267. Fudenberg, D., Levine, D.K.: Consistency and cautious fictitious play. J. Econ. Dyn. Control 19(5–7), 1065–1089 (1995)

    Article  MathSciNet  MATH  Google Scholar 

  268. Hofbauer, J., Sandholm, W.H.: On the global convergence of stochastic fictitious play. Econometrica 70(6), 2265–2294 (2002)

    Article  MathSciNet  MATH  Google Scholar 

  269. Leslie, D.S., Collins, E.J.: Generalised weakened fictitious play. Games Econ. Behav. 56(2), 285–298 (2006)

    Article  MathSciNet  MATH  Google Scholar 

  270. Benaïm, M., Faure, M.: Consistency of vanishingly smooth fictitious play. Math. Oper. Res. 38(3), 437–450 (2013)

    Article  MathSciNet  MATH  Google Scholar 

  271. Li, Z., Tewari, A.: Sampled fictitious play is Hannan consistent. Games Econ. Behav. 109, 401–412 (2018)

    Article  MathSciNet  MATH  Google Scholar 

  272. Ernst, D., Geurts, P., Wehenkel, L.: Tree-based batch mode reinforcement learning. J. Mach. Learn. Res. 6(Apr), 503–556 (2005)

    Google Scholar 

  273. Heinrich, J., Silver, D.: Self-play Monte-Carlo tree search in computer Poker. In: Workshops at AAAI Conference on Artificial Intelligence (2014)

    Google Scholar 

  274. Browne, C.B., Powley, E., Whitehouse, D., Lucas, S.M., Cowling, P.I., Rohlfshagen, P., Tavener, S., Perez, D., Samothrakis, S., Colton, S.: A survey of Monte Carlo tree search methods. IEEE Trans. Comput. Intell. AI Games 4(1), 1–43 (2012)

    Article  Google Scholar 

  275. Borkar, V.S.: Stochastic Approximation: A Dynamical Systems Viewpoint. Cambridge University Press, Cambridge (2008)

    Google Scholar 

  276. Cesa-Bianchi, N., Lugosi, G.: Prediction, Learning, and Games. Cambridge University Press, Cambridge (2006)

    Google Scholar 

  277. Auer, P., Cesa-Bianchi, N., Freund, Y., Schapire, R.E.: The nonstochastic multiarmed bandit problem. SIAM J. Comput. 32(1), 48–77 (2002)

    Article  MathSciNet  MATH  Google Scholar 

  278. Vovk, V.G.: Aggregating strategies. In: Proceedings of Computational Learning Theory (1990)

    Google Scholar 

  279. Littlestone, N., Warmuth, M.K.: The weighted majority algorithm. Inf. Comput. 108(2), 212–261 (1994)

    Article  MathSciNet  MATH  Google Scholar 

  280. Freund, Y., Schapire, R.E.: Adaptive game playing using multiplicative weights. Games Econ. Behav. 29(1–2), 79–103 (1999)

    Article  MathSciNet  MATH  Google Scholar 

  281. Hart, S., Mas-Colell, A.: A simple adaptive procedure leading to correlated equilibrium. Econometrica 68(5), 1127–1150 (2000)

    Article  MathSciNet  MATH  Google Scholar 

  282. Lanctot, M., Waugh, K., Zinkevich, M., Bowling, M.: Monte Carlo sampling for regret minimization in extensive games. In: Advances in Neural Information Processing Systems, pp. 1078–1086 (2009)

    Google Scholar 

  283. Burch, N., Lanctot, M., Szafron, D., Gibson, R.G.: Efficient Monte Carlo counterfactual regret minimization in games with many player actions. In: Advances in Neural Information Processing Systems, pp. 1880–1888 (2012)

    Google Scholar 

  284. Gibson, R., Lanctot, M., Burch, N., Szafron, D., Bowling, M.: Generalized sampling and variance in counterfactual regret minimization. In: AAAI Conference on Artificial Intelligence (2012)

    Google Scholar 

  285. Johanson, M., Bard, N., Lanctot, M., Gibson, R., Bowling, M.: Efficient Nash equilibrium approximation through Monte Carlo counterfactual regret minimization. In: International Conference on Autonomous Agents and Multi-Agent Systems, pp. 837–846 (2012)

    Google Scholar 

  286. Lisỳ, V., Lanctot, M., Bowling, M.: Online Monte Carlo counterfactual regret minimization for search in imperfect information games. In: International Conference on Autonomous Agents and Multi-Agent Systems, pp. 27–36 (2015)

    Google Scholar 

  287. Schmid, M., Burch, N., Lanctot, M., Moravcik, M., Kadlec, R., Bowling, M.: Variance reduction in Monte Carlo counterfactual regret minimization (VR-MCCFR) for extensive form games using baselines. In: AAAI Conference on Artificial Intelligence, vol. 33, pp. 2157–2164 (2019)

    Google Scholar 

  288. Waugh, K., Morrill, D., Bagnell, J.A., Bowling, M.: Solving games with functional regret estimation. In: AAAI Conference on Artificial Intelligence (2015)

    Google Scholar 

  289. Morrill, D.: Using regret estimation to solve games compactly. Ph.D. thesis, University of Alberta (2016)

    Google Scholar 

  290. Brown, N., Lerer, A., Gross, S., Sandholm, T.: Deep counterfactual regret minimization. In: International Conference on Machine Learning, pp. 793–802 (2019)

    Google Scholar 

  291. Brown, N., Sandholm, T.: Regret-based pruning in extensive-form games. In: Advances in Neural Information Processing Systems, pp. 1972–1980 (2015)

    Google Scholar 

  292. Brown, N., Kroer, C., Sandholm, T.: Dynamic thresholding and pruning for regret minimization. In: AAAI Conference on Artificial Intelligence (2017)

    Google Scholar 

  293. Brown, N., Sandholm, T.: Reduced space and faster convergence in imperfect-information games via pruning. In: International Conference on Machine Learning, pp. 596–604 (2017)

    Google Scholar 

  294. Tammelin, O.: Solving large imperfect information games using CFR+ (2014). arXiv preprint arXiv:1407.5042

  295. Tammelin, O., Burch, N., Johanson, M., Bowling, M.: Solving heads-up limit Texas Hold’em. In: International Joint Conference on Artificial Intelligence (2015)

    Google Scholar 

  296. Burch, N., Moravcik, M., Schmid, M.: Revisiting CFR+ and alternating updates. J. Artif. Intell. Res. 64, 429–443 (2019)

    Article  MathSciNet  MATH  Google Scholar 

  297. Zhou, Y., Ren, T., Li, J., Yan, D., Zhu, J.: Lazy-CFR: a fast regret minimization algorithm for extensive games with imperfect information (2018). arXiv preprint arXiv:1810.04433

  298. Zinkevich, M.: Online convex programming and generalized infinitesimal gradient ascent. In: International Conference on Machine Learning, pp. 928–936 (2003)

    Google Scholar 

  299. Lockhart, E., Lanctot, M., Pérolat, J., Lespiau, J.B., Morrill, D., Timbers, F., Tuyls, K.: Computing approximate equilibria in sequential adversarial games by exploitability descent (2019). arXiv preprint arXiv:1903.05614

  300. Johanson, M., Bard, N., Burch, N., Bowling, M.: Finding optimal abstract strategies in extensive-form games. In: AAAI Conference on Artificial Intelligence, pp. 1371–1379 (2012)

    Google Scholar 

  301. Schaeffer, M.S., Sturtevant, N., Schaeffer, J.: Comparing UCT versus CFR in simultaneous games (2009)

    Google Scholar 

  302. Lanctot, M., Lisỳ, V., Winands, M.H.: Monte Carlo tree search in simultaneous move games with applications to Goofspiel. In: Workshop on Computer Games, pp. 28–43 (2013)

    Google Scholar 

  303. Lisỳ, V., Kovarik, V., Lanctot, M., Bošanskỳ, B.: Convergence of Monte Carlo tree search in simultaneous move games. In: Advances in Neural Information Processing Systems, pp. 2112–2120 (2013)

    Google Scholar 

  304. Tak, M.J., Lanctot, M., Winands, M.H.: Monte Carlo tree search variants for simultaneous move games. In: IEEE Conference on Computational Intelligence and Games, pp. 1–8 (2014)

    Google Scholar 

  305. Kovařík, V., Lisỳ, V.: Analysis of Hannan consistent selection for Monte Carlo tree search in simultaneous move games (2018). arXiv preprint arXiv:1804.09045

  306. Mazumdar, E.V., Jordan, M.I., Sastry, S.S.: On finding local Nash equilibria (and only local Nash equilibria) in zero-sum games (2019). arXiv preprint arXiv:1901.00838

  307. Bu, J., Ratliff, L.J., Mesbahi, M.: Global convergence of policy gradient for sequential zero-sum linear quadratic dynamic games (2019). arXiv preprint arXiv:1911.04672

  308. Mescheder, L., Nowozin, S., Geiger, A.: The numerics of GANs. In: Advances in Neural Information Processing Systems, pp. 1825–1835 (2017)

    Google Scholar 

  309. Adolphs, L., Daneshmand, H., Lucchi, A., Hofmann, T.: Local saddle point optimization: a curvature exploitation approach (2018). arXiv preprint arXiv:1805.05751

  310. Daskalakis, C., Panageas, I.: The limit points of (optimistic) gradient descent in min-max optimization. In: Advances in Neural Information Processing Systems, pp. 9236–9246 (2018)

    Google Scholar 

  311. Mertikopoulos, P., Zenati, H., Lecouat, B., Foo, C.S., Chandrasekhar, V., Piliouras, G.: Optimistic mirror descent in saddle-point problems: going the extra (gradient) mile. In: International Conference on Learning Representations (2019)

    Google Scholar 

  312. Fiez, T., Chasnov, B., Ratliff, L.J.: Convergence of learning dynamics in Stackelberg games (2019). arXiv preprint arXiv:1906.01217

  313. Balduzzi, D., Racaniere, S., Martens, J., Foerster, J., Tuyls, K., Graepel, T.: The mechanics of n-player differentiable games. In: International Conference on Machine Learning, pp. 363–372 (2018)

    Google Scholar 

  314. Sanjabi, M., Razaviyayn, M., Lee, J.D.: Solving non-convex non-concave min-max games under Polyak-Łojasiewicz condition (2018). arXiv preprint arXiv:1812.02878

  315. Nouiehed, M., Sanjabi, M., Lee, J.D., Razaviyayn, M.: Solving a class of non-convex min-max games using iterative first order methods (2019). arXiv preprint arXiv:1902.08297

  316. Mazumdar, E., Ratliff, L.J., Jordan, M.I., Sastry, S.S.: Policy-gradient algorithms have no guarantees of convergence in continuous action and state multi-agent settings (2019). arXiv preprint arXiv:1907.03712

  317. Chen, X., Deng, X., Teng, S.H.: Settling the complexity of computing two-player Nash equilibria. J. ACM 56(3), 14 (2009)

    Article  MathSciNet  MATH  Google Scholar 

  318. Greenwald, A., Hall, K., Serrano, R.: Correlated Q-learning. In: International Conference on Machine Learning, pp. 242–249 (2003)

    Google Scholar 

  319. Aumann, R.J.: Subjectivity and correlation in randomized strategies. J. Math. Econ. 1(1), 67–96 (1974)

    Article  MathSciNet  MATH  Google Scholar 

  320. Perolat, J., Strub, F., Piot, B., Pietquin, O.: Learning Nash equilibrium for general-sum Markov games from batch data. In: International Conference on Artificial Intelligence and Statistics, pp. 232–241 (2017)

    Google Scholar 

  321. Maillard, O.A., Munos, R., Lazaric, A., Ghavamzadeh, M.: Finite-sample analysis of Bellman residual minimization. In: Asian Conference on Machine Learning, pp. 299–314 (2010)

    Google Scholar 

  322. Letcher, A., Balduzzi, D., Racanière, S., Martens, J., Foerster, J.N., Tuyls, K., Graepel, T.: Differentiable game mechanics. J. Mach. Learn. Res. 20(84), 1–40 (2019)

    MathSciNet  MATH  Google Scholar 

  323. Chasnov, B., Ratliff, L.J., Mazumdar, E., Burden, S.A.: Convergence analysis of gradient-based learning with non-uniform learning rates in non-cooperative multi-agent settings (2019). arXiv preprint arXiv:1906.00731

  324. Hart, S., Mas-Colell, A.: Uncoupled dynamics do not lead to Nash equilibrium. Am. Econ. Rev. 93(5), 1830–1836 (2003)

    Article  Google Scholar 

  325. Saldi, N., Başar, T., Raginsky, M.: Markov-Nash equilibria in mean-field games with discounted cost. SIAM J. Control Optim. 56(6), 4256–4287 (2018)

    Article  MathSciNet  MATH  Google Scholar 

  326. Saldi, N., Başar, T., Raginsky, M.: Approximate Nash equilibria in partially observed stochastic games with mean-field interactions. Math. Oper. Res. (2019)

    Google Scholar 

  327. Saldi, N.: Discrete-time average-cost mean-field games on Polish spaces (2019). arXiv preprint arXiv:1908.08793

  328. Saldi, N., Başar, T., Raginsky, M.: Discrete-time risk-sensitive mean-field games (2018). arXiv preprint arXiv:1808.03929

  329. Guo, X., Hu, A., Xu, R., Zhang, J.: Learning mean-field games (2019). arXiv preprint arXiv:1901.09585

  330. Fu, Z., Yang, Z., Chen, Y., Wang, Z.: Actor-critic provably finds Nash equilibria of linear-quadratic mean-field games (2019). arXiv preprint arXiv:1910.07498

  331. Hadikhanloo, S., Silva, F.J.: Finite mean field games: fictitious play and convergence to a first order continuous mean field game. J. Math. Pures Appl. (2019)

    Google Scholar 

  332. Elie, R., Pérolat, J., Laurière, M., Geist, M., Pietquin, O.: Approximate fictitious play for mean field games (2019). arXiv preprint arXiv:1907.02633

  333. Anahtarci, B., Kariksiz, C.D., Saldi, N.: Value iteration algorithm for mean-field games (2019). arXiv preprint arXiv:1909.01758

  334. Zaman, M.A.u., Zhang, K., Miehling, E., Başar, T.: Approximate equilibrium computation for discrete-time linear-quadratic mean-field games. Submitted to IEEE American Control Conference (2020)

    Google Scholar 

  335. Yang, B., Liu, M.: Keeping in touch with collaborative UAVs: a deep reinforcement learning approach. In: International Joint Conference on Artificial Intelligence, pp. 562–568 (2018)

    Google Scholar 

  336. Pham, H.X., La, H.M., Feil-Seifer, D., Nefian, A.: Cooperative and distributed reinforcement learning of drones for field coverage (2018). arXiv preprint arXiv:1803.07250

  337. Tožička, J., Szulyovszky, B., de Chambrier, G., Sarwal, V., Wani, U., Gribulis, M.: Application of deep reinforcement learning to UAV fleet control. In: SAI Intelligent Systems Conference, pp. 1169–1177 (2018)

    Google Scholar 

  338. Shamsoshoara, A., Khaledi, M., Afghah, F., Razi, A., Ashdown, J.: Distributed cooperative spectrum sharing in UAV networks using multi-agent reinforcement learning. In: IEEE Annual Consumer Communications & Networking Conference, pp. 1–6 (2019)

    Google Scholar 

  339. Cui, J., Liu, Y., Nallanathan, A.: The application of multi-agent reinforcement learning in UAV networks. In: IEEE International Conference on Communications Workshops, pp. 1–6 (2019)

    Google Scholar 

  340. Qie, H., Shi, D., Shen, T., Xu, X., Li, Y., Wang, L.: Joint optimization of multi-UAV target assignment and path planning based on multi-agent reinforcement learning. IEEE Access (2019)

    Google Scholar 

  341. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)

    Article  Google Scholar 

  342. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Advances in neural information processing systems, pp. 5998–6008 (2017)

    Google Scholar 

  343. Hausknecht, M., Stone, P.: Deep recurrent Q-learning for partially observable MDPs. In: 2015 AAAI Fall Symposium Series (2015)

    Google Scholar 

  344. Jorge, E., Kågebäck, M., Johansson, F.D., Gustavsson, E.: Learning to play guess who? and inventing a grounded language as a consequence (2016). arXiv preprint arXiv:1611.03218

  345. Sukhbaatar, S., Fergus, R., et al.: Learning multiagent communication with backpropagation. In: Advances in Neural Information Processing Systems, pp. 2244–2252 (2016)

    Google Scholar 

  346. Havrylov, S., Titov, I.: Emergence of language with multi-agent games: learning to communicate with sequences of symbols. In: Advances in Neural Information Processing Systems, pp. 2149–2159 (2017)

    Google Scholar 

  347. Das, A., Kottur, S., Moura, J.M., Lee, S., Batra, D.: Learning cooperative visual dialog agents with deep reinforcement learning. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2951–2960 (2017)

    Google Scholar 

  348. Peng, P., Wen, Y., Yang, Y., Yuan, Q., Tang, Z., Long, H., Wang, J.: Multiagent bidirectionally-coordinated nets: emergence of human-level coordination in learning to play starcraft combat games (2017). arXiv preprint arXiv:1703.10069

  349. Mordatch, I., Abbeel, P.: Emergence of grounded compositional language in multi-agent populations. In: AAAI Conference on Artificial Intelligence (2018)

    Google Scholar 

  350. Jiang, J., Lu, Z.: Learning attentional communication for multi-agent cooperation. In: Advances in Neural Information Processing Systems, pp. 7254–7264 (2018)

    Google Scholar 

  351. Jiang, J., Dun, C., Lu, Z.: Graph convolutional reinforcement learning for multi-agent cooperation. 2(3) (2018). arXiv preprint arXiv:1810.09202

  352. Celikyilmaz, A., Bosselut, A., He, X., Choi, Y.: Deep communicating agents for abstractive summarization (2018). arXiv preprint arXiv:1803.10357

  353. Das, A., Gervet, T., Romoff, J., Batra, D., Parikh, D., Rabbat, M., Pineau, J.: TarMAC: targeted multi-agent communication (2018). arXiv preprint arXiv:1810.11187

  354. Lazaridou, A., Hermann, K.M., Tuyls, K., Clark, S.: Emergence of linguistic communication from referential games with symbolic and pixel input (2018). arXiv preprint arXiv:1804.03984

  355. Cogswell, M., Lu, J., Lee, S., Parikh, D., Batra, D.: Emergence of compositional language with deep generational transmission (2019). arXiv preprint arXiv:1904.09067

  356. Allis, L.: Searching for solutions in games and artificial intelligence. Ph.D. thesis, Maastricht University (1994)

    Google Scholar 

  357. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 1097–1105 (2012)

    Google Scholar 

  358. Silver, D., Hubert, T., Schrittwieser, J., Antonoglou, I., Lai, M., Guez, A., Lanctot, M., Sifre, L., Kumaran, D., Graepel, T., Lillicrap, T., Simonyan, K., Hassabis, D.: A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play. Science 362(6419), 1140–1144 (2018)

    Article  MathSciNet  MATH  Google Scholar 

  359. Billings, D., Davidson, A., Schaeffer, J., Szafron, D.: The challenge of Poker. Artif. Intell. 134(1–2), 201–240 (2002)

    Google Scholar 

  360. Kuhn, H.W.: A simplified two-person Poker. Contrib. Theory Games 1, 97–103 (1950)

    Google Scholar 

  361. Southey, F., Bowling, M., Larson, B., Piccione, C., Burch, N., Billings, D., Rayner, C.: Bayes’ bluff: opponent modelling in Poker. In: Proceedings of the 21st Conference on Uncertainty in Artificial Intelligence, pp. 550–558. AUAI Press (2005)

    Google Scholar 

  362. Bowling, M., Burch, N., Johanson, M., Tammelin, O.: Heads-up limit hold’em Poker is solved. Science 347(6218), 145–149 (2015)

    Google Scholar 

  363. Heinrich, J., Silver, D.: Smooth UCT search in computer Poker. In: 24th International Joint Conference on Artificial Intelligence (2015)

    Google Scholar 

  364. Moravčík, M., Schmid, M., Burch, N., Lisỳ, V., Morrill, D., Bard, N., Davis, T., Waugh, K., Johanson, M., Bowling, M.: Deepstack: expert-level artificial intelligence in heads-up no-limit Poker. Science 356(6337), 508–513 (2017)

    Google Scholar 

  365. Brown, N., Sandholm, T.: Superhuman AI for heads-up no-limit Poker: Libratus beats top professionals. Science 359(6374), 418–424 (2018)

    Google Scholar 

  366. Burch, N., Johanson, M., Bowling, M.: Solving imperfect information games using decomposition. In: 28th AAAI Conference on Artificial Intelligence (2014)

    Google Scholar 

  367. Moravcik, M., Schmid, M., Ha, K., Hladik, M., Gaukrodger, S.J.: Refining subgames in large imperfect information games. In: 30th AAAI Conference on Artificial Intelligence (2016)

    Google Scholar 

  368. Brown, N., Sandholm, T.: Safe and nested subgame solving for imperfect-information games. In: Advances in Neural Information Processing Systems, pp. 689–699 (2017)

    Google Scholar 

  369. Vinyals, O., Ewalds, T., Bartunov, S., Georgiev, P., Vezhnevets, A.S., Yeo, M., Makhzani, A., Küttler, H., Agapiou, J., Schrittwieser, J., et al.: Starcraft II: a new challenge for reinforcement learning (2017). arXiv preprint arXiv:1708.04782

  370. Vinyals, O., Babuschkin, I., Czarnecki, W.M., Mathieu, M., Dudzik, A., Chung, J., Choi, D.H., Powell, R., Ewalds, T., Georgiev, P., et al.: Grandmaster level in Starcraft II using multi-agent reinforcement learning. Nature 1–5 (2019)

    Google Scholar 

  371. Mnih, V., Badia, A.P., Mirza, M., Graves, A., Lillicrap, T., Harley, T., Silver, D., Kavukcuoglu, K.: Asynchronous methods for deep reinforcement learning. In: International Conference on Machine Learning, pp. 1928–1937 (2016)

    Google Scholar 

  372. Lerer, A., Peysakhovich, A.: Maintaining cooperation in complex social dilemmas using deep reinforcement learning (2017). arXiv preprint arXiv:1707.01068

  373. Hughes, E., Leibo, J.Z., Phillips, M., Tuyls, K., Dueñez-Guzman, E., Castañeda, A.G., Dunning, I., Zhu, T., McKee, K., Koster, R., et al.: Inequity aversion improves cooperation in intertemporal social dilemmas. In: Advances in Neural Information Processing Systems, pp. 3326–3336 (2018)

    Google Scholar 

  374. Cai, Q., Yang, Z., Lee, J.D., Wang, Z.: Neural temporal-difference learning converges to global optima (2019). arXiv preprint arXiv:1905.10027

  375. Arora, S., Cohen, N., Hazan, E.: On the optimization of deep networks: implicit acceleration by overparameterization (2018). arXiv preprint arXiv:1802.06509

  376. Li, Y., Liang, Y.: Learning overparameterized neural networks via stochastic gradient descent on structured data. In: Advances in Neural Information Processing Systems, pp. 8157–8166 (2018)

    Google Scholar 

  377. Brafman, R.I., Tennenholtz, M.: A near-optimal polynomial time algorithm for learning in certain classes of stochastic games. Artif. Intell. 121(1–2), 31–47 (2000)

    Article  MathSciNet  MATH  Google Scholar 

  378. Brafman, R.I., Tennenholtz, M.: R-max-a general polynomial time algorithm for near-optimal reinforcement learning. J. Mach. Learn. Res. 3, 213–231 (2002)

    Google Scholar 

  379. Tu, S., Recht, B.: The gap between model-based and model-free methods on the linear quadratic regulator: an asymptotic viewpoint (2018). arXiv preprint arXiv:1812.03565

  380. Sun, W., Jiang, N., Krishnamurthy, A., Agarwal, A., Langford, J.: Model-based RL in contextual decision processes: PAC bounds and exponential improvements over model-free approaches. In: Conference on Learning Theory, pp. 2898–2933 (2019)

    Google Scholar 

  381. Lin, Q., Liu, M., Rafique, H., Yang, T.: Solving weakly-convex-weakly-concave saddle-point problems as weakly-monotone variational inequality (2018). arXiv preprint arXiv:1810.10207

  382. García, J., Fernández, F.: A comprehensive survey on safe reinforcement learning. J. Mach. Learn. Res. 16(1), 1437–1480 (2015)

    MathSciNet  MATH  Google Scholar 

  383. Chen, Y., Su, L., Xu, J.: Distributed statistical machine learning in adversarial settings: Byzantine gradient descent. Proc. ACM Meas. Anal. Comput. Syst. 1(2), 44 (2017)

    Google Scholar 

  384. Yin, D., Chen, Y., Ramchandran, K., Bartlett, P.: Byzantine-robust distributed learning: towards optimal statistical rates (2018). arXiv preprint arXiv:1803.01498

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Tamer Başar .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Zhang, K., Yang, Z., Başar, T. (2021). Multi-Agent Reinforcement Learning: A Selective Overview of Theories and Algorithms. In: Vamvoudakis, K.G., Wan, Y., Lewis, F.L., Cansever, D. (eds) Handbook of Reinforcement Learning and Control. Studies in Systems, Decision and Control, vol 325. Springer, Cham. https://doi.org/10.1007/978-3-030-60990-0_12

Download citation

Publish with us

Policies and ethics