Multi-Agent Reinforcement Learning: A Selective Overview of Theories and Algorithms

Zhang, Kaiqing; Yang, Zhuoran; Başar, Tamer

doi:10.1007/978-3-030-60990-0_12

Kaiqing Zhang⁶,
Zhuoran Yang⁷ &
Tamer Başar⁶

Part of the book series: Studies in Systems, Decision and Control ((SSDC,volume 325))

19k Accesses
257 Citations

Abstract

Recent years have witnessed significant advances in reinforcement learning (RL), which has registered tremendous success in solving various sequential decision-making problems in machine learning. Most of the successful RL applications, e.g., the games of Go and Poker, robotics, and autonomous driving, involve the participation of more than one single agent, which naturally fall into the realm of multi-agent RL (MARL), a domain with a relatively long history, and has recently re-emerged due to advances in single-agent RL techniques. Though empirically successful, theoretical foundations for MARL are relatively lacking in the literature. In this chapter, we provide a selective overview of MARL, with focus on algorithms backed by theoretical analysis. More specifically, we review the theoretical results of MARL algorithms mainly within two representative frameworks, Markov/stochastic games and extensive-form games, in accordance with the types of tasks they address, i.e., fully cooperative, fully competitive, and a mix of the two. We also introduce several significant but challenging applications of these algorithms. Orthogonal to the existing reviews on MARL, we highlight several new angles and taxonomies of MARL theory, including learning in extensive-form games, decentralized MARL with networked agents, MARL in the mean-field regime, (non-)convergence of policy-based methods for learning in games, etc. Some of the new angles extrapolate from our own research endeavors and interests. Our overall goal with this chapter is, beyond providing an assessment of the current state of the field on the mark, to identify fruitful future research directions on theoretical studies of MARL. We expect this chapter to serve as continuing stimulus for researchers interested in working on this exciting while challenging topic.

Writing of this chapter was supported in part by the US Army Research Laboratory (ARL) Cooperative Agreement W911NF-17-2-0196, and in part by the Air Force Office of Scientific Research (AFOSR) Grant FA9550-19-1-0353.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 189.00; Price excludes VAT (USA)

Softcover Book: USD 249.99; Price excludes VAT (USA)

Hardcover Book: USD 249.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Hereafter, we will use agent and player interchangeably.
2.
Note that there are several other standard formulations of MDPs, e.g., time-average-reward setting and finite-horizon episodic setting. Here, we only present the classical infinite-horizon discounted setting for ease of exposition.
3.
The partially observed MDP (POMDP) model is usually advocated when the agent has no access to the exact system state but only an observation of the state. See [45, 46] for more details on the POMDP model.
4.
Similar to the single-agent setting, here we only introduce the infinite-horizon discounted setting for simplicity, though other settings of MGs, e.g., time-average-reward setting and finite-horizon episodic setting, also exist [85].
5.
Here, we focus only on stationary Markov Nash equilibria, for the infinite-horizon discounted MGs considered.
6.
Partially observed Markov games under the cooperative setting are usually formulated as decentralized POMDP (Dec-POMDP) problems. See Sect. 12.4.1.3 for more discussions on this setting.
7.
The difference between mean-field teams and mean-field games is mainly the solution concept: optimum versus equilibrium, as the difference between general dynamic team theory [88, 173, 174] and game theory [82, 85]. Although the former can be viewed as a special case of the latter, related works are usually reviewed separately in the literature. We follow here this convention.
8.
Note that hereafter we use decentralized and distributed interchangeably for describing this paradigm.

References

Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of Go with deep neural networks and tree search. Nature 529(7587), 484–489 (2016)
Article Google Scholar
Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., Hubert, T., Baker, L., Lai, M., Bolton, A., et al.: Mastering the game of Go without human knowledge. Nature 550(7676), 354 (2017)
Article Google Scholar
OpenAI: Openai five. https://blog.openai.com/openai-five/ (2018)
Vinyals, O., Babuschkin, I., Chung, J., Mathieu, M., Jaderberg, M., Czarnecki, W.M., Dudzik, A., Huang, A., Georgiev, P., Powell, R., Ewalds, T., Horgan, D., Kroiss, M., Danihelka, I., Agapiou, J., Oh, J., Dalibard, V., Choi, D., Sifre, L., Sulsky, Y., Vezhnevets, S., Molloy, J., Cai, T., Budden, D., Paine, T., Gulcehre, C., Wang, Z., Pfaff, T., Pohlen, T., Wu, Y., Yogatama, D., Cohen, J., McKinney, K., Smith, O., Schaul, T., Lillicrap, T., Apps, C., Kavukcuoglu, K., Hassabis, D., Silver, D.: AlphaStar: mastering the real-time strategy game starcraft II. https://deepmind.com/blog/alphastar-mastering-real-time-strategy-game-starcraft-ii/ (2019)
Kober, J., Bagnell, J.A., Peters, J.: Reinforcement learning in robotics: a survey. Int. J. Robot. Res. 32(11), 1238–1274 (2013)
Article Google Scholar
Lillicrap, T.P., Hunt, J.J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., Wierstra, D.: Continuous control with deep reinforcement learning. In: International Conference on Learning Representations (2016)
Google Scholar
Brown, N., Sandholm, T.: Libratus: the superhuman AI for no-limit Poker. In: International Joint Conference on Artificial Intelligence, pp. 5226–5228 (2017)
Google Scholar
Brown, N., Sandholm, T.: Superhuman AI for multiplayer poker. Science 365, 885–890 (2019)
Article MathSciNet MATH Google Scholar
Shalev-Shwartz, S., Shammah, S., Shashua, A.: Safe, multi-agent, reinforcement learning for autonomous driving (2016). arXiv preprint arXiv:1610.03295
Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A.A., Veness, J., Bellemare, M.G., Graves, A., Riedmiller, M., Fidjeland, A.K., Ostrovski, G., et al.: Human-level control through deep reinforcement learning. Nature 518(7540), 529–533 (2015)
Article Google Scholar
Busoniu, L., Babuska, R., De Schutter, B., et al.: A comprehensive survey of multiagent reinforcement learning. IEEE Trans. Syst. Man Cybern. Part C 38(2), 156–172 (2008)
Article Google Scholar
Adler, J.L., Blue, V.J.: A cooperative multi-agent transportation management and route guidance system. Transp. Res. Part C: Emerg. Technol. 10(5), 433–454 (2002)
Google Scholar
Wang, S., Wan, J., Zhang, D., Li, D., Zhang, C.: Towards smart factory for industry 4.0: a self-organized multi-agent system with big data based feedback and coordination. Comput. Netw. 101, 158–168 (2016)
Google Scholar
Jangmin, O., Lee, J.W., Zhang, B.T.: Stock trading system using reinforcement learning with cooperative agents. In: International Conference on Machine Learning, pp. 451–458 (2002)
Google Scholar
Lee, J.W., Park, J., Jangmin, O., Lee, J., Hong, E.: A multiagent approach to \(Q \)-learning for daily stock trading. IEEE Trans. Syst. Man Cybern.-Part A: Syst. Hum. 37(6), 864–877 (2007)
Google Scholar
Cortes, J., Martinez, S., Karatas, T., Bullo, F.: Coverage control for mobile sensing networks. IEEE Trans. Robot. Autom. 20(2), 243–255 (2004)
Article Google Scholar
Choi, J., Oh, S., Horowitz, R.: Distributed learning and cooperative control for multi-agent systems. Automatica 45(12), 2802–2814 (2009)
Article MathSciNet MATH Google Scholar
Castelfranchi, C.: The theory of social functions: challenges for computational social science and multi-agent learning. Cogn. Syst. Res. 2(1), 5–38 (2001)
Article Google Scholar
Leibo, J.Z., Zambaldi, V., Lanctot, M., Marecki, J., Graepel, T.: Multi-agent reinforcement learning in sequential social dilemmas. In: International Conference on Autonomous Agents and Multi-Agent Systems, pp. 464–473 (2017)
Google Scholar
Hernandez-Leal, P., Kartal, B., Taylor, M.E.: A survey and critique of multiagent deep reinforcement learning (2018). arXiv preprint arXiv:1810.05587
Foerster, J., Assael, Y.M., de Freitas, N., Whiteson, S.: Learning to communicate with deep multi-agent reinforcement learning. In: Advances in Neural Information Processing Systems, pp. 2137–2145 (2016)
Google Scholar
Zazo, S., Macua, S.V., Sánchez-Fernández, M., Zazo, J.: Dynamic potential games with constraints: fundamentals and applications in communications. IEEE Trans. Signal Process. 64(14), 3806–3821 (2016)
Article MathSciNet MATH Google Scholar
Zhang, K., Yang, Z., Liu, H., Zhang, T., Başar, T.: Fully decentralized multi-agent reinforcement learning with networked agents. In: International Conference on Machine Learning, pp. 5867–5876 (2018)
Google Scholar
Subramanian, J., Mahajan, A.: Reinforcement learning in stationary mean-field games. In: International Conference on Autonomous Agents and Multi-Agent Systems, pp. 251–259 (2019)
Google Scholar
Heinrich, J., Silver, D.: Deep reinforcement learning from self-play in imperfect-information games (2016). arXiv preprint arXiv:1603.01121
Lowe, R., Wu, Y., Tamar, A., Harb, J., Abbeel, P., Mordatch, I.: Multi-agent actor-critic for mixed cooperative-competitive environments. In: Advances in Neural Information Processing Systems, pp. 6379–6390 (2017)
Google Scholar
Foerster, J., Farquhar, G., Afouras, T., Nardelli, N., Whiteson, S.: Counterfactual multi-agent policy gradients (2017). arXiv preprint arXiv:1705.08926
Gupta, J.K., Egorov, M., Kochenderfer, M.: Cooperative multi-agent control using deep reinforcement learning. In: International Conference on Autonomous Agents and Multi-Agent Systems, pp. 66–83 (2017)
Google Scholar
Omidshafiei, S., Pazis, J., Amato, C., How, J.P., Vian, J.: Deep decentralized multi-task multi-agent reinforcement learning under partial observability. In: International Conference on Machine Learning, pp. 2681–2690 (2017)
Google Scholar
Kawamura, K., Mizukami, N., Tsuruoka, Y.: Neural fictitious self-play in imperfect information games with many players. In: Workshop on Computer Games, pp. 61–74 (2017)
Google Scholar
Zhang, L., Wang, W., Li, S., Pan, G.: Monte Carlo neural fictitious self-play: Approach to approximate Nash equilibrium of imperfect-information games (2019). arXiv preprint arXiv:1903.09569
Mazumdar, E., Ratliff, L.J.: On the convergence of gradient-based learning in continuous games (2018). arXiv preprint arXiv:1804.05464
Jin, C., Netrapalli, P., Jordan, M.I.: Minmax optimization: stable limit points of gradient descent ascent are locally optimal (2019). arXiv preprint arXiv:1902.00618
Zhang, K., Yang, Z., Başar, T.: Policy optimization provably converges to Nash equilibria in zero-sum linear quadratic games. In: Advances in Neural Information Processing Systems (2019)
Google Scholar
Sidford, A., Wang, M., Yang, L.F., Ye, Y.: Solving discounted stochastic two-player games with near-optimal time and sample complexity (2019). arXiv preprint arXiv:1908.11071
Oliehoek, F.A., Amato, C.: A Concise Introduction to Decentralized POMDPs, vol. 1. Springer, Berlin (2016)
Google Scholar
Arslan, G., Yüksel, S.: Decentralized Q-learning for stochastic teams and games. IEEE Trans. Autom. Control 62(4), 1545–1558 (2017)
Article MathSciNet MATH Google Scholar
Yongacoglu, B., Arslan, G., Yüksel, S.: Learning team-optimality for decentralized stochastic control and dynamic games (2019). arXiv preprint arXiv:1903.05812
Zhang, K., Miehling, E., Başar, T.: Online planning for decentralized stochastic control with partial history sharing. In: IEEE American Control Conference, pp. 167–172 (2019)
Google Scholar
Hernandez-Leal, P., Kaisers, M., Baarslag, T., de Cote, E.M.: A survey of learning in multiagent environments: dealing with non-stationarity (2017). arXiv preprint arXiv:1707.09183
Nguyen, T.T., Nguyen, N.D., Nahavandi, S.: Deep reinforcement learning for multi-agent systems: a review of challenges, solutions and applications (2018). arXiv preprint arXiv:1812.11794
Oroojlooy Jadid, A., Hajinezhad, D.: A review of cooperative multi-agent deep reinforcement learning (2019). arXiv preprint arXiv:1908.03963
Zhang, K., Yang, Z., Başar, T.: Networked multi-agent reinforcement learning in continuous spaces. In: IEEE Conference on Decision and Control, pp. 2771–2776 (2018)
Google Scholar
Zhang, K., Yang, Z., Liu, H., Zhang, T., Başar, T.: Finite-sample analyses for fully decentralized multi-agent reinforcement learning (2018). arXiv preprint arXiv:1812.02783
Monahan, G.E.: State of the art-a survey of partially observable Markov decision processes: theory, models, and algorithms. Manag. Sci. 28(1), 1–16 (1982)
Article MATH Google Scholar
Cassandra, A.R.: Exact and approximate algorithms for partially observable Markov decision processes. Brown University (1998)
Google Scholar
Bertsekas, D.P.: Dynamic Programming and Optimal Control, vol. 1. Athena Scientific, Belmont (2005)
MATH Google Scholar
Watkins, C.J., Dayan, P.: Q-learning. Mach. Learn. 8(3–4), 279–292 (1992)
Article MATH Google Scholar
Szepesvári, C., Littman, M.L.: A unified analysis of value-function-based reinforcement-learning algorithms. Neural Comput. 11(8), 2017–2060 (1999)
Article Google Scholar
Singh, S., Jaakkola, T., Littman, M.L., Szepesvári, C.: Convergence results for single-step on-policy reinforcement-learning algorithms. Mach. Learn. 38(3), 287–308 (2000)
Article MATH Google Scholar
Chang, H.S., Fu, M.C., Hu, J., Marcus, S.I.: An adaptive sampling algorithm for solving Markov decision processes. Oper. Res. 53(1), 126–139 (2005)
Article MathSciNet MATH Google Scholar
Kocsis, L., Szepesvári, C.: Bandit based Monte-Carlo planning. In: European Conference on Machine Learning, pp. 282–293. Springer (2006)
Google Scholar
Coulom, R.: Efficient selectivity and backup operators in Monte-Carlo tree search. In: International Conference on Computers and Games, pp. 72–83 (2006)
Google Scholar
Agrawal, R.: Sample mean based index policies by \(O(log n)\) regret for the multi-armed bandit problem. Adv. Appl. Probab. 27(4), 1054–1078 (1995)
Article MathSciNet MATH Google Scholar
Auer, P., Cesa-Bianchi, N., Fischer, P.: Finite-time analysis of the multiarmed bandit problem. Mach. Learn. 47(2–3), 235–256 (2002)
Article MATH Google Scholar
Jiang, D., Ekwedike, E., Liu, H.: Feedback-based tree search for reinforcement learning. In: International Conference on Machine Learning, pp. 2284–2293 (2018)
Google Scholar
Shah, D., Xie, Q., Xu, Z.: On reinforcement learning using Monte-Carlo tree search with supervised learning: non-asymptotic analysis (2019). arXiv preprint arXiv:1902.05213
Tesauro, G.: Temporal difference learning and TD-Gammon. Commun. ACM 38(3), 58–68 (1995)
Article Google Scholar
Tsitsiklis, J.N., Van Roy, B.: Analysis of temporal-difference learning with function approximation. In: Advances in Neural Information Processing Systems, pp. 1075–1081 (1997)
Google Scholar
Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. MIT Press, Cambridge (2018)
Google Scholar
Sutton, R.S., Szepesvári, C., Maei, H.R.: A convergent \(O(n)\) algorithm for off-policy temporal-difference learning with linear function approximation. In: Advances in Neural Information Processing Systems, vol. 21(21), pp. 1609–1616 (2008)
Google Scholar
Sutton, R.S., Maei, H.R., Precup, D., Bhatnagar, S., Silver, D., Szepesvári, C., Wiewiora, E.: Fast gradient-descent methods for temporal-difference learning with linear function approximation. In: International Conference on Machine Learning, pp. 993–1000 (2009)
Google Scholar
Liu, B., Liu, J., Ghavamzadeh, M., Mahadevan, S., Petrik, M.: Finite-sample analysis of proximal gradient TD algorithms. In: Conference on Uncertainty in Artificial Intelligence, pp. 504–513 (2015)
Google Scholar
Bhatnagar, S., Precup, D., Silver, D., Sutton, R.S., Maei, H.R., Szepesvári, C.: Convergent temporal-difference learning with arbitrary smooth function approximation. In: Advances in Neural Information Processing Systems, pp. 1204–1212 (2009)
Google Scholar
Dann, C., Neumann, G., Peters, J., et al.: Policy evaluation with temporal differences: a survey and comparison. J. Mach. Learn. Res. 15, 809–883 (2014)
MathSciNet MATH Google Scholar
Sutton, R.S., McAllester, D.A., Singh, S.P., Mansour, Y.: Policy gradient methods for reinforcement learning with function approximation. In: Advances in Neural Information Processing Systems, pp. 1057–1063 (2000)
Google Scholar
Williams, R.J.: Simple statistical gradient-following algorithms for connectionist reinforcement learning. Mach. Learn. 8(3–4), 229–256 (1992)
Article MATH Google Scholar
Baxter, J., Bartlett, P.L.: Infinite-horizon policy-gradient estimation. J. Artif. Intell. Res. 15, 319–350 (2001)
Article MathSciNet MATH Google Scholar
Konda, V.R., Tsitsiklis, J.N.: Actor-critic algorithms. In: Advances in Neural Information Processing Systems, pp. 1008–1014 (2000)
Google Scholar
Bhatnagar, S., Sutton, R., Ghavamzadeh, M., Lee, M.: Natural actor-critic algorithms. Automatica 45(11), 2471–2482 (2009)
Article MathSciNet MATH Google Scholar
Silver, D., Lever, G., Heess, N., Degris, T., Wierstra, D., Riedmiller, M.: Deterministic policy gradient algorithms. In: International Conference on Machine Learning, pp. 387–395 (2014)
Google Scholar
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms (2017). arXiv preprint arXiv:1707.06347
Schulman, J., Levine, S., Abbeel, P., Jordan, M., Moritz, P.: Trust region policy optimization. In: International Conference on Machine Learning, pp. 1889–1897 (2015)
Google Scholar
Haarnoja, T., Zhou, A., Abbeel, P., Levine, S.: Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor (2018). arXiv preprint arXiv:1801.01290
Yang, Z., Zhang, K., Hong, M., Başar, T.: A finite sample analysis of the actor-critic algorithm. In: IEEE Conference on Decision and Control, pp. 2759–2764 (2018)
Google Scholar
Zhang, K., Koppel, A., Zhu, H., Başar, T.: Global convergence of policy gradient methods to (almost) locally optimal policies (2019). arXiv preprint arXiv:1906.08383
Agarwal, A., Kakade, S.M., Lee, J.D., Mahajan, G.: Optimality and approximation with policy gradient methods in Markov decision processes (2019). arXiv preprint arXiv:1908.00261
Liu, B., Cai, Q., Yang, Z., Wang, Z.: Neural proximal/trust region policy optimization attains globally optimal policy (2019). arXiv preprint arXiv:1906.10306
Wang, L., Cai, Q., Yang, Z., Wang, Z.: Neural policy gradient methods: global optimality and rates of convergence (2019). arXiv preprint arXiv:1909.01150
Chen, Y., Wang, M.: Stochastic primal-dual methods and sample complexity of reinforcement learning (2016). arXiv preprint arXiv:1612.02516
Wang, M.: Primal-dual \(\pi \) learning: sample complexity and sublinear run time for ergodic Markov decision problems (2017). arXiv preprint arXiv:1710.06100
Shapley, L.S.: Stochastic games. Proc. Natl. Acad. Sci. 39(10), 1095–1100 (1953)
Article MathSciNet MATH Google Scholar
Littman, M.L.: Markov games as a framework for multi-agent reinforcement learning. In: International Conference on Machine Learning, pp. 157–163 (1994)
Google Scholar
Başar, T., Olsder, G.J.: Dynamic Noncooperative Game Theory, vol. 23. SIAM, Philadelphia (1999)
Google Scholar
Filar, J., Vrieze, K.: Competitive Markov Decision Processes. Springer Science & Business Media, Berlin (2012)
Google Scholar
Boutilier, C.: Planning, learning and coordination in multi-agent decision processes. In: Conference on Theoretical Aspects of Rationality and Knowledge, pp. 195–210 (1996)
Google Scholar
Lauer, M., Riedmiller, M.: An algorithm for distributed reinforcement learning in cooperative multi-agent systems. In: International Conference on Machine Learning (2000)
Google Scholar
Yoshikawa, T.: Decomposition of dynamic team decision problems. IEEE Trans. Autom. Control 23(4), 627–632 (1978)
Article MATH Google Scholar
Ho, Y.C.: Team decision theory and information structures. Proc. IEEE 68(6), 644–654 (1980)
Article Google Scholar
Wang, X., Sandholm, T.: Reinforcement learning to play an optimal Nash equilibrium in team Markov games. In: Advances in Neural Information Processing Systems, pp. 1603–1610 (2003)
Google Scholar
Mahajan, A.: Sequential decomposition of sequential dynamic teams: applications to real-time communication and networked control systems. Ph.D. thesis, University of Michigan (2008)
Google Scholar
González-Sánchez, D., Hernández-Lerma, O.: Discrete-Time Stochastic Control and Dynamic Potential Games: The Euler-Equation Approach. Springer Science & Business Media, Berlin (2013)
Google Scholar
Valcarcel Macua, S., Zazo, J., Zazo, S.: Learning parametric closed-loop policies for Markov potential games. In: International Conference on Learning Representations (2018)
Google Scholar
Kar, S., Moura, J.M., Poor, H.V.: QD-learning: a collaborative distributed strategy for multi-agent reinforcement learning through consensus + innovations. IEEE Trans. Signal Process. 61(7), 1848–1862 (2013)
Article MathSciNet MATH Google Scholar
Doan, T., Maguluri, S., Romberg, J.: Finite-time analysis of distributed TD (0) with linear function approximation on multi-agent reinforcement learning. In: International Conference on Machine Learning, pp. 1626–1635 (2019)
Google Scholar
Wai, H.T., Yang, Z., Wang, Z., Hong, M.: Multi-agent reinforcement learning via double averaging primal-dual optimization. In: Advances in Neural Information Processing Systems, pp. 9649–9660 (2018)
Google Scholar
OpenAI: Openai dota 2 1v1 bot. https://openai.com/the-international/ (2017)
Jacobson, D.: Optimal stochastic linear systems with exponential performance criteria and their relation to deterministic differential games. IEEE Trans. Autom. Control 18(2), 124–131 (1973)
Article MathSciNet MATH Google Scholar
Başar, T., Bernhard, P.: H\(_\infty \) Optimal Control and Related Minimax Design Problems: A Dynamic Game Approach. Birkhäuser, Boston (1995)
MATH Google Scholar
Zhang, K., Hu, B., Başar, T.: Policy optimization for \(\cal{H}_2\) linear control with \(\cal{H}_{\infty }\) robustness guarantee: implicit regularization and global convergence (2019). arXiv preprint arXiv:1910.09496
Hu, J., Wellman, M.P.: Nash Q-learning for general-sum stochastic games. J. Mach. Learn. Res. 4, 1039–1069 (2003)
Google Scholar
Littman, M.L.: Friend-or-foe Q-learning in general-sum games. In: International Conference on Machine Learning, pp. 322–328 (2001)
Google Scholar
Lagoudakis, M.G., Parr, R.: Learning in zero-sum team Markov games using factored value functions. In: Advances in Neural Information Processing Systems, pp. 1659–1666 (2003)
Google Scholar
Bernstein, D.S., Givan, R., Immerman, N., Zilberstein, S.: The complexity of decentralized control of Markov decision processes. Math. Oper. Res. 27(4), 819–840 (2002)
Article MathSciNet MATH Google Scholar
Osborne, M.J., Rubinstein, A.: A Course in Game Theory. MIT Press, Cambridge (1994)
Google Scholar
Shoham, Y., Leyton-Brown, K.: Multiagent Systems: Algorithmic, Game-Theoretic, and Logical Foundations. Cambridge University Press, Cambridge (2008)
Google Scholar
Koller, D., Megiddo, N.: The complexity of two-person zero-sum games in extensive form. Games Econ. Behav. 4(4), 528–552 (1992)
Article MathSciNet MATH Google Scholar
Kuhn, H.: Extensive games and the problem of information. Contrib. Theory Games 2, 193–216 (1953)
MathSciNet Google Scholar
Zinkevich, M., Johanson, M., Bowling, M., Piccione, C.: Regret minimization in games with incomplete information. In: Advances in Neural Information Processing Systems, pp. 1729–1736 (2008)
Google Scholar
Heinrich, J., Lanctot, M., Silver, D.: Fictitious self-play in extensive-form games. In: International Conference on Machine Learning, pp. 805–813 (2015)
Google Scholar
Srinivasan, S., Lanctot, M., Zambaldi, V., Pérolat, J., Tuyls, K., Munos, R., Bowling, M.: Actor-critic policy optimization in partially observable multiagent environments. In: Advances in Neural Information Processing Systems, pp. 3422–3435 (2018)
Google Scholar
Omidshafiei, S., Hennes, D., Morrill, D., Munos, R., Perolat, J., Lanctot, M., Gruslys, A., Lespiau, J.B., Tuyls, K.: Neural replicator dynamics (2019). arXiv preprint arXiv:1906.00190
Rubin, J., Watson, I.: Computer Poker: a review. Artif. Intell. 175(5–6), 958–987 (2011)
Google Scholar
Lanctot, M., Lockhart, E., Lespiau, J.B., Zambaldi, V., Upadhyay, S., Pérolat, J., Srinivasan, S., Timbers, F., Tuyls, K., Omidshafiei, S., et al.: Openspiel: a framework for reinforcement learning in games (2019). arXiv preprint arXiv:1908.09453
Claus, C., Boutilier, C.: The dynamics of reinforcement learning in cooperative multiagent systems. In: AAAI Conference on Artificial Intelligence, vol. 1998, pp. 746–752, 2p. (1998)
Google Scholar
Bowling, M., Veloso, M.: Rational and convergent learning in stochastic games. In: International Joint Conference on Artificial Intelligence, vol. 17, pp. 1021–1026 (2001)
Google Scholar
Kapetanakis, S., Kudenko, D.: Reinforcement learning of coordination in cooperative multi-agent systems. In: AAAI Conference on Artificial Intelligence, vol. 2002, pp. 326–331 (2002)
Google Scholar
Conitzer, V., Sandholm, T.: Awesome: a general multiagent learning algorithm that converges in self-play and learns a best response against stationary opponents. Mach. Learn. 67(1–2), 23–43 (2007)
Article Google Scholar
Hansen, E.A., Bernstein, D.S., Zilberstein, S.: Dynamic programming for partially observable stochastic games. In: AAAI Conference on Artificial Intelligence, pp. 709–715 (2004)
Google Scholar
Amato, C., Chowdhary, G., Geramifard, A., Üre, N.K., Kochenderfer, M.J.: Decentralized control of partially observable Markov decision processes. In: IEEE Conference on Decision and Control, pp. 2398–2405 (2013)
Google Scholar
Amato, C., Oliehoek, F.A.: Scalable planning and learning for multiagent POMDPs. In: AAAI Conference on Artificial Intelligence (2015)
Google Scholar
Shoham, Y., Powers, R., Grenager, T.: Multi-agent reinforcement learning: a critical survey. Technical Report (2003)
Google Scholar
Zinkevich, M., Greenwald, A., Littman, M.L.: Cyclic equilibria in Markov games. In: Advances in Neural Information Processing Systems, pp. 1641–1648 (2006)
Google Scholar
Bowling, M., Veloso, M.: Multiagent learning using a variable learning rate. Artif. Intell. 136(2), 215–250 (2002)
Article MathSciNet MATH Google Scholar
Bowling, M.: Convergence and no-regret in multiagent learning. In: Advances in Neural Information Processing Systems, pp. 209–216 (2005)
Google Scholar
Blum, A., Mansour, Y.: Learning, regret minimization, and equilibria. In: Algorithmic Game Theory, pp. 79–102 (2007)
Google Scholar
Hart, S., Mas-Colell, A.: A reinforcement procedure leading to correlated equilibrium. In: Economics Essays, pp. 181–200. Springer, Berlin (2001)
Google Scholar
Kasai, T., Tenmoto, H., Kamiya, A.: Learning of communication codes in multi-agent reinforcement learning problem. In: IEEE Conference on Soft Computing in Industrial Applications, pp. 1–6 (2008)
Google Scholar
Kim, D., Moon, S., Hostallero, D., Kang, W.J., Lee, T., Son, K., Yi, Y.: Learning to schedule communication in multi-agent reinforcement learning. In: International Conference on Learning Representations (2019)
Google Scholar
Chen, T., Zhang, K., Giannakis, G.B., Başar, T.: Communication-efficient distributed reinforcement learning (2018). arXiv preprint arXiv:1812.03239
Lin, Y., Zhang, K., Yang, Z., Wang, Z., Başar, T., Sandhu, R., Liu, J.: A communication-efficient multi-agent actor-critic algorithm for distributed reinforcement learning. In: IEEE Conference on Decision and Control (2019)
Google Scholar
Ren, J., Haupt, J.: A communication efficient hierarchical distributed optimization algorithm for multi-agent reinforcement learning. In: Real-World Sequential Decision Making Workshop at International Conference on Machine Learning (2019)
Google Scholar
Kim, W., Cho, M., Sung, Y.: Message-dropout: an efficient training method for multi-agent deep reinforcement learning. In: AAAI Conference on Artificial Intelligence (2019)
Google Scholar
He, H., Boyd-Graber, J., Kwok, K., Daumé III, H.: Opponent modeling in deep reinforcement learning. In: International Conference on Machine Learning, pp. 1804–1813 (2016)
Google Scholar
Grover, A., Al-Shedivat, M., Gupta, J., Burda, Y., Edwards, H.: Learning policy representations in multiagent systems. In: International Conference on Machine Learning, pp. 1802–1811 (2018)
Google Scholar
Gao, C., Mueller, M., Hayward, R.: Adversarial policy gradient for alternating Markov games. In: Workshop at International Conference on Learning Representations (2018)
Google Scholar
Li, S., Wu, Y., Cui, X., Dong, H., Fang, F., Russell, S.: Robust multi-agent reinforcement learning via minimax deep deterministic policy gradient. In: AAAI Conference on Artificial Intelligence (2019)
Google Scholar
Zhang, X., Zhang, K., Miehling, E., Basar, T.: Non-cooperative inverse reinforcement learning. In: Advances in Neural Information Processing Systems, pp. 9482–9493 (2019)
Google Scholar
Tan, M.: Multi-agent reinforcement learning: Independent vs. cooperative agents. In: International Conference on Machine Learning, pp. 330–337 (1993)
Google Scholar
Matignon, L., Laurent, G.J., Le Fort-Piat, N.: Independent reinforcement learners in cooperative Markov games: a survey regarding coordination problems. Knowl. Eng. Rev. 27(1), 1–31 (2012)
Article Google Scholar
Foerster, J., Nardelli, N., Farquhar, G., Torr, P., Kohli, P., Whiteson, S., et al.: Stabilising experience replay for deep multi-agent reinforcement learning. In: International Conference of Machine Learning, pp. 1146–1155 (2017)
Google Scholar
Tuyls, K., Weiss, G.: Multiagent learning: basics, challenges, and prospects. AI Mag. 33(3), 41 (2012)
Google Scholar
Guestrin, C., Lagoudakis, M., Parr, R.: Coordinated reinforcement learning. In: International Conference on Machine Learning, pp. 227–234 (2002)
Google Scholar
Guestrin, C., Koller, D., Parr, R.: Multiagent planning with factored MDPs. In: Advances in Neural Information Processing Systems, pp. 1523–1530 (2002)
Google Scholar
Kok, J.R., Vlassis, N.: Sparse cooperative Q-learning. In: International Conference on Machine learning, pp. 61–69 (2004)
Google Scholar
Sunehag, P., Lever, G., Gruslys, A., Czarnecki, W.M., Zambaldi, V., Jaderberg, M., Lanctot, M., Sonnerat, N., Leibo, J.Z., Tuyls, K., et al.: Value-decomposition networks for cooperative multi-agent learning based on team reward. In: International Conference on Autonomous Agents and Multi-Agent Systems, pp. 2085–2087 (2018)
Google Scholar
Rashid, T., Samvelyan, M., De Witt, C.S., Farquhar, G., Foerster, J., Whiteson, S.: QMIX: monotonic value function factorisation for deep multi-agent reinforcement learning. In: International Conference on Machine Learning, pp. 681–689 (2018)
Google Scholar
Qu, G., Li, N.: Exploiting fast decaying and locality in multi-agent MDP with tree dependence structure. In: IEEE Conference on Decision and Control (2019)
Google Scholar
Mahajan, A.: Optimal decentralized control of coupled subsystems with control sharing. IEEE Trans. Autom. Control 58(9), 2377–2382 (2013)
Article MathSciNet MATH Google Scholar
Oliehoek, F.A., Amato, C.: Dec-POMDPs as non-observable MDPs. IAS Technical Report (IAS-UVA-14-01) (2014)
Google Scholar
Foerster, J.N., Farquhar, G., Afouras, T., Nardelli, N., Whiteson, S.: Counterfactual multi-agent policy gradients. In: AAAI Conference on Artificial Intelligence (2018)
Google Scholar
Dibangoye, J., Buffet, O.: Learning to act in decentralized partially observable MDPs. In: International Conference on Machine Learning, pp. 1233–1242 (2018)
Google Scholar
Kraemer, L., Banerjee, B.: Multi-agent reinforcement learning as a rehearsal for decentralized planning. Neurocomputing 190, 82–94 (2016)
Article Google Scholar
Macua, S.V., Chen, J., Zazo, S., Sayed, A.H.: Distributed policy evaluation under multiple behavior strategies. IEEE Trans. Autom. Control 60(5), 1260–1274 (2015)
Article MathSciNet MATH Google Scholar
Macua, S.V., Tukiainen, A., Hernández, D.G.O., Baldazo, D., de Cote, E.M., Zazo, S.: Diff-dac: Distributed actor-critic for average multitask deep reinforcement learning (2017). arXiv preprint arXiv:1710.10363
Lee, D., Yoon, H., Hovakimyan, N.: Primal-dual algorithm for distributed reinforcement learning: distributed GTD. In: IEEE Conference on Decision and Control, pp. 1967–1972 (2018)
Google Scholar
Doan, T.T., Maguluri, S.T., Romberg, J.: Finite-time performance of distributed temporal difference learning with linear function approximation (2019). arXiv preprint arXiv:1907.12530
Suttle, W., Yang, Z., Zhang, K., Wang, Z., Başar, T., Liu, J.: A multi-agent off-policy actor-critic algorithm for distributed reinforcement learning (2019). arXiv preprint arXiv:1903.06372
Littman, M.L.: Value-function reinforcement learning in Markov games. Cogn. Syst. Res. 2(1), 55–66 (2001)
Article Google Scholar
Young, H.P.: The evolution of conventions. Econ.: J. Econ. Soc. 57–84 (1993)
Google Scholar
Son, K., Kim, D., Kang, W.J., Hostallero, D.E., Yi, Y.: QTRAN: Learning to factorize with transformation for cooperative multi-agent reinforcement learning. In: International Conference on Machine Learning, pp. 5887–5896 (2019)
Google Scholar
Perolat, J., Piot, B., Pietquin, O.: Actor-critic fictitious play in simultaneous move multistage games. In: International Conference on Artificial Intelligence and Statistics (2018)
Google Scholar
Monderer, D., Shapley, L.S.: Potential games. Games Econ. Behav. 14(1), 124–143 (1996)
Article MathSciNet MATH Google Scholar
Başar, T., Zaccour, G.: Handbook of Dynamic Game Theory. Springer, Berlin (2018)
Google Scholar
Huang, M., Caines, P.E., Malhamé, R.P.: Individual and mass behaviour in large population stochastic wireless power control problems: centralized and Nash equilibrium solutions. In: IEEE Conference on Decision and Control, pp. 98–103 (2003)
Google Scholar
Huang, M., Malhamé, R.P., Caines, P.E., et al.: Large population stochastic dynamic games: closed-loop Mckean-Vlasov systems and the Nash certainty equivalence principle. Commun. Inf. Syst. 6(3), 221–252 (2006)
Article MathSciNet MATH Google Scholar
Lasry, J.M., Lions, P.L.: Mean field games. Jpn. J. Math. 2(1), 229–260 (2007)
Article MathSciNet MATH Google Scholar
Bensoussan, A., Frehse, J., Yam, P., et al.: Mean Field Games and Mean Field Type Control Theory, vol. 101. Springer, Berlin (2013)
Google Scholar
Tembine, H., Zhu, Q., Başar, T.: Risk-sensitive mean-field games. IEEE Trans. Autom. Control 59(4), 835–850 (2013)
Article MathSciNet MATH Google Scholar
Arabneydi, J., Mahajan, A.: Team optimal control of coupled subsystems with mean-field sharing. In: IEEE Conference on Decision and Control, pp. 1669–1674 (2014)
Google Scholar
Arabneydi, J.: New concepts in team theory: Mean field teams and reinforcement learning. Ph.D. thesis, McGill University (2017)
Google Scholar
Yang, Y., Luo, R., Li, M., Zhou, M., Zhang, W., Wang, J.: Mean field multi-agent reinforcement learning. In: International Conference on Machine Learning, pp. 5571–5580 (2018)
Google Scholar
Witsenhausen, H.S.: Separation of estimation and control for discrete time systems. Proc. IEEE 59(11), 1557–1566 (1971)
Article MathSciNet Google Scholar
Yüksel, S., Başar, T.: Stochastic Networked Control Systems: Stabilization and Optimization Under Information Constraints. Springer Science & Business Media, Berlin (2013)
Google Scholar
Subramanian, J., Seraj, R., Mahajan, A.: Reinforcement learning for mean-field teams. In: Workshop on Adaptive and Learning Agents at International Conference on Autonomous Agents and Multi-Agent Systems (2018)
Google Scholar
Arabneydi, J., Mahajan, A.: Linear quadratic mean field teams: optimal and approximately optimal decentralized solutions (2016). arXiv preprint arXiv:1609.00056
Carmona, R., Laurière, M., Tan, Z.: Linear-quadratic mean-field reinforcement learning: convergence of policy gradient methods (2019). arXiv preprint arXiv:1910.04295
Carmona, R., Laurière, M., Tan, Z.: Model-free mean-field reinforcement learning: mean-field MDP and mean-field Q-learning (2019). arXiv preprint arXiv:1910.12802
Rabbat, M., Nowak, R.: Distributed optimization in sensor networks. In: International Symposium on Information Processing in Sensor Networks, pp. 20–27 (2004)
Google Scholar
Dall’Anese, E., Zhu, H., Giannakis, G.B.: Distributed optimal power flow for smart microgrids. IEEE Trans. Smart Grid 4(3), 1464–1475 (2013)
Article Google Scholar
Zhang, K., Shi, W., Zhu, H., Dall’Anese, E., Başar, T.: Dynamic power distribution system management with a locally connected communication network. IEEE J. Sel. Top. Signal Process. 12(4), 673–687 (2018)
Article Google Scholar
Zhang, K., Lu, L., Lei, C., Zhu, H., Ouyang, Y.: Dynamic operations and pricing of electric unmanned aerial vehicle systems and power networks. Transp. Res. Part C: Emerg. Technol. 92, 472–485 (2018)
Article Google Scholar
Corke, P., Peterson, R., Rus, D.: Networked robots: flying robot navigation using a sensor net. Robot. Res. 234–243 (2005)
Google Scholar
Zhang, K., Liu, Y., Liu, J., Liu, M., Başar, T.: Distributed learning of average belief over networks using sequential observations. Automatica (2019)
Google Scholar
Nedic, A., Ozdaglar, A.: Distributed subgradient methods for multi-agent optimization. IEEE Trans. Autom. Control 54(1), 48–61 (2009)
Article MathSciNet MATH Google Scholar
Agarwal, A., Duchi, J.C.: Distributed delayed stochastic optimization. In: Advances in Neural Information Processing Systems, pp. 873–881 (2011)
Google Scholar
Jakovetic, D., Xavier, J., Moura, J.M.: Cooperative convex optimization in networked systems: augmented Lagrangian algorithms with directed gossip communication. IEEE Trans. Signal Process. 59(8), 3889–3902 (2011)
Article MathSciNet MATH Google Scholar
Tu, S.Y., Sayed, A.H.: Diffusion strategies outperform consensus strategies for distributed estimation over adaptive networks. IEEE Trans. Signal Process. 60(12), 6217–6234 (2012)
Article MathSciNet MATH Google Scholar
Varshavskaya, P., Kaelbling, L.P., Rus, D.: Efficient distributed reinforcement learning through agreement. In: Distributed Autonomous Robotic Systems, pp. 367–378 (2009)
Google Scholar
Ciosek, K., Whiteson, S.: Expected policy gradients for reinforcement learning (2018). arXiv preprint arXiv:1801.03326
Sutton, R.S., Mahmood, A.R., White, M.: An emphatic approach to the problem of off-policy temporal-difference learning. J. Mach. Learn. Res. 17(1), 2603–2631 (2016)
MathSciNet MATH Google Scholar
Yu, H.: On convergence of emphatic temporal-difference learning. In: Conference on Learning Theory, pp. 1724–1751 (2015)
Google Scholar
Zhang, Y., Zavlanos, M.M.: Distributed off-policy actor-critic reinforcement learning with policy consensus (2019). arXiv preprint arXiv:1903.09255
Pennesi, P., Paschalidis, I.C.: A distributed actor-critic algorithm and applications to mobile sensor network coordination problems. IEEE Trans. Autom. Control 55(2), 492–497 (2010)
Article MathSciNet MATH Google Scholar
Lange, S., Gabel, T., Riedmiller, M.: Batch reinforcement learning. In: Reinforcement Learning, pp. 45–73. Springer, Berlin (2012)
Google Scholar
Riedmiller, M.: Neural fitted Q iteration–first experiences with a data efficient neural reinforcement learning method. In: European Conference on Machine Learning, pp. 317–328 (2005)
Google Scholar
Antos, A., Szepesvári, C., Munos, R.: Fitted Q-iteration in continuous action-space MDPs. In: Advances in Neural Information Processing Systems, pp. 9–16 (2008)
Google Scholar
Hong, M., Chang, T.H.: Stochastic proximal gradient consensus over random networks. IEEE Trans. Signal Process. 65(11), 2933–2948 (2017)
Article MathSciNet MATH Google Scholar
Nedic, A., Olshevsky, A., Shi, W.: Achieving geometric convergence for distributed optimization over time-varying graphs. SIAM J. Optim. 27(4), 2597–2633 (2017)
Article MathSciNet MATH Google Scholar
Munos, R.: Performance bounds in \(\ell _p\)-norm for approximate value iteration. SIAM J. Control Optim. 46(2), 541–561 (2007)
Article MathSciNet MATH Google Scholar
Munos, R., Szepesvári, C.: Finite-time bounds for fitted value iteration. J. Mach. Learn. Res. 9(May), 815–857 (2008)
MathSciNet MATH Google Scholar
Antos, A., Szepesvári, C., Munos, R.: Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path. Mach. Learn. 71(1), 89–129 (2008)
Article MATH Google Scholar
Farahmand, A.M., Szepesvári, C., Munos, R.: Error propagation for approximate policy and value iteration. In: Advances in Neural Information Processing Systems, pp. 568–576 (2010)
Google Scholar
Cassano, L., Yuan, K., Sayed, A.H.: Multi-agent fully decentralized off-policy learning with linear convergence rates (2018). arXiv preprint arXiv:1810.07792
Qu, G., Li, N.: Harnessing smoothness to accelerate distributed optimization. IEEE Trans. Control Netw. Syst. 5(3), 1245–1260 (2017)
Article MathSciNet MATH Google Scholar
Schmidt, M., Le Roux, N., Bach, F.: Minimizing finite sums with the stochastic average gradient. Math. Program. 162(1–2), 83–112 (2017)
Article MathSciNet MATH Google Scholar
Ying, B., Yuan, K., Sayed, A.H.: Convergence of variance-reduced learning under random reshuffling. In: IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 2286–2290 (2018)
Google Scholar
Singh, S.P., Sutton, R.S.: Reinforcement learning with replacing eligibility traces. Mach. Learn. 22(1–3), 123–158 (1996)
Article MATH Google Scholar
Bhandari, J., Russo, D., Singal, R.: A finite time analysis of temporal difference learning with linear function approximation. In: Conference On Learning Theory, pp. 1691–1692 (2018)
Google Scholar
Srikant, R., Ying, L.: Finite-time error bounds for linear stochastic approximation and TD learning. In: Conference on Learning Theory, pp. 2803–2830 (2019)
Google Scholar
Stanković, M.S., Stanković, S.S.: Multi-agent temporal-difference learning with linear function approximation: weak convergence under time-varying network topologies. In: IEEE American Control Conference, pp. 167–172 (2016)
Google Scholar
Stanković, M.S., Ilić, N., Stanković, S.S.: Distributed stochastic approximation: weak convergence and network design. IEEE Trans. Autom. Control 61(12), 4069–4074 (2016)
Article MathSciNet MATH Google Scholar
Zhang, H., Jiang, H., Luo, Y., Xiao, G.: Data-driven optimal consensus control for discrete-time multi-agent systems with unknown dynamics using reinforcement learning method. IEEE Trans. Ind. Electron. 64(5), 4091–4100 (2016)
Article Google Scholar
Zhang, Q., Zhao, D., Lewis, F.L.: Model-free reinforcement learning for fully cooperative multi-agent graphical games. In: International Joint Conference on Neural Networks, pp. 1–6 (2018)
Google Scholar
Bernstein, D.S., Amato, C., Hansen, E.A., Zilberstein, S.: Policy iteration for decentralized control of Markov decision processes. J. Artif. Intell. Res. 34, 89–132 (2009)
Article MathSciNet MATH Google Scholar
Amato, C., Bernstein, D.S., Zilberstein, S.: Optimizing fixed-size stochastic controllers for POMDPs and decentralized POMDPs. Auton. Agents Multi-Agent Syst. 21(3), 293–320 (2010)
Article Google Scholar
Liu, M., Amato, C., Liao, X., Carin, L., How, J.P.: Stick-breaking policy learning in Dec-POMDPs. In: International Joint Conference on Artificial Intelligence (2015)
Google Scholar
Dibangoye, J.S., Amato, C., Buffet, O., Charpillet, F.: Optimally solving Dec-POMDPs as continuous-state MDPs. J. Artif. Intell. Res. 55, 443–497 (2016)
Article MathSciNet MATH Google Scholar
Wu, F., Zilberstein, S., Chen, X.: Rollout sampling policy iteration for decentralized POMDPs. In: Conference on Uncertainty in Artificial Intelligence (2010)
Google Scholar
Wu, F., Zilberstein, S., Jennings, N.R.: Monte-Carlo expectation maximization for decentralized POMDPs. In: International Joint Conference on Artificial Intelligence (2013)
Google Scholar
Best, G., Cliff, O.M., Patten, T., Mettu, R.R., Fitch, R.: Dec-MCTS: decentralized planning for multi-robot active perception. Int. J. Robot. Res. 1–22 (2018)
Google Scholar
Amato, C., Zilberstein, S.: Achieving goals in decentralized POMDPs. In: International Conference on Autonomous Agents and Multi-Agent Systems, pp. 593–600 (2009)
Google Scholar
Banerjee, B., Lyle, J., Kraemer, L., Yellamraju, R.: Sample bounded distributed reinforcement learning for decentralized POMDPs. In: AAAI Conference on Artificial Intelligence (2012)
Google Scholar
Nayyar, A., Mahajan, A., Teneketzis, D.: Decentralized stochastic control with partial history sharing: a common information approach. IEEE Trans. Autom. Control 58(7), 1644–1658 (2013)
Article MathSciNet MATH Google Scholar
Arabneydi, J., Mahajan, A.: Reinforcement learning in decentralized stochastic control systems with partial history sharing. In: IEEE American Control Conference, pp. 5449–5456 (2015)
Google Scholar
Papadimitriou, C.H.: On inefficient proofs of existence and complexity classes. In: Annals of Discrete Mathematics, vol. 51, pp. 245–250. Elsevier (1992)
Google Scholar
Daskalakis, C., Goldberg, P.W., Papadimitriou, C.H.: The complexity of computing a Nash equilibrium. SIAM J. Comput. 39(1), 195–259 (2009)
Article MathSciNet MATH Google Scholar
Von Neumann, J., Morgenstern, O., Kuhn, H.W.: Theory of Games and Economic Behavior (commemorative edition). Princeton University Press, Princeton (2007)
Google Scholar
Vanderbei, R.J., et al.: Linear Programming. Springer, Berlin (2015)
Google Scholar
Hoffman, A.J., Karp, R.M.: On nonterminating stochastic games. Manag. Sci. 12(5), 359–370 (1966)
Article MathSciNet MATH Google Scholar
Van Der Wal, J.: Discounted Markov games: generalized policy iteration method. J. Optim. Theory Appl. 25(1), 125–138 (1978)
Article MathSciNet MATH Google Scholar
Rao, S.S., Chandrasekaran, R., Nair, K.: Algorithms for discounted stochastic games. J. Optim. Theory Appl. 11(6), 627–637 (1973)
Article MathSciNet MATH Google Scholar
Patek, S.D.: Stochastic and shortest path games: theory and algorithms. Ph.D. thesis, Massachusetts Institute of Technology (1997)
Google Scholar
Hansen, T.D., Miltersen, P.B., Zwick, U.: Strategy iteration is strongly polynomial for 2-player turn-based stochastic games with a constant discount factor. J. ACM 60(1), 1 (2013)
Article MathSciNet MATH Google Scholar
Lagoudakis, M.G., Parr, R.: Value function approximation in zero-sum Markov games. In: Conference on Uncertainty in Artificial Intelligence, pp. 283–292 (2002)
Google Scholar
Zou, S., Xu, T., Liang, Y.: Finite-sample analysis for SARSA with linear function approximation (2019). arXiv preprint arXiv:1902.02234
Sutton, R.S., Barto, A.G.: A temporal-difference model of classical conditioning. In: Proceedings of the Annual Conference of the Cognitive Science Society, pp. 355–378 (1987)
Google Scholar
Al-Tamimi, A., Abu-Khalaf, M., Lewis, F.L.: Adaptive critic designs for discrete-time zero-sum games with application to \(\cal{H}_\infty \) control. IEEE Trans. Syst. Man Cybern. Part B 37(1), 240–247 (2007)
Article MATH Google Scholar
Al-Tamimi, A., Lewis, F.L., Abu-Khalaf, M.: Model-free Q-learning designs for linear discrete-time zero-sum games with application to \(\cal{H}_\infty \) control. Automatica 43(3), 473–481 (2007)
Article MathSciNet MATH Google Scholar
Farahmand, A.M., Ghavamzadeh, M., Szepesvári, C., Mannor, S.: Regularized policy iteration with nonparametric function spaces. J. Mach. Learn. Res. 17(1), 4809–4874 (2016)
Google Scholar
Yang, Z., Xie, Y., Wang, Z.: A theoretical analysis of deep Q-learning (2019). arXiv preprint arXiv:1901.00137
Jia, Z., Yang, L.F., Wang, M.: Feature-based Q-learning for two-player stochastic games (2019). arXiv preprint arXiv:1906.00423
Sidford, A., Wang, M., Wu, X., Yang, L., Ye, Y.: Near-optimal time and sample complexities for solving Markov decision processes with a generative model. In: Advances in Neural Information Processing Systems, pp. 5186–5196 (2018)
Google Scholar
Wei, C.Y., Hong, Y.T., Lu, C.J.: Online reinforcement learning in stochastic games. In: Advances in Neural Information Processing Systems, pp. 4987–4997 (2017)
Google Scholar
Auer, P., Ortner, R.: Logarithmic online regret bounds for undiscounted reinforcement learning. In: Advances in Neural Information Processing Systems, pp. 49–56 (2007)
Google Scholar
Jaksch, T., Ortner, R., Auer, P.: Near-optimal regret bounds for reinforcement learning. J. Mach. Learn. Res. 11, 1563–1600 (2010)
Google Scholar
Koller, D., Megiddo, N., von Stengel, B.: Fast algorithms for finding randomized strategies in game trees. Computing 750, 759 (1994)
MATH Google Scholar
Von Stengel, B.: Efficient computation of behavior strategies. Games Econ. Behav. 14(2), 220–246 (1996)
Article MathSciNet MATH Google Scholar
Koller, D., Megiddo, N., Von Stengel, B.: Efficient computation of equilibria for extensive two-person games. Games Econ. Behav. 14(2), 247–259 (1996)
Article MathSciNet MATH Google Scholar
Von Stengel, B.: Computing equilibria for two-person games. Handbook of Game Theory with Economic Applications 3, 1723–1759 (2002)
Article Google Scholar
Parr, R., Russell, S.: Approximating optimal policies for partially observable stochastic domains. In: International Joint Conference on Artificial Intelligence, pp. 1088–1094 (1995)
Google Scholar
Rodriguez, A.C., Parr, R., Koller, D.: Reinforcement learning using approximate belief states. In: Advances in Neural Information Processing Systems, pp. 1036–1042 (2000)
Google Scholar
Hauskrecht, M.: Value-function approximations for partially observable Markov decision processes. J. Artif. Intell. Res. 13, 33–94 (2000)
Article MathSciNet MATH Google Scholar
Buter, B.J.: Dynamic programming for extensive form games with imperfect information. Ph.D. thesis, Universiteit van Amsterdam (2012)
Google Scholar
Cowling, P.I., Powley, E.J., Whitehouse, D.: Information set Monte Carlo tree search. IEEE Trans. Comput. Intell. AI Games 4(2), 120–143 (2012)
Article Google Scholar
Teraoka, K., Hatano, K., Takimoto, E.: Efficient sampling method for Monte Carlo tree search problem. IEICE Trans. Inf. Syst. 97(3), 392–398 (2014)
Article Google Scholar
Whitehouse, D.: Monte Carlo tree search for games with hidden information and uncertainty. Ph.D. thesis, University of York (2014)
Google Scholar
Kaufmann, E., Koolen, W.M.: Monte-Carlo tree search by best arm identification. In: Advances in Neural Information Processing Systems, pp. 4897–4906 (2017)
Google Scholar
Hannan, J.: Approximation to Bayes risk in repeated play. Contrib. Theory Games 3, 97–139 (1957)
MathSciNet MATH Google Scholar
Brown, G.W.: Iterative solution of games by fictitious play. Act. Anal. Prod. Allo. 13(1), 374–376 (1951)
MathSciNet MATH Google Scholar
Robinson, J.: An iterative method of solving a game. Ann. Math. 296–301 (1951)
Google Scholar
Benaïm, M., Hofbauer, J., Sorin, S.: Stochastic approximations and differential inclusions. SIAM J. Control Optim. 44(1), 328–348 (2005)
Article MathSciNet MATH Google Scholar
Hart, S., Mas-Colell, A.: A general class of adaptive strategies. J. Econ. Theory 98(1), 26–54 (2001)
Article MathSciNet MATH Google Scholar
Monderer, D., Samet, D., Sela, A.: Belief affirming in learning processes. J. Econ. Theory 73(2), 438–452 (1997)
Article MathSciNet MATH Google Scholar
Viossat, Y., Zapechelnyuk, A.: No-regret dynamics and fictitious play. J. Econ. Theory 148(2), 825–842 (2013)
Article MathSciNet MATH Google Scholar
Kushner, H.J., Yin, G.G.: Stochastic Approximation and Recursive Algorithms and Applications. Springer, New York (2003)
MATH Google Scholar
Fudenberg, D., Levine, D.K.: Consistency and cautious fictitious play. J. Econ. Dyn. Control 19(5–7), 1065–1089 (1995)
Article MathSciNet MATH Google Scholar
Hofbauer, J., Sandholm, W.H.: On the global convergence of stochastic fictitious play. Econometrica 70(6), 2265–2294 (2002)
Article MathSciNet MATH Google Scholar
Leslie, D.S., Collins, E.J.: Generalised weakened fictitious play. Games Econ. Behav. 56(2), 285–298 (2006)
Article MathSciNet MATH Google Scholar
Benaïm, M., Faure, M.: Consistency of vanishingly smooth fictitious play. Math. Oper. Res. 38(3), 437–450 (2013)
Article MathSciNet MATH Google Scholar
Li, Z., Tewari, A.: Sampled fictitious play is Hannan consistent. Games Econ. Behav. 109, 401–412 (2018)
Article MathSciNet MATH Google Scholar
Ernst, D., Geurts, P., Wehenkel, L.: Tree-based batch mode reinforcement learning. J. Mach. Learn. Res. 6(Apr), 503–556 (2005)
Google Scholar
Heinrich, J., Silver, D.: Self-play Monte-Carlo tree search in computer Poker. In: Workshops at AAAI Conference on Artificial Intelligence (2014)
Google Scholar
Browne, C.B., Powley, E., Whitehouse, D., Lucas, S.M., Cowling, P.I., Rohlfshagen, P., Tavener, S., Perez, D., Samothrakis, S., Colton, S.: A survey of Monte Carlo tree search methods. IEEE Trans. Comput. Intell. AI Games 4(1), 1–43 (2012)
Article Google Scholar
Borkar, V.S.: Stochastic Approximation: A Dynamical Systems Viewpoint. Cambridge University Press, Cambridge (2008)
Google Scholar
Cesa-Bianchi, N., Lugosi, G.: Prediction, Learning, and Games. Cambridge University Press, Cambridge (2006)
Google Scholar
Auer, P., Cesa-Bianchi, N., Freund, Y., Schapire, R.E.: The nonstochastic multiarmed bandit problem. SIAM J. Comput. 32(1), 48–77 (2002)
Article MathSciNet MATH Google Scholar
Vovk, V.G.: Aggregating strategies. In: Proceedings of Computational Learning Theory (1990)
Google Scholar
Littlestone, N., Warmuth, M.K.: The weighted majority algorithm. Inf. Comput. 108(2), 212–261 (1994)
Article MathSciNet MATH Google Scholar
Freund, Y., Schapire, R.E.: Adaptive game playing using multiplicative weights. Games Econ. Behav. 29(1–2), 79–103 (1999)
Article MathSciNet MATH Google Scholar
Hart, S., Mas-Colell, A.: A simple adaptive procedure leading to correlated equilibrium. Econometrica 68(5), 1127–1150 (2000)
Article MathSciNet MATH Google Scholar
Lanctot, M., Waugh, K., Zinkevich, M., Bowling, M.: Monte Carlo sampling for regret minimization in extensive games. In: Advances in Neural Information Processing Systems, pp. 1078–1086 (2009)
Google Scholar
Burch, N., Lanctot, M., Szafron, D., Gibson, R.G.: Efficient Monte Carlo counterfactual regret minimization in games with many player actions. In: Advances in Neural Information Processing Systems, pp. 1880–1888 (2012)
Google Scholar
Gibson, R., Lanctot, M., Burch, N., Szafron, D., Bowling, M.: Generalized sampling and variance in counterfactual regret minimization. In: AAAI Conference on Artificial Intelligence (2012)
Google Scholar
Johanson, M., Bard, N., Lanctot, M., Gibson, R., Bowling, M.: Efficient Nash equilibrium approximation through Monte Carlo counterfactual regret minimization. In: International Conference on Autonomous Agents and Multi-Agent Systems, pp. 837–846 (2012)
Google Scholar
Lisỳ, V., Lanctot, M., Bowling, M.: Online Monte Carlo counterfactual regret minimization for search in imperfect information games. In: International Conference on Autonomous Agents and Multi-Agent Systems, pp. 27–36 (2015)
Google Scholar
Schmid, M., Burch, N., Lanctot, M., Moravcik, M., Kadlec, R., Bowling, M.: Variance reduction in Monte Carlo counterfactual regret minimization (VR-MCCFR) for extensive form games using baselines. In: AAAI Conference on Artificial Intelligence, vol. 33, pp. 2157–2164 (2019)
Google Scholar
Waugh, K., Morrill, D., Bagnell, J.A., Bowling, M.: Solving games with functional regret estimation. In: AAAI Conference on Artificial Intelligence (2015)
Google Scholar
Morrill, D.: Using regret estimation to solve games compactly. Ph.D. thesis, University of Alberta (2016)
Google Scholar
Brown, N., Lerer, A., Gross, S., Sandholm, T.: Deep counterfactual regret minimization. In: International Conference on Machine Learning, pp. 793–802 (2019)
Google Scholar
Brown, N., Sandholm, T.: Regret-based pruning in extensive-form games. In: Advances in Neural Information Processing Systems, pp. 1972–1980 (2015)
Google Scholar
Brown, N., Kroer, C., Sandholm, T.: Dynamic thresholding and pruning for regret minimization. In: AAAI Conference on Artificial Intelligence (2017)
Google Scholar
Brown, N., Sandholm, T.: Reduced space and faster convergence in imperfect-information games via pruning. In: International Conference on Machine Learning, pp. 596–604 (2017)
Google Scholar
Tammelin, O.: Solving large imperfect information games using CFR+ (2014). arXiv preprint arXiv:1407.5042
Tammelin, O., Burch, N., Johanson, M., Bowling, M.: Solving heads-up limit Texas Hold’em. In: International Joint Conference on Artificial Intelligence (2015)
Google Scholar
Burch, N., Moravcik, M., Schmid, M.: Revisiting CFR+ and alternating updates. J. Artif. Intell. Res. 64, 429–443 (2019)
Article MathSciNet MATH Google Scholar
Zhou, Y., Ren, T., Li, J., Yan, D., Zhu, J.: Lazy-CFR: a fast regret minimization algorithm for extensive games with imperfect information (2018). arXiv preprint arXiv:1810.04433
Zinkevich, M.: Online convex programming and generalized infinitesimal gradient ascent. In: International Conference on Machine Learning, pp. 928–936 (2003)
Google Scholar
Lockhart, E., Lanctot, M., Pérolat, J., Lespiau, J.B., Morrill, D., Timbers, F., Tuyls, K.: Computing approximate equilibria in sequential adversarial games by exploitability descent (2019). arXiv preprint arXiv:1903.05614
Johanson, M., Bard, N., Burch, N., Bowling, M.: Finding optimal abstract strategies in extensive-form games. In: AAAI Conference on Artificial Intelligence, pp. 1371–1379 (2012)
Google Scholar
Schaeffer, M.S., Sturtevant, N., Schaeffer, J.: Comparing UCT versus CFR in simultaneous games (2009)
Google Scholar
Lanctot, M., Lisỳ, V., Winands, M.H.: Monte Carlo tree search in simultaneous move games with applications to Goofspiel. In: Workshop on Computer Games, pp. 28–43 (2013)
Google Scholar
Lisỳ, V., Kovarik, V., Lanctot, M., Bošanskỳ, B.: Convergence of Monte Carlo tree search in simultaneous move games. In: Advances in Neural Information Processing Systems, pp. 2112–2120 (2013)
Google Scholar
Tak, M.J., Lanctot, M., Winands, M.H.: Monte Carlo tree search variants for simultaneous move games. In: IEEE Conference on Computational Intelligence and Games, pp. 1–8 (2014)
Google Scholar
Kovařík, V., Lisỳ, V.: Analysis of Hannan consistent selection for Monte Carlo tree search in simultaneous move games (2018). arXiv preprint arXiv:1804.09045
Mazumdar, E.V., Jordan, M.I., Sastry, S.S.: On finding local Nash equilibria (and only local Nash equilibria) in zero-sum games (2019). arXiv preprint arXiv:1901.00838
Bu, J., Ratliff, L.J., Mesbahi, M.: Global convergence of policy gradient for sequential zero-sum linear quadratic dynamic games (2019). arXiv preprint arXiv:1911.04672
Mescheder, L., Nowozin, S., Geiger, A.: The numerics of GANs. In: Advances in Neural Information Processing Systems, pp. 1825–1835 (2017)
Google Scholar
Adolphs, L., Daneshmand, H., Lucchi, A., Hofmann, T.: Local saddle point optimization: a curvature exploitation approach (2018). arXiv preprint arXiv:1805.05751
Daskalakis, C., Panageas, I.: The limit points of (optimistic) gradient descent in min-max optimization. In: Advances in Neural Information Processing Systems, pp. 9236–9246 (2018)
Google Scholar
Mertikopoulos, P., Zenati, H., Lecouat, B., Foo, C.S., Chandrasekhar, V., Piliouras, G.: Optimistic mirror descent in saddle-point problems: going the extra (gradient) mile. In: International Conference on Learning Representations (2019)
Google Scholar
Fiez, T., Chasnov, B., Ratliff, L.J.: Convergence of learning dynamics in Stackelberg games (2019). arXiv preprint arXiv:1906.01217
Balduzzi, D., Racaniere, S., Martens, J., Foerster, J., Tuyls, K., Graepel, T.: The mechanics of n-player differentiable games. In: International Conference on Machine Learning, pp. 363–372 (2018)
Google Scholar
Sanjabi, M., Razaviyayn, M., Lee, J.D.: Solving non-convex non-concave min-max games under Polyak-Łojasiewicz condition (2018). arXiv preprint arXiv:1812.02878
Nouiehed, M., Sanjabi, M., Lee, J.D., Razaviyayn, M.: Solving a class of non-convex min-max games using iterative first order methods (2019). arXiv preprint arXiv:1902.08297
Mazumdar, E., Ratliff, L.J., Jordan, M.I., Sastry, S.S.: Policy-gradient algorithms have no guarantees of convergence in continuous action and state multi-agent settings (2019). arXiv preprint arXiv:1907.03712
Chen, X., Deng, X., Teng, S.H.: Settling the complexity of computing two-player Nash equilibria. J. ACM 56(3), 14 (2009)
Article MathSciNet MATH Google Scholar
Greenwald, A., Hall, K., Serrano, R.: Correlated Q-learning. In: International Conference on Machine Learning, pp. 242–249 (2003)
Google Scholar
Aumann, R.J.: Subjectivity and correlation in randomized strategies. J. Math. Econ. 1(1), 67–96 (1974)
Article MathSciNet MATH Google Scholar
Perolat, J., Strub, F., Piot, B., Pietquin, O.: Learning Nash equilibrium for general-sum Markov games from batch data. In: International Conference on Artificial Intelligence and Statistics, pp. 232–241 (2017)
Google Scholar
Maillard, O.A., Munos, R., Lazaric, A., Ghavamzadeh, M.: Finite-sample analysis of Bellman residual minimization. In: Asian Conference on Machine Learning, pp. 299–314 (2010)
Google Scholar
Letcher, A., Balduzzi, D., Racanière, S., Martens, J., Foerster, J.N., Tuyls, K., Graepel, T.: Differentiable game mechanics. J. Mach. Learn. Res. 20(84), 1–40 (2019)
MathSciNet MATH Google Scholar
Chasnov, B., Ratliff, L.J., Mazumdar, E., Burden, S.A.: Convergence analysis of gradient-based learning with non-uniform learning rates in non-cooperative multi-agent settings (2019). arXiv preprint arXiv:1906.00731
Hart, S., Mas-Colell, A.: Uncoupled dynamics do not lead to Nash equilibrium. Am. Econ. Rev. 93(5), 1830–1836 (2003)
Article Google Scholar
Saldi, N., Başar, T., Raginsky, M.: Markov-Nash equilibria in mean-field games with discounted cost. SIAM J. Control Optim. 56(6), 4256–4287 (2018)
Article MathSciNet MATH Google Scholar
Saldi, N., Başar, T., Raginsky, M.: Approximate Nash equilibria in partially observed stochastic games with mean-field interactions. Math. Oper. Res. (2019)
Google Scholar
Saldi, N.: Discrete-time average-cost mean-field games on Polish spaces (2019). arXiv preprint arXiv:1908.08793
Saldi, N., Başar, T., Raginsky, M.: Discrete-time risk-sensitive mean-field games (2018). arXiv preprint arXiv:1808.03929
Guo, X., Hu, A., Xu, R., Zhang, J.: Learning mean-field games (2019). arXiv preprint arXiv:1901.09585
Fu, Z., Yang, Z., Chen, Y., Wang, Z.: Actor-critic provably finds Nash equilibria of linear-quadratic mean-field games (2019). arXiv preprint arXiv:1910.07498
Hadikhanloo, S., Silva, F.J.: Finite mean field games: fictitious play and convergence to a first order continuous mean field game. J. Math. Pures Appl. (2019)
Google Scholar
Elie, R., Pérolat, J., Laurière, M., Geist, M., Pietquin, O.: Approximate fictitious play for mean field games (2019). arXiv preprint arXiv:1907.02633
Anahtarci, B., Kariksiz, C.D., Saldi, N.: Value iteration algorithm for mean-field games (2019). arXiv preprint arXiv:1909.01758
Zaman, M.A.u., Zhang, K., Miehling, E., Başar, T.: Approximate equilibrium computation for discrete-time linear-quadratic mean-field games. Submitted to IEEE American Control Conference (2020)
Google Scholar
Yang, B., Liu, M.: Keeping in touch with collaborative UAVs: a deep reinforcement learning approach. In: International Joint Conference on Artificial Intelligence, pp. 562–568 (2018)
Google Scholar
Pham, H.X., La, H.M., Feil-Seifer, D., Nefian, A.: Cooperative and distributed reinforcement learning of drones for field coverage (2018). arXiv preprint arXiv:1803.07250
Tožička, J., Szulyovszky, B., de Chambrier, G., Sarwal, V., Wani, U., Gribulis, M.: Application of deep reinforcement learning to UAV fleet control. In: SAI Intelligent Systems Conference, pp. 1169–1177 (2018)
Google Scholar
Shamsoshoara, A., Khaledi, M., Afghah, F., Razi, A., Ashdown, J.: Distributed cooperative spectrum sharing in UAV networks using multi-agent reinforcement learning. In: IEEE Annual Consumer Communications & Networking Conference, pp. 1–6 (2019)
Google Scholar
Cui, J., Liu, Y., Nallanathan, A.: The application of multi-agent reinforcement learning in UAV networks. In: IEEE International Conference on Communications Workshops, pp. 1–6 (2019)
Google Scholar
Qie, H., Shi, D., Shen, T., Xu, X., Li, Y., Wang, L.: Joint optimization of multi-UAV target assignment and path planning based on multi-agent reinforcement learning. IEEE Access (2019)
Google Scholar
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Article Google Scholar
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Advances in neural information processing systems, pp. 5998–6008 (2017)
Google Scholar
Hausknecht, M., Stone, P.: Deep recurrent Q-learning for partially observable MDPs. In: 2015 AAAI Fall Symposium Series (2015)
Google Scholar
Jorge, E., Kågebäck, M., Johansson, F.D., Gustavsson, E.: Learning to play guess who? and inventing a grounded language as a consequence (2016). arXiv preprint arXiv:1611.03218
Sukhbaatar, S., Fergus, R., et al.: Learning multiagent communication with backpropagation. In: Advances in Neural Information Processing Systems, pp. 2244–2252 (2016)
Google Scholar
Havrylov, S., Titov, I.: Emergence of language with multi-agent games: learning to communicate with sequences of symbols. In: Advances in Neural Information Processing Systems, pp. 2149–2159 (2017)
Google Scholar
Das, A., Kottur, S., Moura, J.M., Lee, S., Batra, D.: Learning cooperative visual dialog agents with deep reinforcement learning. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2951–2960 (2017)
Google Scholar
Peng, P., Wen, Y., Yang, Y., Yuan, Q., Tang, Z., Long, H., Wang, J.: Multiagent bidirectionally-coordinated nets: emergence of human-level coordination in learning to play starcraft combat games (2017). arXiv preprint arXiv:1703.10069
Mordatch, I., Abbeel, P.: Emergence of grounded compositional language in multi-agent populations. In: AAAI Conference on Artificial Intelligence (2018)
Google Scholar
Jiang, J., Lu, Z.: Learning attentional communication for multi-agent cooperation. In: Advances in Neural Information Processing Systems, pp. 7254–7264 (2018)
Google Scholar
Jiang, J., Dun, C., Lu, Z.: Graph convolutional reinforcement learning for multi-agent cooperation. 2(3) (2018). arXiv preprint arXiv:1810.09202
Celikyilmaz, A., Bosselut, A., He, X., Choi, Y.: Deep communicating agents for abstractive summarization (2018). arXiv preprint arXiv:1803.10357
Das, A., Gervet, T., Romoff, J., Batra, D., Parikh, D., Rabbat, M., Pineau, J.: TarMAC: targeted multi-agent communication (2018). arXiv preprint arXiv:1810.11187
Lazaridou, A., Hermann, K.M., Tuyls, K., Clark, S.: Emergence of linguistic communication from referential games with symbolic and pixel input (2018). arXiv preprint arXiv:1804.03984
Cogswell, M., Lu, J., Lee, S., Parikh, D., Batra, D.: Emergence of compositional language with deep generational transmission (2019). arXiv preprint arXiv:1904.09067
Allis, L.: Searching for solutions in games and artificial intelligence. Ph.D. thesis, Maastricht University (1994)
Google Scholar
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 1097–1105 (2012)
Google Scholar
Silver, D., Hubert, T., Schrittwieser, J., Antonoglou, I., Lai, M., Guez, A., Lanctot, M., Sifre, L., Kumaran, D., Graepel, T., Lillicrap, T., Simonyan, K., Hassabis, D.: A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play. Science 362(6419), 1140–1144 (2018)
Article MathSciNet MATH Google Scholar
Billings, D., Davidson, A., Schaeffer, J., Szafron, D.: The challenge of Poker. Artif. Intell. 134(1–2), 201–240 (2002)
Google Scholar
Kuhn, H.W.: A simplified two-person Poker. Contrib. Theory Games 1, 97–103 (1950)
Google Scholar
Southey, F., Bowling, M., Larson, B., Piccione, C., Burch, N., Billings, D., Rayner, C.: Bayes’ bluff: opponent modelling in Poker. In: Proceedings of the 21st Conference on Uncertainty in Artificial Intelligence, pp. 550–558. AUAI Press (2005)
Google Scholar
Bowling, M., Burch, N., Johanson, M., Tammelin, O.: Heads-up limit hold’em Poker is solved. Science 347(6218), 145–149 (2015)
Google Scholar
Heinrich, J., Silver, D.: Smooth UCT search in computer Poker. In: 24th International Joint Conference on Artificial Intelligence (2015)
Google Scholar
Moravčík, M., Schmid, M., Burch, N., Lisỳ, V., Morrill, D., Bard, N., Davis, T., Waugh, K., Johanson, M., Bowling, M.: Deepstack: expert-level artificial intelligence in heads-up no-limit Poker. Science 356(6337), 508–513 (2017)
Google Scholar
Brown, N., Sandholm, T.: Superhuman AI for heads-up no-limit Poker: Libratus beats top professionals. Science 359(6374), 418–424 (2018)
Google Scholar
Burch, N., Johanson, M., Bowling, M.: Solving imperfect information games using decomposition. In: 28th AAAI Conference on Artificial Intelligence (2014)
Google Scholar
Moravcik, M., Schmid, M., Ha, K., Hladik, M., Gaukrodger, S.J.: Refining subgames in large imperfect information games. In: 30th AAAI Conference on Artificial Intelligence (2016)
Google Scholar
Brown, N., Sandholm, T.: Safe and nested subgame solving for imperfect-information games. In: Advances in Neural Information Processing Systems, pp. 689–699 (2017)
Google Scholar
Vinyals, O., Ewalds, T., Bartunov, S., Georgiev, P., Vezhnevets, A.S., Yeo, M., Makhzani, A., Küttler, H., Agapiou, J., Schrittwieser, J., et al.: Starcraft II: a new challenge for reinforcement learning (2017). arXiv preprint arXiv:1708.04782
Vinyals, O., Babuschkin, I., Czarnecki, W.M., Mathieu, M., Dudzik, A., Chung, J., Choi, D.H., Powell, R., Ewalds, T., Georgiev, P., et al.: Grandmaster level in Starcraft II using multi-agent reinforcement learning. Nature 1–5 (2019)
Google Scholar
Mnih, V., Badia, A.P., Mirza, M., Graves, A., Lillicrap, T., Harley, T., Silver, D., Kavukcuoglu, K.: Asynchronous methods for deep reinforcement learning. In: International Conference on Machine Learning, pp. 1928–1937 (2016)
Google Scholar
Lerer, A., Peysakhovich, A.: Maintaining cooperation in complex social dilemmas using deep reinforcement learning (2017). arXiv preprint arXiv:1707.01068
Hughes, E., Leibo, J.Z., Phillips, M., Tuyls, K., Dueñez-Guzman, E., Castañeda, A.G., Dunning, I., Zhu, T., McKee, K., Koster, R., et al.: Inequity aversion improves cooperation in intertemporal social dilemmas. In: Advances in Neural Information Processing Systems, pp. 3326–3336 (2018)
Google Scholar
Cai, Q., Yang, Z., Lee, J.D., Wang, Z.: Neural temporal-difference learning converges to global optima (2019). arXiv preprint arXiv:1905.10027
Arora, S., Cohen, N., Hazan, E.: On the optimization of deep networks: implicit acceleration by overparameterization (2018). arXiv preprint arXiv:1802.06509
Li, Y., Liang, Y.: Learning overparameterized neural networks via stochastic gradient descent on structured data. In: Advances in Neural Information Processing Systems, pp. 8157–8166 (2018)
Google Scholar
Brafman, R.I., Tennenholtz, M.: A near-optimal polynomial time algorithm for learning in certain classes of stochastic games. Artif. Intell. 121(1–2), 31–47 (2000)
Article MathSciNet MATH Google Scholar
Brafman, R.I., Tennenholtz, M.: R-max-a general polynomial time algorithm for near-optimal reinforcement learning. J. Mach. Learn. Res. 3, 213–231 (2002)
Google Scholar
Tu, S., Recht, B.: The gap between model-based and model-free methods on the linear quadratic regulator: an asymptotic viewpoint (2018). arXiv preprint arXiv:1812.03565
Sun, W., Jiang, N., Krishnamurthy, A., Agarwal, A., Langford, J.: Model-based RL in contextual decision processes: PAC bounds and exponential improvements over model-free approaches. In: Conference on Learning Theory, pp. 2898–2933 (2019)
Google Scholar
Lin, Q., Liu, M., Rafique, H., Yang, T.: Solving weakly-convex-weakly-concave saddle-point problems as weakly-monotone variational inequality (2018). arXiv preprint arXiv:1810.10207
García, J., Fernández, F.: A comprehensive survey on safe reinforcement learning. J. Mach. Learn. Res. 16(1), 1437–1480 (2015)
MathSciNet MATH Google Scholar
Chen, Y., Su, L., Xu, J.: Distributed statistical machine learning in adversarial settings: Byzantine gradient descent. Proc. ACM Meas. Anal. Comput. Syst. 1(2), 44 (2017)
Google Scholar
Yin, D., Chen, Y., Ramchandran, K., Bartlett, P.: Byzantine-robust distributed learning: towards optimal statistical rates (2018). arXiv preprint arXiv:1803.01498

Download references

Author information

Authors and Affiliations

Coordinated Science Laboratory, Department of Electrical and Computer Engineering, University of Illinois at Urbana-Champaign, 1308 West Main, Urbana, IL, 61801, USA
Kaiqing Zhang & Tamer Başar
Department of Operations Research and Financial Engineering, Princeton University, 98 Charlton St, Princeton, NJ, 08540, USA
Zhuoran Yang

Authors

Kaiqing Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Zhuoran Yang
View author publications
You can also search for this author in PubMed Google Scholar
Tamer Başar
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Tamer Başar .

Editor information

Editors and Affiliations

The Daniel Guggenheim School of Aerospace Engineering, Georgia Institute of Technology, Atlanta, GA, USA
Kyriakos G. Vamvoudakis
Department of Electrical Engineering, The University of Texas at Arlington, Arlington, TX, USA
Yan Wan
Department of Electrical Engineering, The University of Texas at Arlington, Arlington, TX, USA
Frank L. Lewis
Army Research Office, Durham, NC, USA
Derya Cansever

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Zhang, K., Yang, Z., Başar, T. (2021). Multi-Agent Reinforcement Learning: A Selective Overview of Theories and Algorithms. In: Vamvoudakis, K.G., Wan, Y., Lewis, F.L., Cansever, D. (eds) Handbook of Reinforcement Learning and Control. Studies in Systems, Decision and Control, vol 325. Springer, Cham. https://doi.org/10.1007/978-3-030-60990-0_12

Download citation

DOI: https://doi.org/10.1007/978-3-030-60990-0_12
Published: 24 June 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-60989-4
Online ISBN: 978-3-030-60990-0
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics