Abstract
Multi-agent reinforcement learning (MARL) is often modeled using the framework of Markov games (also called stochastic games or dynamic games). Most of the existing literature on MARL concentrates on zero-sum Markov games but is not applicable to general-sum Markov games. It is known that the best response dynamics in general-sum Markov games are not a contraction. Therefore, different equilibria in general-sum Markov games can have different values. Moreover, the Q-function is not sufficient to completely characterize the equilibrium. Given these challenges, model-based learning is an attractive approach for MARL in general-sum Markov games. In this paper, we investigate the fundamental question of sample complexity for model-based MARL algorithms in general-sum Markov games. We show two results. We first use Hoeffding inequality-based bounds to show that \(\tilde{{\mathcal {O}}}( (1-\gamma )^{-4} \alpha ^{-2})\) samples per state–action pair are sufficient to obtain a \(\alpha \)-approximate Markov perfect equilibrium with high probability, where \(\gamma \) is the discount factor, and the \(\tilde{{\mathcal {O}}}(\cdot )\) notation hides logarithmic terms. We then use Bernstein inequality-based bounds to show that \(\tilde{{\mathcal {O}}}( (1-\gamma )^{-1} \alpha ^{-2} )\) samples are sufficient. To obtain these results, we study the robustness of Markov perfect equilibrium to model approximations. We show that the Markov perfect equilibrium of an approximate (or perturbed) game is always an approximate Markov perfect equilibrium of the original game and provide explicit bounds on the approximation error. We illustrate the results via a numerical example.
Similar content being viewed by others
Data Availability
Data sharing is not applicable to this article as no datasets were generated or analyzed during the current study.
Notes
[65] construct two player general-sum games with the following properties. The game has two states: in state 1, player 1 has two actions and player 2 has one action; in state 2, player 1 has one action and player 2 has two actions. The transition probabilities are chosen such that there is a unique Markov perfect equilibrium in mixed strategies. This means that in state 1, both actions of player 1 maximize the Q-function; in state 2, both actions of player 2 minimize the Q-function. However, the Q-function in itself is insufficient to determine the randomizing probabilities for the mixed strategy MPE.
The plug-in estimator is also known as a certainty equivalent controller in the stochastic control literature.
If \(\mu \) and \(\nu \) are absolutely continuous with respect to some measure \(\lambda \) and let \(p = d\mu /d\lambda \) and \(q = d\nu /d\lambda \), then total variation is typically defined as \(\tfrac{1}{2} \int _{\mathcal X}| p(x) - q(x)| \lambda (dx)\). This is consistent with our definition. Let \({{\bar{f}}} = ( \sup f + \inf f)/2\). Then
$$\begin{aligned}&\left| \int _{\mathcal X} f d\mu - \int _{\mathcal X} f d\nu \right| = \left| \int _{\mathcal X} f(x) p(x) \lambda (dx) - \int _{\mathcal X} f(x) q(x) \lambda (dx) \right| \\&\quad = \left| \int _{\mathcal X} \bigl [ f(x) - {{\bar{f}}} \bigr ] \bigl [ p(x) - q(x) \bigr ] \lambda (dx) \right| \le \Vert f - {{\bar{f}}} \Vert _{\infty } \int _{\mathcal X} \bigl | p(x) - q(x) \bigr | \lambda (dx)\\&\quad \le \tfrac{1}{2} {{\,\textrm{span}\,}}(f) \int _{\mathcal X} \bigl | p(x) - q(x) \bigr | \lambda (dx). \end{aligned}$$
References
Acemoglu D, Robinson JA (2001) A theory of political transitions. Am Econ Rev 91(4):938–963
Agarwal A, Kakade S, Yang LF (2020) Model-based reinforcement learning with a generative model is minimax optimal. In Conference on Learning Theory, pages 67–83. PMLR
Aguirregabiria V, Mira P (2007) Sequential estimation of dynamic discrete games. Econometrica 75(1):1–53
Akchurina N (2010) Multi-agent reinforcement learning algorithms. PhD thesis, University of Paderborn
Albright SC, Winston W (1979) A birth-death model of advertising and pricing. Adv Appl Probab 11(1):134–152
Altman E (1999) Constrained Markov decision processes: stochastic modeling. CRC Press, Boca Raton
Azar MG, Munos R, Kappen HJ (2013) Minimax PAC bounds on the sample complexity of reinforcement learning with a generative model. Mach Learn 91(3):325–349
Bajari P, Benkard CL, Levin J (2007) Estimating dynamic models of imperfect competition. Econometrica 75(5):1331–1370
Başar T, Bernhard P (2008) H-infinity optimal control and related minimax design problems: a dynamic game approach. Springer Science & Business Media, NY
Başar T, Zaccour G (2018) Handbook of dynamic game theory. Springer International Publishing, NY
Bertsekas DP (2017) Dynamic programming and optimal control. Athena Scientific, Belmont, MA
Breton M (1991) Algorithms for stochastic games. Springer, NY
Busoniu L, Babuska R, De Schutter B (2008) A comprehensive survey of multiagent reinforcement learning. IEEE Trans Syst, Man, Cybern, Part C (Appl Rev) 38(2):156–172
Cesa-Bianch N, Lugosi G (2006) Prediction, learning, and games. Cambridge university press, Cambridge
Deng X, Li Y, Mguni DH, Wang J, Yang Y (2021) On the complexity of computing markov perfect equilibrium in general-sum stochastic games. http://arxiv.org/abs/2109.01795
Doraszelski U, Escobar JF (2010) A theory of regular Markov perfect equilibria in dynamic stochastic games: Genericity, stability, and purification. Theor Econ 5(3):369–402
Ericson R, Pakes A (1995) Markov-perfect industry dynamics: a framework for empirical work. Rev Econ Stud 62(1):53–82
Fershtiman C, Pakes A (2000) A dynamic oligopoly with collusion and price wars. RAND J Econ 31(2):207–236
Filar J, Vrieze K (1996) Competitive Markov Decision Processes. Springer, New York, NY. 978-1-4612-8481-9 978-1-4612-4054-9
Filar JA, Schultz TA, Thuijsman F, Vrieze O (1991) Nonlinear programming and stationary equilibria in stochastic games. Math Program 50(1):227–237
Fink AM (1964) Equilibrium in a stochastic n-person game. Hiroshima Math J 28:1
Herings PJ-J, Peeters R (2010) Homotopy methods to compute equilibria in game theory. Econ Theory 42(1):119–156
Herings PJ-J, Peeters RJ et al (2004) Stationary equilibria in stochastic games: structure, selection, and computation. J Econ Theory 118(1):32–60
Hinderer K (2005) Lipschitz continuity of value functions in Markovian decision processes. Mathematical Methods of Operations Research, 62 (1): 3–22. ISSN 1432-5217
Hoffman AJ, Karp RM (1966) On nonterminating stochastic games. Manage Sci 12(5):359–370
Jaśkiewicz A, Nowak AS (2014) Robust Markov perfect equilibria. J Math Anal Appl 419(2):1322–1332
Kakade SM (2003) On the sample complexity of reinforcement learning. PhD thesis, University College, London
Kearns M, Singh S (1999) Finite-sample convergence rates for q-learning and indirect algorithms. Adv Neural Inf Process Syst 871:996–1002
Krupnik O, Mordatch I, Tamar A (2019) Multi-agent reinforcement learning with multi-step generative models. http://arxiv.org/abs/1901.10251
Leonardos S, Overman W, Panageas I, Piliouras G (2021) Global convergence of multi-agent policy gradient in markov potential games. http://arxiv.org/abs/2106.01969
Li G, Wei Y, Chi Y, Gu Y, Chen Y (2020) Breaking the sample size barrier in model-based reinforcement learning with a generative model. Adv Neural Inf Process Syst 33:12861
Littman ML (1994) Markov games as a framework for multi-agent reinforcement learning. In International Conference on Machine Learning, pages 157–163. Elsevier
Littman ML (2001) Value-function reinforcement learning in Markov games. Cognit Syst Res 2(1):55–66
Mailath GJ, Samuelson L (2006) Repeated games and reputations: long-run relationships. Oxford University Press, Oxford
Maskin E, Tirole J (1988) A theory of dynamic oligopoly, I: Overview and quantity competition with large fixed costs. Econometrica: J Econ Soc 549–569
Maskin E, Tirole J (1988) A theory of dynamic oligopoly, II: Price competition, kinked demand curves, and edgeworth cycles. Econometrica: J Econ Soc 571–599
Maskin E, Tirole J (2001) Markov perfect equilibrium: I. observable actions. J Econ Theory 100(2):191–219
Müller A (1997) How does the value function of a Markov decision process depend on the transition probabilities? Math Op Res 22(4):872–885
Müller A (1997) Integral probability metrics and their generating classes of functions. Adv Appl Probab 29(2):429–443
Pakes A, Ostrovsky M, Berry S (2007) Simple estimators for the parameters of discrete dynamic games (with entry/exit examples). RAND J Econ 38(2):373–399
Pérolat J, Strub F, Piot B, Pietquin O (2017) Learning Nash equilibrium for general-sum Markov games from batch data. In Artificial Intelligence and Statistics, pages 232–241. PMLR
Pesendorfer M, Schmidt-Dengler P (2008) Asymptotic least squares estimators for dynamic games. Rev Econ Stud 75(3):901–928
Prasad H, LA P, Bhatnagar S (2015) Two-timescale algorithms for learning Nash equilibria in general-sum stochastic games. In Proceedings of the 2015 International Conference on Autonomous Agents and Multiagent Systems, pages 1371–1379
Rogers PD (1969) Nonzero-sum stochastic games. PhD thesis, University of California, Berkeley
Sengupta S, Chowdhary A, Huang D, Kambhampati S (2019) General sum markov games for strategic detection of advanced persistent threats using moving target defense in cloud networks. In International Conference on Decision and Game Theory for Security, pages 492–512. Springer
Shapley LS (1953) Stochastic games. Proc Nat Acad Sci 39(10):1095–1100
Shoham Y, Powers R, Grenager T (2003) Multi-agent reinforcement learning: a critical survey. Technical report, Stanford University
Sidford A, Wang M, Wu X, Yang LF, Ye Y (2018) Near-optimal time and sample complexities for solving markov decision processes with a generative model. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, pages 5192–5202
Sidford A, Wang M, Yang L, Ye Y (2020) Solving discounted stochastic two-player games with near-optimal time and sample complexity. In International Conference on Artificial Intelligence and Statistics, pages 2992–3002. PMLR
Solan E (2021) A course in stochastic game theory. Cambridge University Press, Cambridge
Song Z, Mei S, Bai Y (2021) When can we learn general-sum markov games with a large number of players sample-efficiently? http://arxiv.org/abs/2110.04184
Sriperumbudur BK, Gretton A, Fukumizu K, Lanckriet GRG, Schölkopf B (2008) Injective Hilbert space embeddings of probability measures. In Conference on Learning Theory,
Subramanian J, Sinha A, Seraj R, Mahajan A (2022) Approximate information state for approximate planning and reinforcement learning in partially observed systems. J Mach Learn Res 23:1–12
Sutton RS (1990) Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In International Conference on Machine Learning, pages 216–224. San Francisco (CA)
Takahashi M (1964) Equilibrium points of stochastic non-cooperative \(n\)-person games. Hiroshima Math J 28(1):95
Tidball MM, Altman E (1996) Approximations in dynamic zero-sum games I. SIAM J Control Optim 34(1):311–328
Tidball MM, Pourtallier O, Altman E (1997) Approximations in dynamic zero-sum games II. SIAM J Control Optim 35(6):2101–2117
Vrieze OJ (1987) Stochastic games with finite state and action spaces. CWI, Jan. ISBN 978-90-6196-313-4
Wang T, Bao X, Clavera I, Hoang J, Wen Y, Langlois E, Zhang S, Zhang G, Abbeel P, Ba J (2019) Benchmarking Model-Based Reinforcement Learning. http://arxiv.org/abs/1907.02057
Whitt W (1980) Representation and approximation of noncooperative sequential games. SIAM J Control Optim 18(1):33–48
Zhang K, Kakade S, Basar T, Yang L (2020) Model-based multi-agent rl in zero-sum Markov games with near-optimal sample complexity. Adv Neural Inf Process Syst 33:1166
Zhang K, Yang Z, Başar T (2021) Multi-agent reinforcement learning: A selective overview of theories and algorithms. Handbook of Reinforcement Learning and Control, pages 321–384
Zhang R, Ren Z, Li N (2021) Gradient play in multi-agent markov stochastic games: Stationary points and convergence. arXiv e-prints, pages arXiv–2106
Zhang W, Wang X, Shen J, Zhou M (2021) Model-based Multi-agent Policy Optimization with Adaptive Opponent-wise Rollouts. In International Joint Conference on Artificial Intelligence. Montreal, Canada
Zinkevich M, Greenwald A, Littman M (2006) Cyclic equilibria in Markov games. In Neural Information Processing Systems, pages 1641–1648
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
The work of Amit Sinha and Aditya Mahajan was supported in part by the Innovation for Defence Excellence and Security (IDEaS) Program of the Canadian Department of National Defence through grant CFPMN2-30.
A preliminary version of this work appeared in the 2021 Indian Control Conference.
This article is part of the topical collection “Multi-agent Dynamic Decision Making and Learning” edited by Konstantin Avrachenkov, Vivek S. Borkar and U. Jayakrishnan Nair.
Appendices
Appendix
Proof of Theorem 4
Let \(V_*\) denote the optimal value function for MDP \({\mathcal {M}}\) and \(V_{\pi }\) denote the value function for policy \(\pi \) in MDP \({\mathcal {M}}\). Let \({\hat{\pi }}_*\) be the optimal policy for \(\widehat{{\mathcal {M}}}\) and \({\hat{\pi }}\) be an \(\alpha _{\textrm{opt}}\)-optimal policy for \(\widehat{{\mathcal {M}}}\). From triangle inequality, we have
Now we bound the three terms separately. For the first term, we have
where (a) relies on the fact that \(\max f(x) \le \max | f(x) - g(x) | + \max g(x)\), (b) follows from triangle inequality, and (c) follows from the definition of \(\varepsilon \) and \(\varDelta _{{\hat{\pi }}}\). Therefore,
For the second term of (42), we have
since \({\hat{\pi }}\) is an \(\alpha _{\textrm{opt}}\)-optimal policy of \(\widehat{{\mathcal {M}}}\).
For the third term of (42), we have
where (d) follows from triangle inequality and (e) follows from the definition of \(\varepsilon \) and \(\varDelta _{{\hat{\pi }}}\). Therefore,
The result of Theorem 4 then follows by substituting (43)–(45) in (42). \(\square \)
Proof of Theorem 5
We follow the same notation as in Appendix A. In addition, let \(\mathcal {B}_*\) and \(\mathcal {B}_\pi \) denote the Bellman operators for the true model and let \({\hat{\mathcal {B}}}_*\) and \({{\hat{\mathcal {B}}}}_\pi \) denote the Bellman operators for the approximate model. As in Appendix A, the approximation error can be bounded as
Recall that \(V_{{\hat{\pi }}} = \mathcal {B}_{{\hat{\pi }}} V_{{\hat{\pi }}}\) and \({\hat{V}}_{{\hat{\pi }}} = {\hat{\mathcal {B}}}_{{\hat{\pi }}} {\hat{V}}_{{\hat{\pi }}}\). Moreover, \({\hat{V}}_* = {\hat{V}}_{{\hat{\pi }}_*}\). Therefore, from triangle inequality, we have
where (a) from the contraction property of the Bellman operators. Rearranging terms and using (44), we get that
Now the first term in (48) can be simplified similar to (43), where in step (b) of (43), we need to add and subtract \(\sum _{s' \in \mathcal {S}} \widehat{{\textsf{P}}}(s' | s, a) V_{\pi _*}(s')\). Using this, we would obtain
The last term in (48) can be simplified as follows:
The result of Theorem 5 follows by substituting (49) and (50) in (48). \(\square \)
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Subramanian, J., Sinha, A. & Mahajan, A. Robustness and Sample Complexity of Model-Based MARL for General-Sum Markov Games. Dyn Games Appl 13, 56–88 (2023). https://doi.org/10.1007/s13235-023-00490-2
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13235-023-00490-2