Skip to main content
Log in

Robustness and Sample Complexity of Model-Based MARL for General-Sum Markov Games

  • Published:
Dynamic Games and Applications Aims and scope Submit manuscript

Abstract

Multi-agent reinforcement learning (MARL) is often modeled using the framework of Markov games (also called stochastic games or dynamic games). Most of the existing literature on MARL concentrates on zero-sum Markov games but is not applicable to general-sum Markov games. It is known that the best response dynamics in general-sum Markov games are not a contraction. Therefore, different equilibria in general-sum Markov games can have different values. Moreover, the Q-function is not sufficient to completely characterize the equilibrium. Given these challenges, model-based learning is an attractive approach for MARL in general-sum Markov games. In this paper, we investigate the fundamental question of sample complexity for model-based MARL algorithms in general-sum Markov games. We show two results. We first use Hoeffding inequality-based bounds to show that \(\tilde{{\mathcal {O}}}( (1-\gamma )^{-4} \alpha ^{-2})\) samples per state–action pair are sufficient to obtain a \(\alpha \)-approximate Markov perfect equilibrium with high probability, where \(\gamma \) is the discount factor, and the \(\tilde{{\mathcal {O}}}(\cdot )\) notation hides logarithmic terms. We then use Bernstein inequality-based bounds to show that \(\tilde{{\mathcal {O}}}( (1-\gamma )^{-1} \alpha ^{-2} )\) samples are sufficient. To obtain these results, we study the robustness of Markov perfect equilibrium to model approximations. We show that the Markov perfect equilibrium of an approximate (or perturbed) game is always an approximate Markov perfect equilibrium of the original game and provide explicit bounds on the approximation error. We illustrate the results via a numerical example.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1

Similar content being viewed by others

Data Availability

Data sharing is not applicable to this article as no datasets were generated or analyzed during the current study.

Notes

  1. [65] construct two player general-sum games with the following properties. The game has two states: in state 1, player 1 has two actions and player 2 has one action; in state 2, player 1 has one action and player 2 has two actions. The transition probabilities are chosen such that there is a unique Markov perfect equilibrium in mixed strategies. This means that in state 1, both actions of player 1 maximize the Q-function; in state 2, both actions of player 2 minimize the Q-function. However, the Q-function in itself is insufficient to determine the randomizing probabilities for the mixed strategy MPE.

  2. The plug-in estimator is also known as a certainty equivalent controller in the stochastic control literature.

  3. If \(\mu \) and \(\nu \) are absolutely continuous with respect to some measure \(\lambda \) and let \(p = d\mu /d\lambda \) and \(q = d\nu /d\lambda \), then total variation is typically defined as \(\tfrac{1}{2} \int _{\mathcal X}| p(x) - q(x)| \lambda (dx)\). This is consistent with our definition. Let \({{\bar{f}}} = ( \sup f + \inf f)/2\). Then

    $$\begin{aligned}&\left| \int _{\mathcal X} f d\mu - \int _{\mathcal X} f d\nu \right| = \left| \int _{\mathcal X} f(x) p(x) \lambda (dx) - \int _{\mathcal X} f(x) q(x) \lambda (dx) \right| \\&\quad = \left| \int _{\mathcal X} \bigl [ f(x) - {{\bar{f}}} \bigr ] \bigl [ p(x) - q(x) \bigr ] \lambda (dx) \right| \le \Vert f - {{\bar{f}}} \Vert _{\infty } \int _{\mathcal X} \bigl | p(x) - q(x) \bigr | \lambda (dx)\\&\quad \le \tfrac{1}{2} {{\,\textrm{span}\,}}(f) \int _{\mathcal X} \bigl | p(x) - q(x) \bigr | \lambda (dx). \end{aligned}$$
  4. For consistency with the normalized rewards considered in the game formulation (see Remark 1), we use normalized rewards for MDPs as well. Although most of the literature on MDPs uses unnormalized rewards, normalized rewards are commonly used in the literature on constrained MDPs [6].

  5. Recall that we are working with normalized total expected reward (see Remark 1), while the results [7] are derived for the unnormalized total reward. In the discussion above, we have normalized the results of [7].

References

  1. Acemoglu D, Robinson JA (2001) A theory of political transitions. Am Econ Rev 91(4):938–963

    Article  Google Scholar 

  2. Agarwal A, Kakade S, Yang LF (2020) Model-based reinforcement learning with a generative model is minimax optimal. In Conference on Learning Theory, pages 67–83. PMLR

  3. Aguirregabiria V, Mira P (2007) Sequential estimation of dynamic discrete games. Econometrica 75(1):1–53

    Article  MathSciNet  MATH  Google Scholar 

  4. Akchurina N (2010) Multi-agent reinforcement learning algorithms. PhD thesis, University of Paderborn

  5. Albright SC, Winston W (1979) A birth-death model of advertising and pricing. Adv Appl Probab 11(1):134–152

    Article  MathSciNet  MATH  Google Scholar 

  6. Altman E (1999) Constrained Markov decision processes: stochastic modeling. CRC Press, Boca Raton

    MATH  Google Scholar 

  7. Azar MG, Munos R, Kappen HJ (2013) Minimax PAC bounds on the sample complexity of reinforcement learning with a generative model. Mach Learn 91(3):325–349

    Article  MathSciNet  MATH  Google Scholar 

  8. Bajari P, Benkard CL, Levin J (2007) Estimating dynamic models of imperfect competition. Econometrica 75(5):1331–1370

    Article  MathSciNet  MATH  Google Scholar 

  9. Başar T, Bernhard P (2008) H-infinity optimal control and related minimax design problems: a dynamic game approach. Springer Science & Business Media, NY

    Book  MATH  Google Scholar 

  10. Başar T, Zaccour G (2018) Handbook of dynamic game theory. Springer International Publishing, NY

    Book  Google Scholar 

  11. Bertsekas DP (2017) Dynamic programming and optimal control. Athena Scientific, Belmont, MA

    MATH  Google Scholar 

  12. Breton M (1991) Algorithms for stochastic games. Springer, NY

    Book  MATH  Google Scholar 

  13. Busoniu L, Babuska R, De Schutter B (2008) A comprehensive survey of multiagent reinforcement learning. IEEE Trans Syst, Man, Cybern, Part C (Appl Rev) 38(2):156–172

    Article  Google Scholar 

  14. Cesa-Bianch N, Lugosi G (2006) Prediction, learning, and games. Cambridge university press, Cambridge

    Book  MATH  Google Scholar 

  15. Deng X, Li Y, Mguni DH, Wang J, Yang Y (2021) On the complexity of computing markov perfect equilibrium in general-sum stochastic games. http://arxiv.org/abs/2109.01795

  16. Doraszelski U, Escobar JF (2010) A theory of regular Markov perfect equilibria in dynamic stochastic games: Genericity, stability, and purification. Theor Econ 5(3):369–402

    Article  MathSciNet  MATH  Google Scholar 

  17. Ericson R, Pakes A (1995) Markov-perfect industry dynamics: a framework for empirical work. Rev Econ Stud 62(1):53–82

    Article  MATH  Google Scholar 

  18. Fershtiman C, Pakes A (2000) A dynamic oligopoly with collusion and price wars. RAND J Econ 31(2):207–236

    Article  Google Scholar 

  19. Filar J, Vrieze K (1996) Competitive Markov Decision Processes. Springer, New York, NY. 978-1-4612-8481-9 978-1-4612-4054-9

  20. Filar JA, Schultz TA, Thuijsman F, Vrieze O (1991) Nonlinear programming and stationary equilibria in stochastic games. Math Program 50(1):227–237

    Article  MathSciNet  MATH  Google Scholar 

  21. Fink AM (1964) Equilibrium in a stochastic n-person game. Hiroshima Math J 28:1

    Article  MathSciNet  MATH  Google Scholar 

  22. Herings PJ-J, Peeters R (2010) Homotopy methods to compute equilibria in game theory. Econ Theory 42(1):119–156

    Article  MathSciNet  MATH  Google Scholar 

  23. Herings PJ-J, Peeters RJ et al (2004) Stationary equilibria in stochastic games: structure, selection, and computation. J Econ Theory 118(1):32–60

    MathSciNet  MATH  Google Scholar 

  24. Hinderer K (2005) Lipschitz continuity of value functions in Markovian decision processes. Mathematical Methods of Operations Research, 62 (1): 3–22. ISSN 1432-5217

  25. Hoffman AJ, Karp RM (1966) On nonterminating stochastic games. Manage Sci 12(5):359–370

    Article  MathSciNet  MATH  Google Scholar 

  26. Jaśkiewicz A, Nowak AS (2014) Robust Markov perfect equilibria. J Math Anal Appl 419(2):1322–1332

    Article  MathSciNet  MATH  Google Scholar 

  27. Kakade SM (2003) On the sample complexity of reinforcement learning. PhD thesis, University College, London

  28. Kearns M, Singh S (1999) Finite-sample convergence rates for q-learning and indirect algorithms. Adv Neural Inf Process Syst 871:996–1002

    Google Scholar 

  29. Krupnik O, Mordatch I, Tamar A (2019) Multi-agent reinforcement learning with multi-step generative models. http://arxiv.org/abs/1901.10251

  30. Leonardos S, Overman W, Panageas I, Piliouras G (2021) Global convergence of multi-agent policy gradient in markov potential games. http://arxiv.org/abs/2106.01969

  31. Li G, Wei Y, Chi Y, Gu Y, Chen Y (2020) Breaking the sample size barrier in model-based reinforcement learning with a generative model. Adv Neural Inf Process Syst 33:12861

    Google Scholar 

  32. Littman ML (1994) Markov games as a framework for multi-agent reinforcement learning. In International Conference on Machine Learning, pages 157–163. Elsevier

  33. Littman ML (2001) Value-function reinforcement learning in Markov games. Cognit Syst Res 2(1):55–66

    Article  Google Scholar 

  34. Mailath GJ, Samuelson L (2006) Repeated games and reputations: long-run relationships. Oxford University Press, Oxford

    Book  Google Scholar 

  35. Maskin E, Tirole J (1988) A theory of dynamic oligopoly, I: Overview and quantity competition with large fixed costs. Econometrica: J Econ Soc 549–569

  36. Maskin E, Tirole J (1988) A theory of dynamic oligopoly, II: Price competition, kinked demand curves, and edgeworth cycles. Econometrica: J Econ Soc 571–599

  37. Maskin E, Tirole J (2001) Markov perfect equilibrium: I. observable actions. J Econ Theory 100(2):191–219

    Article  MathSciNet  MATH  Google Scholar 

  38. Müller A (1997) How does the value function of a Markov decision process depend on the transition probabilities? Math Op Res 22(4):872–885

    Article  MathSciNet  MATH  Google Scholar 

  39. Müller A (1997) Integral probability metrics and their generating classes of functions. Adv Appl Probab 29(2):429–443

    Article  MathSciNet  MATH  Google Scholar 

  40. Pakes A, Ostrovsky M, Berry S (2007) Simple estimators for the parameters of discrete dynamic games (with entry/exit examples). RAND J Econ 38(2):373–399

    Article  Google Scholar 

  41. Pérolat J, Strub F, Piot B, Pietquin O (2017) Learning Nash equilibrium for general-sum Markov games from batch data. In Artificial Intelligence and Statistics, pages 232–241. PMLR

  42. Pesendorfer M, Schmidt-Dengler P (2008) Asymptotic least squares estimators for dynamic games. Rev Econ Stud 75(3):901–928

    Article  MathSciNet  MATH  Google Scholar 

  43. Prasad H, LA P, Bhatnagar S (2015) Two-timescale algorithms for learning Nash equilibria in general-sum stochastic games. In Proceedings of the 2015 International Conference on Autonomous Agents and Multiagent Systems, pages 1371–1379

  44. Rogers PD (1969) Nonzero-sum stochastic games. PhD thesis, University of California, Berkeley

  45. Sengupta S, Chowdhary A, Huang D, Kambhampati S (2019) General sum markov games for strategic detection of advanced persistent threats using moving target defense in cloud networks. In International Conference on Decision and Game Theory for Security, pages 492–512. Springer

  46. Shapley LS (1953) Stochastic games. Proc Nat Acad Sci 39(10):1095–1100

    Article  MathSciNet  MATH  Google Scholar 

  47. Shoham Y, Powers R, Grenager T (2003) Multi-agent reinforcement learning: a critical survey. Technical report, Stanford University

  48. Sidford A, Wang M, Wu X, Yang LF, Ye Y (2018) Near-optimal time and sample complexities for solving markov decision processes with a generative model. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, pages 5192–5202

  49. Sidford A, Wang M, Yang L, Ye Y (2020) Solving discounted stochastic two-player games with near-optimal time and sample complexity. In International Conference on Artificial Intelligence and Statistics, pages 2992–3002. PMLR

  50. Solan E (2021) A course in stochastic game theory. Cambridge University Press, Cambridge

    MATH  Google Scholar 

  51. Song Z, Mei S, Bai Y (2021) When can we learn general-sum markov games with a large number of players sample-efficiently? http://arxiv.org/abs/2110.04184

  52. Sriperumbudur BK, Gretton A, Fukumizu K, Lanckriet GRG, Schölkopf B (2008) Injective Hilbert space embeddings of probability measures. In Conference on Learning Theory,

  53. Subramanian J, Sinha A, Seraj R, Mahajan A (2022) Approximate information state for approximate planning and reinforcement learning in partially observed systems. J Mach Learn Res 23:1–12

    MathSciNet  MATH  Google Scholar 

  54. Sutton RS (1990) Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In International Conference on Machine Learning, pages 216–224. San Francisco (CA)

  55. Takahashi M (1964) Equilibrium points of stochastic non-cooperative \(n\)-person games. Hiroshima Math J 28(1):95

    Article  MathSciNet  MATH  Google Scholar 

  56. Tidball MM, Altman E (1996) Approximations in dynamic zero-sum games I. SIAM J Control Optim 34(1):311–328

    Article  MathSciNet  MATH  Google Scholar 

  57. Tidball MM, Pourtallier O, Altman E (1997) Approximations in dynamic zero-sum games II. SIAM J Control Optim 35(6):2101–2117

    Article  MathSciNet  MATH  Google Scholar 

  58. Vrieze OJ (1987) Stochastic games with finite state and action spaces. CWI, Jan. ISBN 978-90-6196-313-4

  59. Wang T, Bao X, Clavera I, Hoang J, Wen Y, Langlois E, Zhang S, Zhang G, Abbeel P, Ba J (2019) Benchmarking Model-Based Reinforcement Learning. http://arxiv.org/abs/1907.02057

  60. Whitt W (1980) Representation and approximation of noncooperative sequential games. SIAM J Control Optim 18(1):33–48

    Article  MathSciNet  MATH  Google Scholar 

  61. Zhang K, Kakade S, Basar T, Yang L (2020) Model-based multi-agent rl in zero-sum Markov games with near-optimal sample complexity. Adv Neural Inf Process Syst 33:1166

    Google Scholar 

  62. Zhang K, Yang Z, Başar T (2021) Multi-agent reinforcement learning: A selective overview of theories and algorithms. Handbook of Reinforcement Learning and Control, pages 321–384

  63. Zhang R, Ren Z, Li N (2021) Gradient play in multi-agent markov stochastic games: Stationary points and convergence. arXiv e-prints, pages arXiv–2106

  64. Zhang W, Wang X, Shen J, Zhou M (2021) Model-based Multi-agent Policy Optimization with Adaptive Opponent-wise Rollouts. In International Joint Conference on Artificial Intelligence. Montreal, Canada

  65. Zinkevich M, Greenwald A, Littman M (2006) Cyclic equilibria in Markov games. In Neural Information Processing Systems, pages 1641–1648

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jayakumar Subramanian.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

The work of Amit Sinha and Aditya Mahajan was supported in part by the Innovation for Defence Excellence and Security (IDEaS) Program of the Canadian Department of National Defence through grant CFPMN2-30.

A preliminary version of this work appeared in the 2021 Indian Control Conference.

This article is part of the topical collection “Multi-agent Dynamic Decision Making and Learning” edited by Konstantin Avrachenkov, Vivek S. Borkar and U. Jayakrishnan Nair.

Appendices

Appendix

Proof of Theorem 4

Let \(V_*\) denote the optimal value function for MDP \({\mathcal {M}}\) and \(V_{\pi }\) denote the value function for policy \(\pi \) in MDP \({\mathcal {M}}\). Let \({\hat{\pi }}_*\) be the optimal policy for \(\widehat{{\mathcal {M}}}\) and \({\hat{\pi }}\) be an \(\alpha _{\textrm{opt}}\)-optimal policy for \(\widehat{{\mathcal {M}}}\). From triangle inequality, we have

$$\begin{aligned} \Vert V_* - V_{{\hat{\pi }}} \Vert _{\infty } \le \Vert V_* - {\hat{V}}_{{\hat{\pi }}_*} \Vert _{\infty } + \Vert {\hat{V}}_{{\hat{\pi }}_*} - {\hat{V}}_{{\hat{\pi }}} \Vert _{\infty } + \Vert V_{{\hat{\pi }}} - {\hat{V}}_{{\hat{\pi }}} \Vert _{\infty } . \end{aligned}$$
(42)

Now we bound the three terms separately. For the first term, we have

$$\begin{aligned} \Vert V_* - {\hat{V}}_{{\hat{\pi }}_*} \Vert _{\infty }&{\mathop {\le }\limits ^{(a)}}\max _{s\in \mathcal {S}} \biggl | \max _{a\in \mathcal {A}} \biggl [ (1-\gamma )r(s, a) + \gamma \sum _{s' \in \mathcal {S}} {\textsf{P}}(s' | s, a) V_*(s') \\&\qquad \qquad - (1-\gamma ){\hat{r}}(s, a) - \gamma \sum _{s' \in \mathcal {S}} \widehat{{\textsf{P}}}(s' | s, a) {\hat{V}}_{{\hat{\pi }}_*}(s') \biggr ] \biggr | \\&{\mathop {\le }\limits ^{(b)}}(1-\gamma ) \max _{(s,a) \in \mathcal {S}\times \mathcal {A}} \bigl | r(s, a) - {\hat{r}}(s, a) \bigr | \\&\quad + \gamma \max _{(s,a) \in \mathcal {S}\times \mathcal {A}} \biggl | \sum _{s' \in \mathcal {S}} {\textsf{P}}(s' | s, a) V_*(s') - {\textsf{P}}(s' | s, a) {\hat{V}}_{{\hat{\pi }}_*}(s') \biggr | \\&\quad + \gamma \max _{(s,a) \in \mathcal {S}\times \mathcal {A}} \biggl | \sum _{s' \in \mathcal {S}} {\textsf{P}}(s' | s, a) {\hat{V}}_{{\hat{\pi }}_*}(s') - \widehat{{\textsf{P}}}(s' | s, a) {\hat{V}}_{{\hat{\pi }}_*}(s') \biggr | \\&{\mathop {\le }\limits ^{(c)}}(1 - \gamma ) \varepsilon + \gamma \Vert V_* - {\hat{V}}_{{\hat{\pi }}_*}\Vert _{\infty } + \gamma \varDelta _{{\hat{\pi }}_*}, \end{aligned}$$

where (a) relies on the fact that \(\max f(x) \le \max | f(x) - g(x) | + \max g(x)\), (b) follows from triangle inequality, and (c) follows from the definition of \(\varepsilon \) and \(\varDelta _{{\hat{\pi }}}\). Therefore,

$$\begin{aligned} \Vert V_* - {\hat{V}}_{{\hat{\pi }}_*} \Vert _{\infty } \le \varepsilon + \frac{\gamma \varDelta _{{\hat{\pi }}_*}}{1-\gamma }. \end{aligned}$$
(43)

For the second term of (42), we have

$$\begin{aligned} \Vert {\hat{V}}_{{\hat{\pi }}_{*}} - {\hat{V}}_{{\hat{\pi }}} \Vert _{\infty } \le \alpha _{\textrm{opt}}, \end{aligned}$$
(44)

since \({\hat{\pi }}\) is an \(\alpha _{\textrm{opt}}\)-optimal policy of \(\widehat{{\mathcal {M}}}\).

For the third term of (42), we have

$$\begin{aligned} \Vert V_{{\hat{\pi }}} - {\hat{V}}_{{\hat{\pi }}} \Vert _{\infty }&= \max _{s\in \mathcal {S}} \biggl | \sum _{a\in \mathcal {A}} {\hat{\pi }}(a| s) \biggl [ (1-\gamma )r(s, a) + \gamma \sum _{s' \in \mathcal {S}} {\textsf{P}}(s' | s, a) V_{{\hat{\pi }}}(s') \\&\qquad \qquad - (1-\gamma ) {\hat{r}}(s,a) - \gamma \sum _{s' \in \mathcal {S}} \widehat{{\textsf{P}}}(s' | s, a) {\hat{V}}_{{\hat{\pi }}}(s') \biggr ] \biggl | \\&{\mathop {\le }\limits ^{(d)}}(1-\gamma )\max _{s\in \mathcal {S}} \biggl | \sum _{a\in \mathcal {A}} {\hat{\pi }}(a| s) \bigl [ r(s,a) - {\hat{r}}(s, a) \bigr ] \biggr | \\&\quad + \gamma \max _{s\in \mathcal {S}} \biggl | \sum _{a\in \mathcal {A}} {\hat{\pi }}(a| s) \biggl [ \sum _{s' \in \mathcal {S}} \bigl [ {\textsf{P}}(s' | s, a) V_{{\hat{\pi }}}(s') - {\textsf{P}}(s' | s, a) {\hat{V}}_{{\hat{\pi }}}(s') \bigr ] \biggr ] \biggr | \\&\quad + \gamma \max _{s\in \mathcal {S}} \biggl | \sum _{a\in \mathcal {A}} {\hat{\pi }}(a| s) \biggl [ \sum _{s' \in \mathcal {S}} \bigl [ {\textsf{P}}(s' | s, a) {\hat{V}}_{{\hat{\pi }}}(s') - \widehat{{\textsf{P}}}(s' | s, a) {\hat{V}}_{{\hat{\pi }}}(s') \bigr ] \biggr ] \biggr | \\&{\mathop {\le }\limits ^{(e)}}(1-\gamma )\varepsilon + \gamma \Vert V_{{\hat{\pi }}} - {\hat{V}}_{{\hat{\pi }}} \Vert _{\infty } + \gamma \varDelta _{{\hat{\pi }}} \end{aligned}$$

where (d) follows from triangle inequality and (e) follows from the definition of \(\varepsilon \) and \(\varDelta _{{\hat{\pi }}}\). Therefore,

$$\begin{aligned} \Vert V_{{\hat{\pi }}} - {\hat{V}}_{{\hat{\pi }}} \Vert _{\infty } \le \varepsilon + \frac{\gamma \varDelta _{{\hat{\pi }}}}{1-\gamma }. \end{aligned}$$
(45)

The result of Theorem 4 then follows by substituting (43)–(45) in (42). \(\square \)

Proof of Theorem 5

We follow the same notation as in Appendix A. In addition, let \(\mathcal {B}_*\) and \(\mathcal {B}_\pi \) denote the Bellman operators for the true model and let \({\hat{\mathcal {B}}}_*\) and \({{\hat{\mathcal {B}}}}_\pi \) denote the Bellman operators for the approximate model. As in Appendix A, the approximation error can be bounded as

$$\begin{aligned} \Vert V_* - V_{{\hat{\pi }}} \Vert _{\infty } \le \Vert V_* - {\hat{V}}_{{\hat{\pi }}_*} \Vert _{\infty } + \Vert {\hat{V}}_{{\hat{\pi }}_*} - {\hat{V}}_{{\hat{\pi }}}\Vert _{\infty } + \Vert V_{{\hat{\pi }}} - {\hat{V}}_{{\hat{\pi }}} \Vert _{\infty }. \end{aligned}$$
(46)

Recall that \(V_{{\hat{\pi }}} = \mathcal {B}_{{\hat{\pi }}} V_{{\hat{\pi }}}\) and \({\hat{V}}_{{\hat{\pi }}} = {\hat{\mathcal {B}}}_{{\hat{\pi }}} {\hat{V}}_{{\hat{\pi }}}\). Moreover, \({\hat{V}}_* = {\hat{V}}_{{\hat{\pi }}_*}\). Therefore, from triangle inequality, we have

$$\begin{aligned} \Vert V_* - V_{{\hat{\pi }}} \Vert _{\infty }&\le \Vert V_* - {\hat{V}}_{*} \Vert _{\infty } + \Vert {\hat{V}}_{{\hat{\pi }}_*} - {\hat{V}}_{{\hat{\pi }}}\Vert _{\infty } + \Vert \mathcal {B}_{{\hat{\pi }}} V_{{\hat{\pi }}} - \mathcal {B}_{{\hat{\pi }}} V_* \Vert _{\infty } \nonumber \\&\quad + \Vert \mathcal {B}_{{\hat{\pi }}} V_* - {\hat{\mathcal {B}}}_{{\hat{\pi }}} V_* \Vert _{\infty } + \Vert {\hat{\mathcal {B}}}_{{\hat{\pi }}} V_* - {\hat{\mathcal {B}}}_{{\hat{\pi }}} {\hat{V}}_{*} \Vert _{\infty } + \Vert {\hat{\mathcal {B}}}_{{\hat{\pi }}} {\hat{V}}_{*} - {\hat{\mathcal {B}}}_{{\hat{\pi }}} {\hat{V}}_{{\hat{\pi }}} \Vert _{\infty } \nonumber \\&{\mathop {\le }\limits ^{(a)}}\Vert V_* - {\hat{V}}_{*} \Vert _{\infty } + \Vert {\hat{V}}_{{\hat{\pi }}_*} - {\hat{V}}_{{\hat{\pi }}}\Vert _{\infty } + \gamma \Vert V_{{\hat{\pi }}} - V_* \Vert _{\infty } \nonumber \\&\quad + \Vert \mathcal {B}_{{\hat{\pi }}} V_* - {\hat{\mathcal {B}}}_{{\hat{\pi }}} V_* \Vert _{\infty } + \gamma \Vert V_* - {\hat{V}}_{*} \Vert _{\infty } + \gamma \Vert {\hat{V}}_{*} - {\hat{V}}_{{\hat{\pi }}} \Vert _{\infty } \end{aligned}$$
(47)

where (a) from the contraction property of the Bellman operators. Rearranging terms and using (44), we get that

$$\begin{aligned} \Vert V_* - V_{{\hat{\pi }}} \Vert _{\infty }&\le \frac{1}{1-\gamma } \bigl [ (1+\gamma ) \Vert V_* - {\hat{V}}_{*} \Vert _{\infty } + (1 + \gamma ) \alpha _{\textrm{opt}} + \Vert \mathcal {B}_{{\hat{\pi }}} V_* - {\hat{\mathcal {B}}}_{{\hat{\pi }}} V_* \Vert _{\infty } \bigr ] \end{aligned}$$
(48)

Now the first term in (48) can be simplified similar to (43), where in step (b) of (43), we need to add and subtract \(\sum _{s' \in \mathcal {S}} \widehat{{\textsf{P}}}(s' | s, a) V_{\pi _*}(s')\). Using this, we would obtain

$$\begin{aligned} \Vert V_* - {\hat{V}}_{*} \Vert _{\infty } \le \varepsilon + \frac{\gamma {{\overline{\varDelta }}}_{\pi _*}}{1-\gamma }. \end{aligned}$$
(49)

The last term in (48) can be simplified as follows:

$$\begin{aligned} \Vert \mathcal {B}_{{\hat{\pi }}} V_* - {\hat{\mathcal {B}}}_{{\hat{\pi }}} V_* \Vert _{\infty }&\le \max _{s\in \mathcal {S}} \sum _{a\in \mathcal {A}} {\hat{\pi }}(a| s) \biggl [ | r(s,a) - {\hat{r}}(s,a) | \nonumber \\&\qquad \qquad + \gamma \biggl | \sum _{s' \in \mathcal {S}} {\textsf{P}}(s'|s,a) V_*(s') - \sum _{s' \in \mathcal {S}} \widehat{{\textsf{P}}}(s' | s, a) V_*(s') \biggr | \biggr ] \nonumber \\&\le \varepsilon + \gamma {{\overline{\varDelta }}}_{\pi _*} \end{aligned}$$
(50)

The result of Theorem 5 follows by substituting (49) and (50) in (48). \(\square \)

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Subramanian, J., Sinha, A. & Mahajan, A. Robustness and Sample Complexity of Model-Based MARL for General-Sum Markov Games. Dyn Games Appl 13, 56–88 (2023). https://doi.org/10.1007/s13235-023-00490-2

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s13235-023-00490-2

Navigation