Robustness and Sample Complexity of Model-Based MARL for General-Sum Markov Games

Subramanian, Jayakumar; Sinha, Amit; Mahajan, Aditya

doi:10.1007/s13235-023-00490-2

Robustness and Sample Complexity of Model-Based MARL for General-Sum Markov Games

Published: 21 January 2023

Volume 13, pages 56–88, (2023)
Cite this article

Dynamic Games and Applications Aims and scope Submit manuscript

278 Accesses
1 Citation
1 Altmetric
Explore all metrics

Abstract

Multi-agent reinforcement learning (MARL) is often modeled using the framework of Markov games (also called stochastic games or dynamic games). Most of the existing literature on MARL concentrates on zero-sum Markov games but is not applicable to general-sum Markov games. It is known that the best response dynamics in general-sum Markov games are not a contraction. Therefore, different equilibria in general-sum Markov games can have different values. Moreover, the Q-function is not sufficient to completely characterize the equilibrium. Given these challenges, model-based learning is an attractive approach for MARL in general-sum Markov games. In this paper, we investigate the fundamental question of sample complexity for model-based MARL algorithms in general-sum Markov games. We show two results. We first use Hoeffding inequality-based bounds to show that $\tilde{{\mathcal {O}}}( (1-\gamma )^{-4} \alpha ^{-2})$ samples per state–action pair are sufficient to obtain a $\alpha $-approximate Markov perfect equilibrium with high probability, where $\gamma $ is the discount factor, and the $\tilde{{\mathcal {O}}}(\cdot )$ notation hides logarithmic terms. We then use Bernstein inequality-based bounds to show that $\tilde{{\mathcal {O}}}( (1-\gamma )^{-1} \alpha ^{-2} )$ samples are sufficient. To obtain these results, we study the robustness of Markov perfect equilibrium to model approximations. We show that the Markov perfect equilibrium of an approximate (or perturbed) game is always an approximate Markov perfect equilibrium of the original game and provide explicit bounds on the approximation error. We illustrate the results via a numerical example.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Monte Carlo Tree Search: a review of recent modifications and applications

Article Open access 19 July 2022

Multi-Agent Reinforcement Learning: A Selective Overview of Theories and Algorithms

Introduction to Reinforcement Learning

Data Availability

Data sharing is not applicable to this article as no datasets were generated or analyzed during the current study.

Notes

[65] construct two player general-sum games with the following properties. The game has two states: in state 1, player 1 has two actions and player 2 has one action; in state 2, player 1 has one action and player 2 has two actions. The transition probabilities are chosen such that there is a unique Markov perfect equilibrium in mixed strategies. This means that in state 1, both actions of player 1 maximize the Q-function; in state 2, both actions of player 2 minimize the Q-function. However, the Q-function in itself is insufficient to determine the randomizing probabilities for the mixed strategy MPE.
The plug-in estimator is also known as a certainty equivalent controller in the stochastic control literature.
If $\mu $ and $\nu $ are absolutely continuous with respect to some measure $\lambda $ and let $p = d\mu /d\lambda $ and $q = d\nu /d\lambda $, then total variation is typically defined as $\tfrac{1}{2} \int _{\mathcal X}| p(x) - q(x)| \lambda (dx)$. This is consistent with our definition. Let ${{\bar{f}}} = ( \sup f + \inf f)/2$. Then
$$\begin{aligned}&\left| \int _{\mathcal X} f d\mu - \int _{\mathcal X} f d\nu \right| = \left| \int _{\mathcal X} f(x) p(x) \lambda (dx) - \int _{\mathcal X} f(x) q(x) \lambda (dx) \right| \\&\quad = \left| \int _{\mathcal X} \bigl [ f(x) - {{\bar{f}}} \bigr ] \bigl [ p(x) - q(x) \bigr ] \lambda (dx) \right| \le \Vert f - {{\bar{f}}} \Vert _{\infty } \int _{\mathcal X} \bigl | p(x) - q(x) \bigr | \lambda (dx)\\&\quad \le \tfrac{1}{2} {{\,\textrm{span}\,}}(f) \int _{\mathcal X} \bigl | p(x) - q(x) \bigr | \lambda (dx). \end{aligned}$$
For consistency with the normalized rewards considered in the game formulation (see Remark 1), we use normalized rewards for MDPs as well. Although most of the literature on MDPs uses unnormalized rewards, normalized rewards are commonly used in the literature on constrained MDPs [6].
Recall that we are working with normalized total expected reward (see Remark 1), while the results [7] are derived for the unnormalized total reward. In the discussion above, we have normalized the results of [7].

References

Acemoglu D, Robinson JA (2001) A theory of political transitions. Am Econ Rev 91(4):938–963
Article Google Scholar
Agarwal A, Kakade S, Yang LF (2020) Model-based reinforcement learning with a generative model is minimax optimal. In Conference on Learning Theory, pages 67–83. PMLR
Aguirregabiria V, Mira P (2007) Sequential estimation of dynamic discrete games. Econometrica 75(1):1–53
Article MathSciNet MATH Google Scholar
Akchurina N (2010) Multi-agent reinforcement learning algorithms. PhD thesis, University of Paderborn
Albright SC, Winston W (1979) A birth-death model of advertising and pricing. Adv Appl Probab 11(1):134–152
Article MathSciNet MATH Google Scholar
Altman E (1999) Constrained Markov decision processes: stochastic modeling. CRC Press, Boca Raton
MATH Google Scholar
Azar MG, Munos R, Kappen HJ (2013) Minimax PAC bounds on the sample complexity of reinforcement learning with a generative model. Mach Learn 91(3):325–349
Article MathSciNet MATH Google Scholar
Bajari P, Benkard CL, Levin J (2007) Estimating dynamic models of imperfect competition. Econometrica 75(5):1331–1370
Article MathSciNet MATH Google Scholar
Başar T, Bernhard P (2008) H-infinity optimal control and related minimax design problems: a dynamic game approach. Springer Science & Business Media, NY
Book MATH Google Scholar
Başar T, Zaccour G (2018) Handbook of dynamic game theory. Springer International Publishing, NY
Book Google Scholar
Bertsekas DP (2017) Dynamic programming and optimal control. Athena Scientific, Belmont, MA
MATH Google Scholar
Breton M (1991) Algorithms for stochastic games. Springer, NY
Book MATH Google Scholar
Busoniu L, Babuska R, De Schutter B (2008) A comprehensive survey of multiagent reinforcement learning. IEEE Trans Syst, Man, Cybern, Part C (Appl Rev) 38(2):156–172
Article Google Scholar
Cesa-Bianch N, Lugosi G (2006) Prediction, learning, and games. Cambridge university press, Cambridge
Book MATH Google Scholar
Deng X, Li Y, Mguni DH, Wang J, Yang Y (2021) On the complexity of computing markov perfect equilibrium in general-sum stochastic games. http://arxiv.org/abs/2109.01795
Doraszelski U, Escobar JF (2010) A theory of regular Markov perfect equilibria in dynamic stochastic games: Genericity, stability, and purification. Theor Econ 5(3):369–402
Article MathSciNet MATH Google Scholar
Ericson R, Pakes A (1995) Markov-perfect industry dynamics: a framework for empirical work. Rev Econ Stud 62(1):53–82
Article MATH Google Scholar
Fershtiman C, Pakes A (2000) A dynamic oligopoly with collusion and price wars. RAND J Econ 31(2):207–236
Article Google Scholar
Filar J, Vrieze K (1996) Competitive Markov Decision Processes. Springer, New York, NY. 978-1-4612-8481-9 978-1-4612-4054-9
Filar JA, Schultz TA, Thuijsman F, Vrieze O (1991) Nonlinear programming and stationary equilibria in stochastic games. Math Program 50(1):227–237
Article MathSciNet MATH Google Scholar
Fink AM (1964) Equilibrium in a stochastic n-person game. Hiroshima Math J 28:1
Article MathSciNet MATH Google Scholar
Herings PJ-J, Peeters R (2010) Homotopy methods to compute equilibria in game theory. Econ Theory 42(1):119–156
Article MathSciNet MATH Google Scholar
Herings PJ-J, Peeters RJ et al (2004) Stationary equilibria in stochastic games: structure, selection, and computation. J Econ Theory 118(1):32–60
MathSciNet MATH Google Scholar
Hinderer K (2005) Lipschitz continuity of value functions in Markovian decision processes. Mathematical Methods of Operations Research, 62 (1): 3–22. ISSN 1432-5217
Hoffman AJ, Karp RM (1966) On nonterminating stochastic games. Manage Sci 12(5):359–370
Article MathSciNet MATH Google Scholar
Jaśkiewicz A, Nowak AS (2014) Robust Markov perfect equilibria. J Math Anal Appl 419(2):1322–1332
Article MathSciNet MATH Google Scholar
Kakade SM (2003) On the sample complexity of reinforcement learning. PhD thesis, University College, London
Kearns M, Singh S (1999) Finite-sample convergence rates for q-learning and indirect algorithms. Adv Neural Inf Process Syst 871:996–1002
Google Scholar
Krupnik O, Mordatch I, Tamar A (2019) Multi-agent reinforcement learning with multi-step generative models. http://arxiv.org/abs/1901.10251
Leonardos S, Overman W, Panageas I, Piliouras G (2021) Global convergence of multi-agent policy gradient in markov potential games. http://arxiv.org/abs/2106.01969
Li G, Wei Y, Chi Y, Gu Y, Chen Y (2020) Breaking the sample size barrier in model-based reinforcement learning with a generative model. Adv Neural Inf Process Syst 33:12861
Google Scholar
Littman ML (1994) Markov games as a framework for multi-agent reinforcement learning. In International Conference on Machine Learning, pages 157–163. Elsevier
Littman ML (2001) Value-function reinforcement learning in Markov games. Cognit Syst Res 2(1):55–66
Article Google Scholar
Mailath GJ, Samuelson L (2006) Repeated games and reputations: long-run relationships. Oxford University Press, Oxford
Book Google Scholar
Maskin E, Tirole J (1988) A theory of dynamic oligopoly, I: Overview and quantity competition with large fixed costs. Econometrica: J Econ Soc 549–569
Maskin E, Tirole J (1988) A theory of dynamic oligopoly, II: Price competition, kinked demand curves, and edgeworth cycles. Econometrica: J Econ Soc 571–599
Maskin E, Tirole J (2001) Markov perfect equilibrium: I. observable actions. J Econ Theory 100(2):191–219
Article MathSciNet MATH Google Scholar
Müller A (1997) How does the value function of a Markov decision process depend on the transition probabilities? Math Op Res 22(4):872–885
Article MathSciNet MATH Google Scholar
Müller A (1997) Integral probability metrics and their generating classes of functions. Adv Appl Probab 29(2):429–443
Article MathSciNet MATH Google Scholar
Pakes A, Ostrovsky M, Berry S (2007) Simple estimators for the parameters of discrete dynamic games (with entry/exit examples). RAND J Econ 38(2):373–399
Article Google Scholar
Pérolat J, Strub F, Piot B, Pietquin O (2017) Learning Nash equilibrium for general-sum Markov games from batch data. In Artificial Intelligence and Statistics, pages 232–241. PMLR
Pesendorfer M, Schmidt-Dengler P (2008) Asymptotic least squares estimators for dynamic games. Rev Econ Stud 75(3):901–928
Article MathSciNet MATH Google Scholar
Prasad H, LA P, Bhatnagar S (2015) Two-timescale algorithms for learning Nash equilibria in general-sum stochastic games. In Proceedings of the 2015 International Conference on Autonomous Agents and Multiagent Systems, pages 1371–1379
Rogers PD (1969) Nonzero-sum stochastic games. PhD thesis, University of California, Berkeley
Sengupta S, Chowdhary A, Huang D, Kambhampati S (2019) General sum markov games for strategic detection of advanced persistent threats using moving target defense in cloud networks. In International Conference on Decision and Game Theory for Security, pages 492–512. Springer
Shapley LS (1953) Stochastic games. Proc Nat Acad Sci 39(10):1095–1100
Article MathSciNet MATH Google Scholar
Shoham Y, Powers R, Grenager T (2003) Multi-agent reinforcement learning: a critical survey. Technical report, Stanford University
Sidford A, Wang M, Wu X, Yang LF, Ye Y (2018) Near-optimal time and sample complexities for solving markov decision processes with a generative model. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, pages 5192–5202
Sidford A, Wang M, Yang L, Ye Y (2020) Solving discounted stochastic two-player games with near-optimal time and sample complexity. In International Conference on Artificial Intelligence and Statistics, pages 2992–3002. PMLR
Solan E (2021) A course in stochastic game theory. Cambridge University Press, Cambridge
MATH Google Scholar
Song Z, Mei S, Bai Y (2021) When can we learn general-sum markov games with a large number of players sample-efficiently? http://arxiv.org/abs/2110.04184
Sriperumbudur BK, Gretton A, Fukumizu K, Lanckriet GRG, Schölkopf B (2008) Injective Hilbert space embeddings of probability measures. In Conference on Learning Theory,
Subramanian J, Sinha A, Seraj R, Mahajan A (2022) Approximate information state for approximate planning and reinforcement learning in partially observed systems. J Mach Learn Res 23:1–12
MathSciNet MATH Google Scholar
Sutton RS (1990) Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In International Conference on Machine Learning, pages 216–224. San Francisco (CA)
Takahashi M (1964) Equilibrium points of stochastic non-cooperative $n$-person games. Hiroshima Math J 28(1):95
Article MathSciNet MATH Google Scholar
Tidball MM, Altman E (1996) Approximations in dynamic zero-sum games I. SIAM J Control Optim 34(1):311–328
Article MathSciNet MATH Google Scholar
Tidball MM, Pourtallier O, Altman E (1997) Approximations in dynamic zero-sum games II. SIAM J Control Optim 35(6):2101–2117
Article MathSciNet MATH Google Scholar
Vrieze OJ (1987) Stochastic games with finite state and action spaces. CWI, Jan. ISBN 978-90-6196-313-4
Wang T, Bao X, Clavera I, Hoang J, Wen Y, Langlois E, Zhang S, Zhang G, Abbeel P, Ba J (2019) Benchmarking Model-Based Reinforcement Learning. http://arxiv.org/abs/1907.02057
Whitt W (1980) Representation and approximation of noncooperative sequential games. SIAM J Control Optim 18(1):33–48
Article MathSciNet MATH Google Scholar
Zhang K, Kakade S, Basar T, Yang L (2020) Model-based multi-agent rl in zero-sum Markov games with near-optimal sample complexity. Adv Neural Inf Process Syst 33:1166
Google Scholar
Zhang K, Yang Z, Başar T (2021) Multi-agent reinforcement learning: A selective overview of theories and algorithms. Handbook of Reinforcement Learning and Control, pages 321–384
Zhang R, Ren Z, Li N (2021) Gradient play in multi-agent markov stochastic games: Stationary points and convergence. arXiv e-prints, pages arXiv–2106
Zhang W, Wang X, Shen J, Zhou M (2021) Model-based Multi-agent Policy Optimization with Adaptive Opponent-wise Rollouts. In International Joint Conference on Artificial Intelligence. Montreal, Canada
Zinkevich M, Greenwald A, Littman M (2006) Cyclic equilibria in Markov games. In Neural Information Processing Systems, pages 1641–1648

Download references

Author information

Authors and Affiliations

Media and Data Science Research Lab, Digital Experience Cloud, Adobe Inc., Noida, Uttar Pradesh, India
Jayakumar Subramanian
Department of Electrical and Computer Engineering, McGill University, Montreal, Canada
Amit Sinha & Aditya Mahajan

Authors

Jayakumar Subramanian
View author publications
You can also search for this author in PubMed Google Scholar
Amit Sinha
View author publications
You can also search for this author in PubMed Google Scholar
Aditya Mahajan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jayakumar Subramanian.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

The work of Amit Sinha and Aditya Mahajan was supported in part by the Innovation for Defence Excellence and Security (IDEaS) Program of the Canadian Department of National Defence through grant CFPMN2-30.

A preliminary version of this work appeared in the 2021 Indian Control Conference.

This article is part of the topical collection “Multi-agent Dynamic Decision Making and Learning” edited by Konstantin Avrachenkov, Vivek S. Borkar and U. Jayakrishnan Nair.

Appendices

Appendix

Proof of Theorem 4

Let $V_*$ denote the optimal value function for MDP ${\mathcal {M}}$ and $V_{\pi }$ denote the value function for policy $\pi $ in MDP ${\mathcal {M}}$. Let ${\hat{\pi }}_*$ be the optimal policy for $\widehat{{\mathcal {M}}}$ and ${\hat{\pi }}$ be an $\alpha _{\textrm{opt}}$-optimal policy for $\widehat{{\mathcal {M}}}$. From triangle inequality, we have

$$\begin{aligned} \Vert V_* - V_{{\hat{\pi }}} \Vert _{\infty } \le \Vert V_* - {\hat{V}}_{{\hat{\pi }}_*} \Vert _{\infty } + \Vert {\hat{V}}_{{\hat{\pi }}_*} - {\hat{V}}_{{\hat{\pi }}} \Vert _{\infty } + \Vert V_{{\hat{\pi }}} - {\hat{V}}_{{\hat{\pi }}} \Vert _{\infty } . \end{aligned}$$

(42)

Now we bound the three terms separately. For the first term, we have

$$\begin{aligned} \Vert V_* - {\hat{V}}_{{\hat{\pi }}_*} \Vert _{\infty }&{\mathop {\le }\limits ^{(a)}}\max _{s\in \mathcal {S}} \biggl | \max _{a\in \mathcal {A}} \biggl [ (1-\gamma )r(s, a) + \gamma \sum _{s' \in \mathcal {S}} {\textsf{P}}(s' | s, a) V_*(s') \\&\qquad \qquad - (1-\gamma ){\hat{r}}(s, a) - \gamma \sum _{s' \in \mathcal {S}} \widehat{{\textsf{P}}}(s' | s, a) {\hat{V}}_{{\hat{\pi }}_*}(s') \biggr ] \biggr | \\&{\mathop {\le }\limits ^{(b)}}(1-\gamma ) \max _{(s,a) \in \mathcal {S}\times \mathcal {A}} \bigl | r(s, a) - {\hat{r}}(s, a) \bigr | \\&\quad + \gamma \max _{(s,a) \in \mathcal {S}\times \mathcal {A}} \biggl | \sum _{s' \in \mathcal {S}} {\textsf{P}}(s' | s, a) V_*(s') - {\textsf{P}}(s' | s, a) {\hat{V}}_{{\hat{\pi }}_*}(s') \biggr | \\&\quad + \gamma \max _{(s,a) \in \mathcal {S}\times \mathcal {A}} \biggl | \sum _{s' \in \mathcal {S}} {\textsf{P}}(s' | s, a) {\hat{V}}_{{\hat{\pi }}_*}(s') - \widehat{{\textsf{P}}}(s' | s, a) {\hat{V}}_{{\hat{\pi }}_*}(s') \biggr | \\&{\mathop {\le }\limits ^{(c)}}(1 - \gamma ) \varepsilon + \gamma \Vert V_* - {\hat{V}}_{{\hat{\pi }}_*}\Vert _{\infty } + \gamma \varDelta _{{\hat{\pi }}_*}, \end{aligned}$$

where (a) relies on the fact that $\max f(x) \le \max | f(x) - g(x) | + \max g(x)$, (b) follows from triangle inequality, and (c) follows from the definition of $\varepsilon $ and $\varDelta _{{\hat{\pi }}}$. Therefore,

$$\begin{aligned} \Vert V_* - {\hat{V}}_{{\hat{\pi }}_*} \Vert _{\infty } \le \varepsilon + \frac{\gamma \varDelta _{{\hat{\pi }}_*}}{1-\gamma }. \end{aligned}$$

(43)

For the second term of (42), we have

$$\begin{aligned} \Vert {\hat{V}}_{{\hat{\pi }}_{*}} - {\hat{V}}_{{\hat{\pi }}} \Vert _{\infty } \le \alpha _{\textrm{opt}}, \end{aligned}$$

(44)

since ${\hat{\pi }}$ is an $\alpha _{\textrm{opt}}$-optimal policy of $\widehat{{\mathcal {M}}}$.

For the third term of (42), we have

$$\begin{aligned} \Vert V_{{\hat{\pi }}} - {\hat{V}}_{{\hat{\pi }}} \Vert _{\infty }&= \max _{s\in \mathcal {S}} \biggl | \sum _{a\in \mathcal {A}} {\hat{\pi }}(a| s) \biggl [ (1-\gamma )r(s, a) + \gamma \sum _{s' \in \mathcal {S}} {\textsf{P}}(s' | s, a) V_{{\hat{\pi }}}(s') \\&\qquad \qquad - (1-\gamma ) {\hat{r}}(s,a) - \gamma \sum _{s' \in \mathcal {S}} \widehat{{\textsf{P}}}(s' | s, a) {\hat{V}}_{{\hat{\pi }}}(s') \biggr ] \biggl | \\&{\mathop {\le }\limits ^{(d)}}(1-\gamma )\max _{s\in \mathcal {S}} \biggl | \sum _{a\in \mathcal {A}} {\hat{\pi }}(a| s) \bigl [ r(s,a) - {\hat{r}}(s, a) \bigr ] \biggr | \\&\quad + \gamma \max _{s\in \mathcal {S}} \biggl | \sum _{a\in \mathcal {A}} {\hat{\pi }}(a| s) \biggl [ \sum _{s' \in \mathcal {S}} \bigl [ {\textsf{P}}(s' | s, a) V_{{\hat{\pi }}}(s') - {\textsf{P}}(s' | s, a) {\hat{V}}_{{\hat{\pi }}}(s') \bigr ] \biggr ] \biggr | \\&\quad + \gamma \max _{s\in \mathcal {S}} \biggl | \sum _{a\in \mathcal {A}} {\hat{\pi }}(a| s) \biggl [ \sum _{s' \in \mathcal {S}} \bigl [ {\textsf{P}}(s' | s, a) {\hat{V}}_{{\hat{\pi }}}(s') - \widehat{{\textsf{P}}}(s' | s, a) {\hat{V}}_{{\hat{\pi }}}(s') \bigr ] \biggr ] \biggr | \\&{\mathop {\le }\limits ^{(e)}}(1-\gamma )\varepsilon + \gamma \Vert V_{{\hat{\pi }}} - {\hat{V}}_{{\hat{\pi }}} \Vert _{\infty } + \gamma \varDelta _{{\hat{\pi }}} \end{aligned}$$

where (d) follows from triangle inequality and (e) follows from the definition of $\varepsilon $ and $\varDelta _{{\hat{\pi }}}$. Therefore,

$$\begin{aligned} \Vert V_{{\hat{\pi }}} - {\hat{V}}_{{\hat{\pi }}} \Vert _{\infty } \le \varepsilon + \frac{\gamma \varDelta _{{\hat{\pi }}}}{1-\gamma }. \end{aligned}$$

(45)

The result of Theorem 4 then follows by substituting (43)–(45) in (42). $\square $

Proof of Theorem 5

We follow the same notation as in Appendix A. In addition, let $\mathcal {B}_*$ and $\mathcal {B}_\pi $ denote the Bellman operators for the true model and let ${\hat{\mathcal {B}}}_*$ and ${{\hat{\mathcal {B}}}}_\pi $ denote the Bellman operators for the approximate model. As in Appendix A, the approximation error can be bounded as

$$\begin{aligned} \Vert V_* - V_{{\hat{\pi }}} \Vert _{\infty } \le \Vert V_* - {\hat{V}}_{{\hat{\pi }}_*} \Vert _{\infty } + \Vert {\hat{V}}_{{\hat{\pi }}_*} - {\hat{V}}_{{\hat{\pi }}}\Vert _{\infty } + \Vert V_{{\hat{\pi }}} - {\hat{V}}_{{\hat{\pi }}} \Vert _{\infty }. \end{aligned}$$

(46)

Recall that $V_{{\hat{\pi }}} = \mathcal {B}_{{\hat{\pi }}} V_{{\hat{\pi }}}$ and ${\hat{V}}_{{\hat{\pi }}} = {\hat{\mathcal {B}}}_{{\hat{\pi }}} {\hat{V}}_{{\hat{\pi }}}$. Moreover, ${\hat{V}}_* = {\hat{V}}_{{\hat{\pi }}_*}$. Therefore, from triangle inequality, we have

$$\begin{aligned} \Vert V_* - V_{{\hat{\pi }}} \Vert _{\infty }&\le \Vert V_* - {\hat{V}}_{*} \Vert _{\infty } + \Vert {\hat{V}}_{{\hat{\pi }}_*} - {\hat{V}}_{{\hat{\pi }}}\Vert _{\infty } + \Vert \mathcal {B}_{{\hat{\pi }}} V_{{\hat{\pi }}} - \mathcal {B}_{{\hat{\pi }}} V_* \Vert _{\infty } \nonumber \\&\quad + \Vert \mathcal {B}_{{\hat{\pi }}} V_* - {\hat{\mathcal {B}}}_{{\hat{\pi }}} V_* \Vert _{\infty } + \Vert {\hat{\mathcal {B}}}_{{\hat{\pi }}} V_* - {\hat{\mathcal {B}}}_{{\hat{\pi }}} {\hat{V}}_{*} \Vert _{\infty } + \Vert {\hat{\mathcal {B}}}_{{\hat{\pi }}} {\hat{V}}_{*} - {\hat{\mathcal {B}}}_{{\hat{\pi }}} {\hat{V}}_{{\hat{\pi }}} \Vert _{\infty } \nonumber \\&{\mathop {\le }\limits ^{(a)}}\Vert V_* - {\hat{V}}_{*} \Vert _{\infty } + \Vert {\hat{V}}_{{\hat{\pi }}_*} - {\hat{V}}_{{\hat{\pi }}}\Vert _{\infty } + \gamma \Vert V_{{\hat{\pi }}} - V_* \Vert _{\infty } \nonumber \\&\quad + \Vert \mathcal {B}_{{\hat{\pi }}} V_* - {\hat{\mathcal {B}}}_{{\hat{\pi }}} V_* \Vert _{\infty } + \gamma \Vert V_* - {\hat{V}}_{*} \Vert _{\infty } + \gamma \Vert {\hat{V}}_{*} - {\hat{V}}_{{\hat{\pi }}} \Vert _{\infty } \end{aligned}$$

(47)

where (a) from the contraction property of the Bellman operators. Rearranging terms and using (44), we get that

$$\begin{aligned} \Vert V_* - V_{{\hat{\pi }}} \Vert _{\infty }&\le \frac{1}{1-\gamma } \bigl [ (1+\gamma ) \Vert V_* - {\hat{V}}_{*} \Vert _{\infty } + (1 + \gamma ) \alpha _{\textrm{opt}} + \Vert \mathcal {B}_{{\hat{\pi }}} V_* - {\hat{\mathcal {B}}}_{{\hat{\pi }}} V_* \Vert _{\infty } \bigr ] \end{aligned}$$

(48)

Now the first term in (48) can be simplified similar to (43), where in step (b) of (43), we need to add and subtract $\sum _{s' \in \mathcal {S}} \widehat{{\textsf{P}}}(s' | s, a) V_{\pi _*}(s')$. Using this, we would obtain

$$\begin{aligned} \Vert V_* - {\hat{V}}_{*} \Vert _{\infty } \le \varepsilon + \frac{\gamma {{\overline{\varDelta }}}_{\pi _*}}{1-\gamma }. \end{aligned}$$

(49)

The last term in (48) can be simplified as follows:

$$\begin{aligned} \Vert \mathcal {B}_{{\hat{\pi }}} V_* - {\hat{\mathcal {B}}}_{{\hat{\pi }}} V_* \Vert _{\infty }&\le \max _{s\in \mathcal {S}} \sum _{a\in \mathcal {A}} {\hat{\pi }}(a| s) \biggl [ | r(s,a) - {\hat{r}}(s,a) | \nonumber \\&\qquad \qquad + \gamma \biggl | \sum _{s' \in \mathcal {S}} {\textsf{P}}(s'|s,a) V_*(s') - \sum _{s' \in \mathcal {S}} \widehat{{\textsf{P}}}(s' | s, a) V_*(s') \biggr | \biggr ] \nonumber \\&\le \varepsilon + \gamma {{\overline{\varDelta }}}_{\pi _*} \end{aligned}$$

(50)

The result of Theorem 5 follows by substituting (49) and (50) in (48). $\square $

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Subramanian, J., Sinha, A. & Mahajan, A. Robustness and Sample Complexity of Model-Based MARL for General-Sum Markov Games. Dyn Games Appl 13, 56–88 (2023). https://doi.org/10.1007/s13235-023-00490-2

Download citation

Accepted: 08 December 2022
Published: 21 January 2023
Issue Date: March 2023
DOI: https://doi.org/10.1007/s13235-023-00490-2

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Robustness and Sample Complexity of Model-Based MARL for General-Sum Markov Games

Abstract

Access this article

Similar content being viewed by others

Monte Carlo Tree Search: a review of recent modifications and applications

Multi-Agent Reinforcement Learning: A Selective Overview of Theories and Algorithms

Introduction to Reinforcement Learning

Data Availability

Notes

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Appendices

Appendix

Proof of Theorem 4

Proof of Theorem 5

Rights and permissions

About this article

Cite this article

Navigation

Robustness and Sample Complexity of Model-Based MARL for General-Sum Markov Games

Abstract

Access this article

Similar content being viewed by others

Monte Carlo Tree Search: a review of recent modifications and applications

Multi-Agent Reinforcement Learning: A Selective Overview of Theories and Algorithms

Introduction to Reinforcement Learning

Data Availability

Notes

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Appendices

Appendix

Proof of Theorem 4

Proof of Theorem 5

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation