Online Markov decision processes with non-oblivious strategic adversary

Dinh, Le Cong; Mguni, David Henry; Tran-Thanh, Long; Wang, Jun; Yang, Yaodong

doi:10.1007/s10458-023-09599-5

Online Markov decision processes with non-oblivious strategic adversary

Published: 27 January 2023

Volume 37, article number 15, (2023)
Cite this article

Autonomous Agents and Multi-Agent Systems Aims and scope Submit manuscript

Le Cong Dinh^nAff1,
David Henry Mguni²,
Long Tran-Thanh³,
Jun Wang⁴ &
…
Yaodong Yang ORCID: orcid.org/0000-0001-8132-5613⁵

423 Accesses
2 Altmetric
Explore all metrics

Abstract

We study a novel setting in Online Markov Decision Processes (OMDPs) where the loss function is chosen by a non-oblivious strategic adversary who follows a no-external regret algorithm. In this setting, we first demonstrate that MDP-Expert, an existing algorithm that works well with oblivious adversaries can still apply and achieve a policy regret bound of ${\mathcal {O}}(\sqrt{T \log (L)}+\tau ^2\sqrt{ T \log (\vert A \vert )})$ where L is the size of adversary’s pure strategy set and $\vert A \vert$ denotes the size of agent’s action space.Considering real-world games where the support size of a NE is small, we further propose a new algorithm: MDP-Online Oracle Expert (MDP-OOE), that achieves a policy regret bound of ${\mathcal {O}}(\sqrt{T\log (L)}+\tau ^2\sqrt{ T k \log (k)})$ where k depends only on the support size of the NE. MDP-OOE leverages the key benefit of Double Oracle in game theory and thus can solve games with prohibitively large action space. Finally, to better understand the learning dynamics of no-regret methods, under the same setting of no-external regret adversary in OMDPs, we introduce an algorithm that achieves last-round convergence to a NE result. To our best knowledge, this is the first work leading to the last iteration result in OMDPs.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Monte Carlo Tree Search: a review of recent modifications and applications

Article Open access 19 July 2022

Multi-Agent Reinforcement Learning: A Selective Overview of Theories and Algorithms

Introduction to Reinforcement Learning

Notes

In the multi-armed bandit setting, it is also impossible to achieve sublinear policy regret against all adaptive adversaries (see Theorem 1 in [24]).
For the completeness of the paper, we provide the lemma in Appendix A.
If the adversary does not follow the optimal bound (i.e., irrational), then regret bound of the agent will change accordingly.
W.l.o.g, we consider the payoff (i.e., -the loss) for the agent in our experiments so that the agent aims to maximize the payoff.

References

Sutton, R. S., & Barto, A. G. (2018). Reinforcement Learning: An Introduction. Massachusetts: MIT press.
MATH Google Scholar
Laurent, G. J., Matignon, L., Fort-Piat, L., et al. (2011). The world of independent learners is not markovian. International Journal of Knowledge-based and Intelligent Engineering Systems, 15(1), 55–64.
Article Google Scholar
Even-Dar, E., Kakade, S. M., & Mansour, Y. (2009). Online markov decision processes. Mathematics of Operations Research, 34(3), 726–736.
Article MathSciNet MATH Google Scholar
Dick, T., Gyorgy, A., & Szepesvari, C. (2014). Online learning in markov decision processes with changing cost sequences. In ICML (pp. 512–520).
Neu, G., Antos, A., György, A., & Szepesvári, C. (2010). Online markov decision processes under bandit feedback. In NeurIPS (pp. 1804–1812).
Neu, G., & Olkhovskaya, J. (2020). Online learning in mdps with linear function approximation and bandit feedback. arXiv e-prints, 2007.
Yang, Y., & Wang, J. (2020). An overview of multi-agent reinforcement learning from game theoretical perspective. arXiv preprint arXiv:2011.00583
Freund, Y., & Schapire, R. E. (1999). Adaptive game playing using multiplicative weights. Games and Economic Behavior, 29(1–2), 79–103.
Article MathSciNet MATH Google Scholar
Shalev-Shwartz, S., et al. (2011). Online learning and online convex optimization. Foundations and Trends in Machine Learning, 4(2), 107–194.
Article MATH Google Scholar
Mertikopoulos, P., Papadimitriou, C., & Piliouras, G. (2018). Cycles in adversarial regularized learning. In Proceedings of the twenty-ninth annual ACM-SIAM symposium on discrete algorithms (pp. 2703–2717). SIAM.
Bailey, J. P., & Piliouras, G. (2018). Multiplicative weights update in zero-sum games. In Proceedings of the 2018 ACM conference on economics and computation (pp. 321–338).
Dinh, L. C., Nguyen, T.-D., Zemhoho, A. B., & Tran-Thanh, L. (2021). Last round convergence and no-dynamic regret in asymmetric repeated games. In Algorithmic learning theory (pp. 553–577) PMLR.
Mertikopoulos, P., Lecouat, B., Zenati, H., Foo, C.-S., Chandrasekhar, V., & Piliouras, G. (2019). Optimistic mirror descent in saddle-point problems: Going the extra (gradient) mile. In ICLR 2019-7th international conference on learning representations (pp. 1–23).
Leslie, D. S., Perkins, S., & Xu, Z. (2020). Best-response dynamics in zero-sum stochastic games. Journal of Economic Theory, 189, 105095.
Article MathSciNet MATH Google Scholar
Guan, P., Raginsky, M., Willett, R., & Zois, D.-S. (2016). Regret minimization algorithms for single-controller zero-sum stochastic games. In 2016 IEEE 55th conference on decision and control (CDC) (pp 7075–7080). IEEE
Neu, G., György, A., Szepesvári, C., & Antos, A. (2013). Online markov decision processes under bandit feedback. IEEE Transactions on Automatic Control, 59(3), 676–691.
Article MathSciNet MATH Google Scholar
Filar, J., & Vrieze, K. (1997). Applications and special classes of stochastic games. In Competitive markov decision processes (pp 301–341). Springer, New York.
Puterman, M. L. (1990). Markov decision processes. Handbooks in Operations Research and Management Science, 2, 331–434.
Article MathSciNet MATH Google Scholar
McMahan, H. B., Gordon, G. J., & Blum, A. (2003). Planning in the presence of cost functions controlled by an adversary. In Proceedings of the 20th international conference on machine learning (ICML-03) (pp. 536–543).
Dinh, L. C., Yang, Y., Tian, Z., Nieves, N. P., Slumbers, O., Mguni, D. H., & Wang, J. (2021). Online double oracle. arXiv preprint arXiv:2103.07780
Wei, C.-Y., Hong, Y.-T., & Lu, C.-J. (2017). Online reinforcement learning in stochastic games. arXiv preprint arXiv:1712.00579
Cheung, W. C., Simchi-Levi, D., & Zhu, R. (2019). Non-stationary reinforcement learning: The blessing of (more) optimism. Available at SSRN 3397818.
Yu, J. Y., Mannor, S., & Shimkin, N. (2009). Markov decision processes with arbitrary reward processes. Mathematics of Operations Research, 34(3), 737–757.
Article MathSciNet MATH Google Scholar
Arora, R., Dekel, O., & Tewari, A. (2012). Online bandit learning against an adaptive adversary: from regret to policy regret. arXiv preprint arXiv:1206.6400
Cesa-Bianchi, N., & Lugosi, G. (2006). Prediction, Learning, and Games. Cambridge: Cambridge University Press.
Book MATH Google Scholar
Zinkevich, M., Johanson, M., Bowling, M., & Piccione, C. (2007). Regret minimization in games with incomplete information. Advances in Neural Information Processing Systems, 20, 1729–1736.
Google Scholar
Daskalakis, C., Ilyas, A., Syrgkanis, V., & Zeng, H. (2017). Training gans with optimism. arXiv preprint arXiv:1711.00141
Shapley, L. S. (1953). Stochastic games. Proceedings of the National Academy of Sciences, 39(10), 1095–1100.
Article MathSciNet MATH Google Scholar
Deng, X., Li, Y., Mguni, D. H., Wang, J., & Yang, Y. (2021). On the complexity of computing markov perfect equilibrium in general-sum stochastic games. arXiv preprint arXiv:2109.01795
Tian, Y., Wang, Y., Yu, T., & Sra, S. (2020). Online learning in unknown markov games.
Neumann, Jv. (1928). Zur theorie der gesellschaftsspiele. Mathematische Annalen, 100(1), 295–320.
Article MathSciNet MATH Google Scholar
Nash, J. F., et al. (1950). Equilibrium points in n-person games. Proceedings of the National Academy of Sciences, 36(1), 48–49.
Article MathSciNet MATH Google Scholar
Brown, G. W. (1951). Iterative solution of games by fictitious play. Activity Analysis of Production and Allocation, 13(1), 374–376.
MathSciNet MATH Google Scholar
Czarnecki, W. M., Gidel, G., Tracey, B., Tuyls, K., Omidshafiei, S., Balduzzi, D., & Jaderberg, M. (2020). Real world games look like spinning tops. arXiv preprint arXiv:2004.09468
Perez-Nieves, N., Yang, Y., Slumbers, O., Mguni, D. H., Wen, Y., & Wang, J. (2021). Modelling behavioural diversity for learning in open-ended games. In International conference on machine learning (pp. 8514–8524). PMLR
Liu, X., Jia, H., Wen, Y., Yang, Y., Hu, Y., Chen, Y., Fan, C., & Hu, Z. (2021). Unifying behavioral and response diversity for open-ended learning in zero-sum games. arXiv preprint arXiv:2106.04958
Yang, Y., Luo, J., Wen, Y., Slumbers, O., Graves, D., Bou Ammar, H., Wang, J., & Taylor, M. E. (2021). Diverse auto-curriculum is critical for successful real-world multiagent learning systems. In Proceedings of the 20th international conference on autonomous agents and multiagent systems (pp. 51–56).
Bohnenblust, H., Karlin, S., & Shapley, L. (1950). Solutions of discrete, two-person games. Contributions to the Theory of Games, 1, 51–72.
MathSciNet MATH Google Scholar
Vinyals, O., Babuschkin, I., Czarnecki, W. M., Mathieu, M., Dudzik, A., Chung, J., Choi, D. H., Powell, R., Ewalds, T., Georgiev, P., et al. (2019). Grandmaster level in starcraft ii using multi-agent reinforcement learning. Nature, 575(7782), 350–354.
Article Google Scholar
Daskalakis, C., & Panageas, I. (2019). Last-iterate convergence: Zero-sum games and constrained min-max optimization. In 10th innovations in theoretical computer science.
Conitzer, V., & Sandholm, T. (2007). Awesome: A general multiagent learning algorithm that converges in self-play and learns a best response against stationary opponents. Machine Learning, 67(1–2), 23–43.
Article MATH Google Scholar
Chakraborty, D., & Stone, P. (2014). Multiagent learning in the presence of memory-bounded agents. Autonomous Agents and Multi-agent Systems, 28(2), 182–213.
Article Google Scholar

Download references

Author information

Le Cong Dinh
Present address: University of Southampton, Southampton, United Kingdom

Authors and Affiliations

Huawei R &D, Cambridge, United Kingdom
David Henry Mguni
University of Warwick, Coventry, United Kingdom
Long Tran-Thanh
University College London, London, United Kingdom
Jun Wang
Institute for AI, Peking University, Beijing, China
Yaodong Yang

Authors

Le Cong Dinh
View author publications
You can also search for this author in PubMed Google Scholar
David Henry Mguni
View author publications
You can also search for this author in PubMed Google Scholar
Long Tran-Thanh
View author publications
You can also search for this author in PubMed Google Scholar
Jun Wang
View author publications
You can also search for this author in PubMed Google Scholar
Yaodong Yang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yaodong Yang.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A proofs

We provide the following lemmas and proposition:

Lemma 7

(Lemma 3.3 in [3]) For all loss function ${\varvec{l}}$ in [0, 1] and policies $\pi$, $Q_{{\varvec{l}},\pi }(s,a) \le 3\tau$.

Lemma 8

(Lemma 1 from [16]) Consider a uniformly ergodic OMDPs with mixing time $\tau$ with losses ${{\varvec{l}}}_t \in [0,1]^{\varvec{d}}$. Then, for any $T > 1$ and policy $\pi$ with stationary distribution ${\varvec{d}}_{\pi }$, it holds that

$$\begin{aligned} \sum _{t=1}^T \vert \langle {{\varvec{l}}}_t, {\varvec{d}}_{\pi } -{\varvec{v}}_t^{\pi } \rangle \vert \le 2 \tau +2 . \end{aligned}$$

This lemma guarantees that the performance of a policy’s stationary distribution is similar to the actual performance of the policy in the case of a fixed policy.

In the other case of non-fixed policy, the following lemma bound the performance of policy’s stationary distribution of algorithm A with the actual performance:

Lemma 9

(Lemma 5.2 in [3]) Let $\pi _1, \pi _2,\dots$ be the policies played by MDP-E algorithm ${\mathcal {A}}$ and let ${\tilde{{\varvec{d}}}}_{{\mathcal {A}},t},\;{\tilde{{\varvec{d}}}}_{\pi _t} \in [0,1]^{|S|}$ be the stationary state distribution. Then,

$$\begin{aligned} \Vert {\tilde{{\varvec{d}}}}_{{\mathcal {A}},t}-{\tilde{{\varvec{d}}}}_{\pi _t}\Vert _1\le 2\tau ^2 \sqrt{\frac{\log (\vert A \vert )}{t}}+2e^{-t/\tau }. \end{aligned}$$

From the above lemma, since the policy’s stationary distribution is a combination of stationary state distribution and the policy’s action in each state, it is easy to show that:

$$\begin{aligned} \Vert {\varvec{v}}_t-{\varvec{d}}_{\pi _t}\Vert _1 \le \Vert {\tilde{{\varvec{d}}}}_{{\mathcal {A}},t}-{\tilde{{\varvec{d}}}}_{\pi _t}\Vert _1\le 2\tau ^2 \sqrt{\frac{\log (\vert A \vert )}{t}}+2e^{-t/\tau }. \end{aligned}$$

Proposition 8

For the MWU algorithm [8] with appropriate $\mu _t$, we have:

$$\begin{aligned} R_T(\pi )= {\mathbb {E}} \left[ \sum _{t=1}^T {\varvec{l}}_t(\pi _t)\right] - {\mathbb {E}} \left[ \sum _{t=1}^T {\varvec{l}}_t(\pi )\right] \le M \sqrt{\frac{T \log (n)}{2}}, \end{aligned}$$

where $\Vert {\varvec{l}}_t(.)\Vert \le M$. Furthermore, the strategy ${\varvec{\pi }}_t$ does not change quickly: $\Vert {\varvec{\pi }}_t-{\varvec{\pi }}_{t+1}\Vert \le \sqrt{\frac{\log (n)}{t}}.$

Proof

For a fixed T, if the loss function satisfies ${\varvec{l}}_t(.)\Vert \le 1$ then by setting $\mu _t=\sqrt{\frac{8 \log (n)}{T}}$, following Theorem 2.2 in [25] we have:

$$\begin{aligned} R_T(\pi )= {\mathbb {E}} \left[ \sum _{t=1}^T {\varvec{l}}_t(\pi _t)\right] - {\mathbb {E}} \left[ \sum _{t=1}^T {\varvec{l}}_t(\pi )\right] \le 1 \sqrt{\frac{T \log (n)}{2}}. \end{aligned}$$

(A1)

Thus, in the case where ${\varvec{l}}_t(.)\Vert \le M$, by scaling up both sides by M in Eq. (A1) we have the first result of the Proposition 8. For the second part, follow the updating rule of MWU we have:

$$\begin{aligned} \pi _{t+1}(i)-\pi _t(i)&=\pi _t(i)\left( \frac{\exp (-\mu _t {\varvec{l}}_t({\varvec{a}}^i))}{\sum _{i=1}^n {\varvec{\pi }}_t(i)\exp (-\mu _t {\varvec{l}}_t({\varvec{a}}^i))}-1\right) \nonumber \\&\approx \pi _t(i) \left( \frac{1-\mu _t{\varvec{l}}_t({\varvec{a}}^i)}{1-\mu _t{\varvec{l}}_t(\pi _t)}-1\right) \nonumber \\&=\mu _t \pi _t(i) \frac{{\varvec{l}}_t(\pi _t)-{\varvec{l}}_t({\varvec{a}}^i)}{1-\mu _t{\varvec{l}}_t(\pi _t)} = {\mathcal {O}}(\mu _t), \end{aligned}$$

(A2a)

where we use the approximation $e^x\approx 1+x$ for small x in Eq. (A2a). Thus, the difference in two consecutive strategies $\pi _t$ will be proportional to the learning rate $\mu _t$, which is set to be ${\mathcal {O}}\big (\sqrt{\frac{\log (n)}{t}}\big )$. Similar result can be found in Proposition 1 in [3]. $\square$

Theorem

(Theorem 5) Suppose the agent uses Algorithm 2 in our online MDPs setting, then the regret in Eq. (1) can be bounded by:

$$\begin{aligned} R_T(\pi ) ={\mathcal {O}}(\tau ^2\sqrt{ T k \log (k)} +\sqrt{T\log (L)}). \end{aligned}$$

Proof

First we bound the difference between the true loss and the loss with respect to the policy’s stationary distribution. Following the Algorithm 2, at the start of each time interval $T_i$ (i.e., the time interval in which the effective strategy set does not change), the learning rate needs to restart to ${\mathcal {O}}(\sqrt{\log (i)/t_i})$, where i denotes the number of pure strategies in the effective strategy set in the time interval $T_i$ and $t_i$ is relative position of the current round in that interval. Thus, following Lemma 5.2 in [3], in each time interval $T_i$, the difference between the true loss and the loss with respect to the policy’s stationary distribution will be:

$$\begin{aligned} \begin{aligned} \sum _{t=t_{i-1}+1}^{t_i} \vert \langle {{\varvec{l}}}_t, {\varvec{v}}_t-{\varvec{d}}_{\pi _t} \rangle \vert&\le \sum _{t=t_{i-1}+1}^{t_i} \Vert {\varvec{v}}_t-{\varvec{d}}_{\pi _t} \Vert _1 \\&\le \sum _{t=1}^{T_i} 2\tau ^2 \sqrt{\frac{\log (i)}{t}}+2e^{-t/\tau } \\&\le 4\tau ^2 \sqrt{T_i\log (i)}+2(1+\tau ). \end{aligned} \end{aligned}$$

From this we have:

$$\begin{aligned} \begin{aligned} \sum _{t=1}^T \vert \langle {{\varvec{l}}}_t, {\varvec{v}}_t-{\varvec{d}}_{\pi _t} \rangle \vert&=\sum _{i=1}^k \sum _{t=t_{i-1}+1}^{t_i} \vert \langle {{\varvec{l}}}_t, {\varvec{v}}_t-{\varvec{d}}_{\pi _t} \rangle \vert \\&\le \sum _{i=1}^k \left( 4\tau ^2 \sqrt{T_i\log (i)}+2(1+\tau )\right) \\&\le 4\tau ^2 \sqrt{Tk \log (k)}+2k(1+\tau ). \end{aligned} \end{aligned}$$

Following Lemma 1 from [16], we also have:

$$\begin{aligned} \sum _{t=1}^T\vert \langle {{\varvec{l}}}_t, {\varvec{d}}_{\pi } -{\varvec{v}}_t^{\pi } \rangle \vert \le 2 \tau +2. \end{aligned}$$

Thus the regret in Eq. (1) can be bounded by:

$$\begin{aligned} \begin{aligned} R_T(\pi )&\le \left( \sum _{t=1}^T \langle {\varvec{d}}_{\pi _t},{{\varvec{l}}}_t \rangle + \sum _{t=1}^T \vert \langle {{\varvec{l}}}_t, {\varvec{v}}_t-{\varvec{d}}_{\pi _t} \rangle \vert \right) -\left( \sum _{t=1}^T \langle {{\varvec{l}}}_t^{\pi }, {\varvec{d}}_{\pi } \rangle - \sum _{t=1}^T\vert \langle {{\varvec{l}}}_t, {\varvec{d}}_{\pi } -{\varvec{v}}_t^{\pi } \rangle \vert \right) \\&= \left( \sum _{t=1}^T \langle {\varvec{d}}_{\pi _t},{{\varvec{l}}}_t \rangle -\sum _{t=1}^T \langle {{\varvec{l}}}_t^{\pi }, {\varvec{d}}_{\pi } \rangle \right) + \sum _{t=1}^T \vert \langle {{\varvec{l}}}_t, {\varvec{v}}_t-{\varvec{d}}_{\pi _t} \rangle + \sum _{t=1}^T\vert \langle {{\varvec{l}}}_t, {\varvec{d}}_{\pi } -{\varvec{v}}_t^{\pi } \rangle \vert \\&\le 3 \tau \left( \sqrt{2 {T k \log (k)}} +\frac{k\log (k)}{8} \right) + \frac{\sqrt{T \log (L)}}{\sqrt{2}}+ 4\tau ^2 \sqrt{Tk \log (k)}+2k(1+\tau )+2\tau +2\\&={\mathcal {O}}(\tau ^2\sqrt{ T k \log (k)} +\sqrt{T\log (L)}). \end{aligned} \end{aligned}$$

(A3)

The proof is complete. $\square$

Theorem

(Theorem 6) Suppose the agent only accesses to $\epsilon$-best response in each iteration when following Algorithm 2. If the adversary follows a no-external regret algorithm then the average strategy of the agent and the adversary will converge to $\epsilon$-Nash equilibrium. Furthermore, the algorithm has $\epsilon$-regret.

Proof

Suppose that the player uses the Multiplicative Weights Update in Algorithm 2 with $\epsilon$-best response. Let $T_1, T_2, \dots , T_k$ be the time window that the players does not add up a new strategy. Since we have a finite set of strategies A then k is finite. Furthermore,

$$\begin{aligned} \sum _{i=1}^k T_k=T. \end{aligned}$$

In a time window $T_i$, the regret with respect to the best strategy in the set of strategy at time $T_i$ is:

$$\begin{aligned} \begin{aligned} \sum _{t=\bar{T}_i}^{\bar{T}_{i+1}} \langle {{\varvec{l}}}_t^{\pi _t}, {\varvec{d}}_{\pi _t}\rangle -\min _{\pi \in A_{{\bar{T}}_i+1}}\sum _{t=\vert \bar{T}_i\vert }^{\bar{T}_{i+1}} \langle {{\varvec{l}}}_t^{\pi _t}, {\varvec{d}}_{\pi }\rangle \le 3 \tau \left( \sqrt{2 {T_i \log (i)}} +\frac{\log (i)}{8} \right) , \end{aligned} \end{aligned}$$

(A4)

where $\bar{T}_i=\sum _{j=1}^{i-1}T_j$. Since in the time window $T_i$, the $\epsilon$-best response strategy stays in $\Pi _{\bar{T}_i +1}$ and therefore we have:

$$\begin{aligned} \min _{\pi \in A_{{\bar{T}}_i+1}} \sum _{t=\vert {\bar{T}}_i\vert }^{{\bar{T}}_{i+1}} \langle {{\varvec{l}}}_t^{\pi _t}, {\varvec{d}}_{\pi }\rangle -\min _{\pi \in \Pi } \sum _{t=\vert {\bar{T}}_i\vert }^{{\bar{T}}_{i+1}} \langle {{\varvec{l}}}_t^{\pi _t}, {\varvec{d}}_{\pi }\rangle \le \epsilon T_i. \end{aligned}$$

Then, from the Eq. (A4) we have:

$$\begin{aligned} \begin{aligned} \sum _{t={\bar{T}}_i}^{{\bar{T}}_{i+1}} \langle {{\varvec{l}}}_t^{\pi _t}, {\varvec{d}}_{\pi _t}\rangle - \min _{\pi \in \Pi } \sum _{t=\vert {\bar{T}}_i\vert }^{{\bar{T}}_{i+1}} \langle {{\varvec{l}}}_t^{\pi _t}, {\varvec{d}}_{\pi }\rangle \le 3 \tau \left( \sqrt{2 {T_i \log (i)}} +\frac{\log (i)}{8} \right) + \epsilon T_i. \end{aligned} \end{aligned}$$

(A5)

Sum up the Eq. (A5) for $i=1,\dots k$ we have:

$$\begin{aligned}&\sum _{t=1}^T \langle {{\varvec{l}}}_t^{\pi _t}, {\varvec{d}}_{\pi _t}\rangle -\sum _{i=1}^k \min _{\pi \in \Pi } \sum _{t=\vert {\bar{T}}_i\vert }^{{\bar{T}}_{i+1}} \langle {{\varvec{l}}}_t^{\pi _t}, {\varvec{d}}_{\pi }\rangle \le \sum _{i=1}^k 3 \tau \left( \sqrt{2 {T_i \log (i)}} +\frac{\log (i)}{8} \right) + \epsilon T_i \nonumber \\&\quad \implies \sum _{t=1}^T \langle {{\varvec{l}}}_t^{\pi _t}, {\varvec{d}}_{\pi _t}\rangle -\min _{\pi \in \Pi } \sum _{i=1}^k \sum _{t=\vert {\bar{T}}_i\vert }^{{\bar{T}}_{i+1}} \langle {{\varvec{l}}}_t^{\pi _t}, {\varvec{d}}_{\pi }\rangle \le \epsilon T+ \sum _{i=1}^k 3 \tau \left( \sqrt{2 {T_i \log (i)}} +\frac{\log (i)}{8} \right) \end{aligned}$$

(A6a)

$$\begin{aligned}&\quad \implies \sum _{t=1}^T \langle {{\varvec{l}}}_t^{\pi _t}, {\varvec{d}}_{\pi _t}\rangle - \min _{\pi \in \Pi } \sum _{t=1}^{T} \langle {{\varvec{l}}}_t^{\pi _t}, {\varvec{d}}_{\pi }\rangle \le \epsilon T+ \sum _{i=1}^k 3 \tau \left( \sqrt{2 {T_i \log (i)}} +\frac{\log (i)}{8} \right) \nonumber \\&\quad \implies \sum _{t=1}^T \langle {{\varvec{l}}}_t^{\pi _t}, {\varvec{d}}_{\pi _t}\rangle - \min _{\pi \in \Pi } \sum _{t=1}^{T} \langle {{\varvec{l}}}_t^{\pi _t}, {\varvec{d}}_{\pi }\rangle \le \epsilon T + 3 \tau \left( \sqrt{2 {T k \log (k)}} +\frac{k\log (k)}{8} \right) . \end{aligned}$$

(A6b)

Inequality (A6a) is due to $\sum \min \le \min \sum$. Inequality (A6b) comes from Cauchy-Schwarz inequality and Stirling’ approximation. Using Inequality (A6b), we have:

$$\begin{aligned} \min _{\pi \in \Pi } \langle \bar{{{\varvec{l}}}}, {\varvec{d}}_{\pi } \rangle \ge \frac{1}{T} \sum _{t=1}^T \langle {{\varvec{l}}}_t^{\pi _t}, {\varvec{d}}_{\pi _t}\rangle - 3\tau \left( \sqrt{\frac{2k \log (k)}{T}} +\frac{k \log (k)}{8T} \right) -\epsilon . \end{aligned}$$

(A7)

Since the adversary follows a no-regret algorithm, we have:

$$\begin{aligned} \begin{aligned}&\max _{{{\varvec{l}}} \in \Delta _L} \sum _{t=1}^T \langle {{\varvec{l}}}, {\varvec{d}}_{\pi _t} \rangle -\sum _{t=1}^T \langle {{\varvec{l}}}_t^{\pi _t}, {\varvec{d}}_{\pi _t}\rangle \le \sqrt{\frac{T}{2}} \sqrt{\log (L)}\\&\quad \implies \max _{{{\varvec{l}}} \in \Delta _L} \sum _{t=1}^T \langle {{\varvec{l}}}, \bar{{\varvec{d}}_{\pi }} \rangle \le \frac{1}{T} \sum _{t=1}^T \langle {{\varvec{l}}}_t^{\pi _t}, {\varvec{d}}_{\pi _t}\rangle +\sqrt{\frac{ \log (L)}{2T}}. \end{aligned} \end{aligned}$$

(A8)

Using the Inequalities (A7) and (A8) we have:

$$\begin{aligned} \begin{aligned} \langle \bar{{{\varvec{l}}}}, \bar{{\varvec{d}}_{\pi }} \rangle&\ge \min _{\pi \in \Pi } \langle \bar{{{\varvec{l}}}}, {\varvec{d}}_{\pi } \rangle \ge \frac{1}{T} \sum _{t=1}^T \langle {{\varvec{l}}}_t^{\pi _t}, {\varvec{d}}_{\pi _t}\rangle - 3\tau \left( \sqrt{\frac{2k \log (k)}{T}} +\frac{k \log (k)}{8T} \right) -\epsilon \\&\ge \max _{{{\varvec{l}}} \in \Delta _L} \sum _{t=1}^T \langle {{\varvec{l}}}, \bar{{\varvec{d}}_{\pi }} \rangle - \sqrt{\frac{\log (L)}{2T}}- 3\tau \left( \sqrt{\frac{2k \log (k)}{T}} +\frac{k \log (k)}{8T} \right) -\epsilon . \end{aligned} \end{aligned}$$

Similarly, we also have:

$$\begin{aligned} \begin{aligned} \langle \bar{{{\varvec{l}}}}, \bar{{\varvec{d}}_{\pi }} \rangle&\le \max _{{{\varvec{l}}} \in \Delta _L} \sum _{t=1}^T \langle {{\varvec{l}}}, \bar{{\varvec{d}}_{\pi }} \rangle \le \frac{1}{T} \sum _{t=1}^T \langle {{\varvec{l}}}_t^{\pi _t}, {\varvec{d}}_{\pi _t}\rangle +\sqrt{\frac{ \log (L)}{2T}}\\&\le \min _{\pi \in \Pi } \langle \bar{{{\varvec{l}}}}, {\varvec{d}}_{\pi } \rangle + 3\tau \left( \sqrt{\frac{2k \log (k)}{T}} +\frac{k \log (k)}{8T} \right) +\epsilon . \end{aligned} \end{aligned}$$

Take the limit $T \rightarrow \infty$, we then have:

$$\begin{aligned} \max _{{{\varvec{l}}} \in \Delta _L} \sum _{t=1}^T \langle {{\varvec{l}}}, \bar{{\varvec{d}}_{\pi }} \rangle -\epsilon \le \langle \bar{{{\varvec{l}}}}, \bar{{\varvec{d}}_{\pi }} \rangle \le \min _{\pi \in \Pi } \langle \bar{{{\varvec{l}}}}, {\varvec{d}}_{\pi } \rangle +\epsilon . \end{aligned}$$

Thus $(\bar{{{\varvec{l}}}}, \bar{{\varvec{d}}_{\pi }})$ is the $\epsilon$-Nash equilibrium of the game. $\square$

Appendix B experiments

We provide further experiment results to demonstrate the performance of MDP-OOE and MDP-E.

In Fig. 2, by considering the different number of loss vectors ($L=7$), we test whether the performance difference between MDP-OOE and MDP-E is consistent with regard to the number of loss vectors. As we can see in Fig. 2, MDP-OOE also outperforms MDP-E with the number of loss functions $L=7$. The result further validates the advantage of MDP-OOE over MDP-E in the setting of a small support size of the NE.

In Fig. 3, we consider a larger set of agent’s action in each state ($A = 500$). As we can see in Fig. 3, the difference in performance between MDP-OOE and MDP-E becomes more significant when a larger action set is considered in both cases when $L=3$ and $L=7$, as expected by our theoretical results.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Dinh, L.C., Mguni, D.H., Tran-Thanh, L. et al. Online Markov decision processes with non-oblivious strategic adversary. Auton Agent Multi-Agent Syst 37, 15 (2023). https://doi.org/10.1007/s10458-023-09599-5

Download citation

Accepted: 04 January 2023
Published: 27 January 2023
DOI: https://doi.org/10.1007/s10458-023-09599-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Online Markov decision processes with non-oblivious strategic adversary

Abstract

Access this article

Similar content being viewed by others

Monte Carlo Tree Search: a review of recent modifications and applications