Skip to main content
Log in

Unified reinforcement Q-learning for mean field game and control problems

  • Original Article
  • Published:
Mathematics of Control, Signals, and Systems Aims and scope Submit manuscript

Abstract

We present a Reinforcement Learning (RL) algorithm to solve infinite horizon asymptotic Mean Field Game (MFG) and Mean Field Control (MFC) problems. Our approach can be described as a unified two-timescale Mean Field Q-learning: The same algorithm can learn either the MFG or the MFC solution by simply tuning the ratio of two learning parameters. The algorithm is in discrete time and space where the agent not only provides an action to the environment but also a distribution of the state in order to take into account the mean field feature of the problem. Importantly, we assume that the agent cannot observe the population’s distribution and needs to estimate it in a model-free manner. The asymptotic MFG and MFC problems are also presented in continuous time and space, and compared with classical (non-asymptotic or stationary) MFG and MFC problems. They lead to explicit solutions in the linear-quadratic (LQ) case that are used as benchmarks for the results of our algorithm.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20
Fig. 21
Fig. 22
Fig. 23

Similar content being viewed by others

References

  1. Anahtarci B, Kariksiz CD, Saldi N (2020) Q-learning in regularized mean-field games. arXiv preprint arXiv:2003.12151

  2. Angiuli A, Fouque J-P, Laurière M (2021) Reinforcement learning for mean field games, with applications to economics. To appear in the Handbook on Machine Learning in Financial Markets: A guide to contemporary practises, editors: A. Capponi and C.-A. Lehalle, Cambridge University Press.

  3. Bellman RE, Dreyfus SE (2015) Applied dynamic programming, vol 2050. Princeton University Press

  4. Bensoussan A, Frehse J, Chi PYS (2013) Mean field games and mean field type control theory. Springer Briefs in Mathematics, Springer, New York

  5. Borkar VS (1997) Stochastic approximation with two time scales. Syst Control Lett 29(5):291–294

  6. Borkar VS (2008) Stochastic approximation. Cambridge University Press, Cambridge, Hindustan Book Agency, New Delhi. A dynamical systems viewpoint

  7. Cardaliaguet P, Hadikhanloo S (2017) Learning in Mean Field Games: the Fictitious Play. COCV 23:569–591.

  8. Carmona R, Delarue F (2018) Probabilistic theory of mean field games with applications I–II. Springer

  9. Carmona R, Mathieu L (2019) Convergence analysis of machine learning algorithms for the numerical solution of mean field control and games: I–the ergodic case. arXiv preprint arXiv:1907.05980

  10. Carmona R, Laurière M (2019) Convergence analysis of machine learning algorithms for the numerical solution of mean field control and games: II–the finite horizon case. arXiv preprint arXiv:1908.01613

  11. Carmona R, Laurière M, Zongjun T (2019) Linear-quadratic mean-field reinforcement learning: convergence of policy gradient methods. Preprint

  12. Carmona R, Laurière M, Zongjun T (2019) Mean-field MDP and mean-field Q-learning: model-free mean-field reinforcement learning. Preprint

  13. Elie R, Perolat J, Laurière M, Geist M, Pietquin O (2020) On the convergence of model free learning in mean field games. In: Proceedings of AAAI

  14. Even-DE Mansour Y (2003) Learning rates for q-learning. J Mach Learn Res 5(Dec):1–25

    MathSciNet  MATH  Google Scholar 

  15. Fouque JP, Zhang Z (2020) Deep learning methods for mean field control problems with delay. Front Appl Math Stat 6(11)

  16. Fu Z, Yang Z, Chen Y, Wang Z (2019) Actor-critic provably finds nash equilibria of linear-quadratic mean-field games. arXiv preprint arXiv:1910.07498

  17. Gu H, Guo X, Wei X, Xu R (2019) Dynamic programming principles for learning MFCS. arXiv preprint arXiv:1911.07314

  18. Gu H, Guo X, Wei X, Xu R (2020) Mean-field controls with Q-learning for cooperative MARL: convergence and complexity analysis. arXiv preprint arXiv:2002.04131

  19. Guo X, Hu A, Xu R, Zhang J (2019) Learning mean-field games. In: Advances in neural information processing systems, pp 4966–4976

  20. Han J, Hu R (2020) Deep fictitious play for finding Markovian Nash equilibrium in multi-agent games. arXiv:1912.01809

  21. Huang M, Caines PE, Malhamé RP (2007) Large-population cost-coupled LQG problems with nonuniform agents: individual-mass behavior and decentralized \(\epsilon \)-Nash equilibria. IEEE Trans Autom Control 52(9):1560–1571

  22. Huang M, Malhamé RP, Caines PE (2006) Large population stochastic dynamic games: closed-loop McKean–Vlasov systems and the Nash certainty equivalence principle. Commun Inf Syst 6(3):221–251

  23. Lasry J-M, Lions P-L (2007) Mean field games. Jpn J Math 2(1):229–260

  24. Mguni D, Jennings J, de Cote EM (2018) Decentralised learning in systems with many, many strategic agents. In: Thirty-second AAAI conference on artificial intelligence

  25. Motte M, Pham H (2019) Mean-field Markov decision processes with common noise and open-loop controls. arXiv preprint arXiv:1912.07883

  26. Perrin S, Pérolat J, Laurière M, Geist M, Elie R, Olivier P (2020) Fictitious play for mean field games: continuous time analysis and applications. In preparation

  27. Subramanian J, Mahajan A (2019) Reinforcement learning in stationary mean-field games. In: Proceedings. 18th international conference on autonomous agents and multiagent systems

  28. Sutton RS, Barto AG (2018) Reinforcement learning: an introduction. MIT press

  29. Watkins CJCH (1989) Learning from delayed rewards. PhD thesis, King’s College, Cambridge

  30. Xie Q, Yang Z, Wang Z, Minca A (2020) Provable fictitious play for general mean-field games. arXiv preprint arXiv:2010.04211

  31. Yang J, Ye X, Trivedi R, Xu H, & Zha H (2018) Deep mean field games for learning optimal behavior policy of large populations. In International Conference on Learning Representations.

  32. Yang Y, Luo R, Li M, Zhou M, Zhang W, Wang J (2018) Mean field multi-agent reinforcement learning. In: International conference on machine learning, pp 5567–5576

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mathieu Laurière.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

J. Fouque: Work supported by NSF Grant DMS-1814091 M. Laurière: Work supported by NSF Grant DMS-1716673 and ARO Grant W911NF-17-1-0578.

Published in the topical collection Machine Learning for Control Systems and Optimal Control

Appendices

Theoretical solutions for the benchmark examples

In this appendix, the solutions of the following benchmark problems are presented for the linear-quadratic models given by (15).

  1. A.1

    Non-asymptotic Mean Field Game,

  2. A.2

    Asymptotic Mean Field Game,

  3. A.3

    Stationary Mean Field Game,

  4. A.4

    Non-asymptotic Mean Field Control,

  5. A.5

    Asymptotic Mean Field Control.

  6. A.6

    Stationary Mean Field Control.

In particular, we check that the relations (3) and (4) are satisfied. The explicit formulas for the optimal controls (AMFG and AMFC) are used as benchmarks for our algorithm.

1.1 Solution for non-asymptotic MFG

We present the solution for the following MFG problem

  1. 1.

    Fix \({\varvec{m}}=(m_t)_{t\ge 0} \subset {\mathbb {R}}\) and solve the stochastic control problem:

    $$\begin{aligned} \min _{\varvec{ \alpha }}J^{{\varvec{m}}}(\varvec{ \alpha })&=\min _{\varvec{ \alpha }} {\mathbb {E}}\left[ \int _0^{\infty } e^{-\beta t}f(X^{\varvec{ \alpha }}_t,\alpha _t,m_t)dt \right] \\&=\min _{\varvec{ \alpha }}{\mathbb {E}}\left[ \int _0^{+\infty }e^{-\beta t }\left( \frac{1}{2}\alpha _t^2 + c_1 \left( X_t^{\varvec{ \alpha }}- c_2 m_t \right) ^2 + c_3 \left( X_t^{\varvec{ \alpha }}- c_4 \right) ^2+c_5 m_t^2 \right) \hbox {d}t \right] , \\ \text {subject to }&\\ dX^{\varvec{ \alpha }}_t&=\alpha _t \hbox {d}t +\sigma dW_t, \\ X^{\varvec{ \alpha }}_0&\sim \mu _0. \end{aligned}$$
  2. 2.

    Find the fixed point, \(\varvec{{{\hat{m}}}}=({{\hat{m}}}_t)_{t\ge 0}\), such that \({\mathbb {E}}\left[ X_t^{\varvec{ {{\hat{\alpha }}}}}\right] ={{\hat{m}}}_t\) for all \( t\ge 0\).

This problem can be solved by two equivalent approaches: PDE and FBSDEs. Both approaches start by solving the problem defined by a finite horizon T. Then, the solution to the infinite horizon problem is obtained by taking the limit T goes to infinity. Let \(V^{{\varvec{m}}^T,T}(t,x)\) be the optimal value function for the finite horizon problem conditioned on \(X_0=x\), i.e.,

$$\begin{aligned} V^{{\varvec{m}}^T,T}(t,x)= & {} \inf _{\varvec{ \alpha }}J^{{\varvec{m}},x}(\varvec{ \alpha })\\&=\inf _{\varvec{ \alpha }}{\mathbb {E}}\left[ \int _t^{T}e^{-\beta s }f(X_s^{\varvec{\alpha }},\alpha _s,m^T_s) ds\Big |X_0^{\varvec{ \alpha }}=x\right] , \quad V^{{\varvec{m}}^T,T}(T,x)=0. \end{aligned}$$

where \({\varvec{m}}^T=\{ m_t^T \}_{0 \le t \le T}\subset {\mathbb {R}}.\) Let us consider the following ansatz with its derivatives

$$\begin{aligned} \begin{aligned} V^{{\varvec{m}}^T,T}(t,x)&= \varGamma _2^T (t) x^2 + \varGamma _1^T (t) x + \varGamma _0^T (t), \\ \partial _t V^{{\varvec{m}}^T,T}(t,x)&= {\dot{\varGamma }}_2^T (t) x^2 + {\dot{\varGamma }}_1^T (t) x + {\dot{\varGamma }}_0^T (t), \\ \partial _x V^{{\varvec{m}}^T,T}(t,x)&= 2 \varGamma _2^T (t) x + \varGamma _1^T (t) , \\ \partial _{xx}V^{{\varvec{m}}^T,T}(t,x)&=2 \varGamma _2^T (t), \end{aligned} \end{aligned}$$
(17)

Then, the HJB equation for the value function reads:

$$\begin{aligned}&\partial _t V^{{\varvec{m}}^T,T} - \beta V^{{\varvec{m}}^T,T} + \inf _{\alpha } \{{\mathcal {A}}^X V^{{\varvec{m}}^T,T} + f(x,\alpha ,m^T)\}\\&\qquad \qquad =\partial _t V^{{\varvec{m}}^T,T} - \beta V^{{\varvec{m}}^T,T} + \inf _{\alpha } \left\{ \alpha \partial _x V^{{\varvec{m}}^T,T}\right. \\&\qquad \qquad \left. \quad +\frac{1}{2}\sigma ^2 \partial _{xx} V^{{\varvec{m}}^T,T} + \frac{1}{2}\alpha ^2 + c_1 (x- c_2 m^T)^2 + c_3 (x - c_4)^2 +c_5 (m^T)^2\right\} \\&\qquad \qquad =\partial _t V^{{\varvec{m}}^T,T} - \beta V^{{\varvec{m}}^T,T} + \left\{ - {\partial _x V^{{\varvec{m}}^T,T}}^2 +\frac{1}{2}\sigma ^2 \partial _{xx}V^{{\varvec{m}}^T,T} \right. \\&\qquad \qquad \left. \quad + \frac{1}{2}{\partial _x V^{{\varvec{m}}^T,T}}^2 + c_1 (x- c_2 m^T)^2+ c_3 (x - c_4 )^2+ c_5 (m^T)^2\right\} \\&\qquad \qquad =\partial _t V^{{\varvec{m}}^T,T} - \beta V^{{\varvec{m}}^T,T} - \frac{1}{2}{\partial _x V^{{\varvec{m}}^T,T}}^2\\&\qquad \qquad \quad +\frac{1}{2}\sigma ^2 \partial _{xx} V^{{\varvec{m}}^T,T} + c_1 (x-c_2 m^T)^2 + c_3 (x - c_4)^2 + c_5 (m^T)^2= 0, \end{aligned}$$

where in the third line we evaluated the infimum at \({{\hat{\alpha }}}^{T}= -V^{{\varvec{m}}^T,T}_x\). The following ODEs system is obtained by replacing the ansatz and its derivatives in the HJB equation:

$$\begin{aligned} {\left\{ \begin{array}{ll} {{{\dot{\varGamma }}}}_2^T -2({{ \varGamma }^T_2})^2 - \beta { \varGamma }^T_2 + c_1 + c_3 =0, \quad &{}{ \varGamma }_2^T(T) = 0, \\ {{{\dot{\varGamma }}}}^T_1 = (2 { \varGamma }^T_2 + \beta ) { \varGamma }^T_1 + 2 c_1 c_2 m^T + 2 c_3 c_4, \quad &{}{ \varGamma }^T_1(T)=0, \\ {{{\dot{\varGamma }}}}^T_0 = \beta { \varGamma }^T_0 + \frac{1}{2}({{ \varGamma }^T_1})^2\\ - \sigma ^2 { \varGamma }^T_2 -c_3 {c_4}^2 - (c_1 {c_2}^2 +c_5) ({m^T})^2 , \quad &{}{ \varGamma }^T_0(T) = 0,\\ \dot{m}^T = - 2 {\varGamma }^T_2 m^T - { \varGamma }^T_1, \quad &{}m^T(0)= {\mathbb {E}}\left[ \mu _0\right] =m_0,\\ \end{array}\right. } \end{aligned}$$
(18)

where the last equation is obtained by considering the expectation of \(X_t^{\varvec{\alpha }}\) after replacing \({{\hat{\alpha }}}^{T} = -\partial _x V^{{\varvec{m}}^T,T} = - (\varGamma ^T_2 x + \varGamma ^T_1)\). The first equation is a Riccati equation. In particular, the solution \(\varGamma ^T_2\) converges to \({{\hat{\varGamma }}}_2=\frac{-\beta + \sqrt{\beta ^2+8(c_1 + c_3)}}{4}\) as T goes to infinity. The second and fourth ODEs are coupled, and they can be written in matrix notation as

(19)

We start by solving the homogeneous equation, i.e.,

(20)

We introduce the propagator \(P^T\), i.e.,

$$\begin{aligned} {\begin{pmatrix} m^T \\ \varGamma _1^T \end{pmatrix}} = P^T_t \begin{pmatrix} m^T(0) \\ \varGamma _1^T(0) \end{pmatrix}. \end{aligned}$$
(21)

By deriving \(\begin{pmatrix} m^T \\ \varGamma _1^T \end{pmatrix}\) and expressing the initial conditions in terms of the inverse of \(P^T\) and \(\begin{pmatrix} m^T \\ \varGamma _1^T \end{pmatrix}\), we obtain

(22)

By comparing the last system with (20), we obtain

$$\begin{aligned} {\left\{ \begin{array}{ll} \dot{P^T_t} &{}= K^T_t P^T_t\\ P^T_0 &{}= \mathbb {I}_2 \end{array}\right. } \end{aligned}$$
(23)

where \(\mathbb {I}_2\) is the identity matrix in dimension 2. The solution is given by \(P^T_t=e^{\int _0^t K^T_s ds } :=e^{L^T_t}.\) In particular, the exponent is equal to

$$\begin{aligned} L^T_t= \int _0^t K^T_sds = \begin{bmatrix} - 2 \int _0^t \varGamma _2^T(s)ds &{} -t \\ 2 c_1 c_2 t &{} 2 \int _0^t \varGamma _2^T(s)ds + \beta t \end{bmatrix}= \begin{bmatrix} g_t^T &{} d_t \\ b_t &{} a_t^T \end{bmatrix}. \end{aligned}$$
(24)

We evaluate the exponential \(P^T(t)= e^{L^T_t}\) by using the Taylor’s expansion and diagonalizing the matrix \(L^T_t\). The eigenvalues/eigenvectors of \(L^T_t\) are given by

$$\begin{aligned}&\lambda ^T_{1\backslash 2,t} :=\frac{a_t^T+g_t^T \pm \sqrt{(a_t^T-g_t^T)^2 + 4b_t d_t}}{2},\nonumber \\&\quad v^T_{1,t}:=\begin{pmatrix} d_t \\ \lambda ^T_{1,t} - g_t^T \end{pmatrix}, \quad v^T_{2,t}:=\begin{pmatrix} d_t \\ \lambda ^T_{2,t} - g^T_t \end{pmatrix}. \end{aligned}$$
(25)

\(P_t\) is obtained by

$$\begin{aligned} P^T_t= & {} \begin{pmatrix} p^T_t(1,1) &{}\quad p^T_t(1,2) \\ p^T_t(2,1) &{}\quad p^T_t(2,2) \end{pmatrix}\nonumber \\= & {} e^{L^T_t}= \sum _{k=0}^{\infty } \begin{bmatrix} v^T_{1,t}&v^T_{2,t} \end{bmatrix} \frac{\begin{pmatrix} \lambda ^T_{1,t} &{} 0 \\ 0 &{} \lambda ^T_{2,t} \end{pmatrix}^k}{k! }\begin{bmatrix} v^T_{1,t}&v^T_{2,t} \end{bmatrix}^{-1} \nonumber \\:= & {} S^T_t \sum _{k=0}^{\infty } \frac{{D^T_t}^k}{k!} ({S^T_t})^{-1}\nonumber \\= & {} S^T_t \begin{pmatrix} e^{\lambda ^T_{1,t}} &{} 0 \\ 0 &{} e^{\lambda ^T_{2,t}} \end{pmatrix}({S^T_t})^{-1}\nonumber \\= & {} \frac{1}{d_t(\lambda ^T_{2,t} - \lambda ^T_{1,t})}\nonumber \\&\quad \begin{pmatrix} d_t e^{\lambda ^T_{1,t}}(\lambda ^T_{2,t} - g^T_t)+ d_t e^{\lambda ^T_{2,t}}(g^T_t-\lambda ^T_{1,t}) &{} d_t^2(e^{\lambda ^T_{2,t}}-e^{\lambda ^T_{1,t}}) \\ (\lambda ^T_{1,t} - g^T_t) (\lambda ^T_{2,t} - g^T_t) (e^{\lambda ^T_{1,t}}-e^{\lambda ^T_{2,t}}) &{} d_t e^{\lambda ^T_{2,t}}(\lambda ^T_{2,t} - g^T_t) + d_t e^{\lambda ^T_{1,t}}(g^T_t-\lambda ^T_{1,t}) \end{pmatrix}.\nonumber \\ \end{aligned}$$
(26)

In order to solve the non-homogeneous case, we introduce an extra term \(\begin{pmatrix} h_1^T \\ h_2^T \end{pmatrix}\), i.e.,

$$\begin{aligned} {\begin{pmatrix} m^T \\ \varGamma _1^T \end{pmatrix}} = P^T_t \begin{pmatrix} h^T_1 \\ h^T_2 \end{pmatrix}. \end{aligned}$$
(27)

By deriving \( {\begin{pmatrix} m^T \\ \varGamma _1^T \end{pmatrix}}\), we obtain

(28)

By comparing (19) with (28), we obtain

(29)

By integration, we obtain

$$\begin{aligned} \begin{aligned} h_1^T(t)&=h_1^T(0)-2c_3c_4\int _0^t \frac{p_s^T(1,2)}{|P_s^T|}ds,\\ h_2^T(t)&=h_2^T(0)+2c_3c_4\int _0^t \frac{p_s^T(1,1)}{|P_s^T|}ds, \end{aligned} \end{aligned}$$
(30)

where \(h_1^T(0)=m_0\) and \(h_2^T(0)=\varGamma _1^T(0)\).

We use the terminal condition \(\varGamma _1^T(T)=0\) to obtain an evaluation of \(h_2^T(0)=\varGamma _1^T(0)\) in terms of \(P^T_T\) and \(m_0\), i.e.,

$$\begin{aligned} \begin{aligned} \varGamma _1^T(T)&=p^T_T(2,1)h^T_1(T) + p^T_T(2,2) h^T_2(T)=0, \\ \varGamma _1^T(T)&=p^T_T(2,1)\\&\qquad \left( m_0-2c_3c_4\int _0^T \frac{p_s^T(1,2)}{|P_s^T|}ds \right) \\&\qquad + p^T_T(2,2) \left( \varGamma _1^T(0)+2c_3c_4\int _0^T \frac{p_s^T(1,1)}{|P_s^T|}ds \right) =0, \\ \varGamma _1^T(0)&= -\frac{p^T_T(2,1)}{p^T_T(2,2)} \left( m_0-2c_3c_4\int _0^T \frac{p_s^T(1,2)}{|P_s^T|}ds \right) -2c_3c_4\int _0^T \frac{p_s^T(1,1)}{|P_s^T|}ds . \end{aligned} \end{aligned}$$
(31)

In order to evaluate the limit of \(\varGamma _1^T(0)\) as T goes to infinity, we analyze the different terms separately. First, we evaluate the following limit:

$$\begin{aligned} \lim _{T\rightarrow \infty }\frac{1}{T}\int _0^T \varGamma _2^T(s)ds =\lim _{T\rightarrow \infty } \varGamma _2^T(s_1)= {{\hat{\varGamma }}}_2, \quad s_1 \in [0,T] , \end{aligned}$$
(32)

where we applied the mean value integral theorem and \({{\hat{\varGamma }}}_2=\frac{-\beta +\sqrt{\beta ^2+8(c_1+c_3)}}{4}\) is the limit of the solution of the Riccati equation obtained previously, i.e., \({{\hat{\varGamma }}}_2=\lim _{T\rightarrow \infty } \varGamma _2^T(s).\) We recall that

$$\begin{aligned}\lambda ^T_{2,T}-\lambda ^T_{1,T}=\sqrt{(a^T_T-g^T_T)^2+4b^T_T d_T}= T \sqrt{\left( \frac{4}{T}\int _0^T \varGamma ^T_2(s)ds + \beta \right) ^2 - 8 c_1 c_2 }>0\end{aligned}$$

which goes to infinity as T goes to \(\infty \) when the term under square root is well defined. We observe that

$$\begin{aligned} \begin{aligned} {\hat{g}}_t&:=\lim _{T\rightarrow \infty }g^T_t = \lim _{T\rightarrow \infty }- 2 \int _0^t \varGamma _2^T(s)ds = - 2 {{\hat{\varGamma }}}_2 t :=g t,\\ b_t&=2c_1c_2 t,\\ {\hat{a}}_t&:=\lim _{T\rightarrow \infty }a^T_t = \lim _{T\rightarrow \infty } 2 \int _0^t \varGamma _2^T(s)ds + \beta t= 2 {{\hat{\varGamma }}}_2 t + \beta t ,\\ d_t&= - t,\\ {{\hat{\lambda }}}_{1\backslash 2,t}&:=\lim _{T\rightarrow \infty }\lambda ^T_{1\backslash 2,t} = \frac{{\hat{a}}_t+{\hat{g}}_t \pm \sqrt{({\hat{a}}_t-{\hat{g}}_t)^2 + 4b_t d_t}}{2} \\&= t \frac{\beta \pm \sqrt{(4 {{\hat{\varGamma }}}_2 + \beta )^2-8c_1c_2}}{2}:=t \lambda _{1\backslash 2} , \\ {\hat{P}}_t&:=\lim _{T\rightarrow \infty }P^T_t \\&=\frac{1}{d_t({{\hat{\lambda }}}_{2,t} - {{\hat{\lambda }}}_{1,t})}\\&\quad \begin{pmatrix} d_t e^{{{\hat{\lambda }}}_{1,t}}({{\hat{\lambda }}}_{2,t} - {\hat{g}}_t)+ d_t e^{{{\hat{\lambda }}}_{2,t}}({\hat{g}}_t-{{\hat{\lambda }}}_{1,t}) &{} d_t^2(e^{{{\hat{\lambda }}}_{2,t}}-e^{{{\hat{\lambda }}}_{1,t}}) \\ ({{\hat{\lambda }}}_{1,t} - {\hat{g}}_t) ({{\hat{\lambda }}}_{2,t} - {\hat{g}}_t) (e^{{{\hat{\lambda }}}_{1,t}}-e^{{{\hat{\lambda }}}_{2,t}}) &{} d_t e^{{{\hat{\lambda }}}_{2,t}}({{\hat{\lambda }}}_{2,t} - {\hat{g}}_t) + d_t e^{{{\hat{\lambda }}}_{1,t}}({\hat{g}}_t-{{\hat{\lambda }}}_{1,t}) \end{pmatrix}. \end{aligned} \end{aligned}$$
(33)

To evaluate \({{\hat{\varGamma }}}_1(0)=\lim _{T\rightarrow \infty } \varGamma ^T_1(0)\), we study the limit of the remaining terms:

$$\begin{aligned} \lim _{T\mapsto \infty }-\frac{p^T_T(2,1)}{p^T_T(2,2)}= & {} \lim _{T\mapsto \infty }\frac{(\lambda ^T_{1,T} - g^T_T) (\lambda ^T_{2,T} - g^T_T) (e^{\lambda ^T_{2,T}}-e^{\lambda ^T_{1,T}})}{ d_T e^{\lambda ^T_{2,T}}(\lambda ^T_{2,T} - g^T_T) + d_T e^{\lambda ^T_{1,T}}(g^T_T-\lambda ^T_{1,T})}\nonumber \\= & {} \lim _{T\mapsto \infty }\frac{1}{\frac{d_T}{(\lambda ^T_{1,T}-g^T_T)(1-e^{\lambda ^T_{1,T}-\lambda ^T_{2,T}})}+\frac{d_T}{(\lambda ^T_{2,T}-g^T_T)(1-e^{\lambda ^T_{2,T}-\lambda ^T_{1,T}})}}\nonumber \\= & {} -(\lambda _1 -g)\nonumber \\= & {} -(\lambda _1 + 2 {{\hat{\varGamma }}}_2),\nonumber \\ \lim _{T\mapsto \infty }\int _0^T \frac{p_s^T(1,2)}{|P_s^T|}ds= & {} \lim _{T\mapsto \infty }\int _0^T \frac{d_s(e^{\lambda ^T_{2,s}}-e^{\lambda ^T_{1,s}})}{(\lambda ^T_{2,s}-\lambda ^T_{1,s})(e^{\lambda ^T_{1,s}+\lambda ^T_{2,s}})}ds \nonumber \\= & {} \frac{1}{\lambda _2-\lambda _1}\left( \frac{1}{\lambda _2}-\frac{1}{\lambda _1} \right) \nonumber \\ \lim _{T\mapsto \infty }\int _0^T \frac{p_s^T(1,1)}{|P_s^T|}ds= & {} \lim _{T\mapsto \infty } \int _0^T\frac{1}{e^{\lambda ^T_{1,s}+\lambda ^T_{2,s}}} \left( e^{\lambda ^T_{1,s}}\frac{\lambda _{2,s}^T-g_s^T}{\lambda ^T_{2,s}-\lambda ^T_{1,s}} +e^{\lambda ^T_{2,s}} \frac{g_s^T-\lambda _{1,s}^T}{\lambda ^T_{2,s}-\lambda ^T_{1,s}} \right) ds\nonumber \\= & {} \frac{\lambda _2-g}{\lambda _2(\lambda _2-\lambda _1)}+\frac{g-\lambda _1}{\lambda _1(\lambda _2-\lambda _1)}. \end{aligned}$$
(34)

Finally, the value of \({{\hat{\varGamma }}}_1(0)\) is given by

$$\begin{aligned} {{\hat{\varGamma }}}_1(0) = - (\lambda _1 - g) m_0 -2\frac{c_3c_4}{\lambda _2}. \end{aligned}$$
(35)

Given \({{\hat{\varGamma }}}_1(0)\), we evaluate the limit as T goes to \(\infty \) of (30), i.e.,

$$\begin{aligned} \begin{aligned} h_1(t):=\lim _{T\mapsto \infty } h_1^T(t)&=m_0- 2c_3c_4 \lim _{T\mapsto \infty } \int _0^t \frac{ p_s^T(1,2)}{|P_s^T|}ds \\&=m_0+ 2 \frac{c_3 c_4}{\lambda _2 - \lambda _1} \left( \frac{1}{\lambda _2}e^{-t \lambda _2}-\frac{1}{\lambda _1}e^{-t \lambda _1}+\frac{1}{\lambda _1}-\frac{1}{\lambda _2} \right) ,\\ h_2(t):=\lim _{T\mapsto \infty } h_2^T(t)&=\lim _{T\mapsto \infty } \left( \varGamma _1^T(0)+2c_3c_4 \int _0^t \frac{p_s^T(1,1)}{|P_s^T|} \hbox {d}s \right) \\&= {{\hat{\varGamma }}}_1(0)+ 2 \frac{c_3 c_4}{\lambda _2 - \lambda _1} \left( \frac{\lambda _2-g}{\lambda _2}(1-e^{-t\lambda _2})+\frac{g-\lambda _1}{\lambda _1}(1-e^{-t\lambda _1}) \right) . \end{aligned} \end{aligned}$$
(36)

We can conclude that

$$\begin{aligned} \begin{aligned} {\hat{m}}_t&=\lim _{T\rightarrow \infty }m^T_t\\&= {\hat{p}}_t(1,1) h_1(t) + {\hat{p}}_t(1,2)h_2(t)\\&= \left( m_0+2 \frac{c_3c_4}{\lambda _2-\lambda _1}\left( \frac{1}{\lambda _1}-\frac{1}{\lambda _2} \right) \right) e^{t\lambda _{1}}+2 \frac{c_3c_4}{\lambda _2-\lambda _1}\left( \frac{1}{\lambda _2}-\frac{1}{\lambda _1} \right) ,\\ {{\hat{\varGamma }}}_1(t)&=\lim _{T\rightarrow \infty }\varGamma _1^T(t)\\&= {\hat{p}}_t(2,1) h_1(t) + {\hat{p}}_t(2,2)h_2(t)\\&=m_0 (g-\lambda _1) e^{t\lambda _{1}}+2\frac{c_3c_4}{\lambda _2-\lambda _1}\left( \frac{\lambda _2-g}{\lambda _2}-\frac{\lambda _1-g}{\lambda _1} \right) .\\ \end{aligned} \end{aligned}$$
(37)

Finally, the third ODE in (18) can be solved by plugging in the solution of the previous ones and integrating. Since our interest is into the evolution of the mean and the control function, we omit these calculations, but we recall that:

$$\begin{aligned} {{\hat{\alpha }}}_t=-({{\hat{\varGamma }}}_2 x+{{\hat{\varGamma }}}_1(t)), \quad {{\hat{\varGamma }}}_2=\frac{-\beta + \sqrt{\beta ^2+8(c_1 + c_3)}}{4}, \end{aligned}$$
(38)

and we observe that

$$\begin{aligned} \lim _{t\rightarrow \infty }{{\hat{\alpha }}}_t=-({{\hat{\varGamma }}}_2 x+{{\hat{\varGamma }}}_1), \quad {{\hat{\varGamma }}}_1=-\frac{4c_1c_2{{\hat{\varGamma }}}_2}{\lambda _2} =\frac{c_3c_4{{\hat{\varGamma }}}_2}{2(c_1+c_3-c_1c_2)}. \end{aligned}$$
(39)

1.2 Solution for asymptotic MFG

The asymptotic version of the problem presented above is given by:

  1. 1.

    Fix \(m \in {\mathbb {R}}\) and solve the stochastic control problem:

    $$\begin{aligned} \min _{\varvec{ \alpha }}J^{m}(\varvec{ \alpha })&=\min _{\varvec{ \alpha }} {\mathbb {E}}\left[ \int _0^{\infty } e^{-\beta t}f(X^{\varvec{ \alpha }}_t,\alpha _t,m)\hbox {d}t \right] \\&=\min _{\varvec{ \alpha }}{\mathbb {E}}\left[ \int _0^{\infty }e^{-\beta t }\left( \frac{1}{2}\alpha _t^2 + c_1 \left( X_t^{\varvec{ \alpha }}-c_2 m \right) ^2 + c_3 \left( X_t^{\varvec{ \alpha }}-c_4 \right) ^2 + c_5 m^2 \right) \hbox {d}t\right] , \\ \text {subject to: }&\quad dX^{\varvec{ \alpha }}_t=\alpha _t \hbox {d}t +\sigma dW_t, \quad X^{\varvec{ \alpha }}_0\sim \mu _0. \end{aligned}$$
  2. 2.

    Find the fixed point, \({{\hat{m}}}\), such that \({{\hat{m}}} = \lim _{t \rightarrow +\infty } {\mathbb {E}}\left[ X^{{{\hat{\alpha }}},{{\hat{m}}}}_t\right] \).

Let \(V^m(x)\) be the optimal value function given \(m \in {\mathbb {R}}\) and conditioned on \(X_0=x\), i.e.,

$$\begin{aligned} V^m(x)= & {} \inf _{\varvec{ \alpha }}J^{m,x}(\varvec{ \alpha })\\= & {} \inf _{\varvec{ \alpha }}{\mathbb {E}}\left[ \int _0^{+\infty }e^{-\beta t }\left( \frac{1}{2}\alpha _t^2 + c_1 \left( X_t^{\varvec{ \alpha }}-c_2 m \right) ^2 + c_3 \left( X_t^{\varvec{ \alpha }}-c_4 \right) ^2 + c_5 m^2 \right) \Big |X_0^{\varvec{ \alpha }}=x\right] . \end{aligned}$$

We consider the following ansatz with its derivatives with respect to x:

$$\begin{aligned} V^m(x)&=\varGamma _2 x^2 + \varGamma _1 x +\varGamma _0, \\ \dot{V}^m(x)&= 2\varGamma _2 x + \varGamma _1, \\ {\ddot{V}}^m(x)&=2\varGamma _2. \end{aligned}$$

Let us consider the HJB equation

$$\begin{aligned}&\beta V^m(x) - \inf _{\alpha } \{{\mathcal {A}}^X V^m(x) + f(x,\alpha ,m)\}\\&\quad =\beta V^m(x) - \inf _{\alpha } \left\{ \alpha \dot{V}(x) +\frac{1}{2}\sigma ^2 {\ddot{V}}^m(x) + \frac{1}{2}\alpha ^2 + c_1 (x- c_2 m)^2 \right. \\&\left. \qquad + c_3 (x - c_4)^2 +c_5 m^2\right\} \\&\quad =\beta V^m(x) - \left\{ - ({\dot{V}^m})^2(x) +\frac{1}{2}\sigma ^2 {\ddot{V}}^m(x) + \frac{1}{2}({\dot{V}^m})^2(x)\right. \\&\left. + c_1 (x- c_2 m)^2+ c_3 (x - c_4 )^2+ c_5 m^2\right\} \\&\quad =\beta V^m(x) + \frac{1}{2}({\dot{V}^m})^2(x)\\&\quad -\frac{1}{2}\sigma ^2 {\ddot{V}}^m(x) - c_1 (x-c_2 m)^2 - c_3 (x - c_4)^2 - c_5 m^2= 0, \end{aligned}$$

where in the third line we evaluated the infimum at \({{\hat{\alpha }}}(x)= -\dot{V}^m(x)\). Replacing the ansatz and its derivatives in the HJB equation, it follows that

$$\begin{aligned}&\left( \beta \varGamma _2 + 2 \varGamma _2^2 - c_1 - c_3 \right) x^2 +(\beta \varGamma _1 +2\varGamma _2\varGamma _1+2c_1c_2 m +2c_3 c_4 )x +\beta \varGamma _0\\&\qquad +\frac{1}{2}\varGamma _1^2-\sigma ^2 \varGamma _2 -( c_1{c_2}^2+c_5) m^2 - c_3 {c_4}^2=0. \end{aligned}$$

An easy computation gives the values

$$\begin{aligned} \varGamma _2&=\frac{-\beta + \sqrt{\beta ^2 +8 (c_1+c_3)}}{4},\\ \varGamma _1&=- \frac{ 2c_1c_2m+2c_3 c_4}{\beta + 2\varGamma _2},\\ \varGamma _0&=\frac{ c_5 m^2 + c_3 {c_4}^2+ c_1 {c_2}^2 m^2 +\sigma ^2 \varGamma _2 -\frac{1}{2}\varGamma _1^2 }{\beta }. \end{aligned}$$

By plugging the control \({{\hat{\alpha }}}(x)=-(2\varGamma _2x+\varGamma _1)\) into the dynamics of \(X_t\) and taking the expected value, we obtain an ODE for \({ m_t}\)

$$\begin{aligned} \dot{m}_t= -(2\varGamma _2 m_t+\varGamma _1). \end{aligned}$$
(40)

The solution of (40) is used to derive m as follows

$$\begin{aligned} \begin{aligned} m&=\lim _{t\mapsto \infty } m_t =\lim _{t\mapsto \infty } -\frac{\varGamma _1}{2\varGamma _2} + \left( m_0 + \frac{\varGamma _1}{\varGamma _2} \right) e^{-2 \varGamma _2 t} =-\frac{\varGamma _1}{2\varGamma _2} =\frac{2c_1 c_2 m+2 c_3 c_4}{2 \varGamma _2 (\beta + 2\varGamma _2)},\\ m&= \frac{c_3 c_4}{\varGamma _2 (\beta + 2\varGamma _2) -c_1 c_2 } \end{aligned} \end{aligned}$$
(41)

To summarize, we derived that \({{\hat{\alpha }}}(x)=-(2\varGamma _2x+\varGamma _1)\) with \(\varGamma _2={{\hat{\varGamma }}}_2\) and \(\varGamma _1={{\hat{\varGamma }}}_1\) obtained in (39). In other words, we have checked that

$$\begin{aligned} \lim _{t\rightarrow \infty }{{\hat{\alpha }}}_t^{MFG}(x) = {{\hat{\alpha }}}^{AMFG}(x), \quad \forall x, \end{aligned}$$

that is the first part of (3) for this LQ MFG.

1.3 Solution for stationary MFG

The only difference with the derivation above in the case of asymptotic MFG is that \(m_t\) should be a constant which, from (40), should satisfy \(2\varGamma _2 m+\varGamma _1=0\). Therefore, m takes the same value as in (41), and we deduce

$$\begin{aligned} {{\hat{\alpha }}}^{SMFG}(x) = {{\hat{\alpha }}}^{AMFG}(x), \quad \forall x, \end{aligned}$$

that is the second part of (3) for this LQ MFG.

1.4 Solution for non-asymptotic MFC

We present the solution for the following non-asymptotic MFC problem

$$\begin{aligned} \min _{\varvec{ \alpha }}J(\varvec{ \alpha })&=\min _{\varvec{ \alpha }} {\mathbb {E}}\left[ \int _0^{\infty } e^{-\beta t}f(X^{\varvec{ \alpha }}_t,\alpha _t,{\mathbb {E}}\left[ X_t^{\varvec{\alpha }}\right] )\hbox {d}t \right] \\&=\min _{\varvec{ \alpha }}{\mathbb {E}}\left[ \int _0^{+\infty }e^{-\beta t }\left( \frac{1}{2}\alpha _t^2 + c_1 \left( X_t^{\varvec{ \alpha }}-c_2 {\mathbb {E}}\left[ X_t^{\varvec{ \alpha }}\right] \right) ^2 + c_3 \left( X_t^{\varvec{ \alpha }}-c_4 \right) ^2 + c_5 {\mathbb {E}}\left[ X_t^{\varvec{ \alpha }}\right] ^2 \right) \hbox {d}t\right] ,\\ \text {subject to: }&\quad dX^{\varvec{ \alpha }}_t=\alpha _t \hbox {d}t +\sigma dW_t ,\quad X^{\varvec{ \alpha }}_0\sim \mu _0. \end{aligned}$$

Note that here the mean \({\mathbb {E}}\left[ X_t^{\varvec{ \alpha }}\right] \) of the population changes instantaneously when \(\varvec{ \alpha }\) changes.

This problem can be solved by two equivalent approaches: PDE and FBSDEs. Both approaches start by solving the problem defined by a finite horizon T. Then, the solution to the infinite horizon problem is obtained by taking the limit for T goes to infinity. Let \(V^T(t,x)\) be the optimal value function for the finite horizon problem conditioned on \(X_0=x\), i.e.,

$$\begin{aligned} V^T(t,x)= & {} \inf _{\varvec{ \alpha }}J^{\varvec{m^\alpha },x}(\varvec{ \alpha })\\= & {} \inf _{\varvec{ \alpha }}{\mathbb {E}}\left[ \int _t^{T}e^{-\beta s }f(X_s^{\varvec{\alpha }},\alpha _s,m_s^{\varvec{\alpha }})ds\Big |X_0^{\varvec{ \alpha }}=x\right] , \quad V^T(T,x)=0. \end{aligned}$$

Let us consider the following ansatz with its derivatives

$$\begin{aligned} \begin{aligned} V^T(t,x)&= \varGamma _2^T (t) x^2 + \varGamma _1^T (t) x + \varGamma _0^T (t), \quad V^T(T,x)=0,\\ \partial _t V^T(t,x)&= {\dot{\varGamma }}_2^T (t) x^2 + {\dot{\varGamma }}_1^T (t) x + {\dot{\varGamma }}_0^T (t), \\ \partial _x V^T(t,x)&= 2 \varGamma _2^T (t) x + \varGamma _1^T (t) , \\ \partial _{xx} V^T(t,x)&=2 \varGamma _2^T (t), \end{aligned} \end{aligned}$$
(42)

Starting by the MFC-HJB equation (4.12) given in [4], we extended it to the asymptotic case as follows

$$\begin{aligned}&\beta V^T -V_t^T - H\left( t,x,\varvec{\mu },\alpha \right) - \int _{{\mathbb {R}}} \frac{\delta H}{\delta \mu }\left( t,h,\varvec{\mu },-\partial _x V^T \right) (x)\mu _t(h)dh=0, \end{aligned}$$

where \(m_t=\int _{{\mathbb {R}}} y \mu _t(dy)\) and \(\alpha ^*=-\partial _x V^T\). We have:

$$\begin{aligned}&H\left( t,x,\varvec{\mu },\alpha \right) \\&\quad :=\inf _{\alpha }\left\{ {\mathcal {A}}^X V^T + f\left( t, x,\alpha ,\varvec{\mu } \right) \right\} \\&\quad =\inf _{\alpha }\left\{ \alpha \partial _x V^T + \frac{1}{2}\sigma ^2 \partial _{xx}V^T+\frac{1}{2}\alpha ^2 +c_1 (x-c_2 m_t)^2 + c_3 (x-c_4)^2 + c_5 {m_t}^2 \right\} \\&\quad =-\frac{1}{2}(\partial _x V^T)^2+ \frac{1}{2}\sigma ^2 \partial _{xx}V^T+c_1 (x-c_2 m_t)^2 + c_3 (x-c_4)^2 + c_5 {m_t}^2, \\&\frac{\delta H\left( t,h,\varvec{\mu },\alpha \right) }{\delta \mu } (x)\\&\quad =\frac{\delta }{\delta \mu }\left( c_1 (h-c_2 m_t)^2 + c_5 {m_t}^2 \right) (x)\\&\quad =\frac{\delta }{\delta \mu }\left( c_1 \left( h- c_2 \int _{{\mathbb {R}}} y \mu _t(dy) \right) ^2 +c_5 \left( \int _{{\mathbb {R}}} y \mu _t(dy) \right) ^2 \right) (x)\\&\quad =-2c_1 c_2 x\left( h-c_2\int _{{\mathbb {R}}} y \mu _t(dy)) \right) +2c_5x\int _{{\mathbb {R}}} y \mu _t(dy)\\&\quad =- 2c_1 c_2 x(h -c_2 m_t) +2c_5xm_t, \end{aligned}$$
$$\begin{aligned} {\int _{{\mathbb {R}}} \frac{\delta H}{\delta \mu }\left( t,h,\varvec{\mu },-\partial _x V^T \right) (x)\mu _t(h)dh}&= - 2c_1 c_2 x(m_t - c_2 m_t) +2c_5xm_t, \end{aligned}$$

and finally

$$\begin{aligned}&\beta V^T -\partial _t V^T +\frac{1}{2}(\partial _x ^T)^2- \frac{1}{2}\sigma ^2 \partial _{xx} V^T- c_1 (x- c_2 m_t)^2 - c_3 (x- c_4)^2 \\&\qquad - c_5 {m_t}^2 + 2 c_1 c_2 x(m_t - c_2 m_t) - 2c_5xm_t=0 . \end{aligned}$$

The following system of ODEs is obtained by replacing the ansatz and its derivatives in the MFC-HJB:

$$\begin{aligned} {\left\{ \begin{array}{ll} {{{\dot{\varGamma }}}}_2^T -2({{ \varGamma }^T_2})^2 - \beta { \varGamma }^T_2 + c_1 + c_3 =0, \quad &{}{ \varGamma }_2^T(T) = 0, \\ {{{\dot{\varGamma }}}}^T_1 = (2 { \varGamma }^T_2 + \beta ) { \varGamma }^T_1\\ + (2 c_1 c_2 ( 2 - c_2) -2c_5)m_t^T + 2 c_3 c_4, \quad &{}{ \varGamma }^T_1(T)=0, \\ {{{\dot{\varGamma }}}}^T_0 = \beta { \varGamma }^T_0 + \frac{1}{2}({{ \varGamma }^T_1})^2 - \sigma ^2 { \varGamma }^T_2\\ -c_3 {c_4}^2 - (c_1 {c_2}^2 +c_5) ({m^T_t})^2 , \quad &{}{ \varGamma }^T_0(T) = 0,\\ {\dot{m}}_t^T = - 2 {\varGamma }^T_2 m^T - { \varGamma }^T_1, \quad &{}m^T(0)= {\mathbb {E}}\left[ X^{\varvec{\alpha }}_0\right] =m_0,\\ \end{array}\right. } \end{aligned}$$
(43)

where the last equation is obtained by considering the expectation of \(X_t^{\varvec{\alpha }}\) after replacing \(\alpha ^*(x) = -\partial _x V^T(x) = - (\varGamma ^T_2 x + \varGamma ^T_1)\). The first equation is a Riccati equation. In particular, the solution \(\varGamma ^T_2\) converges to \(\varGamma ^*_2=\frac{-\beta + \sqrt{\beta ^2+8(c_1 + c_3)}}{4}\) as T goes to infinity. The second and fourth ODEs are coupled, and they can be written in matrix notation as

(44)

By similar calculations to the non-asymptotic MFG case, the following solutions can be obtained

$$\begin{aligned} \begin{aligned} m_t^*&=\lim _{T\rightarrow \infty }m^T_t = p^*_t(1,1) h_1(t) + p^*_t(1,2)h_2(t)\\&= \left( m_0+2 \frac{c_3c_4}{\lambda _2-\lambda _1}\left( \frac{1}{\lambda _1}-\frac{1}{\lambda _2} \right) \right) e^{t\lambda _{1}}+2 \frac{c_3c_4}{\lambda _2-\lambda _1}\left( \frac{1}{\lambda _2}-\frac{1}{\lambda _1} \right) ,\\ \varGamma _1^*(t)&=\lim _{T\rightarrow \infty }\varGamma _1^T(t) = p^*_t(2,1) h_1(t) + p^*_t(2,2)h_2(t)\\&=m_0 (g-\lambda _1) e^{t\lambda _{1}}+2\frac{c_3c_4}{\lambda _2-\lambda _1}\left( \frac{\lambda _2-g}{\lambda _2}-\frac{\lambda _1-g}{\lambda _1} \right) ,\\ \end{aligned} \end{aligned}$$
(45)

where

$$\begin{aligned} \begin{aligned} g&:=- 2 \varGamma _2^* ,\\ b&:=2(c_1c_2 (2-c_2) - c_5) ,\\ a&:=2 \varGamma _2^* + \beta ,\\ d&:=- 1,\\ \lambda _{1\backslash 2}&:=\frac{a+g \pm \sqrt{(a-g)^2 + 4b d}}{2} = t \frac{\beta \pm \sqrt{(4 \varGamma _2^* + \beta )^2-8(c_1c_2 (2-c_2) - c_5)}}{2} . \\ \end{aligned} \end{aligned}$$
(46)

As in the MFG case, the third ODE in (43) can be solved by plugging in the solution of the previous ones and integrating. Since our interest is into the evolution of the mean and the control function, we omit the calculation for this ODE.

1.5 Solution for asymptotic MFC

The asymptotic version of the problem presented above is given by:

$$\begin{aligned} \min _{\varvec{ \alpha }}J(\varvec{ \alpha })&=\inf _{\varvec{ \alpha }} {\mathbb {E}}\left[ \int _0^{\infty } e^{-\beta t}f(X^{\varvec{ \alpha }}_t,\alpha _t,m^{\varvec{\alpha }})\hbox {d}t \right] \\&=\inf _{\varvec{ \alpha }}{\mathbb {E}}\left[ \int _0^{+\infty }e^{-\beta t }\left( \frac{1}{2}\alpha _t^2 + c_1 \left( X_t^{\varvec{ \alpha }}-c_2 m^{\varvec{ \alpha }} \right) ^2 + c_3 \left( X_t^{\varvec{ \alpha }}- c_4 \right) ^2 +c_5 (m^{\varvec{ \alpha }})^2 \right) \hbox {d}t\right] ,\\ \text {subject to: }&\quad dX^{\varvec{ \alpha }}_t=\alpha _t \hbox {d}t +\sigma dW_t ,\quad X^{\varvec{ \alpha }}_0\sim \mu _0, \end{aligned}$$

where \(m^{\varvec{ \alpha }} = \lim _{t \rightarrow +\infty } {\mathbb {E}}\left[ X^{\alpha }_t\right] .\)

Let V(x) be the optimal value function conditioned on \(X_0=x\), i.e.,

$$\begin{aligned} V(x)= & {} \inf _{\varvec{ \alpha }}J^{x}(\varvec{ \alpha })\\= & {} \inf _{\varvec{ \alpha }}{\mathbb {E}}\left[ \int _0^{+\infty }e^{-\beta t }\left( \frac{1}{2}\alpha _t^2 + c_1 \left( X_t^{\varvec{ \alpha }}-c_2 m^{\varvec{ \alpha }} \right) ^2 + c_3 \left( X_t^{\varvec{ \alpha }}-c_4 \right) ^2 + c_5 (m^{\varvec{ \alpha }})^2 \right) \hbox {d}t\Big |X_0^{\varvec{ \alpha }}=x\right] . \end{aligned}$$

We consider the following ansatz with its derivative

$$\begin{aligned} V(x)&=\varGamma _2 x^2 + \varGamma _1 x + \varGamma _0, \\ \dot{V}(x)&= 2\varGamma _2 x + \varGamma _1, \\ {\ddot{V}}(x)&=2\varGamma _2. \\ \end{aligned}$$

Starting by the MFC-HJB equation (4.12) given in [4], we extended it to the asymptotic case as follows

$$\begin{aligned}&\beta V(x) - H\left( x,\mu ^{\varvec{ \alpha }},\alpha \right) - \int _{{\mathbb {R}}} \frac{\delta H}{\delta \mu }\left( h,\mu ^{\varvec{ \alpha }},-\dot{V}(h) \right) (x)\mu ^{\varvec{ \alpha }}(h)dh=0, \end{aligned}$$

where \(m^{\varvec{ \alpha }}=\int _{{\mathbb {R}}} y \mu ^{\varvec{ \alpha }}(dy)\). We have:

$$\begin{aligned}&H\left( x,\mu ^{\varvec{ \alpha }},\alpha \right) \\&\quad :=\inf _{\alpha }\left\{ {\mathcal {A}}^X V(x) + f\left( x,\alpha ,\mu ^{\varvec{ \alpha }} \right) \right\} \\&\quad =\inf _{\alpha }\left\{ \alpha \dot{V}(x) + \frac{1}{2}\sigma ^2 \ddot{V}(x)+\frac{1}{2}\alpha ^2 +c_1 (x-c_2 m^{\varvec{ \alpha }})^2 + c_3 (x-c_4)^2 + c_5 (m^{\varvec{ \alpha }})^2 \right\} \\&\quad =-\frac{1}{2}\dot{V}(x)^2+ \frac{1}{2}\sigma ^2 {\ddot{V}}(x)+c_1 (x-c_2 m^{\varvec{ \alpha }})^2 + c_3 (x-c_4)^2 + c_5 (m^{\varvec{ \alpha }})^2, \\&\frac{\delta H\left( h,\mu ^{\varvec{ \alpha }},\alpha \right) }{\delta \mu } (x)\\&\quad =\frac{\delta }{\delta \mu }\left( c_1 (h-c_2 m^{\varvec{ \alpha }})^2 + c_5 (m^{\varvec{ \alpha }})^2 \right) (x)\\&\quad =\frac{\delta }{\delta \mu }\left( c_1 \left( h- c_2 \int _{{\mathbb {R}}} y \mu ^{\varvec{ \alpha }}(dy) \right) ^2 +c_5 \left( \int _{{\mathbb {R}}} y \mu ^{\varvec{ \alpha }}(dy) \right) ^2 \right) (x)\\&\quad =-2c_1 c_2 x\left( h-c_2\int _{{\mathbb {R}}} y \mu ^{\varvec{ \alpha }}(dy)) \right) +2c_5x\int _{{\mathbb {R}}} y \mu ^{\varvec{ \alpha }}(dy)\\&\quad =- 2c_1 c_2 x(h -c_2 m^{\varvec{ \alpha }}) +2c_5xm^{\varvec{ \alpha }}, \end{aligned}$$
$$\begin{aligned}&{ \int _{{\mathbb {R}}} \frac{\delta H}{\delta \mu }\left( h,\mu ^{\varvec{ \alpha }},-\dot{V}(h) \right) (x)\mu ^{\varvec{ \alpha }}(h)dh}= - 2c_1 c_2 x(m^{\varvec{ \alpha }} - c_2 m^{\varvec{ \alpha }}) +2c_5xm^{\varvec{ \alpha }}, \end{aligned}$$

and finally, the HJB equation becomes:

$$\begin{aligned}&\beta V(x) +\frac{1}{2}\dot{V}(x)^2- \frac{1}{2}\sigma ^2 {\ddot{V}}(x)- c_1 (x- c_2 m^{\varvec{ \alpha }})^2 - c_3 (x- c_4)^2 \\&\qquad - c_5 (m^{\varvec{ \alpha }})^2 + 2 c_1 c_2 x(m^{\varvec{ \alpha }} - c_2 m^{\varvec{ \alpha }}) - 2c_5xm^{\varvec{ \alpha }}=0 . \end{aligned}$$

A system of ODEs is obtained by replacing the ansatz and its derivatives in the MFC-HJB and cancelling terms in \(x^2\), and x and constant:

$$\begin{aligned} \begin{aligned} \left( \beta \varGamma _2 + 2 \varGamma _2^2 - c_1 - c_3 \right) x^2&+\left( \beta \varGamma _1 +2\varGamma _2\varGamma _1+2 c_1 c_ 2 m^{\varvec{ \alpha }} (2-c_2) +2c_3 c_4 - 2 c_5 m^{\varvec{ \alpha }} \right) x \\&+\beta \varGamma _0+\frac{1}{2}\varGamma _1^2-\sigma ^2 \varGamma _2 -( c_1 {c_2}^2+c_5) (m^{\varvec{ \alpha }})^2 - c_3 {c_4}^2=0. \end{aligned} \end{aligned}$$

An easy computation gives the values

$$\begin{aligned} \varGamma _2&=\frac{-\beta + \sqrt{\beta ^2 +8 (c_1+c_3)}}{4},\\ \varGamma _1&= \frac{2 c_5 m^{\varvec{ \alpha }} -2c_1 c_2m^{\varvec{ \alpha }}(2-c_2)-2c_3 c_4}{\beta + 2\varGamma _2},\\ \varGamma _0&=\frac{ c_5 (m^{\varvec{ \alpha }})^2 + c_3 {c_4}^2+ c_1 {c_2}^2 (m^{\varvec{ \alpha }})^2 +\sigma ^2 \varGamma _2 -\frac{1}{2}\varGamma _1^2 }{\beta }. \end{aligned}$$

By plugging the control \(\alpha ^*(x)=-(2\varGamma _2x+\varGamma _1)\) into the dynamics of \(X^{\varvec{\alpha }}_t\) and taking the expected value, we obtain an ODE for \(m^{\varvec{ \alpha }}_t\)

$$\begin{aligned} \dot{m}_t^{\varvec{ \alpha }}= -(2\varGamma _2 m_t^{\varvec{ \alpha }}+\varGamma _1). \end{aligned}$$
(47)

The solution of (47) is used to derive m as follows

$$\begin{aligned} \begin{aligned} m^{\varvec{ \alpha }}&=\lim _{t\mapsto \infty } m_t^{\varvec{ \alpha }} =\lim _{t\mapsto \infty } \left( -\frac{\varGamma _1}{2\varGamma _2} + \left( m_0 + \frac{\varGamma _1}{\varGamma _2} \right) e^{-2 \varGamma _2 t}\right) \\&=-\frac{\varGamma _1}{2\varGamma _2} =-\frac{2c_5 m^{\varvec{ \alpha }} -2c_1 c_2m^{\varvec{ \alpha }}(2-c_2)-2c_3 c_4}{2 \varGamma _2 (\beta + 2\varGamma _2)}\\ m^{\varvec{ \alpha }}&= \frac{c_3 c_4}{\varGamma _2 (\beta + 2\varGamma _2)+ c_5 -c_1 c_2 (2-c_2)} \end{aligned} \end{aligned}$$
(48)

We remark that the values of \(m_t^{\varvec{ \alpha }}\) and \(\varGamma _1(t)\) obtained in the non-asymptotic case converge to \(m^{\alpha }\) and \(\varGamma _1\), respectively, as t goes to \(\infty \). Therefore, we have obtained that

$$\begin{aligned} \lim _{t\rightarrow \infty }\alpha _t^{*MFC}(x) = \alpha ^{*AMFG}(x), \quad \forall x, \end{aligned}$$

that is the first part of (4) for this LQ MFC problem.

1.6 Solution for stationary MFC

The only difference with the derivation above in the case of asymptotic MFC is that \(m^\alpha _t\) should be a constant which, from (47), should satisfy \(2\varGamma _2 m^\alpha +\varGamma _1=0\). Therefore, \(m^\alpha \) takes the same value as in (48), and we deduce

$$\begin{aligned} \alpha ^{*SMFG}(x) = \alpha ^{*AMFG}(x), \quad \forall x, \end{aligned}$$

that is the second part of (4) for this LQ MFC problem .

Lipschitz property of the 2 scale operators

1.1 Generic setting

We modify the original operators using the softmin operator on \({\mathbb {R}}^{|{\mathcal {A}}|}\) defined as:

$$\begin{aligned} {{\,\mathrm{soft-min}\,}}(z) = \left( \frac{e^{-z_i}}{\sum _j e^{-z_j}}\right) _{i=1,\dots ,|{\mathcal {A}}|} \in \varDelta ^{|{\mathcal {A}}|}, \qquad z \in {\mathbb {R}}^{|{\mathcal {A}}|}. \end{aligned}$$

Intuitively, it gives a probability distribution on the indices \(i=1,\dots ,|{\mathcal {A}}|\) which has higher values on indices whose corresponding values are closer to be a minimum. In particular, the elements of \(\min \{i=1,\dots ,|{\mathcal {A}}| : z_i = {{\,\mathrm{arg\,min}\,}}_j z_j\}\) have equal weight and this weight is the largest among \(\left( \frac{e^{-z_i}}{\sum _j e^{-z_j}}\right) _{i=1,\dots ,|{\mathcal {A}}|}\). We recall that the function \({{\,\mathrm{soft-min}\,}}\) is Lipschitz continuous for the 2-norm. Denoting by \(L_s\) its Lipschitz constant, it means that

$$\begin{aligned} \Vert {{\,\mathrm{soft-min}\,}}(z) - {{\,\mathrm{soft-min}\,}}(z')\Vert _2 \le L_s \Vert z - z'\Vert _2, \qquad z, z' \in {\mathbb {R}}^{|{\mathcal {A}}|}. \end{aligned}$$

Moreover, since \(|{\mathcal {A}}|\) is finite, all the norms on \({\mathbb {R}}^{|{\mathcal {A}}|}\) are equivalent, so there exists a positive constant \(c_{2,\infty }\) such that

$$\begin{aligned} \Vert {{\,\mathrm{soft-min}\,}}(z) - {{\,\mathrm{soft-min}\,}}(z')\Vert _\infty \le L_s c_{2,\infty } \Vert z - z'\Vert _\infty , \qquad z, z' \in {\mathbb {R}}^{|{\mathcal {A}}|}. \end{aligned}$$

To alleviate the notation, we will write \(Q(x) := (Q(x,a))_{a \in {\mathcal {A}}}\) for any \(Q \in {\mathbb {R}}^{|{\mathcal {X}}| \times |{\mathcal {A}}|}\). We also introduce a more general version \({\underline{p}}\) of the transition kernel p, which can take as an input a probability over actions instead of a single action: for \(x,x' \in {\mathcal {X}}, \nu \in \varDelta ^{|{\mathcal {A}}|}, \mu \in \varDelta ^{|{\mathcal {X}}|}\),

$$\begin{aligned} {\underline{p}}(x'|x,\nu ,\mu ) = \sum _{a} \nu (a) p(x'|x,a,\mu ). \end{aligned}$$

Intuitively, this is the probability for an agent at x to move to \(x'\) when the population distribution is \(\mu \) and the agent picks a random action following the distribution \(\nu \).

We now consider the following iterative procedure, which is a slight modification of (9a)–(9b). Here again, both variables (Q and \(\mu \)) are updated at each iteration but with different rates. Starting from an initial guess \((Q_0, \mu _0) \in {\mathbb {R}}^{|{\mathcal {X}}| \times |{\mathcal {A}}|} \times \varDelta ^{|{\mathcal {X}}|}\), define iteratively for \(k=0,1,\dots \):

figure f

where

$$\begin{aligned} {\left\{ \begin{array}{ll} {\mathcal {T}}(Q, \mu )(x,a) = f(x, a, \mu ) + \gamma \sum _{x'} p(x' | x,a,\mu ) \min _{a'}Q(x',a')\\ - Q(x,a), \qquad (x,a) \in {\mathcal {X}} \times {\mathcal {A}}, \\ \underline{{\mathcal {P}}}(Q, \mu )(x) = (\mu {\underline{P}}^{Q,\mu })(x) - \mu (x), \qquad x \in {\mathcal {X}}, \end{array}\right. } \end{aligned}$$

with

$$\begin{aligned}&{\underline{P}}^{Q,\mu }(x, x') = {\underline{p}}(x' | x, {{\,\mathrm{soft-min}\,}}Q(x), \mu ),\\&\qquad \hbox { and } \qquad (\mu {\underline{P}}^{Q,\mu })(x) = \sum _{x_0} \mu (x_0) {\underline{P}}^{Q,\mu }(x_0,x), \end{aligned}$$

is the transition matrix when the population distribution is \(\mu \) and the agent uses an approximately optimal randomized control according to the soft-min of Q.

Lemma 1

Assume that f is Lipschitz continuous with respect to \(\mu \) and that \({\underline{p}}\) is Lipschitz continuous with respect to \(\nu \) and \(\mu \). Then,

  • the operator \({\mathcal {T}}\) is Lipschitz continuous w.r.t. \(\mu \) (with a Lipschitz constant possibly depending on \(\Vert Q\Vert _\infty )\), and Lipschitz continuous in Q (uniformly in \(\mu \));

  • the operator \(\underline{{\mathcal {P}}}\) is Lipschitz continuous in both variables.

If p is independent of \(\mu \), then both \({\mathcal {T}}\) and \(\underline{{\mathcal {P}}}\) are Lipschitz continuous.

Proof

Let us denote by \(L_p\) and \(L_f\) the Lipschitz constants of p and f, respectively. Let \((Q,\mu ),(Q',\mu ') \in {\mathbb {R}}^{|{\mathcal {X}}| \times |{\mathcal {A}}|} \times \varDelta ^{|{\mathcal {X}}|}\). We first consider \({\mathcal {T}}\). We have

$$\begin{aligned}&\Vert {\mathcal {T}}(Q,\mu ) - {\mathcal {T}}(Q',\mu )\Vert _{\infty } \\&\quad \le \gamma \sum _{x'} \max _{x,a} p(x' | x,a,\mu ) \left| \min _{a'}Q(x',a') - \min _{a'}Q'(x',a')\right| + \left\| Q - Q'\right\| _{\infty } \\&\quad \le (\gamma + 1) \left\| Q - Q'\right\| _{\infty }. \end{aligned}$$

Moreover,

$$\begin{aligned} \Vert {\mathcal {T}}(Q,\mu ) - {\mathcal {T}}(Q,\mu ')\Vert _{\infty }&\le |f(x, a, \mu ) - f(x, a, \mu ')| \\&\quad + \gamma \sum _{x'} |p(x' | x,a,\mu ) - p(x' | x,a,\mu ')| \, |\min _{a'}Q(x',a')| \\&\le (L_f + \gamma L_p \Vert Q\Vert _\infty )|{\mathcal {X}}| \Vert \mu - \mu '\Vert _{\infty }, \end{aligned}$$

where \(L_f\) and \(L_p\) are, respectively, the Lipschitz constants of f and p with respect to \(\mu \). If p is independent of \(\mu \), we obtain

$$\begin{aligned} \Vert {\mathcal {T}}(Q,\mu ) - {\mathcal {T}}(Q,\mu ')\Vert _{\infty }&\le L_f \Vert \mu - \mu '\Vert _{\infty }. \end{aligned}$$

We then show that the operator \(\underline{{\mathcal {P}}}\) is Lipschitz continuous. We have

$$\begin{aligned}&\Vert \underline{{\mathcal {P}}}(Q, \mu ) - \underline{{\mathcal {P}}}(Q, \mu ')\Vert _{\infty } \\&\quad \le \Vert \mu {\underline{P}}^{Q,\mu } - \mu ' {\underline{P}}^{Q,\mu '}\Vert _{\infty } + \Vert \mu - \mu '\Vert _{\infty } \\&\quad \le \left\| \sum _x \Big ({\underline{p}}(\cdot | x, {{\,\mathrm{soft-min}\,}}Q(x), \mu )\mu (x) - {\underline{p}}(\cdot | x, {{\,\mathrm{soft-min}\,}}Q(x), \mu ')\mu '(x)\Big ) \right\| _{\infty } \\&\qquad + \Vert \mu - \mu '\Vert _{\infty }. \end{aligned}$$

For the first term, we note that, for every \(x \in {\mathcal {X}}\),

$$\begin{aligned}&\left\| \Big ({\underline{p}}(\cdot | x, {{\,\mathrm{soft-min}\,}}Q(x), \mu )\mu (x) - {\underline{p}}(\cdot | x, {{\,\mathrm{soft-min}\,}}Q(x), \mu ')\mu '(x)\Big )\right\| _{\infty } \\&\quad \le \left\| \Big ({\underline{p}}(\cdot | x, {{\,\mathrm{soft-min}\,}}Q(x), \mu ) - {\underline{p}}(\cdot | x, {{\,\mathrm{soft-min}\,}}Q(x), \mu ') \Big )\mu (x)\right\| _{\infty } \\&\qquad + \left\| {\underline{p}}(\cdot | x, {{\,\mathrm{soft-min}\,}}Q(x), \mu ')\Big (\mu (x) - \mu '(x)\Big )\right\| _{\infty } \\&\quad \le (L_p + 1) \left\| \mu - \mu '\right\| _{\infty }, \end{aligned}$$

where we used the fact that discrete probability measures are non-negative and bounded by 1.

Moreover, we have

$$\begin{aligned} \Vert \underline{{\mathcal {P}}}(Q, \mu ) - \underline{{\mathcal {P}}}(Q', \mu )\Vert _{\infty }&\le \Vert \mu ({\underline{P}}^{Q,\mu } - {\underline{P}}^{Q',\mu '})\Vert _{\infty } \\&\le \sum _x \Vert {\underline{p}}(\cdot | x, {{\,\mathrm{soft-min}\,}}Q(x), \mu )\\&- {\underline{p}}(\cdot | x, {{\,\mathrm{soft-min}\,}}Q'(x), \mu )\Vert _{\infty } \\&\le \sum _x L_p \Vert {{\,\mathrm{soft-min}\,}}Q(x) - {{\,\mathrm{soft-min}\,}}Q'(x)\Vert _{\infty } \\&\le |{\mathcal {X}}| \, L_p \, L_s \, c_{2,\infty } \, \Vert Q - Q'\Vert _{\infty }, \end{aligned}$$

which concludes the proof. \(\square \)

1.2 Application to a discrete model for the LQ problem

Recall that the continuous linear-quadratic model we consider is defined by (15). Here, we propose a finite space MDP which approximates the dynamics of a typical agent in this continuous LQ model. We consider that the action space is given by \(\mathcal {A} = \{ a_0=-1, a_1 = -1+\varDelta _{.}, \dots , a_{N_{\mathcal {A}}}=1-\varDelta _{.}, a_{N_{\mathcal {A}}}=1 \}\) and the state space by \(\mathcal {X} = \{ x_0=x_c-2, x_1=x_c-2-\varDelta _{.}, \dots , x_{N_{\mathcal {X}}-1}=x_c+2-\varDelta _{.}, x_{N_{\mathcal {X}}}=x_c+2 \}\), where \(x_c\) is the center of the state space. The step size for the discretization of the spaces \(\mathcal {X}\) and \(\mathcal {A}\) is given by \(\varDelta _{.} = \sqrt{\varDelta t} = 10^{-1} \).

Consider the transition probability:

$$\begin{aligned} p(x,x',a,\mu )= & {} {\mathbb {P}}(Z^{x+a, \varDelta t} \in [x'-\varDelta _{.}/2, x'+\varDelta _{.}/2]) \\= & {} \varPhi _{x+a, \sigma ^2 \varDelta t}(x'+\varDelta _{.}/2) - \varPhi _{x+a, \sigma ^2 \varDelta t}(x'-\varDelta _{.}/2), \end{aligned}$$

where \(Z \sim {\mathcal {N}}(x+a,\sigma ^2 \varDelta t)\) and \(\varPhi _{x+a,\sigma ^2 \varDelta t}\) is the cumulative distribution function of the \({\mathcal {N}}(x+a,\sigma ^2 \varDelta t)\) distribution. Moreover, consider that the one-step cost function is given by \(f(x,a,\mu ) \varDelta t\) with

$$\begin{aligned} f(x,a,\mu )= & {} \frac{1}{2}a^2 + c_1 \left( x- c_2 \sum _{\xi \in S} \mu (\xi )\right) ^2 + c_3 \left( x- c_4 \right) ^2 \\&+ c_5 \left( \sum _{\xi \in S} \mu (\xi )\right) ^2, \qquad b(x,a,\mu ) = a, \end{aligned}$$

For simplicity, we write \({{\bar{\mu }}} = \sum _{\xi \in S} \mu (\xi )\).

Lemma 2

In this model, f is Lipschitz continuous with respect to \(\mu \) and \({\underline{p}}\) is Lipschitz continuous with respect to \(\nu \) and \(\mu \)

Proof

We start with f. For the \(\mu \) component, we have:

$$\begin{aligned} |f(x,a,\mu ) - f(x,a,\mu ')|&\le c \left| \left( x- c_2 {{\bar{\mu }}}\right) ^2 - \left( x- c_2 {{\bar{\mu }}}'\right) ^2\right| + c \left| \left( {{\bar{\mu }}}\right) ^2 - \left( {{\bar{\mu }}}'\right) ^2\right| \\&\le c \left( {{\bar{\mu }}}' - {{\bar{\mu }}} \right) \cdot \left( 2x + ({{\bar{\mu }}}' - {{\bar{\mu }}})\right) + c ({{\bar{\mu }}} - {{\bar{\mu }}}')({{\bar{\mu }}} + {{\bar{\mu }}}') \\&\le c \max _{x \in S}\Vert x\Vert _\infty \, \left( {{\bar{\mu }}}' - {{\bar{\mu }}} \right) \\&\le c \max _{x \in S}\Vert x\Vert _\infty \, \sum _{x \in S}\left( \mu '(x) - \mu (x) \right) \\&\le c \max _{x \in S}\Vert x\Vert _\infty \, |S| \, \Vert \mu ' - \mu \Vert _\infty , \end{aligned}$$

where \(c>0\) is a constant depending only on the parameters of the model and whose value may change from line to line.

Then, we consider \({\underline{p}}\). It is independent of \(\mu \) in this model. For the action component, we have:

$$\begin{aligned}&|{\underline{p}}(x,x',\nu ,\mu ) - {\underline{p}}(x,x',\nu ',\mu )| \\&\quad = \left| \sum _{a}\nu (a)\Big (\varPhi _{x+a, \sigma ^2 \varDelta t}(x'+\varDelta _{.}/2) - \varPhi _{x+a, \sigma ^2 \varDelta t}(x'-\varDelta _{.}/2) \Big ) \right. \\&\left. \qquad - \sum _{a'}\nu '(a')\Big (\varPhi _{x+a', \sigma ^2 \varDelta t}(x'+\varDelta _{.}/2) - \varPhi _{x+a', \sigma ^2 \varDelta t}(x'-\varDelta _{.}/2)\Big )\right| \\&\quad = \left| \sum _{a} \left( \nu (a)\varPhi _{x+a, \sigma ^2 \varDelta t}(x'+\varDelta _{.}/2) - \nu '(a)\varPhi _{x+a, \sigma ^2 \varDelta t}(x'+\varDelta _{.}/2) \right) \right| \\&\qquad + \left| \sum _{a} \left( \nu (a)\varPhi _{x+a, \sigma ^2 \varDelta t}(x'-\varDelta _{.}/2) \Big ) - \nu '(a)\varPhi _{x+a, \sigma ^2 \varDelta t}(x'-\varDelta _{.}/2) \right) \right| \\&\quad = \int _{-\infty }^{x'+\varDelta _{.}/2} \frac{1}{\sigma \sqrt{2\pi \varDelta t }} \left| \sum _{a}(\nu (a) - \nu '(a))e^{-\frac{(y-(x+a))^2}{2\sigma ^2 \varDelta t}} \right| dy \\&\qquad + \int _{-\infty }^{x'-\varDelta _{.}/2} \frac{1}{\sigma \sqrt{2\pi \varDelta t }} \left| \sum _{a}(\nu (a) - \nu '(a)) e^{-\frac{(y-(x+a))^2}{2\sigma ^2 \varDelta t}} \right| dy \\&\quad \le c \Vert \nu - \nu '\Vert _\infty , \end{aligned}$$

where c is a constant depending only on the model (and in particular on the state space, the action space and \(\varDelta t\)). \(\square \)

The Bellman equation for the optimal Q function in the asymptotic MFC framework

In this appendix, we provide the derivation of the Bellman equation (8) for the modified Q-function presented in Sect. 3.3.

Let \(\mathcal {X}\) and \(\mathcal {A}\) be discrete and finite state and action spaces. Let \(V^\alpha : \mathcal {X} \mapsto \mathcal {R} \) and \(Q^\alpha : \mathcal {X} \times \mathcal {A} \mapsto \mathcal {R} \) be value function relative to the policy \(\alpha \) and the corresponding modified Q-function defined as follows

(50)
(51)

where

$$\begin{aligned} \mu ^{\alpha }= \lim _{n\mapsto \infty } \mathcal {L}(X^{\alpha }_{n}) \quad \text {and} \quad {\tilde{\alpha }}(s) = {\left\{ \begin{array}{ll} \alpha (s), &{}\quad \forall s\ne x,\\ a, &{}\quad \text {if } s= x.\\ \end{array}\right. } \end{aligned}$$

Theorem 2

The optimal \(Q^*(x,a)=\min _\alpha Q^\alpha (x,a)\) satisfies the Bellman equation

$$\begin{aligned} Q^*(x,a) = f(x, a, {\tilde{\mu }}^*) + \gamma \sum _{x' \in {\mathcal {X}}} p(x' | x, a, {\tilde{\mu }}^*) \min _{a'} Q^*(x',a'), \qquad (x,a) \in {\mathcal {X}} \times {\mathcal {A}}, \end{aligned}$$
(52)

where the optimal control \(\alpha ^*\) is given by \(\alpha ^*(x)={{\,\mathrm{arg\,min}\,}}_a Q^*(x,a)\), the modification \({{\tilde{\alpha }}}^*(x)\) is based on the pair (xa) and \({\tilde{\mu }}^*:=\mu ^{{\tilde{\alpha }}^*}\).

Remark 3

The population distribution \({\tilde{\mu }}^*\) based on the modification of \(\alpha ^*\) given the pair \((x,\alpha ^*(x))\) is equal to \({\mu }^*\) . Indeed, \({\tilde{\alpha }}^*\) is equal to \(\alpha ^*\) itself, i.e.,

$$\begin{aligned} {\tilde{\alpha }}^*(s) = {\left\{ \begin{array}{ll} \alpha ^*(s), &{}\quad \forall s\ne x,\\ \alpha ^*(s), &{}\quad \text {if } s= x.\\ \end{array}\right. } \end{aligned}$$

Remark 4

The term \(\min _{a'} Q^*(x',a')\) does not depend on \({\tilde{\mu }}^*\) , i.e.,

where step \(\square \) is due to Remark 3. It follows that (52) depends on \({\tilde{\mu }}^*\) only through the cost due to the first step.

In order to prove Theorem 2, the following results are required.

Theorem 3

The Bellman equation for \(Q^{\alpha }\) is given by

$$\begin{aligned} Q^{\alpha }(x,a) = f(x,a,\mu ^{{\tilde{\alpha }}}) +\gamma {\mathbb {E}} \left[ Q^{\alpha }(X_{1},\alpha (X_{1}))\,\Big \vert \, X_{0} = x, A_{0} = a \right] , \end{aligned}$$
(53)

Lemma 3

The value function relative to the policy \(\alpha \) is equivalent to the corresponding Q-function evaluated on the pair \((x,\alpha (x))\), i.e.,

$$\begin{aligned} V^{\alpha }(x) = Q^{\alpha }(x,\alpha (x)). \end{aligned}$$
(54)

Theorem 4

(Policy improvement) Let \({\tilde{\alpha }}\) be a policy derived by \(\alpha \)

$$\begin{aligned} \begin{aligned} {\tilde{\alpha }}(s)&= {\left\{ \begin{array}{ll} \alpha (s), \quad &{}\text {for } s\ne x,\\ a, \quad &{}\text {for } s= x.\end{array}\right. } \end{aligned} \end{aligned}$$

such that

$$\begin{aligned} Q^{\alpha }(x,{\tilde{\alpha }}(x)) > V^{\alpha }(x). \end{aligned}$$
(55)

Then,

$$\begin{aligned} V^{{\tilde{\alpha }}}(x') > V^{\alpha }(x') \quad \forall x' \in {\mathcal {X}}. \end{aligned}$$
(56)

Theorem 5

Let \(V^*:\mathcal {X} \mapsto \mathcal {R}\) be defined as \(V^*(x)=\max _{\alpha } V^{\alpha }(x)\). Then,

$$\begin{aligned} V^*(x)= \max _a \max _{\alpha } Q^{\alpha }(x,a), \end{aligned}$$
(57)

Proof (Theorem 3)

$$\begin{aligned} \begin{aligned}&Q^{\alpha }(x,a) \\&\quad = f(x,a,\mu ^{{\tilde{\alpha }}}) + \\&\quad \quad +\gamma {\mathbb {E}} \left[ {\mathbb {E}} \left[ \sum _{n=1}^{\infty } \gamma ^{n-1} f(X_{n},\alpha (X_{n}),\mu ^{\alpha })\,\Big \vert \, X_{0} = x, A_{0} = \alpha (x), X_{1} \right] \,\Big \vert \, X_{0} = x, A_{0} = a \right] \\&\quad = f(x,a,\mu ^{{\tilde{\alpha }}}) + \gamma {\mathbb {E}} \left[ {\mathbb {E}} \left[ \sum _{n=1}^{\infty } \gamma ^{n-1} f(X_{n},\alpha (X_{n}),\mu ^{\alpha })\,\Big \vert \, X_{1} \right] \,\Big \vert \, X_{0} = x, A_{0} = a\right] \\&\quad = f(x,a,\mu ^{{\tilde{\alpha }}}) +\\&\quad \quad + \gamma {\mathbb {E}} \left[ f(X_{1},\alpha (X_{1}),\mu ^{\alpha }) + \gamma {\mathbb {E}} \left[ \sum _{n=2}^{\infty } \gamma ^{n-2} f(X_{n},\alpha (X_{n}),\mu ^{\alpha })\,\Big \vert \, X_{1} \right] \,\Big \vert \, X_{0} = x, A_{0} = a \right] \\&\quad = f(x,a,\mu ^{{\tilde{\alpha }}}) +\gamma {\mathbb {E}} \left[ Q^{\alpha }(X_{1},\alpha (X_{1}))\,\Big \vert \, X_{0} = x, A_{0} = a \right] , \end{aligned} \end{aligned}$$

\(\square \)

Proof (Lemma 3)

 

where we used that the modification of \(\alpha \) given the pair \((x,\alpha (x))\) is equal to \(\alpha \) itself and consequently \(\mu ^{\alpha }=\mu ^{{{\tilde{\alpha }}}}\). \(\square \)

Proof (Theorem 4)

Step 1 Show that \(V^{\alpha }(x) < V^{{\tilde{\alpha }}}(x) \).

We observe that

Considering the limit as \(k\rightarrow \infty \), it follows that

$$\begin{aligned} V^{\alpha }(x) <{\mathbb {E}} \left[ \sum _{n=0}^{\infty } \gamma ^n f(X_{n}, {\tilde{\alpha }}(X_{n}),\mu ^{{\tilde{\alpha }}}) \,\Big \vert \, X_{0}=x \right] = V^{{\tilde{\alpha }}}(x) \end{aligned}$$

Step 2 Show that \(V^{\alpha }(x') < V^{{\tilde{\alpha }}}(x') \quad \forall x' \in {\mathcal {X}}\setminus \{x\}\).

Let define \(\tau _x = \min \{ n : X_{n} = x \}\). Then,

We start analyzing the first term observing that \(X_{n} \ne x\) and \(\alpha (X_{n}) = {\tilde{\alpha }}(X_{n})\) for all \(n<=\tau _x - 1\). Then,

$$\begin{aligned} T_1 = {\mathbb {E}} \left[ \sum _{n=0}^{\tau _x-1} \gamma ^n f(X_{n},{\tilde{\alpha }}(X_{n}),\mu ^{{\tilde{\alpha }}} )\,\Big \vert \, X_{0} = x' \right] \end{aligned}$$

The analyses of the term \(T_2\) are based on the tower property (TP), the Markov property (MP) and Step 1 (S1). It follows that

Combining the analyses of \(T_1\) and \(T_2\), it follows that

$$\begin{aligned} \begin{aligned} V^{\alpha }(x')&= T_1 + T_2 \\&< {\mathbb {E}} \left[ \sum _{n=0}^{\tau _x-1} \gamma ^n f(X_{n},{\tilde{\alpha }}(X_{n}),\mu ^{{\tilde{\alpha }}} )\,\Big \vert \, X_{0} = x' \right] +{\mathbb {E}}\left[ \gamma ^{\tau _x} V^{{\tilde{\alpha }}}(X_{\tau _x}) \,\Big \vert \, X_{0} = x' \right] \\&= {\mathbb {E}} \left[ \sum _{n=0}^{\tau _x-1} \gamma ^n f(X_{n},{\tilde{\alpha }}(X_{n}),\mu ^{{\tilde{\alpha }}} )+\gamma ^{\tau _x}\sum _{n=\tau _x}^{\infty } \gamma ^{n-\tau _x} f(X_{n},{\tilde{\alpha }}(X_{n}),\mu ^{{\tilde{\alpha }}})\,\Big \vert \, X_{0} = x' \right] \\&={\mathbb {E}} \left[ \sum _{n=0}^{\infty } \gamma ^n f(X_{n},{\tilde{\alpha }}(X_{n}),\mu ^{{\tilde{\alpha }}} )\,\Big \vert \, X_{0} = x' \right] \\&=V^{{\tilde{\alpha }}}(x') \end{aligned} \end{aligned}$$

\(\square \)

Proof (Theorem 5)

Let \({\mathcal {X}}=\{ x_1, \dots , x_n\}\) and \({\mathcal {A}}=\{ a_0, \dots , a_m\}\) be the state and action spaces.

Step 1 Let \(\alpha ^0\) be an initial policy and define \(\alpha ^1\) as follows

$$\begin{aligned} \alpha ^1(x) = {\left\{ \begin{array}{ll} \arg \max _a Q^{\alpha ^0}(x,a), \quad &{}\text { if } x = x_1,\\ \alpha _0(x), \quad &{}\text { o.w. } \end{array}\right. } \end{aligned}$$

Then,

Step 2 Consider \(\alpha ^2\) defined as follows

$$\begin{aligned} \begin{aligned} \alpha ^2(x)&= {\left\{ \begin{array}{ll} \arg \max _a Q^{\alpha ^1}(x,a), \quad &{}\text { if } x = x_2,\\ \alpha _1(x), \quad &{}\text { o.w. } \end{array}\right. } \\&={\left\{ \begin{array}{ll} \arg \max _a Q^{\alpha ^1}(x,a), \quad &{}\text { if } x = x_2,\\ \arg \max _a Q^{\alpha ^0}(x,a), \quad &{}\text { if } x = x_1,\\ \alpha _0(x), \quad &{}\text { o.w. } \end{array}\right. } \end{aligned} \end{aligned}$$

Then,

Step \({\varvec{n}}\) Consider \(\alpha ^{n}\) defined as follows

$$\begin{aligned} \begin{aligned} \alpha ^n(x)&= {\left\{ \begin{array}{ll} \arg \max _a Q^{\alpha ^{n-1}}(x,a), \quad &{}\text { if } x = x_n,\\ \alpha _{n-1}(x), \quad &{}\text { o.w. } \end{array}\right. } \\&=\arg \max _a Q^{\alpha ^{k-1}}(x,a), \quad \quad \quad \text { if } x = x_k, \text { for } k =1,\dots , n , \end{aligned} \end{aligned}$$

Then,

Step \({\varvec{N}}\) Since the state and action spaces are finite, the policy can be improved only a finite number of times. In other words, \(\exists N>0\) such that

$$\begin{aligned} \alpha ^{N}(x) = \arg \max _a Q^{\alpha ^N}(x,a), \quad \forall x \in {\mathcal {X}} \end{aligned}$$

and

$$\begin{aligned} V^{\alpha ^N}(x) = Q^{\alpha ^N}(x,\alpha ^N(x)) = \max _a Q^{\alpha ^N}(x,a), \quad \forall x \in {\mathcal {X}}. \end{aligned}$$

Can \(\alpha ^N\) be still suboptimal? No, by extending Bellman and Dreyfus’s Optimality Theorem (1962), [3]. \(\square \)

Proof (Theorem (2))

where the last step is due to what shown in the proof of equation (57), i.e., the same policy \(\alpha ^*\) optimizes \(V^{\alpha }\) and \(Q^{\alpha }\). \(\square \)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Angiuli, A., Fouque, JP. & Laurière, M. Unified reinforcement Q-learning for mean field game and control problems. Math. Control Signals Syst. 34, 217–271 (2022). https://doi.org/10.1007/s00498-021-00310-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00498-021-00310-1

Keywords

Mathematics Subject Classification

Navigation