1 Introduction

1.1 Motivations

Based on a modification of a model by van der Pol, FitzHugh [17] proposed in 1961 the following system of equations in order to describe the dynamics of a single neuron subject to an external current I:

$$\begin{aligned} \begin{aligned}&\dot{v}= v - \frac{1}{3}v^3 -w +I\\&\dot{w}=c(v +a -bw) \end{aligned} \end{aligned}$$
(2)

for some constants \(a,b,c>0\), where the unknowns vw correspond respectively to the so-called voltage and recovery variables (see also Nagumo [19]). In presence of interactions, one has to enlarge the previous pair by an additional unknown y that counts a fraction of open channels (synaptic channels), and which is sometimes referred to as gating variable.

When it comes to an interacting network of neurons, it is customary to assume that the corresponding graph is fully connected, which is arguably a good approximation at small scales [24]. This implies that all the neurons in the given network add a contribution to the interaction terms in the equation. Precisely, for a population of size \(N\in {\mathbb {N}},\) the state at time t of the i-th neuron is described by the three-dimensional vector

$$\begin{aligned} X^i_t=(v^i_t,w^i_t,y^i_t) ,\quad i=1,\dots N, \end{aligned}$$

and one is led to study the system of 3N stochastic differential equations:

$$\begin{aligned} \left\{ \begin{aligned}&dv^i_t=\Big (v^i_t-\frac{(v^ i_t)^3}{3}-w_t^i + I_t\Big )dt +\sigma _{ext} d W^i_t \\&\quad \quad - \frac{1}{N}\sum \nolimits _{j=1}^N J(v^i_t-V_{rev})y^j_tdt - \frac{1}{N}\sum \nolimits _{j=1}^N\sigma ^J(v^i_t-V_{rev})y^j_tdB_t^i \\&dw^i_t=c (v^i_t+a -b w^i_t)dt, \\&dy^i_t=({\overline{a}} S (v_t^i)(1-y_t^i)-{\overline{b}} y_t^i)dt + \sigma ^{y^i}(v^i)d{{\tilde{B}}}^i_t\,. \end{aligned}\right. \end{aligned}$$
(3)

In the above, \(W^i\), \(B^{i}\), \({{\tilde{B}}}^i\) are i.i.d. Brownian motions modelling independent sources of noise with respective intensities \(\sigma ^J,\sigma _{ext},\sigma ^{y^i}(v^i)>0\). The last of these intensities depends on the solution, through the formula

$$\begin{aligned} \sigma ^{y}(v)=\chi (y)\sqrt{{\overline{a}}S(v)(1-y)+{\overline{b}}y} \end{aligned}$$
(4)

with given constants \({\overline{a}},{\overline{b}}>0\) and some smooth cut-off function \(\chi :{\mathbb {R}}\rightarrow {\mathbb {R}}\) supported in (0, 1).

In this model, the voltage variable \(v^i\) describes the membrane potential of the \( i \)-th neuron in the network, while the recovery variable \(w^i\) models the dynamics of corresponding ion channels, which can influence the membrane potential of neuron i by opening and closing, depending on \(v^i\).

Without interaction term (i.e. when \(J=\sigma ^J=0\)), equation (3) reduces to a network of N independent neurons, where the dynamics of each neuron is modelled by the FitzHugh–Nagumo equation (2).

When interaction is present (\(J\not =0\)), the model describes the situation where each neuron of the network affects its adjacent neurons by releasing chemical transmitters, causing particular ion channels of adjacent neurons to open. This induces a current to the adjacent neuron, affecting its membrane potential. In this extended model, the gating variable \(y^i\) models a fraction of open ion channels in the adjacent neurons of neuron i, and thus ought to be a number between 0 and 1 (hence the cut-off \(\chi (y^i)\) in (4)). Loosely speaking, \(y^i\) should be thought as the output contribution of the neuron i.

Depending on the fraction \(y^i\) of open channels, the induced current for an adjacent neuron with membrane potential v is given by \(-J(v-V_{rev})y^i\), where the constant \(V_{rev}\) describes the membrane potential, at which there is no net current flow. The coupling strength J originally refers to the (mean of) the maximum conductance, which is typically affected by noise coming from the environment. This explains the diffusion part of the interaction term in Eq. (3) for \(\sigma ^J\not =0\). For further details we refer to [2].

The dynamics of the gating variable \(y^i\) in Eq. (3) depends on some physical constants, which we will now briefly introduce:

  • \(S(v^i)\) refers to the concentration of chemical transmitters, released by neuron i; explicitly for \(v\in {\mathbb {R}}\)

    $$\begin{aligned} S(v)=\dfrac{T_{max}}{1+e^{-\lambda (v-V_T)}} \end{aligned}$$
    (5)

    where \(T_{max}\) is a given maximal concentration and \(\lambda ^{-1}>0,V_T>0,\) are constants setting the steepness, resp. the value, at which S(v) is half-activated (for typical values, see for instance [13]);

  • \({\overline{a}},{\overline{b}}>0\) correspond to some rise and decay rates, respectively.

For a better understanding of the interaction, we included a small illustration in Fig. 1.

Fig. 1
figure 1

Synaptic dynamics

In this representation, a rapid increase of the membrane potential of the neuron i will cause it to release chemical messengers into the (synaptic) cleft between the neuron i and the adjacent neuron j, which in turn will bind to receptors of the neuron j. The receptors will cause ion channels of neuron j to open, thus neuron i induces the opening of a fraction \(y^i\) of ion channels at the dendrites of the neuron j. As already mentioned, the resulting current from i to j affecting the neuron j is then given by \(-J(v^j-V_{rev})y^i\). For a thorough presentation of (3) and its applications in the field of neurosciences, we refer for instance to the monograph of Ermentrout and Terman [16].

When it comes to monitoring neural activity in the brain, one does typically not measure single neuron activity, but consider more macroscopic measurements like, e.g., local field potentials (LFPs). The LFP refers to the electric potential in the extracellular space around neurons and it is influenced by all ionic processes around the electrode. It is assumed that action potentials have a limited impact to the LFP and that it is strongly influenced by synaptic currents. Although we are aware of the fact that the correct interpretation of measurements like the intracerebral local field potential is highly non trivial, for our type of application it is reasonable to assume that measurements only depend on the distribution of our network, rather than on activities of single neurons from the network. Furthermore, external stimulation acts homogeneously on every neuron of the network, that is every neuron of the network receives the same external input I. We therefore model the dynamics of a typical neuron as a controlled mean-field type equation and admissible controls need to be independent of the state of the individual neuron, hence we will consider deterministic controls. In the following we will formulate our mean-field model and the set of admissible controls will be defined in Sect. 2.2.

1.2 Propagation of Chaos

The system (3) has the generic form

$$\begin{aligned} \left\{ \begin{aligned} dX_t^{N,i}&=b(t,X_t^{N,i},{\overline{\mu }}_{X_{t}^N},\alpha _t)dt+\sigma (t,X_t^{N,i},{\overline{\mu }}_{X_{t}^N},\alpha _t)dW_t^i\,,&t\in [0,T], \\ X_0^{N,i}&\sim u_0, \end{aligned}\right. \end{aligned}$$
(6)

for \(i=1,\dots ,N\), where \(u_0\) is a probability measure on \({\mathbb {R}}^d\), \((\alpha _t)\) is a control and \({{\bar{\mu }}}_{X_t^N}\) denotes the empirical measure

$$\begin{aligned} {\overline{\mu }}_{X_{t}^N}:=\dfrac{1}{N}\sum _{k=1}^{N}\delta _{X_t^{N,k}}. \end{aligned}$$

For \(N\rightarrow \infty \), one is naturally pushed to investigate the convergence in law of the solutions of (6) towards the probability measure \(\mu ={{\,\mathrm{\mathcal {L}}\,}}(X|{\mathbb {P}})\), where X solves

$$\begin{aligned} \left\{ \begin{aligned}&dX_t =b(t,X_t,{{\,\mathrm{\mathcal {L}}\,}}(X_t),\alpha _t)dt+\sigma (t,X_t,{{\,\mathrm{\mathcal {L}}\,}}(X_t),\alpha _t)dW_t\,,\quad t\in [0,T]\,, \\&X_0\in L^2(\Omega ,\mathcal {F}_0,{\mathbb {P}};{\mathbb {R}}^d)\,, \end{aligned}\right. \end{aligned}$$
(7)

and where \(b,\sigma \) are the coefficients obtained by substituting expectations in (6) in place of empirical means. In the context of (3), a first mathematical investigation of such convergence is due to Baladron, Fasoli, Faugeras and Touboul [2] (see also the clarification notes [6]). In this direction, the authors show that the sequence of symmetric probability measures

$$\begin{aligned} \mu _N:={{\,\mathrm{\mathcal {L}}\,}}((X^{N,1},\dots ,X^{N,N})|{\mathbb {P}})\end{aligned}$$

is \(\mu \)-chaotic. Namely, for each integer \(k\le N\) and every finite sequence of bounded and continuous functions \( \phi _i:C([0,T];{\mathbb {R}}^{d})\rightarrow {\mathbb {R}},\) for \( i=1,...,k, \) it holds

$$\begin{aligned} \lim \limits _{N\rightarrow \infty }\langle \mu _N,\phi _1\otimes \dots \otimes \phi _k\otimes 1\otimes \dots \otimes 1\rangle =\prod _{i=1}^{k}\langle \mu ,\phi _i\rangle , \end{aligned}$$

where \( \langle \mu ,\phi \rangle :=\int \phi d\mu \) and the symbol “\(\otimes \)” denotes the usual tensor product, i.e. \( \phi _i\otimes \phi _j(f,g):=\phi _i(f)\phi _j(g) \), for any \( f,g\in C([0,T];{\mathbb {R}}^d) .\) This situation is usually referred to as “propagation of chaos”. Although we could not find any literature concerning propagation of chaos results in the general non Lipschitz case, propagation of chaos for the system (3) was investigated in [6].

1.3 Mean-Field Limit and Control

In this regard, taking \(N\gg 1\) guarantees that a “good enough” approximation of (3) is given by the mean-field limit (7), where the corresponding coefficients \((b,\sigma ):[0,T]\times {\mathbb {R}}^3\times \mathcal {P}({\mathbb {R}}^3)\times {\mathbb {R}}\rightarrow {\mathbb {R}}^3\times {\mathbb {R}}^{3\times 3}\), are given by

$$\begin{aligned} \begin{aligned} b(t,x,\mu ,\alpha )&=\begin{pmatrix} v-\frac{v^3}{3}-w+\alpha \\ c(v+a-bw) \\ {\overline{a}}S(v)(1-y)-{\overline{b}}y \end{pmatrix} + \begin{pmatrix} -J(v-V_{rev})\int _{{\mathbb {R}}^3}z_3\mu (dz) \\ 0 \\ 0 \end{pmatrix}, \end{aligned} \end{aligned}$$
(8)

for \(x=(v,w,y),\) and

$$\begin{aligned} \begin{aligned} \sigma (t,x,\mu ,\alpha )&=\begin{pmatrix} \sigma _{ext} &{} -\sigma ^J(v-V_{rev})\int _{{\mathbb {R}}^3}z_3\mu (dz) &{} 0 \\ 0 &{} 0 &{} 0 \\ 0 &{} 0 &{} \chi (y)\sqrt{{\overline{a}}S(v)(1-y)+{\overline{b}}y} \end{pmatrix}. \end{aligned} \end{aligned}$$
(9)

In this paper, we concentrate our attention on the optimal control problem associated with a cost functional of the form

$$\begin{aligned} J:{{\mathbb {A}}}\rightarrow {\mathbb {R}},\quad \alpha \mapsto {\mathbb {E}}\left( \int _{0}^{T}f(t,X_t^\alpha ,{{\,\mathrm{\mathcal {L}}\,}}(X_t^\alpha ),\alpha _t)dt+g(X_T^\alpha ,{{\,\mathrm{\mathcal {L}}\,}}(X_T^\alpha ))\right) , \end{aligned}$$
(10)

for suitable functions f and g, and where \(X^\alpha \) is subject to the dynamical constraint (7). The functional cost ought to be minimized over some convex, admissible set of controls \({{\mathbb {A}}}.\)

Because of potential applications in neuroscience, the control of the stochastic FHN model has gained a lot of attention during the last years (see, e.g., [3, 11]). The need to introduce random perturbations in the original model is widely justified from a physics perspective (see for instance [12] and the references therein). In [11] the authors investigate a FitzHugh–Nagumo SPDE which results from the continuum limit of a network of coupled FitzHugh–Nagumo equations. We have a similar structure in mind regarding the dependence of the coefficients on the control (namely, the dynamics of the membrane potential depends linearly on the control). Our approach here is however completely different, in that we hinge on the McKean-Vlasov type SDE (7) that originates from the propagation of chaos.

McKean-Vlasov control problems of this type were investigated in the past decade by Bensoussan, Frehse and Yam [4], but also by Carmona and co-authors (see for instance [9]). These developments culminated with the monograph of Carmona and Delarue [8], where a systematic treatment is made (under reasonable assumptions). Other related works include [1, 5, 7, 14]. These results fail however to encompass (7)–(9), due for instance to the lack of Lipschitz property for the drift coefficient.

From the analytic point of view, the FitzHugh–Nagumo model also suffers the fact that the diffusion matrix is degenerate, making difficult to obtain energy estimates for the Kolmogorov equation (see Remark 3.2).

Our objective in this work is twofold. At first, our purpose is to extract some of the qualitative features of FitzHugh–Nagumo system and its mean field limit, in a broader treatment that encloses (3) and (7)–(9). In this sense, our intention is not to deal with the previous models “as such” but instead, we aim to take a step further by dealing with a certain class of equations that has the following peculiarities

  • (Monotonicity) – though the drift coefficient in (7) displays a cubic non-linearity, it satisfies the monotonicity condition \(\langle x-x',b(t,x,\mu ,\alpha )-b(t,x',\mu ,\alpha )\rangle \lesssim |x-x'|^2\).

  • (Structural assumption on dynamics and level set boundedness) – the dynamics of the coupling variable ensures that the boundedness property \(y_t\in [0,1]\) holds for all times.

  • (Interaction with linear growth w.r.t. the unknown) – the drift nonlinearity displays the behaviour \(|b(t,x,\mu ,\alpha )-b(t,x,\nu ,\alpha )|\lesssim (1+|x|)W_2(\mu ,\nu )\), where \(W_2\) denotes the usual 2-Wasserstein distance defined in Sect. 2.1.

Under the above setting, we aim to develop and implement direct variational methods, in the spirit of the stochastic approach of Yong and Zhou [25] for classical control problems (note that some work in this direction has been already done by Pfeiffer [21, 22], in a slightly different setting). Second, we aim to derive a Pontryagin maximum principle for mean-field type control problems of the previous form, with a view towards efficient numerical approximations of optimal controls (e.g. gradient descent).

1.4 Organization of the Paper

In Sect. 2 we introduce our assumptions on the coefficients and give the main results. Section 3 is devoted to the well-posedness of the main optimal control problem (Theorem 2.1). In Sect. 4, we show the corresponding maximum principle (Theorem 2.2). Finally, Sect. 5 will be devoted to numerical examples.

2 Preliminaries

2.1 Notation and Settings

In the whole manuscript, we consider an arbitrary but finite time horizon \(T>0\). We fix a dimension \(d\ge 1\), and denote the scalar product in \({\mathbb {R}}^d\) by \(\langle \cdot ,\cdot \rangle .\) If AB are matrices of the same size, we shall also write \(\langle A, B\rangle \) for their scalar product, namely

$$\begin{aligned} \langle A,B\rangle := \mathrm {tr}(A^{\dagger }B) \end{aligned}$$

where \(A^\dagger \) is the transposed matrix, and \(\mathrm {tr}\) the trace operator. For a continuously differentiable function \(f:{\mathbb {R}}^d\rightarrow {\mathbb {R}}\), we adopt the suggestive notation \(f_x\) to denote its Jacobian (seen for each \(x\in {\mathbb {R}}^d\) as an element of the dual of \({\mathbb {R}}^d\)). Given \(h\in {\mathbb {R}}^d\), we let

$$\begin{aligned} f_x(x)\cdot h \end{aligned}$$
(11)

be the evaluation of \(f_x(x)\) at h. A similar convention will be used for vector-valued functions.

Throughout the paper, we fix a complete filtered probability space \((\Omega ,\mathcal {F}, (\mathcal {F}_t)_{t\in [0,T]},{\mathbb {P}})\) carrying an m-dimensional Wiener process \((W_t)_{t\in [0,T]}\). Given \(p\in [1,\infty )\) and a p-integrable random variable X, we denote its usual \(L^p\)-norm by \(\Vert X\Vert _p:={\mathbb {E}}(|X|^p)^{1/p}\). We further introduce the spaces

$$\begin{aligned} \mathcal {H}^{2,d}&:=\bigg \lbrace Z:\Omega \times [0,T]\rightarrow {\mathbb {R}}^d\,\bigg |\, Z\text { prog.\ measurable and }\int _{0}^{T}\Vert Z_t\Vert _2^2dt<\infty \bigg \rbrace \\ \mathcal {S}^{2,d}&:=\bigg \lbrace Z:\Omega \times [0,T]\rightarrow {\mathbb {R}}^d\,\bigg |\, Z\text { prog.\ measurable, continuous and }\Big \Vert \sup _{t\in [0,T]}|Z_t|\Big \Vert ^2_2<\infty \bigg \rbrace . \end{aligned}$$

For \(m\in {\mathbb {N}}\), the notations \(\mathcal {S}^{2,d\times m}\), \(\mathcal {H}^{2,d\times m}\) will also be used to denote the corresponding sets of \(d\times m\) matrix-valued processes. Whenever clear from the context, we will omit to indicate dimensions and write \(\mathcal {S}^2\) or \(\mathcal {H}^2\) instead.

We will denote by \(\mathcal {P}({\mathbb {R}}^d)\) the set of all probability measures on \(({\mathbb {R}}^d,\mathcal {B}({\mathbb {R}}^d))\). For \(p\in [1,\infty )\), \(\mu \in \mathcal {P}({\mathbb {R}}^d)\) we define the moment of order p:

$$\begin{aligned} \mathcal {M}_p(\mu )^p:=\int _{{\mathbb {R}}^d}|x|^p\mu (dx)\in [0,\infty ], \end{aligned}$$

and we let \(\mathcal {P}_p({\mathbb {R}}^d):=\left\{ \mu \in \mathcal {P}({\mathbb {R}}^d)\,\big |\,{\mathcal {M}}_p(\mu )<\infty \right\} .\) By \(W_p,\) \(p\in [1,\infty ),\) we denote the usual p-Wasserstein distance on \({\mathcal {P}}_p\), that is for \(\mu ,\nu \in {\mathcal {P}}_p({\mathbb {R}}^d)\)

$$\begin{aligned} W_p(\mu ,\nu )^p:=\inf _{\pi \in \Pi (\mu ,\nu )}\iint _{{\mathbb {R}}^d\times {\mathbb {R}}^d}|x-y|_{{\mathbb {R}}^d}^p\pi (dx\times dy), \end{aligned}$$
(12)

where \(\Pi (\mu ,\nu )\) denotes the set of probability measures on \({\mathbb {R}}^d\times {\mathbb {R}}^d\) with \(\mu \) and \(\nu \) as respective first and second marginals (we refer to [8, Chap. 5] for a thorough introduction to the subject). Moreover, we recall the following elementary but useful consequence of the previous definition. Let \(\mu ,\nu \) be in \({\mathcal {P}}_p,\) and assume that there are random variables XY on \((\Omega ,{\mathcal {F}},{\mathbb {P}})\) such that \(X\sim \mu \) and \(Y\sim \nu .\) Then, it holds

$$\begin{aligned} W_p(\mu ,\nu )\le {\mathbb {E}}\left( |X-Y|^p\right) ^\frac{1}{p}. \end{aligned}$$
(13)

Finally, we will shortly recall the notion of L-differentiability. A function \(f:\mathcal {P}_2({\mathbb {R}}^d)\rightarrow {\mathbb {R}}\) is called L-differentiable at \(\mu _0\in \mathcal {P}_2({\mathbb {R}}^d)\) if there exists a random variable \(X_0\) with law \(\mu _0\), such that the lifted function

$$\begin{aligned} {\tilde{f}}:L^2(\Omega ,\mathcal {F},{\mathbb {P}};{\mathbb {R}}^d)\rightarrow {\mathbb {R}},\quad X\mapsto f(\mathcal {L}(X)) \end{aligned}$$

is Fréchet differentiable at \(X_0\). The following property for a L-differentiable function f is well-known (see [8, Chap. 5]): for any \(\mu _0\in \mathcal {P}_2({\mathbb {R}}^d)\), there exists a \(\mu _0\)-almost everywhere uniquely defined measurable function \(\xi :{\mathbb {R}}^d\rightarrow {\mathbb {R}}^d\), such that for all \(X_0\in L^2(\Omega ,\mathcal {F},{\mathbb {P}};{\mathbb {R}}^d)\) with \(\mathcal {L}(X_0)=\mu _0\), it holds \(D{\tilde{f}}(X_0)=\xi (X_0)\). We write \(f_\mu (\mu _0)\) to denote the equivalence class of \(\xi \) in \(L^2({\mathbb {R}}^d,\mu _0;{\mathbb {R}}^d)\). In keeping with the notation (11) on differentials, we will let \(f_\mu (\nu )(x)\cdot h\) be its evaluation (as an element of the dual of \({\mathbb {R}}^d\)) at \(h\in {\mathbb {R}}^d\). If \(f_\mu \) is continuous, we call f continuously L-differentiable.

2.2 Controls and Cost Functional

Our controlled dynamics will be given by a McKean-Vlasov type SDE (state equation) of the form (7), where \(X_0\in L^r(\Omega ,\mathcal {F}_0,{\mathbb {P}};{\mathbb {R}}^d)\) for some fixed \(r\ge 6\) and \(\alpha \) is an admissible control, i.e. for some bounded convex set \(A\subset {\mathbb {R}}^k\) and throughout the paper,

$$\begin{aligned} \alpha \in {{\mathbb {A}}}:= \left\{ \alpha :[0,T]\rightarrow A\right\} . \end{aligned}$$
(14)

In the whole manuscript, we assume that we are given continuous running and terminal cost functions

$$\begin{aligned} f&:[0,T]\times {\mathbb {R}}^d\times \mathcal {P}_2({\mathbb {R}}^d)\times A\rightarrow {\mathbb {R}}\\ g&:{\mathbb {R}}^d\times \mathcal {P}_2({\mathbb {R}}^d)\rightarrow {\mathbb {R}}\end{aligned}$$

which have quadratic growth in the following sense: there exists \(C>0\) such that for all \(t\in [0,T]\), \(x\in {\mathbb {R}}^d\), \(\alpha \in A\) and \(\mu \in \mathcal {P}_2({\mathbb {R}}^d)\)

$$\begin{aligned} |f(t,x,\mu ,\alpha )|&\le C(1+|x|+ \mathcal {M}_2(\mu )+|\alpha |)^2\\ |g(x,\mu )|&\le C(1+|x|+\mathcal {M}_2(\mu ))^2. \end{aligned}$$

We will then consider the cost functional

$$\begin{aligned} J:{{\mathbb {A}}}\rightarrow {\mathbb {R}},\quad \alpha \mapsto {\mathbb {E}}\left( \int _{0}^{T}f(t,X_t^\alpha ,{{\,\mathrm{\mathcal {L}}\,}}(X_t^\alpha ),\alpha _t)dt+g(X_T^\alpha ,{{\,\mathrm{\mathcal {L}}\,}}(X_T^\alpha ))\right) , \end{aligned}$$
(15)

where \(X^\alpha \) is subject to the dynamical constraint (7).

2.3 Level Set Boundedness

A formal application of Itô Formula reveals that the solutions of the state equation associated with a network of FitzHugh–Nagumo neurons take values in the set

$$\begin{aligned}{\mathcal {C}}:= \left\{ x=(v,w,y):0\le y\le 1\right\} \,.\end{aligned}$$

This is of course coherent with the intuition that y is a fraction of open channels. In other words, we have \(\pi (X)\le 0\) where \(\pi :{\mathbb {R}}^3\rightarrow {\mathbb {R}},\) is the map \(x\mapsto y(y-1).\) Motivated by this example, we will assume in the sequel that we are given a convex function \(\pi \in C^2({\mathbb {R}}^d,{\mathbb {R}})\) such that any solution X is supported in \({\mathcal {C}}\subset {\mathbb {R}}^d\) for all times, where \({\mathcal {C}}\) is the set

$$\begin{aligned} {\mathcal {C}}:=\left\{ x\in {\mathbb {R}}^d:\pi (x)\le 0\right\} . \end{aligned}$$
(16)

We suppose moreover that \({\mathcal {C}}\) contains at least one element, which for convenience is assumed to be 0. To ensure that the solutions are indeed \( {\mathcal {C}}\)-valued, we need to assume that \(\pi (X_0)\le 0\), \({\mathbb {P}}\text {-almost surely}\). Furthermore we need to make the following compatibility assumptions on \(\pi :{\mathbb {R}}^d\rightarrow {\mathbb {R}}.\)

Assumption 2.1

(structural assumption on the dynamics) For all \(\mu \in \mathcal {P}({\mathbb {R}}^d),\alpha \in A,\) \(t\in [0,T]\) and \(x\in {\mathbb {R}}^d\setminus {\mathcal {C}}\), we have

$$\begin{aligned}&\pi _x(x)\cdot b(t,x,\mu ,\alpha )\le 0, \end{aligned}$$
(17)

while

$$\begin{aligned}&\mathrm {Im}\left( \sigma (t,x,\mu ,\alpha )\right) \subset \pi _x(x)^{\perp } \quad \text {and}\quad \pi _{xx}(x)\cdot (\sigma \sigma ^\dagger (t,x,\mu ,\alpha )) = 0. \end{aligned}$$
(18)

Example 2.1

(Gating variable boundedness for FitzHugh–Nagumo) Assumption 2.1 is fulfilled for (7)–(9) and with \(\pi (v,w,y)=y(y-1)\), as can be seen as follows. We have the identities (recall the notation (4))

$$\begin{aligned} \begin{aligned}&\pi _x (x)= \begin{pmatrix} 0&0&2y-1 \end{pmatrix},\quad \pi _{xx}(x)= \begin{pmatrix} 0 &{}0 &{} 0 \\ 0 &{} 0 &{} 0 \\ 0 &{} 0 &{} 2 \end{pmatrix}, \\&\sigma \sigma ^{\dagger }(t,x,\mu ,\alpha )= \begin{pmatrix} \sigma _{ext}^2 + (\sigma ^J)^2(v-V_{rev})^2\left( \int _{{\mathbb {R}}^3}z_3\mu (dz)\right) ^2 &{}0 &{} 0 \\ 0 &{} 0 &{} 0 \\ 0 &{} 0 &{} \sigma ^y(v)^2 \end{pmatrix}, \end{aligned} \end{aligned}$$

Clearly,

$$\begin{aligned} \pi _x(x)\sigma (t,x,\mu ,\alpha ) \in {\mathrm {Linspan}}\left\{ \begin{pmatrix}0&0&(2y-1)\sigma ^y(v)\end{pmatrix}\right\} \,. \end{aligned}$$

But using \({\mathrm {Supp}}\chi \subset (0,1)\), we find indeed that \((2y-1)\sigma ^y(v)=0\) outside \({\mathcal {C}}\). The same argument implies

$$\begin{aligned} \pi _{xx}(x)\cdot (\sigma \sigma ^\dagger (t,x,\mu ,\alpha )) = 2\sigma ^y(v)^2 \end{aligned}$$

and the latter vanishes if \(x\notin {\mathcal {C}}\), hence (18).

Towards (17), one observes letting \(q={\bar{a}} S(v)\) that

$$\begin{aligned} \pi _x(x)\cdot b(t,x,\mu ,\alpha ) = -q +(3q + {\bar{b}})y -2(q+{\bar{b}})y^2 = P(y)\,. \end{aligned}$$

The polynomial P(y) has discriminant \((q-b)^2\), hence the roots

$$\begin{aligned} r_- = \frac{q}{q+{\bar{b}}}\,,\quad r_+=\frac{1}{2}\,, \end{aligned}$$

which both lie in the interval (0, 1). It follows that P(y) is negative outside \({\mathcal {C}}\), implying (17).

2.4 Regularity Assumptions and Main Results

Besides Assumption 2.1, one needs to make suitable hypotheses on the regularity of the drift and diffusion coefficients. In the sequel, we denote by \({\mathcal {P}}_2^{{\mathcal {C}}}({\mathbb {R}}^d)\) the subset of all probability measures in \({\mathcal {P}}_2({\mathbb {R}}^d)\) which are supported in \({\mathcal {C}}=\pi ^{-1}((-\infty ,0]).\)

Assumption 2.2

(MKV Regularity) We assume that the coefficients

$$\begin{aligned} (b,\sigma ):[0,T]\times {\mathbb {R}}^d\times \mathcal {P}_2({\mathbb {R}}^d)\times A\rightarrow {\mathbb {R}}^d\times {\mathbb {R}}^{d\times m} \end{aligned}$$

are locally Lipschitz. Moreover, there are constants \(L_1,L_2,L_3>0\) such that the following properties hold.

  1. (L1)

    – (regularity of the diffusion coefficient) – The diffusion coefficient \(\sigma \) satisfies the property \(\sup _{0\le t\le T}|\sigma (t,0,\delta _0,0)|^2<\infty .\) Moreover, for all \(t\in [0,T]\), \(x\in \mathcal {C}\), \(\alpha \in A\) and \(\mu \in \mathcal {P}^{{\mathcal {C}}}_2({\mathbb {R}}^d)\) we have

    $$\begin{aligned} |\sigma (t,x,\mu ,\alpha )|^2\le L_1(1+|\alpha |^2+|x|^2) . \end{aligned}$$
    (19)

    For all \(x,x'\in \mathcal {C} \), \(\alpha '\in A\), it holds

    $$\begin{aligned} |\sigma (t,x,\mu ,\alpha )-\sigma (t,x',\mu ,\alpha ')|^2\le L_1(|x-x'|^2+|\alpha -\alpha '|^2). \end{aligned}$$
    (20)

    Finally, if \(x\in \mathcal {C}\) and \(\mu '\in \mathcal {P}_2({\mathbb {R}}^d)\), then

    $$\begin{aligned} |\sigma (t,x,\mu ,\alpha )-\sigma (t,x,\mu ',\alpha )|^2\le L_1(1+|x|^2)W_2(\mu ,\mu ')^2. \end{aligned}$$
    (21)
  2. (L2)

    – (regularity of the drift coefficient) – There exists \(q\in {\mathbb {N}}\) with \(4q\le r\), such that for all \(t\in [0,T]\), \(x,x'\in \mathcal {C}\), \(\alpha ,\alpha '\in A\) and \(\mu \in \mathcal {P}_2({\mathbb {R}}^d)\)

    $$\begin{aligned} \begin{aligned}&|b(t,x,\mu ,\alpha )-b(t,x',\mu ,\alpha ')| \\&\le \sqrt{L_2}(1+|x|^{q-1}+|x'|^{q-1}+|\alpha |^{q-1}+|\alpha '|^{q-1}+\mathcal {M}_2(\mu )^2)(|x-x'|+|\alpha -\alpha '|). \end{aligned} \end{aligned}$$
    (22)

    In addition, b satisfies the following Lipschitz property with respect to the Wasserstein distance: for all \(t\in [0,T]\), \(x\in \mathcal {C}\), \(\alpha \in A\) and \(\mu ,\mu '\in \mathcal {P}_2({\mathbb {R}}^d)\)

    $$\begin{aligned} |b(t,x,\mu ,\alpha )-b(t,x,\mu ',\alpha )|^2\le L_2(1+|x|^2)W_2(\mu ,\mu ')^2. \end{aligned}$$
    (23)
  3. (L3)

    – (monotonicity of the drift) – The drift coefficient b is such that \(\sup _{0\le t\le T}|b(t,0,\delta _0,0)|<\infty .\) Moreover, for all \(t\in [0,T]\), \(x\in {\mathcal {C}}\), \(\alpha \in A\) and \(\mu \in \mathcal {P}^{{\mathcal {C}}}_2({\mathbb {R}}^d)\) it holds

    $$\begin{aligned} \langle x,b(t,x,\mu ,\alpha )\rangle \le L_3(1+|\alpha |^2+|x|^2) \end{aligned}$$
    (24)

    and if \(x'\in {\mathcal {C}}\), \(\alpha '\in {{\mathbb {A}}}\), then

    $$\begin{aligned} \left\langle x-x',b(t,x,\mu ,\alpha )-b(t,x',\mu ,\alpha ')\right\rangle \le L_3(|x-x'|^2+|\alpha -\alpha '|^2). \end{aligned}$$
    (25)

Example 2.2

(Analysis of the FitzHugh–Nagumo model) Let us go back to the settings of (7)–(9) for a coupled system of FitzHugh–Nagumo neurons. Trivially, one has \(\sup _{0\le t\le T}|\sigma (t,0,\delta _0,0)|=|\sigma _{ext}|<\infty .\) The map \(v\mapsto S(v)\) being positive and bounded, we further see that the (3, 3)-th entry of \(\sigma \) is Lipschitz, as deduced immediately from the fact that \(\chi \) is supported in (0, 1). For the remaining non-trivial component, we have

$$\begin{aligned} \sigma ^{1,2}(x,\mu ,\alpha )^2 \le J(V_{rev} + |v|^2) |\beta (\mu )| \end{aligned}$$

where to ease notation we introduce the barycenter \(\beta (\mu )\), defined as the quantity

$$\begin{aligned} \beta (\mu ):=\int _{{\mathbb {R}}^3}z_3\mu (dz_1\times dz_2\times dz_3). \end{aligned}$$
(26)

The condition \({\mathrm {Supp}}\mu \subset {\mathcal {C}},\) implies trivially that \(|\beta (\mu )|\le 1\) and thus we obtain (19) for \(L_1=(V_{rev}J)\vee 1.\) The Lipschitz-type property (20) is shown in a similar fashion.

The Wasserstein-type regularity (21) is hardly more problematic: using the Kantorovitch duality Theorem [8, Prop. 5.3 & Cor. 5.4] and the fact that the projector \(z=(z_1,z_2,z_3)\mapsto z_3\) is Lipschitz, one finds that

$$\begin{aligned} |\beta (\mu -\mu ')|=|\int _{{\mathbb {R}}^3}z_3(\mu -\mu ')(dz)|\le W_1(\mu ,\mu '). \end{aligned}$$
(27)

hence

$$\begin{aligned} |\sigma (x,\mu ) - \sigma (x,\mu ') | \le J|v-V_{rev}|W_1(\mu ,\mu '). \end{aligned}$$

As is classical, the 1-Wasserstein distance \(W_1(\mu ,\mu ')\) can be estimated by \(W_2(\mu ,\mu '),\) which in turn implies (21), and thus 2.2.

As for the drift coefficient, since \(b(t,0,\delta _0,0)\) is also independent of t, the supremum condition in 2.2 is clear. Moreover, it has polynomial dependency on the variables vwy, which implies the local Lipschitz property (22) with \(q=3\). We also have

$$\begin{aligned} |b(t,x,\mu ,\alpha )-b(t,x,\mu ',\alpha )|&\le J|v-V_{rev}||\beta (\mu -\mu ')| \end{aligned}$$

and we conclude by (27) that 2.2 holds.

To show (24) and (25), it is enough to prove the corresponding bounds when \(c=0={{\overline{b}}} ,\) since the related contributions are affine linear in the variables. Similarly, by linearity we can let \(w=\alpha =0\). But in that case, it holds

$$\begin{aligned} \langle x,b(t,x,\mu ,0)\rangle \le v^2-\dfrac{v^4}{3} +{\overline{a}}S(v)(1-y)y -Jv^2\beta (\mu )+JV_{rev}v\beta (\mu )\,. \end{aligned}$$

Observe that, since \(\mu \) is supported inside \({\mathcal {C}}\), one has in particular \(\beta (\mu )\ge 0\). Consequently, the fourth term in the right hand side can be ignored, showing (24) with \(L_3=L_3({{\overline{a}}} ,|S|_{\infty },J,V_{rev})>0.\)

Similarly, if \(x'=(v',0,y')\in {\mathbb {R}}^3\)

$$\begin{aligned}&\langle x-x',b(t,x,\mu ,0)-b(t,x',\mu ,0)\rangle \\&= (v-v')^2-\frac{1}{3}(v^3-v'^3)(v-v') -J(v-v')^2\beta (\mu ) \\&\quad \quad \quad +{\overline{a}}(1-y)(y-y')(S(v)-S(v'))-{\overline{a}}S(v')(y-y')^2 \\&\le |S'|_{\infty }(1\vee {{\overline{a}}})(1+y^2)(|y-y'|^2 + |v-v'|^2)\,. \end{aligned}$$

It follows that (24) holds with \(L_3=L_3({{\overline{a}}},{{\overline{b}}},c,|S|_{C^1})>0.\)

Assumption 2.3

(Weak continuity) For any \(t\in [0,T]\), \(x\in {\mathbb {R}}^d\) and \(\mu \in \mathcal {P}_2({\mathbb {R}}^d)\), the function \(A\rightarrow {\mathbb {R}}\), \(\alpha \mapsto f(t,x,\mu ,\alpha )\) is convex and the functions \(A\rightarrow {\mathbb {R}}^d\times {\mathbb {R}}^{d\times m}, \alpha \mapsto (b,\sigma )(t,x,\mu ,\alpha )\) are affine. Furthermore, for all \(x\in C([0,T];{\mathbb {R}}^d)\) and \(\mu \in C([0,T];\mathcal {P}_2^{\mathcal {C}}({\mathbb {R}}^d))\) the functions

$$\begin{aligned} {\mathbb {A}}\rightarrow L^2([0,T];{\mathbb {R}}^d),\quad \alpha \mapsto b(\cdot ,x_\cdot ,\mu _\cdot ,\alpha _\cdot ),\\ {\mathbb {A}}\rightarrow L^2([0,T];{\mathbb {R}}^{d\times m}),\quad \alpha \mapsto \sigma (\cdot ,x_\cdot ,\mu _\cdot ,\alpha _\cdot ), \end{aligned}$$

are weakly sequentially continuous, that is for all \((\alpha ^n)_{n\in {\mathbb {N}}}\subset {\mathbb {A}}\) with \(\alpha ^n\rightharpoonup \alpha \), for some \(\alpha \in {\mathbb {A}}\), it holds \((b,\sigma )(\cdot ,x_\cdot ,\mu ,\alpha ^n_\cdot )\rightharpoonup (b,\sigma )(\cdot ,x_\cdot ,\mu ,\alpha _\cdot )\).

Remark 2.1

The continuity and convexity of \(f(t,x,\mu ,\cdot )\) leads to weak lower semicontinuity of the map

$$\begin{aligned} {\mathbb {A}}\rightarrow {\mathbb {R}},\quad \alpha \mapsto \int _{0}^{T}f(t,x_t,\mu _t,\alpha _t)dt, \end{aligned}$$

for all \(x\in C([0,T];{\mathbb {R}}^d)\) and \(\mu \in C([0,T];\mathcal {P}_2({\mathbb {R}}^d))\).

We can now present our main results. At first, we investigate the existence of an optimal control for the following problem

$$\begin{aligned} \min _{\alpha \in {{\mathbb {A}}}}J(\alpha ), \end{aligned}$$
(SM)

subject to

$$\begin{aligned} \left\{ \begin{aligned} dX_t&=b(t,X_t,{{\,\mathrm{\mathcal {L}}\,}}(X_t),\alpha _t)dt+\sigma (t,X_t,{{\,\mathrm{\mathcal {L}}\,}}(X_t),\alpha _t)dW_t\,,&t\in [0,T], \\ X_0&\in L^r(\Omega ,\mathcal {F}_0,{\mathbb {P}};{\mathbb {R}}^d). \end{aligned}\right. \end{aligned}$$
(28)

Theorem 2.1

Under assumptions 2.12.3, the problem (SM) is finite and has an optimal control. Namely, \(\inf _{\alpha \in {{\mathbb {A}}}}J(\alpha )<\infty \) and there is \({\overline{\alpha }}\in {{\mathbb {A}}}\), such that

$$\begin{aligned} J({{\overline{\alpha }}})\le J(\alpha ), \end{aligned}$$

for all \(\alpha \in {{\mathbb {A}}}\).

In order to address the corresponding maximum principle, we now introduce further assumptions on our coefficients.

Assumption 2.4

(Pontryagin Principle) The coefficients \(b,\sigma ,f\) and g are continuously differentiable with respect to \((x,\alpha )\) and continuously \(\mathrm {L}\)-differentiable with respect to \(\mu \in {\mathcal {P}}_2({\mathbb {R}}^d).\) Furthermore there exist \(A_1,A_2,A_3>0\) such that:

  1. (A1)

    For every \((s,x,\mu ,\alpha )\in [0,T]\times {\mathcal {C}}\times {\mathcal {P}}_2^{\mathcal {C}}({\mathbb {R}}^d)\times A,\) and each \(y,z\in {\mathbb {R}}^d\):

    $$\begin{aligned} \langle b_x(t,x,\mu ,\alpha )\cdot z,z\rangle&\le A_1|z|^2,\\ |b_x(t,x,\mu ,\alpha )|&\le A_1(1+|x|^{q-1}),\\ |b_\alpha (t,x,\mu ,\alpha )|&\le A_1,\\ |b_\mu (t,x,\mu ,\alpha )(y)|&\le A_1(1+|x|), \end{aligned}$$

    where q is the same constant as in 2.2.

  2. (A2)

    For every \((s,x,\mu ,\alpha )\in [0,T]\times {\mathcal {C}}\times {\mathcal {P}}_2^{\mathcal {C}}({\mathbb {R}}^d)\times A,\) and \(y\in {\mathbb {R}}^d\):

    $$\begin{aligned} |\sigma _x(t,x,\mu ,\alpha )|&\le A_2,\\ |\sigma _\alpha (t,x,\mu ,\alpha )|&\le A_2,\\ |\sigma _\mu (t,x,\mu ,\alpha )(y)|&\le A_2(1+|x|). \end{aligned}$$
  3. (A3)

    For all \(R>0\) and every \((s,x,\mu ,\alpha )\in [0,T]\times {\mathbb {R}}^d\times {\mathcal {P}}_2 ({\mathbb {R}}^d)\times A,\) such that \(|x|\vee \mathcal {M}_2(\mu )\vee |\alpha |\le R\) the quantities

    $$\begin{aligned}&f_x(t,x,\mu ,\alpha ),\, f_\alpha (t,x,\mu ,\alpha ),\, g_x(x,\mu ),\, \int _{{\mathbb {R}}^d}|f_\mu (t,x,\mu ,\alpha )(y)|^2\mu (dy),\\&\quad \int _{{\mathbb {R}}^d}^{}|g_\mu (x,\mu )(y)|^2\mu (dy),\, \end{aligned}$$

    are all bounded in norm by \(A_3(1+R)\).

Example 2.3

Again, we investigate the above properties for the setting of a FitzHugh–Nagumo neural network. The property 2.4 depends on the choice of f and g, hence we do not discuss it here (it is however clear for the ansatz (47) below). Concerning assumption 2.4 and 2.4 we have

$$\begin{aligned} b_x(t,x,\mu ,\alpha )=\begin{pmatrix} 1-v^2-J\beta (\mu ) &{} 1 &{} 0\\ c &{} -cb &{} 0\\ {\overline{a}}S'(v)(1-y) &{} 0 &{} -{\overline{a}}S(v)-{\overline{b}} \end{pmatrix}, \end{aligned}$$

where we recall the notation (26). Using that \({\mathrm {Supp}}(\mu )\subset {\mathcal {C}}\), together with the boundedness of \(S'(v)\), this leads to

$$\begin{aligned} \langle b_x(t,x,\mu ,\alpha )\cdot z,z\rangle \le A_1(b,c,{\overline{a}},{\overline{b}},|S|_\infty ,|S'|_{\infty })|z|^2, \end{aligned}$$

hence the first estimate. Letting as before \(\beta (\mu ):=\int _{{\mathbb {R}}^3}z_3\mu (dz),\) it is easily seen by definition of the L-derivative that

$$\begin{aligned} \beta _\mu (\mu )({{\tilde{x}}})\cdot h= h_3\quad \text {for all}{{\tilde{x}}}\text {and}h\equiv (h_1,h_2,h_3)\in {\mathbb {R}}^3 . \end{aligned}$$

In a matrix representation, this gives the following constant value for the L-derivative of the drift coefficient at a given point \(x\equiv (v,w,y)\in {\mathbb {R}}^3\)

$$\begin{aligned} b_\mu (t,x,\mu ,\alpha )({{\tilde{x}}})=\begin{pmatrix} 0 &{} 0 &{} -J(v-V_{rev})\\ 0 &{} 0 &{} 0\\ 0 &{} 0 &{} 0 \end{pmatrix},\quad \text {for all}{{\tilde{x}}}\in {\mathbb {R}}^3\,. \end{aligned}$$

Thus we have \(|b_\mu (t,x,\mu ,\alpha )({{\tilde{x}}})|\le J\vee (JV_{rev})(1+|x|),\) showing the desired property.

Next, we introduce the corresponding adjoint equation, which will be essential for the maximum principle. For a solution \(X\in \mathcal {S}^{2,d}\) of (28) consider the following backward SDE

$$\begin{aligned} \left\{ \begin{aligned} dP_t&=-\Big \{\langle b_x(t,X_t,{{\,\mathrm{\mathcal {L}}\,}}(X_t),\alpha _t),P_t\rangle +\langle \sigma _x(t,X_t,{{\,\mathrm{\mathcal {L}}\,}}(X_t),\alpha _t),Q_t\rangle +f_x(t,X_t,{{\,\mathrm{\mathcal {L}}\,}}(X_t),\alpha _t)\\&\quad -{\tilde{{\mathbb {E}}}}\left( \langle b_\mu (t,{{\tilde{X}}}_t,{{\,\mathrm{\mathcal {L}}\,}}(X_t),\alpha _t)( X_t),{\tilde{P}}_t\rangle +f_\mu (t,{{\tilde{X}}}_t,{{\,\mathrm{\mathcal {L}}\,}}(X_t),\alpha _t)( X_t)\right) \Big \}dt +Q_tdW_t\\ P_T&=g_x(X_T,{{\,\mathrm{\mathcal {L}}\,}}(X_T))+{\tilde{{\mathbb {E}}}}\left( g_\mu ({{\tilde{X}}}_t,{{\,\mathrm{\mathcal {L}}\,}}(X_T))({X}_T)\right) , \end{aligned}\right. \end{aligned}$$
(29)

where the tilde variables \({{\tilde{X}}},{{\tilde{P}}}\) are independent copies of the corresponding random variables (carried on some arbitrary probability space \(({{\tilde{\Omega }}},\mathcal {{{\tilde{F}}}},\mathbb {{{\tilde{P}}}})\)), and \(\mathbb {{{\tilde{E}}}}\) denotes integration in \({{\tilde{\Omega }}}\) (this convention will be adopted throughout the paper). Herein, we recall that \(\langle \sigma (t,x,\mu ,\alpha ),q\rangle \) is a synonym for \(\text {tr}(\sigma (t,x,\mu ,\alpha )^\dagger q)\).

A pair of processes \((P,Q)\in \mathcal {H}^{2,d}\times \mathcal {H}^{2,d\times m}\) will be called a solution to the adjoint equation corresponding to X if it satisfies (29) for all \(t\in [0,T]\), \({\mathbb {P}}\)-almost surely.

We are now in position to formulate the maximum principle. For that purpose, we introduce the Hamiltonian, which for each \(x,p\in {\mathbb {R}}^d,\) \(q\in {\mathbb {R}}^{d\times m}\) \(\mu \in {\mathcal {P}}_2\) and \(\alpha \in A,\) is the quantity

$$\begin{aligned} H(t,x,\mu ,p,q,\alpha ):=\langle b(t,x,\mu ,\alpha ),p\rangle +\langle \sigma (t,x,\mu ,\alpha ), q\rangle +f(t,x,\mu ,\alpha )\,. \end{aligned}$$

Theorem 2.2

Let assumptions 2.12.4 hold. Let \({\overline{\alpha }}\in {{\mathbb {A}}}\) be an optimal control for the problem (SM). If \((P,Q)\in \mathcal {H}^{2,d}\times \mathcal {H}^{2,d\times m}\) is the solution to the corresponding adjoint equation, then we have for Lebesgue-almost every \(t\in [0,T]\)

$$\begin{aligned}&{\mathbb {E}}\left( H(t,X_t,{{\,\mathrm{\mathcal {L}}\,}}(X_t),P_t,Q_t,{\overline{\alpha }}_t)\right) \le {\mathbb {E}}\left( H(t,X_t,{{\,\mathrm{\mathcal {L}}\,}}(X_t),P_t,Q_t,\alpha )\right) , \end{aligned}$$

for all \(\alpha \in A\).

It should be noticed that in contrast to the maximum principle stated in [8, Thm. 6.14 p. 548], the maximum principle here is formulated in terms of the expectation for almost every \(t\in [0,T]\) instead of \(dt\otimes {\mathbb {P}}-\) almost everywhere, since we only consider deterministic controls and thus we only alter the control in deterministic directions.

3 Well-Posedness of the Optimal Control Problem

The main purpose of this section is to prove the existence of an optimal control for the stated control problem. For that purpose, we will need to show (among other results) that the state equation (7) is well-posed, and that the solution satisfies uniform moment bounds up to a certain level. Hereafter, we suppose that assumptions 2.12.2 and 2.3 are fulfilled.

3.1 Well-Posedness of the State Equation

Our first task is to show that the level-set constraint which was alluded to in Sect. 2.3 is preserved along the flow of solutions. This statement is contained in the next result. The proof is partially adapted from that of [6, Prop. 3.3].

Lemma 3.1

For every \(\alpha \in {{\mathbb {A}}}\) and \(\mu \in C([0,T];\mathcal {P}^{{\mathcal {C}}}_2({\mathbb {R}}^d))\) we have that

$$\begin{aligned} {\mathbb {P}}\big (\pi (X_t^{\alpha ,\mu }) \le 0,\forall t\in [0,T]\big )=1 \end{aligned}$$
(30)

where \(X_t^{\alpha ,\mu }\) is the unique solution to

$$\begin{aligned} \left\{ \begin{aligned} dX_t&=b(t,X_t,\mu _t,\alpha _t)dt+\sigma (t,X_t,\mu _t,\alpha _t)dW_t,&t\in [0,T] \\ X_0&\in L^r(\Omega ,\mathcal {F}_0,{\mathbb {P}};{\mathbb {R}}^d). \end{aligned}\right. \end{aligned}$$
(31)

Proof

First, observe that given \(\mu \in C([0,T];\mathcal {P}_2({\mathbb {R}}^d)),\) equation (31) has a unique strong solution \(X^\mu \) in \(\mathcal {S}^2.\) Indeed, if we let

$$\begin{aligned} b^\mu (t,x,\alpha ):=b(t,x,\mu ,\alpha ),\quad \sigma ^\mu (t,x,\alpha ):=\sigma (t,x,\mu ,\alpha ), \end{aligned}$$

then from Assumption 2.2 we see that \(\sigma ^\mu \) is Lipschitz, while 2.2 and 2.2 imply the local Lipschitz continuity and the monotonicity of the drift coefficient \(b^\mu \). Hence, by standard results on monotone SDEs (see for instance [20, Thm. 3.26 p. 178]) (31) has a unique strong solution, this solution being progressivey measurable and square integrable. This proves our assertion.

In order to show (30), consider a family \((\Psi _\epsilon )_{\epsilon >0}\) of non-negative and non-decreasing functions in \( C^2({\mathbb {R}})\) which for all \(\epsilon >0\) satisfy:

$$\begin{aligned} \Psi _\epsilon (x)=0\text { on }(-\infty ,0]\,,\quad \Psi _\epsilon (x)=1\text { on }[\epsilon ,\infty )\,,\quad \quad \sup _\epsilon |\Psi _\epsilon |_\infty \le 1\,, \end{aligned}$$

and such that \(\Psi _\epsilon \) converges pointwise to \({\mathbf {1}}_{(0,\infty )}\) as \(\epsilon \rightarrow 0\). Let \(\tau _n:=\inf \{t\ge 0\text { s.t.\ }|X_t|\ge n\}\). By Itô Formula, we have for each \(n\ge 0\) and \(\epsilon >0\)

$$\begin{aligned} \Psi _\epsilon (\pi (X_{t\wedge \tau _n}))-M_t^{\epsilon }&=\int _{0}^{\tau _n\wedge t} \big (\pi _x(X_s)\cdot b(s,X_s,\mu _s,\alpha _s)\big )\Psi _\epsilon '(\pi (X_t))ds \\&\quad \quad +\dfrac{1}{2} \int _{0}^{\tau _n\wedge t} \Psi _\epsilon ''(\pi (X_s)) |\pi _x(X_s)^\dagger \sigma (s,X_s,\mu _s,\alpha _s)|^2ds \\&\quad \quad +\frac{1}{2}\int _{0}^{\tau _n\wedge t}\Psi _\epsilon '(\pi (X_s)) \pi _{xx}(X_s)\cdot (\sigma \sigma ^\dagger (s,X_s,\mu _s,\alpha _s)) ds\,, \end{aligned}$$

where we let \(M_t^{\epsilon }:=\sum \nolimits _{k=1}^m\int _{0}^{\tau _n\wedge t} \pi _x(X_s)\cdot \sigma ^{\cdot ,k}(s,X_s,\mu _s,\alpha _s) \Psi _\epsilon '(\pi (X_s))dW^k_s\). Since \(\Psi _\epsilon \) is supported on the real positive axis, only the values of X which satisfy \(\pi (X)>0\) contribute to the above expression. Hence, making use of Assumption 2.1, we see that the first term in the previous right hand side is bounded above by 0, while the two last terms simply vanish. We arrive at the relation

$$\begin{aligned} {\mathbb {E}}\left( \sup _{t\in [0,T]}\Psi _\epsilon (\pi (X_{t\wedge \tau _n}))\right) \le 0. \end{aligned}$$

Letting first \(n\rightarrow \infty \), and then \(\epsilon \rightarrow 0,\) we observe by Fatou Lemma that

$$\begin{aligned} {\mathbb {E}}\left( \sup _{t\in [0,T]}{\mathbf {1}}_{(0,\infty )}(\pi (X_t))\right) =0, \end{aligned}$$

and our claim follows. \(\square \)

We are now able to prove the existence of a unique solution to equation (7).

Theorem 3.1

There exists a unique strong solution to equation (7) in \(\mathcal {S}^2\), which is supported in \({\mathcal {C}}\) for all times. Furthermore, for each \(p\in [2,r]\) and every \(\alpha \in {{\mathbb {A}}},\) the solution satisfies the moment estimate

$$\begin{aligned} \begin{aligned} {\mathbb {E}}\left( \sup _{t\in [0,T]}|X_t|^p\right)&\le C\left( \Vert X_0\Vert _{p},L_1,L_3,p\right) \left( 1+\int _{0}^{T}|\alpha _t|^pdt\right) . \end{aligned} \end{aligned}$$
(32)

where the constant C depends only upon the indicated quantities.

Proof

Recall that \(\mathcal {P}_2^{\mathcal {C}}\) denotes the set of probability measures in \(\mathcal {P}_2({\mathbb {R}}^d)\) which are supported in \({\mathcal {C}}:=\pi ^{-1}((-\infty ,0]).\) Equipped with the standard Wasserstein distance, it is a closed subset of \(\mathcal {P}_2({\mathbb {R}}^d)\). Indeed, it is standard (see for instance [15]) that given probability measures \(\{\mu _n,n\in {\mathbb {N}}\}\) and \(\mu \) such that \(\mu _n\Rightarrow \mu \), then

$$\begin{aligned} \text {supp}\mu \subset \liminf _{n\rightarrow \infty }(\text {supp}\mu _n):=\left\{ x\in {\mathbb {R}}^d\Big |\limsup _{n\rightarrow \infty }\inf _{y\in \text {supp}\mu _n}|x-y|=0\right\} , \end{aligned}$$

so that our claim follows. Thus, for fixed \(\alpha \in {{\mathbb {A}}}\), we can rightfully consider the operator

$$\begin{aligned} \Theta :C([0,T];\mathcal {P}_2^{\mathcal {C}})\rightarrow C([0,T];\mathcal {P}_2^{\mathcal {C}}),\quad \mu \mapsto ({{\,\mathrm{\mathcal {L}}\,}}(X^{\alpha ,\mu }_t))_{t\in [0,T]}, \end{aligned}$$

where \(X^{\mu }=X^{\alpha ,\mu }\) is the unique solution to equation (31). Using similar arguments as in [8], the existence of a unique solution to (28) follows if one can show that \(\Theta \) has a unique fixed point. In fact, we are going to show that it is a contraction (for a well-chosen metric). The moment estimate (32) will follow from the fixed point argument, provided one can show that

$$\begin{aligned} \begin{aligned} {\mathbb {E}}\left( \sup _{t\in [0,T]}|X_t^\mu |^p\right)&\le C\left( \Vert X_0\Vert _{p},L_1,L_3,p\right) \left( 1+\int _{0}^{T}|\alpha _t|^pdt\right) \end{aligned} \end{aligned}$$
(33)

where the displayed constant depends on the indicated quantities but not on the particular element \(\mu \) in \(C([0,T];{\mathcal {P}}^{{\mathcal {C}}}_2)\). We now divide the proof into two steps.

Step 1: moment bounds. We need a localization argument. For each \( n>0 \), introduce \( \tau _n:=\inf \{t\in (0,T], |X^{\mu }_t|>n\} \), and denote by \( X^{\mu ,n}_t:= X^{\mu }_{t\wedge \tau _n}\). Itô Formula gives

$$\begin{aligned} \begin{aligned} \frac{1}{p}|X_t^{\mu ,n}|^p -N^{\mu ,n}_t&=\frac{1}{p}|X_0^{\mu ,n}|^p + \int _0^{t\wedge \tau _n}\Big \{\left\langle X^\mu _s, b(s,X^\mu _s,\mu _s,\alpha _s)\right\rangle |X^\mu _s|^{p-2} \\&\quad \quad +\frac{1}{2}|\sigma (s,X^\mu _s,\mu _s,\alpha _s)|^2|X^\mu _s|^{p-2} + \frac{p-2}{2}|\sigma ^{\dagger }X^\mu _s|^2|X^\mu _s|^{p-4} \Big \}ds \end{aligned} \end{aligned}$$
(34)

where \(N_t^{n,\mu }:=\int _0^{t\wedge \tau _n}\left| X^\mu _s|^{p-2}\langle X^\mu _s, \sigma (s,X^\mu _s,\mu _s,\alpha _s)dW_s\right\rangle \) is the corresponding martingale term. Denoting by \(\kappa >0\) the constant in the Burkholder-Davis-Gundy Inequality, the latter is estimated for each \( t\in [0,T] \) thanks to (19) and Cauchy-Schwarz Inequality as

$$\begin{aligned}\begin{aligned} {\mathbb {E}}(\sup _{s\in [0,t]}N_t^{n,\mu })&\le \kappa {\mathbb {E}}\left( \left( \int _0^{t\wedge \tau _n}|X^{\mu }_s|^{2p-4}|\sigma (s,X^{\mu }_s,\mu _{s},\alpha _{s})^\dagger X^{\mu }_s|^2ds\right) ^{\frac{1}{2}}\right) \\&\le \kappa \sqrt{L_1}{\mathbb {E}}\left( \sup _{0\le s\le t}|X^{\mu ,n}_s|^{\frac{p}{2}}\left( \int _0^t|X^{\mu ,n}_s|^{p-2}(1+|X^{\mu ,n}_s|^2+|\alpha _{s}|^2{\mathbf {1}}_{[0,\tau _n]}(s))ds\right) ^{\frac{1}{2}}\right) . \end{aligned} \end{aligned}$$

But from Young’s inequality, the previous right hand side is also bounded by

$$\begin{aligned}&\frac{1}{2p}{\mathbb {E}}\left( \sup _{0\le s\le t}|X^{\mu ,n}_s|^p\right) \\&\quad + \frac{p\kappa ^2L_1}{2}{\mathbb {E}}\left( \int _0^t(|X^{\mu ,n}_s|^{p-2}+|X^{\mu ,n}_s|^p+|X^{\mu ,n}_s|^{p-2}|\alpha _{s}|^2\mathbf{1}_{[0,\tau _n]}(s))ds\right) . \end{aligned}$$

Define \(\Psi ^n_t:={\mathbb {E}}\left( \sup _{0\le s\le t}|X_s^{n,\mu }|^p\right) .\) Taking the expectation in (34), we infer from (24), (19), Young’s inequality \(ab\le \frac{2}{p}a^{\frac{p}{2}}+\frac{p-2}{p}b^{\frac{p}{p-2}}\) and the previous discussion that

$$\begin{aligned} \frac{1}{2p}\Psi _t^n \le \frac{1}{p}{\mathbb {E}}(|X_0|^p) + C_p\left( L_1+ L_3\right) \int _0^t(1+\Psi ^n_s+|\alpha _{s}|^p)ds \end{aligned}$$

for some universal constant \(C_p>0.\) Applying Gronwall Inequality, we obtain the corresponding moment estimate for \( X^{\mu ,n} \), for each \( n>0 \). Since none of the constants used in the above computations depend on \( n \), the localization is removed by letting \( n\nearrow \infty \) and using the monotone convergence theorem.

Step 2: the fixed point argument. From Lemma 3.1, it is clear that for all \(t\in [0,T],\) the probability measure \({\mathbb {P}}\circ (X_t^\mu )^{-1}\) is supported in \({\mathcal {C}}\). For simplicity, let \(L:=L_1\vee L_2\vee L_3\) and introduce the weight

$$\begin{aligned} \phi _t:=\exp \left( -2Lt\right) ,\quad t\in [0,T].\end{aligned}$$

Then, Itô Formula gives

$$\begin{aligned} \begin{aligned} d\left( \frac{1}{2}|X_t^{\mu }-X_t^{\nu }|^2\phi _t\right)&+2L|X_t^{\mu }-X_t^{\nu }|^2\phi _tdt \\&=\phi _t\left\langle X^{\mu }_t-X^{\nu }_t,b(t,X^{\mu }_t,\mu _t,\alpha _t)-b(t,X^{\nu }_t,\mu _t,\alpha _t)\right\rangle dt \\&\quad +\phi _t\left\langle X^{\mu }_t-X^{\nu }_t,b(t,X^{\nu }_t,\mu _t,\alpha _t)-b(t,X^{\nu }_t,\nu _t,\alpha _t)\right\rangle dt \\&\quad \quad + \phi _t\left\langle X^{\mu }_t-X^{\nu }_t,\sigma (t,X^{\mu }_t,\mu _t,\alpha _t)-\sigma (t,X^{\nu }_t,\nu _t,\alpha _t)dW_t\right\rangle \\&\quad \quad \quad + \frac{1}{2}\phi _t |\sigma (t,X^{\mu }_t,\mu _t,\alpha _t)-\sigma (t,X^{\nu }_t,\nu _t,\alpha _t)|^2 dt. \end{aligned} \end{aligned}$$
(35)

The first term in the right hand side of (35) is evaluated thanks to (25). For the second term, we use the quadratic growth assumption (23). As for the Itô correction, we can estimate it similarly, using this time Assumption 2.2. With \(M_t:=\int _{0}^{t} \phi _s\left\langle X^{\mu }_s-X^{\nu }_s,\sigma (s,X^{\mu }_s,\mu _s,\alpha _s)-\sigma (s,X^{\nu }_s,\nu _s,\alpha _s)dW_s\right\rangle \) we get

$$\begin{aligned} \begin{aligned} \frac{1}{2}|X_t^{\mu }-X_t^{\nu }|^2\phi _t&+2L \int _0^t|X_s^{\mu }-X_s^{\nu }|^2\phi _sds -M_t \\&\le \int _0^t\Big \{(L_1+L_3)|X^\mu _s-X^\nu _s|^2 +(L_1+L_2)(1+|X_s^\nu |^2) W_2(\mu _s,\nu _s)^2 \Big \}\phi _sds \\&\le 2L\left( \int _{0}^{t}|X^\mu _s-X^\nu _s|^2 \phi _sds+ \Big (1+\sup _{s\in [0,t]}|X^\nu _s|^2\Big )\int _0^t W_2(\mu _s,\nu _s)^2\phi _sds \right) \,. \end{aligned} \end{aligned}$$

Taking expectations, supremum in t, then absorbing to the left yields

$$\begin{aligned} \begin{aligned}&\sup _{0\le s\le t}{\mathbb {E}}\left( |X_s^{\mu }-X_s^{\nu }|^2\right) \phi _s \le 4L\Big (1+{\mathbb {E}}\big (\sup _{s\in [0,t]}|X^\nu _s|^2\big )\Big )\int _0^t W_2(\mu _s,\nu _s)^2\phi _sds \end{aligned} \end{aligned}$$

Using the estimate (32) with \(p=2\), the fact that \(\exp (-2TL)\le \phi \le 1,\) inequality (13) and the boundedness of the control state space \( A \subset {\mathbb {R}}^k\), we arrive at

$$\begin{aligned} \begin{aligned} \sup _{0\le s\le t} W_2(\Theta (\mu )_s,\Theta (\nu )_s)^2&\le C(\Vert X_0\Vert _p,T,L)\int _0^t W_2(\mu _s,\nu _s)^2ds\,. \end{aligned} \end{aligned}$$

Considering the k-th composition of the map \(\Theta \), we get

$$\begin{aligned} \begin{aligned} \sup _{0\le s\le T} W_2(\Theta ^k(\mu )_s,\Theta ^k(\nu )_s)^2&\le \frac{C(\Vert X_0\Vert _p,T,L)^kT^k}{k!}\sup _{0\le s\le T}W_2(\mu _s,\nu _s)^2\,, \end{aligned} \end{aligned}$$

hence contractivity of \(\Theta ^k\) follows for \(k>0\) large enough and the result then follows from Banach-fixed point theorem. \(\square \)

We now investigate some regularity of the control-to-state operator, which will be needed in the proof of the optimality principle.

Lemma 3.2

For \(p\in [2,r],\) the solution map

$$\begin{aligned} G:{{\mathbb {A}}}\rightarrow \mathcal {S}^p\cap \mathcal {S}^2,\quad \alpha \mapsto X^{\alpha } \end{aligned}$$

is well-defined and Lipschitz continuous. More precisely, there exists a constant \(C(L_1,L_2,L_3,T)>0\), such that for all \(\alpha ,\beta \in {{\mathbb {A}}}\)

$$\begin{aligned} {\mathbb {E}}\left( \sup _{t\in [0,T]}|X^{\alpha }_t-X^{\beta }_t|^2\right) \le C(L_1,L_2,L_3,T)\int _{0}^{T}|\alpha _t-\beta _t|^2dt. \end{aligned}$$

Proof

That G is well-defined follows immediately from Theorem 3.1. Towards Lipschitz-continuity, the property is shown by similar considerations as in the proof of Theorem 3.1. Indeed, fixing \(\alpha ,\beta \in {{\mathbb {A}}}\) and letting M be the martingale \(M_t:=\int _0^t\langle \sigma (t,X^{\alpha },{{\,\mathrm{\mathcal {L}}\,}}(X^{\alpha }),\alpha )-\sigma (s,X^{\beta },{{\,\mathrm{\mathcal {L}}\,}}(X^{\beta }),\beta ),(X^{\alpha }-X^{\beta }) dW \rangle \), then using Itô Formula with assumptions 2.2, 2.2 and 2.2, we arrive at

$$\begin{aligned}&\frac{1}{2}|X^{\alpha }_t-X^{\beta }_t|^2-M_t \\&= \int _0^t\Bigg \{\left\langle X^{\alpha }_s-X^{\beta }_t, b(s,X^{\alpha }_s,{{\,\mathrm{\mathcal {L}}\,}}(X^{\alpha }_s),\alpha _s)-b(s,X^{\beta }_s,{{\,\mathrm{\mathcal {L}}\,}}(X^{\alpha }_s),\beta _s)\right\rangle \\&\quad +\left\langle X^{\alpha }_s-X^{\beta }_s,b(s,X^{\beta }_s,{{\,\mathrm{\mathcal {L}}\,}}(X^{\alpha }_s),\beta _s)-b(s,X^{\beta }_s,{{\,\mathrm{\mathcal {L}}\,}}(X^{\beta }_s),\beta _s)\right\rangle \\&\quad \quad \quad + \frac{1}{2}|\sigma (t,X^{\alpha }_s,{{\,\mathrm{\mathcal {L}}\,}}(X^{\alpha }_s),\alpha _s)-\sigma (t,X^{\beta }_s,{{\,\mathrm{\mathcal {L}}\,}}(X^{\beta }_s),\beta _s)|^2\Bigg \}ds \\&\le \int _0^t\Big \{ (L_3+\frac{1}{2}+L_1)(|X^\alpha -X^\beta |^2 + |\alpha -\beta |^ 2) \\&\quad \quad \quad \quad + (\frac{L_2}{2}+L_1)(1+|X^\alpha |^2+|X^\beta |^2)W_2({{\,\mathrm{\mathcal {L}}\,}}(X^\alpha ),{{\,\mathrm{\mathcal {L}}\,}}(X^\beta ))^2 \Big \}ds\,. \end{aligned}$$

Letting \(\kappa >0\) be the constant in the BDG inequality, the estimate (13) and \(ab\le \frac{a^2}{4}+ b^2\) yield

$$\begin{aligned}&\frac{1}{4}{\mathbb {E}}\left( \sup _{s\in [0,t]}|X^{\alpha }_t-X^{\beta }_t|^2\right) \\&\quad \le C_L(3+\kappa ^2)\Big ( 2 + {\mathbb {E}}\big (\sup _{s\in [0,T]} |X^\alpha _s|^2+|X_s^\beta |^2\big )\Big )\\&\quad \int _{0}^{t}\left\{ {\mathbb {E}}(\sup _{r\in [0,s]}|X_r^{\alpha }-X_r^{\beta }|^2) + |\alpha _s-\beta _s|^2\right\} ds \end{aligned}$$

where \(C_L:=\frac{1}{2}\vee L_1\vee \frac{L_2}{2}\vee L_3.\) The result now follows from the uniform bound (32), together with Gronwall Lemma. \(\square \)

Remark 3.1

Since we have \(W_2({{\,\mathrm{\mathcal {L}}\,}}(X_s^{\alpha }),{{\,\mathrm{\mathcal {L}}\,}}(X_s^{\beta }))\le {\mathbb {E}}\left( \sup _{t\in [0,T]}|X_s^{\alpha }-X_s^{\beta }|^2\right) ^\frac{1}{2}\) we also get the Lipschitz continuity of the map

$$\begin{aligned} {{\mathbb {A}}}\rightarrow \mathcal {P}_2(\mathcal {S}^2), \quad \alpha \mapsto {{\,\mathrm{\mathcal {L}}\,}}(G(\alpha )). \end{aligned}$$

Remark 3.2

(Fokker-Planck equation) Given the settings of Example 2.2, we define

$$\begin{aligned} b_0(t,x,\alpha )&:=\begin{pmatrix} v-\frac{v^3}{3}-w+\alpha \\ c(v+a-bw) \\ {\overline{a}}S(v)(1-y)-{\overline{b}}y \end{pmatrix} ,\quad b_1(x,z):=\begin{pmatrix} -J(v-V_{rev})z_3 \\ 0 \\ 0 \end{pmatrix}, \\ {\tilde{\sigma }}(x,z)&:=\begin{pmatrix} \sigma _{ext} &{} -\sigma ^J(v-V_{rev})z_3 &{} 0 \\ 0 &{} 0 &{} 0 \\ 0 &{} 0 &{} \chi (y)\sqrt{{\overline{a}}S(v)(1-y)+{\overline{b}}y} \end{pmatrix}. \end{aligned}$$

If we assume that the solution to the corresponding mean-field equation has a density p(tx) with respect to the 3-dimensional lebesgue measure, then the McKean-Vlasov-Fokker-Planck equation is given by the nonlinear PDE:

$$\begin{aligned} \partial _tp(t,x)&=-\text {div}\left( \left( b_0(t,x,\alpha )+\int _{{\mathbb {R}}^3}b_1(x,z)p(t,z)dz\right) p(t,x) \right) \\&\quad +\dfrac{1}{2}\nabla ^2\cdot \left( \left( \iint _{{\mathbb {R}}^3\times {\mathbb {R}}^3}{\tilde{\sigma }}(x,z){\tilde{\sigma }}(x,{\bar{z}})^{\dagger }p(t,z)p(t,{\bar{z}})\,dz d{\bar{z}}\right) p(t,x)\right) \end{aligned}$$

(see [2]). It is degenerate parabolic because the matrix \(\sigma {{{\tilde{\sigma }}}}^\dagger \) is not strictly positive.

3.2 Proof of Theorem 2.1

We now prove the existence of an optimal control for (28). The strategy we use strings along the commonly named “direct method” in the calculus of variations. As a trivial consequence of the assumptions made in Sect. 2.2 and the uniform estimate (32), note at first that our control problem is indeed finite. Next, consider a sequence \((\alpha ^n)_{n\in {\mathbb {N}}}\subset {{\mathbb {A}}}\) realizing the infimum of J asymptotically, i.e.

$$\begin{aligned} \lim \limits _{n\rightarrow \infty }J(\alpha ^n)=\inf _{\alpha \in {{\mathbb {A}}}}J(\alpha ). \end{aligned}$$

Since \({{\mathbb {A}}}\subset L^2([0,T];{\mathbb {R}}^k)\) is bounded and closed, by Banach Alaogu Theorem there exists an \(\alpha \in L^2([0,T];{\mathbb {R}}^k)\) and a subsequence also denoted by \((\alpha ^n)_{n\in {\mathbb {N}}}\), such that

$$\begin{aligned} \alpha ^n\rightharpoonup \alpha , \quad \text {weakly in}L^2(0,T;{\mathbb {R}}^k). \end{aligned}$$

Since \({{\mathbb {A}}}\) is also convex, we get \(\alpha \in {{\mathbb {A}}}\), so \(\alpha \) is indeed an admissible control. We now divide the proof into four steps.

Step 1: tightness. In the sequel, we denote by \(X^n\) the solution of the state equation (7) with respect to the control \(\alpha ^n\), \(n\in {\mathbb {N}}.\) Adding and subtracting in (7), we have

$$\begin{aligned} \Vert X_t^n-X_s^n\Vert _4^4 \le 4^3\bigg \{ \big \Vert \int _{s}^{t}b(r,0,\delta _0,0)dr\big \Vert _4^4 + \big \Vert \int _s^tb(r,0,{{\,\mathrm{\mathcal {L}}\,}}(X_r^n),0)-b(r,0,\delta _0,0)dr\big \Vert _4^4 \\ +\big \Vert \int _{s}^{t}b(r,X_r^n,{{\,\mathrm{\mathcal {L}}\,}}(X_r^n),\alpha _r^n)-b(r,0,{{\,\mathrm{\mathcal {L}}\,}}(X_r^n),0)dr\big \Vert _4^4 +\kappa \big \Vert \int _{s}^{t}|\sigma (r,X^n_r,{{\,\mathrm{\mathcal {L}}\,}}(X^n_r),\alpha ^n_r)|^2dr\big \Vert _2^2\bigg \}\,, \end{aligned}$$

where \(\kappa >0\) is the constant in the BDG inequality. Using the assumptions 2.2, 2.2, 2.2, the fact that \(0\in {\mathcal {C}}\) and the basic inequality (13), we obtain that

$$\begin{aligned} \Vert X_t^n-X_s^n\Vert _4^4 \le 4^3\bigg \{ (t-s)^4 \sup _{r\in [0,T]}|b(r,0,\delta _0,0)|^4 + (t-s)^4L_2^2\sup _{r\in [0,T]}|X^n_r|^4 \\ + L_2^2\big \Vert \int _{s}^{t}(1+|X^n_r|^{q-1}+|\alpha ^n_r|^{q-1}+\mathcal {M}_2({{\,\mathrm{\mathcal {L}}\,}}(X_r^n))^2(|X^n_r|+|\alpha ^n_r|)dr\big \Vert _4^4 \\ + \kappa L_1^2\big \Vert \int _{s}^{t}(1+|X_r^n|^2+|\alpha _r^n|^2)dr\Vert _2^2 \bigg \}\,. \end{aligned}$$

Using Hölder Inequality, Young Inequality \(ab\le \frac{q-1}{q}a^\frac{q}{q-1}+\frac{1}{q}b^q\), we arrive at the following estimate, for all \(n\in {\mathbb {N}}\) and \(0\le s\le t\le T\)

$$\begin{aligned} {\mathbb {E}}\left( |X_t^n-X_s^n|^4\right)&\le C(L,T)\bigg \{ (t-s)^4 \left[ \sup _{r\in [0,T]}|b(r,0,\delta _0,0)|^4 +{\mathbb {E}}\left( 1+\sup _{r\in [0,T]}|X^n_r|^{4q}\right) \right] \\&\quad + C(t-s)^{4/3}(1+(t-s))\bigg \}\,, \end{aligned}$$

where we used the fact that \( A \) is bounded. Note that the above constants depends upon the indicated quantities, but not on \(n\in \mathbb N.\)

Making use of the uniform estimate (32), the Kolmogorov continuity criterion then asserts that the sequence of probability measures \(({\mathbb {P}}\circ (X^n)^{-1})_{n\in {\mathbb {N}}}\), defined on the space

$$\begin{aligned}E:=\left( C([0,T];{\mathbb {R}}^d),\mathcal {B}(C([0,T];{\mathbb {R}}^d))\right) \end{aligned}$$

is tight. In the same way, we can prove that the sequence on probability measures \(({\mathbb {P}}_n)_{n\in {\mathbb {N}}}:=\left( {\mathbb {P}}\circ (X^n, B^n)^{-1}\right) _{n\in {\mathbb {N}}}\), with

$$\begin{aligned} B^n(t):=\int _{0}^{t}b(s,X_s^n,{{\,\mathrm{\mathcal {L}}\,}}(X_s^n),\alpha _s^n)ds, \end{aligned}$$

is tight on the product space \(E\times E,\) with respect to the product topology, where for two E-valued random variables \(Z_1,Z_2\) defined on \((\Omega ,\mathcal {A},{\mathbb {P}})\), \({\mathbb {P}}\circ (Z_1, Z_2)^{-1}\) denotes the joint law of \(Z_1\) and \(Z_2\). Thus by Prokhorov’s theorem there exists a subsequence of \(({\mathbb {P}}_n)_{n\in {\mathbb {N}}}\), which converges weakly to some probability measure \({\mathbb {P}}^*\) on \(E\times E\).

Step 2: passage to the limit in the drift. By Skorokhod’s representation theorem we can then find random variables \({\overline{X}},{\overline{B}}\), \(({\overline{X}}^n)_{n\in {\mathbb {N}}},({\overline{B}}^n)_{n\in {\mathbb {N}}}\) defined on some probability space \(({\overline{\Omega }},\overline{\mathcal {F}},{\overline{{\mathbb {P}}}})\) and with values in \(E\times E\) such that

  • \({\overline{{\mathbb {P}}}}\circ ({\overline{X}}^n,{\overline{B}}^n)^{-1}={\mathbb {P}}_n\) for all \(n\in {\mathbb {N}}\) and \({\overline{{\mathbb {P}}}}\circ ({\overline{X}},{\overline{B}})^{-1}={\mathbb {P}}^*\) and

  • \(\lim \limits _{n\rightarrow \infty }({\overline{X}}^n,{\overline{B}}^n) =({\overline{X}},{\overline{B}})\), \({\overline{{\mathbb {P}}}}\)-almost surely with respect to the uniform topology.

From (33) and by the definition of \({\mathbb {A}}\) we get for any \(p\le r\)

$$\begin{aligned} {{\overline{{\mathbb {E}}}}}\left( \sup _{0\le t\le T}|{\overline{X}}_t^n|^p\right) \le C(p,\Vert X_0\Vert _p,L_1,L_2,L_3), \end{aligned}$$

for some constant independent of n. Thus we can conclude by the dominated convergence theorem that

$$\begin{aligned} W_2(\mathcal {L}({\overline{X}}_t^n),\mathcal {L}({\overline{X}}_t))^2\le {\mathbb {E}}\left( \sup _{0\le t\le T}|{\overline{X}}_t^n-{\overline{X}}_t|^2\right) \rightarrow 0, \end{aligned}$$

as \(n\rightarrow \infty \). This also implies \((\mathcal {L}({\overline{X}}_t))_{t\in [0,T]}\subset \mathcal {P}_2^{\mathcal {C}}\), since \( \mathcal {P}_2^{\mathcal {C}}\) is closed.

To identify the almost sure limit \({\overline{B}}\), we first claim that for each \(t\in [0,T]\)

$$\begin{aligned} {\overline{B}}^n(t)\rightharpoonup \int _{0}^{t}b(s,{\overline{X}}_s,{{\,\mathrm{\mathcal {L}}\,}}({\overline{X}}_s),\alpha _s)ds, \end{aligned}$$
(36)

weakly in \(L^2({\overline{\Omega }};{\mathbb {R}}^d)\). Indeed, by (22) and the dominated convergence theorem we have

$$\begin{aligned} {{\overline{{\mathbb {E}}}}}\left( \int _{0}^{t}|b(s,{\overline{X}}_s^n,\mathcal {L}({\overline{X}}_s^n),\alpha _s^n)-b(s,{\overline{X}}_s,\mathcal {L}({\overline{X}}_s),\alpha _s^n)|^2ds\right) \rightarrow 0. \end{aligned}$$

Likewise, for \(h\in L^2({\overline{\Omega }};{\mathbb {R}}^d)\) we have by Assumption 2.3 and dominated convergence

$$\begin{aligned} {{\overline{{\mathbb {E}}}}}\left( \int _{0}^{t}\langle \left( b(s,{\overline{X}}_s,\mathcal {L}({\overline{X}}_s),\alpha _s^n)-b(s,{\overline{X}}_s,\mathcal {L}({\overline{X}}_s),\alpha _s)\right) ,h\rangle ds\right) \rightarrow 0, \end{aligned}$$

as \(n\rightarrow \infty \), thus proving our claim.

The desired identification then follows from (36), the Banach-Saks theorem and the uniqueness of the almost sure limit. The processes \({{\overline{B}}}\) and \(\int _{0}^{\cdot }b(s,{\overline{X}}_s,{{\,\mathrm{\mathcal {L}}\,}}({\overline{X}}_s),\alpha _s)ds\) being both continuous pathwise, they are indistinguishable, hence the identity

$$\begin{aligned} {\overline{B}}(t)=\int _{0}^{t}b(s,{\overline{X}}_s,{{\,\mathrm{\mathcal {L}}\,}}({\overline{X}}_s),\alpha _s)ds,\quad \end{aligned}$$
(37)

for all \(t\in [0,T],\) \({\overline{{\mathbb {P}}}}\)-almost surely.

Step 3: identification of the martingale. Letting \(\sigma \sigma ^\dagger (t,x,\mu ,\alpha ):=\sigma (t,x,\mu ,\alpha )\sigma (t,x,\mu ,\alpha )^\dagger \) for short, similar arguments as above show that

$$\begin{aligned} \sigma \sigma ^\dagger (t,{\overline{X}}^n_t,{{\,\mathrm{\mathcal {L}}\,}}({\overline{X}}^n_t),\alpha ^n_t)\rightharpoonup \sigma \sigma ^\dagger (t,{\overline{X}}_t,{{\,\mathrm{\mathcal {L}}\,}}({\overline{X}}_t),\alpha _t) \end{aligned}$$

weakly in \(L^2([0,T]\times {\overline{\Omega }};{\mathbb {R}}^d)\). Since the process

$$\begin{aligned} M_t^n:=X_t^n-X_0-B^n(t)=\int _{0}^{t}\sigma (s,X_s^n,{{\,\mathrm{\mathcal {L}}\,}}(X_s^n),\alpha _s^n)dW_s \end{aligned}$$

is, for each n,  a \(\mathcal {G}_t^n:=\sigma ({X_s^n|s\le t})\) martingale under \({\mathbb {P}}\), we can conclude that

$$\begin{aligned} {\overline{M}}_t^n:={\overline{X}}_t^n-X_0-{\overline{B}}^n(t) \end{aligned}$$

is a \(\mathcal {{\overline{G}}}_t^n:=\sigma ({{\overline{X}}_s^n|s\le t})\) martingale under \({\overline{{\mathbb {P}}}}\) with quadratic variation

$$\begin{aligned} \langle {\overline{M}}^n\rangle _t=\int _{0}^{t}\sigma \sigma ^\dagger (s,{\overline{X}}_s^n,{{\,\mathrm{\mathcal {L}}\,}}({\overline{X}}_s^n),\alpha _s^n)ds. \end{aligned}$$

From the previous considerations, we can conclude that

$$\begin{aligned} {\overline{M}}_t^n\rightarrow {\overline{X}}_t-X_0-\int _{0}^{t}b(s,{\overline{X}}_s,{{\,\mathrm{\mathcal {L}}\,}}({\overline{X}}_s),{\overline{\alpha }}_s)ds=:{\overline{M}}_t, \end{aligned}$$

\({\overline{{\mathbb {P}}}}\)-almost surely for all \(t\in [0,T]\). Thus by the dominated convergence theorem the process \(({\overline{M}}_t)_{t\in [0,T]}\) is a \(\overline{\mathcal {G}}_t:=\sigma ({\overline{X}}_s|s\le t)\) martingale under \({\overline{{\mathbb {P}}}}\) and with standard arguments we also obtain, that \(({\overline{M}}_t)_{t\in [0,T]}\) has quadratic variation

$$\begin{aligned} \langle {\overline{M}}\rangle _t=\int _{0}^{t}\sigma \sigma ^\dagger (s,{\overline{X}}_s,{{\,\mathrm{\mathcal {L}}\,}}({\overline{X}}_s),{\overline{\alpha }}_s)ds. \end{aligned}$$

By the martingale representation theorem we can find an extended probability space \(({\hat{\Omega }},\hat{\mathcal {F}},(\hat{\mathcal {F}}_t)_{t\in [0,T]},{\hat{{\mathbb {P}}}})\) with an m-dimensional brownian motion \({\hat{W}}\), such that the natural extension \({\hat{X}}\) of \({\overline{X}}\) satisfies \({\hat{{\mathbb {P}}}}\circ ({\hat{X}}^{-1})={\overline{{\mathbb {P}}}}\circ ({\overline{X}}^{-1})\) and

$$\begin{aligned} {\hat{X}}_t=X_0+\int _{0}^{t}b(s,{\hat{X}}_s,{{\,\mathrm{\mathcal {L}}\,}}({\hat{X}}_s),{\overline{\alpha }}_s)ds+\int _{0}^{t}\sigma (s,{\hat{X}}_s,{{\,\mathrm{\mathcal {L}}\,}}({\hat{X}}_s),{\overline{\alpha }}_s)d{\hat{W}}_s, \end{aligned}$$

\({\hat{{\mathbb {P}}}}\)-almost surely for all \(t\in [0,T]\).

Step 4: end of the proof It remains to show that the infimum is attained for \(\alpha .\) Due to the uniqueness of equation (7), we have \({\mathbb {P}}\circ (X^{\alpha })^{-1}={\hat{{\mathbb {P}}}}\circ ({\hat{X}}^{-1})\). Using Fatou’s lemma, continuity of fg,  Assumption 2.3 and Remark 2.1, we obtain

$$\begin{aligned} \inf _{{{\tilde{\alpha }}}\in {{\mathbb {A}}}}J({{\tilde{\alpha }}})&=\lim \limits _{n\rightarrow \infty }J(\alpha ^n)\\&\ge \liminf _{n\rightarrow \infty }{\mathbb {E}}\left( \int _{0}^{T}f(t,X_t^n,{{\,\mathrm{\mathcal {L}}\,}}(X_t^n),\alpha _t^n)dt+g(X_T^n,{{\,\mathrm{\mathcal {L}}\,}}(X_T^n))\right) \\&=\liminf _{n\rightarrow \infty }{\overline{{\mathbb {E}}}}\left( \int _{0}^{T}f(t,{\overline{X}}_t^n,{{\,\mathrm{\mathcal {L}}\,}}({\overline{X}}_t^n),\alpha _t^n)dt+g({\overline{X}}_T^n,{{\,\mathrm{\mathcal {L}}\,}}({\overline{X}}_T^n))\right) \\&\ge {\overline{{\mathbb {E}}}}\left( \int _{0}^{T}f(t,{\overline{X}}_t,{{\,\mathrm{\mathcal {L}}\,}}({\overline{X}}_t),\alpha _t)dt+g({\overline{X}}_T,{{\,\mathrm{\mathcal {L}}\,}}({\overline{X}}_T))\right) \\&={\hat{{\mathbb {E}}}}\left( \int _{0}^{T}f(t,{\hat{X}}_t,{{\,\mathrm{\mathcal {L}}\,}}({\hat{X}}_t),\alpha _t)dt+g({\hat{X}}_T,{{\,\mathrm{\mathcal {L}}\,}}({\hat{X}}_T))\right) \\&={\mathbb {E}}\left( \int _{0}^{T}f(t,X^{\alpha }_t,{{\,\mathrm{\mathcal {L}}\,}}(X^{\alpha }_t),\alpha _t)dt+g(X^{\alpha }_T,{{\,\mathrm{\mathcal {L}}\,}}(X^{\alpha }_T))\right) \\&=J(\alpha ). \end{aligned}$$

This shows that \(\alpha \) has the desired properties, and hence the proof is finished. \(\square \)

4 The Maximum Principle: Proof of Theorem 2.2

In this section, it will be assumed implicitly that assumptions 2.12.22.3 and 2.4 hold. Hereafter, we let \(({\tilde{\Omega }},\tilde{\mathcal {A}},{\tilde{{\mathbb {P}}}})\) be a copy of the probability space \((\Omega ,\mathcal {A},{\mathbb {P}})\). The corresponding expectation map will be denoted by \({{\tilde{{\mathbb {E}}}}}\).

4.1 Gâteaux Differentiability

In this subsection we aim to complete Lemma 3.2 by showing the Gâteaux-differentiability of the control-to-state operator

$$\begin{aligned} G:{{\mathbb {A}}}\subset L^p([0,T];{\mathbb {R}}^k)\rightarrow \mathcal {S}^2,\quad \alpha \mapsto X^\alpha . \end{aligned}$$

The Gâteaux derivative of the solution map will be given by the solution of a mean-field equation with random coefficients. We will deal with this problem in the similar fashion as it is done in [8, Thm. 6.10 p. 544].

Lemma 4.1

The solution map G is Gâteaux-differentiable. Moreover, for each \(\alpha \in {{\mathbb {A}}}\), its derivative in the direction \(\beta \in {{\mathbb {A}}}\) is given by

$$\begin{aligned} dG(\alpha )\cdot \beta =Z^{\alpha ,\beta }, \end{aligned}$$

where, introducing

$$\begin{aligned} B_\mu (t,x,\mu )&:=\iint _{{\mathbb {R}}^d\times {\mathbb {R}}^d}b_\mu (t,x,\mathcal {L}(X_t),\alpha _t)({\tilde{x}})\cdot {\tilde{y}}\mu (d{\tilde{x}}\times d{\tilde{y}}) \\ \Sigma _\mu (t,x,\mu )&:=\iint _{{\mathbb {R}}^d\times {\mathbb {R}}^d}\sigma _\mu (t,x,\mathcal {L}(X_t),\alpha _t)({\tilde{x}})\cdot {\tilde{y}}\mu (d{\tilde{x}}\times d{\tilde{y}})\,, \end{aligned}$$

the process \(Z=Z^{\alpha ,\beta }\) is characterized as the unique solution to

$$\begin{aligned} \left\{ \begin{aligned} dZ_t&=\big \{b_x(t,X_t,{{\,\mathrm{\mathcal {L}}\,}}(X_t),\alpha _t)\cdot Z_t+b_\alpha (t,X_t,{{\,\mathrm{\mathcal {L}}\,}}(X_t),\alpha _t)\cdot \beta _t+B_\mu (t,X_t,{{\,\mathrm{\mathcal {L}}\,}}(X_t,Z_t))\big \}dt,\\&\quad + \big \{\sigma _x(t,X_t,{{\,\mathrm{\mathcal {L}}\,}}(X_t),\alpha _t)\cdot Z_t+\sigma _\alpha (t,X_t,{{\,\mathrm{\mathcal {L}}\,}}(X_t),\alpha _t)\cdot \beta _t+\Sigma _\mu (t,X_t,{{\,\mathrm{\mathcal {L}}\,}}(X_t,Z_t))\big \}dW_t\\ Z_0&=0. \end{aligned}\right. \end{aligned}$$
(38)

Proof

We will start by showing that (38) has a unique solution. For that purpose, we define

$$\begin{aligned} \mathcal {R}:=\left\{ \mu \in C([0,T];\mathcal {P}_2({\mathbb {R}}^d\times {\mathbb {R}}^d)),\text {such that}\mu _t\circ p_1^{-1}={{\,\mathrm{\mathcal {L}}\,}}(X_t),\forall t\right\} , \end{aligned}$$

where \(p_1\) denotes the projector onto the first d-coordinates, namely

$$\begin{aligned} p_1:{\mathbb {R}}^d\times {\mathbb {R}}^d\rightarrow {\mathbb {R}}^d,\quad (x,y)\mapsto x. \end{aligned}$$

Clearly, if \(\mu ^n_t\) is a sequence converging weakly to \(\mu _t\) for every \(t\in [0,T]\), the constraint \(\mu _t^n\circ p_1^{-1}={{\,\mathrm{\mathcal {L}}\,}}(X_t),\forall t\) remains true for \(\mu \) itself. Since the Wasserstein distance metrizes the weak topology, we see that \(\mathcal {R}\) is closed in \(C([0,T];{\mathcal {P}}_2({\mathbb {R}}^d\times {\mathbb {R}}^d))\). Next, define

$$\begin{aligned} \Psi :{\mathcal {R}}\rightarrow {\mathcal {R}}, \end{aligned}$$

which maps \(\mu \in C([0,T];\mathcal {P}_2({\mathbb {R}}^d\times {\mathbb {R}}^d))\) to \(({{\,\mathrm{\mathcal {L}}\,}}(X_t,V_t))_{t\in [0,T]}\), where \((V_t)_{t\in [0,T]}\) is the unique solution to

$$\begin{aligned} \left\{ \begin{aligned} dV_t&=b_x(t,X_t,{{\,\mathrm{\mathcal {L}}\,}}(X_t),\alpha _t)\cdot V_t+b_\alpha (t,X_t,{{\,\mathrm{\mathcal {L}}\,}}(X_t),\alpha _t)\cdot \beta _t+B_\mu (t,X_t,\mu _t)dt\\&\quad + \sigma _x(t,X_t,{{\,\mathrm{\mathcal {L}}\,}}(X_t),\alpha _t)\cdot V_t+\sigma _\alpha (t,X_t,{{\,\mathrm{\mathcal {L}}\,}}(X_t),\alpha _t)\cdot \beta _t+\Sigma _\mu (t,X_t,\mu _t)dW_t\\ Z_0&=0. \end{aligned}\right. \end{aligned}$$
(39)

For fixed \(\mu \in C([0,T];\mathcal {P}_2({\mathbb {R}}^d\times {\mathbb {R}}^d))\) we first need to check the existence of a unique solution V. But letting

$$\begin{aligned} B(t,\omega ,v,\mu ,\alpha )&:=b_x(t,X_t(\omega ),\mathcal {L}(X_t),\alpha _t)\cdot v+b_\alpha (t,X_t(\omega ),\mathcal {L}(X_t),\alpha _t)\\&\quad \cdot \beta _t+B_\mu (t,X_t(\omega ),\mu _t), \\ \Sigma (t,\omega ,v,\mu ,\alpha )&:=\sigma _x(t,X_t(\omega ),\mathcal {L}(X_t),\alpha _t)\cdot v+\sigma _\alpha (t,X_t(\omega ),\mathcal {L}(X_t),\alpha _t)\\&\quad \cdot \beta _t+\Sigma _\mu (t,X_t(\omega ),\mu _t), \end{aligned}$$

we have the following properties:

$$\begin{aligned} \langle B(t,\omega ,v,\mu ,\alpha )-B(t,\omega ,v',\mu ,\alpha ),v-v'\rangle&\le A_1|v-v'|^2 \\ \int _{0}^{T}\sup _{|v|\le c}|B(t,\omega ,v,\mu ,\alpha )|dt&<\infty ,\quad \forall c\ge 0, \end{aligned}$$

for all \(t\in [0,T]\) and \({\mathbb {P}}\)-almost every \(\omega \). The first estimate is a result of Assumption 2.4 and the fact that \({\mathbb {P}}(X_t\in {\mathcal {C}},\forall t)=1\). The second estimate follows from

$$\begin{aligned}&|B(t,\omega ,v,\mu ,\alpha )|\\&\quad \le C\Bigg \{(1+|X_t(\omega )|^{q-1})|v|+|\beta _t|+(1+|X_t(\omega )|)\iint _{{\mathbb {R}}^d\times {\mathbb {R}}^d}|y|\mu _t(dx\times dy)\Bigg \}, \end{aligned}$$

together with the continuity of \(t\mapsto \iint _{{\mathbb {R}}^d\times {\mathbb {R}}^d}|y|\mu _t(dx\times dy),\) and the uniform estimate (32). Using (30) we get with similar arguments

$$\begin{aligned} |\Sigma (t,\omega ,v,\mu ,\alpha )-\Sigma (t,\omega ,v',\mu ,\alpha )|&\le A_2|v-v'|, \\ \int _{0}^{T}\sup _{|v|\le c}|\Sigma (t,\omega ,v,\mu ,\alpha )|^2dt&<\infty ,\quad \forall c\ge 0, \end{aligned}$$

for all \(t\in [0,T]\), \({\mathbb {P}}\)-almost every \(\omega \). It follows then by classical SDE results that (39) is well-posed. Moreover, adapting the arguments yielding the moment estimates of Theorem 3.1, it is shown mutatis mutandis that for \(2\le p\le r\)

$$\begin{aligned} {\mathbb {E}}\left( \sup _{0\le t\le T}|V_t|^p\right) <\infty . \end{aligned}$$

Therefore \((V_t)\) (and hence \(\Psi (\mu )\equiv {{\,\mathrm{\mathcal {L}}\,}}(X,V)\)) is uniquely determined by the probability measure \(\mu \).

We now aim to prove that \(\Psi \) is a contraction, but for that purpose it is convenient to introduce another (stronger) metric. For any \(\mu ,\nu \in {\mathcal {P}}_2({\mathbb {R}}^d\times {\mathbb {R}}^d)\) with \(\mu \circ p_1^{-1}=\nu \circ p_1^{-1}\), we let

$$\begin{aligned} d(\mu ,\nu )^2:= \inf _{m\in \Lambda (\mu ,\nu )}\iiint _{{\mathbb {R}}^d\times {\mathbb {R}}^d\times {\mathbb {R}}^d}|v-w|^2 m(dx\times dv\times dw)\,, \end{aligned}$$

where \(\Lambda (\mu ,\nu )\) is the set of all probability measures m on \(({\mathbb {R}}^d)^3\) such that for any \(A,B\in {\mathcal {B}}({\mathbb {R}}^d)\)

$$\begin{aligned} m(A\times B\times {\mathbb {R}}^d)=\mu (A\times B)\quad \text {and}\quad m(A\times {\mathbb {R}}^d\times B)=\nu (A\times B). \end{aligned}$$

That d is stronger than \(W_2\) can be seen as follows. If m is any element in \(\Lambda (\mu ,\nu )\), one can define

$$\begin{aligned} \rho (dx\times dv\times dy\times dw):= m(dx\times dv\times dw)\delta _x(dy) \end{aligned}$$

where \(\delta _x\) is the Dirac mass centered at x. Clearly, \(\rho \) belongs to the set of transport plans \(\Pi (\mu ,\nu )\) between \(\mu \) and \(\nu ,\) so that in particular

$$\begin{aligned}&W_2(\mu ,\nu )=\inf _{\rho \in \Pi (\mu ,\nu )}\iiiint \limits _{({\mathbb {R}}^d)^4}(|x-y|^2+|v-w|^2 )\pi (dx\times dv\times dy\times dw) \\&\quad \le \iiint \limits _{({\mathbb {R}}^d)^3}|v-w|m(dx\times dv\times dw). \end{aligned}$$

Then, taking the infimum over all such m yields our conclusion.

Next, let \(m\in \Lambda (\mu ,\nu )\). Using the marginal condition on m, we have

$$\begin{aligned}&|B_\mu (t,X_t,\mu _t)-B_\mu (t,X_t,\nu _t)| \\&= \Big |\iint _{{\mathbb {R}}^d\times {\mathbb {R}}^d}b_\mu (t,X_t,\mathcal {L}(X_t),\alpha _t)(x)\cdot v \mu (dx\times dv) \\&\quad \quad \quad \quad -\iint _{{\mathbb {R}}^d\times {\mathbb {R}}^d}b_\mu (t,X_t,\mathcal {L}(X_t),\alpha _t)\cdot w\nu (dx\times dw)\Big | \\&= \Big |\iiint _{{\mathbb {R}}^d\times {\mathbb {R}}^d\times {\mathbb {R}}^d}b_\mu (t,X_t,\mathcal {L}(X_t),\alpha _t)(x)\cdot v m (dx\times dv\times dw) \\&\quad \quad \quad \quad -\iiint _{{\mathbb {R}}^d\times {\mathbb {R}}^d\times {\mathbb {R}}^d}b_\mu (t,X_t,\mathcal {L}(X_t),\alpha _t)\cdot wm(dx\times dv\times dw)\Big | \,. \end{aligned}$$

Thus,

$$\begin{aligned}&|B_\mu (t,X_t,\mu _t)-B_\mu (t,X_t,\nu _t)|=\Big |\iiint _{{\mathbb {R}}^d\times {\mathbb {R}}^d\times {\mathbb {R}}^d}b_\mu (t,X_t,\mathcal {L}(X_t),\alpha _t)(x)\cdot (v-w) \\&\quad m(dx\times dv\times dw)\Big |\,. \end{aligned}$$

Since m is arbitrary, we obtain

$$\begin{aligned} |B_\mu (t,X_t,\mu _t)-B_\mu (t,X_t,\nu _t)|\le A_1(1+|X_t|)d(\mu _t,\nu _t)\,, \end{aligned}$$

and a similar result can be shown for \(\Sigma _\mu \). Now, if we equip \({\mathcal {R}}\) with a metric \(\delta \) inherited from d,  for instance \(\delta (\mu ,\nu ):=\sup _{t\in [0,T]}e^{-\gamma t}d(\mu _t,\nu _t)\) for \(\gamma >0\) large enough, the proof that \(\Psi \) is a contraction follows with simple arguments. Since it is similar to the proof of Theorem 3.1, we omit the details.

Let now \(\alpha ,\beta \in {{\mathbb {A}}}\) and \(\epsilon >0\) small enough, such that \(\alpha +\epsilon \beta \in {{\mathbb {A}}}\). By X we denote the solution of (7) with respect to \(\alpha \) and by \(X^\epsilon \) we denote the solution to (7) with respect to \(\alpha +\epsilon \beta \). Furthermore for \(\lambda \in [0,1]\) we introduce \(X^{\lambda ,\epsilon }:=X+\lambda (X^\epsilon -X)\) and \(\alpha ^{\lambda ,\epsilon }:=\alpha +\lambda \epsilon \beta \). Note that, since \(\pi \) is convex, we have

$$\begin{aligned} \pi (X_t+\lambda (X_t^\epsilon -X_t))=\pi ((1-\lambda )X_t+\lambda X_t^\epsilon )\le (1-\lambda )\pi (X_t)+\lambda \pi (X_t^\epsilon )\le 0\,, \end{aligned}$$
(40)

hence \(X_t^{\lambda ,\epsilon }\) is supported in \({\mathcal {C}}\).

Next, by Lemma 3.2 we get

$$\begin{aligned} {\mathbb {E}}\left( \sup _{\lambda \in [0,1]}\sup _{t\in [0,T]}|X_t^{\lambda ,\epsilon }-X_t|^2\right)&\le {\hat{C}}_{L,T}\epsilon ^2\int _{0}^{T}|\beta _t|^2dt\,. \end{aligned}$$

Thus, we can conclude that \(X^{\lambda ,\epsilon }\underset{\epsilon \rightarrow 0}{\longrightarrow } X\) in \(L^2(\Omega ,C([0,T];{\mathbb {R}}^d))\), uniformly in \(\lambda \). By a simple Taylor expansion we get

$$\begin{aligned}&b(t,X_t^\epsilon ,{{\,\mathrm{\mathcal {L}}\,}}(X_t^\epsilon ),\alpha _t+\epsilon \beta _t)\\&\quad =b(t,X_t,{{\,\mathrm{\mathcal {L}}\,}}(X_t),\alpha _t)+[b_x]^\epsilon _{t}\cdot (X_t^\epsilon -X_t) +\epsilon [b_\alpha ]^\epsilon _{t}\cdot \beta _t +{\tilde{{\mathbb {E}}}}\left( [b_\mu ]^\epsilon \cdot \widetilde{(X_t^\epsilon -X_t)}\right) \end{aligned}$$

where, given \(\varphi =\varphi (t,x,\mu ,\alpha )({{\tilde{x}}})\) we use the shorthand notation

$$\begin{aligned}{}[\varphi ]^\epsilon _{t}:=\int _{0}^{1}\varphi \left( t,X_t^{\lambda ,\epsilon },{{\,\mathrm{\mathcal {L}}\,}}(X_t^{\lambda ,\epsilon }),\alpha _t^{\lambda ,\epsilon }\right) \left( {{\tilde{X}}}^{\lambda ,\epsilon }_t\right) d\lambda \,,\end{aligned}$$

with the convention that the last input is ignored whenever \(\varphi \) does not depend on the tilde variable. Similarly, we have

$$\begin{aligned}&\sigma (t,X_t^\epsilon ,{{\,\mathrm{\mathcal {L}}\,}}(X_t^\epsilon ),\alpha _t+\epsilon \beta _t)\\&\quad =\sigma (t,X_t,{{\,\mathrm{\mathcal {L}}\,}}(X_t),\alpha _t)+[\sigma _x]^\epsilon _{t}\cdot (X_t^\epsilon -X_t)+\epsilon [\sigma _\alpha ]^\epsilon _{t}\cdot \beta _t+{\tilde{{\mathbb {E}}}}\left( [\sigma _\mu ]^\epsilon \cdot \widetilde{(X_t^\epsilon -X_t)}\right) . \end{aligned}$$

Thus, for \(\varDelta ^\epsilon _t:=\dfrac{X_t^\epsilon -X_t}{\epsilon }-Z_t^{\alpha ,\beta }\) we have

$$\begin{aligned} d\varDelta _t^\epsilon&=\Bigg \{[b_x]^\epsilon _{t}\cdot \varDelta _t^\epsilon +{\tilde{{\mathbb {E}}}}\left( [b_\mu ]^\epsilon _{t}\cdot {\tilde{\varDelta }}_t^\epsilon \right) + ([b_x]^\epsilon _{t}-b_x(t,X_t,{{\,\mathrm{\mathcal {L}}\,}}(X_t),\alpha _t))\cdot Z_t^{\alpha ,\beta } \\&\quad \quad \quad +\epsilon ([b_\alpha ]^\epsilon _{t}- b_\alpha (t,X_t,{{\,\mathrm{\mathcal {L}}\,}}(X_t),\alpha _t))\cdot \beta _t\\&\quad \quad \quad +{\tilde{{\mathbb {E}}}}\left( ([b_\mu ]^\epsilon _{t}-b_\mu (t,X_t,{{\,\mathrm{\mathcal {L}}\,}}(X_t),\alpha _t)({\tilde{X}}_t))\cdot {\tilde{Z}}_t^{\alpha ,\beta }\right) \Bigg \}dt\\&\quad + \Bigg \{[\sigma _x]^\epsilon _{t}\cdot \varDelta _t^\epsilon +{\tilde{{\mathbb {E}}}}\left( [\sigma _\mu ]^\epsilon _{t}\cdot {\tilde{\varDelta }}_t^\epsilon \right) + ([\sigma _x]^\epsilon _{t}-\sigma _x(t,X_t,{{\,\mathrm{\mathcal {L}}\,}}(X_t),\alpha _t))\cdot Z_t^{\alpha ,\beta }\\&\quad \quad \quad +\epsilon ([\sigma _\alpha ]^\epsilon _{t}- \sigma _\alpha (t,X_t,{{\,\mathrm{\mathcal {L}}\,}}(X_t),\alpha _t))\cdot \beta _t\\&\quad \quad \quad +{\tilde{{\mathbb {E}}}}\left( ([\sigma _\mu ]^\epsilon _{t}-\sigma _\mu (t,X_t,{{\,\mathrm{\mathcal {L}}\,}}(X_t),\alpha _t)({\tilde{X}}_t))\cdot {\tilde{Z}}_t^{\alpha ,\beta }\right) \Bigg \}dW_t. \end{aligned}$$

By Itô formula, (40) and Assumption 2.4, we get

$$\begin{aligned} d\left( \frac{|\varDelta _t^\epsilon |^2}{2}\right)&\le \bigg \{A_1 |\varDelta _t^\epsilon |^2 +{\tilde{{\mathbb {E}}}}\left( |[b_\mu ]^\epsilon _{t}||{\tilde{\varDelta }}_t^\epsilon |\right) |\varDelta _t^\epsilon |\\&\quad \quad \quad +|[b_x]^\epsilon _{t}-b_x(t,X_t,{{\,\mathrm{\mathcal {L}}\,}}(X_t),\alpha _t)||Z_t^{\alpha ,\beta }||\varDelta _t^\epsilon |\\&\quad \quad \quad +\epsilon |[b_\alpha ]^\epsilon _{t}- b_\alpha (t,X_t,{{\,\mathrm{\mathcal {L}}\,}}(X_t),\alpha _t)||\beta _t||\varDelta _t^\epsilon |\\&\quad \quad \quad + {\tilde{{\mathbb {E}}}}\left( |[b_\mu ]^\epsilon _{t}-b_\mu (t,X_t,{{\,\mathrm{\mathcal {L}}\,}}(X_t),\alpha _t)({\tilde{X}}_t)||{\tilde{Z}}_t^{\alpha ,\beta }|\right) |\varDelta _t^\epsilon |\bigg \}dt\\&\quad + \Bigg \langle \varDelta _t^\epsilon ,\Bigg ([\sigma _x]^\epsilon _{t}\cdot \varDelta _t^\epsilon +{\tilde{{\mathbb {E}}}}\left( \sigma _\mu ^\epsilon (\varDelta )\cdot {\tilde{\varDelta }}_t^\epsilon \right) +([\sigma _x]^\epsilon _{t}-\sigma _x(t,X_t,{{\,\mathrm{\mathcal {L}}\,}}(X_t),\alpha _t))\cdot Z_t^{\alpha ,\beta }\\&\quad \quad \quad +\epsilon ([\sigma _\alpha ]^\epsilon _{t}- \sigma _\alpha (t,X_t,{{\,\mathrm{\mathcal {L}}\,}}(X_t),\alpha _t))\cdot \beta _t\\&\quad \quad \quad +{\tilde{{\mathbb {E}}}}\left( ([\sigma _\mu ]^\epsilon _{t}-\sigma _\mu (t,X_t,{{\,\mathrm{\mathcal {L}}\,}}(X_t),\alpha _t)({\tilde{X}}_t))\cdot {\tilde{Z}}_t^{\alpha ,\beta }\right) \Bigg )dW_t\Bigg \rangle \\&\quad +\frac{5}{2}\bigg \{A_2|\varDelta _t^\epsilon |^2 +\Big (\int _0^1A_2(1+|X_t^{\lambda ,\epsilon }|^2)d\lambda \Big ){\tilde{{\mathbb {E}}}}\left( |{\tilde{\varDelta }}_t^\epsilon |\right) ^2\\&\quad \quad \quad +|[\sigma _x]^\epsilon _{t}-\sigma _x(t,X_t,{{\,\mathrm{\mathcal {L}}\,}}(X_t),\alpha _t)|^2|Z_t^{\alpha ,\beta }|^2\\&\quad \quad \quad +\epsilon ^2|[\sigma _\alpha ]^\epsilon _{t}- \sigma _\alpha (t,X_t,{{\,\mathrm{\mathcal {L}}\,}}(X_t),\alpha _t)|^2|\beta _t|^2\\&\quad \quad \quad + {\tilde{{\mathbb {E}}}}\left( |[\sigma _\mu ]^\epsilon _{t}-\sigma _\mu (t,X_t,{{\,\mathrm{\mathcal {L}}\,}}(X_t),\alpha _t)({\tilde{X}}_t)||{\tilde{Z}}_t^{\alpha ,\beta }|\right) ^2\bigg \}dt. \end{aligned}$$

By Young Inequality, Jensen Inequality and assumption 2.4 we have

$$\begin{aligned} {\tilde{{\mathbb {E}}}}\left( |[b_\mu ]^\epsilon _{t}||{\tilde{\varDelta }}_t^\epsilon |\right) |\varDelta _t^\epsilon |&\le \dfrac{1}{2}\left( {\tilde{{\mathbb {E}}}}\left( |[b_\mu ]^\epsilon _{t}|^2|{\tilde{\varDelta }}_t^\epsilon |^2\right) +|\varDelta _t^\epsilon |^2\right) \\&\le \dfrac{A_1}{2}\left( \int _{0}^{1}(1+|X_t^{\lambda ,\epsilon }|^2)d\lambda \right) {\tilde{{\mathbb {E}}}}\left( |{\tilde{\varDelta }}_t^\epsilon |^2\right) +\frac{1}{2}|\varDelta _t^\epsilon |^2. \end{aligned}$$

Since \(\epsilon >0\) is chosen in a way that \(\alpha +\epsilon \beta \in {\mathbb {A}}\), we can conclude by the a priori bound (32) and the definition of \({\mathbb {A}}\), that

$$\begin{aligned} {\mathbb {E}}\left( \sup _{s\in [0,t]}{\tilde{{\mathbb {E}}}}\left( |[b_\mu ]^\epsilon _{s}||{\tilde{\varDelta }}_s^\epsilon |\right) |\varDelta _s^\epsilon |\right) \le C(T,\Vert X_0\Vert _p){\tilde{{\mathbb {E}}}}\left( \sup _{s\in [0,t]}|{\tilde{\varDelta }}_s^\epsilon |^2\right) +{\mathbb {E}}\left( \sup _{s\in [0,t]}\frac{|\varDelta _s^\epsilon |^2}{2}\right) , \end{aligned}$$

for some constant \(C(T,\Vert X_0\Vert _p)>0\) which does not depend on \(\epsilon \). By the Burkholder-Davis-Gundy inequality, Young and Jensen inequalities we arrive at

$$\begin{aligned} {\mathbb {E}}\left( \sup _{t\in [0,T]}|\varDelta _t^\epsilon |^2\right)&\le I_1+I_2+I_3+I_4+I_5+I_6 + C\int _{0}^{T}{\mathbb {E}}\left( \sup _{s\in [0,t]}|\varDelta _s^\epsilon |^2\right) ds, \end{aligned}$$

for a constant \(C>0\) which does not depend on \(\epsilon \) and

$$\begin{aligned} I_1&={\mathbb {E}}\left( \int _{0}^{T}|[b_x]^\epsilon _{t}-b_x(t,X_t,{{\,\mathrm{\mathcal {L}}\,}}(X_t),\alpha _t)|^2|Z_t^{\alpha ,\beta }|^2dt\right) \\ I_2&=\epsilon ^2{\mathbb {E}}\left( \int _{0}^{T}|[b_\alpha ]^\epsilon _{t}-b_\alpha (t,X_t,{{\,\mathrm{\mathcal {L}}\,}}(X_t),\alpha _t)|^2|\beta _t|^2dt\right) \\ I_3&={\mathbb {E}}\left( \int _{0}^{T}{\tilde{{\mathbb {E}}}}\left( |[b_\mu ]^\epsilon _{t}-b_\mu (t,X_t,{{\,\mathrm{\mathcal {L}}\,}}(X_t),\alpha _t)({\tilde{X}}_t)|^2|{\tilde{Z}}_t^{\alpha ,\beta }|^2\right) dt\right) \end{aligned}$$

and \(I_4,I_5,I_6\) are analogues for \(\sigma \). We will only show \(I_1\rightarrow 0\) as \(\epsilon \rightarrow 0\), the other terms being handled by similar arguments. By assumption 2.4 we have

$$\begin{aligned} |[b_x]^\epsilon _{t}-b_x(t,X_t,{{\,\mathrm{\mathcal {L}}\,}}(X_t),\alpha _t)|^4\le C(1+|X_t|^{4q-4}+|X_t^{\lambda ,\epsilon }|^{4q-4}). \end{aligned}$$

Furthermore we have for any \(p\le r\) that

$$\begin{aligned} {\mathbb {E}}\left( \sup _{0\le t\le T}|X_t^{\lambda ,\epsilon }|^p\right)&\le C_p\Bigg \{(1+\lambda ^p) {\mathbb {E}}\left( \sup _{0\le t\le T}|X_t^{\epsilon }|^p\right) +\lambda ^p{\mathbb {E}}\left( \sup _{0\le t\le T}|X_t|^p\right) \Bigg \}, \end{aligned}$$

is bounded from above by some constant that does not depend on \(\epsilon \) for \(\epsilon >0\) small enough. Since \(X^{\lambda ,\epsilon }\rightarrow X\) in \(L^2(\Omega ;C([0,T];{\mathbb {R}}^d))\), by the a-priori bound (32), the estimate \({\mathbb {E}}\left( \sup _{t\in [0,T]}|Z_t|^4\right) <\infty \), the continuity of \(b_x\) and the dominated convergence theorem, one concludes that \(I_1\rightarrow 0\) as \(\epsilon \rightarrow 0\). Similar arguments combined with Gronwall’s lemma finish the proof. \(\square \)

As an important consequence, we obtain the following formula for the Gâteaux derivative of the cost functional. Given Lemma 4.1, the next result is proven in the same way as it is done in [8] and thus omitted.

Corollary 4.1

The cost functional

$$\begin{aligned} J:{{\mathbb {A}}}\rightarrow {\mathbb {R}}\end{aligned}$$

is Gâteaux differentiable and its Gâteaux derivative at \(\alpha \in {{\mathbb {A}}}\) in direction \(\beta \in {{\mathbb {A}}}\) is given by

$$\begin{aligned} dJ(\alpha )\cdot \beta&={\mathbb {E}}\left( f_x(t,X_t,{{\,\mathrm{\mathcal {L}}\,}}(X_t),\alpha _t)\cdot Z_t^{\alpha ,\beta }+f_\alpha (t,X_t,{{\,\mathrm{\mathcal {L}}\,}}(X_t),\alpha _t)\cdot \beta _t\right) \\&\quad +{\mathbb {E}}\left( {\tilde{{\mathbb {E}}}}\left( f_\mu (t,X_t,{{\,\mathrm{\mathcal {L}}\,}}(X_t),\alpha _t)({\tilde{X}}_t)\cdot {\tilde{Z}}_t^{\alpha ,\beta }\right) \right) \\&\quad + {\mathbb {E}}\left( g_x(X_T,{{\,\mathrm{\mathcal {L}}\,}}(X_T))\cdot Z_t^{\alpha ,\beta }+{\tilde{{\mathbb {E}}}}\left( g_\mu (X_T,{{\,\mathrm{\mathcal {L}}\,}}(X_T))({\tilde{X}}_T)\cdot {\tilde{Z}}_T^{\alpha ,\beta }\right) \right) . \end{aligned}$$

4.2 Maximum Principle

For the reader’s convenience, we now rewrite the adjoint equation (29) using Hamiltonian formalism. Recall that for \(x,y,p\in {\mathbb {R}}^d,\) \(q\in {\mathbb {R}}^{d\times m}\) \(\mu \in {\mathcal {P}}_2\) and \(\alpha \in A,\) we introduced the quantity

$$\begin{aligned} H(t,x,\mu ,p,q,\alpha ):=\langle b(t,x,\mu ,\alpha ),p\rangle +\langle \sigma (t,x,\mu ,\alpha ), q\rangle +f(t,x,\mu ,\alpha )\,. \end{aligned}$$

Thus, given a control \(\alpha \in {{\mathbb {A}}},\) one sees that the pair \((P,Q)\in \mathcal {S}^{2,d}\times \mathcal {H}^{2,d\times m}\) solves the adjoint equation if and only if for all \(t\in [0,T]\), \({\mathbb {P}}\)-almost surely

$$\begin{aligned} \left\{ \begin{aligned} dP_t&=-\left[ H_x(t,X_t,{{\,\mathrm{\mathcal {L}}\,}}(X_t),P_t,Q_t,\alpha _t) + \mathbb {{{\tilde{E}}}}\left( H_\mu (t,{{\tilde{X}}}_t,{{\,\mathrm{\mathcal {L}}\,}}(X_t),{{\tilde{P}}}_t,{{\tilde{Q}}}_t,\alpha _t)( X_t)\right) \right] dt +Q_tdW_t \\ P_T&=g_x(X_T,{{\,\mathrm{\mathcal {L}}\,}}(X_T))+{\tilde{{\mathbb {E}}}}\left( g_\mu ({{\tilde{X}}}_t,{{\,\mathrm{\mathcal {L}}\,}}(X_T))( X_T)\right) . \end{aligned}\right. \end{aligned}$$
(41)

where \(({{\tilde{X}}},{{\tilde{P}}},{{\tilde{Q}}},{{\tilde{\alpha }}})\) is an independent copy of \((X,P,Q,\alpha )\) on the space \(({{\tilde{\Omega }}},\mathcal {{{\tilde{F}}}},\mathbb {{{\tilde{P}}}}).\)

Let us point out that the above coefficients fail to satisfy [8, Assumption MKV BSDE, Chap. 4]. Hence, we first need to address the solvability of the BSDE (41) under the assumptions of Theorem 2.2.

Lemma 4.2

Under the assumptions of Theorem 2.2, there exists a unique solution \((P,Q)\in \mathcal {S}^2\times \mathcal {H}^{2,d\times m}\) of (41).

Proof

Fix \(\alpha \in {{\mathbb {A}}}\) and for simplicity, denote by \(H_x(t,\omega ,p,q):=H_x(t,X_t(\omega ),\mathcal {L}(X_t),p,q,\alpha _t)\) and by \(H_\mu (t,\omega ,x,p,q):=H_\mu (t,x,\mathcal {L}(X_t), p,q,\alpha _t)(X_t(\omega )).\) Consider the map \(\Gamma :\mathcal {H}^{2,d}\times \mathcal {H}^{2,d\times m}\rightarrow \mathcal {H}^{2,d}\times \mathcal {H}^{2,d\times m}\) which maps a given pair

$$\begin{aligned} (Y,Z)\in \mathcal {H}^{2,d}\times \mathcal {H}^{2,d\times m} \end{aligned}$$

to the solution (PQ) of

$$\begin{aligned} \left\{ \begin{aligned} dP_t&=-\left[ H_x(t,\omega ,P_t,Q_t)+{\mathbb {E}}\left( H_\mu (t,\omega ,X_t,Y_t,Z_t)\right) \right] dt+ QdW_t \\ P_T&=g_x(X_T(\omega ),{{\,\mathrm{\mathcal {L}}\,}}(X_T))+{\tilde{{\mathbb {E}}}}\left( g_\mu ({\tilde{X}}_t,{{\,\mathrm{\mathcal {L}}\,}}(X_T))(X_T)\right) , \end{aligned}\right. \end{aligned}$$
(42)

where the expectation is to be understood in the following way:

$$\begin{aligned} {\mathbb {E}}\left( H_\mu (t,\omega ,X_t,Y_t,Z_t)\right) =\int _{\Omega }H_\mu (t,\omega ,X_t(\omega '),Y_t(\omega '),Z_t(\omega ')){\mathbb {P}}(d\omega '). \end{aligned}$$

In the following we drop the dependence on \(\omega \) for \(H_\mu \).

Since the above equation is a standard backward SDE with monotone coefficients, the existence of a solution is well-known by standard results. We will now show that the map \(\Gamma \) is a contraction, when the space \(\mathcal {H}^{2,d}\times \mathcal {H}^{2,d\times m}\) is equipped with the norm

$$\begin{aligned} |||(P,Q)|||_\gamma :=\left( \int _{0}^{T}e^{\gamma t}(\Vert P_t\Vert _2^2+\Vert Q_t\Vert _2^2)dt\right) ^{1/2} ,\end{aligned}$$

for a sufficiently large parameter \(\gamma >0\). If we denote by \((P^1,Q^1),(P^2,Q^2)\) two solutions of (42) for \((Y^1,Z^1)\) and \((Y^2,Z^2)\) respectively, then by the backward Itô Formula [20, p. 356] applied to \(e^{\gamma t}|P^1_t-P^2_t|^2\) we get

$$\begin{aligned}&|P^1_t-P^2_t|^2 +{\mathbb {E}}\left( \int _{t}^{T}\gamma e^{\gamma (r-t)}|P^1_r-P^2_r|^2dr\Bigg |\mathcal {F}_t\right) \nonumber \\&\quad +{\mathbb {E}}\left( \int _{t}^{T} e^{\gamma (r-t)}|Q^1_r-Q^2_r|^2dr\Bigg |\mathcal {F}_t\right) \nonumber \\&\quad \le 2{\mathbb {E}}\Bigg (\int _{t}^{T}e^{\gamma (r-t)}\Bigg \{\big ( H_x(t,P^1_r,Q^1_r)- H_x(r,P^2_r,Q^2_r)\big )\cdot (P^1_r-P^2_r) +|P^1_r-P^2_r|\times \nonumber \\&\quad \int _{\Omega }|H_\mu (r,X_r(\omega '),Y_r^1(\omega '),Z_r^1(\omega '))-H_\mu (r,X_r(\omega '),Y_r^2(\omega '),Z_r^2(\omega '))|{\mathbb {P}}(d\omega ')\Bigg \}dr\Bigg |\mathcal {F}_t\Bigg ).\nonumber \\ \end{aligned}$$
(43)

From assumptions 2.4,2.4, Young’s inequality and Lemma 3.1, we infer that

$$\begin{aligned}&\Vert (H_x(t,\omega ,P^1,Q^1) -H_x(t,\omega ,P^2,Q^2))\cdot (P^1-P^2)\Vert _1\\&\quad \quad \le (A_1+A_2^2) \Vert P^1-P^2\Vert _2^2+\frac{1}{4}\Vert Q^1-Q^2\Vert _2^2 \end{aligned}$$

and

$$\begin{aligned}&\int _{\Omega }|H_\mu (r,X_r(\omega '),Y_r^1(\omega '),Z_r^1(\omega '))-H_\mu (r,X_r(\omega '),Y_r^2(\omega '),Z_r^2(\omega '))|{\mathbb {P}}(d\omega ')\\&\quad \quad \le (A_1\vee A_2)(1+\Vert X_r\Vert _2^2)^\frac{1}{2}(\Vert Y_r^1-Y_r^2\Vert _2+\Vert Z_r^1-Z_r^2\Vert _2)\,. \end{aligned}$$

Invoking (32), Cauchy-Schwarz and Young Inequalities, we can conclude that

$$\begin{aligned}&\int _{0}^{T}\gamma e^{\gamma r}\Vert P^1_r-P^2_r\Vert _2^2dr+\int _{0}^{T} e^{\gamma r}\Vert Q^1_r-Q^2_r\Vert _2^2dr\\&\quad \le 2(A_1+A_2^2)\int _{0}^{T}e^{\gamma r}\Vert P_r^1-P_r^2\Vert _2^2dr+\frac{1}{2}\int _{0}^{T}e^{\gamma r}\Vert Q_r^1-Q_r^2\Vert ^2_2dr\\&\quad + \frac{1}{2}\int _{0}^{T}e^{\gamma r}\Vert P^1_r-P^2_r\Vert _2^2dr +C(A,\Vert X_0\Vert _2)\int _{0}^{T}e^{\gamma r}\left( \Vert Y_r^1-Y_r^2\Vert _2^2+\Vert Z_r^1-Z_r^2\Vert _2^2\right) dr. \end{aligned}$$

For \(\gamma \) large enough this leads to

$$\begin{aligned} |||(P^1-P^2,Q^1-Q^2)|||_{\gamma } ^2 \le \frac{1}{2} |||(Y^1_r-Y^2_r,Z^1_r-Z^2_r)|||_\gamma ^2\,, \end{aligned}$$

showing that \(\Gamma \) is a contraction. The conclusion follows. \(\square \)

The following corollary follows immediately by integration by parts and an application of Fubini Theorem. We therefore omit the proof and refer to [8, Lemma 6.12 p. 547].

Corollary 4.2

Let (PQ) be a solution to (41), then it holds

$$\begin{aligned}&{\mathbb {E}}\left( \langle P_T,Z_T^{\alpha ,\beta }\rangle \right) ={\mathbb {E}}\left( \int _{0}^{T}\langle P_t,b_\alpha (t,X_t,{{\,\mathrm{\mathcal {L}}\,}}(X_t),\alpha _t)\cdot \beta \rangle +\langle Q_t,\sigma _\alpha (t,X_t,{{\,\mathrm{\mathcal {L}}\,}}(X_t),\alpha _t)\cdot \beta \rangle dt\right) \nonumber \\&\quad - {\mathbb {E}}\left( \int _{0}^{T}f_x(t,X_t,{{\,\mathrm{\mathcal {L}}\,}}(X_t),\alpha _t)\cdot Z_t^{\alpha ,\beta }+{\tilde{{\mathbb {E}}}}\left( f_\mu (t,X_t,{{\,\mathrm{\mathcal {L}}\,}}(X_t),\alpha _t)({\tilde{X}}_t)\cdot {\tilde{Z}}_t^{\alpha ,\beta }\right) \right) . \end{aligned}$$
(44)

Remark 4.1

An immediate consequence of (44) is the following formula for the Gâteaux derivative of the cost functional

$$\begin{aligned} d J(\alpha )\cdot \beta&={\mathbb {E}}\bigg (\int _{0}^{T}\Big \{\langle b_\alpha (t,X_t,{{\,\mathrm{\mathcal {L}}\,}}(X_t),\alpha _t)\cdot \beta _t,P_t\rangle +\langle \sigma _\alpha (t,X_t,{{\,\mathrm{\mathcal {L}}\,}}(X_t),\alpha _t)\cdot \beta _t,Q_t\rangle \\&\quad \quad \quad \quad +f_\alpha (t,X_t,{{\,\mathrm{\mathcal {L}}\,}}(X_t),\alpha _t)\cdot \beta _t\Big \}dt \bigg ) \\&={\mathbb {E}}\bigg (\int _{0}^{T} H_\alpha (t,X_t,\mathcal {L}(X_t),P_t,Q_t,\alpha _t)\cdot \beta _tdt\bigg ). \end{aligned}$$

An application of Fubini Theorem then leads to the following representation for the gradient of J:

$$\begin{aligned} \nabla J(\alpha )_t ={\mathbb {E}}\left( H_\alpha (t,X_t,\mathcal {L}(X_t),P_t,Q_t,\alpha _t)\right) \,,\quad t\in [0,T]. \end{aligned}$$
(45)

It is hardly necessary to mention that the formula (45) is of fundamental importance for numerical purposes, see Sect. 5 below.

We are now in position to prove the maximum principle.

Proof of Theorem 2.2

Let \(\alpha \in {\mathbb {A}}\) be an optimal control for (SM), X the corresponding solution to (7) and (PQ) the associated solution to (41). For \(\beta \in {\mathbb {A}}\) we have by the optimality of \(\alpha \)

$$\begin{aligned} dJ(\alpha )\cdot (\beta -\alpha )=\langle \nabla J(\alpha ),\beta -\alpha \rangle _{L^2([0,T];{\mathbb {R}}^k)}\ge 0\,. \end{aligned}$$

Invoking the convexity of the Hamiltonian (see Assumption 2.3), we get

$$\begin{aligned} \int _{0}^{T}{\mathbb {E}}\Big (H(t,X_t,\mathcal {L}(X_t),P_t,Q_t,\alpha _t)-H(t,X_t,\mathcal {L}(X_t),P_t,Q_t,\beta _t)\Big )dt\ge 0\,. \end{aligned}$$

For any arbitrary measurable set \(C\subset [0,T]\) and \({{\tilde{\alpha }}}\in A\) we can define the admissible control

$$\begin{aligned} \beta _t={\left\{ \begin{array}{ll} {{\tilde{\alpha }}} &{} \text {for}t\in C,\\ \alpha _t &{} \text {otherwise}, \end{array}\right. } \end{aligned}$$

hence

$$\begin{aligned} \int _{0}^{T}{\mathbf {1}}_C(t){\mathbb {E}}\big (H(t,X_t,\mathcal {L}(X_t),P_t,Q_t,\alpha _t)-H(t,X_t,\mathcal {L}(X_t),P_t,Q_t,{{\tilde{\alpha }}})\big )dt\ge 0. \end{aligned}$$

Therefore we get

$$\begin{aligned} {\mathbb {E}}\big (H(t,X_t,\mathcal {L}(X_t),P_t,Q_t,\alpha _t)-H(t,X_t,\mathcal {L}(X_t),P_t,Q_t,{{\tilde{\alpha }}})\big )\ge 0, \end{aligned}$$

dt-almost everywhere. This proves the theorem. \(\square \)

5 Numerical Examples

In this section we focus on the FitzHugh–Nagumo model with external noise only, i.e. the system of 3N stochastic differential equations:

$$\begin{aligned} \left\{ \begin{aligned}&dv^i_t=\Big (v^i_t-\frac{(v^ i_t)^3}{3}-w_t^i + \alpha _t - \frac{1}{N}\sum \nolimits _{j=1}^N{\bar{J}}(v^i_t-V_{rev})y^j_t\Big )dt +\sigma _{ext} d W^i_t \\&dw^i_t=c (v^i_t+a -b w^i_t)dt, \\&dy^i_t=(a_r S (v_t^i)(1-y_t^i)-a_d y_t^i)dt , \end{aligned}\right. \end{aligned}$$
(46)

where we recall that \(S(v):= T_{max}/[1+\exp \lambda (v-V_T)].\)

We are interested in controlling the average membrane potential (called in the following “local field potential”) of a network of FitzHugh–Nagumo neurons into a desired state. Our cost functional is given by

$$\begin{aligned} \begin{aligned} f(t,x,\mu ,\alpha )&:=|\int _{{\mathbb {R}}^3}v\mu (dv\times dw\times dy)-{\overline{v}}_t|^2 \\ g(t,x)&:=0, \end{aligned} \end{aligned}$$
(47)

where \(({\overline{v}}_t)_t\) is a certain reference profile. We should mention that the average membrane potential will only give an idea about the average activity of the network at each time. For example a high average membrane potential is an indication that a high number of neurons are in the regenerative or active phase, while a low average membrane potential means that a high number of neurons are in the absolute refractory or silent phase.

In the described case the adjoint equation is reduced to

$$\begin{aligned} \left\{ \begin{aligned} dP_t&=-\bigg \{\left\langle b_x(t,X_t,{{\,\mathrm{\mathcal {L}}\,}}(X_t),\alpha _t),P_t\right\rangle + \mathbb {{{\tilde{E}}}}\left( \left\langle b_\mu (t,X_t,{{\,\mathrm{\mathcal {L}}\,}}(X_t),\alpha _t)({\tilde{X}}_t),{\tilde{P}}_t\right\rangle \right) \\&\quad \quad \quad +\mathbb {{{\tilde{E}}}}\left( f_\mu (t,X_t,{{\,\mathrm{\mathcal {L}}\,}}(X_t),\alpha _t)({\tilde{X}}_t)\right) \bigg \}dt +Q_tdW_t \\ P_T&=0. \end{aligned}\right. \end{aligned}$$
(48)

In the following section we will give a short introduction on how to solve (48) numerically.

5.1 Numerical Approximation of the Adjoint Equation

In general we consider the following non fully coupled MFFBSDE

$$\begin{aligned} \left\{ \begin{aligned}&dX_t=b(t,X_t,\mathcal {L}(X_t))dt+\sigma (t,X_t,\mathcal {L}(X_t))dW_t \\&dY_t= \left[ f(t,X_t,Y_t)+h(t,X_t,\mathcal {L}(X_t,Y_t))\right] dt-Z_tdW_t\\&X_0=\xi \\&Y_T=g(X_T). \end{aligned}\right. \end{aligned}$$
(49)

For the approximation of the forward component we consider an implicit Euler scheme for McKean-Vlasov equations. Since this is standard, we will not go into further details. Concerning the backward component, we consider a scheme similar to the one presented in [10]. We should mention that since we are not dealing with a fully coupled MFFBSDE, our situation is a lot easier to handle than the one treated in [10]. For a given discrete time grid \(\pi :0=t_0<t_1<...<t_N=T\), we consider the following numerical scheme:

$$\begin{aligned} Y_{t_k}^{\pi }&={\mathbb {E}}\left( Y_{t_{k+1}}^{\pi }|\mathcal {F}_{t_k}\right) -(t_{k+1}-t_{k})\bigg \{f(t_k,X_{t_k}^{\pi },Y_{t_k}^{\pi })+h(t_{k+1},X_{t_{k}}^{\pi },\mathcal {L}(X_{t_{k+1}}^\pi ,Y_{t_{k+1}}^\pi ))\bigg \}\\ Z_{t_k}^{\pi }&:=(t_{k+1}-t_{k})^{-1}{\mathbb {E}}\left( Y_{t_{k+1}}^{\pi }(W_{t_{k+1}}-W_{t_{k}})|\mathcal {F}_{t_k}\right) ,\\ Y_{t_n}^{\pi }&=g(X_{t_n}^{\pi }),\quad Z_{t_n}^{\pi }=0. \end{aligned}$$

For the approximation of the conditional expectation, we make use of the decoupling field mentioned in [8], to write

$$\begin{aligned} Y_{t_{k+1}}^{\pi }=u(t_{k+1},X_{t_{k+1}}^{\pi },\mathcal {L}(X_{t_{k+1}}^{\pi }))=:{{\hat{u}}}(t_{k+1},X_{t_{k+1}}^{\pi }). \end{aligned}$$

Thus we can represent the conditional expectation in terms of a function \({\tilde{u}}\) by

$$\begin{aligned} {\mathbb {E}}\left( Y_{t_{k+1}}^{\pi }|\mathcal {F}_{t_k}\right) ={\tilde{u}}(t_{k+1},X_{t_k}^{\pi }). \end{aligned}$$

We approximate \({\tilde{u}}(t_{k+1},\cdot )\) with gaussian radial basis functions, by solving the following minimization problem for fixed nodes \(x_1,...,x_L\):

$$\begin{aligned} \min _{\alpha }{\mathbb {E}}\left( |Y_{t_{k+1}}^{\pi }-\sum _{i=1}^{L}\alpha _i(t_{k+1})e^{\frac{1}{2 \delta }\Vert X_{t_k}^\pi -x_i\Vert ^2}|^2\right) , \end{aligned}$$

for \(\alpha =(\alpha _1(t_{k+1}),...,\alpha _L(t_{k+1}))^\dagger \), where \(\delta >0\) and \(L\in {\mathbb {N}}\) are fixed. Therefore we initialize our reference points \(x_1,...,x_L\) by L independent realizations of \(X_{t_k}^\pi \). For m realizations of \(Y_{t_{k+1}}^{\pi }\) and \(X_{t_k}^\pi \), denoted by \(y_{t_{k+1}}^1,...,y_{t_{k+1}}^m\) and \(x_{t_{k+1}}^1,...,x_{t_{k+1}}^m\) respectively, we then write

$$\begin{aligned} y_{t_{k+1}}&=(y_{t_{k+1}}^1,...,y_{t_{k+1}}^m)^\dagger \\ A(t_k)&=(e^{\frac{1}{2 \delta }\Vert x_{t_k}^i-x_j\Vert ^2})_{i=1,...,m,j=1,...,L}. \end{aligned}$$

Thus we need to minimize

$$\begin{aligned} \Vert y_{t_{k+1}}-A(t_k)\alpha (t_{k+1})\Vert ^2. \end{aligned}$$

A similar approach for BSDEs can be found in [18]. There is no convergence analysis of this scheme for our assumptions on the coefficients, this should only give an idea how to solve the adjoint equation in practice. Furthermore we should mention, that in the case where only external noise is present, the duality (44) and the resulting gradient representation still holds true for any non adapted solution of (41), i.e. any solution to the random backward ODE arising from equation (41) for \(Q\equiv 0\). Thus one can also implement a numerical scheme for the adjoint equation, without any conditional expectations involved. For more general diffusion coefficient however this is not true, since the proof of the duality is based on integration by parts and the stochastic integral which appears is not defined if the integrand is not adapted.

5.2 Gradient Descent Algorithm

We will now briefly sketch our gradient descent algorithm.

Algorithm 5.1

Take an initial control \(\alpha _0\in {\mathbb {A}}\), \(s_0>0\), and recursively for \(n=0,1,\dots :\)

  1. (1)

    determine \(X^{\alpha _n}\) by solving the state equation with an implicit particle scheme to avoid particle corruption;

  2. (2)

    solve the adjoint equation for given \(X^{\alpha _n}\) in order to approximate \((P^{\alpha _n},Q^{\alpha _n})\);

  3. (3)

    approximate the gradient

    $$\begin{aligned} \nabla J(\alpha _n)_s ={\mathbb {E}}\Big [\langle b_\alpha (s,X_s^{\alpha _n},{{\,\mathrm{\mathcal {L}}\,}}(X^{\alpha _n}_s),\alpha _s^n),P^{\alpha _n}_s\rangle + f_\alpha (s,X_s^{\alpha _n},{{\,\mathrm{\mathcal {L}}\,}}( X_s^{\alpha _n}),\alpha ^n_s) \Big ] \end{aligned}$$

    via Monte-Carlo method, where \((P^{\alpha _n},Q^{\alpha _n})\) solves the adjoint equation;

  4. (4)

    update the step size \(s_n\) according to a suitable step size rule (e.g. Armijo-rule), repeating step 5.1–5.1 if necessary

  5. (5)

    update the control in direction of the steepest descent: \(\alpha _{n+1}:=\alpha _n -s_n\nabla J(\alpha _n)\);

  6. (6)

    the algorithm stops if \(\Vert \nabla J(\alpha _n)\Vert <\epsilon \)

To compute the expectation term, one is in fact reduced to simulate the solution of the network equation itself and use the particles as samples for the Monte-Carlo simulation.

5.3 Numerical Examples for Systems of FitzHugh–Nagumo Neurons

Although the solution to the adjoint equation is a 3-dimensional process, in the following we will only plot its first variable, since the other variables are irrelevant for the gradient in our situation.

To illustrate some problems we had with the simulations, we consider the example of the deterministic uncoupled case of equation (46), where \({\overline{J}}=0\) and \(\sigma _{ext}=0\). In the given situation the membrane potential v becomes highly sensitive to small perturbations of the control at specific times, when we chose the control \(\alpha _t\equiv \alpha \) close to the bifurcation value \(\alpha _c\approx 0.33\) for the supercritical Hopf bifurcation point of the equation, where the fixed point is stable, but there is no periodic orbit, in particular we chose \(\alpha =0.3251<\alpha _c\approx 0.33\). The bifurcation point of the deterministic system can be determined in a similar way as it is done in [23]. This sensitivity can lead to high valued solutions of the corresponding adjoint equation for specific reference profiles. One example is to choose the reference profile as the v-trajectory of a solution to (46), for a control parameter \(\alpha \) in the limit cycle regime. This situation is illustrated by the Figs. 2, 3 and 4.

Fig. 2
figure 2

Membrane potential of the solution to (46) for \(\alpha \equiv 0.3251\)

Fig. 3
figure 3

Reference profile generated by solving (46) for \(\alpha \equiv 0.332\)

Fig. 4
figure 4

Solution to the corresponding adjoint equation

The same type of phenomena also occurs in the case of the coupled system of stochastic FitzHugh–Nagumo neurons. Here it can lead to high fluctuations of the sample mean for the adjoint equation, thus a high number of particles is required to compute the expectation of the solution to the adjoint equation. A small illustration is given by the Figs. 5, 6 and 7.

Fig. 5
figure 5

Local field potential of the solution to (46) for \(\alpha \equiv 0\)

Fig. 6
figure 6

Reference profile chosen as the local field potential of (46) for \(\alpha (t)= 0.8\) if \(t\le 7\)

Fig. 7
figure 7

Samples of the solution to the adjoint equation

In this example and in the following, the initial states are uniformly distributed on the orbit of a solution to (46) with \(\alpha \equiv 0\), \(\sigma _{ext}=0\) and initial conditions \(V_0=-0.828,w_0=-0.139,y_0=0.589\). The other parameters are given below in Table 1. In our example only external noise is present, so we can approximate the non adapted solution to the BSDE using the simple Euler-scheme with no conditional expectations involved, replacing \({\mathbb {E}}\left( Y_{t_{k+1}}^\pi |\mathcal {F}_{t_{k}}\right) \) by \(Y_{t_{k+1}}^\pi \) in our numerical scheme, to get a good approximation of the gradient. For the sake of completeness we mention that for the approximation of the adapted solution we would use the parameter \(L=200\) and \(\delta =2\). Furthermore we are always using \(N=1000\) particles for the particle approximation of (46).

5.3.1 Control of a Coupled System of FitzHugh–Nagumo Neurons

For our first example, we consider a parameter regime where the activity of a large number of neurons of the network at some time t leads to further activity at a later time, without any external current applied to the system. Therefore we slow down the gating variable, by decreasing the closing rate of the synaptic gates. This way its impact on the network is still high enough, when a large part of the network is excitable again. Figure 8 shows the uncontrolled local field potential in this case (i.e. when \(\alpha \equiv 0\)).

Our goal is now to increase the activity of the network up to time \(t=100\) and then control the network back into its resting potential. Up to time \(t=100\), the reference profile showed in Fig. 9 shows the local field potential of a network of coupled FitzHugh–Nagumo neurons, when a constant input current of magnitude 0.8 is applied for a time period of \(\Delta t=7\) at \(t=0\). For times \(t>100\) it shows the resting potential of a single FitzHugh–Nagumo neuron.

Fig. 8
figure 8

Uncontrolled local field potential

Fig. 9
figure 9

Reference profile

We expect the optimal control to raise the membrane potential for a small time period at \(t=0\) and then counteract the stimulating effect of the coupling around \(t=100\). However this effects should not occur in the uncoupled setting, which we will consider afterwards.

Figures 10 and 11 show the optimal control and the corresponding optimal local field potential. We remind that this might only be locally optimal, since we cannot expect to find a globally optimal control with our gradient descent algorithm.

Fig. 10
figure 10

Optimal control

Fig. 11
figure 11

Local field potential with optimal control

Since our terminal cost is always zero, i.e. \(g\equiv 0\), the solution to the adjoint equation will always be zero at terminal time T. Consequently our gradient will always be zero at time T and our gradient descent algorithm does not change the control at time T. That is why the value of our approximated optimal control will stay at the value of our initial control \(\alpha _0\) at the terminal time. We started with initial control \(\alpha _0\equiv 0\), which explains the small peak in Figs. 10 and 11.

5.3.2 Control of an Uncoupled System of FitzHugh–Nagumo Neurons

Now we investigate the control problem for the uncoupled equation (46), where \(J=0\). Since the reference profile it still the same as in example 5.3.1, we will only present the corresponding optimal control (Fig. 12).

Fig. 12
figure 12

Optimal control

As expected, the control does not need to counteract any stimulating effects for times \(t>100\). Furthermore it is not sufficient in the uncoupled case to apply an input current for a small time period at \(t=0\), to reach the desired local field potential up to time \(t=100\) (Table 1).

Table 1 Parameters used for the examples