Optimal control of mean field equations with monotone coefficients and applications in neuroscience

We are interested in the optimal control problem associated with certain quadratic cost functionals depending on the solution $X=X^\alpha$ of the stochastic mean-field type evolution equation in $\mathbb R^d$ $dX_t=b(t,X_t,\mathcal L(X_t),\alpha_t)dt+\sigma(t,X_t,\mathcal L(X_t),\alpha_t)dW_t,$ $X_0\sim \mu$ given, under assumptions that enclose a sytem of FitzHugh-Nagumo neuron networks, and where for practical purposes the control $\alpha_t$ is deterministic. To do so, we assume that we are given a drift coefficient that satisfies a one-sided Lipshitz condition, and that the dynamics is subject to a (convex) level set constraint of the form $\pi(X_t)\leq0$. The mathematical treatment we propose follows the lines of the recent monograph of Carmona and Delarue for similar control problems with Lipshitz coefficients. After addressing the existence of minimizers via a martingale approach, we show a maximum principle and then numerically investigate a gradient algorithm for the approximation of the optimal control.


Introduction Motivations
Based on a modification of a model by van der Pol, FitzHugh [ ] proposed in the following system of equations in order to describe the dynamics of a single neuron subject to an external current for some constants a, b, c > 0, where the unknowns v, w correspond respectively to the so-called voltage and recovery variables (see also Nagumo [ ]). In presence of interactions, one has to enlarge the previous pair by an additional unknown y that counts a fraction of open channels (synapic channels), and which is sometimes referred to as gating variable.
When it comes to an interacting network of neurons, it is customary to assume that the corresponding graph is fully connected, which is arguably a good approximation at small scales [ ]. This implies that all the neurons in the given network add a contribution to the interaction terms in the equation. Precisely, for a population of size N ∈ N, the state at time t of the i-th neuron is described by the three-dimensional vector and one is led to study the system of 3N stochastic differential equations: ( ) In the above formula, B i , W i ,B i are i.i.d. Brownian motions modelling independent sources of noise with respective intensities σ J , σ ext , σ y i (v i ) > 0. The last of these intensities depends on the solution, through the formula σ y (v) = χ(y) aS(v)(1 − y) + by ( ) with given constants a, b > 0 and some smooth cut-off function χ : R → R supported in (0, 1). Various physical constants appear in ( ), which we now briefly introduce: • V rev is the synaptic reversal potential; •J is (the mean of) the maximum conductance; • S(v i ) is the concentration of neurotransmitters released into the synaptic cleft by the presynaptic neuron i; explicitly for v ∈ R where T max is a given maximal concentration and λ −1 > 0, V T > 0, are constants setting the steepness, resp. the value, at which S(v) is half-activated (for typical values, see for instance [ ]); • a r , a d > 0 correspond to rise and decay rates, respectively, for the synaptic conductance.
In this model, the voltage variable v i is describing the membrane potential of the i-th neuron in the network, while the recovery variable w i is modeling the dynamics of the corresponding ion channels. As already alluded to, the gating variable y i models a fraction of open ion channels in the postsynaptic neurons, and thus ought to be a number between 0 and 1 (hence the cut-off χ(y i ) in ( )). Loosely speaking, y i should be thought as the output contribution of the neuron i to adjoining postsynaptic neurons, resulting from the concentration S(v i ) of neurotransmitters. The resulting synaptic current from i to j affecting the postsynaptic neuron j is then given by −J(v j − V rev )y i where J is the maximum conductance. This latter term is affected by noise coming from the environment, which in turn explains the structure of the interaction terms in the first equation. For a thorough presentation of ( ) and its applications in the field of neurosciences, we refer for instance to the monograph of Ermentrout and Terman [ ].

Propagation of chaos
The system ( ) has the generic form For N → ∞, one is naturally pushed to investigate the convergence in law of the solutions of ( ) towards the probability measure µ = L(X|P), where X solves dX t = b(t, X t , L(X t ), α t )dt + σ(t, X t , L(X t ), α t )dW t , t ∈ [0, T ] X 0 ∈ L 2 (Ω, F 0 , P; R d ).
( ) and where b, σ are the coefficients obtained by substituting expectations in ( ) in place of empirical means. In the context of ( ), a first mathematical investigation of such convergence is due to Baladron, Fasoli, Faugeras and Touboul [ ]  is µ-chaotic. Namely, for each k ∈ N, k ≤ N and φ 1 , . . . , φ k ∈ C b (C([0, T ]; R d )) it holds This situation is usually referred to as "propagation of chaos".

Mean-field limit and control
In this regard, taking N 1 guarantees that a "good enough" approximation of ( ) is given by the mean-field limit ( ), where the corresponding coefficients (b, σ) : [0, T ] × R 3 × P(R 3 ) × R → R 3 × R 3×3 , are given by In this paper, we concentrate our attention on the optimal control problem associated with a cost functional of the form for suitable functions f and g, and where X α is subject to the dynamical constraint ( ). The functional cost ought to be minimized over some convex, admissible set of controls A.
Because of potential applications in the treatment of neuronal diseases, the control of the stochastic FHN model has gained a lot of attention during the last years (see, e.g., [ , ]). The need to introduce random perturbations in the original model is widely justified from a physics perspective (see for instance [ ] and the references therein). In [ ] the authors investigate a FitzHugh-Nagumo SPDE which results from the continuum limit of a network of coupled FitzHugh-Nagumo equations. We have a similar structure in mind regarding the dependence of the coefficients on the control (namely, the dynamics of the membrane potential depends linearly on the control). Our approach here is however completely different, in that we hinge on the McKean-Vlasov type SDE ( ) that originates from the propagation of chaos.
McKean-Vlasov control problems of this type were investigated in the past decade by Bensoussan, Frehse and Yam [ ], but also by Carmona and co-authors (see for instance [ ]). These developments culminated with the monograph of Carmona and Delarue [ ], where a systematic treatment is made (under reasonable assumptions). Other related works include [ , , , ]. These results fail however to encompass ( )-( ), due for instance to the lack of Lipshitz property for the drift coefficient.
From the analytic point of view, the FitzHugh-Nagumo model also suffers the fact that the diffusion matrix is degenerate, making difficult to obtain energy estimates for the Kolmogorov equation (see Remark . ).
Our objective in this work is twofold. At first, our purpose is to extract some of the qualitative features of FitzHugh-Nagumo system and its mean field limit, in a broader treatment that encloses ( ) and ( )-( ). In this sense, our intention is not to deal with the previous models "as such" but instead, we aim to take a step further by dealing with a certain class of equations that possess the following attributes: • (Monotonicity) -though the drift coefficient in ( ) displays a cubic non-linearity, it satisfies the monotonicity condition • (Constrained dynamics) -the dynamics of the coupling variable ensures that the convex constraint y t ∈ [0, 1] holds for all times.
• (Interaction with quadratic dependence on the unknown) -In spite of the order type interaction in ( )-( ) (in the sense of [ , p. ]), the corresponding nonlinearity displays the quadratic behaviour |b(t, Under the above setting, we aim to develop and implement direct variational methods, in the spirit of the stochastic approach of Yong and Zhou [ ] for classical control problems (note that some work in this direction has been already done by Pfeiffer [ , ], in a slightly different setting). Second, we aim to derive a Pontryagin maximum principle for mean-field type control problems of the previous form, with a view towards efficient numerical approximations of optimal controls (e.g. gradient descent).

Organization of the paper
In Section we introduce our assumptions on the coefficients and give the main results. Section is devoted to the well-posedness of the main optimal control problem (Theorem . ). In Section , we show the corresponding maximum principle (Theorem . ). Finally, Section will be devoted to numerical examples.

Preliminaries . Notation and settings
In the whole manuscript, we consider an arbitrary but finite time horizon T > 0. We fix a dimension d ≥ 1, and denote the scalar product in R d by ·, · . If A, B are matrices of the same size, we shall also write A, B for their scalar product, namely where A † is the transposed matrix, and tr the trace operator. For a continuously differentiable function f : R d → R, we adopt the suggestive notation f x to denote its Jacobian (seen for each x ∈ R d as an element of the dual of R d ). Given h ∈ R d , we let be the evaluation of f x (x) at h. A similar convention will be used for vector-valued functions. Throughout the paper, we fix a complete filtered probability space (Ω, F, (F t ) t∈[0,T ] , P) carrying an m-dimensional Wiener process (W t ) t∈[0,T ] . Given p ∈ [1, ∞) and a p-integrable random variable X, we denote its usual L p -norm by X p := E(|X| p ) 1/p . We further introduce the spaces For m ∈ N, the notations S 2,d×m , H 2,d×m will also be used to denote the corresponding sets of d × m matrix-valued processes. Whenever clear from the context, we will omit to indicate dimensions and write S 2 or H 2 instead.
We will denote by P(R d ) the set of all probability measures on (R d , B(R d )). For p ∈ [1, ∞), µ ∈ P(R d ) we define the moment of order p: and we let P p (R d ) := µ ∈ P(R d ) M p (µ) < ∞ . By W p , p ∈ [1, ∞), we denote the usual p-Wasserstein distance on P p , that is for µ, ν ∈ P p (R d ) where Π(µ, ν) denotes the set of probability measures on R d × R d with µ and ν as respective first and second marginals (we refer to [ , Chap. ] for a thorough introduction to the subject). Moreover, we recall the following elementary but useful consequence of the previous definition. Let µ, ν be in P p , and assume that there are random variables X, Y on (Ω, F, P) such that X ∼ µ and Y ∼ ν. Then, it holds Finally, whenever f : P 2 → R is continuously L-differentiable at some µ ∈ P 2 , we write f µ (µ)(x) to denote its Lions derivative at the point (µ, x) ∈ P 2 × R d . In keeping with the notation ( ) on differentials, we will let f µ (ν)(x) · h be its evaluation (as an element of the dual of R d ) at h ∈ R d .

. Controlled dynamics and cost functional
Our controlled dynamics will be given by a McKean-Vlasov type SDE (state equation) of the form ( ), where X 0 ∈ L r (Ω, F 0 , P; R d ) for some fixed r ≥ 6 and α is an admissible control, i.e. for some convex set A ⊂ R k and some constant K > 0 fixed throughout the paper, In the whole manuscript, we assume that we are given continuous running and terminal cost functions which have quadratic growth in the following sense: there exists C > 0 such that for all t ∈ [0, T ], |g(x, µ)| ≤ C(1 + |x| + M 2 (µ))) 2 .
We will then consider the cost functional

. Level set constraint
A formal application of Itô Formula reveals that the constraint is preserved along the flow of the state equation associated with a network of FitzHugh-Nagumo neurons. This is of course coherent with the intuition that y is a fraction of open channels. In other words, we have π(X) ≤ 0 where π : R 3 → R, is the map x → y(y − 1). Motivated by this example, we will assume in the sequel that we are given a convex function π ∈ C 2 (R d , R) such that any solution X is supported in C ⊂ R d for all times, where C is the set We suppose moreover that C contains at least one element, which for convenience is assumed to be 0. To ensure that the constraint is preserved, we need to assume that π(X 0 ) ≤ 0, P-almost surely. Furthermore we need to make the following compatibility assumptions on π : R d → R.
The polynomial P (y) has discriminant (q − b) 2 , hence the roots which both lie in the interval (0, 1). It follows that P (y) is negative outside C, implying ( ).

. Regularity assumptions and main results
Besides Assumption . , one needs to make suitable hypotheses on the regularity of the drift and diffusion coefficients. In the sequel, we denote by P C 2 (R d ) the subset of all probability measures in P 2 (R d ) which are supported in C = π −1 ((−∞, 0]).

Assumption . (MKV Regularity). We assume that the coefficients
are locally Lipshitz. Moreover, there are constants L 1 , L 2 , L 3 > 0 such that the following properties hold.
( ) In addition, b satisfies the following Lisphitz property with respect to the Wasserstein distance: Example . (Analysis of the FitzHugh-Nagumo model). Let us go back to the settings of ( )-( ) for a coupled system of FitzHugh-Nagumo neurons. Trivially, one has sup 0≤t≤T |σ(t, 0, δ 0 , 0)| = |σ ext | < ∞. The map v → S(v) being positive and bounded, we further see that the (3, 3)-th entry of σ is Lipshitz, as deduced immediately from the fact that χ is supported in (0, 1). For the remaining non-trivial component, we have where to ease notation we introduce the barycenter β(µ), defined as the quantity ( ) The condition Suppµ ⊂ C, implies trivially that |β(µ)| ≤ 1 and thus we obtain ( ) for L 1 = (V rev J) ∨ 1. The Lipshitz-type property ( ) is shown in a similar fashion. The Wasserstein-type regularity ( ) is hardly more problematic: using the Kantorovitch duality Theorem [ , Prop. . & Cor. . ] and the fact that the projector z = (z 1 , z 2 , z 3 ) → z 3 is Lipshitz, one finds that As is classical, the 1-Wasserstein distance W 1 (µ, µ ) can be estimated by W 2 (µ, µ ), which in turn implies ( ), and thus (L ). As for the drift coefficient, since b(t, 0, δ 0 , 0) is also independent of t, the supremum condition in (L ) is clear. Moreover, it has polynomial dependency on the variables v, w, y, which implies the local Lipshitz property ( ) with q = 3. We also have and we conclude by ( ) that (L ) holds.
To show ( ) and ( ), it is enough to prove the corresponding bounds when c = 0 = b, since the related contributions are affine linear in the variables. Similarly, by linearity we can let w = α = 0. But in that case, it holds Observe that, since µ is supported inside C, one has in particular β(µ) ≥ 0. Consequently, the fourth term in the right hand side can be ignored, showing ( ) with are weakly sequential continuous.
Remark . . The continuity and convexity of f (t, x, µ, ·) leads to weak lower semicontinuity of the map We can now present our main results. At first, we investigate the existence of an optimal control for the following problem subject to . Under assumptions . -. , the problem (SM) is finite and has an optimal control. Namely, inf α∈A J(α) < ∞ and there is α ∈ A, such that In order to address the corresponding maximum principle, we now introduce further assumptions on our coefficients.
Assumption . (Pontryagin Principle). The coefficients b, σ, f and g are continuously differentiable with respect to (x, α) and continuously L-differentiable with respect to µ ∈ P 2 (R d ). Furthermore there exist A 1 , A 2 , A 3 > 0 such that: where q is the same constant as in (L ).

are all bounded in norm by
Example . . Again, we investigate the above properties for the setting of a FitzHugh-Nagumo neural network. The property (A ) depends on the choice of f and g, hence we do not discuss it here (it is however clear for the ansatz ( ) below). Concerning assumption (A ) and ( hence the first estimate. Letting as before β(µ) :=´R 3 z 3 µ(dz), it is easily seen by definition of the L-derivative that In a matrix representation, this gives the following constant value for the L-derivative of the drift coefficient at a given point showing the desired property.
Next, we introduce the corresponding adjoint equation, which will be essential for the maximum principle. For a solution X ∈ S 2,d of ( ) consider the following backward SDE ( ) where the tilde variablesX,P are independent copies of the corresponding random variables (carried on some arbitrary probability space (Ω,F,P)), andẼ denotes integration inΩ (this convention will be adopted throughout the paper). Herein, we recall that σ(t, x, µ, α), q is a synonym for tr(σ(t, x, µ, α) † q).
A pair of processes (P, Q) ∈ H 2,d × H 2,d×m will be called a solution to the adjoint equation corresponding to X if it satisfies ( ) for all t ∈ [0, T ], P-almost surely.
We are now in position to formulate the maximum principle. For that purpose, we introduce the Hamiltonian, which for each x, y, p ∈ R d , q ∈ R d×m µ ∈ P 2 and α ∈ A, is the quantity Theorem . . Let assumptions . -. hold. Let α ∈ A be an optimal control for the problem (SM). If (P, Q) ∈ H 2,d × H 2,d×m is the solution to the corresponding adjoint equation, then we have for Lebesgue-almost every t ∈ [0, T ] It should be noticed that in contrast to the maximum principle stated in [ , Thm. . p. ], the maximum principle here is formulated in terms of the expectation for almost every t ∈ [0, T ] instead of dt ⊗ P− almost everywhere, since we only consider deterministic controls and thus we only alter the control in deterministic directions.

Well-Posedness of the Optimal Control Problem
The main purpose of this section is to prove the existence of an optimal control for the stated control problem. For that purpose, we will need to show (among other results) that the state equation ( ) is well-posed, and that the solution satisfies uniform moment bounds up to a certain level. Hereafter, we suppose that assumptions . , . and . are fulfilled.

. Well-posedness of the State equation
Our first task is to show that the level-set constraint which was alluded to in Section . is preserved along the flow of solutions. This statement is contained the next result. The proof is partially adapted from that of [ , Prop. . ].

Lemma . . For every α ∈
then from Assumption (L ) we see that σ µ is Lipshitz, while (L ) and (L ) imply the local Lipschitz continuity and the monotonicity of the drift coefficient b µ . Hence, by standard results on monotone SDEs (see for instance [ , Thm. . p. ]) ( ) has a unique strong solution, this solution being progressivey measurable and square integrable. This proves our assertion.
In order to show ( ), consider a family (Ψ ) >0 of non-negative and non-decreasing functions in C 2 (R) which for all > 0 satisfy: and such that Ψ converges pointwise to 1 (0,∞) as → 0. Let τ n := inf{t ≥ 0 s.t. |X t | ≥ n}. By Itô Formula, we have for each n ≥ 0 and > 0 where we let M t := m k=1´τ n∧t 0 π x (X s ) · σ ·,k (s, X s , µ s , α s )Ψ (π(X s ))dW k s . Since Ψ is supported on the real positive axis, only the values of X which satisfy π(X) > 0 contribute to the above expression. Hence, making use of Assumption . , we see that the first term in the previous right hand side is bounded above by 0, while the two last terms simply vanish. We arrive at the relation Letting first n → ∞, and then → 0, we observe by Fatou Lemma that E sup t∈[0,T ] 1 (0,∞) (π(X t )) = 0, and our claim follows.
We are now able to prove the existence of a unique solution to equation ( ).
Theorem . . There exists a unique strong solution to equation ( ) in S 2 , which is supported in C for all times. Furthermore, for each p ∈ [2, r] and every α ∈ A, the solution satisfies the moment estimate where the constant C depends only upon the indicated quantities.
Proof. Recall that P C 2 denotes the set of probability measures in P 2 (R d ) which are supported in C := π −1 ((∞, 0]). Equipped with the standard Wasserstein distance, it is a closed subset of P 2 (R d ). Indeed, it is standard (see for instance [ ]) that given probability measures {µ n , n ∈ N} and µ such that µ n ⇒ µ, then so that our claim follows. Thus, for fixed α ∈ A, we can rightfully consider the operator where X µ = X α,µ is the unique solution to eq ( ). Using similar arguments as in [ ], the existence of a unique solution to ( ) follows if one can show that Θ has a unique fixed point. In fact, we are going to show that it is a contraction (for a well-chosen metric). The moment estimate ( ) will follow from the fixed point argument, provided one can show that where the displayed constant depends on the indicated quantities but not on the particular element µ in C([0, T ]; P C 2 ). We now divide the proof into two steps.
Step : moment bounds. Itô Formula gives where N t :=´t 0 X| p−2 X µ , σ(s, X µ , µ, α)dW s is the corresponding martingale term. Denoting by κ > 0 the constant in the Burkholder-Davis-Gundy Inequality, the latter is estimated thanks to ( ) and Cauchy-Schwarz Inequality as But from Young's inequality, the previous right hand side is also bounded by Taking the expectation in ( ), we infer from ( ), ( ), Young's inequality ab ≤ 2 p a p 2 + p−2 p b p p−2 and the previous discussion that for some universal constant C p > 0. Applying Gronwall Inequality, we obtain the desired moment estimate.
Step : the fixed point argument. From Lemma . , it is clear that for all t ∈ [0, T ], the probability measure P • (X µ t ) −1 is supported in C. For simplicity, let L := L 1 ∨ L 2 ∨ L 3 and introduce the weight Then, Itô Formula gives ( ) The first term in the right hand side of ( ) is evaluated thanks to ( ). For the second term, we use the quadratic growth assumption ( ). As for the Itô correction, we can estimate it similarly, using this time Assumption (L ).
Taking expectations, supremum in t, then absorbing to the left yields Using the estimate ( ) with p = 2, the fact that exp(−2T L) ≤ φ ≤ 1, and the basic inequality ( ), we arrive at The contractivity now follows by considering the k-th composition of the map Θ, for some k > 0 large enough and the result then follows from Banach-fixed point theorem.
We now investigate some regularity of the control-to-state operator, which will be needed in the proof of the optimality principle.
Lemma . . For p ∈ [2, r], the solution map G : A → S p ∩ S 2 , α → X α is well-defined and Lipschitz continuous. More precisely, there exists a constant C(L 1 , L 2 , L 3 , T, K) > 0 (here K is the constant associated to A through ( )), such that for all α, β ∈ A E sup Proof. That G is well-defined follows immediately from Theorem . . Towards Lipschitzcontinuity, the property is shown by similar considerations as in the proof of Theorem . . Indeed, fixing α, β ∈ A and letting M be the martingale M t :=´t 0 σ(t, X α , L(X α ), α) − σ(s, X β , L(X β ), β), (X α − X β )dW , then using Itô Formula with assumptions (L ), (L ) and (L ), we arrive at Letting κ > 0 be the constant in the BDG inequality, the estimate ( ) and ab ≤ a 2 4 + b 2 yield The result now follows from the uniform bound ( ), together with Gronwall Lemma.
Remark . (Fokker-Planck equation). Given the settings of Example . , we define If we assume that the solution to the corresponding mean-field equation has a density p(t, x) with respect to the 3-dimensional lebesgue measure, then the McKean-Vlasov-Fokker-Planck equation is given by the nonlinear PDE: . It is degenerate parabolic because the matrix σσ † is not strictly positive. .

Proof of Theorem .
We now prove the existence of an optimal control for ( ). The strategy we use strings along the commonly named "direct method" in the calculus of variations. As a trivial consequence of the assumptions made in Section . and the uniform estimate ( ), note at first that our control problem is indeed finite. Next, consider a sequence (α n ) n∈N ⊂ A realizing the infimum of J asymptotically, i.e. Since A ⊂ L 2 ([0, T ]; R k ) is bounded and closed, by Banach Alaogu Theorem there exists an α ∈ L 2 ([0, T ]; R k ) and a subsequence also denoted by (α n ) n∈N , such that α n α, weakly in L 2 (0, T ; R k ).
Since A is also convex, we get α ∈ A, so α is indeed an admissible control. We now divide the proof into four steps.
Step : tightness. In the sequel, we denote by X n the solution of the state equation ( ) with respect to the control α n , n ∈ N. Adding and subtracting in ( ), we have where κ > 0 is the constant in the BDG inequality. Using the assumptions (L ), (L ), (L ), the fact that 0 ∈ C and the basic inequality ( ), we obtain that (1 + |X n r | q−1 + |α n r | q−1 + M 2 (L(X n r )) 2 (|X n r | + |α n r |)dr (1 + |X n r | 2 + |α n r | 2 )dr 2 2 .
Using Hölder Inequality, our assumption that 4 ≤ 4q ≤ r together with Young Inequality ab ≤ q−1 q a q q−1 + 1 q b q , we arrive at the following estimate, for all n ∈ N and 0 ≤ s ≤ t ≤ T where the above constant depends upon the indicated quantities, but not on n ∈ N.
Making use of the uniform estimate ( ), the Kolmogorov continuity criterion then asserts that the sequence of probability measures (P • (X n ) −1 ) n∈N , defined on the space is tight. In the same way, we can prove that the sequence on probability measures (P n ) n∈N := is tight on the product space E × E, with respect to the product topology. Thus by Prokhorov's theorem there exists a subsequence of (P n ) n∈N , which converges weakly to some probability measure P * on E × E.
Step : passage to the limit in the drift. By Skorokhod's representation theorem we can then find random variables X, B, (X n ) n∈N , (B n ) n∈N defined on some probability space (Ω, F, P) and with values in E × E such that • P • (X n , B n ) −1 = P n for all n ∈ N and P • (X, B) −1 = P * and • lim n→∞ (X n , B n ) = (X, B), P-almost surely with respect to the uniform topology.
From ( ) and by the definition of A we get for any p ≤ r E sup 0≤t≤T |X n t | p ≤ C(p, X 0 p , L 1 , L 2 , L 3 , K), for some constant independent of n. Thus we can conclude by the dominated convergence theorem that as n → ∞. This also implies (L(X t )) t∈[0,T ] ⊂ P C 2 , since P C 2 is closed. To identify the almost sure limit B, we first claim that for each t ∈ [0, T ] Likewise, for h ∈ L 2 (Ω; R d ) we have by Assumption . and dominated convergence as n → ∞, thus proving our claim. The desired identification then follows from ( ), the Banach-Saks theorem and the uniqueness of the almost sure limit. The processes B and´· 0 b(s, X s , L(X s ), α s )ds being both continuous pathwise, they are indistinguishable, hence the identity for all t ∈ [0, T ], P-almost surely.
Step : identification of the martingale. Letting σσ † (t, x, µ, α) := σ(t, x, µ, α)σ(t, x, µ, α) † for short, similar arguments as above show that Since the process is, for each n, a G n t := σ(X n s |s ≤ t) martingale under P, we can conclude that is a G By the martingale representation theorem we can find an extended probability space (Ω,F, (F t ) t∈[0,T ] ,P) with an m-dimensional brownian motionŴ , such that the natural extensionX of X satisfieŝ Step : end of the proof It remains to show that the infimum is attained for α. Due to the uniqueness of equation ( ), we have P • (X α ) −1 =P • (X −1 ). Using Fatou's lemma, continuity of f, g,, Assumption . and Remark . , we obtain This shows that α has the desired properties, and hence the proof is finished.

The maximum principle: proof of Theorem .
In this section, it will be assumed implicitly that assumptions . , . , . and . hold. Hereafter, we let (Ω,Ã,P) be a copy of the probability space (Ω, A, P). The corresponding expectation map will be denoted byẼ.

. Gâteaux differentiability
In this subsection we aim to complete Lemma . by showing the Gâteaux-differentiability of the control-to-state operator The Gâteaux derivative of the solution map will be given by the solution of a mean-field equation with random coefficients. We will deal with this problem in the similar fashion as its done in [ , Thm. . p. ].
Lemma . . The solution map G is Gâteaux-differentiable. Moreover, for each α ∈ A, its derivative in the direction β ∈ A is given by the process Z = Z α,β is characterized as the unique solution to Proof. We will start by showing that ( ) has a unique solution. For that purpose, we define where p 1 denotes the projector onto the first d-coordinates, namely Clearly, if µ n t is a sequence converging weakly to µ t for every t ∈ [0, T ], the constraint µ n t • p −1 1 = L(X t ), ∀t remains true for µ itself. Since the Wasserstein distance metrizes the weak topology, we see that R is closed in C([0, T ]; P 2 (R d × R d )). Next, define we first need to check the existence of a unique solution V . But letting we have the following properties: for all t ∈ [0, T ] and P-almost every ω. The first estimate is a result of Assumption . and the fact that P(X t ∈ C, ∀t) = 1. The second estimate follows from together with the continuity of t →˜R d ×R d |y|µ t (dx × dy), and the uniform estimate ( ). Using ( ) we get with similar arguments for all t ∈ [0, T ], P-almost every ω. It follows then by classical SDE results that ( ) is well-posed. Moreover, adapting the arguments yielding the moment estimates of Theorem . , it is shown mutatis mutandis that for 2 ≤ p ≤ r E sup 0≤t≤T |V t | p < ∞.
We now aim to prove that Ψ is a contraction, but for that purpose it is convenient to introduce another (stronger) metric. For any µ, ν ∈ P 2 (R where Λ(µ, ν) is the set of all probability measures m on (R d ) 3 such that for any A, B ∈ B(R d ) That d is stronger than W 2 can be seen as follows. If m is any element in Λ(µ, ν), one can define where δ x is the Dirac mass centered at x. Clearly, ρ belongs to the set of transport plans Π(µ, ν) between µ and ν, so that in particular Then, taking the infimum over all such m yields our conclusion. Next, let m ∈ Λ(µ, ν). Using the marginal condition on m, we have Thus, Since m is arbitrary, we obtain and a similar result can be shown for Σ µ . Now, if we equip R with a metric δ inherited from d, for instance δ(µ, ν) := sup t∈[0,T ] e −γt d(µ t , ν t ) for γ > 0 large enough, the proof that Ψ is a contraction follows with simple arguments. Since it is similar to the proof of Theorem . , we omit the details.
Let now α, β ∈ A and > 0 small enough, such that α + β ∈ A. By X we denote the solution of ( ) with respect to α and by X we denote the solution to ( ) with respect to α + β. Furthermore for λ ∈ [0, 1] we introduce X λ, := X + λ(X − X) and α λ, := α + λ β. Note that, since π is convex, we have Next, by Lemma . we get E sup Thus, we can conclude that X λ, −→ →0 X in L 2 (Ω, C([0, T ]; R d )), uniformly in λ. By a simple Taylor expansion we get where, given ϕ = ϕ(t, x, µ, α)(x) we use the shorthand notation with the convention that the last input is ignored whenever ϕ does not depend on the tilde variable. Similarly, we have By Itô formula, ( ) and Assumption . , we get By Young Inequality, Jensen Inequality and assumption (A ) we havẽ Since > 0 is chosen in a way that α + β ∈ A, we can conclude by the a priori bound ( ) and the definition of A, that E sup for some constant C(T, K, X 0 p ) > 0 which does not depend on . By the Burkholder-Davis-Gundy inequality, Young and Jensen inequalities we arrive at for a constant C > 0 which does not depend on and and I 4 , I 5 , I 6 are analogues for σ. We will only show I 1 → 0 as → 0, the other terms being handled by similar arguments. By assumption (A ) we have Furthermore we have for any p ≤ r that is bounded from above by some constant that does not depend on for > 0 small enough. Since , by the a-priori bound ( ), the estimate E sup t∈[0,T ] |Z t | 4 < ∞, the continuity of b x and the dominated convergence theorem, one concludes that I 1 → 0 as → 0. Similar arguments combined with Gronwall's lemma finish the proof.
As an important consequence, we obtain the following formula for the Gâteaux derivative of the cost functional. Given Lemma . , the next result is proven in the same way as its done in [ ] and thus omitted.

Corollary . . The cost functional
is Gâteaux differentiable and its Gâteaux derivative at α ∈ A in direction β ∈ A is given by
Thus, given a control α ∈ A, one sees that the pair (P, Q) ∈ S 2,d × H 2,d×m solves the adjoint equation if and only if for all t ∈ [0, T ], P-almost surely ( ) where (X,P ,Q,α) is an independent copy of (X, P, Q, α) on the space (Ω,F,P).
Let us point out that the above coefficients fail to satisfy [ , Assumption MKV SDE, Chap. ]. Hence, we first need to address the solvability of the BSDE ( ) under the assumptions of Theorem . .

Lemma . .
Under the assumptions of Theorem . , there exists a unique solution (P, Q) ∈ S 2 × H 2,d×m of ( ).
For γ large enough this leads to showing that Γ is a contraction. The conclusion follows.
The following corollary follows immediately by integration by parts and an application of Fubini Theorem. We therefore omit the proof and refer to [ , Lemma. . p. ].
Corollary . . Let (P, Q) be a solution to ( ), then it holds

Remark
. . An immediate consequence of ( ) is the following formula for the Gâteaux derivative of the cost functional An application of Fubini Theorem then leads to the following representation for the gradient of J: ( ) It is hardly necessary to mention that the formula ( ) is of fundamental importance for numerical purposes, see Section below.
We are now in position to prove the maximum principle.
Proof of Theorem . . Let α ∈ A be an optimal control for (SM), X the corresponding solution to ( ) and (P, Q) the associated solution to ( ). For β ∈ A we have by the optimality of α Invoking the convexity of the Hamiltonian (see Assumption . ), we get For any arbitrary measurable set C ⊂ [0, T ] and α ∈ A we can define the admissible control dt-almost everywhere. This proves the theorem.

Numerical examples
In this section we focus on the FitzHugh-Nagumo model with external noise only, i.e. the system of 3N stochastic differential equations: We are interested in controlling the average membrane potential (called in the following "local field potential") of a network of FitzHugh-Nagumo neurons into a desired state. Our cost functional is given by where (v t ) t is a certain reference profile. We should mention that the average membrane potential will only give an idea about the average activity of the network at each time. For example a high average membrane potential is an indication that a high number of neurons are in the regenerative or active phase, while a low average membrane potential means that a high number of neurons are in the absolute refractory or silent phase.
In the described case the adjoint equation is reduced to In the following section we will give a short introduction on how to solve ( ) numerically.

. Numerical approximation of the adjoint equation
In general we consider the following non fully coupled MFFBSDE

( )
For the approximation of the forward component we consider an implicit Euler scheme for McKeanvlasov equations. Since this is standard, we will not go into further details. Concerning the backward component, we consider a scheme similar to the one presented in [ ]. We should mention that since we are not dealing with a fully coupled MFFBSDE, our situation is a lot easier to handle than the one treated in [ ]. For a given discrete time grid π : 0 = t 0 < t 1 < ... < t N = T , we consider the following numerical scheme: ..,m,j=1,...,L .
Thus we need to minimize A similar approach for BSDEs can be found in [ ]. There is no convergence analysis of this scheme for our assumptions on the coefficients, this should only give an idea how to solve the adjoint equation in practice. Furthermore we should mention, that in the case where only external noise is present, the duality ( ) and the resulting gradient representation still holds true for any non adapted solution of ( ). Thus one can also implement a numerical scheme for the adjoint equation, without any conditional expectations involved.

. Gradient descent algorithm
We will now briefly sketch our gradient decent algorithm.
Algorithm . . Take an initial control α 0 ∈ A, s 0 > 0, and recursively for n = 0, 1, · · · : -determine X αn by solving the state equation with an implicit particle scheme to avoid particle corruption; -solve the adjoint equation for given X αn in order to approximate (P αn , Q αn ); -approximate the gradient ∇J(α n ) s = E b α (s, X αn s , L(X αn s ), α n s ), P αn s + f α (s, X αn s , L(X αn s ), α n s ) via Monte-Carlo method, where (P αn , Q αn ) solves the adjoint equation; -update the control in direction of the steepest decent: α n+1 := α n − s n ∇J(α n ); -accept the new control if the cost corresponding to the new control is smaller than the previous cost, otherwise decrease the step size: s n = s n /2 and go back to step -the algorithm stops if ∇J(α n ) < To compute the expectation term, one is in fact reduced to simulate the solution of the network equation itself and use the particles as samples for the Monte-Carlo simulation.

. Numerical examples for systems of FitzHugh-Nagumo Neurons
Although the solution to the adjoint equation is a 3-dimensional process, in the following we will only plot its first variable, since the other variables are irrelevant for the gradient in our situation.
To illustrate some problems we had with the simulations, we consider the example of the deterministic uncoupled case of equation ( ), where J = 0 and σ ext = 0. In the given situation the membrane potential v becomes highly sensitive to small perturbations of the control at specific times, when we chose the control α t ≡ α close to the bifurcation value for the supercritical Hopf bifurcation point of the equation. This sensitivity can lead to high valued solutions of the corresponding adjoint equation for specific reference profiles. One example is to choose the reference profile as the v-trajectory of a solution to ( ), for a control parameter α in the limit cycle regime. This situation is illustrated by the figures below.  In this example and in the following, the initial states are uniformly distributed on the orbit of a solution to ( ) with α ≡ 0, σ ext = 0 and initial conditions V 0 = −0.828, w 0 = −0.139, y 0 = 0.589. The other parameters are given below in Table . Furthermore we are always using N = 1000 particles for the particle approximation of ( ).

. . Control of a coupled system of FitzHugh-Nagumo Neurons
For our first example, we consider a parameter regime where the activity of a large number of neurons of the network at some time t leads to further activity at a later time, without any external current applied to the system. Therefore we slow down the gating variable, by decreasing the closing rate of the synaptic gates. This way its impact on the network is still high enough, when a large part of the network is excitable again.
Our goal is now to increase the activity of the network up to time t = 100 and then control the network back into its resting potential. Up to time t = 100, the following reference profile shows the local field potential of a network of coupled FitzHugh-Nagumo neurons, when a constant input current of magnitude 0.8 is applied for a time period of ∆t = 7 at t = 0. For times t > 100 it shows the resting potential of a single FitzHugh-Nagumo neuron. We expect the optimal control to raise the membrane potential for a small time period at t = 0 and then counteract the stimulating effect of the coupling around t = 100. However this effects should not occur in the uncoupled setting, which we will consider afterwards.
The following shows the optimal control and the corresponding optimal local field potential. We remind that this might only be locally optimal, since we cannot expect to find a globally optimal control with our gradient decent algorithm. Now we investigate the control problem for the uncoupled equation ( ), where J = 0. Since the reference profile it still the same as in example . . , we will only present the corresponding optimal control.

Figure : Optimal control
As expected, the control does not need to counteract any stimulating effects for times t > 100. Furthermore it is not sufficient in the uncoupled case to apply an input current for a small time period at t = 0, to reach the desired local field potential up to time t = 100.