Optimal Control of Nonlocal Continuity Equations: Numerical Solution

The paper addresses an optimal ensemble control problem for nonlocal continuity equations on the space of probability measures. We admit the general nonlinear cost functional, and an option to directly control the nonlocal terms of the driving vector field. For this problem, we design a descent method based on Pontryagin’s maximum principle (PMP). To this end, we derive a new form of PMP with a decoupled Hamiltonian system. Specifically, we extract the adjoint system of linear nonlocal balance laws on the space of signed measures and prove its well-posedness. As an implementation of the designed descent method, we propose an indirect deterministic numeric algorithm with backtracking. We prove the convergence of the algorithm and illustrate its modus operandi by treating a simple case involving a Kuramoto-type model of a population of interacting oscillators.


Introduction
Nonlocal continuity equations on the spaces of probability measures arise as macroscopic mathematical models of multi-agent dynamical systems describing the time evolution of large ensembles (beams, crowds, swarms, populations, networks) of structurally identical objects (e.g., elementary particles, people, animals, "neurons" of natural or artificial neural networks etc.).The main idea is to treat the manyparticle dynamics as a whole by focusing on its "statistical" behavior assuming that the agents are homotypic and, therefore, indistinguishable.
Passing to the limit in the number of agents, a large set of individuals (described by a system of many similar ODEs) is replaced by their continual probability distribution, named the "mean field" (driven by a single transport PDE).This idea, rooted in statistical mechanics [28], has been found useful in different areas of applied mathematics such as mathematical biology [20,21,27,39], modeling of pedestrian and urban traffic [25,26,42], mathematical neuroscience [37] and even theoretical foundations of artificial intelligence [10,40,49,50], just to name a few.
Recent results in the analysis on the space of measures, achieved in the works of L. Ambrosio, N. Gigli, J. Lott, F. Otto, F. Santambrogio, G. Savaré, C. Villani, and others, have been proved fruitful for mathematical control theory, largely spurred by the variety of mentioned applications and the needs of control engineering.The starting point was the derivation of a mathematically rigorous "mean field limit" of the classical multi-agent optimal control problem [30,31] (see also [13,29]).In the consequent few years, the cornerstones of the classical optimal control theorysuch as Pontryagin's maximum principle (PMP) [7,9,11,12,25,43,45,46], and the dynamic programming method [5,6,24,38] -were extended to the area of mean field control.
The mean field PMP, which is at the focus of the present paper, was obtained on different levels of generality by various mathematical strategies.Its particular version was first derived in [43] for a specific "shepard's" problem over the local continuity equation, and subsequently, for a general linear problem with relaxed controls [45]; these version of PMP are mainly reconstructed from the differential properties of flows of the driving vector field by standard analytical methods such as Filippov's lemma.A result of the similar spirit for another particular local problem was recently obtained in [10] as a specification of a more general PMP [12] by an original technique of generalized Lagrange multipliers on the convex subset of Radon measures with unit mass.Notice that, in the local case, PMP takes the familiar form as it is formulated in terms of a certain decoupled optimality system with an explicit backward adjoint equation -a non-conservative transport PDE.
The first result in this line was obtained in [8] for a particular "bi-level" optimization problem; a natural strategy was to pass to the limit in the usual PMP conditions for conventional control problems obtained by the "finite-agent" approximations in the Dobrushin's framework.Similar arguments, based on a finite-dimensional approximation and Ekeland's variational principle, were used in [47] to prove an impulsive version of PMP for a nonlocal transport equation with states being measure-valued curves of bounded variation.For general (non-impulsive) nonlocal transport equations, PMP was first proved by an appropriate extension of the classical technique of needle-shaped control variations for problems without [12] and with [9] additional state-constraints.Another approach, relying on an appropriate linearization of the nonlocal dynamics, was further proposed in [11].A different method to derive the necessary optimality conditions for mean-field control problems was suggested in [16] exploiting an appropriate generalization of Karush-Kuhn-Tucker conditions.Also, in [19,34,50], alternative versions of the mean-field PMP were obtained for stochastic optimal control problems.

Numerical solution: mainstream approaches and their pitfalls
The use of existing analytical methods is limited to the simplest mean-field control problems, while the transition of these results to the numerical context is fraught with critical technical difficulties.Here, PMP would be a promising footing if it were not for a number of significant flaws.The key drawback is due to the mentioned coupling in the Hamiltonian system.The state of such a Hamiltonian equation -a measure on the cotangent bundle of the state space -is always singular, even if the solution of the primal continuity equation -a measure on the state space -has a density.This makes it impossible to solve the Hamiltonian system by the standard numerical schemes and, consequently, the existing forms of PMP do not provide a descent algorithm.
In the finite-dimensional case, a wide range of various direct and indirect numerical methods are described in numerous works.For nonlocal continuity equations, the numerical solution of optimal control problems still remains a burning question, which is principal for the transfer of the mean-field control theory to the practice of control engineering.The mainstream approaches are represented by the following two families: 1. Semi-direct (finite-particle) method: Approximation of the initial distribution by a discrete measure and transformation of a distributed control system to a highdimensional ODE.The resulting finite-dimensional control problem is solved directly or using special techniques such as, e.g., "random batch" methods [35].2. Direct method: Total discretization of a nonlocal equation and reduction of a variational problem to mathematical programming.In practice, both the mentioned approaches typically lead to unsatisfactory results.The first one returns one to a high-dimensional classical optimal control problem followed by the "curse of dimensionality"; in fact, this approach rejects the very heart of the mean-field approximation along with all profits of the statistical averaging, while it draws us back to the need of keeping track of all individual representative of a large population.The second approach leads to a complex (highdimensional, nonlinear and non-convex) mathematical programming problem, which is not always satisfactory solved even by commercial solvers.Here, the main difficulty is the presence of non-local terms depending on the density distribution over the entire spatial grid making the computations much more demanding.This feature also leads to a dramatic loss in the efficiency of parallelization, since integration steps require interprocessor communications of the "all-to-all" pattern.
In contrast to the classical setting, the bibliography on indirect numeric algorithms for optimal mean-field control is poor.There are only few results [2,10,44,48,49], all focusing on particular problems, and relying on adequate necessary optimality conditions.The work [44] deals with the so-called "shepard's problem", where one has to steer the population of non-interacting individuals to a given target set; the proposed numeric algorithm is based on a specific form of PMP.On the conceptual level, the algorithm [10] (named in the cited paper a "shooting method") is a variant of the classical Krylov-Chernous'ko algorithm -probably the first indirect algorithm based on PMP in the history of optimal control.The convergence of the algorithm essentially depends on the convexity of the cost functional, and is not guaranteed in general, even for the finite-dimensional case µ t = δ x(t) .An alternative algorithm was proposed in [49] for the linear problem of ensemble control employing an exact formula for the increment of the cost functional and feedback control variations.In [48], a version of the gradient descent method was constructed for a mean-field optimal control problem over a nonlocal Fokker-Planck-Kolmogorov equation modeling interactions in a Kuramoto type model: the first variation of the objective functional and the adjoint equation are obtained by a formal Lagrange method due to the model specifics.Finally, to the best of our knowledge, there are no results of this sort for the general µ-nonlinear problem.

Goals, contribution, and organization of the paper
In the present work, we put forth an indirect numerical method for optimal mean-field control.Namely, we design a PMP-based indirect deterministic numeric algorithm with backtracking line search for a class of optimal ensemble control problems involving nonlocal continuity equations in the space of probability measures.The method can be viewed as an adequate version of the classical gradient descent method, and demonstrates encouraging results in a series of numeric experiments.To our knowledge, this is the first indirect descent algorithm for mean-field control problems, nonlinear in measure.
The derivation of the algorithm is based on a set of new theoretical results, which are of independent interest.First, we derive the linearized form of the original nonlocal transport PDE.In contrast to [11], our arguments apply to nonlocal perturbations of the vector field, and therefore, cover the case, when the control is injected into the nonlocal term of the dynamics.As a byproduct, we compute the first variation of the cost functional within the class of weak variations of the control function.Another contribution is a new, equivalent articulation of PMP, where the Hamiltonian equation on the cotangent bundle of the state space is decoupled into the primal (forward) and dual (backward) parts; the dual systems turns to be a system of nonlocal linear balance laws (continuity equations with sources).
The rest of paper is organized as follows: A statement of the optimal control problem is presented in Sect.1.3.Section 2 collects the necessary notation, and several noteworthy facts from the topology, analysis, and differential calculus over the space of probability measures.In Sect.2.6, we introduce the concept of a flow of a nonlocal vector field and calculate a "directional derivative" of the flow along a nonlocal vector field.Sections 3-5 dwell on a simplified version of the stated optimization problem, where the running cost rate is lifted, and the driving vector field is affine in the control variable; this technical simplification is not critical but enables us to shorten the presentation of the main results.
In Section 3, we exhibit two standard representations of the increment of the cost functional.The first one is formulated in the language of flows of nonlocal vector fields, while the second formula is written down in terms of the mentioned Hamiltonian system.In Section 4, noting that none of these representations are suitable for numerical purposes, we derive the third version of the cost increment, which relies on the notion of adjoint equation.The corresponding numerical algorithm is presented in Section 5. We study the convergence of the algorithm, discuss certain principal aspects of its technical implementation and demonstrate its modus operandi by treating a simple but illustrative case, namely, an aggregation problem for a meanfield Kuramoto-type oscillatory model.Finally, in Section 6, the obtained results are extended to the general problem, involving the running cost and the nonlinear dependence on the control variable.

Problem statement
Given the data V : consider the following optimal control problem (P) on a fixed finite time interval I .= [0, T ]: We assume that control signals are functions t → u(t) of time variable only, and take values in a given set U ⊆ R m , i.e., U .= L ∞ (I; U), where L ∞ is equipped with the weak* topology σ (L ∞ , L 1 ).
Optimization problems of this sort appear in the framework of multi-agent dynamical systems, where the measure µ t represents the spatial distribution of agents at time t.The specified class of controls implies that u acts simultaneously on all agents (one can imagine that we are able to influence a common agents' environment rather than agents in person).An important example of the nonlocal vector field is where f models an external force pushing the agents and K stands for their internal interaction.Typical terminal cost functionals are Here, ℓ 1 represents the potential (l) and interaction (W ) energy terms, while ℓ 2 is related to the averaged control problem [51], where the goal is to bring the expectation of the distribution µ to some target position m T .Finally, common versions of running cost term are L 1 represents the "total energy" of the control action, and L 2 captures the problem of following a desired path t → m(t).

Preliminaries
In this section, we introduce some notations, and recall several useful facts from analysis on the metric space of probability measures.

Notation
Throughout the paper, we use the following notation: • B r ⊂ R n the closed unit ball of radius r centered at the origin.
• f ♯ µ pushforward measure for µ ∈ P(R n ) and a Borel function f : R n → R m .
• spt µ the support of a measure µ.
• M m,n the space of matrices A with m rows and n columns.
Below, we will also deal with vector measures whose values belong to M 1,n , i.e., ν = ν 1 • • • ν n , where ν 1 , . . ., ν n are Radon measures on R n .Given ϕ : Let X be a Polish space.From measures on X one can construct several important topological spaces: M (X) ⊃ P(X) ⊃ P 2 (X) ⊃ P c (X).Here M (X) consists of all signed Radon measures, P(X) of all probability measures, P 2 (X) of all probability measures with finite second moments, P c (X) of all compactly supported probability measures.Below, the Wasserstein distance [1] on P 2 (X) is always denoted by W 2 .
Given a Radon measure µ on R n , denote by L p µ (R n ; R m ) the space of all µ-measurable maps (equivalence classes)

2.2
The space P c (R n ) and functions of probability measures The role of the main arena of our paper will be played by the space P c (R n ) endowed with the so-called final topology.
Definition 2.1 Let (X n , τ n ) be a sequence of topological spaces such that X n ⊂ X n+1 with continuous inclusion on every n.Let X = ∪ n X n .The final topology is the strongest topology τ on X which lets the inclusions id n : X n → X be continuous for every n.
In our case, (X n , τ n ) .= (P(B n ),W 2 ) and X .= P c (R n ).The final topology τ on P c (R n ) enjoys the following properties [33]: • τ is a Hausdorff topology but it is not induced by any distance.
Below, we will constantly deal with mappings Φ : 2. Φ is locally bounded if its restriction on any compact subset of 4. Φ is sublinear if and only if there exists Thanks to the outlined properties of the final topology, the definitions of the local boundedness and local Lipschitzianity can be given in the following equivalent way: 2 ′ .Φ is locally bounded if and only if, for any compact Ω ⊂ R n , there exists

Derivatives in the space of probability measures
There are several concepts of derivative of a function P → R. In this paper, we shall employ the notion of "intrinsic derivative" [18].
Since δ F δ µ is defined up to an additive constant, we adopt the normalization convention Some important properties of the intrinsic derivative are gathered in the following proposition, which combines the statements of Propositions 2.2-2.4 from [17].
δ µ be C 1 in y, and D µ F be sequentially continuous and locally bounded.Then, the following holds: 1.For any Borel measurable, locally bounded map ϕ : R n → R n , the function s → F (id + sϕ) ♯ µ is differentiable at zero, and
3. The quantity δ F δ µ can be calculated as follows: The first property links the intrinsic derivative with a "directional" derivative, where ϕ plays the role of direction.The second one relates the notion of intrinsic derivative with the so-called localized Wasserstein derivative [11]: , for any µ ′ ∈ P(Ω) and any transport plan Π between µ and µ ′ .Such ξ is uniquely defined and called the localized Wasserstein derivative of F at µ.

Recall that the tangent space Tan
Proposition 2.1 says that any C 1 functional on the space of probability measures with sequentially continuous and locally bounded intrinsic derivative D µ F is locally differentiable at any µ ∈ P c (R n ), and the projection of D µ F(µ, •) onto Tan µ P 2 (R n ) coincides with the corresponding localized Wasserstein derivative.
The third assertion of Proposition 2.1 offers a convenient tool for practical calculation of the intrinsic derivative.We illustrate this machinery with the use of the following paradigmatic example.
the flat derivative is easily found as Recall another useful fact: Lemma 2.1 Let F be the same as in Proposition 2.1.Then F is locally Lipschitz.
Proof Fix a compact set Ω and two measures µ, µ ′ ∈ P(Ω).Denote by Π an optimal plan between µ and µ ′ and let µ t = (1 − t)µ + tµ ′ .Then, we have The difference in the squared brackets is Hence the statement follows from the local boundedness of Definition 2.6 We say that F : and the intrinsic derivative D µ F is locally Lipschitz and locally bounded.

Nonlocal vector fields and their flows
A time-dependent nonlocal vector field is a map V : ) is fictitious we say that V is a local vector field (or simply "vector field").The basic regularity of nonlocal vector fields is understood in the sense of Definition 2.2.
It is well-known that the local transport PDEs can be studied using their characteristic flows.Recall the following Definition 2.7 We say that a vector field v : is Carathéodory, locally bounded and locally Lipschitz.
Any sublinear C 1,1 vector field v generates a unique continuous map P : I × I × R n → R n named the flow of v t ; this map is defined such that, for each t 0 ∈ I and x ∈ R n , t → P t 0 ,t (x) is as a solution of the Cauchy problem For any t 0 ,t ∈ I the map P t 0 ,t : R n → R n is a C 2 diffeomorphism.Moreover, it satisfies the semigroup property: P t 1 ,t 2 • P t 0 ,t 1 = P t 0 ,t 2 for all t 0 ,t 1 ,t 2 ∈ I.
In fact, the concept of flow can be extended to the case of nonlocal vector fields.To this end, we modify Definition 2.7 as follows: Definition 2.8 We say that a nonlocal vector field V : 2) V is C 1 in x for each t and µ, and C 1 in µ for each t and x; Carathéodory, locally bounded and locally Lipschitz.Now, observe that any sublinear C 1,1 nonlocal vector field V generates a unique sequentially continuous function X : We abbreviate X ϑ t = X ϑ 0,t and stress that µ t = (X ϑ t ) ♯ ϑ is the unique solution of the nonlocal continuity equation We call the map X the flow of the nonlocal vector field V .
Notice that, for a given ϑ , we can define v t (x) .= V t (x, X ϑ t♯ ϑ ) and denote by P the flow of v.It is clear that X ϑ 0,t = P 0,t .We will use this fact below several times.The outlined facts (existence of the flow, well-posedness of the nonlocal continuity equation, and the representation formula for its solution) are well-known, refer, e.g., to [12,31,41].

O loc (λ 2 ) families of vector fields
In this section, we discuss some differential properties of nonlocal vector fields and their flows.Definition 2.9 Let Φ λ : X → R m , λ ∈ [0, 1], be a family of functions on a topological space X .We say that Φ λ is O loc (λ 2 ) family and write In particular, a family of nonlocal vector fields Lemma 2.2 Let V be a nonlocal vector field of class C 1,1 .Then, for any locally bounded ϕ, ψ : R n → R, one has Moreover, the constant C Ω that guaranties the estimate depends only on the data Proof We split the proof into several steps.1. Fix a compact set Ω and a triple (t, x, µ) ∈ I × Ω × P(Ω).Consider the identity: By the mean value theorem, the first difference in the right-hand side takes the form and the second one yields where µ λ ,τ = (1 − τ)µ + τ(id + λ ψ) ♯ µ.
Arguments, similar to those of the previous proof, lead to the following slight modification of Lemma 2.2.
Lemma 2.3 Let V be a nonlocal vector field of class C 1,1 and X : I × R n × P c (R n ) → R n be a sequentially continuous and locally bounded map such that x → X µ t (x) is bijective for all t and µ.Then, for any locally bounded Carathéodory maps ϕ, ψ : Moreover, the constant C Ω which guaranties the estimate and L r which bounds, for all t, the Lipschitz constants of D x V t and D µ V t : The following presents a refined version of the formula (5) for the intrinsic derivative.
Lemma 2.4 Let F : P c (R n ) → R be of class C 1,1 , and Φ λ : R n → R n , λ ∈ [0, 1], be a family of Borel maps which can be expanded as follows: for some ϕ : R n → R n .Then, Proof In view of Lemma 2.3, it suffices to show that According to Lemma 2.1, F is locally Lipschitz.Hence, for any compact Ω ⊂ R n , there exists for all µ ∈ P(Ω).Now, by using (11), we complete the proof.

Derivative of the flow
Recall that, for a fixed initial measure, any sublinear C 1,1 nonlocal vector field (n.v.f.) V generates a map X that can be thought of as its flow.We shall study the flow X λ of the perturbed n.v.f.V λ = V + λW , where W is also sublinear C 1,1 , and λ ∈ [0, 1].The results of this section, which provide the linearization of the nonlocal flow, are largely similar to those of [11, sec.3.2] (both in their statements and their proofs).However, in contrast to [11], we accept here nonlocal perturbations of the vector field.On the other hand, we impose slightly more restrictive assumptions, enabling us to expand the nonlocal flow up to the term of order O(λ 2 ) rather than o(λ ) as demonstrated in [11].This fact will play a crucial role in establishing the convergence of our numerical algorithm in Section 5.1.
Theorem 2.1 Let V,W be sublinear C 1,1 nonlocal vector fields, X be the flow of V and X λ be the flow of V λ .
Remark 2.1 First, notice that ϑ in (12) can be considered as a parameter.Thus, ( 12) can be thought of as "linear transport equation with nonlocal source term".One can easily show (for example, by fixed-point arguments) that ( 12), ( 13) has a unique continuous solution w (see also [11,12], where such solution is constructed explicitly for the case W t (x, µ) ≡ W t (x)).Moreover, w is sequentially continuous as a function of t, x, ϑ .
Before presenting the proof, note that our assumptions on V and W imply that there exists Using the local boundedness of D x V , D µ V and W , we can find C ρ > 0 such that Now, it follows from ( 12), ( 13) that This implies that (t, x, ϑ ) → t, (X λ ,ϑ Finally, since V λ , D x V λ and D µ V λ are locally Lipschitz, we choose L r > 0 such that for all t ∈ I and λ ∈ [0, 1].Fix a compact set Ω ⊂ R n and a measure ϑ ∈ P(Ω).From now on, we will omit the index ϑ in X λ ,ϑ t and w ϑ t .Consider the following set: and equip it with the norm ϕ σ = max I×Ω e −σt |ϕ t (x)|, σ > 0. Since • σ is equivalent to the standard sup norm, X (Ω) becomes a complete metric space.Finally, for any λ ∈ [0, 1] and ϕ ∈ X (Ω), we define One can easily check that F maps [0, 1] × X (Ω) to X (Ω).

Increment formula
Now, we turn to the analysis of the increment of the cost functional along an adequate class of control variations.The theory of Pontryagin's maximum principle is commonly built around the class of needle-shaped variation.However, for the specified control-affine case, the latter can be replaced by a simpler class of weak control variations.

Problem specification
In this section, in order to simplify the presentation, we assume that the driving vector field V is affine in control variable u, i.e., and the running cost is identically zero, i.e., L ≡ 0. Later, in Sect.6 we will discuss how to deal with the general case.We begin by listing our basic assumptions.
Assumption (A 1 ): 1. V takes the form (20), where all V j with 0 ≤ j ≤ m are of class C 1,1 ; 2. U ⊂ R m is compact and convex; 3. ℓ : Assumption (A 2 ): all maps D x V j with 0 ≤ j ≤ m are continuously differentiable in x and their derivatives are locally bounded.
Remark 3.1 We use Assumption (A 2 ) only once: to show that, for any fixed control function u ∈ U , the solution w of ( 12), ( 13) which corresponds to V t (x, µ) Indeed, t → w ϑ t (x) satisfies the ODE: , where both functions are continuously differentiable.Hence, w is C 1 in x, according to the standard ODE theory.

Increment formula I
Further in this section, ϑ is supposed to be fixed, so we will omit it when writing the arguments X and w.
Let us fix a pair of control functions u, ū ∈ U , u = ū.We call u a reference control and ū a target control.A weak variation of u towards ū is the convex combination In view of (20), the variation (21) implies the following perturbation of the reference vector field V t (x, µ) .= V t (x, µ, u(t)): Note that, by Assumption (A 1 ), there exists C > 0 such that V λ t (x, µ) ≤ C(1 + |x|), for all t, x, µ, λ , u, ū.This means that ρ from ( 14) can be chosen independently from u, ū ∈ U .Again, by Assumption (A 1 ), we can find C ρ which guarantees, for all u, ū ∈ U , the estimate (15), then construct r by ( 17) and find L r such that (18) holds for all u, ū ∈ U .Now, Theorem 2.1 implies that where w is a solution of ( 12), (13).Here, we think of U as a compact topological space equipped with the weak- * topology σ (L ∞ , L 1 ).
Here we write O instead of O loc because U is already compact.Our next goal is to rewrite this formula in a "constructive" form, namely, in terms of a Hamiltonian system associated to our optimal control problem.

Hamiltonian system
The Hamiltonian system associated with Problem (P) (see ( 1)-( 3)) is merely a continuity equation on the cotangent bundle of R n , i.e., on the space R n × (R n ) * ≃ R 2n comprised by pairs (x, p), where x is the primal and p is the dual state variables.In our case, this equation takes the form This equation is supplemented with the terminal condition where µ t satisfies (2).The standard well-posedness result for nonlocal continuity equations (see, e.g., [46]) guarantees that ( 23), ( 25) has a unique solution γ t .Moreover, the projection of γ t onto the x space coincides with µ t :

Increment formula II
Let us go back to (22).First, recalling that µ T .= X T ♯ ϑ , we express the integral entering in its right-hand side as follows: By Lemma 8.1.2[1], the following version of the classical Newton-Leibniz formula holds for any function ψ ∈ C 1 (I × R 2n ): Remark 3.1 allows us to take ψ t (x, p) .= p • w t X −1 t (x) in the above expression.Recall that X t (x) = P 0,t (x), where P is the flow of the noauthonomous vector field v t (x) = V t (x, µ t , u(t)), in particular, X −1 t = P t,0 and we can use the standard rules of flow differentiation (Theorem 2.3.3 [15]) to perform the calculations: Then, In view of ( 12), the right-hand side reduces to Renaming the variables (x, p) ↔ (y, q) in the latter term shows that the last two terms cancel out.Hence, Ξ t dγ t = pW t (x, µ t ) dγ t (x, p).
Finally, noticing that ψ 0 .= p w 0 ≡ 0 and then using (27) and the definition of W t , we have Proposition 3.2 Under assumptions (A 1 ), (A 2 ) it holds and t → γ t is a solution of the Hamiltonian system (23)-( 25).

Pontryagin's maximum principle
A consequence of the increment formula ( 29) is the following version of Pontryagin's maximum principle.
Theorem 3.1 (PMP in terms of Hamiltonian system) Assume that (A 1,2 ) hold, and ϑ ∈ P c (R n ).Let (µ, u) be an optimal pair for (P).Then u(t) satisfies, for a.e.t ∈ I, the maximum condition where γ is a unique solution of the Hamiltonian system (23)-( 25) and H t is defined by (30).
Proof Since u is optimal, we have I [u λ ] − I [u] ≥ 0 for any target control ū.Now, the increment formula implies that On the other hand, Let ψ(t, υ) .= H t (γ t , υ) and α(t) .= max υ∈U ψ(t, υ).It is easy to check that ψ is a Carathéodory map.Since α(t) ∈ ψ(t,U) for a.e.t ∈ I, we deduce from Filippov's lemma [4,Theorem 8.2.10] that there exists ũ ∈ U satisfying α(t) = ψ(t, ũ(t)) for a.e.t ∈ I. Hence, the inequality in (32) can be replaced by the equality, which completes the proof.Remark 3.2 Pontryagin's maximum principle displayed by Theorem 3.1 is essentially the same as in [11,13].However, in these papers, the driving vector field has a specific form: It can be represented as the sum of a nonlocal drift term and an external Lipschitz vector field u = u(t, x) playing the role of control action.In our case, the control is a measurable function of time variable only u = u(t), which may enter in the non-local term itself, thus enabling us, e.g., to govern convolution kernels as in (4).Finally, note that Theorem 3.1 can be derived from the (most general) version of PMP recently obtained in [7], which relies on the so-called Lagrangian interpretation [23] of the mean-field control problem (P).

Remark 3.3
We conclude this section by stressing two obvious drawbacks of the presented form of the necessary optimality condition, which are critical for its numerical implementation.
1. Equation ( 23) is defined on the space of dimension 2n, which makes its numerical solution computationally demanding even for n = 2.
2. Even if µ is absolutely continuous, γ is not.In other words, γ never takes the form ρ t L 2n with a density function (t, x, p) → ρ t (x, p).This is due to the fact that γ T is supported on the graph of the map x → −D µ ℓ(µ T )(x), which is always L 2n -null set.This means that system (23) can not be solved by standard numerical methods for hyperbolic PDEs, which can be used only when densities exist.
These issues motivate the development of a new version of Theorem 3.1, which is obtained by extracting the "adjoint system" from the Hamiltonian PDE (23).

Adjoint equation
It this section, we shall see that the Hamiltonian system ( 23) can be decoupled into the primal and dual parts just as one is used to experience in the classical optimal control theory.This fact will allow us to rewrite the increment formula and Pontryagin's maximum principle in an equivalent form, suitable for numerics.

Derivation
After reflecting upon the formula (29), one comes up with an idea to take, as a matter of adjoint trajectory, the family of signed vector (namely, row vector) measures defined by where γ t is the solution of ( 23)- (25).Indeed, return to representation ( 27), (28) and specify the class of test functions ψ as follows: In this case, the left-hand side of ( 27) vanishes, which implies where Ξ is defined in (28).In terms of ν, the parts of the integral in the left hand side of (34) can be represented as follows: and, according to (26), Substituting these expressions into (34), we obtain The choice ϕ(x) = (0, . . ., ϕ i (x), . . .0) T , where only i-th component of ϕ is nonzero, shows that this is merely the weak formulation of the following system of balance laws: Here, for the sake of readability, we omit the lower index t of ν t and abbreviate where m j i = m j i (t, x, y) are elements of the matrix D µ V t (y, µ t , u(t), x).At the final time instant T , one has which can be rewritten in terms of the Radon-Nikodym derivative as Definition 4.1 We call the backward system (36), (37) of nonlocal linear PDEs the adjoint system associated to the optimal control problem (P).

Well-posedness
We observe that there exists a solution of the adjoint system, namely, the one defined by (33).Let us show that this solution is unique.Basically, the adjoint system (36) is a system of linear balance laws with sources of the form To proceed, recall basic properties [47] of the linear balance law with a Carathéodory, locally Lipschitz, sublinear vector field f t and an integrable source ς t .
is measurable and T 0 ς t TV dt < +∞, where • TV denotes the total variation norm on M (R n ).
For integrable curves we can define a notion of integral in the usual way: Theorem 4.1 Under our assumptions, there exists a unique solution of (39) with the initial condition ρ 0 = ξ .Moreover, it can be expressed by where P is the flow of f t .
The following Lemma collects several well-known properties of the total variation norm (since their proof is quite standard, we drop them for brevity).
The well-posedness of the adjoint system is established by the following result, where M c (R n ) denotes the subset of M (R n ) composed of signed measures with compact support.Proposition 4.1 Under assumptions (A 1,2 ), the adjoint system (36) Proof Take two terminal measures ξ , ξ ′ ∈ [M c (R n )] n and denote by ν t and ν ′ t the corresponding (potentially, non-unique) trajectories of (36).Then, from Theorem 4.1 and Lemma 4.1, it follows that where ς and ς ′ are the corresponding sources defined by (38).Since we obtain, again by Lemma 4.1, where Ω is a compact set containing the supports of the measures ν t , ν ′ t , µ t , t ∈ I (one can show that there is such a set by reasoning as in [47]), Now, Grönwall's lemma gives the uniqueness.

Increment formula III
The increment formula (29) and Pontryagin's maximum principle (Theorem 3.1) are trivially reformulated in terms of a solution to the adjoint system.
Remark 4.1 Since the adjoint system (36), (37) has a unique solution ν t , it must coincide with the one given by (33).In particular, ν t acts on test functions ϕ ∈ C 1 (R n ; R n ) by the rule where γ x t is the disintegration of γ t with respect to µ t (see [1, Theorem 5.3.1]).If the initial measure ϑ is absolutely continuous with respect to the Lebesgue measure L n , then so are all µ t , t ∈ I (thanks to the representation µ t = X t♯ ϑ ).Now, (44) implies that every ν t , t ∈ I, must be absolutely continuous as well.
The discussed fact has important consequences, which answer the challenges outlined by Remark 3.3: 1.In contrast to the Hamiltonian continuity equation ( 23) as a whole, the adjoint system is solvable numerically.
2. While handling the adjoint equation, we deal with a system of n + 1 first-order hyperbolic PDEs, each one "living" on R n .Solving this system is less computationally expensive than treating a single equation on R 2n .

Linear case
Now, we establish a connection between Theorem 4.3 and the well-known version of PMP for µ-independent vector fields V t (x, µ, u) = V t (x, u) (see, e.g., [10,45]).For such fields, the part of adjoint state is played by a solution (t, x) → ψ t (x) of a single non-conservative transport equation It is reasonable to expect that, under sufficient regularity, the adjoint system (36) boils down to (45).This ansatz is confirmed by the following Proposition 4.2 Assume that (A 1,2 ) hold, x → δ ℓ δ µ (µ, x) is of class C 2 , ϑ ∈ P c (R n ) and V t (x, µ, u) = V t (x, u).Let u ∈ U , and µ t and ν t be the corresponding solutions of (2) and (36), (37), respectively.Then, for a.e.t ∈ I, where (t, x) → ψ t (x) is a solution of the transport equation (45) with the terminal condition Proof It is clear that the representation ( 46), ( 47) does agree with the terminal condition (37), since Due to the uniqueness of a solution to (36), we only need to formally check that ν i .= ∂ x i ψ µ, 1 ≤ i ≤ m, meets the identity (35) with the vector field V t (x, µ t , u(t)) = v t (x).
A solution of ( 45), ( 47) can be written explicitly as ψ t (x) = − δ ℓ δ µ (µ T , P t,T (x)), where P is the flow of v.This formula, together with our assumptions, implies that ψ admits the partial derivatives ∂ x i ψ, ∂ x i ∂ x j ψ and ∂ x i ∂ t ψ for all 1 ≤ i, j ≤ m.These derivatives are at least measurable in t, continuous in x and locally bounded.Take the standard mollification kernel η ε : R → R and consider the convolution It is easy to see that ∂ x i ∂ t ψ ε = ∂ t ∂ x i ψ ε and ∂ α ψ ε → ∂ α ψ as ε → 0 in the sense that where ∂ α denotes any of the derivatives More precisely, for any test function ϕ It remains to use (48) for passing to the limit as ε → 0. The first term in the right-hand side vanishes thanks to (45), so we get in the sense of distributions.Since D µ V = 0, we conclude that ν does satisfy (36).

Descent method
Now, we are able to construct an algorithm for the numerical solution of Problem (P) with vector field as in (20).Note that similar algorithms were earlier proposed for solving classical [3] and stochastic [2] optimal control problems.

Algorithm
Let u be a reference control, µ t and ν t be the corresponding trajectory and co- trajectory.Construct the target control as follows: The increment formula (41) shows that ū − u is a descent direction.Let us introduce the functional It is clear that E [u] ≥ 0 and E [u] = 0 implies that the pair (µ[u], u) satisfies the PMP.In other words, E measures the "non-extremality" of u.Now, we can use the descent direction ū − u for developing the following version of the classical backtracking algorithm.
The convergence analysis of the algorithm is provided by the following theorem.
Theorem 5.1 For any initial control u 0 ∈ U , the sequence {u k } generated by the algorithm Since the right-hand side tends to zero, we come to a contradiction.

Implementation
In the algorithm described in Sect.5.1, the primal and adjoint equations are solved numerically.If the original problem is periodic in space, and the driving vector field has a convolutional structure (4), then, for the numeric integration, one gives preference to so-called spectral methods [14].Assume that the initial measure ϑ ∈ P c (R) is absolutely continuous.This implies that the corresponding trajectories µ t and ν t are absolutely continuous as well, and all ingredients of the algorithm can be recast in terms of their densities ρ t and ζ t , respectively.Moreover, since ϑ is compactly supported, there exists a segment [a, b] such that spt µ t ⊂ [a, b], and spt ν t ⊂ [a, b] for all t ∈ I.This implies that µ t and ν t can be considered as measures on the circle S 1 (i.e. the measures can be view as 2π-periodic in x).
The primal and the adjoint equations can be written in the form: Suppose that all the densities are of the class L 2 (S 1 ).Upon substitution of the truncated Fourier series in (51), the partial differential equation transforms into the system of ODEs where the hat over nonlinear terms denotes their Fourier coefficients, and stands for the Fourier coefficients of ρ(x,t).
The system (53) can be integrated by any appropriate numerical method (e.g. the Runge-Kutta method).Transformations between the physical and spectral (Fourier) spaces are computed by using the Fast Fourier Transforms (FFT).Multiplications of fields are usually computed in the physical space, derivatives and convolutions are evaluated in the Fourier space.

Numerical experiment
As an example, we consider the paradigmatic model of Kuramoto [36], which describes an assembly of pairwise interacting homotypic oscillators.Specifically, we consider an optimization problem in the spirit of [48], in which the goal is to synchronize a continuous oscillatory network by a given time moment T .
The prototypic ODE representing the dynamics of N oscillators takes the form Here, x i (t) ∈ S 1 and ω i ∈ R are the phase and natural frequency of the ith oscillator, respectively, α is the phase shift.Control inputs are t → u(t) .= (u 1 (t), u 2 (t)), where u 1 affects the angular velocity, and u 2 modulates the connectivity of the network.
As in [48], we assume that all oscillators have a common natural frequency ω, which, in this case, can be specified as ω = 0.As the number of oscillators N → ∞, the limiting mean-field version of (55) is described by the curve t → µ t ∈ P(S 1 ) satisfying the nonlocal continuity equation driven by the vector field Remark 5.1 Let us stress several differences between the problem that we solve here and the one addressed in [48].First of all, in [48] the authors consider the so-called mean-field type controls, i.e., they assume that u depends not only on t but also on x.It is clear that this choice greatly improves the controllability of the system.Moreover, the system in [48] is subject to common noise, which also contributes to the controllability.Indeed, let the initial density be given by ρ 0 = 1 + sin kx, k ≥ 2.Then, the convolution in (56) vanishes, which means that our control options reduce to shifting the wave ρ 0 back and forth.On the other hand, under the presence of common noise, the Fourier coefficient corresponding to sin x immediately becomes nonzero and, as a result, the system is self-synchronizing for any positive u 2 .A similar effect can be observed if we try to solve (2), (56) with a discretization scheme that involves a numerical diffusion (such as the classical Lax-Friedrichs method).

General case
In this section, we shall discuss a natural extension of the obtained results to the control-nonlinear case and general cost functional (1).

Nonlinear dependence on control
To handle the case of nonlinear dependence u → V t (x, µ, u), we shall resort to the standard technique based on the extension of the original class U of control signals to a broader space U .= η ∈ P(I ×U) : [(t, u) → t] ♯ η = 1 T L 1 of Young measures [22].It is well-known that such an extension provides the linearization of the vector field w.r.t. the driving signal and, in a certain sense, reduces the general model to the above control-affine case.Recall that i) U is dense in U due to the embedding u → η, η t = δ u(t) , where t → η t ∈ P(U) is the weakly measurable family of probability measures obtained by disintegration of η w.r.t. 1 T L 1 ; and ii) U is compact in the topology of weak convergence of probability measures (and therefore, in any metric W p , p ≥ 1) as soon as U is compact, thanks to the classical Prohorov theorem.
This passage, which is a routine of the mathematical control theory, leads to the following relaxation of the original dynamics (2): V t (x, µ t , η t ) .= U V t (x, µ t , u) dη t (u) .= η t ,V t (x, µ t , •) ; the original cost should be reformulated in the corresponding form: I = ℓ( μT ), where μt is a solution of (60).
Observing that the dependence ω → V t (x, µ, ω) is linear, we invite the reader to consider the weak variation η λ = η + λ ( η − η) and the respective cost increment I [η λ ] − I [η] in place of ( 21) and (22), and reproduce the arguments of Sect. 3 and 4. By doing this, one ensures that the resulting increment formula and necessary condition for the optimality of a Young measure η keep the form of Theorems 4.2 and 4.3, where V and H t are replaced by V and H t , respectively, H t (µ, ν, ω)  .= U H t (µ, ν, u) dω(u), and the maximum condition (43) becomes H t ( μt , νt , ω) ⇔ spt(η t ) ⊆ arg max υ∈U H t ( μt , νt , υ), where νt is the adjoint backward solution associated to η.Now, if the addressed control-nonlinear problem (P) does have a usual minimizer u ∈ U , then PMP for u is restored by taking η such that η t = δ u(t) .

Running cost
If the map u → L t (x, µ, u) is affine, one easily adapts PMP by reformulating the dynamics (24) of the Hamiltonian PDE and the Hamiltonian (42) as and H t .= V t • dν − L t .Further details can be found, e.g., in [11].The general unonlinear case refers to the relaxation technique exhibited in Sect.6.1.

Lemma 4 . 1
Let ρ ∈ M (R n ) and ς : I → M (R n ) be an integrable curve.Then, 1. for any Borel measurable bijective map f : R n
Compute the trajectories µ k (forward in time) and ν k (backward in time starting from (37)) of the initial and the adjoint systems corresponding to u k , and take d 1: k (t) .=(d 1 , . .., d m )[u k ](t), where d j [u] are introduced in (49).