Nonlocal-interaction equation on graphs: gradient flow structure and continuum limit

We consider dynamics driven by interaction energies on graphs. We introduce graph analogues of the continuum nonlocal-interaction equation and interpret them as gradient flows with respect to a graph Wasserstein distance. The particular Wasserstein distance we consider arises from the graph analogue of the Benamou-Brenier formulation where the graph continuity equation uses an upwind interpolation to define the density along the edges. While this approach has both theoretical and computational advantages, the resulting distance is only a quasi-metric. We investigate this quasi-metric both on graphs and on more general structures where the set of"vertices"is an arbitrary positive measure. We call the resulting gradient flow of the nonlocal-interaction energy the nonlocal nonlocal-interaction equation (NL$^2$IE). We develop the existence theory for the solutions of the NL$^2$IE as curves of maximal slope with respect to the upwind Wasserstein quasi-metric. Furthermore, we show that the solutions of the NL$^2$IE on graphs converge as the empirical measures of the set of vertices converge weakly, which establishes a valuable discrete-to-continuum convergence result.

• P 2 (A) ⊆ P(A) stands for the elements of P(A) with finite second moment, that is, • ρ 1 ⊗ ρ 2 is the product measure of ρ 1 , ρ 2 ∈ M(R d ).
• C b (A) is the set of bounded continuous functions from A to R.
• a + := max{0, a} and a − := (−a) + are the positive and negative parts of a ∈ R.
• µ ∈ M + (R d ) sets the underlying geometry of the state space; it is sometimes referred to as base measure. • ρ ∈ P(R d ) denotes a configuration; the natural setting is that supp ρ ⊆ supp µ, although we allow for general supports as needed for stability results. • γ 1 = ρ ⊗ µ and γ 2 = µ ⊗ ρ.
• G := {(x, y) ∈ R d × R d : x = y} is the set of possible edges.
• V as is the set of antisymmetric graph vector fields, defined in (1.6).
• ∇f is the nonlocal gradient of a function f : R d → R, while ∇ · j is the nonlocal divergence of a measure-valued flux j ∈ M(G); see Definition 2.7. • A stands for the action functional; see Definition 2.3.
Let us also specify the notions of narrow convergence and convolution. A sequence (ρ n ) n ⊂ M(A) is said to converge narrowly to ρ ∈ M(A), in which case we write ρ n ρ, provided that Given a function f : A × A → R and ρ ∈ M(A), we write f * ρ the convolution of f and ρ, that is, f (x, y) dρ(y) for any x ∈ A such that the right-hand side exists.

Introduction
We investigate dynamics driven by interaction energies on graphs, and their continuum limits. We interpret the relevant dynamics as gradient flows of the interaction energy with respect to a particular graph analogue of the Wasserstein distance. We prove the convergence of the dynamics on finite graphs to a continuum dynamics as the number of vertices goes to infinity. To do so we create a unified setup where the continuum and the discrete dynamics are both seen as particular instances of the gradient flow of the same energy, with respect to a nonlocal Wasserstein quasi-metric whose state space is adapted to the configuration space considered.
Let us first introduce the problem on finite graphs where it is the simplest to describe.

Graph setting with general interactions.
Consider an undirected graph with vertices X = {x 1 , . . . , x n } and edge weights w x,y ≥ 0, satisfying w x,y = w y,x for all x, y ∈ X. Although technically not necessary, we impose the natural requirement that w x,x = 0. The interaction potential is a symmetric function K : X × X → R, while the external potential is denoted P : X → R. We consider a "mass" distribution ρ : X → [0, ∞), and we require x∈X ρ x = 1. The total energy E : P(X) → R is a combination of the interaction energy E I and the potential energy E P : The gradient descent of E X that we study is described by the following system of ODE for the mass distribution: The quantities v : X ×X → R and j : X ×X → R are defined on edges and model the graph analogues of velocity and flux. An evolution by such system is illustrated on Figure 1. The system (1.2)-(1.4) is the gradient flow of the energy E X with respect to a new graph equivalent of the Wasserstein metric. The concept of Wasserstein metrics on finite graphs were introduced independently by Chow, Huang, Li, and Zhou [11], Maas [28], and Mielke [29,30]. All of the approaches rely on graph analogues of the continuity equation to describe the paths in the configuration space. On graphs the mass is distributed over the vertices and is exchanged over the edges. Hence the analogues of the vector field and the flux are defined over the edges. However, the flux should be the product of the velocity (an edge-based quantity) by the density (a vertex-based quantity). Thus one has to interpolate the densities at vertices to define the density (and hence the flux) along the edges. The choice of interpolation is not unique, and has important ramifications.
While the overall structure of our setup is closest to one in [28], which we recall in Section 1.4, the form of the interpolation used is related to the upwind interpolation used in [11]. However, while in [11] the authors considered only the direction of the flux due to the potential energy to determine which density to use on the edges, in our case the density chosen depends on the total velocity and we furthermore include the interaction term which itself depends on the configuration. In particular we use an upwind interpolation based on the total velocity.
The "velocities" v we consider can be assumed to be antisymmetric: v x,y = −v y,x for all x, y ∈ X. In the graph setting, which we normalize in order to consider limit n → ∞, the continuity equation with upwind interpolation is obtained by combining (1.2) with the flux-velocity relation (1.3). As in [28], we define the graph Wasserstein distance by minimizing the action, which is the graph analogue of the kinetic energy: As in [11,28,30], the graph Wasserstein distance is defined by adapting the Benamou-Brenier formula: T (ρ 0 , ρ 1 ) 2 = inf (ρ,v)∈CE X (ρ 0 ,ρ 1 ) where CE X (ρ 0 , ρ 1 ) is the set of all paths (i.e., solutions of (1.2)-(1.4)) connecting ρ 0 and ρ 1 . We consider a graph based on 240 sample points X from a 2D two-moon data set. The connectivity distance is ε = 0.7. The edge weights are w x,y = exp(−6|x − y| 2 ) if |x − y| ≤ ε and zero otherwise. The interaction potential is K x,y = 1 − exp(−d(x, y) 2 /10), where d(x, y) is the graph distance between vertices x and y of X with edge weights 1/w x,y , and the external potential is P ≡ 0. The solution, starting from a uniform distribution, is shown at time t = 60. Brighter color indicates more mass.
It is important to observe that, in our setting, T is not symmetric (that is, T (ρ 0 , ρ 1 ) is in general different from T (ρ 1 , ρ 0 )). The reason is that in general A(ρ, v) = A(ρ, −v). Therefore the nonlocal Wasserstein distance which arises from the upwind interpolation is only a quasi-metric. The action A(ρ, v) provides a Finsler structure to the tangent space, instead of the usual Riemannian structure. Formally the system (1.2)-(1.4) is the gradient flow of E X with respect to this Finsler structure; we present a derivation of this fact in a more general setting in Section 3.1. The system is also the curve of steepest descent with respect to quasi-metric T , which is the point of view we use to create rigorous theory in the general setting.
Remark 1.1. The well-posedness of (1.2)-(1.4) is a straightforward consequence of the Picard existence theorem. Namely, note that the simplex 1 ≥ ρ x ≥ 0, x∈X ρ x = 1 is an invariant region of the dynamics and that on it the vector field (1.4) is Lipschitz continuous in ρ x , x ∈ X. Remark 1.2. One could consider other interpolations instead of the upwind. In particular, if we considered an interpolation of the form I(ρ x , ρ y ) instead of the upwind one, the only change in the gradient flow would be that the velocity-flux relation (1.3) would become j x,y = 1 n I(ρ x , ρ y )v x,y . We note that this can have major implications on the resulting dynamics. In particular for the logarithmic interpolation, I(r, s) = (r − s)/(ln r − ln s), or the geometric interpolation, I(r, s) = √ rs, the resulting dynamics would never expand the support of the solutions, so even for repulsive potentials the mass may not spread throughout the domain. On the other hand, using the arithmetic interpolation, I(r, s) = (r + s)/2, would not work directly since the solutions may become negative. It should still be possible to consider the arithmetic interpolation by restricting the set of allowed velocities.
Before we turn to the general setting we point out that the system (1.2)-(1.4) offers a new model of graph-based clustering, which is briefly discussed in Section 1.5.
1.2. General setting for vertices in Euclidean space. Here we introduce the general framework for studies of interaction equations on families of graphs and their limits as the number of vertices n goes to ∞. In particular, in the applications to machine learning which we briefly discuss in Section 1.5, the graphs considered are random samples of some underlying measure in Euclidean space, and the edge weights, as well as the interaction energy, depend on the positions of the vertices. The vertices are points in R d and the set of possible edges is The edge weights are set by a nonnegative symmetric function η : G → [0, ∞). From the discrete setting, the set of vertices is replaced by the more general notion of a measure on R d ; the discrete graphs with vertices X = {x 1 , . . . , x n } ⊂ R d correspond to µ being the empirical measure of the set of points, µ = 1 n n i=1 δ x i . The distribution of mass over the vertices is described by the measure ρ ∈ P(R d ) and in most applications we consider supp ρ ⊆ supp µ. However, in order to prove general stability results (e.g., Theorem 3.14), we need to allow that initially part of the support of ρ is outside of the support of µ; we think of such mass as outside of the domain specified by µ. The mass starting outside of the support of µ can only flow into the support of µ. Here we present the evolution assuming ρ µ, while in Sections 2 and 3 we present the setup in full generality. Furthermore, we denote by ρ both the measure and its density with respect to µ.
The evolution of interest is the gradient descent of the energy E : P(R d ) → R given by where K : R d × R d → R is symmetric and P : R d → R. This energy generalizes (1.1) in terms of the configurations ρ and specializes it in terms of the type of potentials K and P considered. The gradient flow we consider takes the form (NL 2 IE) The system (NL 2 IE) consists first of a nonlocal continuity equation, where the divergence ∇· is encoded with the graph structure described through µ and η (see Definition 2.7). Secondly, it involves a mapping from velocity to flux, which in our case is the upwind flux and encodes the geometry of the gradient structure. Finally, the third equation identifies the driving velocity as the nonlocal gradient of the variation of the energy (1.5). Overall, we obtain that (NL 2 IE) is the gradient flow of the energy E with respect to a generalization of the graph Wasserstein metric we now introduce.
Nonlocal continuity equation. Let us set and call its elements nonlocal (antisymmetric) vector fields on G; for any pair (x, y) ∈ G the value v(x, y) can be regarded as a jump rate from x to y. Let us fix a final time T > 0 throughout the paper and let a family {v t } t∈[0,T ] ⊂ V as be given. In the case ρ t µ for all t ∈ [0, T ], it is possible to combine the first two equations in (NL 2 IE) in order to arrive at the nonlocal continuity equation For general curves ρ : [0, T ] → P(R d ), it is necessary to consider the weak form of (1.7), which is discussed in Section 2.3. We remark that the general setup we develop allows for the solution ρ to develop atoms and persist even after the atoms have formed. Heuristic arguments and numerical experiments indicate that there are equations covered by our theory for which this is the case. For example, if µ is the Lebesgue measure on R, ρ 0 the restriction of the Lebesgue measure to [−0.5, 0.5], K(x, y) = |x − y| and η(x, y) = 1/(x − y) 2 , then the solutions develop delta mass concentrations at 0 in finite time. Understanding for which K and η solutions do develop finite time singularities is an interesting open problem.
We note that when defining the flux in (1.7) we define the density along edges to be the density at the source; analogously to an upwind numerical scheme. While, as we show, this leads to a convenient framework to consider the dynamics, it creates the difficulty that the resulting distance, that we are about to define, is not symmetric and is thus only a quasi-metric.
Upwind nonlocal transportation metric. We use the nonlocal continuity equation (1.7) to define a nonlocal Wasserstein quasi-distance in analogy to the Benamou-Brenier formulation [5] for the classical Kantorovich-Wasserstein distances [39]. That is, for two probability measures ρ 0 , ρ 1 ∈ P 2 (R d ), let where CE(ρ 0 , ρ 1 ) is the set of weak solutions ρ to the nonlocal continuity equation (see Definition 2.14) on [0, 1] with ρ(0) = ρ 0 and ρ(1) = ρ 1 . We note that the notion of the nonlocal Wasserstein distance for measures on R d was introduced by Erbar in [19], who used it to study the fractional heat equation. One difference is that the interpolation we consider is beyond the scope of [19]. Another difference is that here the measure µ plays an important role in how the action is measured and allows one to incorporate seamlessly both the continuum case (e.g., µ is the Lebesgue measure on R d ) and the graph case (µ is the empirical measure of the set of vertices). The notions above are rigorously developed in Section 2, where we list the precise assumption (W) on the edge weight function η and the joint Assumptions (A1) and (A2) on η and the underlying measure µ. We then rigorously introduce the action (Definition 2.3), which is a nonlocal analogue of kinetic energy; we show its fundamental properties, in particular joint convexity (Lemma 2.12) and lower semicontinuity with respect to narrow convergence (Lemma 2.9). In Section 2.3 we rigorously introduce the nonlocal continuity equation in measure-valued flux form (2.11); we introduce the notion on all of R d where µ does not initially play a role. The measure µ enters the framework by considering paths of finite action. Proposition 2.17 establishes an important compactness property of sequences of solutions. In Section 2.4 we turn our attention to the nonlocal Wasserstein quasi-metric based on the upwind interpolation, which we introduce in Definition 2.18. The compactness of solutions of the nonlocal continuity equation and the lower semicontinuity of the action imply the existence of (directed) geodesics (Proposition 2.20). Following the work of Erbar [19] we show that the nonlocal Wasserstein quasi-metric generates a topology on the set of probability measures which is stronger than the W 1 topology (i.e., the Monge distance or the 1-Wasserstein metric). Analogously to [2] we show the equivalence between the paths of finite length with respect to the quasi-metric and the solutions of the nonlocal continuity equation with finite action (Proposition 2.20). The set of probability measures endowed with the quasi-metric T has a formal structure of a Finsler manifold, and parts of this structure can be described; in particular, in (2.26) we describe the tangent space at a given measure ρ using the fluxes. We note that using fluxes, instead of velocities, is necessary since, because of the upwinding, the relation between the velocities and the tangent vectors is not linear (Proposition 2.26). We conclude Section 2 by showing that, given a measure µ, the finiteness of the action ensures that any path starting within the support of µ will remain within the support of µ (Proposition 2.28).
Nonlocal nonlocal-interaction equation. In Section 3 we develop the existence theory of the equation (NL 2 IE) based on the interpretation as the gradient flow of E with respect to the quasi-metric T defined in (1.8). We begin by listing the precise conditions (K1)-(K3) on the interaction kernel K. We note that these are less restrictive than the typical conditions for the well-posedness of the standard nonlocal-interaction equation in Euclidean setting [2,8].
Before we turn to the rigorous theory of weak solutions as curves of maximal slope on quasi-metric space, we discuss the gradient flow structure in a more geometric setting, namely the Finsler structure related to T . Indeed, the action (formally given by the time integrand in (1.8), and rigorously defined by (2.3)) defines a positively homogeneous norm (namely a Minkowski norm) on the tangent space. The Hessian of the square of the norm endows the tangent space at each measure with the formal structure of a Riemann manifold. We compute this Riemann metric in Appendix A under an absolute-continuity assumption. With this assumption, we show that (NL 2 IE) is the gradient flow of E with respect to the Finsler structure in Section 3.1. For simplicity, we consider P ≡ 0, since the extension to P ≡ 0 is straightforward, as it is explained in Remark 3.2.
In Section 3.2 we develop the rigorous gradient descent formulation based on curves of maximal slope in the space of probability measures endowed with the quasi-metric T . The theory of gradient flows in the spaces of probability measures endowed with the standard Wasserstein metric was developed in [2]. Here we extend it to the setting of quasi-metric spaces, endowed with the nonlocal Wasserstein distance. This requires several delicate arguments. We start by introducing the notions of one-sided strong upper gradient (Definition 3.12) and curves of maximal slope (Definition 3.8). We define the local slope D in (3.17) by using a heuristically derived gradient of the energy E, and show, using a chain rule established in Proposition 3.10, that √ D is a one-sided strong upper gradient for E with respect to T . One of our main results is Theorem 3.9, which establishes the equivalence between curves of maximal slope and weak solutions of (NL 2 IE). In Section 3.4 we prove several important results. Namely Theorem 3.14 establishes that the De Giorgi functional G T is stable under variations of the base measure µ and of the solutions. A consequence of this result is the convergence of solutions of (NL 2 IE) on graphs defined on random samples of a measure to solutions of (NL 2 IE) corresponding to the full underlying measure (Remark 3.17). The proof of Theorem 3.14 relies on the lower semicontinuity of the local slope (Lemma 3.12) and the lower semicontinuity of the De Giorgi functional (3.13). Another important consequence is the existence of weak solutions of (NL 2 IE), which is proved in Theorem 3.15.
1.3. Relation to the numerical finite-volume upwind scheme. Equation (1.7) can be interpreted in several ways. For example, it can be understood as the master equation of a continuous-time and continuous-space Markov jump process on the graphon (R d , G, η), that is, a continuous graph with vertices R d , possible edges G and symmetric weight η(x, y) for (x, y) ∈ G. The stochastic interpretation is that a particle at position x ∈ R d jumps according to the measure v(x, y) + η(x, y) dµ(y) to y ∈ R d . In this way it gives rise to a Markov jump process related to the numerical upwind scheme.
The numerical upwind scheme is one of the basic finite-volume methods used to solve conservation laws; see [23]. To draw the connection, let {x 1 , . . . , x n } be a suitable representative of a tessellation {K 1 , . . . , K n }, for instance a Voronoi tessellation, of some bounded domain Ω ⊂ R d . Let µ be the Lebesgue measure on Ω and take η to be the transmission coefficient common in finite-volume Hausdorff measure of the common face between K i and K j . With this choice the equation (1.7) becomes the (continuous-time) discretization of the classical continuity equation for some vector field v t : Ω → R d . Hereby, the discretized vector field v t is obtained from v t by taking the average over common interfaces: where ν K i ,K j is the unit normal to K i pointing from K i to K j . We refer to the recent work [7] for a variational interpretation of the upwind scheme, which is close to that we propose for the more general equation (1.7). Earlier results in this direction are contained in [17,30]. The connection to finite-volume schemes explains also that the nonlocality in (1.7) introduces a regularization, which in the numerical literature is referred to as numerical diffusion. That the numerical diffusion is actually an honest Markov jump process, as described at the beginning of this section, was observed and used to find optimal convergence rates in the works [15,16,35,36].

1.4.
Comparison with other discrete metrics and gradient structures. The interpretation of diffusion on graphs as gradient flows of the entropy was independently carried out in [11,28,29].
Here we recall the descriptions of the flows relying on reversible Markov chains, which was the framework used in [21,22,28]. Starting with Markov chains, which then determine the edge weights, offers an additional layer of modeling flexibility. In particular, consider the Markov chain with state space X = {x 1 , . . . , x n } and jump rates {Q x,y } x,y∈X . Let π x be the reversible probability measure for the Markov chain, meaning that it satisfies the detailed balance condition π x Q x,y = π y Q y,x . The edge weights {w x,y } x,y∈X are given by w x,y = π x Q x,y . The energy considered is the relative entropy: for ρ : X → [0, 1] with x∈X ρ x = 1 we define The paths in the configuration space are given as the solution of the continuity equation which for the flux {j x,y : [0, T ] → R} x,y∈X takes the form (1.2). To compute the flux from a given velocity {v x,y } x,y∈X (an edge-based quantity) and density {ρ x } x∈X (a vertex-based quantity), one interpolates the densities at vertices to define the density (and hence the flux) along the edges. The literature so far has considered a proportional constitutive relation of the form where the function θ : R + × R + → R + needs to be one-homogeneous for dimensional reasons. In addition, it is assumed that the function θ is an interpolation, that is, min{a, b} ≤ θ(a, b) ≤ max{a, b}. The choice providing a gradient flow characterization for linear Markov chains is the logarithmic mean, defined by θ(a, b) = a−b log a−log b for a = b and θ(a, a) = a. The associated transportation distance is obtained by minimizing the action functional The corresponding transportation distance is induced as the minimum of the action along paths: 1] solves (1.2) and ρ(0) = ρ 0 , ρ(1) = ρ 1 .
As we do in Corollary 2.8, it was shown that it suffices to consider antisymmetric fluxes. To arrive at a gradient flow formulation, one considers the metric induced by the action function (1.11): Then the gradient grad H of the relative entropy (1.9) with respect to this metric is given as the antisymmetric flux j * of minimal norm satisfying for any curve (ρ(t)) t≥0 such that ∂ t ρ(0) = − ∇ · j . Expanding (1.13) and using that j * is antisymmetric gives Since this identity holds for all j x,y , the flux j * is identified by where the last equality holds for the particular choice of the logarithmic mean interpolation θ(r, s) = r−s ln r−ln s . By plugging j * x,y into the continuity equation (1.2), one recovers the (linear) heat equation on graphs.
The next relevant step is the introduction of the interaction and the potential energies as in (1.1). In particular, [21] provides a gradient structure for free energy functionals of the form where β > 0 is the inverse temperature. This is nontrivial since considering the logarithmic interpolation, which makes the diffusion term linear, would make the potential term nonlinear, and thus the Fokker-Planck equation on graphs would be nonlinear. To cope with this, the framework of [21] extends the linear theory outlined above to a family of nonlinear Markov chains satisfying a local detailed balance condition. The consequence for the resulting gradient structure is that the quantities {π x } x∈X , {Q x,y } x,y∈X and {w x,y } x,y∈X depend on the current state ρ in such a way that the detailed balance condition w x, is still valid for all ρ ∈ P(X).
In particular, for F β defined in (1.14), it holds that It would be natural to try to build the framework for the case β = ∞, which we consider in this paper, by taking the limit β → ∞ in the framework of [21]. It turns out that this limit is singular for the constructed gradient structure. First of all, the measure π x [ρ] degenerates at all points except at the argmin of the effective potential x → P x + y K x,y ρ y . This causes the constitutive relation (1.10) to become meaningless. A more detailed analysis also shows that the metric in (1.12) degenerates.
We also note that in this setting the potential functions P and K and inverse temperature β enter the metric in (1.11) through the weights w x,y and rate matrix Q x,y . This is in stark contrast with the continuous classical gradient flow formulation for free energies of the form F β form (1.14), where the metric is always the L 2 -Wasserstein distance, independently of the potentials P and K and also of the inverse temperature β > 0, including β = ∞ [2,8,9,26].
The above problems lead us to consider the upwind interpolation in the flux-velocity relation (1.10). In view of (1.2), this relation is replaced in the present setting by Note that the relation (1.15) is a functional relation between velocity and flux with the interpolation Θ depending on the velocity. We remark that solutions of system (1.2)-(1.4) are not the limit of the gradient flows in [21] as β → ∞. We emphasize here that the limit of these dynamics as β → ∞ would in fact not be the desirable gradient flow of the nonlocal-interaction energy, since the initial support of the solutions would never expand; see the related Remark 1.2.
We conclude this section by observing that it seems possible to generalize the upwind interpolation in a continuous way to define a flux-velocity relation to deal with free energies F β for β > 0. A candidate, inspired by the Scharfetter-Gummel scheme [34], is the following constitutive flux-velocity relation depending on β: In particular it holds that j β x,y → j x,y as β → ∞, where j x,y is as in (1.15). The form of j β x,y can be physically deduced from the one-dimensional cell problem for the unknown value j β x,y ∈ R and function ρ : Note that j β x,y = ρx−ρy β for v x,y = 0, which is the flux due to Fick's law. Likewise, j β x,y = 0 for v x,y = β −1 log ρy ρx , which is the velocity needed to counteract the diffusion. This direction of inquiry will be pursued in future research.
1.5. Connections to data science. Part of the motivation for the present work comes from applications to data science. Here we introduce a family of nonlinear gradient flows that is relevant to discovering local concentrations in networks akin to modes of a distribution.
Our main interest is in equations posed on graphs whose vertices are random samples of some underlying distribution and whose edge weights are a function of distances between vertices. In machine learning one often deals with data in the form of a point cloud in high-dimensional space. While the ambient dimension may be very large, the data often possess an underlying low-dimensional structure that can be used in making reliable inferences about the underlying data distribution. To use the geometric information, we follow one of the standard approaches and consider graphs associated to point clouds. Formulating the machine learning tasks directly on the point cloud enables one to access the geometric structure of the distribution in a simple and computationally efficient way. The works in the literature have mostly focused on models based on minimizing objective functionals modeling tasks such as clustering or dimensionality reduction [4,24,25,27,31], or based on characterizing clusters through estimating some property of the data distribution (most often the density); see [10] and references therein. Only few dynamical models have been considered-notable among them are diffusion maps [12], where the heat equation is used to redistance the points.
Here we focus on models that are motivated by nonlocal PDEs. Consider a probability measure µ on R d with finite second moments. Let X = {x 1 , . . . , x n } be random i.i.d. samples of the measure µ. Let µ n = 1 n n i=1 δ x i be the empirical measure of the sample and let K : . We consider a random geometric graph based on 240 sample points X from a 2D bean data set. The connectivity distance is ε = 0.23. The edge weights are w x,y = exp(−24|x − y| 2 ), provided that the vertices x and y of X are connected. The interaction potential is K x,y = 1 − exp(−8|x − y| 2 ) and the external potential is P ≡ 0. The solution, starting from a uniform distribution, is shown at time t = 200. Brighter color indicates more mass (right). and P : R d → R. The total energy E X : P(X) → R, given in (1.1), for the empirical measure µ n can be rewritten as The gradient flow of E X with respect to the graph Wasserstein metric T µ n defined in (1.8) is described by Another evolution by such system is illustrated on Figure 2.
Here we remark on the contrast between (1.2)-(1.4) and the gradient flow of (1.16) in the ambient space R d , with respect to the standard Wasserstein metric, which takes the form The first notable difference is that, on the graph, masses change and the positions remain fixed, while in R d positions change and the masses remain fixed. This difference is somewhat superficial, since both equations describe the rearrangement of mass in order to decrease the same energy in the most efficient way measured by two different metrics. The main difference is that the graph encodes the geometry of the space that mass is allowed to occupy. In particular it ensures that the geometric mode discovered will be a data point itself.
We note that the popular mean-shift algorithm [13] can be interpreted as a time-stepping algorithm to approximate solutions of (1.17) with K ≡ 0 and P = ln(θ * ρ), where θ * ρ is the kernel density estimate for the density ρ of the underlying distribution. We note that considering the gradient flow of the corresponding energy on the graph (1.2)-(1.4) ensures that the modes of the distribution discovered by the (graph) mean-shift algorithm will remain within the data set. Furthermore, we note that adding nonlocal attraction on the graph progressively clumps nearby masses together and thus provides an approach to agglomerative clustering.
One of our main results, stated in Theorem 3.14, is that as n → ∞ the solutions of the graphbased equation (1.2)-(1.4) narrowly converge along a subsequence to a solution of the nonlocal nonlocal-interaction equation (NL 2 IE).

Nonlocal continuity equation and upwind transportation metric
2.1. Weight function. Throughout the paper we set G := {(x, y) ∈ R d × R d : x = y} and we consider the weight function η : G → R, which shall always satisfy Since η is symmetric, we regard (G, η) as undirected weighted graph. Many of the edge-based quantities we consider, like vector fields and fluxes, will lie in an η-weighted where the factor 1 2 ensures that each undirected edge is counted only once. Below we state two assumptions on the base measure µ ∈ M + (R d ) and the weight function η, where we use the notation ∨ to denote the maximum.
(A2) (local blow-up control) The family of measures {|x − ·| 2 η(x, ·)µ} x∈R d is locally uniformly integrable with respect to µ, that is, Example 2.2. Typically the function η is a function of the distance: where ϑ : (0, ∞) → [0, ∞) is lower semicontinuous and satisfies analogues of (A1) and (A2). An important example are geometric graphs with connectivity distance given by ε > 0 and weight In this example, fixing µ = Leb(R d ), we conjecture that the weak formulation of (NL 2 IE)-see Section 3-converges to the nonlocal aggregation equation ∂ t ρ t = ∇ · (ρ t ∇K * ρ t + ρ t ∇P ) as ε → 0 for sufficiently smooth potentials K and P . See Section 3.5 for a discussion on the local limit.

2.2.
Action. The form of the action inside (1.8) seems practical, but it does not have any obvious convexity and lower semicontinuity properties. Therefore, we define the action in flux variables. We start by introducing some notation. For a signed measure j ∈ M(G), we denote by j = j + − j − its Jordan decomposition. Moreover, for any measurable A ⊆ G, let A = {(y, x) ∈ R d ×R d : (x, y) ∈ A} be its transpose. Likewise, for j ∈ M(G), we denote by j the transposed measure defined by For any measures µ ∈ M + (R d ) and ρ ∈ P(R d ), we define the (restricted) product measures and Note that γ 1 = γ 2 . We define the action for general η which we only require to satisfy Assumption (W), i.e., lower semicontinuity, symmetry and positivity.
Remark 2.4. We note that the action is inversely proportional to the measure µ: doubling the measure µ leads to halving the action. This has important consequence to the way µ influences the geometry of the space of measures. In particular, µ not only sets the region where mass can be transported, but also makes the transport less costly in the regions of high density of µ.
Remark 2.5. If ρ µ, then we denote its density by ρ by abuse of notation, and if furthermore j µ ⊗ µ with density j, then it holds In the following lemma we can see that the action takes the form from the tentative definition of the metric in (1.8) as soon as it is bounded.
and it holds In particular, if v ∈ V as , then y), and d|j| =j dλ for some measurableγ 1 ,γ 2 ,j : G → R. Without loss of generality we can assume λ to be symmetric; for instance by considering 1 By the definition of the function α in (2.4), it immediately follows that the vector fieldṽ The statement (2.6) follows by using the positively one-homogeneity of α, the identity α(j, r) = α(j + , r) and the symmetry of λ:

Definition 2.7 (Nonlocal gradient and divergence). For any function
In particular, for j ∈ M as (G) : If j is given by (2.5) for some v ∈ V as , then the flux satisfies an antisymmetric relation on the support of η γ 1 , i.e., j + = (j ) − η γ 1 -a.e. The following corollary shows that those antisymmetric fluxes are the relevant ones for the minimization of the action functional. For this reason, the natural class of fluxes are those which are antisymmetric with positive part absolutely continuous with respect to η γ 1 , that is, Corollary 2.8 (Antisymmetric vector fields have lower action). Let µ ∈ M + (R d ), ρ ∈ P(R d ) and j ∈ M(G) be such that A(µ; ρ, j) < ∞. Then there exists an antisymmetric flux j as ∈ M as ηγ 1 such that ∇ · j = ∇ · j as , with lower action: Proof. Let us set j as = (j − j )/2. Since η is symmetric and ∇φ = −∇φ, we get By an application of Lemma 2.6 and comparison of (2.6) and (2.7) it is enough to show that, for all (x, y) ∈ G, This estimate is a consequence of Jensen's inequality applied to the convex functions

Lemma 2.9 (Lower semicontinuity of the action). The action is lower semicontinuous with respect to the narrow convergence in
Proof. The narrow convergence of (ρ n ) n and (µ n ) n imply the narrow convergence of the product: . For any n ∈ N, let us define the vector-valued measure and the function Since the function η is lower semicontinuous by (W) and α defined in (2.4) is lower semicontinuous, jointly convex and positively one-homogeneous, f satisfies the assumptions of [6, Theorem 3.4.3], whence the claim follows.
As a consequence of the previous results we have the following corollary, which will be useful in Section 2.3.
Proof. Let us consider the case where the last estimate follows from (A1) and the integral is finite since ρ ∈ P(R d ).

Lemma 2.12 (Convexity of the action). Let
Proof. Let us consider a measure λ ∈ M(G) such that dγ i j =γ i j dλ and dj i =j i dλ for i = 0, 1 and j = 1, 2. Then, the convex combinations are such that dγ τ j =γ τ j dλ and dj τ =j τ dλ, wherẽ Using the convexity of the function α we get the result, that is,

Nonlocal continuity equation.
In view of the considerations made in Section 2.2, we now deal with the nonlocal continuity equation where (ρ t ) t∈[0,T ] and (j t ) t∈[0,T ] are unknown Borel families of measures in P(R d ) and M(G), respectively. Equation (2.11) is understood in the weak form: Since |∇ϕ(x, y)| ≤ ||ϕ|| C 1 (2 ∧ |x − y|), the weak formulation is well-defined under the integrability condition Remark 2.13. The integrability condition (2.13) is automatically satisfied by a pair (ρ t , j t ) t∈[0,T ] such that T 0 A(µ; ρ t , j t ) dt < ∞, due to Corollary 2.11. Hence we arrive at the following definition of weak solution of the nonlocal continuity equation:

We denote the set of all weak solutions on the time interval
(2.14) We now prove propagation of second-order moments.

Lemma 2.16 (Uniformly bounded second moments). Let
We proceed by considering the time derivative of the second-order moment of ρ n t for all t ∈ [0, T ] and n ∈ N. Since x → |x| 2 is not an admissible test function in (2.12), we introduce a smooth cut-off function R . Then, we can use the definition of solution with the function ψ R (x) = ϕ R (x) 2 (|x| 2 + 1) and apply Lemma 2.10 with Φ = ∇ψ R to obtain, for all t ∈ [0, T ] and n ∈ N, For R ≥ 1, we estimate, for all (x, y) ∈ G, 15) and observe that Hence the first term in (2.15) is bounded by 32|x − y| 2 . For the second term in (2.15), we abbreviate by setting r = ϕ R (x)|x| and s = ϕ R (y)|y| and compute the bound It is easy to check that x → ϕ R (x)|x| is globally Lipschitz and we can conclude that, for some numerical constant C > 0, for all (x, y) ∈ G we have Thus, by sending R → ∞ and using (A1), it follows that By integrating the above differential inequality, we arrive at the bound Then, there exists (ρ, j) ∈ CE T such that, up to a subsequence, as n → ∞ it holds with ρ t ∈ P 2 (R d ) for any t ∈ [0, T ]. Moreover, the action is lower semicontinuous along the above subsequences (µ n ) n , (ρ n ) n and (j n ) n , i.e., Proof. We argue similarly to [18,Lemma 4.5], [19,Proposition 3.4]. For each n ∈ N we define j n ∈ M(G × [0, T ]) as dj n (x, y, t) = η(x, y) dj n t (x, y) dt = dĵ n t (x, y) dt. In view of Lemma 2.16 there exists C 2 > 0 such that sup t∈[0,T ] sup n∈N M 2 (ρ n t ) ≤ C 2 < +∞. Let us consider any compact sets K ⊂ supp η ⊆ G and I ⊆ [0, T ]. We apply the bound (2.10) of Corollary 2.11 and the Cauchy-Schwarz inequality to get sup n∈N |j n |(K × I) ≤ sup Now, as we need to pass to the limit in (2.12), we consider a function ξ ∈ C ∞ c (R d ) and an (x, y) has no compact support in [t 0 , t 1 ] × G, so we proceed by a truncation argument. Let ε > 0 and let us set We need to estimate terms for which ϕ ε (t, x) < 1. First, setting I c ε = [t 0 , t 1 ] \ I ε , we note that Likewise, using the symmetry, we arrive at which vanishes as ε → 0 in view of Assumption (A2). Finally, the last term is estimated again using (A1): By means of the last convergence, the tightness of (ρ n 0 ) n , and (2.14) with ϕ(t, x) = ξ(x), t 0 = 0 and t 1 = T , we obtain that (ρ n t ) n locally narrowly converges to some finite nonnegative measure ρ t ∈ M + (R d ) for any t ∈ [0, T ]. In particular, for any ξ ∈ C ∞ c (R d ) and any t ∈ [0, T ], we have Now, for R > 0, let us consider a function ξ R ∈ C ∞ c (R d ) such that 0 ≤ ξ ≤ 1, ξ = 1 on B R , and ξ C 1 ≤ 1. Because of the integrability condition (2.13), satisfied thanks to Corollary 2.11, we have Hence the measure ρ t is actually a probability measure on R d for all t ∈ [0, T ]. Moreover Lemma 2.16 ensures that the convergence is global and not only local. As a direct consequence of the previous considerations, (ρ, j) ∈ CE T and the lower semicontinuity follows from Lemma 2.9.

Nonlocal upwind transportation quasi-metric.
Here, we give a rigorous definition of the nonlocal transportation quasi-metric we introduced in (1.8). Let us recall that η : G → R is the weight function satisfying (W).  (A2), and ρ 0 , ρ 1 ∈ P 2 (R d ), the nonlocal upwind transportation cost between ρ 0 and ρ 1 is defined by If µ is clear from the context, the notation T is used in place of T µ .
Note that Proposition 2.17 ensures the existence of minimizers to (2.21), when T µ < ∞, which holds when there exists a path of finite action. On the other hand, if this is not the case, the nonlocal upwind transportation cost is infinite. For example, consider the graph with vertices set by µ and η which is disconnected, meaning that there are x, y ∈ supp µ such that there is no sequence (x 0 = x, x 1 , . . . , x n−1 , x n = y) n with η(x i , x i+1 ) > 0 for all i = 0, . . . , n − 1; in this case, T µ (δ x , δ y ) = ∞ since the set of solutions to the continuity equation CE(δ x , δ y ) is empty.
Due to the one-homogeneity of the action density function α in (2.4), we have the following reparametrization result, which is similar to [18,Theorem 5.4]. (A1) and (A2), and any ρ 0 , ρ T ∈ P 2 (R d ), it holds that

Proposition 2.20. For any µ ∈ M + (R d ) satisfying Assumptions (A1) and
The next proposition establishes a link between T µ and the W 1 -distance. only on µ and η). Then for any ρ 0 , ρ 1 ∈ P 2 (R d ) it holds Proof. By a standard regularization argument and the truncation procedure as in the proof of Lemma 2.16, we can actually consider any 1-Lipschitz function ψ as a test function in the weak formulation (2.12) for some (ρ, j) ∈ CE(ρ 0 , ρ 1 ). Then we can estimate, by Lemma 2.10 and assumption (A1), Taking the supremum over all 1-Lipschitz functions and the infimum in the couplings (ρ, j) ∈ CE(ρ 0 , ρ 1 ) gives the result.
The results above show that T µ is an extended (meaning that it can take value ∞) quasi-metric on the set of probability measures which induces a topology stronger than the W 1 -topology: Theorem 2.22. Let µ ∈ M + (R d ) satisfy Assumptions (A1) and (A2). The nonlocal upwind transportation cost T µ defines an extended quasi-metric on P 2 (R d ). The map (ρ 0 , ρ 1 ) → T µ (ρ 0 , ρ 1 ) is lower semicontinuous with respect to the narrow convergence. The topology induced by T µ is stronger than the W 1 -topology and the narrow topology. In particular, bounded sets are narrowly relatively compact in (P 2 (R d ), T µ ).
Proof. If T µ (ρ 0 , ρ 1 ) = 0, then A(µ; ρ t , j t ) = 0 for a.e. t ∈ [0, 1]. Hence j t ≡ 0 γ t -a.e., which implies that ρ 0 ≡ ρ 1 by the nonlocal continuity equation (2.14). The triangle inequality is a consequence of Lemma 2.19 and the fact that solutions to the nonlocal continuity equation can be concatenated. The lower semicontinuity and compactness properties of T µ are inherited from the action functional A via Proposition 2.17. In view of the comparison with W 1 from Proposition 2.21, we have that the topology induced by T µ is stronger than that induced by W 1 and the narrow topology.
The next lemma provides a quantitative illustration of asymmetry of T .
Now, let us assume without loss of generality that ρ 0 < ν 0 . Obviously, in this configuration we can restrict the above infimum among non-decreasing g, as it gives a lower action. Therefore, by applying Jensen's inequality, we have The equality case is obtained by noting that the solution to − d with consistent boundary values g 0 = ρ 0 and g 1 = ν 0 , is given by The case ν 0 < ρ 0 is obtained in a similar manner, which gives formula (2.22).

Hereby, the previous identity holds if and only ifj
Proof. The first statement on the characterization of absolutely continuous curves as curves of finite action follows from [ Recalling the notation for the Jordan decomposition of a measure from Section 2.2, note that we use that the functional j → A(µ; ρ, j) is strictly convex for j ∈ M(G) such that j + η ρ ⊗ µ and j − η µ ⊗ ρ, which is guaranteed above since A(µ; ρ, j) < ∞ and j ∈ M as ηγ 1 (G). Then, we observe the set {j ∈ M as ηγ 1 (G) : ∇ · j = ∇ · j t } is closed with respect to the narrow convergence. In addition, the estimate (2.9) from Lemma 2.10 with Φ(x, y) = |x − y| ∨ |x − y| 2 gives showing that the sublevel sets of j → A(µ; ρ t , j) are locally relatively compact with respect to the narrow convergence, arguing as in the proof of Proposition 2.17. Hence the elementj t is well-defined by applying the direct method of calculus of variations.
We defined the tangent space T ρ P 2 (R d ) in (2.26) using the nonlocal fluxes j. We note that this is in some way a nonlocal, Lagrangian description of the tangent vectors and that the relationship between this Lagrangian description and the Eulerian description is the nonlocal continuity equation which is satisfied in the weak sense. This provides a useful heuristic, but as for classical Wasserstein gradient flows [2] the precise, rigorous definition of the tangent space is in Lagrangian form; we note, however, that here we use fluxes instead of velocities. This is not just a superficial difference. Namely, as can be seen in Proposition 2.26, the relation between velocities and fluxes is not linear and thus the velocities do not provide a linear parametrization of the tangent space. We use the argument from [18,Theorem 5.21] to characterize the tangent space T ρ P 2 (R d ) in more detail:  (A2), and ρ ∈ P 2 (R d ). Then, it holds that j ∈ T ρ P 2 (R d ) if and only if j + γ 1 , j − γ 2 , and v + : Proof. If A(µ; ρ, j) < ∞, then by Lemma 2.6 it holds for some v ∈ V as that where γ + = γ 1 | J + , with J + = supp j + , and we used that (J + ) = supp j − . Then, by recalling the definition of the norm on L 2 (η γ 1 ) from (2.1), . By using the relation between j and v from above, we can rewrite the divergence ∇ · j in weak form for any ψ ∈ C ∞ c (R d ): Hence v + belongs to the closure of {∇ϕ : ϕ ∈ C ∞ c (R d )} in L 2 (η γ + ). From the antisymmetry of v follows that v − belongs to the closure of {∇ϕ : ϕ ∈ C ∞ c (R d )} in L 2 (η γ − ). Thus, the conclusion follows from the identity γ + + γ + =γ v .
Remark 2.27. Proposition 2.26 shows that for µ as in its statement, ρ ∈ P 2 (R d ) and j chosen from a dense subset of T ρ P 2 (R d ), there exists a measurable ϕ : R d → R such that we have the identity Finally, we provide an interesting property of absolutely continuous curves. Proof. Since (ρ t ) t∈[0,T ] is absolutely continuous, there exists by Proposition 2.25 a unique family (j t ) t∈[0,T ] such that (ρ, j) ∈ CE T and j t ∈ T ρt P 2 (R d ) ⊆ M as ηγ 1,t (G), where γ 1,t = ρ t ⊗ µ, and |ρ t | 2 = A(µ; ρ t , j t ) for a.e. t ∈ [0, T ]. In particular, by Lemma 2.6, there exists a measurable family Without loss of generality, let (ρ t ) t∈[0,T ] be the weakly continuous curve from Lemma 2.15 satisfying, for any test function ϕ ∈ C ∞ c (R d ) and t ∈ [0, T ], which implies that supp ρ t ⊆ supp µ, since ρ t ∈ P(R d ) is in particular a nonnegative measure for all t ∈ [0, T ] by Lemma 2.15.

Nonlocal nonlocal-interaction equation
In this section we consider gradient flows in the spaces of probability measures P 2 (R d ) endowed with the nonlocal transportation quasi-metric T µ , defined by (2.21). From now until Section 3.4 (excluded) we fix µ ∈ M + (R d ) satisfying (A1) and (A2), unless otherwise specified. For this reason we shall use the simplifications A(ρ, j) for A(µ; ρ, j) and T for T µ .
In this section investigate the nonlocal nonlocal-interaction equation (NL 2 IE) as a gradient flow with respect to the metric T . We restate it in a one-line form and note that from now on we consider the external potential P ≡ 0. The extension to P ≡ 0 is straightforward; see Remark 3.2.
In the classical setting of gradient flows in the spaces of probability measures endowed with the Wasserstein metric [2,8], the nonlocal-interaction equation is the gradient flow of the nonlocal-interaction energy We start by discussing the geometry of (NL 2 IE) and interpret it as the gradient flow of (3.2) in the infinite-dimensional Finsler manifold of measures endowed with the Finsler metric associated to T . Following this, we develop a framework of gradient flows in the quasi-metric space T , which extends the setup of gradient flows in metric spaces [2] to quasi-metric spaces. In particular we build the existence theory for (NL 2 IE) based on this approach. Above, for simplicity, (NL 2 IE) was written for ρ µ, where we recall that we used the notation ρ to denote both the measure and the density with respect to µ. Our framework, however, also applies to the case when ρ is not absolutely continuous with respect to µ. The general weak form of (NL 2 IE) is obtained in terms of the nonlocal continuity equation as introduced in Section 2.3. Specifically, we have the following definition.
the pair (ρ, j) is a weak solution to the continuity equation according to Definition 2.14.
Here we list the assumptions on the interaction kernel K : R d × R d → R we refer to throughout this section: (K3) K is L-Lipschitz near the diagonal and at most quadratic far away, that is there exists some L ∈ (0, ∞) such that, for all (x, y), Note that the bound (3.3) implies that E : Remark 3.2. Assumption (K3) implies that, for some C > 0 and all x, y ∈ R d , and bounding the maximum (∨) by the sum, we arrive at |K(x, y)| ≤ L + 2L |(x , y )| 2 + |(x, y)| 2 + |K(x , y )|, which gives (3.3) with C = 2L 1 + |(x , y )| 2 + |K(x , y )|. As mentioned previously, the theory in this section can be easily extended to energies of the form (1.5) including potential energies E P (ρ) = R d P dρ for some external potential P : R d → R satisfying a local Lipschitz condition with at-most-quadratic growth at infinity; that is, similarly to (K3), there exists L ∈ (0, ∞) so that for all x, y ∈ R d we have We now show that, under the above assumptions on the interaction potential K, we have narrow continuity of the energy.

Proposition 3.3 (Continuity of the energy). Let the interaction potential K satisfy Assumptions (K1)-(K3). Then, for any sequence
Proof. Let (ρ n ) n ⊂ P 2 (R d ) and ρ ∈ P 2 (R d ) be such that ρ n ρ as n → ∞. For all R > 0, we write B R the closed ball of radius R centered at the origin in (R d ) 2 and ϕ R : (R d ) 2 → R a continuous function such that ϕ(z) = 1 for all z ∈ B R , ϕ R (z) = 0 for all z ∈ (R d ) 2 \ B 2R , and ϕ R (z) ≤ 1 for all z ∈ (R d ) 2 . For all R > 0, we then set K R = ϕ R K and Since (ρ n ) n converges narrowly to ρ as n → ∞ and K R is bounded and continuous, we get Furthermore, since K R → K pointwise as R → ∞, |K R | ≤ |K| for all R > 0, the domain of E is P 2 (R d ) and ρ ∈ P 2 (R d ), we also have by the Lebesgue dominated convergence theorem. Similarly, we also have By a diagonal argument, we deduce the result.
3.1. Identification of the gradient in Finsler geometry. Since the nonlocal upwind transportation cost T is only a quasi-metric, the underlying structure of P 2 (R d ) does not have the formal Riemannian structure as it does in the classical gradient flow theory, but a Finslerian structure instead. This highlights the fact that at every point ρ ∈ P 2 (R d ) the tangent space T ρ P 2 (R d ) is not a Euclidean space, but rather a manifold in its own right.
In this section we provide calculations, in the spirit of Otto's calculus, that characterize the gradient descent in the infinite-dimensional Finsler manifold of probability measures endowed with the nonlocal transportation quasi-metric T . To keep the following considerations simple, we assume that ρ is a given probability measure which is absolutely continuous with respect to µ. In this way, we avoid the need to introduce yet another measure λ ∈ M + (G) with respect to which all of the occurring measures are absolutely continuous, similar to how we proceeded in Definition 2.3 for the action. This restriction is done solely to make the presentation clearer and highlight the geometric structure. Hence any flux j of interest is absolutely continuous with respect to µ ⊗ µ and we can think of j via its density with respect to µ ⊗ µ, which we shall denote by j (using a letter which is not bold).
At every tangent flux j ∈ T ρ P 2 (R d ) we define an inner product g ρ,j : where {j > 0} is an abbreviation for {(x, y) ∈ G : j(x, y) > 0} and similarly for {j < 0}. The ratios are well-defined since ρ cannot be zero where j is not zero. We note that this is the bilinear form that corresponds to the quadratic form defining the action (see Definition 2.3 and Remark 2.5); namely, g ρ,j (j, j) = A(µ; ρ, j).
We refer the reader to Appendix A for a derivation of this inner product from a Minkowski norm on T ρ P 2 (R d ) as it is required in Finslerian geometry. We recall that from Proposition 2.26 follows that j is a tangent flux if and only if there is a potential ϕ ∈ C ∞ c (R d ) such that, for µ ⊗ µ-a.e. (x, y) ∈ G, In this Finsler setting, we now want to determine the direction of steepest descent from ρ, for the underlying energy defined in (3.2). The gradient vector of some energy E : P(R d ) → R at ρ, which we denote by grad E(ρ), is defined as the tangent vector which satisfies provided this vector exists and is unique. Here, we use the continuity equation Definition 2.14 to define variations via whereρ is any curve such thatρ 0 = ρ and d dt t=0ρ t = −∇ · j. From Definition 2.7, due to µ ⊗ µ-absolute continuity of j we have that for µ-a.e. x ∈ R d .
In the case, when M is a finite-dimensional Finsler manifold, such gradient vector exists and is unique since the mapping : T ρ M → (T ρ M) * , j → g ρ,j (j, ·), is a bijection; see [14,Proposition 1.9]. For further details into Finsler geometry, we refer the reader to [3,38]. In our case, we can at least claim that the functional ρ : is injective η µ ⊗ µ-a.e.; that is, the existence of a gradient implies its uniqueness (η µ ⊗ µ-a.e.), in which case we have ρ (grad E(ρ)) = Diff ρ E. To see the injectivity of (3.6), we first note that ρ is positively 1-homogeneous by definition. Moreover, we have the following one-sided version of a Cauchy-Schwarz-type estimate Here, we also used that Note that the above inequalities become strict if any of the integrands j 2 (x, y) + j(x, y) − or j 2 (x, y) − j(x, y) + have a contribution. In particular, we could have ρ (j)(j 2 ) = −∞ although the right-hand side is finite. Despite this, we still have equality in (3.7) if and only if j 2 = βj 1 η µ ⊗ µ-a.e. for some β ≥ 0.
The direction of the steepest descent on Finsler manifolds is in general not − grad E(ρ), but is defined to be the tangent flux, which we denote by grad − E(ρ), such that for all j ∈ T ρ P 2 (R d ).
By injectivity and positive 1-homogeneity of ρ , we get The gradient flows with respect to E in the Finsler space (P 2 (R d ), T ) can thus be written These considerations stay valid for general energy functionals E : Let us compute the gradient flux for the specific case of the interaction energy (3.2). A direct computation using the symmetry of K gives, for all j ∈ T ρ P 2 (R d ), where by comparison with (3.6), we observe that grad − E(ρ) is given for µ ⊗ µ-a.e. (x, y) ∈ G by This shows by (3.8) the existence and by our previous argument also uniqueness of grad − E(ρ).
It is easily observed that it has exactly the form (3.5) with the corresponding potential given by ϕ = −K * ρ. We conclude this section by mentioning that the Finslerian gradient flow structure of differential equations has been discovered and investigated in other systems; see [1,32,33].

Variational characterization for the nonlocal nonlocal-interaction equation. Section
3.1 shows that the nonlocal nonlocal-interaction equation (NL 2 IE) can in fact be written as the gradient descent of the energy E according to the Finsler gradient operator; see (3.10) and (3.11). This is why we refer to weak solutions of (NL 2 IE) as gradient flows.
In this section we consider (P 2 (R d ), T ) as a quasi-metric space rather than a Finsler manifold, which allows us to be prove rigorous statements more easily. In particular we show that the weak solutions of (NL 2 IE) are curves of maximal slope for the energy (3.2) in the quasi-metric space (P 2 (R d ), T ) and vice versa. We then establish the existence and stability of gradient flows using the variational framework of curves of maximal slope. To develop the variational formulation, we adapt the approach of [2] to curves of maximal slope in metric spaces to the quasi-metric space (P 2 (R d ), T ). This requires introducing a one-sided version of the usual concepts from [2] to cope with the asymmetry of the quasi-metric T .
where |ρ | is the metric derivative of ρ as defined in (2.24).
The above one-sided definition is sufficient to characterize the curves of maximal slope:

) is a curve of maximal slope for E with respect to its one-sided strong upper gradient h if and only if t → E(ρ t ) is nonincreasing and
Remark 3.6. Note that by using Young's inequality in (3.12), we get Hence, if the curve (ρ t ) t∈[0.T ] is a curve of maximal slope for E with respect to its strong upper gradient h, we actually have an equality in (3.13).
Therefore, in order to give a variational characterization of (NL 2 IE) we need to detect the right one-sided strong upper gradient. As showed in [20], the variation of the energy along the solution to the equation provides the suitable candidate. In the following we clarify this point as well as the strategy.
It will be convenient to work directly with this vector field (w t ) t∈[0,T ] : from now on we write (ρ, w) ∈ CE T for (ρ, j) ∈ CE T as well as A(ρ t , w t ) for A(ρ t , j t ) according to (2.7). With this convention, we can define a Finsler-type product on velocities in analogy to (3.4) as Note that, under the absolute-continuity assumptions of Section 3.1, by comparing with (3.4) we have that g ρ,w (u, v) = g ρ,j (j 1 , j 2 ), where j 1 , j 2 are obtained from u, v by (3.14), respectively. Moreover, taking (3.6) into account, we also define ρ (w)(v) = g ρ,w (w, v). Arguing as in (3.7), we arrive at the following one-sided Cauuchy-Schwarz inequality.
Proof. Using v = v + − v − and the usual Cauchy-Schwarz inequality in L 2 (η ρ ⊗ µ), we get Now note that, from the weak formulation of the nonlocal continuity equation (2.14), we have for any ϕ ∈ C ∞ c (R d ) and any 0 ≤ s < t ≤ T the following chain rule: Moreover, we still have the identification of the product g with the action in the form of Lemma 2.6: (3.16) which shows that the action is the norm with respect to the Finslerian structure.
A crucial step toward the variational characterization of (NL 2 IE) mentioned above is to obtain the chain rule (3.15) for the energy functional (3.2), which is done in Proposition 3.10 below by a suitable regularization. As a consequence, by using the one-sided Cauchy-Schwarz inequality from Lemma 3.7, we obtain in Corollary 3.11 that the square root √ D of the local slope, defined below in (3.17), is a one-sided strong upper gradient for E with respect to the quasi-metric T in the sense of Definition 3.4, where |ρ t | 2 =Â(ρ t , w t ) = g ρt,wt (w t , w t ) for a.e. t ∈ [0, T ] due to Proposition 2.25 and (3.16). This allows us to define the De Giorgi functional, which provides the characterization of weak solutions as curves of maximal slope. Definition 3.8 (Local slope and De Giorgi functional). For any ρ ∈ P 2 (R d ), let the local slope at ρ be given by For any ρ ∈ AC([0, T ]; (P 2 (R d ), T )), the De Giorgi functional at ρ is defined as When the dependence on the base measure µ needs to be explicit, the local slope and the De Giorgi functional are denoted by D(µ; ρ) and G T (µ; ρ), respectively.
If the potential K satisfies Assumptions (K1)-(K3), we note that whenever ρ is a weak solution to (NL 2 IE) and ρ ∈ AC([0, T ]; P 2 (R d )) the quantity G T (ρ) is finite; indeed, the domain of the energy is all of P 2 (R d ) and Proposition 2.25 yields that both the local slope (since it is equal to the action of (ρ, j), where j is given in Definition 3.1) and metric derivative are finite.
We are ready to state our main theorem.

Theorem 3.9. Suppose that µ satisfies Assumptions (A1) and (A2) and K satisfies Assumptions (K1)-(K3). A curve
where G T is the De Giorgi functional as given in Definition 3.8.
Note that in the above theorem, the implicit assumption that √ D is a one-sided strong upper gradient for E is made; this is in fact true thanks to Corollary 3.11 below. In light of this we can represent the result via the following diagram: ρ is a weak solution of (NL 2 IE) ⇐⇒ ρ is a curve of maximal slope for E w.r.t.
3.3. The chain rule and proof of Theorem 3.9. Firstly, we focus on the chain-rule property, which is the main technical step for proving Theorem 3.9.
where (w t ) t∈[0,T ] is the antisymmetric vector field associated to (ρ, j) ∈ CE T .
Proof. Since the curve ρ ∈ AC [0, T ]; (P 2 (R d ), T ) , according to Proposition 2.25 there exists a unique family (j t ) t∈[0,T ] belonging to T ρ P 2 (R d ) for a.e. t ∈ [0, T ] such that: Then the identity (3.20) is equivalent to proving We proceed by applying two regularization procedures. First, for all (x, y) for all z ∈ R 2d and ε > 0, where m is a standard mollifier on R 2d . We also introduce a smooth cut-off function ϕ R on R 2d such that ϕ(z) = 1 on B R , ϕ(z) = 0 on R 2d \ B 2R and |∇ϕ R | ≤ 2 R , where B R is the ball of radius R in R 2d centered at the origin. We set K ε R := ϕ R K ε and note that it is a C ∞ c (R 2d ) function. We now introduce the approximate energies, indexed by ε and R, Let us extend ρ and j to [−T, 2T ] periodically in time, meaning that ρ −s = ρ T −s and ρ T +s = ρ s for all s ∈ (0, T ] and likewise for j. We regularise ρ and j in time by using a standard mollifier n on R supported on [−1, 1], by setting n σ (t) = 1 σ n( t σ ) and for any σ ∈ (0, T ); whence ρ σ t ∈ P 2 (R d ). Let us now show that the integral of the action is uniformly bounded with respect to σ. Let |λ| ∈ M + (G) be such that γ 1,t , γ 2,t , |j t | |λ| for all t ∈ [0, T ]. Then by using the joint convexity of the function α from (2.4), Jensen's inequality and Fubini's Theorem, we get It is easy to check that (ρ σ , j σ ) is still a solution to the nonlocal continuity equation on [0, T ]. By arguing as in the proof of Proposition 2.17, we get that along subsequences it holds ρ σ with dĵ := η dj t dt, for some curve (j t ) t∈[0,T ] in M(G). Note that n σ δ 0 as σ → 0, and, as a consequence, ρ σ t ρ t for all t ∈ [0, T ] in the view of Proposition 2.21. Thus, we actually havẽ ρ = ρ andj = j by uniqueness of the limit and the flux, as highlighted above. Using the regularity for ε > 0 and σ > 0, we get For the sake of completeness, we note that the second equality follows from the definition of CE T by using again a cut-off argument on the function K ε R * ρ σ t . We omit this step as it is a standard procedure. By integrating in time between s and t, with s ≤ t, it follows In order to obtain (3.21) we need to let ε and σ go to 0 and R go to ∞ in (3.22). The left-hand side is easy to handle since ρ σ t ρ t as σ → 0 for any t ∈ [0, T ], and K ε R → K R uniformly on compact sets as ε → 0. Finally, by letting R go to ∞ we have convergence to E(ρ t ).
In order to pass to the limit in the right-hand side of (3.22), we use a truncation argument similar to that in the proof of Proposition 2.17. By using the same notation as above in (2.18), we find a family (ϕ δ ) δ>0 ⊂ C ∞ c (R d × G) of truncation functions such that, for all δ > 0, By using the assumption (K3), Lemma 2.10 with Φ(x, y) = |x − y| ∨ |x − y| 2 and (A1), we can bound the modulus of the right-hand side of (3.22) by This implies that: s D(ρ τ ) + |ρ τ | 2 dτ = 0, by Corollary 3.11. Whence the first part of the theorem follows for s = 0 and t = T since G T (ρ) = 0.
Consider now ρ ∈ AC([0, T ]; (P 2 (R d ), T )) satisfying the equality (3.19). Let us verify that it is a weak solution of (NL 2 IE). By Proposition 2.25 there exists a unique family Moreover, by Lemma 2.6 we find an antisymmetric measurable vector field w : Thanks to Proposition 3.10, by applying the one-sided Cauchy-Schwarz, using the identification (3.16), the definition of the local slope (3.17) and Young inequality, we get Thanks to the equality (3.19), we actually have that the above inequalities are equalities, which holds if and only if w t (x, y) = −∇ δE δρ (x, y) for a.e. t ∈ [0, T ] and γ 1,t -a.e. (x, y) ∈ G. Hence (ρ, j) ∈ CE T with w = −∇ δE δρ , that is, ρ is a weak solution to (NL 2 IE).
3.4. Stability and existence of weak solutions. Theorem 3.9 provides a characterization of (weak) solutions to (NL 2 IE) as minimizers of G T attaining the value 0. The direct method of calculus of variations gives existence of minimizers of G T . However, it is not clear a priori whether they attain the value 0 and are thus actually weak solutions to (NL 2 IE). Hence we prove compactness and stability of gradient flows (see Theorem 3.14) and approximate the desired problem by discrete problems for which the existence of solutions is easy to show; see the proof of Theorem 3.15. We start by proving that the local slope D is narrowly lower semicontinuous jointly in its arguments, µ and ρ; see Lemma 3.12. We then establish the compactness coming from a uniform control of the De Giorgi functional G T , as well as its joint narrow lower semicontinuity (see Lemma 3.13), which we prove using compactness in CE T and the joint narrow lower semicontinuity of the action (see Proposition 2.17) and of the local slope. (See also [37,Theorem 2] for an analogous strategy.) In Theorem 3.14 we prove one of our main results, namely that the functional G T is stable under variations in base measures, defining the vertices of the graph, and absolutely continuous curves. A particular consequence of this theorem is that weak solutions to (NL 2 IE) with respect to graphs defined by random samples of a measure µ converge to weak solutions to (NL 2 IE) with respect to µ; see Remark 3.17.
The existence of weak solutions of (NL 2 IE) (and thus gradient flows) with respect to E proved in Theorem 3.15 shows that, indeed, the De Giorgi functional (3.18) corresponding to an interaction potential K satisfying (K1)-(K3) admits a minimizer when µ(R d ) is finite. Lemma 3.12. Let (µ n ) n ⊂ M + (R d ) and suppose that (µ n ) n narrowly converges to µ. Assume that the base measures (µ n ) n and µ are such that (A1) and (A2) hold uniformly in n, and let K satisfy Assumptions (K1)-(K3). Let moreover (ρ n ) n be a sequence such that ρ n ∈ P 2 (R d ) for all n ∈ N and ρ n ρ as n → ∞ for some ρ ∈ P 2 (R d ). Then lim inf n→∞ D(µ n ; ρ n ) ≥ D(µ; ρ).
Let us also prove the compactness and narrow lower semicontinuity of the De Giorgi functional. Proof. For any n ∈ N, recall the definition where we are careful to take the metric derivative of ρ n with respect to T µ n (as given in Definition 2.18).
Since the domain of the energy E is all of P 2 (R d ) and the local slope D is nonnegative, the bound sup n∈N G T (µ n ; ρ n ) < ∞ ensures that For all n ∈ N, since ρ n ∈ AC([0, T ]; (P 2 (R d ), T µ n )), Proposition 2.25 yields the existence of a flux j n such that (ρ n , j n ) ∈ CE T and |(ρ n t ) | 2 = A(µ n ; ρ n t , j n t ) for almost all t ∈ [0, T ]. We then get By Proposition 2.17, there now exists (ρ, j) ∈ CE T such that, up to subsequences, ρ n t ρ t for all t ∈ [0, T ] and j n j as n → ∞, and By Proposition 2.25, we therefore have ρ ∈ AC([0, T ]; (P 2 (R d ), T µ )) and |(ρ t ) | 2 Tµ ≤ A(µ; ρ t , j t ) for almost all t ∈ [0, T ], which finally gives (3.24) lim inf By the narrow continuity of the energy proved in Proposition 3.3, we get which ends the proof.
We now get our stability result.
Theorem 3.14 (Stability of gradient flows). Let (µ n ) n ⊂ M + (R d ) and suppose that (µ n ) n narrowly converges to µ. Assume that the base measures µ n and µ satisfy (A1) and (A2) uniformly in n, and let the interaction potential K satisfy (K1)-(K3). Suppose that ρ n is a gradient flow of E with respect to µ n for all n ∈ N, that is, G T (µ n ; ρ n ) = 0 for all n ∈ N, such that (ρ n 0 ) n satisfies sup n∈N M 2 (ρ n 0 ) < ∞ and ρ n t ρ t as n → ∞ for all t ∈ [0, T ] for some curve (ρ t ) t∈[0,T ] ⊂ P 2 (R d ). Then, ρ ∈ AC([0, T ]; (P 2 (R d ), T µ )) and ρ is a gradient flow of E with respect to µ, that is, G T (µ; ρ) = 0.
Note that, via Theorem 3.9, the above theorem also shows stability of weak solutions to (NL 2 IE). Typically, in Theorem 3.14, (µ n ) n is a sequence of atomic measures used to approximate, or sample, the support of µ. Indeed, we now use this approach to show the existence of weak solutions to the nonlocal nonlocal-interaction equation. Theorem 3.15 (Existence of weak solutions). Let K be an interaction potential satisfying Assumptions (K1)-(K3). Suppose that µ ∈ M + (R d ) is finite, i.e., µ(R d ) < ∞, and satisfies (A2). Assume furthermore that for some C η > 0 it holds that (3.27) sup (x,y)∈G∩supp µ⊗µ |x − y| 2 ∨ |x − y| 4 η(x, y) ≤ C η .
Proof. Let (µ n ) n ⊂ M + (R d ) be a sequence of atomic measures such that (µ n ) n converges narrowly to µ. Moreover, assume that µ n has finitely many atoms and µ n (R d ) ≤ µ(R d ) and supp µ n ⊆ supp µ for all n ∈ N. Letμ n be the normalization of µ n which has the same total mass as µ, that is, and let π n be optimal transportation plan between µ andμ n for the quadratic cost. Let ρ n 0 be the second marginal ofρ 0 π n , whereρ 0 is the density of the measure ρ 0 with respect to µ; namely, let ρ n 0 (A) = R d ×Aρ 0 (x) dπ n (x, y) for any Borel set A ⊂ R d . Note that ρ n 0 (R d ) = ρ 0 (R d ) and ρ n 0 µ n for all n ∈ N, and that, sinceρ 0 π n is a transport plan between ρ 0 and ρ n 0 , ρ n 0 ρ 0 as n → ∞. Thanks to Assumption (3.27), it holds, for all n ∈ N, Since by construction ρ n 0 µ n , we have supp ρ n 0 ⊆ supp µ n ⊆ supp µ. This nested support property is, thanks to Proposition 2.28, preserved in time, so that supp ρ n t ⊆ supp µ for all t ∈ [0, T ] and n ∈ N. For this reason, (3.28) can be used, under the stated support restriction on ρ 0 , instead of Assumption (A1) uniformly in n when calling Lemma 3.13 and Theorem 3.14 later in this proof. Since µ n consists of finitely many atoms and µ satisfies (A2), it satisfies (A2) uniformly in n.
By Remark 1.1, we know that the ODE system (1.2)-(1.4) admits a unique solution for all n ∈ N. It can be easily checked that this solution, which we denote by ρ n , is a weak solution to (NL 2 IE) with respect to µ n starting from ρ n 0 , according to Definition 3.1. By Theorem 3.9, we then get that ρ n is a gradient flow of E with respect to µ starting from ρ n 0 for all n ∈ N. Combining the compactness part of Lemma 3.13 and the stability from Theorem 3.14, we get that, up to a subsequence, ρ n t ρ t as n → ∞ for all t ∈ [0, T ], where ρ ∈ AC([0, T ]; (P 2 (R d ), T µ )) is a gradient flow of E with respect to µ starting from ρ 0 . Theorem 3.9 finally shows that ρ is a weak solution to (NL 2 IE) with respect to µ starting from ρ 0 . Remark 3.16. Assumption (3.27) is only needed to arrive at an atomic approximation sequence (µ n ) n of µ such that Assumptions (A1) and (A2) hold uniformly in n. On a case-by-case basis, one could drop (3.27) and try to construct the sequence (µ n ) n explicitly in such a way as to satisfy both assumptions uniformly in n.
Remark 3.17. We conclude the section by remarking on the relevance of the Theorem 3.14 to the setting of machine learning. Namely, there µ is the measure modeling the true data distribution, which can be assumed to be compact. Let (x i ) i be a sequence of i.i.d samples of µ and let µ n = 1 n n i=1 δ x i be the empirical measure of the first n sample points. Assume (ρ n ) n is a narrowly converging sequence of probability measures such that supp ρ n ⊆ {x 1 , . . . , x n } for all n ∈ N, and denote by ρ its limit. Assume that η is an edge weight kernel such that µ and η satisfy (A2) and (3.27). Let K be an interaction kernel satisfying (K2) and (K3). Finally, let (ρ n ) n be the sequence of solutions of (NL 2 IE) in the sense of Definition 3.1 such thatρ n 0 = ρ n for all n ∈ N. Then, by Lemma 3.13, the sequence (ρ n t ) n narrowly converges along a subsequence for all t ∈ [0, T ], and furthermore, by Theorem 3.15, any curve (ρ t ) t∈[0,T ] of subsequential limits yields a solutionρ of (NL 2 IE) with initial condition ρ.
3.5. Discussion of the local limit. Here we discuss at a formal level the connection between the nonlocal nonlocal-interaction equation and its limit as the graph structure localizes. We first present a very formal justification as to why we expect the solutions of (NL 2 IE) to converge to the solutions of a nonlocal-interaction equation as the localizing parameter ε → 0 + , i.e., as the edge-weight function η = η ε localizes. We conclude this section with an example that cautions that the formal argument cannot be justified in full generality. Proving the convergence of (NL 2 IE) in the limit ε → 0 + , under appropriate conditions, remains an intriguing open problem.
Take µ = Leb(R d ) and choose η ε given by (2.2). Consider a smooth interaction potential K : R d × R d → R and a compactly supported initial condition ρ 0 which has a continuous density with respect to µ. Let ρ ε be the solution of (NL 2 IE) starting from ρ 0 for the edge weight function η ε . Assume that ρ ε t is absolutely continuous with respect to µ for all t. In the following we drop the t-dependence of ρ ε for brevity. From (NL 2 IE), by adding and subtracting ρ ε (x) R d (∇K * ρ ε (x, y)) + η ε (x, y) dy, it follows that ∇ρ ε (x, y)(∇K * ρ ε (x, y)) + η ε (x, y) dy.
Then, for almost all x ∈ R d we have A standard calculation, using a second-order Taylor expansion, shows that the right-hand side approximates ∆K * ρ ε (x) when ε is small, provided that derivatives of ρ ε remain uniformly bounded. Similarly, by Taylor expanding ∇ρ ε and ∇K * ρ ε to first order and changing variable over the unit sphere while carefully tracking the positive part, one gets R d ∇ρ ε (x, y)(∇K * ρ ε (x, y)) + η ε (x, y) dy ≈ ∇ρ ε (x) · ∇K * ρ ε (x) for small ε.
This suggests that if ρ ε converge as ε → 0 + , then the limiting ρ is a solution of the standard nonlocal interaction equation (3.1). A possible way to attack the local limit within the variational framework is via a stability statement similar to that of Theorem 3.14, but now with respect to the family (η ε ) ε>0 in the limit ε → 0 + . The following remark indicates that this will require further regularity assumptions on the interaction kernel K.
Remark 3.18. We present an example that indicates that, in certain situations, solutions of (NL 2 IE) cannot be expected to converge to solutions of (3.1) as the interaction kernel η ε becomes more concentrated. Namely, consider d = 1, Ω = (−2, 2) and µ = Leb(Ω). Let K(x, y) = 1 − e −|x−y| for all x, y ∈ Ω and η be a smooth, even function, positive on (−0.2, 0.2) and zero otherwise. Consider ρ 0 = 1 2 (δ −1 + δ 1 ). It is straightforward to verify that ρ t = ρ 0 for all t ∈ [0, T ] yields a weak solution of (NL 2 IE) for all ε > 0. In particular, note that the corresponding velocity field satisfies v(−1, y) = −(K * ρ 0 (y) − K * ρ 0 (−1)) ≤ 0 for all y ∈ (−1.2, −0.8), and thus the flux from x = −1 remains zero, and analogously from x = 1. Therefore, one cannot expect the weak solutions for the interaction potential K to converge to weak solutions of (3.1) as ε → 0 + . We believe that, for these particular kernel K and edge weights η, the problem persists for strong solutions for initial data close to ρ 0 , only that explicit solutions are not available.
Appendix A. Minkowski norm of the underlying Finsler structure In this appendix we show that, given ρ ∈ P 2 (R d ) and j ∈ T ρ P 2 (R d ), the inner product g ρ,j from Section 3.1 derives from a so-called Minkowski norm, as it should be in the theory of Finsler geometry; see [3,14,38].