Nonlocal Wasserstein distance: metric and asymptotic properties

The seminal result of Benamou and Brenier provides a characterization of the Wasserstein distance as the path of the minimal action in the space of probability measures, where paths are solutions of the continuity equation and the action is the kinetic energy. Here we consider a fundamental modification of the framework where the paths are solutions of nonlocal (jump) continuity equations and the action is a nonlocal kinetic energy. The resulting nonlocal Wasserstein distances are relevant to fractional diffusions and Wasserstein distances on graphs. We characterize the basic properties of the distance and obtain sharp conditions on the (jump) kernel specifying the nonlocal transport that determine whether the topology metrized is the weak or the strong topology. A key result of the paper are the quantitative comparisons between the nonlocal and local Wasserstein distance.


Introduction and Summary of
, where µ 0 and µ 1 lie in P 2 (R d ), the space of probability measures on R d with finite second moments, and Π(µ 0 , µ 1 ) is the space of all couplings of µ 0 and µ 1 . The celebrated work of Benamou and Brenier [4] establishes that the Wasserstein metric has a dynamical reformulation inspired by fluid mechanics. There, one considers the space of all narrowly continuous curves ρ t : [0, 1] → P 2 (R d ) and velocity vector fields v t : [0, 1] → L 2 (ρ t ; R d ) such that the continuity equation holds in the sense of distributions. In [4] it is shown that The integral´1 0´R d |v t (x)| 2 dρ t (x)dt represents the total action along (ρ t , v t ) t∈ [0,1] ; in other words, the 2-Wasserstein distance between µ 0 and µ 1 can be reformulated as the least action of a curve in P 2 (R d ) connecting µ 0 to µ 1 in which the flow of mass is continuous, in the sense of satisfying equation 1.1.
In this article, our object of study is the class of nonlocal transportation metrics on P(R d ), introduced by Erbar in [19], which are defined in terms of an nonlocal action minimization problem on the space of curves in P(R d ).
A key difference is that the curves connecting the measures are not solutions of the continuity equation 1.1, but solutions of nonlocal continuity equation: where v t : R d × R d → R is the nonlocal velocity, η : R d × R d → [0, ∞) is the weight kernel which encodes the ability to transport mass directly from x to y, and θ : [0, ∞) × [0, ∞) → [0, ∞) allows one to define an "interpolated density" θ(ρ t (x), ρ t (y)), which is a generalized average of ρ t (x) and ρ t (y).
Together, 1.3 and 1.4 allow us to consider the family of nonlocal transportation distances (since the distance now depends on the choice of η and θ) where the infimum runs over all (ρ t , v t ) t∈[0,1] which solve (1.3), such that ρ 0 = µ 0 and ρ 1 = µ 1 . This family of distances can be viewed simultaneously as nonlocal analogues of the Benamou-Brenier formulation of the W 2 metric, and also as a "continuum state space" analogue of the graph Wasserstein distance defined in [10,36,37], wherein the underlying space is a finite graph or irreducible Markov chain, rather than R d as in the case of W η,θ . In this paper we investigate topological and metric properties of the family of distances W η,θ and compare them to the Wasserstein metric W 2 . Some properties have already been established in [19]: it is known that the W η,θ distance is lower semicontinuous with respect to narrow convergence, and that if the kernel η has finite second moments, then the topology induced by W η,θ is at least as fine as that of narrow convergence.
Here we show that the topology metrized by W η,θ can be strictly stronger than that of narrow convergence. In particular we characterize for which kernels W η,θ metrizes the narrow, the strong, or an even stronger topology on the space of measures supported within a compact domain. The key to establishing the result is the following proposition which loosely speaking characterizes the effort needed to spread mass from a point to the surrounding region. Proposition 1.1. Let W η,θ be defined as in Definition 2.15. Suppose that η and θ satisfy Assumptions 2.1 (i-v) and 2.2 respectively. If ν is any compactly supported probability measure singular to δ 0 (the Dirac measure at the origin), then depending on the choice of η and θ: (i) If θ(1, 0) = 0 and´B (0,1) η(|y|)dy < ∞ then W η,θ (δ 0 , ν) = ∞.
We remark that the proposition is not a trichotomy: the case where´B (0,1) η(|y|)dy = ∞, but η(|y|) does not grow strictly faster than |y| −d near the origin, remains open.
We prove Proposition 1.1 in Section 3.2. While Proposition 1.1 shows that in case (i), the topology induced by W η,θ on P(R d ) is highly disconnected, we give a further structural description of the topology of W η,θ in the other two cases. Moreover in case (ii) we establish a quantitative comparison to a combination of total variation and transportation distances. Let W 1 be the Monge distance, that is the optimal transportation distance with linear cost. Theorem 1.2. Let W η,θ be defined as in Definition 2.15. Suppose that η and θ satisfy Assumptions 2.1 (i-v) and 2.2 respectively. If θ(1, 0) > 0 and´B (0,1) η(|y|)dy < ∞, then on a compact domain, there exists an explicit constant C such that for all µ, ν ∈ P(R d ), T V 1/2 (µ, ν) ≤ W η,θ (µ, ν) ≤ C · T V 1/2 (µ, ν).
The lower bounds asserted in Theorem 1.2 are established in Section 3.1; the corresponding upper bound makes use of the estimates from Section 3.2, and is proved in Section 3.3.
(ii) Suppose that ρ 0 and ρ 1 are supported inside some domain of radius R. Then, there is a constant C R,η depending solely and explicitly on R and η such that In particular, when restricting attention to probability measures supported on a compact domain, these estimates imply the Gromov-Hausdorff convergence of M 2 (η) 2d εW ηε,θ to W 2 as ε → 0. Part 1 of Theorem 1.3 is deduced, with an explicit constant for C d,θ,η , as Corollary 4.3; likewise, Part 2 of Theorem 1.3 is deduced, with an explicit constant for C R,η , in Corollary 5.12.
1.1. Related work. Having stated the main results of the article, let us give some further motivating discussion regarding nonlocal Wasserstein distances, and why their topological and asymptotic properties are of interest.
1.1.1. Nonlocal Wasserstein metric, and associated gradient flows. Nonlocal Wasserstein distances were introduced in the work of Erbar [19]. A central result of [19] is that the W η,θ gradient flow of entropy E(ρ) =´ln ρdρ is the fractional heat equation ∂ t ρ t + (−∆) s/2 ρ = 0 when θ is chosen to be the logarithmic mean, and η is chosen to be the jump kernel of an s-stable Levy process: η(x, y) = −c|x − y| −s−d . More broadly, this result suggests that nonlocal parabolic equations may be studied in an analogous fashion to those parabolic equations (such as: the heat equation, the Fokker-Planck equation, the porous medium equation) which can be cast as W 2 gradient flows [3]. Erbar has made important contributions in that direction by establishing the lower-semicontinuity of the nonlocal action, showing the topology of nonlocal Wasserstein distances is at least as strong as that of the Wasserstein distance, and showing that the entropy is geodesically convex with respect to the nonlocal Wasserstein distance.
As with the regular Wasserstein distance it is of interest to consider gradient flows of the functionals that combine some or all of: entropy, potential, and interaction energy. For β i ≥ 0 for i = 1, 2, 3 where K(x, y) = K(y, x). If β 1 β 2 > 0 and β 3 = 0 this would be a nonlocal Fokker-Planck equation, for β 1 β 3 > 0 this would be a nonlocal McKean-Vlasov equation. The issue that arises in nonlocal Wasserstein gradient flows is that the behavior of the solutions at low temperature, for β 1 ≪ 1, crucially depends on the kernel η and interpolation θ. In particular the result of Proposition 1.1(i) indicates that when β 1 = 0, η is integrable and θ is for example a logarithmic mean then gradient flow of the potential (or interaction) energy is unable to move a delta mass! This follows from the fact that since the potential energy is finite the gradient flow curves have finite action. This highlights the need to better understand the influence of the choice of η and θ on the nonlocal Wasserstein metric and the resulting gradient flows.
To overcome the issues with the freezing of support for the gradient flow of potential and interaction energies, Esposito, Patacchini, Schlichting, and Slepčev [25] studied a modification of the nonlocal transportation framework, inspired by upwind numerical schemes, which allows for the interpolation θ to depend on the velocity. For antisymmetric v They study gradient flow of the nonlocal interaction energy (β 1 = β 2 = 0) with respect to nonlocal Wasserstein distances both on graphs and in the continuum.
The resulting upwind nonlocal Wasserstein "distance" is not symmetric, and is shown to be a quasimetric. It provides a formal Finslerian (rather than Riemannian) differential structure on P(R d ), which is nonetheless sufficient to develop gradient flows as curves of maximal slope.
Lastly: in [24], the authors show that the 1D aggregation equation can be cast as the gradient flow of the kinetic energy with respect to a nonlocal transportation metric, which they call the nonlocal collision metric. This metric falls outside the scope of this article because the action they consider is 2-homogeneous (rather than 1-homogeneous). While the aggregation equation is also known to be a 2-Wasserstein gradient flow (of the nonlocal interaction energy, rather than the kinetic energy), it is nonetheless notable that a nonlocal transportation metric is recently shown to be physically relevant in kinetic theory. Let X be a finite set. Let π be some distinguished probability measure on X . Define In other words, P π (X ) is the set of probability densities on X , w.r.t. π. Consider some irreducible Markov kernel K : X × X → R + such that π is the unique stationary measure for K, that is, Furthermore, assume that K is reversible, that is, the detailed balance condition K(x, y)π(x) = K(y, x)π(y) holds. Equivalently [35,Chapter 9], one may consider a connected weighted graph on X with weights w(x, y), where π is the stationary distribution for the uniform random walk on (X , w).
Let θ(x, y) be an interpolation function 1 ; define the shorthandρ(x, y) := θ(ρ(x), ρ(y)). We introduce the graph continuity equationρ where v t : X × X → R is thought of as a "vector field" on X , analogous to the vector field v t appearing in the (continuum) continuity equation. The term y∈X v t (x, y)ρ t (x, y)K(x, y) can be interpreted as a graph analogue of the term div(ρv) from the continuity equation on R d .
The action of a density-potential pair is given by 1 By this, we informally mean: a function which is a well-behaved, but possibly nonlinear, average of x and y. Rigorously, we mean a function from R+ × R+ → R satisfying all the conditions of Assumption 2.2 below.
From here, one can define a geodesic metric on P π (X ) in a variational fashion, by setting where the infimum runs over all pairs (ρ t , v t ) t∈[0,1] satisfying the graph continuity equation, with ρ 0 =ρ 0 and ρ 1 =ρ 1 . This Benamou-Brenier-type formulation of a distance on a discrete base space is more technically straightforward than its continuum ancestor W 2 . For one, it is shown in [36] that, at least on the "interior" of P π (X ) (namely, the subset {ρ(x) ∈ P π (X ) : ∀x ∈ X , ρ(x) > 0}) we can interpret W as a geodesic metric arising from a bona fide Riemannian metric structure; this is in contrast to the continuum setting, where the space P 2 (R d ) can only be understood "formally" as a Riemannian manifold. Indeed, for the metric W θ,η,π and related gradient flows, numerous heuristic arguments from the Otto calculus can be translated to rigorous arguments in the discrete setting. This has been exploited to study a variety of evolution equations on discrete spaces as W θ,η,π gradient flows, for instance discrete analogs of the porous medium equation [22] and the McKean-Vlasov equation [20]. The reason for the need to introduce an interpolation θ in discrete setting is rather clear. Indeed, because mass configurations are defined on the set of nodes, and vector fields are defined on edges, the discrete analogue of the flux of the continuity equation, ρv, must combine node-and edge-defined quantities in a noncanonical fashion. Indeed, in the definition of W θ,η,π the role of the flux ρv is played by the quantity θ(ρ(x), ρ(y))v(x, y), where the interpolation θ is introduced in order to define an edge-based quantity (flux) based on vertex-defined quantities (mass). While there are many possible choices for θ, which θ one chooses can significantly alter the geometry of W θ,η,π .
From graphs to continuum. A number of works has investigated the asymptotic properties of graph Wasserstein distances as the graphs converge to a continuum limit. This is of particular interest in data science and for mesh-free numerical schemes. For instance, if one considers a sequence of finite graphs G n equipped with the shortest-path metric, converging in some sense to a continuous domain G ⊂ R d , is it the case that (P(G n ), W θ,η,π ) also converges to (P(G), W 2 )? Similar Gromov-Hausdorff-type stability results are well-established for a sequence of continuous domains where each respective space of probability measures is equipped with the W 2 metric [46]. However, the problem of discrete-to-continuum stability for the graph Wasserstein distance turns out to be considerably more delicate. In [31], it is shown that if we consider a sequence X n of finer and finer d-dimensional regular lattices on the flat d-torus T d , and take for our Markov chain the uniform random walk on said lattice, then under appropriate rescaling the sequence of spaces (P(X n ), W θ,η,π ) converges to (P(T d ), W 2 ) in the sense of Gromov-Hausdorff. On the other hand, such convergence does not hold for an arbitrary sequence of regular meshes [33]. Despite this failure of convergence for general sequences of meshes, Garcia-Trillos has shown [29] that W θ,η,π corresponding to weighted random geometric graphs (e.g. where vertices are random i.i.d. samples from the Lebesgue measure on the torus) converges in the sense of Gromov-Hausdorff to (P(T d ), W 2 ) as the number of vertices goes to infinity and the graph bandwidth converges to zero at appropriate rate (which is such that unweighted graph degrees go to infinity).
Our own Theorem 1.3 provides another result in this vein, whereby the 2-Wasserstein distance is recovered in the limit; but ours gives a nonlocal-to-local convergence result, rather than discrete-to-continuum.
We should also draw attention to one other question regarding the graph Wasserstein metrics: what can be said about the geodesics on (P(X ), W θ,η,π )? Two works [23,28] have independently investigated graph Wasserstein geodesics via their dual description in terms of solutions to a suitable discrete Hamilton-Jacobi equation. In the present article, an analogous Hamilton-Jacobi duality result for the nonlocal Wasserstein metric is exploited in Section 5 to prove Part 2 of Theorem 1.3.

1.1.3.
Other related work. In [39], Peletier, Rossi, Savaré and Tse consider a far-reaching generalization of the results of [19]. While [19] shows that the fractional heat equation can be viewed as the gradient flow of KL(· | Leb) with respect to a Wasserstein-like metric, [39] investigates a large class of reversible Markov jump processes whose Kolmogorov forward equations may be cast as so-called generalized gradient systems, which are a further abstraction of the gradient flows in metric spaces studied in [3].
Finally, let us remark on some related work on the subject of nonlocal conservation laws. We draw attention to two distinct uses of the term in the literature (although the works cited in the following discussion refer to some other uses of the term in the literature). On the one hand, certain authors use "nonlocal conservation law" to refer to solutions to the (local) continuity equation ∂ t u t + div(u t v t ) = 0 where v t is itself a nonlocal functional of u t ; see [5,13] for a general discussion of this class of equations and an overview of related literature. In particular, we draw attention to recent work [11,12] investigating the local limit (namely, as ε → 0) of nonlocal conservation laws of the form Formally, the singular limit is given by the conservation law ∂ t u t + div(u t b(u t )) = 0, but [12] exhibits counterexamples where u ε t → u t (e.g. in L p for p > 1) even when b and η are regular. More recently, sufficient conditions for nonlocal-to-local convergence in one dimension have been given in [11]; but it remains the case that formally "obvious" nonlocal-to-local convergence problems can present unexpected technical phenomena.
In a distinct line of work, the articles [16,17] introduce new classes of "nonlocal conservation laws" where one replaces the divergence term in the continuity equation with a nonlocal divergence-type operator. The nonlocal divergence and gradient in these papers, as well as in [17,32]), share the properties of the objects we study, but also have important differences. More importantly the authors are interested in conservation laws where the flux j is a nonlinear function of ρ. Several ways to encode the nonlinear and nonlocal dependence on ρ are developed. In [16] the authors present, in the same vein as in [11], sufficient conditions which allow one to recover a local conservation law in the limit, under a suitable rescaling of the nonlocal divergence operator.

PRELIMINARIES
Wasserstein metric. Let P 2 (R d ) denote the space of probability measures on R d with finite second moments. The 2-Wasserstein distance W 2 on P 2 (R d ) is defined by where Π(µ, ν) is the set of all transport plans (couplings) of µ and ν. The 2-Wasserstein distance also has a well-known dynamical formulation, due to Benamou and Brenier [4]: Here, the continuity equation ∂ t ρ t + div(ρ t v t ) = 0 is interpreted in a suitable distributional sense, in particular to allow ρ t to be a probability measure (rather than a smooth density). The Benamou-Brenier formulation of the W 2 metric can be interpreted as showing that the W 2 distance between µ and ν is given by the minimal total kinetic energy of a unit-time flow of mass with initial and terminal distribution specified by µ and ν respectively. Classically, kinetic energy is either formulated in terms of position and velocity, or position and momentum; accordingly, as was observed in [4] (but see also further discussion and extensions in [8,14]), one can also rewrite the Benamou-Brenier formulation of W 2 in "mass-flux" coordinates: where v t is a locally finite signed measure, which formally takes the place of ρ t v t . This presentation of W 2 has the technical advantage that the action´R d dv dρ (x) 2 dρ(x) is jointly convex and lower semicontinuous in ρ and v (this is a well-known consequence of Reshetnyak's theorem; for completeness, we provide a proof of this result which covers the case where v ∈ M loc (R d ) in Theorem A.1). In particular, see Remark 2.23 for a useful consequence of these properties.
2.1. Nonlocal structure: weight kernel, interpolation, vector calculus. We equip R d with an "underlying nonlocal structure", as follows.
The radial profile η is non-increasing.
(iv) η satisfies the tail moment bound´R d 1 ∧ |y| 2 η(|y|)dy < ∞. Additionally, for some results we require that (v) The support of η contains (0, 1], Or, furthermore, that (vi) The support of η is equal to (0, 1]. The assumption of isotropy is largely imposed to simplify the statements of our results. When combined with the tail moment bound, these suffice to guarantee that Assumption 1.1 from [19] is satisfied; the relevance for us is that we make use of several results from [19] which require this assumption. On the other hand, the arguments of Sections 4 and 5 make use of the assumption that η is compactly supported. Note also that under the assumption η is isotropic, if η is compactly supported then the suport of η(|y|) is equal toB(0, R) for some R > 0; in assumption (vi), we fix R = 1 merely as a convention. Likewise, in view of assumption (iii) assumption (v) may be viewed as merely a convention unless η is identically zero.
Since η is taken to be isotropic, we often write η(|x − y|) rather than η(|x − y|); in other words, we abusively identify η with its radial profile.
We write M p (η) :=´R d |x − y| p η(|x − y|)dy to denote the p-th central moment of η(x, y) with x fixed. Note that due to isotropy, M p (η) does not depend on the choice of x. Unless otherwise stated, we do not explicitly assume pth moment bounds on η, but note that Theorem 1.3 assumes that M 2 (η) < ∞.
In what follows, given a choice of weight kernel η, we denote G := {(x, y) ∈ R d × R d \{x = y} : η(x, y) > 0}. The intended interpretation is that G is the set of edges we have placed on R d , with η(x, y) being the edge weight between x and y.
We also assume that the interpolation function θ satisfies the following:  [19,Assumption 2.1], except that Erbar also assumes that θ is zero on the boundary, namely θ(0, t) = 0 for all t ≥ 0. However, a careful reading of [19] indicates that this extra assumption is never used (except in the sense that Erbar proves some results only for the logarithmic mean, which is indeed zero on the boundary). We do not assume θ is zero on the boundary; moreover, we will see below that whether or not θ is zero on the boundary has significant topological consequences for the nonlocal Wasserstein distance.
We call point (vii) "connectedness" because, as discussed in [36] if instead C θ = ∞, the discrete W θ distance on the space of probability measures supported on a symmetric graph with two points becomes topologically disconnected. The assumption that C θ < ∞ is required for several arguments in Section 3. Lemma 2.4. Any θ satisfying Assumption 2.2 also satisfies θ(r, s) ≤ r+s 2 . Proof. First, note that by 1-homogeneity and the normalization θ(1, 1) = 1, we have θ(r, r) = r for all r ∈ R + . At the same time, symmetry implies that at any point along the line (r, r) in R + × R + (excluding (0, 0), the directional derivative of θ(r, r) in the direction orthogonal to the vector (1, 1) must be zero. Therefore, by concavity, θ(r, s) is upper-bounded by the hyperplane which takes the value r at (r, r) and has directional derivative zero in the direction orthogonal to (1, 1) at every point (r, r), and the only such hyperplane is given by r+s 2 . Definition 2.5 (Nonlocal gradient and divergence). For any function φ : ∇φ(x, y)η(x, y)dj(x, y).
Remark 2.6. It is the nonlocal divergence operator ∇· which replaces the usual divergence operator in the Definition 2.12. However, we will simply write out the intergral operator 1 2˜G ∇φ(x, y)η(x, y)dj(x, y) explicitly in the sequel; the definition here is just presented for easier comparison with articles such as [19,25]. Additionally, we caution the reader that other conventions for the definition of nonlocal gradient and divergence operator are present in the literature, in particular our definition is not the same as the one presented in [15].

2.2.
Action. We rigorously define the action in the "flux form". Definition 2.7 (Action). Let η satisfy Assumption 2.1 and θ satisfy Assumption 2.2. Let µ ∈ P(R d ) and j ∈ M loc (G). Let m ∈ M loc (R d ) be any reference measure. Define the action of the pair (µ, j) by where λ is taken to be any nonnegative measure in M + loc (G) such that |j|, µ ⊗ m, m ⊗ µ ≪ λ. Here, the fraction in the integrand is understood with the convention that 0 0 = 0. If θ and η are obvious from context, and m is chosen to be Leb, the Lebesgue measure on R d , we simply write A(µ, j).
Remark 2.9. The choice of "reference measure" m in Definiton 2.7 allows us to encode alternate geometries on R d besides the usual Euclidean one; phrased differently, Definiton 2.7 makes sense when working with any metric measure space (R d , d, m) (however if we use a metric d other than the Euclidean one, note this would change what it means for η to be isotropic). In particular, if we consider a "weighted measured and m n is a measure supported on V , then selecting m n as the reference measure in Definiton 2.7 causes A η,θ,mn to coincide with the action associated to the graph Wasserstein distance discussed in the introduction, if we also restrict to the case where µ ≪ m n and j ≪ i,j w(x i , x j )m n (x i )m n (x j ). Lemma 2.10. The action A θ,η (µ, j; m) is jointly convex in (µ, j), and is jointly lower semicontinuous in (µ, j) with respect to the narrow topology on P(R d ) and the weak* topology on M loc (G).
Proof. This is proved in Corollary A.2.
Remark 2.11 (Comparison with action given in Erbar). Our Definition 2.7 is superficially different from the definition of the action given in [19,Section 2]. Nonetheless, it can be seen that the two definitions are, in fact, equivalent, and so our choice of an alternate presentation of the action is largely one of taste.
Specialized to our setting, Erbar defined the action where dµ 1 = η(x, y)dµ(x)m(y) and dµ 2 = η(x, y)dm(x)dµ(y), and λ is any measure in M loc (G) dominating v, µ 1 , and µ 2 (and the fraction in the integrand is understood with the convention that 0 0 = 0). Now, given a pair (µ, j) ∈ P(R d ) × M loc (G), it is routine to check (by using the chain rule for Radon-Nikodym derivatives and the 1-homogeneity of θ) that
Remark 2.13 (Comparison with nonlocal continuity equation given in [19]). In [19], a slightly different nonlocal continuity equation is considered. There, equation 1.3 is replaced with the following equation: satisfies the nonlocal continuity equation in the sense of Definition 2.12, then satisfies the nonlocal continuity equation in the sense of Definition 2.12. Note that this latter definition (of j t ) is unproblematic on G, since by definition η(x, y) > 0 everywhere on G.
It is sometimes advantageous to work with a stronger notion of solution to the nonlocal continuity equation: 1] is a classical solution to the nonlocal continuity equation provided that j t (x, y) is antisymmetric and Note that if (ρ t , j t ) t∈[0,1] is a classical solution to the nonlocal continuity equation, then it holds that the time-dependent measures (ρ t dx, j t dxdy) t∈[0,1] are again a solution to the nonlocal continuity equation in the sense of Definition 2.12.
We will write W η,θ to denote the case where m = Leb. Furthermore, we will usually drop the explicit reference to the choice of θ, and simply write W η .
Remark 2.16. In view of Remarks 2.11 and 2.13, our definition of the nonlocal Wasserstein distance is equivalent to a special case of the nonlocal Wasserstein distance defined in [19].
The proof of this fact is exactly as in [19] and is therefore omitted.
Proof. The proof of (i) is identical to the argument presented in [25, Corollary 2.8] and is therefore omitted.
(ii) This likewise follows from a minor modification of arguments given in [25, Lemma 2.6 and Corollary 2.8]. Namely, we reason from Lemma 2.10, in particular from the convexity of the action in the j variable: Note that without loss of generality we can take λ t to also dominate j t , and furthermore we can take λ t to be symmetric (by replacing λ t with (λ t + λ T t )/2. Then, computing that and using the fact that θ d(µ⊗m) dλ (x, y), d(m⊗µ) dλ (x, y) , η(x, y), and λ t are all symmetric, we see thaẗ Remark 2.19. Thus far, we have only defined W η,θ,m (µ, ν) in the case where µ, ν ∈ P(R d ). However, it is occasionally useful to consider the nonlocal Wasserstein distance between nonnegative Radon measures of equal mass: we do so, in particular, in the proof of Proposition 3.12 below. In particular, the action A(ρ, j) is still well-defined for ρ ∈ M + (R d ), and it is trivial to modify the definition of a solution to the nonlocal continuity equation we have given in Definition 2.12, to allow for the case where ρ t : but ρ t has fixed total mass for all t ∈ [0, T ]. Therefore, in the case where µ, ν ∈ M + (R d ) and µ = ν , W η,θ,m (µ, ν) can be defined exactly as in Definition 2.15. More precisely, we have the following welldefinedness and homogeneity result: In particular, 1] is also a solution to the nonlocal continuity equation, with endpoints µ and ν; consequently, using the 1-homogeneity of the action, By identical reasoning, we also deduce that W 2

2.5.
Convolutions. We make frequent use of convolution estimates in Sections 4 and 5. A number of elementary computations relating to convolutions are deferred to Appendix B; here, we fix notation and state some basic convolution stability results concerning the W η distances. Lastly, we show that a specific convolution kernel, the Laplace kernel K := c K e −|x−y| , has the property that smoothed densities K * µ have relative Lipschitz regularity of the form K * µ(x) K * µ(x ′ ) ≤ 1 + C|x − x ′ |, a fact which we exploit repeatedly in Sections 4 and 5.
Given a convolution kernel k and a (possibly signed, possibly vector-valued) measure µ, we denote We also use the following notation: given a convolution kernel k and a measure µ, we write k * µ to denote the measure whose Lebesgue density is given by Separately, for measures j ∈ M loc (G), we follow [19] and define the convolution k * j ∈ M loc (G) of a convolution kernel k (on R d ) with j as follows: Note that this definition may be understood as a special case of convolution with respect to a translationinvariant group action on a space: in this case, R d acts on G by the translation (x, y) → (x + z, y + z), and this action is indeed translation-invariant since η(|x − y|) is preserved under this translation.
Proposition 2.22 (Stability of action and metric under convolution). Let k be any convolution kernel.
The result now follows according to the reasoning given in [ Remark 2.23. Similarly, it is known [3, Lemmas 8.1.9 and 8. 1] solves the (local) continuity equation ∂ t ρ t + div v t = 0 in the sense of distributions, and k is a convolution kernel, then 1] is again a solution of ∂ t ρ t + div v t = 0 in the sense of distributions; and similarly, Therefore, applying the mass-flux presentation of the W 2 metric described in Section 2 above, we can reason exactly as in the proof of part (iii) of the previous proposition to deduce that for any convolution kernel k and probability measures µ 0 and µ 1 , Furthermore if |x − y| ≤ δ then Hence, for |y − x| < δ, 3. METRIC STRUCTURE OF W η,θ DISTANCES 3.1. General lower bounds for nonlocal Wasserstein distances. In this subsection, we consider nonlocal Wassersein distances W η,θ,m with general reference measure m, since this complicates our analysis only minimally. As in 2.15, we assume that η satsfies Assumption 2.1 (i-iv), and that θ satisfies Assumption 2.2. We first recall the following result of Erbar (which we specialize somewhat), which gives a partial characterization of the topology induced by the nonlocal Wasserstein distance.
In particular, observe that this W 1 lower bound is vacuous in the case where m = Leb and the second moment of η is infinite.
When the lower bound in the previous proposition is non-vacuous, this shows, in particular, that the topology induced by W η,θ as strong or stronger than the narrow topology on P(R d ). What we show below is that, when η is integrable, the topology is strictly stronger. More precisely we show that the topology induced by W η,θ,m is at least as strong as the strong topology on P(R d ). This indicates that the nonlocal Wasserstein distances are fundamentally different from the standard Wasserstein distances.
The nonlocal Wasserstein distance W η,θ,m with arbitrary reference measure m of Definition 2.15 satisfies This lower bound is vacuous when η is non-integrable. This suggests that the upwind nonlocal transportation distance induces a weaker topology when η is non-integrable; we address this point later on in Lemma 3.10.
Proof. Recall that one of the several equivalent definitions of the TV norm is as follows: Any mass-flux pair (ρ t , j t ) connecting ρ 0 to ρ 1 must therefore move at least T V (ρ 0 , ρ 1 ) − ε of mass from A to A C (or vice versa). Without loss of generality, we can take A to be compact.
Let λ t be some measure which dominates all of j t , m ⊗ ρ t , and ρ t ⊗ m. The action for Applying the reverse Hölder inequality, and using the fact that djt dλt = d|jt| dλt , we get By the Radon-Nikodym theorem, we have the estimatë The same estimate works if we replace ρ t (y) with ρ t (x); hence, we conclude thaẗ Therefore, (applying the Radon-Nikodym theorem once more) Let ξ δ be a cutoff function for the set A, more precisely ξ δ = 1 on A, 0 on A c δ , and continuous on R d (the existence of such a ξ δ is guaranteed by Urysohn's lemma). We use ξ δ (x) as a test function in the nonlocal continuity equation: by [25, Lemma 2.15] we find that , and ε > 0 was arbitrary, we conclude that as desired.
We also have the following technical corollary, which will be used in Proposition 3.6.
into a unit-time solution to the nonlocal continuity equation; compute that 1] is also an action-minimizing solution to the nonlocal continuity equation -otherwise, we could locally replace (ρ t , j t ) t∈[t 0 ,t 1 ] and get a lower-action mass-flux pair connecting ρ 0 and ρ 1 , which is ruled out by assumption. Therefore, by Proposition 3.2, Finally, since W θ,η,m (ρ t 0 , ρ t 1 ) ≤ W θ,η,m (ρ 0 , ρ 1 ), we find that which shows that t → ρ t (A) is 1 2 -Hölder continuous, as desired. 3.2. Expel problem for W η,θ . In this subsection we consider the expel problem for nonlocal Wasserstein distances. That is, given a Dirac mass, say at the origin δ 0 for concreteness, we wish to estimate inf ν⊥δ 0 W η,θ (δ 0 , ν). Throughout, we only consider the case where the reference measure is the Lebesgue measure. In this subsection, we also assume that η satisfies Assumption 2.1 (i-v), and that θ satisfies Assumption 2.2.
We shall make repeated use of an adaptation of a specific computation from [36], which we present separately as Lemma B.1.
dr and C d,s is given explicitly in the proof.
An important consequence of the result above is that W η,ε (δ x 0 , m B(x 0 ,δ) ) → 0 as δ → 0 and hence the expel cost is zero: , that is, the annulus of outer radius r and inner radius r 2 . We let m A(x 0 ,r) denote the uniform probability measure on A(x 0 , r). Fix δ > 0. Applying Lemma B.1 with A = A(x 0 , δ2 −n ) and B = A(x 0 , 2 −n−1 ), we find that .
Let us write this upper bound in a more explicit fashion. We know that where α d is the volume of the d-dimensional unit ball. On the other hand, η ε δ 3 PuttingC d,s = 2 α d c s

2
−d−s 1/2 , this shows that Summing the geometric series, we find To see why, observe that (m A(x 0 ,δ2 −n ) ) n∈N converges to δ x 0 in W 1 and thus in the narrow topology. Since W η,ε is jointly l.s.c. with respcet to the narrow topology, we find that Finally, we can easily upper bound W η,ε (m A(x 0 ,δ) , m B(x 0 ,δ) ). We use yet another construction based on the W geodesic in the two-point space: we use exactly the same computation as in the proof of Lemma B.1. Indeed, consider the curve ρ t : Additionally, let j t be chosen exactly as in the proof of Lemma B.1. This (ρ t , j t ) is constructed so that ρ 0 = m A(x 0 ,δ) and ρ 1 = m B(x 0 , δ 2 ) , and so that the mass on A (x 0 , δ) is decreasing uniformly on the set, and continuously in time; therefore, there is a t 0 ∈ (0, 1) such that ρ t 0 has uniform distribution on B(x 0 , δ).
In particular, it follows that Note however that ´1 0 A(ρ t , j t )dt is none other than the upper bound for W η,ε (m A(x 0 ,δ) , m B(x 0 , δ 2 ) ), so we have that Therefore, by the triangle inequality, we have that The previous lemma computed an upper bound on the expel cost for general interpolation θ, in the case where η(| · |) is non-integrable in B(0, δ). It is also possible to provide an expel upper bound in the case where θ is nonzero on the boundary -this condition is satisfied, for instance, by the arithmetic mean, but not by the logarithmic mean. Lemma 3.5. Suppose that θ(1, 0) = κ θ > 0. (Note that if θ(r, s) = r+s 2 , then κ θ = 1 2 .) In this case, Proof. Let 0 < δ < ε. Let g : [0, 1] → [0, 1] be a function to be determined later, such that g(0) = 0 and g(1) = 1.
Note that with our given boundary conditions on g(t), ρ 0 is the uniform measure on B(x 0 , δ 0 ), and ρ 1 = δ x 0 . Note also that by construction, Let j t (x, y) be a flux so that (ρ t , j t ) solves the nonlocal continuity equation; in particular we set Together, since γ ⊗ γ dominates all of j t , ρ t ⊗ Leb, and Leb ⊗ ρ t , the action of (ρ t , j t ) is then Observe that Clearly, if it were the case that θ (g(t), 0) = 0, then the action of (ρ t , j t ) would be infinite. However, since we have instead assumed that θ(1, 0) = κ θ > 0, and so by 1-homogeneity, θ (g(t), 0) = κ θ g(t) for all t, and hence . dt.
Finally, we select g(t) = t 2 . With this choice,´1 0 Proposition 3.6. If´R d η(|y|)dy < ∞, and θ(1, 0) = 0, then for all probability measures ν ∈ P(R d ) which are singular to Proof. Let (ρ t , j t ) t∈[0,1] solve the nonlocal continuity equation, and let ρ 0 = δ 0 and ρ 1 = ν. We assume for the sake of contradiction that´1 0 A(ρ t , j t )dt < ∞. We define the set Note that this set is measurable since ρ t : [0, 1] → P(R d ) is narrowly continuous and j t : [0, 1] → M loc (G) is a Borel function. We claim that for any t ∈ T, it holds that To see this, let λ t ∈ M + (G) be any measure such that ρ t ⊗ Leb + Leb ⊗ ρ t + |j t | ≪ λ t . In particular, if ρ t ({0}) = 0 (and, so, for any t ∈ T), the fact that ρ t ⊗ Leb ≪ λ t implies that λ t ↾ {0} × R d is not identically zero. At the same time, compute that Note that for all t ∈ T we may select a representative of This implies that (up to a choice of a.e.-equivalent representative) dLeb⊗ρt dλt (0, y) = 0 for all y ∈ R d .
Therefore, compute as follows: for all t ∈ T, Therefore, if´1 0 A(ρ t , j t ) < ∞, it must be the case that Leb(T) = 0. However, we claim that Leb(T) > 0. Indeed, consider the following. Let ξ ∈ C ∞ c (R d , [0, 1]) such that ξ(0) = 1 and let ξ δ (x) := ξ x δ . Note that 0 ≤ ξ δ (x) ≤ 1, and that as δ → 0, ξ δ (x) converges pointwise to the indicator 1 {x=0} . Plugging ξ δ (x) into the continuity equation, we find that Using the fact that d|jt| dλt = djt dλt , and observing that dλt whenever the right hand side is well defined, we deduce (now using the convention that 0 · ∞ = ∞) that Applying Hölder's inequality, and using the fact that θ(r, s) ≤ r+s 2 , we find that this last expression is bounded above by The term in the left parentheses is bounded above, in turn, by 2t´R d η(|y|)dy. To see this, compute that

The computation for´t 0˜G
Leb⊗ρt dλt (x, y)η(x, y)dλ t (x, y)dt is identical. Therefore, we find that η ε (x, y) is integrable with respect to dλ t (x, y)dt on G × [0, t] (assuming that´t 0 A(ρ t , j t )dt < ∞), and (ξ δ (y) − ξ δ (x)) 2 converges pointwise to 1 {0}×R d \{0} , we can apply the dominated convergence theorem to deduce that as δ → 0, At the same time, as δ → 0, So in the limit we find that In particular, if t is taken to be the first time such that ρ t ({0}) = 0 (note that such a first t exists, thanks to Corollary 3.3), then we find that which implies, in particular, that on a subset of [0, t) of positive measure, which means that on a subset of [0, t) of positive measure, |j t |({0} × R d \{0}) > 0. However, for all of [0, t), we know that ρ t ({0}) > 0. Therefore, Leb(T) > 0.
In order to estimate W η,ε (m B(x,δ) , m B(y,δ) ), we use Lemma B.1. Indeed, applying Lemma B.1 in the case where A = B(x 0 , δ) and B = B(x 1 , δ), and |x 0 − x 1 | + 2δ ≤ ε 2 , it follows that In particular, since η(|x − y|) ≥ c s |x − y| −d−s , we find that Note that 2δ < |x 0 − x 1 | ≤ ε 2 − 2δ but otherwise δ is arbitrary. In particular, if we pick δ = 1 6 |x 0 − x 1 |, this leads to the constraint on x 0 and x 1 that |x 0 − x 1 | < 3 8 ε, and the estimate At the same time, we know from Lemma 3 and similarly for x 1 , so altogether, For arbitrary x, y ∈ R d , it suffices to repeatedly apply the triangle inequality. Namely, construct a sequence x = x 0 , x 1 , . . . , x k = y so that |x i − x i+1 | < 3 8 ε for each i ∈ {0, . . . , k − 1}; note in particular we can take Thus, for |x − y| ≥ 3 8 ε, we have that Note that when |x − y| = 3 8 ε, we have C d,θ,s More generally, one has the following upper bound: where C d,θ,η is an explicit constant given in the proof.
Proof. To estimate the distance between delta masses at x and y when |x−y| ≫ ε, we first spread the mass to a ball of width δ, comparable to ǫ, and then jump between identical balls placed at distance comparable to ε along the line segment between x and y. The number of the balls is comparable to |x − y|/ε, which explains the scaling of the right hand side in our estimate.
In other to estimate W η,ε (m B(x,δ) , m B(y,δ) ), we use Lemma B.1. Indeed, applying Lemma B.1 in the case where A = B(x 0 , δ) and B = B(x 1 , δ), and |x 0 − x 1 | + 2δ ≤ ε 2 , it follows that Then, in order to estimate W η,ε (m B(x,δ) , m B(y,δ) ), it suffices to select a sequence x = x 0 , x 1 , . . . , x k = y so that δ < |x i − x i+1 | ≤ ε 2 − 2δ for all i ∈ {0, . . . , k − 1}; in particular, we can take Hence, For convenience, we also select δ = ε 6 . Plugging this in, we get In other words, putting we see that Finally, we can compute as follows: where we have used the fact that δ = ε 6 . In particular, applying Lemma 3.4, we find that if η(x, y) ≥ c s |x − y| −d−s when |x − y| ≤ 1 6 , it holds that ; applying Lemma 3.5, we find that if θ(1, 0) = κ θ > 0, then So altogether, we have that where we have the case-wise definition of C d,θ,η (if both conditions obtain, either case can be chosen for the value of C d,θ,η ): In order to proceed, we prove the following disintegration inequality for the W η metric, which is of independent interest, in addition to being needed in the proof of Lemma 3.10 below. An analogous result was established in [21, Proposition 2.14] in the discrete case; however, their proof does not readily adapt to our continuum setting. Theorem 3.9 (Disintegration inequality). Let µ, ν ∈ P(R d ). Then, Morally speaking, this theorem is just an instance of Jensen's inequality, since W 2 is a convex l.s.c. function, and W 2 η (µ, ν) = W 2 η ´δ x dπ(x, y),´δ y dπ(x, y) . And indeed, in the discrete case, the proof of [21, Proposition 2.14] proceeds rather directly from Jensen's inequality, albeit applied to the action A(ρ, j) rather than to W. However, standard proofs of Jensen's inequality (see for instance [40]) require the underlying space (in this case P(R d ) 2 ) to carry the structure of a topological vector space, which we do not have here. We are aware of one more abstract version of Jensen's inequality [44] which does not require a t.v.s. structure, but in our situation a direct proof turns out to be readily available.
We now use the disintegration inequality and the estimates on the nonloal transport between delta masses to obtain the initial, crude, upper bound on nonlocal transport between general probability measures in terms of the Wasserstein distance. Lemma 3.10. Let ν 0 , ν 1 be any measures in P 2 (R d ). Suppose that η(x, y) ≥ c s |x−y| −d−s when |x−y| ≤ 1 6 , or θ(1, 0) = κ θ > 0 (or both). Then, dr, and C d,θ,η is the constant from Lemma 3.8.
Furthermore, in the case where η(x, y) ≥ c s |x − y| −d−s when |x − y| ≤ 1 6 , one has the alternative upper bound where Φ is the function from the statement of Lemma 3.7.
Remark 3.11. This lemma actually helps to address a question posed by Erbar. In [19], it is mentioned that it is unclear for which probability measures ν 0 and ν 1 on R d we have that W(ν 0 , ν 1 ) < ∞ (in the case of the Wasserstein distances W p , these are precisely the measures which have finite pth moments). The proposition we are about to prove gives a sufficient condition on η and θ ensuring that W(ν 0 , ν 1 ) < ∞ for all ν 0 , ν 1 ∈ P 2 (R d ).
Proof. In Proposition 3.9, we proved the disintegration inequality On the other hand, Therefore, let π be W 2 -optimal plan for (ν 0 , ν 1 ). Using the assumption that η(x, y) ≥ c s |x − y| −d−s when |x − y| ≤ 1 6 , or θ(1, 0) = κ θ > 0 (or both), it follows that Alternatively, in the case where η(|x − y|) ≥ c s |x − y| −d−s when |x − y| ≤ 1 6 , we can apply Lemma 3.7, and deduce that Let us mention the following consequence of Lemma 3.10. Together with Proposition 3.2, this shows that when η is integrable, on a bounded domain the topology induced by W η is equivalent to the strong topology on probability measures. Proposition 3.12 (T V upper bound on a bounded set). Suppose that η(x, y) ≥ c s |x − y| −d−s when |x − y| ≤ 1 6 , or θ(1, 0) = κ θ > 0 (or both), and that ν 0 , ν 1 ∈ P(R d ) are both supported inside some bounded set K ⊂ R d . Then, there exists some constant C (independent of ν 0 and ν 1 , but allowed to depend on d, θ, η, and the diameter of K) such that Proof. The idea is that we simply rerun the argument for Lemma 3.10, but allow the mass in the "overlap" between ν 0 and ν 1 to stay put. To wit, define the measure Θ := min{ν 0 , ν 1 }. We suppose that Θ is not identically zero, since otherwise the desired inequality holds trivially. Observe that is well-defined. Moreover, by Lemma 3.10 (with ε = 1) it holds that By the 1-homogeneity of the action A η,θ , and also the 1-homogeneity of W 2 2 , this implies that . At the same time, any solution to the nonlocal continuity equation with endpoints ν 0 − Θ and ν 1 − Θ extends trivially to a solution to the nonlocal continuity equation with endpoints ν 0 and ν 1 : this is because the nonlocal continuity equation is additive, and so we can just add on the constant solution (Θ, 0) t∈[0,1] to the NCE. By the convexity of the action, this implies that Therefore,  One can check that under this assumption, ζ is integrable in R d even when η is not. Consider a solution of the continuity equation
Thus, using symmetry of η, While the preceding argument is formal, and written for strong solutions, making the argument rigorous and extending to weak solutions is straightforward, and is done in the lemma below.
In what follows, we use slightly more burdensome notation: given a specific kernel η, we write ζ η (r) := ∞ r sη(s)ds; so in particular ζ (ηε) (r) :=´∞ r sη ε (s)ds. Note however that it need not hold that ζ η (|x|) is a convolution kernel, i.e. in may not be normalized when integrating on R d . We therefore introduce the (normalized) convolution kernels (that these are the correct normalization constants is shown in Lemma Note thatζ η is a convolution kernel supported on the unit ball, whileζ (ηε) is a convolution kernel supported on the ball of radius ε.
Then, it holds that ζ (ηε) * ρ t , j t t∈[0,1] solves the nonlocal continuity equation, where the measure j t : [0, 1] → M loc (G) is defined in the following way: for all test functions Φ(x, y) ∈ C ∞ C (G), we definë Proof. Suppose that for all test functions ϕ t ∈ C c ((0, 1) Then, it also holds that ζ (ηε) * ρ t ,ζ (ηε) * j t t∈[0,1] is also a solution to this form of the continuity equa- Our goal is to show that which establishes the theorem. Recalling that∇ϕ t (x, y) := ϕ t (y) − ϕ t (x), we see that for each t ∈ [0, 1], (Note that the two integrals on the right hand side are well-defined, since ϕ t (x) is smooth and compactly supported in R d , η ε (x, y)(y − x) is integrable and compactly supported in x for each y, and By identical reasoning, Next, compute thaẗ since the function η ε (x, y)(y − x) is radially anti-symmetric around y, implying that´R d η ε (x, y)(y − x)dx = 0. By identical reasoning, it also holds thaẗ Therefore, for all t ∈ [0, 1], which establishes the claim.

4.2.
A quantitative upper bound on the nonlocal Wasserstein distance. In this section we establish a bound on the nonlocal Wasserstein distance of the form W η,ε ≤ CW 2 + O( √ ε), where C is the exact proportionality constant that also appears in the matching lower bound on W η,ε presented in Corollary 5.12.
Then, (ζ (ηε) * K s * ρ t , j t ) t∈[0,1] solves the nonlocal continuity equation, and for all 0 < ε < s, and all t ∈ [0, 1], In particular, for all ρ 0 , ρ 1 ∈ P 2 (R d ), Proof. Let (ρ t , j t ) t∈[0,1] be a solution to the (local) continuity equation in flux form. Let s > 0 be some fixed quantity chosen later on; we define ρ s t := K s * ν t and j s t := K s * j t . Note that these objects are measures; the corresponding Lebesgue densities are K s * ρ t and j s t := K s * j t respectively. By [3, Lemma 8.1.9], (ρ s t , j s t ) also solves the continuity equation in flux form. We then smooth ρ s t and j s t again, using the kernelζ (ηε) , so that ζ (ηε) * ρ s t ,ζ (ηε) * j s t again solves the continuity equation in flux form. By Lemma B.4, we know (since η ε (|x − y|) is supported on B(0, ε)) that if ε < s, then the corresponding Lebesgue densityζ (ηε) * ρ s t has local relative Lipschitz regularity of the form Define j t : [0, 1] → M loc (G) as in Lemma 4.1 with respect to j s t , namely In this case, since j s t has a density with respect to the Lebesgue measure given by j s t , it follows that j t has density with respect to the product Lebesgue measure restricted to G, given by Furthermore, by Lemma 4.1, we know that ζ (ηε) * ρ s t , j t solves the nonlocal continuity equation. Now, let us compare the nonlocal action A η,ε ζ (ηε) * ρ s t , j t with the local action A(ρ s t , j s t ). Relying on the homogeneity of the interpolation θ, we observe (using Lemma B.4) that Therefore, By [29,Corollary 2.16], together with Lemma B.2, we have that Consequently, also using the fact that the support of η ε (x, y) has diameter ε, dy. Likewise, dx and so we deduce that We now take (ρ t , j t ) t∈[0,1] to be a curve of least action for W 2 , connecting ρ 0 and ρ 1 . Then, by using the fact that A(ρ s t , j s t ) ≤ A(ρ t , j t ), and integrating in t, we have that as desired.

NONLOCAL HAMILTON-JACOBI SUBSOLUTION
We first first sketch the argument of this section. Our aim is to prove a bound of the following form: Ultimately, we will show that the error terms are of order √ ε. Our strategy is to use Hamilton-Jacobi duality for W 2 (and also W η,ε ). Simplifying somewhat: (i) Let (φ t ) t∈[0,1] be a (viscosity) solution to the Hamilton-Jacobi equation ∂ t φ t +|∇φ t | 2 = 0. Duality theory for W 2 (specifically [45,Proposition 5.48]) tells us that (ii) We expect that a similar duality theorem holds for W η,ε (but there is a different notion of "nonlocal Hamilton-Jacobi equation"): 1] is a n.l. HJ subsolution .
(iii) We will use solutions of the Hamilton-Jacobi equation to construct subsolutions to the nonlocal Hamilton-Jacobi equations thus obtaining a lower bound on W 2 η,ε . Unfortunately, things are not so simple an we will need to introduce a layer of approximations. Because of the (conjectured, at this point, based on the outcome of Section 4) asymptotic proportionality constant between W 2 and W η,ε , the constant prefactor in any mapping which takes a (local) HJ solution and gives us a nonlocal HJ subsolution must be 2d ε 2 M 2 (η) . Before proceeding, we present several preparatory lemmas.
Note that by Rademacher's theorem, if φ is Lipschitz then ∇φ exists Lebesgue-almost everywhere; the lemma therefore also shows that for the optimal Kantorovich potential, |∇φ| ≤ R.
Proof. This is an easy refinement of standard results concerning Kantorovich duality, such as [46, Theorem 5.10].
Indeed, as discussed on [42, p. 11], if c(x, y) is any continuous cost function and ψ c is any c-convex (resp. c-concave) function, it holds automatically that any modulus of continuity for c(x, y) is also a modulus of continuity for ψ c . In the case of c(x, y) = 1 2 |x − y| 2 on a domain of diameter R, we can take |c(x, y) − c(x ′ , y)| ≤ R|x − x ′ | as a crude global modulus of continuity in the x variable (and of course the same reasoning applies to the y variable). Since the optimal Kantorovich potential φ can always be taken to be c-convex, the claim follows.
The Kantorovich duality formula for W 2 also has a "dynamic" counterpart in terms of solutions to a Hamilton-Jacobi equation. This fact was initially observed in [6,38]; here we just give a proof for convenience.
Definition 5.4. Following [23, Definition 3.1], we say that φ t (x) : [0, 1]×R d → R is a nonlocal Hamilton-Jacobi (HJ) subsolution, and write φ t (x) ∈ HJ 1 NL if, for a.e. t ∈ (0, 1), if, for a.e. t ∈ (0, 1), the partial derivative ∂ t φ t (x) exists for every x ∈ R d , and sup x∈R d |∂ t φ t (x)| < ∞; and we have, for all probability measures µ ∈ P(R d ), and for any (hence all) λ such that µ ≪ λ, Remark 5.5 (Nonlocal Hamilton-Jacobi solutions). In the present work, it is the notion of nonlocal HJ subsolution which is relevant. Nonetheless, we mention here an associated notion of solution. Following [28], we say that φ t (x) is a nonlocal Hamilton-Jacobi solution if it is a nonlocal Hamilton-Jacobi subsolution, and moreover and we have that We leave investigation of this rather atypical PDE to future work.
The duality formula we expect to hold for W η,ε is However, in this work, we do not attempt to prove this duality formula directly. Rather, for technical reasons, we introduce a "smoothed version" of the nonlocal Wasserstein distance (for which we do prove a partial duality result), as follows.
We also use the notation µ s := K s * µ and j s := K s * j. In this argument, we allow µ t to take values in M + (R d ) rather than just P(R d ). Note that in this situation, the action A(µ, j) and the nonlocal continuity equation are still well-defined. Additionally, there will be no loss of generality in assuming that, for paths we consider, A(µ s t , j s t ) < ∞ for almost all t ∈ [0, 1]; since otherwise, we will find that W 2 η,ε,s (K s * µ 0 , K s * µ 1 ) = ∞, and so the proposition holds trivially. In particular, quantification over "all" (µ t , j t ) t∈[0,1] shall be understood to mean that: to pass to the limit in the nonlocal continuity equation, and deduce that more generally, Therefore, for any two s-smoothed probability measuresμ s 0 andμ s 1 with W 2 η,ε,s (μ s 0 ,μ s 1 ) < ∞, we have that (Note that the the inner supremum takes the value +∞ unless (5.1) holds.) Using the general fact that sup inf ≤ inf sup, we see that which in turn implies that (now letting the infimum quantify over a larger set, without fixed endpointsμ 0 andμ 1 ) Observe that due to the 1-homogeneity in (µ s t , j s t ) t∈[0,1] of both the total action and nonlocal continuity equation, the inner infimum evaluates to −∞ unless φ t is chosen so that, for all (µ t , j t ) t∈[0,1] , Therefore, we conclude that Again, this condition holds provided that the inner infimum is ≥ 0 (as opposed to −∞). Of course, if it is the case that the inner infimum is nonnegative (for a given φ t (x) ∈ BL([0, 1] × R d )), we certainly have that Therefore, we deduce the duality relation In turn, since in general, this implies that However, the quantity ( †) is independent of j t . Therefore the statement reduces to quantification over (µ t ) t∈[0,1] is for all weak*ly continuous "curves" in M + (R d ), such that there exists a (j t ) t∈[0,1] for which A(µ s t , j s t ) < ∞ for almost all t ∈ [0, 1]; but we do not even require satisfaction of the nonlocal continuity equation, so we can always just take j t = 0. From this, by restricting the quantification only to "curves" (µ t ) t∈[0,1] which are constant and belong to P(R d ), we deduce that as desired.
Proof. In this appendix, we establish the following variant of Reshetnyak's theorem as well as some direct consequences of this theorem which are used in the main body of the article (for other variants on this theorem, we refer the reader to [1,2,9,41,43]).
Theorem A.1. (locally finite, topological and sequential "Reshetnyak's theorem") Let Ω be a locally compact Polish space, and let f : Ω × R n → [0, ∞] be a (topologically) lower semicontinuous function such that for every ω ∈ Ω, the function f (ω, ·) is convex and positively 1-homogeneous. Then the functional is convex, and both topologically and sequentially weak* lower semicontinuous.
In the statement of this theorem, the "weak* topology on M loc (Ω, R d )" has the following sense: we consider M loc (Ω, R n ) as the dual space of C c (Ω, R n ), where C c (Ω, R n ) is understood as a locally convex t.v.s. which is a direct limit of the Banach spaces C c (K, R n ) for every compact K ⊂ Ω. (See also discussion on this point in discussion in [2,Chapter 1].) The reason why the statement of the theorem carefully specifies that F is both topologicallly and sequentially weak* l.s.c. is that in this setting, it is not known to the authors whether the two notions coincide. (In particular, M loc (Ω, R d ) is not the weak* dual of a separable Banach space, so we cannot apply [9, Proposition 1.1.6 (iii)].) For our immediate purposes, the sequential weak* l.s.c. property is used in variational arguments, while F being topologically weak* l.s.c. implies directly that F is weak* Borel measurable.
Proof. Convexity of F follows directly, regardless of the underlying domain, from the convexity and 1homogeneity of f .
Let K ⊂ Ω be compact. Consider the functional namely precomposition of the continuous projection π K with the functional F (· ↾ K) (which has domain M(K, R n )). The functional F (· ↾ K) is known, by Reshetnyak's theorem (more precisely, [9, Theorem 3.4.3]), to be sequentially weak* lower semicontinuous on M(K, R n ). Since M(K, R n ) is the dual of a separable Banach space, and F (· ↾ K) is convex, F (· ↾ K) is also topologically weak* lower semicontinuous on M(K, R n ). Since the precomposition of a continuous map with a (topologically) l.s.c. map is again l.s.c., it follows that F (· ↾ K) • π K is a topologically lower semicontinuous map from M loc (Ω, , R n ) to [0, ∞]. Now, consider a sequence (K n ) of compact sets which exhaust Ω.
In fact, from this one can also deduce that´Ω f ω, dλ d|λ| (ω) d|λ|(ω) is also sequentially weak* l.s.c.: this follows from the fact that a function is sequentially l.s.c. with respect to some topology τ iff it is l.s.c. with respect to the sequential topology induced by τ , and this topology is at least as fine as τ itself (see discussion in [9], especially [9, Proposition 1.1.5]).
As an application, we give a proof of the following convolution inequality. Lemma A.3. Let µ ∈ P(R d ) and j ∈ M loc (G), and suppose that A η,θ (µ, j) < ∞. Let k be a convolution kernel, and for each z ∈ R d , let µ z and j z denote the z-translates of µ and j respectively: namely for all ψ ∈ C ∞ C (R d ) and ϕ ∈ C ∞ C (G), R d ψ(x)dµ z (x) :=ˆR d ψ(x + z)dµ(x);¨G ϕ(x, y)dj z (x, y) =¨G ϕ(x + z, y + z)dj(x, y).
We note that morally, this is just an application of Jensen's inequality, but we are unaware of a version of Jensen's inequality in the literature that applies to this setting. Therefore we give an ad hoc proof, essentially the same (albeit with addition complications) to the proof of Proposition 3.9.
Since A η,θ is jointly convex, we observe that Now, suppose that´R d A η,θ (µ z , j z )k(|z|)dz < ∞, since otherwise the lemma holds trivially. Since the action is nonnegative, the strong law of large numbers shows that almost surely. Therefore, if we can verify that 1 n n i=1 µ Z i * ⇀ k * µ and 1 n n i=1 j Z i * ⇀ k * j almost surely, it then follows from the joint sequential lower semicontinuity of A η,θ that j Z i and the lemma is proved. We therefore verify that 1 n n i=1 µ Z i * ⇀ k * µ and 1 n n i=1 j Z i * ⇀ k * j almost surely; note that this is a similar result to the Glivenko-Cantelli theorem.
Let ψ ∈ C 0 (R d ). Then´R d ψ(x)dµ Z (x) is an L 1 random variable, in fact L ∞ since for all ω ∈ Ω It follows from the strong law of large numbers that compact support when restricted to any K ⊂ G, this implies that that almost surely, again in duality with C C (K). Since K was arbitrary, by the characterization of weak* convergence in M loc (G) we have that almost surely, in duality with C c (G). By identical reasoning, we have that 1 dr.