The Sinkhorn algorithm, parabolic optimal transport and geometric Monge-Amp\`ere equations

We show that the discrete Sinkhorn algorithm, as applied in the setting of Optimal Transport on a compact manifold, converges in the large-scale limit, to the solution of a parabolic PDE of Monge-Amp\`ere type. The latter evolution equation has previously appeared in different contexts (in particular, on the torus it can be be identified with the Ricci flow). This leads to a meshfree algorithmic approximation of the potential of the optimal transport map (or equivalently, certain geometric Monge-Amp\`ere equations) with explicit bounds on the time-complexity of the construction and the approximation errors. Applications to the far-field antenna problem are given and connections to Quasi-Monte Carlo methods are provided.


Introduction
The theory of Optimal Transport [60,61] is used in a multitude of applications ranging from economy, statistics, cosmology, geometric optics and meterology to more recent applications in data processing (see the surveys [57,58]). In the last years there has been a flurry of numerical work applying the Sinkhorn algorithm [55] (aka the Iterative Proportional Fitting Procedure [48]) as a fast and efficient way of computing approximations to optimal transport maps, or equivalently, solutions to certain geometric Monge-Ampère type equations. This is motivated by applications to machine learning [25] (concerning optimal transport in Euclidean R n ) and computer graphics and image processing [57] (where the general setting of optimal transport on Riemannian manifolds is considered). The key advantage of the Sinkhorn algorithm in this context is its favorable large-scale computational properties (parallelization, quadratic complexity, linear time convergence, etc [3]).
The main aim of the present paper is to show that, in the large-scale limit, the Sinkhorn algorithm converges towards the solution of a parabolic PDE of Monge-Ampère type, which, incidentally, previously has appeared in [50,40,41] and is called the parabolic optimal transport equation in [40]. The convergence is shown with explicit error estimates. This leads, in particular, to an algorithmic approximation of the potential of the optimal transport map with explicit bounds on the time-complexity of the construction and the approximation errors introduced by the discretization.
1.1.1. The Sinkhorn algorithm. Let p and q be two vectors in R n + whose entries sum to one. Given any matrix K ∈ R N + ×R N + there exists, by Sinkhorn's theorem [55], two diagonal positive matrices D a and D b with diagonal vectors a and b in R N + such that the matrix B := D b KD a has the property that the rows sum to p and the columns sum to q. The construction is essentially unique (i.e. up to scaling D a and D −1 b by a positive number) and B can be obtained as the limit of the process of alternately normalizing the rows and columns of the matrix. In other words, where the pair of positive vectors (a(m), b(m)) are defined by the following recursion, formulated in terms of matrix vector multiplications and component-wise division of vectors: with initial data a(0) taken as the vector with entries 1. In fact, any initial positive vector a(0) will do and a(m) and b(m) even converge with any need of scaling, as follows from Theorem 2.8).
The same algorithm has appeared in various fields. In its most general (infinite dimensional) form, known as the Iterative Proportional Fitting Procedure in the statistics literature, the roles of p and q are played by probability measures on two (possible non-finite) topological spaces. In this setting the corresponding convergence of B(m) towards a limit B was established in [48] using a maximum entropy characterization of B, which in the finite setting above says that B is the unique element realizing the infimum inf γ∈Π(p,q)

I(γ|K)
where I denotes the Kullback-Leibler divergence of γ relative K, when γ and K are identified with measures on the discrete product {1, ..., N } 2 and Π(p, q) denotes the set of all matrices γ in R N + × R N + with row sum p and column sum q, i.e. the corresponding measures on {1, ..., N } 2 have marginals p and q, respectively (−I(γ|K) is the corresponding "physical" entropy). An alternative proof of the convergence follows from Theorem 2.8 below, which also shows that a(m) and b(m) have unique limits (determined by the initial value a(0)).

1.1.2.
Discrete optimal transport. Now replace K with a family of matrices K ǫ of the form for a given matrix C ij and parametrized by a positive number ǫ. Then the corresponding matrix B ǫ ∈ Π(p, q) furnished by Sinkhorn's theorem converges, as ǫ → 0 to a matrix B 0 realizing the infimum C := inf γ∈Π(p,q)

C, γ
In the terminology of discrete optimal transport theory [60,26] this means that B 0 is an optimal transport plan (coupling) between p and q, with respect to the cost matrix C. The convergence follows from noting that, the maximum entropy characterization of B ǫ shows that B ǫ realizes the perturbed minimum C ǫ := inf γ∈Π(p,q) c, γ + ǫI(γ|I).
While the matrix B 0 is sparse (and is typically supported on the graph of a transport map) the approximation B ǫ always has full support (by Sinkhorn's theorem) and is thus more "regular" than B 0 . Accordingly, the small parameter ǫ is sometimes referred to as the entropic regularization parameter. This is illustrated by the simulations in [3] for the case when p and q represent the discretization of two probability measures on the unit-interval in R, using a large number N of points and with C ij the cost matrix defined by to the squared distance function on R. When ǫ is taken to be of the order 1/N [3, Fig 1] shows how the discrete probability measures on R × R appear as smoothed out versions of the graph of the corresponding optimal transport map. It should also be pointed out that the entropy minimization problem above goes back to the work by Schrödinger on Quantum Mechanics in the 30s [54] (see the survey [46], where the connection to optimal transport is emphasized).
1.1.3. Discretization of Optimal Transport on the torus. Let now X be a compact manifold (without boundary) endowed with a Riemannian distance function d(x, y). In applications one typically discretizes the manifold by a fixing N points on X, i.e. a "point cloud". One then fixes an entropic regularization parameter ǫ, whose size depends on N (and is typically taken to be of the order of the "spatial resolution" of X [57]). To keep things as simple as possible we will start by taking the manifold X to be the n−dimensional torus T n := ( R Z ) n endowed with the standard distance function d(x, y), induced from the Euclidean distance function on R n . Let µ and ν be two probability measure on T n (which thus correspond to two periodic measures on R n ) with C 2 −smooth and strictly positive densities e −f and e −g , respectively 1 : where dV is the Riemannian normalized volume form on T n . As is well-known, a map F transporting (i.e. pushing forward) µ to ν is optimal with respect to the cost function d(x, y) 2 iff it can be expressed in terms of a potential u ∈ C 4 (T n ) : F (x) := x + ∇u(x), T n → T n , which is strictly quasi-convex in the sense that symmetric matrix ∇ 2 u + I is positive definite: (we identify u with a Z n −periodic function on R n so that the gradient ∇u on R n descends to define a self-map of T n ). The function u is uniquely determined, up to an additive constant, by the following Monge-Ampère equation (1.2) exp(−g(x + ∇u(x)) det(I + ∇ 2 u(x)) = exp(−f (x)) A standard way to discretize this setting is to replace the torus T n with a regular point set Λ k in T n such that the corresponding "grid" has edge lengths k −1 , where k is a positive integer, i.e. Λ k is the discrete torus defined by be an enumeration of the points in Λ k The total number N k of points is thus given by N k = k n 1 If f and g are merely assumed to be in the Hölder space C α (T n ), for α > 0, then the convergence results below still apply (with similar proofs), but with an error term of the order k −α/2 . Denote by p (k) and q (k) the corresponding discrete approximations in R N k of µ and ν, defined by the normalized values of the densities of µ and ν at the points in Λ k . Defining a sequence of N k × N k matrices K (k) by j ), K (k) (x, y) := e −kd(x,y) 2 /2 and applying Sinkhorn's theorem to the triple (K (k) , p (k) , q (k) ) furnishes two positive vectors a (k) and b (k) in R N k , uniquely determined by the normalization condition that a (k) i k = 0 for the index i k corresponding to the point x i k = 0 in Λ k .

1.2.
Main results in the torus setting. We will investigate the joint limit where the entropic regularization parameter ǫ coincides with the edge length in the discretization scheme above: ǫ := k −1 Our first result shows that the potential u for the optimal transport problem between µ and ν can be recovered from the positive vectors a (k) and b (k) , furnished by the Sinkhorn theorem: i k is a sequence of points in the discrete torus Λ k converging to the point x in the torus T n , as k → ∞, then where u is the unique optimal transport potential solving the Monge-Ampère equation 1.2 and normalized by u(0) = 0. Moreover, the convergence is uniform with respect to k.
This convergence result should come at no surprise and it holds in a very general setting (see Theorems 3.33.4). But the main point of the present paper is that the Sinkhorn algorithm itself, when viewed as a discrete dynamical system for the positive vectors a (k) i k , also admits a continuous large-scale limit u t (x), evolving according to the following parabolic PDE: The existence of a C 4 −smooth solution u t , given strictly quasi-convex C 2 −smooth initial data u 0 follows from the results in [50]. In order to prove the convergence we first observe that the function i k (m) on the discrete torus Λ k admits a canonical extension, defining a quasi-convex function on X : expressed in terms of a Fourier/Gauss type sum, with k playing the role of the band-with.
Theorem 1.2. (Dynamic case) For any sequence of discrete times m k such that m k /k → t we have lim k→∞ u (k) m = u t uniformly on T n , where u t is the smooth and strictly quasi-convex solution of the parabolic PDE 1.4 with initial data u 0 = 0. More precisely, where the constant C depends explicitly on upper bounds on the C 2 −norms of f and g and a strict positive lower bound on f and g. More generally, if at the initial discrete time m = 0 u (k) 0 = u 0 , where u 0 is the restriction to the discrete torus Λ k of a fixed C 4 −smooth and strictly quasiconvex function on the torus T n , then the corresponding result still holds.
Using that u t converges exponentially, as t → ∞, to a potential u for the optimal transport problem [40] we thus arrive at the following m k (x i k ) converge uniformly to the optimal transport potential u(x) as k → ∞. More precisely, Moreover, the discrete probability measures γ k on T n × T n , determined by the Sinkhorn algorithm, converge weakly towards the corresponding optimal transport plan (I × (∇u + I)) * µ concentrating exponentially on the graph Γ of the transport map ∇u + I : for some positive constant p, where d Γ denotes the distance to the graph Γ in T n × T n .
Since each iteration in the Sinkhorn algorithm may be formulated in terms of matrixvector multiplication, which requires O(N 2 ) arithmetic operators, and since k = N 1/n , the direct construction of the function u k uses, in general, O(N 2+1/n ) log N elementary arithmetic operations. Moreover, in the present case of the torus the matrix-vector operations in question are discrete convolutions and can thus be performed using merely O(N )(log N ) arithmetic operations, by using the Fast Fourier Transform (see Section 5.3.1).
By symmetry, Theorem 1.2 also shows that, on the one hand, the functions v (k) m k := −k −1 log b (k) (m k ) converge, as k → ∞, towards the solution v t of the parabolic equation obtained by interchanging the roles of µ and ν. On the other hand, by Lemma 3.1, the function v (k) m k is equal to the Legendre transform (in the space variable) of u (k) m k , up to negligable O(log k)k error term. Thus Theorem 1.2 is consistent, as it must, with the fact that the Legendre transform of the solution u t of equation 1.4 solves the parabolic equation obtained by interchanging the roles of µ and ν (as can be checked by a direct calculation).

1.3.
Generalizations to compact manifolds. The result in the static setting is shown to hold in a very general setting of optimal transport between two probability measures µ and ν defined on compact topological spaces X and Y, respectively (see Theorem 3.3 and also Theorem 3.5 which holds in the classical (non-compact) Euclidean setting where X is equal to R n ). Then the roles of the positive vectors p (k) and q (k) , discretizing µ and ν, are played by two sequences µ (k) and ν (k) , satisfying certain density properties with respect to µ and ν, which are almost always satisfied in practice. Moreover, the cost function c is merely assumed to be continuous and can even be replaced by any sequence c k converging uniformly to c, as the inverse k of the entropic regularization parameter tends to infinity (which applies, in particular, to the convolutional Wasserstein distances introduced in [57], where the matrix in formula 1.3 is replaced by a heat kernel; see Section 3.1.2).
The corresponding result in the dynamic setting (see Theorem 4.8) requires that X and Y be compact Riemannian manifold (without boundary) and further regularity assumptions on c, µ and ν. In particular, a "local density property" on the approximations µ (k) and ν (k) is required, roughly saying that the approximations of µ and ν hold up to length scales of the order k −1/2 point, with an O(1/k)−error term. Interestingly, the local density properties turn out to be satisfied when the approximations µ (k) and ν (k) are defined by (weighted) point clouds, generated using suitable Quasi-Monte Carlo methods in numerical analysis [9,14,10] (as explained in Section 4.2). Then the entropic regularization parameter k −1 is comparable to the covering radius (mesh norm) of the corresponding point clouds [9]. This provides a theoretic justification for taking the entropic regularization parameter in the Sinkhorn algorithm to be comparable to the "spatial resolution" of the corresponding discretization scheme (as was done in the numerical simulations on Riemannian manifolds in [57]).
The general results are applied to the case of the round sphere endowed with the two different cost functions: (1) d(x, y) 2 and (2) − log |x − y|. These two cases appear, for example, in applications to (1) computer graphics (texture mapping), medical imaging [29,65] and mesh adaptation for global whether and climate prediction [63] and (2) the reflector antenna problem in geometric optics [62,35]. Nearly linear complexity of the corresponding Sinkhorn iteration is achieved in both cases, using fast transforms.
1.4. Relation to previous results. To the best of the authors knowledge these are the first convergence results concerning the Sinkhorn algorithm (and its fixed points) in the limit when the number N of points and the inverse of the regularization parameter ǫ jointly tend to infinity (see [19] and references therein for the static case when only ǫ −1 tends to infinity in the Euclidean R n − setting and [45,46] for a very general setting). This kind of joint limit is, in practice, what is studied in numerical simulations (see for example [57]) and the convergence analysis in the present paper thus provides a theoretical bases for the simulations. We do not adress questions of numerical stability (in connection to floating point-arithmetic) and stabilization and instead refer to [19,Section 4] and the in-depth study in [53] (where various numerical speed-ups are explored).
The convergence in Corollary 1.3 should be compared with previous results in the rapidly growing literature on numerical approximations schemes for solutions to Optimal Transport problems and (generalized) Monge-Ampère equations (satifying a second boundary value condition). However, the author is not aware of any previous results providing both complexity bounds (in terms of N ) and a quantified rate of convergence of the error of the approximate solution, as N → ∞. We recall that the oldest approach is to approximate the optimal transport potential u by solving the linear program which is dual to the discrete optimal transport problem. This can be done in about O(N 3 ) time-complexity (see, for example, the exposition in [11], where applications to cosmology are given). Another influential approach is the Benamou-Brenier approach, using computational fluid mechanics [2] (numerical experiments suggests that it has O(N 3 ) time-complexity, as pointed out in [4]). There is also a rapidly expanding literature on other discretization schemes in the PDE and numerics literature (mainly in the flat case of R n and the torus). We refer the reader to the introduction of the very recent work [4] where references to the main two numerical branches are provided (the semi-discrete approach e.g. [42] and finite difference methods e.g. [52]), which, experimentally, show nearly linear complexity. A survey of numerical methods for general fully non-linear partial differential equations, with an emphasis on Monge-Ampère equations, is given in [33]. See also the recent work [47] for a new convex optimization approach and [59] for numerical simulation (based on finite difference schemes) of the analog of the parabolic equation in the case of bounded convex Euclidean domains. One advantage of the present framework over finite difference and finite element type approaches, when applied to general manifolds, is that it is meshfree. In other words, it does not require generating a grid or a polyhedral tessellation of the manifolds, but only a suitable point cloud, which can be efficently generated using Quasi-Monte Carlo methods. In the case of the round sphere various different numerical algorithms have previously been explored in the literature: see [29,65,63] for experimental work on the case of the cost function d(x, y) 2 and [16,27] for the case of the cost function − log |x − y|, as applied to the reflector antenna problem in geometric optics.
The present results are very much inspired by an analogous setup which appears in complex (Kähler) geometry. Briefly, the role of the Sinkhorn algorithm is then played by Donaldson's iteration [31], whose fixed points are called balanced metrics and k appears as the power of a given ample line bundle over X. From this point of view Theorem 1.1 (and its generalization Theorem 3.3) is the analog of [8, Thm B] and Theorem 1.2 is the analog of the result in [6] showing that Donaldson's iteration converges to the Kähler-Ricci flow [18]. In fact, identifying the real torus T n with a reduction of the complex torus X := C n /(Z n + iZ n ) (or more generally, an abelian variety) the parabolic flow 1.4 can, in the case when g = 0, be identified with a twisted Kähler-Ricci flow whose stationary solutions are Kähler potentials solving the corresponding complex Monge-Ampère equation (known as the Calabi-Yau equation in this context). 2 A new feature of the theoretical analysis in the present paper, compared to the usual situation in Kähler geometry, is that the target measure µ is taken to be discrete and depend on k (in practise, such discretizations are used in implementations of Donaldson's iteration, such as the experimental work [32], motivated by String Theory). Interestingly, the density condition on the the sequence µ (k) appearing in Lemma 3.1 can be viewed as a real analog of the Bernstein-Markov property for a sequence µ k , as studied in the complex geometric and pluripotential theoretic setting (see the discussion on page 8 in [7]). The relations between the real and complex settings will be expanded on elsewhere.
1.5. Acknowledgements. I am grateful to Klas Modin for many discussions and, in particular, for drawing my attention to the recent numerical work on the applications of the Sinkhorn algorithm to Optimal Transport, which was the starting point of the present paper. Also thanks to Gabriel Peyré for providing me with further references. Lacking background in Numerics and Optimal Transport I would like to apologize for any omission in accrediting results properly. This work was supported by grants from the ERC and the KAW foundation.
1.6. Organization. In section 2 a general setting for iterations on C(X), generalizing the Sinkhorn algorithm, is introduced. The iteration in question, which is determined by a triple (µ, ν, c), is essentially equivalent to the Iterative Proportional Fitting Procedure and the results in Section 2 are probably more or less well-known (expcept perhaps Theorem 2.8). But one point of the presentation is to exploit the variational structure. It can be viewed as a real analogue of the formalism introduced in [6], in the setting of Donaldson's iteration [31] and it lends itself to various generalizations of the optimal transport problem (such as Monge-Ampère equations with exponential non-linearities). The main results in the torus setting, stated in the introduction above, are proved in Section 3. In fact, the general form of the static result given in Theorem 1.1 is proved, as it requires not much more work then in the torus setting. Applications to convolutional Wasserstein distances are also given. Then, in the following Section 4, the dynamic result in Theorem 1.2 is generalized from the torus setting to a rather general setting of optimal transport on compact manifolds, where the connection to the general parabolic transport equation introduced in [40] is made. In Section 5 it is shown that nearly linear complexity can be achieved in the case of optimal transport on the torus and the sphere (which applies, in particular, to the reflector antenna problem). Section 6 gives an outlook on relations to singularity formation in the parabolic optimal transport equations. In the appendix a proof of a discrete version of the classical stationary phase approximation is provided.

General setup and preliminaries
If Z is a compact topological space then we will denote by C(Z) the space of continuous functions on Z endowed with the sup-norm and by P(Z) the space of all (Borel) probability measures on Z, endowed with the weak topology. Given a subset S of Z we will denote by χ S the function which is equal to 1 on S and infinity otherwise.
Throughout the paper we will assume given a triple (µ, ν, c) where µ and ν are probability measures on compact topological spaces X and Y, respectively and c is a continuous function on X × Y. The support of µ and ν will be denote by X µ and Y ν , respectively. Given u ∈ C(X) and v ∈ C(Y ) we will, abusing notation slightly, identify u and v with their pull-backs to X × Y.

2.1.
Recap of Optimal Transport and the c−Legendre transform. Let us start by recalling the standard setup for optimal transport (see the book [60] for further background). A probability measure γ ∈ P(X × Y ) is said to be a transport plan (or coupling) between µ and ν if the the push forwards of γ to X and Y are equal to µ and ν, respectively. The subspace of all such probability measures in P(X × Y ) will be denote by Π(µ, ν). A transport plan in Π(µ, ν) is said to be optimal wrt the cost function c, if it realizes the following infimum: inf γ∈Π(µ,ν) X×Y cγ By weak compactness such an optimal transport plan always exists.
The c−Legendre transform u c of a function u ∈ C(X) is defined as the following function Indeed, this follows from the observation that u ccc = u for any u ∈ C(X), which in turn follows from u cc ≤ u.
Proposition 2.1. ("Optimality criterion") A transport plan γ ∈ Π(µ, ν) is optimal iff there exists u ∈ C(X) which is c-convex and such that γ is supported in Proof. This is standard and known as the Knott-Smith optimality criterion (in the Euclidean setting) [60]. For completeness we provide the simple proof of the direction that we shall use later on. Consider the following functional on C(X) (often called the Kantorovich functional ): (2.2) J(u) := uµ + u c µ.
Since u + u c + c ≥ 0 on X × Y the following lower bound holds for any given γ ∈ Π(µ, ν) Now, if γ is supported in Γ u it follows directly that X×Y cγ = −J(u) and hence γ attains the lower bound above, i.e. γ is optimal. q.e.d.
Remark 2.2. A byproduct of Theorem 3.3 below (applied to the case when µ k = µ and ν k = ν for all k) is a proof that there always exists a transport plan γ * with support in Γ u for some c−convex function. Since γ * saturates the lower bound 2.3 it follows that taking the infimum over all γ in Π(µ, ν) yields equality in 2.3 . As a consequence, This is the content of Kantorovich duality which is usually shown using Rockafeller-Fenchel duality in topological vector spaces [60].
2.1.1. The torus setting. In the case when X = Y = T n := ( R Z ) n we will take c(x, y) to be half the squared standard distance function on T n : be a smooth function on T n . Then the following conditions are equivalent and will be referred to as strict quasi-convexity: • u(x) is strictly c−convex • u satisfies the following inequality on T n : is a diffeomorphism of T n . Moreover, the previous conditions on u are equivalent to the corresponding conditions for u c . The corresponding map is the inverse of the map 2.5 and the following matrix relation holds Proof. This is well-known and can be deduced from the corresponding classical properties of the ordinary Legendre transform φ → φ * on R n corresponding to c(x, y) := −x · y (see for example [60, Section 2.1.3]). Indeed, setting φ(x) := u(x) + |x| 2 /2 it follows directly from the definitions that φ * (x) = u c (x) + |x| 2 /2 (compare formula 2.8 below). q.e.d.
We will also have use for the following Lemma 2.4. Assume that u is C 1 −smooth and strictly quasi-convex. Then, for any fixed y ∈ T n , the unique infimum of the function x → d(x, y) 2 /2+u(x) is attained at x = x y (defined by formula 2.6). Moreover, the function x → d(x, y) 2 /2 is smooth on some neighborhood of x y and its Hessian is equal to the identity there.
Proof. Given two pointsx andȳ in T n we can identify them with two points in R n in the fundamental domain [0, 1] n for the Z n −action such that We claim that, under the assumptions of the lemma, the inf above is uniquely attained at m = 0 whenx =xȳ : To see this we identify u with a periodic function on R n and note that The inf in the left hand side above is attained atxȳ and hence so is the inf in the right hand side. Now assume, to get a contradiction, that the claim above does not hold, i.e. there exists a non-zero m ∈ Z n such that |xȳ + m −ȳ| = |xȳ −ȳ|. This implies that the inf in the right hand side in formula 2.8 is attained both atxȳ and atxȳ + m (since u is periodic). But this contradicts the fact that the function x → |x − y| 2 /2 + u(x) is strictly convex on R n (by the quasi-convexity of u on T n ). Finally, the claim shows, since the inequality in the claim is preserved whenx is perturbed slighly, that d(x,ȳ) 2 = |x −ȳ| 2 for allx sufficently close toxȳ.
Hence, x → d(x, y) 2 /2 is smooth there and its Hessian is constant. q.e.d.

2.2.
The iteration on C(X). In this section we will consider an iteration on C(X), which can be viewed as a generalization of the Sinkhorn algorithm (see Section2.3.1). Given data (µ, ν, c), as in Section 2.1, we first introduce the following maps and (abusing notation slightly we will write T ν (u) = v[u] etc). This yields an iteration on C(X) defined by where S is defined as the the composed operator T ν • T µ on C(X) : It will be convenient to rewrite the iteration 2.9 for u m as the following difference equation: where ρ u is defined by and has the property that ρ u µ is a probability measure on X (as follows directly from the definitions).

2.2.1.
Existence and uniqueness of fixed points. Consider the following functional F on C(X) : Note that I µ and L µ are equivariant under the additive action of R and hence F is invariant. Lemma 2.5. The following is equivalent: • u is a critical point for the functional F on C(X) • ρ u = 1 a.e. with respect to µ Moreover, if u is a critical point, then u * := S(u) is a fixed point for the operator S on C(X) Proof. First observe that that the differential of the functional L defined in formula 2.12, at an element u ∈ C(X), is represented by the probability measure ρ u µ, where ρ u is defined by formula 2.11. This means that for anyu ∈ C(X) This follows readily from the definitions by differentiating t → v[(u + tu)] to get an integral over (X, µ) and then switching the order of integration. As a consequence, u is a critical point of the functional F on C 0 (X) iff ρ u µ = µ, i.e. iff ρ u = 1 a.e. with respect to µ. Finally, if this is the case then S(u) = u a.e wrt µ and hence S(S(u)) = S(u) (since S(f ) only depends on f viewed as an element in L 1 (X, µ)). q.e.d.
The following basic compactness property holds: Proof. First observe that, since X × Y is assumed compact, the continuous function c is, in fact, uniformly continuous on X. Hence, it follows from the very definition of S that S(C(X)) is an equicontinuous family of continuous functions on X. By Arzela-Ascoli theorem that the set K x 0 is thus compact in C(X). q.e.d.
Using the previous two lemmas gives the following Proposition 2.7. The operator S has a fixed point u * in C(X). Moreover, u * is uniquely determined a.e. wrt µ up to an additive constant. More precisely, there exists a unique fixed point in S(C(X))/R. Proof. We start by noting that F(Su) ≤ F(u) (this is shown in the first step of Theorem 2.8 below). Since F is invariant under the natural R−action we conclude that inf where K 0 denotes the compact subset of C(X) appearing in Lemma 2.6. Since F is clearly continuous on C(X) this implies the existence of a minimizer of F which is moreover in K 0 . Next observe that F is convex on C(X). Indeed, for any fixed y ∈ Y, v → v[u](y) is convex on C(Y ), as follows directly from Jensen's inequality. Hence, −L ν is convex and since I µ is affine we conclude that F is convex. More precisely, Jensen's (or Hölder's) inequality implies that F is strictly convex on C(X)/R viewed as a subset of L 1 (µ)/R. Hence, if u 0 and u 1 are two minimizers, then there exists a constant C such that u 0 = u 1 + C a.e. wrt µ. In particular, if C = 0 then u * := S(u 0 ) = S(u 1 ) gives the same fixed point of S. q.e.d.

2.2.2.
Monotonicity and convergence properties of the iteration. We next establish the following result, which can be seen as a refinement, in the present setting, of the convergence of the general Iterative Proportional Fitting Procedure established in [48]. The result will be used in the proof of Proposition 6.1.
Theorem 2.8. Given u 0 ∈ C(X) the corresponding iteration u m := S m u 0 converges uniformly to a fixed point u ∞ of S. Proof.
Step 1: I µ and −L are decreasing along the iteration and hence F is also decreasing. The functionals are strictly decreasing at u m unless S(u * ) = u * for u * := S(u m ).
Using the difference equation 2.17 for u m and Jensen's inequality, we have Moreover, equality holds unless ρ um = 1 a.e wrt µ i.e. S(u m ) = u m and S(u * ) = u * everywhere on X. Similarly, by symmetry, , denotes the probability measure on Y defined as in formula 2.11, but with the roles of µ and ν interchanged.
Step 2: Convergence in C(X)/R. Given the initial data u 0 we denote by K u 0 the closure of the orbit of u 0 in C(X). By Lemma 2.6 K u 0 /R is compact in C(X)/R. Hence, after perhaps passing to a subsequence, u m → u ∞ in C(X)/R. Now, since F is decreasing along the orbit we have Hence, by the condition for strict monotonicity it must be that T u ∞ = u ∞ a.e. wrt µ and hence, since u ∞ is the image of S, it follows that T u ∞ = u ∞ on all of X. It then follows from Proposition 2.7 that u ∞ is uniquely determined in C(X)/R (by the initial data u 0 ), i.e. the whole sequence converges in C(X)/R.
Step 3 : Convergence in C(X) Let us first show that there exists a number λ ∈ R such that (2.13) lim m→∞ I µ (u m ) = λ.

By
Step 1 I µ is decreasing and hence it is enough to show that I µ (u m ) is bounded from below. But I µ = F + L µ , where, by Prop 2.7 (or the previous step) F is bounded from below (by F(u ∞ )). Moreover, by the first step L µ (u m ) ≥ L µ (u 0 ), which concludes the proof of 2.13. Next, decompose By Lemma 2.6 the sequence (ũ m ) is relatively compact in C(X) and we claim that |u m (x 0 )| ≤ C for some constant C. Indeed, if this is not the case then there is a subsequence u m j such that |u m j | → ∞ uniformly on X. But this contradicts that I µ (u m ) is uniformly bounded (by 2.13). It follows that the sequence (u m ) is also relatively compact. Hence, by the previous step the whole sequence u m converges to the unique minimizer u * of F in S(C(X)) satisfying Remark 2.9. The convergence result in [48] is, in the present setting, equivalent to convergence of the iteration on C(X)/R induced by u n . In fact, the latter convergence holds at linear rate, i.e. there exists a metric d on C(X)/R and a positive number to the Hilbert metric on the cone of positive functions in C(X)) this follows from Birkhoff's theorem about positive operators on cones, precisely as in the finite dimensional situation of the Sinkhorn iteration considered in [34]; see also [26,Thm 4.2].
Fixing an initial data u 0 ∈ C(X) the corresponding evolution m → u m induces a sequence of pairs (u m , v m ) ∈ C(X)× C(Y ) defined by the following recursion: The induced discrete evolution on P(X × Y ) and entropy. Let us briefly explain the dual point of view involving the space M(X × Y ) of measures on X × Y (which, however, is not needed for the proofs of the main results). The data (µ, ν, c) induces the following element Proof. This follows immediately from observing The discrete dynamical system u m induces a sequence Proposition 2.11. The unique minimizer γ * of the functional I(·|γ c ) on Π(µ, ν) is characterized by the property that it has the form where u * is a fixed point for S on C(X) (or more generally, on L 1 (X, µ)) and and given any function u 0 ∈ C(X), the corresponding sequence γ m converges in L 1 (i.e. in variation norm) towards γ * (and moreover I(γ m |γ * ) → 0).
Proof. By construction γ * := γ u * has the property that . But a standard calculus argument reveals that any such γ * is the unique minimizer of the restriction of I to the affine subspace Π(µ, ν) of P(X × Y ) (using that I is strictly convex). The last convergence statement then follows directly from Theorem 2.8 (only the easier convergence in C(X)/R is needed). q.e.d. Rewriting the equality 2.15 can be viewed as an entropic variant of Kantorovich duality 2.4 in the limit when c is replaced by kc for a large positive number k. In fact, it follows from Theorem 3.3 applied to µ k = µ and ν k = ν that inf γ∈Π(µ,ν) as in the Kantorovich duality 2.4. In the next section we will consider the setting where µ and ν also change with k.
2.3. The k−parametrized setting and discretization. Let us next consider the following variant of the previous setting, parametrized by a parameter k (which is the parameter that will later on tend to infinity and which corresponds to the entropic regularization parameter ǫ := k −1 ). This means that we replace the triple (µ, ν, c) with a sequence (µ (k) , ν (k) , kc).
As explained in Section 1.1.2 replacing c with kc corresponds to introducing the entropic regularization parameter ǫ = k −1 . We then rescale the functions in C(X) and C(Y ) by k and consider the corresponding rescaled operators: etc. The corresponding rescaled iteration is thus defined by the iteration given the initial value u (k) 0 ∈ C(X). Equivalently, m converges in C(X) to a fixed point u (k) of the operator S (k) (uniquely determined by the initial value u (m) 0 )). We observe that the following compactness property holds (and is proved exactly as in Lemma 2.6): Lemma 2.12. The union k≥0 S (k) is relatively compact in C(X)/R (identifying C(X)/R is identified with the set of all continuous functions vanishing at a given point x 0 ) 2.3.1. Discretization and the Sinkhorn algorithm. Now assume that µ (k) and ν (k) are discrete probability measures whose supports are finite sets of the same number N k of points in X and Y, respectively. This means that there exist vectors p (k) and q (k) in R N k such that Moreover, since µ (k) and ν (k) are probability measures the vectors p (k) and q (k) are elements in the simplex Σ N k in R N k defined by which we identify with P({1, ..., N }). Similarly, we identify the discrete measure where C ij is viewed as a cost function on {1, ..., N } 2 . Under the identifications the scaled iteration 2.16 gets identified with the recursion a (k) (m) defined by the Sinkhorn algorithm determined by the matrix K (k) and the positive vectors p (k) and q (k) (see Section 1.1.1). Given an initial positive vector a (k) (0) Theorem 2.8 thus shows that (a (k) (m), b (k) (m)) converges, as m → ∞, to a pair of positive vectors (a (k) , b (k) ) such that the scaled matrix D b K (k) D a has the property that the rows sum to p (k) and the columns sum to q (k) .
Remark 2.13. By construction, the functions u k (x) on X can be expressed in terms of a Fourier type sum: are given by the Sinkhorn algorithm. In the non-compact setting when X = Y = R n , with c(x, y) = −x·y, this is the analytic continuation to iR n of a bona fide Fourier sum with fourier coefficients in k times the support of ν (k) . Hence, k plays the role of the "band-with".  1 (and generalizations). Throughout the section we will consider the parametrized setting in Section 2.3 and assume that the sequences µ (k) and ν (k) converge to µ and ν in P(X) and P(Y ), respectively. We will denote by u (k) the fixed point of the corresponding operator S (k) on C(X), uniquely determined by the normalization condition u (k) (x 0 ) = 0, at a given point x 0 in X. Similarly, v (k) denotes the corresponding fixed point obtained by reversing the roles of µ and ν.
We start by giving a density condition on µ (k) ensuring that v (k) [u] converges uniformly to the c−Legendre transform u c of u, when µ has full support: Lemma 3.1. Assume that the sequence µ (k) converging to µ in P(X) has the following "density property": for any given open subset U intersecting the support X µ of µ Then, for any given u ∈ C(X), the sequence v (k) [u] converges uniformly to (χ Xµ u) * ,c in C(Y ).
Proof. Replacing the integral over µ (k) with a sup directly gives for any y ∈ Y. To prove a reversed inequality let x y be a point in X µ where the sup defining (1 Xµ u) c (y) is attained and U δ a neighborhood of x y where the oscillation of c(·, y) + u is bounded from above by δ (by compactness/continuity U δ can be taken to be independent of y).
Then in any given open set U intersecting X µ have sub-exponential growth in k. Theorem 3.3. Suppose that µ (k) → µ and ν (k) → µ in P(X) and P(Y ), respectively and assume that µ (k) and ν (k) satisfy the density property 3.1. Let u (k) be a fixed point for the scaled operator S (k) on C(X). Then, after perhaps passing to a subsequence, the following holds: u (k) → u uniformly on X, where u is a c−convex minimizer of the Kantorovich functional J (formula 2.2) satisfying u = (χ Yν (χ Xµ u) c ) c As a consequence, the corresponding probability measures converge weakly to a transport plan γ between µ and ν, which is optimal wrt the cost function c.

Proof. Step 1: Convergence of a subsequence of u (k)
In the following all functions will be normalized by demanding that the values vanish at a given point. By Lemma 2.12 we may assume that u (k) → u (∞) uniformly on X, for some element u (∞) in C(X). Now, by the previous lemma, for any given u ∈ C(X) we have, with J(u) defined by formula 2.2, Now take a sequence ǫ k of positive numbers tending to zero such that By definition, F(u (k) ) ≤ F(u) for any given u in C(X). Hence, combining 3.4 and 3.2 and letting k → ∞ gives Hence, combing 3.3 with the previous lemma, applied twice, gives Step 2: Convergence of γ (k) towards an optimizer By Lemma 2.10 γ (k) is in Π(µ, ν). Hence, by weak compactness, we may assume that γ (k) converges towards an element γ (∞) in P(X × Y ). By Prop 2.1 it will thus be enough to show that γ (∞) is supported in Γ u (∞) . To this end let Γ δ be the closed subset of X × Y where u + u c ≥ δ > 0 for u := u (∞) . By the previous lemma γ (k) ≤ e −kδ/2 µ (k) ⊗ ν (k) on Γ δ , when k is sufficiently large and hence the limit γ (∞) is indeed supported on Γ u (∞) . q.e.d.
Taking the cost function c(x, y) to be the squared distance function on a compact Riemannian manifold we arrive at the following generalization of Theorem 1.1 stated in the introduction (using that the density property is satisfied for the discretization scheme used in Theorem 1.1 , as explained in the example above): Theorem 3.4. Let X be a compact Riemannian manifold (possibly with boundary) and set c(x, y) := d(x, y) 2 /2, where d is the Riemannian distance function. Suppose that µ (k) → µ and ν (k) → µ in P(X) and assume that ν is absolutely continuous wrt the Riemannian volume form on X and has full support. If ν (k) satisfies the density property 3.1, then v (k) converges uniformly in Y to the normalized potential for the unique optimal Borel map transporting ν to µ, i.e. the map that can be expressed as (3.5) y → x y := exp y (∇v), (which means that y x is obtained by transporting x along a unit-length geodesic in the direction of (∇v)(y)). Moreover, u (k) converges uniformly on X towards the d 2 /2−Legendre transform of v and x → exp y (∇u) defines the optimal transport of µ to ν.
Proof. This follows from the previous proposition combined with the results in [44]. Indeed, it is shown in [44] that when v is taken as a minimizer of the Kantorovich functional, then the the map 3.5 is well-defined a.e. wrt ν and the corresponding transport plan is the unique optimal transport plan (the result in [44] is a Riemannian version of Brenier's theorem in R n [13]; see also [21] for a direct proof in the torus setting). Since ν has full support it follows that the minimizer v is uniquely determined modulo additive constants and since u = v * so is the limit of u (k) . q.e.d.
The previous theorem applies more generally as soon as a unique Borel optimal map exists (see for example [61,Thm 10.38] for conditions on c ensuring that this is the case).
3.1.1. The Euclidean setting in R n . Theorem 3.4 applies, in particular, to the Euclidean setting when µ and ν are probability measures on R n , assuming that their supports X and Y are bounded domains (i.e. the closure of open connected subsets). This yields an analog of the recent convergence result [47,Thm 1.9], where the role of u (k) is played by the solution to a convex optimization problem formulated using "almost triangulations" of X and Y (assuming that Y is be convex). Moreover, the present result also applies in a non-compact Euclidean setting when X = R n and gives the following result formulated in terms of the Monge-Ampère measure M A (in the sense of Alexandrov) and the sub-gradient maping ∂u [60]: Theorem 3.5. Assume that µ and ν are probability measures on Euclidean R n such that µ and ν have compact support and ν has convex support Y and a bounded density g. If µ (k) and ν (k) satisfy the density property 3.1, then the (normalized) fixed points u (k) of the corresponding iteration converge uniformly on compacts to a convex function u on R n satisfying Proof. In the following v * will denote the classical Legendre transform of v. Applying Theorem 3.3 to a ball X R of radius R containing the support of µ and the support Y of ν gives that, after perhaps passing to a subsequence, v where v is such that the that L ∞ −map ∇v has the transport property (∇v) * µ = ν (note that v (k) and hence v are independent of R since they only depend on the support of µ). As a consequence, by Brenier's theorem [13] (or more generally, by [44]) v is uniquely determined in the interior of Y and hence the whole sequence v (k) converges towards v on Y. Next, observe that, since Y is convex the function χ Y v, is convex on R n and hence u := (χ Y v) * is a convex function on R n such that u * = χ Y v. But then it follows from basic convex analysis that conditions (i) and (ii) hold. Moreover, since, by definition, u |X R = (χ Y v) * |X R it follows from Theorem 3.3 applied to (X R , Y ) that u (k) → u uniformly on X R . Since R can be taken arbitrarily large this concludes the proof of the theorem. q.e.d.
The equation 3.6 is usually refered to the second boundary value problem for the Monge-Ampère operator in R n . It admits a unique normalized convex solution u (as follows, for example, from the argument using Legendre transforms in the proof of the previous theorem). The previous theorem can be viewed as an analog of [4,Thm 5.5], where the role of u (k) is played by the solution to a finite difference type discretization of equation 3.6 (in [4] it is assumed that g is constant in order to ensure the existence of a discrete solution).

3.1.2.
Application to convolutional Wasserstein distances. Theorem 3.3 holds more generally (with essentially the same proof) when the function c is replaced by a sequence c k such that For example, in the Riemannian setting of Theorem 3.4. denoting by K t (x, y) the corresponding heat kernel and setting t := 2k −1 , the sequence satisfies 3.7, by Varadhan's formula (which holds more generally on Lipschitz Riemannian manifolds [49]). Replacing c by c k in this setting thus has the effect of replacing the matrix A ij := e −kd 2 (x i ,x j )/2 appearing in the corresponding Sinkhorn algorithm with the heat kernel matrix K 2k −1 (x i , x j ) which, as emphasized in [57], has computational advantages. Following [57] we consider the squared convolutional Wasserstein distance between µ and ν : definied wrt approximations µ (k) and ν (k) , for example given by weighed point clouds, as in Example 3.2. In [57, Page 3], the problem of developing conditions for the convergence of W 2 (k) (µ, ν) was posed. The following result provides an answer: Theorem 3.6. Let X be a compact Riemannian manifold (possibly with boundary) and set c(x, y) := d(x, y) 2 /2, where d is the Riemannian distance function. Suppose that µ (k) → µ and ν (k) → µ in P(X) and that µ (k) and ν (k) satisfy the density property 3.1. Then where W 2 (µ, ν) denotes the squared L 2 −Wasserstein distance between µ and ν.
Proof. Repeating the argument in the proof of Theorem 3.3, with c replaced by c k as above, gives lim k→∞ inf u∈C 0 (X) According to formula 2.8 the infimum appearing in the left hand side above is precisely W 2 (k) (µ, ν). Since the infimum in the right hand side above is equal to W 2 (µ, ν), by Kantorovich duality (formula 2.4), the result follows. q.e.d.

3.2.
Proof of Theorem 1.2. We will denote by δ Λ k the uniform discrete measure supported on the discrete torus Λ k with edge-length 1/k and by µ (k) and ν (k) the corresponding discretizations of µ = e −f dx and ν = e −g dy. We start with the following discrete version of the classical stationary phase approximation, proved in the appendix. Lemma 3.7. Let α be a C 4 −smooth function on T n with a unique minimum at x 0 which is non-degenerate, i.e.
where the constant C only depends on an upper bounds on the C 4 −norm of α, the C 2 −norm of h and a strict lower bound on the smallest eigenvalue of ∇ 2 α close to x 0 .
We next prove the key asymptotic result that will be used in the proof of Theorem 1.2 giving the asymptotics of the function ρ ku (x), explicitly defined by formula 2.18 (the result can be viewed as a refinement of Lemma 3.1).
Proposition 3.8. Let u be a smooth and strictly quasi-convex function on T n . Then the following asymptotics hold where the constant C only depends on upper bounds on the C 4 −norm of u and the C 2 −norms of f and g and a strict positive lower bounds on the matrix (I + ∇ 2 u(x).
Proof. In the proof we will denote by ǫ k any sequence of functions satisfying |ǫ k | ≤ Ck −1 . First observe that the integral over localizes around a small neighborhood U of x = x y (defined as in Lemma 2.3) i.e. the integration may be restricted to U up to introducing an exponentially small error term. Indeed, the unique infimum of the function x → d(x, y) 2 /2 + u(x) is attained at x = x y and is equal to u c (y x ). Hence, we can take U to be defined by all y satisfying exp(−f (x y )) det(I + ∇ 2 u(x y )) .
Lemma 3.9. Let X be a compact topological space and consider the following family of difference equations on C(X), parametrized by a positive number k and a discrete time m : where D (k) is an operator on C(X), which descends to C(X)/R and with the property that I + k −1 D (k) is an increasing operator (wrt the usual order relation on C(X)). Assume that there exists a subset H of C(X) and an operator D on H such that for any u in the class where C u is a positive constant depending on u and ǫ k is a sequence of positive numbers converging towards zero. Assume that u (k) 0| = u 0 where u 0 is a fixed function in H and that there exists a family u t ∈ H, which is two times differentiable wrt t and solving Proof. We will write ψ k,m = u m/k .
Step 1: The following holds for all (k, m) Indeed, using the mean value theorem we can write where the term O(k −1 ) may be estimated as Using the evolution equation for u t and applying formula 3.11 thus proves Step 1.
Step 2: The discrete evolution on C(X) defined by the difference equation 3.10 contracts the sup-norm.
Set C := sup |φ m − ψ m | where φ m and ψ m satisfy the difference equation 3.10 for a fixed k. Then, φ m ≤ ψ m + C and hence, since I + k −1 D (k) is assumed to be increasing, Applying the same argument with the roles of φ and ψ interchanged concludes the proof.
Step 3: Conclusion: We will prove this by induction over m (for k fixed) the statement being trivially true for m = 0. We fix the integer k and assume as an induction hypothesis that 3.13 holds for m with C the constant in the previous inequality. Applying first Step 2 and then the induction hypothesis thus gives Now, by Step 1, proving the induction step and hence the final Step 3. q.e.d.
In the present setting H will be taken as subspace of C 4 (T n ) consisting of all strictly quasiconvex functions u and The existence and large time-convergence properties of the corresponding flow follows from the results in [50,40]: Proposition 3.10. Let f and g be two functions in C 2 (T 4 ). Then, for initial data u 0 ∈ C 2 (T n ) which is strictly quasi-convex there exists a solution u(x, t) to the corresponding parabolic PDE in C 2 (]0, ∞[×T n ) such that u t ∈ C 4 (X) and u t is strictly quasi-convex (and u(x, t) ∈ C 4 ([0, ∞[×T n ) if u 0 ∈ C 4 (X)). Moreover, as t → ∞ the functions u t converge in the C 4 −topology to a static solution u ∞ (i.e. a potential to the corresponding optimal transport problem) and there exists a positive constant A such that (if moreover f and g are smooth then so is u t and the convergence holds in the C ∞ −topology).
Proof. We first observe that the existence of a solution u t as above, follows from the results in [50], as do the large t− convergence. Strictly speaking the setting in [50] is a bit different as the role of T n is played by a bounded smooth domain in R n and some boundary conditions are imposed on u t . But the proof in the case of T n is actually easier than the ones in [50] since T n is compact, without boundary. Indeed, the key point is the interior C 2 −estimate in [50,Lemma 4.6] which directly apply to the present setting. Then parabolic Krylov-Safonov theory and (Shauder) boot strapping can be applied in the standard way. As for the exponential convergence it follows from the Li-Yau type Harnack inequality established in [40]; see Section 5-7 in [40] (in the case when g = 0 the usual Li-Yau Harnack inequality can be applied, precisely as in [18]). q.e.d. Finally, Corollary 1.3 follows directly by combining Theorem 1.2 with the exponential convergence in formula 3.14. Indeed, setting m k = kt k with t k := A log k, where A is the constant appearing formula 3.14., gives as desired (also using the C ∞ −convergence of u t which ensures that C is bounded from above (independently of t). Setting u k := u (k) m k this proves the estimate 1.5, which also implies the estimate 1.6 for γ k . Indeed, by definition, γ k := e −kd 2 (x,y) 2 /2 e −ku k (x) e −kv k (y) µ (k) ⊗ ν (k) , where v k := v[u k ]. By the estimate 1.5 and Lemma 7.1 there exists a positive number C such that The proof is thus concluded by invoking the following elementary inequality, which we claim holds for any smooth quasi-convex function u on T n : To see this we identify u and u c with Z n −periodic functions on R n and set φ(x) := u(x)+|x| 2 /2. Then φ * (y) = u c (y) + |y| 2 /2, where φ * is the classical Legendre-Fenchel transform on R n and it will be enough to show that Indeed, the claimed inequality 3.15 follows from the latter one after replacing x with x+m and taking the infimum over all m ∈ Z n . Now, by assumption ∇ 2 φ ≤ δ −1 I and hence ∇ 2 φ * ≥ δI (by 2.7). As a consequence, φ * (y) ≥ φ * (y − t) + t · ∇φ * (y − t) + δ|t| 2 /2 for any t ∈ R n . Setting t := y − ∇φ(x) and using that φ * (∇φ(x)) = ∇φ(x) · x − φ(x) and (∇φ * )(∇φ(x)) = x this implies the desired inequality 3.16.

Convergence towards parabolic optimal transport equations on compact manifolds
Let X and Y be compact smooth manifolds (without boundary) and c a lsc function on X × Y, taking values in ] − ∞, ∞] which is smooth on the complement of a closed proper subset, denoted by sing (c).
We will denote by ∂ x the vector of partial derivatives defined wrt a choice of local coordinates around a fixed point x ∈ X. Given two normalized volume forms µ and ν in P(X) and P(Y ) we locally express µ = e −f dx, ν = e −g dy in terms of the local volume forms dx and dy determined by a choice of local coordinates. Following standard practice we will assume that the cost function satisfies the following assumptions • (A1) ("Twist condition") The map y → ∂ x c(x, y) is injective for any (x, y) ∈ X × Y − sing (c) • (A2) ("Non-degeneracy") det (∂ x i ∂ y j c)(x, y) = 0 for any (x, y) ∈ X × Y − sing (c) Remark 4.1. See [61, pages 246, 313] for an in depth discussion of various assumption on cost functions. In [61] A1+A2 is called the strong twist condition and as pointed out in [61,Remark 12.23] it holds for the cost function derived from any well-behaved Lagrangian, including the Riemannian setting where c = d 2 /2).

Definition 4.2.
The space H(X) (or H for short) of all smooth potentials on X is defined as the subspace of C ∞ (X) consisting of all c−convex (i.e. such that (u c ) c = u) smooth functions u on X such the subset Γ u of X × Y defined by formula 2.1 is the graph of a diffeomorphism, denoted by F u and c is smooth in a neighborhood of Γ u .
The definition has been made so that, if u ∈ H and (F u ) * µ = ν, then F u is an optimal map (diffeomorphism) wrt the cost function c (by Prop 2.1). Accordingly, we will call u the potential of the map F u .
We note that it follows immediately from the assumption A1 that, for a given x ∈ X, expressed in terms of a choice of local coordinates, where det (∂ x F u ) denotes the local Jacobian of the map x → F u (x). We note that • The right hand side in the equation 4.2 is globally well-defined (i.e. independent of the choice of local coordinates around (x, F u (x)) in X × Y ). Indeed, it is equal to the logarithm of the quotient of (F u ) * µ and ν. Accordingly, u is a stationary solution iff it is the potential of an optimal transport map. • Differentiating the equation 4.1 reveals that

Sequences of measures satisfying a local density property and Quasi-Monte
Carlo methods. We next introduce the following stronger local form of the "global" density property appearing in Lemma 3.1.
Definition 4.4. Given a positive integer s, a sequence of probability measures µ (k) on X, converging weakly towards a measure µ, is said to have the local density property (at the length scale k −1/2 ) of order s if for any fixed x 0 ∈ X there exists local coordinates ξ := (ξ 1 , ..., ξ n ) centered at x 0 such that for any sequence h k defined on the polydisc D k of radius log k, centered at 0, and satisfying h k C s (D k ) ≤ Ce −x 2 /C the following holds: x 0 is the scaled coordinate map from a neighborhood of x 0 in X into R n defined by F (k) x 0 (x) := (k 1/2 ξ(x)) and dλ denotes Lebesgue measure. Example 4.5. The model case for the local density property introduced above (of order two) is the case when X is the torus and µ (k) is obtained from a volume form µ using the standard discretization scheme described in Section 1.1.3, based on a regular grid. This means that µ (k) = f δ Λ k /Z k , where Z k is the normalization constant and Λ k is the discrete torus.
In practice, for a general manifold X it is convenient to fix a "discretization of X", that is some discrete "reference sequence" λ k converging towards a "reference volume form" λ ∈ P(X) and satisfying the local density property (of some order). Then any volume form µ can be "discretized" by writing µ = ρλ and setting which then automatically has the local density property (and similarly for Y and ν). Such reference sequences λ k can, for example, be constructed using quasi-Monte Carlo methods in numerical analysis, as we next explain.

4.2.1.
Quasi-Monte Carlo systems. Let (X, g) be an n−dimensional compact Riemannian manifold and denote by dV the corresponding normalized volume form on X. Following [9] (and [14] in the case of a sphere) the worst case error of integration of points {x i } N i=1 ⋐ X and weights {w i } N i=1 ⋐ R + (assumed to sum to one) with respect to some Banach space W of continuous functions on X, is defined as Let now W := W s p (X) be the Sobolev space of all functions f on X such that all (fractional) distributional derivatives of order s are in L p (X) and assume that s > n/p (which ensures that W s p (X) ⊂ C(X)). Then a sequence of N k weighted points is a quasi-Monte Carlo system for W s p (X) (where s > n/p) such that N k ∼ k n . Then the corresponding sequence λ k : i ∈ P(X) has the local density property of order s − n(1 − 1/p).
Proof. Given a sequence h k as in Definition 4.4 we may assume that f k := (F (k) ) * h k extends to define a sequence of functions in C s (X) (supported in a fixed neighborhood of x 0 ). By the assumed quasi-Monte Carlo property Multiplying both sides with k n/2 and using the Chain rule thus gives By the assumption on h k the integral in the right hand side above is uniformly bounded from above, which concludes the proof. q.e.d.
In view of the applications to the present setting we introduce the following (non-standard) definition: is said to be a good quasi-Monte Carlo system if N k ∼ k n and there exists (s, p) such that s − n(1 − 1/p) ≥ 2. In particular, the corresponding sequence of measures λ k then has the local density property of order 2.
The existence of good quasi-Monte Carlo systems follows from [10, Cor 2.13] on any compact Riemannian manifold (X, g) (in fact [10,Cor 2.13] shows that, for any given p, the quasi-Monte Carlo property holds for any s). The corresponding weighed points are taken as "weighted cubature points" for the space H k (X) ⊂ C ∞ (X) spanned by all eigenfunctions of the Laplacian with eigenvalue bounded from above by k (i.e. the corresponding integration error vanishes for any f ∈ H k (X)). In the particular case of the standard n−dimensional sphere it follows from [12] that all the weights can be taken to be equal to 1/N k , i.e. the points can be taken to be spherical k−designs. Such points have been generated for large values of k [64].

Convergence of the Sinkhorn algorithm towards parabolic optimal transport.
We are now ready for the following generalization of Theorem 1.2 stated in the introduction (as in the torus setting in Section 1.1.2 the entropic regularization parameter is expressed as ǫ = k −1 , where k is a positive number).
Theorem 4.8. Let c be a function satisfying the assumptions A1 and A2 and µ (k) and ν (k) be two sequences converging towards µ and ν in P(X) and P(Y ), respectively, satisfying the local density property of order s, for some s ≥ 2. Given u 0 ∈ H assume that there exists a solution u t in C 2 (X × [0, T ]) of the parabolic PDE 4.2 with initial condition u 0 and such that u t ∈ H for any t ∈ [0, T ]. Denote by u (k) m the iteration 2.16 defined by the data (µ (k) , ν (k) , c) and such that u (k) 0 = u 0 for any given k. Then there exists a constant C such that for any Proof. The assumptions have been made precisely to ensure that the proof of Theorem 1.2 can be generalized, almost verbatim. Hence, we will be rather brief.
Step 1: Let α be a C 4 −smooth function on X with a unique minimum at x 0 which is non-degenerate, i.e. in local coordinates ∂ 2 x α(x 0 ) > 0 and h a C 2 −smooth function. Then, in local coordinates centered at x 0 , Using the local density assumption as a replacement of Lemma 7.1 this is shown essentially as before.
Step 2: If u ∈ H, then the following asymptotics holds To prove this first observe that if u ∈ H(X), then u c ∈ H(Y ). Indeed, by assumption there is a unique x such that y = F u (x) and we can express Since c is assumed to be smooth in a neighborhood of Γ u the right hand side above defines a smooth function in x and since F is a diffeomorphism it follows that u c (y) is smooth. Moreover, by symmetry Γ u c = Γ u , which can be identified with the graph of the diffeomorphism F −1 u . This shows that u c ∈ H(Y ) and Setting y x := F u (x) and x y = F u * (y) we can now apply the previous step, essentially as before, to get Finally, differentiating the relation 4.4 reveals that det((∂ y F u c )(y x )) = det((F u (x)) −1 and hence using equation 4.3 and symmetry (which ensures that the denominator appearing in equation 4.3 coincides with the one appearing obtained when u is replaced by u c ) concludes the proof of Step 2.
Step 3: Conclusion of proof The proof is concluded, as before, by invoking Lemma 3.9. q.e.d.
It follows from standard short-time existence results for parabolic PDEs that the existence of a solution u t as in the previous theorem holds for some T > 0. Moreover, by [40] long-time existence, i.e. T = ∞, holds under the following further assumptions on c : • (A3) ("Stay-away property") For any 0 < λ 1 , λ 2 there exists ǫ > 0 only depending on λ 1 , λ 2 such that λ 1 ≤ |det∂ x F u | ≤ λ 2 =⇒ dist (Γ u , sing (c)) ≥ ǫ for any u ∈ H • (A4) ("Semi-concavity") c is locally semi-concave, i.e. the sum of a concave and a smooth function on the domain where it is finite.  2 with initial condition u 0 and such that u t ∈ H for any t > 0. Moreover, u t converges exponentially, (in any C p (X)), as t → ∞, to a potential u ∈ H of a diffeomorphism F u transporting µ to ν , which is optimal wrt the cost function c.
Proof. Let us explain how to translate the result in [40] to the present setting. Following [40] a function u ∈ C 2 (X) is said to be locally strictly c−convex, if, in local coordinates, the matrix is positive definite. This condition is independent of the choice of local coordinates. Indeed, it equivalently means that any given x 0 ∈ X is a non-degenerate local minimum for the function (4.5) x → c(x, F (x 0 )) + u(x) on X.
It follows that for any such u the corresponding map F u is a local diffeomorphism. The main result in [40] says that, under the assumptions on c in the statement above, for any initial datum u 0 ∈ C 2 (X) which is locally strictly c−convex, there exists a solution u(x, t) in C ∞ (X×]0, T ]) which is also locally strictly c−convex. To make the connection to the present setting first note that if u ∈ H then u 0 is even an absolute minimum for the function 4.5, which is non-degenerate (since F u is a diffeomorphism) and hence u is locally strictly c−convex. Conversely, if u is locally strictly c−convex then [40, Cor 7.1] says that u ∈ C 2 (X) is c−convex (i.e. (u c ) c = u) and the proof given in [40,Cor 7.1] moreover shows that F u is a global diffeomorphism. Hence, u ∈ C ∞ (X) is locally c−convex iff u ∈ H, which concludes the proof of the theorem. q.e.d.
Remark 4.10. Under the assumptions in the previous theorem it follows, in particular, that the optimal transport map is smooth. Conversely, the assumptions are "almost necessary" for regularity of the optimal transport map (see [61,Chapter 12] and reference therein). Also note that the semi-concavity assumption is always satisfied in the case when X = Y is a compact Riemannian manifold and c = d 2 /2 [61, (b), Page 278].
Combining the exponential large-time convergence of u t , in the previous theorem, with Theorem 4.8 gives, just as in the torus setting, the following The assumptions in the previous corollary are satisfied when X = Y is the n−sphere and c(x, y) = d 2 (x, y)/2 for the standard round metric or c(x, y) = − log |x − y|, where |x − y| denotes the chordal distance (see [40] and references therein). The latter case appears in the reflector antenna problem, as explained in Section 5.3.3.

Nearly linear complexity on the torus and the sphere
In this section we start by showing that the convergerence results in Section 4 hold in a more general setting where the kernel K (k) (x, y) := e −kc(x,y) is replaced with an appropriate approximate kernel. This extra flexibility is then applied in the setting of optimal transport on the two-sphere, using "band-limited" heat-kernels, where it leads to a nearly linear algorithmic cost for the corresponding Sinkorn iterations. This implies, in particular, that if Theorem 3.3 holds for a given kernel K (k) , then it also holds for any other kernelK (k) wich has error O(k −1 ) as an approximation relative to K (k) , i.e. such that or such thatK (k) has absolute error e −Ck , for C sufficently large, i.e.
Heat kernel approximations in the Riemannian setting. Consider now the Riemannian setting where X is a compact Riemannian manifold (without boundary) and c = d 2 /2 and c k is defined in terms of heat kernel (formula 3.8): Theorem 5.1. Let X be a compact Riemannian manifold (without boundary) and set c(x, y) := d(x, y) 2 /2, where d is the Riemannian distance function. Then the results in Theorem 4.8 and Corollary 4.11 still hold when the matrix kernel e −kd 2 (x,y)/2 is replaced with the heat kernel K 2k −1 (x, x) (at time t = 2k −1 ) Proof. As discussed above this follows from the following heat kernel asymptotics (which are a special case of [5, Thm 3.1] and more generally hold for the heat kernel associated to a suitable hypoelliptic operator). Assume that x and y are contained in a compact subset of the complement of the cut-locus. Then where h 0 is smooth and h 0 > 0 and r 1 is smooth and uniformly bounded on ]0, t 0 ] × X. This is not exatly of the form 5.2 due to the presens of the factor t −n/2 := A k . But it is, in fact, enough to know that 5.2 holds when the right hand side is multiplied with a sequence A k , only depending on k. Indeed, the iteration u (k) m is unaltered when the cost function c k (x, y) is replaced by c k (x, y) + C k for some constant C k (which is consistent, as it must, with the fact that the parabolic equation 4.2 is unaltered when a constant is added to c). q.e.d.
The use of the heat kernel in the Sinkorn algorithm for optimal transport on Riemannian manifolds was advocated in [57], where it was found numerically that discretized heat kernels provide substantional speedups, when compared to other methods. The previous theorem offers a theoretical basis for the experimental findings in [57], as long as the discretized heat kernelsK (k) satisfy one of the the approximation properties 5.3 and 5.4 (when compared with the corresponding bona fida heat kernel). However, the author is not aware of any general such approximation results in the discretized setting (but see [22] and references therein for various numerical approaces to discretizations of heat kernels). We will instead follow a different route, based on "band-limited" heat kernels and fast Fourier type tranforms, applied to the case when X is the two-sphere.
Remark 5.2. In the case of a general Riemannian manifold X one might hope that the use of band-limited kernels and fast Fourier type transforms can be replaced by fast multipole methods [1].

5.3.
Near linear complexity using fast transforms. Each iteration in the Sinkhorn algorithm amounts to computing two vector-matrix products of the form for a given function K on X × Y (followed by N inversions), where b and a denote generic "input vector" and "output vectors", respectively. In general, this requires O(N 2 ) arithmetic operations. But, as we will next exploit, in the presense of suitably symmetry fast summations techniques can be used to lower the complexity to nearly linear, i.e. to at most CN (log N ) p operations (for some positive constants C and p).

5.3.1.
Optimal transport on the flat torus. Let us first come back to the case of the flat torus T n discretized by the discrete torus Λ k , considered in Section 1.2. Since K(x, y) := e −kd 2 (x,y) is invariant under the diagonal action of the torus T n it is follows from standard arguments that the sums 5.5 can be computed in O(N )(log N ) arithmetic operations. Indeed, using the group structure on T n we can write K(x, y) = h(x − y), for some function h on Λ k . Then the classical convolution theorem of Fourier Analysis, on the discrete torus Λ k (viewed as an abelian finite group), gives This requires evaluating two Discrete Fourier Transforms (DFT) at N = k n points. Using the Fast Fourier Transform (FFT) this can be done in O(N )(log N ) arithmetic operations. Note that since the heat kernel is also torus invariant the same argument can also be used for the kernel appearing in Theorem 5.1, in the torus case.

5.3.2.
Optimal transport on the round two-sphere. Consider the round two-sphere S 2 embedded as the unit-sphere in R 3 . Removing the north and south pole on S 2 we have the standard spherical (longatude-colatitude) coordinates (ϕ, θ) ∈ [0, 2π[×] − π, π[. A complete set of (nonnormalized) eigenfunctions for the Laplacian on L 2 (S 2 ) is given by the spherical harmonics Y m l (ϕ, θ) := e imϕ P m l (cos θ), |m| ≤ l, which has eigenvalue λ 2 l,m := l(l + 1). Here P m l (x) denotes, as usual, the Legendre function of degree l and order m (aka the associated Legendre polynomial); see, for example, [30].
Given a positive number W (the "band-width") we consider the band-limited heat kernel on the two-sphere: (By the spectral theorem this means that K t (x, y) W is the integral kernel of e −t∆ Π W where Π W is the orthognal projection onto the space of all band-limited functions).
Theorem 5.3. Consider the two-sphere, discretized by a given good Quasi-Monte Carlo (QMC) system and take R such that R > 1. Then the analog of all the results in Section 1.2 are valid when the matrix kernel e −kd 2 (x,y)/2 is replaced by the band-limited heat kernel K 2k −1 (x, y) Rk . Moreover, the arithmetic complexity of each Sinkhorn iteration is O(N 3/2 ).
Proof. As recalled in Example 4.12 the cost function d(x, y) 2 on the sphere satisfies the assumptions in Theorem 4.8 (with t = ∞) and Corollary 4.11.
Step 1: The asymptotics 5.1 and 5.2 are satisfied. By Theorem 5.1 it is enough to observe that the following basic estimate holds if t = 2k −1 and W = Rk : |K t (x, y) − K t (x, y) W | ≤ C δ e −2R 2 k(1−δ) for any given δ ∈]0, 1[. To prove the estimate note that using that the quotient involving Y m l is dominated by Cl (and that a given l corresponds to 2l + 1 ms). Indeed, this is a special case of the the universal L 2 −estimates for an eigenfunction Ψ λ of the Laplacian (with eigenvalue λ 2 ) on a general n−dimensional Riemannian manifold [56], which gives the growth factor Cλ n−1 . Finally, dominating the Riemann Gaussian sum above with the integral of the function e −ks 2 s 2 over [R, ∞[ concludes the proof.
Step 3: Complexity analysis Using formula 5.6 gives b l,m is the "forward discrete spherical Fourier transform" evaluated at (l, m). Once it has been computed for all (l, m) a i becomes an "inverse discrete spherical Fourier transform" (with coefficents c m,lbl,m ). By separation of variablers, both these transforms can be computed using a total of O(k 3 )(= O(N 3/2 )) arithmetic operations (see the discussion after formula 1.9 in [51]). q.e.d.
In the special case when the good Quasi-Monte Carlo system is defined by an "equi-angular" grid of O(k 2 ) points on on S 2 (in terms of the spherical coordinates (ϕ, θ)), the arithmetic complexity of each iteration can be reduced to O(N )(log N ) 2 operations, using a fast discrete spherical Fourier transform [36]. Such points indeed determine a good Quasi-Monte Carlo system with explicit weights, as follows from the fact that they define cubature formulas for band-limited functions [30, Thm 3].

5.3.3.
Application to the reflector antenna problem. The extensively studied far field reflector antenna problem appears when X = Y = S n is the n−dimensional sphere S n , embedded as the unit-sphere in R n+1 and the cost function is taken as c(x, y) := − log |x − y| [62,35]. Briefly, the problem is to design a perfectly reflecting surface Σ in R n+1 with the following property: when Σ is illuminated with light emitted from the origin with intensity µ ∈ P(S n ) the output reflected intensity becomes ν ∈ P(S n ) (of course, n = 2 in the usual applications). Representing Σ as a radial graph over S n : for a positive function h on S n it follows from the reflection law and conservation of energy that h satisfies the following Monge-Ampère type equation, expressed in terms of the covariant derivatives ∇ i in local orthonormal coordinates: where F h (x) denotes the reflected direction of the ray emitted in the direction x (and µ and ν are represented as in 1.1). The equation is also supplemented with the "second boundary value condition" that F h maps the support of µ onto the support of ν. Assuming that f and g are smooth there exists a smooth solution h, which is unique up to scaling (see [17] and references therein).
Theorem 5.4. Consider the two-sphere, discretized by a given good Quasi-Monte Carlo (QMC) system. Let K (k) be the N k × N k matrix defined by Consider the Sinkhorn algorithm associated to (p (k) , q (k) , K (k) ). Then the function h k on S n defined by the k th root of a k := a (k) (m k ) after m k = Ck log k Sinkhorn iterations converges uniformly, as k → ∞, towards a solution h of the antenna equation 5.7 satisfying the corresponding second boundary value condition. More precisely, there exists a constant C such that sup Moreover, the arithmetic complexity of each iteration is O(N 3/2 ) in general and O(N )(log N ) 2 in the case of an equi-angular grid.
Proof. As recalled in Example 4.12 the cost function − log d(x, y) 2 on the sphere satisfies the assumptions in Corollary 4.11. For the complexity analysis we first recall the general fact that any kernel K (k) (x, y) which is radial, i.e. only depends on |x − y|, may be expressed as for some positive constants C m,l (a proof will be given below). By the argument in the proof of Step 2 in Theorem 5.3, it will be enough to show that C m,l = 0 when l > k, i.e. that K (k) (x, y) is already band-limited with W = k. To this end we follow the general approach in [39]. First observe that when x and y are in S 2 we can write |x − y| 2 = 2(1 − x · y). Hence, 1]. The Legendre polynomials p l (= p 0 l form a base in the space of all polynomials of degree at most k (which is orthogonal wrt Lesbegue measure on [0, 1]) and hence we can decompose Formula 5.8 now follows from the classical Spherical Harmonic (Legendre) addition theorem: q.e.d.
6. Outlook 6.1. Generalized parabolic optimal transport and singularity formation. Consider the setting in Section 4 with a cost function c satisfying the assumptions A1 and A2, but assume for simplicity that c is globally continuous (for example, c = d 2 /2 in the Riemannian setting).
Recall that, given initial data u 0 ∈ H and volume forms µ and ν, the parabolic equation 4.2 admits a smooth solution u t on some maximal time-interval [0, T [ and the corresponding maps F ut give an evolution of diffeomorphisms of X. It does not seem to be known whether T = ∞, in general, i.e. it could be that there are no solutions in C ∞ (X×]0, ∞[), in general. Still, using the corresponding iteration u (m) k (say, defined with respect to µ k = µ and ν k = ν) a generalized notion of solution can be defined: Proposition 6.1. Given a c−convex function u 0 , define the following curve u t of functions on X, emanating from u 0 : Then u t is c−convex for any fixed t (and, in particular, continuous) and there exists a constant C such that sup X×[0,∞[ |u t (x)| ≤ C. Proof.
Step 1: there exists a constant such that |u m ) → inf C 0 (X) J and hence the lhs above is uniformly bounded in k. Thus, there exists a constant C such that −C ≤ I µ (u (k) m ) ≤ C. The proof of Step 1 is now concluded by observing that there exist constants A 1 and A 2 such that, for any c−convex function u, Indeed, both functionals f 1 (u) := sup X u − I µ (u) and f 2 (u) := inf X u − I µ (u) are continuous on C(X) and descend to C(X)/R. But the space of c−convex functions is compact in C(X)/R (as is shown precisely as in Lemma 2.6) and hence any continuous functional on the space is uniformly bounded, which implies the two inequalities above.
Step 2: If {u α } α∈A is a finite family of c−convex functions, then u := max{u α } α∈A is c−convex.
It is enough to find a function v ∈ C(X) such that u = v c . We will show that v := min{u c α } α∈A does the job. To this end first observe that u → u c is order preserving. Hence, u α ≤ u implies that u c α ≥ u c , giving v ≥ u c . Applying the c−Legendre transform again thus gives v c ≤ u cc = u. To prove the reversed inequality first observe that, by definition, u c α ≥ v and hence u α = (u c α ) c ≤ v c . Finally, taking the sup over all α proves the desired reversed inequality.
Step 3: Conclusion Denote by K t the closure in C(X) of the set S t of all u (k) m such that m/k → t and k → ∞. By Step 1 and Lemma 2.12 K t is compact. Let u 1 , ..., u m be the limit points of S t . By the argument towards the end of Step 1 in the proof of Theorem 3.3, u i is c−convex. Hence, by Step 2, so is u := max{u i }. q.e.d.
The curve u t 4.2 is well-defined for any probability measure µ and ν on compact topological spaces X and Y and for any continuous cost function c. Moreover, if µ and ν are normalized volume forms on compact manifolds, assumptions A1 and A2 hold and u 0 ∈ H, then, by Theorem 2.8, u t coincides with the classical solution of the parabolic equation 4.2, as long as such such a solution exists in H, i.e. as long as F ut is a well-defined diffeomorphism. This makes the curve u t a candidate for a solution to the problem posed in [28,Problem 9] of defining some kind of weak solution to the parabolic equation 4.2, without making assumptions on the MTW-tensor etc (as in Theorem 4.9). The connection to the Sinkhorn algoritm also opens the possibility of numerically exploring singularity formation of classical solutions u t to the parabolic equaiton 4.2 as t → T (the maximal existence time). As indicated in [28,Problem 9] one could expect that the first derivatives of a classical solution u t blow up along a subset S of X of measures zero as t → T (moreover, in the light of the discussion in [28,Problem 8], the subset S might be expected to be rectifiable and of Hausdorff codimension at least one).
Finally, it may be illuminating to point out that, even if the construction of the generalized solution u t may appear to be rather non-standard from a PDE point of view it bears some similarities to the method of "vanishing viscosity" for constructing solutions to PDEs by adding small regularizing terms. This is reinfored by the intepretation of the inverse of k as an "entropic regularization parameter" discussed in the introduction of the paper (also note that the approximations u (k) m k are smooth when the heat kernel is used, as in Theorem 5.1). One is thus lead to ask whether, under suitable regularity assumptions on (µ, ν, c) the curve u t is a viscosity solution of the parabolic PDE [23]? 7. Appendix: proof of Lemma 3.7 (discrete stationary phase approximation) We will use the standard notation O(k −∞ ) for a sequence of numbers a k such that |a k | ≤ C p k −p for any given p > 0. We start with the following elementary Lemma 7.1. Let h k be a sequence of continuous convex functions on the polydisc D k in R n of radius log k centered at 0 such that h k C s (D k ) ≤ Ce −x 2 /C for s ≤ 2 for some positive constant C. Then Proof. By restricting the integration to one variable at a time it is enough to consider the case when n = 1. Fix x (k) i , which, by symmetry, may be assumed non-negative. For any fixed x in the interval I k (x using that e −x 2 /C is decreasing in the last step. By symmetry, the integral over Hence, summing over all points x (k) i ∈ D k ∩ k −1/2 Z except the end points and using that |f (x (k) i )| ≤ Ce −(log k) 2 /C ≤ O(k −∞ ) ≤ C ′ /k −1 at the end points gives 0≤s≤log k e −(x−1/2k 1/2 ) 2 /C ds, which concludes the proof. q.e.d.
In the sequel we will denote by + the ordinary group structure on T n and by 0 the zero with respect to the group structure. Without loss of generality we may as well assume that α(x 0 ) = 0. First note that the integral localizes around x 0 , i.e. it may be restricted to an arbitrarily small neighborhood U of x 0 , up to adding an exponentially small error term. Indeed, on the subset where α > δ the integrand is dominated by Ce −δk .
Step 1: The case when x 0 = 0 in T n . We observe that the integral localizes to polydisc U k of radius k −1 log k centered at 0 up to introducing an error term of the order O(k −1 ). Indeed, by assumption α(x) ≥ δ|x| 2 on the complement of a fixed neighborhood U and on U k we have Introducing the notation h (k) (x) := kh(k −1/2 x) we can write Now, Taylor expanding α gives, when |x| ≤ log k, and denoting by p (3) the third order term (i.e. with homogeneous degree three) gives α (k) (x) = Ax · x/2 + k −1/2 p (3) + k −1 O(|x| 4 ) which may be differentiated twice, by assumption. Thus h k := e −α (k) satisfies the assumptions of the previous lemma (with s = 2 if α ∈ C 2 ) giving This shows that in the present discrete setting we get the same result, up to the negligible error term O(k −1 ), as the ordinary stationary phase approximation, which can hence be invoked to conclude. Alternatively, a direct argument goes as follows. Taylor expanding the exponential gives h k (x) := e −α (k) (x) = e −Ax·x/2 (1 + k −1/2 p (3) + k −1 O(|x| 4 )) and hence |x|≤log k e −Ax·x/2 ((1 + k −1/2 p (3) + k −1 O(|x| 4 )) Using the exponential decay of e −Ax·x/2 the integral may be taken over all of R n , up to introducing an error term O(k −∞ ). Hence computing the Gaussian integral concludes the proof, once one has verified that the integral over p (3) vanishes. In the case when A is the identity the vanishing follows directly from the fact that p (3) is odd. In the general case one first observes that the space of polynomials of homogeneous degree 3 is invariant under the action of the space of invertible linear maps. Hence the problem reduces, by a linear change of variables to the previous case.
Step 2: The case of a general x 0 . Setα(x) := α(x + x 0 ) andf (x) := f (x + x 0 ) and decompose x 0 = m k + r k where m k ∈ Λ k and |r k | ≤ 1/k (where we have identified a small neighborhood of 0 in T n , containing r k with R n ). Then we can write e −kα f δ Λ k = e −kαf δ (Λ k −r k ) Indeed, for any function h on T n we have, since m k ∈ Λ k , Now, we note that the conclusion in the previous lemma remains true when Λ k is replaced by the shifted set Λ k − r k (with essentially the same proof) and hence we can conclude as before.