1 Introduction

The theory of Optimal Transport [67, 68] is used in a multitude of applications ranging from economics, statistics, cosmology, geometric optics and meteorology to more recent applications in data sciences (including machine learning, vision, graphics and imaging; see [27, 63, 64]). In the last years there has been a flurry of numerical work applying the Sinkhorn algorithm [61] (aka the Iterative Proportional Fitting Procedure [55]) as a fast and efficient way of computing approximations to optimal transport maps, or equivalently, solutions to certain geometric Monge–Ampère type equations. This is motivated by applications to machine learning [26] (concerning optimal transport in Euclidean \(\mathbb {R}^{n})\) and computer graphics and image processing [63] (where the general setting of optimal transport on Riemannian manifolds is considered). The key advantage of the Sinkhorn algorithm in this context is its favorable large-scale computational properties (parallelization, linear time convergence, etc [3]).

The main aim of the present paper is to show that, in a large-scale limit, the Sinkhorn algorithm converges towards the solution of a parabolic PDE of Monge–Ampère type, which, incidentally, previously has appeared in [44, 45, 57] and is called the parabolic optimal transport equation in [44]. The convergence is shown with explicit error estimates. This leads, in particular, to the first constructive approximation of the potential of the optimal transport map with explicit bounds on the time-complexity of the construction and the approximation errors introduced by the discretization.

1.1 Background and setup

1.1.1 The Sinkhorn iteration

Let p and q be two vectors in \(\mathbb {R}_{+}^{n}\) whose entries sum to one. Given any matrix \(K\in \mathbb {R}_{+}^{N}\times \mathbb {R}_{+}^{N}\) there exists, by Sinkhorn’s theorem [61], two diagonal positive matrices \(D_{a}\) and \(D_{b}\) with diagonal vectors a and b in \(\mathbb {R}_{+}^{N}\) such that the matrix

$$\begin{aligned} B:=D_{b}KD_{a} \end{aligned}$$
(1.1)

has the property that the rows sum to p and the columns sum to q. The diagonal matrices \(D_{a}\) and \(D_{b}\) are uniquely determined up to scaling \(D_{a}\) and \(D_{b}^{-1}\) by the same positive number. Moreover, B can be obtained as the limit of the algorithm defined by alternately normalizing the rows and columns of the matrix. In other words,

$$\begin{aligned} B=\lim _{m\rightarrow \infty }B(m),\,\,\,B(m)=D_{b(m)}KD_{a(m)}, \end{aligned}$$

where the pair of positive vectors (a(m), b(m)) are defined by the following recursion, formulated in terms of matrix vector multiplications and component-wise division of vectors:

$$\begin{aligned} b(m)= & {} p/K\cdot a(m)\\ a(m+1)= & {} q/K^{T}\cdot b(m) \end{aligned}$$

with initial data a(0) taken as the vector with entries 1. In fact, any initial positive vector a(0) will do and the vectors a(m) and b(m) are convergent as \(m\rightarrow \infty \) (without any need of scaling) as follows from Theorem 2.9. The corresponding iteration

$$\begin{aligned} a(m+1):=q/K^{T}\cdot \left( p/K\cdot a(m)\right) \end{aligned}$$

will be called the Sinkhorn iteration. Its fixed point (uniquely determined up to scaling) is thus the vector a appearing in formula 1.1.

The same algorithm has appeared in various fields (economics, traffic planning, statistics,...; see [27]). In its most general (infinite dimensional) form, known as the Iterative Proportional Fitting Procedure in the statistics literature, the roles of p and q are played by probability measures on two (possible non-finite) topological spaces. In this setting the corresponding convergence of B(m) towards a limit B was established in [55] using a maximum entropy characterization of B, which in the discrete setting above says that B is the unique element realizing the infimum

$$\begin{aligned} \inf _{\gamma \in \Pi (p,q)}\mathcal {I}(\gamma |K) \end{aligned}$$

where \(\mathcal {I}\) denotes the Kullback–Leibler divergence of \(\gamma \) relative to K, when \(\gamma \) and K are identified with measures on the discrete product \(\{1,\ldots ,N\}^{2}\) and \(\Pi (p,q)\) denotes the set of all matrices \(\gamma \) in \(\mathbb {R}_{+}^{N}\times \mathbb {R}_{+}^{N}\) with row sum p and column sum q (i.e. the corresponding measures on \(\{1,\ldots ,N\}^{2}\) have marginals p and q, respectively). Since \(-\mathcal {I}(\gamma |K)\) is the “physical” entropy of \(\gamma \) relative to K this can indeed be viewed as a maximum entropy characterization of B. An alternative proof of the convergence follows from Theorem 2.9 below, which also shows that a(m) and b(m) have unique limits (determined by the initial value a(0)).

1.1.2 Discrete optimal transport

Now replace K with a family of matrices \(K_{\epsilon }\) of the form

$$\begin{aligned} (K_{\epsilon })_{ij}=e^{-\epsilon ^{-1}C_{ij}}, \end{aligned}$$

for a given matrix \(C_{ij}\) and parametrized by a positive number \(\epsilon \). Then the corresponding matrix \(B_{\epsilon }\in \Pi (p,q)\) furnished by Sinkhorn’s theorem converges, as \(\epsilon \rightarrow 0\), to a matrix \(B_{0}\) realizing the infimum

$$\begin{aligned} \mathcal {C}:=\inf _{\gamma \in \Pi (p,q)}\left\langle C,\gamma \right\rangle . \end{aligned}$$
(1.2)

In the terminology of discrete optimal transport theory [27, 67] this means that \(B_{0}\) is an optimal transport plan (coupling) between p and q, with respect to the cost matrix C. The convergence follows from the maximum entropy characterization of \(B_{\epsilon }\) (recalled in the previous section), which reveals that \(B_{\epsilon }\) realizes the perturbed minimum

$$\begin{aligned} \mathcal {C}_{\epsilon }:=\inf _{\gamma \in \Pi (p,q)}\left\langle c,\gamma \right\rangle +\epsilon \mathcal {I}(\gamma |I). \end{aligned}$$

While the matrix \(B_{0}\) is sparse (and is typically supported on the graph of a transport map) the approximation \(B_{\epsilon }\) always has full support (by Sinkhorn’s theorem) and is thus more “regular” than \(B_{0}\). Accordingly, the small parameter \(\epsilon \) is sometimes referred to as the entropic regularization parameter. This is illustrated by the simulations in [3] for the case when p and q represent the discretization of two probability measures on the unit-interval in \(\mathbb {R}\), using a large number N of points and with \(C_{ij}\) the cost matrix defined by to the squared distance function on \(\mathbb {R}\). When \(\epsilon \) is taken to be of the order 1/N [3, Fig. 1] shows how the discrete probability measures on \(\mathbb {R}\times \mathbb {R}\) appear as smoothed out versions of the graph of the corresponding optimal transport map.

It should also be pointed out that the entropy minimization problem above can be traced back to the work by Schrödinger on Quantum Mechanics in the 30s [60] (see the survey [48], where the connection to optimal transport is emphasized).

1.1.3 Discretization of optimal transport on the torus

Let now X be a compact manifold endowed with a cost function c(xy). To keep things as simple as possible we will start by taking the manifold X to be the n-dimensional torus

$$\begin{aligned} T^{n}:=\left( \frac{\mathbb {R}}{\mathbb {Z}}\right) ^{n} \end{aligned}$$

endowed with the standard distance function \(d_{T^{n}}(x,y)\), induced from the Euclidean distance function on \(\mathbb {R}^{n}\). Let \(\mu \) and \(\nu \) be two probability measure on \(T^{n}\) (which thus correspond to two periodic measures on \(\mathbb {R}^{n})\) with Hölder continuous and strictly positive densities \(e^{-f}\) and \(e^{-g}\), respectively

$$\begin{aligned} \mu =e^{-f}dV,\,\,\,\nu =e^{-g}dV \end{aligned}$$
(1.3)

where dV is the Riemannian normalized volume form on \(T^{n}\). Define the cost functionc(xy) by

$$\begin{aligned} c(x,y):=d_{T^{n}}(x,y)^{2}/2. \end{aligned}$$

As is well-known, a continuous self-map F of \(T^{n}\) transporting (pushing forward) \(\mu \) to \(\nu \) is optimal with respect to this cost function, i.e. the corresponding transport plan \(\gamma _{F}:=(I\times F)_{*}\mu \) realizes the infimum

$$\begin{aligned} d^{2}(\mu ,\nu )/2:=\inf _{\gamma \in \Pi (\mu ,\nu )}\left\langle c,\gamma \right\rangle , \end{aligned}$$
(1.4)

if and only if F can be expressed in terms of a potential\(u\in C^{2}(T^{n}):\)

$$\begin{aligned} F(x):=x+\nabla u(x),\,\,\,T^{n}\rightarrow T^{n}, \end{aligned}$$

which is strictly quasi-convex in the sense that the symmetric matrix \(\nabla ^{2}u+I\) is positive definite:

$$\begin{aligned} \nabla ^{2}u+I>0 \end{aligned}$$

(we identify u with a \(\mathbb {Z}^{n}\)-periodic function on \(\mathbb {R}^{n}\) so that F(x) descends to define a self-map of \(T^{n})\) . The function u is uniquely determined, up to an additive constant, by the following Monge–Ampère equation

$$\begin{aligned} \exp (-g(x+\nabla u(x))\det (I+\nabla ^{2}u(x))=\exp (-f(x)). \end{aligned}$$
(1.5)

We recall that the distance \(d(\mu ,\nu )\) between \(\mu \) and \(\nu \), defined by formula 1.4, is usually called the Wasserstein\(L^{2}\)-distance or the Optimal Transport distance.

1.2 Main results in the torus setting

For notational reasons it will be convenient to express the entropic regularization parameter as

$$\begin{aligned} \epsilon :=k^{-1} \end{aligned}$$

for k a positive integer (but the results apply to any real parameter \(k\ge k_{0}>0)\). For each k we fix a positive integer \(N_{k}\) and denote by \(\Lambda _{k}\) the corresponding discrete torus in \(T^{n}\) with \(N_{k}\) points. In other words, \(\Lambda _{k}\) is the grid on \(T^{n}\) with edge-length \(N_{k}^{-1/n}:\)

$$\begin{aligned} \Lambda _{k}:=\left( \frac{(N_{k}^{-1/n}\mathbb {Z})}{\mathbb {Z}}\right) ^{n}\subset T^{n}. \end{aligned}$$

We will assume that

$$\begin{aligned} \lim _{k\rightarrow \infty }N_{k}=\infty . \end{aligned}$$

Denote by \(p^{(k)}\) and \(q^{(k)}\) the corresponding discrete approximations in \(\mathbb {R}^{N_{k}}\) of \(\mu \) and \(\nu \), defined by the normalized values of the densities of \(\mu \) and \(\nu \) at the points in \(\Lambda _{k}\). Defining a sequence of \(N_{k}\times N_{k}\) matrices \(K^{(k)}\) by

$$\begin{aligned} K_{ij}^{(k)}:=\mathcal {K}^{(k)}(x_{i}^{(k)},x_{j}^{(k)}),\,\,\,\mathcal {K}^{(k)}(x,y):=e^{-kd(x,y)^{2}/2} \end{aligned}$$
(1.6)

and applying Sinkhorn’s theorem to the triple \((K^{(k)},p^{(k)},q^{(k)})\) furnishes two positive vectors \(a^{(k)}\) and \(b^{(k)}\) in \(\mathbb {R}^{N_{k}}\), uniquely determined by the normalization condition that \(a_{i_{k}}^{(k)}=0\) for the index \(i_{k}\) corresponding to the point \(x_{i_{k}}=0\) in \(\Lambda _{k}\).

Our first result shows that the potential u for the optimal transport problem between \(\mu \) and \(\nu \) can be recovered from the positive vectors \(a^{(k)}\) and \(b^{(k)}\), furnished by the Sinkhorn theorem, i.e. the fixed points of the corresponding iteration:

Theorem 1.1

(Static case) If \(x_{i_{k}}^{(k)}\) is a sequence of points in the discrete torus \(\Lambda _{k}\) converging to the point x in the torus \(T^{n}\), as \(k\rightarrow \infty \), then

$$\begin{aligned} -\lim _{k\rightarrow \infty }k^{-1}\log a_{i_{k}}^{(k)}=u(x) \end{aligned}$$

where u is the unique optimal transport potential solving the Monge–Ampère equation 1.5 and normalized by \(u(0)=0\).

This convergence result should come at no surprise and it holds in a very general setting (see Theorems 3.3, 3.4 and 3.6). But the main point of the present paper is that the Sinkhorn algorithm itself, when viewed as a discrete dynamical system for the positive vectors \(a_{i_{k}}^{(k)}\), also admits a continuous large-scale limit \(u_{t}(x)\), evolving according to the following fully non-linear parabolic PDE:

$$\begin{aligned} \frac{\partial u_{t}(x)}{\partial t}=\log \det (I+\nabla ^{2}u_{t}(x))-g(x+\nabla u_{t}(x))+f(x),\,\,\,u_{0}=0 \end{aligned}$$
(1.7)

The existence of a \(C^{4}\)-smooth solution \(u_{t}\) to this PDE, given \(f,g\in C^{2,\alpha }(T^{n})\), essentially follows from the results in [44, 45, 57] (for completeness a proof is provided in “Appendix B”). In order to formulate the convergence in the next theorem we first observe that for \(m\ge 1\) the function

$$\begin{aligned} u_{m}^{(k)}(x_{i})=-k^{-1}\log \frac{a_{i}^{(k)}(m)}{p_{i}} \end{aligned}$$
(1.8)

on the discrete torus \(\Lambda _{k}\) admits a canonical extension, defining a quasi-convex function on X : 

$$\begin{aligned} u_{m}^{(k)}(x):=k^{-1}\log \sum _{i=1}^{N_{k}}e^{-kd(x,y_{i}^{(k)})^{2}/2}b_{i}^{(k)}(m-1) \end{aligned}$$
(1.9)

expressed in terms of a Fourier/Gauss type sum, with k playing the role of the band-width.

Theorem 1.2

(Dynamic case) Assume that f and g are in \(C^{2,\alpha }(T^{n})\) for some \(\alpha >0\) and that \(N_{k}=k^{n}\), i.e. the edge-length of the grid on \(T^{n}\) is equal to \(k^{-1}\). For any sequence of discrete times \(m_{k}\) (iterations) such that \(m_{k}/k\rightarrow t\) we have

$$\begin{aligned} \lim _{k\rightarrow \infty }u_{m_{k}}^{(k)}=u_{t} \end{aligned}$$

uniformly on \(T^{n}\), where \(u_{t}\) is the smooth and strictly quasi-convex solution of the parabolic PDE 1.7 with initial data \(u_{0}=0\). More precisely, for any positive integers k and m

$$\begin{aligned} \sup _{T^{n}}\left| u_{m}^{(k)}-u_{m/k}\right| \le C\frac{m}{k}k^{-1}, \end{aligned}$$

for a constant C independent of t. More generally, if at the initial discrete time \(m=0\),

$$\begin{aligned} u_{0|\Lambda _{k}}^{(k)}=u_{0|\Lambda _{k}}, \end{aligned}$$

for a given strictly quasi-convex function \(u_{0}\) in \(C^{4,\alpha }(T^{n})\), then the corresponding result still holds (with a constant C depending on \(u_{0})\).

An immediate consequence is that if the number \(m_{k}\) of iterations is too small, of the order o(k), then \(u_{m_{k}}^{(k)}\rightarrow 0\) under the Sinkhorn iterations, i.e. “nothing happens”. As discussed in Sect. 4.2 the estimate in the previous theorem can be expected to be sharp under the regularity assumption in the theorem. Moreover, the proof of the theorem yields an essentially explicit control on the constant C appearing in the error bound.

Using that \(u_{t}\) converges exponentially to a potential u for the optimal transport problem, as \(t\rightarrow \infty \) (which follows from results in [44]) we deduce the following

Corollary 1.3

(Constructive approximation of the potential u) Assume that f and g are in \(C^{2,\alpha }(T^{n})\) for some \(\alpha >0\) and that \(N_{k}=k^{n}\), i.e. the “edge length” of the grid on \(T^{n}\) is equal to \(k^{-1}\). There exists a positive constant \(A_{0}\) such that for any \(A>A_{0}\) the following holds: after \(m_{k}=\left\lfloor Ak\log k\right\rfloor \) iterations the corresponding quasi-convex functions \(u_{k}(x):=u_{m_{k}}^{(k)}(x)\) satisfy the estimate

$$\begin{aligned} \sup _{T^{n}}\left| u_{k}-u\right| \le Ck^{-1}\log k, \end{aligned}$$
(1.10)

for some constant C (depending on A), where u is a potential for the corresponding optimal transport map. Moreover, the discrete probability measures \(\gamma _{k}\) on \(T^{n}\times T^{n}\), determined by the Sinkhorn algorithm, converge weakly towards the corresponding optimal transport plan \((I\times (\nabla u+I))_{*}\mu \), concentrating exponentially on the graph \(\Gamma \) of the transport map \(\nabla u+I:\)

$$\begin{aligned} \gamma _{k}\le pk^{p}e^{-kd_{\Gamma _{u}}^{2}/p}\delta _{\Lambda _{k}}, \end{aligned}$$
(1.11)

for some positive constant p, where \(d_{\Gamma }\) denotes the vertical distance to the graph \(\Gamma \) in \(T^{n}\times T^{n}\), i.e. \(d_{\Gamma }(x,y):=d_{T^{n}}(y,\nabla u(x)+x)\) and \(\delta _{\Lambda _{k}}\) denotes the discrete uniform probability measure on the discrete torus \(\Lambda _{k}\).

More generally, we will show in Sect. 5, that if f and g are in \(C^{\infty }(T^{n})\), then the previous theorem and its corollary still hold as long as the number of discretization points in the grid satisfies

$$\begin{aligned} N_{k}\ge C_{\delta }k^{n/2(1+\delta )},\,\,\,C_{\delta }>0 \end{aligned}$$

for some \(\delta \in ]0,1/2]\). In particular, this means that one can then take a grid with larger edge-length

$$\begin{aligned} h=O(1/k^{1/2+\delta }) \end{aligned}$$

without affecting the quality of the approximation \(u_{k}\), that is, without affecting the order \(O(k^{-1}\log k\)) of the corresponding error terms. In other words, for smooth data the error terms are nearly of order \(O(h^{2})\) if the entropic regularization parameter is taken to be close to the square of the edge-length h of the grid.

Since each iteration in the Sinkhorn algorithm may be formulated in terms of matrix-vector multiplication, which requires \(O(N^{2})\) arithmetic operators, and since the direct construction of the function \(u_{k}\) uses, in general, \(O\left( kN_{k}^{2}\log k\right) \) elementary arithmetic operations. Moreover, in the present case of the torus the matrix-vector operations in question are discrete convolutions and can thus be performed using merely \(O\left( N\log N\right) \) arithmetic operations, by using the Fast Fourier Transform (or \(O(N^{1+1/n})\) operations, using separability; see Sect. 6.3.1). Thus the construction of \(u_{k}\) in the previous corollary requires merely \(O\left( kN_{k}(\log N_{k})\log k\right) \) arithmetic operations to obtain on error term of order \(O(k^{-1}\log k)\). The same number of arithmetic operations (modulo a negligible term \(O(N_{k})\) needed to for the summing) yields the following constructive approximation of the squared Optimal Transport distance \(d(\mu ,\nu )\) (formula 1.4) with an additive error of the order \(O(k^{-1}\log k)\):

Corollary 1.4

(Constructive approximation of \(d(\mu ,\nu )^{2}\)) Assume that f and g are in \(C^{2,\alpha }(T^{n})\) for some \(\alpha >0\). There exists a positive constant A such that after \(m_{k}=\left\lfloor Ak\log k\right\rfloor \) iterations

$$\begin{aligned} \frac{1}{2}d(\mu ,\nu )^{2}=\frac{1}{k}\sum _{i=1}^{N_{k}}p_{i}^{(k)}\log a_{i}^{(k)}(m_{k})+\frac{1}{k}\sum _{i=1}^{N_{k}}q_{i}^{(k)}\log b_{i}^{(k)}(m_{k})+O\left( \frac{1}{k}\log k\right) \end{aligned}$$

and, as a consequence, such an equality also holds for the squared discrete Optimal Transport distance \(d(\mu ^{(k)},\nu ^{(k)})^{2}\) (defined by formula 1.2 with \(C_{ij}=d(x_{i},x_{j})^{2}/2\)).

Finally we point out that, by symmetry, Theorem 1.2 also shows that, on the one hand, the functions

$$\begin{aligned} v_{m_{k}}^{(k)}(x_{i}):=-k^{-1}\log \frac{b_{i}^{(k)}(m_{k})}{q_{i}} \end{aligned}$$
(1.12)

converge, as \(k\rightarrow \infty \), towards the solution \(v_{t}\) of the parabolic equation obtained by interchanging the roles of \(\mu \) and \(\nu \). On the other hand, by Lemma 3.1, the function \(v_{m_{k}}^{(k)}\) is equal to the Legendre transform (in the space variable) of \(u_{m_{k}}^{(k)}\), up to a negligible \(O\left( k^{-1}\log k\right) \) error term. Thus Theorem 1.2 is consistent, as it must, with the fact that the Legendre transform of the solution \(u_{t}\) of Eq. 1.7 solves the parabolic equation obtained by interchanging the roles of \(\mu \) and \(\nu \) (as can be checked by a direct calculation).

1.3 Generalizations

1.3.1 The static case

The result in the static setting is shown to hold in a very general setting of optimal transport between two probability measures \(\mu \) and \(\nu \) defined on compact topological spaces X and Y, respectively: see Theorems 3.3, 3.4 and also Theorem 3.6 which, in particular, applies to the classical Euclidean setting where X and Y are convex domains in \(\mathbb {R}^{n}\) and \(c(x,y):=-x\cdot y\). In the case when Y is convex the corresponding limit \(\phi \) obtained from the Sinkhorn iteration is the unique convex normalized solution to the second boundary value problem for Monge–Ampère equation in the interior of X : 

$$\begin{aligned} e^{-g(\nabla \phi )}\det (\nabla ^{2}\phi )dx=\mu ,\,\,\,(\partial \phi )(X)\subset Y. \end{aligned}$$
(1.13)

The result holds without any regularity assumptions on \(\mu \) and g if Alexandrov’s classical notion of a Monge–Ampère measure is employed (see Sect. 3.2). Moreover, the convergence holds even when the target set Y is not convex, but then \((\partial \phi )(X)\) is contained in the convex hull of Y (see the discussion in Sect. 3.2).

In the general setting the roles of the positive vectors \(p^{(k)}\) and \(q^{(k)}\), discretizing \(\mu \) and \(\nu \), are played by two sequences \(\mu ^{(k)}\) and \(\nu ^{(k)}\), satisfying certain density properties with respect to \(\mu \) and \(\nu \) (which are almost always satisfied in practice). Moreover, the cost function c is merely assumed to be continuous and can even be replaced by any sequence \(c_{k}\) converging uniformly to c, as the inverse k of the entropic regularization parameter tends to infinity (which applies, in particular, to the convolutional Wasserstein distances introduced in [63], where the matrix in formula 1.6 is replaced by a heat kernel; see Sect. 3.3).

1.3.2 The dynamic case

The corresponding result in the dynamic setting (Theorem 5.7) requires that X and Y be compact Riemannian manifolds and further regularity assumptions on c, \(\mu \) and \(\nu \) (the case when X and Y have boundaries is left for the future). A “local density property” on the approximations \(\mu ^{(k)}\) and \(\nu ^{(k)}\) is required, roughly saying that the approximations of \(\mu \) and \(\nu \) hold up to length scales of the order \(k^{-1/2}\), with an O(1/k)-error term. Interestingly, the local density properties turn out to be satisfied when the approximations \(\mu ^{(k)}\) and \(\nu ^{(k)}\) are defined by (weighted) point clouds, generated using Quasi-Monte Carlo methods for numerical integration [12, 13, 17]. The general results are applied to the case of the round two-sphere endowed with the two different cost functions: (1) \(d(x,y)^{2}\) and (2) \(-\log |x-y|\). These two cases appear, for example, in applications to (1) computer graphics (texture mapping), medical imaging [30, 72], mesh adaptation for global whether and climate prediction [70] and (2) the reflector antenna problem in geometric optics [37, 69]. Nearly linear complexity of the corresponding Sinkhorn iteration is achieved in both cases, using fast transforms and \(O(N^{3/2})\)-complexity using separability.

1.4 Relation to previous results

To the best of the authors knowledge these are the first convergence results concerning the Sinkhorn algorithm (and its fixed points) in the limit when the number N of points and the inverse of the regularization parameter \(\epsilon \)jointly tend to infinity (see [21] and references therein for the static case when only \(\epsilon ^{-1}(=k)\) tends to infinity in the Euclidean \(\mathbb {R}^{n}\)-setting and [47, 48] for a very general setting). This kind of joint limit is, in practice, what is studied in numerical simulations in the context of geometric optimal transport (see for example [63] and the in-depth study in [35, 59] on CPU and GPU hardware, respectively. Thus the convergence analysis in the present paper provides a theoretical bases for these simulations and yields concrete rates, under appropriate regularity assumptions (see Sect. 4.2 for a comparison with previous rates for the Sinkhorn algorithm). In particular, the present results rigorously establish and quantify the experimental observations in [59, Fig. 2], that the Sinkhorn algorithm converges after essentially \(O(\epsilon ^{-1})\) iterations. Moreover, the experimental findings in [59, Fig. 2] that \(\epsilon \) can be taken to be close to the order \(O(h^{2})\) on a grid with “edge length” h are confirmed, if the data is assumed to be \(C^{\infty }\)-smooth (note that the data in [59, Fig. 2] is even real-analytic).

It should also be pointed out that the “change of variables” in formula 1.8 which plays an important role in the present paper, is also crucial for numerical simulations as it ensures numerical stability, as emphasized in [27, 35, 59]. As stressed in [35] the corresponding log-sum-exp KeOps routines [22], used in [35] to implement the iteration \(u_{m}\rightarrow u_{m+1}\) on GPU hardware, are just as efficient as matrix-vector products with the kernel matrix \(K_{ij}\). Moreover, the present results provide a theoretical justification for the kernel truncations employed in [35, 59]. To briefly explain this we recall that the starting point of the stabilization scheme advocated in [59] (see also [27, Remark 4.22]) is to write the iteration in terms of the variables \((u_{m+1}-u_{m},v_{m+1}-v_{m})\). In the setting of a general cost function c(xy) this corresponds to replacing the matrix kernel \(e^{-kc(x,y)}\) with the “stabilized” kernel

$$\begin{aligned} e^{-kc(x,y)}e^{-ku_{m-1}(x)}e^{-kv_{m-1}(y)} \end{aligned}$$

By Theorems 1.2, 5.7 (and the argument in the proof 1.11) the latter kernel is, for k large, exponentially concentrated on the graph of \(\Gamma _{t}\) in \(X\times Y\) of the diffeomorphism \(F_{u_{t}}\) corresponding to the parabolic solution \(u_{t}\) for \(t=(m-1)/k\). This means that the corresponding \(N_{k}\times N_{k}\) matrices \(\left( e^{-kc(x_{i},y_{j})}e^{-ku_{m-1}(x_{i})}e^{-kv_{m-1}(y_{j})}\right) \) are effectively sparse. Building on the results in the present paper this is exploited in [6] to modify the Sinkhorn algorithm in order to obtain a numerically stable algorithm on the torus, which is shown to have nearly O(N)-complexity at each iteration.

1.4.1 Comparison with other numerical schemes for optimal transport

The quantitative convergence in Corollary 1.3 should be compared with previous results in the rapidly growing literature on numerical approximations schemes for solutions to Optimal Transport problems and the corresponding Monge–Ampère type equations. However, the author is not aware of any previous results providing both complexity bounds (in terms of N) and a quantified rate of convergence of the error of the approximate solution, as \(N\rightarrow \infty \). Recall that a time-honored approach is to approximate the optimal transport potential u by solving the linear program which is dual to the discretized optimal transport problem. This can be done using combinatorial algorithms. However, they do not scale well for large N (see, for example, the exposition in [14], where applications to cosmology are given in the periodic setting). Moreover, it is not clear how to establish quantitative convergence rates for the convergence towards u. Another influential approach is the Benamou–Brenier Augmented Lagrangian approach, using computational fluid mechanics, introduced in [2] in the periodic setting (numerical experiments suggest that it has \(O(N^{3})\) time-complexity, as pointed out in [4]). There is also a rapidly expanding literature concerning other discretization schemes in the field of numerical analysis of PDEs, mainly concerning the case of domains in \(\mathbb {R}^{n}\) (as in Sect. 3.2) and the periodic case of the torus (as in Sect. 1.1.3). Here we will focus on the schemes where convergence results—analogous to Theorems 3.6, 1.1 - have been established. The mostly studied approaches are based on either finite differences [4, 5] or semi-discrete Optimal Transport [46] (which goes back to the classical work of Alexandrov and Pogorelov). Then the fully non-linear Monge–Ampère equation for the potential u (and the corresponding boundary conditions 1.13 or periodicity conditions) is replaced with a finite dimensional non-linear algebraic equation for a “discrete” function \(u_{h}\), where h denotes the corresponding spatial resolution. In our torus setting we thus have \(h=1/k\) (defined as the entropic regularization paremeter) when the data is \(C^{2,\alpha }\) smooth and for \(C^{\infty }\)-data h can be taken arbitrarily close to \(1/k^{1/2}\). The equation for \(u_{h}\) may be expressed as the fixed point condition for a non-linear map \(S_{h}:\)

$$\begin{aligned} S_{h}(u_{h})=u_{h}, \end{aligned}$$

(whose role in our setting is played by the scaled logarithm of the Sinkhorn operator defined by formula 2.20). In practice, the map \(S_{h}\) is often taken to be a Newton type iteration. In experiments it has been computed in almost linear time-complexity \(O(N_{h})\) [4, 5]. Moreover, merely a few dozen iterations \(S_{h}^{(m_{h})}(u_{0})\) appear to be needed in order to obtain a good approximation of \(u_{h}\). In the semi-discrete approach the convergence of \(u_{h}\) towards u, as \(h\rightarrow 0\), follows directly from basic stability results for optimal transport plans. Moreover, in the case of finite difference approach the convergence is established in [4] in the setting of domains. Another discretization of the Optimal Transport problem in domains, called the logarithmic discrete Monge–Ampère optimization problem, is introduced in [49] and the corresponding solution \(u_{h}\) is shown to converge towards u. Very recently, a general convergence framework is introduced in [38], showing how to slightly modify a range of existing numerical schemes (including [5]) to establish the convergence of the corresponding discrete solutions \(u_{h}\) towards u in the setting of domains.

However, the problem of establishing convergence rates as \(h\rightarrow 0\) is still open in these schemes (see the discussion in the survey [34]). From this point of view one of the main points of the present paper is to rigorously quantify how many iterations \(m_{h}\) are needed in the setting of the Sinkhorn iteration in order to approximate the solution u to nearly order O(h), for \(C^{2,\alpha }\)-data (and nearly order \(O(h^{2})\) for \(C^{\infty }\)-data). The answer is that nearly \(O(h^{-1})\) iterations are needed, according to Corollary 1.3 (see the discussion in Sect. 4.2). This is in line with the experimental findings in [35, 59] (see, in particular, [35, Figure 3.19] where \(\epsilon :=k^{-1}=10^{-4}\) and [59, Fig. 2]).

Remark 1.5

Interestingly, as discussed in detail in [35, 59], heuristics inspired by simulated annealing in numerics (aka \(\epsilon \)-scaling) and multi-scale techniques can be used to effectively reduce the of number of Sinkhorn iterations needed to approximate the optimal transport potential u to a few dozen. In a nutshell, the idea is to fine-tune and gradually decrease both the parameter \(\epsilon \) and the spatial resolution h during the Sinkhorn iterations (see [35, Figure 3.26]). One can anticipate that variants of Theorems 1.2, 5.7 will play a role in making these heuristics and experimental findings mathematically rigorous with quantitative error estimates. The local density property in Definition 5.6 should be useful in this regard (note that the length scale \(k^{-1/2}(=\epsilon ^{1/2})\) appearing in Definition 5.6 corresponds to the notion of blurring scale in [35]). On the other hand, as discussed in Sect. 7, it is of independent interest to be able to approximate the solution \(u_{t}\) of the parabolic optimal transport equations at any finite time t and then \(\epsilon \) must be kept fixed during the iterations (so that an \(O(\epsilon )\)-approximation of \(u_{t}\) is obtained after \(O(\epsilon ^{-1}t)\) iterations, if logarithmic factors are ignored).

In the setting of semi-discrete optimal transport the convergence of a damped Newton iteration towards \(u_{h}\) is established in [46] (under similar assumptions as in Theorem 5.10) at a linear rate (in the exponential sense). Furthermore, rates of convergence of \(u_{h}\) towards u are established in [11]. However, the damping parameter and the rate established in [46] depends on the discrete solution \(u_{h}\) (and hence on h) in a rather complicated way and degenerates as \(N_{h}\) is increased, i.e. when h is decreased; see [46, Remark 1.3] and compare also with the discussion in Sect. 4.2. This means that, for the moment, the analog of Corollary 1.3 in the semi-discrete setting seems ot be out of reach.

A damped Newton approach has also been applied directly to the Monge–Ampère equation and shown to converge to u at a linear rate in the periodic setting in [51] when the target density is constant and then in [58] in the general periodic setting. In these approaches the iteration \(u^{(m+1)}\) is obtained by inverting a second order linear elliptic operator (in non-divergence form) depending on \(u^{(m)}\). Various discretizations of these schemes are studied experimentally in [42, 51, 58, 70].

One advantage of the Sinkhorn framework over many other approaches, when applied to general manifolds, is that it is meshfree. In other words, it does not require generating a grid or a polyhedral tessellation of the manifolds, but only a suitable point cloud, which can be efficiently generated using Quasi-Monte Carlo methods. In the case of the round sphere various different numerical algorithms have previously been explored in the literature: see [30, 70, 72] for experimental work on the case of the cost function \(d(x,y)^{2}\) and [18, 28] for the case of the cost function \(-\log |x-y|\), as applied to the reflector antenna problem in geometric optics.

1.4.2 Kähler geometry

The present results are very much inspired by an analogous setup which appears in complex (Kähler) geometry. Briefly, the role of the Sinkhorn algorithm is then played by Donaldson’s iteration, introduced in [32], whose fixed points are called balanced metrics and k appears as the power of a given ample line bundle L over X with \(e^{-ku}\) playing the role of a Hermitian metric on L. Moreover, the role of the function \(\rho _{ku}\) (formula 2.23) is played by the (normalized) point-wise norm of the Bergman kernel on the diagonal, induced by the pair \((u,\mu )\). From this point of view the “static case” of Theorem 1.1 (and its generalization Theorem 3.3) is the analog of [10, Thm B] and the “dynamic case” of Theorem 1.2 is the analog of the result in [8] showing that Donaldson’s iteration converges to the Kähler-Ricci flow [20], as conjectured in [32]. In fact, identifying the real torus \(T^{n}\) with a reduction of the complex torus \(X:=\mathbb {C}^{n}/(\mathbb {Z}^{n}+i\mathbb {Z}^{n})\) the parabolic flow 1.7 can, in the case when g is constant be identified with a twisted Kähler-Ricci flow [20] whose stationary solutions are Kähler potentials solving the corresponding complex Monge–Ampère equation (known as the Calabi–Yau equation in this context).Footnote 1 But is should be stressed that a new feature of the analysis in the present paper, compared to the usual situation in Kähler geometry (apart from allowing a non-uniform target measure \(\nu \), i.e. a non-constant g) is that the source measure \(\mu \) is taken to be discrete and depend on k, i.e. it is given by a discrete sequence \(\mu ^{(k)}\). In practice, such discretizations are used in implementations of Donaldson’s iteration, such as the experimental work [33], motivated by String Theory. The discreteness of \(\mu ^{(k)}\) leads to various technical complications, that do not seem to have been studied rigorously in the Kähler geometry setting. Interestingly, the density condition on the sequence \(\mu ^{(k)}\) appearing in Lemma 3.1 can be viewed as a real analog of the Bernstein–Markov property for a sequence \(\mu ^{(k)}\), as studied in the complex geometric and pluripotential theoretic setting (see the discussion on page 8 in [9]). The relations between the real and complex settings will be expanded on elsewhere.

1.5 Organization

In Sect. 2 a general setting for iterations on C(X), generalizing the Sinkhorn algorithm, is introduced. The iteration in question, which is determined by a triple \((\mu ,\nu ,c)\), is essentially equivalent to the Iterative Proportional Fitting Procedure and the results in Sect. 2 are probably more or less well-known (except perhaps Theorem 2.9). But one point of the presentation is to exploit the variational structure. It can be viewed as a real analogue of the formalism introduced in [8], in the setting of Donaldson’s iteration [32] and it lends itself to various generalizations of the optimal transport problem (such as Monge–Ampère equations with exponential non-linearities). In Sect. 3 the variational structure is used to give a general convergence result for the Sinkhorn fixed points, which when specialized to the torus setting yields Theorem 1.1. Applications to the second boundary value problem for the Monge–Ampère operator in \(\mathbb {R}^{n}\) and to convolutional Wasserstein distances are also given. Then in Sect. 4 the convergence of the Sinkhorn iteration towards the parabolic Optimal Transport equations on the torus is shown (Theorem 1.2). The proof leverages the regularity theory and a priori estimates of the corresponding parabolic PDE on the torus (shown in “Appendix B”). In the following Sect. 5 the result is generalized to a rather general setting of optimal transport on compact manifolds. In Sect. 6 it is shown that nearly linear complexity can be achieved in the case of optimal transport on the torus and the sphere (which applies, in particular, to the reflector antenna problem). Section 7 gives an outlook on relations to singularity formation in the parabolic optimal transport equations. In particular, the variational approach to the Sinkhorn iteration introduced in Sect. 2 is exploited in order to propose a generalized notion of solution to the corresponding parabolic PDE. In “Appendix A” a proof of a discrete version of the classical stationary phase approximation is provided.

2 General setup and preliminaries

If Z is a compact topological space then we will denote by C(Z) the space of continuous functions on Z endowed with the sup-norm and by \(\mathcal {P}(Z)\) the space of all (Borel) probability measures on Z, endowed with the weak topology. Given a subset S of Z we will denote by \(\chi {}_{S}\) the function which is equal to 0 on S and infinity otherwise.

Throughout the paper we will assume given a triple \((\mu ,\nu ,c)\) where \(\mu \) and \(\nu \) are probability measures on compact topological spaces X and Y, respectively and a function c on \(X\times Y\). The function c will be assumed to be continuous in all sections except in Sect. 5 (where we assume that c is lower semi-continuous). The supports of \(\mu \) and \(\nu \) will be denote by \(X_{\mu }\) and \(Y_{\nu }\), respectively. Given \(u\in C(X)\) and \(v\in C(Y)\) we will, abusing notation slightly, identify u and v with their pull-backs to \(X\times Y\).

2.1 Recap of optimal transport and the c-Legendre transform

Let us start by recalling the standard setup for optimal transport (see the book [67] for further background). A probability measure \(\gamma \in \mathcal {P}(X\times Y)\) is said to be a transport plan (or coupling) between\(\mu \)and\(\nu \) if the the push forwards of \(\gamma \) to X and Y are equal to \(\mu \) and \(\nu \), respectively. The subspace of all such probability measures in \(\mathcal {P}(X\times Y)\) will be denote by \(\Pi (\mu ,\nu )\). A transport plan in \(\Pi (\mu ,\nu )\) is said to be optimal wrt the cost function c, if it realizes the following infimum:

$$\begin{aligned} \inf _{\gamma \in \Pi (\mu ,\nu )}\int _{X\times Y}c\gamma \end{aligned}$$

By weak compactness such an optimal transport plan always exists. The c-Legendre transform\(u^{c}\) of a function \(u\in C(X)\) is defined as the following function in C(Y)

$$\begin{aligned} u^{c}(y):=\sup _{x\in X}\left( -c(x,y)-u(x)\right) . \end{aligned}$$

Similarly, if \(v\in C(Y)\) then \(v^{c}\) is the function on C(X) defined by replacing u in the previous formula with v and taking the sup over Y. A function \(u\in C(X)\) is said to be c-convex if

$$\begin{aligned} (u^{c})^{c}=u \end{aligned}$$

Equivalently, u is c-convex iff there exists some \(v\in C(Y)\) such that \(u=v^{c}\). Indeed, this follows from the observation that \(u^{ccc}=u^{c}\) for any \(u\in C(X)\), which in turn follows from \(u^{cc}\le u\). The following functional on C(X) will be called the Kantorovich functional:

$$\begin{aligned} J(u):=\int u\mu +\int u^{c}\nu . \end{aligned}$$
(2.1)

Proposition 2.1

(“Optimality criterion”) A transport plan \(\gamma \in \Pi (\mu ,\nu )\) is optimal iff there exists \(u\in C(X)\) which is c-convex and such that \(\gamma \) is supported in

$$\begin{aligned} \Gamma _{u}:=\left\{ (x,y)\in X\times Y:\,u(x)+u^{c}(y)+c(x,y)=0\right\} \end{aligned}$$
(2.2)

Moreover, if this is the case then

$$\begin{aligned} \int _{X\times Y}c\gamma =-J(u) \end{aligned}$$

Proof

This is standard and known as the Knott–Smith optimality criterion (in the Euclidean setting) [67]. For completeness we provide the simple proof of the direction that we shall use later on. Since \(u+u^{c}+c\ge 0\) on \(X\times Y\) the following lower bound holds for any given \(\gamma \in \Pi (\mu ,\nu )\)

$$\begin{aligned} \int _{X\times Y}c\gamma \ge -\inf _{u\in C(X)}J(u) \end{aligned}$$
(2.3)

Now, if \(\gamma \) is supported in \(\Gamma _{u}\) it follows directly that \(\int _{X\times Y}c\gamma =-J(u)\) and hence \(\gamma \) attains the lower bound above, i.e. \(\gamma \) is optimal. \(\square \)

Remark 2.2

A byproduct of Theorem 3.3 below (applied to the case when \(\mu _{k}=\mu \) and \(\nu _{k}=\nu \) for all k) is a proof that there always exists a transport plan \(\gamma _{*}\) with support in \(\Gamma _{u}\) for some c-convex function. Since \(\gamma _{*}\) saturates the lower bound 2.3 it follows that taking the infimum over all \(\gamma \) in \(\Pi (\mu ,\nu )\) yields equality in 2.3. As a consequence,

$$\begin{aligned} \inf _{\gamma \in \Pi (\mu ,\nu )}\int _{X\times Y}c\gamma= & {} \sup _{(u,v)\in \Phi _{c}}\int u\mu \nonumber \\&+\int v\nu ,\,\,\,\Phi _{c}:=\left\{ (u,v)\in C(X)\times C(Y):\,u+v\le c\right\} \nonumber \\ \end{aligned}$$
(2.4)

This is the content of Kantorovich duality, which is usually shown using Rockafeller-Fenchel duality in topological vector spaces [67].

2.1.1 The torus setting

In section we consider the case when \(X=Y=T^{n}:=\mathbb {R}^{n}/\mathbb {Z}^{n}\) and the cost function \(c:=d_{T^{n}}^{2}/2\) is half the squared standard distance function on \(T^{n}\). We will identify a function u on \(T^{n}\) with a \(\mathbb {Z}^{n}\)-periodic function on \(\mathbb {R}^{n}\) in the usual way. Similarly, we identify the cost function c on \(T^{n}\) with a function on \(\mathbb {R}^{n}\times \mathbb {R}^{n}\), which is \(\mathbb {Z}^{n}\)-periodic in each argument

$$\begin{aligned} c(x,y):=\frac{1}{2}d_{T^{n}}(x,y)^{2}:=\frac{1}{2}\inf _{m\in \mathbb {Z}^{n}}|x+m-y|^{2} \end{aligned}$$
(2.5)

Note that \(c(x,\cdot )\) is Lipschitz with Lipschitz constant \(\sqrt{n}\) on \(T^{n}\)(endowed with its standard metric). As a consequence any c-convex function u on \(T^{n}\) is also Lipschitz with Lipschitz constant \(\sqrt{n}\). In this particular case a c-convex function will be called quasi-convex and we will say that \(u\in C^{2}(T^{n})\) is strictlyquasi-convex if \(\nabla ^{2}u+I>0\).

Lemma 2.3

Let \(u\in C^{2}(T^{n})\) be strictly quasi-convex. Then

  • the map

    $$\begin{aligned} x\mapsto y_{x}:=x+(\nabla u)(x) \end{aligned}$$
    (2.6)

    defines a \(C^{1}\)-diffeomorphism of \(T^{n}\).

  • \(u^{c}\) is also a strictly quasi-convex \(C^{2}\)-function on \(T^{n}\) and the corresponding map

    $$\begin{aligned} y\mapsto x_{y}:=y+(\nabla u^{c})(y) \end{aligned}$$
    (2.7)

    is the inverse of the map 2.6 and the following matrix relation holds

    $$\begin{aligned} (\nabla ^{2}u+I)(x_{y})^{-1}=(\nabla ^{2}u^{c}+I)(y) \end{aligned}$$
    (2.8)

Conversely, if \(u\in C^{2}(T^{n})\) is quasi-convex and the map 2.6 is a \(C^{1}\)-diffeomorphism of \(T^{n}\), then u is strictly quasi-convex. Moreover, if \(u\in C^{k}(T^{n})\) is strictly quasi-convex, for k a positive integer and \(k\ge 2\), then \(u^{c}\in C^{k}(T^{n})\).

Proof

Given a function \(\phi \) on \(\mathbb {R}^{n}\) we denote by \(\phi ^{*}\) its classical Legendre transform:

$$\begin{aligned} \phi ^{*}(y):=\sup _{x\in \mathbb {R}^{n}}x\cdot y-\phi (x) \end{aligned}$$
(2.9)

(in other words, this is the c-Legendre transform wrt \(c(x,y):=-x\cdot y)\). Given a \(\mathbb {Z}^{n}\)-invariant quasi-convex function u on \(\mathbb {R}^{n}\) we set \(\phi (x):=u(x)+|x|^{2}/2\). Then it follows directly from the definitions that \(\phi \) is convex and

$$\begin{aligned} \phi ^{*}(y)=u^{c}(y)+|y|^{2}/2,\,\,\,x+(\nabla u)(x)=\nabla \phi (x),\,\,\nabla ^{2}u+I=\nabla ^{2}\phi \end{aligned}$$
(2.10)

Next note that since \(u^{c}\) is continuous and periodic it is bounded and hence \(\phi ^{*}\) is finite on all of \(\mathbb {R}^{n}\) with quadratic growth. As a consequence, given \(y\in \mathbb {R}^{n}\), the function \(x\mapsto x\cdot y-\phi (x)\) attains it sup at some point \(x_{y}\in \mathbb {R}^{n}\)

$$\begin{aligned} \phi ^{*}(y)=x_{y}\cdot y-\phi (x_{y}) \end{aligned}$$
(2.11)

and since the point is a local maximum we have \(y=\nabla \phi (x_{y})\). This shows that \(\nabla \phi \) maps \(\mathbb {R}^{n}\) surjectively onto \(\mathbb {R}^{n}\). Moreover, since \(\nabla ^{2}\phi >0\) the \(\phi \) function is strictly convex and \(x_{y}\) is uniquely determined. Thus the \(C^{1}\)-map \(\nabla \phi \) is a bijection and moreover its inverse \(y\mapsto x_{y}\) is also a \(C^{1}\)-map (by the inverse function theorem), which proves the first claim in the lemma. Moreover, since \(x_{y}\) is the unique minimizer of the sup defining \(\phi ^{*}\) the function \(\phi ^{*}\) is differentiable with gradient \(x_{y}\) at y. Hence, the \(C^{1}\)-inverse of \(\nabla \phi \) is given by \(\nabla \phi ^{*}\), showing that \(\phi ^{*}\) is in \(C^{2}(\mathbb {R}^{n})\). Differentiating the identity \(\nabla \phi ^{*}\circ \nabla \phi =I\) finally proves 2.8 and the last statement follows from the implicit function theorem: \(\nabla \phi \) is \(C^{k-1}\) implies (since \(\nabla ^{2}\phi >0)\) that its inverse \(\nabla \phi ^{*}\) is also \(C^{k-1}\). Finally, if \(u\in C^{2}(T^{n})\) is quasi-convex and the map 2.6 is a diffeomorphism of \(T^{n}\), then differentiating \((\nabla \phi )^{-1}\circ \nabla \phi =I\) reveals that the non-negative matrix \(\nabla ^{2}\phi \) is non-degenerate, hence strictly positive, i.e u is strictly quasi-convex. \(\square \)

We will also have use for the following

Lemma 2.4

Assume that u is \(C^{1}\)-smooth and strictly quasi-convex. Then, for any fixed \(y\in T^{n}\), the unique infimum of the function \(x\mapsto c(x,y)+u(x)\) on \(T^{n}\) is attained at \(x=x_{y}\) (defined by formula 2.7). Moreover, the function \(x\mapsto c(x,y)\) is smooth on some neighborhood of \(x_{y}\) in \(T^{n}\) and its Hessian is equal to the identity there.

Proof

First observe that, given \(y\in T^{n}\), the infimum in question is attained at \(x_{y}\) (defined by formula 2.7), as follows directly from combining formulas 2.11 and 2.10. Representing \(x_{y}\) and y with points in \(\mathbb {R}^{n}\) we thus have

$$\begin{aligned} d_{T^{n}}(x_{y},y)^{2}=\inf _{m\in \mathbb {Z}^{n}}|x_{y}+m-y|^{2}=|x_{y}+m_{0}-y|^{2} \end{aligned}$$

for some \(m_{0}\in \mathbb {Z}^{n}\). We claim that, under the assumptions of the lemma, the inf above is uniquely attained at \(m_{0}\), i.e.

$$\begin{aligned} m\ne m_{0}\implies |x_{y}+m-y|^{2}>|x_{y}+m_{0}-y|^{2}. \end{aligned}$$

To see this we note that, since u is periodic, when viewed as a function on \(\mathbb {R}^{n}\), we have

$$\begin{aligned} \inf _{x\in \mathbb {R}^{n}}d_{T^{n}}(x,y)^{2}/2+u(x)=\inf _{x\in \mathbb {R}^{n}}|x-y|^{2}/2+u(x) \end{aligned}$$
(2.12)

Since the inf in the left hand side above is attained at \(x_{y}\) so is the inf in the right hand side. Now assume, to get a contradiction, that the claim above does not hold, i.e. there exists a non-zero \(m\in \mathbb {Z}^{n}\) such that \(|x_{y}+m-y|=|x_{y}+m_{0}-y|\). This implies that the inf in the right hand side in formula 2.12 is attained both at \(x_{y}\) and at \(x_{y}+m\) (since u is periodic). But this contradicts the fact that the function \(x\mapsto |x-y|^{2}/2+u(x)\) is strictly convex on \(\mathbb {R}^{n}\) (by the assumed strict quasi-convexity of u on \(T^{n})\). Finally, the claim shows, since the inequality in the claim is preserved when \(\bar{x}\) is perturbed slightly, that \(d_{T^{n}}(x,y)^{2}=|x-y|^{2}\) for all x sufficiently close to \(x_{y}\). Hence, \(x\mapsto d_{T^{n}}(x,y)^{2}/2\) is smooth there and its Hessian is constant, as desired. \(\square \)

2.2 The log Sinkhorn iteration on C(X)

In this section we will consider an iteration on C(X), which can be viewed as a reformulation of the Sinkhorn algorithm and the Iterative Proportional Fitting Procedure, recalled in Sect. 1.1.1 (see Sect.  2.3.1). Given data \((\mu ,\nu ,c)\), as in Sect. 2.1, we first introduce the following maps

$$\begin{aligned} T_{\mu }:\,C(X)\rightarrow C(Y),\,\,\,u\mapsto v[u]:=\log \int e^{-c(x,\cdot )-u(x)}\mu (x) \end{aligned}$$

and

$$\begin{aligned} T_{\nu }:\,C(Y)\rightarrow C(X),\,\,\,v\mapsto u[v]:=\log \int e^{-c(\cdot ,y)-v(y)}\nu (y) \end{aligned}$$

(abusing notation slightly we will write \(T_{\mu }(u)=v[u]\) etc). This yields an iteration on C(X) defined by

$$\begin{aligned} u_{m+1}:=S[u_{m}], \end{aligned}$$
(2.13)

where S is defined as the the composed operator \(T_{\nu }\circ T_{\mu }\) on C(X) : 

$$\begin{aligned} S:\,C(X)\rightarrow C(X),\,\,\,u\mapsto u[v[u]] \end{aligned}$$

In lack for a better name the iteration 2.13 will be called the log Sinkhorn iteration and the operator S will be called the log Sinkhorn operator. It will be convenient to rewrite it as the following difference equation:

$$\begin{aligned} u_{m+1}-u_{m}=\log (\rho _{u_{m}}), \end{aligned}$$
(2.14)

where \(\rho _{u}\) is defined by

$$\begin{aligned} \rho _{u}:=e^{S[u]-u} \end{aligned}$$
(2.15)

and has the property that \(\rho _{u}\mu \) is a probability measure on X (as follows directly from the definitions).

In this section we will use a variational approach to study the log Sinhorn iteration. An alternative approach will also be used in Sect.  4, which relies on the observation that the log Sinkhorn iteration contracts the \(L^{\infty }\)-distance on C(X) (see Step 2 in the proof of Lemma 4.4).

2.2.1 Existence and uniqueness of fixed points

Consider the following functional \(\mathcal {F}\) on C(X) : 

$$\begin{aligned} \mathcal {F}:=I_{\mu }-\mathcal {L},\,\,\,I_{\mu }(u)=\int _{X}u\mu ,\,\,\,\mathcal {L}(u):=-\int _{Y}v[u]\nu . \end{aligned}$$
(2.16)

Note that \(I_{\mu }\) and \(\mathcal {L}\) are equivariant under the additive action of \(\mathbb {R}\) and hence \(\mathcal {F}\) is invariant.

Remark 2.5

This functional can be viewed as an analog of the Kantorovich functional J(u) (formula 2.1), where the c-Legendre transform \(u^{c}\) is replaced by v[u]. From a numerical perspective this amounts to replacing the supremum defining \(u^{c}\) by a “soft max” [27]. It is well-known that \(\mathcal {F}\) decreases under the log Sinkhorn iteration, i.e. that \(\mathcal {F}(Su)\le \mathcal {F}(u)\). This is usually shown using block coordinate descent on the “dual functional” to the “primal” minimization problem in Prop 2.12; see [27, Prop 4.21] and [66, Section 3.1]. But here we observe that, in fact, both functionals \(I_{\mu }\) and \(-\mathcal {L}\) decrease along the iteration (see Step 1 in the proof of Theorem 2.9). This will be important in the proof of the last step of Theorem 2.9 and in the proof of Prop 7.1.

Lemma 2.6

The following is equivalent:

  • u is a critical point for the functional \(\mathcal {F}\) on C(X)

  • \(\rho _{u}=1\) a.e. with respect to \(\mu \)

Moreover, if u is a critical point, then \(u_{*}:=S(u)\) is a fixed point for the operator S on C(X)

Proof

First observe that the differential of the functional \(\mathcal {L}\) defined in formula 2.16, at an element \(u\in C(X)\), is represented by the probability measure \(\rho _{u}\mu \), where \(\rho _{u}\) is defined by formula 2.15. This means that for any \(\dot{u}\in C(X)\)

$$\begin{aligned} \frac{d(\mathcal {L}(u+t\dot{u}))}{dt}|_{t=0}=\int \dot{u}\rho _{u}\mu . \end{aligned}$$

This follows readily from the definitions by differentiating \(t\mapsto v[(u+t\dot{u})]\) to get an integral over \((X,\mu )\) and then switching the order of integration. As a consequence, u is a critical point of the functional \(\mathcal {F}\) on \(C^{0}(X)\) iff \(\rho _{u}\mu =\mu \), i.e. iff \(\rho _{u}=1\) a.e. with respect to \(\mu \). Finally, if this is the case then \(S(u)=u\) a.e wrt \(\mu \) and hence \(S(S(u))=S(u)\) (since S(f) only depends on f viewed as an element in \(L^{1}(X,\mu ))\).

\(\square \)

The following basic compactness property holds:

Lemma 2.7

Given a point \(x_{0}\in X\) the subset \(\mathcal {K}_{x_{0}}\) of C(X) defined as all elements u in the image of S satisfying \(u(x_{0})=0\) is compact in C(X).

Proof

First observe that, since \(X\times Y\) is assumed compact, the continuous function c is, in fact, uniformly continuous on X. Hence, it follows from the very definition of S that S(C(X)) is an equicontinuous family of continuous functions on X. By Arzela-Ascoli theorem it follows that the set \(\mathcal {K}_{x_{0}}\) is compact in C(X). \(\square \)

Using the previous two lemmas gives the following

Proposition 2.8

The operator S has a fixed point \(u_{*}\) in C(X). Moreover, \(u_{*}\) is uniquely determined a.e. wrt \(\mu \) up to an additive constant and \(u_{*}\) minimizes the functional \(\mathcal {F}\). More precisely, there exists a unique fixed point in \(S(C(X))/\mathbb {R}\).

Proof

We start by noting that

$$\begin{aligned} \mathcal {F}(Su)\le \mathcal {F}(u) \end{aligned}$$

(this is shown in the first step of Theorem 2.9 below). Since \(\mathcal {F}\) is invariant under the natural \(\mathbb {R}\)-action we conclude that

$$\begin{aligned} \inf _{C(X)}\mathcal {F}=\inf _{\mathcal {K}_{0}}\mathcal {F}, \end{aligned}$$

where \(\mathcal {K}_{0}\) denotes the compact subset of C(X) appearing in Lemma 2.7. Since \(\mathcal {F}\) is clearly continuous on C(X) this implies the existence of a minimizer of \(\mathcal {F}\) which is moreover in \(\mathcal {K}_{0}\).

Next observe that \(\mathcal {F}\) is convex on C(X). Indeed, for any fixed \(y\in Y\), \(u\mapsto v[u](y)\) is convex on C(X), as follows directly from Jensen’s inequality. Hence, \(-\mathcal {L}\) is convex and since \(I_{\mu }\) is affine we conclude that \(\mathcal {F}\) is convex. More precisely, Jensen’s (or Hölder’s) inequality implies that \(\mathcal {F}\) is strictly convex on \(C(X)/\mathbb {R}\) viewed as a subset of \(L^{1}(\mu )/\mathbb {R}\). Hence, if \(u_{0}\) and \(u_{1}\) are two minimizers, then there exists a constant C such that \(u_{0}=u_{1}+C\) a.e. wrt \(\mu \). In particular, if \(C=0\) then \(u_{*}:=S(u_{0})=S(u_{1})\) gives the same fixed point of S. \(\square \)

2.2.2 Monotonicity and convergence properties of the iteration

We next establish the following result, which can be seen as a refinement, in the present setting, of the convergence of the general Iterative Proportional Fitting Procedure established in [55]. The result will be used in the proof of Proposition 7.1.

Theorem 2.9

Given \(u_{0}\in C(X)\) the corresponding iteration \(u_{m}:=S^{m}u_{0}\) converges uniformly to a fixed point \(u_{\infty }\) of S.

Proof

Step 1\(I_{\mu }\)and\(-\mathcal {L}\)are decreasing along the iteration and hence\(\mathcal {F}\)is also decreasing. The functionals are strictly decreasing at\(u_{m}\)unless\(S(u_{*})=u_{*}\)for\(u_{*}:=S(u_{m})\).

Using the difference equation 2.14 for \(u_{m}\) and Jensen’s inequality, we have

$$\begin{aligned} I_{\mu }(u_{m+1})-I_{\mu }(u_{m})=\int \log \rho _{u_{m}}\mu \le \log \int \rho _{u_{m}}\mu =\log 1=0 \end{aligned}$$

Moreover, equality holds unless \(\rho _{u_{m}}=1\) a.e wrt \(\mu \) i.e. \(S(u_{m})=u_{m}\) and \(S(u_{*})=u_{*}\) everywhere on X. Similarly, by symmetry,

$$\begin{aligned} \mathcal {L}(u_{m})-\mathcal {L}(u_{m+1})=\int \log \rho _{v_{m}}\nu \le \log \int \rho _{v_{m}}\nu =\log 1=0, \end{aligned}$$

where now \(\rho _{v}\), for \(v\in C(Y)\), denotes the probability measure on Y defined as in formula 2.15, but with the roles of \(\mu \) and \(\nu \) interchanged.

Step 2 Convergence in\(C(X)/\mathbb {R}\).

Given the initial data \(u_{0}\) we denote by \(\mathcal {K}_{u_{0}}\) the closure of the orbit of \(u_{0}\) in C(X) under repeated application of S. By Lemma 2.7\(\mathcal {K}_{u_{0}}/\mathbb {R}\) is compact in \(C(X)/\mathbb {R}\). Hence, after perhaps passing to a subsequence, \(u_{m}\rightarrow u_{\infty }\) in \(C(X)/\mathbb {R}\). Now, since \(\mathcal {F}\) is decreasing along the orbit we have

$$\begin{aligned} \mathcal {F}(u_{\infty })=\inf _{\mathcal {K}_{0}}\mathcal {F}. \end{aligned}$$

Hence, by the condition for strict monotonicity it must be that \(Su_{\infty }=u_{\infty }\) a.e. wrt \(\mu \) and hence, since \(u_{\infty }\) is the image of S, it follows that \(Su_{\infty }=u_{\infty }\) on all of X. It then follows from Proposition 2.8 that \(u_{\infty }\) is uniquely determined in \(C(X)/\mathbb {R}\) (by the initial data \(u_{0})\), i.e. the whole sequence converges in \(C(X)/\mathbb {R}\).

Step 3 Convergence inC(X)

Let us first show that there exists a number \(\lambda \in \mathbb {R}\) such that

$$\begin{aligned} \lim _{m\rightarrow \infty }I_{\mu }(u_{m})=\lambda . \end{aligned}$$
(2.17)

By Step 1 \(I_{\mu }\) is decreasing and hence it is enough to show that \(I_{\mu }(u_{m})\) is bounded from below. But \(I_{\mu }=\mathcal {F}+\mathcal {L}\), where, by Prop 2.8 (or the previous step) \(\mathcal {F}\) is bounded from below (by \(\mathcal {F}(u_{\infty }))\). Moreover, by the first step \(\mathcal {L}(u_{m})\ge \mathcal {L}(u_{0})\), which concludes the proof of 2.17. Next, decompose

$$\begin{aligned} u_{m}=\tilde{u}_{m}+u_{m}(x_{0}),\,\,\, \end{aligned}$$

By Lemma 2.7 the sequence \((\tilde{u}_{m})\) is relatively compact in C(X) and we claim that \(|u_{m}(x_{0})|\le C\) for some constant C. Indeed, if this is not the case then there is a subsequence \(u_{m_{j}}\) such that \(|u_{m_{j}}|\rightarrow \infty \) uniformly on X. But this contradicts that \(I_{\mu }(u_{m})\) is uniformly bounded (by 2.17). It follows that the sequence \((u_{m})\) is also relatively compact. Hence, by the previous step the whole sequence \(u_{m}\) converges to the unique minimizer \(u_{*}\) of \(\mathcal {F}\) in S(C(X)) satisfying \(I_{\mu }(u_{*})=\lambda \). \(\square \)

Remark 2.10

The convergence result in [55] is, in the present setting, equivalent to convergence of the induced iteration on \(C(X)/\mathbb {R}\). In fact, the latter convergence holds at linear rate, i.e. there exists a norm \(\left\| \cdot \right\| _{C(X)/\mathbb {R}}\) on \(C(X)/\mathbb {R}\) and a positive number \(\delta \) such that \(\left\| u-u_{\infty }\right\| _{C(X)/\mathbb {R}}\le e^{-\delta m}\). Indeed, setting \(\left\| u-u'\right\| _{C(X)/\mathbb {R}}:=\left\| \sup (u-u')-\inf (u-u')\right\| _{C(X)}\) (which corresponds, under \(u\mapsto e^{-u}\), to the Hilbert metric on the cone of positive functions in C(X)) this follows from Birkhoff’s theorem about positive operators on cones, precisely as in the finite dimensional situation of the Sinkhorn iteration considered in [36]; see also [27, Thm 4.2] and [66].

2.2.3 The induced discrete evolution on \(C(X)\times C(Y)\)

Fixing an initial function \(u_{0}\in C(X)\) the corresponding evolution \(m\mapsto u_{m}\) induces a sequence of pairs \((u_{m},v_{m})\in C(X)\times C(Y)\) defined by the following recursion:

$$\begin{aligned} (u_{m+1},v_{m+1}):=(u[v_{m+1}],v[u_{m}]) \end{aligned}$$

2.2.4 The induced discrete evolution on \(\mathcal {P}(X\times Y)\) and entropy

Let us briefly explain the dual point of view involving the space \(\mathcal {M}(X\times Y)\) of measures on \(X\times Y\) (which, however, is not needed for the proofs of the main results). The data \((\mu ,\nu ,c)\) induces the following element \(\gamma _{c}\in \mathcal {M}(X\times Y):\)

$$\begin{aligned} \gamma _{c}:=e^{-c}\mu \otimes \nu \end{aligned}$$

Given a function \(u\in C(X)\) we will write

$$\begin{aligned} \gamma _{u}:=e^{-(u+v[u])}\gamma _{c} \end{aligned}$$
(2.18)

Lemma 2.11

u satisfies \(S(u)=u\) a.e. wrt \(\mu \) iff \(\gamma _{u}\in \Pi (\mu ,\nu )\).

Proof

A direct computation reveals that the push-forwards of \(e^{-(u+v[u])}\gamma _{c}\) to X and Y, respectively, are given by

$$\begin{aligned} \int _{X}e^{-(u+v[u])}\gamma _{c}=\nu ,\,\,\,\,\int _{Y}e^{-(u+v[u])}\gamma _{c}=\rho _{u}\mu \end{aligned}$$

Hence, \(\gamma _{u}\in \Pi (\mu ,\nu )\) iff \(\rho _{u}\mu =\mu \), which, by the definition 2.15 of \(\rho _{u}\), concludes the proof. \(\square \)

The discrete dynamical system \(u_{m}\) induces a sequence

$$\begin{aligned} \gamma _{m}:=\gamma _{u_{m}}(=e^{-u_{m}(x)}e^{-v_{m}(x)}\gamma _{c})\in \mathcal {P}(X\times Y) \end{aligned}$$

Proposition 2.12

The unique minimizer \(\gamma _{*}\) of the functional \(\mathcal {I}(\cdot |\gamma _{c})\) on \(\Pi (\mu ,\nu )\) is characterized by the property that it has the form

$$\begin{aligned} \gamma _{_{*}}=e^{-\Phi }\gamma _{c} \end{aligned}$$

for some \(\Phi \in C(X)+C(Y)\). Moreover, \(\gamma _{_{*}}=\gamma _{u_{*}}\), where \(u_{*}\) is a fixed point for S on C(X) (or more generally, on \(L^{1}(X,\mu ))\) and

$$\begin{aligned} \inf _{\Pi (\mu ,\nu )}\mathcal {I}(\cdot |\gamma _{c})=\inf _{C(X)\times C(Y)}\mathcal {F} \end{aligned}$$
(2.19)

and given any function \(u_{0}\in C(X)\), the corresponding sequence \(\gamma _{m}\) converges in \(L^{1}\) (i.e. in variation norm) towards \(\gamma _{*}\) (and moreover \(\mathcal {I}(\gamma _{m}|\gamma _{*})\rightarrow 0)\).

Proof

By construction \(\gamma _{*}:=\gamma _{u_{*}}\) has the property that

$$\begin{aligned} \gamma _{_{*}}=e^{-\Phi }\gamma _{c},\,\,\,\gamma _{_{*}}\in \Pi (\mu ,\nu ) \end{aligned}$$

for some \(\Phi \in L^{\infty }(X)+L^{\infty }(Y)\). But a standard calculus argument reveals that any such \(\gamma _{*}\) is the unique minimizer of the restriction of \(\mathcal {I}\) to the affine subspace \(\Pi (\mu ,\nu )\) of \(\mathcal {P}(X\times Y\)) (using that \(\mathcal {I}\) is strictly convex). The last convergence statement then follows directly from Theorem 2.9 (only the easier convergence in \(C(X)/\mathbb {R}\) is needed).

\(\square \)

Rewriting

$$\begin{aligned} k^{-1}\mathcal {I}(\gamma |\gamma _{kc})=\int c\gamma +k^{-1}\mathcal {I}(\gamma |\mu \otimes \nu ), \end{aligned}$$

the equality 2.19 can be viewed as an entropic variant of Kantorovich duality 2.4 in the limit when c is replaced by kc for a large positive number k. In fact, it follows from Theorem 3.3 applied to \(\mu _{k}=\mu \) and \(\nu _{k}=\nu \) that

$$\begin{aligned} \lim _{k\rightarrow \infty }\inf _{\gamma \in \Pi (\mu ,\nu )}k^{-1}\mathcal {I}(\gamma |\gamma _{kc})=\inf _{\gamma \in \Pi (\mu ,\nu )}\int c\gamma =\sup _{\Phi _{c}}\int u\mu +\int v\nu , \end{aligned}$$

as in the Kantorovich duality 2.4. In the next section we will consider the setting where \(\mu \) and \(\nu \) also change with k.

2.3 The scaled setting and discretization

Let us next consider the following variant of the previous setting, parametrized by a parameter k (which is the parameter that will later on tend to infinity and which corresponds to the entropic regularization parameter \(\epsilon :=k^{-1}\)). This means that we replace the triple \((\mu ,\nu ,c)\) with a sequence \((\mu ^{(k)},\nu ^{(k)},kc)\). As explained in Sect. 1.1.2 replacing c with kc corresponds to introducing the entropic regularization parameter \(\epsilon =k^{-1}\). We then rescale the functions in C(X) and C(Y) by k and consider the corresponding rescaled operators:

$$\begin{aligned} v^{(k)}[u]&:=k^{-1}\log \int e^{-kc(x,\cdot )-ku(x)}\mu ^{(k)}(x)\nonumber \\ u^{(k)}[v]&:=k^{-1}\log \int e^{-kc(\cdot ,y)-kv(y)}\nu ^{(k)}(y)\nonumber \\ S^{(k)}(u)&:=k^{-1}S(ku) \end{aligned}$$
(2.20)

etc. The corresponding rescaled iteration is thus defined by the iteration

$$\begin{aligned} u_{m+1}^{(k)}:=S^{(k)}u_{m}^{(k)}\in C(X), \end{aligned}$$
(2.21)

given the initial value \(u_{0}^{(k)}\in C(X)\). It will be called the scaled log Sinkhorn iteration (at level k). Equivalently,

$$\begin{aligned} u_{m+1}^{(k)}-u_{m}^{(k)}=k^{-1}\log (\rho _{ku_{m}^{(k)}}), \end{aligned}$$
(2.22)

where

$$\begin{aligned} \rho _{ku}(x):=\frac{e^{ku^{(k)}\left[ v^{(k)}[u]\right] (x)}}{e^{ku}}(x\text {)}, \end{aligned}$$
(2.23)

which can be explicitly expressed as

$$\begin{aligned} \rho _{ku}(x)=\int _{Y}\frac{e^{-kc(x,y)-ku(x)}}{\int _{X}e^{-kc(x',y)-ku(x')}\mu ^{(k)}(x')}\nu ^{(k)}(y) \end{aligned}$$

We also set

$$\begin{aligned} \mathcal {F}^{(k)}(u):=k^{-1}\mathcal {F}(ku)=\int u\mu +\int v^{(k)}[u]\nu . \end{aligned}$$
(2.24)

By Theorem 2.9 (applied to a fixed k) as \(m\rightarrow \infty \) the iteration \(u_{m}^{(k)}\) converges in C(X) to a fixed point \(u^{(k)}\) of the operator \(S^{(k)}\) (uniquely determined by the initial value \(u_{0}^{(k)}))\).

We observe that the following compactness property holds (and is proved exactly as in Lemma 2.7):

Lemma 2.13

The union \(\bigcup _{k\ge 0}S^{(k)}\) is relatively compact in \(C(X)/\mathbb {R}\) (identifying \(C(X)/\mathbb {R}\) with the set of all continuous functions vanishing at a given point \(x_{0})\)

2.3.1 Discretization and the Sinkhorn algorithm

Now assume that \(\mu ^{(k)}\) and \(\nu ^{(k)}\) are discrete probability measures whose supports are finite sets

$$\begin{aligned} X^{(k)}:=\{x_{i}^{(k)}\}_{i=1}^{N_{k}},\,\,\,Y^{(k)}:=\{y_{i}^{(k)}\}_{i=1}^{N_{k}} \end{aligned}$$

of the same number \(N_{k}\) of points in X and Y, respectively. This means that there exist vectors \(p^{(k)}\) and \(q^{(k)}\) in \(\mathbb {R}^{N_{k}}\) such that

$$\begin{aligned} \mu ^{(k)}=\sum _{i=1}^{N_{k}}\delta _{x_{i}^{(k)}}p_{i}^{(k)},\,\,\,\nu ^{(k)}=\sum _{i=1}^{N_{k}}\delta _{x_{i}^{(k)}}q_{i}^{(k)}. \end{aligned}$$

Moreover, since \(\mu ^{(k)}\) and \(\nu ^{(k)}\) are probability measures the vectors \(p^{(k)}\) and \(q^{(k)}\) are elements in the simplex \(\Sigma _{N_{k}}\) in \(\mathbb {R}^{N_{k}}\) defined by

$$\begin{aligned} \Sigma _{N}:=\left\{ v\in \mathbb {R}^{N}:v_{i}\ge 0,\,\,\,\sum _{i=1}^{N}v_{i}=1\right\} , \end{aligned}$$
(2.25)

which we identify with \(\mathcal {P}(\{1,\ldots ,N\})\). Similarly, we identify the discrete measure

$$\begin{aligned} \gamma _{c}^{(k)}:=e^{-kc}\mu ^{(k)}\otimes \nu ^{(k)} \end{aligned}$$

on \(X\times Y\) with the matrix \(\tilde{K}\in \mathbb {R}^{N_{k}}\times \mathbb {R}^{N_{k}}\) defined by

$$\begin{aligned} \tilde{K}_{ij}:=K_{ij}^{(k)}p_{i}^{(k)}q_{j}^{(k)},\,\,\,K_{ij}^{(k)}:=\exp (-kC_{ij}),\,\,\,C_{ij}:=c(x_{i}^{(k)},y_{j}^{(k)}), \end{aligned}$$

where \(C_{ij}\) is viewed as a cost function on \(\{1,\ldots ,N\}^{2}\). Under the identifications

$$\begin{aligned} C(X^{(k)})\leftrightarrow \mathbb {R}_{+}^{N_{k}},\,\,\,u\mapsto a,\,\,\,a_{i}:=e^{-ku(x_{i}^{(k)})}p_{i}^{(k)} \end{aligned}$$

and

$$\begin{aligned} C(Y^{(k)})\leftrightarrow \mathbb {R}_{+}^{N_{k}},\,\,\,v\mapsto b,\,\,\,b_{i}:=e^{-kv(y_{j}^{(k)})}q_{i}^{(k)} \end{aligned}$$

the scaled iteration 2.21 gets identified with the recursion \(a^{(k)}(m)\) defined by the Sinkhorn algorithm determined by the matrix \(K^{(k)}\) and the positive vectors \(p^{(k)}\) and \(q^{(k)}\) (see Sect. 1.1.1). Given an initial positive vector \(a^{(k)}(0)\) Theorem 2.9 thus shows that \((a^{(k)}(m),b^{(k)}(m))\) converges, as \(m\rightarrow \infty \), to a pair of positive vectors \((a^{(k)},b^{(k)})\) such that the scaled matrix \(D_{b}K^{(k)}D_{a}\) has the property that the rows sum to \(p^{(k)}\) and the columns sum to \(q^{(k)}\).

Remark 2.14

By construction, the functions \(u_{m}^{(k)}(x)\) on X can be expressed in terms of a Fourier type sum:

$$\begin{aligned} u_{m}^{(k)}(x)=k^{-1}\log \sum _{i=1}^{N_{k}}e^{-kc\left( x,y_{i}^{(k)}\right) }b_{i}^{(k)}(m-1) \end{aligned}$$

where the “Fourier coefficients” \(b_{i}^{(k)}(m-1)\) are given by the Sinkhorn algorithm. In the case when X and Y are domains in \(\mathbb {R}^{n}\), with \(c(x,y)=-x\cdot y\), this is the analytic continuation to \(i\mathbb {R}^{n}\) of a bona fide Fourier sum with Fourier coefficients in k times the support of \(\nu ^{(k)}\). Hence, k plays the role of the “band-width”.

3 Convergence of the fixed points

In this section we will prove various generalizations of Theorem 1.1, stated in the introduction. Throughout the section we will consider the parametrized setting in Sect.  2.3 and assume that the sequences \(\mu ^{(k)}\) and \(\nu ^{(k)}\) converge to \(\mu \) and \(\nu \) in \(\mathcal {P}(X)\) and \(\mathcal {P}(Y)\), respectively (in the standard weak topology). We will denote by \(u^{(k)}\) the fixed point of the corresponding operator \(S^{(k)}\) on C(X), uniquely determined by the normalization condition \(u^{(k)}(x_{0})=0\), at a given point \(x_{0}\) in X and set \(v^{(k)}:=T_{\mu }u^{(k)}=:v^{(k)}[u^{(k}]\), which is a fixed point of the corresponding operator \(S^{(k)}\) on C(Y).

3.1 A general convergence result for the fixed points

We start by giving a density condition on \(\mu ^{(k)}\) ensuring that \(v^{(k)}[u]\) converges uniformly to the c-Legendre transform \(u^{c}\) of u, when \(\mu \) has full support:

Lemma 3.1

Assume that the sequence \(\mu ^{(k)}\) converging to \(\mu \) in \(\mathcal {P}(X)\) has the following “density property”: for any given open subset U intersecting the support \(X_{\mu }\) of \(\mu \)

$$\begin{aligned} \liminf _{k\rightarrow \infty }k^{-1}\log \mu ^{(k)}(U)\ge 0 \end{aligned}$$
(3.1)

Then, for any given \(u\in C(X)\), the sequence \(v^{(k)}[u]\) converges uniformly to \((\chi _{X_{\mu }}+u)^{c}\) in C(Y).

Proof

Replacing the integral over \(\mu ^{(k)}\) with a sup directly gives

$$\begin{aligned} v^{(k)}[u](y)\le (\chi _{X_{\mu }}+u)^{c}(y) \end{aligned}$$
(3.2)

for any \(y\in Y\). To prove a reversed inequality let \(x_{y}\) be a point in \(X_{\mu }\) where the sup defining \((\chi _{X_{\mu }}+u)^{c}(y)\) is attained and \(U_{\delta }\) a neighborhood of \(x_{y}\) where the oscillation of \(c(\cdot ,y)+u\) is bounded from above by \(\delta \) (the existence of \(U_{\delta }\) is ensured by the continuity of c and the compactness of X and Y). Then

$$\begin{aligned} v^{(k)}[u](y)\ge k^{-1}\log \int _{U_{\delta }}e^{-k(c(x,y)+u(x))}\mu ^{(k)}(x)\ge k^{-1}\log \mu ^{(k)}(U_{\delta })+u^{c}(y)-\delta \end{aligned}$$

Hence, as \(k\rightarrow \infty \), \(v^{(k)}[u](y)\rightarrow u^{c}(y)\) and since \(v^{(k)}[u]\) is equicontinuous (by the assumed compactness of \(X\times Y\) and the continuity of c) this implies the desired uniform convergence. \(\square \)

Example 3.2

(Weighted point clouds) If \(\mu _{k}=\mu \) for any k then the density property is trivially satisfied. More generally, the density property 3.1 is satisfied by any reasonable approximation \(\mu ^{(k)}\). For example, in the discrete case where \(\mu ^{(k)}=\sum _{i=1}^{N_{k}}w_{i}^{(k)}\delta _{x_{i}^{(k)}}\) the property in question holds if \(\sup _{i}1/w_{i}^{(k)}\) and the inverse of the number of points \(x_{i}^{(k)}\) in any given open set U intersecting \(X_{\mu }\) have sub-exponential growth in k.

Theorem 3.3

Suppose that \(\mu ^{(k)}\rightarrow \mu \) and \(\nu ^{(k)}\rightarrow \mu \) in \(\mathcal {P}(X)\) and \(\mathcal {P}(Y)\), respectively and assume that \(\mu ^{(k)}\) and \(\nu ^{(k)}\) satisfy the density property 3.1. Let \(u^{(k)}\) be the normalized fixed point for the scaled log Sinkhorn operator \(S^{(k)}\) on C(X). Then, after perhaps passing to a subsequence, the following holds:

$$\begin{aligned} u^{(k)}\rightarrow u \end{aligned}$$

uniformly on X, where u is a c-convex minimizer of the Kantorovich functional J (formula 2.1) satisfying

$$\begin{aligned} u=(\chi _{Y_{\nu }}+(\chi _{X_{\mu }}+u)^{c})^{c} \end{aligned}$$
(3.3)

As a consequence, the corresponding probability measures

$$\begin{aligned} \gamma ^{(k)}:=e^{-k(u^{(k)}+v^{(k)})}e^{-kc}\mu ^{(k)}\otimes \nu ^{(k)}\in \mathcal {P}(X\times Y) \end{aligned}$$

converge weakly to a transport plan \(\gamma \) between \(\mu \) and \(\nu \), which is optimal wrt the cost function c.

Proof

Step 1: Convergence of a subsequence of \(u^{(k)}\)

In the following all functions will be normalized by demanding that the values vanish at a given point. By Lemma 2.13 we may, after perhaps passing to a subsequence, assume that \(u^{(k)}\rightarrow u^{(\infty )}\) uniformly on X, for some element \(u^{(\infty )}\) in C(X). By the previous lemma, for any given \(u\in C(X)\) we have

$$\begin{aligned} \mathcal {F}^{(k)}(u)=J(\chi _{X_{\mu }}+u)+o(1), \end{aligned}$$
(3.4)

where \(\mathcal {F}^{(k)}\) is defined by formula 2.24 and J denotes the Kantorovich funtional, defined by formula 2.1. Now take a sequence \(\epsilon _{k}\) of positive numbers tending to zero such that

$$\begin{aligned} u^{(\infty )}-\epsilon _{k}\le u^{(k)}\le u^{(\infty )}+\epsilon _{k} \end{aligned}$$
(3.5)

Since \(u\mapsto v^{(k)}[u]\) is decreasing it follows that

$$\begin{aligned} \mathcal {F}^{(k)}(u^{(\infty )})-2\epsilon _{k}\le \mathcal {F}^{(k)}(u^{(k)})+2\epsilon _{k} \end{aligned}$$
(3.6)

Next note that since \(u^{(k)}\) minimizes the functional \(\mathcal {F}^{(k)}\) (by Prop 2.8) we have \(\mathcal {F}^{(k)}(u^{(k)})\le \mathcal {F}^{(k)}(u)\) for any given u in C(X). Hence, combining 3.6 and 3.4 and letting \(k\rightarrow \infty \) gives

$$\begin{aligned} J(u^{(\infty )})\le \inf _{u\in C(X)}J(u), \end{aligned}$$

showing that \(u^{(\infty )}\) minimizes J on C(X). To see that \(u^{(\infty )}\) is c-convex first recall that, by definition, \(u^{(k)}\) satisfies

$$\begin{aligned} u^{(k)}=u^{(k)}[v^{(k)}[u^{(k)}]]. \end{aligned}$$

Hence, combing 3.5 with the previous lemma, applied twice, gives

$$\begin{aligned} u^{(k)}=u^{(k)}[(\chi _{X_{\mu }}+u^{(\infty )})^{c}]+o(1)=((\chi _{Y_{\mu }}+(\chi _{X_{\mu }}+u^{(\infty )})^{c})^{c}+o(1) \end{aligned}$$

This shows that \(u^{(\infty )}=((\chi _{Y_{\mu }}+(\chi _{X_{\mu }}+u^{(\infty )})^{c})^{c}\), proving that \(u^{(\infty )}=f^{c}\) for some \(f\in C(Y)\). Hence \(u^{(\infty )}\) is c-convex.

Step 2: Convergence of \(\gamma ^{(k)}\) (for the subsequence in Step 1) towards an optimizer

By Lemma 2.11\(\gamma ^{(k)}\)is in\(\Pi (\mu ,\nu )\). Hence, by weak compactness, we may assume that \(\gamma ^{(k)}\) converges towards an element \(\gamma ^{(\infty )}\) in \(\mathcal {P}(X\times Y)\). By Prop 2.1 it will thus be enough to show that \(\gamma ^{(\infty )}\) is supported in \(\Gamma _{u^{(\infty )}}\). To this end let \(\Gamma _{\delta }\) be the closed subset of \(X\times Y\) where \(u+u^{c}+c\ge \delta >0\) for \(u:=u^{(\infty )}\). By the previous lemma \(\gamma ^{(k)}\le e^{-k\delta /2}\mu ^{(k)}\otimes \nu ^{(k)}\) on \(\Gamma _{\delta }\), when k is sufficiently large and hence the limit \(\gamma ^{(\infty )}\) is indeed supported on \(\Gamma _{u^{(\infty )}}\). \(\square \)

In order to ensure that the whole sequence \(u^{(k)}\) is convergent some conditions on the cost function c and the measures \(\mu \) and \(\nu \) need to be imposed. Exploiting well-known uniqueness result for optimal transport plans/maps this can, in particular, be achieved in the following Riemannian setting.

Theorem 3.4

Let M be a Riemannian manifold and denote by d the Riemannian distance function. Let X and Y be compact subsets of M such that Y is a topological domain, i.e. Y is equal to the closure of the interior of Y and take c(xy) to be the restriction of \(d(x,y)^{2}/2\) to \(X\times Y\). Assume that \(\nu \) is absolutely continuous wrt the Riemannian volume form and has support Y and that \(\mu ^{(k)}\rightarrow \mu \) and \(\nu ^{(k)}\rightarrow \nu \) in \(\mathcal {P}(X)\). Denote by \(u^{(k)}\) the normalized fixed point of the scaled log Sinkhorn operator \(S^{(k)}\) on C(X). Then

  • \(v^{(k)}\) converges uniformly in Y to a c-convex function v, which is a potential for the unique optimal Borel map transporting \(\nu \) to \(\mu \), i.e. the map that can be expressed as

    $$\begin{aligned} y\mapsto x_{y}:=\text {exp}_{y}(\nabla v), \end{aligned}$$
    (3.7)

    (which means that \(x_{x}\) is obtained by transporting y along a unit-length geodesic in the direction of \((\nabla v)(y))\).

  • \(u^{(k)}\) converges uniformly on X towards the c-convex function u given by the c-Legendre transform \(v^{c}\) of v. Moreover, u and v satisfy

    $$\begin{aligned} u=(u+\chi _{X_{\nu }})^{cc},\,\,\,v=(u+\chi _{X_{\nu }})^{c}=u^{c}=v \end{aligned}$$
  • If \(\mu \) is absolutely continuous wrt the Riemannian volume form, then \(x\mapsto \exp _{y}(\nabla u)\) defines the optimal transport of \(\mu \) to \(\nu \).

Proof

This will be shown to follow from the previous theorem combined with results in [68], generalizing Brenier’s theorem in \(\mathbb {R}^{n}\) [16] and its Riemannian version in [52] (and, in particular, [23], concerning the torus case). After passing to a subsequence, as in the previous theorem, we may assume that \(u^{(k)}\rightarrow u:=u^{(\infty )}\) and that \(v^{(k)}\rightarrow v:=(\chi _{X_{\mu }}+u)^{c}\). In particular, v is c-convex. Denote by \(\gamma \) the corresponding optimal transport plan, furnished by the previous theorem, which is supported in the subset of \(X\times Y\) where \(u+v+c=0\). Hence, it follows from [68, Thm 10.41] (and its proof) that the Borel map \(y\mapsto \text {exp}_{y}(\nabla v)\) is the unique optimal transport (Borel) map, pushing forward \(\nu \) to \(\mu \). Since Y is assumed to be a topological domain it follows that v is uniquely determined on Y, modulo additive constants (see [68, Remark 10.30]). Now, by formula 3.3 we have \(u=v^{c}\) and since \(u(x_{0})=0\) it follows that u is uniquely determined. But, as shown in the proof of the previous theorem, we have \(v=(u+\chi _{X_{\nu }})^{c}\) and hence v is also uniquely determined (i.e. not only determined modulo an additive constant). Next, since, by assumption, \(Y=Y_{\nu }\) formula 3.3 says that \((u+\chi _{X_{\nu }})^{cc}=u\). In general, \(w^{ccc}=w^{c}\) for any function w and hence it follows that \((u+\chi _{X_{\nu }})^{c}=u^{c}\) which shows that \(v=u^{c}\). \(\square \)

The previous theorem applies more generally as soon as a unique Borel optimal map exists (see for example [68, Thm 10.38] for conditions on c ensuring that this is the case).

3.1.1 The torus case: proof of Theorem 1.1

First assume only that the probability measure \(\nu \) is absolutely continuous wrt Lebesgue measure. By the previous theorem \(u^{(k)}\) then converges uniformly towards a c-convex function u such that

$$\begin{aligned} (\nabla u^{c}+I)_{*}\nu =\mu \end{aligned}$$

If \(\mu \) moreover and \(\nu \) have densities \(e^{-g}\) and \(e^{-g}\) which are Hölder continuous, then it it is well-known that there exists a unique optimal transport map and its potential (which is uniquely determined up to a constant) is in \(C^{2,\alpha }(T)\) for some \(\alpha >0\) (see [23] where this is deduced from the regularity results of Caffarelli \(\mathbb {R}^{n}\)). It follows that \(u^{c}\) and hence also u is \(C^{2}\)-smooth and strictly quasi-convex and solves the Monge–Ampère equation 1.5.

3.2 Application to the second boundary value problem for the Monge–Ampère operator in \(\mathbb {R}^{n}\)

Now consider the Euclidean case of Theorem 3.4, i.e. the case where \(M=\mathbb {R}^{n}\) and \(d(x,y)=|x-y|\) and assume that X and Y are compact convex domains (and, in particular, topological domains) and take \(x_{0}=0\). As before we also assume that the support of \(\nu \) is equal to Y and that \(\nu \) is absolutely continuous wrt dx and fix discretizations \(\mu ^{(k)}\) and \(\nu ^{(k)}\) satisfying the density property 3.1.

In order to conform to classical notation in \(\mathbb {R}^{n}\) we set \(\phi ^{(k)}(x):=u^{(k)}(x)+|x|^{2}/2\) and \(\psi ^{(k)}=v^{(k)}(y)+|y|^{2}\) etc. This corresponds to replacing the cost function \(d^{2}/2\) with

$$\begin{aligned} c(x,y):=-x\cdot y. \end{aligned}$$

We will use the classical notation \(\phi ^{*}\) for the corresponding Legendre transform (formula 2.9). Then, by Theorem 3.4 , \(\phi ^{(k)}\) converges uniformly to a convex function \(\phi \) on X and

$$\begin{aligned} (\nabla \psi )_{*}\nu =\mu ,\,\,\,\psi :=(\chi _{X}+\phi )^{*} \end{aligned}$$
(3.8)

We next observe that this means that \(\phi \) satisfies the following Monge–Ampère equation on \(\Omega \)

$$\begin{aligned} MA_{\nu }(\phi )=\mu , \end{aligned}$$
(3.9)

where \(MA_{\nu }(\phi )\) denote the Monge–Ampère measure of \(\phi \) relative the target measure \(\nu \), in the sense of Alexandrov. We recall that if \(\nu \) is a given probability measure on \(\mathbb {R}^{n}\) which is absolutely continuous wrt dx and \(\phi \) is a finite convex function on a convex open set \(\Omega \) in \(\mathbb {R}^{n}\), then \(MA_{\nu }(\phi )\) is the Borel measure on \(\Omega \) defined by

$$\begin{aligned} \int _{E}MA_{\nu }(\phi )=\int _{(\partial \phi )(E)}\nu , \end{aligned}$$

where E is a given Borel set in \(\Omega \) and \((\partial \phi )(E)\) denotes the image of E under the multivalued sub-gradient map \(\partial \phi \) [67]. As is well-known, the assumption that \(\nu \) is absolutely continuous wrt dx ensures that \(MA_{\nu }(\phi )\), as defined above, is indeed a measure on \(\Omega \), i.e. countably additive (see [53, Section 3] for a more general setting involving a cost function c). Moreover, if \(\phi \) is \(C^{2}\)-smooth and strictly convex, then making the change of variables \(y=\nabla \phi (x)\) reveals that

$$\begin{aligned} MA_{\nu }(\phi )=\rho _{\nu }(\nabla \phi )\det (\nabla ^{2}\phi ), \end{aligned}$$
(3.10)

where \(\rho _{\nu }\) denotes the density of \(\nu \) wrt dx. We will use the following general representation of the the Monge–Ampère measure (see [11, Section 2.2] and references therein).

Lemma 3.5

Let \(\phi \) be a bounded finite convex function in \(\Omega \). Then

$$\begin{aligned} MA_{\nu }(\phi )=(\nabla \psi )_{*}\nu ,\,\,\,\psi :=(\chi _{X}+\phi )^{*} \end{aligned}$$

Hence, by formula 3.8, the limit \(\phi \) of \(\phi ^{(k)}\) indeed solves the Monge–Ampère equation 3.9. It should be stressed that, in general, there can be many different convex solutions to the Monge–Ampère equation (which not only differ by an additive constant). But the point is that the Sinkhorn iteration singles out a particular solution \(\phi \). On the other hand, when Y is convex we have the following well-known uniqueness result:

Let \(\Omega \) be a convex open set in \(\mathbb {R}^{n}\) and \(\mu \) and \(\nu \) probability measures on \(\mathbb {R}^{n}\) such that \(\mu \) is supported in \(\Omega \), \(\nu \) is absolutely continuous wrt dx and the support Y of \(\nu \) is a convex body, i.e. a compact convex set with non-empty interior. Then a solution \(\phi \) to the second boundary value problem

$$\begin{aligned} MA_{\nu }(\phi )=\mu ,\,\,\,(\partial \phi )(\Omega )\subset Y \end{aligned}$$
(3.11)

is uniquely determined up to an additive constant.

Proof

A simple proof is given in [11], which we recall here since it fits well with the spirit of the present paper (see also [53, Thm 3.1] for a different proof in the setting of more general cost function). The starting point is the observation that any convex function \(\phi \) on an open convex set \(\Omega \) with the property that \((\partial \phi )(\Omega )\subset Y\), for a convex body Y, may be expressed as \(\phi =\left( \chi _{Y}+(\chi _{X}+\phi )^{*}\right) \). But, by the previous lemma, the equation \(MA_{\nu }(\phi )=\mu \) implies that the function \(\psi :=(\chi _{X}+\phi )^{*}\) in Y is uniquely determined up to an additive constant. Hence, so is \(\phi \). \(\square \)

We thus arrive at the following result, which also applies to the second boundary value problem in all of \(\mathbb {R}^{n}:\)

Theorem 3.6

Let X and Y be compact convex domains in \(\mathbb {R}^{n}\) endowed with probability measure \(\mu \) and \(\nu \) respectively. Assume that the support of \(\nu \) is equal to Y and that \(\nu \) is absolutely continuous wrt dx. Let \(\mu ^{(k)}\) and \(\nu ^{(k)}\) be sequences of probability measures on X and Y converging weakly towards \(\mu \) and \(\nu \) respectively and satisfying the density property 3.1. Denote by \(\phi ^{(k)}\) the unique normalized fixed point of the scaled log Sinkhorn operator on C(X) corresponding to the cost function \(c(x,y)=-x\cdot y\). Then

$$\begin{aligned} \phi ^{(k)}\rightarrow \phi ,\,\,\,k\rightarrow \infty \end{aligned}$$
(3.12)

uniformly on X, where \(\phi \) is the unique normalized convex solution to the second boundary value problem 3.11 in the interior of X. More generally, the corresponding result holds when X is replaced by \(\mathbb {R}^{n}\) under the assumption that there exists a compact subset containing the support of \(\mu ^{(k)}\) for all k. Then the corresponding convergence 3.12 is uniform on compact subsets of \(\mathbb {R}^{n}\).

Proof

First assume that X is compact. As explained above it then follows from Theorem 3.4 that \(\phi ^{(k)}\) converges uniformly to a normalized convex function \(\phi \) on X, satisfying \(MA_{\nu }(\phi )=\mu \). Next, note that, it follows directly from the definition of the fixed point \(\phi ^{(k)}\) that \(\phi ^{(k)}\) may be expressed as

$$\begin{aligned} \phi ^{(k)}(x)=k^{-1}\log \int _{Y}e^{kx\cdot y}\nu '(y) \end{aligned}$$

for a measure \(\nu '\) supported on Y. As a consequence, \(\phi ^{(k)}\) is smooth and \(\nabla \phi ^{(k)}\) is contained in the convex hull of Y, i.e. in Y, since Y is assumed to be convex. But then it follows that \((\partial \phi )(\Omega )\subset Y\), which concludes the proof in the case when X is compact. Note that, alternatively, we could have used directly that, by Theorem 3.4, the limit \(\phi \) satisfies \(\phi =\left( \chi _{Y}+(\chi _{X}+\phi )^{*}\right) \), which is the unique normalized solution to the second boundary problem in question (by the proof of the previous lemma). To prove the non-compact case when X is replaced by \(\mathbb {R}^{n}\) denote by \(X_{R}\) the ball or radius R centered at 0 and assume that R is sufficiently large to ensure that \(\mu \) and all \(\mu ^{(k)}\) are supported in the interior of \(X_{R}\). Denote by \(\phi _{R}^{(k)}(x)\) the normalized fixed point of the corresponding iteration on \(C(X_{R})\). Since \(\mu ^{(k)}\) is supported in \(X_{R}\) it follows, from the very definition of the iteration, that \(\phi _{R}^{(k)}(x)\) is, in fact, independent of R. Accordingly, we may define a normalized convex function \(\phi ^{(k)}\) on \(\mathbb {R}^{n}\) by setting \(\phi ^{(k)}:=\phi _{R}^{(k)}\) on \(X_{R}\) for any R sufficiently large. Then \(\phi ^{(k)}\) is the unique normalized fixed point of the corresponding iteration on \(C(\mathbb {R}^{n})\). Moreover, since \(X_{R}\) is compact \(\phi ^{(k)}\) converges uniformly on \(X_{R}\) to a convex function \(\phi \) solving the second boundary value problem 3.11 on \(X_{R}\). Since R is arbitrary this shows that \(\phi \), in fact, solves the problem on all of \(\mathbb {R}^{n}\). \(\square \)

3.3 Application to convolutional Wasserstein distances

Theorem 3.3 holds more generally (with essentially the same proof) when the function c is replaced by a sequence \(c_{k}\) such that

$$\begin{aligned} \left\| c_{k}-c\right\| _{L^{\infty }(X\times Y)}\rightarrow 0 \end{aligned}$$
(3.13)

For example, in the Riemannian setting of Theorem 3.4. denoting by \(\mathcal {K}_{t}(x,y)\) the corresponding heat kernel and setting \(t:=2k^{-1}\), the sequence

$$\begin{aligned} c_{k}:=-t^{-1}\log \mathcal {K}_{t}(x,y) \end{aligned}$$
(3.14)

satisfies 3.13, by Varadhan’s formula (which holds more generally on Lipschitz Riemannian manifolds [54]). Replacing c by \(c_{k}\) in this setting thus has the effect of replacing the matrix \(A_{ij}:=e^{-kd^{2}(x_{i},x_{j})/2}\) appearing in the corresponding Sinkhorn algorithm with the heat kernel matrix \(\mathcal {K}_{2k^{-1}}(x_{i},x_{j})\) which, as emphasized in [63], has computational advantages. Following [63] we consider the squared convolutional Wasserstein distance between \(\mu \) and \(\nu :\)

$$\begin{aligned} \mathcal {W}_{(k)}^{2}(\mu ,\nu ):=k^{-1}\inf _{\gamma \in \Pi (\mu ^{(k)},\nu ^{(k})}\mathcal {I}(\gamma ,\mathcal {K}_{2k^{-1}}\mu ^{(k)}\otimes \nu ^{(k)}), \end{aligned}$$

defined wrt approximations \(\mu ^{(k)}\) and \(\nu ^{(k)}\), for example given by weighted point clouds, as in Example 3.2. In [63, Page 3], the problem of developing conditions for the convergence of \(\mathcal {W}_{(k)}^{2}(\mu ,\nu )\) was posed. The following result provides an answer:

Theorem 3.7

Let X be a compact Riemannian manifold (possibly with boundary) and set \(c(x,y):=d(x,y)^{2}/2\), where d is the Riemannian distance function. Suppose that \(\mu ^{(k)}\rightarrow \mu \) and \(\nu ^{(k)}\rightarrow \mu \) in \(\mathcal {P}(X)\) and that \(\mu ^{(k)}\) and \(\nu ^{(k)}\) satisfy the density property 3.1. Then

$$\begin{aligned} \lim _{k\rightarrow \infty }\mathcal {W}_{(k)}^{2}(\mu ,\nu )=\mathcal {W}^{2}(\mu ,\nu ), \end{aligned}$$

where \(\mathcal {W}^{2}(\mu ,\nu )\) denotes the squared \(L^{2}\)-Wasserstein distance between \(\mu \) and \(\nu \).

Proof

Repeating the argument in the proof of Theorem 3.3, with c replaced by \(c_{k}\) as above, gives

$$\begin{aligned} \lim _{k\rightarrow \infty }\inf _{u\in C^{0}(X)}\mathcal {F}^{(k)}=\inf _{u\in C(X)}J(u) \end{aligned}$$

According to formula 2.12 the infimum appearing in the left hand side above is precisely \(\mathcal {W}_{(k)}^{2}(\mu ,\nu )\). Since the infimum in the right hand side above is equal to \(\mathcal {W}^{2}(\mu ,\nu )\), by Kantorovich duality (formula 2.4), the result follows. \(\square \)

4 Convergence of the iteration towards parabolic optimal transport equations on the torus

4.1 Proof of Theorem 1.2

We will denote by \(\delta _{\Lambda _{k}}\) the uniform discrete probability measure supported on the discrete torus \(\Lambda _{k}\) with edge-length 1/k : 

$$\begin{aligned} \delta _{\Lambda _{k}}:=\frac{1}{N_{k}}\sum _{x_{i}\in \Lambda _{k}}\delta _{x_{i}} \end{aligned}$$

Given two probability measures \(\mu =e^{-f}dx\) and \(\nu =e^{-g}dy\) we can then define their discretizations as the probability measures

$$\begin{aligned} \mu ^{(k)}=\frac{1}{\int _{T^{n}}e^{-f}\delta _{\Lambda _{k}}}e^{-f}\delta _{\Lambda _{k}},\,\,\,\nu ^{(k)}=\frac{1}{\int _{T^{n}}e^{-g}\delta _{\Lambda _{k}}}e^{-g}\delta _{\Lambda _{k}} \end{aligned}$$

Note that the normalization constants have the asymptotics

$$\begin{aligned} \left| \int _{T^{n}}e^{-f}\delta _{\Lambda _{k}}-1\right| \le C_{f}k^{-1},\,\,\,\,\left| \int _{T^{n}}e^{-g}\delta _{\Lambda _{k}}-1\right| \le C_{g}k^{-1} \end{aligned}$$
(4.1)

where C only depends on an upper bound on \(|\nabla f|\) on \(T^{n}\) (and similarly for g); see [13, Page 2].

Remark 4.1

As will be clear from the proof, Theorem 1.2 also holds with the simpler discretizations \(e^{-f}\delta _{\Lambda _{k}}\) and \(e^{-g}\delta _{\Lambda _{k}}\) (which, in general, are not probability measures).

We start with the following discrete version of the classical Laplace method of integration, proved in “Appendix A”:

Lemma 4.2

Let \(\alpha \) be a lower-semicontinuous (lsc) function on \(T^{n}\) with a unique minimum at \(x_{0}\) and assume that \(\alpha \) is \(C^{4}\)-smooth on a neighborhood of \(x_{0}\) and \(\nabla ^{2}\alpha (x_{0})>0\). Then, if h is \(C^{2}\)-smooth

$$\begin{aligned} k^{n/2}\int e^{-k\alpha }h\delta _{\Lambda _{k}}=(2\pi )^{n/2}e^{-k\alpha (x_{0})}\frac{h(x_{0})}{\sqrt{\det (\nabla ^{2}\alpha (x_{0}))}}(1+\epsilon _{k}),\,\,\,|\epsilon _{k}|\le Ck^{-1} \end{aligned}$$

where the constant C only depends on an upper bounds on the \(C^{4}\)-norm of \(\alpha \), the \(C^{2}\)-norm of h and a strict lower bound on the smallest eigenvalue of \(\nabla ^{2}\alpha \) on a neighborhood of \(x_{0}\).

We next prove the key asymptotic result that will be used in the proof of Theorem 1.2 giving the asymptotics of the function \(\rho _{ku}(x)\), defined by formula 2.23 (the result can be viewed as a refinement of Lemma 3.1).

Proposition 4.3

Let u be a strictly quasi-convex function in \(C^{4}(T^{n})\). Then the following asymptotics hold

$$\begin{aligned} \rho _{ku}(x)=\det (I+\nabla ^{2}u(x))e^{f(x)-g(x+\nabla u(x))}(1+\epsilon _{k}),\,\,\,|\epsilon _{k}|\le Ck^{-1} \end{aligned}$$

where the constant C only depends on upper bounds on the \(C^{4}\)-norm of u and the \(C^{2}\)-norms of f and g and a strict positive lower bound on the eigenvalues of the matrix \(\left( I+\nabla ^{2}u(x)\right) \).

Proof

First recall that, by Lemma 2.3, \(u^{c}\) is also strictly quasi-convex and in \(C^{4}(T^{n})\). Set \(c(x,y)=d_{T^{n}}^{2}(x,y)/2\). In the proof we will denote by \(O(k^{-1})\) any sequence of functions satisfying \(|O(k^{-1})|\le Ck^{-1}\), for a constant C depending on data as in the statement of the proposition. Fix \(y\in Y\). The function \(x\mapsto c(x,y)+u(x)\) is lsc on \(T^{n}\) and has a unique minimum \(x_{y}\) and it is \(C^{4}\)-smooth close to \(x_{y}\) (by Lemma 2.4). Applying the asymptotics 4.1 and Lemma 4.2 thus gives

$$\begin{aligned} k^{n/2}e^{kv^{(k)}[u](y)}:=k^{n/2}\int e^{-k\left( c(x,y)+u(x)\right) }\mu ^{(k)}(x)=e^{ku^{c}(y)}\left( h(y)+O(1)k^{-1}\right) , \end{aligned}$$

where

$$\begin{aligned} h(y):=(2\pi )^{n/2}\frac{\exp (-f(x_{y}))}{\sqrt{\det (I+\nabla ^{2}u(x_{y}))}}. \end{aligned}$$

As a consequence, the inverse of \(k^{n/2}e^{kv^{(k)}[u](y)}\) is given by \(e^{-ku^{c}(y)}\left( h(y)^{-1}+O(1)k^{-1}\right) \) and hence

$$\begin{aligned}&e^{ku^{(k)}\left[ v^{(k)}[u]\right] (x)}:=k^{n/2}\int e^{-kc(x,y)}\left( \frac{1}{k^{n/2}e^{kv^{(k)}[u](y)}}\right) \nu ^{(k)}(y)\nonumber \\&\quad =k^{n/2}\int e^{-k\left( c(x,y)+u(x)+u^{c}(y)\right) }h(y)^{-1}\nu ^{(k)}(y)+R_{k}(x), \end{aligned}$$
(4.2)

where

$$\begin{aligned} R_{k}(x)\le O(k^{-1})k^{n/2}\int e^{-k\left( c(x,y)+u(x)+u^{c}(y)\right) }\nu ^{(k)}(y). \end{aligned}$$

By Lemma 4.2 (and the asymptotics 4.1) we have

$$\begin{aligned} k^{n/2}\int e^{-k\left( c(x,y)+u(x)+u^{c}(y)\right) }\nu ^{(k)}(y)\le C' \end{aligned}$$

and hence it follows that

$$\begin{aligned} R_{k}(y)=O(k^{-1}). \end{aligned}$$

Now, the same localization argument as above shows that the integral over \(\nu ^{(k)}(y)\) in formula 4.2 localizes around a small neighborhood V of \(y=y_{x}\). Hence, applying Lemma 4.2 (and the asymptotics 4.1) again gives

$$\begin{aligned}&k^{n/2}\int e^{-k(d(x,y)^{2}/2+u(x)+u^{c}(y))}h(y)^{-1}\nu ^{(k)}(y)\\&\quad =e^{k\left( (u^{c})^{c}\right) (x)}h^{-1}(y_{x})\frac{(2\pi )^{n/2}\exp (-g(y_{x}))}{\sqrt{\det (I+\nabla ^{2}u^{c}(y_{x}))}}(1+O(k^{-1})). \end{aligned}$$

All in all, since \(\left( (u^{c})^{c}\right) =u\), this shows that

$$\begin{aligned} e^{ku^{(k)}\left[ v^{(k)}[u]\right] (x)}=e^{ku(x)}h^{-1}(y_{x})\frac{(2\pi )^{n/2}\exp (-g(y_{x}))}{\sqrt{\det (I+\nabla ^{2}u^{c}(y_{x}))}}(1+O(k^{-1})). \end{aligned}$$

The proof is thus concluded by invoking the inverse properties of the Hessians in Lemma 2.3. \(\square \)

Lemma 4.4

Let X be a compact topological space and consider the following family of difference equations on C(X), parametrized by a positive number k and a discrete time m : 

$$\begin{aligned} u_{m+1}^{(k)}-u_{m}^{(k)}=k^{-1}D^{(k)}(u_{m}^{(k)}), \end{aligned}$$
(4.3)

where \(D^{(k)}\) is an operator on C(X), which descends to \(C(X)/\mathbb {R}\) and with the property that \(I+k^{-1}D^{(k)}\) is an increasing operator (wrt the usual order relation on C(X)). Assume that there exists a subset \(\mathcal {H}\) of C(X) and an operator D on \(\mathcal {H}\) such that for any u in the class

$$\begin{aligned} \left| D^{(k)}(u)-D(u)\right| \le C_{u}\epsilon _{k}, \end{aligned}$$
(4.4)

where \(C_{u}\) is a positive constant depending on u and \(\epsilon _{k}\) is a sequence of positive numbers converging towards zero. Assume that \(u_{0}^{(k)}=u_{0}\) where \(u_{0}\) is a fixed function in \(\mathcal {H}\) and that there exists a family \(u_{t}\in \mathcal {H}\), which is two times differentiable wrt t and solving

$$\begin{aligned} \frac{\partial u_{t}}{\partial t}=D(u_{t}),\,\,\,\,(u_{t})_{|t=0}=u_{0}, \end{aligned}$$
(4.5)

for \(t\in [0,T]\). Then, for any (km) such that \(m/k\in [0,T]\),

$$\begin{aligned} \sup _{T^{n}}\left| u_{m}^{(k)}-u_{m/k}\right| \le C_{T}\frac{m}{k}\cdot 2\max \{\epsilon _{k},k^{-1}\},\,\,\,C_{T}:=\max \left\{ \sup _{t\in [0,T]}C_{u_{t}},\sup _{X\times [0,T]}\frac{\partial ^{2}u_{t}}{\partial ^{2}t}\right\} . \end{aligned}$$

Proof

We will write \(\psi _{k,m}=u_{m/k}\).

Step 1: The following holds for all (km)

$$\begin{aligned} \sup \left| \psi _{k,m+1}-\psi _{k,m}-k^{-1}D^{(k)}(\psi _{k,m})\right| \le k^{-1}C_{T}\epsilon '_{k},\,\,\,\epsilon '_{k}:=2\max \{\epsilon _{k},k^{-1}\}. \end{aligned}$$

Indeed, using the mean value theorem we can write

$$\begin{aligned} \psi _{k,m+1}-\psi _{k,m}=\frac{1}{k}\left( \frac{u_{m/k+1/k}-u_{m/k}}{1/k}\right) =\frac{1}{k}\left( \frac{\partial u_{t}}{\partial t}_{|t=m/k}+O(k^{-1})\right) , \end{aligned}$$

where the term \(O(k^{-1})\) may be estimated as \(|O(k^{-1})|\le A_{T}k^{-1}\), where \(A_{T}=\sup _{X\times [0,T]}|\frac{\partial ^{2}u_{t}}{\partial ^{2}t}|\). Using the evolution equation for \(u_{t}\) and applying formula 4.4 thus proves Step 1.

Step 2: The discrete evolution onC(X) defined by the difference equation4.3decreases the sup-norm, i.e.

$$\begin{aligned} \sup _{X}|\phi _{m+1}-\psi _{m+1}|\le \sup _{X}|\phi _{m}-\psi _{m}| \end{aligned}$$

if \(\phi _{m}\) and \(\psi _{m}\) satisfy the difference equation 4.3 for a fixed k.

To see this set \(C:=\sup |\phi _{m}-\psi _{m}|\). Then, \(\phi _{m}\le \psi _{m}+C\) and hence, since \(I+k^{-1}D^{(k)}\) is assumed to be increasing,

$$\begin{aligned} \phi _{m+1}=\phi _{m}+k^{-1}D^{(k)}(\phi _{m})\le \psi _{m}+C+k^{-1}D^{(k)}(\psi _{m}+C)=\psi _{m+1}+C \end{aligned}$$

In particular, \(\sup (\phi _{m+1}-\psi _{m+1})\le C\). Applying the same argument with the roles of \(\phi \) and \(\psi \) interchanged concludes the proof.

Step 3: Conclusion:

$$\begin{aligned} \sup _{X}|u_{m}^{(k)}-\psi _{k,m}|\le C_{T}\frac{m}{k}\epsilon '_{k}. \end{aligned}$$
(4.6)

We will prove this by induction over m (for k fixed) the statement being trivially true for \(m=0\). We fix the integer k and assume as an induction hypothesis that 4.6 holds for m with C the constant in the previous inequality. Applying first Step 2 and then the induction hypothesis thus gives

$$\begin{aligned}&\sup _{X}|\left( \psi _{k,m}+k^{-1}D^{(k)}(\psi _{k,m})\right) -\left( u_{m}^{(k)}+k^{-1}D^{(k)}(u_{m}^{(k)})\right) |\\&\quad \le \sup _{X}|\psi _{k,m}-u_{m}^{(k)}|\le C_{T}\frac{m}{k}\epsilon '_{k}. \end{aligned}$$

Now, by Step 1,

$$\begin{aligned} \sup _{X}|\psi _{k,m+1}-(\psi _{k,m}+k^{-1}D^{(k)}(\psi _{k,m})|\le \frac{C_{T}}{k}\epsilon '_{k} \end{aligned}$$

for all (mk). Hence,

$$\begin{aligned} \sup _{X}|\psi _{k,m+1}-u_{m+1}^{(k)}|\le C_{T}\frac{m}{k}\epsilon '_{k}+C_{T}\frac{1}{k}\epsilon '_{k}=C_{T}\frac{(m+1)}{k}\epsilon '_{k}, \end{aligned}$$

proving the induction step and hence the final Step 3. \(\square \)

In the present setting \(\mathcal {H}\) will be taken as subspace of \(C^{4}(T^{n})\) consisting of all strictly quasi-convex functions u and

$$\begin{aligned} D(u)(x):=\log (\det (\nabla ^{2}u(x)+I))-g(\nabla u(x)+x)+f(x). \end{aligned}$$

The following proposition essentially follows from the results in [44, 45, 57] (for completeness a proof is provided in “Appendix B”). The space \(C^{k,\alpha }(T^{n})\) denotes, as usual, space of all functions such that the kth order derivatives are Hölder continuous with Hölder exponent \(\alpha \in ]0,1[\).

Proposition 4.5

Let f and g be two given functions in \(C^{2,\alpha }(T^{n})\). Then, for initial data \(u_{0}\in C^{4,\alpha }(T^{n})\) which is strictly quasi-convex there exists a solution u(xt) to the corresponding parabolic PDE 1.7 in \(C^{2}([0,\infty [\times T^{n})\) such that for any \(t>0\)\(u_{t}\in C^{4}(T^{n})\) and \(u_{t}\) is strictly quasi-convex. Moreover,

  • There exists a positive constant \(C_{1}\)—only depending on upper bounds on the \(C^{2}\)-norms of \(u_{0},f\) and g and a strict positive lower bound on the eigenvalues of the matrix \((I+\nabla ^{2}u_{0}(x))\)—such that

    $$\begin{aligned} C_{1}^{-1}I\le \left( \nabla ^{2}u_{t}(x)+I\right) \le C_{1}I \end{aligned}$$

    and a constant \(C_{2}\) which, moreover, depends on the \(C^{4,\alpha }\)-norm of \(u_{0}\) and the \(C^{2,\alpha }\)-norms of f and g such that

    $$\begin{aligned} \left\| u_{t}\right\| _{C^{4}(T^{n})}\le C_{2} \end{aligned}$$
  • There exist positive constants \(A_{1}\) and \(A_{0}\) such that

    $$\begin{aligned} \sup _{T^{n}}|u_{t}-u_{\infty }|\le A_{1}e^{-t/A_{0}} \end{aligned}$$
    (4.7)

    where \(u_{\infty }\) is a potential solving the corresponding optimal transport problem (i.e. a solution to the corresponding stationary equation).

  • if \(u_{0},f\) and g are in \(C^{\infty }(T^{n})\), then so is \(u_{t}\) and \(\left\| u_{t}\right\| _{C^{s}(T^{n})}\le C_{s}\) for any positive integer s.

Note that differentiating the parabolic PDE 1.7 gives

$$\begin{aligned} \frac{\partial ^{2}u_{t}}{\partial ^{2}t}=L_{u_{t}}[D(u_{t})], \end{aligned}$$

where \(L_{u_{t}}\) is the linearization of D at \(t_{t}\). By the first point in the previous theorem \(L_{u_{t}}\) is a uniformly elliptic second order operator (see formula 9.2 in “Appendix B”) and hence

$$\begin{aligned} \sup _{X\times [0,\infty [}\left| \frac{\partial ^{2}u_{t}}{\partial ^{2}t}\right| \le C_{3} \end{aligned}$$

where \(C_{3}\) only depends on the constants \(C_{1}\) and \(C_{2}\) in the previous proposition and the \(C^{2}\)-norms of f and g.

4.1.1 Conclusion of proof of Theorem 1.2 and Corollary 1.3

Consider the log Sinkhorn iteration and define \(D^{(k)}(u)\) to be \(\log (\rho _{ku})\), as in Eq. 2.22. The proof of Theorem 1.2 follows directly from combining Lemma 4.4 with Propositions 4.3, 4.5 and also using that the corresponding operator \(S^{(k)}=I+k^{-1}D^{(k)}\), defined by formula 2.20, is clearly increasing and invariant under the additive \(\mathbb {R}\)-action. Finally, Corollary 1.3 follows by combining Theorem 1.2 with the exponential convergence in formula 4.7. Indeed, setting \(m_{k}=kt_{k}\) with \(t_{k}:=A\log k\), where A is the constant appearing formula 4.7, gives

$$\begin{aligned} \left\| u_{m_{k}}^{(k)}-u_{\infty }\right\| _{C(X)}\le & {} \left\| u_{m_{k}}^{(k)}-u_{t_{k}}\right\| _{C(X)}+\left\| u_{t_{k}}-u_{\infty }\right\| _{C(X)}\\\le & {} ACk^{-1}\log k+A_{1}(k^{-1})^{A/A_{0}} \end{aligned}$$

where C is bounded uniformly from above, independently of t as follows from 1.2 (using the uniform bounds on \(u_{t}\) in Prop 4.5, which ensure that C is bounded from above, independently of t). Setting \(u_{k}:=u_{m_{k}}^{(k)}\) this proves the estimate 1.10 when \(A>A_{0}\).

Next note that the estimate 1.10 implies the estimate 1.11 for \(\gamma _{k}\). Indeed, by definition, \(\gamma _{k}:=e^{-kd_{T^{n}}^{2}(x,y)^{2}/2}e^{-ku_{k}(x)}e^{-kv_{k}(y)}\mu ^{(k)}\otimes \nu ^{(k)}\), where \(v_{k}:=v[u_{k}]\). By the estimate 1.10 and Lemma 8.1 there exists a positive number C such that

$$\begin{aligned} u_{k}(x)+v_{k}(y)\ge u(x)+u^{c}(y)+Ck^{-1}\log k. \end{aligned}$$

The proof is thus concluded by invoking the following elementary inequality, which we claim holds for any strictly quasi-convex function u in \(C^{2}(T^{n}):\)

$$\begin{aligned} d_{T^{n}}^{2}(x,y)^{2}/2+u(x)+u^{c}(y)\ge \frac{\delta }{2}d_{T^{n}}(y,F_{u}(x))^{2},\,\,\,F_{u}(x)=x+(\nabla u)(x), \end{aligned}$$
(4.8)

under the assumption that

$$\begin{aligned} 0<\delta I\le \left( \nabla ^{2}u(x)+I\right) \le \delta ^{-1}I. \end{aligned}$$

To see this we identify u and \(u^{c}\) with \(\mathbb {Z}^{n}\)-periodic functions on \(\mathbb {R}^{n}\) and set \(\phi (x):=u(x)+|x|^{2}/2\). Then \(\phi ^{*}(y)=u^{c}(y)+|y|^{2}/2\), where \(\phi ^{*}\) is the classical Legendre-transform on \(\mathbb {R}^{n}\) (compare with the proof of Lemma 2.3). Hence, it will be enough to show that

$$\begin{aligned} -x\cdot y+\phi (x)+\phi ^{*}(y)\ge \frac{\delta }{2}|y-\nabla \phi (x)|^{2}. \end{aligned}$$
(4.9)

Indeed, the claimed inequality 4.8 follows from the latter one after replacing x with \(x+m\) and taking the infimum over all \(m\in \mathbb {Z}^{n}\). Now, by assumption \(\nabla ^{2}\phi \le \delta ^{-1}I\) and hence \(\nabla ^{2}\phi ^{*}\ge \delta I\) (by 2.8). As a consequence, \(\phi ^{*}(y)\ge \phi ^{*}(y-t)+t\cdot \nabla \phi ^{*}(y-t)+\delta |t|^{2}/2\) for any \(t\in \mathbb {R}^{n}\). Setting \(t:=y-\nabla \phi (x)\) and using that \(\phi ^{*}(\nabla \phi (x))=\nabla \phi (x)\cdot x-\phi (x)\) and \((\nabla \phi ^{*})(\nabla \phi (x))=x\) this implies the desired inequality 4.9.

4.1.2 Proof of Corollary 1.4

By Prop 2.1\(C(\mu ,\nu )=-J(u)\), where u is an optimal transport potential and J is the Kantorovich functional. Now, since u and \(u^{c}\) are Lipschitz continuous (with Lipschitz constant \(\sqrt{n}\)) we can approximate the integrals defining J(u) to get

$$\begin{aligned} -C(\mu ,\nu )=J(u):=\int u\mu +\int u^{c}\nu =\int u\mu ^{(k)}+\int u^{c}\nu ^{(k)}+O(k^{-1}) \end{aligned}$$

(see [13, Page 2]). Next, by Corollary 1.3\(u=u_{m_{k}}^{(k)}+O(k^{-1}\log k)\). Similarly, applying Corollary 1.3 to the situation where the roles of \(\mu \) and \(\nu \) have been reversed shows that \(v=v_{m_{k}}^{(k)}+O(k^{-1}\log k)\) and hence

$$\begin{aligned} C(\mu ,\nu )=-\sum _{x_{i}\in \Lambda _{k}}p_{i}u_{m_{k}}^{(k)}(x_{i})-\sum _{y_{i}\in \Lambda _{k}}q_{i}v_{m_{k}}^{(k)}(y_{i})+O(k^{-1}\log k) \end{aligned}$$

The proof is thus concluded by expressing \(u_{m_{k}}^{(k)}\) and \(v_{m_{k}}^{(k)}(y_{i})\) in terms of \(a^{(k)}(m)\) and \(b^{(k)}(m)\), respectively (formulas 1.8, 1.12) and using that \(k^{-1}\log p_{i}=k^{-1}\log (e^{-f}(x_{i})(1+O(k^{-1}))=O(k^{-1})\) and similarly for \(q_{i}\). This proves the first statement in Corollary 1.4. The last statement then follows from the observation that

$$\begin{aligned} \frac{1}{2}d(\mu ^{(k)},\nu ^{(k)})^{2}=\frac{1}{2}d(\mu ,\nu )^{2}+O(k^{-1}) \end{aligned}$$

This observation is without doubt standard but, for completeness, we provide a proof. Denote by \(X_{k}\) and \(Y_{k}\) the support of \(\mu ^{(k)}\) and \(\nu ^{(k)}\), respectively. By definition, \(\frac{1}{2}d(\mu ^{(k)},\nu ^{(k)})^{2}\) is the sup of \(\left\langle c,\gamma \right\rangle \) over all probability measures on \(X_{k}\times Y_{k}\) with marginals \(\mu ^{(k)}\) and \(\nu ^{(k)}\), respectively (with the usual identifications). Since any probability measure \(\gamma \) on \(T^{n}\times T^{n}\) with marginals \(\mu ^{(k)}\) and \(\nu ^{(k)}\) is automatically supported on \(X_{k}\times Y_{k}\) we may as well take the sup over all such \(\gamma \). Hence, \(\frac{1}{2}d(\mu ^{(k)},\nu ^{(k)})^{2}=C(\mu ^{(k)},\nu ^{(k)})\) on \(T^{n}\times T^{n}\). Kantorovich duality then gives

$$\begin{aligned} -\frac{1}{2}d(\mu ^{(k)},\nu ^{(k)})^{2}= & {} \inf _{u:\,u^{cc}=u}\left( \int u\mu ^{(k)}+\int u^{c}\nu ^{(k)}\right) \\= & {} \inf _{u:\,u^{cc}=u}\left( \int u\mu +\int u^{c}\nu \right) +O(k^{-1}), \end{aligned}$$

where the last equality follows from the argument in the beginning of the proof of the corollary. Using Kantorovich duality again thus concludes the proof.

4.2 Comparison with previous convergence rates for the Sinkhorn algorithm and sharpness

As pointed out in Remark 2.10 it is well-known that, for k fixed, \(u_{m}^{(k)}\) converges exponentially in \(C(X)/\mathbb {R}\) towards a fixed point \(u_{\infty }^{(k)}:\)

$$\begin{aligned} \left\| u_{m}^{(k)}-u_{\infty }^{(k)}\right\| _{C(X)/\mathbb {R}}\le C_{0}e^{-\delta _{k}m} \end{aligned}$$

(i.e. it converges at linear rate in the terminology of numerical analysis). The general estimate on \(e^{-\delta _{k}}\) provided in [36] gives that \(1-e^{-\delta _{k}}\) is comparable to \(1-2e^{-k}\) if we assume, for simplicity, that c is Lipschitz continuous with Lipschitz constant 1 and that the diameter of X is equal one (see [66]). Hence, \(\delta _{k}\) is comparable to \(e^{-k}\), which means that, for k large, on would needs to run \(O(e^{k})\) iterations to get close to the fixed point \(u_{\infty }^{(k)}\). On the order hand experimental findings reported in [27, Remark 4.15] and [35] suggest that in the “geometric settings”, such as in present torus setting, \(\delta _{k}\) is of the order \(O(k^{-1})\). Hence, only O(k) iterations should be needed to get close to the limiting fixed point (see also [59, Prop 18], where this is shown for a modified asymmetric version of the Sinkhorn algorithm, inspired by the auction algorithm). This is confirmed by Theorem 1.2 (and the proof of Corollary 1.3). Indeed, setting \(m_{k}=t_{k}k\) and applying Theorem 1.2 gives

$$\begin{aligned} \left\| u_{m_{k}}^{(k)}-u_{\infty }^{(k)}\right\| _{C(X)}\le Ct_{k}k^{-1}+\left\| u_{t_{k}}-u_{\infty }\right\| _{C(X)}+\left\| u_{t_{k}}-u_{\infty }\right\| _{C(X)} \end{aligned}$$

Hence, by the exponential convergence in Prop. 4.5 and Theorem 1.1

$$\begin{aligned} \left\| u_{m_{k}}^{(k)}-u_{\infty }^{(k)}\right\| _{C(X)}\le Ct_{k}k^{-1}+Ae^{-t_{k}/A}+\epsilon _{k}. \end{aligned}$$

In particular, by taking \(m_{k}\sim Tk\) for T sufficiently large fixed number one can make the right hand side above arbitrarily small, for k large. But an important point of Corollary 1.3 is that it provides a quantative estimate on \(\left\| u_{m_{k}}^{(k)}-u_{\infty }\right\| _{C(X)}\) when \(m_{k}\sim Ak\log k\), where \(u_{\infty }\) is the optimal transport potential, which arises as the limit of the corresponding parabolic equation. It should also be pointed out that it can be shown that the asymptotics in Prop 4.3 are sharp in the sense that error term can not be improved to \(o(k^{-1})\), in general. Moreover, inspection of the proof of the Laplace method suggests that f and g should have at least two derivatives to get the order \(O(k^{-1})\). Hence, the rate of convergence in Theorem 1.2 can be expected to be sharp.

5 Convergence towards parabolic optimal transport equations on compact manifolds

Let X and Y be compact smooth manifolds (without boundary) and c a lsc function on \(X\times Y\), taking values in \(]-\infty ,\infty ]\) which is smooth on the complement of a closed proper subset, denoted by \(\text {sing } {(c)}\).

Remark 5.1

Note that, in contrast to previous sections, we do not assume that c is continuous on all of \(X\times Y\). See for example Sect. 6.3.3 for a relevant example where \(c(x,y)=-\log |x-y|\) for X and Y given by the unit-sphere in \(\mathbb {R}^{n+1}\) and (hence \(\text {sing } {(c)}\) is the diagonal in \(X\times Y)\). In the Riemannian case when \(X=Y\) and \(c=d^{2}/s\) (where d is the Riemannian distance function) c is, of course, continuous on all of \(X\times Y\) and \(\text {sing } {(c)}\) is non-empty, due to the presence of cut-locus.

We will denote by \(\partial _{x}\) the vector of partial derivatives defined wrt a choice of local coordinates around a fixed point \(x\in X\). Given two normalized volume forms \(\mu \) and \(\nu \) in \(\mathcal {P}(X)\) and \(\mathcal {P}(Y)\) we locally express

$$\begin{aligned} \mu =e^{-f}dx,\,\,\,\nu =e^{-g}dy \end{aligned}$$

in terms of the local volume forms dx and dy determined by a choice of local coordinates. We assume that f and g are \(C^{\infty }\)-smooth.

Following standard practice we will assume that the cost function satisfies the following assumptions

  • (A1) (“Twist condition”) The map \(y\mapsto \partial _{x}c(x,y)\) is injective for any \((x,y)\in X\times Y-\text {sing } {(c)}\)

  • (A2) (“Non-degeneracy”) det \((\partial _{x_{i}}\partial _{y_{j}}c)(x,y)\ne 0\) for any \((x,y)\in X\times Y-\text {sing } {(c)}\)

Remark 5.2

See [68] for an in depth discussion of various assumption on cost functions. In [68] A1+A2 is called the strong twist condition and as pointed out in [68, Remark 12.23] it holds for the cost function derived from any well-behaved Lagrangian, including the Riemannian setting where \(c=d^{2}/2)\).

Definition 5.3

The space \(\mathcal {H}(X)\) (or \(\mathcal {H}\) for short) of all smooth potentials on X is defined as the subspace of \(C^{\infty }(X)\) consisting of all c-convex (i.e. such that \((u^{c})^{c}=u)\) smooth functions u on X with the property that the subset \(\Gamma _{u}\) of \(X\times Y\) defined by formula 2.2 is the graph of a diffeomorphism, denoted by \(F_{u}\) and c is smooth in a neighborhood of \(\Gamma _{u}\).

The definition has been made so that, if \(u\in \mathcal {H}\) and \((F_{u})_{*}\mu =\nu \), then \(F_{u}\) is an optimal map (diffeomorphism) wrt the cost function c (by Prop 2.1). Accordingly, we will call u the potential of the map \(F_{u}\).

Lemma 5.4

Assume that c satisfies condition A1. If \(u\in \mathcal {H}\), then

$$\begin{aligned} y=F_{u}(x)\Longleftrightarrow (\partial _{x}c)(x,y)+(\partial _{x}u)(x)=0 \end{aligned}$$
(5.1)

Proof

By the very definition of \(F_{u}\) we have that \(y=F_{u}(x)\) iff \(u^{c}(y)=-c(x,y)-u(x)\). In turn, by the definition of \(u^{c}\) this equivalently means that x maximizes the usc function \(x'\mapsto \mathcal {C}(x'):=-c(x',y)-u(x')\) on X. Now assume first that \(y=F_{u}(x)\). Then, by assumption, the function \(\mathcal {C}\) is smooth close that x and hence the differential of \(\mathcal {C}\) vanishes at x, i.e. \((\partial _{x}c)(x,y)+(\partial _{x}u)(x)=0\) at x. Conversely, if the latter equation holds then, by the twist assumption A1, y is uniquely determined by x and since (as explained above) \(F_{u}(x)\) satisfies the equation in question it follows that \(y=F_{u}(x)\). \(\square \)

Example 5.5

Assume that \(X=Y\) and that \(c(x,y)\ge 0\) with equality iff \(x=y\) and that c is smooth in a neighborhood of the diagonal. Then \(u=0\) is in \(\mathcal {H}\) (with \(F_{0}\) given by the identity) and more examples of potentials are obtained by using that \(\mathcal {H}\) is open in the \(C^{\infty }\)-topology, in general. In particular, this applies in the “Riemannian setting”, where \(c=d^{2}/2\) on a compact Riemannian manifold X.

5.1 Parabolic optimal transport equations

Consider now the following parabolic PDE, introduced in [44]:

$$\begin{aligned} \frac{\partial u_{t}(x)}{\partial t}=\log \det \left( \partial _{x}F_{u_{t}}\right) -g(F_{u_{t}}(x))+f(x) \end{aligned}$$
(5.2)

expressed in terms of a choice of local coordinates, where \(\det \left( \partial _{x}F_{u}\right) \) denotes the local Jacobian of the map \(x\mapsto F_{u}(x)\). We note that

  • The right hand side in the equation 5.2 is globally well-defined (i.e. independent of the choice of local coordinates around \((x,F_{u}(x))\) in \(X\times Y)\). Indeed, it is equal to the logarithm of the quotient of \((F_{u})_{*}\mu \) and \(\nu \). Accordingly, u is a stationary solution iff it is the potential of an optimal transport map.

  • Differentiating the Eq. 5.1 reveals that (see [68, Section 12])

    $$\begin{aligned} \det (\partial _{x}F_{u})=\frac{\det \left( (\partial _{x}^{2}c)(x,F_{u}(x))+(\partial _{x}^{2}u)(x)\right) }{\det \left( (\partial _{x}\partial _{y}c)(x,F_{u}(x))\right) }. \end{aligned}$$
    (5.3)

5.2 Convergence of the Sinkhorn algorithm towards parabolic optimal transport

We next introduce the following stronger local form of the “global” density property appearing in Lemma 3.1.

Definition 5.6

A sequence (or family) of probability measures \(\mu ^{(k)}\) on X, converging weakly towards a measure \(\mu \), is said to have the local density property at orders (and length scale\(k^{-1/2})\) if there exists a positive integer \(s\in [2,\infty [\) and constants \(C_{1}\) and \(C_{2}\) such for any fixed \(x_{0}\in X\) there exist local coordinates \(\xi :=(\xi _{1},\ldots ,\xi _{n})\) centered at \(x_{0}\) with the following property: for any sequence \(h_{k}\) defined on the polydisc \(2D_{k}\) of radius \(2\log k\), centered at 0 in \(\mathbb {R}^{n}\) and satisfying \(\left| \partial ^{|\alpha |}h_{k}\right| (x)\le C_{1}e^{-|x|^{2}/C_{1}}\) on \(2D_{k}\) for all multiindices \(\alpha \) satisfying \(|\alpha |\le s\) the bound

$$\begin{aligned} k^{n/2}\int _{D_{k}}h_{k}(F_{x_{0}}^{(k)})_{*}(\mu ^{(k)}-\mu )\le C_{2}k^{-1}, \end{aligned}$$

holds, where \(F_{x_{0}}^{(k)}\) is the scaled coordinate map from a neighborhood of \(x_{0}\) in X into \(\mathbb {R}^{n}\) defined by \(F_{x_{0}}^{(k)}(x):=(k^{1/2}\xi (x))\). If this is property holds for some \(s\in [2,\infty [\) we will simply say that \(\mu ^{(k)}\) has the local density property.

We are now ready for the following generalization of Theorem 1.2 stated in the introduction (as in the torus setting in Sect. 1.1.2 the entropic regularization parameter is expressed as \(\epsilon =k^{-1}\), where k is a positive number).

Theorem 5.7

Let c be a function satisfying the assumptions A1 and A2 and \(\mu ^{(k)}\) and \(\nu ^{(k)}\) be two sequences converging towards \(\mu \) and \(\nu \) in \(\mathcal {P}(X)\) and \(\mathcal {P}(Y)\), respectively, satisfying the local density property. Given \(u_{0}\in \mathcal {H}\) assume that there exists a solution \(u_{t}\) in \(C^{2}(X\times [0,T])\) of the parabolic PDE 5.2 with initial condition \(u_{0}\) and such that \(u_{t}\in \mathcal {H}\) for any \(t\in [0,T]\). Denote by \(u_{m}^{(k)}\) the iteration 2.21 defined by the data \((\mu ^{(k)},\nu ^{(k)},c)\) and such that \(u_{0}^{(k)}=u_{0}\) for any given k. Then there exists a constant C such that for any (mk) satisfying \(m/k\in [0,T]\)

$$\begin{aligned} \sup _{T^{n}}\left| u_{m}^{(k)}-u_{m/k}\right| \le C\frac{m}{k}k^{-1}, \end{aligned}$$

Proof

The assumptions have been made precisely to ensure that the proof of Theorem 1.2 can be generalized, almost verbatim. Hence, we will be rather brief.

Step 1: Let \(\alpha \) be a lower-semicontinuous (lsc) function on X with a unique minimum at \(x_{0}\) and assume that \(\alpha \) is \(C^{4}\)-smooth on a neighborhood of \(x_{0}\) and \(\nabla ^{2}\alpha (x_{0})>0\) (in local coordinates centered at \(x_{0})\). Then, in local coordinates centered at \(x_{0}\),

$$\begin{aligned} k^{n/2}\int _{X}e^{-k\alpha }h\mu ^{(k)}=(2\pi )^{n/2}e^{-k\alpha (x_{0})}\frac{h(x_{0})e^{-f(x_{0})}}{\sqrt{\det (\partial _{x}^{2}\alpha (x_{0}))}}(1+O(k^{-1})). \end{aligned}$$
(5.4)

Using the local density assumption as a replacement of Lemma 8.1, this is shown essentially as before. The point is that the derivatives \(\partial ^{\alpha }\) of any fixed order of \(h_{k}(x):=\left( (F_{x_{0}}^{(k)})^{-1}\right) ^{*}\left( fe^{-k\alpha }\right) \) are bounded by \(Ce^{-|x|^{2}/C}\) when \(|x_{i}|\le k^{1/2}\) (and in particular, for \(|x_{i}|\le 2\log k)\). This follows readily from the chain rule, as before. In fact, the proof of formula 5.4 is even a bit simpler than in the torus case, since the last step 3 is not needed when the local density assumption is made at any point \(x_{0}\).

Step 2: If \(u\in \mathcal {H}\), then the following asymptotics holds

$$\begin{aligned} \rho _{ku}(x)=\det (\partial _{x}F_{u}(x))e^{f(x)-g(F_{u}(x))}(1+O(k^{-1})) \end{aligned}$$

To prove this first observe that if \(u\in \mathcal {H}(X)\), then \(u^{c}\in \mathcal {H}(Y)\). Indeed, by assumption there is a unique \(x(=x_{y})\) such that \(y=F_{u}(x)\). Moreover, by the very definition of \(F_{u}\) (Definition5.3) we can express

$$\begin{aligned} u^{c}(y)=-c(x_{y},F_{u}(x_{y}))-u(x_{y}). \end{aligned}$$

Since c is assumed to be smooth in a neighborhood of \(\Gamma _{u}\) the right hand side above defines a smooth function in x and since F is a diffeomorphism it follows that \(u^{c}(y)\) is smooth. Moreover, by symmetry \(\Gamma _{u^{c}}=\Gamma _{u}\), which can be identified with the graph of the diffeomorphism \(F_{u}^{-1}\). This shows that \(u^{c}\in \mathcal {H}(Y)\) and

$$\begin{aligned} F_{u^{c}}=(F_{u})^{-1} \end{aligned}$$
(5.5)

Setting \(y_{x}:=F_{u}(x)\) and \(x_{y}=F_{u^{*}}(y)\) we can now apply the previous step, essentially as before, to get

$$\begin{aligned} \rho _{ku}(x)=\sqrt{\frac{\det \left( (\partial _{x}^{2}c)(x_{y},x)+\partial _{x}^{2}u(x_{y})\right) }{\det \left( (\partial _{y}^{2}c)(x,y_{x})+\partial _{y}^{2}u^{c}(y_{x})\right) }}e^{f(x)-g(F_{u}(x))}(1+O(k^{-1})) \end{aligned}$$

Finally, differentiating the relation 5.5 reveals that

$$\begin{aligned} \det ((\partial _{y}F_{u^{c}})(y_{x}))=\det ((F_{u}(x))^{-1} \end{aligned}$$

and hence using Eq. 5.3 and symmetry (which ensures that the denominator appearing in Eq. 5.3 coincides with the one appearing obtained when u is replaced by \(u^{c}\)) concludes the proof of Step 2.

Step 3: Conclusion of proof

The proof is concluded, as before, by invoking Lemma 4.4. \(\square \)

As pointed out in [44] it follows from standard short-time existence results for parabolic PDEs that the existence of a solution \(u_{t}\) as in the previous theorem holds for some \(T>0\) (see, for example, [1, Main Thm 1]). Moreover, by [44] long-time existence, i.e. \(T=\infty \), holds under the following further assumptions on c : 

  • (A3) (“Stay-away property”) For any \(0<\lambda _{1},\lambda _{2}\) there exists \(\epsilon >0\) only depending on \(\lambda _{1},\lambda _{2}\) such that \(\lambda _{1}\le {|{\det }}\partial _{x}F_{u}|\le \lambda _{2}\implies \text {dist } {(\Gamma _{u},{{\text {sing } {(c))\ge \epsilon }}}}\) for any \(u\in \mathcal {H}\)

  • (A4) (“Semi-concavity”) c is locally semi-concave, i.e. the sum of a concave and a smooth function on the domain where it is finite.

  • (A5) (“Strong MAW-condition”) The Ma-Wang-Trudinger tensor of c is bounded from below on \(X\times Y-\text {sing } {(c)}\) by a uniform positive constant \(\delta \).

Theorem 5.8

(Kim-Streets-Warren [44]) Assume that c satisfies the assumptions A1–A5. Then, for any given \(u_{0}\in \mathcal {H}\) there exists a solution u(xt) in \(C^{\infty }(X\times [0,\infty [)\) of the parabolic PDE 5.2 with initial condition \(u_{0}\) and such that \(u_{t}\in \mathcal {H}\) for any \(t>0\). Moreover, \(u_{t}\) converges exponentially in \(C^{0}(X))\), as \(t\rightarrow \infty \), to a potential \(u\in \mathcal {H}\) of a diffeomorphism \(F_{u}\) transporting \({\mu }\) to \({\nu }\), which is optimal wrt the cost function c.

Proof

Let us explain how to translate the result in [44] to the present setting. Following [44] a function \(u\in C^{2}(X)\) is said to be locally strictlyc-convex, if, in local coordinates, the matrix \((\partial _{x}^{2}c)(x,F_{u}(x))+(\partial _{x}^{2}u)(x)\) is positive definite. This condition is independent of the choice of local coordinates. Indeed, it equivalently means that any given \(x_{0}\in X\) is a non-degenerate local minimum for the function

$$\begin{aligned} x\mapsto c(x,F(x_{0}))+u(x)\,\,\,\text {on}\,X. \end{aligned}$$
(5.6)

It follows that for any such u the corresponding map \(F_{u}\) is a local diffeomorphism. The main result in [44] says that, under the assumptions on c in the statement above, for any initial datum \(u_{0}\in C^{\infty }(X)\) which is locally strictly c-convex, there exists a solution u(xt) in \(C^{\infty }(X\times ]0,T])\) which is also locally strictly c-convex. To make the connection to the present setting first note that if \(u\in \mathcal {H}\) then \(u_{0}\) is even an absolute minimum for the function 5.6, which is non-degenerate (since \(F_{u}\) is a diffeomorphism) and hence u is locally strictly c-convex. Conversely, if u is locally strictly c-convex then [44, Cor 7.1] says that \(u\in C^{2}(X)\) is c-convex (i.e. \((u^{c})^{c}=u)\) and the proof given in [44, Cor 7.1] moreover shows that \(F_{u}\) is a global \(C^{1}\)-diffeomorphism. It then follows from the Assumption A3 that c is smooth in a neighborhood of \(\Gamma _{u}\). Hence, \(u\in C^{\infty }(X)\) is locally strictly c-convex iff \(u\in \mathcal {H}\), which concludes the proof of the theorem. \(\square \)

Remark 5.9

Under the assumptions in the previous theorem it follows, in particular, that the optimal transport map is smooth. Conversely, the assumptions are “almost necessary” for regularity of the optimal transport map (see [68, Chapter 12] and reference therein). Also note that the semi-concavity assumption is always satisfied in the case when \(X=Y\) is a compact Riemannian manifold and \(c=d^{2}/2\) [68, (b), Page 278].

Combining the exponential large-time convergence of \(u_{t}\), in the previous theorem, with Theorem 5.7 gives, just as in the torus setting, the following

Corollary 5.10

Assume that the cost function c satisfies the assumptions in Theorem 5.7 (or that \(c=d_{T^{n}}^{2}/2\) in the case of the torus \(T^{n}\)). Assume, moreover, that \(\mu ^{(k)}\) and \(\nu ^{(k)}\) satisfy the local density property. Then, for any given \(u_{0}\in \mathcal {H}\), there exists a positive constant\(A_{0}\) such that for any \(A>A_{0}\) the following holds: \(u_{k}(x):=u_{m_{k}}^{(k)}(x_{i_{k}})\), with \(m_{k}:=\left\lfloor Ak\log k\right\rfloor \), converges uniformly to the optimal transport potential u(x). More precisely, there exists a constant C (depending on A) such that

$$\begin{aligned} \sup _{T^{n}}\left| u_{k}-u\right| \le Ck^{-1}\log k \end{aligned}$$

In turn, this corollary implies, just as before, that the analog of Corollary 1.4 holds (under the same assumptions as in the previous corollary).

Example 5.11

The assumptions on c in the previous corollary are satisfied when \(X=Y\) is the n-sphere and \(c(x,y)=d^{2}(x,y)/2\) for the standard round metric or \(c(x,y)=-\log |x-y|\), where \(|x-y|\) denotes the chordal distance (see [44] and references therein). The latter case appears in the reflector antenna problem, as explained in Sect. 6.3.3.

5.3 Constructing discretizations using Quasi-Monte Carlo systems

In order to construct discretizations satisfying the local density property (Definition 5.6) on general compact manifolds we will employ Quasi-Monte Carlo systems, familiar from the theory of numerical integration. These can be viewed as generalizations of the standard grids with N points on the torus.

Let (Xg) be an n-dimensional compact Riemannian manifold and denote by dV the corresponding normalized volume form on X. Following [12] (and [17] in the case of a sphere) the worst case error of integration of points \(\{x_{i}\}_{i=1}^{N}\Subset X\) and weights \(\{w_{i}\}_{i=1}^{N}\Subset \mathbb {R}_{+}\) (assumed to sum to one) with respect to some Banach space W of continuous functions on X, is defined as

$$\begin{aligned} \text {wce } (\left\{ (x_{i},w_{i})\right\} _{i=1}^{N}):={\sup }\left\{ \left| \int fdV-\sum _{i=1}^{N}f(x_{i})w_{i}\right| :\,\,f\in W,\,\,\left\| f\right\| \le 1\right\} \end{aligned}$$

We will use the shorthand \(X_{N}\) for the weighted point cloud\(\left\{ (x_{i},w_{i})\right\} _{i=1}^{N}\) and and call the corresponding discrete probability measure

$$\begin{aligned} \lambda _{X_{N}}:=\sum _{i=1}^{N}w_{i}\delta _{x_{i}} \end{aligned}$$
(5.7)

for the sampling measure associated to \(X_{N}\). Let now \(W:=W_{p}^{s}(X)\) be the Sobolev space of all functions f on X such that all (fractional) distributional derivatives of order s are in \(L^{p}(X)\) and assume that \(s>n/p\) and \(p\in [1,\infty ]\) (which ensures that \(W_{p}^{s}(X)\subset C^{0}(X))\). Then a sequence of weighted point clouds \(X_{N}:=\left\{ (x_{i},w_{i})\right\} _{i=1}^{N}\) is said to be a quasi-Monte Carlo system for \(W_{p}^{s}(X)\) if

$$\begin{aligned} \text {wce } {(X_{N})\le \frac{C}{\left( N^{1/n}\right) ^{s}}} \end{aligned}$$

for some uniform constant C (the corresponding lower bound holds for any sequence \(X_{N})\). In view of the applications to the present setting we introduce the following definition (modeled on the corresponding definition for \(p=2\) in [17]):

Definition 5.12

Given \(p\in [1,\infty ]\), a sequence of weighted point clouds \(X_{N}:=\left\{ (x_{i},w_{i})\right\} _{i=1}^{N}\) is said to be a generic Quasi-Monte Carlop-system if \(X_{N}\) is a quasi-Monte Carlo system for \(W_{p}^{s}(X)\) for any positive integer \(s>n/p\). Moreover, \(X_{N}\) is a completely generic QMC system if it is a generic Quasi-Monte Carlo p-system for any \(p\in [1,\infty ]\).

Example 5.13

When \(M=T^{n}\) the standard grid with N points on \(T^{n}\) (i.e. with “edge length” \(1/N^{1/n}))\) and with all weights taken to be equal to 1/N defines a completely generic Quasi-Monte Carlo system \(X_{N}\) [13]. Moreover, in the case of the standard n-dimensional sphere \(S^{n}\) it follows from [15, 17] that there exists a sequence of completely generic quasi-Monte Carlo system with all weights equal to 1/N. The corresponding point sets are taken as sphericalt-designs with \(t\sim N^{1/n}\) (these are defined as quadrature points, discussed in Sect. 5.3.1 below, with all weights equal). Such points have been generated for large values of N [71]. But allowing different weights has the advantage that explicit completely generic QMC systems can be constructed, as discussed below.

Remark 5.14

Recall that the covering radius \(\rho _{N}\) of a point cloud \(\{x_{1},\ldots ,x_{N}\}\) on (Xg) is defined by \(\sup _{x\in X}\min _{i\le N}d(x,x_{i})\). By [12, Prop 4.1, 4.3], if \(X_{N}\) is a weighted generic QMC 1-system then the corresponding covering radius \(\rho _{N}\) is comparable to \(N^{-1/n}\).

In order to apply this setup to the present setting of the Sinkhorn iteration we will need to adapt the number N of points to the value of the parameter k (defined as the invers of the entropic regularization parameter). This is made precise by the following lemma, quantifying how large \(N(=N_{k})\) needs to be:

Lemma 5.15

Let s be a positive integer. Assume that \(X_{N_{k}}:=\left\{ (x_{i}^{(k)},w_{i}^{(k)})\right\} _{i=1}^{N_{k}}\) is a sequence of quasi-Monte Carlo systems for \(W_{p}^{s}(X)\) (where \(s>n/p)\) indexed by a parameter k. Then the corresponding sampling measures\(\lambda _{k}(:=\lambda _{N_{k}})\) have the local density property at order s at length scales \(k^{-1/2}\) (Definition 5.6) if the “number condition”

$$\begin{aligned} N_{k}^{-1/n}\le \frac{C}{k^{1/2}}\frac{1}{k^{\left( 1+\frac{n}{2}(1-1/p)\right) /s}} \end{aligned}$$
(5.8)

holds for some positive constant C.

Proof

Given a sequence \(h_{k}\) as in Definition 5.6 set \(f_{k}:=(F_{x_{0}}^{(k)})^{*}(\chi _{k}h_{k})\), where \(\chi _{k}(x):=\chi (x/\log k)\) for a given smooth function \(\chi \) in \(\mathbb {R}^{n}\) equal to one on the polydisc \(D_{1}\) centered at 0 and with support in \(2D_{1}\). By the assumed quasi-Monte Carlo property there exists a constant C such that

$$\begin{aligned} \left| \int _{X}f_{k}(\lambda _{k}-dV)\right| \le C\frac{1}{N_{k}^{s/n}}\sum _{|\alpha |\le s}\left( \int |\partial ^{\alpha }f_{k}|^{p}dV\right) ^{1/p}. \end{aligned}$$

Multiplying both sides with \(k^{n/2}\) and using the chain rule thus gives

$$\begin{aligned}&k^{n/2}\left| \int _{D_{k}}h_{k}(F_{x_{0}}^{(k)})_{*}(\lambda _{k}-dV)\right| \le k^{n/2}\left| \int _{2D_{k}}\chi _{k}h_{k}(F_{x_{0}}^{(k)})_{*}(\lambda _{k}-dV)\right| \\&\quad =\frac{Ck^{n/2}}{N_{k}^{s/n}}\left( \sum _{|\alpha |\le s}k^{|\alpha |/2}\int _{D_{2k}}|\partial ^{\alpha }(\chi _{k}h_{k})|^{p}(F_{x_{0}}^{(k)})_{*}dV\right) ^{1/p}\\&\quad \le C'\frac{k^{n/2}k^{-n/2p}}{N_{k}^{s/n}}k^{s/2}\sum _{|\alpha |\le s}\left( \int _{D_{2k}}|\partial ^{\alpha }(\chi _{k}h_{k})|^{p}dx\right) ^{1/p}. \end{aligned}$$

Now, by assumption, all \(|\partial ^{\alpha }h_{k}|(x)\le Ce^{-|x|^{2}/C}\) on \(D_{2k}\). Moreover, \(|\partial ^{\alpha }\chi _{k}|\) is uniformly bounded (since \(\partial ^{\alpha }\chi _{k}=\partial \chi /(\log k)^{|\alpha |}\)) and hence the sum in the right hand side above is uniformly bounded. Since the number condition ensures that the factor in front of the sum is bounded by a constant times \(k^{-1}\) this concludes the proof. \(\square \)

Remark 5.16

When \(X=T^{1}\) and \(s=2\) and \(p=1\) (which satisfies \(s>n/p)\) the previous lemma is closely related to the case \(n=1\) of Lemma 8.1 (recall that the case \(X=T^{n}\) can be reduced to the case when \(n=1)\).

5.3.1 Constructing completely generic QMC systems

Finally, we recall how to construct completely generic QMC systems on any n-dimensional compact Riemannian manifold (Xg), generalizing the standard grid on the torus, following [13]. Given a positive number W (the “bandwidth”) we denote by \(H_{\le W}(X,g)\) the finite dimensional subspace of \(C^{\infty }(X)\) consisting of all eigenfunctions of the Laplacian with eigenvalue bounded from above by \(W^{2}\). A weighted point cloud \(X_{N}\) is said to consist of weighted quadrature points for \(H_{\le W}(X,g)\)) if the corresponding numerical integration error vanishes for any \(f\in H_{\le W}(X)\). For any sufficently large constant \(C_{X}\) there exists a sequence of weighted quadrature points \(X_{N}\) for \(H_{\le C_{X}N^{1/n}}(X)\) (as follows from [13, Cor 2.11] and the Weyl asymptotics, saying that the dimension of \(H_{\le W}(X)\) grows like a constant times \(W^{n})\). Moreover, by [13, Cor 2.13] such a sequence \(X_{N}\) defines a completely generic QMC system.

Example 5.17

When (Xg) is the round two-sphere \(S^{2}\) it is customary to rewrite \(W^{2}=l(l+1)\). Letting l range over all non-negative integers then enumerates all eigenvalues of the Laplacian and the corresponding space \(H_{\le W}(X)\) then coincides with the space of all spherical polynomials of degree l (see Sect.  6.3.2). In this case the most commonly used explicit weighted quadrature points are obtained using various longitude-latitude rules; see [40, Section 4.1] and also [31, Thm 3] (applied to \((l,m)=(0,0)\)) for the “equi-angular” case.

5.4 Reducing the numbers \(N_{k}\) of discretization points by exploiting higher regularity of the data

Coming back to the setup in the beginning of Sect.  5 we fix two sequences \(X_{N}\) and \(Y_{N}\) of N weighted points on X and Y, respectively. We will assume that \(X_{N}\) and \(Y_{N}\) are completely generic QMC systems defined with respect to some Riemannian volume forms \(dV_{X}\) and \(dV_{Y}\) on X a Y, respectively. Fix \(\delta \in ]0,1/2]\) and index the number of points N by the parameter k so that

$$\begin{aligned} N_{k}^{1/n}\ge C_{\delta }k^{1/2+\delta },\,\,\,\delta>0,\,\,C_{\delta }>0 \end{aligned}$$
(5.9)

(by Remark 5.14 this equivalently means that the covering radius of the corresponding point clouds is of the order \(O(1/k^{1/2+\delta }\)). We then define a family of “discretizations” \(\mu ^{(k)}\) and \(\nu ^{(k)}\) of \(\mu \) and \(\nu \), by proceeding as in the torus case, but replacing the sampling measure \(\delta _{\Lambda _{k}}\) with the sampling measures corresponding to \(X_{N}\) and \(Y_{N}\). In other words, setting \(\rho _{X}:=\mu /dV_{X}\) the discrete probability measure \(\mu ^{(k)}\) is defined as

$$\begin{aligned} \mu ^{(k)}:=\rho _{X}\lambda _{X_{N_{k}}}/Z_{N_{k}},\,\,\, \end{aligned}$$

where \(Z_{N_{k}}\) is the normalizing constant (and \(\nu \) is defined in a similar fashion).

Lemma 5.18

The sequences \(\mu ^{(k)}\) and \(\nu ^{(k)}\) have the local density property.

Proof

First observe that if follows directly from the assumptions on \(X_{N}\) and \(Y_{N}\) that the corresponding norming constants are equal to \(1+O(k^{-1})\) (in fact, the error terms are of the order \(O(k^{-\infty })\) since the densities are assumed to be smooth). Hence, it will be enough to show that local density property holds for the sequences \(\rho _{X}\lambda _{X_{N_{k}}}\) and \(\rho _{Y}\lambda _{Y_{N_{k}}}\). But then the result is reduced to Lemma 3.1 by simply replacing \(h_{k}\) with \(\tilde{h}_{k}:=h_{k}(F_{x_{0}}^{(k)^{-1}})^{*}\rho \), where \(\rho \) is the density of \(\mu \) or \(\nu \). Indeed, by Leibniz rule and the chain rule \(\tilde{h}_{k}\) also satisfies the estimates in the lemma. Thus taking \(s=1/\delta \) and \(p=1\) and invoking Lemma 3.1 concludes the proof. \(\square \)

This means that Theorem 5.7 and Corollary 5.10 apply to the discretization scheme above, as long as the number \(N_{k}\) of points satisfies the bound 5.9 for some \(\delta >0\).

Remark 5.19

The previous argument shows that if the densities of \(\mu \) and \(\nu \) are in \(C^{s}(X)\) for some integer \(s\in [2,\infty [\) and \(s>n/p\) for \(p\in [1,\infty [\) then the local density condition (at order s) holds if \(N_{k}\) satisfies the condition 5.8. In particular, if \(s>n\) on can take \(p=1\) and then the condition is that \(N_{k}^{1/n}\ge k^{1/2+1/s}\). In the particular case of a uniform grid on the torus the condition is thus that the edge-length of the grid is bounded from below by \(Ck^{-(1/2+1/s)}\) (the assumption \(s>n\) is not needed in this case, since one can reduce to the case when \(n=1)\).

6 Nearly linear complexity on the torus and the sphere

In this section we start by showing that the convergence results in Sect. 5 hold in a more general setting where the kernel \(\mathcal {K}^{(k)}(x,y):=e^{-kc(x,y)}\) is replaced with an appropriate approximate kernel. This extra flexibility is then applied in the setting of optimal transport on the two-sphere, using “band-limited” heat-kernels, where it leads to a nearly linear algorithmic cost for the corresponding Sinkhorn iterations.

6.1 Sequences \(c_{k}\) and approximate kernels \(K_{k}\)

Just as in the generalization of the (static) Theorem 3.3, considered in Section 3.7, the (dynamic) Theorem 5.7 can be generalized by replacing the cost function c with a suitable sequence \(c_{k}\). But then the uniform convergence of \(c_{k}\) towards c (formula 3.13) has to be supplemented with further asymptotic properties on the complement of the singularity locus of c. For example, the proof of Theorem 5.7 goes through, almost word for word, if the upper bound corresponding to 3.13 holds globally, i.e.:

$$\begin{aligned} e^{-kc_{k}(x,y)}\le O(e^{\epsilon k})e^{-kc(x,y)} \end{aligned}$$
(6.1)

(where \(O(e^{\epsilon k})\) denotes a sequence of sub-exponential growth) and \(c_{k}\) has the following further property: on any given compact subset in the complement of \(\text {sing } {(c)}\) there exists a strictly positive smooth function \(h_{0}(x,y)\) and a uniformly bounded sequence \(r_{k}(x,y)\) of functions such that

$$\begin{aligned} \mathcal {K}^{(k)}(x,y):=e^{-kc_{k}(x,y)}=e^{-kc(x,y)}(h_{0}(x,y)+k^{-1}r_{k}(x,y)) \end{aligned}$$
(6.2)

This implies, in particular, that if Theorem 3.3 holds for a given kernel \(\mathcal {K}^{(k)}\), then it also holds for any other kernel \(\tilde{\mathcal {K}}^{(k)}\) which has error \(O(k^{-1})\) as an approximation relative to \(\mathcal {K}^{(k)}\), i.e. such that

$$\begin{aligned} |\mathcal {K}^{(k)}-\tilde{\mathcal {K}}^{(k)}|\le Ck^{-1}\mathcal {K}^{(k)} \end{aligned}$$
(6.3)

or such that \(\tilde{\mathcal {K}}^{(k)}\) has absolute error \(e^{-Ck}\), for C sufficiently large, i.e.

$$\begin{aligned} |\mathcal {K}^{(k)}-\tilde{\mathcal {K}}^{(k)}|\le e^{-Ck},\,\,\,C>\inf _{X\times Y}c \end{aligned}$$
(6.4)

6.2 Heat kernel approximations in the Riemannian setting

Consider now the Riemannian setting where X is a compact Riemannian manifold (without boundary) and \(c=d^{2}/2\) and \(c_{k}\) is defined in terms of heat kernel (formula 3.14):

Theorem 6.1

Let X be a compact Riemannian manifold (without boundary) and set \(c(x,y):=d(x,y)^{2}/2\), where d is the Riemannian distance function. Then the results in Theorem 5.7 and Corollary 5.10 still hold when the matrix kernel \(e^{-kd^{2}(x,y)/2}\) is replaced with the heat kernel \(\mathcal {K}_{2k^{-1}}(x,x)\) (at time \(t=2k^{-1})\)

Proof

As discussed above this follows from the following heat kernel asymptotics (which are a special case of [7, Thm 3.1] and more generally hold for the heat kernel associated to a suitable hypoelliptic operator). Assume that x and y are contained in a compact subset of the complement of the cut-locus. Then

$$\begin{aligned} \mathcal {K}_{t}(x,y)=t^{-n/2}e^{-t^{-1}d^{2}(x,y)/4}\left( h_{0}(x,y)+tr_{1}(t,x,y)\right) , \end{aligned}$$

where \(h_{0}\) is smooth and \(h_{0}>0\) and \(r_{1}\) is smooth and uniformly bounded on \(]0,t_{0}]\times X\). This is not exactly of the form 6.2 due to the presence of the factor \(t^{-n/2}:=A_{k}\). But it is, in fact, enough to know that 6.2 holds when the right hand side is multiplied with a sequence \(A_{k}\), only depending on k. Indeed, the iteration \(u_{m}^{(k)}\) is unaltered when the cost function \(c_{k}(x,y)\) is replaced by \(c_{k}(x,y)+C_{k}\) for some constant \(C_{k}\) (which is consistent, as it must, with the fact that the parabolic equation 5.2 is unaltered when a constant is added to c). \(\square \)

The use of the heat kernel in the Sinkhorn algorithm for optimal transport on Riemannian manifolds was advocated in [63], where it was found numerically that discretized heat kernels provide substantial speedups, when compared to other methods. The previous theorem offers a theoretical basis for the experimental findings in [63], as long as the discretized heat kernels \(\tilde{\mathcal {K}}^{(k)}\) satisfy one of the the approximation properties 6.3 and 6.4 (when compared with the corresponding bona fide heat kernel). However, the author is not aware of any general such approximation results in the discretized setting (but see [24] and references therein for various numerical approaches to discretizations of heat kernels). We will instead follow a different route, based on “band-limited” heat kernels and fast Fourier type transforms, applied to the case when X is the two-sphere.

6.3 Near linear complexity using fast transforms

Each iteration in the Sinkhorn algorithm amounts to computing two vector-matrix products of the form

$$\begin{aligned} a_{i}=\sum _{j=1}^{N}\mathcal {K}(x_{i},y_{j})b_{j},\,\,\,\,i=1,\ldots ,N, \end{aligned}$$
(6.5)

for a given function \(\mathcal {K}\) on \(X\times Y\) (followed by N inversions), where b and a denote generic “input vector” and “output vectors”, respectively. In general, this requires \(O(N^{2})\) arithmetic operations. But, as we will next exploit, in the presence of suitably symmetry fast summations techniques can be used to lower the complexity to nearly linear, i.e. to at most \(CN(\log N)^{p}\) operations (for some positive constants C and p). Alternatively, separability properties of the kernels in question can often be exploited to directly decrease the complexity (as in [27, Remark 4.17]). For example, consider the case when \(X=X_{1}\times X_{2}\) and \(Y=Y_{1}\times Y_{2}\) and denote by \(\pi _{1}\) and \(\pi _{2}\) the projections on the first and second factors, respectively. If \(\mathcal {K}(x,y)\) is separable, i.e. factors as

$$\begin{aligned} \mathcal {K}(x,y)=\mathcal {K}_{1}\left( \pi _{1}(x),\pi _{1}(y)\right) \mathcal {K}_{1}\left( \pi _{2}(x),\pi _{2}(y)\right) \end{aligned}$$

one can take the point clouds \((x_{i})_{i\le 1}\) and \((y_{i})_{i\le N}\) as the “grid”s induced by given point clouds on each factor with (consisting of \(N_{1}\) and \(N_{2}\) points, respectively). Then by first summing over the first factor the complexity of the computation is reduced to \(O\left( N(N^{1}+N^{2})\right) \) operations. More generally, if the separability holds with r factors, then, by induction, the complexity becomes \(O\left( N(\max _{i\le r}N_{i})\right) \).

6.3.1 Optimal transport on the flat torus

Let us first come back to the case of the flat torus \(T^{n}\) discretized by the discrete torus \(\Lambda _{k}\), considered in Sect. 1.1.3. Since \(\mathcal {K}(x,y):=e^{-kd^{2}(x,y)}\) is invariant under the diagonal action of the torus \(T^{n}\) it is follows from standard arguments that the sums 6.5 can be computed in \(O(N)(\log N)\) arithmetic operations. Indeed, using the group structure on \(T^{n}\) we can write \(\mathcal {K}(x,y)=h(x-y)\), for some function h on \(\Lambda _{k}\). Then the classical convolution theorem of Fourier Analysis, on the discrete torus \(\Lambda _{k}\) (viewed as an abelian finite group), gives (with \(m_{1},\ldots m_{N}\) the points of the dual discrete torus, identified with \(\Lambda _{k}):\)

$$\begin{aligned} a_{i}=\sum _{j=1}^{N}h(x_{i}-y_{j})b_{j}=\sum _{j=1}^{N}\hat{h}(m_{j})\hat{b}(m_{j})e^{2\pi im_{j}\cdot x_{i}},\,\,\,\,\hat{f}(m_{j}):=\sum _{i=1}^{N}f_{i}e^{-2\pi ix_{i}\cdot m_{j}} \end{aligned}$$

This requires evaluating two Discrete Fourier Transforms (DFT) at the \(N=k^{n}\) points \(m_{1},\ldots m_{N}\), Using the Fast Fourier Transform (FFT) this can be done in \(O(N)(\log N)\) arithmetic operations. Note that, since the heat kernel is also torus invariant, the same argument can also be used for the kernel appearing in Theorem 6.1, in the torus case. Alternatively, using that \(\mathcal {K}(x,y)\) is separable on \(T^{n}\) (since, in general, the squared Riemannian distance function on a Riemannian product is the sum of the squared distances of the factors), the summing can directly be achieved with complexity \(O(N^{1+1/n})\) for any fixed point cloud on \(S^{1}\).

6.3.2 Optimal transport on the round two-sphere

Consider the round two-sphere \(S^{2}\) embedded as the unit-sphere in \(\mathbb {R}^{3}\). Removing the north and south pole on \(S^{2}\) we have the standard spherical (longitude-colatitude) coordinates \((\varphi ,\theta )\in [0,2\pi [\times ]-\pi ,\pi [\). A complete set of (non-normalized) eigenfunctions for the Laplacian on \(L^{2}(S^{2})\) is given by the spherical harmonics

$$\begin{aligned} Y_{l}^{m}(\varphi ,\theta ):=e^{im\varphi }P_{l}^{m}(\cos \theta ),\,\,\,\,|m|\le l, \end{aligned}$$

which has eigenvalue \(\lambda _{l,m}^{2}:=l(l+1)\). Here \(P_{l}^{m}(z)\) denotes, för \(z\in [-1,1]\), as usual, the Legendre function of degree l and order m (aka the associated Legendre polynomial); see, for example, [31, 40].

Given a positive number W (the “band-width”) we consider the band-limited heat kernel on the two-sphere:

$$\begin{aligned} \mathcal {K}_{t}(x,y)_{W}:=\sum _{|m|\le l\le W}c_{m,l}Y_{l}^{m}(x)\overline{Y_{l}^{m}(y)},\,\,\,c_{m,l}:=e^{-tl(l+1)}\left\| Y_{l}^{m}\right\| _{L^{2}}^{-2} \end{aligned}$$
(6.6)

(By the spectral theorem this means that \(\mathcal {K}_{t}(x,y)_{W}\) is the integral kernel of \(e^{-t\Delta }\Pi _{W}\) where \(\Pi _{W}\) is the orthogonal projection onto the space of all band-limited functions).

Theorem 6.2

Consider the two-sphere, discretized by a given good Quasi-Monte Carlo (QMC) system and take R such that \(R>1\). Then the analog of all the results in Sect. 1.1.3 are valid when the matrix kernel \(e^{-kd^{2}(x,y)/2}\) is replaced by the band-limited heat kernel \(\mathcal {K}_{2k^{-1}}(x,y)_{Rk}\). Moreover, the arithmetic complexity of each Sinkhorn iteration is \(O(N^{3/2})\).

Proof

As recalled in Example 5.11 the cost function \(d(x,y)^{2}\) on the sphere satisfies the assumptions in Theorem 5.7 (with \(t=\infty )\) and Corollary 5.10.

Step 1: The asymptotics6.1and6.2are satisfied.

By Theorem 6.1 it is enough to observe that the following basic estimate holds if \(t=2k^{-1}\) and \(W=Rk:\)

$$\begin{aligned} \left| \mathcal {K}_{t}(x,y)-\mathcal {K}_{t}(x,y)_{W}\right| \le C_{\delta }e^{-2R^{2}k(1-\delta )} \end{aligned}$$

for any given \(\delta \in ]0,1[\). To prove the estimate note that

$$\begin{aligned} \left| \mathcal {K}_{t}(x,y)-\mathcal {K}_{t}(x,y)_{W}\right|\le & {} \sum _{l>W}e^{-2k^{-1}l(l+1)}\frac{\left\| Y_{l}^{m}\right\| _{L^{\infty }}^{2}}{\left\| Y_{l}^{m}\right\| _{L^{2}}^{2}}\\\le & {} 2Ck^{3}\sum _{l/k>R}e^{-2k(\frac{l}{k})^{2}}\frac{(l+1)^{2}}{k^{2}}\frac{1}{k}, \end{aligned}$$

using that the quotient involving \(Y_{l}^{m}\) is dominated by Cl (and that a given l corresponds to \(2l+1\)ms). Indeed, this is a special case of the the universal \(L^{2}\)-estimates for an eigenfunction \(\Psi _{\lambda }\) of the Laplacian (with eigenvalue \(\lambda ^{2})\) on a general n-dimensional Riemannian manifold [62], which gives the growth factor \(C\lambda ^{n-1}\). Finally, dominating the Riemann Gaussian sum above with the integral of the function \(e^{-ks^{2}}s^{2}\) over \([R,\infty [\) concludes the proof.

Step 3: Complexity analysis

Using formula 6.6 gives

$$\begin{aligned} a_{i}=\sum _{|m|\le l\le W}c_{m,l}\hat{b}_{l,m}Y_{l}^{m}(x_{i}),\,\,\,c_{m,l}=e^{-tl(l+1)}\left\| Y_{l}^{m}\right\| _{L^{2}}^{-1}\hat{b}_{l,m},\,\,\,\hat{b}_{l,m}:=\sum _{j=1}^{N}b_{j}\overline{Y_{l}^{m}(x_{j})}) \end{aligned}$$

\(\hat{b}_{l,m}\) is the “forward discrete spherical Fourier transform” evaluated at (lm). Once it has been computed for all (lm) \(a_{i}\) becomes an “inverse discrete spherical Fourier transform” (with coefficients \(c_{m,l}\hat{b}_{l,m}\) ). By separation of variables, both these transforms can be computed using a total of \(O(k^{3})(=O(N^{3/2}))\) arithmetic operations (see the discussion after formula 1.9 in [56]). \(\square \)

In the special case when the good Quasi-Monte Carlo system is defined by an “equi-angular” colatitude-longitude rule of N weighted points on \(S^{2}\) the arithmetic complexity of each iteration can be reduced to \(O(N)(\log N)^{2}\) operations, using a fast discrete spherical Fourier transform [39].

6.3.3 Application to the reflector antenna problem

The extensively studied far field reflector antenna problem appears when \(X=Y=S^{n}\) is the n-dimensional sphere \(S^{n}\), embedded as the unit-sphere in \(\mathbb {R}^{n+1}\) and the cost function is taken as \(c(x,y):=-\log |x-y|\) [37, 69]. Briefly, the problem is to design a perfectly reflecting surface \(\Sigma \) in \(\mathbb {R}^{n+1}\) with the following property: when \(\Sigma \) is illuminated with light emitted from the origin with intensity \(\mu \in \mathcal {P}(S^{n})\) the output reflected intensity becomes \(\nu \in \mathcal {P}(S^{n})\) (of course, \(n=2\) in the usual applications). Representing \(\Sigma \) as a radial graph over \(S^{n}:\)

$$\begin{aligned} \Sigma :=\{h(x)x\},\,\,x\in S^{n}, \end{aligned}$$

for a positive function h on \(S^{n}\) it follows from the reflection law and conservation of energy that h satisfies the following Monge–Ampère type equation, expressed in terms of the covariant derivatives \(\nabla _{i}\) in local orthonormal coordinates:

$$\begin{aligned} \frac{\det (-\nabla _{i}\nabla _{j}h+2h^{-1}\nabla _{i}h\nabla _{j}h+(h-\eta )\delta _{ij})}{((|\nabla h|^{2}+h^{2})/2h)^{n}}=e^{g(F_{h}(x))-f(x)}, \end{aligned}$$
(6.7)

where \(F_{h}(x)\) denotes the reflected direction of the ray emitted in the direction x (and \(\mu \) and \(\nu \) are represented as in 1.3). The equation is also supplemented with the “second boundary value condition” that \(F_{h}\) maps the support of \(\mu \) onto the support of \(\nu \). Assuming that f and g are smooth there exists a smooth solution h, which is unique up to scaling (see [19] and references therein).

Theorem 6.3

Consider the two-sphere, discretized by a given good Quasi-Monte Carlo (QMC) system. Let \(K^{(k)}\) be the \(N_{k}\times N_{k}\) matrix defined by

$$\begin{aligned} K_{ij}^{(k)}=|x_{i}^{(k)}-x_{j}^{(k)}|^{k} \end{aligned}$$

Consider the Sinkhorn algorithm associated to \((p^{(k)},q^{(k)},K^{(k)})\). There exists a positive constant \(A_{0}\) such that for any \(A>A_{0}\) the following holds: after \(m_{k}=\left\lfloor Ak\log k\right\rfloor \) Sinkhorn iterations the function \(h_{k}\) on \(S^{n}\) defined by the k th root of \(a_{k}:=a^{(k)}(m_{k})\) converges uniformly, as \(k\rightarrow \infty \), towards a solution h of the antenna equation 6.7 satisfying the corresponding second boundary value condition. More precisely, there exists a constant C (depending on A) such that

$$\begin{aligned} \sup _{S_{N}}|h_{k}-h|\le Ck^{-1}\log k, \end{aligned}$$

Moreover, the arithmetic complexity of each iteration is \(O(N^{3/2})\) in general and \(O(N)(\log N)^{2}\) in the case of an equi-angular grid.

Proof

As recalled in Example 5.11 the cost function \(-\log |x-y|\) on the sphere satisfies the assumptions in Corollary 5.10. For the complexity analysis we first recall the general fact that any kernel \(K^{(k)}(x,y)\) which is radial, i.e. only depends on \(|x-y|\), may be expressed as

$$\begin{aligned} K^{(k)}(x,y)=\sum _{l=1}^{\infty }C_{m,l}Y_{l}^{m}(x)\overline{Y_{l}^{m}(y)} \end{aligned}$$
(6.8)

for some positive constants \(C_{m,l}\) (a proof will be given below). By the argument in the proof of Step 2 in Theorem 6.2, it will be enough to show that \(C_{m,l}=0\) when \(l>k\), i.e. that \(K^{(k)}(x,y)\) is already band-limited with \(W=k\). To this end we follow the general approach in [43]. First observe that when x and y are in \(S^{2}\) we can write \(|x-y|^{2}=2(1-x\cdot y)\). Hence, \(K_{k}(x,y)=2^{k}f^{(k)}(x\cdot y)\), where \(f^{(k)}(s)=(1-s)^{k}\) for \(s\in [-1,1]\). The Legendre polynomials \(p_{l}(=p_{l}^{0}\) form a base in the space of all polynomials of degree at most k (which is orthogonal wrt Lebesgue measure on [0, 1]) and hence we can decompose

$$\begin{aligned} 2^{k}f^{(k)}=\sum _{l=1}^{k}c_{l}^{(k)}p_{l}. \end{aligned}$$

Formula 6.8 now follows from the classical Spherical Harmonic (Legendre) addition theorem:

$$\begin{aligned} p_{l}(x\cdot y)=\frac{4\pi }{2l+1}\sum _{|m|\le l}Y_{l}^{m}(x)\overline{Y_{l}^{m}(y)}. \end{aligned}$$

\(\square \)

7 Outlook

7.1 Generalized parabolic optimal transport and singularity formation

Consider the setting in Sect. 5 with a cost function c satisfying the assumptions A1 and A2, but assume for simplicity that c is globally continuous (for example, \(c=d^{2}/2\) in the Riemannian setting). Recall that, given initial data \(u_{0}\in \mathcal {H}\) and volume forms \(\mu \) and \(\nu \), the parabolic equation 5.2 admits a smooth solution \(u_{t}\) on some maximal time-interval [0, T[ and the corresponding maps \(F_{u_{t}}\) give an evolution of diffeomorphisms of X. It does not seem to be known whether \(T=\infty \), in general, i.e. it could be that there are no solutions in \(C^{\infty }(X\times ]0,\infty [)\), in general. Still, using the corresponding iteration \(u_{k}^{(m)}\) (say, defined with respect to \(\mu _{k}=\mu \) and \(\nu _{k}=\nu \)) a generalized notion of solution can be defined:

Proposition 7.1

Given a c-convex function \(u_{0}\), define the following curve \(u_{t}\) of functions on X, emanating from \(u_{0}:\)

$$\begin{aligned} u_{t}:=\sup \left\{ u_{k}^{(m)}:\,\,\,(m,k):\,\,m/k\rightarrow t,\,\,k\rightarrow \infty \right\} \end{aligned}$$

Then \(u_{t}\) is c-convex for any fixed t (and, in particular, continuous) and there exists a constant C such that \(\sup _{X\times [0,\infty [}|u_{t}(x)|\le C\).

Proof

Step 1 there exists a constant such that\(|u_{k}^{(m)}|\le C\).

By the argument in Step 3 in the proof of Theorem 2.9 we have

$$\begin{aligned} \mathcal {F}^{(k)}(u^{(k)})+\mathcal {L}^{(k)}(u_{0})\le I_{\mu }(u_{m}^{(k)})\le I_{\mu }(u_{0}) \end{aligned}$$

By Lemma 3.1\(\mathcal {L}^{(k)}(u_{0})\rightarrow -\int u_{0}^{c}\nu \) and by Theorem 3.3\(\mathcal {F}^{(k)}(u_{m}^{(k)})\rightarrow \inf _{C^{0}(X)}J\) and hence the lhs above is uniformly bounded in k. Thus, there exists a constant C such that \(-C\le I_{\mu }(u_{m}^{(k)})\le C\). The proof of Step 1 is now concluded by observing that there exist constants \(A_{1}\) and \(A_{2}\) such that, for any c-convex function u,

$$\begin{aligned} \sup _{X}u\le I_{\mu }(u)+A_{1},\,\,\,\inf _{X}u\ge I_{\mu }(u)-A_{2}. \end{aligned}$$

Indeed, both functionals \(f_{1}(u):=\sup _{X}u-I_{\mu }(u)\) and \(f_{2}(u):=\inf _{X}u-I_{\mu }(u)\) are continuous on C(X) and descend to \(C(X)/\mathbb {R}\). But the space of c-convex functions is compact in \(C(X)/\mathbb {R}\) (as is shown precisely as in Lemma 2.7) and hence any continuous functional on the space is uniformly bounded, which implies the two inequalities above.

Step 2: If\(\{u_{\alpha }\}_{\alpha \in A}\)is a finite family ofc-convex functions, then\(u:=\max \{u_{\alpha }\}_{\alpha \in A}\)isc-convex.

It is enough to find a function \(v\in C(X)\) such that \(u=v^{c}\). We will show that \(v:=\min \{u_{\alpha }^{c}\}_{\alpha \in A}\) does the job. To this end first observe that \(u\mapsto u^{c}\) is order preserving. Hence, \(u_{\alpha }\le u\) implies that \(u_{\alpha }^{c}\ge u^{c}\), giving \(v\ge u^{c}\). Applying the c-Legendre transform again thus gives \(v^{c}\le u^{cc}=u\). To prove the reversed inequality first observe that, by definition, \(u_{\alpha }^{c}\ge v\) and hence \(u_{\alpha }=(u_{\alpha }^{c})^{c}\le v^{c}\). Finally, taking the sup over all \(\alpha \) proves the desired reversed inequality.

Step 3: Conclusion

Denote by \(\mathcal {K}_{t}\) the closure in C(X) of the set \(S_{t}\) of all \(u_{m}^{(k)}\) such that \(m/k\rightarrow t\) and \(k\rightarrow \infty \). By Step 1 and Lemma 2.13\(\mathcal {K}_{t}\) is compact. Let \(u_{1},\ldots ,u_{m}\) be the limit points of \(S_{t}\). By the argument towards the end of Step 1 in the proof of Theorem 3.3, \(u_{i}\) is c-convex. Hence, by Step 2, so is \(u:=\max \{u_{i}\}\). \(\square \)

The curve \(u_{t}\)5.2 is well-defined for any probability measure \(\mu \) and \(\nu \) on compact topological spaces X and Y and for any continuous cost function c. Moreover, if \(\mu \) and \(\nu \) are normalized volume forms on compact manifolds, assumptions A1 and A2 hold and \(u_{0}\in \mathcal {H}\), then, by Theorem 2.9, \(u_{t}\) coincides with the classical solution of the parabolic equation 5.2, as long as such such a solution exists in \(\mathcal {H}\), i.e. as long as \(F_{u_{t}}\) is a well-defined diffeomorphism. This makes the curve \(u_{t}\) a candidate for a solution to the problem posed in [29, Problem 9] of defining some kind of weak solution to the parabolic equation 5.2, without making assumptions on the MTW-tensor etc (as in Theorem 5.8). The connection to the Sinkhorn algorithm also opens the possibility of numerically exploring singularity formation of classical solutions \(u_{t}\) to the parabolic equation 5.2 as \(t\rightarrow T\) (the maximal existence time). As indicated in [29, Problem 9] one could expect that the first derivatives of a classical solution \(u_{t}\) blow up along a subset S of X of measures zero as \(t\rightarrow T\) (moreover, in the light of the discussion in [29, Problem 8], the subset S might be expected to be rectifiable and of Hausdorff codimension at least one).

Finally, it may be illuminating to point out that, even if the construction of the generalized solution \(u_{t}\) may appear to be rather non-standard from a PDE point of view it bears some similarities to the method of “vanishing viscosity” for constructing solutions to PDEs by adding small regularizing terms. This is reinforced by the interpretation of the inverse of k as an “entropic regularization parameter” discussed in the introduction of the paper (also note that the approximations \(u_{m_{k}}^{(k)}\) are smooth when the heat kernel is used, as in Theorem 6.1). One is thus lead to ask whether, under suitable regularity assumptions on \((\mu ,\nu ,c)\) the curve \(u_{t}\) is a viscosity solution of the parabolic PDE [25]?

8 “Appendix A”: proof of Lemma 4.2 (discrete Laplace method)

We start with the following elementary lemma:

Lemma 8.1

Let \(h_{k}\) be a sequence of continuous convex functions on the polydisc \(D_{k}\) in \(\mathbb {R}^{n}\) of radius \(\log k\) centered at 0 such that there exists a constant C such such that \(|h_{k}(x)|+|\nabla _{e_{1}}\nabla _{e_{2}}h_{k}|(x)\le Ce^{-|x|^{2}/C}\), where \(e_{1}\) and \(e_{2}\) are any two unit-vectors. Then there exists a constant \(C'\) (only depending on C) such that

$$\begin{aligned} \left| k^{-n/2}\sum _{x_{i}^{(k)}\in D_{k}\cap (k^{-1/2}\mathbb {Z})^{n}}h_{k}(x_{i}^{(k)})-\int _{D_{k}}h_{k}dx\right| \le C'/k \end{aligned}$$

Proof

By restricting the integration to one variable at a time it is enough to consider the case when \(n=1\). Fix \(x_{i}^{(k)}\), which, by symmetry, may be assumed non-negative. For any fixed x in the interval \(I_{k}(x_{i}^{(k)})\) centered at \(x_{i}^{(k)}\), of length \(k^{-1/2}\), Taylor expanding \(h_{k}\) gives

$$\begin{aligned} |h_{k}(x_{i}^{(k)})-h_{k}(x)-(x_{i}^{(k)}-x)h_{k}'(x_{i}^{(k)})|\le k^{-1}Ce^{-(x_{i}^{(k)})^{2}/C}\le Ck^{-1}e^{-(x-1/2k^{1/2})^{2}/C}, \end{aligned}$$

using that \(e^{-x^{2}/C}\) is decreasing in the last step. By symmetry, the integral over \(I_{k}(x_{i}^{(k)})\) of the linear term \((x_{i}^{(k)}-x)h_{k}'(x_{i}^{(k)})\) vanishes, giving

$$\begin{aligned} k^{-1/2}h_{k}(x_{i}^{(k)})=\int _{I_{k}(x_{i}^{(k)})}h_{k}(x_{i}^{(k)})dx=\int _{I_{k}(x_{i}^{(k)})}h_{k}(x)dx+\epsilon _{i}^{(k)}, \end{aligned}$$

where

$$\begin{aligned} |\epsilon _{i}^{(k)}|\le Ck^{-1}\int _{I_{k}(x_{i}^{(k)})}e^{-(x-1/2k^{1/2})^{2}/C}dx \end{aligned}$$

Hence, summing over all points \(x_{i}^{(k)}\in D_{k}\cap k^{-1/2}\mathbb {Z}\) except the end points and using that \(|h_{k}(x_{i}^{(k)})|\le Ce^{-(\log k)^{2}/C}\le C'/k^{-1}\) at the end points gives

$$\begin{aligned}&\left| k^{-1/2}\sum _{x_{i}^{(k)}\in D_{k}\cap k^{-1/2}\mathbb {Z}}h_{k}(x_{i}^{(k)})-\int _{D_{k}}h_{k}dx\right| \\&\quad \le C'k^{-1}+k^{-1}C''\int _{0\le s\le \log k}e^{-(s-1/2k^{1/2})^{2}/C}ds, \end{aligned}$$

which concludes the proof. \(\square \)

In the sequel we will denote by \(+\) the ordinary group structure on \(T^{n}\) and by 0 the zero with respect to the group structure. Without loss of generality we may as well assume that \(\alpha (x_{0})=0\) and \(h(x_{0})=1\). We will denote by \(O(k^{-1})\) any sequence of functions satisfying \(|O(k^{-1})|\le Ck^{-1}\), for a constant C depending on data as in the statement of the proposition to be proved.

8.1 Step 1: Localization to a polydisc \(U_{k}\) of radius \(k^{-1/2}\log k\) centered at \(x_{0}\)

First fix a neighborhood U of \(x_{0}\). Since \(\alpha \) is assumed lower-semicontinuous we have \(\alpha (x)\ge \delta >0\) on U and hence

$$\begin{aligned} k^{n/2}\int _{T^{n}-U}e^{-k\alpha }h\delta _{\Lambda _{k}}\le k^{n/2}\sup _{T^{n}}|h|e^{-k\delta }=k^{n/2}O(1)k^{-1}, \end{aligned}$$

using that \(\delta _{\Lambda _{k}}\) is a probability measure. Next, assume that U is a small coordinate neighborhood of \(x_{0}\) and denote by \(U_{k}\) the polydisc of radius \(k^{-1}\log k\) centered at \(x_{0}\). By assumption we may assume that \(\alpha (x)\ge |x-x_{0}|^{2}/C\) on U. Hence,

$$\begin{aligned} k^{n/2}\int _{U-U_{k}}e^{-k\alpha }h\delta _{\Lambda _{k}}\le k^{n/2}\sup _{T^{n}}|h|e^{-(\log k)^{2}}=O(1)k^{-1} \end{aligned}$$

8.2 Step 2: The case when \(x_{0}=0\) in \(T^{n}\)

Introducing the notation \(\alpha ^{(k)}(x):=k\alpha (k^{-1/2}x)\) and \(h_{k}(x):=h(k^{-1/2}x)\exp (-\alpha ^{(k)}(x))\) we can write

$$\begin{aligned} I_{k}:=k^{n/2}\int _{U_{k}}e^{-k\alpha }h\delta _{\Lambda _{k}}=k^{-n/2}\sum _{x_{i}^{(k)}\in D_{k}\cap (k^{-1/2}\mathbb {Z})^{n}}h_{k}(x_{i}^{(k)}) \end{aligned}$$

Now, Taylor expanding \(\alpha \) and denoting by \(p^{(3)}\) the third order term (i.e. defining a polynomial of homogeneous degree three) gives, when \(|x|\le k^{1/2}/C\) (and, in particular, when \(|x|\le \log k)\)

$$\begin{aligned} \alpha ^{(k)}(x)=Ax\cdot x/2+k^{-1/2}p^{(3)}+k^{-1}O(|x|^{4}) \end{aligned}$$

Thus \(h_{k}(x)\) may be Taylor expanded as follows

$$\begin{aligned} h_{k}(x)=e^{-Ax\cdot x/2}\left( 1+k^{-1/2}(q^{(1)}+p^{(3)})+k^{-1}O(|x|^{4})\right) ,\,\,\,\,q^{(1)}(x):=(\nabla h)(0)\cdot x \end{aligned}$$

Next note that \(h_{k}(x)\) satisfies the assumptions of the previous lemma. Indeed, by the chain rule \((\nabla _{e_{1}}\nabla _{e_{2}}\alpha ^{(k)})(x)\) is, when \(|x_{i}|\le k^{1/2}/C\), equal to the corresponding second derivatives of \(\alpha \) on U, which are uniformly bounded on U, by assumption. Hence, applying the lemma in question gives

$$\begin{aligned} I_{k}=\int _{D_{k}}h_{k}dx+O(k^{-1}) \end{aligned}$$

This shows that in the present discrete setting we get the same result, up to the negligible error term \(O(k^{-1})\), as the ordinary Laplace method of integration, which can hence be invoked to conclude. For completeness we provide an alternative direct argument. By the expansion above we have

$$\begin{aligned} I_{k}=\int _{D_{k}}e^{-Ax\cdot x/2}\left( 1+k^{-1/2}(q^{(1)}+p^{(3)})+k^{-1}O(|x|^{4})\right) \end{aligned}$$

Using the exponential decay of \(e^{-Ax\cdot x/2}\) the integral may be taken over all of \(\mathbb {R}^{n}\), up to introducing an error term \(O(k^{-\infty })\). Hence computing the Gaussian integral concludes the proof, once one has verified that the integrals over \(q^{(1)}\) and \(p^{(3)}\) vanish. In the case when A is the identity matrix the vanishing follows directly from the fact that \(q^{(1)}\) and \(p^{(3)}\) are odd. In the general case one first observes that the space of polynomials of homogeneous degree d on \(\mathbb {R}^{n}\) is invariant under the action of the space of invertible linear maps. Hence the problem reduces, by a linear change of variables, to the previous case of an identity matrix.

8.3 Step 3: The case of a general \(x_{0}\)

Set \(\tilde{\alpha }(x):=\alpha (x+x_{0})\) and \(\tilde{f}(x):=f(x+x_{0})\) and decompose \(x_{0}=m_{k}+r_{k}\) where \(m_{k}\in \Lambda _{k}\) and \(|r_{k}|\le 1/k\) (where we have identified a small neighborhood of 0 in \(T^{n}\), containing \(r_{k}\) with \(\mathbb {R}^{n})\). Then we can write

$$\begin{aligned} \int e^{-k\alpha }f\delta _{\Lambda _{k}}=\int e^{-k\tilde{\alpha }}\tilde{f}\delta _{(\Lambda _{k}-r_{k})} \end{aligned}$$

Indeed, for any function h on \(T^{n}\) we have, since \(m_{k}\in \Lambda _{k}\),

$$\begin{aligned} \sum _{x_{i}\in \Lambda _{k}}h(x_{i}+m_{k}+r_{k})=\sum _{x_{i}}h(x_{i}+r_{k}) \end{aligned}$$

Now, we note that the conclusion in the previous lemma remains true when \(\Lambda _{k}\) is replaced by the shifted set \(\Lambda _{k}-r_{k}\) (with essentially the same proof) and hence we can conclude as before.

9 “Appendix B”: Proof of Prop 4.5 (the parabolic PDE on the torus)

The main difference between Proposition 4.5, concerning the n-dimensional torus \(T^{n}\) and the corresponding result for cost functions c on general compact manifolds in [44] (see Theorem 5.8) is that in [44] it is assumed that the strong M-W-T condition holds, i.e. the Ma-Wang-Trudinger tensor of c is bounded from below by a strictly positive constant, while the Ma-Wang-Trudinger tensor vanishes identically for the standard cost function c on \(T^{n}\). The only place where the strong M-W-T condition enters in [44] is in the proof of the \(C^{2}\)-estimate of \(u_{t}\). Here we will provide a proof of the \(C^{2}\)-estimate in question on \(T^{n}\), building on estimates in [45]. To make the connection to the setting in [45] we identify a solution \(u_{t}\) of the parabolic flow appearing in Prop 4.5 with a periodic function in \(\mathbb {R}^{n}\), i.e. \(u_{t}\) is a \(\mathbb {Z}^{n}\)-invariant function in \(\mathbb {R}^{n}\). Using the notation in [45] the parabolic equation in question may be expressed as

$$\begin{aligned} \frac{\partial u_{t}(x)}{\partial t}=D[u_{t}],\,\,\,\,D[u_{t}]:=\log \det \left( \nabla ^{2}u_{t}(x)-A\right) -\log B\left( x,\nabla u_{t}(x)\right) \end{aligned}$$
(9.1)

with

$$\begin{aligned} A=-I,\,\,\,\log B\left( x,\nabla u(x)\right) =g\left( x+\nabla u(x)\right) -f(x). \end{aligned}$$

Recall that the initial data \(u_{0}\) is assumed to be a periodic function in \(C^{4,\alpha }(\mathbb {R}^{n})\), which is strictly quasi-convex, i.e. the matrix \((\nabla ^{2}u_{0})(x)+I\) is positive definite for all x. This is precisely the parabolic equation appearing in [45] in the case of the smooth cost function \(c(x,y)=|x-y|^{2}/2\) in \(\mathbb {R}^{n}\times \mathbb {R}^{n}\). However, in [45] the equation is considered on a compact domain \(\Omega \) in \(\mathbb {R}^{n}\) with additional boundary conditions, while here we assume that \(u_{t}\) is a periodic function on \(\mathbb {R}^{n}\). Denote by \(L_{u}\) the linearization of the operator D at u. A direct calculation reveals that

$$\begin{aligned} L_{u}=\Delta _{w}-(\nabla g)\cdot \nabla , \end{aligned}$$
(9.2)

where \(\Delta _{w}\) denotes the Laplacian operator defined wrt the Hessian metric

$$\begin{aligned} (w_{ij})=(\nabla ^{2}u_{t})(x)+I, \end{aligned}$$
(9.3)

i.e., denoting by \((w^{ij})\) the inverse matrix,

$$\begin{aligned} \Delta _{w}=\sum _{i,j}w^{ij}\nabla _{i}\nabla _{j} \end{aligned}$$

and the gradient \(\nabla g\) is evaluated at the point \(x+\nabla u(x)\) and \((\nabla g)\cdot \nabla \) denotes the corresponding first order differential operator. Note that u is strictly quasi-convex iff \(L_{u}\) is an elliptic operator.

9.1 The \(C^{2}\)-estimate

First observe that

$$\begin{aligned} |\nabla u_{t}|\le \sqrt{n}. \end{aligned}$$
(9.4)

This follows directly from the fact that \(u_{t}\) is c-convex, when viewed as a function on the torus \(T^{n}\) (see the beginning of Sect. 2.1.1).

Lemma 9.1

Let \(u_{t}(x)\) be a solution to the parabolic equation 9.1 on \(T^{n}\times [0,t_{max}[\) which is \(C^{2}\)-smooth in x and jointly \(C^{1}\)-smooth in x and t. There exists a constant \(C_{0}\), independent of t, such that

$$\begin{aligned} |\nabla ^{2}u_{t}|\le C_{0}. \end{aligned}$$

More precisely, \(C_{0}\) only depends on upper bounds on the \(C^{2}\)-norms of \(u_{0},f\) and g and a strict positive lower bound on the eigenvalues of the matrix \((I+\nabla ^{2}u_{0}(x))\).

Proof

We will adapt the proof of the interior \(C^{2}\)-estimate in [45, Thm 10.1] to the present periodic setting (the estimate in [45, Thm 10.1] is a parabolic version of the corresponding estimate in the elliptic setting considered in [65]). Following the convention used in [45] we will omit the sum sign \(\sum \) from the formulas below. Moreover, C will denote a constant only depending on an upper bound on \(|\nabla u_{t}|\) and the first and second order derivatives of \(\log B\) (i.e. of g and f), whose precise value may change from line to line. By the a priori estimate 9.4 these bounds are thus independent of t. Given a unit-vector \(\xi \in S^{n-1}\) and \(a>0\) we consider the function

$$\begin{aligned} v(x,t)=w_{\xi \xi }(x)+a|\nabla u_{t}(x)|^{2},\,\,\,w_{\xi \xi }(x):=w_{ij}(x)\xi _{i}\xi _{j}, \end{aligned}$$

which coincides with the function defined in the beginning of the proof of [45, Thm 10.1]. Fix a point \(x_{0}\in \mathbb {R}^{n}\) and take \(\xi \) to be a unit-vector \(\xi _{0}\) maximizing \(w_{ij}(x_{0})\xi _{i}\xi _{j}\).

Step 1: The following inequality holds at the fixed point \(x_{0}\) if \(w_{ij}(x_{0})\xi _{i}\xi _{j}\ge 1\)

$$\begin{aligned} \left( -\frac{\partial }{\partial t}+L_{u_{t}}\right) [v]\ge (2a-C)w_{ii}-Cw^{ii}-Ca, \end{aligned}$$

This inequality is shown in the course of the proof of [45, Thm 10.1], only using that the MTW-tensor is non-negative (in the present case where the matrix A is constant the MTW-tensor vanishes identically). Indeed, first it is shown, by differentiating the parabolic equation for \(u_{t}\), that

$$\begin{aligned} \left( -\frac{\partial }{\partial t}+L_{u_{t}}\right) [v]\ge & {} \frac{w^{il}w^{jk}\nabla _{\xi }w_{ij}\nabla _{\xi }w_{lk}}{(w_{\xi \xi })^{2}}-\frac{w^{ij}\nabla _{i}w_{\xi \xi }\nabla _{j}w_{\xi \xi }}{w_{\xi \xi }}\\&+(2a-C)w_{ii}-Cw^{ii}-Ca. \end{aligned}$$

Then it is shown that the sum of the first two terms is bounded from below by a constant times \(-w^{ii}\). But, in fact, in the present case, where the matrix A is constant, the sum in question is actually non-negative (see the proof of [50, Thm 3.3] Theorem 3.3 for a different proof of this fact).

Step 2: The following inequality holds at any point x

$$\begin{aligned} L_{u}[-u]\ge w^{ii}-C|\nabla u|. \end{aligned}$$

To see this first observe that the following identity holds:

$$\begin{aligned} -\Delta _{w}u+n=w^{ii}, \end{aligned}$$

This is seen by multiplying the matrix identity 9.3 with the inverse \((w^{ij})\) of \((w_{ij})\) and taking the trace. Hence,

$$\begin{aligned} L[-u]\ge w^{ii}+(\nabla g)\cdot \nabla u \end{aligned}$$

which concludes the proof of Step 2.

Step 3: Conclusion using the parabolic maximum principle

Now fix \(b>0\) and consider the function

$$\begin{aligned} v_{a,\kappa }(x,t)=w_{ij}\xi _{i}\xi _{j}+a|\nabla u_{t}|^{2}-b(u_{t}-u_{t}(0)). \end{aligned}$$

Combining the previous steps we get, at the fixed point \(x_{0}:\)

$$\begin{aligned} \mathcal {L}_{a,\kappa }[v]\ge (2a-C)w_{ii}+(b-C)w^{ii}-C(a+b). \end{aligned}$$

Take the parameters a and b sufficiently large to ensure that \(2a-C>0\) and \(b-C>0\) (this condition is independent of the choice of fixed point \(x_{0})\). Now fix \(T<t_{max}\) and consider the smooth function \(v_{a,\kappa }\) on \(\mathbb {R}^{n}\times [0,T]\times S^{n-1}\). Since \(v_{a,\kappa }\) is periodic in the x-variable its sup is attained at some \((x_{0},t_{0},\xi _{0})\) in \([0,1]^{n}\times [0,T]\times S^{n-1}\). Note that \(\xi _{0}\) maximizes \(w_{ij}(x_{0})\xi _{i}\xi _{j}\). We may assume that \(w_{ij}(x_{0})\xi _{i}\xi _{j}\ge 1\) (otherwise we are already done). Now, if \(t_{0}>0\) then

$$\begin{aligned} 0+0\ge \left( -\frac{\partial }{\partial t}+L\right) [v_{a,\kappa }] \end{aligned}$$

at \((x_{0},t_{0})\). But then the inequality in Step 1 implies an upper bound on \(w_{ii}\) at \((x_{0},t_{0})\) only depending on C. Finally, since \(a|\nabla u_{t}|^{2}-b(u_{t}-u_{t}(0))\) is uniformly bounded by a constant this shows that the sup of \(w_{ii}\) over \(\mathbb {R}^{n}\times [0,T]\) is also uniformly bounded by a constant. Finally, if \(t_{0}=0\) then we get an upper bound on \(w_{ii}\) in terms of the Hessian of \(u_{0}\). \(\square \)

9.2 Conclusion of the proof of Prop 4.5

The rest of the argument proceeds as in [44], but for the convenience of the reader we provide some details, highlighting the dependence on the regularity assumptions on the data \(u_{0},f\) and g.

9.2.1 Short-time existence

First assume that \(u_{0}\in C^{2,\alpha }(T^{n})\) and that u is strictly quasi-convex. Then the linearization \(L_{u_{0}}\) is uniformly elliptic and hence standard short-time existence results for parabolic equations imply that there exists a maximal \(t_{max}>0\) and a unique solution u(xt) to the equation 9.1 on \(T^{n}\times [0,t_{max}[\) with the property that \(u_{t}\) is in \(C^{2,\alpha }(X)\) and the corresponding Hölder derivatives of \(u_{t}\) are continuous as \(t\rightarrow 0\). In particular, the norms \(\left\| u_{t}\right\| _{C^{2,\alpha }(T^{n})}\) are uniformly bounded for \(t\in [0,T]\) for any given \(T<t_{max}\). This follows, for example, from [1, Main Thm 1] or [41, Thm 1.2]. Next given two unit-vectors \(e_{1}\) and \(e_{2}\) denote by \(\nabla _{1}\) and \(\nabla _{2}\) the corresponding derivations. Applying \(\nabla _{1}\) to the equation 9.1 gives

$$\begin{aligned} \frac{\partial (\nabla _{1}u_{t})}{\partial t}=L_{u_{t}}[\nabla _{1}u_{t}]-\nabla _{1}f. \end{aligned}$$
(9.5)

Next assume that \(u_{0}\in C^{4,\alpha }(T^{n})\). Since \(L_{u_{t}}\) is uniformly elliptic when \(t\le T<t_{max}\) and the coefficients of \(L_{u_{t}}\) are Hölder continuous (with exponent \(\alpha )\) global linear parabolic Shauder estimates [41, Thm 2.3] yield, since \(e_{1}\) was arbitrary, that \(\nabla u_{t}\in C^{2,\alpha }(T^{n})\) and that \(\left\| \nabla u_{t}\right\| _{C^{2,\alpha }(T^{n})}\) is uniformly bounded for \(t\in [0,T]\). Now, applying \(\nabla _{2}\) to the equation 9.5 gives

$$\begin{aligned} \frac{\partial (\nabla _{2}\nabla _{1}u_{t})}{\partial t}= & {} L_{u_{t}}[\nabla _{2}\nabla _{1}u_{t}]-F_{t},\,\,\,F_{t}:=\sum _{i,j}(\nabla _{2}w^{ij})\nabla _{i}\nabla _{j}(\nabla _{1}u_{t})\\&-\nabla _{2}(\nabla g)\cdot \nabla (\nabla _{1}u_{t})-\nabla _{2}\nabla _{1}f. \end{aligned}$$

By assumption, \(\nabla ^{2}g\) and \(\nabla ^{2}f\) are Hölder continuous and hence (also using that \(\nabla u_{t}\in C^{2,\alpha }(T^{n}))\) the function \(F_{t}\) is Hölder continuous on \(T^{n}\) for some positive Hölder exponent \(\alpha '(\le \alpha )\). We can thus apply the global linear parabolic Shauder estimates again (now to the linear operator \(L_{t}[\cdot ]-F_{t}\)) and deduce that the norms \(\left\| \nabla ^{2}u_{t}\right\| _{C^{2,\alpha '}(T^{n})}\) are uniformly bounded for \(t\in [0,T]\).

9.2.2 \(t_{0}\)-independent a priori estimates

Next note that there exist constant C (independent of x and t) such that

$$\begin{aligned} C^{-1}I\le (w_{ij}(x,t))\le CI. \end{aligned}$$
(9.6)

Indeed, the upper bound is the content of Lemma 9.1 and to prove the lower bound first note that the time derivative \(\dot{u}_{t}\) of \(u_{t}\) solves the linear parabolic equation

$$\begin{aligned} \frac{\partial U_{t}}{\partial t}=L_{u_{t}}[U_{t}], \end{aligned}$$
(9.7)

Hence, by the parabolic maximum principle

$$\begin{aligned} \sup _{T^{n}\times [0,t_{max}[}|\dot{u}_{t}|\le \sup _{T^{n}}|\dot{u}_{0}|. \end{aligned}$$

Plugging this bound into the equation 9.1 this means that the determinant of \(w_{ij}(x,t)\) is uniformly bounded from below (by a constant only depending on the sup norm of g and the Hessian of \(u_{0})\). Combined with the upper bound in 9.6 this implies the lower bound in 9.6. Next note that the bound 9.6 says that the linearized operator \(L_{u_{t}}\) is uniformly elliptic (with constants independent of t). Hence, the Krylov-Safanov theory for fully non-linear parabolic equations yield uniform interior \(C^{2,\alpha }(T^{n})\)-estimates for \(u_{t}\) for some \(\alpha >0\) (i.e. they hold for \(t\ge \epsilon >0\) with constants only depending on \(\epsilon \), the \(C^{2}\)-norms of \(u_{t}\) and f and g, i.e. on the previous uniform constant C). It follows that \(t_{max}=\infty \) (otherwise we could restart the flow again). Then, by the argument in the previous section, using parabolic Shauder estimates, we get a bound

$$\begin{aligned} \left\| \nabla ^{2}u_{t}\right\| _{C^{2,\alpha '}(T^{n})}\le C' \end{aligned}$$

for a constant independent of \(t\in [0,\infty [\). In particular, such a bound holds for \(\left\| u_{t}\right\| _{C^{4}(T^{n})}\), which concludes the proof of the first point in Prop 4.5.

9.2.3 Exponential convergence

Finally, the exponential convergence follows from the generalization of the Li-Yau Harnack inequality in [40, Thm 5.2, Cor 5.3]. Briefly, by [40, Thm 5.2, Cor 5.3] the a priori bounds in the first point of Prop 4.5 imply that there exists a constant \(C'\) such that any positive solution \(U_{t}\) to the linear parabolic equation 9.7 satisfies

$$\begin{aligned} \sup _{T^{n}}U_{t+1/2}\le C'\inf _{T^{n}}U_{t} \end{aligned}$$

when \(n\ge 3\). Using a standard induction argument this implies an exponential decay of the sup-norm of the oscillation of \(\dot{u}\), which in turn implies the exponential convergence in Prop 4.5. The assumption that \(n\ge 3\) in [40, Cor 5.3] is then bypassed by taking a product of \(T^{n}\) with \(T^{2}\), just as in [44, Section 7.1.2].