Skip to main content

An Optimal Transport Approach for the Schrödinger Bridge Problem and Convergence of Sinkhorn Algorithm

Abstract

This paper exploit the equivalence between the Schrödinger Bridge problem (Léonard in J Funct Anal 262:1879–1920, 2012; Nelson in Phys Rev 150:1079, 1966; Schrödinger in Über die umkehrung der naturgesetze. Verlag Akademie der wissenschaften in kommission bei Walter de Gruyter u, Company, 1931) and the entropy penalized optimal transport (Cuturi in: Advances in neural information processing systems, pp 2292–2300, 2013; Galichon and Salanié in: Matching with trade-offs: revealed preferences over competing characteristics. CEPR discussion paper no. DP7858, 2010) in order to find a different approach to the duality, in the spirit of optimal transport. This approach results in a priori estimates which are consistent in the limit when the regularization parameter goes to zero. In particular, we find a new proof of the existence of maximizing entropic-potentials and therefore, the existence of a solution of the Schrödinger system. Our method extends also when we have more than two marginals: the main new result is the proof that the Sinkhorn algorithm converges even in the continuous multi-marginal case. This provides also an alternative proof of the convergence of the Sinkhorn algorithm in two marginals.

Introduction

Let \((X,d_X)\) and \((Y,d_Y)\) be Polish spaces, \(c:X\times Y\rightarrow {\mathbb {R}}\) be a cost function, \(\rho _1\in \mathcal {P}(X)\) and \(\rho _2\in \mathcal {P}(Y)\) be probability measures. We consider the Schrödinger Bridge problem

$$\begin{aligned} {\text {OT}}_{\varepsilon }(\rho _1,\rho _2) = \inf _{\gamma \in \Pi (\rho _1,\rho _2)}\varepsilon {\text {KL}}(\gamma |\mathcal {K}), \end{aligned}$$
(1.1)

where \({\mathcal {K}}\) is the so-called Gibbs Kernel associated with the cost c:

$$\begin{aligned} {\mathcal {K}} = k(x,y)\rho _1\otimes \rho _2 = e^{-\frac{c(x,y)}{\varepsilon }}\rho _1\otimes \rho _2. \end{aligned}$$
(1.2)

The function \({\text {KL}}(\gamma |\mathcal {K})\) in (1.1) is the Kullback–Leibler divergence between the probability measures \(\gamma \) and \(\mathcal {K}\in \mathcal {P}(X\times Y)\) which is defined as

$$\begin{aligned} \displaystyle {\text {KL}}(\gamma |\mathcal {K}) = \displaystyle {\left\{ \begin{array}{ll} \displaystyle \int _{X\times Y}\gamma \log \left( \dfrac{\gamma }{k}\right) d(\rho _1\otimes \rho _2) &{}\text { if } \gamma \ll \rho _1\otimes \rho _2 \\ +\infty &{}\text { otherwise }\\ \end{array}\right. }. \end{aligned}$$

Here, by abuse of notation we are denoting by \(\gamma \) the Radon-Nikodym derivative of \(\gamma \) with respect to the product measure \(\rho _1\otimes \rho _2\). Geometrically speaking, when we interpret the Kullback–Leibler divergence as a distance, the problem (1.1) defines the so called Kullback–Leibler projection of \(\mathcal {K}\) on the set \(\Pi (\rho _1,\rho _2)\).

In the past years, theoretical and numerical aspects of (1.1) has been object of study in mathematical physics (e.g. [13,14,15, 24, 33, 59,60,61, 66, 71,72,73]), probability (e.g. [19, 47, 52, 53]), fluid mechanics (e.g. [3, 6]), metric geometry (e.g. [41, 45]), optimal transport theory (e.g. [5, 16, 18, 20, 21, 32, 44, 51, 54, 55, 58]), data sciences (e.g. [34, 39, 40, 56, 57, 62, 63] see also the book [28] and references therein).

The existence of a minimizer in (1.1) was obtain in different generality by Czisar, Ruschendorf, Borwein, Lewis and Nussbaum, Léonard, Gigli and Tamanini et al. [8, 25, 45, 65]. In the most general case the kernel \(\mathcal {K}\) is not even assumed to be absolutely continuous with respect to \(\rho _1 \otimes \rho _2\), as opposed to our assumption (1.2). In particular, under the assumption (1.2) on \(\mathcal {K}\) (see for example [53]), a unique minimizer for (1.1) exists and \(\gamma ^{\varepsilon }_{opt}\) is the minimizer if and only if

$$\begin{aligned} \gamma ^{\varepsilon }_{opt} = a^{\varepsilon }(x)b^{\varepsilon }(y)\mathcal {K}, \text { where } a^{\varepsilon },b^{\varepsilon } \text { solve } \displaystyle {\left\{ \begin{array}{ll} a^{\varepsilon }(x)\int _{Y} b^{\varepsilon }(y)k(x,y)d\rho _2(y) = 1 \\ b^{\varepsilon }(y)\int _{X} a^{\varepsilon }(y)k(x,y)d\rho _1(x) = 1\\ \end{array}\right. }. \end{aligned}$$
(1.3)

The functions \(a^{\varepsilon }(x)\) and \(b^{\varepsilon }(y)\) are called Entropic potentials. They are unique up to the trivial transformation \(a \mapsto a/\lambda \), \(b \mapsto \lambda b\) for some \(\lambda >0\). The system solved by the Entropic potentials is called the Schrödinger system. Assuming \(\rho _1\) and \(\rho _2\) are everywhere positive and have finite entropy with respect to \(\mathcal {K}\), the minimizer in (1.1) has a special form as is stated in the Theorem below [8, Corollary 3.9].

Theorem 1.1

(J. M. Borwein, A. S. Lewis, and R. D. Nussbaum) Let \((X,d_X)\) and \((Y,d_Y)\) be a Polish spaces, \(c:X\times Y\rightarrow [0, \infty [\) be a bounded cost function, \(\rho _1 \in \mathcal {P}(X)\), \(\rho _2 \in \mathcal {P}(Y)\) be probability measures such that \(\rho _1(x),\rho _2(y)>0, \forall x\in X, y\in Y\) and \({\mathcal {K}} = e^{-\frac{c(x,y)}{\varepsilon }}\rho _1\otimes \rho _2\). Then, if and \({\text {KL}}(\rho _1\otimes \rho _2|\mathcal {K}) <+\infty \), then for each \(\varepsilon > 0\), there exists a unique minimizer \(\gamma ^{\varepsilon }\in \Pi (\rho _1,\rho _2)\) for the Schrödinger problem \({\text {OT}}_{\varepsilon }(\rho _1,\rho _2)\) that can be written as

$$\begin{aligned} \gamma ^{\varepsilon }_{opt}(x,y) = a^{\varepsilon }(x)b^{\varepsilon }(y) \mathcal {K}(x,y), \quad \ln a^{\varepsilon }(x)\in L_{\rho _1}^1(X), \ln b^{\varepsilon }(x) \in L_{\rho _2}^1(Y). \end{aligned}$$

In this paper, we are interested in the following questions:

  1. (1)

    What is the regularity of the Entropic potentials \(a^{\varepsilon }\) and \(b^{\varepsilon }\)?

  2. (2)

    Can we understand the structure and regularity of the minimizer in (1.1) if we consider the Schrödinger Bridge problem with N given marginals \(\rho _1,\rho _2,\dots ,\rho _N\) instead of 2?

The answers to the questions (1) and (2) relies on the Kantorovich duality formulation of (1.1) and its extension to the multi-marginal setting: we will exploit the parallel with optimal transport to give also a new (variational) proof for the existence of a solution to the Schrödinger system. We believe that also this contribution is important since the only available proofs of that pass through abstract results of closure of “sum type” functions.

The multi-marginal Schrödinger Bridge problem, to be introduced in Sect. 4, has been recently considered in the literature from different viewpoints (e.g. [5, 6, 16, 18, 20, 27, 42, 43]) as, for instance, the Wasserstein Barycenters, Matching problem in Economics, time-discretisation of Euler Equations and Density Functional Theory in computational chemistry.

Finally, we want to mention that G. Carlier and M. Laborde in [18] show the well-posedness (existence, uniqueness and smooth dependence with respect to the data) for the multi-marginal Schrödinger system in \(L^{\infty }\)—see (4.8) in Sect. 4—via a local and global inverse function theorems. This is a different approach and orthogonal result compared to the study presented in this paper; moreover their result is restricted to measures \(\rho _i\) which are absolutely continuous with respect to some reference measure, with density bounded from above and below.

Computational Aspects and Connection with Optimal Transport Theory

In many applications, the method of choice for numerically computing (1.1) is the so-called Iterative Proportional Fitting Procedure (IPFP) or Sinkhorn algorithm [67]. The aim of the Sinkorn algorithm is to construct the measure \(\gamma ^{\varepsilon }\) realizing minimum in (1.1) by fixing the shape of the guess as \(\gamma ^{\varepsilon }_{n} = a^n(x) b^n(y) \mathcal {K}\) (since this is the actual shape of the minimizer) and then alternatively updating either \(a^n\) or \(b^n\), by matching one of the marginal distribution respectively to the target marginals \(\rho _1\) or \(\rho _2\).

The IPFP sequences \((a^n)_{n\in \mathbb {N}}\) and \((b^n)_{n\in \mathbb {N}}\) are defined thus iteratively byFootnote 1

$$\begin{aligned} \begin{array}{lcl} \displaystyle a^0(x) &{} = &{}1, \\ \displaystyle b^0(y) &{} = &{}1, \\ \displaystyle b^n(y) &{} = &{} \dfrac{1}{\int k(x,y) a^{n-1}(x)d\rho _1(x)}, \\ \displaystyle a^n(y) &{} = &{} \dfrac{1}{\int k(x,y) b^{n}(y)d\rho _2(y)}. \end{array} \end{aligned}$$
(1.4)

While \(a^n\) and \(b^n\) will be approximations of the solution of the Schrödinger system, the sequence of probability measures \(\gamma ^{\varepsilon }_{n} = a^{n}(x) b^n(y)\mathcal {K}\) will approximate the minimizer \(\gamma ^{\varepsilon }\).

We stress that the IPFP procedure can be easily generalized in the multi-marginal setting, whose discussion will be detailed in Sect. 4.

  1. (3)

    Can one prove convergence of the Sinkhorn algorithm in two and several marginals case?

In the two marginals case, the IPFP schemes was introduced by Sinkhorn [67]. The convergence of the iterates (1.4) was proven by Franklin and Lorenz [36] in the discreate case and by Ruschendorf [65] in the continuous one. An alternative proof in the continuous setting, based on the Franklin-Lorenz approach via the Hilbert metric, was also provided by Chen et al. Pavon [21], which in particular leads to a linear convergence rate of the procedure (in the Hilbert metric).

Despite the different approaches and theoretical guarantees obtained in the 2-marginal problem, in the multi-marginal case for continuous measures the situation changes completely. Theoretical results guaranteeing convergence and stability were unknown (even if in [65] it is claimed that with his methods the result can be extended to the multi-marginal case, but to our knowledge this has not been done yet). In the discrete case, the convergence has been established by viewing it as a special case of iterative Bergman projection algorithm, whose global convergence is guaranteed (e.g. [23, 49]).

One of the contributions of this paper is to extend the alternate maximization procedure of the multi-marginal Sinkhorn algorithm in the continuous setting. Our approach is an extension of dual block coordinate ascent in the continuous setting and is based on intuitions coming from Optimal Transport Theory. In particular, we exploit the regularity of Entropic potentials to prove by compactness the convergence of IPFP scheme (1.4).

Connection with Optimal Transport Theory: the problem (1.1) allow us to create very efficient numerical scheme approximating solutions to the Monge–Kantorovich formulation of optimal transport and its many generalizations. Indeed, notice that we can rewrite (1.1) as a functional given by the Monge–Kantorovich formulation of Optimal Transport with a cost function c plus an Entropic regularization parameter

$$\begin{aligned} {\text {OT}}_{\varepsilon }(\rho _1,\rho _2)&= \min _{\gamma \in \Pi (\rho _1,\rho _2)}\varepsilon \int _{X\times Y}\gamma \log \left( \dfrac{\gamma }{k}\right) d(\rho _1\otimes \rho _2) \nonumber \\&= \min _{\gamma \in \Pi (\rho _1,\rho _2)}\int _{X\times Y}cd\gamma + \varepsilon \int _{X\times Y}\gamma \log \gamma \, d(\rho _1\otimes \rho _2). \end{aligned}$$
(1.5)

In particular, one can show that [16, 52, 58] if \((\gamma ^{\varepsilon })_{\varepsilon \ge 0}\) is a sequence of minimizers of the above problem, then \(\gamma ^{\varepsilon }\) converges when \(\varepsilon \rightarrow 0\) to a solution of the Optimal Transport \((\varepsilon =0)\), as depicted in Fig. 1. More precisely, let us define the functionals \(C_k,C_0:\mathcal {P}(X\times Y)\rightarrow \mathbb {R}\cup \lbrace +\infty \rbrace \)

$$\begin{aligned}&C_{k}(\gamma ) = {\left\{ \begin{array}{ll} \int _{X\times Y}c d\gamma +\varepsilon _k\int _{X\times Y}\rho _{\gamma }\log \rho _{\gamma }d(\rho _1\otimes \rho _2) &{}\text { if } \gamma \in \Pi (\rho _0,\rho _1) \\ +\infty &{}\text { otherwise }\\ \end{array}\right. },\\&C_{0}(\gamma ) = {\left\{ \begin{array}{ll} \int _{X\times Y}c d\gamma &{}\text { if } \gamma \in \Pi (\rho _0,\rho _1) \\ +\infty &{}\text { otherwise }\\ \end{array}\right. }. \end{aligned}$$

Then in [16, 53, 58] it is shows that the sequence of functionals \((C_k)_{k\in \mathbb {N}}\) \(\Gamma -\)converges to \(C_0\) with respect to the weak\(^*\) topology. In particular the minima and minimal values are converging and so, in particular if \(c(x,y) = d(x,y)^p\), then

$$\begin{aligned} \lim _{k\rightarrow +\infty }{\text {OT}}^p_{\varepsilon _k}(\rho _1,\rho _2) = W^p_p(\rho _1,\rho _2), \end{aligned}$$

where \(W_p(\rho _1,\rho _2)\) is the p-Wasserstein distance between \(\rho _1\) and \(\rho _2\),

$$\begin{aligned} W^p_p(\rho _1,\rho _2) = \min _{\gamma \in \Pi (\rho _1,\rho _2)}\int _{X\times Y}d^p(x,y)d\gamma (x,y). \end{aligned}$$
Fig. 1
figure1

Support of the optimal coupling \(\gamma ^{\varepsilon }\) in (1.1) for the one-dimensional distance square cost with different values of \(\epsilon = 10^{-1},1,10,10^{2},10^{3}\): the densities \(\rho _1 \sim N(0,5)\) (blue) and \(\rho _2 = \frac{1}{2}\eta _1 + \frac{1}{2}\eta _2\) is a mixed Gaussian (red), where \(\eta _1 \sim N(-2,0.8)\) and \(\eta _2 \sim N(2,0.7)\). The numerical computations are done using the POT library [35]

In the context of Optimal Transport Theory, the entropic regularization was introduced by Galichon and Salanié [37] to solve matching problems in economics; and by Cuturi [26] in the context of machine learning and data sciences. Both seminal papers received renewed attention in understanding the theoretical aspects of (1.5) as well as had a strong impact in imagining, data sciences and machine learning communities due to the efficiency of the Sinkhorn algorithm.

Sinkhorn algorithm provides an efficient and scalable approximation to optimal transport. In particular, by an appropriate choice of parameters, the Sinkhorn algorithm is in fact a near-linear time approximation for computing OT distances between discrete measures [2]. However, as explained in [31, 69], the Wasserstein distance suffer from the so-called curse of dimensionality. We refer to the recent book [28] written by Cuturi and Peyré for a complete presentation and references on computational optimal transport.

Main Contributions

   In order to study the regularity of Entropic-potentials, we introduce the dual (Kantorovich) functional

$$\begin{aligned} D_{\varepsilon }(u,v) = \int _X u(x)d\rho _1(x) + \int _Y v(y)d\rho _2(y) - \varepsilon \int _{X\times Y} e^{\frac{u(x)+v(y)-c(x,y)}{\varepsilon }}d(\rho _1\otimes \rho _2). \end{aligned}$$

   The Kantorovich duality of (1.1) is given by the following variational problem (see Proposition 2.12)

$$\begin{aligned} {\text {OT}}_{\varepsilon }(\rho _1,\rho _2) = \sup _{u\in C_b(X),v\in C_b(Y)}D_{\varepsilon }(u,v) +\varepsilon . \end{aligned}$$
(1.6)

The Entropy-Kantorovich duality (1.6) appeared, for instance, in [18, 34, 43,44,45, 53]. The firsts contributions of this paper are (i) prove the existence of maximizers \(u^*\) and \(v^*\) (up to translation) in (1.6) in natural spaces; (ii) show that the Entropy-Kantorovich potentials inherit the same regularity of the cost function (see the precise statement in Proposition 2.4 and Theorem 2.8).

We then link \(u^*\) and \(v^*\) to the solution of the Schrödinger problem; as a byproduct of our results we are able to provide an alternative proof of the convergence of the Sinkhorn algorithm in the 2-marginal case via a purely optimal transportation approach (Theorem 3.1), seeing it as an alternate maximization procedure. The strength of this proof is that it can be easily generalized to the multi-marginal setting for continuous measures (Theorem 4.7).

Summary of Results and Main Ideas of the Proofs

Our approach follows ideas from Optimal Transport and relies on the study of the duality (Kantorovich) problem (1.6) of (1.1). Analogously to the optimal transport case, if one assume some regularity (boundedness, uniform continuity, concavity) of the cost function c, then we can obtain the same type of regularity of the Entropy potentials u and v.

The relation between solution of the dual problem (1.6) and the Entropic-Potentials solving the Schrödinger system was already pointed out by C. Léonard [53]. From our knowledge, the direct proof of existence of maximizers in (1.6) a new result.

Our approach to obtain the existence of Entropic-Kantorovich potentials, follow the direct method of Calculus of Variations. The key idea in the argument is to define a generalized notion of c-transform in the Schrödinger Bridge case, namely the \((c,\varepsilon )\)-transform. The main duality result, in the most general case where we assume only that c is bounded, is given by the Theorem 2.8 and stated below.

Theorem 1.2

Let \((X,d_X)\), \((Y,d_Y)\) be Polish spaces, \(c:X\times Y\rightarrow \mathbb {R}\) be a Borel bounded cost, \(\rho _1 \in \mathcal {P}(X)\), \(\rho _2 \in \mathcal {P}(Y)\) be probability measures and \(\varepsilon >0\) be a positive number. Then the supremum in (1.6) is attained for a unique couple \((u_0,v_0)\) (up to the trivial tranformation \((u,v) \mapsto (u+a,v-a)\)). Moreover we also have

$$\begin{aligned} u_0 \in L^{\infty }(\rho _1) \quad \text { and } \quad v_0 \in L^{\infty }(\rho _2) \end{aligned}$$

and we can choose the maximizers such that \(\Vert u_0\Vert _{\infty } ,\Vert v_0\Vert _{\infty } \le \frac{3}{2} \Vert c\Vert _{\infty }\).

On the \((c,\varepsilon )-\)transform: Given a measurable function \(u:X\rightarrow \mathbb {R}\) such that \(\int _{X}e^{u/\varepsilon }d\mu <+\infty \), we defined the \((c,\varepsilon )\)-transform of u by

$$\begin{aligned} u^{(c,\varepsilon )}(y) = -\varepsilon \log \left( \int _{X}e^{(u(x)-c(x,y))/\varepsilon }d\mu (x)\right) . \end{aligned}$$

One can show that this operation is well defined and, moreover, \(D_{\varepsilon }(u,u^{(c,\varepsilon )}) \ge D_{\varepsilon }(u,v), \forall u,v\) and \(D_{\varepsilon }(u,u^{(c,\varepsilon )}) = D_{\varepsilon }(u,v) \text { if and only if } v = u^{(c,\varepsilon )}\) (Lemma 2.6). If we assume additionally regularity for the cost function c, for instance that c is \(\omega \)-continuous or that it is merely bounded, the \((c,\varepsilon )\)-transform is actually a compact operator, respectively form \(L^1(\rho _1) \) to C(Y) and from \(L^{\infty }(\rho _1)\) to \(L^p(\rho _2)\) (Proposition 2.4).

IPFP/Sinkhorn algorithm: As a byproduct of the above approach to the duality, we present an alternative proof of the convergence of the IPFP/Sinkhorn algorithm. The main idea in our proof is that we can rewrite the IPFP iteration substituting \(a^n = \exp (u^n/\varepsilon )\) and \(b^n = \exp (v^n/\varepsilon )\); in these new variables the iteration becomes

$$\begin{aligned} \begin{array}{lcl} \displaystyle v^n(y)/\varepsilon = &{} - \log \left( \int _X k(x,y)\otimes \frac{u^{n-1}(x)}{\varepsilon }d\rho _1\right) \\ \displaystyle u^n(x)/\varepsilon = &{} - \log \left( \int _Y k(x,y)\otimes \frac{v^{n}(x)}{\varepsilon }d\rho _2\right) \end{array}. \end{aligned}$$

Or, \(v^n(y) = (u^{(n-1)})^{(c,\varepsilon )}\) and \(u^n(y) = (v^n)^{(c,\varepsilon )}\). In particular we can interpret the IPFP in optimal transportation terms: at each step the Sinkhorn iterations (1.4) are equivalent to take the \((c,\varepsilon )\)-transforms alternatively and therefore the IPFP can be seen as an alternating maximizing procedure for the dual problem (in view also of Lemma 2.6). Therefore, using the aforementioned compactness, it is easy to show that \(u^n\) and \(v^n\) converge to to the optimal solution of the Kantorovich dual problem when \(n\rightarrow +\infty \). A similar approach has been used also in the discrete case in [23, 49], where however global convergence is guaranteed thanks to standard results in optimization.

Theorem 1.3

Let \((X,d_X)\) and \((Y,d_Y)\) be Polish metric spaces, \(\rho _1 \in \mathcal {P}(X)\) and \(\rho _2\in \mathcal {P}(Y)\) be probability measures and \(c:X\times Y\rightarrow \mathbb {R}\) be a Borel bounded cost. If \((a^n)_{n\in \mathbb {N}}\) and \((b^n)_{n\in \mathbb {N}}\) are the IPFP sequences defined in (1.4), then there exists \(\lambda _n >0\) such that

$$\begin{aligned} a^n/\lambda _n\rightarrow a \text { in } L^p(\rho _1) \quad \text { and } \quad \lambda _n b^n\rightarrow b \text { in } L^p(\rho _2), \quad 1\le p <+\infty , \end{aligned}$$

for ab that solve the Schrödinger system. In particular, the sequence \(\gamma ^n = a^nb^n\mathcal {K}\) converges in \(L^p(\rho _1\otimes \rho _2)\) to \(\gamma ^{\varepsilon }_{opt}\) in (1.1), \(1\le p <+\infty \).

We recall that the argument in original proof of convergence of the Sinkhorn algorithm for the continuous case [36] (also in [21]) relies on defining the Hilbert metric on the projection cone of the Sinkhorn iterations. The authors show that the Sinkhorn iterates are a contraction under this metric and therefore the procedure converges. This proof has the advantage of providing automatically the rate of convergence of the iterates; however it is not easily extendable in the several marginals case.

Our approach instead can be extended to obtain the existence and convergence results also in the multi-marginal setting for any probability measures \(\rho _1,\rho _2,\dots ,\rho _N\) (however we don’t get any quantitative convergence):

Theorem 1.4

Let \((X_i,d_{X_i})\) be Polish metric spaces and \(\rho _i \in \mathcal {P}(X_i)\) be probability measures, for \(i \in \lbrace 1,\dots , N \rbrace \) and \(c:X_1\times \dots \times X_N\rightarrow \mathbb {R}\) be a Borel bounded cost. If \((a_i^n)_{n\in \mathbb {N}}\) are the multi-marginal IPFP sequences that will be defined (4.9), then there exist \(\lambda ^n_i >0\) with \(\prod _i \lambda ^n_i=1\) such that

$$\begin{aligned} a_i^n/\lambda _i^n \rightarrow a_i \text { in } L^p(\rho _i) \quad \text { for all } i \in \lbrace 1,\dots , N \rbrace , 1\le p <+\infty , \end{aligned}$$

where \((a_1, \ldots , a_N)\) solve the Schrödinger system. In particular, the sequence \(\gamma _N^n = \otimes ^N_{i=1}a_i^n\mathcal {K}\) converges in \(L^p(\otimes ^N_{i=1} \rho _i)\), \(1\le p <+\infty \), to the optimal coupling \(\gamma ^{\varepsilon }_{N, opt}\) solving the multi-marginal Schrödinger Bridge problem to be defined in (4.2).

Organization of the Paper

The remaining part of the paper is organized as follows: Sect. 2 contains the main structural results of the paper, namely Proposition 2.4 and Theorem 2.8. In particular, we define the main tools for showing the existence of maximizer of the Entropic-Kantorovich problem and prove regularity results of the Entropic-Kantorovich potentials via the \((c,\varepsilon )\)-transform.

In the Sect. 3, we apply the above results to prove convergence of the Sinkhorn algorithm purely via the compactness argument and alternating maximizing procedue (Theorem 3.1) and, in Sect. 4, we extend the main results of the paper to the multi-marginal Schrödinger Bridge problem, including convergence of Sinkhorn algorithm in the multi-marginal case (Theorem 4.7).

The Role of the Reference Measures

In this subsection, we simply give a technical remark, discussing the role of the reference measures \(\mathfrak {m}_1\) and \(\mathfrak {m}_2\). We stress that all the results of the paper can be extended while considering a kind of entropic optimal transport problem, where the penalization occurs with respect to some reference measures \(\mathfrak {m}_1, \mathfrak {m}_2\).

For \(\varepsilon >0\), we in particular may look at the problem

$$\begin{aligned} \mathcal {S}_{\varepsilon }(\rho _1,\rho _2; \mathfrak {m}_1, \mathfrak {m}_2)&:= \min _{\gamma \in \Pi (\rho _1,\rho _2)} \left\{ \int _{X\times Y}c \, d\gamma + \varepsilon {\text {KL}}( \gamma | \mathfrak {m}_1\otimes \mathfrak {m}_2) \right\} \nonumber \\&= \min _{\gamma \in \Pi (\rho _1,\rho _2)} \varepsilon {\text {KL}}( \gamma | {\mathcal {K}}). \end{aligned}$$
(1.7)

where \({\mathcal {K}}\) is the Gibbs Kernel \({\mathcal {K}} = e^{-\frac{c}{\varepsilon }}\mathfrak {m}_1\otimes \mathfrak {m}_2\).

While having a reference measure in some situations can be quite useful (for example the Schrödinger problem is set with \(\mathfrak {m}_1=\mathfrak {m}_2 = {\mathcal {L}}^d\)), in other it is the opposite, for example when we are considering \(\rho _1, \rho _2\) to be sums of diracs. In those cases it is a much better solution to consider \(\mathfrak {m}_1=\rho _1\) and \(\mathfrak {m}_2=\rho _2\). Notice that in this case, we have that

$$\begin{aligned} {\text {OT}}_{\varepsilon }(\rho _1,\rho _2) = \mathcal {S}_{\varepsilon }(\rho _1,\rho _2;\rho _1,\rho _2). \end{aligned}$$

Now we will see that in fact \({\text {OT}}_{\varepsilon }\) is a universal reduction for \(\mathcal {S}_{\varepsilon }\), meaning that we can always assume \(\mathfrak {m}_1=\rho _1\) and \(\mathfrak {m}_2=\rho _2\):

Lemma 1.5

Let \((X,d,\mathfrak {m}_1)\) and \((Y,d,\mathfrak {m}_1)\) be a Polish metric measure spaces and \(c:X\times Y\rightarrow [0,+\infty [\) be a cost function. Assume that \(\rho _1 \in \mathcal {P}(X)\), \(\rho _2 \in \mathcal {P}(Y)\). Then we have

$$\begin{aligned} \mathcal {S}_{\varepsilon }(\rho _1, \rho _2; \mathfrak {m}_1, \mathfrak {m}_2) = {\text {OT}}_{\varepsilon }( \rho _1, \rho _2) +\varepsilon {\text {KL}}(\rho _1| \mathfrak {m}_1) + \varepsilon {\text {KL}}(\rho _2|\mathfrak {m}_2); \end{aligned}$$

moreover, whenever one of the two sides is finite the minimizers of \(\mathcal {S}_{\varepsilon }\) and \({\text {OT}}_{\varepsilon }\) are the same.

Proof

The key equality in proving this lemma is that, whenever \(\gamma \in \Gamma (\rho _1, \rho _2)\) one has

$$\begin{aligned} {\text {KL}}( \gamma | \mathfrak {m}_1\otimes \mathfrak {m}_2) = {\text {KL}}(\gamma | \rho _1 \otimes \rho _2) + {\text {KL}}(\rho _1 | \mathfrak {m}_1) + {\text {KL}}(\rho _2 | \mathfrak {m}_2). \end{aligned}$$
(1.8)

While this equality is clear whenever all the terms are finite, we refer to Lemma 1.6 below for a complete proof entailing every case. From this equality we can easily get the conclusions. \(\square \)

Lemma 1.6

Let \((X,\sigma _X)\) and \((Y,\sigma _Y)\) be measurable spaces. Assume that \(\gamma \in \mathcal {P}(X\times Y)\), \(\mathfrak {m}_1\in \mathcal {P}(X)\) and \(\mathfrak {m}_2\in \mathcal {P}(Y)\). Then we have

$$\begin{aligned} {\text {KL}}( \gamma | \mathfrak {m}_1\otimes \mathfrak {m}_2) = {\text {KL}}(\gamma | \rho _1 \otimes \rho _2) + {\text {KL}}(\rho _1 | \mathfrak {m}_1) + {\text {KL}}(\rho _2 | \mathfrak {m}_2), \end{aligned}$$
(1.9)

where \(\rho _1= (e_1)_{\sharp }\gamma \) and \(\rho _2 = (e_2)_{\sharp }\gamma \) are the projections of \(\gamma \) onto X and Y respectively.

Proof

First we assume the right hand side of (1.9) is finite, and in particular this implies \(\gamma \ll \rho _1 \otimes \rho _2\), \(\rho _1 \ll \mathfrak {m}_1\) and \(\rho _2 \ll \mathfrak {m}_2\). In particular we get \(\gamma \ \mathfrak {m}_1\otimes \mathfrak {m}_2\) and we can infer

$$\begin{aligned} \frac{ d \gamma }{ d (\mathfrak {m}_1\otimes \mathfrak {m}_2)} (x,y)= \frac{ d \gamma }{ d (\rho _1 \otimes \rho _2 )} (x,y) \cdot \frac{ d \rho _1}{ d\mathfrak {m}_1} (x) \cdot \frac{ d \rho _2 }{ d \mathfrak {m}_2} (y). \end{aligned}$$

We can now compute

$$\begin{aligned} {\text {KL}}( \gamma |\mathfrak {m}_1\otimes \mathfrak {m}_2)&= \int _{X \times Y} \ln \left( \frac{ d \gamma }{ d (\mathfrak {m}_1\otimes \mathfrak {m}_2)} (x,y) \right) \, d \gamma \\&=\int _{X \times Y} \ln \left( \frac{ d \gamma }{ d (\rho _1 \otimes \rho _2 )} (x,y) \right) \, d \gamma + \int _{X \times Y} \ln \left( \frac{ d \rho _1}{ d\mathfrak {m}_1} (x)\right) \, d \gamma \\&\quad + \int _{X \times Y} \ln \left( \frac{ d \rho _2}{ d\mathfrak {m}_2}(y)\right) \, d \gamma \\&= {\text {KL}}(\gamma | \rho _1 \otimes \rho _2 ) + \int _X \ln \left( \frac{ d \rho _1}{ d\mathfrak {m}_1} (x)\right) \, d \rho _1 + \int _Y \ln \left( \frac{ d \rho _2}{ d\mathfrak {m}_2} (y)\right) \, d \rho _2 \\&= {\text {KL}}(\gamma | \rho _1 \otimes \rho _2) + {\text {KL}}(\rho _1 | \mathfrak {m}_1) + {\text {KL}}(\rho _2 | \mathfrak {m}_2). \end{aligned}$$

We assume now that the left hand side of (1.9) is finite. Thanks to the fact that \({\text {KL}}(F_{\sharp } \mu | F_{\sharp } \nu ) \le {\text {KL}}( \mu | \nu )\) for every measurable function F, we immediately have that \( {\text {KL}}(\rho _1 | \mathfrak {m}_1)\) and \( {\text {KL}}(\rho _2 | \mathfrak {m}_2)\) are finite, and in particular \(\rho _1 \ll \mathfrak {m}_1\) and \(\rho _2 \ll \mathfrak {m}_2\). Now let us introduce \(f = \frac{d \gamma }{d \mathfrak {m}_1\otimes \mathfrak {m}_2}\), \(g_1 = \frac{d \rho _1}{d \mathfrak {m}_1} \) and \(g_2 =\frac{ d \rho _2}{d \mathfrak {m}_2}\); let us then consider any measurable set \(A \subseteq X \times Y\) and assume that \((\rho _1 \otimes \rho _2) (A)=0\). In particular we have

$$\begin{aligned} \int _{X \times Y} \chi _A (x,y) g_1(x)g_2(y) \, d (\mathfrak {m}_1\otimes \mathfrak {m}_2) = (\rho _1 \otimes \rho _2) (A)=0; \end{aligned}$$

from this we deduce that A is \(\mathfrak {m}_1\otimes \mathfrak {m}_2\)-essentially contained in the set \(B=\{g_1(x) g_2(y) =0\} = B_x \cup B_y\), where \(B_x= \{ g_1(x) =0\} \times Y\) and \(B_y = X \times \{ g_2(y)=0\}\). However, by the marginal conditions, we have \(\gamma (B_x)=\rho _1\{ g_1(x) =0\} =0\) and similarly \(\gamma (B_y) =0\), which imply \(\gamma (B)=0\). In particular we have

$$\begin{aligned} \gamma (A)&= \int _{X \times Y} \chi _A (x,y) f(x,y) \, d (\mathfrak {m}_1\otimes \mathfrak {m}_2) = \int _{X \times Y} \chi _{A \cap B} (x,y) f(x,y) \, d (\mathfrak {m}_1\otimes \mathfrak {m}_2) \\&\le \int _{X \times Y} \chi _{ B} (x,y) f(x,y) \, d (\mathfrak {m}_1\otimes \mathfrak {m}_2) \le \gamma (B)=0. \end{aligned}$$

This proves that \(\gamma \ll \rho _1 \otimes \rho _2\) and so we can perform the same calculation we did before to conclude. \(\square \)

Regularity of Entropic-Potentials and Dual Problem

In this section we will treat the case where \(c:X \times Y \rightarrow \mathbb {R}\) is a Borel bounded cost; of course everything extends also to the case when \(c \in L^{\infty }(\rho _1 \otimes \rho _2)\). Some of the results extend naturally for unbounded costs (for example (i), (ii), (v) in Proposition 2.4), but we prefer to keep the setting uniform.

Entropy-Transform and a Priori Estimates

We start by defining the Entropy-Transform. First, let us define the space \(L^{\mathrm{exp}}_{\varepsilon }\), which will be the natural space for the dual problem.

Definition 2.1

(\(L^{\mathrm{exp}}_{\varepsilon }\) spaces) Let \(\varepsilon >0\) be a positive number and \((X,d_X)\) be a Polish space. We define the set \(L^{\mathrm{exp}}_{\varepsilon }(X,\rho _1)\) by

$$\begin{aligned}&L^{\mathrm{exp}}_{\varepsilon }(X,\rho _1) \\&\quad = \left\{ u:X \rightarrow [-\infty , \infty [ \, : \, u \text { is a measurable function in } (X,\rho _1) \text { and } 0<\int _X e^{u/\varepsilon } \, d \rho _1 < \infty \right\} . \end{aligned}$$

For \(u \in L^{\mathrm{exp}}_{\varepsilon }(X, \rho _1)\) we define also \(\lambda _u:=\varepsilon \log \left( \int _X e^{u/\varepsilon } \, d \rho _1 \right) \).

For simplicity, we will use the notation \(L^{\mathrm{exp}}_{\varepsilon }(\rho _1)\) instead of \(L^{\mathrm{exp}}_{\varepsilon }(X,\rho _1)\). Notice that it is possible that \(u\in L^{\mathrm{exp}}_{\varepsilon }(X,\rho _1)\) attains the value \(-\infty \) in a set of positive measure, but not everywhere, because of the positivity constraint \(\int _X e^{u/\varepsilon } \, d \rho _1>0\). On the other hand, we have that \(u \in L^{\mathrm{exp}}_{\varepsilon }(X,\rho _1)\) implies \(u^+ \in L^p(\rho _1)\) for every \( p \ge 1\), where \(u^+(x):= \max \{ u(x),0\} \) denotes the positive part of u.

Definition 2.2

(Entropic c-transform or \((c,\varepsilon )\)-transform) Let \((X,d_X)\), \((Y,d_Y)\) be Polish spaces, \(\varepsilon >0\) be a positive number, \(\rho _1 \in \mathcal {P}(X)\) and \(\rho _2 \in \mathcal {P}(Y)\) be probability measures and let c be a bounded measurable cost on \(X \times Y\). The entropic \((c,\varepsilon )\)-transform \(\mathcal {F}^{(c, \varepsilon )}:L^{\mathrm{exp}}_{\varepsilon }(\rho _1)\rightarrow L^0(\rho _2)\) is defined by

$$\begin{aligned} \mathcal {F}^{(c, \varepsilon )}( u ) (y):= -\varepsilon \log \left( \int _X e^{ \frac{ u (x) - c(x,y) }{\varepsilon }} \, d \rho _1 (x) \right) . \end{aligned}$$
(2.1)

Analogously, we define the \((c,\varepsilon )\)-transform \(\mathcal {F}^{(c, \varepsilon )}:L^{\mathrm{exp}}_{\varepsilon }(\rho _2)\rightarrow L^0(\rho _1)\) by

$$\begin{aligned} \mathcal {F}^{(c, \varepsilon )}( v ) (x):= - \varepsilon \log \left( \int _Y e^{ \frac{ v (y) - c(x,y) }{\varepsilon }} \, d \rho _2 (y) \right) . \end{aligned}$$
(2.2)

Whenever it will be clear we denote \(v^{(c,\varepsilon )}=\mathcal {F}^{(c, \varepsilon )}(v)\) and \(u^{(c,\varepsilon )}=\mathcal {F}^{(c, \varepsilon )}(u)\), in an analogous way to the classical c-transform.

Notice that \(L^{\mathrm{exp}}_{\varepsilon }(\rho _1)\) is the natural domain of definition for \(\mathcal {F}^{(c, \varepsilon )}\) because if \(u \not \in L^{\mathrm{exp}}_{\varepsilon }(\rho _1)\) we would have either \(\mathcal {F}^{(c, \varepsilon )}(u) \equiv - \infty \) or \(\mathcal {F}^{(c, \varepsilon )}(u) \equiv +\infty \); moreover, thanks to the positivity constraint \(\int _X e^{u/\varepsilon } \, d \rho _1>0\) we also have \(\mathcal {F}^{(c, \varepsilon )}(u)(y) \in \mathbb {R}\) almost everywhere. In fact we will show that \(\mathcal {F}^{(c, \varepsilon )}(u) \in L^{\infty }(\rho _2)\).

We also remark that the \((c,\varepsilon )\)-transform is consistent with the c-transform when \(\varepsilon \rightarrow 0\): \(u^{(c,\varepsilon )}\rightarrow \max \lbrace u (x) - c(x,y) : x\in X \rbrace \), when \(\varepsilon \rightarrow 0\). In other words, \(u^{(c,\varepsilon )}(y) = u^{c}(y) + O(\varepsilon )\) (Fig. 2).

Fig. 2
figure2

Entropy-Kantorovich potentials \(u^{\varepsilon }(x)-\varepsilon \ln ( \rho _1)\) associated to the densities \(\rho _1\) and \(\rho _2\) for different values of the regularization parameter: \(\varepsilon _1< \varepsilon _2 < \varepsilon _3\) (from left to right). The densities \(\rho _1 \sim N(0,5)\) and \(\rho _2 = \frac{1}{2}\eta _1 + \frac{1}{2}\eta _2\) is a mixed Gaussian, where \(\eta _1 \sim N(7,0.9)\) and \(\eta _2 \sim N(14,0.9)\)

Lemma 2.3

Let \((X,d_X)\), \((Y,d_Y)\) be Polish spaces, \(u\in L^{\mathrm{exp}}_{\varepsilon }(\rho _1),v\in L^{\mathrm{exp}}_{\varepsilon }(\rho _2)\) and \(\varepsilon >0\). Then,

  1. (i)

    \(u^{(c,\varepsilon )}(y) \in L^{\infty }(\rho _2)\) and \(v^{(c,\varepsilon )}(x) \in L^{\infty }(\rho _1)\). More precisely,

    $$\begin{aligned} -\Vert c\Vert _{\infty }-\varepsilon \log \left( \int _X e^{ \frac{ u (x) }{\varepsilon }} \, d \rho _1 \right) \le u^{(c,\varepsilon )}(y) \le \Vert c\Vert _{\infty } -\varepsilon \log \left( \int _X e^{ \frac{ u (x) }{\varepsilon }} \, d\rho _1 \right) \end{aligned}$$
  2. (ii)

    \(u^{(c,\varepsilon )}(y) \in L^{\mathrm{exp}}_{\varepsilon }(\rho _2)\) and \(v^{(c,\varepsilon )}(x) \in L^{\mathrm{exp}}_{\varepsilon }(\rho _1)\). Moreover \(|\lambda _{u^{(c,\varepsilon )}} + \lambda _u| \le \Vert c \Vert _{\infty }\).

Proof

If \(u\in L^{\mathrm{exp}}_{\varepsilon }(\rho _1)\) then

$$\begin{aligned} u^{(c,\varepsilon )}(y)&= -\varepsilon \log \left( \int _X e^{\frac{ u(x)-c(x,y) }{\varepsilon }}d\rho _1\right) \\&\le -\varepsilon \log \left( e^{\frac{ -\Vert c\Vert _{\infty }}{\varepsilon }}\int _X e^{\frac{ u(x)}{\varepsilon }}d\rho _1\right) \\&= \Vert c \Vert _{\infty } - \varepsilon \log \left( \int _X e^{\frac{ u(x)}{\varepsilon }}d\rho _1\right) . \end{aligned}$$

Moreover, we get a lower bound for the above quantity using \(c\ge -\Vert c\Vert _{\infty }\):

$$\begin{aligned} u^{(c,\varepsilon )}(y) = -\varepsilon \log \left( \int _X e^{\frac{ u(x)-c(x,y) }{\varepsilon }}d\rho _1\right) \ge -\Vert c\Vert _{\infty } -\varepsilon \log \left( \int _X e^{\frac{u(x)}{\varepsilon }}d\rho _1\right) . \end{aligned}$$

Then we proved that \(u^{(c,\varepsilon )}\in L^{\infty }(\rho _2)\). The fact that \(v^{(c,\varepsilon )}\in L^{\infty }(\rho _1)\) is analogous. This shows the (i). Since \(u\in L^{\mathrm{exp}}_{\varepsilon }(\rho _1)\), by using the part (i) we have that

$$\begin{aligned} \int _Y e^{\frac{u^{(c,\varepsilon )}(y)}{\varepsilon }}d\rho _2(y) \le \int _Y e^{\Vert c\Vert _{\infty }/\varepsilon }\left( \int _X e^{\frac{u(x)}{\varepsilon }}d\rho _1(x)\right) ^{-1}d\rho _2(y) < +\infty . \end{aligned}$$

Therefore \(u^{(c,\varepsilon )}\in L^{\mathrm{exp}}_{\varepsilon }(\rho _2)\) and in particular \(\lambda _{u^{(c,\varepsilon )}} \le -\lambda _u + \Vert c\Vert _{\infty }\); the other inequality follows with a similar calculation and the same holds for \(v^{(c,\varepsilon )}\), which proves (ii). \(\square \)

Some of the following properties were already known for the softmax operator (for example in [39]) and they are used in order to get a posteriori regularity of the potentials but, up to our knowledge, were never used to get a priori results. Another very cleverly used properties of the \((c, \varepsilon )\)-transform are in [32] in order to obtain a new proof of the Caffarelli’s contraction theorem [12].

Proposition 2.4

Let \(\varepsilon >0\) be a positive number, \((X,d_X)\) and \((Y,d_Y)\) be Polish metric spaces, \(c:X\times Y\rightarrow [0,\infty ]\) be a bounded cost function, \(\rho _1 \in \mathcal {P}(X)\), \(\rho _2 \in \mathcal {P}(Y)\) be probability measures and \(u\in L^{\mathrm{exp}}_{\varepsilon }(\rho _1)\). Then

  1. (i)

    if c is L-Lipschitz, then \(u^{(c,\varepsilon )}\) is L-Lipschitz;

  2. (ii)

    if c is \(\omega \)-continuous, then \(u^{(c,\varepsilon )}\) is \(\omega \)-continuous;

  3. (iii)

    if \(\vert c\vert \le M\), then we have \(|u^{(c,\varepsilon )}+\lambda _u|\le M\);

  4. (iv)

    if \(\vert c\vert \le M\), then \(\mathcal {F}^{(c, \varepsilon )}:L^{\infty }(\rho _1)\rightarrow L^p(\rho _2)\) is a 1-Lipschitz compact operator.

  5. (v)

    if c is K-concave with respect to y, then \(u^{(c,\varepsilon )}\) is K-concave.

Proof

Of course we have that (ii) implies (i); let us prove directly (ii).

  1. (ii)

    Let \(u\in L^{\mathrm{exp}}_{\varepsilon }(\rho _1)\), \(y_1,y_2\in Y\). We can assume without loss of generality that \(u^{(c,\varepsilon )}(y_1)\ge u^{(c,\varepsilon )}(y_2)\); in that case

    $$\begin{aligned}&\vert u^{(c,\varepsilon )}(y_1) - u^{(c,\varepsilon )}(y_2)\vert \\&\quad = \varepsilon \log \left( \int _X e^{\frac{u(x)-c(x,y_2)}{\varepsilon }}d\rho _1\right) - \varepsilon \log \left( \int _X e^{\frac{u(x)-c(x,y_1)}{\varepsilon }}d\rho _1\right) \\&\quad = \varepsilon \log \left( \int _X e^{\frac{u(x)-c(x,y_1)+c(x,y_1)-c(x,y_2)}{\varepsilon }}d\rho _1\right) - \varepsilon \log \left( \int _X e^{\frac{u(x)-c(x,y_1)}{\varepsilon }}d\rho _1\right) \\&\quad \le \varepsilon \log \left( e^{\frac{\omega (d(y_1,y_2))}{\varepsilon }}\int _X e^{\frac{u(x)-c(x,y_1)}{\varepsilon }}d\rho _1\right) - \varepsilon \log \left( \int _X e^{\frac{u(x)-c(x,y_1)}{\varepsilon }}d\rho _1\right) \\&\quad = \omega (y_1,y_2). \end{aligned}$$
  2. (iii)

    This is a direct consequence of Lemma 2.3 (i);

  3. (iv)

    We first prove that \(\mathcal {F}^{(c, \varepsilon )}\) is 1-Lipschitz. In fact, letting \(u, {\tilde{u}} \in L^{\infty }(\rho _1)\), we can perform a calculation very similar to what has been done in (ii): for every \(y \in Y\) we have

    $$\begin{aligned} \mathcal {F}^{(c, \varepsilon )}(u) (y)&=- \varepsilon \log \left( \int _X e^{\frac{u(x)-c(x,y)}{\varepsilon }}d\rho _1\right) \ge - \varepsilon \log \left( \int _X e^{\frac{{\tilde{u}}(x) + \Vert u - {{\tilde{u}}} \Vert _{\infty } -c(x,y)}{\varepsilon }}d\rho _1\right) \\&= \mathcal {F}^{(c, \varepsilon )}({\tilde{u}}) (y) - \Vert u - {\tilde{u}} \Vert _{\infty } \end{aligned}$$

    We can conlcude that \(\Vert \mathcal {F}^{(c, \varepsilon )}(u) - \mathcal {F}^{(c, \varepsilon )}({\tilde{u}})\Vert _p \le \Vert \mathcal {F}^{(c, \varepsilon )}(u) - \mathcal {F}^{(c, \varepsilon )}({\tilde{u}})\Vert _{\infty } \le \Vert u - {\tilde{u}} \Vert _{\infty }\). This proves in particular that \(\mathcal {F}^{(c, \varepsilon )}: L^{\infty }(\rho _1) \rightarrow L^p(\rho _2)\) is continuous. In order to prove that \(\mathcal {F}^{(c, \varepsilon )}\) is compact it suffices to prove that \(\mathcal {F}^{(c, \varepsilon )}(B)\) is precompact for every bounded set \(B \subset L^{\infty }(\rho _1)\). We will use Proposition 5.1; since \(\mathcal {F}^{(c, \varepsilon )}\) is Lipschitz, for sure if B is bounded we have that \(\mathcal {F}^{(c, \varepsilon )}(B)\) is bounded in \(L^p(\rho _2)\), so it remains to prove part (b) of the criterion of Proposition 5.1.

    Let us consider \(\gamma = \rho _1\otimes \rho _2\). Since \(c \in L^{\infty }(\gamma )\), by Lusin theorem we have that for every \(\sigma >0\) there exists \(N_{\sigma } \subset X \times Y\), with \(\gamma ( N_{\sigma })< \sigma \), such that \(c|_{(N_{\sigma })^c}\) is uniformly continuous, with modulus of continuity \(\omega _{\sigma }\). We now try to mimic what we did in (ii), this time keeping also track of the remainder terms we will have.

    For each \(y\in Y\), we consider the slice of \(N_{\sigma }\) above y, \(N^{\sigma }_y = \left\{ x\in X : (x,y)\in N_{\sigma } \right\} \); then we consider the set of bad \(y \in Y\), where the slice \(N^{\sigma }_y\) is too big:

    $$\begin{aligned} N^{\sigma }_b = \left\{ y \in Y : \rho _1(N^{\sigma }_y)\ge \sqrt{\sigma } \right\} . \end{aligned}$$

    In particular, by definition if \(y\not \in N_b^{\sigma }\) we have \(\rho _1(N^{\sigma }_y) \le \sqrt{\sigma }\), but thanks to Fubini and the condition \(\gamma (N_{\sigma }) < \sigma \) we have also that \(\rho _2(N^{\sigma }_b) \le \sqrt{\sigma }\) .

    Let us now consider \(y, y' \not \in N^{\sigma }_b\), and let us denote \(X^*= X \setminus ( N^{\sigma }_y \cup N^{\sigma }_{y'})\). Then we have that for every \(x \in X^*\), \(|c(x,y)-c(x,y')| \le \omega _{\sigma }(d(y,y'))\). We can assume without loss of generality that \(u^{(c,\varepsilon )}(y)\ge u^{(c,\varepsilon )}(y')\) and we have

    $$\begin{aligned}&|u^{(c,\varepsilon )}(y)-u^{(c,\varepsilon )}(y')| \\&\quad = - \varepsilon \log \left( \int _X e^{(u(x)-c(x,y))/\varepsilon }d\rho _1\right) +\varepsilon \log \left( \int _X e^{(u(x)-c(x,y'))/\varepsilon }d\rho _1\right) \\&\quad = \varepsilon \log \left( \dfrac{\int _X e^{(u(x)-c(x,y)+c(x,y)-c(x,y'))/\varepsilon }d\rho _1}{\int _X e^{(u(x)-c(x,y))/\varepsilon }d\rho _1}\right) \\&\quad \le \varepsilon \log \left( e^{\frac{\omega _{\sigma }(d(y,y'))}{\varepsilon }} +\dfrac{\int _{N^{\sigma }_y \cup N^{\sigma }_{y'}} e^{(u(x)-c(x,y'))/\varepsilon }d\rho _1}{\int _X e^{(u(x)-c(x,y))/\varepsilon }d\rho _1}\right) \\&\quad \le \varepsilon \log \left( e^{\frac{\omega _{\sigma }(d(y,y'))}{\varepsilon }}+\rho _1(N^{\sigma }_y \cup N^{\sigma }_{y'}) e^{2(\Vert c\Vert + \Vert u\Vert )/\varepsilon } \right) \\&\quad \le \varepsilon \log \left( e^{\frac{\omega _{\sigma }(d(y,y'))}{\varepsilon }}+2\sqrt{\sigma }e^{2(\Vert c\Vert + \Vert u\Vert )/\varepsilon }\right) \end{aligned}$$

    Now we denote by \(A= 2 e^{2(\Vert c\Vert + \Vert u\Vert )/\varepsilon }\) and thanks to the fact that if \(a, b \ge 0\) then \(e^a+b \le e^{a+b}\), we have

    $$\begin{aligned} |u^{(c,\varepsilon )}(y)-u^{(c,\varepsilon )}(y')| \le \omega _{\sigma }(d(y,y')) + \varepsilon \sqrt{\sigma }A \qquad \forall y, y' \not \in N_b^{\sigma }. \end{aligned}$$

    Then (having in mind also (iii) and that A depends only on \(\Vert u \Vert _{\infty }\)), we have that also (b) of Proposition 5.1 is satisfied for \(\mathcal {F}^{(c, \varepsilon )}(B)\), for every bounded set \(B \subset L^{\infty }(\rho _1)\), granting then the compactness of \(\mathcal {F}^{(c, \varepsilon )}\).

  4. (v)

    In this case we are assuming that Y is a geodesic space and that there exists \(K \in \mathbb {R}\) such that for each constant speed geodetic \((y_t)_{t \in [0,1]}\) we have

    $$\begin{aligned} c(x,y_t) \ge (1-t)c(x,y_0)+ t c(x,y_1) + 2 t(1-t) K d^2(y_0,y_1) \quad \forall x \in X. \end{aligned}$$

    Then, setting \(f_t(x)=e^{(u(x)-c(x,y_t))/\varepsilon }\), the K-concavity inequality for c implies

    $$\begin{aligned} f_t(x)&=e^{(u(x)-c(x,y_t))/\varepsilon } \\&\le e^{(u(x)-(1-t)c(x,y_0)-tc(x,y_1)-2t(1-t)Kd^2(y_0,y_1))/\varepsilon } \\&= e^{-2 t(1-t) K d^2(y_0,y_1)/\varepsilon } \cdot e^{((1-t)(u(x)-c(x,y_0))+t(u(x)-c(x,y_1)))/\varepsilon } \\&= e^{-2 t(1-t) K d^2(y_0,y_1)/\varepsilon } \cdot f_0(x)^{1-t} \cdot f_1(x)^t \end{aligned}$$

    Using this along with Hölder inequality we get

    $$\begin{aligned} u^{(c,\varepsilon )}(y_t)&= - \varepsilon \log \left( \int _X e^{(u(x)-c(x,y_t))/\varepsilon }\, d\rho _1 \right) = -\varepsilon \log \left( \int _X f_t(x) \, d \rho _1 \right) \\&\ge - \varepsilon \log \left( \int _X e^{-2 t(1-t) K d^2(y_0,y_1)/\varepsilon } \cdot f_0(x)^{1-t} \cdot f_1(x)^t \,d\rho _1\right) \\&= 2t (1-t)Kd^2(y_0,y_1)- \varepsilon \log \left( \int _X f_0(x)^{1-t} \cdot f_1(x)^t \,d\rho _1\right) \\&\ge 2t (1-t)Kd^2(y_0,y_1)- \varepsilon \log \left( \Bigl (\int _X f_0 \, d \rho _1 \Bigr )^{1-t} \cdot \Bigl (\int _X f_1 \, d \rho _1 \Bigr )^{t}\right) \\&=2t (1-t)Kd^2(y_0,y_1) + (1-t)u^{(c,\varepsilon )}(y_0)+ tu^{(c,\varepsilon )}(y_1). \end{aligned}$$

\(\square \)

Remark 2.5

(Entropic c-transform for \(\mathcal {S}_{\varepsilon }(\rho _1,\rho _2;\mathfrak {m}_1,\mathfrak {m}_2)\)). It is possible to define the entropic c-transform also for the Schrödinger problem \(\mathcal {S}_{\varepsilon }(\rho _1,\rho _2;\mathfrak {m}_1,\mathfrak {m}_2)\) with reference measures \(\mathfrak {m}_1\in \mathcal {P}(X)\) and \(\mathfrak {m}_2\in \mathcal {P}(Y)\). In this case,

$$\begin{aligned} \mathcal {F}^{(c, \varepsilon )}_{*} (u) (y)= & {} \varepsilon \log (\rho _2(y)) -\varepsilon \log \left( \int _X e^{ \frac{ u (x) - c(x,y) }{\varepsilon }}d\mathfrak {m}_1(x)\right) , \quad \text {and} \end{aligned}$$
(2.3)
$$\begin{aligned} \mathcal {F}^{(c, \varepsilon )}_{*}(v) (x)= & {} \varepsilon \log (\rho _1(x))-\varepsilon \log \left( \int _X e^{ \frac{ u (x) - c(x,y) }{\varepsilon }}d\mathfrak {m}_2(y)\right) . \end{aligned}$$
(2.4)

It is easy to see that

$$\begin{aligned} {\left\{ \begin{array}{ll} \mathcal {F}^{(c, \varepsilon )}_{*}(v) &{}= \varepsilon \log \rho _1 + \mathcal {F}^{(c, \varepsilon )}( v - \varepsilon \log \rho _2) \\ \mathcal {F}^{(c, \varepsilon )}_{*}(u) &{}= \varepsilon \log \rho _2 + \mathcal {F}^{(c, \varepsilon )}( u - \varepsilon \log \rho _1) \end{array}\right. } \end{aligned}$$

so that in fact the \((c,\varepsilon )-\)transforms with reference measures are in fact the \((c,\varepsilon )\)-trasforms conjugated by the addition of a function. In particular we can get exactely the same estimates we did in Lemma 2.3, up to translate in the appropriate manner. For example we would have if \(u\in L^{\mathrm{exp}}_{\varepsilon }(\mathfrak {m}_1)\), we would have then \(u^{(c,\varepsilon )}_{*}(y) -\varepsilon \log (\rho _2(y)) \in L^{\infty }(\mathfrak {m}_2)\).

Dual Problem

Let \(u\in L^{\mathrm{exp}}_{\varepsilon }(\rho _1), v\in L^{\mathrm{exp}}_{\varepsilon }(\rho _2)\) and consider the Entropy-Kantorovich functional,

$$\begin{aligned} D_{\varepsilon }(u,v) = \int _X u(x)d\rho _1(x) + \int _Y v(y)d\rho _2(y) - \varepsilon \int _{X\times Y} e^{\frac{u(x)+v(y)-c(x,y)}{\varepsilon }}d(\rho _1\otimes \rho _2). \end{aligned}$$
(2.5)

What are the minimal assumption on uv in order to make sense for \(D_{\varepsilon }(u,v)\)? First of all if \(u^+ \in L^1(\rho _1)\) and \(v^+ \in L^1(\rho _2)\) then \(D_{\varepsilon }(u,v)< \infty \) and in particular in order to have \(D_{\varepsilon }(u,v)>-\infty \) we need \(u\in L^{\mathrm{exp}}_{\varepsilon }(\rho _1), v\in L^{\mathrm{exp}}_{\varepsilon }(\rho _2)\) which is then a natural assumption (since we want to compute the supremum of \(D_{\varepsilon }\)).

Lemma 2.6

Let us consider \(D_{\varepsilon }: L^{\mathrm{exp}}_{\varepsilon }(\rho _1) \times L^{\mathrm{exp}}_{\varepsilon }(\rho _2)\rightarrow {\mathbb {R}}\) defined as in (2.5), then

$$\begin{aligned}&D_{\varepsilon }(u,u^{(c,\varepsilon )}) \ge D_{\varepsilon }(u,v) \qquad \forall v \in L^{\mathrm{exp}}_{\varepsilon }(\rho _2), \end{aligned}$$
(2.6)
$$\begin{aligned}&D_{\varepsilon }(u,u^{(c,\varepsilon )}) = D_{\varepsilon }(u,v) \text { if and only if } v = u^{(c,\varepsilon )}. \end{aligned}$$
(2.7)

In particular we can say that \(u^{(c,\varepsilon )}\in {\mathrm{argmax}} \{ D_{\varepsilon } ( u,v) \; : \; v \in L^{\mathrm{exp}}_{\varepsilon }(\rho _2) \}\).

Proof

By Fubini’s theorem and Eq. (2.1), we have

$$\begin{aligned} D_{\varepsilon }(u,v)&= \int _X u(x)d\rho _1(x) + \int _Y v(y)d\rho _2(y) - \varepsilon \int _{X\times Y} e^{\frac{u(x)+v(y)-c(x,y)}{\varepsilon }}d(\rho _1\otimes \rho _2),\\&= \int _X u(x)d\rho _1(x) + \int _Y v(y)d\rho _2(y) - \varepsilon \int _Y e^{\frac{v(y)}{\varepsilon }}\left( \int _X e^{\frac{u(x)-c(x,y)}{\varepsilon }}d\rho _1\right) d\rho _2,\\&= \int _X u(x)d\rho _1(x) + \int _Y v(y) - \varepsilon e^{\frac{v(y)-u^{(c,\varepsilon )}(y)}{\epsilon }}d\rho _2(y). \end{aligned}$$

Therefore, for any \(v\in L^{\mathrm{exp}}_{\varepsilon }(\rho _1)\), \( D_{\varepsilon }(u,v) \le D_{\varepsilon }(u,u^{(c,\varepsilon )})\), since the function \(g(t) = t-\varepsilon e^{(t-a)/\varepsilon }\) is strictly concave and attains its maximum in \(t=a\). In particular, \(D_{\varepsilon }(u,u^{(c,\varepsilon )})= D_{\varepsilon }(u,v)\) if and only if \(v = u^{(c,\varepsilon )}\). \(\square \)

Lemma 2.7

Let us consider \(u \in L^{\mathrm{exp}}_{\varepsilon }(\rho _1)\) and \(v \in L^{\mathrm{exp}}_{\varepsilon }(\rho _2)\). Then there exist \(u^* \in L^{\mathrm{exp}}_{\varepsilon }(\rho _1)\) and \(v^* \in L^{\mathrm{exp}}_{\varepsilon }(\rho _2)\) such that

  • \(D_{\varepsilon }(u,v) \le D_{\varepsilon }(u^*,v^*)\);

  • \( \Vert v^* \Vert _{\infty } \le 3\Vert c\Vert _{\infty }/2 \);

  • \( \Vert u^*\Vert _{\infty } \le 3\Vert c\Vert _{\infty }/2 \).

Moreover \(a \in {\mathbb {R}}\) such that \(u^*= (v+a)^{(c,\varepsilon )}\) and \(v^*=(u^*)^{(c,\varepsilon )}\).

Proof

Let us apply Proposition 2.4 (iii) to v and \({\tilde{u}}=v^{(c,\varepsilon )}\):

$$\begin{aligned}&- \Vert c\Vert _{\infty } \le v^{(c,\varepsilon )}+ \lambda _v \le \Vert c \Vert _{\infty } \\&\quad - \Vert c\Vert _{\infty } \le (v^{(c,\varepsilon )})^{(c,\varepsilon )}+ \lambda _{v^{(c,\varepsilon )}} \le \Vert c \Vert _{\infty } \end{aligned}$$

Let us define \({\tilde{u}}=v^{(c,\varepsilon )}\) and \({\tilde{v}}=(v^{(c,\varepsilon )})^{(c,\varepsilon )}\). Then by Lemma 2.6 we have of course that \(D_{\varepsilon }(u,v) \le D_{\varepsilon }({\tilde{u}},{\tilde{v}})\); now we know that \(D_{\varepsilon }({\tilde{u}}-a, {\tilde{v}}+a)=D_{\varepsilon }({\tilde{u}}, {\tilde{v}})\) for any \(a \in {\mathbb {R}}\) and moreover

$$\begin{aligned} \Vert {\tilde{u}}-a \Vert _{\infty } \le \Vert c\Vert _{\infty } + | a+\lambda _v| \qquad \Vert {\tilde{v}}+a \Vert _{\infty } \le \Vert c\Vert _{\infty } + |\lambda _{v^{(c,\varepsilon )}} -a|. \end{aligned}$$

We can now choose \(a^*= (\lambda _{v^{(c,\varepsilon )}}- \lambda _v) /2\) and, recalling Lemma 2.3 (ii) we can conclude that \(u^*={\tilde{u}}-a^*\) and \(v^*={\tilde{v}}+a^*\) satisfy the required bounds. \(\square \)

Theorem 2.8

Let \((X,d_X)\), \((Y,d_Y)\) be Polish spaces, \(c:X\times Y\rightarrow \mathbb {R}\) be a Borel bounded cost, \(\rho _1 \in \mathcal {P}(X)\), \(\rho _2 \in \mathcal {P}(Y)\) be probability measures and \(\varepsilon >0\) be a positive number. Consider the problem

$$\begin{aligned} \sup \left\{ D_{\varepsilon }(u,v) \; : \; u \in L^{\mathrm{exp}}_{\varepsilon }(\rho _1) , v \in L^{\mathrm{exp}}_{\varepsilon }(\rho _2) \right\} . \end{aligned}$$
(2.8)

Then the supremum in (2.8) is attained for a unique couple \((u_0,v_0)\) (up to the trivial tranformation \((u,v) \mapsto (u+a,v-a)\)). In particular we have

$$\begin{aligned} u_0 \in L^{\infty }(\rho _1) \quad \text { and } \quad v_0 \in L^{\infty }(\rho _2); \end{aligned}$$

moreover we can choose the maximizers such that \(\Vert u_0\Vert _{\infty } ,\Vert v_0\Vert _{\infty } \le \frac{3}{2} \Vert c\Vert _{\infty }\).

Proof

Now, we are going to show that the supremum is attainded in the right-hand side of (2.8). Let \((u_n)_{n\in \mathbb {N}} \subset L^{\mathrm{exp}}_{\varepsilon }(\rho _1) \) and \((v_n)_{n\in \mathbb {N}} \subset L^{\mathrm{exp}}_{\varepsilon }(\rho _2)\) be maximizing sequences. Due to Lemma 2.7, we can suppose that \(u_n\in L^{\infty }(\rho _1)\), \(v_n\in L^{\infty }(\rho _2)\) and \(\Vert u_n\Vert _{\infty },\Vert v_n\Vert _{\infty } \le \frac{3}{2} \Vert c \Vert _{\infty }\). Then by Banach-Alaoglu theorem there exists subsequences \((u_{n_k})_{n_k\in \mathbb {N}}\) and \((v_{n_k})_{n_k\in \mathbb {N}}\) such that \(u_{n_k}\rightharpoonup {\overline{u}}\) and \(v_{n_k}\rightharpoonup {\overline{v}}\). In particular, \({\tilde{u}}_{n_k}+{\tilde{v}}_{n_k}-c\rightharpoonup {\overline{u}}+{\overline{v}}-c\).

First, notice that since \(t \mapsto e^t\) is a convex function, we have

$$\begin{aligned}&\liminf _{n\rightarrow \infty } \int _{X\times Y}e^{\frac{u_n+v_n-c}{\varepsilon }}d(\rho _1\otimes \rho _2)&= \liminf _{n\rightarrow \infty } \int _{X\times Y}e^{\frac{u_n+v_n-c}{\varepsilon }}d(\rho _1\otimes \rho _2) \\&\quad \ge \int _{X\times Y}e^{\frac{{\overline{u}}+{\overline{v}}-c}{\varepsilon }}d(\rho _1\otimes \rho _2). \end{aligned}$$

Moreover,

$$\begin{aligned} \sup _{u,v} D_{\varepsilon }(u,v)&= \lim _{n\rightarrow \infty }\left\{ \int _X u_nd\rho _1 + \int _Y v_nd\rho _2 - \varepsilon \int _{X\times Y}e^{\frac{u_n+v_n-c}{\varepsilon }}d(\rho _1\otimes \rho _2) \right\} \\&\le \lim _{n\rightarrow \infty }\left\{ \int _X u_n d\rho _1 + \int _Y v_n d\rho _2 \right\} - \varepsilon \liminf _{ n \rightarrow \infty }\left\{ \int _{X\times Y}e^{\frac{u_n+v_n-c}{\varepsilon }}d(\rho _1\otimes \rho _2)\right\} \\&\le \int _X {\overline{u}}d\rho _1 + \int _Y {\overline{v}}d\rho _2 - \varepsilon \int _{X\times Y}e^{\frac{{\overline{u}}+{\overline{v}}-c}{\varepsilon }}d(\rho _1\otimes \rho _2) = D({\overline{u}},{\overline{v}}). \end{aligned}$$

So, \(({\overline{u}},{\overline{v}})\) is a maximizer for \(D_{\varepsilon }\). By construction, we have also that \({\overline{u}} \in L^{\infty }(\rho _1)\) and \(\quad {\overline{v}} \in L^{\infty }(\rho _2)\). Finally, the strictly concavity of \(D_{\varepsilon }\) and Lemma 2.6 implies that the maximizer is unique and, in particular \({\overline{v}} = {\overline{u}}^{(c,\varepsilon )}\). \(\square \)

Corollary 2.9

Let \((X,d_X,\mathfrak {m}_1)\), \((Y,d_Y,\mathfrak {m}_2)\) be Polish metric measure spaces, \(c:X\times Y\rightarrow {\mathbb {R}}\) be a Borel bounded cost function, \(\rho _1\in \mathcal {P}(X)\) and \(\rho _2\in \mathcal {P}(Y)\) be probability measures such that \({\text {KL}}(\rho _1|\mathfrak {m}_1) + {\text {KL}}(\rho _1|\mathfrak {m}_2) < \infty \). Consider the dual functional \({{\tilde{D_{\varepsilon }}}}:L^{\mathrm{exp}}_{\varepsilon }(\mathfrak {m}_1)\times L^{\mathrm{exp}}_{\varepsilon }(\mathfrak {m}_2)\rightarrow \mathbb {R}\),

$$\begin{aligned}&{\tilde{D_{\varepsilon }}}(u,v) = \int _X u(x)\rho _1(x)d\mathfrak {m}_1(x) + \int _Y v(y)\rho _2(y)d\mathfrak {m}_2(y) \\&\quad - \varepsilon \int _{X\times Y} e^{\frac{u(x)+v(y)-c(x,y)}{\varepsilon }}d(\mathfrak {m}_1(x)\otimes \mathfrak {m}_2(y)). \end{aligned}$$

Then the supremum

$$\begin{aligned} \sup \left\{ D_{\varepsilon }(u,v) \; : \; u \in L^{\mathrm{exp}}_{\varepsilon }(\mathfrak {m}_1) , v \in L^{\mathrm{exp}}_{\varepsilon }(\mathfrak {m}_2) \right\} , \end{aligned}$$

is attained for a unique couple \((u_0,v_0)\) and in particular we have

$$\begin{aligned} u_0 - \varepsilon \log \rho _1 \in L^{\infty }(\mathfrak {m}_1) \quad \text { and } \quad v_0 -\varepsilon \log \rho _2 \in L^{\infty }(\mathfrak {m}_2). \end{aligned}$$

Proof

The proof follows by the change of variable \(T:(u,v)\mapsto (u -\varepsilon \log \rho _1, v-\varepsilon \log \rho _2)\) which is such that \({{\tilde{D_{\varepsilon }}}} (u,v)= D_{\varepsilon }( T(u,v))+ \varepsilon {\text {KL}}(\rho _1|\mathfrak {m}_1) + \varepsilon {\text {KL}}(\rho _1|\mathfrak {m}_2)\), and Theorem 2.8. Another way is to apply same arguments of theorem (2.8) by using the Entropic c-transform \(u^{(c,\varepsilon )}_{\mathfrak {m}_1}\) described in Remark 2.5. \(\square \)

In the following proposition an important concept will be that of bivariate transformation. Given \(\mathcal {K}\) a Gibbs measure, a(x) and b(y) two measurable function with respect to \(\kappa \), such that \(a,b \ge 0\), we define the bivariate transformation of \(\mathcal {K}\) through a and b as

$$\begin{aligned} \kappa (a,b):= a(x) b(y) \cdot \mathcal {K}\end{aligned}$$
(2.9)

this is still a (possibily infinite) measure.

Lemma 2.10

Let \(\varepsilon >0\) be a positive number, \((X,d_X)\) and \((Y,d_Y)\) be Polish metric spaces, \(c:X\times Y\rightarrow {\mathbb {R}}\) be a cost function (not necessarily bounded), \(\rho _1 \in \mathcal {P}(X)\), \(\rho _2 \in \mathcal {P}(Y)\) be probability measures and let \(\kappa \) as in (1.2). Then for every \(\gamma \in \Pi (\rho _1, \rho _2)\), \(u \in L^{\mathrm{exp}}_{\varepsilon }(\rho _1) \) and \( v \in L^{\mathrm{exp}}_{\varepsilon }(\rho _2)\) then we have

$$\begin{aligned} \varepsilon {\text {KL}}(\gamma |\mathcal {K}) \ge D_{\varepsilon } ( u,v) + \varepsilon , \quad \text { with equality iff }\gamma =\kappa (e^{u/\varepsilon }, e^{v/\varepsilon }), \end{aligned}$$
(2.10)

where \(\kappa \) is defined as in (2.9).

Proof

First of all we can assume \(\gamma \ll \mathcal {K}\), otherwise the right hand side would be \(+ \infty \) and so the inequality would be verified; then if we denote (with a slight abuse of notation) \(\gamma (x,y)\) the density of \(\gamma \) with respect to \(\mathcal {K}\), we get

$$\begin{aligned} \varepsilon {\text {KL}}(\gamma |\mathcal {K})&= \int _{X\times Y}c d\gamma + \varepsilon \int _{X\times Y}\gamma \log \gamma d\left( \rho _1\otimes \rho _2\right) \\&= \int _{X\times Y}(c+\varepsilon \log \gamma -u-v)\cdot \gamma d\rho _1\otimes \rho _2 + \int _X ud\rho _1 + \int _Y vd\rho _2 \\&= \int _X u d\rho _1 + \int _Y v d\rho _2 + \int _{X\times Y}\left( \varepsilon \log \gamma +c-u-v\right) \cdot \gamma d\left( \rho _1\otimes \rho _2\right) \\&\ge \int _X u d\rho _1 + \int _Y v d\rho _2- \varepsilon \int _{X\times Y} e^{\frac{u+v-c}{\varepsilon }}d\left( \rho _1\otimes \rho _2\right) + \varepsilon \\&= D_{\varepsilon }(u,v) + \varepsilon , \end{aligned}$$

where we used \(ts + \varepsilon t\ln t - \varepsilon \ge -\varepsilon e^{-s/\varepsilon }\), with equality if \(t = e^{-s / \varepsilon }\). Notice in particular that, as we wanted, there is equality iff \(\gamma = e^{(u(x)+v(y)-c(x,y))/\varepsilon } \cdot \rho _1 \otimes \rho _2 = \kappa (e^{u/\varepsilon }, e^{v/\varepsilon }) \). \(\square \)

Proposition 2.11

(Equivalence and complementarity condition) Let \(\varepsilon >0\) be a positive number, \((X,d_X)\) and \((Y,d_Y)\) be Polish metric spaces, \(c:X\times Y\rightarrow {\mathbb {R}}\) be a bounded cost function, \(\rho _1 \in \mathcal {P}(X)\), \(\rho _2 \in \mathcal {P}(Y)\) be probability measures and let \(\kappa \) as in (1.2). Then given \(u^* \in L^{\mathrm{exp}}_{\varepsilon }(\rho _1) , v^* \in L^{\mathrm{exp}}_{\varepsilon }(\rho _2)\), the following are equivalent:

  1. 1.

    (Maximizers) \(u^*\) and \(v^*\) are maximizing potentials for (2.8);

  2. 2.

    (Maximality condition) \(\mathcal {F}^{(c, \varepsilon )}(u^*)=v^*\) and \(\mathcal {F}^{(c, \varepsilon )}(v^*)=u^*\);

  3. 3.

    (Schrödinger system) let \(\gamma ^*=\kappa (e^{u^*/\varepsilon }, e^{v^*/\varepsilon })=e^{(u^*(x)+v^*(y)-c(x,y))/\varepsilon } \cdot \rho _1 \otimes \rho _2\), then \(\gamma ^* \in \Pi (\rho _1, \rho _2)\);

  4. 4.

    (Duality attainement) \({\text {OT}}_{\varepsilon }(\rho _1,\rho _2) = D_{\varepsilon } (u^*,v^*) +\varepsilon \).

Moreover in those cases \(\gamma ^*\), as defined in 3, is also the (unique) minimizer for the problem (1.1).

Proof

We will prove \( 1 \Rightarrow 2 \Rightarrow 3 \Rightarrow 4 \Rightarrow 1\).

  • 1. \(\Rightarrow \) 2. This is a straightforward application of Lemma 2.6. In fact thanks to (2.6) we have \(D_{\varepsilon } (u^*, \mathcal {F}^{(c, \varepsilon )}(u^*)) \ge D_{\varepsilon }(u^*,v^*)\); however, by the maximality of \(u^*,v^*\) we have also \(D_{\varepsilon }(u^*,v^*) \ge D_{\varepsilon } (u^*, \mathcal {F}^{(c, \varepsilon )}(u^*)) \), and so we conclude that \(D_{\varepsilon }(u^*,\mathcal {F}^{(c, \varepsilon )}(u^*))=D_{\varepsilon }(u^*,v^*)\). Thanks to (2.7) we then deduce that \(v^*=\mathcal {F}^{(c, \varepsilon )}(u^*)\). We can follow a similar argument to prove that conversely \(u^*=\mathcal {F}^{(c, \varepsilon )}(v^*)\).

  • 2. \(\Rightarrow \) 3. A simple calculation shows for every \(u \in L^{\mathrm{exp}}_{\varepsilon }(\rho _1) \) and \( v \in L^{\mathrm{exp}}_{\varepsilon }(\rho _2)\) we have \((\pi _1)_{\sharp } ( \kappa (e^{u/\varepsilon }, e^{v /\varepsilon })) = e^{(u-v^{(c, \varepsilon )})/\varepsilon }\rho _1\) and similarly \((\pi _2)_{\sharp } ( \kappa (e^{u/\varepsilon }, e^{v /\varepsilon })) = e^{(v-u^{(c, \varepsilon )})/\varepsilon }\rho _2\). So if we assume 2. it is trivial to see that in fact \(\gamma ^* = \kappa (e^{u^*/\varepsilon }, e^{v^* /\varepsilon }) \in \Pi (\rho _1, \rho _2)\)

  • 3. \(\Rightarrow \) 4. since \(\gamma ^* \in \Pi (\rho _1, \rho _2)\), from Lemma 2.10 we have

    $$\begin{aligned}&\varepsilon {\text {KL}}(\gamma ^*|\mathcal {K}) \ge D_{\varepsilon } ( u,v) + \varepsilon \quad \forall u \in L^{\mathrm{exp}}_{\varepsilon }(\rho _1), v \in L^{\mathrm{exp}}_{\varepsilon }(\rho _2) \end{aligned}$$
    (2.11)
    $$\begin{aligned}&\varepsilon {\text {KL}}(\gamma |\mathcal {K}) \ge D_{\varepsilon } ( u^*,v^*) + \varepsilon \quad \forall \gamma \in \Pi (\rho _1,\rho _2). \end{aligned}$$
    (2.12)

    Moreover, since by definition \(\gamma ^*= \kappa (e^{u^*/\varepsilon }, e^{v^*/\varepsilon })\), Lemma 2.10 assure us also that

    $$\begin{aligned} \varepsilon {\text {KL}}(\gamma ^*|\mathcal {K}) \ge D_{\varepsilon } ( u^*,v^*) + \varepsilon . \end{aligned}$$
    (2.13)

    Putting now (2.11),(2.12) and (2.13) together we obtain

    $$\begin{aligned} \varepsilon {\text {KL}}(\gamma |\mathcal {K}) \ge D_{\varepsilon } ( u^*,v^*) + \varepsilon = \varepsilon {\text {KL}}(\gamma ^*|\mathcal {K}) \ge D_{\varepsilon } ( u,v) + \varepsilon ; \end{aligned}$$

    in particular we have \(\varepsilon {\text {KL}}(\gamma |\mathcal {K}) \ge \varepsilon {\text {KL}}(\gamma ^*|\mathcal {K})\) which grants us that \(\gamma ^*\) is a minimizer for (1.1) and that in particular \({\text {OT}}_{\varepsilon }(\rho _1,\rho _2) = \varepsilon {\text {KL}}(\gamma ^*|\mathcal {K}) = D_{\varepsilon } ( u^*,v^*)+\varepsilon \).

  • 4. \(\Rightarrow \) 1. Looking at (2.10) and minimizing in \(\gamma \) we find that

    $$\begin{aligned} {\text {OT}}_{\varepsilon }(\rho _1,\rho _2) \ge D_{\varepsilon } ( u,v) + \varepsilon \qquad \forall u \in L^{\mathrm{exp}}_{\varepsilon }(\rho _1), v \in L^{\mathrm{exp}}_{\varepsilon }(\rho _2); \end{aligned}$$

    using that by hypotesis \({\text {OT}}_{\varepsilon }(\rho _1,\rho _2)= D_{\varepsilon } ( u^*,v^*) + \varepsilon \), we get that

    $$\begin{aligned} D_{\varepsilon } ( u^*,v^*) \ge D_{\varepsilon } ( u,v) \qquad \forall u \in L^{\mathrm{exp}}_{\varepsilon }(\rho _1), v \in L^{\mathrm{exp}}_{\varepsilon }(\rho _2), \end{aligned}$$

    that is, \(u^*,v^*\) are maximizing potentials for (2.8).

Notice that in proving \(3 \Rightarrow 4\) we incidentally proved that \(\gamma ^*\) is the (unique) minimizer. \(\square \)

Finally, we conclude this section by giving a short proof of the duality between (1.1) and (2.8).

Proposition 2.12

(General duality) Let \(\varepsilon >0\) be a positive number, \((X,d_X)\) and \((Y,d_Y)\) be Polish metric spaces, \(c:X\times Y\rightarrow {\mathbb {R}}\) be a bounded cost function, \(\rho _1 \in \mathcal {P}(X)\), \(\rho _2 \in \mathcal {P}(Y)\) be probability measures. Then duality holds

$$\begin{aligned} {\text {OT}}_{\varepsilon }(\rho _1,\rho _2) = \max \left\{ D_{\varepsilon }(u,v) \; : \; u \in L^{\mathrm{exp}}_{\varepsilon }(\rho _1) , v \in L^{\mathrm{exp}}_{\varepsilon }(\rho _2) \right\} +\varepsilon . \end{aligned}$$

Proof

From Theorem 2.8 we have the existence of a maximizing pair of potentials \(u^*,v^*\). In particular we have

$$\begin{aligned} \max \left\{ D_{\varepsilon }(u,v) \; : \; u \in L^{\mathrm{exp}}_{\varepsilon }(\rho _1) , v \in L^{\mathrm{exp}}_{\varepsilon }(\rho _2) \right\} +\varepsilon = D_{\varepsilon } (u^*,v^*) +\varepsilon ; \end{aligned}$$

this, together with point 4 in Proposition 2.11 (which is true since 1 holds true), proves the duality. \(\square \)

By a similar argument, one can show that the duality holds also for the functional \(\mathcal {S}_{\varepsilon }(\rho _1,\rho _2;\mathfrak {m}_1,\mathfrak {m}_2)\).

Corollary 2.13

Let \(\varepsilon >0\) be a positive number, \((X,d_X,\mathfrak {m}_1)\) and \((Y,d_Y,\mathfrak {m}_2)\) be Polish metric measure spaces, \(c:X\times Y\rightarrow {\mathbb {R}}\) be a bounded cost function, \(\rho _1 \in \mathcal {P}(X)\), \(\rho _2 \in \mathcal {P}(Y)\) be probability measures. Then duality holds

$$\begin{aligned} \mathcal {S}_{\varepsilon }(\rho _1,\rho _2;\mathfrak {m}_1,\mathfrak {m}_2) = \max \left\{ {\tilde{D}}_{\varepsilon }(u,v) : u\in L^{\mathrm{exp}}_{\varepsilon }(\mathfrak {m}_1), v\in L^{\mathrm{exp}}_{\varepsilon }(\mathfrak {m}_2) \right\} + \varepsilon . \end{aligned}$$

Convergence of the Sinkhorn/IPFP Algorithm

In this section, we give an alternative proof for the convergence of the Sinkhorn algorithm. The aim of the Iterative Proportional Fitting Procedure (IPFP, also known as Sinkorn algorithm) is to construct the measure \(\gamma ^{\varepsilon }\) realizing minimum in (1.1) by alternatively matching one marginal distribution to the target marginals \(\rho _1\) and \(\rho _2\): this leads to the construction of the IPFP sequences \((a^n)_{n\in \mathbb {N}}\) and \((b^n)_{n\in \mathbb {N}}\), defined in (1.4).

We now look at the new variables \(u_n := \varepsilon \ln (a^n)\) and \(v_n ;= \varepsilon \ln (b^n)\): we can then rewrite the system (1.4) as

$$\begin{aligned} \begin{array}{lcl} \displaystyle v_n(y)/\varepsilon = &{} - \log \left( \int _X k(x,y)e^{ \frac{u_{n-1}(x)}{\varepsilon }}d\rho _1\right) \\ \displaystyle u_n(x)/\varepsilon = &{} - \log \left( \int _Y k(x,y)e^{ \frac{v_{n}(y)}{\varepsilon }}d\rho _2\right) \end{array}. \end{aligned}$$

In other words, using the \((c,\varepsilon )-\)transform and the expression of k(xy) given in (1.2), \(v^n(y) = (u^{(n-1)})^{(c,\varepsilon )}\) and \(u^n(y) = (v^n)^{(c,\varepsilon )}\).

Theorem 3.1

Let \((X,d_X)\) and \((Y,d_Y)\) be Polish metric spaces, \(\rho _1 \in \mathcal {P}(X)\) and \(\rho _2\in \mathcal {P}(Y)\) be probability measures and \(c:X\times Y\rightarrow \mathbb {R}\) be a Borel bounded cost. If \((a^n)_{n\in \mathbb {N}}\) and \((b^n)_{n\in \mathbb {N}}\) are the IPFP sequences defined in (1.4), then there exists a sequence of positive real numbers \((\lambda ^n)_{n \in \mathbb {N}}\) such that

$$\begin{aligned} a^n/\lambda ^n\rightarrow a \text { in } L^p(\rho _1) \quad \text { and } \quad \lambda ^nb^n \rightarrow b \text { in } L^p(\rho _2), \quad 1\le p <+\infty , \end{aligned}$$

where (ab) solve the Schrödinger problem. In particular, the sequence \(\gamma ^n = a^nb^n k\), where k is defined in (1.2) converges in \(L^p(\rho _1\otimes \rho _2)\) to \(\gamma ^{\varepsilon }_{opt}\), the density of the minimizer of (1.1) with respect to \(\rho _1 \otimes \rho _2\), for any \(1\le p <+\infty \).

Proof

Let \((a^n)_{n\in \mathbb {N}}\) and \((b^n)_{n\in \mathbb {N}}\) be the IPFP sequence defined in (1.4). Let us write \(a^n = e^{u_n/\varepsilon }\), \(b^n = e^{v_n/\varepsilon }\); then, in this new variables, we noticed that the iteration can be written with the help of the \((c, \varepsilon )\)-transform:

$$\begin{aligned} {\left\{ \begin{array}{ll} v_{2n+1} = (u_{2n})^{(c,\varepsilon )} \\ u_{2n+1} = u_{2n} \\ \end{array}\right. }, \quad {\left\{ \begin{array}{ll} v_{2n+2} = v_{2n+1} \\ u_{2n+2} = (v_{2n+1})^{(c,\varepsilon )} \\ \end{array}\right. }. \end{aligned}$$

Notice that, as soon as \(n\ge 2\), we have \(u_n \in L^{\infty }(\rho _1)\) and \(v_n \in L^{\infty }(\rho _2)\) thanks to the regularizing properties of the \((c,\varepsilon )\)-transform proven in Lemma 2.3 and, moreover, thanks to (2.6) and Proposition 2.12 we have

$$\begin{aligned} D_{\varepsilon }(u_n,v_n)\le D_{\varepsilon }(u_{n+1},v_{n+1}) \le \dots \le {\text {OT}}_{\varepsilon }(\rho _1,\rho _2) - \varepsilon . \end{aligned}$$

Then, by the same argument used in the proof of Lemma 2.7 it is easy to prove that there for each \(n \ge 2\) there exists \(\ell _n \in \mathbb {R}\) such that \( \Vert u_n - \ell _n \Vert _{\infty }, \Vert v_n +\ell _n\Vert \le \frac{3}{2} \Vert c\Vert _{\infty }\). Now, thanks to Proposition 2.4 we have that the sequeces \(u_n - \ell _n\) and \(v_n +\ell _n\) are precompact in every \(L^p\), for \(1 \le p < \infty \); in particular let us consider any limit point uv. Then we have a subsequence \(u_{n_k},v_{n_k}\) such that \(u_{n_k}\rightarrow u\),\(v_{n_k}\rightarrow v\) in \(L^{\infty }\) and \(u_{n_k+1} = (v_{n_k})^{(c,\varepsilon )}\) (or the opposite). Using the continuity in \(L^p\) of the \((c,\varepsilon )\)-transform, and the fact that an increasing and bounded sequence has vanishing increments, we obtain

$$\begin{aligned} D_{\varepsilon }(v^{(c,\varepsilon )},v) - D_{\varepsilon }(u,v) = \lim _{n_k\rightarrow \infty } D_{\varepsilon }(u_{n_k+1},v_{n_k+1}) - D_{\varepsilon }({u_{n_k}},v_{n_k}) = 0. \end{aligned}$$

In particular, by (2.7), we have \(u=v^{(c,\varepsilon )}\). Analogously, we obtain that \(v=u^{(c,\varepsilon )}\) by doing the same calculation using the potentials \((u_{n_{k+2}}, v_{n_{k+2}})\) and then

$$\begin{aligned} D_{\varepsilon }(u,u^{(c,\varepsilon )}) - D_{\varepsilon }(u,v) = \lim _{n_k\rightarrow \infty } D_{\varepsilon }(u_{n_k+2},v_{n_k+2}) - D_{\varepsilon }({u_{n_k}},v_{n_k}) = 0. \end{aligned}$$

Now we can use Proposition 2.11: the implication \(2 \Rightarrow 1\) proves that (uv) is a maximizer.Footnote 2 In particular \(a=e^{u/\varepsilon }\), \(b=e^{v/\varepsilon }\) are solutions of the Schrödinger equation and taking \(\lambda ^n=e^{\ell _n/\varepsilon }\) we get the convergence result for \(a^n\) and \(b^n\), using that the exponential is Lipschitz in bounded domains.

In order to prove also the convergence of the plans, it is sufficient to note that for free we have \(u_n+v_n \rightarrow u+v\) in \(L^p(\rho _1 \otimes \rho _2)\), since now the translations are cancelled. Again, the fact that the exponential is Lipschitz on bounded domains and the boundedness of k, will let us conclude that in fact \(\gamma ^n \rightarrow \gamma \) in \(L^p(\rho _1 \otimes \rho _2)\) for every \(1 \le p < \infty \). \(\square \)

Remark 3.2

Notice that as long as we have more hypothesis on the smoothness of the cost function c we can use precompactness of the sequences \(u_n-\ell _n\) and \(v_n +\ell _n\) on larger space, obtaining faster convergence. For example if c is uniformly continuous we will get the uniform convergence instead of strong \(L^p\) convergence.

Multi-marginal Schrödinger Problem

In this section we generalize the results obtain previously for the Schrödinger problem with more than two marginals, including a proof of convergence of the Sinkhorn algorithm in the several marginals case.

We consider \((X_1,d_1), \dots , (X_N,d_N)\) Polish spaces, \(\rho _1,\dots ,\rho _N\) probability measures respectively in \(X_1,\dots ,X_N\) and \(c:X_1\times \dots \times X_N\rightarrow \mathbb {R}\) a bounded cost. Define \(\rho ^N= \rho _1\otimes \dots \otimes \rho _N\) by the product measure. For every \(\gamma \in {\mathcal {M}}(X_1\times \dots \times X_N)\), the relative entropy of \(\gamma \) with respect to the Gibbs Kernel \(\mathcal {K}(x_1,\dots ,x_N) = k^N(x_1,\dots ,x_N)\rho ^N = e^{-\frac{c(x_1,\dots ,x_N)}{\varepsilon }}d\rho _1\otimes \dots \otimes \rho _N\) is defined by

$$\begin{aligned} \displaystyle {\text {KL}}^{N}(\gamma |\mathcal {K}) ={\left\{ \begin{array}{ll} \displaystyle \int _{X_1\times \dots \times X_N}\gamma \log \left( \frac{\gamma }{k^N}\right) d\rho ^N\qquad &{} \text { if }\gamma \ll \rho ^N\\ +\infty &{} \text { otherwise.} \end{array}\right. } \end{aligned}$$
(4.1)

An element \(\gamma \in \Pi (\rho _1,\dots ,\rho _N)\) is called coupling and is a probability measure on the product space \(X_1\times \dots \times X_N\) having the ith-marginal equal to \(\rho _i\), i.e \(\gamma \in {\mathcal {P}}(X_1\times \dots \times X_N)\) such that \((e_i)_{\sharp }\gamma = \rho _i, \, \forall i \in \lbrace 1,\dots , N\rbrace \).

The Multi-marginal Schrödinger problem is defined as the infimum of the Kullback–Leibler divergence \({\text {KL}}^{N}(\gamma |\mathcal {K})\) over the couplings \(\gamma \in \Pi (\rho _1,\dots ,\rho _N)\)

$$\begin{aligned} {\text {OT}}^N_{\varepsilon }(\rho _1,\dots ,\rho _N) = \inf _{\gamma \in \Pi (\rho _1,\dots ,\rho _N)}\varepsilon \int _{X_1\times \dots \times X_N}{\text {KL}}(\gamma |\mathcal {K})d\gamma . \end{aligned}$$
(4.2)

Optimal Transport problems with several marginals or its entropic-regularization appears, for instance, in economics Carlier and Ekeland [17], and Chiappori et al. [22]; imaging (e.g. [27, 68]); and in theoretical chemistry (e.g. [30, 42, 46]). The first important instance of such kind of problems is attributed to Brenier’s generalised solutions of the Euler equations for incompressible fluids [9,10,11].

We point out that the entropic-regularization of the multi-marginal transport problem leads to a problem of multi-dimensional matrix scaling [36, 64]. An important example in this setting is the Entropy-Regularized Wasserstein Barycenter introduced by Agueh and Carlier [1]. The Wasserstein Barycenter defines a non-linear interpolation between several probabilities measures generalizing the Euclidian barycenter and turns out to be equivalent to Gangbo-Świeçh cost [38], that is \(c(x_1,\dots ,x_N) = \frac{1}{2}\Vert x_j-x_i\Vert ^2\) .

In the next section we extend to the multi-marginal setting the notions and properties of the Entropy c-transform done in Sect. 2. As a consequence, we generalise the proof of convergence of IPFP.

Entropy-Transform

Analogously to Definitions (2.1) and (2.2) in Sect. 2.1, we define the following Entropy c-transforms \(\hat{u}^{(N,c,\varepsilon )}_1,\hat{u}^{(N,c,\varepsilon )}_2,\dots ,\hat{u}^{(N,c,\varepsilon )}_N\). Notice that the notation \(\hat{u_i}\) stands for \(\hat{u_i}= (u_1,\dots ,u_{i-1},u_{i+1},\dots ,u_N)\).

Definition 4.1

(Entropic c-transform or \((c,\varepsilon )\)-transform) Let \(i\in \lbrace 1,\dots , N\rbrace \) and \(\varepsilon >0\) be a positive number. Consider \((X_i,d_{X_i})\) Polish spaces, \(\rho _i \in \mathcal {P}(X_i)\) probability measures and let c a bounded measurable cost on \(X_1 \times \dots \times X_N\). For every i, the Entropy c-transform \(\hat{u}^{(N,c,\varepsilon )}_i\) is defined by the functional \(\mathcal {F}_i^{(N,c, \varepsilon )}: \prod _{j \ne i} L^{\mathrm{exp}}_{\varepsilon }(\rho _{j})\rightarrow L^0(\rho _i)\),

$$\begin{aligned} \hat{u}^{(N,c,\varepsilon )}_i(x_i)=\mathcal {F}_i^{(N,c, \varepsilon )}( {{\hat{u}}}_i ) (x_{i}) = - \varepsilon \log \left( \int _{\prod _{j\ne i}X_j}e^{\frac{\sum _{j\ne i}u_j(x_j)-c(x_1,\dots ,x_N)}{\varepsilon }} d\left( \otimes _{j\ne i}\rho _j\right) \right) .\nonumber \\ \end{aligned}$$
(4.3)

In particular, we have \(\hat{u}^{(N,c,\varepsilon )}_i\in L^{\mathrm{exp}}_{\varepsilon }( \rho _{i})\). For \(u_i \in L^{\mathrm{exp}}_{\varepsilon }(X_i,\rho )\), we denote the constant \(\lambda _{u_i}\) by

$$\begin{aligned} \lambda _{u_i} = \varepsilon \log \left( \int _{\prod _{j\ne i}X_j}e^{\frac{\sum _{j\ne i}u_j(x_j)}{\varepsilon }} d\left( \otimes _{j\ne i}\rho _j\right) \right) . \end{aligned}$$

There is also the possibility to reconduce us to the case \(N=2\): notice that if one considers the spaces \(X_i \) and \(Y_i=\Pi _{j \ne i} X_j\), then c is also a natural cost function on \(X_i \times Y_i\). We can then consider \(\rho _i\) as a measure on \(X_i\) and \(\otimes _{j \ne i} \rho _j\) as a measure on \(Y_i\). In this way we able to construct an entropic c-trasform \(\mathcal {F}^{(c, \varepsilon )}\) associated to this 2-marginal problem and it is clear that

$$\begin{aligned} \mathcal {F}_i^{(N,c, \varepsilon )}( {{\hat{u}}}_i ) = \mathcal {F}^{(c, \varepsilon )}\Bigl ( \sum _{j \ne i} u_j \Bigr ). \end{aligned}$$

The following lemma extend Lemma 2.3 in the multi-marginal setting. We omit the proof since it follow by similar arguments.

Lemma 4.2

For every \(i\in \lbrace 1,\dots ,N\rbrace \), the Entropy c-transform \(\hat{u}^{(N,c,\varepsilon )}_i\) is well defined. Moreover,

  1. (i)

    \(\hat{u}^{(N,c,\varepsilon )}_i\in L^{\infty }\left( \rho _i\right) \). In particular,

    $$\begin{aligned}&-\Vert c \Vert _{\infty }- \varepsilon \log \left( \int _{\prod _{j\ne i} X_i} e^{ \frac{ \sum _{j\ne i}u_j (x_j)}{\varepsilon }} \, d \left( \otimes _{j\ne i}\rho _j\right) \right) \le \hat{u}^{(N,c,\varepsilon )}_i(x_{i}) \\&\quad \le \Vert c\Vert _{\infty } -\varepsilon \log \left( \int _{\prod _{j\ne i} X_i} e^{ \frac{ \sum _{j\ne i}u_j (x_j)}{\varepsilon }} \, d \left( \otimes _{j\ne i}\rho _j\right) \right) . \end{aligned}$$
  2. (ii)

    \(\hat{u}^{(N,c,\varepsilon )}_i\in L^{\mathrm{exp}}_{\varepsilon }\left( \rho _i\right) \).

  3. (iii)
    $$\begin{aligned} | \hat{u}^{(N,c,\varepsilon )}_i(x_i)+ \sum _{i \ne j} \lambda _{u_j} | \le \Vert c \Vert _{\infty }. \end{aligned}$$
    (4.4)
  4. (iv)

    if c is L-Lipschitz (resp. \(\omega \)-continuous), then \(\hat{u}^{(N,c,\varepsilon )}_i\) is L-Lipschitz (resp. \(\omega \)-continuous);

  5. (v)

    if \(\vert c\vert \le M\), then \({\text {osc}}(\hat{u}^{(N,c,\varepsilon )}_i)\le 2M\) and \(\mathcal {F}_i^{(N,c, \varepsilon )}: \prod _{j \ne i} L^{\infty }(\rho _j)\rightarrow L^p(\rho _i)\) for \(i=1, \ldots , n\) are compact operators for every \(1 \le p < \infty \).

Entropy-Kantorovich Duality

We introduce the dual functional \(D^N_{\varepsilon }:L^{\mathrm{exp}}_{\varepsilon }(\rho _1)\times \dots \times L^{\mathrm{exp}}_{\varepsilon }(\rho _N)\rightarrow [0,+\infty ]\),

$$\begin{aligned} D^N_{\varepsilon }(u_1,\dots ,u_N) = \sum ^N_{i=1}\int _{X_i}u_id\rho _i - \varepsilon \int _{X_1\times \dots \times X_N}e^{\frac{\sum ^N_{i=1}u_i(x_i)-c(x_1,\dots ,x_N)}{\varepsilon }}d\left( \rho _1\otimes \dots \otimes \rho _N\right) .\nonumber \\ \end{aligned}$$
(4.5)

In the sequel we will use the invariance by translation of the dual problem, and thus we introduce the following projection operator:

Lemma 4.3

Let us consider the operator \(P: \prod _{i=1}^N L^{\infty }( \rho _i) \rightarrow \prod _{i=1}^N L^{\infty }( \rho _i) \) defined as

$$\begin{aligned} P_i(u) = {\left\{ \begin{array}{ll} u_i - \lambda _{u_i} \qquad \qquad &{}\text { if } i=1, \ldots , N-1 \\ u_i+ \sum _{j \ne i}^{N-1} \lambda _{u_j} &{} \text { if } i=N. \end{array}\right. } \end{aligned}$$

Then the following properties hold

  1. (i)

    \(D^N_{\varepsilon }(P(u))=D^N_{\varepsilon } (u)\);

  2. (ii)

    \(\Vert P_i(u)\Vert _{\infty } \le osc ( u_i) + |\sum _{i=1}^N \lambda _{u_i}| \), for all \(i=1, \ldots , N\);

  3. (iii)

    let \(v=P(u)\). Then \(u_i= \mathcal {F}_i^{(N,c, \varepsilon )}( {{\hat{u}}}_i )\) if only if \(v_i= \mathcal {F}_i^{(N,c, \varepsilon )}( {{\hat{v}}}_i )\).

Proof

   

  1. (i)

    In order to prove \(D^N_{\varepsilon }(P(u))=D^N_{\varepsilon } (u)\) we first observe that

    $$\begin{aligned} \sum _{i=1}^N P_i ( u ) (x_i) =u_N(x_N) + \sum _{i=1}^{N-1} \lambda _{u_i} + \sum _{i=1}^{N-1} (u_i(x_i) - \lambda _{u_i} ) = \sum _{i=1}^N u_i(x_i). \end{aligned}$$

    In particular we have (here we denote \(X=X_1 \times \cdots \times X_N\)

    $$\begin{aligned} D^N_{\varepsilon }(P(u))&= \sum ^N_{i=1}\int _{X_i}P_i(u)d\rho _i - \varepsilon \int _{X_1\times \dots \times X_N}e^{\frac{\sum ^N_{i=1}P_i(u)(x_i)-c(x_1,\dots ,x_N)}{\varepsilon }}d\left( \rho _1\otimes \dots \otimes \rho _N\right) \\&= \int _{X}\sum ^N_{i=1} P_i(u) (x_i) \, d \rho ^N - \varepsilon \int _{X}e^{\frac{\sum ^N_{i=1}P_i(u)(x_i)-c(x_1,\dots ,x_N)}{\varepsilon }}d \rho ^N \\&= \int _{X}\sum ^N_{i=1} u_i (x_i) d \rho ^N - \varepsilon \int _{X}e^{\frac{\sum ^N_{i=1}u_i(x_i)-c(x_1,\dots ,x_N)}{\varepsilon }}d \rho ^N= D^N_{\varepsilon }(u) \end{aligned}$$
  2. (ii)

    The inequality is not trivial only if \(u_i \in L^{\infty }(\rho _i)\). In this case obviously we have \(\inf u_i \le \lambda _{u_i} \le \sup u_i\) and in particular

    $$\begin{aligned} - osc(u_i) = \inf u_i - \sup u_i \le u_i(x_i) - \lambda _{u_i} \le \sup u_i - \inf u_i = osc(u_i), \end{aligned}$$

    that is \(\Vert u_i - \lambda _{u_i}\Vert _{\infty } \le osc(u_i)\). This proves already the bound for \(i <N\); for \(i=N\) we have, letting \(\lambda = \sum _{i=1}^N \lambda _{u_i}\)

    $$\begin{aligned} \Vert P_N(u ) \Vert _{\infty } = \Vert u_N - \lambda _{u_N} + \sum _{i=1}^N \lambda _{u_i}\Vert _{\infty } \le \Vert u_N - \lambda _{u_N}\Vert _{\infty } + | \lambda | \le osc(u_N) + | \lambda | \end{aligned}$$
  3. (iii)

    This is obvious from the fact that \(\mathcal {F}_i^{(N,c, \varepsilon )}( \widehat{ u_i - \lambda _i} )= \mathcal {F}_i^{(N,c, \varepsilon )}({{\hat{u}}}_i) + \sum _{j \ne i} \lambda _j\).

\(\square \)

This projection operator allows us to generalize Lemma 2.7:

Lemma 4.4

Let us consider \(u_i \in L^{\mathrm{exp}}_{\varepsilon }(\rho _i)\), for every \(i=1, \ldots , N\). Then there exist \(u_i^* \in L^{\mathrm{exp}}_{\varepsilon }(\rho _i)\) for \(i=1, \ldots , N\) such that

  • \(D^N_{\varepsilon }(u_1, \ldots , u_N) \le D^N_{\varepsilon }(u_1^*,\ldots , u_N^*)\);

  • \( \Vert u_i^* \Vert _{\infty } \le 3\Vert c\Vert _{\infty } \) for every \(i=1, \ldots , N\).

Proof

Let us construct the following sequence of potentials:

$$\begin{aligned} {\left\{ \begin{array}{ll} u_1^{1} = \mathcal {F}_i^{(N,c, \varepsilon )}\bigl (\,\widehat{u_1}\,\bigr ) \\ u_2^{1} = u_2 \\ u_3^{} = u_3 \\ \, \cdots \\ u_N^{1} = u_N \\ \end{array}\right. }, \quad {\left\{ \begin{array}{ll} u^{2}_1 = u_1^{1} \\ u_{2} = \mathcal {F}_i^{(N,c, \varepsilon )}\bigl (\,\widehat{u^{1}_2}\,\bigr ) \\ u^{2}_3 = u_3^{1} \\ \, \cdots \\ u^{2}_N = u_N^{1} \\ \end{array}\right. }, \dots \, , \quad {\left\{ \begin{array}{ll} u_1^{N} = u_1^{N-1} \\ u_2^{N} = u_2^{N-1} \\ u_3^{N} = u_3^{N-1} \\ \, \cdots \\ u^{N}_N = \mathcal {F}_i^{(N,c, \varepsilon )}\bigl (\,\widehat{u^{N-1}_N}\,\bigr ). \\ \end{array}\right. } \end{aligned}$$

Then let us consider \(u^*=P(u^N)\). First of all we notice that, using the multimarginal analogous of Lemma 2.6 we have

$$\begin{aligned} D_{\varepsilon }^N(u_1, \ldots , u_N) \le D_{\varepsilon }^N(u^1_1, \ldots , u^1_N) \le \cdots \le D_{\varepsilon }^N(u^N_1, \ldots , u^N_N)= D_{\varepsilon }^N(u^*_1, \ldots , u^*_N). \end{aligned}$$

Then is clear by construction that for every \(i=1, \ldots , N\) we have \(u^N_i=u^i_i\) and in particular, by Lemma 4.2 (iv) we have \(osc(u^N_i) \le 2\Vert c \Vert _{\infty }\). Moreover, thanks to (4.4) it is easy to see that \( | \sum _i \lambda _{u_i^N}| \le \Vert c\Vert _{\infty }\). Now we can use Lemma 4.3 (ii) to conclude that in fact \(\Vert u^*_i\Vert \le 3 \Vert c \Vert _{\infty }.\) \(\square \)

Similarly to Theorem 2.8, Propositions 2.11 and 2.12 , the next theorem and the following Proposition state the existence of a maximizer and the Entropic-Kantorovich duality to the multi-marginal case, along with the complementarity conditions. Since the proofs follows the same lines of the case \(N=2\), without big changes, we will omit them.

Theorem 4.5

For every \(i\in \lbrace 1,\dots ,N\rbrace \), let \((X_i,d_i)\) be Polish metric spaces, \(\rho _i \in \mathcal {P}(X_i)\) be a probability measures and \(c:X_1\times \dots \times X_N\rightarrow \mathbb {R}\) be a bounded cost function. Then for every \(\varepsilon >0\),

  1. (i)

    The dual function \(D^N_{\varepsilon }\) is well defined on its definition domain and moreover

    $$\begin{aligned}&D^N_{\varepsilon }(\hat{u}^{(N,c,\varepsilon )}_1, \hat{u_1}) \ge D^N_{\varepsilon }(u_1,\dots ,u_N), \qquad \forall \, u_i \in L^{\mathrm{exp}}_{\varepsilon }(\rho _i), \end{aligned}$$
    (4.6)
    $$\begin{aligned}&D^N_{\varepsilon }(\hat{u}^{(N,c,\varepsilon )}_1,\hat{u_1}) = D^N_{\varepsilon }(u_1,\dots ,u_N) \text { if and only if } u_1 = \hat{u}^{(N,c,\varepsilon )}_1. \end{aligned}$$
    (4.7)
  2. (ii)

    The supremum is attained, up to trivial transformations, for a unique N-tuple \((u^0_1,\dots ,u^0_N)\) and in particular we have

    $$\begin{aligned} u^0_i\in L^{\infty }(\rho _i) ,\quad \forall i\in \left\{ 1,\dots ,N \right\} . \end{aligned}$$

    Moreover if we consider \(\gamma ^{0,N} = e^{ (\sum _i u^0_i(x_i) - c )/\varepsilon } \rho ^N\) then \(\gamma ^{0,N}\) is the minimizer of (4.2)

  3. (iii)

    Duality holds:

    $$\begin{aligned} {\text {OT}}^N_{\varepsilon }(\rho _1,\dots ,\rho _N) = \sup \left\{ D^N_{\varepsilon }(u_1,\dots ,u_N) \; : \; u_i \in L^{\mathrm{exp}}_{\varepsilon }(\rho _i), i \in \left\{ 1,\dots , N \right\} \right\} + \varepsilon . \end{aligned}$$

Finally, the result extends the main results of the previous section to the multi-marginal case.

Proposition 4.6

(Equivalence and complementarity condition) Let \(\varepsilon >0\) and for every \(i\in \lbrace 1,\dots ,N\rbrace \), let \((X_i,d_i)\) be Polish metric spaces, \(\rho _i \in \mathcal {P}(X_i)\) be a probability measures and \(c:X_1\times \dots \times X_N\rightarrow \mathbb {R}\) be a bounded cost function. Then given \(u_i^* \in L^{\mathrm{exp}}_{\varepsilon }(\rho _i) \) for every \(i=1, \ldots , N\), the following are equivalent:

  1. 1.

    (Maximizers) \(u_1^*, \ldots , u_N^*\) are maximizing potentials for (4.5);

  2. 2.

    (Maximality condition) \(\mathcal {F}_i^{(N,c, \varepsilon )}(\hat{u_i^*})=u_i^*\) for every \(i=1, \ldots , N\);

  3. 3.

    (Schrödinger system) let \(\gamma ^*=e^{(\sum _i u_i^*(x_i)-c)/\varepsilon } \cdot \rho ^N \), then \(\gamma ^* \in \Pi (\rho _1, \ldots , \rho _N)\);

  4. 4.

    (Duality attainement) \({\text {OT}}^N_{\varepsilon }(\rho _1,\ldots , \rho _N) = D^N_{\varepsilon } (u_1^*, \ldots , u_N^*) +\varepsilon \).

Moreover in those cases \(\gamma ^*\), as defined in 3, is also the (unique) minimizer for the problem (4.2).

The part 3. in proposition 4.6 have been already shown in different settings by J.M Borwein, A.S. Lewis and R.D. Nussbaum ([7, Theorem 4.4], see also [8, section 3]) and G. Carlier & M. Laborde [18]. Our approach, being purely variational, allows us to study the convergence of the Sinkhorn algorithm in the several marginal case in a similar way of done in the previous section.

In fact, \({\text {OT}}^N_{\varepsilon }(\rho _1,\dots ,\rho _N)\) defines an unique element \(\gamma ^{\varepsilon }_{N,opt}\) - the \({\text {KL}}^{N}\)-projection on \(\Pi (\rho _1,\dots ,\rho _N)\)—which has product density \(\Pi ^N_{i=1}a_i\) with respect to the Gibbs measures \(\mathcal {K}\), where \(a_i = e^{u^*_i/\varepsilon }\) as defined in proposition 4.6. Also in this case, an equivalent system to (1.3) can be implicity written: \(\gamma ^{\varepsilon }_{N,opt}\) is a solution of (4.2) if and only if \(\gamma ^{\varepsilon }_{N,opt} = \otimes ^N_{i=1}a^{\varepsilon }(x_i)\mathcal {K}, \text { where } a^{\varepsilon }_i \text { solve } \)

$$\begin{aligned} \displaystyle a^{\varepsilon }_i(x)\int _{Y} \otimes ^N_{j\ne i}a_j(x_j)k(x_1,\dots ,x_N)d\rho _i(x_i) = 1, \quad \forall i = 1,\dots , N. \end{aligned}$$
(4.8)

Therefore, by using the marginal condition \(\gamma ^{\varepsilon }\in \Pi (\rho _1,\dots ,\rho _N)\), the functions \(a_i\) can be implicitly computed

$$\begin{aligned} a_i(x_i) = \dfrac{1}{\int _{\Pi ^N_{j\ne i}X_j}\otimes ^N_{j\ne i}a_j(x_j)k(x_1,\dots ,x_N)d(\otimes ^N_{j\ne i}\rho _j)}, \quad \forall i \in \left\{ 1,\dots ,N\right\} . \end{aligned}$$

Convergence of the IPFP/Sinkhorn Algorithm for Several Marginals

The goal of this subsection is to prove the convergence of the IPFP/Sinkhorn algorithm in the multi-marginal setting. Analogously to (1.4), define recursively the sequences \((a^n_j)_{n\in \mathbb {N}}, j\in \lbrace 1,\dots ,N\rbrace \) by

$$\begin{aligned} \begin{array}{lcl} \displaystyle a_1^0(x_1) &{} = &{} 1, \\ \displaystyle a^0_j(x_j) &{} = &{} 1, \quad j \in \lbrace 2,\dots ,N \rbrace , \\ \displaystyle a^n_j(x_j) &{} = &{} \dfrac{1}{\int \otimes ^N_{i<j}a_i^n(x_i)\otimes ^N_{i> j}a_i^{n-1}(x_i)k^N(x_1,\dots ,x_N)d(\otimes ^N_{i\ne j}\rho _i)}, \, \forall n\in \mathbb {N}. \end{array} \end{aligned}$$
(4.9)

Also here, by writing \(a^n_j = \exp (u^n_j/\varepsilon )\), for all \(j\in \lbrace 1,\dots ,N\rbrace \), one can rewrite the IPFP sequences (4.9) in terms of Entropic \((c,\varepsilon )\)-transforms,

$$\begin{aligned} u^n_j(x_j)&= - \varepsilon \log \left( \int _{\Pi _{i \ne j}X_i} k^N(x_1,\dots ,x_N)\otimes _{i\ne j}e^{u^n_i(x_i)/\varepsilon }d\left( \otimes ^N_{i\ne j}\rho _i\right) \right) \\&= (\hat{u^n_j})^{(N,c,\varepsilon )}(x_j). \end{aligned}$$

Then, the proof of convergence of the IPFP in the multi-marginal case, follows a method similar to the one used in Theorem 3.1.

Theorem 4.7

Let \((X_1,d_1), \dots , (X_N,d_N)\) be Polish spaces, \(\rho _1,\dots ,\rho _N\) be probability measures in \(X_1,\dots ,X_N\), \(c:X_1\times \dots \times X_N\rightarrow [0,+\infty ]\) be a bounded cost, p be an integer \(1\le p <\infty \). If \((a^n_j)_{n\in \mathbb {N}}, j\in \lbrace 1,\dots ,N\rbrace \) are the IPFP sequence defined in (4.9), then there exist a sequence \(\lambda ^n \in \mathbb {R}^N\), with \(\lambda ^n_i >0\) and \(\prod _{i=1}^N \lambda ^n_i = 1\) such that

$$\begin{aligned} \forall j\in \lbrace 1,\dots ,N\rbrace , \quad a^n_j/\lambda ^n_j\rightarrow a_j \text { in } L^p(\rho _j), \end{aligned}$$

where \((a_j)_{j=1}^N\) solve the Schrödinger system. In particular, the sequence \(\gamma ^n = \Pi ^N_{i=1}a^n_i\mathcal {K}\) converges in \(L^p(\rho _1\otimes \dots \otimes \rho _N)\) to the optimizer \(\gamma ^{\varepsilon }_{opt}\) in (4.2).

Proof

Let \(i \in \lbrace 1,\dots , N\rbrace \) and consider Let \((a_i^n)_{n\in \mathbb {N}}\) the IPFP sequence defined in (4.9). For every i, we define \(u_i^0 := \varepsilon \ln (a_i^0)\) and then iteratively define the following potentials for every \(p \in \mathbb {N}\)

$$\begin{aligned} {\left\{ \begin{array}{ll} u_1^{pN+1} = (\widehat{u_1^{pN}})^{(c,\varepsilon )} \\ u_2^{pN+1} = u_2^{pN} \\ u_3^{pN+1} = u_3^{pN} \\ \, \cdots \\ u_N^{pN+1} = u_N^{pN} \\ \end{array}\right. }, \quad {\left\{ \begin{array}{ll} u^{pN+2}_1 = u_1^{pN+1} \\ u_{pN+2} = (\widehat{u^{pN+1}_2})^{(c,\varepsilon )} \\ u^{pN+2}_3 = u_3^{pN+1} \\ \, \cdots \\ u^{pN+2}_N = u_N^{pN+1} \\ \end{array}\right. }, \dots \, , \quad {\left\{ \begin{array}{ll} u_1^{pN+N} = u_1^{pN+N-1} \\ u_2^{pN+N} = u_2^{pN+N-1} \\ u_3^{pN+N} = u_3^{pN+N-1} \\ \, \cdots \\ u^{pN+N}_N = (\widehat{u^{pN+N-1}_N})^{(c,\varepsilon )} \\ \end{array}\right. }. \end{aligned}$$

Notice that \(a_i^n = e^{u^{nN}_i / \varepsilon }\) and moreover \(osc ( u^{pN+i}_i ) \le \Vert c \Vert _{\infty }\) by Lemma 4.2 (iv), and in particular \(osc(u^{n}_i) \le \Vert c \Vert _{\infty }\) as long as \(n \ge N\). Moreover, thanks to (4.4) we also have \(|\sum _i \lambda _{u_i^n}| \le \Vert c \Vert _{\infty }\). In particular, defining \(v^n= P(u^n)\), we have \(\Vert v_i^n\Vert _{\infty } \le 3\Vert c\Vert _{\infty }\) thanks to Lemma 4.3 (ii); using (4.6) and Lemma 4.3 we also have

$$\begin{aligned} D_{\varepsilon }^N(v^{n}_1,\dots , v^{n}_N)\le D_{\varepsilon }^N(v^{n+1}_1,\dots , v^{n+1}_N) \le \dots \le {\text {OT}}_{\varepsilon }(\rho _1,\rho _2,\dots , \rho _N). \end{aligned}$$

By the boundedness of \(\Vert v_i^n\Vert _{\infty }\), by the compactness in Lemma 4.2 (iv), there exists a subsequence \(k_n\) such that \(v_i^{k_n}\) converges in \(L^{p}\) to some \(v_i\) for every \(i=1, \ldots , N\); by pigeon-hole principle we have that at least a class of residue modulo N is taken infinitely by the sequence \(k_n\) and we will suppose that without loss of generality this residue class is 0. Up to restricting to the infinite subsequence such that \(k_n \equiv 0 \pmod {N}\), we can assume that \(v_N^{k_{n}} = (\widehat{v_N^{k_n}})^{(N,c,\varepsilon )}\) and \(v_1^{k_{n}+1} = (\widehat{v_1^{k_n}})^{(N,c,\varepsilon )}\)

In particular, by the continuity of the \((N,c, \varepsilon )\)-transform we have

$$\begin{aligned}&D_{\varepsilon }^N(\hat{v_1}^{(N,c,\varepsilon )}, \hat{v_1}) - D_{\varepsilon }^N(v_1,v_2,\dots ,v_N)\\&\quad = \lim _{n\rightarrow \infty } D_{\varepsilon }(v_1^{k_n+1},\dots , v_N^{k_n+1}) - D_{\varepsilon }(v_1^{k_n},\dots , v_N^{k_n}) = 0. \end{aligned}$$

In particular, we have \(v_1 = \hat{v_1}^{(N,c,\varepsilon )}\) by (4.7) and in particular \(u_i^{k_n+1} \rightarrow u_i\) for every \(i=1, \ldots , N\). Now, doing a similar computation, for every \(i=2,\dots ,N\), we can inductively prove that, for every i,

$$\begin{aligned}&D_{\varepsilon }( \hat{v_i}^{(N,c,\varepsilon )}, \hat{v_i}) - D_{\varepsilon }(v_1,\dots , v_i, \dots , v_N) \\&\quad = \lim _{n \rightarrow \infty } D_{\varepsilon }(v_1^{k_n+i},\dots , v_N^{k_n+i}) - D_{\varepsilon }(v_1^{k_n+i-1},\dots , v_N^{k_n+i-1}) = 0. \end{aligned}$$

Hence, \(v_i = \hat{v_i}^{(N,c,\varepsilon )}, \, \forall i \in \lbrace 1,\dots , N\rbrace \). The result follows by noticing that \((e^{v_1/\varepsilon },\dots , e^{v_N/\varepsilon })\) solves the Schrödinger system, by Proposition 4.6. \(\square \)

Remark on the multi-marginal problem \(\mathcal {S}_{\varepsilon }(\rho _1,\dots ,\rho _N;\mathfrak {m}_1,\dots , \mathfrak {m}_N)\): More generally, we could also consider the multi-marginal Schrödinger problem with references measures \(\mathfrak {m}_i \in \mathcal {P}(X_i), i=1,\dots , N\). For simplicity, we denote \({\overline{\rho }} = (\rho _1,\dots ,\rho _N)\) and \({\overline{\mathfrak {m}}} = (\mathfrak {m}_1,\dots ,\mathfrak {m}_N)\). Then the functional \(\mathcal {S}_{\varepsilon }({\overline{\rho }};{\overline{\mathfrak {m}}})\) is defined by

$$\begin{aligned} \mathcal {S}_{\varepsilon }({\overline{\rho }};{\overline{\mathfrak {m}}}) = \min _{\gamma \in \Pi _N(\rho _1,\dots ,\rho _N)} \varepsilon {\text {KL}}^{N}(\gamma | \mathfrak {m}_1\otimes \cdots \otimes \mathfrak {m}_N). \end{aligned}$$

Analogously to the 2 marginal case, the duality results, existence and regularity of entropic-potentials as well as the convergence of the Sinkhorn algorithm can be extended to that case. We omit the details here since the proof follows by similar arguments.

Notes

  1. 1.

    The iterations above system also appeared in [4, 29, 48, 50, 70] with different names (eg. RAS, IPFP).

  2. 2.

    in order to prove that there is a unique limit point at this stage, it is sufficient to take \(\ell _n\) that minimizes \(\Vert u_n - \ell _n - u\Vert _2\).

References

  1. 1.

    Agueh, M., Carlier, G.: Barycenters in the Wasserstein space. SIAM J. Math. Anal. 43, 904–924 (2011)

    MathSciNet  MATH  Google Scholar 

  2. 2.

    Altschuler, J., Weed, J., Rigollet, P.: Near-linear time approximation algorithms for optimal transport via Sinkhorn iteration. In: Advances in Neural Information Processing Systems, pp. 1964–1974 (2017)

  3. 3.

    Arnaudon, M., Cruzeiro, A. B., Léonard, C., Zambrini, J.-C.: An entropic interpolation problem for incompressible viscid fluids. arXiv preprint arXiv:1704.02126 (2017)

  4. 4.

    Bacharach, M.: Estimating nonnegative matrices from marginal data. Int. Econ. Rev. 6, 294–310 (1965)

    MATH  Google Scholar 

  5. 5.

    Benamou, J.-D., Carlier, G., Cuturi, M., Nenna, L., Peyré, G.: Iterative Bregman projections for regularized transportation problems. SIAM J. Sci. Comput. 37, A1111–A1138 (2015)

    MathSciNet  MATH  Google Scholar 

  6. 6.

    Benamou, J.-D., Carlier, G., Nenna, L.: Generalized incompressible flows, multi-marginal transport and Sinkhorn algorithm. Numerische Mathematik 142, 1–22 (2017)

    MathSciNet  MATH  Google Scholar 

  7. 7.

    Borwein, J.M., Lewis, A.S.: Decomposition of multivariate functions. Can. J. Math. 44, 463–482 (1992)

    MathSciNet  MATH  Google Scholar 

  8. 8.

    Borwein, J.M., Lewis, A.S., Nussbaum, R.D.: Entropy minimization, DAD problems, and doubly stochastic kernels. J. Funct. Anal. 123, 264–307 (1994)

    MathSciNet  MATH  Google Scholar 

  9. 9.

    Brenier, Y.: The least action principle and the related concept of generalized flows for incompressible perfect fluids. J. Am. Math. Soc. 2, 225–255 (1989)

    MathSciNet  MATH  Google Scholar 

  10. 10.

    Brenier, Y.: The dual least action problem for an ideal, incompressible fluid. Arch. Ration. Mech. Anal. 122, 323–351 (1993)

    MathSciNet  MATH  Google Scholar 

  11. 11.

    Brenier, Y.: Minimal geodesics on groups of volume-preserving maps and generalized solutions of the Euler equations. Commun. Pure Appl. Math. A J. Issued Courant Inst. Math. Sci. 52, 411–452 (1999)

    MathSciNet  MATH  Google Scholar 

  12. 12.

    Caffarelli, L.A.: Monotonicity properties of optimal transportation and the fkg and related inequalities. Commun. Math. Phys. 214, 547–563 (2000)

    MathSciNet  MATH  Google Scholar 

  13. 13.

    Carlen, E.: Stochastic mechanics: a look back and a look ahead. Diffus. Quantum Theory Radic. Eleme. Math. 47, 117–139 (2014)

    MathSciNet  Google Scholar 

  14. 14.

    Carlen, E.A.: Conservative diffusions. Commun. Math. Phys. 94, 293–315 (1984)

    MathSciNet  MATH  Google Scholar 

  15. 15.

    Carlen, E. A.: Existence and sample path properties of the diffusions in Nelson’s stochastic mechanics. In: Stochastic Processes: Mathematics and Physics, Springer, Berlin, pp. 25–51 (1986)

  16. 16.

    Carlier, G., Duval, V., Peyré, G., Schmitzer, B.: Convergence of entropic schemes for optimal transport and gradient flows. SIAM J. Math. Anal. 49, 1385–1418 (2017)

    MathSciNet  MATH  Google Scholar 

  17. 17.

    Carlier, G., Ekeland, I.: Matching for teams. Econom. Theory 42, 397–418 (2010)

    MathSciNet  MATH  Google Scholar 

  18. 18.

    Carlier, G., Laborde, M.: A differential approach to the multi-marginal Schrödinger system. SIAM J. Math. Anal. 52(1), 709–717 (2020)

    MathSciNet  MATH  Google Scholar 

  19. 19.

    Cattiaux, P., Léonard, C.: Minimization of the Kullback information of diffusion processes. Ann. l’IHP Probabilités et statistiques 30, 83–132 (1994)

    MathSciNet  MATH  Google Scholar 

  20. 20.

    Chen, Y., Conforti, G., Georgiou, T. T., Ripani, L.: Multi-marginal Schrödinger bridges. In: International Conference on Geometric Science of Information, pp. 725–732. Springer (2019)

  21. 21.

    Chen, Y., Georgiou, T., Pavon, M.: Entropic and displacement interpolation: a computational approach using the Hilbert metric. SIAM J. Appl. Math. 76, 2375–2396 (2016)

    MathSciNet  MATH  Google Scholar 

  22. 22.

    Chiappori, P.-A., McCann, R.J., Nesheim, L.P.: Hedonic price equilibria, stable matching, and optimal transport: equivalence, topology, and uniqueness. Econ. Theor. 42, 317–354 (2010)

    MathSciNet  MATH  Google Scholar 

  23. 23.

    Chizat, L., Peyré, G., Schmitzer, B., Vialard, F.-X.: Scaling algorithms for unbalanced optimal transport problems. Math. Comput. 87, 2563–2609 (2018)

    MathSciNet  MATH  Google Scholar 

  24. 24.

    Cruzeiro, A.B., Zambrini, J.-C.: Malliavin calculus and Euclidean quantum mechanics. I. Functional calculus. J. Funct. Anal. 96, 62–95 (1991)

    MathSciNet  MATH  Google Scholar 

  25. 25.

    Csiszár, I.: I-divergence geometry of probability distributions and minimization problems. In: The Annals of Probability, pp. 146–158 (1975)

  26. 26.

    Cuturi, M.: Sinkhorn distances: lightspeed computation of optimal transport. In: Advances in Neural Information Processing Systems, pp. 2292–2300 (2013)

  27. 27.

    Cuturi, M., Doucet, A.: Fast computation of Wasserstein barycenters. In: International Conference on Machine Learning, pp. 685–693 (2014)

  28. 28.

    Cuturi, M., Peyré, G.: Computational optimal transport, foundations and trends®. Mach. Learn. 11, 355–607 (2019)

    Google Scholar 

  29. 29.

    Deming, W.E., Stephan, F.F.: On a least squares adjustment of a sampled frequency table when the expected marginal totals are known. Ann. Math. Stat. 11, 427–444 (1940)

    MathSciNet  MATH  Google Scholar 

  30. 30.

    Di Marino, S., Gerolin, A., Giesbertz, K., Nenna, L., Seidl, M., Gori-Giorgi, P.: The strictly-correlated electron functional for spherically symmetric systems revisited (in preparation) (2016)

  31. 31.

    Dudley, R.M.: The speed of mean Glivenko–Cantelli convergence. Ann. Math. Stat. 40, 40–50 (1969)

    MathSciNet  MATH  Google Scholar 

  32. 32.

    Fathi, M., Gozlan, N., Prodhomme, M.: A proof of the Caffarelli contraction theorem via entropic regularization. arXiv preprint arXiv:1904.06053 (2019)

  33. 33.

    Fényes, I.: A deduction of Schrödinger equation. Acta Bolyaina 1 (1946)

  34. 34.

    Feydy, J., Séjourné, T., Vialard, F.-X., Amari, S.-I., Trouvé, A., Peyré, G.: Interpolating between optimal transport and MMD using Sinkhorn divergences. In: The 22nd International Conference on Artificial Intelligence and Statistics, pp. 2681–2690 (2019)

  35. 35.

    Flamary, R., Courty, N.: POT Python Optimal Transport library (2017)

  36. 36.

    Franklin, J., Lorenz, J.: On the scaling of multidimensional matrices. Linear Algebra Appl. 114, 717–735 (1989)

    MathSciNet  MATH  Google Scholar 

  37. 37.

    Galichon, A., Salanié, B.: Matching with trade-offs: revealed preferences over competing characteristics. CEPR Discussion Paper No. DP7858 (2010)

  38. 38.

    Gangbo, W., Swiech, A.: Optimal maps for the multidimensional Monge–Kantorovich problem. Commun. Pure Appl. Math. 51, 23–45 (1998)

    MathSciNet  MATH  Google Scholar 

  39. 39.

    Genevay, A., Chizat, L., Bach, F., Cuturi, M., Peyré, G.: Sample complexity of Sinkhorn divergences. In: The 22nd International Conference on Artificial Intelligence and Statistics, pp. 1574–1583 (2019)

  40. 40.

    Genevay, A., Peyré, G., Cuturi, M.: Learning Generative models with Sinkhorn divergences. In: International Conference on Artificial Intelligence and Statistics, pp. 1608–1617 (2018)

  41. 41.

    Gentil, I., Léonard, C., Ripani, L., Tamanini, L.: An entropic interpolation proof of the HWI inequality. Stochas. Process. Appl. 130(2), 907–923 (2018)

    MathSciNet  MATH  Google Scholar 

  42. 42.

    Gerolin, A., Grossi, J., Gori-Giorgi, P.: Kinetic correlation functionals from the entropic regularization of the strictly correlated electrons problem. J. Chem. Theor. Comput. 16(1), 488–498 (2019)

    Google Scholar 

  43. 43.

    Gerolin, A., Kausamo, A., Rajala, T.: Multi-marginal entropy-transport with repulsive cost. Calc. Var. 59, 90 (2020)

  44. 44.

    Gigli, N., Tamanini, L.: Benamou–Brenier and duality formulas for the entropic cost on \( {R}{C}{D}^{*}({K}, {N}) \) spaces. In: Probability, Theory Related Fields (2018)

  45. 45.

    Gigli, N., Tamanini, L.: Second order differentiation formula on \({RCD}^*({K},{N})\) spaces. J. Eur. Math. Soc. (JEMS) (2018)

  46. 46.

    Gori-Giorgi, P., Seidl, M., Vignale, G.: Density-functional theory for strongly interacting electrons. Phys. Rev. Lett. 103, 166402 (2009)

    Google Scholar 

  47. 47.

    Gozlan, N., Léonard, C.: A large deviation approach to some transportation cost inequalities. Probab. Theory Relat. Fields 139, 235–283 (2007)

    MathSciNet  MATH  Google Scholar 

  48. 48.

    Idel, M.: A review of matrix scaling and sinkhorn’s normal form for matrices and positive maps. arXiv preprint arXiv:1609.06349 (2016)

  49. 49.

    Karlsson, J., Ringh, A.: Generalized Sinkhorn iterations for regularizing inverse problems using optimal mass transport. SIAM J. Imaging Sci. 10, 1935–1962 (2017)

    MathSciNet  MATH  Google Scholar 

  50. 50.

    Kruithof, J.: Telefoonverkeersrekening. De Ingenieur 52, 15–25 (1937)

    Google Scholar 

  51. 51.

    Léger, F., Li, W.: Hopf-cole transformation via generalized Schrödinger bridge problem. arXiv preprint arXiv:1901.09051 (2019)

  52. 52.

    Léonard, C.: From the Schrödinger problem to the Monge–Kantorovich problem. J. Funct. Anal. 262, 1879–1920 (2012)

    MathSciNet  MATH  Google Scholar 

  53. 53.

    Léonard, C.: A survey of the Schrödinger problem and some of its connections with optimal transport. Discrete Contin. Dyn. Syst. A 34, 1533–1574 (2014)

    MathSciNet  MATH  Google Scholar 

  54. 54.

    Li, W., Lu, J., Wang, L.: Fisher information regularization schemes for Wasserstein gradient flows. J. Comput. Phys. 416, 109449 (2020)

    MathSciNet  MATH  Google Scholar 

  55. 55.

    Li, W., Yin, P., Osher, S.: Computations of optimal transport distance with Fisher information regularization. J. Sci. Comput. 75, 1581–1595 (2018)

    MathSciNet  MATH  Google Scholar 

  56. 56.

    Luise, G., Rudi, A., Pontil, M., Ciliberto, C.: Differential properties of Sinkhorn approximation for learning with Wasserstein distance. In: Advances in Neural Information Processing Systems, pp. 5859–5870 (2018)

  57. 57.

    Luise, G., Salzo, S., Pontil, M., Ciliberto, C.: Sinkhorn barycenters with free support via Frank-Wolfe algorithm. In: Advances in Neural Information Processing Systems, pp. 9322–9333 (2019)

  58. 58.

    Mikami, T.: Monge’s problem with a quadratic cost by the zero-noise limit of h-path processes. Probab. Theory Relat. Fields 129, 245–260 (2004)

    MathSciNet  MATH  Google Scholar 

  59. 59.

    Nelson, E.: Derivation of the Schrödinger equation from Newtonian mechanics. Phys. Rev. 150, 1079 (1966)

    Google Scholar 

  60. 60.

    Nelson, E.: Dynamical Theories of Brownian Motion, vol. 3. Princeton University Press, Princeton (1967)

    MATH  Google Scholar 

  61. 61.

    Nelson, E.: Quantum Fluctuations. Princeton University Press, Princeton (1985)

    MATH  Google Scholar 

  62. 62.

    Pavon, M., Tabak, E. G., Trigila, G.: The data-driven Schroedinger bridge. arXiv preprint arXiv:1806.01364 (2018)

  63. 63.

    Rabin, J., Peyré, G., Delon, J., Bernot, M.: Wasserstein Barycenter and its application to texture mixing. In: International Conference on Scale Space and Variational Methods in Computer Vision, Springer, pp. 435–446 (2011)

  64. 64.

    Raghavan, T.: On pairs of multidimensional matrices. Linear Algebra Appl. 62, 263–268 (1984)

    MathSciNet  MATH  Google Scholar 

  65. 65.

    Ruschendorf, L.: Convergence of the iterative proportional fitting procedure. Ann. Stat. 23, 1160–1174 (1995)

    MathSciNet  MATH  Google Scholar 

  66. 66.

    Schrödinger, E.: Über die umkehrung der naturgesetze. Verlag Akademie der wissenschaften in kommission bei Walter de Gruyter u, Company (1931)

  67. 67.

    Sinkhorn, R.: A relationship between arbitrary positive matrices and doubly stochastic matrices. Ann. Math. Stat. 35, 876–879 (1964)

    MathSciNet  MATH  Google Scholar 

  68. 68.

    Solomon, J., De Goes, F., Peyré, G., Cuturi, M., Butscher, A., Nguyen, A., Du, T., Guibas, L.: Convolutional Wasserstein distances: efficient optimal transportation on geometric domains. ACM Trans. Gr. (TOG) 34, 66 (2015)

    MATH  Google Scholar 

  69. 69.

    Weed, J., Bach, F.: Sharp asymptotic and finite-sample rates of convergence of empirical measures in Wasserstein distance. Bernoulli 25, 2620–2648 (2019)

    MathSciNet  MATH  Google Scholar 

  70. 70.

    Yule, G.U.: On the methods of measuring association between two attributes. J. R. Stat. Soc. 75, 579–652 (1912)

    Google Scholar 

  71. 71.

    Zambrini, J.: Stochastic mechanics according to E. Schrödinger Phys. Rev. A 33, 1532 (1986)

    MathSciNet  Google Scholar 

  72. 72.

    Zambrini, J.-C.: Variational processes and stochastic versions of mechanics. J. Math. Phys. 27, 2307–2330 (1986)

    MathSciNet  MATH  Google Scholar 

  73. 73.

    Zambrini, J.-C.: The research program of stochastic deformation (with a view toward geometric mechanics). In: Stochastic Analysis: A Series of Lectures, Springer, pp. 359–393 (2015)

Download references

Acknowledgements

S. D. M. is member of Gruppo Nazionale per l’Analisi Matematica, la Probabilità e le loro Applicazioni (GNAMPA) of the Istituto Nazionale di Alta Matematica (INdAM). A.G. acknowledges the European Research Council under H2020/MSCA-IF“OTmeetsDFT” [Grant ID: 795942]. This work started when the second author visited the first author, while he was working at the Scuola Normale Superiore di Pisa (INdAM unit). The authors wants to thank G. Carlier, C. Léonard and L. Tamanini for useful discussions.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Augusto Gerolin.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

Appendix

Proposition 5.1

Let \((X,d,\mu )\) be a measurable metric space with \(\mu (X)=1\). Let us assume that \(\mathcal {F}\subset L^p(X,\mu )\) is a family of functions such that:

  1. (a)

    there exists \(M>0\) such that \( \Vert f\Vert _{\infty } \le M\) for every \(f \in \mathcal {F}\);

  2. (b)

    for every \(\sigma \) there exists a set \(N^{\sigma }\), a modulus of continuity \(\omega _{\sigma }\) and a number \(\beta _{\sigma }\ge 0\) such that

    $$\begin{aligned} |f(x)-f(x')| \le \omega _{\sigma }(d(x,x')) + \beta _\sigma \qquad \forall x,x' \not \in N^{\sigma } \end{aligned}$$

    where \(N^{\sigma }\) and \(\beta _{\sigma }\) are such that \(\mu (N^{\sigma })+\beta _\sigma \rightarrow 0\) as \(\sigma \rightarrow 0\).

Then the family \(\mathcal {F}\) is precompact in \(L^p(X, \mu )\).

Proof

Let us fix \(\varepsilon >0\) and let us consider a sequence \(\sigma _n \rightarrow 0\) such that \(\sum _{n =1}^{\infty } \mu (N^{\sigma _n}) \le \varepsilon \); then define \(\omega _n= \omega _{\sigma _n}\) and \({\mathcal {N}}^{\varepsilon } := \bigcup _n N^{\sigma _n}\); in particular we have \(\mu ({\mathcal {N}}^{\varepsilon }) \le \varepsilon \) and

$$\begin{aligned} |f(x)-f(x')| \le \omega _n(d(x,x')) + \beta _{\sigma _n} \qquad \forall x, x' \not \in {\mathcal {N}}^{\varepsilon }, \forall n \in \mathbb {N}.\end{aligned}$$
(5.1)

Let us define \(\omega ^{\varepsilon }(t) = \inf _n \{ \omega _n (t) + \beta _{\sigma _n}\}\): by (5.1) we have that f is \(\omega ^{\varepsilon }\)-continuous outside \({\mathcal {N}}^{\varepsilon }\).

We can verify that \(\omega ^{\varepsilon }\) is a non degenerate modulus of continuity: it is obvious that is it nondecreasing since it is an infimum of noncreasing functions. Then for every \({\tilde{\varepsilon }}>0\) we can choose n big enough such that \(\beta _{\sigma _n}< {\tilde{\varepsilon }}/2\) and then choose t small enough such that \(\omega _n (t) < {\tilde{\varepsilon }}/2\); in this way we have \(\omega ^{\varepsilon }(t) \le \omega _n (t) + \beta _{\sigma _n} < {\tilde{\varepsilon }}\). In particular \(\omega ^{\varepsilon }(t) \rightarrow 0\) as \(t \rightarrow 0\).

Now we conclude by a diagonal argument: let us consider a sequence \((f^0_n)_{n \in \mathbb {N}} \subseteq \mathcal {F}\) and a sequence \(\varepsilon _k\rightarrow 0\). We want to find a subsequence that is converging strongly in \(L^p\). We iteratively extract a subsequence \((f^k_n)\) of \((f^{k-1}_n)\) that is converging uniformly outside \({\mathcal {N}}^{\varepsilon _k}\) (thanks to Ascoli-Arzelà) to some function \(f^k\), which is defined only outside \({\mathcal {N}}^{\varepsilon _k}\). Then let us consider

$$\begin{aligned} f(x)= {\left\{ \begin{array}{ll} f^k(x) \qquad &{}\text { if }x \not \in {\mathcal {N}}^{\varepsilon _k} \\ 0 &{} \text { otherwise.} \end{array}\right. } \end{aligned}$$

First of all f is well defined since if \(x \not \in {\mathcal {N}}^{\varepsilon _k}\) and \(x \not \in {\mathcal {N}}^{\varepsilon _j}\) with \(j >k\) then we have that \(f^k_n (x) \rightarrow f^k(x)\) but since \(f^j_n\) is a subsequence of \(f^k_n\) we have also \(f^j_n(x) \rightarrow f^k(x)\); however by definition \(f^j_n(x) \rightarrow f^j(x)\) and so \(f^j(x)=f^k(x)\). Moreover it is clear that \(\Vert f\Vert _{\infty } \le M\) since this is true for every \(f^0_n\) thanks to property (a). Now we consider the sequence \(g_n = f^n_n\) which is a subsequence of \(f^0_n\). Let us fix \(\varepsilon >0\) and choose k such that \(\varepsilon _k < \varepsilon ^p\); then let \(n_0>k\) such that \(|f^k_n - f| \le \varepsilon \) on \(X \setminus {\mathcal {N}}^{\varepsilon _k}\) for every \(n \ge n_0\). Now we have \(g_{n}=f^k_n\) for some \(n\ge n_0\) and in particular

$$\begin{aligned} \int _X |g_{n_0}(x)-f(x)|^p \,d \mu&= \int _{X} | f^k_n(x) -f(x)|^p \, d \mu \\&= \int _{{\mathcal {N}}^{\varepsilon _k}} | f^k_n(x)-f(x)|^p \, d \mu + \int _{X \setminus {\mathcal {N}}^{\varepsilon _k}} | f^k_n(x)-f(x)|^p \, d \mu \\&\le \int _{{\mathcal {N}}^{\varepsilon _k}} (2M)^p \, d \mu + \int _{X \setminus {\mathcal {N}}^{\varepsilon _k}} \varepsilon ^p \, d \mu \\&\le \mu ({\mathcal {N}}^{\varepsilon _k}) (2M)^p + \varepsilon ^p ) \\&\le \varepsilon _k (2M)^p + \varepsilon ^p \mu (X) \le \varepsilon ^p ( 2^pM^p+ 1) \end{aligned}$$

In particular we get \(g_n \rightarrow f \) in \(L^p\) and so we’re done. \(\square \)

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Marino, S.D., Gerolin, A. An Optimal Transport Approach for the Schrödinger Bridge Problem and Convergence of Sinkhorn Algorithm. J Sci Comput 85, 27 (2020). https://doi.org/10.1007/s10915-020-01325-7

Download citation

Keywords

  • Schrödinger problem
  • Entropic regularization of optimal transport
  • Kantorovich duality
  • Sinkhorn algorithm
  • Iterative proportional fitting procedure