Introduction

Optimal transport (OT) [82] studies the geometry of probability measures through the lifting of a cost function between samples. This is carried out by devising a coupling between two probability measures via a transport plan, so that one measure is transported to another with minimal total cost. The resulting geometry offers a favorable way of comparing probability measures one to another, which has lead to considerable success in machine learning, especially in generative modelling [6, 24, 28, 55], where one aims at training a model distribution to sample from a given data distribution, and computer vision, where OT provides intuitive metrics between images [71]. Notably, OT can not only be used to derive divergences, but also metrics between probability distributions, referred to as the p-Wasserstein metrics.

To ease the computational aspects of OT, entropic relaxation was introduced, which transforms the constrained convex problem of transportation into an unconstrained strictly convex problem [20]. This is carried out via considering the sum of the total cost and the Kullbackk–Leibler (KL) divergence, between the transport plan and the independent joint distribution, scaled by some regularization magnitude. In addition to computational aspects, the entropic regularization also betters statistical properties [76], specifically, the complexity of estimating the OT quantity between measures through sampling [34, 59, 83]. Theoretical properties of the entropic regularization have been studied in e.g. metric geometry, machine learning and statistics [30, 35, 36, 40, 56, 69, 70]. It has also been applied in a variety of fields, including computer vision, density functional theory in chemistry, and inverse problems (e.g. [36, 38, 52, 67]).

The resulting problem has close relations to the Schrödinger problem [75], which considers the most likely flow of a cloud of gas from an initial position to an observed position after a certain amount of time under a prior assumption on the evolution of the position, given by e.g. a Brownian motion. The resulting problem has found applications in fields such as mathematical physics, economics, optimization and probability [12, 19, 22, 31, 32, 72, 84]). Connections to OT have been considered in e.g. [20, 33, 50, 73, 74].

OT is not the only instance of a geometric framework for probability measures. Other popular choices include information geometric divergences [3, 9] and integral probability metrics [64]. In contrast to these methods, OT and entropic OT has the advantage of metrizing the weak\(^*\)-convergence of probability measures, which results in non-singular behavior when comparing measures of disjoint supports. On top of this, being able to decide the lifted cost function is important in applications, as the cost function can be used to incorporate modelling choices, determining which differences in samples are deemed most important. For example, the standard Euclidean metric is a poor choice for comparing images.

Gaussian distributions provide a meaningful testing ground for such frameworks since, in many cases, they result in closed-form expressions. In addition, the study of Gaussians under the OT framework result in useful divergences. In particular, divergences between centered Gaussians result in divergences between their corresponding covariance matrices. Both instances enjoy many applications in a plethora of fields, such as medical imaging [27], computer vision [79,80,81], brain computer interfaces [18], natural language processing [65], and assessing the quality of generative models [43]. Notably, the 2-Wasserstein metric between Gaussians is known as the Bures metric in quantum physics, where it is used to compare quantum states. Other popular divergences for Gaussians include the affine-invariant Riemannian metric [68], corresponding to the Fisher–Rao distance between centered Gaussians, the Alpha Log-Determinant divergences [14], corresponding to Rényi divergences between centered Gaussians, and the log-Euclidean metric [8]. A survey of some of the most common divergences and their resulting geometry on Gaussians can be found in [29]. More recently, applications have driven research into allowing determining optimal divergences for the task at hand, which has raised interest in studying interpolations between different divergences [4, 17, 78]. Generalizations of these divergences to the infinite-dimensional setting of Gaussian processes and covariance operators have also been considered [49, 54, 57, 60, 61].

The Sinkhorn divergence has been proposed in OT, applying the entropic regularization to define a parametric family of divergences, interpolating from the OT quantity to a maximum mean discrepancy (MMD), whose kernel is determined by the cost. In the present work, we provide a closed-form solution to the entropy-regularized 2-Wasserstein distance between multivariate Gaussians, which can then be applied in the computation of the corresponding Sinkhorn divergence between Gaussians. In addition, we study the task of interpolating between two Gaussians under the entropy-regularized 2-Wasserstein distance, and confirm known limiting properties of the divergences with respect to the regularization strength. Finally, we provide fixed-point expressions for the barycenter of population of Gaussians restricted to the Gaussian manifold, that can be employed in fixed-point iteration for computing the barycenter. The one-dimensional setting has been studied in [4, 37]. The Schrödinger bridge between multivariate Gaussians has been considered in [15], including the study of the limiting case of bringing the noise of the driving Brownian motion to 0, resulting in the 2-Wasserstein case, in [16].

The paper is divided as follows: in Sect. 2, we briefly introduce the necessary background to develop the entropic OT theory of Gaussians, including the formulation of OT, entropic OT, and the corresponding dual and dynamical formulations. In Sect. 3, we compute explicit solutions to the entropy-relaxed 2-Wasserstein distance between Gaussians, including the dynamical formulation that allows for interpolation. As a consequence, we derive a closed-form solution for the corresponding Sinkhorn divergence. In Sect. 4, we study the barycenters of populations of Gaussians, restricted to the Gaussian manifold. We derive fixed-point expressions for the entropic 2-Wasserstein distance and the 2-Sinkhorn divergence. Finally, in Sect. 5, we illustrate the resulting interpolative and barycentric schemes. Especially, we consider varying the regularization magnitude, visualizing the interpolation between the OT and MMD problems in the Sinkhorn case [30, 36, 69].

Related work Several papers—all independently—have formulated the closed form solution of the Entropic regularized Optimal Transport for Gaussian measures [23, 45] in any dimensions, including the case of unbalanced transport [45]. These results have been generalized for \(\varphi \)-exponential distributions [48], Gaussian measures on infinite-dimensional Hilbert spaces, including in particular Reproducing Kernel Hilbert Spaces, and Gaussian processes [62, 63]. Both two and multi-marginal solution in the one-dimensional case first appeared in [38].

Background

In this section, we start by recalling the essential background for optimal transport (OT) and its entropy-relaxed version. More in-depth exposition for OT can be found in [82], and for computational aspects and entropic OT in [22].

Optimal transport

Let \(({\mathcal {X}},d)\) be a metric space equipped with a lower semi-continuous cost function \(c:{\mathcal {X}}\times {\mathcal {X}}\rightarrow {\mathbb {R}}_{\ge 0}\). Then, the optimal transport problem between two probability measures \(\mu , \nu \in {\mathcal {P}}({\mathcal {X}})\) is given by

$$\begin{aligned} \mathrm {OT}(\mu , \nu ) = \min _{\gamma \in \mathrm {ADM}(\mu ,\nu )}{\mathbb {E}}_\gamma [c], \end{aligned}$$
(1)

where \(\mathrm {ADM}(\mu ,\nu )\) is the set of joint probabilities with marginals \(\mu \) and \(\nu \), and \({\mathbb {E}}_\mu [f]\) denotes the expected value of f under \(\mu \)

$$\begin{aligned} {\mathbb {E}}_\mu [f] = \int _{\mathcal {X}}f(x) {\mathrm{d}}\mu (x). \end{aligned}$$
(2)

Additionally, by \({\mathbb {E}}[\mu ]\) we denote the expectation of \(\mu \). A minimizer of (1) is denoted by \(\gamma _{\mathrm{opt}}\) and called a transport plan.

The OT problem admits the following Kantorovich (dual) formulation

$$\begin{aligned} \mathrm {OT}(\mu ,\nu ) = \max \limits _{\varphi , \psi \in \mathrm {ADM}(c)} \left\{ {\mathbb {E}}_\mu [\varphi ] + {\mathbb {E}}_\nu [\psi ] \right\} , \end{aligned}$$
(3)

where \((\varphi , \psi ) \in \mathrm {ADM}(c)\) is required to satisfy

$$\begin{aligned} \varphi (x) + \psi (y) \le c(x,y),\quad \forall (x,y) \in {\mathcal {X}}\times {\mathcal {X}}. \end{aligned}$$
(4)

Potentials \(\varphi _{\mathrm{opt}}, \psi _{\mathrm{opt}}\) achieving the maximum in (3) are called Kantorovich potentials.

Wasserstein distances

The p-Wasserstein distance \(W_p\) between \(\mu \) and \(\nu \) is defined as

$$\begin{aligned} W_p(\mu ,\nu ) = \mathrm {OT}_{d^p}(\mu , \nu )^{\frac{1}{p}}, \end{aligned}$$
(5)

where d is a metric on X and \(p\ge 1\). The case \(p=2\) is particularly interesting, as the resulting metric is then induced by a pseudo-Riemannian metric structure [5, 53].

2-Wasserstein distance between Gaussians

One of the rare cases where the 2-Wasserstein distance admits a closed form solution is between two multivariate Gaussian distributions \(\mu _i={\mathcal {N}}(m_i,K_i)\), \(i=0,1\) with \(d(x,y) = \Vert x-y\Vert \), which is given by [26, 42, 46, 66]

$$\begin{aligned} W_2^2(\mu _0, \mu _1) = ||m_0 -m_1||^2 + \mathrm {Tr}(K_0) + \mathrm {Tr}(K_1) - 2 \mathrm {Tr}\left( K_1^\frac{1}{2} K_0 K_1^\frac{1}{2}\right) ^\frac{1}{2}. \end{aligned}$$
(6)

It can be shown that (6) is induced by a Riemannian metric in the space of n-dimensional Gaussians \({\mathcal {N}}({\mathbb {R}}^n)\), with the metric \(g_K:T_K{\mathcal {N}}({\mathbb {R}}^n)\times T_K{\mathcal {N}}({\mathbb {R}}^n)\rightarrow {\mathbb {R}}\) given by [77]

$$\begin{aligned} g_K(U,V) = \mathrm {Tr}\left[ v_{(K,U)}Kv_{(K,V)}\right] , \quad \forall ~K\in {\mathcal {N}}({\mathbb {R}}^n),~ U,V \in T_K{\mathcal {N}}({\mathbb {R}}^n), \end{aligned}$$
(7)

where \(v_{(K,V)}\) denotes the unique symmetric matrix solving the Sylvester equation

$$\begin{aligned} V = Kv_{(K,V)} + v_{(K,V)}K. \end{aligned}$$
(8)

Moreover, given \({\mathcal {N}}(m_0,K_0),{\mathcal {N}}(m_1,K_1)\in {\mathcal {N}}({\mathbb {R}}^n)\), the geodesics under the metric (6) are given by \({\mathcal {N}}(m_t, K_t)\), with [58]

$$\begin{aligned} \begin{aligned} m_t&= (1-t)m_0 + tm_1, \\ K_t&= \left( (1-t)I + tK_0^{-\frac{1}{2}}\left( K_0^\frac{1}{2}K_1K_0^\frac{1}{2}\right) ^\frac{1}{2}K_0^{-\frac{1}{2}}\right) K_0\\&\quad \times \left( (1-t)I + tK_0^{-\frac{1}{2}}\left( K_0^\frac{1}{2}K_1K_0^\frac{1}{2}\right) ^\frac{1}{2}K_0^{-\frac{1}{2}}\right) \\&= (1-t)^2K_0 + t^2K_1 + t(1-t)[(K_0K_1)^{1/2} + (K_1K_0)^{1/2}]. \end{aligned} \end{aligned}$$
(9)

We remark that Eq. (6) is valid for all Gaussian distributions, including the case when \(K_0, K_1\) are positive semi-definite. This is in contrast to the affine-invariant Riemannian distance \(||\log (K_0^{-1/2}K_1K_0^{-1/2})||_F\), the Log-Euclidean distance \(||\log (K_0)-\log (K_1)||_F\), and the Kullback–Leibler divergence (see below), which require that \(K_0,K_1\) be strictly positive definite.

Finally, the 2-Wasserstein barycenter \({\bar{\mu }}\) of a population of probability measures \(\mu _i\) with weights \(\lambda _i\ge 0\), \(i=1,2,\ldots ,N\) and \(\sum _{i=1}^N\lambda _i = 1\), is defined as the minimizer

$$\begin{aligned} {\bar{\mu }} := \mathop {{{\,\mathrm{arg\,min}\,}}}\limits _{\mu \in {\mathcal {P}}({\mathbb {R}}^n)} \sum _{i=1}^N \lambda _i W_2^2(\mu ,\mu _i). \end{aligned}$$
(10)

When the population consists of Gaussians \(\mu _i = {\mathcal {N}}(m_i, K_i)\), one can show that the barycenter is Gaussian given by \({\bar{\mu }} = {\mathcal {N}}({\bar{m}}, {\bar{K}})\), where \({\bar{m}}\), \({\bar{K}}\) satisfy [1, Thm. 6.1]

$$\begin{aligned} {\bar{m}} = \sum _{i=1}^N \lambda _i m_i,\quad {\bar{K}} = \sum _{i=1}^N \lambda _i \left( K^\frac{1}{2}K_iK^\frac{1}{2}\right) ^\frac{1}{2}. \end{aligned}$$
(11)

Entropic relaxation

Let \(\mu , \nu \in {\mathcal {P}}(X)\) with densities \(p_\mu \) and \(p_\nu \). Then, we denote by

$$\begin{aligned} D_{\mathrm {KL}}(\mu || \nu ) = - {\mathbb {E}}_\mu \left[ \log \frac{p_\nu }{p_\mu }\right] , \end{aligned}$$
(12)

the Kullback–Leibler divergence (KL-divergence) between \(\mu \) and \(\nu \). The differential entropy of \(\mu \) is given by

$$\begin{aligned} H(\mu ) = -{\mathbb {E}}_\mu [\log p_\mu ]. \end{aligned}$$
(13)

For a product measure, we have the identity

$$\begin{aligned} D_{\mathrm {KL}}(\gamma || \mu _0 \otimes \mu _1) = H(\mu _0) + H(\mu _1) - H(\gamma ). \end{aligned}$$
(14)

A special case that will be used later in this work is the KL-divergence between two non-degenerate multivariate Gaussian distributions \(\mu _0 = {\mathcal {N}}(m_0, K_0)\) and \(\mu _1 = {\mathcal {N}}(m_1, K_1)\) when \(X = {\mathbb {R}}^n\), which is given by

$$\begin{aligned} \begin{aligned} D_{\mathrm {KL}}(\mu || \nu ) =&\frac{1}{2}\left( \mathrm {Tr}\left( K_0^{-1}K_1\right) + \left( m_1 - m_0\right) ^T K_0^{-1} \left( m_1 - m_0\right) \right. \\&\left. - n + \ln \left( \frac{\det K_1}{\det K_0}\right) \right) , \end{aligned} \end{aligned}$$
(15)

and for the entropy we have

$$\begin{aligned} H(\mu _0) = \frac{1}{2}\log \det \left( 2\pi e K_0\right) . \end{aligned}$$
(16)

Given \(\epsilon > 0\), we relax (1) with a KL-divergence term between the transport plan and the independent joint distribution as, yielding the entropic OT problem [20]

$$\begin{aligned} \mathrm {OT}_c^\epsilon (\mu , \nu ) = \min _{\gamma \in \mathrm {ADM}(\mu ,\nu )}\left\{ {\mathbb {E}}_\gamma [c] + \epsilon D_{\mathrm {KL}}(\gamma || \mu \otimes \nu ) \right\} , \end{aligned}$$
(17)

which yields a strictly convex problem with respect to \(\gamma \). Moreover, this problem is numerically more favorable to solve (1) compared, for instance, to the Hungarian and the auction algorithm, due to the Sinkhorn–Knopp algorithm. As shown, for instance in [12, 19, 25, 40, 72], the above problem has a unique minimizer given by

$$\begin{aligned} \gamma ^{\varepsilon }= \alpha ^{\varepsilon }(x)\beta ^{\varepsilon }(y)k(x,y)\mu (x)\nu (y), \end{aligned}$$
(18)

if and only if there exists functions \(\alpha ^\varepsilon \) and \(\beta ^\varepsilon \) such that

$$\begin{aligned} \begin{aligned} \alpha ^{\varepsilon }(x){\mathbb {E}}_\nu \left[ \beta ^{\varepsilon }k(x,\cdot )\right]&= 1, \\ \beta ^{\varepsilon }(y){\mathbb {E}}_\mu \left[ \alpha ^{\varepsilon }k(\cdot ,y)\right]&= 1,\\ \end{aligned} \end{aligned}$$
(19)

where \(k(x,y) = \exp \left( -\frac{1}{\epsilon }c\right) \) denotes the Gibbs kernel. We call \(\gamma ^\epsilon \) an entropic transport plan. Moreover, when \(\varepsilon \rightarrow 0\), \(\gamma ^{\varepsilon }\) converges to \(\gamma _{\mathrm{opt}}\), a solution of the OT problem (1) [22, 39, 50]; while when \(\varepsilon \rightarrow \infty , \gamma ^{\varepsilon }\) converges to the independent coupling \(\gamma ^\infty = \mu \otimes \nu \) [36, 69]. The latter property shows in particular that, for large \(\varepsilon \), the entropy-Regularized OT behaves like an inner product and not like a norm. In linear algebra, the polarization formula is the usual way of defining a norm from a inner product. That is the main idea of Sinkhorn divergence.

Sinkhorn divergence

The KL-divergence term in \(\mathrm {OT}_c^\epsilon \) acts as a bias, as discussed in [30]. This can be removed by defining the p-Sinkhorn divergence as

$$\begin{aligned} S_p^\epsilon (\mu , \nu ) = \mathrm {OT}_{d^p}^\epsilon (\mu , \nu ) - \frac{1}{2}(\mathrm {OT}_{d^p}^\epsilon (\mu ,\mu ) + \mathrm {OT}_{d^p}^\epsilon (\nu ,\nu ) ). \end{aligned}$$
(20)

As shown in [30] if, for example, \(c=d^p, p\ge 1\) the Sinkhorn divergences metrizes the convergence in law in the space of probability measures.

Entropy-Kantorovich duality

In this subsection we summarize well-known results on the Entropy–Kantorich. For further details and proofs, we refer the reader to [25].

Given a probability measure \(\mu \), the class of Entropy-Kantorovich potentials is defined by the set of measurable functions \(\varphi \) on \({\mathbb {R}}^n\) satisfying

$$\begin{aligned} L^{\mathrm{exp}}_{\varepsilon }({\mathbb {R}}^n,\mu ) = \left\{ \varphi :{\mathbb {R}}^n \rightarrow [-\infty , \infty [ \, : \, 0<{\mathbb {E}}_\mu \left[ \exp \left( \frac{1}{\epsilon } \varphi \right) \right] < \infty \right\} . \end{aligned}$$
(21)

Then, given \(c=d^2\), where \(d(x,y) = \Vert x-y\Vert \), \(\varphi \in L^{\mathrm{exp}}_{\varepsilon }({\mathbb {R}}^n,\mu _0)\) and \(\psi \in L^{\mathrm{exp}}_{\varepsilon }({\mathbb {R}}^n,\mu _1)\), the entropic Kantorovich (dual) formulation of \(\mathrm {OT}^\epsilon _{d^2}(\mu ,\nu )\) is given by [25, 30, 36, 41, 50],

$$\begin{aligned} \begin{aligned} \mathrm {OT}^\epsilon _{d^2}(\mu _0,\mu _1)&= \sup \limits _{\varphi , \psi }\left\{ {\mathbb {E}}_{\mu _0}[\varphi ] + {\mathbb {E}}_{\mu _1}[\psi ]\right. \\&\quad \left. -\varepsilon \left( {\mathbb {E}}_{\mu _0 \otimes \mu _1} \left[ \exp \left( \frac{(\varphi \oplus \psi )-d^2}{\varepsilon }\right) \right] -1\right) \right\} , \end{aligned} \end{aligned}$$
(22)

where \(\left( \varphi \oplus \psi \right) (x,y) = \varphi (x) + \psi (y)\), \(\varphi \in L^{\mathrm{exp}}_{\varepsilon }({\mathbb {R}}^n,\mu _0)\), and \(\psi \in L^{\mathrm{exp}}_{\varepsilon }({\mathbb {R}}^n,\mu _1)\).

Finally, the theorem below illustrate the relationship between the Entropy-Kantorovich potentials and the solution (19) of the Entropic regularized Optimal Transport problem (17), assuming the cost c is bounded.

Theorem 1

[25] Let \(\varepsilon >0\) be a positive number, c is bounded cost, \(\mu _0,\mu _1 \in {\mathcal {P}}({\mathbb {R}}^n)\) be probability measures. Then, the supremum in (22) is attained for a unique couple \((\varphi ^\epsilon , \psi ^\epsilon )\) (up to the trivial transformation \((\varphi ^\epsilon , \psi ^\epsilon ) \rightarrow (\varphi ^\epsilon + \alpha , \psi ^\epsilon - \alpha )\)). Moreover, the following are equivalent:

  1. (a)

    (Maximizers) \(\varphi ^\epsilon \) and \(\psi ^\epsilon \) are maximizing potentials for (22).

  2. (b)

    (Schrödinger system) Let

    $$\begin{aligned} \gamma ^\epsilon =\exp \left( \frac{1}{\epsilon }\left( \varphi ^\epsilon \oplus \psi ^\epsilon -d^2\right) \right) \mu _0\otimes \mu _1, \end{aligned}$$
    (23)

    then \(\gamma ^\epsilon \in \mathrm {ADM}(\mu _0, \mu _1)\). Furthermore, \(\gamma ^\epsilon \) is the (unique) minimizer of the problem (17).

Elements of the pair \((\varphi ^\epsilon , \psi ^\epsilon )\) reaching a maximum in (22) are called entropic Kantorovich potentials. Finally, a relationship between \(\alpha ^\epsilon ,\beta ^\epsilon \) in (19), and the entropic Kantorovich potentials \(\varphi ^\epsilon , \psi ^\epsilon \) above, is according to Theorem 1 given by

$$\begin{aligned} \varphi ^\epsilon = \epsilon \log \alpha ^\epsilon ,\quad \psi ^\epsilon = \epsilon \log \beta ^\epsilon . \end{aligned}$$
(24)

Using the dual formulation, we can show the following.

Proposition 1

Let \(\mu ,\nu \in {\mathcal {P}}({\mathbb {R}}^n)\) and c be a bounded cost. Then, \(\mathrm {OT}^\epsilon _c(\mu ,\nu )\) is strictly convex in both arguments.

Proof

Let \(\mu _t = t\mu _0 + (1-t)\mu _1\) for \(t\in (0,1)\), and \((\varphi _j,\psi _j)\) be the entropic Kantorovich potentials associated with \(\mathrm {OT}_c^\epsilon (\mu _j, \nu )\) for \(j=0,1\), and \((\varphi , \psi )\) for \(\mathrm {OT}_c^\epsilon (\mu _t, \nu )\). Then, using the dual formulation (22), we have

$$\begin{aligned} \begin{aligned} \mathrm {OT}_c^\epsilon (\mu _t,\nu )&= t\left( {\mathbb {E}}_{\mu _0}[\varphi ] + {\mathbb {E}}_{\nu }[\psi ]\right) +(1-t)\left( {\mathbb {E}}_{\mu _1}[\varphi ] + {\mathbb {E}}_{\nu }[\psi ]\right) \\&\quad - \epsilon t \left( {\mathbb {E}}_{\mu _0 \otimes \nu }\left[ \exp \left( \frac{(\varphi \otimes \psi )-c}{\epsilon }\right) \right] -1\right) \\&\quad - \epsilon (1-t)\left( {\mathbb {E}}_{\mu _1 \otimes \nu }\left[ \exp \left( \frac{(\varphi \otimes \psi )-c}{\epsilon }\right) \right] -1\right) \\&< t\left( {\mathbb {E}}_{\mu _0}[\varphi _0] + {\mathbb {E}}_{\nu }[\psi _0]\right) +(1-t)\left( {\mathbb {E}}_{\mu _1}[\varphi _1] + {\mathbb {E}}_{\nu }[\psi _1]\right) \\&\quad - \epsilon t \left( {\mathbb {E}}_{\mu _0 \otimes \nu }\left[ \exp \left( \frac{(\varphi _0 \otimes \psi _0)-c}{\epsilon }\right) \right] -1\right) \\&\quad - \epsilon (1-t)\left( {\mathbb {E}}_{\mu _1 \otimes \nu }\left[ \exp \left( \frac{(\varphi _1 \otimes \psi _1)-c}{\epsilon }\right) \right] -1\right) \\&= t\mathrm {OT}_c^\epsilon (\mu _0,\nu ) + (1-t)\mathrm {OT}_c^\epsilon (\mu _1,\nu ), \end{aligned} \end{aligned}$$
(25)

where the first equality results from linearity of expectations, and the inequality from noticing that the pair \((\varphi , \psi )\) is a competitor for \((\varphi _j, \psi _j)\), \(j=0,1\), but due to uniqueness of the entropic Kantorovich potentials (up to scalar additives, Theorem 1), \((\varphi ,\psi )\) cannot be equal to \((\varphi _0,\psi _0)\) and \((\varphi _1,\psi _1)\) (unless \(\mu _0 = \mu _1)\), and will thus return lower values. \(\square \)

Dynamical formulation of entropy relaxed optimal transport

Analogously to unregularized OT theory, the entropic-regularization of OT with distance cost admits a dynamical (aka Benamou–Brenier) formulation.

In the following, we again consider the particular case when the cost function is given by \(c(x,y) = \Vert x-y\Vert ^2\). Then, we can write (17) as [41, 50]

$$\begin{aligned} \mathrm {OT}^\epsilon _{d^2}(\mu _0,\mu _1) = \min _{(\mu ^\epsilon _t,v_t)}\int _0^1{\mathbb {E}}_{\mu ^\epsilon _t}\left[ \Vert v_t\Vert ^2\right] dt + H(\mu _0) + H(\mu _1), \end{aligned}$$
(26)

where \(t\in [0,1]\), \(\mu ^\epsilon _0=\mu _0\), \(\mu ^\epsilon _1 = \mu _1\), and

$$\begin{aligned} \partial _t \mu ^\epsilon _t + \nabla \cdot (v_t \mu ^\epsilon _t) = \frac{\varepsilon }{2}\varDelta \mu ^\epsilon _t. \end{aligned}$$
(27)

where the minimum must be understood as taken among all couples \((\mu ^\epsilon _t,v_t)\) solving the continuity equation in the distributional sense (see appendix A); moreover, the minimum is attained if and only if \((\mu ^\epsilon _t,v_t) = (\mu ^\epsilon _t,\nabla \phi _t^\varepsilon )\), for a potential \(\phi _t^\varepsilon :{\mathbb {R}}^d\rightarrow {\mathbb {R}}\), which is defined in the following via the entropic potentials. The resulting \(\mu _t^\epsilon \) is called the entropic interpolation between \(\mu _0\) and \(\mu _1\).

The solution can be characterized by (while abusing the notation and writing \(\mu (x)\) for the density of \(\mu \), which will be done throughout this work)

$$\begin{aligned} \gamma ^{\varepsilon }(x,y) = \alpha ^{\varepsilon }(x)\beta ^{\varepsilon }(y) \exp \left( -\frac{1}{\epsilon }\Vert x-y\Vert ^2\right) \mu _0(x)\mu _1(y), \end{aligned}$$
(28)

in (19) of the static problem (17) in conjunction with the heat flow allows us to compute the entropic interpolation from \(\mu _0\) to \(\mu _1\), which is given by [41, 50, 70]

$$\begin{aligned} \begin{aligned} \mu ^\epsilon _t&={\mathcal {H}}^{\mu _0}_{t\varepsilon }(\alpha ^{\varepsilon })\,{\mathcal {H}}^{\mu _1}_{(1-t)\varepsilon }(\beta ^{\varepsilon }),\\ {\mathcal {H}}^{\mu }_s[f]&= \int _{{\mathbb {R}}^n} \frac{1}{\sqrt{2\pi s}}\exp \left( -\frac{1}{s}\Vert x-z\Vert ^2\right) f(z)\mu (z){\mathrm{d}} z, \end{aligned} \end{aligned}$$
(29)

and \(\alpha ^{\varepsilon }\),\(\beta ^{\varepsilon }\) are the Entropy-Kantorovich potentials solving the system (19). In particular, we have that

$$\begin{aligned} \alpha ^\varepsilon (x) {\mathcal {H}}^{\mu _1}_{\varepsilon }(\beta ^\varepsilon )(x)=1,\quad \beta ^\varepsilon (y)\,{\mathcal {H}}^{\mu _0}_{\varepsilon }(\alpha ^\varepsilon )(y)=1. \end{aligned}$$
(30)

In particular, when we send the regularization parameter \(\varepsilon \rightarrow 0\), the curves of measures \(\mu ^\epsilon _t\) converge to the 2-Wasserstein between \(\mu _0\) and \(\mu _1\) [40, 50]. Moreover, we can also write the entropic interpolation \(\mu ^\epsilon _t\) and the dynamic entropic Kantorovich potentials \((\varphi ^\varepsilon _t,\psi ^\varepsilon _t)\) via the relation \(\varphi ^\varepsilon _t + \psi ^\varepsilon _t = \varepsilon \log \mu ^\epsilon _t\).

Now, by defining \(\phi _t^\varepsilon = (\varphi ^{\varepsilon }_t-\psi ^{\varepsilon }_t)/2\), it is easy to check that by imposing \(v^{\varepsilon }_t = \nabla \phi ^{\varepsilon }_t\) we have that \((\mu ^\epsilon _t,v^{\varepsilon }_t)\) solves the Fokker–Planck equation

$$\begin{aligned} \partial _t \mu ^\epsilon _t + \nabla \cdot (v_t^{\varepsilon }\mu ^\epsilon _t) = \frac{\varepsilon }{2}\varDelta \mu ^\epsilon _t. \end{aligned}$$
(31)

Entropy-regularized 2-Wasserstein distance between Gaussians

In this section we consider the special case of (17) and (20) when \(c(x,y) = d^2(x,y) = \vert x-y\vert ^2\) is the Euclidian distance in \({\mathbb {R}}^n\) and \(\mu _0 \sim {\mathcal {N}}(m_0,K_0)\)\(\nu \sim {\mathcal {N}}(m_1,K_1)\) are multivariate Gaussian distributions. We are interested in obtain explicity formulas for the optimal coupling \(\gamma ^{\varepsilon }\) solving (17), the Entropy-Kantorovich maximizers \((\varphi ^\epsilon ,\psi ^\epsilon )\) in (22) and the entropic displacement interpolation \(\mu ^{\varepsilon }_t\) in (29).

We start by showing that we can assume, without loss of generality, that \(\mu _0\) and \(\mu _1\) are centered Gaussian distributions. The general case is obtain just by a shift depending on the \(L^2\)-distance of the center of both Gaussians.

Proposition 2

Let \(c(x,y) = \Vert x-y\Vert ^2\), \(X_i\sim \mu _i\in {\mathcal {P}}({\mathbb {R}}^n)\) for \(i=0,1\) and \(m_i = {\mathbb {E}}\left[ \mu _i\right] \). Denote by \({\hat{X}}_i = X_i - m_i\sim {\hat{\mu }}_i\) the corresponding centered distributions. Then

$$\begin{aligned} \mathrm {OT}_{d^2}^\epsilon (\mu _0,\mu _1) = \Vert m_0-m_1\Vert ^2 + \mathrm {OT}_{d^2}^\epsilon \left( {\hat{\mu }}_0, {\hat{\mu }}_1\right) . \end{aligned}$$
(32)

Proof

Recall the definition given in (17)

$$\begin{aligned} \mathrm {OT}_c^\epsilon (\mu _0, \mu _1) = \min _{\gamma \in \mathrm {ADM}(\mu _0,\mu _1)}\left\{ {\mathbb {E}}_\gamma [c] + \epsilon D_{\mathrm {KL}}(\gamma || \mu _0 \otimes \mu _1) \right\} . \end{aligned}$$
(33)

Then, as \(c=d^2\), for the first term we can write

$$\begin{aligned} \begin{aligned} {\mathbb {E}}_\gamma \left[ d^2\right]&= \int _{{\mathbb {R}}^n} \Vert x-y\Vert ^2d\gamma (x,y)\\&= \int _{{\mathbb {R}}^n} \left( \Vert (x-m_0) - (y-m_1)\Vert ^2 + \Vert m_0 - m_1\Vert ^2 \right. \\&\quad \left. + 2\left( (x-m_0) - (y-m_1)\right) ^T(m_0-m_1) \right) d\gamma (x,y)\\&= \Vert m_0-m_1\Vert ^2 + \int _{{\mathbb {R}}^n}\Vert x-y\Vert ^2d\gamma (x+m_0,y+m_1). \end{aligned} \end{aligned}$$
(34)

We now verify that the requirement \(\gamma \in \mathrm {ADM}(\mu _0, \mu _1)\) is equivalent with \(\gamma (\cdot + m_0, \cdot + m_1)\in \mathrm {ADM}({\hat{\mu }}_0, {\hat{\mu }}_1)\), which results from

$$\begin{aligned} \int _{{\mathbb {R}}^n} \gamma (x+m_0, y+m_1) \mathrm{dy}= \mu _0(x+m_0) = {\hat{\mu }}_0(x), \end{aligned}$$
(35)

and similarly for the other margin. Finally, for the entropy term, we use the identity (14). Now, as the entropy of a distribution does not depend on the expected value, we have \(H(\mu _i) = H({\hat{\mu }}_i)\), and therefore

$$\begin{aligned} D_{\mathrm {KL}}(\gamma || {\hat{\mu }}_0 \otimes {\hat{\mu }}_1) = H(\mu _0) + H(\mu _1) - H(\gamma ). \end{aligned}$$
(36)

Putting everything together, we get

$$\begin{aligned} \begin{aligned} \mathrm {OT}_{d^2}^\epsilon (\mu _0, \mu _1)&= \Vert m_0-m_1\Vert ^2 \\&\quad + \min \limits _{\gamma \in \mathrm {ADM}({\hat{\mu }}_0,{\hat{\mu }}_1)}\left\{ {\mathbb {E}}_\gamma [d^2] + \epsilon D_{D_{\mathrm {KL}}}(\gamma || {\hat{\mu }}_0 \otimes {\hat{\mu }}_1) \right\} ,\\&= \Vert m_0-m_1\Vert ^2 + \mathrm {OT}_{d^2}^\epsilon ({\hat{\mu }}_0, {\hat{\mu }}_1). \end{aligned} \end{aligned}$$
(37)

\(\square \)

Proposition 3

Let \(\mu _i = {\mathcal {N}}(0,K_i)\in {\mathcal {N}}({\mathbb {R}}^n)\) for \(i=0,1\). Then, the unique optimal plan \(\gamma ^\epsilon \) in \(\mathrm {OT}_{d^2}^\epsilon (\mu _0, \mu _1)\) is a centered Gaussian distribution.

Proof

Note that \({\mathbb {E}}_\gamma [d^2]\) depends only on the mean and covariance of \(\gamma \), and therefore remains constant, if \(\gamma \) is replaced with a Gaussian with the corresponding mean and covariance (which we can do, as the marginals are Gaussians). Then, for the other term, using the identity (14)

$$\begin{aligned} D_{\mathrm {KL}}(\gamma || \mu _0 \otimes \mu _1) = H(\mu ) + H(\nu ) - H(\gamma ). \end{aligned}$$
(38)

It is readily seen that the \(\gamma \) with a fixed covariance matrix minimizing this expression is Gaussian, as Gaussians achieve maximal entropy over distributions sharing a fixed covariance matrix. Therefore, we can deduce that \(\gamma ^\epsilon \) is Gaussian. Finally, as both of the marginals \(\mu _0\) and \(\mu _1\) are centered, so is \(\gamma ^\epsilon \). \(\square \)

We now arrive at the main theorem of this work, detailing the entropic 2-Wasserstein geometry between multivariate Gaussians. The proof is based on studying the Schrödinger system given in (19). We give an alternative proof for the statement \(\mathbf{a}. \) in Theorem 2 in Appendix B, by finding the minimizer of the OT problem. Recall, that a noteworthy property of the entropic interpolant, is that even if we are interpolating from \(\mu \) to itself, the trajectory does not constantly stay at \(\mu \).

Theorem 2

Let \(\mu _i = {\mathcal {N}}(0,K_i)\), for \(i=0,1\), be two centered multivariate Gaussian distributions in \({\mathbb {R}}^n\), write \(N^\epsilon _{ij} = \left( I + \frac{16}{\epsilon ^2}K_i^\frac{1}{2}K_jK_i^\frac{1}{2}\right) ^\frac{1}{2}\) and \(M^\epsilon = I + \left( I + \frac{16}{\epsilon ^2}K_0K_1\right) ^\frac{1}{2}\). Then,

  1. (a)

    The density of the optimal entropy relaxed plan \(\gamma ^\epsilon \) is given by

    $$\begin{aligned} \gamma ^\epsilon (x,y) = \alpha ^\epsilon (x)\beta ^\epsilon (y) \exp \left( -\frac{\Vert x-y\Vert ^2}{\epsilon }\right) \mu _0(x)\mu _1(y), \end{aligned}$$
    (39)

    where \(\alpha ^\epsilon (x) = \exp \left( x^TAx + a\right) \), \(\beta ^\epsilon (y) = \exp \left( y^TBy + b\right) \), and

    $$\begin{aligned} \begin{aligned} A&= \frac{1}{4}K_0^{-\frac{1}{2}}\left( I + \frac{4}{\epsilon }K_0 - N^\epsilon _{01} \right) K_0^{-\frac{1}{2}}\\ B&= \frac{1}{4}K_1^{-\frac{1}{2}}\left( I + \frac{4}{\epsilon }K_1 - N^\epsilon _{10} \right) K_1^{-\frac{1}{2}} \\ \exp (a+b)&= \sqrt{ \frac{1}{2^n} \det \left( M^\epsilon \right) }. \end{aligned} \end{aligned}$$
    (40)
  2. (b)

    The entropic optimal transport quantity is given by

    $$\begin{aligned} \begin{aligned} \mathrm {OT}_{d^2}^\epsilon (\mu _0, \mu _1)&= \mathrm {Tr}(K_0) + \mathrm {Tr}(K_1)\\&\quad - \frac{\epsilon }{2}\left( \mathrm {Tr}(M^\epsilon ) - \log \det (M^\epsilon ) + n\log 2 - 2n \right) \end{aligned} \end{aligned}$$
    (41)
  3. (c)

    The entropic displacement interpolation \(\mu _t^\epsilon \), \(t\in [0,1]\), between \(\mu _0\) and \(\mu _1\) is given by \(\mu _t^\epsilon = {\mathcal {N}}\left( 0, K^{\epsilon }_t\right) \), where

    $$\begin{aligned} \begin{aligned} K^\epsilon _t&= \frac{(1-t)^2\epsilon ^2}{16}K_1^{-\frac{1}{2}}\left( -I + \left( \frac{4t}{(1-t)\epsilon }K_1 + N^\epsilon _{10} \right) ^2 \right) K_1^{-\frac{1}{2}}\\&= \frac{t^2\epsilon ^2}{16}K_0^{-\frac{1}{2}}\left( -I + \left( \frac{4(1-t)}{t\epsilon }K_0 + N^\epsilon _{01} \right) ^2 \right) K_0^{-\frac{1}{2}}\\&= (1-t)^2K_0 + t^2K_1 + t(1-t) \left[ \left( \frac{\epsilon ^2}{16}I + K_0K_1\right) ^{1/2}\right. \\&\quad +\left. \left( \frac{\epsilon ^2}{16}I + K_1K_0\right) ^{1/2}\right] . \end{aligned} \end{aligned}$$
    (42)

Proof

Part a. Recall that \(\alpha ^\varepsilon \), \(\beta ^\varepsilon \) are the unique functions that give the density of the optimal plan \(\gamma ^\epsilon \)

$$\begin{aligned} \gamma ^\epsilon (x,y) = \alpha ^\varepsilon (x)\beta ^\varepsilon (y)\exp \left( -\frac{\Vert x-y\Vert ^2}{\epsilon }\right) \mu _0(x)\mu _1(y). \end{aligned}$$
(43)

The optimal plan is required to have the right marginals (19), that is,

$$\begin{aligned} \begin{aligned} \mu _0(x)&= \int _{{\mathbb {R}}^n} \gamma ^\epsilon (x,y) {\mathrm{d}}y\\&= \alpha ^\varepsilon (x)\int _{{\mathbb {R}}^n} \beta ^\varepsilon (y)\exp \left( -\frac{\Vert x-y\Vert ^2}{\epsilon }\right) \mu _0(x)\mu _1(y){\mathrm{d}}y,\\ \mu _1(y)&= \int _{{\mathbb {R}}^n} \gamma ^\epsilon (x,y){\mathrm{d}}x\\&= \beta ^\varepsilon (y)\int _{{\mathbb {R}}^n} \alpha ^\varepsilon (x)\exp \left( -\frac{\Vert x-y\Vert ^2}{\epsilon }\right) \mu _0(x)\mu _1(y){\mathrm{d}}x. \end{aligned} \end{aligned}$$
(44)

Assuming \(\alpha ^\varepsilon (x) = \exp (x^TAx+a)\) and \(\beta ^\varepsilon (y) = \exp (y^TBy+b)\), substituting in \(\mu _0\) and \(\mu _1\), and after some simplifications, the system reads

$$\begin{aligned} \begin{aligned} 1&= \frac{\exp (a+b)}{\sqrt{ \det (2\pi K_1)}}\exp \left( x^T\left( A - \frac{1}{\epsilon }I\right) x\right) \\&\quad \times \int _X \exp \left( y^T\left( B-\frac{1}{\epsilon }I - \frac{1}{2}K_1^{-1}\right) y+\frac{2}{\epsilon }x^Ty\right) {\mathrm{d}}y,\\ 1&= \frac{\exp (a+b)}{\sqrt{ \det (2\pi K_0)}}\exp \left( y^T\left( B - \frac{1}{\epsilon }I\right) y\right) \\&\quad \times \int _Y \exp \left( x^T\left( A-\frac{1}{\epsilon }I - \frac{1}{2}K_0^{-1} \right) x+\frac{2}{\epsilon }y^Tx\right) {\mathrm{d}}x. \end{aligned} \end{aligned}$$
(45)

Using the identity

$$\begin{aligned} \int _X \exp \left( -x^TCx + b^Tx\right) {\mathrm{d}}x = \sqrt{\frac{\pi ^n}{\det (C)}} \exp \left( \frac{1}{4}b^TC^{-1}b\right) , \end{aligned}$$
(46)

the system (45) results in

$$\begin{aligned} \begin{aligned} A&= \frac{1}{\epsilon }I +\frac{1}{\epsilon ^2} \left( B-\frac{1}{\epsilon }I - \frac{1}{2}K_1^{-1} \right) ^{-1},\\ B&= \frac{1}{\epsilon }I +\frac{1}{\epsilon ^2} \left( A - \frac{1}{\epsilon }I - \frac{1}{2}K_0^{-1} \right) ^{-1},\\ \exp (a+b)&= \sqrt{\det (2 K_1) \det \left( \frac{1}{\epsilon } I + \frac{1}{2}K_1^{-1} -B \right) }\\ \exp (a+b)&= \sqrt{\det (2 K_0) \det \left( \frac{1}{\epsilon } I + \frac{1}{2}K_0^{-1} -A \right) }\\ \end{aligned} \end{aligned}$$
(47)

Let us solve for A and B first. From system (47), we get that A and B can be written as

$$\begin{aligned} \begin{aligned} A&= \frac{1}{\epsilon }I +\frac{1}{\epsilon ^2}\left( \frac{1}{\epsilon ^2} \left( A - \frac{1}{\epsilon }I - \frac{1}{2}K_0^{-1}\right) ^{-1} - \frac{1}{2}K_1^{-1} \right) ^{-1},\\ B&= \frac{1}{\epsilon }I +\frac{1}{\epsilon ^2}\left( \frac{1}{\epsilon ^2} \left( B - \frac{1}{\epsilon }I - \frac{1}{2}K_1^{-1}\right) ^{-1} - \frac{1}{2}K_0^{-1} \right) ^{-1}. \end{aligned} \end{aligned}$$
(48)

Then, one can show, that the AB given in (40) solves this system. Plugging AB in the expressions for \(\exp (a+b)\) in (47), we get

$$\begin{aligned} \exp (a+b) = \sqrt{ \frac{1}{2^n} \det \left( I + \left( I + \frac{16}{\epsilon ^2} K_0K_1 \right) ^\frac{1}{2}\right) }, \end{aligned}$$
(49)

for which a possible solution is given by

$$\begin{aligned} a=b= \frac{1}{4}\left( -n\log 2 + \log \det \left( I + \left( I + \frac{16}{\epsilon ^2}K_0K_1 \right) ^\frac{1}{2}\right) \right) . \end{aligned}$$
(50)

Now, we show that A solves the equation given in (48). Manipulating (48) we see that it suffices to show the equality

$$\begin{aligned} \left( A - \frac{1}{\epsilon }I \right) ^{-1} = \left( A - \frac{1}{\epsilon }I - \frac{1}{2}K_0^{-1}\right) ^{-1} - \frac{1}{2}K_1^{-1}. \end{aligned}$$
(51)

Substituting in A given in (40), the left-hand side reads

$$\begin{aligned} \left( A - \frac{1}{\epsilon }I\right) ^{-1} = 4K_0^\frac{1}{2}\left( I - \left( I + \frac{16}{\epsilon ^2}K_0^\frac{1}{2}K_1K_0^\frac{1}{2}\right) ^\frac{1}{2} \right) ^{-1}K_0^\frac{1}{2}, \end{aligned}$$
(52)

whereas the right-hand side is given by

$$\begin{aligned} \begin{aligned}&\left( A - \frac{1}{\epsilon }I - \frac{1}{2}K_0^{-1}\right) ^{-1} - \frac{1}{2}K_1^{-1}\\&\quad = -4K_0^\frac{1}{2}\left( \frac{\epsilon ^2}{8}\left( K_0^\frac{1}{2}K_1K_0^\frac{1}{2}\right) ^{-1} + \left( I + \left( I + \frac{16}{\epsilon ^2}K_0^\frac{1}{2}K_1K_0^\frac{1}{2}\right) ^\frac{1}{2}\right) ^{-1} \right) K_0^\frac{1}{2}. \end{aligned} \end{aligned}$$
(53)

Therefore, we need to show the equality

$$\begin{aligned} \begin{aligned} \left( I + \left( I + \frac{16}{\epsilon ^2}K_0^\frac{1}{2}K_1K_0^\frac{1}{2}\right) ^\frac{1}{2} \right) ^{-1}&= \left( -I + \left( I + \frac{16}{\epsilon ^2}K_0^\frac{1}{2}K_1K_0^\frac{1}{2}\right) ^\frac{1}{2} \right) ^{-1},\\&\quad - \left( \frac{8}{\epsilon ^2}K_0^\frac{1}{2}K_1K_0^\frac{1}{2}\right) ^{-1} \end{aligned} \end{aligned}$$
(54)

which can be derived as follows

$$\begin{aligned} \begin{aligned}&- \left( \frac{8}{\epsilon ^2}K_0^\frac{1}{2}K_1K_0^\frac{1}{2}\right) ^{-1} + \left( -I + \left( I + \frac{16}{\epsilon ^2}K_0^\frac{1}{2}K_1K_0^\frac{1}{2}\right) ^\frac{1}{2} \right) ^{-1}\\&\quad = -2\left( I + \left( I + \frac{16}{\epsilon ^2}K_0^\frac{1}{2}K_1K_0^\frac{1}{2} \right) ^\frac{1}{2}\right) ^{-1} + \left( -I + \left( I + \frac{16}{\epsilon ^2}K_0^\frac{1}{2}K_1K_0^\frac{1}{2} \right) ^\frac{1}{2}\right) ^{-1} \\&\qquad \times \left( -I+\left( I + \frac{16}{\epsilon ^2}K_0^\frac{1}{2}K_1K_0^\frac{1}{2} \right) ^\frac{1}{2}\right) ^{-1}\\&\quad =\left( I - 2\left( I + \left( I + \frac{16}{\epsilon ^2}K_0^\frac{1}{2}K_1K_0^\frac{1}{2} \right) ^\frac{1}{2}\right) ^{-1} \right) \left( -I + \left( I + \frac{16}{\epsilon ^2}K_0^\frac{1}{2}K_1K_0^\frac{1}{2} \right) ^\frac{1}{2}\right) ^{-1}\\&\quad =\left( I + \left( I + \frac{16}{\epsilon ^2}K_0^\frac{1}{2}K_1K_0^\frac{1}{2} \right) ^\frac{1}{2} - 2I \right) \left( I + \left( I + \frac{16}{\epsilon ^2}K_0^\frac{1}{2}K_1K_0^\frac{1}{2} \right) ^\frac{1}{2}\right) ^{-1} \\&\qquad \times \left( -I + \left( I + \frac{16}{\epsilon ^2}K_0^\frac{1}{2}K_1K_0^\frac{1}{2} \right) ^\frac{1}{2}\right) ^{-1}\\&\quad =\left( I + \left( I + \frac{16}{\epsilon ^2}K_0^\frac{1}{2}K_1K_0^\frac{1}{2} \right) ^\frac{1}{2}\right) ^{-1}, \end{aligned} \end{aligned}$$
(55)

where the first step results from writing

$$\begin{aligned} - \left( \frac{8}{\epsilon ^2}K_0^\frac{1}{2}K_1K_0^\frac{1}{2}\right) ^{-1} = -2\left( -I + \left( I + \frac{16}{\epsilon ^2}K_0^\frac{1}{2}K_1K_0^\frac{1}{2}\right) \right) ^{-1}, \end{aligned}$$
(56)

and using \(M-I = (M^\frac{1}{2}+1)(M^\frac{1}{2}-1)\) on the right-hand side.

Part b. Let \(\varphi _\epsilon (x) = \epsilon \log \alpha _\epsilon (x)\) and \(\psi _\epsilon (y) = \epsilon \log \beta _\epsilon (y)\), and as previously,

$$\begin{aligned} M^\epsilon = I + \left( I + \frac{16}{\epsilon ^2}K_0K_1\right) ^\frac{1}{2}, \end{aligned}$$
(57)

then plugging \(\varphi ^\epsilon \) and \(\psi ^\epsilon \) into (22) yields

$$\begin{aligned} \begin{aligned} \mathrm {OT}_{d^2}^\epsilon (\mu _0, \mu _1)&= {\mathbb {E}}_{\mu _0}[\varphi _\epsilon ] + {\mathbb {E}}_{\mu _1}[\psi _\epsilon ]\\&\quad -\epsilon \left( {\mathbb {E}}_{\mu _0\otimes \mu _1}\left[ \exp \left( \frac{1}{\epsilon } \left( (\varphi \oplus \psi ) - d^2\right) \right) \right] -1\right) \\&= \epsilon \left( {\mathbb {E}}_{\mu _0}[\log \alpha _\epsilon ] + {\mathbb {E}}_{\mu _1}[\log \beta _\epsilon ] \right) \\&\quad - \epsilon \left( {\mathbb {E}}_{\mu _0 \otimes \mu _1}\left[ \alpha ^\epsilon \beta ^\epsilon \exp \left( -\frac{1}{\epsilon }d^2\right) \right] -1\right) \\&= \epsilon \left( {\mathbb {E}}_{X\sim \mu _0}\left[ X^TAX + a\right] + {\mathbb {E}}_{Y\sim \mu _1}\left[ Y^TBY + b\right] \right) \\&= \epsilon \left( \mathrm {Tr}\left[ K_0A\right] + \mathrm {Tr}\left[ K_1B\right] + a + b \right) \\&= \frac{\epsilon }{4} \mathrm {Tr}\left[ I + \frac{4}{\epsilon } K_0 - \left( I + \frac{16}{\epsilon ^2} K_0^\frac{1}{2}K_1K_0^\frac{1}{2} \right) ^\frac{1}{2} \right] \\&\quad + \frac{\epsilon }{4}\mathrm {Tr}\left[ I + \frac{4}{\epsilon } K_1 - \left( I + \frac{16}{\epsilon ^2} K_1^\frac{1}{2}K_0K_1^\frac{1}{2} \right) ^\frac{1}{2} \right] \\&\quad + \epsilon (a + b)\\&= \mathrm {Tr}K_0 + \mathrm {Tr}K_1 - \frac{\epsilon }{2}\left( \mathrm {Tr}M^\epsilon - \log \det M^\epsilon + n\log 2 - 2n \right) , \end{aligned} \end{aligned}$$
(58)

where we used the fact that \(C^\frac{1}{2}DC^\frac{1}{2}\) has same eigenvalues as CD, and so \(\mathrm {Tr}\left[ (I+C^\frac{1}{2}DC^\frac{1}{2})^\frac{1}{2}\right] = \mathrm {Tr}\left[ (I + CD)^\frac{1}{2}\right] \) for any square and positive-definite matrices C and D.

Part c. As we have solved for \(\alpha ^\epsilon \) and \(\beta ^\epsilon \) for the optimal plan, the entropic interpolant \(\mu _t^\epsilon \) between \(\mu _0\) and \(\mu _1\) is given by (29), which we rewrite here

$$\begin{aligned} \mu ^\epsilon _t(x) = \left( {\mathcal {H}}^{\mu _0}_{t\epsilon }[\alpha ^\epsilon ](x)\right) \left( {\mathcal {H}}^{\mu _1}_{(1-t)\epsilon }[\beta ^\epsilon ](x)\right) . \end{aligned}$$
(59)

Then, we can compute

$$\begin{aligned} \begin{aligned} {\mathcal {H}}^{\mu _0}_{t\epsilon }[\alpha ^\epsilon ](x)&= \frac{1}{\sqrt{\det ((2\pi )^2t\epsilon K_0)}} \int _{{\mathbb {R}}^n} \exp \left( z^TAz + a-\frac{1}{t\epsilon }\Vert x-z\Vert ^2-\frac{1}{2}z^TK_0^{-1}z\right) {\mathrm{d}}z\\&= \frac{\exp \left( a-\frac{1}{t\epsilon }x^Tx\right) }{\sqrt{\det (2\pi t\epsilon K_0)}} \int _{{\mathbb {R}}^n} \exp \left( z^T\left( A - \frac{1}{t\epsilon }I - \frac{1}{2}K_0^{-1}\right) z + \frac{2}{t\epsilon }x^Tz\right) \mathrm{dz} \\&= \frac{\exp (a)}{\sqrt{\det \left( 2\pi t\epsilon K_0\right) \det \left( \frac{1}{t\epsilon }I + \frac{1}{2}K_0^{-1}-A\right) }}\\&\quad \times \exp \left( \frac{1}{t^2\epsilon ^2}x^T\left( \left( \frac{1}{t\epsilon }I + \frac{1}{2}K_0^{-1}-A\right) ^{-1} - I \right) x \right) , \end{aligned} \end{aligned}$$
(60)

similar computation yields

$$\begin{aligned} \begin{aligned}&{\mathcal {H}}^{\mu _1}_{(1-t)\epsilon }[\beta ^\epsilon ](x)\\&\quad = \frac{\exp (b)}{\sqrt{\det \left( 2\pi (1-t) \epsilon K_1\right) \det \left( \frac{1}{(1-t)\epsilon }I + \frac{1}{2}K_1^{-1}-B\right) }}\\&\qquad \times \exp \left( \frac{1}{(1-t)^2\epsilon ^2}x^T\left( \left( \frac{1}{(1-t)\epsilon }I + \frac{1}{2}K_1^{-1}-B\right) ^{-1} - I \right) x \right) . \end{aligned} \end{aligned}$$
(61)

Putting these together, we get

$$\begin{aligned} \begin{aligned} \mu ^\epsilon _t(x)&= \left( {\mathcal {H}}^{\mu _0}_{t\epsilon }[\alpha ^\epsilon ](x)\right) \left( {\mathcal {H}}^{\mu _1}_{(1-t)\epsilon }[\beta ^\epsilon ](x)\right) \\&= N\exp \left( x^T\left[ \frac{1}{t^2\epsilon ^2}\left( \left( \frac{1}{t\epsilon }I + \frac{1}{2}K_0^{-1}- A\right) ^{-1}-I \right) \right. \right. \\&\quad \left. \left. +\frac{1}{(1-t)^2\epsilon ^2}\left( \left( \frac{1}{(1-t)\epsilon }I + \frac{1}{2}K_1^{-1}- B\right) ^{-1}-I \right) \right] x \right) \\&:= N\exp \left( x^T\left( T_0(A) + T_1(B)\right) x\right) , \end{aligned} \end{aligned}$$
(62)

where N is a normalizing constant. We can simplify the matrix \(T_0(A) + T_1(B)\) in (62). Write

$$\begin{aligned} N^\epsilon _{10} = \left( I + \frac{16}{\epsilon ^2}K_1^\frac{1}{2}K_0K_1^\frac{1}{2}\right) ^\frac{1}{2}, \end{aligned}$$
(63)

and consider the first term

$$\begin{aligned} \begin{aligned} T_0(A)&=\frac{1}{t^2\epsilon ^2}\left( \left( \frac{1}{t\epsilon }I + \frac{1}{2}K_0^{-1}- A\right) ^{-1}-I \right) \\&= \frac{1}{t^2\epsilon ^2}\left( \left( \frac{(1-t)}{t\epsilon }I + \frac{1}{2}K_0^{-1} - \frac{1}{\epsilon ^2}\left( B - \frac{1}{\epsilon }I - \frac{1}{2} K_1^{-1}\right) ^{-1} \right) ^{-1} - t\epsilon I\right) \\&= \frac{1}{t^2\epsilon }\left( \left( \frac{(1-t)}{t}I - \left( I - \epsilon B\right) ^{-1} \right) ^{-1} - t\epsilon I\right) \\&= \frac{1}{t^2\epsilon }\left( \left( \frac{t}{(1-t)}I - \frac{t^2}{(1-t)^2}\left( \frac{t}{(1-t)}I + (I-\epsilon B) \right) ^{-1} \right) - t\epsilon I\right) \\&= \frac{1}{(1-t)^2\epsilon ^2}\left( I - \left( \frac{1}{(1-t)\epsilon }I -B\right) ^{-1}\right) \\&= \frac{4}{(1-t)^2\epsilon ^2}K_1^\frac{1}{2}\left( -I + \frac{4t}{\epsilon (1-t)}K_1 + N^\epsilon _{10} \right) ^{-1}K_1^\frac{1}{2}, \end{aligned} \end{aligned}$$
(64)

where second equality follows from (47), third from (48), fourth from the Woodbury matrix inverse identity

$$\begin{aligned} \left( C+D\right) ^{-1} = C^{-1} - C^{-1}\left( C^{-1} + D^{-1}\right) ^{-1}C^{-1}, \end{aligned}$$
(65)

and the last one from substituting in B given in (40).

Likewise, we can substitute B in the second term \(T_1(B)\), which yields

$$\begin{aligned} \begin{aligned} T_1(B)&=\frac{1}{(1-t)^2\epsilon ^2}\left( \left( \frac{1}{(1-t)\epsilon }I + \frac{1}{2}K_1^{-1}- B\right) ^{-1}-I \right) \\&= \frac{4}{(1-t)^2\epsilon ^2}K_1^\frac{1}{2}\left( I + \frac{4t}{\epsilon (1-t)}K_1 + N^\epsilon _{10} \right) ^{-1}K_1^\frac{1}{2}. \end{aligned} \end{aligned}$$
(66)

Putting the two terms together, we get

$$\begin{aligned} \begin{aligned} T_0(A) + T_1(B)&= \frac{4}{(1-t)^2\epsilon ^2}K_1^\frac{1}{2}\left( \left( -I + \frac{4t}{\epsilon (1-t)}K_1 + N^\epsilon _{10} \right) ^{-1}\right. \\&\quad + \left. \left( I + \frac{4t}{\epsilon (1-t)}K_1 + N^\epsilon _{10} \right) ^{-1} \right) K_1^\frac{1}{2} \\&= \frac{8}{(1-t)^2\epsilon ^2}K_1^\frac{1}{2}\left( I - \left( \frac{4t}{(1-t)\epsilon }K_1 + N^\epsilon _{10} \right) ^2 \right) ^{-1}K_1^\frac{1}{2}. \end{aligned} \end{aligned}$$
(67)

Note, that we can write (62) as a Gaussian with covariance matrix \(K_t\)

$$\begin{aligned} \begin{aligned} \mu ^\epsilon _t(x)&= N \exp \left( x^T\left( T_0(A)+T_1(B)\right) x\right) \\&=N \exp \left( -\frac{1}{2}x^T\left( K^\epsilon _t\right) ^{-1}x, \right) \end{aligned} \end{aligned}$$
(68)

and so

$$\begin{aligned} \begin{aligned} K^\epsilon _t&= -\frac{1}{2}\left( T_0(A) + T_1(B)\right) ^{-1}\\&= \frac{(1-t)^2\epsilon ^2}{16}K_1^{-\frac{1}{2}}\left( -I + \left( \frac{4t}{(1-t)\epsilon }K_1 + N^\epsilon _{10} \right) ^2 \right) K_1^{-\frac{1}{2}}\\&= (1-t)^2K_0 + t^2K_1 + t(1-t) \left[ \left( \frac{\epsilon ^2}{16}I + K_0K_1\right) ^{1/2} + \left( \frac{\epsilon ^2}{16}I + K_1K_0\right) ^{1/2}\right] . \end{aligned}\nonumber \\ \end{aligned}$$
(69)

Where for the last step we use the formula

$$\begin{aligned} \left( I + \frac{16}{\epsilon ^2}K_0K_1\right) ^{1/2} = K_0^{1/2}\left( I + \frac{16}{\epsilon ^2}K_0^{1/2}K_1K_0^{1/2}\right) ^{1/2} K_0^{-1/2}. \end{aligned}$$
(70)

\(\square \)

Above we only considered centered Gaussians. Now we combine the results obtained in Proposition 2 and Theorem 2 to deduce the general case. As a consequence, we also derive the corresponding formulas for the Sinkhorn divergence between two Gaussians

Corollary 1

Let \(\mu _i = {\mathcal {N}}(m_i,K_i)\), for \(i=0,1\), be two multivariate Gaussian distributions in \({\mathbb {R}}^n\). Then,

  1. (a)
    $$\begin{aligned} \begin{aligned} \mathrm {OT}_{d^2}^\epsilon (\mu _0, \mu _1)&=\Vert m_0-m_1\Vert ^2 + \mathrm {Tr}(K_0) + \mathrm {Tr}(K_1)\\&\quad - \frac{\epsilon }{2}\left( \mathrm {Tr}(M^\epsilon ) - \log \det (M^\epsilon ) + n\log 2 - 2n \right) \end{aligned} \end{aligned}$$
    (71)
  2. (b)

    The entropic interpolant between \(\mu _0\) and \(\mu _1\) is \(\mu _t^\epsilon = {\mathcal {N}}\left( m_t, K_t\right) \), \(t\in [0,1]\), where \(m_t = (t-1)m_0 - tm_1\), and \(K_t\) is given in (42).

  3. (c)

    Write \(M_{ij}^\epsilon = I + \left( I + \frac{16}{\epsilon ^2}K_iK_j\right) ^\frac{1}{2}\), then

    $$\begin{aligned} \begin{aligned} S_2^\epsilon (\mu _0,\mu _1)&= \Vert m_0 - m_1\Vert _2^2 + \frac{\epsilon }{4} \left( \mathrm {Tr}\left( M_{00}^\epsilon - 2 M_{01}^\epsilon + M_{11}^\epsilon \right) \right. \\&\quad + \left. \log \left( \frac{\det ^2( M_{01}^\epsilon )}{\det (M_{00}^\epsilon )\det (M_{11}^\epsilon )} \right) \right) . \end{aligned} \end{aligned}$$
    (72)

We will now emphasize an identity that can be derived from the calculations of Theorem 2, which we find useful.

Lemma 1

Let CD be symmetric positive-definite matrices. Then,

$$\begin{aligned} \begin{aligned}&\frac{4}{\epsilon } D^\frac{1}{2}\left( I + \left( I + \frac{16}{\epsilon ^2}D^\frac{1}{2}C D^\frac{1}{2}\right) ^\frac{1}{2} \right) ^{-1}D^\frac{1}{2}\\&\quad = I - \frac{\epsilon }{4}C^{-\frac{1}{2}}\left( I + \frac{4}{\epsilon }C - \left( I + \frac{16}{\epsilon ^2}C^\frac{1}{2}DC^\frac{1}{2}\right) ^\frac{1}{2} \right) C^{-\frac{1}{2}}. \end{aligned} \end{aligned}$$
(73)

Proof

Similarly to (40), let

$$\begin{aligned} \begin{aligned} A&= \frac{1}{4}C^{-\frac{1}{2}}\left( I + \frac{4}{\epsilon }C - \left( I + \frac{16}{\epsilon ^2}C^\frac{1}{2}DC^\frac{1}{2}\right) ^\frac{1}{2} \right) C^{-\frac{1}{2}}\\ B&= \frac{1}{4}D^{-\frac{1}{2}}\left( I + \frac{4}{\epsilon }D - \left( I + \frac{16}{\epsilon ^2}D^\frac{1}{2}CD^\frac{1}{2}\right) ^\frac{1}{2} \right) D^{-\frac{1}{2}}. \end{aligned} \end{aligned}$$
(74)

Then, substituting B into the first equation of (47) (while remembering to replace , ) results in

$$\begin{aligned} \begin{aligned} A&= \frac{1}{\epsilon }I +\frac{1}{\epsilon ^2} \left( B-\frac{1}{\epsilon }I - \frac{1}{2}D^{-1} \right) ^{-1}\\&= \frac{1}{\epsilon }I + \frac{1}{\epsilon ^2}\left( \frac{1}{4}D^{-\frac{1}{2}}\left( I + \frac{4}{\epsilon }D - \left( I + \frac{16}{\epsilon ^2}D^\frac{1}{2}CD^\frac{1}{2}\right) ^\frac{1}{2} \right) D^{-\frac{1}{2}}\right. \\&\quad \left. -\frac{1}{\epsilon }I - \frac{1}{2}D^{-1} \right) ^{-1}\\&= \frac{1}{\epsilon }I - \frac{4}{\epsilon ^2}D^\frac{1}{2}\left( I + \left( I + \frac{16}{\epsilon ^2}D^\frac{1}{2}CD^\frac{1}{2}\right) ^\frac{1}{2} \right) ^{-1}D^\frac{1}{2}, \end{aligned} \end{aligned}$$
(75)

and so the result follows from substituting in A, multiplying both sides by \(-\epsilon \), and moving \(-I\) from right-hand side to left-hand side. \(\square \)

Next, we study the limiting cases of \(\epsilon \) going to 0 and \(\infty \), reconfirming that the Sinkhorn divergence interpolates between 2-Wasserstein and MMD [30, 36, 69].

Proposition 4

Let \(\mu _i = {\mathcal {N}}(m_i,K_i)\), for \(i=0,1\), be two multivariate Gaussian distributions in \({\mathbb {R}}^n\). Then,

  1. (a)
    $$\begin{aligned} \begin{aligned} \lim \limits _{\epsilon \rightarrow 0} \mathrm {OT}^\epsilon _{d^2}(\mu _0,\mu _1)&= W_2^2(\mu _0, \mu _1)\\ \lim \limits _{\epsilon \rightarrow \infty } \mathrm {OT}^\epsilon _{d^2}(\mu _0, \mu _1)&= \Vert m_0-m_1\Vert ^2 + \mathrm {Tr}(K_0) + \mathrm {Tr}(K_1) \end{aligned} \end{aligned}$$
    (76)
  2. (b)
    $$\begin{aligned} \begin{aligned} \lim \limits _{\epsilon \rightarrow 0} S^\epsilon _{2}(\mu _0,\mu _1)&= W_2^2(\mu _0, \mu _1) \\ \lim \limits _{\epsilon \rightarrow \infty } S^\epsilon _{2}(\mu _0,\mu _1)&= \Vert m_0-m_1\Vert ^2 \\ \end{aligned} \end{aligned}$$
    (77)
  3. (c)

    For \(t\in [0,1]\), denote by \(\mu _t\) the 2-Wasserstein geodesic given in (9), and by \(\mu _t^\epsilon \) the entropic 2-Wasserstein interpolant between \(\mu _0\) and \(\mu _1\) given in (42). Then,

    $$\begin{aligned} \lim \limits _{\epsilon \rightarrow 0} \mu _t^\epsilon = \mu _t. \end{aligned}$$
    (78)

Proof

Part a. The \(\epsilon \rightarrow 0\) case is a straight-forward computation

$$\begin{aligned} \begin{aligned} \mathrm {OT}_{d^2}^\epsilon (\mu _0, \mu _1)&= \Vert m_0 - m_1\Vert ^2 + \mathrm {Tr}(K_0) + \mathrm {Tr}(K_1)\\&\quad - \frac{\epsilon }{2}\left( \mathrm {Tr}(M^\epsilon ) - \log \det (M^\epsilon ) + n\log 2 - 2n \right) \\&= \Vert m_0-m_1\Vert ^2 + \mathrm {Tr}(K_0) + \mathrm {Tr}(K_1)\\&\quad - 2 \mathrm {Tr}\left( \frac{\epsilon }{4}I + \left( \frac{\epsilon ^2}{16}I + K_0K_1\right) ^\frac{1}{2}\right) \\&\quad + \frac{\epsilon }{2}\log \left( \det \left( \frac{\epsilon }{4}I + \left( \frac{\epsilon ^2}{16}I + K_0K_1\right) ^\frac{1}{2}\right) \right) \\&\quad +\frac{\epsilon n}{2}(\log {2}-\log {\epsilon }+2). \end{aligned} \end{aligned}$$
(79)

Therefore, since \(\epsilon \log \epsilon \rightarrow 0\) when \(\epsilon \rightarrow 0\),

$$\begin{aligned} \begin{aligned} \lim _{\varepsilon \rightarrow 0}\mathrm {OT}_{d^2}^\epsilon (\mu _0, \mu _1)&= \Vert m_0-m_1\Vert ^2 + \mathrm {Tr}(K_0) + \mathrm {Tr}(K_1) - 2 \mathrm {Tr}\left( K_0K_1\right) ^\frac{1}{2}\\&= W_2^2(\mu _0,\mu _1). \end{aligned} \end{aligned}$$
(80)

We now compute the limit when \(\varepsilon \rightarrow \infty \). It is enough to show that the term

$$\begin{aligned} \frac{\epsilon }{2}\left( \mathrm {Tr}(M^\epsilon ) - \log \det \left( M^\epsilon \right) + n\log 2 - 2n\right) , \end{aligned}$$
(81)

goes to 0 when \(\varepsilon \rightarrow \infty \). In fact, denote by \(\{\lambda _i\}_{i=1}^n\) the eigenvalues of \(K_1K_2\). Then,

$$\begin{aligned} \begin{aligned}&\frac{\epsilon }{2}\left( \mathrm {Tr}(M^\epsilon ) - \log \det \left( M^\epsilon \right) + n\log 2 - 2n\right) \\&\quad =\frac{\epsilon }{2}\sum _{i=1}^n \left( -1 + \left( 1 + \frac{16}{\epsilon ^2}\lambda _i\right) ^\frac{1}{2} - \log \left( \frac{1}{2}\left( 1 + \left( 1+\frac{16}{\epsilon ^2}\lambda _i\right) ^\frac{1}{2} \right) \right) \right) . \end{aligned} \end{aligned}$$
(82)

So, first notice that for any \(\lambda > 0\),

$$\begin{aligned} \epsilon \left( -1 + \left( 1 + \frac{16}{\epsilon ^2}\lambda \right) ^\frac{1}{2}\right) = \frac{16\lambda }{\epsilon + \left( \epsilon + 16\lambda \right) ^\frac{1}{2}} \overset{\epsilon \rightarrow \infty }{=} 0. \end{aligned}$$
(83)

Second, we have

$$\begin{aligned} \begin{aligned}&\lim \limits _{\epsilon \rightarrow \infty }\epsilon \log \left( \frac{1}{2}\left( 1 + \left( 1 + \frac{16}{\epsilon ^2}\lambda \right) ^\frac{1}{2} \right) \right) \\&\quad \overset{\mathrm {L'Hospital}}{=} \lim \limits _{\epsilon \rightarrow \infty } \frac{16\lambda }{\epsilon ^3\left( 1+\left( 1 + \frac{16}{\epsilon ^2}\lambda \right) ^\frac{1}{2}\right) \left( 1 + \frac{16}{\epsilon ^2}\lambda \right) ^\frac{1}{2}\log ^2\left( \frac{1}{2}\left( 1 + \left( 1 + \frac{16}{\epsilon ^2}\lambda \right) ^\frac{1}{2}\right) \right) }\\&\quad =0, \end{aligned} \end{aligned}$$
(84)

and so the result follows.

Part b. Straight-forward application of the above result to (72).

Part c. By a straight-forward computation on (42),

$$\begin{aligned} \begin{aligned} K^{\epsilon }_t&= (1-t)^2K_0 + t^2K_1 + t(1-t) \left[ \left( \frac{\epsilon ^2}{16}I + K_0K_1\right) ^{1/2}\right. \\&\quad +\left. \left( \frac{\epsilon ^2}{16}I + K_1K_0\right) ^{1/2}\right] \\&\overset{\epsilon \rightarrow 0}{=}(1-t)^2K_0 + t^2K_1 + t(1-t) [(K_0K_1)^{1/2} + (K_1K_0)^{1/2}]\\&= K_t. \end{aligned} \end{aligned}$$
(85)

\(\square \)

Entropic and Sinkhorn barycenters

In this section, we compute barycenters under the entropic regularization of the 2-Wasserstein distance (e.g. [10, 11, 13, 21, 25, 47, 51]) and the 2-Sinkhorn divergence of a population of multivariate Gaussians, restricted to the manifold of Gaussians.

Entropic 2-Wasserstein barycenter

Given N probability measures \(\mu _i\in {\mathcal {P}}({\mathbb {R}}^n)\), \(i=1,2,\ldots ,N\), the entropic barycenter \({\bar{\mu }}\) with weights \(\lambda _i\ge 0\) is defined in the vein of Karcher and Fréchet means, given as

$$\begin{aligned} {\bar{\mu }} := \mathop {{{\,\mathrm{arg\,min}\,}}}\limits _{\mu \in {\mathcal {P}}({\mathbb {R}}^n)} \sum _{i=1}^N \lambda _i\mathrm {OT}^\epsilon _{d^2}(\mu , \mu _i), \quad \sum ^N_{i=1}\lambda _i = 1. \end{aligned}$$
(86)

Then, (86) is strictly convex, as \(\mathrm {OT}_c^\epsilon (\mu ,\nu )\) is strictly convex in both \(\mu \) and \(\nu \) as stated by Prop. 1.

Next, let us focus on the Gaussian case. We lack the proof that such a barycenter will indeed be a Gaussian, so do note, that the following statement requires the restriction to Gaussians for the candidate barycenters.

Theorem 3

(Entropic Barycenter of Gaussians) Let \(\mu _i={\mathcal {N}}\left( m_i,K_i\right) \), \(i=1,2,\ldots ,N\) be a population of multivariate Gaussians. Then, their entropic barycenter (86) with weights \(\lambda _i\ge 0\) such that \(\sum ^N_{i=1}\lambda _i = 1\), restricted to the manifold of Gaussians \({\mathcal {N}}({\mathbb {R}}^n)\), is given by \({\bar{\mu }}={\mathcal {N}}({\bar{m}}, {\bar{K}})\), where

$$\begin{aligned} \begin{aligned} {\bar{m}} = \sum _{i=1}^N \lambda _i m_i, \quad {\bar{K}} = \frac{\epsilon }{4}\sum _{i=1}^N\lambda _i\left( -I + \left( I + \frac{16}{\epsilon ^2}{\bar{K}}^\frac{1}{2}K_i{\bar{K}}^\frac{1}{2}\right) ^\frac{1}{2}\right) . \end{aligned} \end{aligned}$$
(87)

Proof

Proposition 2 allows us to split the geometry into the \(L^2\)-geometry between the means and the entropic 2-Wasserstein geometry between the centered Gaussians (or their covariances). Then, it immediately follows that

$$\begin{aligned} {\bar{m}} = \sum _{i=1}^N \lambda _i m_i. \end{aligned}$$
(88)

Therefore, we restrict our analysis to the case of centered distributions. Remark again, that in general, the minimizer of (86) might not be Gaussian, even when the population consists of Gaussians. However, here we will look for the barycenter on the manifold of Gaussian measures.

We begin with a straight-forward computation of the gradient of the objective given in (86)

$$\begin{aligned} \begin{aligned}&\nabla _K\sum _{i=1}^N \lambda _i \mathrm {OT}^\epsilon _{d^2}\left( {\mathcal {N}}(0,K),{\mathcal {N}}(0,K_i)\right) \\&\quad = \nabla _K \sum _{i=1}^N \lambda _i \left( \mathrm {Tr}K + \mathrm {Tr}K_i - \frac{\epsilon }{2} \mathrm {Tr}\left( I + \left( I + \frac{16}{\epsilon ^2}K_i^\frac{1}{2}KK_i^\frac{1}{2}\right) ^\frac{1}{2} \right) \right. \\&\qquad + \frac{\epsilon }{2}\log \det \left( I + \left( I + \frac{16}{\epsilon ^2}K_i^\frac{1}{2}KK_i^\frac{1}{2}\right) ^\frac{1}{2} \right) \\&\qquad \left. -\frac{\epsilon }{2}\left( n\log 2 - 2n\right) \right) ,\\&\quad = \sum _{i=1}^N \lambda _i \left( \nabla _K \mathrm {Tr}K - \frac{\epsilon }{2} \nabla _K\mathrm {Tr}\left( I + \left( I + \frac{16}{\epsilon ^2}K_i^\frac{1}{2}KK_i^\frac{1}{2}\right) ^\frac{1}{2} \right) \right. \\&\qquad +\left. \frac{\epsilon }{2}\nabla _K\log \det \left( I + \left( I + \frac{16}{\epsilon ^2}K_i^\frac{1}{2}KK_i^\frac{1}{2}\right) ^\frac{1}{2} \right) \right) . \end{aligned} \end{aligned}$$
(89)

where we used the closed-form solution obtained in the part b. of Theorem 2. Now, recall that \(\nabla _K \mathrm {Tr}K = I\). For the second term, it holds

$$\begin{aligned} \begin{aligned}&\nabla _K \mathrm {Tr}\left( I + \left( I + \frac{16}{\epsilon ^2}K_i^\frac{1}{2}KK_i^\frac{1}{2}\right) ^\frac{1}{2}\right) \\&\quad = \frac{8}{\epsilon ^2}K_i^\frac{1}{2}\left( I + \frac{16}{\epsilon ^2}K_i^\frac{1}{2}KK_i^\frac{1}{2}\right) ^{-\frac{1}{2}}K_i^\frac{1}{2}. \end{aligned} \end{aligned}$$
(90)

Finally, for the third term, we have

$$\begin{aligned} \begin{aligned}&\nabla _K \log \det \left( I + \left( I + \frac{16}{\epsilon ^2}K_i^\frac{1}{2}KK_i^\frac{1}{2}\right) ^\frac{1}{2} \right) \\&\quad = \nabla _K \mathrm {Tr}\left( \mathrm {Log}\left( I + \left( I + \frac{16}{\epsilon ^2}K_i^\frac{1}{2}KK_i^\frac{1}{2}\right) ^\frac{1}{2} \right) \right) \\&\quad = \frac{8}{\epsilon ^2}K_i^\frac{1}{2}\left( \left( I + \frac{16}{\epsilon ^2}K_i^\frac{1}{2}KK_i^\frac{1}{2}\right) + \left( I + \frac{16}{\epsilon ^2}K_i^\frac{1}{2}KK_i^\frac{1}{2}\right) ^\frac{1}{2} \right) ^{-1}K_i^\frac{1}{2}, \end{aligned} \end{aligned}$$
(91)

where \(\mathrm {Log}(M)\) denotes the matrix square-root, and we use the results

$$\begin{aligned} \log \det (M) = \mathrm {Tr}\left( \mathrm {Log}(M)\right) ,\quad \nabla _M \mathrm {Tr}f(M) = f'(M), \end{aligned}$$
(92)

when f is a matrix function given by a Taylor series, such as the matrix square-root or the matrix logarithm.

Using the Woodbury matrix identity (65), one gets

$$\begin{aligned} (I+A)^{-1} = -A^{-1} + (A^2 + A)^{-1}, \end{aligned}$$
(93)

for an invertible A. Substituting (90) and (91) in (89), and using (93) with \(A = \left( I + \frac{16}{\epsilon ^2}K_i^\frac{1}{2}KK_i^\frac{1}{2}\right) ^\frac{1}{2}\), we get

$$\begin{aligned} \begin{aligned}&\nabla _K\sum _{i=1}^N \lambda _i \mathrm {OT}^\epsilon _{d^2}\left( {\mathcal {N}}(0,K),{\mathcal {N}}(0,K_i)\right) \\&\quad =\sum _{i=1}^N \lambda _i\left( I - \frac{4}{\epsilon } K_i^\frac{1}{2}\left( I + \left( I + \frac{16}{\epsilon ^2}K_i^\frac{1}{2}KK_i^\frac{1}{2}\right) ^\frac{1}{2} \right) ^{-1}K_i^\frac{1}{2}\right) \\&\quad = \frac{\epsilon }{4}\sum ^N_{i=1}\lambda _iK^{-\frac{1}{2}}\left( I + \frac{4}{\epsilon }K - \left( I + \frac{16}{\epsilon ^2}K^\frac{1}{2}K_iK^\frac{1}{2}\right) ^\frac{1}{2} \right) K^{-\frac{1}{2}}. \end{aligned} \end{aligned}$$
(94)

The last equality follows from Lemma 1 with the substitutions and . Finally, setting (94) to zero, we get that the optimal \({\bar{K}}\) satisfies the expression given in (87). \(\square \)

Sinkhorn barycenter

Now, we compute the barycenter of a population of Gaussians under the Sinkhorn divergence, defined by

$$\begin{aligned} {\bar{\mu }} := \mathop {{{\,\mathrm{arg\,min}\,}}}\limits _{\mu \in {\mathcal {P}}({\mathbb {R}}^n)} \sum _{i=1}^N \lambda _i S^\epsilon _2(\mu , \mu _i), \quad \lambda _i\ge 0 \text { and } \sum ^N_{i=1}\lambda _i = 1. \end{aligned}$$
(95)

Note that as \(S_\epsilon ^2(\mu ,\nu )\) is convex in both \(\mu \) and \(\nu \) [30, Thm. 1], and so (95) is convex in \(\mu \). Now, similarly to the entropic barycenter case, we look for the barycenter of a population of Gaussians in the space of Gaussians \({\mathcal {N}}({\mathbb {R}}^n)\).

Theorem 4

(Sinkhorn Barycenter of Gaussians) Let \(\mu _i={\mathcal {N}}\left( m_i,K_i\right) \), \(i=1,2,\ldots ,N\) be a population of multivariate Gaussians. Then, their Sinkhorn barycenter (95) with weights \(\lambda _i\ge 0\) such that \(\sum ^N_{i=1}\lambda _i = 1\), restricted to the manifold of Gaussians \({\mathcal {N}}({\mathbb {R}}^n)\), is given by \({\bar{\mu }}={\mathcal {N}}({\bar{m}}, {\bar{K}})\), where

$$\begin{aligned} {\bar{m}} = \sum _{i=1}^N \lambda _i m_i, \quad {\bar{K}} = \frac{\epsilon }{4}\left( -I + \left( \sum _{i=1}^N\lambda _i\left( I + \frac{16}{\epsilon ^2}{\bar{K}}^\frac{1}{2}K_i{\bar{K}}^\frac{1}{2}\right) ^\frac{1}{2}\right) ^2 \right) ^\frac{1}{2}. \end{aligned}$$
(96)

Proof

As in the entropic 2-Wasserstein case, we take \(\mu ={\mathcal {N}}(0,K)\) to be of Gaussian form. Then, we can compute the gradient

$$\begin{aligned} \begin{aligned}&\nabla _K\sum _{i=1}^N \lambda _i S_2^\epsilon \left( {\mathcal {N}}(0,K), {\mathcal {N}}(0,K_i)\right) \\&\quad = \nabla _K\sum _{i=1}^N \lambda _i \Big ( \mathrm {OT}^\epsilon _{d^2}\left( {\mathcal {N}}(0,K), {\mathcal {N}}(0,K_i)\right) \\&\qquad -\frac{1}{2} \mathrm {OT}^\epsilon _{d^2}\left( {\mathcal {N}}(0,K), {\mathcal {N}}(0,K)\right) \\&\qquad -\frac{1}{2}\mathrm {OT}^\epsilon _{d^2}\left( {\mathcal {N}}(0,K_i), {\mathcal {N}}(0,K_i)\right) \Big ), \end{aligned} \end{aligned}$$
(97)

where the last term disappears. Then, we can use the gradient of the first term, which we computed in (94). A very similar computation yields

$$\begin{aligned} \begin{aligned} \nabla _K\mathrm {OT}^\epsilon _{d^2}\left( K,K\right) = \frac{\epsilon }{2}K^{-\frac{1}{2}}\left( I + \frac{4}{\epsilon }K - \left( I + \frac{16}{\epsilon ^2}K^2\right) ^\frac{1}{2} \right) K^{-\frac{1}{2}}. \end{aligned} \end{aligned}$$
(98)

Substituting (94) and (98) into (97) yields

$$\begin{aligned} \begin{aligned}&\nabla _K\sum _{i=1}^N \lambda _i S_2^\epsilon \left( {\mathcal {N}}(0,K), {\mathcal {N}}(0,K_i)\right) \\&\quad = \frac{\epsilon }{4}\sum _{i=1}^N\lambda _i K^{-\frac{1}{2}}\left( \left( I + \frac{16}{\epsilon ^2}K^2\right) ^\frac{1}{2} - \left( I + \frac{16}{\epsilon ^2}K^\frac{1}{2}K_iK^\frac{1}{2}\right) ^\frac{1}{2} \right) K^{-\frac{1}{2}}. \end{aligned} \end{aligned}$$
(99)

When (99) is set to zero, we find, that the optimal \({\bar{K}}\) satisfies the relation given in (96). \(\square \)

Fig. 1
figure 1

Entropic interpolants \(\mu _t^\epsilon \) between two one-dimensional Gaussians given by \(\mu _0 = {\mathcal {N}}(-2,0.1)\) (blue) and \(\mu _1 = {\mathcal {N}}(2,0.5)\) (red), with varying regularization strengths \(\epsilon \), accompanied by the 2-Wasserstein interpolant in the top-left corner (corresponding to \(\epsilon =0\))

Existence and uniqueness of solution

Theorems 3 and 4 derive the fixed point equations, namely Eqs. (87) and (96), respectively, that the corresponding barycenter must satisfy, under the assumption that it is strictly positive. For the Sinkhorn barycenter in Theorem 4, existence and uniqueness of solution was shown in [45] via the Brouwer Fixed Point Theorem, under the assumption that all \(K_i\)’s are strictly positive. For the entropic barycenter in Theorem 3, a non-trivial solution exists, in which case it is unique, only when \(\varepsilon \) is sufficiently small, otherwise it is the Dirac \(\delta \)-measure. This was shown in one-dimension by [44] and for any finite dimension by [62]. The more general setting, where the barycenter can be singular, is treated in [62].

Fig. 2
figure 2

Interpolants between two three-dimensional Gaussians with varying regularization strengths \(\epsilon \), accompanied by the 2-Wasserstein interpolant, given by the first row (parallel to the time axis). The following rows visualize the interpolation for \(\epsilon \in \{0.01, 1, 2, 5, 20\}\) in increasing order

Fixed-point iteration

The fixed-point iteration algorithm is defined by

$$\begin{aligned} x_{k+1} = F(x_{k}), \end{aligned}$$
(100)

where the initial case \(x_0\) is handpicked by the user. The Banach fixed-point theorem is a well-known result stating that such an iteration converges to a fixed-point, i.e. an element x satisfying \(x = F(x)\), if F is a contraction mapping.

Fig. 3
figure 3

Barycentric spans of the four corner tensors under the entropic 2-Wasserstein metric and the 2-Sinkhorn divergence for varying \(\epsilon \)

In the case of the 2-Wasserstein barycenter given in (11), the fixed-point iteration can be shown to converge [2] to the unique barycenter. In the entropic 2-Wasserstein and the 2-Sinkhorn cases we leave such a proof as future work. However, while computing the numerical results in Sect. 5, the fixed-point iteration always succeeded to converge.

Numerical illustrations

We will now illustrate the resulting entropic 2-Wasserstein distance and 2-Sinkhorn divergence for Gaussians by employing the closed-form solutions to visualize entropic interpolations between end point Gaussians. Furthermore, we employ the fixed-point iteration (100) in conjunction with the fixed-point expressions of the barycenters for their visualization.

First, we consider the interpolant between one-dimensional Gaussians given in Fig. 1, where the densities of the interpolants are plotted. As one can see, increasing \(\epsilon \) causes the middle of the interpolation to flatten out. This results from the Fokker–Planck equation (31), which governs the diffusion of the evolution of processes that are objected to Brownian noise. In the limit \(\epsilon \rightarrow \infty \), we would witness a heat death of the distribution.

The same can be seen in the three-dimensional case, depicted in Fig. 2, visualized using the code accompanying [29]. Here, the ellipsoids are determined by the eigenvectors and -values of the covariance matrix of the corresponding Gaussian, and the colors visualize the level sets of the ellipsoids. Note that a large ellipsoid corresponds to high variance in each direction, and does not actually increase the mass of the distribution. Such visualizations are common in diffusion tensor imaging (DTI), where the tensors (covariance matrices) define Gaussian diffusion of water at voxels images produced by magnetic resonance imaging (MRI) [7].

Finally, we consider the entropic 2-Wasserstein and Sinkhorn barycenters in Fig. 3. We consider four different Gaussians, placed in the corners of the square fields in the figure, and plot the barycenters for varying weights, resulting in the barycentric span of the four Gaussians. As the results show, the barycenters are very similar under the two frameworks with small \(\epsilon \). However, as \(\epsilon \) is increased, the Sinkhorn barycenter seems to be more resiliant against the fattening of the barycenters, which can be seen in the 2-Wasserstein case.