Information Geometry

, Volume 1, Issue 2, pp 137–179

# Wasserstein Riemannian geometry of Gaussian densities

• Luigi Malagò
• Luigi Montrucchio
• Giovanni Pistone
Research Paper

## Abstract

The Wasserstein distance on multivariate non-degenerate Gaussian densities is a Riemannian distance. After reviewing the properties of the distance and the metric geodesic, we present an explicit form of the Riemannian metrics on positive-definite matrices and compute its tensor form with respect to the trace inner product. The tensor is a matrix which is the solution to a Lyapunov equation. We compute the explicit formula for the Riemannian exponential, the normal coordinates charts and the Riemannian gradient. Finally, the Levi-Civita covariant derivative is computed in matrix form together with the differential equation for the parallel transport. While all computations are given in matrix form, nonetheless we discuss also the use of a special moving frame.

## Keywords

Information geometry Gaussian distribution Wasserstein distance Riemannian metrics Natural gradient Riemannian exponential Normal coordinates Levi-Civita covariant derivative Optimization on positive-definite symmetric matrices

## Mathematics Subject Classification

15B48 53C23 53C25 60D05

## 1 Introduction

Given two probability measures $$\nu _1$$ and $$\nu _2$$ on $$\mathbb {R}^n$$, with finite second moments, consider the set $$\mathscr {P}(\nu _1,\nu _2)$$ of probability measures on the product sample space $$\mathbb {R}^{2n}$$, such that the two n-dimensional margins have the prescribed distributions, $$X_1 \sim \nu _1$$ and $$X_2 \sim \nu _2$$. The index
\begin{aligned} W^2 = \inf \left\{ {{\mathrm{\mathbb E}}}_{\mu }\left[ \left\| X_1-X_2\right\| ^2\right] \vert \mu \in {\mathscr {P}}(\nu _1,\nu _2) \right\} \end{aligned}
as a measure of dissimilarity between distributions has been considered by many classical authors e.g., C. Gini, P. Levy, and M.R. Fréchet. There is considerable contemporary literature discussing the index W, which is usually called Wasserstein distance. E.g., the monograph by C. Villani [37]. We want also to mention Y. Brenier [9] and R.J. McCann [27].
There is an important particular case, where the above problem reduces to the Monge transport problem. Borrowing the argument from M. Knott and C.S. Smith [18], assume $$\varPhi :\mathbb {R}^n \rightarrow \mathbb {R}$$ is a smooth convex function and $$\nabla \varPhi (X_1) \sim \nu _2$$. Clearly, the condition
\begin{aligned} {{\mathrm{\mathbb E}}}_{\mu }\left[ \left\| X_1 - \nabla \varPhi (X_1)\right\| ^2\right] \le {{\mathrm{\mathbb E}}}_{\mu }\left[ \left\| X_1-X_2\right\| ^2\right] , \quad \mu \in {\mathscr {P}}(\nu _1,\nu _2) , \end{aligned}
turns out to be equivalent to $${{\mathrm{\mathbb E}}}_{\mu }\left[ X_1 \cdot \nabla \varPhi (X_1)\right] \ge {{\mathrm{\mathbb E}}}_{\mu }\left[ X_1 \cdot X_2\right]$$. Latter inequality shows that the minimum quadratic distance is attained. In view of this new formulation, let $$\varPhi ^*$$ be the convex conjugate of $$\varPhi$$. By the Young inequality we have
\begin{aligned} X_1 \cdot X_2 \le \varPhi (X_1) + \varPhi ^*(X_2) \end{aligned}
as well as the Young equality
\begin{aligned} X_1 \cdot \nabla \varPhi (X_1) = \varPhi (X_1) + \varPhi ^*(\nabla \varPhi X_1) . \end{aligned}
By assumption $$X_2 \sim \nabla \varPhi (X_1)$$, so that
\begin{aligned}&{{\mathrm{\mathbb E}}}_{\mu }\left[ X_1 \cdot \nabla \varPhi (X_1)\right] = {{\mathrm{\mathbb E}}}_{\mu }\left[ \varPhi (X_1) + \varPhi ^*(\nabla \varPhi (X_1))\right] = \\&\quad {{\mathrm{\mathbb E}}}_{\mu }\left[ \varPhi (X_1) + \varPhi ^*(X_2)\right] \ge {{\mathrm{\mathbb E}}}_{\mu }\left[ X_1 \cdot X_2\right] , \end{aligned}
which proves that $$\nabla \varPhi (X_1)$$ solves the Monge problem.

This argument, including an existence proof, is in Y. Brenier [9]. In the present paper we shall study the same problem where all the involved distributions are Gaussian. It would be feasible to reduce the Gaussian case to the general one. However, we resort to methods specially suited for this case.

### 1.1 The Gaussian case

Given two Gaussian distributions $$\nu _i={{\mathrm{N}}}_{n}\left( \mu _i,\varSigma _i\right)$$, $$i=1,2$$, consider the set $${\mathscr {G}}(\nu _1,\nu _2)$$ of Gaussian distributions on $$\mathbb {R}^{2n}$$ such that the two n-dimensional margins have the prescribed distributions, $$X_i \sim \nu _i$$. The corresponding index is
\begin{aligned} W^2 = \inf \left\{ {{\mathrm{\mathbb E}}}_{\mu }\left[ \left\| X_1-X_2\right\| ^2\right] \vert \mu \in {\mathscr {G}}(\nu _1,\nu _2) \right\} . \end{aligned}
(1)
Observe that if $$\mu _1=\mu _2=0$$ and U is a symmetric matrix such that $$U\varSigma _1U = \varSigma _2$$, then the previous argument applies by means of the convex function $$\varPhi (x) = \frac{1}{2} x^t U x$$.
The value of $$W^2$$ in Eq. (1) as a function of the mean and the dispersion matrix has been computed by some authors, in particular: I. Olkin and F. Pukelsheim [28], D. C. Dowson and B. V. Landau [12], C. R. Givens and R. M. Shortt [14], M. Gelbrich [13]. They found the (equivalent) forms
\begin{aligned} \begin{aligned} W^2&= \left\| \mu _1-\mu _2\right\| ^2 + {{\mathrm{Tr}}}\left( \varSigma _1+\varSigma _2 - 2 \left( \varSigma _1^{1/2}\varSigma _2\varSigma _1^{1/2}\right) ^{1/2}\right) \\&= \left\| \mu _1-\mu _2\right\| ^2 + {{\mathrm{Tr}}}\left( \varSigma _1+\varSigma _2 - 2 \left( \varSigma _1\varSigma _2\right) ^{1/2}\right) . \end{aligned} \end{aligned}
(2)
Further interpretations of W are available. R. Bhatia et al. [8] showed that W is also the solution of constrained minimization problems for the Frobenius matrix norm $$\left\| M\right\| = \sqrt{{{\mathrm{Tr}}}\left( M^*M\right) }$$, when $$\mu _1=\mu _2=0$$. Especially,
\begin{aligned} W = \min \left\{ \left\| \varSigma _1^{1/2}U-\varSigma _2^{1/2}V\right\| \vert U \text {and} V \text {orthogonal} \right\} . \end{aligned}
Notice that $$\varSigma ^{1/2}U$$ is the generic transformation of the standard Gaussian to the Gaussian with dispersion matrix $$\varSigma$$.

Because of the exponent 2 in Eq. (1), the W distance is more precisely called $$L^2$$-Wasserstein distance. Other exponents or other distances could be used in the definition. The quadratic case is particularly relevant as W is a Riemannian distance. More references will be given later.

In an Information Geometry perspective, we can mimic the argument of the seminal paper by Amari [4], who derived the notion of both Fisher metric and natural gradient, from the second order approximation of the Kullback-Leibler divergence.

It will be shown (see Sect. 2) that the value $$W^2$$ of Eq. (2) has the differential second-order expansion for small H:
\begin{aligned} {{\mathrm{Tr}}}\left( \varSigma + (\varSigma +H) - 2\left( \varSigma ^{1/2}(\varSigma +H)\varSigma ^{1/2}\right) ^{1/2}\right) \simeq {{\mathrm{Tr}}}\left( \mathscr {L} _ {\varSigma } \left[ H\right] \varSigma \mathscr {L} _ {\varSigma } \left[ H\right] \right) ,\nonumber \\ \end{aligned}
(3)
where $$\mathscr {L} _ {\varSigma } \left[ H\right] = X$$ is the solution to the Lyapunov equation $$X \varSigma + \varSigma X = H$$.
The quadratic form in the RHS of Eq. (3) provides a candidate to be the Riemannian inner product associated with the distance W. In addition, if f is a smooth real function defined on a small W-sphere. i.e., $$W(\varSigma ,\varSigma +H) = \epsilon$$ for small $$\epsilon$$, then the increment $$f(\varSigma +H)-f(\varSigma )$$ is maximized along the direction
\begin{aligned} {{\mathrm{grad}}}f(\varSigma ) =\nabla f(\varSigma )\varSigma +\varSigma \nabla f(\varSigma ) , \end{aligned}
where here $$\nabla$$ denotes the Euclidean gradient. The operator $${{\mathrm{grad}}}$$ is Amari’s natural gradient, i.e., the Riemannian gradient.

It is remarkable that all geometric objects shown in the previous equations above may be expressed as matrix operations. In this paper, we proceed in developing systematically the Wasserstein geometry of Gaussian models according to such a formalism.

### 1.2 Relations with the literature on the general transport theory

The Wasserstein distance and its relevant geometry can be studied non-parametrically also for general distributions. We do not pursue in this direction and refer to the monograph by C. Villani [37]. The $$L^2$$-Wasserstein metric geometry has been shown to be Riemannian by F. Otto [29, §4] and J. Lott [21]. Cf. the earlier account by J.D. Lafferty [19].

Let us briefly discuss Otto’s approach in the language of Information Geometry, i.e., with reference to S. Amari and H. Nagaoka [3]. In view of the non-parametric approach first introduced in [33], and denoted by $${\mathscr {M}}$$ the set of n-dimensional Gaussian densities with zero mean, the vector bundle
\begin{aligned} H{\mathscr {M}} = \left\{ (\rho ,\phi ) \vert \rho \in \mathscr {M}, \phi \in L^2(\rho ), \int \phi \ \rho = 0 \right\} \end{aligned}
is the Amari Hilbert bundle on $${\mathscr {M}}$$. The Hilbert bundle contains the statistical bundle whose fibers consist of the scores $$\left. \frac{d}{dt} \log \rho (t) \right| _{t=0}$$ for all smooth curves $$t \mapsto \rho (t) \in {\mathscr {M}}$$ with $$\rho (0)=\rho$$. In turn, the statistical bundle is the tangent space of $${\mathscr {M}}$$ considered as an exponential manifold, see [32, 33].
In our present case, since the model $${\mathscr {M}}$$ is an exponential family, the natural parameter is the concentration matrix $$C = \varSigma ^{-1}$$. The log-likelihood is
\begin{aligned} \log \rho (y;C) = -\frac{1}{2} \log 2\pi + \frac{1}{2} \log \det C - \frac{1}{2} y^*C y . \end{aligned}
If V is a symmetric matrix, the derivative of $$C \mapsto \log \rho (y;C)$$ in the direction V is
\begin{aligned} d_V \log \rho (y;C) = \frac{1}{2} {{\mathrm{Tr}}}\left( C^{-1}V\right) - \frac{1}{2} y^*Vy = {{\mathrm{Tr}}}\left( \phi (y;C)V\right) \end{aligned}
where $$\phi (y;C) = \frac{1}{2}(C^{-1} - yy^*)$$ is a symmetric matrix identified with a linear operator on symmetric matrices $${{\mathrm{Sym}}}\left( n\right)$$, equipped with the Frobenius inner product. The fiber at $$\rho (\cdot ;C)$$ consists of the vector space of functions $${{\mathrm{Tr}}}\left( \phi (\cdot ;C)V\right)$$, $$V \in {{\mathrm{Sym}}}\left( n\right)$$. The inner product in the Hilbert bundle, restricted to the parameterized statistical bundle, is the Fisher metric
\begin{aligned}&F_C(U,V) = \int d_U \log \rho (y;C)d_V \log \rho (y;C) \ \rho (y;C) \ dy = \nonumber \\&\quad - \int d_U {{\mathrm{Tr}}}\left( \phi (y;C)V\right) \ \rho (y;C) \ dx = \frac{1}{2} {{\mathrm{Tr}}}\left( UC^{-1}VC^{-1}\right) . \end{aligned}
(4)
The study of the Fisher metric in the Gaussian case has been done first by L.T. Skovgaard [35].
F. Otto [29, §1.3], who was motivated by the study of a class of partial differential equation, considered a inner product defined on smooth functions of the $$\rho$$-fiber of the Hilbert bundle, as
\begin{aligned} (\phi _1,\phi _2) \mapsto \int \nabla \phi _1(x) \cdot \nabla \phi _2(x) \ \rho (x) \ dx . \end{aligned}
(5)
In the non-parametric case, Otto’s metric of Eq. (5) is related to the Wasserstein distance, for a detailed study of such a metric see J. Lott [21].
If we apply this definition to our score $${{\mathrm{Tr}}}\left( \phi (y;C)V\right) = {{\mathrm{Tr}}}\left( \frac{1}{2}(C^{-1}-yy^*)V\right)$$ and $$V \in {{\mathrm{Sym}}}\left( n\right)$$, the gradient is $$\nabla {{\mathrm{Tr}}}\left( \phi (y;C)V\right) = - V y$$ and the metric becomes
\begin{aligned}&G_C(U,V) = \int \nabla {{\mathrm{Tr}}}\left( \phi (y;C)U\right) \cdot \nabla {{\mathrm{Tr}}}\left( \phi (y;C)V\right) \ \rho (y;C) \ dy \nonumber \\&\quad = \int y^* V U y \ \rho (y;\varSigma ) \ dy = {{\mathrm{Tr}}}\left( UC^{-1}V\right) . \end{aligned}
(6)
The equivalence between the metric in Eq. (6) and the one in Eq. (4) can be seen by a change of parameterization both in $${\mathscr {M}}$$ and in each fiber. First, one must define the inner product at $$\varSigma$$ to be the inner product computed in the bijection $$\varSigma \leftrightarrow C$$, to get $${{\mathrm{Tr}}}\left( U \varSigma V\right)$$, which is the form of the metric provided by A. Takatsu [36, Prop. A]. Second, one has to change the parameterization on each fiber of the statistical bundle by $$U \mapsto U \varSigma + \varSigma U$$. The involved change of parameterization in the statistical bundle $$(C,U) \mapsto (C^{-1},UC^{-1}+C^{-1}U)$$ whose inverse is $$(\varSigma ,X) \mapsto (\varSigma ^{-1},\mathscr {L} _ {\varSigma } \left[ X\right] )$$ produces the desired inner product.
We mention also that the Machine Learning literature discusses a divergence introduced by A. Hyvärinen [16], which is related to Otto’s metric. Precisely, in the concentration parameterization the Hyvärinen divergence is
\begin{aligned} {\text {DH}}\left( D \vert C \right)= & {} \frac{1}{2} \int \left| \nabla \log \rho (y;D) - \nabla \log \rho (y;C)\right| ^2 \rho (y;C) \ dy\\= & {} \frac{1}{2} \int \left| Dy-Cy\right| ^2 \rho (y;C) \ dy = {{\mathrm{Tr}}}\left( C^{-1}(D-C)^2\right) , \end{aligned}
and the second derivative of $$D \mapsto {\text {DH}}\left( D \vert C \right)$$ at C is
\begin{aligned} d^2{\text {DH}}\left( C \vert C \right) [X,Y] = {{\mathrm{Tr}}}\left( XC^{-1}Y\right) . \end{aligned}
In Statistics, Hyvärinen divergence is related to local proper scoring rules, see M. Parry et al. [31].

### 1.3 Overview

The first two sections of the paper are mostly review of known material. In Sec. 2 we recall some properties of the space of symmetric matrices. In particular, we study the Riccati equation, the Lyapunov equation, and we calculate derivatives for the two mappings $${{\mathrm{sq}}}:A \mapsto A^2$$ and $${{\mathrm{sqrt}}}:A \mapsto A^{1/2}$$. The mapping $$\sigma :A \mapsto AA^*$$, where A is a non-singular square matrix is shown to be a submersion and the horizontal vectors at each point is computed. Despite of our manifold being finite dimensional, there is no need of choosing a basis, as all operations of interest are matrix operations. For that reason, we rely on the language of non-parametric differential geometry of W. Klingenberg [17] and S. Lang [20].

In Sec. 3 we discuss known results about the metric geometry induced by the Wasserstein distance. These results are re-stated in Prop. 3 and, for sake of completeness, we provide a further proof inspired by [12]. Prop. 4 provides an explicit metric geodesic, as done by R.J. McCann [27, Example 1.7].

The space of non-degenerate Gaussian measures (or, equivalently, the space of positive definite matrices) can be endowed with a Riemann structure that induces the Wasserstein distance. This is elaborated in Sec. 4, where we use the presentation given by [36], cf. also [8], which in turn adapts to the Gaussian case the original work [29, §4].

The remaining part of the paper is offered as a new contribution to this topic. The Wasserstein Riemannian metric turns out to be
\begin{aligned} W_\varSigma (U,V) = {{\mathrm{Tr}}}\left( \mathscr {L} _ {\varSigma } \left[ U\right] \varSigma \, \mathscr {L} _ {\varSigma } \left[ V\right] \right) = \frac{1}{2} {{\mathrm{Tr}}}\left( \mathscr {L} _ {\varSigma } \left[ U\right] V\right) , \end{aligned}
(7)
at each matrix $$\varSigma$$, and where UV are symmetric matrices. By submersion methods we study the more general problem of the horizontal surfaces in $${\text {GL}}\left( n\right)$$, characterized in Prop. 8. As a specialized case we get the Riemannian geodesic which agrees with the metric geodesic of Sect.  3.

The explicit form of Riemannian exponential is obtained in Sect. 5. The natural (Riemannian) gradient is discussed in Sect. 6 and some applications to optimization are provided in Sect. 6.1. The analysis of the second-order geometry is treated in Sect. 7, where we compute the Levi-Civita covariant derivative, the Riemannian Hessian, and discuss other related topics. However, the curvature tensor will not be taken into consideration in the present paper.

In the final Sect. 8, we discuss the results in view of applications and in Information Geometry of statistical sub-models of the Gaussian manifold.

## 2 Symmetric matrices

The set $${\mathscr {G}}^{n}$$ of Gaussian distributions on $${\mathbb {R}}^{n}$$ is in 1-to-1 correspondence with the space of its parameters $${\mathscr {G}}^{n}\ni {\text {N}}_{n}\left( \mu ,\varSigma \right) \leftrightarrow (\mu ,\varSigma )\in {\mathbb {R}}^{n}\times {\text {Sym}}^{+}\left( n\right)$$. Moreover, $${\mathscr {G}}^{n}$$ is closed for the weak convergence and the identification is continuous in both directions. A reference for Gaussian distributions is the monograph T.W. Anderson [6].

For ease of later reference, we recall a few results on spaces of matrices. General references are the monographs by P. R. Halmos [15], J. R. Magnus and H. Neudecker [22], and R. Bhatia [7].

The vector space of $$n\times m$$ real matrices is denoted by $${\text {M}}(n\times m)$$, while square matrices are denoted $${\text {M}}(n)={\text {M}}(n\times n)$$. It is an Euclidean space of dimension nm and the vectorization mapping $${\text {M}} (n\times m)\ni A\mapsto \mathbf {vec}\left( A\right) \in {\mathbb {R}}^{nm}$$ is an isometry for the Frobenius inner product $$\left\langle A,B\right\rangle =(\mathbf {vec}\left( A\right) )^{*}(\mathbf {vec}\left( B\right) )={\text {Tr}}\left( AB^{*}\right)$$.

Symmetric matrices $${{\mathrm{Sym}}}\left( n\right)$$ form a vector subspace of M(n) whose orthogonal complement is the space of anti-symmetric matrices $${{\mathrm{Sym}}}^{\perp }\left( n\right)$$. We will find it convenient the use, with regard to symmetric matrices, of the equivalent inner product $$\left\langle A,B\right\rangle _{2}=\frac{1}{2}{\text {Tr}}\left( AB\right)$$, see e.g. Eq. (18) below. The closed pointed cone of non-negative-definite symmetric matrices is denoted by $${{\mathrm{Sym}}}^+\left( n\right)$$ and its interior, the open cone of the positive-definite symmetric matrices, by $${{\mathrm{Sym}}}^{++}\left( n\right)$$.

Given $$A,B\in {\text {Sym}}\left( n\right)$$, the equation $$TAT=B$$ is called Riccati equation. If $$A\in {\text {Sym}}^{++}\left( n\right)$$ and $$B\in {\text {Sym}}^{+}\left( n\right)$$, then the equation $$TAT=B$$ has unique solution $$T\in {\text {Sym}}^{+}\left( n\right)$$. In fact, from $$TAT=B$$ it follows $$A^{1/2}TA^{1/2}A^{1/2}TA^{1/2}=A^{1/2}BA^{1/2}$$ and, in turn, $$A^{1/2}TA^{1/2}=\left( A^{1/2}BA^{1/2}\right) ^{1/2}$$ because $$T \in {{\mathrm{Sym}}}^+\left( n\right)$$. Hence, the solution to Riccati equation is
\begin{aligned} T=A^{-1/2}\left( A^{1/2}BA^{1/2}\right) ^{1/2}A^{-1/2} . \end{aligned}
(8)
Notice that $${{\mathrm{det}}}\left( T\right) = {{\mathrm{det}}}\left( A\right) ^{-1/2} {{\mathrm{det}}}\left( B\right) ^{1/2}$$, consequently $${{\mathrm{det}}}\left( T\right) > 0$$ if $${{\mathrm{det}}}\left( B\right) > 0$$. In terms of random variables, if $$X \in {{\mathrm{N}}}_{n}\left( 0,A\right)$$ and $$Y = {{\mathrm{N}}}_{n}\left( 0,B\right)$$, then T is the unique matrix of $${{\mathrm{Sym}}}^+\left( n\right)$$ such that $$Y \sim TX$$.
A more compact closed-form solution of the Riccati equation is available. Given $$A \in {{\mathrm{Sym}}}^{++}\left( n\right)$$ and $$B \in {{\mathrm{Sym}}}^+\left( n\right)$$, observe that $$AB = A^{1/2}(A^{1/2}BA^{1/2})A^{-1/2}$$. By similarity, the eigenvalues of AB are non-negative, hence the square root
\begin{aligned} (AB)^{1/2} = A^{1/2}(A^{1/2}BA^{1/2})^{1/2}A^{-1/2} \end{aligned}
(9)
is well defined, see [7, Ex. 4.5.2]. Therefore, an equivalent formulation of Eq. (8) is
\begin{aligned} T = A^{-1} A^{1/2}\left( A^{1/2}BA^{1/2}\right) ^{1/2}A^{-1/2} = A^{-1} (AB)^{1/2} . \end{aligned}
(10)
Since $$AB = A(BA)A^{-1}$$, the eigenvalues of AB and BA are identical, so that the same argument used before yields too
\begin{aligned} T = (BA)^{1/2} A^{-1} . \end{aligned}
(11)
The square mapping $${{\mathrm{sq}}}:A \mapsto A^2$$ is an injection of $${{\mathrm{Sym}}}^{++}\left( n\right)$$ onto itself with derivative $$d_X {{\mathrm{sq}}}(A) = XA + AX$$. Hence, the derivative operator $$d{{\mathrm{sq}}}(A)$$ is invertible. An alternative notation for the derivative we find convenient to use now and then is $$d_X {{\mathrm{sq}}}(A) = d {{\mathrm{sq}}}(A)[X]$$.
For each assigned matrix $$V \in {{\mathrm{Sym}}}\left( n\right)$$, the matrix $$X = (d {{\mathrm{sq}}}(A))^{-1} V$$ is the unique solution X in the space $${{\mathrm{Sym}}}\left( n\right)$$ to the Lyapunov equation
\begin{aligned} V= X A + A X . \end{aligned}
(12)
Its solution will be written $$X = \mathscr {L} _ {A} \left[ V\right]$$. Clearly we have also
\begin{aligned} V =\mathscr {L} _ {A} \left[ V\right] A + A \mathscr {L} _ {A} \left[ V\right] \quad \text {and} \quad X = \mathscr {L} _ {A} \left[ XA+AX\right] . \end{aligned}
(13)
The Lyapunov operator itself can be seen as a derivative. In fact, the inverse of the square mapping $${{\mathrm{sq}}}$$ is the square root mapping $${{\mathrm{sqrt}}}:\varSigma \rightarrow \varSigma ^{1/2}$$. By the derivative-of-the-inverse rule,
\begin{aligned} d_{V} {{\mathrm{sqrt}}}(\varSigma ) = (d{{\mathrm{sq}}}({{\mathrm{sqrt}}}(\varSigma )))^{-1}[V] = \mathscr {L} _ {\varSigma ^{1/2}} \left[ V\right] . \end{aligned}
(14)
If $$\varSigma$$ is the dispersion of a non-singular Gaussian distribution, then $$C = \varSigma ^{-1} \in {{\mathrm{Sym}}}^{++}\left( n\right)$$ is the concentration matrix and represents an alternative and useful parameterization. From the Lyapunov equation $$V = X\varSigma + \varSigma X$$ we obtain $$\varSigma ^{-1}V\varSigma ^{-1} = \varSigma ^{-1}X + X\varSigma ^{-1}$$, hence
\begin{aligned} \mathscr {L} _ {\varSigma } \left[ V\right] = \mathscr {L} _ {\varSigma ^{-1}} \left[ \varSigma ^{-1}V\varSigma ^{-1}\right] \quad \text {and} \quad \mathscr {L} _ {\varSigma ^{-1}} \left[ U\right] = \mathscr {L} _ {\varSigma } \left[ \varSigma U \varSigma \right] . \end{aligned}
Likewise, another useful formula is
\begin{aligned} \mathscr {L} _ {\varSigma } \left[ V\right] = \varSigma ^{-1/2} \mathscr {L} _ {\varSigma } \left[ \varSigma ^{-1/2}V\varSigma ^{-1/2}\right] \varSigma ^{-1/2} . \end{aligned}
(15)
There is also a relation between the Lyapunov equation and the trace. From $$X \varSigma + \varSigma X = V$$, it follows $$\varSigma ^{-1} X \varSigma + X = \varSigma ^{-1}V$$. Then
\begin{aligned} {{\mathrm{Tr}}}\left( \mathscr {L} _ {\varSigma } \left[ V\right] \right) = \frac{1}{2} {{\mathrm{Tr}}}\left( \varSigma ^{-1}V\right) . \end{aligned}
(16)
We will later need the derivative of the mapping $$A \mapsto \mathscr {L} _ {A} \left[ V\right]$$, for a fixed V. Differentiating the first identity in Eq. (13) in the direction U, we have
\begin{aligned} 0= d_U \mathscr {L} _ {A} \left[ V\right] A + \mathscr {L} _ {A} \left[ V\right] U + U \mathscr {L} _ {A} \left[ V\right] + A \, d_U \mathscr {L} _ {A} \left[ V\right] . \end{aligned}
Hence $$d_{U}\mathscr {L} _ {A} \left[ V\right]$$ is the solution to the Lyapunov equation
\begin{aligned} d_U \mathscr {L} _ {A} \left[ V\right] A + A \ d_U \mathscr {L} _ {A} \left[ V\right] = - (\mathscr {L} _ {A} \left[ V\right] U + U \mathscr {L} _ {A} \left[ V\right] ) , \end{aligned}
so that we get
\begin{aligned} d_U \mathscr {L} _ {A} \left[ V\right] = - \mathscr {L} _ {A} \left[ \mathscr {L} _ {A} \left[ V\right] U + U \mathscr {L} _ {A} \left[ V\right] \right] . \end{aligned}
(17)
It will be useful in the following to evaluate the second derivative of the mapping $${{\mathrm{sqrt}}}:\varSigma \mapsto \varSigma ^{1/2}$$. From Eqs. (14) and (17) it follows
\begin{aligned} d^2 {{\mathrm{sqrt}}}(\varSigma )[U,V] = \mathscr {L} _ {\varSigma ^{1/2}} \left[ \mathscr {L} _ {\varSigma ^{1/2}} \left[ V\right] \mathscr {L} _ {\varSigma ^{1/2}} \left[ U\right] + \mathscr {L} _ {\varSigma ^{1/2}} \left[ U\right] \mathscr {L} _ {\varSigma ^{1/2}} \left[ V\right] \right] . \end{aligned}
Lyapunov equation plays a crucial role, as the linear operator $${\mathscr {L}}_A$$ enters the expression of the Riemannian metric with respect to the standard inner product, see Eq. (7). As a consequence, the numerical implementation of the inner product $$W_\varSigma (U,V)$$ will require the computation of the matrix $$\mathscr {L} _ {\varSigma } \left[ U\right]$$. There are many ways to write down the closed-form solution to Eq. (12). They are discussed in [7]. However, efficient numerical solutions are not based on the closed forms, but rely on specialized numerical algorithms, as discussed by E. L. Wachspress [38] and by V. Simoncini [34].

We now turn to the computation of the second-order approximation of $$W^2$$ in Eq. (2).

Fix $$\varSigma \in {{\mathrm{Sym}}}^{++}\left( n\right)$$ and let $$H \in {{\mathrm{Sym}}}\left( n\right)$$ so that $$(\varSigma \pm H) \in {{\mathrm{Sym}}}^{++}\left( n\right)$$. Hence, $$\varSigma + \theta H\in {{\mathrm{Sym}}}^{++}\left( n\right)$$ for all $$\theta \in [-1,+1]$$. Consider the expression of $$W^2$$ with $$\mu _1=\mu _2=0$$, $$\varSigma _1=\varSigma$$, $$\varSigma _2=\varSigma +\theta H$$, namely
\begin{aligned} \theta \mapsto W^2(\varSigma ,\varSigma +\theta H) = 2 {{\mathrm{Tr}}}\left( \varSigma \right) + \theta {{\mathrm{Tr}}}\left( H\right) - 2 {{\mathrm{Tr}}}\left( \left( \varSigma ^2+ \theta \varSigma ^{1/2}H\varSigma ^{1/2}\right) ^{1/2}\right) . \end{aligned}
By Eq. (14) and Eq. (16), the first-order derivative is
\begin{aligned}&\frac{d}{d\theta }W^2(\varSigma ,\varSigma +\theta H) = {{\mathrm{Tr}}}\left( H\right) - 2 {{\mathrm{Tr}}}\left( \mathscr {L} _ {\left( \varSigma ^2+ \theta \varSigma ^{1/2}H\varSigma ^{1/2}\right) ^{1/2}} \left[ \varSigma ^{1/2}H\varSigma ^{1/2}\right] \right) \\&\quad = {{\mathrm{Tr}}}\left( H\right) - {{\mathrm{Tr}}}\left( \left( \varSigma ^2+ \theta \varSigma ^{1/2}H\varSigma ^{1/2}\right) ^{-1/2}\left( \varSigma ^{1/2}H\varSigma ^{1/2}\right) \right) . \end{aligned}
Observe that $$\left. \frac{d}{d\theta }W^2(\varSigma ,\varSigma +\theta H)\right| _{\theta =0} = 0$$.
The second derivative is
\begin{aligned} \frac{d^2}{d{\theta }^2}W^2(\varSigma ,\varSigma +\theta H) = {{\mathrm{Tr}}}\left( \frac{d}{d\theta }\left( \varSigma ^2+ \theta \varSigma ^{1/2}H\varSigma ^{1/2}\right) ^{-1/2}\left( \varSigma ^{1/2}H\varSigma ^{1/2}\right) \right) \end{aligned}
with
\begin{aligned}&\frac{d}{d\theta }\left( \varSigma ^2+ \theta \varSigma ^{1/2}H\varSigma ^{1/2}\right) ^{-1/2} = \left( \varSigma ^2+ \theta \varSigma ^{1/2}H\varSigma ^{1/2}\right) ^{-1/2} \\&\quad \times \mathscr {L} _ {\left( \varSigma ^2+ \theta \varSigma ^{1/2}H\varSigma ^{1/2}\right) ^{1/2}} \left[ \varSigma ^{1/2}H\varSigma ^{1/2}\right] \left( \varSigma ^2+ \theta \varSigma ^{1/2}H\varSigma ^{1/2}\right) ^{-1/2} , \end{aligned}
so that
\begin{aligned}&\left. \frac{d^2}{d{\theta }^2}W^2(\varSigma ,\varSigma +\theta H) \right| _{\theta =0} = {{\mathrm{Tr}}}\left( \varSigma ^{-1}\mathscr {L} _ {\varSigma } \left[ \varSigma ^{1/2}H\varSigma ^{1/2}\right] \varSigma ^{-1}\varSigma ^{1/2}H\varSigma ^{1/2}\right) \\&\quad = {{\mathrm{Tr}}}\left( \varSigma ^{-1/2}\mathscr {L} _ {\varSigma } \left[ \varSigma ^{1/2}H\varSigma ^{1/2}\right] \varSigma ^{-1/2}H\right) = {{\mathrm{Tr}}}\left( \mathscr {L} _ {\varSigma } \left[ H\right] H\right) , \end{aligned}
where Eq. (15) has been used. Finally, observe that
\begin{aligned}&{{\mathrm{Tr}}}\left( \mathscr {L} _ {\varSigma } \left[ H\right] \varSigma \mathscr {L} _ {\varSigma } \left[ H\right] \right) = {{\mathrm{Tr}}}\left( \mathscr {L} _ {\varSigma } \left[ H\right] \mathscr {L} _ {\varSigma } \left[ H\right] \varSigma \right) \nonumber \\&\quad = \frac{1}{2} {{\mathrm{Tr}}}\left( \mathscr {L} _ {\varSigma } \left[ H\right] \left( \mathscr {L} _ {\varSigma } \left[ H\right] \varSigma + \varSigma \mathscr {L} _ {\varSigma } \left[ H\right] \right) \right) = \frac{1}{2} {{\mathrm{Tr}}}\left( \mathscr {L} _ {\varSigma } \left[ H\right] H\right) \end{aligned}
(18)
We can conclude that
\begin{aligned} W^2(\varSigma ,\varSigma + \theta H)= & {} \frac{\theta ^2}{2}{{\mathrm{Tr}}}\left( \mathscr {L} _ {\varSigma } \left[ H\right] H\right) + {\text {o}}(\theta ^2)\\= & {} \theta ^2 {{\mathrm{Tr}}}\left( \mathscr {L} _ {\varSigma } \left[ H\right] \varSigma \mathscr {L} _ {\varSigma } \left[ H\right] \right) + {\text {o}}(\theta ^2) . \end{aligned}
Therefore, the bi-linear form in the RHS suggests the form of the Riemannian metric to be derived.

### 2.1 The mapping $$A \mapsto AA^*$$

We study now the extension of the square operation to general invertible matrices, namely the mapping $${{\mathrm{\sigma }}}: {\text {GL}}\left( n\right) \rightarrow {{\mathrm{Sym}}}^{++}\left( n\right)$$, defined by $${{\mathrm{\sigma }}}(A) = AA^*$$. Next proposition shows that this operation is a submersion. We recall first its definition, see [10, Ch. 8, Ex. 8–10] or [20, §II.2 ].

Let $${\mathscr {O}}$$ be an open set of the Hilbert space H, and $$f:{ \mathscr {O}}\rightarrow {\mathscr {N}}$$ a smooth surjection from the Hilbert space H onto a manifold $${\mathscr {N}}$$, i.e., assume that for each $$A\in { \mathscr {O}}$$ the derivative at A, $$df(A):H\rightarrow T_{f(A)}{ \mathscr {N}}$$ is surjective. In such a case, for each $$C\in {\mathscr {N}}$$, the fiber $$f^{-1}(C)$$ is a sub-manifold. Assigned a point $$A\in f^{-1}(C)$$, a vector $$U\in H$$ is called vertical if it is tangent to the manifold $$f^{-1}(C)$$. Each such a tangent vector U is the velocity at $$t=0$$ of some smooth curve $$t\mapsto \gamma (t)$$ with $$\gamma (0)=A$$ and $${\dot{\gamma }}(0)=U$$. Precisely, from $$f(\gamma (t))=C$$ for all t we derive the characterization of vertical vectors. We have $$df(A)[{\dot{\gamma }}(0)]=0$$ i.e., the tangent space at A is $$T_{A}f^{-1}(f(A))={\text {Ker}}(df(A))$$. The orthogonal space to the tangent space $$T_{A}f^{-1}(f(A))$$ is called the space of horizontal vectors at A,
\begin{aligned} {\mathscr {H}}_{A}={\text {Ker}}(df(A))^{\perp }={\text {Im}}\left( df(A)^{*}\right) . \end{aligned}
Let us apply this argument to our specific case. Let $${\text {GL}}(n)\subset {\text {M}}(n)$$ be the open set of invertible matrices; $${\text {O}}\left( n\right)$$ the subgroup of $${\text {GL}}(n)$$ of orthogonal matrices; $${\text {Sym}}^{\perp }\left( n\right)$$ the subspace of $${\text {M}}(n)$$ of anti-symmetric matrices.

### Proposition 1

1. 1.
For each given $$A\in {\text {GL}}(n)$$ we have the orthogonal splitting
\begin{aligned} {\text {M}}(n)={\text {Sym}}\left( n\right) A\oplus {\text {Sym}}^{\perp }\left( n\right) (A^{*})^{-1}. \end{aligned}

2. 2.
The mapping
\begin{aligned} \sigma :{\text {GL}}(n)\ni A\mapsto AA^{*}\in {\text {Sym}}^{++}\left( n\right) \end{aligned}
has derivative at A given by $$d_{X}\sigma (A)=XA^{*}+AX^{*}$$. It is a submersion with fibers
\begin{aligned} \sigma ^{-1}(C)=\left\{ C^{1/2}R \vert R\in {\text {O}}(n) \right\} . \end{aligned}

3. 3.
The kernel of the differential is
\begin{aligned} {\text {Ker}}(d\sigma (A))={\text {Sym}}^{\perp }\left( n\right) (A^{*})^{-1}\ \end{aligned}
and its orthogonal complement, $${\mathscr {H}}_{A}={\text {Ker}}(d\sigma (A))^{\perp },$$ is
\begin{aligned} {\mathscr {H}}_{A}={\text {Sym}}\left( n\right) A. \end{aligned}

4. 4.

The orthogonal projection of $$X \in M(n)$$ onto $${\mathscr {H}}_A$$ is $$\mathscr {L} _ {AA^*} \left[ XA^*+AX^*\right] A$$.

### Proof

We provide here the proof for sake of completeness. See also [36] and [8].
1. 1.

If $$\left\langle B,CA\right\rangle =0$$, for all $$C\in {{\mathrm{Sym}}}\left( n\right)$$ i.e., $$CA \in {{\mathrm{Sym}}}^+\left( n\right) A$$ , then $${\text {Tr}}\left( BA^{*}C\right) =0$$, so that $$BA^{*}\in {\text {Sym}}^{\perp }\left( n\right)$$ that is, $$B \in {{\mathrm{Sym}}}^{\perp }\left( n\right) (A^*)^{-1}$$.

2. 2.

Let the matrix A be an element in the fiber manifold $$\sigma ^{-1}(AA^{*})$$. The derivative of $$\sigma$$ at A, $$X \mapsto XA^{*}+AX^{*}$$, is surjective, because for each $$W\in {\text {Sym}}\left( n\right)$$ we have $$d\sigma (A)\left[ \frac{1}{2} W(A^{*})^{-1}\right] =W$$. Hence $$\sigma$$ is a submersion and the fiber $$\sigma ^{-1}(AA^*)=\left\{ (AA^*)^{1/2}R \vert R \in {\text {O}}(n) \right\}$$ is a sub-manifold of $${\text {GL}}(n)$$.

3. 3.
Let us compute the splitting of $${{\mathrm{M}}}(n)$$ into the kernel of $$d\sigma (A)$$ and its orthogonal: $${\text {M}}(n)={\text {Ker}}(d\sigma (A))\oplus {\mathscr {H}}_{A}$$. The vector space tangent to $$\sigma ^{-1}(AA^{*})$$ at A is the kernel of the derivative at A:
\begin{aligned} {\text {Ker}}\left( d\sigma (A)\right)= & {} \left\{ X\in {\text {M}}(n)|\text { }XA^{*}+AX^{*}=0\right\} \\= & {} \left\{ X\in {\text {M}}(n)|\text { }(AX^{*})^{*}=-AX^{*}\right\} . \end{aligned}
Therefore, $$X\in {\text {Ker}}(d\sigma (A))$$ if, and only if, $$AX^{*}\in {\text { Sym}}^{\perp }\left( n\right)$$, i.e., $${\text {Ker}}(d\sigma (A))={\text {Sym}} ^{\perp }\left( n\right) (A^{*})^{-1}$$. We have just proved that this implies $$\mathscr {H}_{A}= {{\mathrm{Sym}}}\left( n\right) A$$.

4. 4.

Consider the decomposition of X into the horizontal and the vertical part: $$X = C A + D (A^*)^{-1}$$ with $$C \in {{\mathrm{Sym}}}\left( n\right)$$ and $$D \in {{\mathrm{Sym}}}^{\perp }\left( n\right)$$. By transposition, we get $$X^* = A^* C - A^{-1} D$$. From the previous two equations, we obtain the two equations $$XA^* = C (AA^*) + D$$ and $$AX^* = (AA^*)C - D$$. The sum of the two previous equations is $$XA^* + AX^* = C(AA^*)+(AA^*)C$$, which is a Lyapunov equation having solution $$C = \mathscr {L} _ {AA^*} \left[ XA^* + AX^*\right]$$. It follows that the projection is $$CA = \mathscr {L} _ {AA^*} \left[ XA^* + AX^*\right] A$$ $$\square$$

## 3 Wasserstein distance

The aim of this section is to discuss the Wasserstein distance for the Gaussian case as well as the equation for the associated metric geodesic. Most of its content is an exposition of known results.

### 3.1 Block-Gaussian

Let us suppose that the dispersion matrix $$\varSigma \in {{\mathrm{Sym}}}^+\left( 2n\right)$$ is partitioned into $$n\times n$$ blocks, and consider random variables X and Y such that
\begin{aligned} \begin{bmatrix} X \\ Y \end{bmatrix} \sim {\text {N}}_{2n}\left( \mu ,\varSigma \right) ,\quad \varSigma = \begin{bmatrix} \varSigma _{1}&K \\ K^{*}&\varSigma _{2} \end{bmatrix} , \end{aligned}
so that $$K_{ij}={\text {Cov}}\left( X_{i},Y_{j}\right)$$ if $$i=1,\dots ,n$$ and $$j=(n+1),\dots ,2n$$. It follows that $$K_{ij}^{2} \le (\varSigma _{1})_{ii}(\varSigma _{2})_{jj} \le \frac{1}{2} \left( (\varSigma _{1})_{ii} + (\varSigma _{2})_{jj}\right)$$, which in turn imply the bounds
\begin{aligned} \left\| K\right\| _{2}^{2}\le {\text {Tr}}\left( \varSigma _{1}\right) {\text {Tr}}\left( \varSigma _{2}\right) \quad \text {and} \quad \sup _{ij}\left| K_{ij}\right| \le \frac{1}{2}({\text {Tr}}\left( \varSigma _{1}\right) +{\text {Tr}}\left( \varSigma _{2}\right) ) . \end{aligned}
(19)
For mean vectors $$\mu _{1},\mu _{2}\in {\mathbb {R}}^{2}$$ and dispersion matrices $$\varSigma _{1},\varSigma _{2}\in {\text {Sym}}^{+}\left( n\right)$$, define the set of jointly Gaussian distributions with given marginals to be
\begin{aligned} {\mathscr {G}}((\mu _{1},\varSigma _{1}),(\mu _{2},\varSigma _{2}))=\left\{ {\text {N}} _{2n}\left( \begin{bmatrix} \mu _{1} \\ \mu _{2} \end{bmatrix} , \begin{bmatrix} \varSigma _{1}&K \\ K^{*}&\varSigma _{2} \end{bmatrix} \right) \right\} , \end{aligned}
and the Gini dissimilarity index
\begin{aligned}&W^{2}((\mu _{1},\varSigma _{1}),(\mu _{2},\varSigma _{2})) \nonumber \\&\quad = \inf \left\{ \mathbb {E}\left[ \left\| X-Y\right\| ^{2}\right] \vert \begin{bmatrix} X \\ Y \end{bmatrix} \sim \gamma , \gamma \in {\mathscr {G}}((\mu _{1},\varSigma _{1}),(\mu _{2},\varSigma _{2})) \right\} \nonumber \\&\quad = \left\| \mu _{1}-\mu _{2}\right\| ^{2}+{\text {Tr}}\left( \varSigma _{1}\right) +{\text {Tr}}\left( \varSigma _{2}\right) -2\sup \left\{ {\text {Tr}} \left( K\right) \vert \begin{bmatrix} \varSigma _{1}&K \\ K^{*}&\varSigma _{2} \end{bmatrix} \in {\text {Sym}}^{+}\left( 2n\right) \right\} \nonumber \\ \end{aligned}
(20)
Actually, in view of either of the bounds in Eq. (19), the set $${ \mathscr {G}}((\mu _{1},\varSigma _{1}),(\mu _{2},\varSigma _{2}))$$ is compact and the $$\inf$$ is attained.
It is easy to verify that
\begin{aligned}&W((\mu _{1},\varSigma _{1}),(\mu _{2},\varSigma _{2}))\\&\quad =\sqrt{\min \left\{ \mathbb {E} \left[ \left\| X-Y\right\| ^{2}\right] \vert \begin{bmatrix} X \\ Y \end{bmatrix} \sim \gamma ,\gamma \in {\mathscr {G}}((\mu _{1},\varSigma _{1}),(\mu _{2},\varSigma _{2})) \right\} } \end{aligned}
defines a distance on the space $${\mathscr {G}}_{n}\simeq {\mathbb {R}}^{n}\times {\text { Sym}}^{+}\left( n\right)$$. The symmetry of W is clear as well as the triangle inequality, by considering Gaussian distributions on $${\mathbb {R}} ^{n}\times {\mathbb {R}}^{n}\times {\mathbb {R}}^{n}$$ with given marginals. To conclude, assume that the $$\min$$ is reached at some $${\overline{\gamma }}$$. Then
\begin{aligned} 0=W((\mu _{1},\varSigma _{1}),(\mu _{2},\varSigma _{2}))=\mathbb {E}_{{\overline{\gamma }}} \left[ \left| X-Y\right| ^{2}\right] \Leftrightarrow \mu _{1}=\mu _{2}\quad {\text {and}}\quad \varSigma _{1}=\varSigma _{2}. \end{aligned}
A further observation is that distance W is homogeneous i.e.,
\begin{aligned} W((\lambda \mu _{1},\lambda ^{2}\varSigma _{1}),(\lambda \mu _{2},\lambda ^{2}\varSigma _{2}))=\lambda W((\mu _{1},\varSigma _{1}),(\mu _{2},\varSigma _{2})),\quad \lambda \ge 0. \end{aligned}

### 3.2 Computing the quadratic dissimilarity index

We will present a proof as given by Dowson and Landau [12], but with some corrections.

Given $$\varSigma _{1},\varSigma _{2}\in {\text {Sym}}^{+}\left( n\right)$$, each admissible K’s in (20) belongs to a compact set of $${\text {M}} (n)$$ thanks to bound (19), so the maximum of the function $$2 {\text {Tr}}\left( K\right)$$ is reached. Therefore, we are led to study the problem
\begin{aligned} \left\{ \begin{aligned} \alpha&(\varSigma _1,\varSigma _2)=\max _{K\in {\text {M}}(n)}2{\text {Tr}}\left( K\right) \\&{\text{ subject } \text{ to }}\\&\varSigma = \begin{bmatrix} \varSigma _1&K\\ K^*&\varSigma _2\end{bmatrix} \in {\text {Sym}}^+\left( 2n\right) \end{aligned}\right. \end{aligned}
(21)
The value of the similar problem with $$\max$$ replaced by $$\min$$ will be denoted by $$\beta (\varSigma _{1},\varSigma _{2}).$$

### Proposition 2

1. 1.
Let $$\varSigma _{1},\varSigma _{2}\in {\text {Sym}}^{+}\left( n\right)$$. Then
\begin{aligned} \alpha (\varSigma _{1},\varSigma _{2})=2{\text {Tr}}\left( \left( \varSigma _{1}^{1/2}\varSigma _{2}\varSigma _{1}^{1/2}\right) ^{1/2}\right) \text { and } \beta (\varSigma _{1},\varSigma _{2})=-\alpha (\varSigma _{1},\varSigma _{2}). \end{aligned}

2. 2.
If moreover $${{\mathrm{det}}}\left( \varSigma _1\right) > 0$$, then
\begin{aligned} \alpha (\varSigma _{1},\varSigma _{2}) = 2 {{\mathrm{Tr}}}\left( (\varSigma _1\varSigma _2)^{1/2}\right) . \end{aligned}

### Proof

(point (1)) A symmetric matrix $$\varSigma \in {\text {Sym}}\left( 2n\right)$$ is non-negative defined if, and only if, it is of the form $$\varSigma =SS^{*}$$, with $$S\in {\text {M}}\left( 2n\right)$$. Given the block structure of $$\varSigma$$ in (21), we can write
\begin{aligned} \begin{bmatrix} \varSigma _{1}&K \\ K^{*}&\varSigma _{2} \end{bmatrix} = \begin{bmatrix} A \\ B \end{bmatrix} \begin{bmatrix} A^{*}&B^{*} \end{bmatrix} = \begin{bmatrix} AA^{*}&AB^{*} \\ BA^{*}&BB^{*} \end{bmatrix} , \end{aligned}
where A and B are two matrices in $${\text {M}}(n\times 2n).$$
Therefore, problem (21) becomes
\begin{aligned} \left\{ \begin{aligned} \alpha&(\varSigma _1,\varSigma _2)=\max _{A,B\in {\text {M}}(n\times 2n)}2{\text {Tr}}\left( AB^*\right) \\&{\text{ subject } \text{ to }}\\&\varSigma _1=AA^*,\quad \varSigma _2=BB^*\end{aligned} \right. \end{aligned}
We have already observed that the optimum exists, so the necessary conditions of Lagrange theorem allows us to characterize this optimum. However, the two constraints $$\varSigma _{1}=AA^{*}$$ and $$\varSigma _{2}=BB^{ *}$$ are not necessarily regular at every point (i.e., the Jacobian of the transformation may fail to be of full rank at some point), so we must take into account that the optimum could be an irregular point. To this purpose, as a customary, we shall adopt Fritz John first-order formulation for the Lagrangian (see [25]).

We shall initially assume that both $$\varSigma _{1}$$ and $$\varSigma _{2}$$ are non-singular.

Let then $$\left( \nu _{0},\varLambda ,\varGamma \right) \in \left\{ 0,1\right\} \times {\text {Sym}}\left( n\right) \times {\text {Sym}}\left( n\right)$$, $$\left( \nu _{0},\varLambda ,\varGamma \right) \ne \left( 0,0,0\right)$$, where the symmetric matrices $$\varLambda$$ and $$\varGamma$$ are the Lagrange multipliers. The Lagrangian function will be
\begin{aligned} L&=2\nu _{0}{\text {Tr}}\left( AB^{*}\right) -{\text {Tr}}\left( \varLambda AA^{*}\right) -{\text {Tr}}\left( \varGamma BB^{*}\right) \\&=2\nu _{0}{\text {Tr}}\left( AB^{*}\right) -{\text {Tr}}\left( A^{*}\varLambda A\right) -{\text {Tr}}\left( B^{*}\varGamma B\right) \end{aligned}
The first-order conditions of L lead to
\begin{aligned} \left\{ \begin{aligned}&\nu _{0}B=\varLambda A,\quad \nu _{0}A=\varGamma B \\&\varSigma _1=AA^*,\quad \varSigma _2=BB^*\end{aligned}\right. . \end{aligned}
(22)
In the case $$\nu _{0}=1,$$ i.e., the case of stationary regular points, Eq. (22) becomes
\begin{aligned} \left\{ \begin{aligned}&B=\varLambda A,\quad A=\varGamma B\\&\varSigma _1=AA^*,\quad \varSigma _2=BB^*\end{aligned}\right. , \end{aligned}
(23)
which in turn implies
\begin{aligned} \left\{ \begin{aligned} \varLambda \varSigma _{1}\varLambda&=\varSigma _{2}\\ \varGamma \varSigma _{2}\varGamma&=\varSigma _{1}\end{aligned}\right. ,\quad \varLambda ,\varGamma \in {\text {Sym}}\left( n\right) \end{aligned}
(24)
and further
\begin{aligned} K=\varSigma _{1}\varLambda =\varGamma \varSigma _{2}. \end{aligned}
Of course, Eqs. (24) could be more general than Eqs. (23) and thus possibly contain undesirable solutions. In this light, we establish the following facts, in which both matrices $$\varSigma _{1}$$ and $$\varSigma _{2}$$ must be nonsingular. Notice that in this case Eqs. (24) imply that both $$\varLambda$$ and $$\varGamma$$ are nonsingular as well.

Claim 1: If $$(\varGamma ,\varLambda )$$ is a solution to (24) and $$\varLambda ^{-1}=\varGamma$$, then the couple $$(\varGamma ,\varLambda )$$ are Lagrange multipliers of Problem (21).

Actually, let $$\varSigma _{1}=AA^{*}$$, $$A\in {\text {M}}(n\times 2n)$$ be any representation of the matrix $$\varSigma _{1}$$. Define $$B=\varLambda A$$ so that $$A=\varLambda ^{-1}B=\varGamma B$$. Moreover
\begin{aligned} BB^{*}=\varLambda AA^{*}\varLambda =\varLambda \varSigma _{1}\varLambda =\varSigma _{2}, \end{aligned}
and so $$\left( \varLambda ,\varGamma \right)$$ are multipliers associated with the feasible point (AB).

Claim 2: The set of solutions to (24), such that $$\varGamma ^{-1}=\varLambda$$, is not empty. In particular, there is a unique pair $$\left( {\widetilde{\varLambda }},{\widetilde{\varGamma }}\right)$$ where both $${\widetilde{\varLambda }}$$ and $${\widetilde{\varGamma }}$$ are positive definite.

We have already observed that Eqs. (24) imply that $$\varLambda$$ and $$\varGamma$$ are nonsingular. Moreover, we have $$\varGamma ^{-1}\varSigma _{1}\varGamma ^{-1}=\varSigma _{2}$$. Recalling that Riccati’s equation has one and only one solution in the class of positive definite matrices, then $$X=\varLambda =\varGamma ^{-1}$$.

Now we proceed to study the solutions to $$\varLambda \varSigma _{1}\varLambda = \varSigma _{2}$$ and we shall show that Eq (24) has infinitely many solutions. In correspondence to each one $$\varLambda$$, the value of the objective function will be given by $$2{\text {Tr}}\left( K\right) =2{\text {Tr}} \left( \varSigma _{1}\varLambda \right)$$. Therefore, we must select the matrix $$\varLambda$$ such that $${\mathrm {Tr}}\left( \varSigma _{1}\varLambda \right)$$ be maximized.

Following [12], we define
\begin{aligned} R=\varSigma _{1}^{1/2}\varLambda \varSigma _{1}^{1/2}\in {\text {Sym}}\left( n\right) , \end{aligned}
so that, in view of (24), we have
\begin{aligned} R^{2}=\varSigma _{1}^{1/2}\varLambda \varSigma _{1}^{1/2}\varSigma _{1}^{1/2}\varLambda \varSigma _{1}^{1/2}=\varSigma _{1}^{1/2}\varLambda \varSigma _{1}\varLambda \varSigma _{1}^{1/2}=\varSigma _{1}^{1/2}\varSigma _{2}\varSigma _{1}^{1/2}\in {\text {Sym}}^{+}\left( n\right) .\nonumber \\ \end{aligned}
(25)
Moreover,
\begin{aligned} {\text {Tr}}\left( R\right) ={\text {Tr}}\left( \varSigma _{1}^{1/2}\varLambda \varSigma _{1}^{1/2}\right) ={\text {Tr}}\left( \varSigma _{1}^{1/2}\varSigma _{1}^{1/2}\varLambda \right) ={\text {Tr}}\left( \varSigma _{1}\varLambda \right) ={\text {Tr}}\left( K\right) . \end{aligned}
Eq. (25) shows that, though the Lagrangian can have many rest points (i.e., many solutions $$\varLambda$$) the matrix $$R^{2}=\varSigma _{1}^{1/2} \varSigma _{2}\varSigma _{1}^{1/2}\in {\text {Sym}}^{+}\left( n\right)$$ remains constant. Not so the value of the objective function $${\text {Tr}}\left( K\right) ={\text {Tr}}\left( R\right)$$ which depends on R (i.e., on $$\varLambda$$ ).
Let
\begin{aligned} R^{2}=\sum _{k}\lambda _{k}E_{k} \end{aligned}
denote the spectral decomposition of $$R^{2}$$, then the solutions to R will be
\begin{aligned} R=\sum _{k}\varepsilon _{k}\lambda _{k}^{1/2}E_{k} \end{aligned}
with $$\varepsilon _{k}=\pm 1$$. Hence $${\text {Tr}}\left( K\right) ={\text {Tr}} \left( R\right)$$ will be maximized whenever $$\varepsilon _{k}\equiv 1$$ and so $$R\in {\text {Sym}}^{+}\left( n\right)$$. Clearly the objective function will be minimized if $$\varepsilon _{k}\equiv -1$$. From now on the proof of the $$\min$$ statement follows similarly.
Hence the maximum of the trace occurs at
\begin{aligned} R=\left( \varSigma _{1}^{1/2}\varSigma _{2}\varSigma _{1}^{1/2}\right) ^{1/2}, \end{aligned}
namely $$\varLambda =\varSigma _{1}^{-1/2}\left( \varSigma _{1}^{1/2}\varSigma _{2}\varSigma _{1}^{1/2}\right) ^{1/2}\varSigma _{1}^{-1/2}.$$ Thanks to Claims 1-2 this matrix is a multiplier of the Lagrangian and so we would have
\begin{aligned} \alpha \left( \varSigma _{1},\varSigma _{2}\right) =2{\mathrm {Tr}}\left( \varSigma _{1}^{1/2}\varSigma _{2}\varSigma _{1}^{1/2}\right) ^{1/2}, \end{aligned}
(26)
as long as the optimum is attained at a regular point. In fact, to complete the proof, we must still examine the case $$\nu _{0}=0$$, for which Eq. (22) becomes
\begin{aligned} \varLambda A=0,\quad \varGamma B=0. \end{aligned}
It follows
\begin{aligned} \varLambda \varSigma _{1}&=\varLambda AA^{*}=0 \\ \varGamma \varSigma _{2}&=\varGamma BB^{*}=0, \end{aligned}
and consequently $$\varLambda =\varGamma =0$$. Therefore there is no irregular point, provided $$\varSigma _{1}$$ and $$\varSigma _{2}$$ are not singular matrices. So we have proved the relation (26) under the above assumptions.

Last step will be that of extending our result to possibly singular matrices $$\varSigma _{1}$$ and $$\varSigma _{2}$$.

Given the two matrices $$\varSigma _{1},\varSigma _{2}\in {\text {Sym}}^{+}\left( n\right)$$, set
\begin{aligned} \varSigma _{1}\left( \varepsilon \right) =\varSigma _{1}+\varepsilon I_{n}{\text { \ and }}\varSigma _{2}\left( \varepsilon \right) =\varSigma _{2}+\varepsilon I_{n}{ \text {, \ with }}\varepsilon \in [0,1]. \end{aligned}
If $$\varepsilon >0$$, then
\begin{aligned} {\text {det}}\left( \varSigma _{i}+\varepsilon I\right) =\prod _{j=1}^{n}(\lambda _{i,j}+\varepsilon )>0,\quad i=1,2. \end{aligned}
where $$\lambda _{i,j}$$, $$j=1,\dots ,n$$ is a set of eigenvalues of $$\varSigma _{i}$$ , $$i=1,2$$. Let us consider the parametric programming problem
\begin{aligned} \left\{ \begin{aligned} \alpha&(\varSigma _1(\varepsilon ),\varSigma _2(\varepsilon ))=\max _{K\in {\text {M}}(n)}2{\text {Tr}}\left( K\right) \\&{\text{ subject } \text{ to }}\\&\begin{bmatrix} \varSigma _1(\varepsilon )&K\\ K^*&\varSigma _2(\varepsilon )\end{bmatrix} \in {\text {Sym}}^+\left( 2n\right) \end{aligned}\right. \end{aligned}
Observe that the feasible region is contained in a compact set independent of $$\varepsilon \in \left[ 0,1\right]$$ because of the bound (19).
Now the continuity of the optimal value $$\varepsilon \mapsto \alpha (\varSigma _{1}(\varepsilon ),\varSigma _{2}(\varepsilon ))$$ follows easily from Berge maximum theorem, see for instance [2, Th. 17.31]. Hence
\begin{aligned} \alpha (\varSigma _{1},\varSigma _{2})=\lim _{\varepsilon \rightarrow 0}\alpha (\varSigma _{1}(\varepsilon ),\varSigma _{2}(\varepsilon ))=2{\text {Tr}}\left( (\varSigma _{1}^{1/2}\varSigma _{2}\varSigma _{1}^{1/2})^{1/2}\right) \end{aligned}
and the assertion is proved for any $$\varSigma _{1},\varSigma _{2}\in {\text {Sym}} ^{+}\left( n\right)$$. $$\square$$

### Proof

(point (2)) From Eq. (9) we have
\begin{aligned} {{\mathrm{Tr}}}\left( \left( \varSigma _1^{1/2} \varSigma _2 \varSigma _1^{1/2}\right) ^{1/2}\right)= & {} {{\mathrm{Tr}}}\left( \varSigma _1^{1/2}\left( \varSigma _1^{1/2} \varSigma _2 \varSigma _1^{1/2}\right) ^{1/2}\varSigma _1^{-1/2}\right) \\= & {} {{\mathrm{Tr}}}\left( \left( \varSigma _1\varSigma _2\right) ^{1/2}\right) . \end{aligned}
$$\square$$

The following result provides exact both lower and upper bounds of $$\mathbb {E}\left[ \left\| X-Y\right\| ^{2}\right]$$.

### Proposition 3

Let XY be multivariate Gaussian random variables taking values in $${\mathbb {R}} ^{n}$$ and having means $$\mu _{1}$$ and $$\mu _{2}$$ and dispersion matrices $$\varSigma _{1}$$ and $$\varSigma _{2}$$ respectively. Then
\begin{aligned}&\left\| \mu _{1}-\mu _{2}\right\| ^{2}+{\text {Tr}}\left( \varSigma _{1}+\varSigma _{2}-2\left( \varSigma _{1}^{1/2}\varSigma _{2}\varSigma _{1}^{1/2}\right) ^{1/2}\right) \le \mathbb {E}\left[ \left\| X-Y\right\| ^{2}\right] \le \\&\left\| \mu _{1}-\mu _{2}\right\| ^{2}+{\text {Tr}}\left( \varSigma _{1}+\varSigma _{2}+2\left( \varSigma _{1}^{1/2}\varSigma _{2}\varSigma _{1}^{1/2}\right) ^{1/2}\right) . \end{aligned}
If $$\det \varSigma _{1}\ne 0$$, then the extremal values are attained at the joint distribution of
\begin{aligned}&\begin{bmatrix} X \\ \mu _{2}\pm T(X-\mu _{1}) \end{bmatrix} \sim \\&\quad {\text {N}}_{2n}\left( \begin{bmatrix} \mu _{1} \\ \mu _{2} \end{bmatrix} , \begin{bmatrix} \varSigma _{1}&\pm T\varSigma _{1} \\ \pm \varSigma _{1}T&\varSigma _{2} \end{bmatrix} \right) = {\text {N}}_{2n}\left( \begin{bmatrix} \mu _{1} \\ \mu _{2} \end{bmatrix} , \begin{bmatrix} \varSigma _{1}&\pm (\varSigma _2\varSigma _{1})^{1/2} \\ \pm (\varSigma _1\varSigma _2)^{1/2}&\varSigma _{2} \end{bmatrix} \right) , \end{aligned}
respectively, where $$T\in {\text {Sym}}^{+}\left( n\right)$$ is the solution to the Riccati equation $$T\varSigma _{1}T=\varSigma _{2}$$.

### Proof

From Proposition 2 and Eq. (20), it follows
\begin{aligned} \begin{aligned} \min \left[ \left\| X-Y\right\| ^{2}\right]&=\left\| \mu _{1}-\mu _{2}\right\| ^{2}+{\text {Tr}}\left( \varSigma _{1}\right) +{\text {Tr}}\left( \varSigma _{2}\right) -2{\text {Tr}}\left( \left( \varSigma _{1}^{1/2}\varSigma _{2}\varSigma _{1}^{1/2}\right) ^{1/2}\right) , \\ \max \left[ \left\| X-Y\right\| ^{2}\right]&=\left\| \mu _{1}-\mu _{2}\right\| ^{2}+{\text {Tr}}\left( \varSigma _{1}\right) +{\text {Tr}}\left( \varSigma _{2}\right) +2{\text {Tr}}\left( \left( \varSigma _{1}^{1/2}\varSigma _{2}\varSigma _{1}^{1/2}\right) ^{1/2}\right) . \end{aligned} \end{aligned}
To check the extremal points it suffices to observe that, in view of relation (8):
\begin{aligned} {\text {Tr}}\left( T\varSigma _{1}\right) ={\text {Tr}}\left( \varSigma _{1}^{-1/2}\left( \varSigma _{1}^{1/2}\varSigma _{2}\varSigma _{1}^{1/2}\right) ^{1/2}\varSigma _{1}^{1/2}\right) ={\text {Tr}}\left( \left( \varSigma _{1}^{1/2}\varSigma _{2}\varSigma _{1}^{1/2}\right) ^{1/2}\right) . \end{aligned}
Hence it is verified that the extremal values are attained at $$Y=\mu _{2}\pm T(X-\mu _{1})$$. In the second form of the distribution we are using Eq. (10) and Eq. (11). $$\square$$

The W-distance defines on $${\mathbb {R}} \times {{\mathrm{Sym}}}^{++}\left( n\right)$$ a metric geometry with geodesics. This result is due to [27].

### Proposition 4

The relation
\begin{aligned} W\left( (\mu _{1},\varSigma _{1}),(\mu _{2},\varSigma _{2})\right) =\sqrt{\left\| \mu _{1}-\mu _{2}\right\| ^{2}+{\text {Tr}}\left( \varSigma _{1}+\varSigma _{2}-2\left( \varSigma _{1}^{1/2}\varSigma _{2}\varSigma _{1}^{1/2}\right) ^{1/2}\right) }\nonumber \\ \end{aligned}
(27)
defines a distance on $${\mathbb {R}}^{n}\times {\text {Sym}}^{+}\left( n\right)$$. The geodesic from $$(\mu _{1},\varSigma _{1})$$ to $$(\mu _{2},\varSigma _{2})$$, with $$(\mu _{1},\varSigma _{1}),(\mu _{2},\varSigma _{2})\in {\mathbb {R}}^{n}\times {\text {Sym}}^{++}\left( n\right)$$, is the curve
\begin{aligned} \varGamma :[0,1] \ni t \mapsto \left( \mu (t),\varSigma (t)\right) , \end{aligned}
where $$\mu (t) = (1-t)\mu _{1} + t\mu _{2}$$ and
\begin{aligned}&\varSigma (t) = ((1-t)I+tT)\varSigma _{1}((1-t)I+tT) \\&\quad =(1-t)^2 \varSigma _1 + t^2 \varSigma _2 + t(1-t)\left( (\varSigma _1\varSigma _2)^{1/2} + (\varSigma _2\varSigma _1)^{1/2}\right) , \end{aligned}
and T is the (unique) non-negative definite solution to the Riccati equation $$T\varSigma _{1}T=\varSigma _{2}$$.

### Proof

Clearly, $$\varGamma (0)=\left( \mu _{1},\varSigma _{1}\right)$$ and $$\varGamma (1)=\left( \mu _{2},\varSigma _{2}\right)$$. Let us compute the distance between $$\varGamma (0)$$ and the point
\begin{aligned} \varGamma (t)=(\mu (t),\varSigma (t))=\left( \mu _{1}+t(\mu _{2}-\mu _{1}),((1-t)I+tT)\varSigma _{1}((1-t)I+tT)\right) . \end{aligned}
We have
\begin{aligned} \begin{aligned} \varSigma _{1}^{1/2}\varSigma (t)\varSigma _{1}^{1/2}&=\varSigma _{1}^{1/2}((1-t)I+tT)\varSigma _{1}((1-t)I+tT)\varSigma _{1}^{1/2} \\&=\left( \varSigma _{1}^{1/2}((1-t)I+tT)\varSigma _{1}^{1/2}\right) \left( \varSigma _{1}^{1/2}((1-t)I+tT)\varSigma _{1}^{1/2}\right) , \end{aligned} \end{aligned}
so that
\begin{aligned} \left( \varSigma _{1}^{1/2}\varSigma (t)\varSigma _{1}^{1/2}\right) ^{1/2}=\varSigma _{1}^{1/2}((1-t)I+tT)\varSigma _{1}^{1/2}, \end{aligned}
and hence
\begin{aligned}&{\text {Tr}}\left( \left( \varSigma _{1}^{1/2}\varSigma (t)\varSigma _{1}^{1/2}\right) ^{1/2}\right) \\&={\text {Tr}}\left( \varSigma _{1}^{1/2}((1-t)I+tT)\varSigma _{1}^{1/2}\right) =(1-t){\text {Tr}}\left( \varSigma _{1}\right) +t{\text {Tr}}\left( T\varSigma _{1}\right) . \end{aligned}
We have
\begin{aligned} \begin{aligned} {\text {Tr}}\left( \varSigma (t)\right)&={\text {Tr}}\left( ((1-t)I+tT)\varSigma _{1}((1-t)I+tT)\right) \\&=(1-t)^{2}{\text {Tr}}\left( \varSigma _{1}\right) +2t(1-t){\text {Tr}}\left( T\varSigma _{1}\right) +t^{2}{\text {Tr}}\left( \varSigma _{2}\right) \end{aligned} \end{aligned}
Collecting all the above results,
\begin{aligned}&{\text {Tr}}\left( \varSigma _{1}+\varSigma (t)-2\left( \varSigma _{1}^{1/2}\varSigma (t)\varSigma _{1}^{1/2}\right) ^{1/2}\right) \\&\quad ={\text {Tr}} \left( \varSigma _{1}\right) + (1-t)^{2}{\text {Tr}}\left( \varSigma _{1}\right) +2t(1-t){\text {Tr}}\left( T\varSigma _{1}\right) \\&\quad \quad +\,t^{2}{\text {Tr}}\left( \varSigma _{2}\right) -2(1-t){\text {Tr}} \left( \varSigma _{1}\right) -2t{\text {Tr}}\left( T\varSigma _{1}\right) \\&\quad = t^{2}{\text {Tr}}\left( \varSigma _{1}\right) +t^{2}{\text {Tr}}\left( \varSigma _{2}\right) -2t^{2}{\text {Tr}}\left( T\varSigma _{1}\right) \\&\quad =t^{2}{\text {Tr}} \left( \varSigma _{1}+\varSigma _{2}-2\left( \varSigma _{1}^{1/2}\varSigma _{2}\varSigma _{1}^{1/2}\right) ^{1/2}\right) . \end{aligned}
In conclusion,
\begin{aligned}&W(\varGamma (0),\varGamma (t)) \\&\quad = \sqrt{\left\| \mu (0)-\mu (t)\right\| ^{2}+{\text {Tr}}\left( \varSigma (0)+\varSigma (t)-2\left( \varSigma (0)^{1/2}\varSigma (t)\varSigma (0)^{1/2}\right) ^{1/2}\right) } \\&\quad = tW(\varGamma (0),\varGamma (1)). \end{aligned}
$$\square$$

We end this section by adding a few remarks.

In metric spaces, the definition of geodesic we use here is related to Merger convexity property, see [30, p. 78]. A stronger definition requires the proportionality of the distance between couple of points on the curve, i.e.,
\begin{aligned} W\left( \varGamma (s),\varGamma (t)\right) =\left| t-s\right| W\left( \varGamma (0),\varGamma (1)\right) , \end{aligned}
for $$s,t\in \left[ 0,1\right]$$. It will be proved later that in fact our geodesics enjoy such a stronger property.

Clearly Proposition 4 still holds under the only assumption that $$\varSigma _{1}$$ is not singular, but the case in which both the distributions are degenerate remains excluded.

The simplest example occurs when the two subspaces, $${\text {Range}}\varSigma _{1}$$ and $${\text {Range}}\varSigma _{2}$$, are orthogonal. In this case, for all joint distribution of the random vector (XY),  with marginals $$X\sim {\text {N}}_{2}\left( 0,\varSigma _{1}\right)$$ and $$Y\sim {\text {N}}_{2}\left( 0,\varSigma _{2}\right) ,$$ the values of X and Y will lie into orthogonal subspaces, so that $$XY^{*}=0.$$ Hence $$\left\| X-Y\right\| ^{2}=\left\| X\right\| ^{2}+\left\| Y\right\| ^{2}$$, and
\begin{aligned} \mathbb {E}\left\| X-Y\right\| ^{2}=\mathbb {E}\left\| X\right\| ^{2}+\mathbb {E}\left\| Y\right\| ^{2}={\text {Tr}}\left( \varSigma _{1}\right) +{\text {Tr}}\left( \varSigma _{2}\right) . \end{aligned}
So any joint distribution (XY) attains the optimal value $$\sqrt{{\text {Tr}} \left( \varSigma _{1}\right) +{\text {Tr}}\left( \varSigma _{2}\right) }.$$
If we now define $$X(t)=(1-t)X+tY$$, then
\begin{aligned} \mathbb {E}\left[ \left\| X-X(t)\right\| ^{2}\right] =\mathbb {E}\left[ t^{2}\left\| X-Y\right\| ^{2}\right] =t^{2}\left[ {\text {Tr}}\left( \varSigma _{1}\right) +{\text {Tr}}\left( \varSigma _{2}\right) \right] , \end{aligned}
consequently X(t) is the geodesic joining the two random vectors X and Y.
The previous example can be extended by taking two singular matrices
\begin{aligned} \varSigma _{1}=\sigma _{1}^{2}vv^{*}\text { and }\varSigma _{2}=\sigma _{2}^{2}ww^{*} \end{aligned}
where $$v\ne w\in \mathbb {R}^{n}$$ and $$\left\| v\right\| =\left\| w\right\| =1$$. Clearly, $${\text {Range}}\varSigma _{1}\cap {\text {Range}} \varSigma _{2}=\left\{ 0\right\}$$ and they are one-dimensional spaces spanned by vectors v and w, respectively (it is not restrictive to assume $$v^{*}w\ge 0$$, too). By Eq. (27),
\begin{aligned} G\left( \varSigma _{1},\varSigma _{2}\right) =\sqrt{\sigma _{1}^{2}+\sigma _{2}^{2}-2\sigma _{1}\sigma _{2}v^{*}w}. \end{aligned}
Despite singularity of these matrices, it can be directly found the point realizing the minimum in (20), which is the singular matrix in $${\text {Sym}}^{+}\left( 2n\right)$$:
\begin{aligned} \left[ \begin{array}{cc} \sigma _{1}^{2}vv^{*} &{} \sigma _{1}\sigma _{2}vw^{*} \\ \sigma _{1}\sigma _{2}wv^{*} &{} \sigma _{2}^{2}ww^{*} \end{array} \right] =\left[ \begin{array}{c} \sigma _{1}v \\ \sigma _{2}w \end{array} \right] \left[ \begin{array}{cc} \sigma _{1}v^{*}&\sigma _{2}w^{*} \end{array} \right] . \end{aligned}

## 4 Wasserstein Riemannian geometry

We have seen how to compute the geodesic for the distance W. Since the component $$\mathbb {R}^n$$ carries the standard Euclidean geometry, we focus on the geometry of the matrix part, i.e., we shall restrict our analysis to 0-mean distributions $${\text {N}}_{n}\left( 0,\varSigma \right)$$. Moreover, $$\varSigma$$ will be assumed to be positive definite. Our purpose is to endow the open set $${{\mathrm{Sym}}}^{++}\left( n\right)$$ with a structure of Riemannian manifold whose metric tensor generates the Wasserstein distance. The Riemannian metric is obtained by pushing forward the Euclidean geometry of square matrices to the space of dispersion matrices via the mapping $$\sigma :A \mapsto AA^* = \varSigma$$. This approach has been introduced by F. Otto [29] in the general non-parametric case and developed in the Gaussian case by A. Takatsu [36] and R. Bhatia [8].

In view of Prop. 1, $$\sigma :{\text {GL}} (n)\rightarrow {\text {Sym}}^{++}\left( n\right) \subset {\text {M}}(n)$$ is a submersion and $${\mathscr {H}}_{A}={\text {Sym}}\left( n\right) A$$ is the space of horizontal vectors at A.

We recall that a submersion $$f:{\text {GL}}(n)\rightarrow {\text {Sym}} ^{++}\left( n\right)$$ is called Riemannian if for all A the differential restricted to horizontal vectors
\begin{aligned} \left. df(A)\right| _{{\mathscr {H}}_{A}}:{\mathscr {H}}_{A}\rightarrow T_{f(A)}{\text {Sym}}^{++}\left( n\right) = {{\mathrm{Sym}}}\left( n\right) \end{aligned}
is an isometry i.e.,
\begin{aligned} U,V\in {\mathscr {H}}_{A}\Rightarrow \left\langle df(A)[U],df(A)[V]\right\rangle _{f(A)}=\left\langle U,V\right\rangle . \end{aligned}
(28)
A linear isometry is always 1-to-1 and, if it is onto, we can write backward that
\begin{aligned} X,Y\in T_{f(A)}{\text {Sym}}^{++}\left( n\right) \Rightarrow \left\langle X,Y\right\rangle _{f(A)}=\left\langle \left( \left. df(A)\right| _{{ \mathscr {H}}_{A}}\right) ^{-1}X,\left( \left. df(A)\right| _{{\mathscr {H} }_{A}}\right) ^{-1}Y\right\rangle \ . \end{aligned}
Conversely, the previous equation provides the definition of a metric on $${{\mathrm{Sym}}}^{++}\left( n\right)$$ for which the submersion f is Riemannian.
If $$U_A$$ is the projection of U on $${\mathscr {H}}_A$$, then $$df(A)[U] = df(A)[U_A]$$ and Eq. (28) becomes
\begin{aligned}&U,V\in {{\mathrm{Sym}}}\left( n\right) \Rightarrow \left\langle df(A)[U],df(A)[V]\right\rangle _{f(A)} \\&\quad = \left\langle df(A)[U_A],df(A)[V_A]\right\rangle _{f(A)} = \left\langle U_A,V_A\right\rangle . \end{aligned}
In general, a submersion induces a local diffeomorphisms from horizontal spaces to the image manifold. In our case, the submersion $$\sigma$$ provides a global parameterization of the manifold of symmetric matrices. Fix a matrix $$A\in {\text {GL}}(n)$$ such that $$\sigma (A)=AA^{*}=\varSigma$$, and consider the open convex cone
\begin{aligned} {\mathscr {H}}_{A}^{++}={\text {Sym}}^{++}\left( n\right) A\subset {\mathscr {H}} _{A}. \end{aligned}
We denote by $$\sigma _{A}$$ the restriction to $${\mathscr {H}}_{A}^{++}$$ of $$\sigma$$.

### Proposition 5

For all $$A \in {\text {GL}}\left( n\right)$$, the mapping
\begin{aligned} \sigma _{A}:{\mathscr {H}}_{A}^{++}\ni B\mapsto BB^{*}=C\in {\text {Sym}} ^{++}\left( n\right) \end{aligned}
is a surjective bijection, with inverse
\begin{aligned} \sigma _{A}^{-1}(C) = C^{-1/2}(C^{1/2}\varSigma C^{1/2})^{1/2}C^{-1/2}A. \end{aligned}

### Proof

For each $$C\in {\text {Sym}}^{++}\left( n\right)$$, the equation
\begin{aligned} C=BB^{*}=(BA^{-1}A)(BA^{-1}A)^{*}=(BA^{-1})\varSigma (BA^{-1})^{*} \end{aligned}
is a Riccati equation for $$BA^{-1}$$. As $$B \in {\text { Sym}}^{++}\left( n\right) A$$, we have $$BA^{-1}\in {\text {Sym}}^{++}\left( n\right)$$ and
\begin{aligned} BA^{-1}=C^{-1/2}(C^{1/2}\varSigma C^{1/2})^{1/2}C^{-1/2} \end{aligned}
is the unique solution. $$\square$$

We come now to the point, i.e., the construction of a metric based on horizontal vectors at a given matrix $$\varSigma$$. We are here using Prop. 1.

### Proposition 6

The inner product
\begin{aligned} \left\langle U,V\right\rangle _{\varSigma }\equiv W_{\varSigma }(U,V)= {{\mathrm{Tr}}}\left( \mathscr {L} _ {\varSigma } \left[ U\right] \varSigma \mathscr {L} _ {\varSigma } \left[ V\right] \right) ,\quad U,V\in {\text {Sym}}\left( n\right) , \end{aligned}
defines a metric on $${{\mathrm{Sym}}}^{++}\left( n\right)$$ such that $$\sigma :A\mapsto AA^{*}$$ is a Riemannian submersion.

### Proof

Let $$X\in {\text {M}}(n)$$ and consider the decomposition of $$X=X_{V}+X_{H}$$ with $$X_{V}$$ vertical at A and $$X_{H}$$ horizontal at A. Then $$d\sigma (A)[X]=d\sigma (A)[X_{H}]$$ and the restriction of the derivative $$d\sigma (A)$$ to the vector space $${\mathscr {H}}_{A}$$ of horizontal vectors at A is 1-to-1 onto the tangent space of $${\text {Sym}}^{++}\left( n\right)$$ at $$AA^{*}$$, that is, $${\text {Sym}}\left( n\right)$$. For such a restriction, for each $$H\in {\mathscr {H}}_{A},$$
\begin{aligned}&\left. U=d\sigma (A)[H]=HA^{*}+AH^{*}=HA^{-1}AA^{*}+A(HA^{-1}A)^{*}\right. \\&\quad \left. =(HA^{-1})AA^{*}+AA^{*}(HA^{-1})^{*}=(HA^{-1})AA^{*}+AA^{*}(HA^{-1}),\right. \end{aligned}
so that the inverse mapping of the restriction is given by
\begin{aligned} H=\left( \left. d\sigma (A)\right| _{{\mathscr {H}}_{A}}\right) ^{-1}(U)= \mathscr {L}_{AA^{*}}[U]A, \end{aligned}
(29)
Let us push-forward the inner product from $${\mathscr {H}}_{A}$$ to $$T_{AA^{*}}{\text {Sym}}^{++}\left( n\right)$$.
From Eq. (29), we have
\begin{aligned}&\left. W_{AA^{*}}(U,V)=\left\langle \left( \left. d\sigma (A)\right| _{{\mathscr {H}}_{A}}\right) ^{-1}(U),\left( \left. d\sigma (A)\right| _{{ \mathscr {H}}_{A}}\right) ^{-1}(V)\right\rangle =\right. \\&\quad \left. \left\langle \mathscr {L}_{AA^{*}}[U]A,\mathscr {L}_{AA^{*}}[V]A\right\rangle ={\text {Tr}}\left( \mathscr {L}_{AA^{*}}[U]AA^{*} \mathscr {L}_{AA^{*}}[V]\right) .\right. \end{aligned}
which depends on $$AA^{*}=\varSigma$$ only. $$\square$$

Next proposition provides a useful tensorial form of Wasserstein Riemannian metric.

### Proposition 7

It holds
\begin{aligned} W_{\varSigma }(U,V) = \frac{1}{2}\left\langle \mathscr {L}_{\varSigma }[U],V\right\rangle \equiv \left\langle \mathscr {L}_{\varSigma }[U],V\right\rangle _{2}. \end{aligned}

### Proof

We have
\begin{aligned} {\text {Tr}}\left( \mathscr {L}_{\varSigma }[U]\varSigma \mathscr {L}_{\varSigma } [V]\right) ={\text {Tr}}\left( \mathscr {L}_{\varSigma }[V]\varSigma \mathscr {L} _{\varSigma }[U]\right) ={\text {Tr}}\left( \mathscr {L}_{\varSigma }[U]\mathscr {L} _{\varSigma }[V]\varSigma \right) , \end{aligned}
and, taking the semi-sum of the first and the last term of the previous equation,
\begin{aligned} W_{\varSigma }(U,V) = \frac{1}{2}{\text {Tr}}\left\{ \mathscr {L}_{\varSigma }[U] \left[ \mathscr {L}_{\varSigma }[V]\varSigma +\varSigma \mathscr {L}_{\varSigma }[V]\right] \right\} = \frac{1}{2}{\text {Tr}}\left\{ \mathscr {L}_{\varSigma }[U]V\right\} . \end{aligned}
$$\square$$

After having shown in Prop. 4 the existence of a metric geodesic for the Wasserstein distance, connecting a pair of matrices $$\varSigma _{1},\varSigma _{2} \in {{\mathrm{Sym}}}^{++}\left( n\right)$$, we prove that the same curve is a Riemannian geodesic, see R.J. McCann [26] and also [8, 36].

More generally, let us discuss the existence of affine horizontal surfaces in $${\text {GL}}\left( n\right)$$ and the existence of geodesically convex surfaces in $${{\mathrm{Sym}}}^{++}\left( n\right)$$. As a particular case, the results give rise to the desired Riemannian geodesics.

A surface $$\theta \mapsto A(\theta ) \in {\text {GL}}\left( n\right)$$, with $$\theta \in \varTheta$$, where $$\varTheta$$ is an open subset of $$\mathbb {R}^n$$, is called horizontal for the submersion $$\sigma :A \mapsto AA^*$$, if $$\partial /\partial {\theta _j} A(\theta ) \in {\mathscr {H}}_{A(\theta )}$$ for each j and $$\theta$$, i.e.,
\begin{aligned} \left( \frac{\partial }{\partial \theta _j} A(\theta )\right) A(\theta )^{-1} \in {{\mathrm{Sym}}}\left( n\right) . \end{aligned}
(30)
A surface is horizontal if, and only if, every smooth curve which lies in it is horizontal.

### Proposition 8

1. 1.
The surface $$\varTheta \ni \theta \mapsto A(\theta ) \in {\text {GL}}\left( n\right)$$ is horizontal for $$\sigma$$ if, and only if,
\begin{aligned} \frac{\partial }{\partial \theta _j} A^*(\theta ) A(\theta )=A^*(\theta ) \frac{\partial }{\partial \theta _j} A(\theta ), \quad j=1,\dots ,k, \quad \theta \in \varTheta \ . \end{aligned}
(31)

2. 2.
Let
\begin{aligned} A(\theta ) = A_0 + \sum _{i=1}^k \theta _i (A_i - A_0) , \quad \theta \in \varTheta , \end{aligned}
(32)
be a surface in $${\text {GL}}\left( n\right)$$ with the k-simplex of $$\mathbb {R}^k$$ contained in $$\varTheta$$. The surface is horizontal if, and only if,
\begin{aligned} A^*_j A_i = A^*_i A_j , \quad i,j=0,\dots ,k . \end{aligned}

3. 3.
Let be given $$\varSigma _0,\varSigma _1 \in {{\mathrm{Sym}}}^{++}\left( n\right)$$ and choose $$A_0,A_1$$ such that $$\varSigma _0 = A_0A_0^*$$ and $$\varSigma _1 = A_1A_1^*$$. The line
\begin{aligned} A(\theta ) = (1-\theta ) A_0+\theta A_1 \end{aligned}
(33)
is horizontal for $$\theta$$ in an open interval containing 0 and 1 if, and only if, $$A_1 = TA_0$$ with $$T \in {{\mathrm{Sym}}}^{++}\left( n\right)$$. This implies T is the solution of the Riccati equation $$T\varSigma _0T = \varSigma _1$$.

4. 4.
Let be given $$\varSigma _j = A_jA_j^*\in {{\mathrm{Sym}}}^{++}\left( n\right)$$, $$j=0,1\dots ,k$$. The surface
\begin{aligned} \theta \mapsto A_0 + \sum _{j=0}^k \theta _k (A_j-A_0) \end{aligned}
is horizontal in an open set of parameters containing the k-simplex if, and only if, $$A_i = T_{ij} A_j$$ with $$T_{ij} \in {{\mathrm{Sym}}}^{++}\left( n\right)$$, $$i,j=0,\dots ,k$$.

### Proof

1. 1.

Eq. (30) is equivalent to $$A^*(\theta )^{-1}\partial / \partial {\theta _j} A^*(\theta ) = \partial / \partial {\theta _j} A(\theta )A(\theta )^{-1}$$ hence to $$\partial / \partial {\theta _j} A^*(\theta ) A(\theta ) = A^*(\theta ) \partial / \partial {\theta _j} A(\theta )$$.

2. 2.
For the surface in Eq. (32) we have $$\partial /\partial {\theta _j} A(\theta ) = A_j$$ so that Eq. (31) becomes
\begin{aligned} A_j^*(\theta ) A(\theta )=A^*(\theta ) A_j(\theta ), \quad j=1,\dots ,k\ , \quad \theta \in \varTheta . \end{aligned}
If $$\theta = 0$$, it holds $$A_j^*A_0 = A_0^*A_j$$, $$j = 1,\dots ,k$$. If $$\theta =e_i$$ then it holds $$A_j^* A_i = A_i A_j^*$$ for $$i,j=1,\dots ,k$$. The converse holds by linearity.

3. 3.
Assume $$\theta \mapsto A(\theta )$$ of Eq. (33) is horizontal on $$\varTheta$$. Then, from the previous item we know $$A_1^*A_0 = A_0^* A_1$$. In turn, this implies $${A_0^*}^{-1}A_1^* = A_1 A_0^{-1}$$, hence $$T = A_1A_0^{-1} \in {{\mathrm{Sym}}}\left( n\right)$$. It follows $$T \varSigma _0 T = A_1A_0^{-1} \varSigma _0 (A_0^*)^{-1} A_1^* = \varSigma _1$$. It remains to show that T is positive definite. Actually, it holds
\begin{aligned} (1 - \theta ) A_0 + \theta A_1 = \left( (1 - \theta ) I + \theta T\right) A_0 \in {\text {GL}}\left( n\right) , \quad \theta \in \varTheta . \end{aligned}
If $$\lambda _i$$ are eigenvalues of the matrix T, then the eigenvalues of the matrix $$(1 - \theta ) I + \theta T$$ are $$(1-\theta ) + \theta \lambda _i$$. As they are never zero for any $$\theta \in [0,1]$$, it follows that no $$\lambda _i$$ can be negative. The $$\lambda _i$$ are not zero by assumption and the conclusion $$T \in {{\mathrm{Sym}}}^{++}\left( n\right)$$ follows.

Conversely, if $$T \in {{\mathrm{Sym}}}^{++}\left( n\right)$$ and $$TA_0 = A_1$$, then $$A_1^* A_0 = A_0^* T A_0$$ is symmetric. Consequently, for all $$\theta$$ such that $$(1-\theta )A_0+\theta A_1 \in {\text {GL}}\left( n\right)$$ the curve is horizontal. On the other hand, $$(1-\theta )I + \theta T$$ is the convex combination of positive definite matrices then it is positive definite on an open interval containing [0, 1].

4. 4.

Conversely, The proof follows exactly the same arguments as in the 2-points case of the previous item.$$\square$$

The previous proposition shows that there is equality between the metric geodesic derived from the Wasserstein distance and the the geodesic we obtain from the submersion argument. Moreover, in the next Corollary we also characterize the existence of geodesically convex surfaces with given vertices.

### Corollary 1

1. 1.
Given $$\varSigma _0, \varSigma _1 \in {{\mathrm{Sym}}}^{++}\left( n\right)$$, there exists an open interval $$\varTheta \supset [0,1]$$ such that the curve
\begin{aligned} \varSigma (\theta ) = \left( (1-\theta )I + \theta T\right) \varSigma _0 \left( (1-\theta )I + \theta T\right) , \quad \theta \in \varTheta , \end{aligned}
(34)
is the Wasserstein Riemannian geodesic through $$\varSigma _0$$ and $$\varSigma _1$$, with $$T\varSigma _0T = \varSigma _1$$.

2. 2.
Let $$\varSigma _0, \dots ,\varSigma _k \in {{\mathrm{Sym}}}^{++}\left( n\right)$$, there exists an open set $$\varTheta$$ containing the k-simplex such that the surface
\begin{aligned} \varSigma (\theta ) = \left( I + \sum _{j=1}^k\theta (T_j-I)\right) \varSigma _0 \left( I + \sum _{j=1}^k\theta (T_j-I)\right) , \quad \theta \in \varTheta , \end{aligned}
is the Wasserstein Riemannian geodesic surface through $$\varSigma _0, \dots ,\varSigma _k$$ if, and only if, the matrices $$T_j$$, which are the positive definite solution of the Riccati equations $$T_j \varSigma _0T_j$$, $$j=1,\dots ,k$$, pairwise commute.

### Proof

1. 1.

Pick $$A_0 = \varSigma _0^{1/2} U$$, with $$U \in {\text {O}}(n)$$, and $$A_1 = TA_0$$, where T is the positive definite solution of the Riccati equation $$T\varSigma _0T = \varSigma _1$$ and so $$A_1A_1^* = \varSigma _1$$. By Prop. 8, Item 3, $$\theta \mapsto A(\theta )$$ is horizontal in $${\text {GL}}\left( n\right)$$. Consequently, $$\varSigma (\theta )=A(\theta )A^*(\theta )$$ is a geodesic.

2. 2.

In view of Prop. 8, Item 4, $$T_{ij} = T_iT_j^{-1}$$. The surface is horizontal if, and only if, each $$T_{ij}$$ is symmetric, that is, $$T_iT_j^{-1} = T_j T_i^{-1}$$, which, in turn, is equivalent to $$T_iT_j = T_jT_i$$.$$\square$$

Unlike the two-points case, the commutativity condition puts severe restrictions on the set of matrices $$\varSigma _0$$,...,$$\varSigma _k$$ generating a geodesic surface, when $$k>1$$. For instance, if $$\varSigma _0=I$$, then we have $$T_i=\varSigma _i^{1/2}$$. Hence, Corollary 9 entails that the matrices I,$$\varSigma _1$$,...,$$\varSigma _k$$ generate a geodesic surface if, and only if, they pairwise commute.

## 5 Wasserstein Riemannian exponential

We aim now at reformulating a Riemannian geodesic in terms of the exponential map. In other words, the purpose is that of writing the geodesic arc passing through a given point and having a given velocity at the point itself.

The velocity of the geodesic of Eq. (34) is
\begin{aligned} {\dot{\varSigma }}(\theta )=(T-I)\varSigma _{0}+\varSigma _{0}(T-I)+2\theta (T-I)\varSigma _{0}(T-I) . \end{aligned}
Using the horizontal lift $$\varSigma (\theta ) = A(\theta )A^*(\theta )$$, the velocity turns out to be
\begin{aligned} {{\dot{\varSigma }}}(\theta ) = \dot{A}(\theta ) A^*(\theta ) + A(\theta ) {\dot{A}}^*(\theta ) = \dot{A}(\theta ) A^{-1}(\theta ) \varSigma (\theta ) + \varSigma (\theta ) {A}^*(\theta )^{-1} {\dot{A}}^*(\theta ) , \end{aligned}
where $$\dot{A}(\theta ) A^{-1}(\theta ) \in {{\mathrm{Sym}}}\left( n\right)$$ by Eq. (30). Therefore,
\begin{aligned} \dot{A}(\theta ) A^{-1}(\theta ) = {A}^*(\theta )^{-1} {\dot{A}}^*(\theta ) = \mathscr {L} _ {\varSigma (\theta )} \left[ {{\dot{\varSigma }}}(\theta )\right] . \end{aligned}
In particular, the initial velocity is
\begin{aligned} {\dot{\varSigma }}(0)=(T-I)\varSigma (0)+\varSigma (0)(T-I) . \end{aligned}
(35)
and $$T - I = \mathscr {L} _ {\varSigma (0)} \left[ {{\dot{\varSigma }}}(0)\right]$$.
Let us compute the norm of the velocity in the Riemannian metric. The value of $$W^2({{\dot{\varSigma }}},{{\dot{\varSigma }}})$$ at $$\varSigma (\theta )$$ is
\begin{aligned}&{{\mathrm{Tr}}}\left( \mathscr {L} _ {\varSigma (\theta )} \left[ \dot{\varSigma }(\theta )\right] \varSigma (\theta )\mathscr {L} _ {\varSigma (\theta )} \left[ \dot{\varSigma }(\theta )\right] \right) \\&\quad = {{\mathrm{Tr}}}\left( \dot{A}(\theta ) A^{-1}(\theta ) A(\theta )A^*(\theta ) {A}^*(\theta )^{-1} {\dot{A}}^*(\theta )\right) \\&\quad = {{\mathrm{Tr}}}\left( \dot{A}(\theta ){\dot{A}}^*(\theta )\right) = {{\mathrm{Tr}}}\left( (T-I)\varSigma (0)(T-I)\right) . \end{aligned}
It is constant, as we expect from the definition by isometric submersion. Also, we can confirm that the length of the geodesic is
\begin{aligned}&\sqrt{{{\mathrm{Tr}}}\left( (T-I)\varSigma (0)(T-I)\right) } = \sqrt{{{\mathrm{Tr}}}\left( \varSigma _0 + \varSigma _1 + T\varSigma _0 + \varSigma _0T\right) } \\&\quad = \sqrt{{{\mathrm{Tr}}}\left( \varSigma _0 + \varSigma _1 + 2(\varSigma _0^{1/2}\varSigma _1\varSigma _0^{1/2})^{1/2}\right) } . \end{aligned}
The last equality follows from the relation $$\varSigma _0^{1/2} T \varSigma _0^{1/2} = (\varSigma _0^{1/2}\varSigma _1\varSigma _0^{1/2})^{1/2}$$.
By substituting Eq. (35) into the equation of the geodesic (34), we get
\begin{aligned} \varSigma (\theta ) =\varSigma (0)+\theta \left[ (T-I)\varSigma (0)+\varSigma (0)(T-I)\right] +\theta ^{2}(T-I)\varSigma (0)(T-I) \\ =\varSigma (0)+\theta {\dot{\varSigma }}(0)+\theta ^{2}\mathscr {L}_{\varSigma (0)}[\dot{\varSigma } (0)]\varSigma (0)\mathscr {L}_{\varSigma (0)}[{\dot{\varSigma }}(0)]. \end{aligned}
We are so led to the following definition, see [1, p. 101–102]) for example.

### Definition 1

For any $$C\in {\text {Sym}}^{++}\left( n\right)$$ and $$V\in {\text {Sym}}\left( n\right) \simeq T_{C}{\text {Sym}}^{++}\left( n\right)$$, the Wasserstein Riemannian exponential is
\begin{aligned} {\text {Exp}}_{C}\left( V\right) =C+V+\mathscr {L}_{C}[V]C\mathscr {L}_{C}[V]=( \mathscr {L}_{C}[V]+I)C(\mathscr {L}_{C}[V]+I), \end{aligned}
(36)

Next proposition collects some properties of the Riemannian exponential.

### Proposition 9

1. 1.
All geodesics emanating from a point $$C\in {{\mathrm{Sym}}}^{++}\left( n\right)$$ are of the form $$\varSigma (\theta )={\text {Exp}}_{C}\left( \theta V\right)$$, with $$\theta \in J_{V}$$, where $$J_{V}$$ is the open interval about the origin:
\begin{aligned} J_{V}=\left\{ \theta \in \mathbb {R} \vert I+\theta \mathscr {L} _ {C} \left[ V\right] \in {{\mathrm{Sym}}}^{++}\left( n\right) \right\} \ . \end{aligned}

2. 2.
The map $$V\mapsto {\text {Exp}}_{C}\left( V\right) ,$$ restricted to the open set
\begin{aligned} \varTheta =\left\{ V\in {\text {Sym}}\left( n\right) :I+\mathscr {L}_{C}[V]\in {\text {Sym}}^{++}\left( n\right) \right\} , \end{aligned}
is a diffeomorphism of $$\varTheta$$ into $${\text {Sym}}^{++}\left( n\right)$$ with inverse
\begin{aligned} {\text {Log}}_{C}\left( B\right) = (BC)^{1/2} + (CB)^{1/2} - 2C \ ; \end{aligned}

3. 3.
The derivative of the Riemannian exponential is
\begin{aligned} d_{X}\left( V\longmapsto {\text {Exp}}_{C}\left( V\right) \right) =X+\mathscr {L} _{C}[X]C\mathscr {L}_{C}[V]+\mathscr {L}_{C}[V]C\mathscr {L}_{C}[X]. \end{aligned}

### Remark 1

Notice that $$I + \theta \mathscr {L} _ {C} \left[ V\right] = \mathscr {L} _ {C} \left[ \frac{1}{2} C^{-1}+\theta V\right]$$ hence, $$\theta \in J_V$$ if $$\frac{1}{2} C^{-1} + \theta V \in {{\mathrm{Sym}}}^{++}\left( n\right)$$.

Clearly, $$0\in J_{V}$$ and $${\text {Exp}}_C(0) = C$$ and the maximal open interval containing 0 in which $${\text {Exp}}_C(\theta V) \in {{\mathrm{Sym}}}^{++}\left( n\right)$$ is precisely $$J_V$$. Moreover, the interval $$J_{V}$$ is unbounded from the right, i.e., it is of the kind $$J_{V}=\left( {\bar{\theta }} ,+\infty \right)$$, provided $$V\in {{\mathrm{Sym}}}^+\left( n\right)$$. Likewise, $$J_{V}=\left( -\infty ,{\bar{\theta }}\right)$$, if $$- V\in {{\mathrm{Sym}}}^+\left( n\right)$$. Similarly, $$\varTheta$$ is an open set containing the origin and so $$V\mapsto {\text {Exp}}_{C}\left( V\right)$$ is a local diffeomorphism around the origin.

Since the geodesics are not defined for all the values of the parameter $$t\in \mathbb {R}$$, we infer that the Riemannian manifold $${\text {Sym}} ^{++}\left( n\right)$$ is geodesically incomplete. Of course this is not a surprising fact: $${\text {Sym}}^{++}\left( n\right)$$ is not a complete metric space, and hence Hopf-Rinow theorem implies that it cannot be geodesically complete, see M.P. do Carmo [10].

### Proof

1. 1.
Let
\begin{aligned} \varSigma (\theta ) = {\text {Exp}}_{C}\left( \theta V\right) =C+\theta V+\theta ^{2}\mathscr {L} _{C}[V]C\mathscr {L}_{C}[V] , \quad \theta \in J_{V} . \end{aligned}
Clearly, $$\varSigma (0)=C$$ and $${\dot{\varSigma }}(0)=V$$. Pick a scalar $${\bar{\theta }}\in J_{V}$$ and consider the two matrices $$\varSigma \left( 0\right)$$ and $$\varSigma \left( {\bar{\theta }}\right)$$ belonging to the curve $$\varSigma .$$ Introduce the new parameterization $${\tilde{\varSigma }}\left( \tau \right) =\varSigma \left( \tau {\bar{\theta }}\right)$$, so that $${\tilde{\varSigma }}\left( 0\right) =\varSigma \left( 0\right)$$ and $${\tilde{\varSigma }}\left( 1\right) =\varSigma \left( {\bar{\theta }} \right)$$. We have,
\begin{aligned} {\tilde{\varSigma }}\left( \tau \right) =C+\tau ({\bar{\theta }}V) + \tau ^{2}\mathscr {L} _ {C} \left[ {\bar{\theta }}V\right] C \mathscr {L} _ {C} \left[ {\bar{\theta }}V\right] \ . \end{aligned}
(37)
Setting $${\tilde{T}}-I=\mathscr {L} _ {C} \left[ {\bar{\theta }}V\right]$$, we have $${\widetilde{T}} \in {{\mathrm{Sym}}}^{++}\left( n\right)$$ and
\begin{aligned} {\tilde{T}} C {\tilde{T}} = (I+\mathscr {L} _ {C} \left[ {{\bar{\theta }}}\right] ) C (I+\mathscr {L} _ {C} \left[ {{\bar{\theta }}}\right] ) = {{\tilde{\varSigma }}}(1) , \end{aligned}
and the Eq. (37) above becomes
\begin{aligned}&{\tilde{\varSigma }}\left( \tau \right) = C+\tau ({\tilde{T}}-I)C+\tau C({\tilde{T}}-I)+\tau ^{2}({\tilde{T}}-I)C({\tilde{T}}-I)= \\&\quad =\left[ \left( 1-\tau \right) I+\tau {\tilde{T}}\right] C\left[ \left( 1-\tau \right) I+\tau {\tilde{T}}\right] , \end{aligned}
which is the geodesic connecting $$\varSigma (0) = {{\tilde{\varSigma }}}(0) = C$$ to $${{\tilde{\varSigma }}}(1) = \varSigma ({{\bar{\theta }}})$$.

2. 2.
By Eq. (36) the solution to Riccati equation
\begin{aligned} {\text {Exp}}_{C}\left( V\right) =(I+\mathscr {L}_{C}[V])C(I+\mathscr {L} _{C}[V])=B \end{aligned}
is
\begin{aligned} I+\mathscr {L}_{C}[V]=C^{-1/2}(C^{1/2}BC^{1/2})^{1/2}C^{-1/2}\ \end{aligned}
provided $$I+\mathscr {L}_{C}[V]\in {\text {Sym}}^{++}\left( n\right)$$. This is true in a sufficiently small neighborhood $$\left\| V\right\| <r$$ of the origin. The inversion of the operator $$\mathscr {L}_{C}[\cdot ]$$ and Eq. (9) provide the desired formula for $${\text {Log}}_{C}\left( B\right)$$.

3. 3.

The derivative follows from a simple bilinear computation.

$$\square$$

The second order properties of the geodesic and the Riemannian exponential will be established in Sect. 7.6.

We have found the form of the Riemannian metric associated to Wasserstein distance. In turn, the inner product equals the second order approximation of $$W^2$$. This is a general fact, whose interpretation is based on the discussion of the natural gradient of the metric as solution to the problem
\begin{aligned} {\left\{ \begin{array}{ll} \max f(X+H) - f(X) \\ \text {subject to} \\ W^2(X,X+H) = {\varepsilon \text { (small and fixed)}} \end{array}\right. } \end{aligned}
which allows the identification of the direction of the maximal increase of the function f with the natural gradient, according to the name introduced by Amari [4], i.e., the Riemannian gradient as defined below.
The Riemannian gradient is the gradient with respect to the inner product of the metric. We denote by $$\nabla$$ the gradient with respect to the inner product $$\left\langle \cdot ,\cdot \right\rangle _{2}$$ and by $${{\mathrm{grad}}}$$ the gradient with respect to the Riemannian metric. By Prop. 7, $$W_\varSigma (X,Y) = \left\langle \mathscr {L} _ {\varSigma } \left[ X\right] ,Y\right\rangle _{2}$$, hence for each smooth scalar field $$\phi$$ we have
\begin{aligned} {{\mathrm{grad}}}\phi (\varSigma ) = {\mathscr {L}}_{\varSigma }^{-1} [\nabla \phi (\varSigma )] = \nabla \phi (\varSigma ) \varSigma + \varSigma \nabla \phi (\varSigma ) , \end{aligned}
where the second equality follows from the definition of $$\mathscr {L}_\varSigma$$. Conversely,
\begin{aligned} \mathscr {L}_{\varSigma }\left[ {\text {grad}}\phi (\varSigma )\right] =\nabla \phi (\varSigma ) . \end{aligned}
The gradient flow of a smooth scalar field $$\phi$$ is the flow generated by the vector field
\begin{aligned} \gamma \mapsto (\gamma , - {\text {grad}}\phi (\gamma )) , \end{aligned}
that is, the flow of the differential equation
\begin{aligned} {\dot{\gamma }}(\theta )=-{\text {grad}}\phi (\gamma (\theta ))=-\left( \nabla \phi (\gamma (\theta ))\gamma (\theta )+\gamma (\theta )\nabla \phi (\gamma (\theta ))\right) . \end{aligned}
The gradient flow equation is the model for many optimization problems which are based on various discrete time approximations of the gradient flow. It should be noted that the expression of the natural gradient in the Wasserstein Riemannian metric is simple and does not require any time-consuming operation as it is the case in optimization methods using the Fisher Riemannian metric. We do not discuss this issue here and refer to [1, 4, 24].

### 6.1 Gradient flow and optimization

With reference to the full Gaussian distribution, one can consider smooth functions defined on $${\mathbb {R}}^n\times {\text {Sym}}^{++}\left( n\right)$$. The first component of the gradient does not require a special gradient as the Riemannian structure is the Euclidean one. The full gradient will thus have two components:
\begin{aligned}&{{\mathrm{grad}}}\phi (\mu ,\varSigma ) = \left( \nabla _{1}\phi (\mu ,\varSigma ),{\text {grad}}_{2}\phi (\mu ,\varSigma )\right) \nonumber \\&\quad =\left( \nabla _{1}\phi (\mu ,\varSigma ),\nabla _{2}\phi (\mu ,\varSigma )\varSigma +\varSigma \nabla _{2}\phi (\mu ,\varSigma )\right) . \end{aligned}
(38)
An important example is based on the gradient flow of the mean value of an objective function $$f :\mathbb {R}^n \rightarrow \mathbb {R}$$. Its Euler scheme is used in optimization, see [1, Ch. 4] and [23]. In the second example in Sect. 6.2 we discuss the gradient flow of the entropy function of a centered Gaussian.
We call relaxation to the full Gaussian model of the objective function $$f:{\mathbb {R}}^{n}\rightarrow {\mathbb {R}}$$ the function
\begin{aligned} \phi (\mu ,\varSigma )=\mathbb {E}\left[ f(X)\right] ,\quad X\sim {\text {N}} _{n}\left( \mu ,\varSigma \right) . \end{aligned}
If we would include the Dirac measures in the Gaussian model, then $$f(x)=\phi (x,0)$$ and the function $$\phi$$ would actually be an extension of the given function. However, we consider only $$\varSigma \in {\text {Sym}} ^{++}\left( n\right)$$ in order to work with a function defined on our manifold.

There are two ways to calculate the expected value as a function of $$\mu$$ and $$\varSigma$$. Each of them leads to a peculiar expression of the natural gradient.

The first one arises from the relation
\begin{aligned} \phi (\mu ,\varSigma )=\mathbb {E}\left[ f(\varSigma ^{1/2}Z+\mu )\right] ,\quad Z\sim {\text {N}}_{n}\left( 0,I\right) . \end{aligned}
which will lead to an equation for the gradient involving the derivatives of f. The second one uses
\begin{aligned} \phi (\mu ,\varSigma )=\int f(x)(2\pi )^{-n/2}{\text {det}}\left( \varSigma \right) ^{-1/2}{\text {exp}}\left( -\frac{1}{2}(x-\mu )^{*}\varSigma ^{-1}(x-\mu )\right) \ dx. \end{aligned}
In the second case, the natural gradient will be achieved by an equation not involving the gradient of the function f. Both forms have their own field of application.
Condider the first approach, under standard conditions regarding the derivation under the expectation sign. We have
\begin{aligned} \nabla _{1}\phi (\mu ,\varSigma )=\mathbb {E}\left[ \nabla f(\varSigma ^{1/2}Z+\mu ) \right] =\mathbb {E}\left[ \nabla f(X)\right] \ . \end{aligned}
By means of Eq. (14), it is straightforward to compute $$d_{U}\left( \varSigma \mapsto \phi (\mu ,\varSigma )\right)$$.
Note that $$\nabla f$$ is the column vector and so $$\nabla ^{*}f$$ will be a row vector. We have
\begin{aligned} \begin{aligned} d_{U}\phi (\mu ,\varSigma )&=\mathbb {E}\left[ df(\varSigma ^{1/2}Z+\mu )[\mathscr { L}_{\varSigma ^{1/2}}\left( U\right) Z]\right] \\&=\mathbb {E}\left[ \nabla ^{*}f(\varSigma ^{1/2}Z+\mu )\mathscr {L}_{\varSigma ^{1/2}}\left( U\right) Z\right] \\&=\mathbb {E}\left[ {\text {Tr}}\nabla ^{*}f(\varSigma ^{1/2}Z+\mu )\mathscr {L} _{\varSigma ^{1/2}}\left( U\right) Z\right] . \end{aligned} \end{aligned}
Under symmetrization (and setting $$X=\varSigma ^{1/2}Z+\mu$$):
\begin{aligned} d_{U}\phi (\mu ,\varSigma )&=\frac{1}{2}\mathbb {E}\left[ {\text {Tr}}\mathscr {L} _{\varSigma ^{1/2}}\left( U\right) \left( Z\nabla ^{*}f(X)+\nabla f(X)Z\right) \right] \\&=\left\langle U,\mathbb {E}\left( \left( Z\nabla ^{*}f(X)+\nabla f(X)Z\right) \right) \right\rangle _{\varSigma ^{1/2}} \\&=\frac{1}{2}\mathbb {E}{\text {Tr}}\mathscr {L}_{\varSigma ^{1/2}}\left( Z\nabla ^{*}f(X)+\nabla f(X)Z\right) U \\&=\left\langle \mathbb {E}\mathscr {L}_{\varSigma ^{1/2}}\left( Z\nabla ^{*}f(X)+\nabla f(X)Z\right) ,U\right\rangle _{2} . \end{aligned}
It follows that
\begin{aligned} \nabla _{2}\phi (\mu ,\varSigma )=\mathbb {E}\left[ \mathscr {L}_{\varSigma ^{1/2}}\left( Z\nabla ^{*}f(X)+\nabla f(X)Z\right) \right] . \end{aligned}
\begin{aligned}&{\text {grad}}_{2}\phi (\mu ,\varSigma ) \\&\quad = \varSigma \mathbb {E}\left[ \mathscr {L}_{\varSigma ^{1/2}}\left( Z\nabla ^{*}f(X)+\nabla f(X)Z\right) \right] + \mathbb {E}\left[ \mathscr {L}_{\varSigma ^{1/2}}\left( Z\nabla ^{*}f(X)+\nabla f(X)Z\right) \right] \varSigma . \end{aligned}
If we set $$\varXi =\mathbb {E}\left[ Z\nabla ^{*}f(X)+\nabla f(X)Z\right]$$, the natural gradient admits the representation
\begin{aligned} {\text {grad}}_{2}\phi (\mu ,\varSigma )=\varSigma \mathscr {L}_{\varSigma ^{1/2}}\left( \varXi \right) +\mathscr {L}_{\varSigma ^{1/2}}\left( \varXi \right) \varSigma . \end{aligned}
We move on to consider the second procedure. Following the standard computation of the Fisher score and starting from the log-density $$p(x;\mu ,\varSigma )$$ of $${\text {N}}_{n}\left( \mu ,\varSigma \right)$$, we have
\begin{aligned} \begin{aligned}&\left. \log p(x;\mu ,\varSigma )=-\frac{n}{2}\log 2\pi -\frac{1}{2}\log \det \varSigma -\frac{1}{2}(x-\mu )^{*}\varSigma ^{-1}(x-\mu )\right. \\&\left. =-\frac{n}{2}\log 2\pi -\frac{1}{2}\log \det \varSigma -\frac{1}{2} - {\text {Tr}}\left( \varSigma ^{-1}(x-\mu )(x-\mu )^{*}\right) .\right. \end{aligned} \end{aligned}
(39)
Denoting the partial derivative $$d_{u}\left( \mu \longmapsto \log p(x;\mu ,\varSigma )\right)$$ as $$d_{u}\log p(x;\mu ,\varSigma )$$, and the other derivative $$d_{U}\left( \varSigma \longmapsto \log p(x;\mu ,\varSigma )\right)$$ as $$d_{U}\log p(x;\mu ,\varSigma )$$, we get:
\begin{aligned} \begin{aligned} d_{u}\log p(x;\mu ,\varSigma )&=(x-\mu )^{*}\varSigma ^{-1}u=\left\langle \varSigma ^{-1}(x-\mu ),u\right\rangle \\ d_{U}\log p(x;\mu ,\varSigma )&=-\frac{1}{2}{\text {Tr}}\left( \varSigma ^{-1}U\right) +\frac{1}{2}{\text {Tr}}\left( \varSigma ^{-1}U\varSigma ^{-1}(x-\mu )(x-\mu )^{*}\right) \\&=\frac{1}{2}\left\langle \varSigma ^{-1}(x-\mu )(x-\mu )^{*}\varSigma ^{-1}-\varSigma ^{-1},U\right\rangle \\&=\left\langle \varSigma ^{-1}\left( (x-\mu )(x-\mu )^{*}-\varSigma \right) \varSigma ^{-1},U\right\rangle _{2} \end{aligned} \end{aligned}
So that
\begin{aligned} \begin{aligned} d_{u}\phi (\mu ,\varSigma )&=\int f(x)\ d_{u}\log p(x;\mu ,\varSigma )\ p(x;\mu ;\varSigma )\ dx \\&=\left\langle \varSigma ^{-1}\int f(x)(x-\mu )p(x;\mu ;\varSigma )\ dx,u\right\rangle \end{aligned} \end{aligned}
and
\begin{aligned} \begin{aligned} d_{U}\phi (\mu ,\varSigma )&=\int f(x)\ d_{U}\log p(x;\mu ,\varSigma )\ p(x;\mu ,\varSigma )\ dx \\&=\left\langle \varSigma ^{-1}\int f(x)\left( (x-\mu )(x-\mu )^{*}-\varSigma \right) p(x;\mu ,\varSigma )\ dx\ \varSigma ^{-1},U\right\rangle _{2}. \end{aligned} \end{aligned}
At last, thanks to Eq. (38), the natural gradient of $$\phi (\mu ,\varSigma )$$ will be
\begin{aligned} \begin{aligned} \nabla _{1}\phi (\mu ,\varSigma )&=\varSigma ^{-1}\int f(x)(x-\mu )p(x;\mu ;\varSigma )\ dx \\ {\text {grad}}_{2}\phi (\mu ,\varSigma )&=\int f(x)\left( (x-\mu )(x-\mu )^{*}-\varSigma \right) p(x;\mu ,\varSigma )\ dx\ \varSigma ^{-1} \\&+\varSigma ^{-1}\int f(x)\left( (x-\mu )(x-\mu )^{*}-\varSigma \right) p(x;\mu ,\varSigma )\ dx. \end{aligned} \end{aligned}

The flow of entropy can be easily calculated by Eq. (39). We have
\begin{aligned} \begin{aligned} \mathscr {E}(\mu ,\varSigma )&=-\int \log p(x;\mu ,\varSigma )p(x;\mu ,\varSigma )\ dx \\&=\frac{n}{2}\log 2\pi +\frac{1}{2}\log \det \varSigma -\frac{1}{2}{\text {Tr}}\left( \varSigma ^{-1}\varSigma \right) \\&=\frac{n}{2}(\log 2\pi -1)+\frac{1}{2}\log \det \varSigma . \end{aligned} \end{aligned}
The entropy does not depend on $$\mu$$ so that $$\nabla _{1}\mathscr {E}(\mu ,\varSigma )=0$$. Moreover (see [22, §8.3]) we know that $$\nabla \mathscr {E}(\varSigma )=\varSigma ^{-1}$$, so that
\begin{aligned} {\text {grad}}\mathscr {E}(\varSigma )=(\varSigma ^{-1}\varSigma +\varSigma \varSigma ^{-1})=2I. \end{aligned}
The entropic flow will be solution to the equations
\begin{aligned} {\dot{\mu }}(t)=0,\quad {\dot{\varSigma }}(t)+2I=0, \end{aligned}
that is
\begin{aligned} \mu (t)=\mu (0),\quad \varSigma (t)=\varSigma (0)-2tI. \end{aligned}
The integral curve is defined for all t such that $$2t<\lambda _{*}$$, $$\lambda _{*}$$ being the minimum of the spectrum of $$\varSigma (0)$$.

## 7 Second order geometry

Recall that $${{\mathrm{Sym}}}^{++}\left( n\right)$$ as an open set of the Hilbert space $${{\mathrm{Sym}}}\left( n\right)$$, endowed with the inner product $$\left\langle X,Y\right\rangle _{2} = \frac{1}{2} {{\mathrm{Tr}}}\left( XY\right)$$. Prop. 7 states that the Wasserstein Riemannian metric W can be expressed through the inner product of $${{\mathrm{Sym}}}\left( n\right)$$, as
\begin{aligned} W_\varSigma (X,Y) = \left\langle X,Y\right\rangle _{\varSigma } = \left\langle \mathscr {L} _ {\varSigma } \left[ X\right] ,Y\right\rangle _{2} , \end{aligned}
for each $$(\varSigma ,X)$$ and $$(\varSigma ,Y)$$ in the trivial tangent bundle $$T {{\mathrm{Sym}}}^{++}\left( n\right) \simeq {{\mathrm{Sym}}}^{++}\left( n\right) \times {{\mathrm{Sym}}}\left( n\right)$$. In the equation above, $${\mathscr {L}} :{{\mathrm{Sym}}}^{++}\left( n\right) \mapsto L({{\mathrm{Sym}}}\left( n\right) ,{{\mathrm{Sym}}}\left( n\right) )$$ is the field of linear operators defining the Wasserstein metric with respect to the standard inner product.

In the trivial chart, a smooth vector field X is a smooth mapping $$X :{{\mathrm{Sym}}}^{++}\left( n\right) \rightarrow {{\mathrm{Sym}}}\left( n\right)$$. The action of the vector field X on the scalar field f that is, Xf, is expressed in the trivial chart by $$d_Xf$$, i.e., the scalar field whose value at point $$\varSigma$$ is the derivative of f in the direction $$X(\varSigma )$$. Similarly, $$d_YX$$ denotes the vector field whose value at point $$\varSigma$$ is the derivative at $$\varSigma$$ of X in the direction $$Y(\varSigma )$$. The Lie bracket [XY] of two smooth vector fields XY is given by $$d_XY - d_YX$$.

### 7.1 The moving frame

While we prefer to express our computation by matrix algebra, in some cases it may be useful to employ a vector basis. Let us now introduce a field of vector bases of particular interest.

The set of symmetric matrices
\begin{aligned} E^{p,q} = e_p e_q^* + e_qe_p^*, \quad p,q = 1,\dots ,n , \end{aligned}
(40)
$$e_p$$ being the p-th element of the standard basis of $$\mathbb {R}^n$$, spans the vector space $${{\mathrm{Sym}}}\left( n\right)$$. Notice that $${{\mathrm{Tr}}}\left( E^{p,q}\right) = 2 \delta _{p,q}$$, where $$\delta$$ is the Kronecker symbol. To avoid repeated elements, a unique enumeration is obtained by taking indexes in the set A of the parts of $$\left\{ 1,\dots ,n\right\}$$ having 1 or 2 elements.
The generating set of Eq. (40) is related to the symmetric product of matrices by the equation
\begin{aligned} E^{p,q} E^{r,s} + E^{r,s} E^{p,q} = \delta _{q,r} E^{p,s} + \delta _{q,s}E^{p,r} + \delta _{p,r}E^{q,s} + \delta _{p,s} E^{q,r} , \end{aligned}
where $$\delta$$ is the Kronecker symbol.
In particular, if we take the trace of the equation above, we get
\begin{aligned} \left\langle E^{p,q},E^{r,s}\right\rangle _{2} = \delta _{p,r}\delta _{q,s}+\delta _{p,s}\delta _{q,r} , \end{aligned}
which in turn implies
\begin{aligned} \left\langle E^{p,q},E^{r,s}\right\rangle _{2} = {\left\{ \begin{array}{ll} 0 &{} \text {if }\left\{ p,q\right\} \ne \left\{ r,s\right\} , \\ 1 &{} \text {if }\left\{ p,q\right\} =\left\{ r,s\right\} \text { and } p \ne q, \\ 2 &{} \text {if }\left\{ p,q\right\} =\left\{ r,s\right\} \text { and }p = q \\ \end{array}\right. } \end{aligned}
In the sequel, we denote by $$(E^\alpha )_{\alpha \in A}$$ the vector basis above, properly normalized to obtain an orthonormal basis. We do not write down the normalizing constants in order to simplify the notation.
For each $$\varSigma \in {{\mathrm{Sym}}}^{++}\left( n\right)$$ the sequence
\begin{aligned} \mathscr {E}^{\alpha }(\varSigma ) = E^\alpha \varSigma + \varSigma E^\alpha , \quad \alpha \in A , \end{aligned}
(41)
is a vector basis of $${{\mathrm{Sym}}}\left( n\right) \simeq T_\varSigma {{\mathrm{Sym}}}^{++}\left( n\right)$$, because it is the image of a vector basis under a linear mapping which is onto. We will call such a sequence of vector fields the (principal) moving frame.
Notice the following properties:
\begin{aligned} \mathscr {E}^{\alpha }= d_{E^\alpha }\varSigma ^2 \ ; \quad \mathscr {L} _ {\varSigma } \left[ \mathscr {E}^{\alpha }(\varSigma )\right] = E^\alpha \ ; \quad \mathscr {E}^{\alpha }(I) = 2E^\alpha \ . \end{aligned}
At a generic point $$\varSigma$$, we can express each $$\mathscr {E}^{\alpha }$$ in the $$(E^\beta )_\beta$$’s orthonormal basis as
\begin{aligned} \mathscr {E}^{\alpha }(\varSigma ) = \sum _\beta g_{\alpha ,\beta }(\varSigma ) E^\beta \ , \quad g_{\alpha ,\beta }(\varSigma ) = {{\mathrm{Tr}}}\left( E^\alpha \varSigma E^\beta \right) . \end{aligned}
(42)
Since
\begin{aligned} W_\varSigma (\mathscr {E}^{\alpha },\mathscr {E}^{\beta }) = {{\mathrm{Tr}}}\left( \mathscr {L} _ {\varSigma } \left[ \mathscr {E}^{\alpha }(\varSigma )\right] \varSigma \mathscr {L} _ {\varSigma } \left[ \mathscr {E}^{\beta }(\varSigma )\right] \right) = {{\mathrm{Tr}}}\left( E^\alpha \varSigma E^\beta \right) , \end{aligned}
the matrix $$[g_{\alpha ,\beta }]_{\alpha ,\beta }$$ is the expression of the Riemannian metric in such a moving frame. Namely, if XY are vector fields expressed in the moving frame as $$X = \sum _\alpha x_\alpha \mathscr {E}^{\alpha }$$ and $$Y = \sum _\beta y_\beta \mathscr {E}^{\beta }$$, then
\begin{aligned}&W_\varSigma (X,Y) = {{\mathrm{Tr}}}\left( \mathscr {L} _ {\varSigma } \left[ \sum _\alpha x_\alpha (\varSigma )\mathscr {E}^{\alpha }\right] \varSigma (\varSigma ) \ \mathscr {L} _ {\varSigma } \left[ \sum _\beta y_\beta (\varSigma ) \mathscr {E}^{\beta }\right] (\varSigma )\right) \\&\quad = {{\mathrm{Tr}}}\left( \left( \sum _\alpha x_\alpha (\varSigma )E^\alpha \right) \varSigma \left( \sum _\beta y_\beta (\varSigma ) E^\beta \right) \right) = \sum _{\alpha ,\beta } x_\alpha (\varSigma )y_\beta (\varSigma )g_{\alpha ,\beta }(\varSigma ) . \end{aligned}
This expression of the inner product is to be compared to that used in [36].
In this way, any vector field X has two representations: one with respect to the moving frame $$(\mathscr {E}^{\alpha })_\alpha$$ and another one with respect to the basis $$(E^\alpha )_\alpha$$. These two representations are related to each other as follows. We have
\begin{aligned} X = \sum _\alpha x_\alpha \mathscr {E}^{\alpha }= \sum _\alpha x_\alpha \sum _\beta g_{\alpha ,\beta } E^\beta = \sum _\beta \left( \sum _\alpha x_\alpha g_{\alpha ,\beta }\right) E^\beta , \end{aligned}
so that
\begin{aligned} \left\langle X,E^\gamma \right\rangle _{2} = \frac{1}{2} {{\mathrm{Tr}}}\left( XE^\gamma \right) = \sum _\beta \left( \sum _\alpha x_\alpha g_{\alpha ,\beta }\right) {{\mathrm{Tr}}}\left( E^\beta E^\gamma \right) = \sum _\alpha x_\alpha g_{\alpha ,\gamma } , \end{aligned}
hence, by applying the inverse matrix $$[g^{\alpha ,\beta }(\varSigma )]=[g_{\alpha ,\beta }(\varSigma )]^{-1}$$, we have
\begin{aligned} x_\alpha = \sum _\gamma g^{\alpha ,\gamma } \left\langle X,E^\gamma \right\rangle _{2} \ . \end{aligned}
(43)
For example, $$\mathscr {L} _ {\varSigma } \left[ V\right] = \sum _\alpha \ell _{\varSigma }^\alpha (V) \mathscr {E}^{\alpha }(\varSigma )$$, with
\begin{aligned} \ell _{\varSigma }^\alpha (V) = \sum _\gamma g^{\alpha ,\gamma }(\varSigma )\left\langle \mathscr {L} _ {\varSigma } \left[ V\right] ,E^\gamma \right\rangle _{2} = W_\varSigma (V,\sum _\gamma g^{\alpha ,\gamma }E^\gamma ) . \end{aligned}

### 7.2 Covariant derivative in the moving frame

If X and Y are vector fields, denote by $$D_YX$$ the action of a covariant derivative, namely, a bilinear operator satisfying, for each scalar field f, the following two conditions:
(CD1)

$$D_{fY}X = fD_Y X$$ ,

(CD2)

$$D_Y (fX) = (d_Yf) X + f D_Y X$$ .

see e.g [10, Sect. 3] or [20, Ch. 8.4].
A convenient way to express a covariant derivative in the moving frame (41) is to define Christoffel symbols in the moving frame as
\begin{aligned} \sum _\gamma \varGamma _{\alpha ,\beta }^\gamma \mathscr {E}^{\gamma }= D_{\mathscr {E}^{\alpha }} \mathscr {E}^{\beta }= E^\beta E^\alpha + E^\alpha E^\beta . \end{aligned}
Each $$\varGamma _{\alpha ,\beta }^\gamma$$ is to be computed by means of Eq. (43).
If $$X=\sum _\alpha x_\alpha \mathscr {E}^{\alpha }$$ and $$Y = \sum _\beta y_\beta \mathscr {E}^{\beta }$$, by using (CD1), (CD2), and Eq. (42), we obtain
\begin{aligned}&D_XY = \sum _{\alpha ,\beta } x_\alpha D_{\mathscr {E}^{\alpha }} (y_\beta \mathscr {E}^{\beta }) = \sum _{\alpha ,\beta } x_\alpha \left( \left( d_{\mathscr {E}^{\alpha }}y_\beta \right) \mathscr {E}^{\beta }+ y_\beta \left( D_{\mathscr {E}^{\alpha }}\mathscr {E}^{\beta }\right) \right) \\&\quad = \sum _{\alpha ,\gamma } x_\alpha d_{\mathscr {E}^{\alpha }}y_\gamma \mathscr {E}^{\gamma }+ \sum _{\alpha ,\beta ,\gamma } y_\beta \varGamma _{\alpha ,\beta }^\gamma \mathscr {E}^{\gamma }= \sum _\gamma \sum _{\alpha ,\beta } x_\alpha \left( d_{\mathscr {E}^{\alpha }}y_\gamma + y_\beta \varGamma _{\alpha ,\beta }^\gamma \right) \mathscr {E}^{\gamma }. \end{aligned}
The inner product of $$D_XY$$ and $$Z = \sum _\delta z_\delta \mathscr {E}^{\delta }$$ is
\begin{aligned} \left\langle D_XY,Z\right\rangle _{\varSigma } = \sum _{\alpha ,\beta ,\gamma ,\delta } x_\alpha \left( d_{\mathscr {E}^{\alpha }}y_\gamma + y_\beta \varGamma _{\alpha ,\beta }^\gamma \right) g_{\delta ,\gamma } z_\delta . \end{aligned}

### 7.3 Levi-Civita derivative

The Levi-Civita (covariant) derivative of a vector field, is the unique covariant derivative D that, for all vector fields XYZ, is:
• (LC1) compatible with the metric, $$d_{X}W(Y,Z)=W(D_X Y,Z) + W(Y,D_{X}Z)$$, (LC2) torsion-free, $$D_{Y}X-D_{X}Y = [X,Y] = d_Y X - d_X Y$$.

In order to keep a compact notation, it will be convenient to make use of the symmetrized of a matrix $$A \in {{\mathrm{M}}}\left( n\right)$$, defined by $$\left\{ A\right\} _S = \frac{1}{2}\left( A+A^*\right)$$. If either A or B is symmetric, then $${{\mathrm{Tr}}}\left( \left\{ A\right\} _SB\right) = {{\mathrm{Tr}}}\left( AB\right)$$.
We denote by XYZ smooth vector fields on $${{\mathrm{Sym}}}^{++}\left( n\right)$$ and we shall use frequently the derivative of the vector field $$\varSigma \mapsto \mathscr {L} _ {\varSigma } \left[ X\right]$$. In view of Eq. (17) and under our notation for the symmetrization, we have
\begin{aligned} d_Y \mathscr {L} _ {\varSigma } \left[ X\right] = -2 \mathscr {L} _ {\varSigma } \left[ \left\{ \mathscr {L} _ {\varSigma } \left[ X\right] Y\right\} _S\right] . \end{aligned}

### Proposition 10

The Levi-Civita derivative $$D_{X}Y$$ is implicitly defined by
\begin{aligned}&\left\langle D_{X}Y,Z\right\rangle _{\varSigma } = \left\langle d_{X}Y,Z\right\rangle _{\varSigma } + \left\langle X,\left\{ \mathscr {L} _ {\varSigma } \left[ Y\right] Z\right\} _S\right\rangle _{\varSigma } \nonumber \\&\qquad - \left\langle X,\left\{ \mathscr {L} _ {\varSigma } \left[ Z\right] Y\right\} _S\right\rangle _{\varSigma } - \left\langle Y,\left\{ \mathscr {L} _ {\varSigma } \left[ Z\right] X\right\} _S\right\rangle _{\varSigma } \nonumber \\&\quad = \left\langle d_{X}Y,Z\right\rangle _{\varSigma } + \frac{1}{2}{{\mathrm{Tr}}}\left( \mathscr {L} _ {\varSigma } \left[ X\right] Z \mathscr {L} _ {\varSigma } \left[ Y\right] \right) \nonumber \\&\qquad -\frac{1}{2}{{\mathrm{Tr}}}\left( \mathscr {L} _ {\varSigma } \left[ X\right] Y \mathscr {L} _ {\varSigma } \left[ Z\right] \right) - \frac{1}{2}{{\mathrm{Tr}}}\left( \mathscr {L} _ {\varSigma } \left[ Y\right] X \mathscr {L} _ {\varSigma } \left[ Z\right] \right) , \end{aligned}
(44)
while the Levi-Civita derivative itself is given by
\begin{aligned} D_{X}Y= & {} d_{X}Y - \left\{ \mathscr {L} _ {\varSigma } \left[ X\right] Y + \mathscr {L} _ {\varSigma } \left[ Y\right] X\right\} _S\\&+ \left\{ \varSigma \mathscr {L} _ {\varSigma } \left[ X\right] \mathscr {L} _ {\varSigma } \left[ Y\right] + \varSigma \mathscr {L} _ {\varSigma } \left[ Y\right] \mathscr {L} _ {\varSigma } \left[ X\right] \right\} _S . \end{aligned}

### Proof

In our case, Eq. MD3 of [20, p. 205] becomes
\begin{aligned} 2 \left\langle D_X Y,\mathscr {L} _ {\varSigma } \left[ Z\right] \right\rangle _{2}= & {} 2\left\langle d_XY,\mathscr {L} _ {\varSigma } \left[ Z\right] \right\rangle _{2} + \left\langle Y,d_X \mathscr {L} _ {\varSigma } \left[ Z\right] \right\rangle _{2} \nonumber \\&+\, \left\langle X,d_Y \mathscr {L} _ {\varSigma } \left[ Z\right] \right\rangle _{2} - \left\langle X,d_Z \mathscr {L} _ {\varSigma } \left[ Y\right] \right\rangle _{2} . \end{aligned}
(45)
By Eq. (17) we have
\begin{aligned} \left\langle Y,d_X \mathscr {L} _ {\varSigma } \left[ Z\right] \right\rangle _{2} = - 2 \left\langle Y,\mathscr {L} _ {\varSigma } \left[ \left\{ \mathscr {L} _ {\varSigma } \left[ Z\right] X\right\} _S\right] \right\rangle _{2} = - 2 \left\langle Y,\left\{ \mathscr {L} _ {\varSigma } \left[ Z\right] X\right\} _S\right\rangle _{\varSigma } , \end{aligned}
and, analogously,
\begin{aligned}&\left\langle X,d_Y \mathscr {L} _ {\varSigma } \left[ Z\right] \right\rangle _{2} = - 2 \left\langle X,\left\{ \mathscr {L} _ {\varSigma } \left[ Z\right] Y\right\} _S\right\rangle _{\varSigma }, \\&\quad \left\langle X,d_Z \mathscr {L} _ {\varSigma } \left[ Y\right] \right\rangle _{2} = - 2 \left\langle X,\left\{ \mathscr {L} _ {\varSigma } \left[ Y\right] Z\right\} _S\right\rangle _{\varSigma } . \end{aligned}
This way, Eq. (45) becomes the first part of Eq. (44).
The second part of Eq. (44) is then easily obtained. For instance,
\begin{aligned} \left\langle X,\left\{ \mathscr {L} _ {\varSigma } \left[ Z\right] \right\} _S\right\rangle _{\varSigma } = \frac{1}{2} {{\mathrm{Tr}}}\left( \mathscr {L} _ {\varSigma } \left[ X\right] \left\{ Z \mathscr {L} _ {\varSigma } \left[ Y\right] \right\} _S\right) =\frac{1}{2} {{\mathrm{Tr}}}\left( \mathscr {L} _ {\varSigma } \left[ X\right] Z \mathscr {L} _ {\varSigma } \left[ Y\right] \right) . \end{aligned}
Regarding the explicit formula of the Levi-Civita derivative (10), observe that
\begin{aligned}&\frac{1}{2} {{\mathrm{Tr}}}\left( \mathscr {L} _ {\varSigma } \left[ X\right] Z \mathscr {L} _ {\varSigma } \left[ Y\right] \right) = \frac{1}{2} {{\mathrm{Tr}}}\left( \mathscr {L} _ {\varSigma } \left[ Y\right] \mathscr {L} _ {\varSigma } \left[ X\right] Z \right) \\&\quad = \frac{1}{2} {{\mathrm{Tr}}}\left( \left\{ \mathscr {L} _ {\varSigma } \left[ X\right] \mathscr {L} _ {\varSigma } \left[ Y\right] \right\} _S Z\right) \\&\quad = \frac{1}{2} {{\mathrm{Tr}}}\left( \mathscr {L} _ {\varSigma } \left[ \left\{ \mathscr {L} _ {\varSigma } \left[ X\right] \mathscr {L} _ {\varSigma } \left[ Y\right] \right\} _S\varSigma + \varSigma \left\{ \mathscr {L} _ {\varSigma } \left[ X\right] \mathscr {L} _ {\varSigma } \left[ Y\right] \right\} _S\right] Z\right) \\&\quad = \left\langle \left\{ \mathscr {L} _ {\varSigma } \left[ X\right] \mathscr {L} _ {\varSigma } \left[ Y\right] \right\} _S\varSigma + \varSigma \left\{ \mathscr {L} _ {\varSigma } \left[ X\right] \mathscr {L} _ {\varSigma } \left[ Y\right] \right\} _S,Z\right\rangle _{\varSigma } \\&\quad = \left\langle \left\{ \varSigma \mathscr {L} _ {\varSigma } \left[ X\right] \mathscr {L} _ {\varSigma } \left[ Y\right] \right\} _S + \left\{ \varSigma \mathscr {L} _ {\varSigma } \left[ Y\right] \mathscr {L} _ {\varSigma } \left[ X\right] \right\} _S,Z\right\rangle _{\varSigma } \\&\quad = \left\langle \left\{ \varSigma \mathscr {L} _ {\varSigma } \left[ X\right] \mathscr {L} _ {\varSigma } \left[ Y\right] + \varSigma \mathscr {L} _ {\varSigma } \left[ Y\right] \mathscr {L} _ {\varSigma } \left[ X\right] \right\} _S,Z\right\rangle _{\varSigma } . \end{aligned}
Moreover,
\begin{aligned}&\frac{1}{2}{{\mathrm{Tr}}}\left( \mathscr {L} _ {\varSigma } \left[ X\right] Y \mathscr {L} _ {\varSigma } \left[ Z\right] \right) + \frac{1}{2}{{\mathrm{Tr}}}\left( \mathscr {L} _ {\varSigma } \left[ Y\right] X \mathscr {L} _ {\varSigma } \left[ Z\right] \right) \\&\quad = \frac{1}{2} {{\mathrm{Tr}}}\left( \left\{ \mathscr {L} _ {\varSigma } \left[ X\right] Y + \mathscr {L} _ {\varSigma } \left[ Y\right] X\right\} _S \mathscr {L} _ {\varSigma } \left[ Z\right] \right) \\&\quad = \left\langle \left\{ \mathscr {L} _ {\varSigma } \left[ X\right] Y + \mathscr {L} _ {\varSigma } \left[ Y\right] X\right\} _S,Z\right\rangle _{\varSigma } \ .\end{aligned}
Therefore, Eq. (44) can be written as
\begin{aligned} \left\langle D_XY,Z\right\rangle _{\varSigma }= & {} \langle \varSigma d_{X}Y - \left\{ \mathscr {L} _ {\varSigma } \left[ X\right] Y + \mathscr {L} _ {\varSigma } \left[ Y\right] X\right\} _S \\&+ \left\{ \varSigma \mathscr {L} _ {\varSigma } \left[ X\right] \mathscr {L} _ {\varSigma } \left[ Y\right] + \varSigma \mathscr {L} _ {\varSigma } \left[ Y\right] \mathscr {L} _ {\varSigma } \left[ X\right] \right\} _S Z \ \rangle , \end{aligned}
and the desired result obtains. $$\square$$

Observe that we have computed the Levi-Civita covariant derivative using its explicit expression in term of derivatives of the metric. However is easy to check the result directly using the properties of the Lyapunov operator.

### 7.4 Levi-Civita derivative in a moving frame

Let us explicit the Levi-Civita derivative in the moving frame (41). Note that $$X(\varSigma ) = \mathscr {E}^{\alpha }(\varSigma ) = E^\alpha \varSigma + \varSigma E^\alpha$$ and $$Y(\varSigma ) = \mathscr {E}^{\beta }(\varSigma ) = E^\beta \varSigma + \varSigma E^\beta$$ are vector fields.

### Proposition 11

For the Levi-Civita covariant derivative D, it holds
\begin{aligned} D_{\mathscr {E}^{\alpha }}\mathscr {E}^{\beta }= E^\beta E^\alpha \varSigma + \varSigma E^\alpha E^\beta . \end{aligned}

### Proof

Eq. (10) yields
\begin{aligned}&D_{\mathscr {E}^{\alpha }}\mathscr {E}^{\beta }= d_{\mathscr {E}^{\alpha }}{\mathscr {E}^{\beta }} - \left\{ \mathscr {L} _ {\varSigma } \left[ \mathscr {E}^{\alpha }\right] {\mathscr {E}^{\beta }} + \mathscr {L} _ {\varSigma } \left[ \mathscr {E}^{\beta }\right] {\mathscr {E}^{\alpha }}\right\} _S \nonumber \\&\quad +\left\{ \varSigma \mathscr {L} _ {\varSigma } \left[ \mathscr {E}^{\alpha }\right] \mathscr {L} _ {\varSigma } \left[ \mathscr {E}^{\beta }\right] + \varSigma \mathscr {L} _ {\varSigma } \left[ \mathscr {E}^{\beta }\right] \mathscr {L} _ {\varSigma } \left[ \mathscr {E}^{\alpha }\right] \right\} _S . \end{aligned}
(46)
We are going to compute one by one the three terms in this equation.
The first term of Eq. (46) is
\begin{aligned}&d_{\mathscr {E}^{\alpha }} \mathscr {E}^{\beta }= d_{(E^\alpha \varSigma + \varSigma E^\alpha )}(E^\beta \varSigma + \varSigma E^\beta ) \\&\quad = E^\beta (E^\alpha \varSigma + \varSigma E^\alpha ) + (E^\alpha \varSigma + \varSigma E^\alpha ) E^\beta \\&\quad = E^\beta E^\alpha \varSigma + E^\beta \varSigma E^\alpha + E^\alpha \varSigma E^\beta + \varSigma E^\alpha E^\beta . \end{aligned}
The second one is
\begin{aligned}&- \left\{ \mathscr {L} _ {\varSigma } \left[ \mathscr {E}^{\alpha }\right] {\mathscr {E}^{\beta }} + \mathscr {L} _ {\varSigma } \left[ \mathscr {E}^{\beta }\right] {\mathscr {E}^{\alpha }}\right\} _S \\&\quad = - \left\{ E^\alpha (E^\beta \varSigma + \varSigma E^\beta ) + E^\beta (E^\alpha \varSigma + \varSigma E^\alpha )\right\} _S \\&\quad = - \left\{ E^\alpha E^\beta \varSigma + E^\alpha \varSigma E^\beta + E^\beta E^\alpha \varSigma + E^\beta \varSigma E^\alpha \right\} _S \\&\quad = - \frac{1}{2} \left( E^\alpha E^\beta \varSigma + E^\beta E^\alpha \varSigma + \varSigma E^\beta E^\alpha + \varSigma E^\alpha E^\beta \right) - \left( E^\alpha \varSigma E^\beta + E^\beta \varSigma E^\alpha \right) . \end{aligned}
Their sum is
\begin{aligned} \frac{1}{2}\left( E^\beta E^\alpha \varSigma + \varSigma E^\alpha E^\beta \right) - \frac{1}{2}\left( E^\alpha E^\beta \varSigma + \varSigma E^\beta E^\alpha \right) . \end{aligned}
The third term is
\begin{aligned}&\left\{ \varSigma \mathscr {L} _ {\varSigma } \left[ \mathscr {E}^{\alpha }\right] \mathscr {L} _ {\varSigma } \left[ \mathscr {E}^{\beta }\right] + \varSigma \mathscr {L} _ {\varSigma } \left[ \mathscr {E}^{\beta }\right] \mathscr {L} _ {\varSigma } \left[ \mathscr {E}^{\alpha }\right] \right\} _S = \left\{ \varSigma E^\alpha E^\beta + \varSigma E^\beta E^\alpha \right\} _S \\&\quad = \frac{1}{2} \left( \varSigma E^\alpha E^\beta + \varSigma E^\beta E^\alpha + E^\beta E^\alpha \varSigma + E^\alpha E^\beta \varSigma \right) . \end{aligned}
$$\square$$
The computation of the Christoffel symbols $$\sum _\gamma \varGamma _{\alpha ,\beta }^\sigma \mathscr {E}^{\gamma }= D_{\mathscr {E}^{\alpha }}\mathscr {E}^{\beta }$$ would require the solution of the equations
\begin{aligned} E^\beta E^\alpha \varSigma + \varSigma E^\alpha E^\beta = \sum _\gamma \varGamma _{\alpha ,\beta }^\gamma (\varSigma ) \left( E^\gamma \varSigma + \varSigma E^\gamma \right) . \end{aligned}
We do not discuss that here.
Instead, let us take now $$X = x_\alpha \mathscr {E}^{\alpha }$$ and $$Y=y_\beta \mathscr {E}^{\beta }$$. Properties (CD1) and (CD2) lead to
\begin{aligned}&D_{(x_\alpha \mathscr {E}^{\alpha })} (y_\beta \mathscr {E}^{\beta }) = x_\alpha D_{E^\alpha } (y_\beta E^\beta ) = x_\alpha \left( d_{E^\alpha }y_\beta E^\beta + y_\beta D_{E^\alpha } E^\beta \right) \\&\quad = x_\alpha d_{E^\alpha }y_\beta E^\beta + x_\alpha y_\beta \left( E^\beta E^\alpha \varSigma + \varSigma E^\alpha E^\beta \right) . \end{aligned}
Finally, for general X and Y,
\begin{aligned} D_XY = \sum _{\alpha ,\beta } x_\alpha d_{E^\alpha }y_\beta E^\beta + \sum _{\alpha ,\beta } x_\alpha y_\beta \left( E^\beta E^\alpha \varSigma + \varSigma E^\alpha E^\beta \right) \ \end{aligned}
which is the desired result.

### 7.5 Parallel transport

The expression of the Levi-Civita derivative in Eq. (44) can be re-written as
\begin{aligned} \left\langle D_{X}Y,Z\right\rangle _{\varSigma } = \left\langle d_{X}Y,Z\right\rangle _{\varSigma } + \left\langle \varGamma (\varSigma ;X,Y),Z\right\rangle _{\varSigma } , \end{aligned}
where $$\varGamma (\varSigma ;\cdot ,\cdot )$$ is the symmetric tensor field defined by
\begin{aligned}&\left\langle \varGamma (\varSigma ;X,Y),Z\right\rangle _{\varSigma } = \\&\quad \frac{1}{2}{{\mathrm{Tr}}}\left( \mathscr {L} _ {\varSigma } \left[ X\right] Z \mathscr {L} _ {\varSigma } \left[ Y\right] \right) - \frac{1}{2}{{\mathrm{Tr}}}\left( \mathscr {L} _ {\varSigma } \left[ X\right] Y \mathscr {L} _ {\varSigma } \left[ Z\right] \right) - \frac{1}{2}{{\mathrm{Tr}}}\left( \mathscr {L} _ {\varSigma } \left[ Y\right] X \mathscr {L} _ {\varSigma } \left[ Z\right] \right) \\&\quad = \frac{1}{2}{{\mathrm{Tr}}}\left( \mathscr {L} _ {\varSigma } \left[ Y\right] \mathscr {L} _ {\varSigma } \left[ X\right] Z \right) - \frac{1}{2}{{\mathrm{Tr}}}\left( \left( \mathscr {L} _ {\varSigma } \left[ X\right] Y + \mathscr {L} _ {\varSigma } \left[ Y\right] X\right) \mathscr {L} _ {\varSigma } \left[ Z\right] \right) \\&\quad = \frac{1}{2}{{\mathrm{Tr}}}\left( \mathscr {L} _ {\varSigma } \left[ Y\right] \mathscr {L} _ {\varSigma } \left[ X\right] \left( \mathscr {L} _ {\varSigma } \left[ Z\right] \varSigma + \varSigma \mathscr {L} _ {\varSigma } \left[ Z\right] \right) \right) \\&\quad \quad - \frac{1}{2}{{\mathrm{Tr}}}\left( \left( \mathscr {L} _ {\varSigma } \left[ X\right] Y + \mathscr {L} _ {\varSigma } \left[ Y\right] X\right) \mathscr {L} _ {\varSigma } \left[ Z\right] \right) \\&\quad = \frac{1}{2} {{\mathrm{Tr}}}\left( \left( \varSigma \mathscr {L} _ {\varSigma } \left[ Y\right] \mathscr {L} _ {\varSigma } \left[ X\right] + \mathscr {L} _ {\varSigma } \left[ Y\right] \mathscr {L} _ {\varSigma } \left[ X\right] \varSigma - \mathscr {L} _ {\varSigma } \left[ X\right] Y - \mathscr {L} _ {\varSigma } \left[ Y\right] X \right) \mathscr {L} _ {\varSigma } \left[ Z\right] \right) \\&\quad = \left\langle \left\{ \varSigma \mathscr {L} _ {\varSigma } \left[ Y\right] \mathscr {L} _ {\varSigma } \left[ X\right] + \mathscr {L} _ {\varSigma } \left[ Y\right] \mathscr {L} _ {\varSigma } \left[ X\right] \varSigma - \mathscr {L} _ {\varSigma } \left[ X\right] Y - \mathscr {L} _ {\varSigma } \left[ Y\right] X\right\} _S,Z\right\rangle _{\varSigma } . \end{aligned}
We have
\begin{aligned} \varGamma (\varSigma ;X,Y) = \left\{ \varSigma \mathscr {L} _ {\varSigma } \left[ Y\right] \mathscr {L} _ {\varSigma } \left[ X\right] + \mathscr {L} _ {\varSigma } \left[ Y\right] \mathscr {L} _ {\varSigma } \left[ X\right] \varSigma - \mathscr {L} _ {\varSigma } \left[ X\right] Y - \mathscr {L} _ {\varSigma } \left[ Y\right] X\right\} _S , \end{aligned}
and, on the diagonal,
\begin{aligned} \varGamma (\varSigma ;X,X) = \varSigma \mathscr {L} _ {\varSigma } \left[ X\right] \mathscr {L} _ {\varSigma } \left[ X\right] + \mathscr {L} _ {\varSigma } \left[ X\right] \mathscr {L} _ {\varSigma } \left[ X\right] \varSigma - \mathscr {L} _ {\varSigma } \left[ X\right] X - X \mathscr {L} _ {\varSigma } \left[ X\right] . \end{aligned}
$$\varGamma (\varSigma ;X,Y)$$ is the expression in the trivial chart of the Christoffel symbol of the Levi-Civita derivative as in [17]. In [20], $$-\varGamma$$ is called the spray of the Levi-Civita derivative.
Given the Christoffel symbol, the linear differential equation of the parallel transport along a curve $$t \mapsto \varSigma (t)$$ is
\begin{aligned} {\left\{ \begin{array}{ll} \dot{U}_V(t) + \varGamma (\varSigma (t);{{\dot{\varSigma }}}(t), U_V(t)) = 0 , \\ U_V(0) = V , \end{array}\right. } \end{aligned}
see [20, VIII, §3 and §4]. Recall that the parallel transport for the Levi-Civita derivative is isometric.
We do not discuss here the representation in the moving frame of Eq. (7.5). We limit ourselves to mention that the action of the Christoffel symbol on vector fields expressed in the moving frame can be computed from
\begin{aligned}&\varGamma (\varSigma ;\mathscr {E}^{\alpha },\mathscr {E}^{\beta }) \\&\quad =\left\{ \varSigma \mathscr {L} _ {\varSigma } \left[ \mathscr {E}^{\beta }\right] \mathscr {L} _ {\varSigma } \left[ \mathscr {E}^{\alpha }\right] + \mathscr {L} _ {\varSigma } \left[ \mathscr {E}^{\beta }\right] \mathscr {L} _ {\varSigma } \left[ \mathscr {E}^{\alpha }\right] \varSigma \right. \\&\quad \quad \left. - \mathscr {L} _ {\varSigma } \left[ \mathscr {E}^{\alpha }\right] \mathscr {E}^{\beta }- \mathscr {L} _ {\varSigma } \left[ \mathscr {E}^{\beta }\right] \mathscr {E}^{\alpha }\right\} \\&\quad = \left\{ \varSigma E^\beta E^\alpha + E^\beta E^\alpha \varSigma - E^\alpha ( E^\beta \varSigma +\varSigma E^\beta ) - E^\beta (E^\alpha \varSigma +\varSigma E^\alpha )\right\} _S \\&\quad = \left\{ \varSigma E^\beta E^\alpha + E^\beta E^\alpha \varSigma - E^\alpha E^\beta \varSigma - E^\alpha \varSigma E^\beta - E^\beta E^\alpha \varSigma - E^\beta \varSigma E^\alpha \right\} _S \\&\quad = - (E^\alpha \varSigma E^\beta + E^\beta \varSigma E^\alpha ) . \end{aligned}

### 7.6 Riemannian Hessian

According to [1, Def. 5.5.1] and [10, p. 141], the Riemannian Hessian of a smooth scalar field $$\phi :{{\mathrm{Sym}}}^{++}\left( n\right) \rightarrow \mathbb {R}$$, is the Levi-Civita covariant derivative of the natural gradient $${{\mathrm{grad}}}\phi$$. Namely, for each vector field X, it is the vector field $${{\mathrm{Hess}}}_X \phi$$ whose value at $$\varSigma$$ is
\begin{aligned} {\text {Hess}}_{X}\phi (\varSigma )=D_{X}({\text {grad}}\phi )(\varSigma )=D_{X}(\nabla \phi (\varSigma )\varSigma +\varSigma \nabla \phi (\varSigma )) . \end{aligned}
The associated symmetric bilinear form is (see [1, Prop. 5.5.3])
\begin{aligned} {\text {Hess}}\phi (\varSigma )\left( X,Y\right) =\left\langle D_{X}({\text {grad}} \phi )(\varSigma ),Y\right\rangle _{\varSigma } . \end{aligned}
To our purpose it will be enough to compute the diagonal of the symmetric form. Therefore, letting $$X=Z=V$$ in the second part of Eq. (44), we obtain
\begin{aligned}&{\text {Hess}}\phi (\varSigma )\left( V,V\right) \\&\quad = \left\langle d_{V}Y,V\right\rangle _{\varSigma } + \frac{1}{2}{\text {Tr}}\left[ \mathscr {L} _{\varSigma }\left[ V\right] V\mathscr {L}_{\varSigma }\left[ Y\right] \right] \\&\quad \quad - \frac{1}{2}{\text {Tr}}\left[ \mathscr {L}_{\varSigma }\left[ V\right] Y\mathscr {L} _{\varSigma }\left[ V\right] \right] -\frac{1}{2}{\text {Tr}}\left[ \mathscr {L}_{\varSigma }\left[ V\right] V\mathscr { L}_{\varSigma }\left[ Y\right] \right] \\&\quad = \left\langle d_{V}Y,V\right\rangle _{\varSigma }-\frac{1}{2}{\text {Tr}}\left[ \mathscr {L}_{\varSigma }\left[ V\right] Y\mathscr {L}_{\varSigma }\left[ V\right] \right] , \end{aligned}
where $$Y={\text {grad}}\phi \left( \varSigma \right)$$. After plugging $$Y = {\text {grad}}\phi \left( \varSigma \right) =\varSigma \nabla \phi \left( \varSigma \right) +\nabla \phi \left( \varSigma \right) \varSigma$$ into it, we get easily
\begin{aligned}&{\text {Hess}}\phi (\varSigma )\left( V,V\right) = \left\langle \nabla _{V}^{2}\phi \left( \varSigma \right) \varSigma +\varSigma \nabla _{V}^{2}\phi \left( \varSigma \right) ,V\right\rangle _{\varSigma }\\&\quad + {\text {Tr}}\left[ \nabla \phi \left( \varSigma \right) V\mathscr {L}_{\varSigma }\left[ V\right] \right] -{\text {Tr}}\left[ \mathscr {L}_{\varSigma }\left[ V\right] \nabla \phi \left( \varSigma \right) \varSigma \mathscr {L}_{\varSigma }\left[ V\right] \right] . \end{aligned}
Plugging $$V = \mathscr {L} _ {\varSigma } \left[ V\right] \varSigma + \varSigma \mathscr {L} _ {\varSigma } \left[ V\right]$$ into the second term of the RHS, we have at last
\begin{aligned}&{\text {Hess}}\phi (\varSigma )\left( V,V\right) \nonumber \\&\quad = \left\langle \nabla _{V}^{2}\phi \left( \varSigma \right) \varSigma +\varSigma \nabla _{V}^{2}\phi \left( \varSigma \right) ,V\right\rangle _{\varSigma }+{\text {Tr}}\left[ \nabla \phi \left( \varSigma \right) \mathscr {L}_{\varSigma }\left[ V\right] \varSigma \mathscr {L}_{\varSigma }\left[ V\right] \right] .\quad \quad \end{aligned}
(47)
Relation (47) substantiates the following important property that links the Hessian to the derivative along a geodesic (see the proof of Prop. 5.5.4 of [1]).

### Proposition 12

Let $$\phi :$$ $${\text {Sym}}^{++}\left( n\right) \rightarrow {\mathbb {R}}$$ be a smooth scalar field and define
\begin{aligned} \varphi \left( t\right) =\phi \left( \exp _{\varSigma }\left( tV\right) \right) . \end{aligned}
It holds
\begin{aligned} {\ddot{\varphi }}\left( 0\right) ={\text {Hess}}\phi (\varSigma )\left( V,V\right) . \end{aligned}

### Proof

By Prop. 9
\begin{aligned} \varSigma (t)={\text {Exp}}_{\varSigma }\left( tV\right) =\varSigma +tV+t^{2}\mathscr {L} _{\varSigma }[V]\varSigma \mathscr {L}_{\varSigma }[V] \end{aligned}
where $$\varSigma (0)=\varSigma$$ and $${\dot{\varSigma }}(0)=V.$$ Hence $${\dot{\varphi }} \left( t\right) =\left\langle \nabla \phi (\varSigma (t)),{\dot{\varSigma }} (t)\right\rangle _{2}$$, and
\begin{aligned} {\ddot{\varphi }}\left( t\right) =\left\langle \nabla ^{2}\phi (\varSigma (t))[ {\dot{\varSigma }}(t)],{\dot{\varSigma }}(t)\right\rangle _{2}+\left\langle \nabla \phi (\varSigma (t)),{\ddot{\varSigma }}(t)\right\rangle _{2}\ \end{aligned}
that evaluated at $$t=0$$, provides
\begin{aligned} {\ddot{\varphi }}\left( 0\right) =\left\langle \nabla ^{2}\phi (\varSigma )[V],V\right\rangle _{2}+2\left\langle \nabla \phi (\varSigma ),\mathscr {L}_{\varSigma }(V)\varSigma \mathscr {L}_{\varSigma }(V)\right\rangle _{2}. \end{aligned}
In view of Eq. (47),
\begin{aligned} {\text {Hess}}\phi (\varSigma )\left( V,V\right) =\left\langle \nabla _{V}^{2}\phi \left( \varSigma \right) ,V\right\rangle _{2}+2\left\langle \nabla \phi \left( \varSigma \right) ,\mathscr {L}_{\varSigma }\left[ V\right] \varSigma \mathscr {L}_{\varSigma }\left[ V\right] \right\rangle ={\ddot{\varphi }}\left( 0\right) . \end{aligned}
$$\square$$

## 8 Conclusion

In the present paper we have discussed in some detail the Wasserstein geometric properties of the Gaussian densities manifold. We have followed a known argument based on the geometric notion of submersion and we have improved upon what is known in the literature by offering a number of further results. In particular, we have studied the geodesic surfaces and provided an explicit form for the Riemannian exponential. More important, a new formulation of the metric based on the field of operators $$\varSigma \mapsto \mathscr {L} _ {\varSigma } \left[ \cdot \right]$$ is introduced. This field of operator gives the Riemannian metric by the Frobenius inner product: $$W_\varSigma (X,Y) = \left\langle \mathscr {L} _ {\varSigma } \left[ X\right] ,Y\right\rangle _{2}$$. This gives rise to an explicit identification of the Riemannian gradient as well as to the calculation of the Levi-Civita covariant derivative, through the partial derivatives of the metric. The equations of the parallel transport and of the Riemannian Hessian have been also derived.

While the form of the natural gradient is simple and may be a source of applications such as those of interest in Machine Learning, the Levi-Civita covariant derivative turns out to be more involved and it is not clear how to use it in applications. However, we have produced a simpler form by the introduction of a special moving frame. In view of this issue, we have not proceeded in this paper to compute other geometrical quantities of interest, like the curvature tensor.

Numerical as well as simulation methods for the relevant equations of the geometry, like geodesics, parallel transport, Hessians, should be also considered. Applications of special interest are in the area of the linear optimization, by means of the natural gradient as direction of increase and by using the Riemannian exponential as a retraction, cf. [1] and in Amari monograph [5]. Also, second order optimization methods (Newton method), via the Riemannian Hessian and the Riemannian exponential, cf. [1] and [5], are source of promising researches.

The issue of a comparison between Fisher and Wasserstein metric is not taken into account here as it is, for example, in Chevallier et al. [11].

From the point of view of applications in Statistics and Machine Learning, the use of the full Gaussian model is in many cases not realistic. We expect our results to be used to compute the Wasserstein geometry induced on parsimonious sub-manifolds such as those listed below.
1. 1.

Sub-manifold of the correlation matrices i.e, with unitary diagonal elements. In this case, the tangent space at each point is the space of symmetric matrices with zero diagonal.

2. 2.

Sub-manifold of trace 1 matrices. This case is of particular interest in Physics and prompts for a generalization of the theory to complex Gaussians i.e., Gaussians densities on $$\mathbb C^n$$. Such distributions have Hermitian covariant matrices, a case that is discussed in [8].

3. 3.

Sub-manifold of the concentration matrices with a given sparsity pattern. Notice that concentration matrices and dispersion matrices are both elements of the same space $${{\mathrm{Sym}}}^{++}\left( n\right)$$. In this case the statistical interpretation of the Wasserstein distance is not available but nevertheless other interpretations of the distance are mentioned in the Introduction.

## Notes

### Acknowledgements

The authors wish to thank two anonymous referees for helpful comments. G. Pistone acknowledges the support of de Castro Statistics and Collegio Carlo Alberto. He is a member of GNAMPA-INdAM.

## References

1. 1.
Absil, P.A., Mahony, R., Sepulchre, R.: Optimization Algorithms on Matrix Manifolds. Princeton University Press, Princeton (2008). (with a foreword by Paul Van Dooren)
2. 2.
Aliprantis, C.D., Border, K.C.: Infinite Dimensional Analysis. A Hitchhiker’s Guide, 3rd edn. Springer, Berlin (2006)
3. 3.
Amari, S., Nagaoka, H.: Methods of information geometry. American Mathematical Society, Providence (2000). (translated from the 1993 Japanese original by Daishi Harada)
4. 4.
Amari, S.I.: Natural gradient works efficiently in learning. Neural Comput. 10(2), 251–276 (1998).
5. 5.
Amari, S.I.: Information geometry and its applications. Appl. Math. Sci. 194 (2016).
6. 6.
Anderson, T.W.: An Introduction to Multivariate Statistical Analysis. Wiley Series in Probability and Statistics, 3rd edn. Wiley, Hoboken (2003)
7. 7.
Bhatia, R.: Positive Definite Matrices. Princeton Series in Applied Mathematics. Princeton University Press, Princeton (2007). ([2015] paperback edition of the 2007 original [MR2284176])
8. 8.
Bhatia, R., Jain, T., Lim, Y.: On the Bures-Wasserstein distance between positive definite matrices. Expositiones Mathematicae (2018). arXiv:1712.01504 (in press)
9. 9.
Brenier, Y.: Polar factorization and monotone rearrangement of vector-valued functions. Comm. Pure Appl. Math. 44(4), 375–417 (1991).
10. 10.
do Carmo, M.P.: Riemannian geometry. Mathematics: Theory and Applications. Birkhuser Boston Inc., Cambridge (1992). (translated from the second Portuguese edition by Francis Flaherty)
11. 11.
Chevallier, E., Kalunga, E., Angulo, J.: Kernel density estimation on spaces of Gaussian distributions and symmetric positive definite matrices. SIAM J. Imaging Sci. 10(1), 191–215 (2017).
12. 12.
Dowson, D.C., Landau, B.V.: The Fréchet distance between multivariate normal distributions. J. Multivar. Anal. 12(3), 450–455 (1982).
13. 13.
Gelbrich, M.: On a formula for the $$L^2$$ Wasserstein metric between measures on Euclidean and Hilbert spaces. Math. Nachr. 147, 185–203 (1990).
14. 14.
Givens, C.R., Shortt, R.M.: A class of Wasserstein metrics for probability distributions. Michigan Math. J. 31(2), 231–240 (1984).
15. 15.
Halmos, P.R.: Finite-dimensional vector spaces. The University Series in Undergraduate Mathematics, 2nd edn. D. Van Nostrand Co., Inc., Princeton-Toronto-New York-London (1958)
16. 16.
Hyvrinen, A.: Estimation of non-normalized statistical models by score matching. J. Mach. Learn. Res. 6, 695–709 (2005)
17. 17.
Klingenberg, W.P.A.: Riemannian Geometry, De Gruyter Studies in Mathematics, vol. 1, 2nd edn. Walter de Gruyter & Co., Berlin (1995).
18. 18.
Knott, M., Smith, C.S.: On the optimal mapping of distributions. J. Optim. Theory Appl. 43(1), 39–49 (1984).
19. 19.
Lafferty, J.D.: The density manifold and configuration space quantization. Trans. Am. Math. Soc. 305(2), 699–741 (1988).
20. 20.
Lang, S.: Differential and Riemannian manifolds, Graduate Texts in Mathematics, vol. 160, 3rd edn. Springer, Berlin Heidelberg (1995)
21. 21.
Lott, J.: Some geometric calculations on Wasserstein space. Comm. Math. Phys. 277(2), 423–437 (2008).
22. 22.
Magnus, J.R., Neudecker, H.: Matrix Differential Calculus with Applications in Statistics and Econometrics. Wiley Series in Probability and Statistics. Wiley, Chichester (1999). (Revised reprint of the 1988 original)
23. 23.
Malagò, L., Pistone, G.: Combinatorial optimization with information geometry: Newton method. Entropy 16, 4260–4289 (2014)
24. 24.
Malagò, L., Pistone, G.: Information geometry of the Gaussiandistributionin view of stochastic optimization. In: Proceedings of FOGA’15, held on January 17-20, 2015, Aberystwyth,Wales, 2015 (2015)Google Scholar
25. 25.
Mangasarian, O.L., Fromovitz, S.: The Fritz John necessary optimality conditions in the presence of equality and inequality constraints. J. Math. Anal. Appl. 17, 37–47 (1967).
26. 26.
McCann, R.J.: A convexity principle for interacting gases. Adv. Math. 128(1), 153–179 (1997).
27. 27.
McCann, R.J.: Polar factorization of maps on Riemannian manifolds. Geom. Funct. Anal. 11(3), 589–608 (2001).
28. 28.
Olkin, I., Pukelsheim, F.: The distance between two random vectors with given dispersion matrices. Linear Algebra Appl. 48, 257–263 (1982).
29. 29.
Otto, F.: The geometry of dissipative evolution equations: the porous medium equation. Comm. Partial Differential Equations 26(1-2), 101–174 (2001)
30. 30.
Papadopoulos, A.: Metric spaces, convexity and non-positive curvature, IRMA Lectures in Mathematics and Theoretical Physics, vol. 6, 2nd edn. European Mathematical Society (EMS), Zürich (2014).
31. 31.
Parry, M., Dawid, A.P., Lauritzen, S.: Proper local scoring rules. Ann. Stat. 40(1), 561–592 (2012).
32. 32.
Pistone, G.: Nonparametric information geometry. In: F. Nielsen, F. Barbaresco (eds.) Geometric Science of Information, Lecture Notes in Comput. Sci., vol. 8085, pp. 5–36. Springer, Heidelberg (2013). First International Conference, GSI 2013 Paris, France, August 28-30 (2013) (proceedings) Google Scholar
33. 33.
Pistone, G., Sempi, C.: An infinite-dimensional geometric structure on the space of all the probability measures equivalent to a given one. Ann. Stat. 23(5), 1543–1561 (1995)
34. 34.
Simoncini, V.: Computational methods for linear matrix equations. SIAM Rev. 58(3), 377–441 (2016).
35. 35.
Skovgaard, L.T.: A Riemannian geometry of the multivariate normal model. Scand. J. Stat. 11(4), 211–223 (1984)
36. 36.
Takatsu, A.: Wasserstein geometry of Gaussian measures. Osaka J. Math. 48(4), 1005–1026 (2011)
37. 37.
Villani, C.: Optimal Transport: Old and New. Grundlehren der mathematischen Wissenschaften. Springer, Berlin Heidelberg (2008)Google Scholar
38. 38.
Wachspress, E.L.: Trail to a Lyapunov equation solver. Comput. Math. Appl. 55(8), 1653–1659 (2008).

© Springer Nature Singapore Pte Ltd. 2018

## Authors and Affiliations

• Luigi Malagò
• 1
• Luigi Montrucchio
• 2
• Giovanni Pistone
• 3
Email author
1. 1.Romanian Institute of Science and Technology, RISTCluj-NapocaRomania
2. 2.Collegio Carlo AlbertoTurinItaly
3. 3.de Castro StatisticsCollegio Carlo AlbertoTurinItaly