Riemannian block SPD coupling manifold and its application to optimal transport

In this work, we study the optimal transport (OT) problem between symmetric positive definite (SPD) matrix-valued measures. We formulate the above as a generalized optimal transport problem where the cost, the marginals, and the coupling are represented as block matrices and each component block is a SPD matrix. The summation of row blocks and column blocks in the coupling matrix are constrained by the given block-SPD marginals. We endow the set of such block-coupling matrices with a novel Riemannian manifold structure. This allows to exploit the versatile Riemannian optimization framework to solve generic SPD matrix-valued OT problems. We illustrate the usefulness of the proposed approach in several applications.


Introduction
Optimal transport (OT) offers a systematic approach to compare probability distributions by finding a transport plan (coupling) that minimizes the cost of transporting mass from one distribution to another.It has been successfully applied in a wide range of fields, such as computer graphics [Solomon et al., 2015, Solomon et al., 2014], graph representation learning [Chen et al., 2020, Petric Maretic et al., 2019], text classification [Yurochkin et al., 2019], domain adaptation [Courty et al., 2016, Courty et al., 2014, Nath and Jawanpuria, 2020], cross-lingual translation [Alvarez-Melis andJaakkola, 2018, Jawanpuria et al., 2020] and prototype selection [Gurumoorthy et al., 2021] to name a few.OT based distances have also used as loss functions in both discriminative and generative learning settings [Frogner et al., 2015, Arjovsky et al., 2017, Genevay et al., 2018, Jawanpuria et al., 2021] Despite the popularity of OT, existing OT formulations are mostly limited to scalar-valued distributions.On the other hand, many applications involve symmetric positive definite (SPD) matrix-valued distributions.In diffusion tensor imaging [Le Bihan et al., 2001], the local diffusion of water molecules in human brain are encoded in fields of SPD matrices [Assaf and Pasternak, 2008].
In image processing, region information of an image can be effectively captured through several SPD covariance descriptors [Tuzel et al., 2006].For the application of image set/video classification, each set of images/frames can be represented by its covariance matrix, which has shown great promise in modelling the intra-set variations [Huang et al., 2015, Harandi et al., 2014].In addition, fields of SPD matrices are also important in computer graphics for anisotropic diffusion [Weickert, 1998], remeshing [Alliez et al., 2003] and texture synthesis [Galerne et al., 2010], just to name a few.In all such cases, being able to compare fields represented by SPD matrices is crucial.This, however, map from tangent space to the manifold That is, for any x ∈ M, retraction R x : T x M − → M such that 1) R x (0) = x and 2) DR x (0)[u] = u, where Df (x) [u] is the derivative of a function at x along direction u.
The Riemannian gradient of a function F : M − → R at x, denoted as gradF (x), generalizes the notion of the Euclidean gradient ∇F (x).It is defined as the unique tangent vector satisfying gradF (x), u x = DF (x)[u] = ∇F (x), u 2 for any u ∈ T x M, where •, • 2 denotes the Euclidean inner product.To minimize the function, Riemannian gradient descent [Absil et al., 2008] and other first-order solvers apply retraction to update the iterates along the direction of negative Riemannian gradient while staying on the manifold, i.e., x t+1 = R xt (−η gradF (x t )), where η is the step size.Similarly, the Riemannian Hessian HessF (x) : T x M − → T x M is defined as the covariant derivative of Riemannian gradient.Popular second-order methods, such as trust regions and cubic regularized Newton's methods have been adapted to Riemannian optimization [Absil et al., 2007, Agarwal et al., 2018].

Scalar-valued optimal transport
Consider two discrete measures supported on R d , µ = m i=1 p i δ x i , ν = n j=1 q j δ y j , where x i , y j ∈ R d and δ x is the Dirac at x.The weights p ∈ Σ m , q ∈ Σ n are in probability simplex where Σ k := {p ∈ R k : p i ≥ 0, i p i = 1}.The 2-Wasserstein distance between µ, ν is given by solving the Monge-Kantorovich optimal transport problem: W 2 2 (p, q) = min γ∈Π(p,q) i,j where Π(p, q) := {γ ∈ R m×n : γ ≥ 0, γ1 = p, γ 1 = q} is the space of joint distribution between the source and the target marginals.An optimal solution of (1) is referred to as an optimal transport plan (or coupling).Recently, [Cuturi, 2013] proposed the Sinkhorn-Knopp algorithm [Sinkhorn, 1964, Knight, 2008] for entropy-regularized OT formulation.In case µ and ν are measures (i.e., the setting is not restricted to probability measures), it may happen that they are of unequal masses.
For a recent survey of OT literature and related machine learning applications, please refer to [Peyré et al., 2019b].

SPD matrix-valued optimal transport
A SPD matrix-valued measure is a generalization of the (scalar-valued) probability measure (discussed in Section 2.2).Let us consider a SPD matrix-valued measure M and a scalar-valued measure µ defined on a space X .Let A be a measurable subset of X .Then, while µ(A) is a non-negative scalar, the "mass" M (A) ∈ S d + , where S d + denotes the set of d × d positive semi-definite matrices.SPD matrix-valued measures have been employed in applications such as diffusion tensor imaging [Le Bihan et al., 2001], image set classification [Huang et al., 2015, Harandi et al., 2014], anisotropic diffusion [Weickert, 1998], and brain imaging [Assaf and Pasternak, 2008], to name a few.
Recent works [Carlen and Maas, 2014, Chen et al., 2017, Ryu et al., 2018, Peyré et al., 2019a] have explored optimal transport formulations for SPD matrix-valued measures.While the works [Carlen and Maas, 2014, Chen et al., 2017, Ryu et al., 2018] discuss dynamical (geodesic) OT framework, [Peyré et al., 2019a] studies the "static" OT formulation that learns a suitable joint coupling between the input SPD matrix-valued measures.However, [Peyré et al., 2019a] explores an unbalanced OT setup for SPD matrix-valued measures and term it as quantum optimal transport (QOT).Thus, the marginals of the (learned) joint coupling in QOT is not equal to the input SPD matrix-valued measures.As in case of unbalanced (scalar-valued) OT [Chizat et al., 2018, Liero et al., 2018], the discrepancy between marginals of the joint and the input measures in QOT is penalized via the Kulback-Leibler divergence (for SPD matrix-valued measures).

Block SPD optimal transport
In this section, we study a balanced OT formulation for SPD matrix-valued measures.Consider P and Q to be (d-dimensional) SPD matrix-valued input measures.Let P := {[P i ] m×1 : ++ } and P and Q have the same total mass.Without loss of generality, we assume i P i = j Q j = I.Here, [•] m×n denotes a collection of mn matrices organized as a block matrix and I represents the identity matrix.The cost of transporting a positive definite matrix-valued mass A from position x i (in source space) to y j (in target space) is parameterized by a (given) positive semi-definite matrix C i,j and is computed as tr(C i,j A).Under this setting, we propose the block SPD matrix-valued balanced OT problem as where Γ = [Γ i,j ] m×n is a block-matrix coupling of size m × n and the set of such couplings are defined as + is used to denote the set of d × d positive semi-definite matrices and tr(•) is the matrix trace.The problem is well-defined provided that the corresponding coupling constraint set Π(m, n, d, P, Q) is non-empty.For arbitrary SPD marginals P, Q, there is no guarantee that the set Π(m, n, d, P, Q) defined in (2) is not empty [Ning et al., 2014].Hence, in this work, we assume that the given marginals P and Q are such that Π(m, n, d, P, Q) is not empty.In Section 4.3 later, we discuss a block matrix balancing algorithm which can be used to check whether Π(m, n, d, P, Q) is empty or not for given marginals P and Q.

Metric properties of MW(P, Q)
In the following result, we show that MW(P, Q) is a valid distance metric for a special case of block SPD marginals.
Proposition 3.1.Suppose the input SPD matrix-valued marginals have the same support size n and the costs {C i,j } n i,j=1 satisfy 1. C i,j = C j,i and 2. C i,j 0 for i = j and C i,j = 0 for i = j, Then, MW(P, Q) is a metric between the SPD matrix-valued marginals P and Q defined as , where p, q ∈ Σ n and I is the d × d identity matrix.
We remark that the conditions on C i,j in Proposition 3.1 generalize the conditions required for W 2 (p, q) in (1) to be a metric.See for example [Peyré et al., 2019b, Proposition 2.2].In Appendix B, we discuss some particular constructions of the cost that satisfy the conditions.
3.2 Manifold structure for the coupling set Π(m, n, d, P, Q) We next analyze the coupling constraint set Π(m, n, d, P, Q) and show that it can be endowed with a manifold structure.This allows to exploit the versatile Riemannian optimization framework to solve (2) and any more general problem [Absil et al., 2008].
We propose the following manifold structure, termed as the block SPD coupling manifold, where i P i = j Q j = I.Particularly, we restrict P i , Q j , Γ i,j ∈ S d ++ , the set of SPD matrices.This ensures that the proposed manifold M d m,n (P, Q) in ( 3) is the interior of the set Π(m, n, d, P, Q).As discussed earlier Π(m, n, d, P, Q) is not guaranteed to be non-empty for arbitrary choices of block SPD marginals P and Q [Ning, 2013].To this end, we assume that the marginals P and Q that are given ensure feasibility of the set Π(m, n, d, P, Q).In particular, the manifold M d m,n (P, Q) inherits the following assumption.
Assumption 3.2.In this work, we consider block-SPD marginals P and Q such that the set M d m,n (P, Q) is not empty.
It should be noted that Assumption 3.2 is trivially satisfied for diagonal SPD marginals, i.e., when P i and Q j are diagonal.However, non-diagonal SPD marginals may also satisfy Assumption 3.2 for many problem instances.In Section 6, we discuss empirical settings where non-diagonal SPD marginals satisfying Assumption 3.2 are considered.The following proposition implies that we can endow M d m,n (P, Q) with a differentiable structure.
It should be emphasized that the proposed manifold M d m,n (P, Q) can be regarded as a generalization to existing manifold structures.For example, when d = 1 and either m = 1 or n = 1, M d m,n (P, Q) reduces to the multinomial manifold of probability simplex [Sun et al., 2015].When d = 1 and m, n = 1, it reduces the so-called doubly stochastic manifold [Douik and Hassibi, 2019] with uniform marginals or the more general matrix coupling manifold [Shi et al., 2021].When d > 1 and either m = 1 or n = 1, our proposed manifold simplifies to the simplex manifold of SPD matrices [Mishra et al., 2019].
In the next section, we derive various optimization-related ingredients on M d m,n (P, Q) that allow optimization of an arbitrary differentiable objective function on the manifold.In particular, we propose a Riemannian optimization approach following the general treatment by [Absil et al., 2008, Boumal, 2020].It allows employing the proposed approach not only for (2) but also for other OT problems as discussed in Section 5.

5:
Compute the update step ξ ∈ T Γ M d m,n .
4 Riemannian geometry and optimization over M d m,n (P, Q) We consider the general optimization problem where The proposed manifold M d m,n (P, Q) can be endowed with a smooth Riemannian manifold structure [Absil et al., 2008, Boumal, 2020].Consequently, ( 4) is an optimization problem on a Riemannian manifold.We solve the problem via the Riemannian optimization framework.It provides a principled class of optimization methods and computational tools for manifolds, both first order and second order, as long as the ingredients such as Riemannian metric, orthogonal projection, retraction, and Riemannian gradient (and Hessian) of a function are defined [Absil et al., 2008, Boumal et al., 2014, Boumal, 2020].Conceptually, the Riemannian optimization framework treats (4) as an "unconstrained" optimization problem over the constraint manifold M d m,n (omitted marginals P, Q for clarity).In Algorithm 1, we outline the skeletal steps involved in optimization over M d m,n , where the step ξ can be computed from different Riemannian methods.In Riemannian steepest descent, ξ = −η gradF (Γ), where gradF (Γ) is the Riemannian gradient at Γ. Also, ξ is given by the "conjugate" direction of gradF (Γ) in the Riemannian conjugate gradient method.And, for the Riemannian trust-region method, ξ computation involves minimizing a second-order approximation of the objective function in a trust-region ball [Absil et al., 2008].Below, we show the computations of these ingredients.

Riemannian metric
where S d is the set of d × d symmetric matrices.The expression for the tangent space is obtained by linearizing the constraints.We endow each SPD manifold with the affine-invariant Riemannian metric [Bhatia, 2009], which induces a Riemannian metric for the product manifold M d m,n as for any U, V ∈ T Γ M d m,n .

Orthogonal projection, Riemannian gradient, and Riemannian Hessian
As an embedded submanifold, the orthogonal projection plays a crucial role in deriving the Riemannian gradient (as orthogonal projection of the Euclidean gradient in the ambient space).
Proposition 4.1.The orthogonal projection of any S ∈ × m,n S d to T Γ M d m,n with respect to the Riemannian metric (5) is given by where auxiliary variables Λ i , Θ j are solved from the system of matrix linear equations: Subsequently, the Riemannian gradient and Hessian are derived as the orthogonal projection of the gradient and Hessian from the ambient space.
Proposition 4.2.The Riemannian gradient and Hessian of F : where U ∈ T Γ M d m,n and ∇F (Γ i,j ) is the block partial derivative of F with respect to Γ i,j .Here, DgradF (Γ i,j )[U i,j ] denotes the directional derivative of the Riemannian gradient gradF along U and {A} S := (A + A )/2.

Retraction and block matrix balancing algorithm
The retraction operation on M d m,n is given by a composition of two operations.The first operation is to ensure positive definiteness of the blocks in the coupling matrix.In particular, we use the exponential map associated with the affine-invariant metric on the SPD manifold S d ++ [Bhatia, 2009].The second operation is to ensure that the summation of the row blocks and column blocks respect the block-SPD marginals.Given an initialized block SPD matrix [A i,j ] m×n , where A i,j ∈ S d ++ , the goal is to find a 'closest' block SPD coupling matrix B ∈ M d m,n .This is achieved by alternatively normalizing the row and column blocks to the corresponding marginals.The procedure is outlined in Algorithm 2. The solution for the row and column normalization factors R j , L i , which are SPD matrices, are computed by solving the Riccati equation TXT = Y for given X, Y ∈ S d ++ .Here, T admits a unique solution [Bhatia, 2009, Malagò et al., 2018].Different from the scalar marginals case where the scaling can be expressed as a diagonal matrix, we need to symmetrically normalize each SPD block matrix.Algorithm 2 is a generalization of the RAS algorithm for balancing non-negative Algorithm 2 Block matrix balancing algorithm ++ and block SPD marginals P and Q. 6: matrices [Sinkhorn, 1967], which is related to the popular Sinkhorn-Knopp algorithm [Sinkhorn, 1964, Knight, 2008].We also use Algorithm 2 to test feasibility of the set M d m,n by checking whether Algorithm 2 outputs a balanced block SPD matrix for a random block SPD matrix A.
It should be noted that a similar matrix balancing algorithm has been introduced for positive operators [Gurvits, 2004, Georgiou andPavon, 2015], where the convergence is only established in limited cases.Algorithm 2 is different from the quantum Sinkhorn algorithm proposed in [Peyré et al., 2019a] that applies to the unbalanced setting.Although we do not provide a theoretical convergence analysis for Algorithm 2, we empirically observe quick convergence of this algorithm in various settings (see Appendix A).
Based on Algorithm 2, we define a retraction where MBalance calls the matrix balancing procedure in Algorithm 2 and exp(•) denotes the matrix exponential.The retraction proposed in ( 6) is valid (i.e., satisfy the two conditions) for diagonal marginals and empirically we also see the retraction is well-defined for arbitrary block-SPD marginals.See Appendix A for more details.

Convergence and computational complexity
Convergence of Riemannian optimization.Similar to Euclidean optimization, the necessary first-order optimality condition for any differentiable F on M d m,n is gradF (Γ * ) = 0, i.e., where the Riemannian gradient vanishes.We call such Γ * the stationary point.The Riemannian methods are known to converge to a stationary point [Absil et al., 2008, Boumal, 2020] under standard assumptions.Additionally, we show the following.
Theorem 4.3.Suppose the objective function of the problem min Γ∈Π(m,n,d,P,Q) F (Γ) is strictly convex and the optimal solution Γ * is positive definite, i.e., it lies in the interior of Π(m, n, d, P, Q).Then, Riemannian optimization (Algorithm 1) for (4) converges to the same global optimal solution Γ * .Theorem 4.3 guarantees the quality of the solution obtained by Riemannian optimization for a class of objective functions which includes the SPD matrix-valued OT problem with convex regularization.
Computational complexity.The complexity of each iteration of the Riemannian optimization algorithm is dominated by the computations of retraction, the Riemannian gradient, the Riemannian Hessian.These also make use of the orthogonal projection operation.All these operations cost O(mnd 3 ).Since the number of parameters to be learned is N = mnd 2 (size of the coupling block SPD matrix Γ), the above cost is almost linear in N .

Applications of block SPD coupling manifold
As discussed earlier, we employ the proposed block SPD coupling manifold optimization approach to solve the block SPD matrix valued balanced OT problem (2).We now present two other OT related applications of the block SPD coupling manifold: learning Wasserstein barycenters and the Gromov-Wasserstein averaging of distance matrices.

Block SPD Wasserstein barycenter learning
We consider the problem of computing the Wasserstein barycenter of a set of block SPD matrix-valued measures.Let ∆ n (S d ++ ) := {P = [P i ] n×1 : P i ∈ S d ++ , i P i = I} denotes the space of n × 1 block SPD marginals.Then, the Wasserstein barycenter P of a set P ∈ ∆ n (S d ++ ) for all = {1, . . ., K} is computed as follows: where the given non-negative weights satisfy ω = 1.It should be noted that we employ a regularized version of the proposed block SPD OT problem (2) to ensure the differentiability of the objective function near boundary in ( 7).The regularized block SPD OT problem is defined as where > 0 is the regularization parameter and Ω(•) is a strictly convex regularization (e.g., entropic regularization) on the block SPD coupling matrices.
To solve for P in (7), we consider Riemannian optimization on ∆ n (S d ++ ), which has recently been studied in [Mishra et al., 2019].The following result provides an expression for the Euclidean gradient of the objective function in problem (7).
The complete algorithm for computing the barycenter in ( 7) is outlined in Algorithm 3 (Appendix E).

Block SPD Gromov-Wasserstein discrepancy
The Gromov-Wasserstein (GW) distance [Mémoli, 2011] generalizes the optimal transport to the case where the measures are supported on possibly different metric spaces X and Y. Let D x ∈ R m×m and D y ∈ R n×n represent the similarity (or distance) between elements in metric spaces X and Y respectively.Let p ∈ Σ m and q ∈ Σ n be the marginals corresponding to the elements in X and Y, respectively.Then, the GW discrepancy between the two distance-marginal pairs (D x , p) and (D y , q) is defined as where D k,l denotes the (k, l)-th element in the matrix D and L is a loss between the distance pairs.Common choices of L include the L 2 distance and the KL divergence.
We now generalize the GW framework to our setting where the marginals are SPD matrix-valued measures.Let (D x , P) and (D y , Q) be two distance-marginal pairs, where the Dirac measures are given by i P i δ x i , j Q j δ y j respectively, for . We define the SPD generalized GW discrepancy as MGW (D x , P), (D y , Q) := min where we use Riemannian optimization (Algorithm 1) to solve problem (9).Gromov-Wasserstein averaging of distance matrices.The GW formulation with scalarvalued probability measures has been used for averaging distance matrices [Peyré et al., 2016].Building on (9), we consider the problem of averaging distance matrices where the marginals are SPD-valued.Let {(D , P )} K =1 with P ∈ ∆ n (S d ++ ), be a set of distance-marginal pairs on K incomparable domains.Suppose the barycenter marginals P ∈ ∆ n (S d ++ ) are given, the goal is to find the average distance matrix D by solving D = arg min where the given weights satisfy ω = 1.Problem (10) can be solved via a block coordinate descent method, that iteratively updates the couplings {Γ } K =1 and the distance matrix D. The update of the coupling is performed via Algorithm 1.For the update of the distance matrix, we show when the loss L is decomposable, including the case of L 2 distance or the KL divergence, the optimal D admits a closed-form solution.This is a generalization of the result [Peyré et al., 2016, Proposition 3] to SPD-valued marginals.
Proposition 5.2.Suppose the loss L can be decomposed as with f 1 /h 1 invertible, then (10) has a closed form solution given by Di,i =

Experiments
In this section, we show the utility of the proposed framework in a number of applications.For empirical comparisons, we refer to our approaches, block SPD OT (2), the corresponding Wasserstein barycenter (7), and block SPD Gromov-Wasserstein OT ( 9) & (10), collectively as RMOT (Riemannian optimized Matrix Optimal Transport).For all the experiments, we use the Riemannian steepest descent method using the Manopt toolbox [Boumal et al., 2014] for implementing Algorithm 1.The codes are available at https://github.com/andyjm3/BlockSPDOT.

Domain adaptation
We apply our OT framework to the application of unsupervised domain adaptation where the goal is to align the distribution of the source with the target for subsequent tasks.Suppose we are given the source p ∈ Σ m and target marginals q ∈ Σ n , along with samples {X i } m i=1 , {Y j } n j=1 from the source and target distributions.The samples are matrix-valued, i.e., X i , Y j ∈ R d×s .We define the cost as is the cost function under the 2-Wasserstein OT setting (1).For domain adaptation, we first learn an optimal coupling between the source and target samples by solving the proposed OT problem (2) with marginals P, Q constructed as P := {[P i ] m×1 : Finally, the source samples are projected to the target domain via barycentric projection.Once the optimal couplings [Γ * i,j ] m×n , the barycentric projection of a source sample X i is computed as Xi = arg min The above approach also works for structured samples.For instance, when the samples are SPD, i.e., X i , Y j ∈ S d ++ , the projected source sample Xi is now the solution to the matrix Lyapunov equation: Here, {A} S = (A + A )/2.For the scalar-valued OT case, discussed in Section 2.2, the barycentric projection of a source sample X i is computed as Xi = arg min where γ * = [γ * i,j ] is the optimal coupling matrix of size m × n for the scalar-valued OT problem.Contrasting the barycentric projection operations (11) with ( 12), we observe that (11) allows to capture feature-specific correlations more appropriately.The benefit of the matrix-valued OT modeling over the scalar-valued OT modeling is reflected in the experiments below.
Experimental setup.We employ domain adaptation to classify the test sets (target) of multiclass image datasets, where the training sets (source) have a different class distribution than the test sets.Suppose we are given a training set {X i } m i=1 and a test set {Y j } n j=1 where X i , Y j ∈ R d×s are s (normalized) image samples of the same class in d dimension for each image set i, j.Instead of constructing the cost directly on the input space, which are not permutation-invariant, we first compute the sample covariances S x i = X i X i /s and S y j = Y j Y j /s, ∀i, j.Now the cost between i, j is given by C i,j = (S x i − S y j )(S x i − S y j ) .Once the block SPD matrix coupling is learnt, the S x i The skew ratio increases from uniform (uf) to r = 0.5.For MNIST and Fashion MNIST, uf=0.1 and for Letters, uf=1/26.We observe that the proposed RMOT performs significantly better than the baselines.covarinaces are projected using the barycerntric projection to obtain Ŝx i , i ∈ [m].This is followed by nearest neighbour classification of j based on the Frobenius distance Ŝx i − S y j F ∀i, j.
We compare the proposed RMOT (2) with the following baselines: (i) sOT: the 2-Wasserstein OT (1) with the cost c i,j = tr(C i,j ) = S x i − S y j 2 F [Courty et al., 2016], and (ii) SPDOT: the 2-Wasserstein OT (1) with the cost as the squared Riemannian geodesic distance between the SPD matrices S x i and S y j [Yair et al., 2019].
Datasets.We experiment on three multiclass image datasets -handwritten letters [Frey and Slate, 1991], MNIST [LeCun et al., 1998] and Fashion MNIST [Xiao et al., 2017] -with various skewed distributions for the training set.MNIST and Fashion MNIST have 10 classes, while Letters has 26 classes.Specifically, we fix the distribution of the test set to be uniform (with the same number of image sets per class).We increase the proportion of the a randomly chosen class in the training set to the ratio r, where r = {uf, 0.1, 0.2, 0.3, 0.4, 0.5} and uf is the ratio corresponding to the uniform distribution of all classes.We reduce the dimension of MNIST, fashion MNIST, and Letters by PCA to d = 5 features.We set s = d, m = 250, and n = 100 for each dataset.
Results.Figures 1a-1c shows the classification accuracy on the three datasets.We observe that the proposed RMOT outperforms sOT and SPDOT, especially in more challenging domain adaptation settings, i.e., higher skew ratios.This implies the usefulness of the non-trivial correlations learned by the SPD matrix valued couplings of RMOT.

Tensor Gromov-Wasserstein distance averaging for shape interpolation
We consider an application of the proposed block SPD Gromov-Wasserstein OT formulation (Section 5.2) for interpolating tensor-valued shapes.We are given two distance-marginal pairs (D 0 , P 0 ), (D 1 , P 1 ) where D 0 , D 1 ∈ R n×n are distance matrices computed from the shapes and P 0 , P 1 are given tensor fields.The aim is to interpolate between the distance matrices with weights ω = (t, 1 − t), t ∈ [0, 1].The interpolated distance matrix D t is computed by solving (10) via Riemannian optimization and Proposition 5.2, with the barycenter tensor fields P t given.Finally, the shape is recovered by performing multi-dimensional scaling to the distance matrix.
Figure 2 presents the interpolated shapes with n = 100 sample points for the input shapes.The matrices D 0 , D 1 are given by the Euclidean distance and we consider L 2 loss for L. The input tensor Figure 2: Tensor-valued shape interpolation obtained using the proposed Gromov-Wasserstein RMOT formulation (Section 5.2).We note that each shape consists of several SPD-matrix/tensor valued fields (displayed using ellipses).In (a), the tensors follow a uniform distribution.In (b), the tensors are generated with multiple orientation and in (c), tensors vary smoothly in both size and orientation.
The proposed approach takes the anisotropy and orientation of tensor fields into account while interpolating shapes.
fields P 0 , P 1 are generated as uniformly random in (a), cross-oriented in (b) and smoothly varying in (c).For simplicity, we consider the barycenter tensor fields given by the linear interpolation of the inputs, i.e., P t = (1 − t)P 0 + tP 1 .In [Peyré et al., 2016], we highlight that the marginals are scalar-valued and fixed to be uniform.Here, on the other hand, the marginals are tensor-valued and the resulting distance matrix interpolation would be affected by the relative mass of the tensors, as shown by Proposition 5.2.The results show the proposed Riemannian optimization approach (Section 4) converges to reasonable stationary solutions for non-convex OT problems.

Tensor field optimal transport mass interpolation
We consider performing optimal transport and displacement interpolation between two tensor fields supported on regular 1-d (or 2-d) grids [Peyré et al., 2019a].We consider a common domain D = [0, 1] (or [0, 1] 2 ) with the cost defined as C i,j = x i − y j 2 I for x i , y j ∈ D. The marginals P, Q are given tensor fields.We first compute the balanced coupling Γ by solving an entropy regularized OT problem (8): min where the quantum entropy is defined as H(Γ i,j ) := −tr(Γ i,j log(Γ i,j ) − Γ i,j ).Then, the coupling is used to interpolate between the two tensor fields by generalizing the displacement interpolation [McCann, 1997] to SPD-valued marginals.Please refer to [Peyré et al., 2019a, Section 2.2] for more details.It should be noted that due to the balanced nature of our formulation, we do not need to adjust the couplings after matching as required in [Peyré et al., 2019a].We compare interpolation results of the proposed (balanced) RMOT with both linear interpolation (1 − t)P + tQ for t ∈ [0, 1] and the unbalanced quantum OT (QOT) of [Peyré et al., 2019a].The QOT solves the following problem with quantum KL regularization, i.e., where KL(P|Q) := i tr P i log(P i ) − P i log(Q i ) − P i + Q i and Γ1 := [ j (Γ i,j )] m×1 and Γ 1 := [ i (Γ i,j )] n×1 .For comparability, we set the same for both QOT and RMOT. Figure 3 compares the mass interpolation for both 1-d (top) and 2-d (bottom) grids.For the 2-d tensor fields, we further render the tensor fields via a background texture where we perform anisotropic smoothing determined by the tensor direction.To be specific, we follow the procedures in [Peyré et al., 2019a] by applying the tensor to the gradient vector of the textures on the grid such that the texture is stretched in the main eigenvector directions of the tensor.In both the settings, we observe that the tensor fields generated from RMOT respect the marginal constraints more closely.

Tensor field Wasserstein barycenter
We also analyze the Wasserstein barycenters learned by the proposed RMOT approach and qualitatively compare with QOT barycenter [Peyré et al., 2019a, Section 4.1].We test on two tensor fields (n = 4) supported 2-d grids.
Figure 4 compares barycenter from QOT (top) and RMOT (bottom) initialized from the normalized solution of QOT.We observe that the QOT solution is not optimal when the marginal constraint is enforced and the barycenter obtained does not lie in the simplex of tensors.Such a claim is strengthened by comparing the objective value versus the optimal value, obtained by the CVX toolbox [Grant and Boyd, 2014].The objective can be further decreased when initialized from the (normalized) QOT solution, see more discussions in Appendix C.

Conclusion
In this paper, we have discussed the balanced optimal transport (OT) problem involving SPD matrix-valued measures.For the SPD matrix-valued OT problem, the coupling matrix is a block matrix where each block is a symmetric positive definite matrix.The set of such coupling matrices can be endowed with Riemannian geometry, which enables optimization both linear and non-linear objective functions.We have also shown how the SPD-valued OT setup extend many optimal transport problems to general SPD-valued marginals, including the Wasserstein barycenter and the Gromov-Wasserstein (GW) discrepancy.Experiments in a number of applications confirm the benefit of our approach.

A Convergence of block matrix balancing algorithm and validity of retraction
In Section 4, we generalize the matrix scaling algorithm to block matrix cases, which is essential to derive the retraction for the manifold M d m,n .Here, we empirically show that the algorithm quickly converges and the proposed retraction is valid and satisfies the two conditions: 1) R x (0) = x and 2) DR x (0)[u] = u, where Df (x) [u] is the derivative of a function at x along direction u.
Convergence.We show in Figure 5 the convergence of the proposed block matrix balancing procedure in Algorithm 2. We generate the marginals as random SPD matrices for different dimensions d and size m, n.The convergence is measured as the relative gap to satisfy the constraints.We observe that the number of iterations for convergence are similar with different parameters while the runtime increases by increasing the dimension and size.
Validity of retraction.The first condition of retraction is easily satisfied as R Γ (0) = MBalance(Γ) = Γ.For the second one, we have for any Hence, we need to numerically verify R Γ (hU) − Γ = O(h)U for any Γ, U. We compute an approximation error in terms of the inner product on the tangent space T Γ M d m,n as In Figure 5(c), we show that the slope of the approximation error (as a function of h) matches the dotted line h = 0, which suggests hat the error ε = O(1), thereby indicating that the retraction is valid.For the retraction to be valid, the slope of the continuous line should match the dotted line (which represents the line h = 0) and also start from 0 when h tends to 0.

B Discussion on construction of matrix-valued cost
As highlighted in Proposition 3.1 for MW(P, Q) to be a metric for probability measures there are some conditions for the cost [C i,j ] m×n to satisfy.In the following, we give some examples of how such costs are constructed: 1. Let the samples are given by 2. Let the samples are given by Proof.
(1) The first definition of cost trivially satisfies all the conditions due to the metric properties of a well-defined scalar-valued distance.
(2) For the second definition of cost, The first two conditions, i.e., symmetric and positive definite conditions are easily satisfied and we only need to verify the third condition in Proposition 3.1.The third condition is also satisfied due to the triangle inequality of Mahalanobis distance metric in the vectorized form.That is, for any A 0, we consider three sets of samples {X i }, {Y k }, {Z j } ⊂ R d×s .Then, we have where vec(C) denotes the vectorization of matrix C by stacking the columns.

C Additional experiments
In this section, we give additional experiments to further substantiate the claims made in the main text.

C.1 Tensor field optimal transport mass interpolation
We first provide more details on displacement interpolation considered in the experiment.After we obtain the optimal Γ * , for t ∈ [0, 1], we compute the interpolated measure at t as i,j where x t i,j is the interpolated location on the 2-d grid.In addition to the experiments presented in the main texts, we also show other examples of tensor fields mass interpolation in Figures 6 and 7.In Figure 6, the inputs are given as 1-d tensor fields, which are the first and last row for each subfigure.We compare the interpolation given by the linear interpolation (first column), QOT with different values of ρ and RMOT (last column).In Figure 7, Input-1 and Input-5 are with t = 0 and t = 1, respectively.QOT-2 and RMOT-2 are with t = 0.25.QOT-3 and RMOT-3 are with t = 0.5.QOT-4 and RMOT-4 are with t = 0.75.

C.2 Tensor field Wasserstein barycenter
We first show how both linear interpolation and QOT solutions are not optimal.We initialize our Riemannian optimizers for P from the linear interpolation and (normalized) QOT.We also include uniform initialization as a benchmark.
We compare the objective value of ω MW ( P, P ) against the optimal objective value obtained from the CVX toolbox [Grant and Boyd, 2014].This allows to compute the optimality gap.
In Figure 8, we see that the optimality gap keeps reducing with iterations even after properly normalizing the barycenter from linear interpolation and (normalized) QOT.This shows that linear interpolation and (normalized) QOT solutions are not optimal.Also, the performance of RMOT with uniform initialization is competitive to that initialized with linear interpolation and (normalized) QOT, implying that RMOT is a competitive solver in itself and obtains better solutions.
Additionally, we show the barycenter results for n = 16 along with convergence of RMOT in Figure 9 and 10.From Figure 9, we see visually no difference in the solutions obtained by QOT and uniform identity (uf).Irrespective of the initialization, RMOT continue to achieve better optimality gap (to the CVX optimal solution) with iterations.As the initial optimality gap is high for all the cases, it shows that the linear interpolation (li) and QOT (qot) solutions are not optimal.
and RMOT, which suggests the solution by QOT (with normalization) is close to optimal.This observation is further validated in Figure 10 where we see the objective value is already quite small when initialized from the QOT solution.

C.3 Additional experiments on domain adaptation
Here, we perform the experiments of domain adaptation on more challenging tasks, including video based face recognition with YouTube Celebrities (YTC) dataset [Kim et al., 2008] and texture classification via Dynamic Texture (DynTex) [Ghanem and Ahuja, 2010] dataset, where covariance representation learning has shown great promise [Huang et al., 2015, Harandi et al., 2014].
Datasets and experimental setup.YTC [Kim et al., 2008] comprises of 1910 low-resolution videos of 47 celebrities from YouTube.Here we only select 9 persons with video size larger than 15.Following standard preprocessing techniques [Huang et al., 2015], we first crop the frames of each video to the detected face regions and resize into 10 × 10 intensity images.Then we construct the covariance representation for each video, which is a 100 × 100 SPD matrix.We then apply the geometry-aware principal component analysis for SPD manifold [Horev et al., 2016] via the Bures-Wasserstein Riemannian metric [Bhatia et al., 2019, Han et al., 2021b, Han et al., 2021a] to reduce the dimensionality to d = 5.Finally, we obtain a collection of 194 SPD covariance matrices of size 5 × 5, each representing one video.Given the relatively small sample size, we select 8 videos per class as the test data and the rest are treated as the training data.Different to the settings in Section 6.1, we skew the selected class by sub-selecting a ratio α of the samples in the training set, where α = 0.2, 0.4, 0.6, 0.8, 1.0.This is again due to the small data size.To further test the robustness of the algorithms, we then randomly truncate the training size to 100.This results in a training set of 100 videos against a test set of 72 videos.Such randomization of process is repeated 5 times.
DynTex [Ghanem and Ahuja, 2010] collects video sequences of 36 moving scenes, such as sea waves, fire, clouds.For our experiment, we choose 10 classes, each with 20 videos.The subsequent processing steps are the same as for YTC dataset.
Finally, we also test on Cifar10 [Krizhevsky et al., 2009] under the same settings as in Section 6.1 in the main text.However, because when d = 5, much information is lost for this complex dataset, we choose d = 17, which captures 70% of the variance in the samples.
Results.The final results are shown in Figure 11 where we observe consistent good performance of the proposed RMOT compared to both sOT and SPDOT.This strengthens the findings that matrix-valued OT is able to explore more variations in the dataset compared to scalar-valued OT.

D Proofs
Proof of Proposition 3.1.For simplicity, we assume p, q > 0. Otherwise, we can follow [Peyré et al., 2019b] to define pj = p j if p j > 0 and 1 otherwise.
We note that P and Q are defined as P := {[P i ] m×1 : P i = p i I} and Q := {[Q j ] n×1 : Q j = q i I}, where I is the d × d identity matrix.With a slight abuse of notation and for simplicity, we define MW(p, q) := MW(P, Q).
First, it is easy to verify the symmetry property, i.e., MW(p, q) = MW(q, p).For the definiteness, when p = q, we have C i,i = 0 and C i,j 0 for i = j.Hence the optimal coupling is a block diagonal matrix with Γ i,i = p i I. Hence MW(p, q) = 0.For the opposite direction, if MW(p, q) = 0, we always need to have Γ i,j = 0, for i = j because tr(C i,j Γ i,j ) > 0 for any C i,j 0 and i = j.Thus, Γ i,i = 0, which gives C i,i = 0 and p = q.
Finally, for triangle inequality, given a, b, c ∈ Σ n , and optimal matrix coupling Γ, ∆ between (a, b) and (b, c), respectively.That is, j Γ i,j = a i I, i Γ i,j = b j I and similarly j ∆ i,j = b i I, i ∆ i,j = c j I.We now follow the same strategy by gluing the coupling Γ, ∆ in [Peyré et al., 2019b, Villani, 2021].That is, we define a coupling T as We can verify T i,j ∈ S d + , given Γ i,j , ∆ i,j ∈ S d + .Furthermore, we have ∀i, j, Hence, [T i,j ] m×n is a valid coupling between (a, c).Let P i = a i I, Q j = c j I and the corresponding samples as X, Y, Z for measures a, b, c respectively.Then, where the second inequality is by assumption (iii) of the proposition and the third inequality is due to the Minkowski inequality.This completes the proof.
Proof of Proposition 3.3.For a given feasible element Γ ∈ M d m,n (P, Q), we can construct a family of feasible elements.For example, choose 0 ≤ ζ < min i,j {λ min (Γ i,j )}.Then, we can add/subtract the equal number of ζI and the result is still feasible.In other words, the set is smooth in a ball around the element Γ of radius ζ.
Proof of Proposition 4.1.Following [Mishra and Sepulchre, 2016], the projection is derived orthogonal to the Riemannian metric (5) as P Γ (S) = arg min The Lagrangian of problem ( 13) is where Λ i , Θ j are dual variables for i ∈ [m], j ∈ [n].The orthogonal projection follows from the stationary conditions of ( 14).
Proof of Proposition 5.1.For each regularized OT problem, we consider the Lagrange dual problem of min Γ∈M d m,n MW ( P, P ), which is given as tr(C i,j Γ i,j ) + Ω(Γ i,j ) + tr Ψ i,j Γ i,j . (17) From the Lagrangian (17), it is easy to see the Euclidean gradient of the barycenter problem with respect to Pi is − Λ i with the dual optimal Λ i for problem (17).The proof is complete by substituting the objective F (Γ) = i,j tr(C i,j Γ i,j + Ω(Γ i,j )) as in Theorem 4.3.
Proof of Proposition 5.2.First we rewrite SPD matrix-valued GW discrepancy as MGW ( D, P), (D , P ) = i,i ,j,j f 1 ( Di,i ) + f 2 (D j,j ) − h 1 ( Di,i )h 2 (D j,j ) tr(Γ i,j Γ i ,j ) = i,j tr Γ i,j i f 1 ( Di,i ) Pi + j f 2 (D j,j )P j − i h 1 ( Di,i ) j h 2 (D j,j )Γ i ,j , where we use the fact that Γ i,j are optimal and satisfy the constraints j Γ i,j = Pi and i Γ i,j = P j .By the first order condition, we have tr Pi f 1 ( Di,i ) Pi − h 1 ( Di,i ) which gives the desired result.

Figure 3 :
Figure3: Tensor field mass interpolation on 1-d (top) and 2-d (bottom) grids.On the top, each row corresponds to an interpolation where we show 7 evenly-spaced interpolated tensor fields.On the bottom, the inputs are given in (f) and (g).We set ρ = 100 for QOT and show 3 evenly-spaced interpolated tensor fields.

Figure 5 :
Figure 5: Convergence of Algorithm 2 in terms of iterations (a), runtime (b), and validity test for retraction (c).For the retraction to be valid, the slope of the continuous line should match the dotted line (which represents the line h = 0) and also start from 0 when h tends to 0.

Figure 10 :
Figure 10: The convergence plots for RMOT initialized from (normalized) QOT solution.For this setting, the solution from QOT is close to optima.

Figure 11 :
Figure 11: Additional results on domain adaptation.On these datastes as well, we see a good performance of the propose MOT modeling.