1 Introduction

Covariance matrices are used in a variety of machine learning applications. Three prominent examples are computer vision applications (Tuzel et al. 2006, 2008), brain imaging (Pennec et al. 2006; Arsigny et al. 2006; Dryden et al. 2009) and brain computer interface (BCI) (Barachant et al. 2010, 2013) data analysis. In computer vision, covariance matrices in the form of region covariances are used in tasks such as texture classification. For brain imaging, the covariance matrices are diffusion tensors extracted from a physical model of the studied phenomenon. Finally, in the BCI community correlation matrices between different sensor channels are used as discriminating features for classification.

As discussed in Fletcher et al. (2004) and Harandi et al. (2014a), dimensionality can pose difficulties when working with covariance matrices because their size grows quadratically w.r.t. the number of variables. To deal with this issue, it is useful to apply to them dimensionality reduction techniques. A simple, commonly used technique is principal component analysis (PCA) (Jolliffe 2002). However, as we later show in Sect. 2, while vector PCA is optimal in terms of preserving data variance, the commonly used naive extensions of vector PCA to the matrix case (Yang et al. 2004; Lu et al. 2006) are sub-optimal for SPD matrices. Furthermore, when applied to SPD matrices, we argue that the standard Euclidean formulation of PCA disregards the inherent geometric structure of this space. We provide an in-depth review and discussion of the geometry of \(\mathcal {S}_+^{n}\) and its advantages in Sect. 3. For the moment consider a brief motivation for the use of Riemannian geometry for dimensionality reduction of SPD matrices.

1.1 A case for the use of Riemannian geometry

The set \(\mathcal {S}_+^{n}\) of symmetric positive definite (SPD) matrices of size \(n \times n\), when equipped with the Frobenius inner product \(\langle A, B \rangle _{\mathcal {F}} = {{\mathrm{tr}}}(A^\top B)\), belongs to a Euclidean space. A straightforward approach for measuring similarity between SPD matrices could be to use the Euclidean distance derived from the Euclidean norm. This is readily seen for \(2\times 2\) SPD matrices. A matrix \(A \in \mathcal {S}_+^{2}\) can be written as \({A = \left[ {\begin{matrix} a &{} c \\ c &{} b \end{matrix}} \right] }\) with \({ab - c^2 > 0 }\), \(a>0\) and \(b>0\). Then matrices in \(\mathcal {S}_+^{2}\) can be represented as points in \(\mathbb {R}^3\) and the constraints can be plotted as an open convex cone whose interior is populated by SPD matrices (see Fig. 1). In this representation, the Euclidean geometry of symmetric matrices then implies that distances are computed along straight lines, shown as blue dashed lines in the figure.

In practice, however, the Euclidean geometry is often inadequate to describe SPD matrices extracted from real-life applications, e.g., covariance matrices. This observation has been discussed in Sommer et al. (2010). We observe similar behavior, as illustrated in Fig. 2. In this figure, we computed the vertical and horizontal gradients at every pixel of the image on the left. We then computed \(2 \times 2\) covariance matrices between the two gradients for patches of pixels in the image. On the right, visualizing the same convex cone as in Fig. 1, every point represents a covariance matrix extracted from an image patch. The interior of the cone is not uniformly populated by the matrices, suggesting that they exhibit some structure that is not captured by the use of the straight geodesics of Euclidean geometry.

Fig. 1
figure 1

Comparison between Euclidean (blue straight dashed lines) and Riemannian (red curved solid lines) distances measured between points of the space \(\mathcal {S}_+^{2}\) (Color figure online)

Fig. 2
figure 2

Original image (left) and \(2 \times 2\) covariance matrices between the image gradients, extracted from random patches of the image (right). In the right plot, the mesh represents the border of the cone of positive semi-definite matrices. The image on the left is courtesy of Bernard-Brunel and Dumont

More generally, the use of Euclidean geometry for SPD matrices is known to generate several artifacts that may make it unsuitable for handling these matrices (Fletcher et al. 2004; Arsigny et al. 2007; Sommer et al. 2010). For example, for the simple task of averaging two matrices it may occur that the determinant of the average is larger than any of the two matrices. This effect is a consequence of the Euclidean geometry and is referred to as the swelling effect by Arsigny et al. (2007). It is particularly harmful for data analysis as it adds spurious variation to the data.

As a related example, take the computation of the maximum likelihood estimator of a covariance matrix, the sample covariance matrix (SCM). It is well known that with the SCM, the largest eigenvalues are overestimated while the smallest eigenvalues are underestimated (Mestre 2008). Since the SCM is an average of rank 1 symmetric positive semi-definite matrices, this may be seen as another consequence of the swelling effect.

Another drawback, illustrated in Fig. 1 and documented by Fletcher et al. (2004), is the fact that this geometry forms a non-complete space. This means that the extension of Euclidean geodesics is not guaranteed to stay within the manifold. Hence, interpolation between SPD matrices is possible, but extrapolation may produce indefinite matrices, leading to uninterpretable solutions.

Contrarily, geodesics computed using the affine invariant Riemannian metric (AIRM) (Bhatia 2009) discussed thoroughly in Sect. 3.1, are shown in Fig. 1 as curved red lines. Their extension asymptotically approaches the boundary of the cone, but remains within the manifold.

Given the potentially erroneous results obtained by the Euclidean geometry, it would be disadvantageous to use this geometry to retain the principal modes of variation. A natural approach to cope with this issue is then to consider a Riemannian formulation of PCA. And, in fact, the development of geometric methods for applications involving SPD matrices has been a growing trend in recent years. In light of the rich geometric structure of \(\mathcal {S}_+^{n}\), the disadvantages of the Euclidean geometry and the advances in manifold optimization, recently tools such as kernels (Barachant et al. 2013; Yger 2013; Jayasumana et al. 2013; Harandi et al. 2012) and divergences (Sra 2011; Cherian et al. 2011; Harandi et al. 2014b; Cichocki et al. 2014), as well as methods such as dictionary learning (Ho et al. 2013; Cherian and Sra 2014; Harandi and Salzmann 2015), metric learning (Yger and Sugiyama 2015), clustering (Goh and Vidal 2008; Kusner et al. 2014) and dimensionality reduction (Fletcher et al. 2004; Harandi et al. 2014a) have all been extended for SPD matrices using the Riemannian geometry.

1.2 Dimensionality reduction on manifolds

In statistics, such an extension of the PCA to a Riemannian setting has been studied for other manifolds. For example, it has been shown in Huckemann et al. (2010) for shape spaces that a Riemannian PCA was able to extract relevant principal components, especially in the regions of high curvature of the space where Euclidean approximation failed to appropriately explain data variation.

For the space of SPD matrices, a Riemannian extension of the PCA, namely the principal geodesic analysis (PGA), has been proposed in Fletcher et al. (2004). This algorithm essentially flattens the manifold at the center of mass of the data by projecting every element from the manifold to the tangent space at the Riemannian mean. In this Euclidean space a classical PCA is then applied. Although this approach is generic to any manifold it does not fully make use of the structure of the manifold, as a tangent space is only a local approximation of the manifold.

In this paper, we propose new formulations of PCA for SPD matrices. Our contribution is twofold: First and foremost, we adapt the basic formulation of PCA to make it suitable to matrix data and as a result capture more of the data variance. We tackle this problem in Sect. 2. Secondly, we extend PCA to Riemannian geometries to derive a truly Riemannian PCA which takes into account the curvature of the space and preserves the global properties of the manifold. This topic is covered in Sect. 3. More specifically, using a matrix transformation first used in Harandi et al. (2014a), we derive an unsupervised dimensionality reduction method maximizing a generalized variance of the data on the manifold. Section 4 is dedicated to a discussion of why our optimization problem is actually an approximation of an exact variance maximization objective, but we will show that it is sufficient for our purposes. Through experiments on synthetic data and real life applications, we demonstrate the efficacy of our proposed dimensionality reduction method. The experimental results are brought in Sect. 5.

2 Variance maximizing PCA for SPD matrices

In the introduction we discussed the need for dimensionality reduction methods for SPD matrices as well as the shortcomings of the commonly used PCA for this task. Essentially, although PCA is a dimensionality reduction method that for vectors optimally preserves the data variance, we show that its naive extensions to matrices do not do so optimally for SPD matrices.

In this section we tackle the problem of defining a PCA suitable for matrices. We begin by stating a formal definition of our problem. Next we show how to properly extend PCA from vectors to matrices so that we retain more of the data variance.

2.1 Problem setup

Let \(\mathcal {S}_+^{n} = \left\{ A \in \mathbb {R}^{n \times n} \vert \; \forall x \ne 0, x \in \mathbb {R}^n, \; x^\top A x > 0, \; A=A^\top \right\} \) be the set of all \(n \times n\) SPD matrices, and let \(\mathbf {X} = \lbrace X_i \in \mathcal {S}_+^{n} \rbrace _{i=1}^N \) be a set of N instances in \(\mathcal {S}_+^{n}\). We assume that these matrices have some underlying structure, whereby their informative part can be described by a more compact, lower dimensional representation. Our goal is to compress the matrices, mapping them to a lower dimensional manifold \(\mathcal {S}_+^{p}\) where \(p < n\). In the process, we wish to keep only the relevant part while discarding the extra dimensions due to noise.

The task of dimensionality reduction can be formulated in two ways: First, as a problem of minimizing the residual between the original matrix and its representation in the target space. Second, it can be stated in terms of variance maximization, where the aim is to find an approximation to the data that accounts for as much of its variance as possible. In a Euclidean setting these two views are equivalent (Bishop 2007, Chap. 12). However, in the case of SPD matrices, \(\mathcal {S}_+^{p}\) is not a sub-manifold of \(\mathcal {S}_+^{n}\) and elements of the input space cannot be directly compared to elements of the target space. Thus, focusing on the second view, we search for a mapping \(\mathcal {S}_+^{n} \mapsto \mathcal {S}_+^{p}\) that best preserves the Fréchet variance \(\sigma _{\delta }^2\) of \(\mathbf {X}\), defined below.

Following the work of Fréchet (1948) we define \(\sigma _{\delta }^2\) via the Fréchet mean, which is unique in our case.

Definition 1

(Fréchet Mean) The (sample) Fréchet mean of the set \(\mathbf {X}\) w.r.t. the metric \(\delta \) is

$$\begin{aligned} \bar{X}_\delta = \underset{X \in \mathcal {S}_+^{n}}{\mathrm {argmin}} \; \frac{1}{N}\sum _{i=1}^N \delta ^2 \left( X_i, X\right) . \end{aligned}$$

Then, the Fréchet variance is:

Definition 2

(Fréchet Variance) The (sample) Fréchet variance of the set \(\mathbf {X}\) w.r.t. the metric \(\delta \) is given by

$$\begin{aligned} \sigma ^2_\delta = \frac{1}{N}\sum _{i=1}^N \delta ^2 \left( X_i, \bar{X}_\delta \right) . \end{aligned}$$

As in Harandi et al. (2014a), we consider for any matrix \(X \in \mathcal {S}_+^{n} \) a mapping to \(\mathcal {S}_+^{p}\) (with \(p < n\)) parameterized by a matrix \(W \in \mathbb {R}^{n \times p}\) which satisfies the orthogonality constraint \(W^\top W =\mathbb {I}_p\), where \(\mathbb {I}_p\) is the \(p \times p\) identity matrix. The mapping then takes the form of \(X^{\downarrow } = W^\top X W\).

2.2 Variance maximization with gaPCA and PCA

Having framed the optimization of the matrix W in terms of maximization of the data variance our proposed formulation, explicitly written in terms the Fréchet variance, is:

Definition 3

(Geometry-aware PCA (gaPCA)) gaPCA is defined as

$$\begin{aligned} W = \underset{W \in \mathcal {G}\left( n,p \right) }{\mathrm {argmax}} \sum _i \delta ^2 \left( W^\top X_i W, W^\top \bar{X}_\delta W \right) , \end{aligned}$$
(1)

where \(\mathcal {G}\left( n,p \right) \) is the Grassmann manifold, the set of all p-dimensional linear subspaces of \(\mathbb {R}^n\).

In reality, the formulation above does not exactly express the Fréchet variance of the compressed set \(\mathbf {X}^{\downarrow }\), but expresses it only approximately. For now we simply state this fact without further explanation and continue with the exposition and analysis. We dedicate Sect. 4 to a detailed investigation of this issue.

As described in Edelman et al. (1998) and Absil et al. (2009), an optimization problem such as the one in Definition 3 can be formulated and solved on either the Stiefel or the Grassmann manifold. The use of the Stiefel manifold, the set of p-rank matrices in \(\mathbb {R}^{n \times p}\) with orthogonal columns, would impose the necessary orthogonality constraint on the transformation matrix W. However, note that for our problem the individual components are of little importance to us. Our interest is in reducing the size of the input matrices.Footnote 1 As such, each single component yields a \(1 \times 1\) matrix which is not very informative in it of itself. Rather, we are interested in the principal components as an ensemble, i.e., the p-dimensional linear space that they span. Rotation within the p-dimensional space will not affect our solution. Thus, it suffices to consider a mapping \(X^{\downarrow } = W^\top X W\) with \(W \in \mathcal {G}(n,p)\).

We compare our variance-based definition to PCA, the canonical method for dimensionality reduction which itself aims to preserve maximal data variance. For vector data, PCA is formulated as

$$\begin{aligned} W= & {} \underset{W^\top W = \mathbb {I}_p}{\mathrm {argmax}} \sum _i \left\| \left( x_i - \bar{x} \right) W\right\| ^2_2 \nonumber \\= & {} \underset{W^\top W = \mathbb {I}_p}{\mathrm {argmax}} {{\mathrm{tr}}}\left( W^\top \left( \sum _i \left( x_i - \bar{x} \right) ^\top \left( x_i - \bar{x} \right) \right) W \right) , \end{aligned}$$
(2)

where \(\bar{x}\) is the Euclidean mean of the data.

Translating the operations in the right-most formulation of Eq. (2) from the vector case to the matrix case gives

$$\begin{aligned} W = \underset{W^\top W = \mathbb {I}_p}{\mathrm {argmax}} {{\mathrm{tr}}}\left( W^\top \left( \sum _i \left( X_i - \bar{X}_\mathrm {e} \right) ^\top \left( X_i - \bar{X}_\mathrm {e} \right) \right) W \right) , \end{aligned}$$
(3)

where \(\bar{X}_\mathrm {e}\) is the Euclidean mean of the data.

For symmetric matrices, this formulation is equivalent to the one proposed in Yang et al. (2004) and Lu et al. (2006). Note, however, that the matrix W in Eq. (3) acts on the data only by right-hand multiplication. Effectively, it is as if we are performing PCA only on the row space of the data matrices \(\mathbf {X}\).Footnote 2

Indeed, the key difference between our proposed method and ordinary PCA is that in our cost function the matrix W acts on \(\mathbf {X}\) on both sides. It is only by applying W to both sides of a matrix that we obtain a valid element in \(\mathcal {S}_+^{p}\). So, it is only natural that W should also act on both its sides during optimization.

Although our method can accommodate multiple geometries via various choices of the metric \(\delta \), the difference between Eqs. (3) and (1) becomes apparent when we work, as the standard PCA does, in the Euclidean geometry. In the Euclidean case, the cost function optimization in Definition 3 becomes

$$\begin{aligned} W= & {} \underset{W \in \mathcal {G}\left( n,p \right) }{\mathrm {argmax}} \sum _i \left\| W^\top \left( X_i - \bar{X}_\mathrm {e} \right) W\right\| ^2_\mathcal {F} \nonumber \\= & {} \underset{W \in \mathcal {G}\left( n,p \right) }{\mathrm {argmax}} \sum _i {{\mathrm{tr}}}\left( W^\top \left( X_i - \bar{X}_\mathrm {e} \right) ^\top W W^\top \left( X_i - \bar{X}_\mathrm {e} \right) W \right) . \end{aligned}$$
(4)

Note the additional term \(W W^\top \ne \mathbb {I}_n\) as compared to Eq. (3).

In general, the two expressions are not equivalent. Moreover, our proposed formulation consistently retains more of the data variance than the standard PCA method, as will be shown in Sect. 5. It is worth noting, however, that the two methods are equivalent when the matrices \(\left\{ X_i \right\} \) share the same eigenbasis. In this case, the problem can be written in terms of the common basis and the extra term \(W W^\top \) does not contribute to the cost function. Both methods then yield identical results.

3 Geometry-aware PCA

Equipped with a PCA formulation adequate for matrix data, we now turn to the issue of geometry awareness. We have already discussed that the use of a Euclidean geometry may potentially lead to erroneous or distorted results when applied to SPD matrices, especially when the distance between matrices is large on the manifold.Footnote 3 Other methods for dimensionality reduction of SPD matrices, while utilizing the structure of the SPD manifold, are nonetheless flawed. First, they use only a local approximation of the manifold, causing a degradation in performance when distances between matrices are large. Second, and more importantly, these methods apply the same faulty formulation of matrix PCA.

In the previous section we presented a general definition of gaPCA in Definition 3. In order to highlight the difference between PCA and gaPCA we studied its instantiation with the Euclidean geometry. However, this formulation can incorporate any metric defined on \(\mathcal {S}_+^{n}\). In this section we explore these other metrics.

We first describe the geometry of the SPD matrix manifold. We will highlight the benefits of our geometry-aware method through a discussion of some of the relevant invariance properties of the various metrics defined on the manifold. These properties will allow for efficient optimization of the transformation matrix W. Finally, we present some useful implementation details.

3.1 The geometry of \(\mathcal {S}_+^{n}\)

The shortcomings of the Euclidean geometry described in the introduction stem from the fact that it does not enforce the positive-definiteness constraint. That is, using the Eucliden metric we are essentially treating the matrices as if they were elements of the (flat) space \(\mathbb {R}^{n \times n}\). Nonetheless, the positive-definiteness constraint induces a Riemannian manifold of negative curvature. As noted in Fletcher et al. (2004) and Sommer et al. (2010), it is then more natural to use a Riemannian geometry as it ensures that the solutions will respect the constraint encoded by the manifold.

A standard choice of metric, due to its favorable geometric properties, is the affine invariant Riemannian metric (AIRM) (Bhatia 2009). In this paper we denote it as \(\delta _\mathrm {r}\). Equipped with this metric, the SPD manifold becomes a complete manifold of negative curvature. It prevents the occurrence of the swelling effect and allows for matrix extrapolation without obtaining non-definite matrices. Other beneficial properties that are specifically related to our problem are presented later in this section.

Definition 4

(Affine invariant Riemannian metric (AIRM)) Let \(X,Y \in \mathcal {S}_+^{n}\) be two SPD matrices. Then, the AIRM is given as

where \(\log \left( \cdot \right) \) is the matrix logarithm function, which for SPD matrices is \(\log \left( X \right) = U \log \left( \varLambda \right) U^\top \) for the eigendecompostion \(X = U\varLambda U^\top \).

Another noteworthy metric, closely related to the AIRM, is the log-determinant (a.k.a symmetric Stein) metric (Sra 2011) which we denote by \(\delta _\mathrm {s}\).

Definition 5

(Log-determinant (symmetric Stein) metric) Let \(X,Y \in \mathcal {S}_+^{n}\) be two SPD matrices. Then, the log-determinant (symmetric Stein) metric is given as

$$\begin{aligned} \delta _\mathrm {s}^2 (X,Y) = \log \left( \det \left( (X+Y)/2 \right) \right) - \log \left( \det (XY) \right) /2. \end{aligned}$$

This metric is dubbed ‘symmetric’ because it is a symmetrized Bregman matrix divergence (Cherian et al. 2011; Sra 2012). These in turn have been shown to be useful for various learning applications (Cherian et al. 2011; Harandi et al. 2014b). While not a Riemannian metric, the log-determinant metric closely approximates the AIRM and shares many of its useful properties. Furthermore, in Sra (2011) it has been suggested that the log-determinant metric is a more computationally efficient alternative to the AIRM. The advantageous properties of this metric relevant to our problem are shared by both \(\delta _\mathrm {s}\) and \(\delta _\mathrm {r}\) and described later in this section.

Even though \(\mathcal {S}_+^{n}\) is a curved space, flat subspaces (briefly, flats) (Bridson and Haefliger 2011) are a powerful tool with which to study its structure and that of other symmetric spaces. Formally, a subspace \(F \subset \mathcal {S}_+^{n}\) is called a flat of dimension k (a k-flat) if it is isometric to \(\mathbb {E}^k\), the Euclidean space of dimension k. If F is not contained in any flat of bigger dimension, then F is called a maximal flat. Next we give a functional definition of flats of \(\mathcal {S}_+^{n}\).

Definition 6

(Flats of \(\mathcal {S}_+^{n}\)) Let \(A \in S(n)\), where S(n) is the space of symmetric matrices. Then, consider elements \(f \in \mathcal {S}_+^{n}\) that share the eigenvectors Q with \(e^A\), the matrix exponential of A. These elements all commute with each other and with \(e^A\). We call this space the n-flat F.

Since the elements of F commute with each other, i.e., \(UV = VU\) for \(U,V \in F\), we have \(\log (UV) = \log (U) + \log (V)\) and the matrix log \(\log \left( \cdot \right) \). So, the distance between them is

$$\begin{aligned} \delta (U,V) = \sqrt{({{\mathrm{tr}}}(\log (U^{-1} V)^2))} = \sqrt{({{\mathrm{tr}}}((\log (V)-\log (U))^2))}. \end{aligned}$$
(5)

Since \(\sqrt{({{\mathrm{tr}}}(\cdot )^2)}\) is a Euclidean norm, we have that F is isomorphic to \(\mathbb {R}^n\) with a Euclidean metric under \(\log (\cdot )\).

Note that \(\delta (U,V) = \sqrt{({{\mathrm{tr}}}((\log (V)-\log (U))^2))}\) is precisely the log-Euclidean metric (Arsigny et al. 2007).

Definition 7

(Log-Euclidean metric) Let \(X,Y \in \mathcal {S}_+^{n}\) be two SPD matrices. Then, the log-Euclidean metric is given as

$$\begin{aligned} \delta _\mathrm {le}^2 (X,Y) = \left\| \log \left( X \right) - \log \left( Y \right) \right\| ^2_\mathcal {F}. \end{aligned}$$

As illustrated in Yger and Sugiyama (2015), this metric uses the logarithmic map to project the matrices to \(\mathcal {T}_\mathbb {I} \mathcal {S}_+^{n}\), the tangent space at the identity, where the standard Euclidean norm is then used to measure distances between matrices. So, the use of this metric in effect flattens the manifold, deforming the distance between matrices when they do not lie in the same n-flat.

In the context of this work, flats provide a valuable intuitive understanding of our proposed method. This is because the goal of our geometry aware PCA method is to find an optimal transformation matrix Q, which is a p-flat. Viewing \(\mathcal {S}_+^{n}\) in terms of flats is especially instructive for \(\mathcal {S}_+^{2}\) where the eigenvectors create a 2-dimensional rotation matrix that can be summed up by the rotation angle \(\theta \). Then, the matrices in \(\mathcal {S}_+^{2}\) can be plotted in cylindrical coordinates: \(\theta \) is the rotation angle, \(\rho \) is ratio of eigenvalues and \(z = \log {\det {X}}\) as seen in Fig. 3a. In this figure geodesics between matrices, computed using the AIRM, are shown in red. With this construction, flats of \(\mathcal {S}_+^{2}\) can be imagined as panes of a revolving door, intersecting at the line formed by multiples of the identity \(\alpha \mathbb {I}\), for \(\alpha > 0\) (see Fig. 3b). In Sect. 5 we will use flats to measure the quality of the eigenspace estimation as compared to the true maximal variance eigenspace.

Fig. 3
figure 3

Riemannian geodesics plotted in cylindrical coordinates (left). Flats in \(\mathcal {S}_+^{2}\) shown as panes of a revolving door (right)

3.2 Invariance properties

As discussed in Bhatia (2009), \(\delta _\mathrm {r}\) is invariant under the congruence transformation \(X \mapsto P^\top X P\) for \(P \in GL_n\left( \mathbf {R}\right) \), the group of \(n\times n\) real valued invertible matrices. So, we have

$$\begin{aligned} \delta _\mathrm {r} (X,Y) =\delta _\mathrm {r}(P^{\top } XP,P^{\top } Y P) \end{aligned}$$

for \(X,Y \in \mathcal {S}_+^{n}\) and \(P \in GL_n\left( \mathbf {R}\right) \). This property is called affine invariance and is also shared by \(\delta _\mathrm {s}\) (Sra 2012). For brevity we will state the subsequent analysis in terms of the AIRM, but the same will hold true for the log-determinant metric.

The affine invariance property has practical consequences for covariance matrices extracted from EEG signals (Barachant and Congedo 2014). Indeed, such a class of transformations includes re-scaling and normalization of the signals, electrode permutations and, if there is no dimensionality reduction, it also includes whitening, spatial filtering or source separation. For covariance matrices extracted from images this property has similar implications, with this class of transformations including changes of illumination when using RGB values (Harandi et al. 2014a).

Contrarily, the log-Euclidean metric and the distance derived from it are not affine-invariant. This fact has been used to derive a metric learning algorithm (Yger and Sugiyama 2015). Nevertheless, it is invariant under the action of the orthogonal group. This comes from the fact that for any SPD matrix X and invertible matrix P, we have \(\log (P X P^{-1})=P\) \(\log (X) P^{-1}\) (Bhatia 2009, p. 219). Then, using the fact that for any matrix \(O \in \mathbb {O}_p\), \(O^\top =O^{-1}\), it follows that \(\delta _{\mathrm {le}} (O^\top XO,O^\top YO)=\delta _{\mathrm {le}} (X,Y)\).

Fig. 4
figure 4

Function optimization on manifolds. The Euclidean gradient of the function \(\bar{f}\) is projected onto the tangent space at the point \(X_0\), then onto the manifold using the exponential map or a retraction

3.3 Optimization on manifolds

Our approach may be summed up as finding a lower-dimensional manifold \(\mathcal {S}_+^{p}\) by optimizing a transformation (parameterized by W) that maximizes the (approximate) Fréchet variance w.r.t. a metric \(\delta \). As the parameter W lies in the Grassmann manifold \(\mathcal {G}(n,p)\), we solve the optimization problem on this manifold (Absil et al. 2009; Edelman et al. 1998).

Optimization on matrix manifolds is a mature field and by now most of the classical optimization algorithms have been extended to the Riemannian setting. In this setting, descent directions are not straight lines but rather curves on the manifold. For a function f, applying a Riemannian gradient descent can be expressed by the following steps, illustrated in Fig. 4:

  1. 1.

    At any iteration, at the point W, transform a Euclidean gradient \(D_W f\) into a Riemannian gradient \(\nabla _W f\). In our case, i.e., for the Grassmann manifold, the transformation is \(\nabla _W f = D_W f - WW^\top D_W f\) (Absil et al. 2009).

  2. 2.

    Perform a line search along geodesics at W in the direction \(H=\nabla _W f\). For the Grassmann manifold, on the geodesic going from a point W in direction H (with a scalar step-size \(t \in \mathbb {R}\)), a new iterate is obtained as \(W (t) = WV \cos (\varSigma t)V^\top + U \sin (\varSigma t) V^\top \), where \(U \varSigma V^\top \) is the compact singular value decomposition of H and the matrix sine and cosine functions are defined by \(\cos \left( A \right) = \sum _{k=0}^\infty \left( -1\right) ^k \frac{A^{2k}}{2k!}\) and \(\sin \left( A \right) = \sum _{k=0}^\infty \left( -1\right) ^k \frac{A^{2k+1}}{\left( 2k+1\right) !}\) .

In practice, we employ a Riemannian trust-region method described in Absil et al. (2009) and efficiently implemented in Boumal et al. (2014).

3.4 Cost function gradients

In order to make this article self-contained, we provide the Euclidean gradient of our cost function w.r.t. W for each of the four metrics described above. For the Euclidean metric \(\delta _{\mathrm {e}}\) this is:

$$\begin{aligned} D_W \delta ^2_{\mathrm {e}} (W^\top X_i W, W^\top \bar{X}_\mathrm {e} W )= 4 \left( X_i - \bar{X}_\mathrm {e} \right) W W^\top \left( X_i -\bar{X}_\mathrm {e}\right) W. \end{aligned}$$

Its derivation is detailed in “Appendix 1”.

Then, for the AIRM \(\delta _{\mathrm {r}}\) we have, following Harandi et al. (2014a).

$$\begin{aligned}&D_W \delta ^2_\mathrm {r} \left( W^\top X_i W, W^\top \bar{X}_\mathrm {r} W \right) \\&=4\left( X_i W \left( {W}^\top X W \right) ^{-1} - \bar{X}_\mathrm {r}W \left( {W}^\top \bar{X}_\mathrm {r} W \right) ^{-1} \right) \log \left( {W}^\top X_i W \left( {W}^\top \bar{X}_\mathrm {r} W \right) ^{-1} \right) . \end{aligned}$$

It should be noted that using directional derivatives (Bhatia 1997; Absil et al. 2009) we obtain a different (but numerically equivalent) formulation of this gradient. For completeness, we report this formula and its derivation in “Appendix 2” of the supplementary material. In our experiments, as it was computationally more efficient, we use the equation above for the gradient.

Next, the gradient w.r.t. W of the log-determinant metric \(\delta _{\mathrm {s}}\) is given in Harandi et al. (2014a) by:

$$\begin{aligned} D_W \delta ^2_\mathrm {s} \left( W^\top X_i W, W^\top \bar{X}_\mathrm {r}W \right)= & {} \left( X_i+ \bar{X}_\mathrm {r}\right) W \left( {W}^\top \frac{X_i + \bar{X}_\mathrm {r}}{2} W \right) ^{-1} \\&\quad - X_i W \left( {W}^\top X_i W \right) ^{-1} - \bar{X}_\mathrm {r}W \left( {W}^\top \bar{X}_\mathrm {r} W \right) ^{-1}. \end{aligned}$$

Finally, For the log-Euclidean metric:

$$\begin{aligned}&D_W \delta ^2_{\mathrm {le}} (W^\top X_i W, W^\top \bar{X}_{\mathrm {le}} W) \\&= 4 \left( X_i W D \log \left( {W}^\top X_i W \right) \left[ \log \left( {W}^\top X_i W \right) - \log \left( {W}^\top \bar{X}_{\mathrm {le}} W \right) \right] \right. \\&\quad + \left. \bar{X}_{\mathrm {le}} W D \log \left( {W}^\top \bar{X}_{\mathrm {le}} W \right) \left[ \log \left( {W}^\top \bar{X}_{\mathrm {le}} W \right) - \log \left( {W}^\top X_i W \right) \right] \right) , \end{aligned}$$

where \(D f(W)[H] = \underset{h \rightarrow 0}{\lim } \frac{f(W+hH)-f(W)}{h}\) is the Fréchet derivative (Absil et al. 2009) and \(\bar{X}_{\mathrm {le}}\) denoted the mean w.r.t. the log-Euclidean metric. Note that there is no closed-form solution for \(D \log (W)[H]\) but it can be numerically computed efficiently (Boumal 2010; Boumal and Absil 2011; Al-Mohy and Higham 2009). This derivation is given in “Appendix 3”.

4 Exact variance maximization

Briefly stated in Sect. 2, we now explain why Eq. (1) constitutes only an approximation to the Fréchet variance maximization. In general, the operations of computing the mean and projecting onto the lower dimensional manifold are not interchangeable. That is, let \(\bar{X^{\downarrow }}\) be the mean of the compressed set \(\mathbf {X}^{\downarrow } = \left\{ W^\top X_i W \right\} \) and let \(W^\top \bar{X} W\) be the compressed mean of the original set \(\mathbf {X}\). The two matrices are not equal in general for metrics other than the Euclidean metric. This is because for the \(\delta _\mathrm {r}\), \(\delta _\mathrm {s}\) and \(\delta _\mathrm {le}\) the Fréchet mean of the set \(\mathbf {X}\) is not a linear function of the matrices. Since we do not know in advance the mean of the compressed set, the cost function defined in Eq. (1) does not exactly express the Fréchet variance of \(\mathbf {X}^{\downarrow }\). Rather, it serves as an approximation to it.

In the following we bring two strategies to address this issue. First, we propose to whiten the input matrices before optimizing the transformation matrix W. Pre-whitening ensures that the Riemannian mean of the set \(\mathbf X \) is known and equal to the identity \(\bar{X}_\mathrm {r}= I\). Second, we consider a constrained formulation of the cost function, where the constraint ensures that we are using the mean of the compressed set \(\mathbf X ^\downarrow \). The constrained formulation is brought here for the AIRM, but a similar technique can also be applied to the other metrics.

Ultimately, we find that the relaxed optimization problem in Definition 3 is sufficient for our purposes. Although feasible, our experiments will show that there is no practical benefit in solving the exact variance maximization problem. Nonetheless, the two methods discussed below highlight interesting aspects of the problem that can be applied to other optimization problems based on the distance between SPD matrices.

4.1 Pre-whitening the data

Our first potential strategy is to pre-whiten the data before optimizing over W. Using the Riemannian geometry, each point is mapped to \(X_i \mapsto \tilde{X}_i = \bar{X}_\mathrm {r}^{-1/2}X_i \bar{X}_\mathrm {r}^{-1/2}\). Subsequently, the Riemannian mean of \(\tilde{\mathbf {X}}\) is the identity \(\mathbb {I}_n\). Using the whitened data, the resulting cost function is

$$\begin{aligned} W = \underset{W \in \mathcal {G}\left( n,p \right) }{\mathrm {argmax}} \sum _i \delta _\mathrm {r} ^2 \left( W^\top \tilde{X}_i W, \mathbb {I}_p \right) . \end{aligned}$$
(6)

Due to the affine invariance property of the AIRM and the log-determinant metric, we have that \(\delta \left( \tilde{X}_i, \mathbb {I} \right) = \delta \left( X, \bar{X} \right) \). That is, for these two metrics the whitened data is simply a translated and rotated copy of the original input data with the distances between matrices preserved. Thus, up to rotation and scaling, we should expect the solution of this problem to be equivalent to that of the original formulation in Definition 3 in terms of retained variance.

4.2 A constrained optimization problem

To obtain an exact variance maximization problem we may use the following fact that the Riemannian mean \(\bar{X}_\mathrm {r}\) of a set \(\mathbf {X} = \lbrace X_i \in \mathcal {S}_+^{n} \rbrace _{i=1}^N\) can be shown to satisfy the following equation (Bhatia 2013):

$$\begin{aligned} C \left( W, \bar{X}_\mathrm {r}\right) = \sum _{j=1}^n \log \left( \left( W^\top X_j W \right) ^{-1/2} \bar{X}_\mathrm {r}\left( W^\top X_j W \right) ^{-1/2} \right) = 0^{p \times p}. \end{aligned}$$
(7)

This is a direct consequence of the fact that the Riemannian mean uniquely minimizes the Fréchet variance \(\sum _{j=1}^n \delta _\mathrm {r}^2 \left( W^\top X_j W, \bar{X}_\mathrm {r}\right) \). While this equation does not have a closed form solution, we may use the implicit constraint \(C \left( W, \bar{X}_\mathrm {r}\right) = 0^{p \times p}\) to formulate a cost function which exactly embodies the Fréchet variance of the data. By incorporating the relation between W and \(\bar{X}_\mathrm {r}\) through the constraint, we ensure that at each step of the optimization we use the exact mean of \(\mathbf X ^\downarrow \), and not an approximation of it.

Consider the following constrained optimization problem, given by

$$\begin{aligned} \hat{f}\left( W \right) = f \left( W,\varLambda \right) = \sum _{j=1}^n \delta _\mathrm {r}^2 \left( W^\top X_j W, \varLambda \right) \; \text {s.t.} \; C\left( W, \varLambda \right) = 0^{p \times p}, \end{aligned}$$
(8)

where for brevity we leave out the dependence of f and C on \(\mathbf {X}\). Through the use of the implicit function theorem (Krantz and Parks 2012), we are guaranteed the existence of a differentiable function \( \varLambda \left( W \right) \;: \; \mathcal {S} \left( n,p \right) \rightarrow \mathcal {S}_+^{p}\) defined by \(C\left( W, \varLambda \right) = 0^{p \times p}\). Then, the derivative \(\nabla _W \varLambda \) can be used to express the gradient of the cost w.r.t. W:

$$\begin{aligned} \nabla \hat{f}\left( W \right) = \nabla _W \varLambda \left( W \right) \nabla _{\varLambda } f \left( W, \varLambda \right) + \nabla _{W} f \left( W, \varLambda \right) . \end{aligned}$$
(9)

Since in this cost function the transformation matrix W is applied only to one argument of the distance function, our problem is no longer invariant to the action of the orthogonal group \(\mathbb {O}_p\). So, we must optimize this cost over the Stiefel manifold.

Practically, for optimization of the constrained cost function, at each step we:

  1. 1.

    Given the current value of W, compute the Riemannian mean of the set \(\mathbf X ^\downarrow \).

  2. 2.

    Compute \(\nabla \hat{f}\left( W \right) \) using Eq. (9) for use in our line search.

This method will be denoted as \(\delta _\mathrm {c}\) (for constrained). Details of the computation of the cost function gradient are brought in “Appendix 4”.

5 Experimental results

To understand the performance of our proposed methods, we test them on both synthetic and real data. First, for synthetically generated data, we examine their ability to compress the data while retaining its variance. We also show that our methods are superior in terms of eigenspace estimation. Next, we apply them to image data in the form of region covariance matrices as well as to brain computer interface (BCI) data in the form of covariance matrices. To assess the quality of the dimensionality reduction, we use the compressed matrices for classification and examine the accuracy rates. Finally, we briefly compare the constrained cost formulation \(\delta _\mathrm {c}\) discussed in Sect. 4.2, which optimizes the exact Fréchet variance, to \(\delta _\mathrm {r}\)PCA, which uses only an approximate expression of the variance.

5.1 Synthetic data

Our first goal is to substantiate the claim that our methods outperform the standard matrix PCA in terms of variance maximization. As shown in Chapter 6 of Jolliffe (2002), it is useful to study the fraction of variance retained by the method as the dimension grows. To this end we randomly generate a set \(\mathbf {X} = \left\{ X_i \in \mathcal {S}_+^{n}\right\} \) of 50 SPD matrices of size \(n=17\) using the following scheme.

For each \(X_i\), we first generate an \(n \times n\) matrix A whose entries are i.i.d. standard normal random variables. Next, we compute the QR decomposition of this matrix \(A = QR\), where Q is an orthonormal matrix and R is an upper triangular matrix. We use Q as the eigenvectors of \(X_i\). Finally, we uniformly draw its eigenvalues \(\lambda = \left( \lambda _1, \ldots , \lambda _n \right) \) from the range [0.5, 4.5]. The resulting matrices are then \(X_i = Q_i \text {diag}\left( \lambda _i\right) Q_i^\top ,\) where each \(X_i\) has a unique matrix \(Q_i\) and spectrum \(\lambda _i\).

Each matrix was compressed to size \(p \times p\) for \(p = 2,\ldots ,9\) using gaPCA with various metrics, 2DPCA (Yang et al. 2004) and PGA (Fletcher et al. 2004). PGA first maps the matrices \(\mathbf {X}\) via the matrix logarithm to \(\mathcal {T}_{\bar{X}_\mathrm {r}} \mathcal {S}_+^{n}\), the tangent space at the point \(\bar{X}_\mathrm {r}\). Then standard linear PCA is performed in the (Euclidean) tangent space. The matrix W was initialized by the same random guess drawn from \(\mathcal {G}\left( n,p \right) \) for each of the metrics of gaPCA. We recorded the fraction of the Fréchet variance contained in the compressed dataset for the various values of p,

$$\begin{aligned} \alpha _\delta \left( p\right) = \frac{\sigma _{\delta }^2 \left( \mathbf {X}^{\downarrow }(p) \right) }{ \sigma _{\delta }^2 \left( \mathbf {X} \right) }, \end{aligned}$$

for the Euclidean and Riemannian metrics. This process was repeated 25 times for different instances of the dataset \(\mathbf {X}\).

The averaged results of the experiment, along with standard deviations, are presented in Fig. 5. For clarity, when the curves of various methods coincide, only one is kept, while the others are omitted. The methods \(\delta _\mathrm {s}\)PCA and \(\delta _\mathrm {le}\)PCA obtained nearly identical results to \(\delta _\mathrm {r}\)PCA. The same is true for the constrained optimization problem \(\delta _\mathrm {c}\)PCA. The subspace spanned by the columns of the transformation matrix W obtained by \(\delta _\mathrm {c}\)PCA is not identical to that of \(\delta _\mathrm {r}\)PCA, but the difference in the retained variance is negligible. This implies that \(W^\top \bar{X}_\mathrm {r}W\) is a good approximation for the Riemannian mean of the compressed set \( \mathbf X ^\downarrow \). The curves of \(\delta _\mathrm {r}\)PCA and \(\delta _\mathrm {g}\)PCA coincide for the Riemannian variance since the AIRM is invariant to the process of data whitening.

Fig. 5
figure 5

Fraction of Fréchet variance retained by the various compression methods w.r.t. a the Riemannian and b the Euclidean distances. Metrics which obtained identical results to \(\delta _\mathrm {r}\) have been omitted from the figure for clarity. The error bars indicate the standard deviation

We see that our proposed methods retain the greatest fraction of data variance. As expected, each gaPCA method is best at retaining the variance w.r.t. its own metric. That is, for the variance w.r.t. the AIRM, \(\delta _\mathrm {r}\)PCA outperforms \(\delta _\mathrm {e}\)PCA, and for the Euclidean metric the opposite is true. The only exception is \(\delta _\mathrm {g}\)PCA, which performs poorly w.r.t. the Euclidean variance. This is due to the data centering performed before the dimensionality reduction. Recall that the data centering is done using the Riemannian geometry, i.e., \(\tilde{X}_i = \bar{X}_\mathrm {r}^{-1/2}X_i \bar{X}_\mathrm {r}^{-1/2}\). While this transformation preserves the Riemannian distance between matrices, it does not preserve the Euclidean distance. Thus, we obtain poor results for the Euclidean metric using this method.

Next, we examine the quality of the eigenspace estimation. To this end, we generate data using the following scheme: We begin with a basis for \(\mathbb {R}^n\). The first p vectors of this basis are denoted by F, and the remaining \(n-p\) vectors are denoted by G. For each matrix \(X_i\) we randomly generate a \(p \times p\) rotation matrix \(O_i\) and an \(n \times n\) rotation matrix \(R_i\). The input matrices are then given by \(X_i = Q_i \text {diag}\left( \lambda _i \right) Q_i^\top \) where for \(\epsilon \ll 1\), \(Q_i = \left[ FO_i \; G \right] \left( \mathbb {I} + \epsilon R_i \right) \) and \(\lambda _i\) is a set of n eigenvalues uniformly drawn from the range [0.5, 1.5]. By rotating F via the matrix \(O_i\) we have that each \(FO_i\) spans the same \(n \times p\) space but lies in a different p-flat.Footnote 4 The matrix \(R_i\) acts as a small perturbation, slightly rotating the entire eigenbasis. Without the perturbation \(R_i\), projecting onto the \(n \times p\) space spanned by F would, in general, ensure maximal variance.Footnote 5

We ran the experiment for a range of subspaces F of sizes \(p = 5,7,8,10,11,13\) and for a constant number of 15 eigenvectors in G. For each value of p computed the distance between the transformation matrix W obtained by each method and the space F. We repeated the experiment 25 times.

The results are brought in Table 1. We see that as p grows, our gaPCA methods manage to more accurately estimate the true eigenspace F as compared to the other methods.

Table 1 Average distance between the estimated p dimensional subspace and the true subspace F

5.2 Texture image data

Following the promising results on the synthetic data, we next test our methods on real data. In computer vision applications, region covariance (RC) matrices (Tuzel et al. 2006, 2008) are useful image descriptors, computed using the covariance between feature vectors taken from image patches. In this set of experiments we performed a texture classification task on the normalized Brodatz (1966) dataset, a set of 112 grayscale texture images (see Fig. 6).

Fig. 6
figure 6

Samples from the normalized Brodatz texture dataset

In each experiment we randomly selected two texture images. For each image the left side was used to create the training set and the right side was used for the test set. The RC matrices were created by extracting \(128 \times 128\) patches from randomly selected locations in the image. Then, at every pixel the feature vectors were composed of 28 Gabor filter responses (Gabor 1946) for four scales and seven angles, as well as pixel intensity and first and second gradients, for a total of 34 feature dimensions. The RCs were compressed to size \(8 \times 8\) using the standard PCA, PGA and our proposed gaPCA, then classified using two different classifiers—a nearest neighbor method and a minimum distance to the mean (MDM) (Barachant et al. 2012) scheme as follows: We first apply our methods in an unsupervised manner. Next, using the labels of the training set, we compute the mean for each of the two classes. Then, we classify the covariance matrices in the test set according to their distance to the class means; each test covariance matrix is assigned the class to which it is closer. This classifier is restricted here to a two-classes problem with various distances.

The results are brought in Tables 2 and 3. The nearest neighbor classifier works quite well for classification of this data set, making it somewhat difficult to see big differences in the performance of the different methods. Nonetheless, for this classifier the geometric methods dominate, with the log-Euclidean metric achieving the highest overall accuracy rate. Using the MDM classifier the classification rates are generally lower and it is more readily seen that gaPCA using the log-Euclidean metric has a clear advantage for this task.

Table 2 Accuracy rates for the various PCA methods using a minimum distance to the mean (MDM) classifier
Table 3 Accuracy rates for the various PCA methods using a nearest neighbor classifier

5.3 Brain-computer interface

The use of covariance matrices is prevalent in the brain computer interface (BCI) community (Barachant et al. 2010; Lotte and Guan 2011). EEG signals involve highly complex and non-linear phenomenon (Blankertz et al. 2008) which cannot be modeled efficiently using simple Gaussian assumptions. In this context, for some specific applications, covariance matrices (using their natural Riemannian geometry) have been successfully used (Barachant et al. 2010, 2012, 2013; Yger 2013). As emphasized in Blankertz et al. (2008) and Lotte and Guan (2011), dimensionality reduction and spatial filtering are crucial steps for building an efficient BCI system. Hence, an unsupervised dimensionality reduction method utilizing the Riemannian geometry of covariance matrices is of great interest for BCI applications.

In this set of experiments, we apply our methods to BCI data from the BCI competition III datasets IIIa and IV (Schlögl et al. 2005). These datasets contain motor imagery (MI) EEG signals and was collected in a multi-class setting, with the subjects performing more than 2 different MI tasks. As was done in Lotte and Guan (2011), we evaluate our algorithms on two-class problems by selecting only signals of left- and right-hand MI trials.

Dataset IIIa comprises EEG signals recorded from 60 electrodes from three subjects who performed left-hand, right-hand, foot and tongue MI. A training set and a test set are available for each subject. Both sets contain 45 trials per class for Subject 1, and 30 trials per class for Subjects 2 and 3. Dataset IV, comprises EEG signals recorded from 118 electrodes from five subjects who performed left-hand, right-hand, foot and tongue MI. Here 280 trials were available for each subject, among which 168, 224, 84, 56 and 28 composed the training sets for the respective subjects. The remaining trials composed their test sets.

We apply the same pre-processing as described in Lotte and Guan (2011). EEG signals were band-pass filtered in 8-30 Hz, using a 5th order Butterworth filter. For each trial, we extracted features from the time segment located from 0.5 to 2.5 s after the cue instructing the subject to perform MI.

For both datasets we reduce the matrices from their original size to \(6 \times 6\) as it corresponds to the number of sources recommended in Blankertz et al. (2008) for common spatial pattern (CSP). Since our problem is non-convex, we restarted the optimization 5 times with different initial values and used the matrix W which produced the lowest value of the cost function. As for the image data, the performance measure for the dimensionality reduction was the classification accuracy with MDM classification. We used both the Riemannian and the Euclidean metrics to compute the class means and distances for the test samples. However, we report the results only for the Riemannian metric, as they were better for all subjects. The results using the Euclidean metric can be found in “Appendix 5”.

Table 4 Accuracy rates for the various PCA methods using the Riemannian metric

The accuracy rates of the classification are presented in Table 4. As a reference on these datasets, we also report the results of a classical method of the literature. This method (Lotte and Guan 2011) consists of supervised dimensionality reduction, namely a CSP, followed by a linear discriminant analysis on the log-variance of the sources extracted by the CSP.

While the results of Lotte and Guan (2011) cannot be compared to those of our unsupervised techniques in a straightforward manner, they nonetheless serve as a motivation.Footnote 6 Since we intend to extend our approach to the supervised learning setting in our future work, it is instructive to quantitatively assess the performance gap even at this stage of research. Encouragingly, our comparatively naive methods work well, obtaining the same classification rates as Lotte and Guan (2011) for some test subjects, and for others even achieving higher rates.

Once again, our \(\delta _\mathrm {c}\)PCA method yields the same classification accuracy as \(\delta _\mathrm {r}\), albeit for slightly different transformation matrices. Once again this implies that our approximation for the Riemannian mean of the compressed set \(\mathbf X ^\downarrow \) is adequate.

5.4 Approximate versus exact costs

Our experiments indicate that the use of the exact method \(\delta _\mathrm {c}\) is effectively equivalent to using the approximate method \(\delta _\mathrm {r}\). So, to determine which method is preferable, we compare the time complexity of each optimization step. When initialized identically, the number of steps needed to reach convergence is roughly the same for both methods. In terms of matrix inversions and decompositions, it can be seen from the expressions for the gradients that both methods require roughly the same amount of operations. However, the gradient computation for \(\delta _\mathrm {c}\) requires the multiplication of larger matrices, as well as the computation of the Riemannian mean of the set \(\mathbf X ^\downarrow \) for each optimization step. Note that this is the mean in the smaller compressed space, and so this does not significantly increase the run time.

Fig. 7
figure 7

Comparison between run time of \(\delta _\mathrm {r}\) and \(\delta _\mathrm {c}\) formulations

Figure 7 shows the average time elapsed until completion of the iterate, in seconds, for both methods. We see that the approximate \(\delta _\mathrm {r}\) runs faster than the exact \(\delta _\mathrm {c}\). Since the gain in performance is negligible, we conclude that the use of the formulation optimizing the approximate Fréchet variance is favorable.

5.5 Choosing a metric

Our experiments have shown that our geometric method gaPCA in general performs better than other existing PCA methods. In terms of choosing which metric to use, each offers a different appeal: The Euclidean geometry, while simple and efficient, is usually too simplistic to capture the curvature of the space. The AIRM and the closely related log-determinant metrics are the most natural and offer the most invariance properties. The log-Euclidean metric also exhibits many invariance properties and can be seen as a compromise between the simplicity of the Euclidean metric and the complexity of the AIRM. Thus, an adequate metric must be suited to the data at hand.

6 Conclusion

In this paper, we introduced a novel way to perform unsupervised dimensionality reduction for SPD matrices. We provided a rectified formulation of matrix PCA based on the optimization of a generalized notion of variance for SPD matrices. Extending this formulation to other geometries, we used tools from the field of optimization on manifolds. We showed that it suffices to use a simple approximation for the Fréchet variance and that it is not necessary to use a more complex formulation in order to optimize it precisely. We applied our method to synthetic and real-world data and demonstrated its usefulness.

In future work we consider several promising extensions to our methods. First, we may cast our \(\delta \)PCA to a stochastic optimization setting on manifolds (Bonnabel 2013). Such an approach may be useful for the massive datasets common in applications such as computer vision. In addition, it would be interesting to use our approach with criteria in the spirit of Yger and Sugiyama (2015). This would lead to supervised dimensionality reduction, bridging the gap between the supervised log-Euclidean metric learning proposed in Yger and Sugiyama (2015) and the dimensionality reduction proposed in Harandi et al. (2014a).