1 Introduction

We study the following constrained optimization problem

$$\begin{aligned} \min _{x \in \mathcal {X}\subseteq \mathcal {M}}\quad \phi (x), \end{aligned}$$
(1)

where \(\phi : \mathcal {M}\rightarrow \mathbb {R}\) is a differentiable function and \(\mathcal {X}\) is a compact geodesically convex (henceforth, g-convex) subset of a Riemannian manifold \(\mathcal {M}\). The objective \(\phi \) may be g-convex or nonconvex. When the constraint set \(\mathcal {X}\) is “simple” one may solve (1) via Riemannian projected-gradient. But in many cases, projection onto \(\mathcal {X}\) can be expensive to compute, motivating us to seek projection-free methods.

Euclidean (\(\mathcal {M}\equiv \mathbb {R}^n\)) projection-free methods based on the Frank-Wolfe (FW) scheme [20] have recently witnessed a surge of interest in machine learning and related fields [29, 36]. Instead of projection, such FW methods rely on access to a “linear” oracle, that is, a subroutine that solves the problem

$$\begin{aligned} \min _{z\in \mathcal {X}}\quad \left\langle {z},\, {\nabla \phi (x)} \right\rangle \; , \end{aligned}$$
(2)

which can sometimes be much simpler than projection onto \(\mathcal {X}\). This attractive property of FW methods has been exploited in convex [2, 29], nonconvex [35], submodular [12, 21], and stochastic [24, 52] optimization problems; among others.

But as far as we are aware, FW methods have not been studied for Riemannian manifolds. Our work fills this gap in the literature by developing, analyzing, and experimenting with Riemannian Frank-Wolfe (RFW) methods. In addition to adapting FW to the Riemannian setting, there is one more challenge that we must overcome: RFW requires access to a Riemannian analog of the linear oracle (2), which can be hard even for g-convex problems.

Therefore, to complement our theoretical analysis of RFW, we discuss in detail practical settings that admit efficient Riemannian “linear” oracles. Specifically, we discuss problems where \(\mathcal {M}=\mathbb {P}_d\), the manifold of (Hermitian) positive definite matrices, and \(\mathcal {X}\) is a g-convex semidefinite interval; then problem (1) assumes the form

$$\begin{aligned} \min _{X \in \mathcal {X}\subseteq \mathbb {P}_d}\quad \phi (X),\qquad \text {where}\ \mathcal {X}:= \{ X \in \mathbb {P}_d \mid L \preceq X \preceq U\}, \end{aligned}$$
(3)

where L and U are positive definite matrices. An important instance of (3) is the following g-convex optimization problem (see §4 for details and notation):

$$\begin{aligned} \min _{X \in \mathbb {P}_d}\quad \sum _{i=1}^n w_i\delta _R^2(X,A_i),\quad \text {where}\ w \in \varDelta _n, A_1,\ldots ,A_n \in \mathbb {P}_d, \end{aligned}$$
(4)

which computes the Riemannian centroid of a set of positive definite matrices (also known as the “matrix geometric mean” and the “Karcher mean”) [5, 32, 37]. We will show that RFW offers a simple approach for solving (4) that performs competitively against recently published state-of-the-art Riemannian approaches. As a second application, we show that RFW allows for an efficient computation of Bures-Wasserstein barycenters on the Gaussian density manifold.

Summary of results. The key contributions of this paper are as follows:

  1. 1.

    We introduce a Riemannian Frank-Wolfe (RFW) algorithm for addressing constrained g-convex optimization on Riemannian manifolds. We show that RFW attains a non-asymptotic O(1/k) rate of convergence to the optimal objective value, k being the number of iterations (Theorem 1). Furthermore, under additional assumptions on the objective function and the constraints, we show that RFW can even attain linear convergence rates (Theorem 2). In the nonconvex case, RFW attains a non-asymptotic \(O(1/\sqrt{k})\) rate of convergence to first-order critical points (Theorem 3). These rates are comparable to the best known guarantees for the classical Euclidean Frank-Wolfe algorithm [29, 36].

  2. 2.

    While the Euclidean “linear” oracle is a convex problem, the Riemannian “linear” oracle is nonconvex. Therefore, the key challenge of developing RFW lies in efficiently solving the “linear” oracle. We address this problem with the following contributions:

    • We specialize RFW for g-convex problems of the form (3) on the manifold of Hermitian positive definite (HPD) matrices. Importantly, for this problem we develop a closed-form solution to the Riemannian “linear” oracle, which involves solving a nonconvex semi-definite program (SDP), see Theorem 4. We then apply RFW to computing the Riemannian mean of HPD matrices. In comparison with state-of-the-art methods, we observe that RFW performs competitively. Furthermore, we implement RFW for the computation of Bures-Wasserstein barycenters on the Gaussian density manifold.

    • We show that we can recover a sublinear convergence rate, even if the Riemannian “linear” oracle can only be solved approximately, e.g., using relaxations or iterative solvers. This makes the approach applicable to a wider range of constrained optimization problems.

We believe that the closed-form solutions for the HPD “linear” oracle, which involve a nonconvex SDP, should be of wider interest too. A similar approach can be used to solve the Euclidean linear oracle, a convex SDP, in closed form (Appendix 1). More broadly, we hope that our results encourage others to study RFW as well as other examples of problems with efficient Riemannian “linear” oracles.

Related work. Riemannian optimization has a venerable history. The books [1, 58] provide a historical perspective as well as basic theory. The focus of these books and of numerous older works on Riemannian optimization, e.g.,[19, 25, 41, 53], is almost exclusively on asymptotic analysis. More recently, non-asymptotic convergence analysis quantifying the iteration complexity of Riemannian optimization algorithms has begun to be pursued [3, 10, 66]. Specifically, it is known that first-order methods, such as Riemannian Gradient Descent, achieve a sublinear iteration complexity. However, to the best of our knowledge, all these works either focus on unconstrained Riemannian optimization, or handle constraints via projections. In contrast, we explore constrained g-convex optimization within an abstract RFW framework, by assuming access to a Riemannian “linear” oracle. Several applications of Riemannian optimization are known, including to matrix factorization on fixed-rank manifolds [57, 59], dictionary learning [16, 56], classical optimization under orthogonality constraints [19], averages of rotation matrices [46], elliptical distributions in statistics  [54, 68], and Gaussian mixture models [27]. Explicit theory of g-convexity on HPD matrices is studied in [55]. Additional related work corresponding to the Riemannian mean of HPD matrices is discussed in Sect. 4.

2 Background

We begin by noting some background and notation from Riemannian geometry. For a deeper treatment we refer the reader to [15, 31].

A smooth manifold \({\mathcal {M}}\) is a locally Euclidean space equipped with a differential structure. At any point \(x \in {\mathcal {M}}\), the set of tangent vectors forms the tangent space \(T_x\mathcal {M}\). Our focus is on Riemannian manifolds, i.e., smooth manifolds with a smoothly varying inner product \(\left\langle {\xi },\, {\eta } \right\rangle _x\) defined on the \(T_x\mathcal {M}\) at each point \(x\in \mathcal {M}\). We write \(\Vert \xi \Vert _x := \sqrt{\left\langle {\xi },\, {\xi } \right\rangle _x}\) for \(\xi \in T_x\mathcal {M}\); for brevity, we will drop the subscript on the norm whenever the associated tangent space is clear from context. Furthermore, we assume \(\mathcal {M}\) to be complete, which ensures that the following map is defined on the whole tangent space: We define the exponential map as a mapping from \(T_x\mathcal {M}\) to \(\mathcal {M}\) by \({{\,\mathrm{Exp}\,}}_x: T_x {\mathcal {M}} \rightarrow {\mathcal {M}}\) such that \(y={{\,\mathrm{Exp}\,}}_x(g_x) \in {\mathcal {M}}\) along a geodesic \(\gamma : [0,1] \rightarrow {\mathcal {M}}\) with \(\gamma (0)= x\), \(\gamma (1)=y\) and \({\dot{\gamma }}(0)=g_x \in T_x\mathcal {M}\). We can define an inverse exponential map \({{\,\mathrm{Exp}\,}}_x^{-1}: {\mathcal {M}} \rightarrow T_x {\mathcal {M}}\) as a diffeomorphism from the neighborhood of \(x \in {\mathcal {M}}\) onto the neighborhood of \(0 \in T_x {\mathcal {M}}\) with \({{\,\mathrm{Exp}\,}}_x^{-1}(x)=0\). Note, that the completeness of \({\mathcal {M}}\) ensures that both maps are well-defined.

Since tangent spaces are local notions, one cannot directly compare vectors lying in different tangent spaces. To tackle this issue, we use the concept of parallel transport: the idea is to transport (map) a tangent vector along a geodesic to the respective other tangent space. More precisely, let \(x, y \in {\mathcal {M}}\) with \(x \ne y\). We transport \(g_x \in T_x {\mathcal {M}}\) along a geodesic \(\gamma \) (where \(\gamma (0)=x\) and \(\gamma (1)=y\)) to the tangent space \(T_y\mathcal {M}\); we denote this by \(\varGamma _x^y g_x\). Importantly, the inner product on the tangent spaces is preserved under parallel transport, so that \(\left\langle {\xi _x},\, {\eta _x} \right\rangle _x = \left\langle {\varGamma _x^y \xi _x},\, {\varGamma _x^y \eta _x} \right\rangle _y\), where \(\xi _x, \eta _x \in T_x {\mathcal {M}}\), while \(\left\langle {\cdot },\, {\cdot } \right\rangle _x\) and \(\left\langle {\cdot },\, {\cdot } \right\rangle _y\) are the respective inner products.

2.1 Gradients, convexity, smoothness

Recall that the Riemannian gradient \({{\,\mathrm{grad}\,}}\phi (x)\) is the unique vector in \(T_x\mathcal {M}\) such that the directional derivative

$$\begin{aligned} D\phi (x)[v] = \left\langle {{{\,\mathrm{grad}\,}}\phi (x)},\, {v} \right\rangle _x,\qquad \forall v \in T_x\mathcal {M}. \end{aligned}$$

When optimizing functions using gradients, it is useful to impose some added structure. The two main properties that we require are sufficiently smooth gradients and geodesic convexity. We say \(\phi : {\mathcal {M}} \rightarrow {\mathbb {R}}\) is L-smooth, or that it has L-Lipschitz gradients, if

$$\begin{aligned} \Vert {{\,\mathrm{grad}\,}}\phi (y)-\varGamma _x^y{{\,\mathrm{grad}\,}}\phi (x)\Vert \le L d(x,y) \qquad \forall \ x, y \in \mathcal {M}\; , \end{aligned}$$
(5)

where d(xy) is the geodesic distance between x and y; equivalently,

$$\begin{aligned} \phi (y) \le \phi (x) + \left\langle {g_x},\, {{{\,\mathrm{Exp}\,}}_x^{-1} (y)} \right\rangle _x + \tfrac{L}{2} d^2(x,y) \qquad \forall x,y \in {\mathcal {M}}; \; g_x \in T_x\mathcal {M}\; . \end{aligned}$$
(6)

We say \(\phi : {\mathcal {M}} \rightarrow {\mathbb {R}}\) is geodesically convex (g-convex) if

$$\begin{aligned} \phi (y) \ge \phi (x) + \left\langle {g_x},\, {{{\,\mathrm{Exp}\,}}_x ^{-1}(y)} \right\rangle _x \qquad \forall x,y \in {\mathcal {M}} ; \; g_x \in T_x\mathcal {M}\; , \end{aligned}$$
(7)

and call it \(\mu \)-strongly g-convex (\(\mu \ge 0\)) if

$$\begin{aligned} \phi (y) \ge \phi (x) + \left\langle {g_x},\, {{{\,\mathrm{Exp}\,}}_x ^{-1}(y)} \right\rangle _x + \tfrac{\mu }{2} d^2(x,y) \qquad \forall x,y \in {\mathcal {M}} ; \; g_x \in T_x\mathcal {M}\; . \end{aligned}$$
(8)

The following observation underscores the reason why g-convexity is a valuable geometric property for optimization.

Proposition 1

(Optimality) Let \(x^* \in \mathcal {X}\subset \mathcal {M}\) be a local optimum for (1). Then, \(x^*\) is globally optimal, and \(\left\langle {{{\,\mathrm{grad}\,}}\phi (x^*)},\, {{{\,\mathrm{Exp}\,}}_{x^*}^{-1}(y)} \right\rangle _{x^*} \ge 0\) for all \(y\in \mathcal {X}\).

2.2 Projection-free vs. Projection-based methods

The growing body of literature on Riemannian optimization considers mostly projection-based methods, such as Riemannian Gradient Decent (RGD) or Riemannian Steepest Decent (RSD) [1]. Such methods and their convergence guarantees typically require Lipschitz assumptions. However, the objectives of many classic optimization and machine learning tasks are not Lipschitz on the whole manifold. In such cases, an additional compactness argument is required. However, in projection-based methods, the typically used retraction back onto the manifold may not be guaranteed to land in this compact set. Thus, in each iteration, an additional and potentially expensive projection step is needed to ensure that the update remains in the compact region where the gradient is Lipschitz. On the other hand, projection-free methods, such as FW, bypass this issue, because their update is guaranteed to stay within the compact feasible region. Importantly, in some problems, the Riemannian linear oracle at the heart of FW can be less expensive than computing a projection back onto the compact set. A detailed numerical study comparing the complexity of projections with that of computing linear minimizers in the Euclidean case can be found in [18]. The efficiency of linear minimizers is especially significant for the applications highlighted in this paper, where the linear oracle even admits a closed form solution (see Sect. 4.1).

2.3 Constrained optimization in Riemannian space

A large body of literature has considered the problem of translating a constrained Euclidean optimization problem into an unconstrained Riemannian problem, by encoding the primary constraint in the manifold structure. However, often a problem has additional constraints, requiring a Riemannian approach to constrained optimization. We list below notable examples, including those that will be covered in the application section of the paper.

Examples on the manifold of positive definite matrices include the computation of Riemannian centroids (with interval constraints, see Sect. 4.2.1) and learning determinantal point processes (with interval constraints [44]). A related problem is that of computing Wasserstein-Barycenters on the Bures manifold (wirh interval constraints, see Sect. 4.3.1). The k-means clustering algorithm corresponds to an optimization task on the Stiefel manifold with equality and inequality constraints [14]. Non-negative PCA can be computed on the sphere with equality constraints [47]. The synchronization of data matrices can be written as an optimization task on the manifold of the orthogonal group with a determinant constraint (see [61, Sect. 5]). Computing a minimum balanced cut for graph bisection can be computed on the Oblique manifold with quadratic equality constraints [41].

3 Riemannian Frank-wolfe

The condition \(\left\langle {{{\,\mathrm{grad}\,}}\phi (x^*)},\, {{{\,\mathrm{Exp}\,}}_{x^*}^{-1}(y)} \right\rangle _{x^*} \ge 0\) for all \(y\in \mathcal {X}\) in Proposition 1 lies at the heart of Frank-Wolfe (also known as “conditional gradient”) methods. In particular, if this condition is not satisfied, then there must be a feasible descent directionFW schemes seek such a direction and update their iterates [20, 29]. This high-level idea is equally valid in the Riemannian setting. Algorithm 1 recalls the basic (Euclidean) FW method, which solves \(\min _{x\in \mathcal {X}}\phi (x)\), and Algorithm 2 introduces its Riemannian version (RFW), obtained by simply replacing Euclidean objects with their Riemannian counterparts. In the following, \(\left\langle {\cdot },\, {\cdot } \right\rangle \) will denote \(\left\langle {\cdot },\, {\cdot } \right\rangle _{x_k}\) unless otherwise specified.

figure a

Notice that to implement Algorithm 1, \(\mathcal {X}\) must be compact and convex. Convexity ensures that after the update in Step 5, \(x_{k+1}\) remains feasible, while compactness ensures that the linear oracle in Step 3 has a solution. To obtain RFW, we first replace the linear oracle (Step 3 in Algorithm 1) with the Riemannian “linear oracle”:

$$\begin{aligned} \min _{z \in \mathcal {X}}\quad \left\langle {{{\,\mathrm{grad}\,}}\phi (x_k)},\, {{{\,\mathrm{Exp}\,}}_{x_k}^{-1}(z)} \right\rangle , \end{aligned}$$
(9)

where now \(\mathcal {X}\) is assumed to be a compact g-convex set. Similarly, observe that Step 5 of Algorithm 1 updates the current iterate \(x_k\) along a straight line joining \(x_k\) with \(z_k\). Thus, by analogy, we replace this step by moving \(x_k\) along a geodesic joining \(x_k\) with \(z_k\). The resulting RFW algorithm is presented as Alg 2.

figure b

While we obtained Algorithm 2 purely by analogy, we must still show that this analogy results in a valid algorithm. In particular, we need to show that Algorithm 2 converges to a solution of (1). We will in fact prove a stronger result that RFW converges globally at the rate O(1/k), i.e., \(\phi (x_k)-\phi (x^*) = O(1/k)\), which matches the rate of the Euclidean FW method.

3.1 Convergence analysis

We make the following smoothness assumption:

Assumption 1

(Smoothness) The objective \(\phi \) has a locally Lipschitz continuous gradient on \(\mathcal {X}\), that is, there exists a constant L such that for all \(x, y \in \mathcal {X}\) we have

$$\begin{aligned} \Vert {{{\,\mathrm{grad}\,}}\phi (y)-\varGamma _x^y{{\,\mathrm{grad}\,}}\phi (x)}\Vert \le L \; d{(x,y)}. \end{aligned}$$
(10)

Next, we introduce a quantity that will play a central role in the convergence rate of RFW, namely the curvature constant

$$\begin{aligned} M_\phi := \sup _{\begin{array}{c} x,y,z \in \mathcal {X} \end{array}} \tfrac{2}{\eta ^2}\left[ \phi (y) - \phi (x) - \left\langle {{{\,\mathrm{grad}\,}}\phi (x)},\, {{{\,\mathrm{Exp}\,}}_x ^{-1}(y)} \right\rangle \right] \; . \end{aligned}$$
(11)

An analogous quantity is used in the analysis of the Euclidean FW [29]. In the following we adapt proof techniques from [29] to the Riemannian setting. Here and in the following, \(y=\gamma (\eta )\) for some \(\eta \in [0,1]\) and a geodesic map \(\gamma :[0,1] \rightarrow \mathcal {M}\) with \(\gamma (0) = x\) and \(\gamma (1) = z\) (denoted in the following as \(\gamma _{xz}\)). Lemma 1 relates the curvature constant (11) to the Lipschitz constant L.

Lemma 1

Let \(\phi : \mathcal {M}\rightarrow \mathbb {R}\) be L-smooth on \(\mathcal {X}\), and let \(\mathrm {diam}(\mathcal {X}) := \sup _{x,y \in \mathcal {X}}d{(x,y)}\). Then, the curvature constant \(M_\phi \) satisfies the bound \(M_{\phi } \le L \; \mathrm{diam}(\mathcal {X})^2\).

Proof

Let \(x, z \in \mathcal {X}\) and \(\eta \in (0,1)\); let \(y=\gamma _{xz}(\eta )\) be a point on the geodesic joining x with z. This implies \(\frac{1}{\eta ^2} d{(x,y)}^2 = d{(x,z)}^2\). From (6) we know that

$$\begin{aligned} \left| \phi (y) - \phi (x) - \langle {{\,\mathrm{grad}\,}}\phi (x), {{\,\mathrm{Exp}\,}}_{x}^{-1}(y) \rangle \right| ^2 \le \frac{L}{2} d(x,y)^2 \end{aligned}$$

whereupon using the definition of the curvature constant we obtain

$$\begin{aligned} M_{\phi } \le \sup \frac{2}{\eta ^2} \frac{L}{2} d{(x,y)}^2 = \sup L \; d{(x,z)}^2 \le L \cdot \mathrm{diam} (\mathcal {X})^2 \; . \end{aligned}$$
(12)

\(\square \)

We note below an analog of the Lipschitz inequality (6) using the constant \(M_\phi \).

Lemma 2

(Lipschitz) Let \(x, y, z \in \mathcal {X}\) and \(\eta \in [0,1]\) with \(y=\gamma _{xz}(\eta )\). Then,

$$\begin{aligned} \phi (y) \le \phi (x) + \eta \left\langle {{{\,\mathrm{grad}\,}}\; \phi (x)},\, {{{\,\mathrm{Exp}\,}}_x^{-1}(z)} \right\rangle + \tfrac{1}{2} \eta ^2 M_{\phi } \; . \end{aligned}$$

Proof

From definition (11) of the constant \(M_\phi \) we see that

$$\begin{aligned} M_{\phi } \ge \tfrac{2}{\eta ^2} \left( \phi (y) - \phi (x) - \left\langle {{{\,\mathrm{grad}\,}}\; \phi (x)},\, {{{\,\mathrm{Exp}\,}}_{x}^{-1}(y)} \right\rangle \right) , \end{aligned}$$

which we can rewrite as

$$\begin{aligned} \phi (y) \le \phi (x) + \left\langle {{{\,\mathrm{grad}\,}}\; \phi (x)},\, {{{\,\mathrm{Exp}\,}}_{x}^{-1}(y)} \right\rangle + \tfrac{1}{2} \eta ^2 M_{\phi }. \end{aligned}$$
(13)

Furthermore, since \(y=\gamma _{xz}(\eta )\), we have \({{\,\mathrm{Exp}\,}}_x^{-1} (y) = \eta {{\,\mathrm{Exp}\,}}_x^{-1} (z)\), and therefore

$$\begin{aligned} \left\langle {{{\,\mathrm{grad}\,}}\phi (x)},\, {{{\,\mathrm{Exp}\,}}_x^{-1}(y)} \right\rangle&= \left\langle {{{\,\mathrm{grad}\,}}\phi (x)},\, {\eta {{\,\mathrm{Exp}\,}}_x^{-1}(z)} \right\rangle = \eta \left\langle {{{\,\mathrm{grad}\,}}\phi (x)},\, {{{\,\mathrm{Exp}\,}}_x^{-1}(z)} \right\rangle . \end{aligned}$$

Plugging this equation into (13) the claim follows. \(\square \)

We need one more technical lemma (easily verified by a quick induction).

Lemma 3

(Stepsize for RFW) Let \((a_k)_{k \in I}\) be a nonnegative sequence that fulfills

$$\begin{aligned} a_{k+1} \le (1-s_k)a_k + \tfrac{1}{2}s_k^2 M_{\phi }. \end{aligned}$$
(14)

If \(s_k = \frac{2}{(k+2)}\), then, \(a_k \le \frac{2M_{\phi }}{(k+2)}\).

We are now ready to state our first main convergence result, Theorem 1 that establishes a global iteration complexity for RFW.

Theorem 1

(Rate) Let \(s_k = \frac{2}{k+2}\), and let \(X^*\) be a minimum of \(\phi \). Then, the sequence of iterates \(X_k\) generated by Algorithm 2 satisfies \(\phi (X_k)-\phi (X^*) = O(1/k)\).

Proof

The proof of this claim is straightforward; indeed

$$\begin{aligned}&\phi (X_{k+1}) - \phi (X^*)\\&\quad \le \phi (X_k) - \phi (X^*) + s_k\left\langle {{{\,\mathrm{grad}\,}}\phi (X_k)},\, {{{\,\mathrm{Exp}\,}}_{X_k} ^{-1}(Z_k)} \right\rangle + \tfrac{1}{2}s_k^2M_\phi \\&\quad \le \phi (X_k) - \phi (X^*) + s_k\left\langle {{{\,\mathrm{grad}\,}}\phi (X_k)},\, {{{\,\mathrm{Exp}\,}}_{X_k} ^{-1}(X^*)} \right\rangle + \tfrac{1}{2}s_k^2M_\phi \\&\quad \le \phi (X_k)-\phi (X^*) - s_k(\phi (X_k)-\phi (X^*)) + \tfrac{1}{2}s_k^2M_\phi \\&\quad =(1-s_k)(\phi (X_k)-\phi (X^*)) + \tfrac{1}{2}s_k^2M_\phi , \end{aligned}$$

where the first inequality follows from Lemma 2, while the second one from \(Z_k\) being an \({{\,\mathrm{argmin}\,}}\) obtained in Step 3 of the algorithm. The third inequality follows from g-convexity of \(\phi \). Setting \(a_k=\phi (X_k)-\phi (X^*)\) in Lemma 3 we immediately obtain

$$\begin{aligned} \phi (X_k)-\phi (X^*) \le \frac{2M_\phi }{k+2},\qquad k\ge 0, \end{aligned}$$

which is the desired O(1/k) convergence rate. \(\square \)

Theorem 1 provides a global sublinear convergence rate for RFW. Typically, FW methods trade off their simplicity for this slower convergence rate, and even for smooth strongly convex objectives they do not attain linear convergence rates [29]. We study in Sect. 3.2 a setting that permits RFW to attain a linear rate of convergence.

3.2 Linear convergence of RFW

In general, the sublinear convergence rate that we derived in the previous section is best-possible for Frank-Wolfe methods. This is due to the following phenomenon, which has been studied extensively in the Euclidean setting [13, 62]: If the optimum lies on the boundary of the constraint set \(\mathcal {X}\), then the FW updates “zig-zag”, resulting in a slower convergence. In this case, the upper bound on the global convergence rate is tight. If, however, the optimum lies in the strict interior of the constraint set, Euclidean FW is known to converge at a linear rate [22, 23]. Remarkably, under a similar assumption, RFW also displays global linear convergence, which we will formally prove below (Theorem 2). Notably, for the special case of the geometric matrix mean that we analyze in the next section, this strict interiority assumption will always be valid, provided that not all the matrices are the same.

3.2.1 Linear convergence under strict interior assumption

We use a Riemannian extension to the well-known Polyak-Łojasiewicz (PL) inequality [42, 49], which we define below. Consider the minimization

$$\begin{aligned} \min _{x\in \mathcal {M}} f(x), \end{aligned}$$

and let \(f^{*}\) be the optimal function value. We say that f satisfies the PL inequality if for some \(\mu > 0\),

$$\begin{aligned} \tfrac{1}{2} \Vert {{{\,\mathrm{grad}\,}}f(x)}\Vert ^2 \ge \mu \left( f(x)-f^{*}\right) \quad \forall x,y \in \mathcal {M}. \end{aligned}$$
(15)

Inequality (15) is weaker than strong convexity (and is in fact implied by it). It has been widely used for establishing linear convergence rates of gradient-based methods; see [33] for several (Euclidean) examples, and [65] for a Riemannian example. We will make use of inequality (15) for obtaining linear convergence of RFW, by combining it with a strict interiority condition on the minimum.

Theorem 2

(Linear convergence RFW) Suppose that \(\phi \) is strongly g-convex with constant \(\mu \) and that its minimum lies in a ball of radius r that strictly inside the constraint set \(\mathcal {X}\). Define \(\varDelta _k := \phi (X_k)-\phi (X^*)\) and let the step-size \(s_k = \frac{r\sqrt{\mu \varDelta _k}}{\sqrt{2}M_\phi }\). Then, RFW converges linearly since it satisfies

$$\begin{aligned} \varDelta _{k+1} \le \left( 1 - \frac{r^2\mu }{4M_\phi }\right) \varDelta _k. \end{aligned}$$

Proof

Let \(\mathcal {B}_r(X^*) \subset \mathcal {X}\) be a ball of radius r containing the optimum. Let

$$\begin{aligned} W_k := {{\,\mathrm{argmax}\,}}_{\begin{array}{c} \xi \in \mathcal {T}_{X_k} \\ \Vert {\xi }\Vert \le 1 \end{array}}\left\langle {\xi },\, {{{\,\mathrm{grad}\,}}\phi (X_k)} \right\rangle \end{aligned}$$

be the direction of steepest descent in the tangent space \(\mathcal {T}_{X_k}\). The point \(P_k = {{\,\mathrm{Exp}\,}}_{X_k}(rW_k)\) lies in \(\mathcal {X}\). Consider now the following inequality

$$\begin{aligned} \begin{aligned} \left\langle {-{{\,\mathrm{Exp}\,}}_{X_k}^{-1}(P_k)},\, {{{\,\mathrm{grad}\,}}\phi (X_k)} \right\rangle&= -\left\langle {{{\,\mathrm{grad}\,}}\phi (X_k)},\, {r W_k} \right\rangle = - r\Vert {{{\,\mathrm{grad}\,}}\phi (X_k)}\Vert , \end{aligned} \end{aligned}$$
(16)

which follows upon using the definition of \(W_k\). Thus, we have the bound

$$\begin{aligned} \begin{aligned} \varDelta _{k+1}&\le \varDelta _k + s_k \left\langle {{{\,\mathrm{grad}\,}}\phi (X_k)},\, {{{\,\mathrm{Exp}\,}}_{X_k}^{-1}(X_{k+1})} \right\rangle + \tfrac{1}{2}s_k^2M_\phi \\&\le \varDelta _k - s_k r\Vert {{{\,\mathrm{grad}\,}}\phi (X_k)}\Vert + \tfrac{1}{2}s_k^2M_\phi \\&\le \varDelta _k - s_k r\sqrt{2\mu }\sqrt{\varDelta _k} + \tfrac{1}{2}s_k^2M_\phi , \end{aligned} \end{aligned}$$

where the first inequality follows from the Lipschitz-bound (Lemma 2), the second one from (16), and the third one from the PL inequality (which, in turn holds due to the \(\mu \)-strong g-convexity of \(\phi \)). Now setting the step size \(s_k = \frac{r\sqrt{\mu \varDelta _k}}{\sqrt{2}M_\phi }\), we obtain

$$\begin{aligned} \varDelta _{k+1} \le \left( 1 - \frac{r^2\mu }{4M_\phi }\right) \varDelta _k, \end{aligned}$$

which delivers the claimed linear convergence rate. \(\square \)

Theorem 2 provides a setting where RFW can converge fast, however, it uses step sizes \(s_k\) that require knowing \(\phi (X^*)\)Footnote 1; in case the optimal value is not available, we can use a worse value, which will still yield the desired inequality.

Remark 1

(Necessity of strict interior assumption) For optimization tasks with polytope constraints that do not fulfill a strict interior assumption as the one described above, several EFW variants achieve linear convergence [36]. Notable examples include Away-step FW [23, 62], Pairwise FW [45] and Fully-corrective FW [26]. One may ask, whether these variants can be generalized to the Riemannian case. The first difficulty lies in finding a Riemannian equivalent of the polytope constraint set, which is defined as the convex hull of a finite set of vectors (atoms). Naturally, we could consider the convex hull of a finite set of points on a manifold, which is the intersection of all convex set that contain them. Unfortunately, such a set is in general not compact – compactness is only guaranteed for Hadamard manifolds under additional, restrictive conditions [39]. Even in this special case, ensuring that the resulting constraint sets have sufficiently “good” geometry is not straightforward.

3.3 RFW for nonconvex problems

Finally, we want to consider the case where \(\phi \) in Eq. 1 may be nonconvex. In this case, we cannot hope to find the global minimum with first-order methods, such as RFW. However, we can compute first-order critical point via RFW. For the setup and analysis, we follow the Euclidean case [35].

We first introduce the Frank-Wolfe gap as a criterion for evaluating convergence rates. For \(X \in \mathcal {X}\), we write

$$\begin{aligned} G(X) := \max _{Z \in \mathcal {X}} \left\langle {{{\,\mathrm{Exp}\,}}_Z^{-1}(X)},\, {- {{\,\mathrm{grad}\,}}\; \phi (X)} \right\rangle \; . \end{aligned}$$
(17)

With this, we can show the following sublinear convergence guarantee:

Theorem 3

(Rate (nonconvex case)) Let \({\tilde{G}}_k := \min _{0 \le k \le K} G(X_k)\) (where \(G(X_k)\) denotes the Frank-Wolfe gap at \(X_k\)). After K iterations of Algorithm 2, we have \({\tilde{G}}_k \le \frac{\max \lbrace 2 h_0, M_{\phi } \rbrace }{\sqrt{K+1}}\).

The proof utilizes techniques similar to those in [35, Theorem 1].

4 Specializing RFW for HPD matrices

In this section we study a concrete setting for RFW, namely, a class of g-convex optimization problems with Hermitian positive definite (HPD) matrices. We will show that the Riemannian linear oracle (9) admits an efficient solution for this class of problems, thereby allowing an efficient implementation of Algorithm 2. The concrete class of problems that we consider is the following:

$$\begin{aligned} \min _{X \in \mathcal {X}\subseteq \mathbb {P}_d}\quad \phi (X),\qquad \text {where}\ \mathcal {X}:= \{ X \in \mathbb {P}_d \mid L \preceq X \preceq U\}, \end{aligned}$$
(18)

where \(\phi \) is a g-convex function and \(\mathcal {X}\) is a “positive definite interval” (which is easily seen to be a g-convex set). Note that the set \(\mathcal {X}\) actually does not admit an easy projection for matrices. Problem (18) captures several g-convex optimization problems, of which perhaps the best known is the task of computing the matrix geometric mean (also known as the the Riemannian centroid or Karcher mean)—see Sect. 4.2.1.

We briefly recall some facts about the Riemannian geometry of HPD matrices below. For a comprehensive overview, see, e.g., [5, Chapter 6]. We denote by \(\mathbb {H}_d\) the set of \(d\times d\) Hermitian matrices. The most common Riemannian geometry on \(\mathbb {P}_d\) is induced by

$$\begin{aligned} \left\langle {A},\, {B} \right\rangle _X := {{\,\mathrm{tr}\,}}(X^{-1}AX^{-1}B),\quad \ \text {where}\ A, B \in \mathbb {H}_d. \end{aligned}$$
(19)

This metric induces the geodesic \(\gamma : [0,1]\rightarrow \mathbb {P}_d\) between \(X, Y \in \mathbb {P}_d\) given by

$$\begin{aligned} \gamma (t) := X^{\nicefrac {1}{2}}(X^{-\nicefrac {1}{2}}YX^{-\nicefrac {1}{2}})^{t}X^{\nicefrac {1}{2}},\qquad t \in [0,1]. \end{aligned}$$
(20)

The corresponding Riemannian distance is

$$\begin{aligned} d(X,Y) := \Vert {\log (X^{-\nicefrac {1}{2}}YX^{-\nicefrac {1}{2}})}\Vert _{\text {F}},\qquad X, Y \in \mathbb {P}_d. \end{aligned}$$
(21)

The Riemannian gradient \({{\,\mathrm{grad}\,}}\phi \) is obtained from the Euclidean one (\(\nabla \phi \)) as follows

$$\begin{aligned} {{\,\mathrm{grad}\,}}\phi (X) = X\nabla ^\mathbb {H}\phi (X)X \; , \end{aligned}$$
(22)

where \(\nabla ^\mathbb {H}\phi (X) := \frac{1}{2} \big ( \nabla \phi (X) + (\nabla \phi (X))^* \big )\) denotes the (Hermitian) symmetrization of the gradient. The exponential map and its inverse at a point \(P \in Pd\) are given by

$$\begin{aligned} {{\,\mathrm{Exp}\,}}_X(A)&= X^{\nicefrac {1}{2}}\exp (X^{-\nicefrac {1}{2}}A X^{-\nicefrac {1}{2}})X^{\nicefrac {1}{2}},\qquad A \in T_X\mathbb {P}_d \equiv \mathbb {H}_d,\\ {{\,\mathrm{Exp}\,}}_X^{-1}(Y)&= X^{\nicefrac {1}{2}}\log (X^{-\nicefrac {1}{2}}YX^{-\nicefrac {1}{2}})X^{\nicefrac {1}{2}},\qquad X, Y \in \mathbb {P}_d, \end{aligned}$$

where \(\exp (\cdot )\) and \(\log (\cdot )\) denote the matrix exponential and logarithm, respectively. Observe that using (22) we obtain the identity

$$\begin{aligned} \left\langle {{{\,\mathrm{grad}\,}}\phi (X)},\, {{{\,\mathrm{Exp}\,}}_X^{-1}(Y)} \right\rangle _X = \left\langle {X^{\nicefrac {1}{2}}\nabla ^\mathbb {H}\phi (X)X^{\nicefrac {1}{2}}},\, {\log (X^{-\nicefrac {1}{2}}YX^{-\nicefrac {1}{2}})} \right\rangle . \end{aligned}$$
(23)

With these details, Algorithm 2 can almost be applied to (18) — the most crucial remaining component is the Riemannian linear oracle, which we now describe.

4.1 Solving the Riemannian linear oracle

For solving (18), the Riemannian linear oracle (see (9)) requires solving

$$\begin{aligned} \min _{L \preceq Z \preceq U}\quad \left\langle {X_k^{\nicefrac {1}{2}}\nabla ^\mathbb {H}\phi (X_k) X_k^{\nicefrac {1}{2}}},\, {\log \left( X_k^{-\nicefrac {1}{2}}ZX_k^{-\nicefrac {1}{2}}\right) } \right\rangle . \end{aligned}$$
(24)

Problem (24) is a non-convex optimization problem over HPD matrices. However, remarkably, it turns out to have a closed form solution. Theorem 4 presents this solution and is our main technical result for Sect. 4.

Theorem 4

Let \(L, U \in \mathbb {P}_d\) such that \(L \prec U\). Let \(S \in \mathbb {H}_d\) and \(X \in \mathbb {P}_d\) be arbitrary. Then, the solution to the optimization problem

$$\begin{aligned} \min _{L \preceq Z \preceq U}\quad {{\,\mathrm{tr}\,}}(S \log (XZX)), \end{aligned}$$
(25)

is given by \(Z = X^{-1}Q \left( P^* [-{{\,\mathrm{sgn}\,}}(D)]_+ P + {\hat{L}}\right) Q^*X^{-1}\), where \(S=QDQ^*\) is a diagonalization of S, \({\hat{U}} - {\hat{L}}=P^* P\) with \({\hat{L}}=Q^* X L X Q\) and \({\hat{U}}=Q^* X U X Q\).

For the proof of Theorem 4, we need a fundamental lemma about eigenvalues of Hermitian matrices (Lemmas 4). First, we need to introduce some additional notation. For \(x \in {\mathbb {R}}^d\), let \(x^\downarrow = \left( x_1^\downarrow , \dots , x_d^\downarrow \right) \) denote the vector with entries of x in decreasing order, i.e., \(x_1^\downarrow \ge \ldots \ge x_d^\downarrow \). For \(x, y \in {\mathbb {R}}^d\) we say that x is weakly majorized by y (\(x \prec _w y\)), if \(\sum _{i=1}^k x_i^\downarrow \le \sum _{i=1}^k y_i^\downarrow \) for \(1 \le k \le d\). We can now recall the following lemma on eigenvalues of Hermitian matrices, which can be found, e.g., in [4, Problem III.6.14]:

Lemma 4

([4]) Let XY be HPD matrices. Then

$$\begin{aligned} \lambda ^\downarrow (X) \cdot \lambda ^\uparrow (Y) \prec _w \lambda (XY) \prec _w \lambda ^\downarrow (X) \cdot \lambda ^\downarrow (Y), \end{aligned}$$

where \(\lambda ^\downarrow (X)\) (\(\lambda ^\uparrow \)) denote eigenvalues of X arranged in decreasing (increasing) order, \(\prec _w\) denotes the weak majorization order and \(\cdot \) denotes the elementwise product. If XY are Hermitian, we have

$$\begin{aligned} \left\langle {\lambda ^\downarrow (X)},\, {\lambda ^\uparrow (Y)} \right\rangle \le {{\,\mathrm{tr}\,}}(XY) \le \left\langle {\lambda ^\downarrow (X)},\, {\lambda ^\downarrow (Y)} \right\rangle \; , \end{aligned}$$

with equality, if the the product XY is symmetric.

Remark 2

We recall a well-known fact on solving the trace minimization problem \(\min _{Y} {{\,\mathrm{tr}\,}}{XY}\). Diagonalizing XY allows for rewriting the minimization problem as \(\min _S {{\,\mathrm{tr}\,}}{DQ^T S Q}\), where DS are diagonal and Q orthogonal. To see for which S the minimum is attained, note that \(\min \sum _{ij}d_i s_j q_{ij}^2 \ge \min \sum _{ij} d_i s_j r_{ij}\), where \((r_{ij})\) is doubly-stochastic. By Birkhoff’s theorem the minimum occurs for permutation matrices, i.e., \(\min \sum _{i} d_i s_{\sigma (i)}\). By the rearrangement inequality this is minimized, if the order of the indices is reversed.

Proof

(Theorem 4) First, introduce the variable \(Y=XZX\); then (25) becomes

$$\begin{aligned} \min _{L' \preceq Y \preceq U'}\quad {{\,\mathrm{tr}\,}}(S\log Y), \end{aligned}$$
(26)

where the constraints have also been modified to \(L' \preceq Y \preceq U'\), where \(L'=XLX\) and \(U'=XUX\). Diagonalizing S as \(S=QDQ^*\), we see that \({{\,\mathrm{tr}\,}}(S\log Y) = {{\,\mathrm{tr}\,}}(D\log W)\), where \(W=Q^*YQ\). Thus, instead of (26) it suffices to solve

$$\begin{aligned} \min _{L'' \preceq W \preceq U''}\quad {{\,\mathrm{tr}\,}}(D\log W), \end{aligned}$$
(27)

where \(L''=Q^*L'Q\) and \(U''=Q^*U'Q\). We have strict inequality \(U'' \succ L''\), thus, our constraints are \(0 \prec W-L'' \preceq U''-L''\), which we may rewrite as \(0 \prec R \preceq I\), where \(R = (U''-L'')^{-\nicefrac {1}{2}}(W-L'')(U''-L'')^{-\nicefrac {1}{2}}\). Notice that

$$\begin{aligned} W = (U''-L'')^{\nicefrac {1}{2}}R(U''-L'')^{\nicefrac {1}{2}} + L''. \end{aligned}$$

Thus, problem (27) now turns into

$$\begin{aligned} \min _{0 \prec R \preceq I }\quad {{\,\mathrm{tr}\,}}(D\log (P^*R P+L'')) \; , \end{aligned}$$
(28)

where \(U'' - L''=P^* P\). We want to construct an R that attains the minimum. Note that using Lemma 4 we see that

$$\begin{aligned} {{\,\mathrm{tr}\,}}(D \log (P^* R P+L'')) \ge \lambda ^\uparrow (D) \cdot \lambda ^\downarrow \left( \log (P^* R P + L'') \right) \; . \end{aligned}$$

Remark 2 ensures that the minimum attains the lower bound, i.e., the minimum is attained by matching the eigenvalues of D and \(\log (P^*R P+L'')\) in reverse orders. Note that the matrix logarithm \(\log (\cdot )\) and the map \(R \mapsto P^* R P +L''\) are operator monotone. Now, without loss of generality, assume that R is diagonal and recall that, by construction \(0 \prec R \preceq I\). Consider

$$\begin{aligned} r_{ii} = {\left\{ \begin{array}{ll} 0 &{} d_{ii} \ge 0\\ 1 &{} d_{ii} < 0. \end{array}\right. } \end{aligned}$$
(29)

This ensures that

  1. 1.

    if \(d_{ii}>0\), the corresponding element of \(\lambda ^\downarrow (\log (P^* R P +L''))\) is minimized,

  2. 2.

    if \(d_{ii} \le 0\), the corresponding element of \(\lambda ^\downarrow (\log (P^* R P+L''))\) is maximized.

With that, the minimum of the trace is attained. Thus, we see that

$$\begin{aligned} Y=Q(P^* R P+L'')Q^*=Q(P^* \left[ -{{\,\mathrm{sgn}\,}}(D)\right] _+ P+L'')Q^* \end{aligned}$$

and we immediately obtain the optimal \(Z=X^{-1}YX^{-1}\). \(\square \)

Remark 3

Computing the optimal direction \(Z_k\) takes one Cholesky factorization, two matrix square roots (Schur method), eight matrix multiplications, and one eigenvalue decomposition. This gives a complexity of \(O(N^3)\). On our machines, we report \(\approx \frac{1}{3} N^3 + 2 \times 28.{\bar{3}} N^3 + 8 \times 2N^3 + 20 N^3 \approx 93 N^3\).

4.2 Application to the Riemannian mean

4.2.1 Computing the Riemannian mean

Statistical inference frequently involves computing averages of input quantities. Typically encountered data lies in Euclidean spaces where arithmetic means are the “natural” notions of averaging. However, the Euclidean setting is not always the most natural or efficient way to represent data. Many applications involve non-Euclidean data such as graphs, strings, or matrices [38, 48]. In such applications, it is often beneficial to represent the data in its natural space and adapt classical tools to the specific setting. In other cases, a problem might be very hard to solve in Euclidean space but may become more accessible when viewed through a different geometric lens. Specifically, with a suitable choice of the Riemannian metric, a Euclidean nonconvex problem may be geodesically convex. Hence, we can give global convergence guarantees for solving such problems with Riemannian optimization methods.

This section considers one of these cases, namely the problem of determining the geometric matrix mean (Karcher mean problem). While there exists an intuitive notion for the geometric mean of sets of positive real numbers, this notion does not immediately generalize to sets of positive definite matrices due to the lack of commutativity on matrix spaces. Over a collection of Hermitian, positive definite matrices the geometric mean can be viewed as the geometric optimization problem

$$\begin{aligned} G := {\mathop {{{\,\mathrm{argmin}\,}}}\limits _{X \succ 0}}\quad \left[ \phi (X) = \sum \nolimits _{i=1}^m w_i\delta _R^2(X,A_i)\right] \; , \end{aligned}$$
(30)

where \(\delta _R\) denotes the Riemannian metric. In a Euclidean setting, the problem is non-convex. However, one can view Hermitian, positive matrices as points on a Riemannian manifold and compute the geometric mean as the Riemannian centroid. The corresponding optimization problem (Eq. 30) is geodesically convex [51]. In this section, we look at the problem through both geometric lenses and provide efficient algorithmic solutions while illustrating the benefits of switching geometric lenses in geometric optimization problems.

There exists a large body of work on the problem of computing the geometric matrix means [30]. Classic algorithms like Newton’s method or Gradient Decent (GD) have been successfully applied to the problem. Standard toolboxes implement efficient variations of GD like Steppest Decent or Conjugate Gradient (Manopt  [11]) or Richardson-like linear-gradient decent (Matrix Means Toolbox [9]). Recent work by Yuan et al. [64] analyzes condition numbers of Hessians in Riemannian and Euclidean steepest-decent approaches that provide theoretical arguments for the good performance of Riemannian approaches.

Recently, T. Zhang developed a majorization-minimization approach with asymptotic linear convergence [67]. In this section, we use the Riemannian Frank-Wolfe algorithm for computing the geometric matrix mean. Here, we exploit the strong geodesic convexity of the problem to achieve global linear convergence. To complement this analysis, we show that recent advances in nonconvex analysis [35] can be used to develop a Frank-Wolfe scheme (EFW) for the nonconvex Euclidean case (see Appendix 1).

In the simple case of two PSD matrices, X and Y, one can view the geometric mean as their metric midpoint computed by [34]

$$\begin{aligned} G(X,Y)= X \#_t Y = X^{\frac{1}{2}} \left( X^{-\frac{1}{2}} Y X^{-\frac{1}{2}} \right) ^t X^{\frac{1}{2}} \; . \end{aligned}$$
(31)

More generally, for a collection of M matrices, the geometric mean can be seen as a minimization problem of the sum of squares of distances [6],

$$\begin{aligned} G(A_1,\ldots , A_M)= {\mathop {{{\,\mathrm{argmin}\,}}}\limits _{X \succ 0}} \sum _{i=1}^M \delta _R^2 (X, A_i) \; , \end{aligned}$$
(32)

where \(\delta _R(X,Y)=\Vert {\log (X^{-\frac{1}{2}} Y X^{-\frac{1}{2}})}\Vert _F\) as introduced earlier. Here we consider the more general weighted geometric mean:

$$\begin{aligned} G(A_1,\ldots , A_M)= {\mathop {{{\,\mathrm{argmin}\,}}}\limits _{X \succ 0}} \sum _{i=1}^M w_i \delta _R^2 (X, A_i) \; . \end{aligned}$$
(33)

E. Cartan showed in a Riemannian setting that a global minimum exists, which led to the term Cartan mean frequently used in the literature. In this setting, one can view the collection of matrices as points on a Riemannian manifold. H. Karcher associated the minimization problem with that of finding centers of masses on these manifolds [32], hence motivating a second term to describe the geometric matrix mean (Karcher mean).

The geometric matrix mean enjoys several key properties. We list below the ones of crucial importance to our paper and refer the reader to [37, 40] for a more extensive list. To state these results, we recall the general form of the two other basic means of operators: the (weighted) harmonic and arithmetic means, denoted by H and A respectively.

$$\begin{aligned} H := \left( \sum \nolimits _{i=1}^M w_i A_i^{-1}\right) ^{-1},\qquad A := \sum \nolimits _{i=1}^M w_i A_i. \end{aligned}$$

Then, one can show the following well-known operator inequality that relates H, G, and A:

Lemma 5

(Means Inequality, [5]) Let \(A_1,\ldots ,A_M > 0\), and let HG, and A denote their (weighted) harmonic, geometric, and arithmetic means, respectively. Then,

$$\begin{aligned} H \preceq G \preceq A \; . \end{aligned}$$
(34)

The key computational burden of all our algorithms lies in computing the gradient of the objective function (30). A short calculation (see e.g., [5, Ch. 6]) shows that if \(f(X) = \delta _R^2(X,A)\), then \(\nabla f(X) = X^{-1}\log (XA^{-1})\). Thus, we immediately obtain

$$\begin{aligned} \nabla \phi (X) = \sum \nolimits _i w_iX^{-1}\log (XA_i^{-1}). \end{aligned}$$

4.2.2 Implementation

We compute the geometric matrix mean with Algorithm 2. For the PSD manifold, line 3 can be written as

$$\begin{aligned} Z_k \leftarrow {\mathop {{{\,\mathrm{argmin}\,}}}\limits _{H \preceq Z \preceq A}}\left\langle {X_k^{\nicefrac {1}{2}}\nabla \phi (X_k)X_k^{\nicefrac {1}{2}}},\, {\log \left( X_k^{-\nicefrac {1}{2}}ZX_k^{-\nicefrac {1}{2}}\right) } \right\rangle \; . \end{aligned}$$
(35)

Note that the operator inequality \(H \preceq G \preceq A\) given by Lemma 5 plays a crucial role: It shows that the optimal solution lies in a compact set so we may as well impose this compact set as a constraint to the optimization problem (i.e., we set \(\mathcal {X}=\lbrace H \preceq Z \preceq A \rbrace \)). We implement the linear oracles as discussed above: In the Euclidean case, a closed-form solution is given by

$$\begin{aligned} Z = H + P^*Q[-{{\,\mathrm{sgn}\,}}(\varLambda )]_{+}Q^*P, \end{aligned}$$
(36)

where \(A-H = P^*P\) and \(P\nabla \phi (X_k)P^* = Q\varLambda Q^*\) (see Thm. 7). Analogously, for the Riemannian case, the “linear” oracle

$$\begin{aligned} \min _{H \preceq Z \preceq A}\quad \left\langle {\nabla \phi (X_k)},\, {\log (Z)} \right\rangle \; , \end{aligned}$$
(37)

is well defined and solved by

$$\begin{aligned} Z=Q\left( P^*[-{{\,\mathrm{sgn}\,}}(\varLambda )]_{+} P + {\hat{H}} \right) Q^{*} \; , \end{aligned}$$
(38)

with \({\hat{A}}-{\hat{H}}=P^{*}P\), \({\hat{A}}=Q^* A Q\), \({\hat{H}}=Q^{*}HQ\) and \(\nabla \phi (X_k) = Q\varLambda Q^*\) (see Thm. 4). The resulting Frank-Wolfe method for the geometric matrix mean is given by Algorithm 3.

figure c

4.3 Application to Bures-wasserstein barycenters

4.3.1 Computing Bures-Wasserstein barycenters on the Gaussian density manifold

As a second application of RFW, we consider the computation of Bures-Wasserstein barycenters of multivariate (centered) Gaussians. This application is motivated by optimal transport theory; in particular, the Bures-Wasserstein barycenter is the solution to the multi-marginal transport problem [8, 43]: Let \(\lbrace A_i \rbrace _{i=1}^m \in {\mathbb {P}}_d\) and \(w=(w_1,\ldots , w_m)\) a vector of weights (\(\sum _i w_i =1; \; w_i >0, \forall i\)). The minimization task

$$\begin{aligned} \min _{\begin{array}{c} X \in {\mathbb {P}}_d \\ X \succ 0 \end{array}} \sum _{i=1}^m w_i d_W^2(X,A_i) \end{aligned}$$
(39)

is called multi-marginal transport problem. Its unique minimizer

$$\begin{aligned} \varOmega (w; \lbrace A_i \rbrace ) = {\mathop {{{\,\mathrm{argmin}\,}}}\limits _{X \succ 0}} \sum _{i=1}^m w_i d_W^2 (X,A_i) \in {\mathbb {P}}_d \end{aligned}$$

is the Bures-Wasserstein barycenter. Here, \(d_W\) denotes the Wasserstein distance

$$\begin{aligned} d_W(X,Y) = \bigl [\mathrm{tr} (X+Y) - 2\mathrm{tr} \bigl ( X^{1/2} Y X^{1/2} \bigr )^{1/2} \bigr ]^{1/2} \; . \end{aligned}$$
(40)

To see the connection to optimal transport, note that Eq. 39 corresponds to a least-squares optimization over a set of multivariate center Gaussians with respect to the Wasserstein distance. The Gaussian density manifold is isomorphic to the manifold of symmetric positive definite matrices, which allows for a direct application of RFW and the setup in the previous section to Eq. 39, albeit with a different set of constraints. In the following, we discuss a suitable set of constraints and adapt RFW for the computation of the Bures-Wasserstein barycenter (short: Wasserstein mean).

First, note that as for the Karcher mean, one can show that the Wasserstein mean of two matrices is given in closed form; namely as

$$\begin{aligned} X \lozenge _t Y = (1-t)^2 X + t^2 Y + t(1-t) \left( (XY)^{1/2} + (YX)^{1/2} \right) \; . \end{aligned}$$
(41)

However, the computation of the barycenter of m matrices (\(m>2\)) requires solving a quadratic optimization problem. Note, that Eq. 41 defines a geodesic map from X to Y. Unfortunately, an analog of the means inequality does not hold in the Wasserstein case. However, one can show, that the arithmetic matrix mean always gives an upper bound, providing one-sided constraints [7]:

$$\begin{aligned} A_i \lozenge _t A_j \preceq \frac{A_i +A_j}{2} \; , \end{aligned}$$

and similarly for the arithmetic mean \(A(w; \lbrace A_i \rbrace )\) of m matrices. Moreover, one can show that \(\alpha I \preceq X\) [8], where \(\alpha \) is the minimal eigenvalue over \(\lbrace A_i \rbrace _{i=1}^m\), i.e.

$$\begin{aligned} \alpha = \min _{1 \le j \le m} \lambda _{\mathrm{min}} (A_j) \; . \end{aligned}$$

In summary, this gives the following constraints on the Wasserstein mean:

$$\begin{aligned} \alpha I \preceq \varOmega (w; \lbrace A_i \rbrace ) \preceq A(w; \lbrace A_i \rbrace ) \; . \end{aligned}$$
(42)

Next, we will derive the gradient of the objective function. According to Eq. 39, the objective is given as

$$\begin{aligned} \phi (X) := \sum \nolimits _j w_j \bigl [ {{\,\mathrm{tr}\,}}(A_j) + {{\,\mathrm{tr}\,}}(X) - 2 {{\,\mathrm{tr}\,}}\bigl (A_j^{1/2} X A_j^{1/2} \bigr )^{1/2} \bigr ]^{1/2} \; . \end{aligned}$$
(43)

Two expressions for the gradient of (43) are derived in the following.

Lemma 6

Let \(\phi \) be given by (43). Then, its gradient is

$$\begin{aligned} \nabla \phi (X, {\mathcal {A}}) = \sum _j \omega _j \left( I - A_j \# X^{-1} \right) \; . \end{aligned}$$
(44)

Proof

$$\begin{aligned} \nabla \phi (X, {\mathcal {A}})&= \underbrace{\frac{d}{dX} \sum _j w_j {{\,\mathrm{tr}\,}}{A_j}}_{\mathrm{vanishes}} + \frac{d}{dX} \sum _j w_j \Big ( {{\,\mathrm{tr}\,}}(X) - 2 {{\,\mathrm{tr}\,}}\big (A_j^{1/2} X A_j^{1/2} \big )^{1/2} \Big ) \\&= \sum _j w_j \frac{d}{dX} {{\,\mathrm{tr}\,}}(X) - \sum _j w_j 2 \frac{d}{dX} {{\,\mathrm{tr}\,}}\big (A_j^{1/2} X A_j^{1/2} \big )^{1/2} \; . \end{aligned}$$

Note that \(\frac{d}{dX} {{\,\mathrm{tr}\,}}(X) = I\). Consider

$$\begin{aligned} 2 \frac{d}{dX} {{\,\mathrm{tr}\,}}\big ( A^{1/2} X A^{1/2} \big )^{1/2}&\overset{(1)}{=} 2 \frac{d}{dX} \Big [ \Big ( \big ( A^{1/2} X A^{1/2} \big )^{1/2} \Big )^* \Big (\big ( A^{1/2} X A^{1/2} \big )^{1/2} \Big ) \Big ]^{1/2} \\&\overset{(2)}{=} 2 \frac{d}{dX} \big ( A^{1/2} X A^{1/2} \big )^{1/2} \\&= A^{1/2} \big ( A^{1/2} X A^{1/2} \big )^{-1/2} A^{1/2} \\&= A^{1/2} \big ( A^{-1/2} X^{-1} A^{-1/2} \big )^{1/2} A^{1/2} \\&= A \# X^{-1} \; , \end{aligned}$$

where (1) follows from the chain rule and (2) from A and X being symmetric and positive definite: If A and X are symmetric and positive definite, then A has a unique positive definite root \(A^{1/2}\) and therefore

$$\begin{aligned} \big ( A^{1/2} X A^{1/2} \big )^* = \big ( {A^{1/2}}^* X^* {A^{1/2}}^* \big ) = A^{1/2} X A^{1/2} \; . \end{aligned}$$

Putting everything together, the desired expression follows as

$$\begin{aligned} \nabla \phi (X, {\mathcal {A}}) = \sum _j \omega _j \big ( I - A_j \# X^{-1} \big ) \; . \end{aligned}$$
(45)

\(\square \)

4.3.2 Implementation

Given a set of constraints and an expression for the gradient, we can solve Eq. 39 with RFW using the following setup: We write the log-linear oracle as

$$\begin{aligned} Z_k \in {\mathop {{{\,\mathrm{argmin}\,}}}\limits _{\alpha I \preceq Z \preceq A}} \langle {{\,\mathrm{grad}\,}}\phi (X), {{\,\mathrm{Exp}\,}}_X^{-1}(Z) \rangle \; , \end{aligned}$$
(46)

where \({{\,\mathrm{grad}\,}}\phi (X)\) is the Riemannian gradient of \(\phi \) and \({{\,\mathrm{Exp}\,}}_X^{-1}(Z) \) the exponential map with respect to the Riemannian metric. The constraints are given by Eq. 42. Since the constraint set is given by an interval of the form \(L \preceq Z \preceq U\), the oracle can be solved in closed form using Theorem 4. The resulting algorithm is given in Algorithm 3.

5 Approximately solving the Riemannian “linear” oracle

In the previous section, we have given examples of applications where the Riemannian “linear” oracle can be solved in closed form – rendering the resulting RFW algorithm into a practical method. Unfortunately, the “linear” oracle is in general a nonconvex subroutine. Therefore, it can be challenging to find efficient solutions for constrained problems in practice.

One remedy for such situations is to solve the “linear” oracle only approximately. In the following, we show that we can recover sublinear convergence rates for RFW, even if we solve the “linear” oracle only approximately. This extension greatly widens the range of possible applications for RFW. For instance, while we currently do not have closed-form solutions for the“linear” oracle in some of the examples in Sect. 2.3, we could find approximate solutions via relaxations or iteratively solving the respective subroutine.

In the following, we say that \(Z' \in \mathcal {X}\) is a \(\delta \)-approximate linear minimizer, if

$$\begin{aligned} \left\langle {{{\,\mathrm{grad}\,}}\; \phi (X_k)},\, {{{\,\mathrm{Exp}\,}}_{X_k}^{-1}(Z')} \right\rangle \le \min _{Z \in \mathcal {X}} \left\langle {{{\,\mathrm{Exp}\,}}_{X_k}^{-1}(Z)},\, {{{\,\mathrm{grad}\,}}\; \phi (X_k)} \right\rangle + \frac{1}{2} \delta \eta M_{\phi } \; . \end{aligned}$$

We give the following sublinear convergence guarantee:

Theorem 5

Let \(X^*\) be a minimum of a geodesically convex function \(\phi \) and \(\delta \ge 0\) the accuracy to which the “linear” oracle is solved in each round. Then, the sequence of iterates \(X_k\) generated by Algorithm 2 satisfies

$$\begin{aligned} \phi (X_k)-\phi (X^*) \le \frac{2M_{\phi }}{k+2} (1+\delta ) \; . \end{aligned}$$

The proof relies on adapting proof techniques from [17, 29] to the Riemannian setting. It utilizes the following auxiliary lemma:

Lemma 7

For a steps size \(\eta \in (0,1)\) and accuracy \(\delta \), we have

$$\begin{aligned} \phi (X_{k+1}) \le \phi (X_k) - \eta \left\langle {{{\,\mathrm{grad}\,}}\; \phi (X_k)},\, {{{\,\mathrm{Exp}\,}}_{X_k}(Z')} \right\rangle + \frac{1}{2}\eta ^2 M_{\phi }(1+\delta ) \; . \end{aligned}$$

A proof of the Lemma can be found in Appendix 1. The proof of Theorem 5 follows then from Lemma 3 and 8 similar to Theorem 1.

6 Computational experiments

In this section, we will remark on the implementation of Algorithm 3 and show numerical results for computing Riemannian centroids for different parameter choices. To evaluate the efficiency of our method, we compare the performance of RFW against that of a selection of state-of-the-art methods. Additionally, we use Algorithm 3 to compute Wasserstein barycenters of positive definite matrices. All computational experiments are performed using Matlab.

6.1 Computational considerations

When implementing the algorithm we can take advantage of the positive definiteness of the input matrices. For example, if using Matlab, rather than computing \(X^{-1}\log (XA_i^{-1})\), it is more preferable to compute

$$\begin{aligned} X^{-\nicefrac {1}{2}}\log (X^{\nicefrac {1}{2}}A_i^{-1}X^{\nicefrac {1}{2}})X^{\nicefrac {1}{2}}, \end{aligned}$$

because both \(X^{-\nicefrac {1}{2}}\) and \(\log (X^{\nicefrac {1}{2}}A_i^{-1}X^{\nicefrac {1}{2}})\) can be computed by suitable eigendecomposition. In contrast, computing \(\log (XA_i^{-1})\) invokes the matrix logarithm (logm in Matlab), which can be much slower.

To save on computation time, we prefer to use a diminishing scalar as the stepsize in Algorithm 3. In principle, this simple stepsize selection could be replaced by a more sophisticated Armijo-like line-search or even exact minimization by solving

$$\begin{aligned} \alpha _k \leftarrow {\mathop {{{\,\mathrm{argmin}\,}}}\limits _{\alpha \in [0,1]}}\quad \phi (X_k + \alpha _k(Z_k-X_k)) \; . \end{aligned}$$
(47)

This stepsize tuning may accelerate the convergence speed of the algorithm, but it must be combined with a more computational intensive strategy of “away” steps [23] to obtain a geometric rate of convergence. However, we prefer Algorithm 3 for its simplicity and efficiency.

Theorems 1 and 3 show that Algorithm 2 converges at the global (non-asymptotic) rates \(O(1/\epsilon )\) (g-convex RFW) and \(O(1/\epsilon ^2)\) (nonconvex RFW). However, by further exploiting the simple structure of the constraint set and the “curvature” of the objective function, we might obtain a stronger convergence result.

6.2 Numerical results

We present numerical results for computing the Riemannian and Wasserstein means for sets of positive definite matrices. We test our methods on both well- and ill-conditioned matrices; the generation of sample matrices is described in the appendix. An extended arXiv version of the paper [61] contains additional examples and numerical experiments.

6.2.1 Computing the Riemannian Mean

Fig. 1
figure 1

Performance of EFW and RFW in comparison with state-of-the-art methods for well-conditioned (left) and ill-conditioned inputs (right) of different sizes (n: size of matrices, m: number of matrices, MaxIt: maximum number of iterations). All tasks are initialized with the harmonic mean \(X_0=HM\)

Fig. 2
figure 2

Performance of EFW and RFW in comparison with state-of-the-art methods for well-conditioned (left) and ill-conditioned inputs (right) and different initializations: \(X_0=HM\) (top) and \(X_0=A_1\) (bottom)

To test the efficiency of our method, we implement RFW (Algorithm 3) in Matlab and compare its performance on the problem of computing the Riemannian mean against related state-of-the-art Riemannian optimization methods:

  1. 1.

    Riemannian L-BFGS (R-LBFGSFootnote 2) is a quasi-Newton method that iteratively approximates the Hessian for evaluating second-order information [63].

  2. 2.

    Riemannian Barzilai–Borwein (BB) is a first-order gradient-based method for constrained and unconstrained optimization. It evaluates second-order information from an approximated Hessian to choose the stepsize [28]. We use the Manopt version of RBB.

  3. 3.

    Matrix Means Toolbox (MM) [9] is an efficient Matlab toolbox for matrix optimization. Its implementation of the geometric mean problem uses a Richardson-like iteration of the form

    $$\begin{aligned} X_{k+1} \leftarrow X_k - \alpha X_k \sum _{i=1}^n \log (A_i^{-1} X_k) \; , \end{aligned}$$

    with a suitable \(\alpha >0\).

  4. 4.

    Zhang [67] is a recently published majorization-minimization method for computing the geometric matrix mean.

Note that this selection reflects a broad spectrum of commonly used methods for Riemannian optimization. It ranges from highly specialized approaches that are targeted to the Karcher mean (MM), to more general and versatile methods (e.g., R-LBFGS). A careful evaluation should take both computing time and algorithmic features, such as the number of calls to oracles and loss functions, into account. We perform and further discuss such an evaluation below. For completeness, we also compare against RFW ’s Euclidean counterpart EFW (Algorithm 4).

Fig. 3
figure 3

Gradient norms at each iteration for EFW and RFW in comparison with R-LBFGS

We generate a set A of m positive definite matrices of size n and compute the geometric mean with EFW and RFW as specified in Algorithms 3 and 4. To evaluate the performance of the algorithm, we compute the cost function

$$\begin{aligned} f(X,A)=\sum _{i=1}^M \Vert {\log \left( X^{-\frac{1}{2}} A_i X^{-\frac{1}{2}} \right) }\Vert _F ^2 \; , \end{aligned}$$
(48)

after each iteration. We further include an experiment, where we report the gradient norm \(\Vert {{{\,\mathrm{grad}\,}}f}\Vert _F\).

Figure 1 shows performance comparisons of all methods for different parameter choices and condition numbers. In a second experiment (Fig. 2) we compare inputs with different condition numbers for two initialization, the harmonic mean and a matrix \(A_i \in A\). We observe that RFW outperforms BB for all inputs and initializations. Furthermore, RFW performs competitively in comparison with R-LBFGS, Zhang’s Method and MM. The competetive performance of RFW is consistent across different initializations. In a third experiment we compared the accuracy reached by RFW with R-LBFGS as the (in our experiments) most accurate Riemannian state-of-the-art method (Fig. 3). We observe that RFW reaches a medium accuracy fast; however, ultimately R-LBFGS reaches a higher accuracy.

In comparison with EFW, we observe that RFW reaches a similar performance and accuracy. The numerical advantage of EFW may be due to implementation differences between Riemannian and Euclidean methods.

In addition, we include a comparison of the number of internal calls to cost and gradient functions (Fig. 1) for all methods. Note that this comparison evaluates algorithmic features and is therefore machine-independent. The reported numbers were obtained by averaging counts over ten experiments with identical parameter choices. We observe that RFW (and EFW) avoid internal calls to the gradient and cost function, requiring only a single gradient evaluation per iteration. This results in the acceleration (i.e., faster convergence) observed in the plots.

Table 1 Number of calls to the gradient and cost functions until reaching convergence, averaged over ten experiments with \(N = 40, M = 10\) and \(K = 30\). Note that the competitive performance of EFW/ RFW is partially due to avoiding internal calls to the cost function, significantly increasing the efficiency of both methods compared to the other methods

6.2.2 Computing Wasserstein barycenters

Fig. 4
figure 4

Performance of RFW for computing the Wasserstein mean for m well- (left) and M ill-conditioned matrices (right) of size n for MaxIt iteration and three initializations: (1) \(X_0\) is set to one of the matrices in the set A, i.e., \(A_i \in A\); (2) \(X_0\) is initialized as the arithmetic mean of A, i.e., the upper bound of the constraint set; and (3) \(X_0\) is set to the lower bound \(\alpha I\) of the constraint, where \(\alpha \) is the minimal eigenvalue over A

To demonstrate the versatility of our approach, we use RFW to compute Bures-Wasserstein barycenters. For this, we adapt the above-described setup to implement Algorithm 3. Fig. 4 shows the performance of RFW on both well- and ill-conditioned matrices. We compare three different initializations in each experiment: In (1), we choose an arbitrary element of A as starting point (\(X_0 \sim {\mathcal {U}}(A)\)). The other experiments start at the lower and upper bounds of the constraint set, i.e., (2) \(X_0\) is set to the arithmetic mean of A and (3) \(X_0=\alpha I\), where \(\alpha \) is the smallest eigenvalue over A. Our results suggest that RFW performs well when initialized from any point in the feasible region, even on sets of ill-conditioned matrices.

7 Discussion

We presented a Riemannian version of the classical Frank-Wolfe method that enables constrained optimization on Riemannian (more precisely, on Hadamard) manifolds. Similar to the Euclidean case, we recover sublinear convergence rates for Riemannian Frank-Wolfe for both geodesically convex and nonconvex objectives. Under the stricter assumption of \(\mu \)-strongly g-convex objectives and a strict interiority condition on the constraint set, we show that even linear convergence rates can be attained by Riemannian Frank-Wolfe. To our knowledge, this work represents the first extension of Frank-Wolfe methods to a manifold setting.

In addition to the general results, we present an efficient algorithm for optimization on Hermitian positive definite matrices. A key highlight of this specialization is a closed-form solution to the Riemannian “linear” oracle needed by Frank-Wolfe (this oracle involves solving a nonconvex problem). While we focus on the specific problem of computing the Karcher mean (also known as the Riemannian centroid or geometric matrix mean), the derived closed-form solutions apply to more general objective functions and should be of wider interest for related nonconvex and g-convex problems. To demonstrate this versatility, we also included an application of RFW to the computation of Wasserstein barycenters on the Gaussian density manifold. In future work, we hope to explore other constraint sets and matrix manifolds that admit an efficient solution to the Riemannian “linear” oracle.

Our algorithm is shown to be competitive against a variety of established and recently proposed approaches [28, 63, 67] providing evidence for its applicability to large-scale statistics and machine learning problems. In follow-up work [60], we show that RFW extends to nonconvex stochastic settings, further increasing the efficiency and versatility of Riemannian Frank-Wolfe methods. Exploring the use of this class of algorithms in large-scale machine learning applications is a promising avenue for future research.