An accelerated first-order method for non-convex optimization on manifolds

We describe the first gradient methods on Riemannian manifolds to achieve accelerated rates in the non-convex case. Under Lipschitz assumptions on the Riemannian gradient and Hessian of the cost function, these methods find approximate first-order critical points faster than regular gradient descent. A randomized version also finds approximate second-order critical points. Both the algorithms and their analyses build extensively on existing work in the Euclidean case. The basic operation consists in running the Euclidean accelerated gradient descent method (appropriately safe-guarded against non-convexity) in the current tangent space, then moving back to the manifold and repeating. This requires lifting the cost function from the manifold to the tangent space, which can be done for example through the Riemannian exponential map. For this approach to succeed, the lifted cost function (called the pullback) must retain certain Lipschitz properties. As a contribution of independent interest, we prove precise claims to that effect, with explicit constants. Those claims are affected by the Riemannian curvature of the manifold, which in turn affects the worst-case complexity bounds for our optimization algorithms.


Introduction
We consider optimization problems of the form where f is lower-bounded and twice continuously differentiable on a Riemannian manifold M. For the special case where M is a Euclidean space, problem (P) amounts to smooth, unconstrained optimization. The more general case is important for applications notably in scientific computing, statistics, imaging, learning, communications and robotics: see for example [1,27]. For a general non-convex objective f , computing a global minimizer of (P) is hard. Instead, our goal is to compute approximate first-and second-order critical points of (P). A number of non-convex problems of interest exhibit the property that second-order critical points are optimal [7,11,14,24,30,36,49]. Several of these are optimization problems on nonlinear manifolds. Therefore, theoretical guarantees for approximately finding second-order critical points can translate to guarantees for approximately solving these problems.
It is therefore natural to ask for fast algorithms which find approximate secondorder critical points on manifolds, within a tolerance (see below). Existing algorithms include RTR [13], ARC [2] and perturbed RGD [20,44]. Under some regularity conditions, ARC uses Hessian-vector products to achieve a rate of O( −7/4 ). In contrast, under the same regularity conditions, perturbed RGD uses only function value and gradient queries, but achieves a poorer rate of O( −2 ). Does there exist an algorithm which finds approximate second-order critical points with a rate of O( −7/4 ) using only function value and gradient queries? The answer was known to be yes in Euclidean space. Can it also be done on Riemannian manifolds, hence extending applicability to applications treated in the aforementioned references? We resolve that question positively with the algorithm PTAGD below.
From a different perspective, the recent success of momentum-based first-order methods in machine learning [42] has encouraged interest in momentum-based firstorder algorithms for non-convex optimization which are provably faster than gradient descent [15,28]. We show such provable guarantees can be extended to optimization under a manifold constraint. From this perspective, our paper is part of a body of work theoretically explaining the success of momentum methods in non-convex optimization.
There has been significant difficulty in accelerating geodesically convex optimization on Riemannian manifolds. See "Related literature" below for more details on best known bounds [3] as well as results proving that acceleration in certain settings is impossible on manifolds [26]. Given this difficulty, it is not at all clear a priori that it is possible to accelerate non-convex optimization on Riemannian manifolds. Our paper shows that it is in fact possible.
We design two new algorithms and establish worst-case complexity bounds under Lipschitz assumptions on the gradient and Hessian of f . Beyond a theoretical contribution, we hope that this work will provide an impetus to look for more practical fast first-order algorithms on manifolds.
More precisely, if the gradient of f is L-Lipschitz continuous (in the Riemannian sense defined below), it is known that Riemannian gradient descent can find anapproximate first-order critical point 1 in at most O( f L/ 2 ) queries, where f upper-bounds the gap between initial and optimal cost value [8,13,47]. Moreover, this rate is optimal in the special case where M is a Euclidean space [16], but it can be improved under the additional assumption that the Hessian of f is ρ-Lipschitz continuous.
Recently in Euclidean space, Carmon et al. [15] have proposed a deterministic algorithm for this setting (L-Lipschitz gradient, ρ-Lipschitz Hessian) which requires at mostÕ( f L 1/2 ρ 1/4 / 7/4 ) queries (up to logarithmic factors), and is independent of dimension. This is a speed up of Riemannian gradient descent by a factor of˜ ( L √ ρ ). For the Euclidean case, it has been shown under these assumptions that first-order methods require at least ( f L 3/7 ρ 2/7 / 12/7 ) queries [17,Thm. 2]. This leaves a gap of merelyÕ(1/ 1/28 ) in the -dependency.
Soon after, Jin et al. [28] showed how a related algorithm with randomization can find ( , √ ρ )-approximate second-order critical points 2 with the same complexity, up to polylogarithmic factors in the dimension of the search space and in the (reciprocal of) the probability of failure. Both the algorithm of Carmon et al. [15] and that of Jin et al. [28] fundamentally rely on Nesterov's accelerated gradient descent method (AGD) [40], with safe-guards against non-convexity. To achieve improved rates, AGD builds heavily on a notion of momentum which accumulates across several iterations. This makes it delicate to extend AGD to nonlinear manifolds, as it would seem that we need to transfer momentum from tangent space to tangent space, all the while keeping track of fine properties.
In this paper, we build heavily on the Euclidean work of Jin et al. [28] to show the following. Assume f has Lipschitz continuous gradient and Hessian on a complete Riemannian manifold satisfying some curvature conditions. With at most O( f L 1/2ρ1/4 / 7/4 ) queries (whereρ is larger than ρ by an additive term affected by L and the manifold's curvature), 1. It is possible to compute an -approximate first-order critical point of f with a deterministic first-order method, 2. It is possible to compute an ( , ρ )-approximate second-order critical point of f with a randomized first-order method.
In the first case, the complexity is independent of the dimension of M. In the second case, the complexity includes polylogarithmic factors in the dimension of M and in the probability of failure. This parallels the Euclidean setting. In both cases (and in contrast to the Euclidean setting), the Riemannian curvature of M affects the complexity in two ways: (a) becauseρ is larger than ρ, and (b) because the results only apply when the target accuracy is small enough in comparison with some curvature-dependent 1 That is, a point where the gradient of f has norm smaller than . 2 That is, a point where the gradient of f has norm smaller than and the eigenvalues of the Hessian of f are at least − √ ρ .
thresholds. It is an interesting open question to determine whether such a curvature dependency is inescapable. We call our first algorithm TAGD for tangent accelerated gradient descent, 3 and the second algorithm PTAGD for perturbed tangent accelerated gradient descent. Both algorithms and (even more so) their analyses closely mirror the perturbed accelerated gradient descent algorithm (PAGD) of Jin et al. [28], with one core design choice that facilitates the extension to manifolds: instead of transporting momentum from tangent space to tangent space, we run several iterations of AGD (safe-guarded against nonconvexity) in individual tangent spaces. After an AGD run in the current tangent space, we "retract" back to a new point on the manifold and initiate another AGD run in the new tangent space. In so doing, we only need to understand the fine behavior of AGD in one tangent space at a time. Since tangent spaces are linear spaces, we can capitalize on existing Euclidean analyses. This general approach is in line with prior work in [20], and is an instance of the dynamic trivializations framework of Lezcano-Casado [33].
In order to run AGD on the tangent space T x M at x, we must "pullback" the cost function f from M to T x M. A geometrically pleasing way to do so is via the exponential map 4 Exp x : T x M → M, whose defining feature is that for each v ∈ T x M the curve γ (t) = Exp x (tv) is the geodesic of M passing through γ (0) = x with velocity γ (0) = v. Then,f x = f • Exp x is a real function on T x M called the pullback of f at x. To analyze the behavior of AGD applied tof x , the most pressing question is: To what extent doesf x = f • Exp x inherit the Lipschitz properties of f ?
In this paper, we show that if f has Lipschitz continuous gradient and Hessian and if the gradient of f at x is sufficiently small, thenf x restricted to a ball around the origin of T x M has Lipschitz continuous gradient and retains partial Lipschitz-type properties for its Hessian. The norm condition on the gradient and the radius of the ball are dictated by the Riemannian curvature of M. These geometric results are of independent interest. Becausef x retains only partial Lipschitzness, our algorithms depart from the Euclidean case in the following ways: (a) at points where the gradient is still large, we perform a simple gradient step; and (b) when running AGD in T x M, we are careful not to leave the prescribed ball around the origin: if we ever do, we take appropriate action. For those reasons and also because we do not have full Lipschitzness but only radial Lipschitzness for the Hessian off x , minute changes throughout the analysis of Jin et al. [28] are in order.
To be clear, in their current state, TAGD and PTAGD are theoretical constructs. As one can see from later sections, running them requires the user to know the value of several parameters that are seldom available (including the Lipschitz constants L and ρ); the target accuracy must be set ahead of time; and the tuning constants as dictated here by the theory are (in all likelihood) overly cautious.
Moreover, to compute the gradient off x we need to differentiate through the exponential map (or a retraction, as the case may be). This is sometimes easy to do in closed form (see [33] for families of examples), but it could be a practical hurdle. On the other hand, our algorithms do not require parallel transport. It remains an interesting open question to develop practical accelerated gradient methods for non-convex problems on manifolds.
In closing this introduction, we give simplified statements of our main results. These are all phrased under the following assumption (see Sect. 2 for geometric definitions): The Riemannian manifold M and the cost function f : M → R have these properties: • M is complete, its sectional curvatures are in the interval [−K , K ], and the covariant derivative of its Riemann curvature endomorphism is bounded by F in operator norm; and • f is lower-bounded by f low , has L-Lipschitz continuous gradient grad f and ρ- Lipschitz continuous Hessian Hess f on M.

Main Geometry Results
As a geometric contribution, we show that pullbacks through the exponential map retain certain Lipschitz properties of f . Explicitly, at a point x ∈ M we have the following statement.
(Above, · denotes both the Riemannian norm on T x M and the associated operator norm. Also, ∇f x and ∇ 2f x are the gradient and Hessian off x on the Euclidean space We expect this result to be useful in several other contexts. Section 2 provides a more complete (and somewhat more general) statement. At the same time and independently, Lezcano-Casado [35] develops similar geometric bounds and applies them to study gradient descent in tangent spaces-see "Related literature" below for additional details.

Main Optimization Results
We aim to compute approximate first-and second-order critical points of f , as defined here: where λ min (·) extracts the smallest eigenvalue of a self-adjoint operator.
In Sect. 5, we define and analyze the algorithm TAGD. Resting on the geometric result above, that algorithm with the exponential retraction warrants the following claim about the computation of first-order points. The O(·) notation is with respect to scaling in . Theorem 1.3 If A1 holds, there exists an algorithm (TAGD) which, given any x 0 ∈ M and small enough tolerance > 0, namely, produces an -FOCP for (P) using at most a constant multiple of T function and pullback gradient queries, and a similar number of evaluations of the exponential map, where In Sect. 6 we define and analyze the algorithm PTAGD. With the exponential retraction, the latter warrants the following claim about the computation of second-order points. Theorem 1.6 If A1 holds, there exists an algorithm (PTAGD) which, given any x 0 ∈ M, any δ ∈ (0, 1) and small enough tolerance > 0 (same condition as in Theorem 1.3) produces an -FOCP for (P) using at most a constant multiple of T function and pullback gradient queries, and a similar number of evaluations of the exponential map, where

algorithm uses no Hessian queries and is randomized.
This result is almost dimension-free, and still not curvature-free for the same reasons as above.

Related Literature
At the same time and independently, Lezcano-Casado [35] develops geometric bounds similar to our own. Both papers derive the same second-order inhomogenous linear ODE (ordinary differential equation) describing the behavior of the second derivative of the exponential map. Lezcano-Casado [35] then uses ODE comparison techniques to derive the geometric bounds, while the present work uses a bootstrapping technique. Lezcano-Casado [35] applies these bounds to study gradient descent in tangent spaces, whereas we study non-convex accelerated algorithms for finding first-and secondorder critical points.
The technique of pulling back a function to a tangent space is frequently used in other settings within optimization on manifolds. See for example the recent papers of Bergmann et al. [9] and Lezcano-Casado [34]. Additionally, the use of Riemannian Lipschitz conditions in optimization as they appear in Section 2 can be traced back to [21,Def. 4.1] and [23,Def. 2.2].
Accelerating optimization algorithms on Riemannian manifolds has been wellstudied in the context of geodesically convex optimization problems. Such problems can be solved globally, and usually the objective is to bound the suboptimality gap rather than finding approximate critical points. A number of papers have studied Riemannian versions of AGD; however, none of these papers have been able to achieve a fully accelerated rate for convex optimization. Zhang and Sra [48] show that if the initial iterate is sufficiently close to the minimizer, then acceleration is possible. Intuitively this makes sense, since manifolds are locally Euclidean. Ahn and Sra [3] pushed this further, developing an algorithm converging strictly faster than RGD, and which also achieves acceleration when sufficiently close to the minimizer.
Alimisis et al. [4][5][6] analyze the problem of acceleration on the class of non-strongly convex functions, as well as under weaker notions of convexity. Interestingly, they also show that in the continuous limit (using an ODE to model optimization algorithms) acceleration is possible. However, it is unclear whether the discretization of this ODE preserves a similar acceleration.
Recently, Hamilton and Moitra [26] have shown that full acceleration (in the geodesically convex case) is impossible in the hyperbolic plane, in the setting where function values and gradients are corrupted by a very small amount of noise. In contrast, in the analogous Euclidean setting, acceleration is possible even with noisy oracles [22].

Riemannian Tools and Regularity of Pullbacks
In this section, we build up to and state our main geometric result. As we do so, we provide a few reminders of Riemannian geometry. For more on this topic, we recommend the modern textbooks by Lee [31,32]. For book-length, optimizationfocused introductions see [1,12]. Some proofs of this section appear in Appendices A and B.
We consider a manifold M with Riemannian metric ·, · x and associated norm · x on the tangent spaces T x M. (In other sections, we omit the subscript x.) The tangent bundle is itself a smooth manifold. The Riemannian metric provides a notion of gradient.

Definition 2.1
The Riemannian gradient of a differentiable function f : M → R is the unique vector field grad f on M which satisfies: The Riemannian metric further induces a uniquely defined Riemannian connection ∇ (used to differentiate vector fields on M) and an associated covariant derivative D t (used to differentiate vector fields along curves on M). (The symbol ∇ here is not to be confused with its use elsewhere to denote differentiation of scalar functions on Euclidean spaces.) Applying the connection to the gradient vector field, we obtain Hessians.

Definition 2.2
The Riemannian Hessian of a twice differentiable function f : M → R at x is the linear operator Hess f (x) to and from T x M defined by where in the last equality c can be any smooth curve on M satisfying c(0) = x and c (0) = s. This operator is self-adjoint with respect to the metric ·, · x .
We can also define the Riemmannian third derivative ∇ 3 f (a tensor of order three), see [12,Ch. 10] for details. We write A retraction R is a smooth map from (a subset of) TM to M with the following property: for all (x, s) ∈ TM, the smooth curve c(t) = R(x, ts) = R x (ts) on M passes through c(0) = x with velocity c (0) = s. Such maps are used frequently in Riemannian optimization in order to move on a manifold. For example, a key ingredient of Riemannian gradient descent is the curve c(t) = R x (−tgrad f (x)) which initially moves away from x along the negative gradient direction.
To a curve c, we naturally associate a velocity vector field c . Using the covariant derivative D t , we differentiate this vector field along c to define the acceleration c = D t c of c: this is also a vector field along c. In particular, the geodesics of M are the curves with zero acceleration.
The exponential map Exp : O → M-defined on an open subset O of the tangent bundle-is a special retraction whose curves are geodesics. Specifically, γ (t) = Exp(x, ts) = Exp x (ts) is the unique geodesic on M which passes through γ (0) = x with velocity γ (0) = s. If the domain of Exp is the whole tangent bundle, we say M is complete.
To compare tangent vectors in distinct tangent spaces, we use parallel transports. Explicitly, let c be a smooth curve connecting the points c(0) = x and c(1) = y. We say a vector field Z along c is parallel if its covariant derivative D t Z is zero. Conveniently, for any given v ∈ T x M there exists a unique parallel vector field along c whose value at t = 0 is v. Therefore, the value of that vector field at t = 1 is a well-defined vector in T y M: we call it the parallel transport of v from x to y along c. We introduce the notation to denote parallel transport along a smooth curve c from c(0) to c(t). This is a linear isometry: (P c t ) −1 = (P c t ) * , where the star denotes an adjoint with respect to the Riemannian metric. For the special case of parallel transport along the geodesic γ (t) = Exp x (ts), we write with the meaning P ts = P γ t . Using these tools, we can define Lipschitz continuity of gradients and Hessians. Note that in the particular case where M is a Euclidean space we have Exp x (s) = x +s and parallel transports are identities, so that this reduces to the usual definitions.
where P * s is the adjoint of P s with respect to the Riemannian metric. The Hessian of f is ρ-Lipschitz continuous if where · x denotes both the Riemannian norm on T x M and the associated operator norm.
It is well known that these Lipschitz conditions are equivalent to convenient inequalities, often used to study the complexity of optimization algorithms. More details appear in [12,Ch. 10].

Proposition 2.4
If a function f : M → R has L-Lipschitz continuous gradient, then If in addition f is twice differentiable, then Hess f (x) ≤ L for all x ∈ M. If f has ρ-Lipschitz continuous Hessian, then If in addition f is three times differentiable, then The other way around, if f is three times continuously differentiable and the stated inequalities hold, then its gradient and Hessian are Lipschitz continuous with the stated constants.
For sufficiently simple algorithms, these inequalities may be all we need to track progress in a sharp way. As an example, the iterates of Riemannian gradient descent with constant step-size 1/L satisfy x k+1 = Exp x k (s k ) with s k = − 1 L grad f (x k ). It follows directly from the first inequality above that f ( From there, it takes a brief argument to conclude that this method finds a point with gradient smaller than in at most 2L( f (x 0 ) − f low ) 1 2 steps. A similar (but longer) story applies to the analysis of Riemannian trust regions and adaptive cubic regularization [2,13].
However, the inequalities in Proposition 2.4 fall short when finer properties of the algorithms are only visible at the scale of multiple combined iterations. This is notably the case for accelerated gradient methods. For such algorithms, individual iterations may not achieve spectacular cost decrease, but a long sequence of them may accumulate an advantage over time (using momentum). To capture this advantage in an analysis, it is not enough to apply inequalities above to individual iterations. As we turn to assessing a string of iterations jointly by relating the various gradients and step directions we encounter, the nonlinearity of M generates significant hurdles.
For these reasons, we study the pullbacks of the cost function, namely, the functionŝ Each pullback is defined on a linear space, hence we can in principle run any Euclidean optimization algorithm onf x directly: our strategy is therefore to apply a momentum-based method onf x . To this end, we now work towards showing that if f has Lipschitz continuous gradient and Hessian thenf x also has certain Lipschitz-type properties.
The following formulas appear in [2, Lem. 5]: we are interested in the case R = Exp. (We use ∇ and ∇ 2 to designate gradients and Hessians of functions on Euclidean spaces: not to be confused with the connection ∇.) where T s is the differential of R x at s (a linear operator): and W s is a self-adjoint linear operator on T x M defined through polarization by
We turn to curvature. The Lie bracket of two smooth vector fields X , Y on M is itself a smooth vector field, conveniently expressed in terms of the Riemannian connection as [X , Y ] = ∇ X Y −∇ Y X . Using this notion, the Riemann curvature endomorphism R of M is an operator which maps three smooth vector fields X , Y , Z of M to a fourth smooth vector field as: Whenever R is identically zero, we say M is flat: this is the case notably when M is a Euclidean space and when M has dimension one (e.g., a circle is flat, while a sphere is not). Though it is not obvious from the definition, the value of the vector field R(X , Y )Z at x ∈ M depends on X , Y , Z only through their value at x. Therefore, given u, v, w ∈ T x M we can make sense of the notation R(u, v)w as denoting the vector in T x M corresponding to R(X , Y )Z at x, where X , Y , Z are arbitrary smooth vector fields whose values at x are u, v, w, respectively. The map (u, v, w) → R(u, v)w is linear in each input.
Two linearly independent tangent vectors u, v at x span a two-dimensional plane of T x M. The sectional curvature of M along that plane is a real number K (u, v) defined as Of course, all sectional curvatures of flat manifolds are zero. Also, all sectional curvatures of a sphere of radius r are 1/r 2 and all sectional curvatures of the hyperbolic space with parameter r are −1/r 2 -see [32,Thm. 8.34].
Using the connection ∇, we differentiate the curvature endomorphism R as follows. Given any smooth vector field U , we let ∇ U R be an operator of the same type as R itself, in the sense that it maps three smooth vector fields X , Y , Z to a fourth one Observe that this formula captures a convenient chain rule on We say ∇ R has operator norm bounded by F if this holds for all x. If F = 0 (that is, ∇ R ≡ 0), we say R is parallel and M is called locally symmetric. This is notably the case for manifolds with constant sectional curvature-Euclidean spaces, spheres and hyperbolic spaces-and (Riemannian) products thereof [41, pp. 219-221].
We are ready to state the main result of this section. Note that M need not be complete.

Theorem 2.7
Let M be a Riemannian manifold whose sectional curvatures are in the interval [K low , K up ], and let K = max(|K low |, |K up |). Also assume ∇ R-the covariant derivative of the Riemann curvature endomorphism R-is bounded by F in operator norm.
Let f : M → R be twice continuously differentiable and select b > 0 such that b ≤ min 1

If f has L-Lipschitz continuous gradient and grad
endowed with the so-called affine invariant metric is a non-compact symmetric space of non-constant curvature. It is commonly used in practice [10,37,38,43]. One can show that K = 1 2 and F = 0 are the right constants for this manifold. K ∇f x (s) x s x . 7. We only get satisfactory Lipschitzness at points where the gradient is bounded by Lb. Fortunately, for the algorithms we study, whenever we encounter a point with gradient larger than that threshold, it is sufficient to take a simple gradient descent step.
To prove Theorem 2.7, we must control ∇ 2f x (s). According to Lemma 2.5, this requires controlling both T s (a differential of the exponential map) and c (0) (the intrinsic initial acceleration of a curve defined via the exponential map, but which is not itself a geodesic in general). On both counts, we must study differentials of exponentials. Jacobi fields are the tool of choice for such tasks. As a first step, we use Jacobi fields to investigate the difference between T s and P s : two linear operators from T x M to T Exp x (s) M. We prove a general result in Appendix A (exact for constant sectional curvature) and state a sufficient particular case here. Control of T s follows as a corollary because P s (parallel transport) is an isometry.
Proof. By Proposition 2.8, the operator norm of T s − P s is bounded above by 1 3 K s 2 x ≤ 1 3 . Furthermore, parallel transport P s is an isometry: its singular values are equal to 1. Thus, Likewise, with min/max taken over unit-norm vectors u ∈ T x M and writing y = Exp x (s), We turn to controlling the term c (0) which appears in the definition of operator W s in the expression for ∇ 2f x (s) provided by Lemma 2.5. We present a detailed proof in Appendix B for a general statement, and state a sufficient particular case here. The proof is fairly technical: it involves designing an appropriate nonlinear second-order ODE on the manifold and bounding its solutions. The ODE is related to the Jacobi equation, except we had to differentiate to the next order, and the equation is not homogeneous.

Proposition 2.10
Let M be a Riemannian manifold whose sectional curvatures are in the interval [K low , K up ], and let K = max(|K low |, |K up |). Further assume ∇ R is bounded by F in operator norm.
For anyṡ ∈ T x M, the curve c(t) = Exp x (s + tṡ) has initial acceleration bounded as Equipped with all of the above, it is now easy to prove the main theorem of this section.

Proof of Theorem 2.7. Consider the pullbackf
. Using Lemma 2.5, we start bounding the Hessian as follows: with operator W s defined by (8). Since grad f is L-Lipschitz continuous, Hess f (y) y ≤ L for all y ∈ M (this follows fairly directly from Proposition 2.4). To bound W s , we start with a Cauchy-Schwarz inequality then we consider the worst case for the magnitude of c (0): Combining these steps yields a first bound of the form To proceed, we keep working on the W s -terms: use Proposition 2.10, L-Lipschitzcontinuity of the gradient, and our bounds on the norms of s and grad f (x) to see that: Returning to (14) and using Corollary 2.9 to bound T s confirms that Thus, ∇f x is 2L-Lipschitz continuous in the ball of radius b around the origin in T x M.
To establish the second part of the claim, we use the same intermediate results and ρ-Lipschitz continuity of the Hessian. First, using Lemma 2.5 twice and noting that We bound this line by line calling upon Proposition 2.8, Corollary 2.9 and (15) to get: This shows a type of Lipschitz continuity of the Hessian of the pullback with respect to the origin, in the ball of radius b.

Assumptions and parameters for TAGD and PTAGD
Our algorithms apply to the minimization of f : M → R on a Riemannian manifold M equipped with a retraction R defined on the whole tangent bundle TM. The pull- In light of Sect. 2, we make the following assumptions.
Moreover, f is twice continuously differentiable and there exist constants ,ρ and b such that, for The first three items in A2 confer Lipschitz properties to the derivatives of the pullbacksf x restricted to balls around the origins of tangent spaces: these are the balls where we shall run accelerated gradient steps. We only need these guarantees at points where the gradient is below a threshold. For all other points, a regular gradient step provides ample progress: the last item in A2 serves that purpose only, see Proposition 5.2.
Section 2 tells us that A2 holds in particular when we use the exponential map as a retraction and f itself has appropriate (Riemannian) Lipschitz properties. This is the link between Theorems 1.3 and 1.6 in the introduction and Theorems 5.1 and 6.1 in later sections.

Corollary 3.1
If we use the exponential retraction R = Exp and A1 holds, then A2 holds with f low , and With constants as in A2, we further define a number of parameters. First, the user specifies a tolerance which must not be too loose.
Then, we fix a first set of parameters (see [28] for more context; in particular, κ plays the role of a condition number; under A3, we have κ ≥ 2): We define a second set of parameters based on some χ ≥ 1 (as set in some of the lemmas and theorems below) and a universal constant c > 0 (implicitly defined as the smallest real satisfying a finite number of lower-bounds required throughout the paper): When we say "with χ ≥ A ≥ 1" (for example, in Theorems 5.1 and 6.1), we mean: "with χ the smallest value larger than A such that T is a positive integer multiple of 4." Lemma C.1 in Appendix C lists useful relations between the parameters. Algorithm 1 TSS(x, s 0 ) with (x, s 0 ) ∈ TM and parameters , η, b, θ, γ, s, T 1: If s 0 is not provided, set s 0 = 0 and perturbed = false; otherwise, set perturbed = true.

Accelerated Gradient Descent in a Ball of a Tangent Space
The main ingredient of algorithms TAGD and PTAGD is TSS: the tangent space steps algorithm. Essentially, the latter runs the classical accelerated gradient descent algorithm (AGD) from convex optimization onf x in a tangent space T x M, with a few tweaks: 1. Becausef x need not be convex, TSS monitors the generated sequences for signs of non-convexity. Iff x happens to behave like a convex function along the sequence TSS generates, then we reap the benefits of convexity. Otherwise, the direction along whichf x behaves in a non-convex way can be used as a good descent direction. This is the idea behind the "convex until proven guilty" paradigm developed by Carmon et al. [15] and also exploited by Jin et al. [28]. Explicitly, given x ∈ M and s, u ∈ T x M, for a specified parameter γ > 0, we check the negative curvature condition (one might also call it the non-convexity condition) (NCC): If (NCC) triggers with a triplet (x, s, u) and s is not too large, we can exploit that fact to generate substantial cost decrease using the negative curvature exploitation algorithm, NCE: see Lemma 4.4. (This is about curvature of the cost function, not the manifold.) 2. In contrast to the Euclidean case in [28], our assumption A2 provides Lipschitztype guarantees only in a ball of radius 3b around the origin in T x M. Therefore, we must act if iterates generated by TSS leave that ball. This is done in two places. First, the momentum step in step 4 of TSS is capped so that u j remains in the ball of radius 2b around the origin. Second, if s j+1 leaves the ball of radius b (as checked in step 10) then we terminate this run of TSS by returning to the manifold.
return argminṡ ∈{s j ,s j +v,s j −v}fx (ṡ) 6: end if Lemma 4.1 guarantees that the iterates indeed remain in appropriate balls, that θ j (19) in the capped momentum step is uniquely defined, and that if a momentum step is capped, then immediately after that TSS terminates.
The initial momentum v 0 is always set to zero. By default, the AGD sequence is initialized at the origin: s 0 = 0. However, for PTAGD we sometimes want to initialize at a different point (a perturbation away from the origin): this is only relevant for Sect. 6.
In the remainder of this section, we provide four general purpose lemmas about TSS. Proofs are in Appendix D. We note that TAGD and PTAGD call TSS only at points x where grad f (x) ≤ 1 2 b. The first lemma below notably guarantees that, for such runs, all iterates u j , s j generated by TSS remain (a fortiori) in balls of radius 3b, so that the strongest provisions of A2 always apply: we use this fact often without mention.

Lemma 4.1 (TSS stays in balls) Fix parameters and assumptions as laid out in
. . , u q (and possibly more), then it also defines vectors s 0 , . . . , s q , and we have: If s q+1 is defined, then s q+1 ≤ 3b and, if u q = 2b, then s q+1 > b and u q+1 is undefined.
Along the iterates of AGD, the value of the cost functionf x may not monotonically decrease. Fortunately, there is a useful quantity which monotonically decreases along iterates: Jin et al. [28] call it the Hamiltonian. In several ways, it serves the purpose of a Lyapunov function. Importantly, the Hamiltonian decreases regardless of any special events that occur while running TSS. It is built as a combination of the cost function value and the momentum. The next lemma makes this precise: we use monotonic decrease of the Hamiltonian often without mention. This corresponds to [28,Lem. 9 and 20].

Lemma 4.2 (Hamiltonian decrease) Fix parameters and assumptions as laid out in
If E j+1 is defined, then E j , θ j and u j are also defined and: Jin et al. [28] formalize an important property of TSS sequences in the Euclidean case, namely, the fact that "either the algorithm makes significant progress or the iterates do not move much." They call this the improve or localize phenomenon. The next lemma states this precisely in our context. This corresponds to [28,Cor. 11].

Lemma 4.3 (Improve or localize) Fix parameters and assumptions as laid out in
. . , s q (and possibly more), then E 0 , . . . , E q are defined by (20) and, for all 0 ≤ q ≤ q, As outlined earlier, in case the TSS sequence witnesses non-convexity inf x through the (NCC) check, we call upon the NCE algorithm to exploit this event. The final lemma of this section formalizes the fact that this yields appropriate cost improvement. (Indeed, if s j > L one can argue that sufficient progress was already achieved; otherwise, the lemma applies and we get a result from E j ≤ E 0 =f x (s 0 ).) This corresponds to [28,Lem. 10 and 17].

Lemma 4.4 (Negative curvature exploitation) Fix parameters and assumptions as laid
so that s j , v j are also defined, and E j is defined by (20).

First-Order Critical Points
Our algorithm to compute -approximate first-order critical points on Riemannian manifolds is TAGD: this is a deterministic algorithm which does not require access to the Hessian of the cost function. Our main result regarding TAGD, namely, Theorem 5.1, states that it does so in a bounded number of iterations. As worked out in Theorem 1.3, this bound scales as −7/4 , up to polylogarithmic terms. The complexity is independent of the dimension of the manifold.
The proof of Theorem 5.1 rests on two propositions introduced hereafter in this section. Interestingly, it is only in the proof of Theorem 5.1 that we track the behavior Case 1: one Riemannian gradient step 5: end if 12: end while of iterates of TAGD across multiple points on the manifold. This is done by tracking decrease of the value of the cost function f . All supporting results (lemmas and propositions) handle a single tangent space at a time. As a result, lemmas and propositions fully benefit from the linear structure of tangent spaces. This is why we can salvage most of the Euclidean proofs of Jin et al. [28], up to mostly minor (but numerous and necessary) changes.

Theorem 5.1 Fix parameters and assumptions as laid out in Sect. 3, with
Running the algorithm requires at most 2T 1 pullback gradient queries and 3T 1 function queries (but no Hessian queries), and a similar number of calls to the retraction.

Proof of Theorem 5.1
The call to TAGD(x 0 ) generates a sequence of points A priori, this sequence may be finite or infinite. Considering two consecutive indices t i and t i+1 , we either have t i+1 = t i + 1 (if the step from x t i to x t i+1 is a single gradient step (Case 1)) or t i+1 = t i + T (if that same step is obtained through a call to TSS (Case 2)). Moreover: • In Case 1, Proposition 5.2 applies and guarantees • In Case 2, Proposition 5.3 applies and guarantees that if grad f (x t i+1 ) > then It is now clear that TAGD(x 0 ) terminates after a finite number of steps. Indeed, if it does not, then the above reasoning shows that the algorithm produces an amortized decrease in the cost function f of E T per unit increment of the counter t, yet the value of f cannot decrease by more than f (x 0 ) − f low because f is globally lower-bounded by f low .
Accordingly, assume TAGD(x 0 ) generates x t 0 , . . . , x t k and terminates there, returning x t k . We know that f (x t k ) ≤ f (x 0 ) and grad f (x t k ) ≤ . Moreover, from the discussion above and t 0 = 0, we know that How much work does it take to run the algorithm? Each (regular) gradient step requires one gradient query and increases the counter by one. Each run of TSS requires at most 2T gradient queries and 2T + 3 ≤ 3T function queries (3 ≤ T because T is a positive integer multiple of 4) and increases the counter by T . Therefore, by the time TAGD produces x t it has used at most 2t gradient queries and 3t function queries.
The two following propositions form the backbone of the proof of Theorem 5.1. Each handles one of the two possible cases in one (outer) iteration of TAGD, namely: Case 1 is a "vanilla" Riemannian gradient descent step, while Case 2 is a call to TSS to run (modified) AGD in the current tangent space. The former has a trivial and standard proof. The latter relies on all lemmas from Sect. 4 and on two additional lemmas introduced in Appendix F, all following Jin et al. [28]. Proposition 5.2 (Case 1) Fix parameters and assumptions as laid out in Sect. 3.

Proof of Proposition 5.2 This follows directly by property 4 in A2 withf
by properties of retractions, and also using η = 1/4: To conclude, it remains to use that (7/8) M 2 ≥ E T , as shown in Lemma C.1.
The next proposition corresponds mostly to [28,Lem. 12]. A proof is in Appendix F.

Second-Order Critical Points
As discussed in the previous section, TAGD produces -approximate first-order critical points at an accelerated rate, deterministically. Such a point might happen to be an approximate second-order critical point, or it might not. In order to produce approximate second-order critical points, PTAGD builds on top of TAGD as follows. Whenever TAGD produces a point with gradient smaller than , PTAGD generates a random vector ξ close to the origin in the current tangent space and runs TSS starting from that perturbation. The run of TSS itself is deterministic. However, the randomized initialization has the following effect: if the current point is not an approximate secondorder critical point, then with high probability the sequence generated by TSS produces significant cost decrease. Intuitively, this is because the current point is a saddle point, and gradient descent-type methods slowly but likely escape saddles. If this happens, we simply proceed with the algorithm. Otherwise, we can be reasonably confident that the point from which we ran the perturbed TSS is an approximate second-order critical point, and we terminate there.
Our main result regarding PTAGD, namely, Theorem 6.1, states that it computes approximate second-order critical points with high probability in a bounded number of iterations. As worked out in Theorem 1.6, this bound scales as −7/4 , up to polylogarithmic terms which include a dependency in the dimension of the manifold and the probability of success.
Mirroring Sect. 5, the proof of Theorem 6.1 rests on the two propositions of that section and on an additional proposition introduced hereafter in this section.
To reach termination, the algorithm requires at most 2T 2 pullback gradient queries and 4T 2 function queries (but no Hessian queries), and a similar number of calls to the retraction.
Notice how this result gives a (probabilistic) guarantee about the smallest eigenvalue of the Hessian of the pullbackf x at 0 rather than about the Hessian of f itself at x. Owing to Lemma 2.5, the two are equal in particular when we use the exponential retraction (more generally, when we use a second-order retraction): see also [13, §3.5].
Proof of Theorem 6. 1 The proof starts the same way as that of Theorem 5.1. The call to PTAGD(x 0 ) generates a sequence of points x t 0 , x t 1 , x t 2 , . . . on M, with t 0 = 0. A priori, this sequence may be finite or infinite. Considering two consecutive indices t i and t i+1 , we either have t i+1 = t i + 1 (if the step from x t i to x t i+1 is a single gradient step (Case 1)) or t i+1 = t i + T (if that same step is obtained through a call to TSS, with or without perturbation (Cases 3 and 2, respectively)). Moreover: • In Case 1, Proposition 5.2 applies and guarantees The algorithm does not terminate here.
and the algorithm does not terminate here.
) ≥ 0 and the step from x t i+1 to x t i+2 does not fall in Case 2: it must fall in Case 3. (Indeed, it cannot fall in Case 1 because the fact that a Case 2 step occurred tells us < 2 M .) The algorithm terminates with In other words, if the algorithm does not terminate with x t i+1 , then • In Case 3, the algorithm terminates with x t i unless Clearly, PTAGD(x 0 ) must terminate after a finite number of steps. Indeed, if it does not, then the above reasoning shows that the algorithm produces an amortized decrease in the cost function f of E 4T per unit increment of the counter t, yet the value of f cannot decrease by more than f (x 0 ) − f low .
Accordingly, assume PTAGD(x 0 ) generates x t 0 , . . . , x t k+1 and terminates there (returning x t k ). The step from x t k to x t k+1 necessarily falls in Case 3: t k+1 − t k = T . The step from x t k−1 to x t k could be of any type. If it falls in Case 2, it could be that is as small as zero and that t k − t k−1 = T . (All other scenarios are better, in that the cost function decreases more, and the counter increases as much or less.) Moreover, for all steps prior to that, each unit increment of t brings about an amortized decrease in f of E 4T . Thus, t k+1 ≤ t k−1 + 2T and Combining, we find What can we say about the point that is returned, x t k ? Deterministically, f (x t k ) ≤ f (x 0 ) and grad f (x t k ) ≤ (notice that we cannot guarantee the same about x t k+1 ). Let us now discuss the role of randomness.
In any run of PTAGD(x 0 ), there are at most T 2 /T perturbations, that is, "Case 3" steps. By Proposition 6.2, the probability of any single one of those steps failing to prevent termination at a point where the smallest eigenvalue of the Hessian of the pullback at the origin is strictly less than − ρ is at most δE 3 f . Thus, by a union bound, the probability of failure in any given run of PTAGD(x 0 ) is at most (we use In all other events, we have λ min (∇ 2f x t k (0)) ≥ − ρ . For accounting of the maximal amount of work needed to run PTAGD(x 0 ), use reasoning similar to that at the end of the proof of Theorem 5.1, adding the cost of checking the condition " f (x t ) − f (x t+T ) < 1 2 E " after each perturbed call to TSS. Note: the inequality The next proposition corresponds mostly to [28,Lem. 13]. A proof is in Appendix G.

Conclusions and Perspectives
Our main complexity results for TAGD and PTAGD (Theorems 1.3 and 1.6) recover known Euclidean results when M is a Euclidean space. In particular, they retain the important properties of scaling essentially with −7/4 and of being either dimensionfree (for TAGD) or almost dimension-free (for PTAGD). Those properties extend as is to the Riemannian case.
However, our Riemannian results are negatively impacted by the Riemannian curvature of M, and also by the covariant derivative of the Riemann curvature endomorphism. We do not know whether such a dependency on curvature is necessary to achieve acceleration. In particular, the non-accelerated rates for Riemannian gradient descent, Riemannian trust-regions and Riemannian adaptive regularization with cubics under Lipschitz assumptions do not suffer from curvature [2,13].
Curvature enters our complexity bounds through our geometric results (Theorem 2.7). For the latter, we do believe that curvature must play a role. Thus, it is natural to ask: Can we achieve acceleration for first-order methods on Riemannian manifolds with weaker (or without) dependency on the curvature of the manifold?
For the geodesically convex case, all algorithms we know of are affected by curvature [3][4][5]48]. Additionally, Hamilton and Moitra [26] show that curvature can significantly slow down convergence rates in the geodesically convex case with noisy gradients.
Adaptive regularization with cubics (ARC) may offer insights in that regard. ARC is a cubically regularized approximate Newton method with optimal iteration complexity on the class of cost functions with Lipschitz continuous Hessian, assuming access to gradients and Hessians [19,39]. Specifically, assuming f has ρ-Lipschitz continuous Hessian, ARC finds an ( , √ ρ )-approximate second-order critical point in at most  (16), (26)]. Note that this is dimension-free and curvature-free. Each iteration, however, requires solving a separate subproblem more costly than a gradient evaluation. Carmon and Duchi [18, §3] argue that it is possible to solve the subproblems accurately enough so as to find -approximate first-order critical points with ∼ 1/ 7/4 Hessian-vector products overall, with randomization and a logarithmic dependency in dimension. Compared to TAGD, this has the benefit of being curvature-free, at the cost of randomization, a logarithmic dimension dependency, and of requiring Hessian-vector products. The latter could conceivably be approximated with finite differences of the gradients. Perhaps that operation leads to losses tied to curvature? If not, as it is unclear why there ought to be a trade-off between curvature dependency and randomization, this may be the indication that the curvature dependency is not necessary for acceleration. On a distinct note and as pointed out in the introduction, TAGD and PTAGD are theoretical constructs. Despite having the theoretical upper-hand in worst-case scenarios, we do not expect them to be competitive against time-tested algorithms such as Riemannian versions of nonlinear conjugate gradients or the trust-region methods. It remains an interesting open problem to devise a truly practical accelerated first-order method on manifolds.
In the Euclidean case, Carmon et al. [15] showed that if one assumes not only the gradient and the Hessian of f but also the third derivative of f are Lipschitz continuous, then it is possible to find -approximate first-order critical points in just O( −5/3 ) iterations. We suspect that our proof technique could be used to prove a similar result on manifolds, possibly at the cost of also assuming a bound on the second covariant derivative of the Riemann curvature endomorphism.
Funding Open access funding provided by EPFL Lausanne.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

A Parallel Transport vs Differential of Exponential Map
In this section, we give a proof for Proposition 2.8 regarding the difference between parallel transport along a geodesic and the differential of the exponential map. We use these families of functions parameterized by K low ∈ R: if K low = 0, Under the assumptions we make below, these functions are only ever evaluated at points where they are nonnegative. In all cases, functions are dominated by the case K low < 0; formally, for all K low ∈ R, all K ≥ |K low | and all t ≥ 0: If K low ≥ 0 and t ≥ 0, then Independently of the sign of K low , if 0 ≤ t ≤ π/ √ |K low |, then For t bounded as indicated, this last line shows that up to constants the sign of K low does not substantially affect bounds.
To state our result, we need the notion of conjugate points along geodesics on Riemannian manifolds. The following definition is equivalent to the standard one [ (25) is only ever evaluated at points where it is nonnegative.
We now state and prove the main result of this section. A similar result appears in [45,Lem. 6] for general retractions. Constants there are not explicit (they are absorbed in O(·) notation). Their proof is based on Taylor expansions of the differential of the exponential map as they appear in [46, Thm. A.2.9], namely, for s → DExp x (s) around s = 0. In the next section, we investigate a situation around s = 0. In appendices, we typically omit subscripts for inner products and norms.
whereṡ ⊥ =ṡ − s,ṡ s,s s is the component ofṡ orthogonal to s, T s = DExp x (s) and P ts denotes parallel transport along γ from γ (0) to γ (t). (The inequality holds with equality if K low = K up .) If it also holds that s ≤ π/ √ |K low |, then Proof For convenience, we consider s = 1: the result follows by a simple rescaling of t. Given any tangent vectorṡ ∈ T x M, consider the following smooth vector field along γ : By [32,Prop. 10.10], this is the unique Jacobi field satisfying the initial conditions where D t is the covariant derivative along curves induced by the Riemannian connection. Thus, J is smooth and obeys the ordinary differential equation (ODE) known as the Jacobi equation: where R denotes Riemannian curvature. Fix e d = s and pick e 1 , . . . , e d−1 so that e 1 , . . . , e d form an orthonormal basis for T x M. Parallel transport this basis along γ as so that E 1 (t), . . . , E d (t) form an orthonormal basis for T γ (t) M. Expand J as with uniquely defined smooth, real functions a 1 , . . . , a d . Plugging this expansion into the Jacobi equation yields the ODE where we used the Leibniz rule on D t , the fact that D t E i = 0, linearity of the Riemann curvature endomorphism in its inputs, and the fact that Taking an inner product of this ODE against each one of the fields E j (t) yields d ODEs: Furthermore, the initial conditions fix a i (0) = 0 and a i (0) = ṡ, e i for i = 1, . . . , d.
Owing to symmetries of Riemannian curvature, the summation above can be restricted to the range 1, . . . , d − 1. For the same reason, a d (t) = 0, so that It remains to solve for the first d − 1 coefficients (they are decoupled from a d ). This effectively splits the solution J into two fields: one tangent (aligned with γ ), and one normal (orthogonal to γ ): The normal part is the Jacobi field with initial conditions J ⊥ (0) = 0 and D t J ⊥ (0) = s ⊥ , whereṡ ⊥ =ṡ − ṡ, s s is the component ofṡ orthogonal to s.

M ji (t) = R(E i (t), E d (t))E d (t), E j (t) .
(41) Then, equations in (38) for j = 1, . . . , d − 1 can be written succinctly as Since a(t) is smooth, it holds that Initial conditions specify a(0) = 0, so that (with · also denoting the standard Euclidean norm and associated operator norm in real space): The left-hand side is exactly what we seek to control. Indeed, initial conditions ensurė s = a 1 (0)e 1 + · · · + a d (0)e d , and: For the right-hand side of (44), first note that M(t) is a symmetric matrix owing to the symmetries of R. Additionally, for any unit-norm z ∈ R d−1 , where v = z 1 E 1 (t) + · · · + z d−1 E d−1 (t) is a tangent vector at γ (t): it is orthogonal to γ (t) and also has unit norm. By definition of sectional curvature K (·, ·) (10), it follows that z M(t)z = K (v, γ (t)).
By symmetry of M(t), we conclude that where K ≥ 0 is such that all sectional curvatures of M along γ are in the interval [−K , K ]. Going back to (44), we have so far shown that It remains to bound a(θ ) . By (40), we see that a(t) = J ⊥ (t) . By the Jacobi field comparison theorem [32,Thm. 11.9b] and our assumed lower-bound on sectional curvature, we can now claim that, for t ≥ 0, with h K low (t) as defined by (25), provided γ has no interior conjugate point on [0, t]. Combining with (48) and with the definitions of h K low (25), g K low (26) and f K low (27), we find It only remains to divide through by t, and to rescale s so that t plays the role of s . For the special case where K up = K low = ±K (constant sectional curvature), one can show (for example by polarization) that M(t) = ±K I d−1 , that is, M(t) is a multiple of the identity matrix. As a result, the ODEs separate and are easily solved (see also [32,Prop. 10.12]). Explicitly, with s = 1, whereṡ = ṡ, s s is the component ofṡ parallel to s. Hence, and the claim follows easily after dividing through by t and rescaling.
As a continuation of the previous proof and in anticipation of our needs in Appendix B, we provide a lemma controlling the Jacobi field J and its covariant derivative, assessing both the full field and its normal component.

Lemma A.4
Let M be a Riemannian manifold whose sectional curvatures are in the interval [K low , K up ], and let K = max(|K low |, |K up |). Consider (x, s) ∈ TM with s = 1 and the geodesic γ (t) = Exp x (ts). Given a tangent vectorṡ ∈ T x M, consider the Jacobi field J defined by (40): where J ⊥ is the Jacobi field along γ with initial conditions J ⊥ (0) = 0 and D t J ⊥ (0) = s ⊥ , andṡ ⊥ =ṡ − ṡ, s s is the component ofṡ orthogonal to s. For t ≥ 0 such that γ is defined and has no interior conjugate point on the interval [0, t], the following inequalities hold: where h K low (t) and g K low (t) are defined by (25) and (26).

Proof
The proof is a continuation of that of Proposition A.3. Using notation as in there, Since J ⊥ and D t J ⊥ are orthogonal to γ = E d , we know that The bound J ⊥ (t) ≤ h K low (t) ṡ ⊥ appears explicitly as (49). With α denoting the angle between s andṡ, we may write ṡ, s 2 = cos(α) 2 ṡ 2 and ṡ ⊥ 2 = sin(α) 2 ṡ 2 , so that Since the maximum of α → cos(α) 2 + q sin(α) 2 with q ∈ R is max(1, q), we find for t ≥ 0 that With the same tools, we may also bound D t J = ṡ, s γ + D t J ⊥ . Indeed, its coordinates in the frame E 1 , . . . , E d are given by a 1 , . . . , a d with a d (t) = ṡ, s , so that where a(t) ∈ R d−1 collects the d − 1 first coordinates. Moreover, Combining with the fact that a (0) = ṡ ⊥ , we get as announced. We now conclude along the same lines as above with Since max(1, 1 + K g K low (t)) = 1 + K g K low (t), we reach the desired conclusion.

B Controlling the Initial Acceleration c (0)
In this section, we build a proof for Proposition 2.10, whose aim is to control the initial intrinsic acceleration c (0) of the curve c(t) = Exp x (s + tṡ). Since c (t) = DExp x (s + tṡ)[ṡ], we can think of this result as giving us access to a second derivative of the exponential map Exp x away from the origin. As a first step, we build an ODE whose solution encodes c (0).
The smooth vector field W along γ defined by the linear ODE with initial conditions W (0) = 0 and D t W (0) = 0 is also defined on the same domain as γ . This vector field is related to the initial intrinsic acceleration of the curve c ts,ṡ as follows:

Furthermore, the vector field H is equivalently defined as
where J ⊥ the Jacobi field along γ with initial conditions J ⊥ (0) = 0 and D t J a variation through geodesics of the geodesic Then, is the Jacobi field along γ with initial conditions J (0) = 0 and D t J (0) =ṡ: the same field we considered in the proof of Proposition A.3. Further consider another smooth vector field along γ . This field is related to acceleration of curves of the form because c ts,tṡ (q) = (q, t). Specifically, To verify the last equality, differentiate the identity c ts,tṡ (q) = c ts,ṡ (tq) twice with respect to q, with the chain rule. This shows in particular that Our goal is to derive a second-order ODE for W . In so doing, we repeatedly use the two following results from Riemannian geometry which allow us to commute certain derivatives: • [32, Prop. 7.5] For every smooth vector field V along (meaning V (q, t) is tangent to M at (q, t)), where R is the Riemann curvature endomorphism.
• [32, Lem. 6.2] The symmetry lemma states With the link between W and D q ∂ q in mind, we compute a first derivative with respect to t: then a second derivative: Our goal is to evaluate this expression for q = 0, in which case the left-hand side yields D 2 t W . However, it is unclear how to evaluate the first term on the right-hand side at q = 0. Focusing on that term for now, apply the commutation rule on the first two derivatives: Focusing on the first term once more, apply the symmetry lemma then the commutation rule: To reach the last equality, we used that D t ∂ t vanishes identically since t → (q, t) is a geodesic for every fixed q. Combining, we find Using the chain rule for tensors as in (11) (see also [32, pp. 95-103] or [41,Def. 3.17]), we can further expand the right-most term: It is now easier to evaluate the whole expression at q = 0: using repeatedly, and also D q ∂ t = D t ∂ q twice so that it evaluates to D t J at q = 0, we find This is now an ODE in the single variable t, involving smooth vector fields J , W and γ along the geodesic γ . We may apply the chain rule for tensors again (we could just as well have done this earlier too): here too simplifying one term since γ vanishes. The algebraic Bianchi identity [32, p. 203 (We also used anti-symmetry of R). Overall, The Jacobi field J splits into its tangent and normal parts (40): Since R(γ , γ ) = 0 by anti-symmetry of R, and since for the same reason (∇ · R)(γ , γ ) = 0 as well, by linearity, we may simplify H to: This concludes the proof.
To reach our main result, it remains to bound the solutions of the ODE in W . In order to do so, we notably need to bound the inhomogeneous term H . For that reason, we require a bound on the covariant derivative of Riemannian curvature.
Theorem B.2 Let M be a Riemannian manifold whose sectional curvatures are in the interval [K low , K up ], and let K = max(|K low |, |K up |). Also assume ∇ R-the covariant derivative of the Riemann curvature endomorphism-is bounded by F in operator norm. Pick any (x, s) ∈ TM such that the geodesic γ (t) = Exp x (ts) is defined for all t ∈ [0, 1], and such that with some constants C ≤ π and C . For anyṡ ∈ T x M, the curve has initial acceleration bounded as whereṡ ⊥ =ṡ − s,ṡ s,s s is the component ofṡ orthogonal to s andW ∈ R is only a function of C and C . In particular, for C, C ≤ 1 4 , we haveW ≤ 3 2 .
Proof By Remark A.2, since C ≤ π we know that γ has no interior conjugate point on [0, 1]. Since the claim is clear for either s = 0 orṡ = 0, assume s = 1 for now-we rescale at the end-andṡ = 0. We also assume K > 0: the case K = 0 follows easily by inspection of the proof below. Following Proposition B.1, the goal is to bound W : the solution of an ODE with right-hand side given by the vector field H . As we did in earlier proofs, pick an orthonormal basis e 1 , . . . , e d for T x M with e d = s and transport it along γ as E i (t) = P ts (e i ). We expand W and H as This allows us to write the ODE in coordinates: where M(t) is as in (41) Thus, To proceed, we need a bound on H (t) = h(t) and a first bound on W (t) . We will then improve the latter by bootstrapping. Let us first bound H . Following [29, eq. (9)], we know that R is bounded (as an operator) as follows: where X , Y , Z are arbitrary vector fields along γ . We further assume that for some finite F ≥ 0. Then, Since γ (t) = s = 1 for all t, this expression simplifies somewhat. Using Lemma A.4, we can also bound all terms involving J and J ⊥ , so that, also using K 0 ≤ 2K , for all t ≥ 0. Assuming 0 ≤ √ K t ≤ C for some C > 0, we find with a = 8 sinh(C) cosh(C) C and b = 2 sinh(C) 2 C 2 . Let us further assume that 0 ≤ t ≤ C K F . Then, Ft ≤ C K and we write: Let us now obtain a first crude bound on W (t) . To this end, introduce Then, Let g(t) = z(t) 2 . Then, g(0) = ṡ 2 ṡ ⊥ 2 /K and Grönwall's inequality states that By triangle inequality and using Thus, z(t) 2 can be bounded above and below: Using our bound on H (t) (68), we find Using √ K t ≤ C again we deduce this crude bound: We now return to (61) and plug in our bounds for H (68) and W (70) to get an improved bound on W : assuming t satisfies the stated conditions, Plug this new and improved bound on W in (61) once again to get: We could now bound √ K t and K t 2 by C and C 2 , respectively, and stop here. However, this yields a constant which depends onW : this can be quite large. Instead, we plug our new bound in (61) again, repeatedly. Doing so infinitely many times, we obtain a sequence of upper bounds, all of them valid. The limit of these bounds exists, and is hence also a valid bound. It is tedious but not difficult to check that this reasoning leads to the following: It is clear that the series converges. Let z be the value it converges to; then: . All in all, we conclude that For example, with C, C ≤ 1 4 , we haveW ≤ 3 2 . From Proposition B.1, we know that for the curve c ts,ṡ (q) = Exp x (ts + qṡ) (recall that s has unit norm) it holds that W (t) = t 2 c ts,ṡ (0). Thus, Allowing s to have norm different from one and rescaling t, we conclude that for the curve we have

C Lemma About Parameter Relations
As a general comments: here and throughout, constants are not optimized at all. In part, this is so that there is leeway in the precise definition of parameters. For example, the step-size η does not need to be exactly equal to 1/4 , but it is convenient to assume equality to simplify many tedious computations.
Lemma C.1 With parameters and assumptions as laid out in Sect. 3, the following hold:

D Proofs from Sect. 4 About AGD in a Ball of a Tangent Space
We give a proof of the lemma which states that iterates generated by TSS remain in certain balls. Such a lemma is not necessary in the Euclidean case.

Proof of Lemma 4.1
Because of how TSS works, if it defines u j for some j, then s j must have already been defined. Moreover, if s j+1 > b, then the algorithm terminates before defining u j+1 . It follows that if u 0 , . . . , u q are defined then s 0 , . . . , s q are all at most b. Also, TSS ensures u 0 , . . . , u q are all at most 2b by construction.
Recall that θ = 1 4 √ κ . From Lemma C.1 we know κ ≥ 2 so that θ ≤ 1. Moreover, 2ηγ = 1 8κ = 1 2 √ κ θ ≤ θ . It follows that θ j as presented in (19) is well defined in the interval [θ, 1]. Indeed, either s j + (1 − θ)v j ≤ 2b, in which case θ j = θ ; or the line segment connecting s j to s j + (1 − θ)v j intersects the boundary of the sphere of radius 2b at exactly one point. By definition, this happens at where we used the fact that u j ≤ 2b and that ∇f x is -Lipschitz continuous in the ball of radius 3b around the origin (by A2), the fact that grad f (x) = ∇f x (0), and the fact that η = 1 4 by definition of η. Consequently, if s q+1 is defined, then If additionally it holds that u q = 2b, then (Mind the strict inequality: this one will matter.) We give a proof of the lemma which states that the Hamiltonian is monotonically decreasing along iterations.

Proof of Lemma 4.2
This follows almost exactly [28,Lem. 9 and 20], with one modification to allow θ j (19) to be larger than 1/2: this is necessary in our setup because we need to cap u j to the ball of radius 2b, requiring values of θ j which can be arbitrarily close to 1.
Since ∇f x is -Lipschitz continuous in B x (3b) and u j , s j+1 ∈ B x (3b), standard calculus and the identity Since η = 1 4 ≤ 1 2 , it follows that Turning to E j+1 as defined by (20) and with the identity v j+1 = s j+1 − s j , we compute: Notice that Moreover, the fact that s j+1 is defined means that (NCC) does not trigger with (x, s j , u j ); in other words: Combining, we find that Using the identities u j − s j = (1 − θ j )v j and E j =f x (s j ) + 1 2η v j 2 , we can further write: From Lemma 4.1 we know that ηγ ≤ 1 2 θ j and that θ j is in the interval [0, 1]. It is easy to check that the function θ j → 1 Thus, as announced.
In closing, note that if v j ≥ M then Lemma C.1 shows which concludes the proof.
We give a proof of the improve-or-localize lemma.

Proof of Lemma 4.3
This follows from [28,Cor. 11], with some modifications for variable θ j and because we allow θ j > 1 2 . By triangular inequality then Cauchy-Schwarz, we have Now use the inequality a + b 2 ≤ (1 + C) a 2 + 1+C C b 2 (valid for all vectors a, b and reals C > 0) with C = 2 √ κ − 1 (positive owing to κ ≥ 1 by Lemma C.1) to see that By construction, we have s j+1 = u j − η∇f x (u j ) and u j = s j + (1 − θ j )v j . Thus: We focus on the second term: recall from Lemma 4.
This holds a fortiori for all t in [θ, 1] because θ ≤ 1 4 owing to κ ≥ 1. It follows that Apply Lemma 4.2 to the parenthesized expression to deduce that Plug this into the first inequality of this proof to conclude with a telescoping sum.
We give a proof of the lemma which states that, upon witnessing significant nonconvexity, it is possible to exploit that observation to drive significant decrease in the cost function value.

Proof of Lemma 4.4
This follows almost exactly [28,Lem. 10 and 17]. We need a slight modification because the Hessian ∇ 2f x may not be Lipschitz continuous in all of B x (3b): our assumptions only guarantee a type of Lipschitz continuity with respect to the origin of T x M. Interestingly, even if the last momentum step was capped (that is, if θ j = θ )-something which does not happen in the Euclidean case-the result goes through.
First, consider the case v j ≥ s, where s is a parameter set in Sect. 3. Then, NCE(x, s j , v j ) = s j . It follows from the definition of E j (20) that Second, consider the case v j < s. We know that v j = 0 as otherwise u j = s j + (1 − θ j )v j = s j : this would contradict the assumption that (NCC) triggers with (x, s j , u j ). Expandf x around u j in a truncated Taylor series with Lagrange remainder to see that we also know that The last two claims combined yield: as defined in the call to NCE. Letṽ be eitherv or −v, chosen so that ∇f x (s j ),ṽ ≤ 0 (at least one of the two choices satisfies this condition). By construction, NCE(x, s j , v j ) is the element of the triplet {s j , s j +v, s j −v} wheref x is minimized. Since s j +ṽ belongs to this triplet, it follows through another truncated Taylor series with Lagrange remainder (this time around s j ) that with ζ j = s j + t ṽ for some t ∈ [0, 1]. Sinceṽ is parallel to v j which itself is parallel to s j − u j (by definition of u j ), we deduce from (71) that We aim to use this to work on (72), but notice that ∇ 2f x is evaluated at two possibly distinct points, namely, ζ j and ζ j : we need to use the Lipschitz properties of the Hessian to relate them. To this end, notice that ζ j and ζ j both live in B x (3b). Indeed, ṽ = v = s ≤ b by Lemma C.1 and s j ≤ b, u j ≤ 2b by Lemma 4.1. Thus, ζ j ≤ s j + u j ≤ b + 2b = 3b and ζ j ≤ s j + ṽ ≤ b + b = 2b.
In contrast to the proof in [28], we have no Lipschitz guarantee for ∇ 2f x along the line segment connecting ζ j and ζ j , but A2 still offers such guarantees along the line segments connecting the origin of T x M to each of ζ j and ζ j . Thus, we can write: where on the last line we used and also (more directly) that ζ j ≤ s j + ṽ = s j + s. Plugging our findings into (72), it follows that Sincef x (s j ) ≤ E j by definition (20), the main part of the lemma's claim is now proved.
We now turn to the last part of the lemma's claim, for which we further assume s j ≤ L . Recall from Lemma C.1 that L ≤ s. We deduce from the main claim that To conclude, use Lemma C.1 anew to bound the right-most term.

E Supporting Lemmas
In this section, we state and prove three additional lemmas about accelerated gradient descent in balls of tangent spaces that are useful for proofs in subsequent sections. The statements apply more broadly than the setup of parameters and assumptions in Sect. 3, but of course it is under those provisions that the conclusions are useful to us. Throughout this section, we use the following notation. For some x ∈ M, let H = ∇ 2f x (0). Given s 0 ∈ T x M, set v 0 = 0 and define for j = 0, 1, 2, . . .: with some arbitrary θ ∈ [0, 1] and η > 0. Also define s −1 = s 0 − v 0 for convenience and where τ ≥ 0 is a fixed index. Notice that iterates generated by TSS(x, s 0 ) with parameters and assumptions as laid out in Sect. 3 conform to this notation so long as θ j = θ . Owing to Lemma 4.1, the latter condition holds in particular if TSS runs all its iterations in full because if at any point θ j = θ then s j+1 > b and TSS terminates early. This is the setting in which we call upon lemmas from this section. The first lemma is a variation on [28,Lem. 18].
Lemma E.1 With notation as above, for all j ≥ 0 we can write Use the definitions of u k and v k to verify that u k = (2 − θ)s k − (1 − θ)s k−1 (we use this several times in subsequent proofs). Plug this in the previous identity to see that Equivalently in matrix form, then reasoning by induction, it follows that This verifies Eq. (76). To prove Eq. (77), observe that (76) together with The last two terms cancel. Indeed, let M j−1 To reach the second-to-last line, verify that (A − I ) −I −I −I −I = ηH ηH 0 0 using (78). The last line follows by direct calculation.
The lemma below is a direct continuation from the lemma above. We use it only for the proof of Lemma G.1.
Lemma E.2 Use notation from Lemma E.1. Given s 0 , s 0 ∈ T x M, define two sequences {s j , u j , v j } and {s j , u j , v j } by the update equations (74). Let w j = s j − s j . Then, Proof By Lemma E.1 with τ = 0, both of these identities hold: Taking the difference of these two equations reveals that Conclude with the definition of δ k .
The next lemma corresponds to [28,Prop. 19]. The claim applies in particular to iterates generated by TSS with parameters and assumptions as laid out in Sect. 3 and R ≤ b, so long as θ j = θ and the s j remain in the appropriate balls. There are a few changes related to indexing and to the fact that our Lipschitz assumptions are limited to balls.

Lemma E.3 Use notation from Lemma E.1. Assume ∇ 2f
x (s) − ∇ 2f x (0) ≤ρ s for all s ∈ B x (3R) with some R > 0,ρ > 0. Also assume s k ≤ R for all k = q − 1, . . . , q. Then for all k = q , . . . , q we have δ k ≤ 5ρ R 2 . Moreover, for all k = q + 1, . . . , q we have Additionally, we can bound their sum as: (Mind the different ranges of summation.) We use this to establish each of the three inequalities. First, by definition of H = ∇ 2f x (0) and of δ k , we know that Owing to u k ≤ 3R, we can use the Lipschitz properties of ∇ 2f x to find This shows the first inequality for k = q , . . . , q.
For the second inequality, first verify that Note that the distance between (1 − φ)u k−1 + φu k and the origin is at most max{ u k , u k−1 } for all φ ∈ [0, 1]. Since for k = q + 1, . . . q we have both u k ≤ 3R and u k−1 ≤ 3R, it follows that (1 − φ)u k−1 + φu k ≤ 3R for all φ ∈ [0, 1]. As a result, we can use the Lipschitz-like properties of ∇ 2f x and write: . From there, it follows that This establishes the second inequality for k = q + 1, . . . q. The third inequality follows from the second one through squaring and a sum, notably using (a + b) 2 ≤ 2(a 2 + b 2 ) for a, b ≥ 0: To conclude, extend the ranges of both sums to q , . . . , q.
We close this supporting section with important remarks about the matrix A (78), still following [28]. Recall the notation H = ∇ 2f x (0): this is an operator on T x M, selfadjoint with respect to the Riemannian inner product on T x M. Let e 1 , . . . , e d ∈ T x M form an orthonormal basis of eigenvectors of H associated to ordered eigenvalues λ 1 ≤ · · · ≤ λ d . We think of A as a linear operator to and from T x M × T x M. Conveniently, the eigenvectors of H reveal how to block-diagonalize A. Indeed, from it is a simple exercise to check that J * A J = diag(A 1 , . . . , A d ) with J = e 1 0 e 2 0 · · · e d 0 0 e 1 0 e 2 · · · 0 e d and Here, J is a unitary operator from R 2d (equipped with the standard Euclidean metric) to T x M × T x M, and J * denotes its adjoint (which is also its inverse). In particular, it becomes straightforward to investigate powers of A: For m, m in {1, . . . , d} we have the useful identities where (A k m ) 11 is the top-left entry of the 2 × 2 matrix (A m ) k . Likewise, Additionally, one can also check that [28,Lem. 24]: F Proofs from Section 5 About TAGD

F.1 Proof of Proposition 5.3
The next two lemmas support Proposition 5.3. Proofs are in Appendix F.2. They correspond to [28,Lem. 21 and 22]. Notice that it is in Lemma F.2 that the condition on χ originates, then finds its way into the conditions of Theorem 5.1 through Proposition 5.3. Ultimately, this causes the polylogarithmic factor in the complexity of Theorem 1.3.

Lemma F.1 Fix parameters and assumptions as laid out in
x (0) associated to eigenvalues strictly larger than θ 2 η(2−θ) 2 . Let P S denote orthogonal projection to S. Assume TSS(x) runs its course in full.

Lemma F.2 Fix parameters and assumptions as laid out in Sect. 3, with
In Lemmas F.1 and F.2, if S is empty then P S maps all vectors to the zero vector, and the statements still hold.

Proof of Proposition 5.3 By
Or s j > L , in which case Lemma 4.3 used with q = j < T and s 0 = 0 implies (See Lemma C.1 for that last equality.) Owing to how NCE works, we always have The iterate s j+1 leaves the ball of radius b, that is, s j+1 > b. In this case, apply Lemma 4.3 with q = j + 1 ≤ T and s 0 = 0 to claim (The first inequality is by definition of E j+1 (20); subsequently, we use s j+1 > b > L as in Lemma C.1.) • (Case 2c) The iterate s j+1 satisfies ∇f x (s j+1 ) ≤ /2. Recall the chain rule identity relating gradients of f and gradients of the pullbackf In our situation, x T = R x (s j+1 ) and s j+1 ≤ b (otherwise, Case 2b applies). Thus, A2 ensures σ min (T s j+1 ) ≥ 1 2 and we deduce that • (Case 2d) None of the other events occur: TSS(x) runs its T iterations in full.
In this case, we apply the logic in the proof of [28,Lem. 12], as follows. We consider two cases. In the first case, E 0 − E T /2 > E . Then, we apply Lemma 4.2 to claim that In the second case, Then, Lemma F.2 applies and we learn the following: Let S denote the linear subspace of T x M spanned by the eigenvectors of ∇ 2f x (0) associated to eigenvalues strictly larger than θ 2 η(2−θ) 2 . Let P S denote orthogonal projection to S. For each j in {T /4, . . . , T /2} we have and In the second case, τ ∈ {T /4, . . . , T /2}. We aim to apply Lemma F.1: there are a few preconditions to check. Here is what we already know: Regarding the third one above: we know that ∇f x (s τ ) > /2 because TSS(x) did not terminate with s τ . We deduce that We now have a final pair of cases to check. Either s τ ≤ L , in which case Lemma F.1 applies: it follows that E τ −1 − E τ +T /4 ≥ E , and by arguments similar as above we conclude that f (x) − f (x T ) ≥ E . Or s τ > L , in which case Lemma 4.3 implies (using s 0 = 0): (For the second and last inequalities, we use τ < T and Lemmas 4.2 and C.1.) This covers all possibilities.

F.2 Proofs of Lemmas F.1 and F.2
We include fulls proofs for the analogues of [28,Lem. 21 and 22] because we need small but important changes for our setting (as is the case for the other similar results we prove in full), and because of (ultimately inconsequential) small issues with some arguments pertaining to the subspace S in the original proofs. (Specifically, the subspace S is defined with respect to the Hessian of the cost function at a specific reference point, which for notational convenience in Jin et al. [28] is denoted by 0; however, this same convention is used in several lemmas, on at least one occasion referring to distinct reference points; the authors easily proposed a fix, and we use a different fix below; to avoid ambiguities, we keep all iterate references explicit.) Up to those minor changes, the proofs of the next two lemmas are due to Jin et al.
As a general heads-up for this and the next section: we call upon several lemmas from [28] which are purely algebraic facts about the entries of powers of the 2 × 2 matrices A m (79): they do not change at all for our context, hence we do not include their proofs. We only note that Lemma 33 in [28] The remainder of the proof consists in showing that s τ +T /4 − s τ is in fact larger than 1 2 L . Starting now, consider j = T /4. From (77) in Lemma E.1, we know that Let e 1 , . . . , e d form an orthonormal basis of eigenvectors for H = ∇ 2f x (0) with eigenvalues λ 1 ≤ · · · ≤ λ d . Expand v τ , ∇f x (s τ ) and δ τ +k in that basis as: Owing to (81) and (82) which reveal how A block-diagonalizes in the basis e, we can further write This reveals the expansion coefficients of s τ + j − s τ in the basis e 1 , . . . , e d , which is enough to study the norm of s τ + j − s τ . Explicitly, where we introduce the notation To proceed, we need control over the coefficients a m,t and b m,t , as provided by [28,Lem. 30]. We explore this for m in the set that is, for the eigenvectors orthogonal to S. Under our general assumptions it holds that ∇ 2f x (0) ≤ , so that |λ m | ≤ for all m. This ensures ηλ m ∈ [−1/4, θ 2 /(2 − θ) 2 ] for m ∈ S c . Recall that A m (79) is a 2 × 2 matrix which depends on θ and ηλ m . It is reasonably straightforward to diagonalize A m (or rather, to put it in Jordan normal form), and from there to get an explicit expression for any entry of A k m . The quantity j−1 k=0 a m,k is a sum of such entries over a range of powers: this can be controlled as one would a geometric series. In [28,Lem. 30], it is shown that, for m ∈ S c , if j ≥ 1 + 2/θ and θ ∈ (0, 1/4], then with some universal constants c 4 , c 5 . The lemma applies because θ ∈ (0, 1/4] by Lemma C.1 and also j = Building on the latter comments, we can define the following scalars for m ∈ S c : In analogy with notation in (85), we also consider vectorsδ j andṽ j with expansion coefficients as above. These definitions are crafted specifically so that (86) yields: We deduce from (88) that where S c is the orthogonal complement of S, that is, it is the subspace of T x M spanned by eigenvectors {e m } m∈S c , and P S c is the orthogonal projector to S c . In the last line, we used a triangular inequality and the assumption that P S c (∇f x (s τ )) ≥ /6. Our goal now is to show that P S c (δ j ) and P S c (ṽ j ) are suitably small.
Consider the following vector with notation as in (75): By the Lipschitz-like properties of ∇ 2f x and the assumption s τ ≤ L < b, we deduce that Note that j−1 k=0 p m,k, j = 1. This and the fact that is independent of k justify that: where (m) denotes the expansion coefficients of in the basis e. Define the vectorδ j (without "prime") with expansion coefficientsδ (m) j = j−1 k=0 p m,k, j (δ τ +k ) (m) . Then, by construction, Through a simple reasoning using [28,Lem. 24,26] one can conclude that, under our setting, both eigenvalues of A m (for m ∈ S c ) are positive, and as a result that the coefficients a m,k (hence also p m,k, j ) are positive. Therefore, Notice that for all 0 ≤ k ≤ j − 1 we have and this right-hand side is independent of k. Thus, we can factor out j−1 k=0 p m,k, j = 1 in the expression above to get: Use first (a + b) 2 ≤ 2a 2 + 2b 2 then (another) Cauchy-Schwarz to deduce To bound this further, we call upon Lemma E.3 with R = 3 2 L ≤ 1 3 b, q = τ and q = τ + T 4 − 1. To this end, we must first verify that s τ +k ≤ R for k = −1, . . . , T 4 − 1.
This is indeed the case owing to (84) and the assumption s τ ≤ L : This confirms that we can use the conclusions of Lemma E.3, reaching: Recall that we aim to make progress from bound (89). The bound P S c (δ j ) ≤ /24 we just established is a first step. We now turn to bounding P S c (ṽ j ) . Owing to (88), we have this first bound assuming j = T /4: (Recall from (85) that v (m) denotes the coefficients of v τ in the basis e 1 , . . . , e d .) We split the sum in order to resolve the max. To this end, note that θ ∈ [0, 1] implies θ 2 ≥ θ 2 (2−θ) 2 , so that the max evaluates to θ 2 exactly when −θ 2 ≤ ηλ m ≤ θ 2 (2−θ) 2 (remembering that ηλ m ≤ θ 2 (2−θ) 2 because m ∈ S c ). Thus, (v (m) ) 2 ηλ m (Recall that P S projects to the subspace spanned by eigenvectors with eigenvalues strictly above θ 2 η(2−θ) 2 .) Combining all work done since (90), it follows that Use assumptions v τ ≤ M and P S v τ , HP S v τ ≤ ρ M 2 to see that (For the last equality, use 2θ 2 = √ρ 2 η and η = 1/4 .) To proceed, we must bound v τ , Hv τ . To this end, notice that by assumption the (NCC) condition did not trigger for (x, s τ , u τ ). Therefore, we know that Moreover, it always holds that f x (s τ ) =f x (u τ ) + ∇f x (u τ ), s τ − u τ for some φ ∈ [0, 1]. Also using u τ = s τ + (1 − θ)v τ , we deduce that v τ , ∇ 2f With the help of Lemma C.1, note that Thus, the Lipschitz-type properties of ∇ 2f x apply up to that point and we get Plugging this back into (91) with c ≥ 80 √ c 5 reveals that This shows that P S c (ṽ j ) ≤ /24 for j = T /4. We plug P S c (δ j ) ≤ /24 and P S c (ṽ j ) ≤ /24 into (89) to state that, with j = T /4, (We used 4θ 2 = ρ η, then we also set c > (3c 4 ) 1/3 .) This last inequality contradicts (84). Thus, the proof by contradiction is complete and we conclude that What follows is the equivalent of the proof of [28,Lem. 22], with the small changes needed for our purpose.
Proof of Lemma F. 2 Since E 0 − E T /2 ≤ E and s 0 = 0, Lemmas 4.2, 4.3 and C.1 yield: By Lemma E.1 with τ = 0 and noting that s 0 = 0, s −1 = s 0 − v 0 = 0, we know that, for all j, Define the operator j = 1 0 ∇ 2f x (φs j ) − Hdφ with H = ∇ 2f x (0). We can write: We shall bound this term by term. The third term is straightforward, so let us start with this one. Owing to (92), the Lipschitz-like properties of the Hessian apply to claim j ≤ 1 2ρ s j . Therefore, with c ≥ 2 and χ ≥ 1. Below, we work toward bounding the other two terms. As we did in the proof of Lemma F.1, let e 1 , . . . , e d form an orthonormal basis of eigenvectors for H with eigenvalues λ 1 ≤ · · · ≤ λ d . Expand ∇f x (0) and δ k in that basis as where S = m : ηλ m > θ 2 (2−θ) 2 indexes the eigenvalues of the eigenvectors which span S. This identity splits in two parts, each of which we now aim to bound.
In this case, we apply the logic in the proof of [28,Lem. 13], as follows. Define the set X (stuck) x as containing exactly all tangent vectors s * ∈ B x (r ) such that 1. TSS(x, s * ) runs its T iterations in full, and 2. E * 0 − E * T ≤ 2E , where E * j denotes the Hamiltonians associated to TSS(x, s * ).
There are two cases. Either s 0 is not in X (stuck) x , in which case E 0 − E T > 2E : it is then easy to conclude (using (107)) that f (x) − f (x T ) > 7 4 E . Or s 0 is in X (stuck) x , in which case we do not lower-bound f (x) − f (x T ). The probability of this happening is Prob ξ ∈ X (stuck) where Vol(·) denotes the volume of a set, and Vol B d r is the volume of a Euclidean ball of radius r in a d-dimensional vector space. In order to upper-bound the volume of X (stuck) x , we resort to Lemma G.1: this is where we use the assumption λ min (∇ 2f x (0)) ≤ − ρ . Let e 1 denote an eigenvector of ∇ 2f x (0) with minimal eigenvalue, and let s 0 , s 0 be two arbitrary vectors in X