Faster Riemannian Newton-type optimization by subsampling and cubic regularization

This work is on constrained large-scale non-convex optimization where the constraint set implies a manifold structure. Solving such problems is important in a multitude of fundamental machine learning tasks. Recent advances on Riemannian optimization have enabled the convenient recovery of solutions by adapting unconstrained optimization algorithms over manifolds. However, it remains challenging to scale up and meanwhile maintain stable convergence rates and handle saddle points. We propose a new second-order Riemannian optimization algorithm, aiming at improving convergence rate and reducing computational cost. It enhances the Riemannian trust-region algorithm that explores curvature information to escape saddle points through a mixture of subsampling and cubic regularization techniques. We conduct rigorous analysis to study the convergence behavior of the proposed algorithm. We also perform extensive experiments to evaluate it based on two general machine learning tasks using multiple datasets. The proposed algorithm exhibits improved computational speed, e.g., a speed improvement from 12%to227%\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$12\% \:\text {to} \:227\%$$\end{document}, and improved convergence behavior, e.g., an iteration number reduction from Omaxϵg-2ϵH-1,ϵH-3toOmaxϵg-2,ϵH-3\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathcal{O}\left(\max\left(\epsilon_g^{-2}\epsilon_H^{-1},\epsilon_H^{-3}\right)\right) \,\text {to}\: \mathcal{O}\left(\max\left(\epsilon_g^{-2},\epsilon_H^{-3}\right)\right)$$\end{document}, compared to a large set of state-of-the-art Riemannian optimization algorithms.


Introduction
In modern machine learning, many learning tasks are formulated as non-convex optimization problems. This is because, as compared to linear or convex formulations, they can often capture more accurately the underlying structures within the data, and model more precisely the learning performance (or losses).
There is an important class of non-convex problems of which the constraint sets possess manifold structures, e.g., to optimize over a set of orthogonal matrices. A manifold in mathematics refers to a topological space that locally behaves as an Euclidean space near each point. Over a D-dimensional manifold M in a d-dimensional ambient space (D < d), each local patch around each data point (a subset of M) is homeomorphic to a local open subset of the Euclidean space R D . This special structure enables a straightforward adoption of any unconstrained optimization algorithm for solving a constrained problem over a manifold, simply by applying a systematic way to modify the gradient and Hessian calculation. The modified calculations are called the Riemannian gradients and the Riemannian Hessians, which will be rigorously defined later. Such an accessible method for developing optimized solutions has benefited many applications and encouraged the implementation of various optimization libraries.
A representative learning task that gives rise to non-convex problems over manifolds is low-rank matrix completion, widely applied in signal denoising, recommendation systems and image recovery [38]. It is formulated as optimization problems constrained on fixed-rank matrices that belong to a Grassmann manifold. Another example task is principal component analysis (PCA) popularly used in statistical data analysis and dimensionality reduction [51]. It seeks an optimal orthogonal projection matrix from a Stiefel manifold. A more general problem setup than PCA is the subspace learning [39], where a low-dimensional space is an instance of a Grassmann manifold. When training a neural network, in order to reduce overfitting, the orthogonal constraint that provides a Stiefel manifold structure is sometimes imposed over the network weights [6]. Additionally, in hand gesture recognition [42], optimizing over a symmetric definite positive (SDP) manifold has been shown effective.
Recent developments in optimization on Riemannian manifolds [3] have offered a convenient and unified solution framework for solving the aforementioned class of non-convex problems. The Riemannian optimization techniques translate the constrained problems into unconstrained ones on the manifold whilst preserving the geometric structure of the solution. For example, one Riemannian way to implement a PCA is to preserve the SDP geometric structure of the solutions without explicit constraints [28]. A simplified description of how Riemannian optimization works is that it first applies a straightforward way to modify the calculation of the first-order and second-order gradient information, then it adopts an unconstrained optimization algorithm that uses the modified gradient information. There are systematic ways to compute these modifications by analyzing the geometric structure of the manifold. Various libraries have implemented these methods and are available to practitioners, e.g., Manopt [13] and Pymanopt [55].
Among such techniques, Riemannian gradient descent (RGD) is the simplest. To handle large-scale computation with a finite-sum structure, Riemannian stochastic gradient descent (RSGD) [9] has been proposed to estimate the gradient from a single sample (or a sample batch) in each iteration of the optimization. Here, an iteration refers to the process by which an incumbent solution is updated with gradient and (or) higher-order derivative information; for example, Eq. (3) in the upcoming text defines an RSGD iteration. Convergence rates of RGD and RSGD are compared in [66] together with a global complexity analysis. The work concludes that RGD can converge linearly while RSGD converges sub-linearly, but RSGD becomes computationally cheaper when there is a significant increase in the size of samples to process, also it can potentially prevent overfitting. By using RSGD to optimize over the Stiefel manifold, [45] attempts to improve interpretability of domain adaptation and has demonstrated its benefits for text classification.
A major drawback of RSGD is the variance issue, where the variance of the update direction can slow down the convergence and result in poor solutions. Typical techniques for variance reduction include the Riemannian stochastic variance reduced gradient (RSVRG) [65] and the Riemannian stochastic recursive gradient (RSRG) [33]. RSVRG reduces the gradient variance by using a momentum term, which takes into account the gradient information obtained from both RGD and RSGD. RSRG follows a different strategy and considers only the information in the last and current iterations. This has the benefit of avoiding large cumulative errors, which can be caused by transporting the gradient vector along a distant path when aligning two vectors at the same tangent plane. It has been shown by [33] that RSRG performs better than RSVRG particularly for large-scale problems.
The RSGD variants can suffer from oscillation across the slopes of a ravine [36]. This also happens when performing stochastic gradient descent in Euclidean spaces. To address this, various adaptive algorithms have been proposed. The core idea is to control the learning process with adaptive learning rates in addition to the gradient momentum. Riemannian techniques of this kind include R-RMSProp [36], R-AMSGrad [17], R-AdamNC [7], RPG [30] and RAGDsDR [5].
Although improvements have been made for first-order optimization, they might still be insufficient for handling saddle points in non-convex problems [40]. They can only guarantee convergence to stationary points and do not have control over getting trapped at saddle points due to the lack of higher-order information. As an alternative, second-order algorithms are normally good at escaping saddle points by exploiting curvature information [34,57]. Representative examples of this are the trust-region (TR) methods. Their capacity for handling saddle points and improved convergence over many first-order methods has been demonstrated in [60] for various non-convex problems. The TR technique has been extended to a Riemannian setting for the first time by [2], referred to as the Riemannian TR (RTR) technique.
It is well known that what prevents the wide use of the second-order Riemannian techniques in large-scale problems is the high cost of computing the exact Hessian matrices. Inexact techniques are therefore proposed to iteratively search for solutions without explicit Hessian computations. They can also handle non-positive-definite Hessian matrices and improve operational robustness. Two representative inexact examples are the conjugate gradient and the Lanczos methods [71,62]. However, their reduced complexity is still proportional to the sample size, and they can still be computationally costly when working with large-scale problems. To address this issue, the subsampling technique has been proposed, and its core idea is to approximate the gradient and Hessian using a batch of samples. It has been proved by [52] that the TR method with subsampled gradient and Hessian can achieve a convergence rate of order O 1 k 2/3 with k denoting the iteration number. A sample-efficient stochastic TR approach is proposed by [52] which finds an ( , √ )-approximate local minimum within a number of O( √ n/ 1.5 ) stochastic Hessian oracle queries where n denotes the sample number. The subsampling technique has been applied to improve the second-order Riemannian optimization for the first time by [32]. Their proposed inexact RTR algorithm employs subsampling over the Riemannian manifold and achieves faster convergence than the standard RTR method. Nonetheless, subsampling can be sensitive to the configured batch size. Overly small batch sizes can lead to poor convergence.
In the latest development of second-order unconstrained optimization, it has been shown that the adaptive cubic regularization technique [15] can improve the standard and subsampled TR algorithms and the Newton-type methods, resulting in, for instance, improved convergence and effectiveness at escaping strict saddle points [34,61]. To improve performance, the variance reduction techniques have been combined into cubic regularization and extended to cases with inexact solutions. Example works of this include [69,70] which were the first to rigorously demonstrate the advantage of variance reduction for second-order optimization algorithms. Recently, the potential of cubic regularization for solving non-convex problems over constraint sets with Riemannian manifold structures has been shown by [67,4].
We aim at improving the RTR optimization by taking advantage of both the adaptive cubic regularization and subsampling techniques. Our problem of interest is to find a local minimum of a non-convex finite-sum minimization problem constrained on a set endowed with a Riemannian manifold structure. Letting f i : M → R be a real-valued function defined on a Riemannian manifold M, we consider a twice differentiable finite-sum objective function: In the machine learning context, n denotes the sample number, and f i (x) is a smooth real-valued and twice differentiable cost (or loss) function computed for the i-th sample. The n samples are assumed to be uniformly sampled, and . We propose a cubic Riemannian Newton-like (RN) method to solve more effectively the problem in Eq. (1). Specifically, we enable two key improvements in the Riemannian space, including (1) to approximate the Riemannian gradient and Hessian using the subsampling technique and (2) to improve the subproblem formulation by replacing the trust-region constraint with a cubic regularization term. The resulting algorithm is named Inexact Sub-RN-CR 1 . After introducing cubic regularization, it becomes more challenging to solve the subproblem, for which we demonstrate two effective solvers based on the Lanczos and conjugate gradient methods. We provide convergence analysis for the proposed Inexact Sub-RN-CR algorithm and present the main results in Theorems 3 and 4. Additionally, we provide analysis for the subproblem solvers, regarding their solution quality, e.g., whether and how they meet a set of desired conditions as presented in Assumptions 1-3, and their convergence property. The key results are presented in Lemma 1, Theorem 2, Lemmas 2 and 3. Overall, our results are satisfactory. The proposed algorithm finds an ( g , H )-optimal solution (defined in Section 3.1) in fewer iterations than the state-of-the-art RTR [32]. Specifically, the required number of iterations is re- H . When being tested by extensive experiments on PCA and matrix completion tasks with different datasets and applications in image analysis, our algorithm shows much better performance than most state-of-the-art and popular Riemannian optimization algorithms, in terms of both the solution quality and computing time.

Notations and Preliminaries
We start by familiarizing the readers with the notations and concepts that will be used in the paper, and recommend [3] for a more detailed explanation of the relevant concepts. The manifold M is equipped with a smooth inner product ·, · x associated with the tangent space T x M at any x ∈ M, and this inner product is referred to as the Riemannian metric. The norm of a tangent vector η ∈ T x M is denoted by η x , which is computed by η x = η, η x . When we use the notation η x , η by default belongs to the tangent space T x M. We use 0 x ∈ T x M to denote the zero vector of the tangent space at x. The retraction mapping denoted by R x (η) : T x M → M is used to move x ∈ M in the direction η ∈ T x M while remaining on M, and it is an equivalent version of x + η in an Euclidean space. The pullback of f at x is defined byf (η) = f (R x (η)), andf (0 x ) = f (x). The vector transport operator T y x (v) : T x M → T x M moves a tangent vector v ∈ T x M from a point x ∈ M to another y ∈ M. We also use a shorthand notation T η (v) to describe T y x (v) for a moving direction η ∈ T x M from x to y satisfying R x (η) = y. The parallel transport operator P η,γ (v) is a special instance of the vector transport. It moves v ∈ T x M in the direction of η ∈ T x M along a geodesic γ : [0, 1] → M, where γ(0) = x, γ(1) = y and γ (0) = η, and during the movement, it has to satisfy the parallel condition on the geodesic curve. We simplify the notation P η,γ (v) to P η (v). Fig. 1 illustrates a manifold and the operations over it. Additionally, we use · to denote the l 2 -norm operation in a Euclidean space. The Riemannian gradient of a real-valued differentiable function f at x ∈ M, denoted by gradf (x), is defined as the unique element of T x M satisfy- generalizes the notion of the directional derivative to a manifold, defined as the derivative of f (γ(t)) at t = 0 where γ(t) is a curve on M such that γ(0) = x anḋ γ(0) = ξ. When operating in an Euclidean space, we use the same notation Df (x)[ξ] to denote the classical directional derivative. The Riemannian Hessian of a real-valued differentiable function f at x ∈ M, denoted by  [ξ] . When the function f is defined over a quotient manifold, the Riemannian gradient and Hessian can be computed by projecting ∇f (x) and ∇ 2 f (x)[ξ] onto the horizontal space of the manifold.
Taking the PCA problem as an example (see Eqs (60) and (61) for its formulation), the general objective in Eq. (1) can be instantiated by min U∈Gr(r,d) The Grassmann manifold Gr (r, d) contains the set of r-dimensional linear subspaces of the d-dimensional vector space. Each subspace corresponds to a point on the manifold that is an equivalence class of d × r orthogonal matrices, expressed as x = [U] := UO : U T U = I, O ∈ O (r) where O (r) denotes the orthogonal group in R r×r . A tangent vector η ∈ T x Gr(r, d) of the Grassmann manifold has the form η = U ⊥ B [19], where B ∈ R (d−r)×r , and U ⊥ ∈ R d×(d−r) is the or-thogonal complement of U such that U, U ⊥ is orthogonal. A commonly used Riemannian metric for the Grassmann manifold is the canonical inner product η, ξ x = tr(η T ξ) given η, ξ ∈ T x Gr(r, d), resulting in η x = η (Section 2.3.2 of [19]). As we can see, the Riemannian metric and the norm here are equivalent to the Euclidean inner product and norm. The same result can be derived from another commonly used metric of the Grassmann manifold, i.e., η, ξ x = tr η T I − 1 2 UU T ξ for η, ξ ∈ T x Gr(r, d) (Section 2.5 of [19]). Expressing two given tangent vectors as Here we provide a few examples of the key operations explained earlier on the Grassmann manifold, taken from [13]. Given a data point [U], a moving direction η and the step size t, one way to construct the retraction mapping is through performing singular value decomposition (SVD) on U + tη, i.e., U + tη =ŪSV T , and the new data point after moving is ŪV T . A transport operation can be implemented by projecting a given tangent vector using the orthogonal projector I − UU T . Both Riemannian gradient and Hessian can be computed by projecting the Euclidean gradient and Hessian of f (U) using the same projector I − UU T .

First-order Algorithms
To optimize the problem in Eq. (1), the first-order Riemannian optimization algorithm RSGD updates the solution at each k-th iteration by using an f i instance, as where β k is the step size. Assume that the algorithm runs for multiple epochs referred to as the outer iterations. Each epoch contains multiple inner iterations, each of which corresponds to a randomly selected f i for calculating the update. Letting x t k be the solution at the t-th inner iteration of the k-th outer iteration andx k be the solution at the last inner iteration of the k-th outer iteration, RSVRG employs a variance reduced extension [65] of the update defined in Eq. (3), given as where Here, the full gradient information gradf (x k−1 ) is used to reduce the variance in the stochastic gradient v t k . As a later development, RSRG [33] suggests a recursive formulation to improve the variance-reduced gradient v t k . Starting This formulation is designed to avoid the accumulated error caused by a distant vector transport.

Inexact RTR
For second-order Riemannian optimization, the Inexact RTR [32] improves the standard RTR [2] through subsampling. It optimizes an approximation of the objective function formulated using the second-order Taylor expansion within a trust region ∆ k around x k at iteration k. A moving direction η k within the trust region is found by solving the subproblem at iteration k: subject to where G k and H k [η] are the approximated Riemannian gradient and Hessian calculated by using the subsampling technique. The approximation is based on the current solution x k and the moving direction η, calculated as where S g , S H ⊂ {1, ..., n} are the sets of the subsampled indices used for estimating the Riemannian gradient and Hessian. The updated solution x k+1 = R x k (η k ) will be accepted and ∆ k will be increased, if the decrease of the true objective function f is sufficiently large as compared to that of the approximate objective used in Eq. (7). Otherwise, ∆ k will be decreased because of the poor agreement between the approximate and true objectives.

Inexact Sub-RN-CR Algorithm
We propose to improve the subsampling-based construction of the RTR subproblem in Eq. (7) by cubic regularization. This gives rise to the minimization

11
: Here, 0 < g < 1 is a user-specified parameter that plays a role in convergence analysis, which we will explain later. The core objective component h x k (η) is formulated by extending the adaptive cubic regularization technique [15], given as with and where the subscript k is used to highlight the pullback of f at x k asf k (·).
Overall, there are four hyper-parameters to be set by the user, including the gradient norm threshold 0 < σ < 1, the dynamic control parameter γ > 1 to adjust the cubic regularization weight, the model validity threshold 0 < τ < 1, and the initial trust parameter σ 0 . We will discuss the setup of the algorithm in more detail. We expect the norm of the approximate gradient to approach g with 0 < g < 1. Following a similar treatment in [32], when the gradient norm is smaller than g , the gradient-based term is ignored. This is important to the convergence analysis shown in the next section.
The trust region radius ∆ k is no longer explicitly defined, but replaced by the cubic regularization term σ k 3 η 3 x k , where σ k is related to a Lagrange multiplier on a cubic trust-region constraint. Naturally, the smaller σ k is, the larger a moving step is allowed. Benefits of cubic regularization have been shown in [25,34]. It can not only accelerate the local convergence especially when the Hessian is singular, but also help escape better strict saddle points than the TR methods, providing stronger convergence properties.
The cubic term σ k 3 η 3 x k is equipped with a dynamic penalization control through the adaptive trust quantity σ k ≥ 0. The value of σ k is determined by examining how successful each iteration k is. An iteration k is considered successful if ρ k ≥ τ , and unsuccessful otherwise, where the value of ρ k quantifies the agreement between the changes of the approximate objectivem k (η) and the true objective f (x). The larger ρ k is, the more effective the approximate model is. Given γ > 1, in an unsuccessful iteration, σ k is increased to γσ k hoping to obtain a more accurate approximation in the next iteration. On the opposite, σ k is decreased to σ k γ , relaxing the approximation in a successful iteration, but it is still restricted within the lower bound σ . This bound σ helps avoid solution candidates with overly large norms η k x k that can cause an unstable optimization. Below we formally define what an (un)successful iteration is, which will be used in our later analysis.
Definition 1 (Successful and Unsuccessful Iterations) An iteration k in Algorithm 1 is considered successful if the agreement ρ k ≥ τ , and unsuccessful if ρ k < τ . In addition, based on Step (12) of Algorithm 1, a successful iteration has σ k+1 ≤ σ k , while an unsuccessful one has σ k+1 > σ k .

Optimality Examination
The stopping condition of the algorithm follows the definition of ( g , H )optimality [43], stated as below.
for all η ∈ T x M, where I is an identity matrix.
This is a relaxation and a manifold extension of the standard second-order optimality conditions gradf (x) x = 0 and Hessf (x) 0 in Euclidean spaces [40]. The algorithm stops (1) when the gradient norm is sufficiently small and (2) when the Hessian is sufficiently close to being positive semidefinite.
To examine the Hessian, we follow a similar approach as in [27] by assessing the solution of the following minimization problem: which resembles the smallest eigenvalue of the Riemannian Hessian. As a result, the algorithm stops when gradf (x) x ≤ g (referred to as the gradient condition) and when λ min (H k ) ≥ − H (referred to as the Hessian condition), where g , H ∈ (0, 1) are the user-set stopping parameters. Note, we use the same g for thresholding as in Eq. (11). Pseudo code of the complete Inexact Sub-RN-CR is provided in Algorithm 1.

Subproblem Solvers
The step (9) of Algorithm 1 requires to solve the subproblem in Eq. (10). We rewrite its objective functionm k (η) as below for the convenience of explanation: where δ = 1 if G k x k ≥ g , otherwise δ = 0. We demonstrate two solvers commonly used in practice.

The Lanczos Method
The Lanczos method [24] has been widely used to solve the cubic regularization problem in Euclidean spaces [61,34,15,31] and been recently adapted to Riemannian spaces [4]. Let D denote the manifold dimension. Its core operation is to construct a Krylov subspace K D , of which the basis {q i } D i=1 spans the tangent space η ∈ T x k M. After expressing the solution as an element in K D , i.e., η := D i=1 y i q i , the minimization problem in Eq. (17) can be converted to one in Euclidean spaces R D , as where T D ∈ R D×D is a symmetric tridiagonal matrix determined by the basic construction process, e.g., Algorithm 1 of [31]. We provide a detailed derivation of Eq. (18) in Appendix A. The global solution of this converted problem, i.e., y * = [y * 1 , y * 2 , . . . , y * D ], can be found by many existing techniques, see [46]. We employ the Newton root finding method adopted by Section 6 of [15], which was originally proposed by [4]. It reduces the problem to a univariate root finding problem. After this, the global solution of the subproblem is computed by η * k = D i=1 y * i q i . In practice, when the manifold dimension D is large, it is more practical to find a good solution rather than the global solution. By working with a lowerdimensional Krylov subspace K l with l < D, one can derive Eq. (18) in R l , and its solution y * l results in a subproblem solution η * l k = l i=1 y * l i q i . Both the global solution η * k and the approximate solution η * l k are always guaranteed to Algorithm 2 Subproblem Solver by Lanczos [4] Input: G k and H k [η], κ θ ∈ (0, 1/6], σ k . 1: Obtain y * by optimizing Eq. (18) with D = l using Newton root finding 5: T l,l+1 = T l+1,l = β, T l+1,l+1 = α 10: if Eq. (47) is satisfied then 11: Return l i=1 y * i q i 12: end if 13: end for be at least as good as the solution obtained by performing a line search along the gradient direction, i.e., because αG k is a common basis vector shared by all the constructed Krylov subspaces {K l } D i=1 . We provide pseudo code for the Lanczos subproblem solver in Algorithm 2.
To benefit practitioners and improve understanding of the Lanczos solver, we analyse the gap between a practical solution η * l k and the global minimizer η * k . Firstly, we define λ max (H k ) in a similar manner to λ min (H k ) as in Eq. (16). It resembles the largest eigenvalue of the Riemannian Hessian, as We denote a degree-l polynomial evaluated at H k [η] by p l (H k ) [η], such that for some coefficients c 0 , c 1 , . . . , c l ∈ R. The quantity H l k [η] is recursively defined by H k H l−1 k [η] for l = 2, 3, . . . We define below an induced norm, as where the identity mapping operator works as Id[η] = η. Now we are ready to present our result in the following lemma.

Lemma 1 (Lanczos Solution Gap)
Let η * k be the global minimizer of the subproblemm k in Eq. (10). Denote the subproblem without cubic regularization in Eq. (7) bym k and letη * k be its global minimizer. For each l > 0, the solution η * l k returned by Algorithm 2 satisfieŝ for a moving direction η, and φ l H k is an upper bound of the induced norm p l+1 H k − Id We provide its proof in Appendix B. In Euclidean spaces, [14] has shown that . With the help of Lemma 1, this could serve as a reference to gain an understanding of the solution quality for the Lanczos method in Riemannian spaces.

The Conjugate Gradient Method
We experiment with an alternative subproblem solver by adapting the nonlinear conjugate gradient technique in Riemannian spaces. It starts from the initialization of η 0 k = 0 x k and the first conjugate direction p 1 = −G k (negative gradient direction). At each inner iteration i (as opposed to the outer iteration k in the main algorithm), it solves the minimization problem with one input variable: The global solution of this one-input minimization problem can be computed by zeroing the derivative ofm k with respect to α, resulting in a polynomial equation of α, which can then be solved by eigen-decomposition [20]. Its root that possesses the minimal value ofm k is retrieved. The algorithm updates the next conjugate direction p i+1 using the returned α * i and p i . Pseudo code for the conjugate gradient subproblem solver is provided in Algorithm 3.
Convergence of a conjugate gradient method largely depends on how its conjugate direction is updated. This is controlled by the setting of β i for cal- Step (17) of Algorithm 3. Working in Riemannian spaces under the subsampling setting, it has been proven by [49] that, when the Fletcher-Reeves formula [22] is used, i.e., where

Algorithm 3 subproblem Solver by Non-linear Conjugate Gradient
Solve Eq. (24) by zeroing its derivative and solving the resulting polynomial equation if Eq. (47) is satisfied then 9: Return η * k = η i k 10: end if 11: if Eq. (31) is met then 14: Return Compute β i by Eq. (28) 17: Working in Euclidean spaces, [59] has shown that the Polak-Ribiere-Polyak formula, i.e., performs better than the Fletcher-Reeves formula. Building upon these, we propose to compute β i by a modified Polak-Ribiere-Polyak formula in Riemannian spaces in Step (16) of Algorithm 3, given as We prove that the resulting algorithm converges to a stationary point, and present the convergence result in Theorem 1, with its proof deferred to Appendix C.
Theorem 1 (Convergence of the Conjugate Gradient Solver) Assume that the step size α * i in Algorithm 3 satisfies the strong Wolfe conditions [29], i.e., given a smooth function f : with 0 < c 1 < c 2 < 1. When Step (16) of Algorithm 3 computes β i by Eq.
(28), Algorithm 3 converges to a stationary point, i.e., lim i→∞ G i In practice, Algorithm 3 terminates when there is no obvious change in the solution, which is examined in Step (4) by checking whether the step size is sufficiently small, i.e., whether α i ≤ 10 −10 (Section 9 in [4]). To improve the convergence rate, the algorithm also terminates when r i in Step (13) is sufficiently small, i.e., following a classical criterion [2], to check whether for some θ, κ > 0.

Properties of the Subproblem Solutions
In Algorithm 2, the basis {q i } D i=1 is constructed successively starting from q 1 = G k , while the converted problem in Eq. (18) is solved for each K l starting from l = 1. This process allows up to D inner iterations. The solution η * k obtained in the last inner iteration where l = D is the global minimizer over R D . Differently, Algorithm 3 converges to a stationary point as proved in Theorem 1. In practice, a maximum inner iteration number m is set in advance. Algorithm 3 stops when it reaches the maximum iteration number or converges to a status where the change in either the solution or the conjugate direction is very small.
The convergence property of the main algorithm presented in Algorithm 1 relies on the quality of the subproblem solution. Before discussing it, we first familiarize the reader with the classical TR concepts of Cauchy steps and eigensteps but defined for the Inexact RTR problem introduced in Section 2.2. According to Section 3.3 of [12], whenm k is the RTR subproblem, the closed-form Cauchy stepη C k is an improving direction defined bŷ It points towards the gradient direction with an optimal step size computed by the min(·, ·) operation, and follows the form of the general Cauchy step defined by According to Section 2.2 of [32], for some ν ∈ (0, 1], the eigenstep η E k satisfies It is an approximation of the negative curvature direction by an eigenvector associated with the smallest negative eigenvalue.
The following three assumptions on the subproblem solution are needed by the convergence analysis later. We define the induced norm for the Hessian as below:

Assumption 1 (Sufficient Descent
Step) Given the Cauchy step η C k and the eigenstep η E k for ν ∈ (0, 1], assume the subproblem solution η * k satisfies the Cauchy condition and the eigenstep condition where The two last inequalities in Eqs. (36) and (37) concerning the Cauchy step and eigenstep are derived in Lemma 6 and Lemma 7 of [61]. Assumption 1 generalizes Condition 3 in [61] to the Riemannian case. It assumes that the subproblem solution η * k is better than the Cauchy step and eigenstep, decreasing more the value of the subproblem objective function. The following two assumptions enable a stronger convergence result for Algorithm 1, which will be used in the proof of Theorem 4.

Lemma 2 (Lanczos Solution)
The subproblem solution obtained by Algorithm 2 when being executed D (the dimension of M) iterations satisfies Assumptions 1, 2, 3. When being executed l < D times, the solution satisfies the Cauchy condition in Assumption 1, also Assumptions 2 and 3.

Lemma 3 (Conjugate Gradient Solution)
The subproblem solution obtained by Algorithm 3 satisfies the Cauchy condition in Assumption 1. As- and approximately the first condition of Assumption 3, as In practice, Algorithm 2 based on lower-dimensional Krylov subspaces with l < D returns a less optimal solution, while Algorithm 3 returns at most a local minimum. They are not guaranteed to satisfy the eigenstep condition in Assumption 1. But the early-returned solutions from Algorithm 2 still satisfy Assumptions 2 and 3. However, solutions from Algorithm 3 do not satisfy these two Assumptions exactly, but they could get close in an approximate manner. For instance, according to Lemma 3, ∇ ηmk (η * k ) x k ≈ 0, and we know that there is a fair chance for Eq. (41) in Assumption 2 to be met by the solution from Algorithm 3. Also, given that

Practical Early Stopping
In practice, it is often more efficient to stop the optimization earlier before meeting the optimality conditions and obtain a reasonably good solution much faster. We employ a practical and simple early stopping mechanism to accommodate this need. Algorithm 1 is allowed to terminate earlier when: (1) the norm of the approximate gradient continually fails to decrease for K times, and (2) when the percentage of the function decrement is lower than a given threshold, i.e., for a consecutive number of K times, with K and τ f > 0 being user-defined.
For the subproblem, both Algorithms 2 and 3 are allowed to terminate when the current solution η k , i.e., η k = η * l k for Algorithm 2 and η k = η i k for Algorithm 3, satisfies This implements an examination of Assumption 2. Regarding Assumption 1, both Algorithms 2 and 3 optimize along the direction of the Cauchy step in their first iteration and thus satisfy the Cauchy condition. Therefore there is no need to examine it. As for the eigenstep condition, it is costly to compute and compare with the eigenstep in each inner iteration, so we do not use it as a stopping criterion in practice. Regarding Assumption 3, according to Lemma 2, it is always satisfied by the solution from Algorithm 2. Therefore, there is no need to examine it in Algorithm 2. As for Algorithm 3, the examination by Eq. (47) also plays a role in checking Assumption 3. For the first condition in Assumption 3, Eq.
indicating that the first condition of Assumption 3 is met approximately. Also, since G k , η * k x k ≤ 0 due to the descent direction η * k , the second condition of Assumption 3 has a fairly high chance to be met.

Preliminaries
We start from listing those assumptions and conditions from existing literature that are adopted to support our analysis. Given a function f , the Hessian of its pullback ∇ 2f (x) [η] and its Riemannian Hessian Hessf (x) [η] are identical when a second-order retraction is used [12], and this serves as an assumption to ease the analysis.
Assumption 4 (Second-order Retraction [10]) The retraction mapping is assumed to be a second-order retraction. That is, for all x ∈ M and all η ∈ T x M, the curve γ(t) := R x (tη) has zero acceleration at t = 0, i.e., γ (0) = D 2 dt 2 R x (tη) t=0 = 0. The following two assumptions originate from the assumptions required by the convergence analysis of the standard RTR algorithm [12,21], and are adopted here to support the inexact analysis.
Assumption 5 (Restricted Lipschitz Hessian) There exists L H > 0 such that for all x k generated by Algorithm 1 and all and where P −1 denotes the inverse process of the parallel transport operator.
Assumption 6 (Norm Bound on Hessian) For all x k , there exists K H ≥ 0 so that the inexact Hessian H k satisfies The following key conditions on the inexact gradient and Hessian approximations are developed in Euclidean spaces by [48] (Section 2.2) and [61] (Section 1.3), respectively. We make use of these in Riemannian spaces.

Condition 1 (Approximation Error Bounds)
For all x k and η k ∈ T x k M, suppose that there exist δ g , δ H > 0, such that the approximate gradient and Hessian satisfy As will be shown in Theorem 2 later, these allow the used sampling size in the gradient and Hessian approximations to be fixed throughout the training process. As a result, it can serve as a guarantee of the algorithmic efficiency when dealing with large-scale problems.

Supporting Theorem and Assumption
In this section, we prove a preparation theorem and present new conditions required by our results. Below, we re-develop Theorem 4.1 in [32] using the matrix Bernstein inequality [26]. It provides lower bounds on the required subsampling size for approximating the gradient and Hessian in order for Condition 1 to hold. The proof is provided in Appendix E.
Theorem 2 (Gradient and Hessian Sampling Size) Define the suprema of the Riemannian gradient and Hessian Given 0 < δ < 1, Condition 1 is satisfied with probability at least where |S g | and |S H | denote the sampling sizes, while d and r are the dimensions of x.
The two quantities δ g and δ H in Condition 1 are the upper bounds of the gradient and Hessian approximation errors, respectively. The following assumption bounds δ g and δ H .
Assumption 7 (Restrictions on δ g and δ H ) Given ν ∈ (0, 1], K H ≥ 0, L H > 0, 0 < τ, g < 1, we assume that δ g and δ H satisfy As seen in Eqs. (55) and (56), sampling sizes |S g | and |S H | are directly proportional to the probability (1 − δ) but inversely proportional to the error tolerances δ g and δ H , respectively. Hence, a higher (1 − δ) and smaller δ g and δ H (affected by K H and L H ) require larger |S g | and |S H | for estimating the inexact Riemannian gradient and Hessian.

Main Results
Now we are ready to present our main convergence results in two main theorems for Algorithm 1. Different from [54] which explores the escape rate from a saddle point to a local minimum, we study the convergence rate from a random point.
Theorem 3 (Convergence Complexity of Algorithm 1) Consider 0 < g , H < 1 and δ g , δ H > 0. Suppose that Assumptions 4, 5, 6 and 7 hold and the solution of the subproblem in Eq. (10) satisfies Assumption 1. Then, if the inexact gradient G k and Hessian H k satisfy Condition 1, Algorithm 1 returns The proof along with the supporting lemmas is provided in Appendices F. Corollary 1 Consider 0 < g , H < 1 and δ g , δ H > 0. Suppose that Assumptions 4, 5, 6 and 7 hold and the solution of the subproblem in Eq. (10) satisfies Assumption 1. For any 0 < δ < 1, suppose Eqs. (55) and (56) are satisfied at each iteration. Then, Algorithm 1 returns an ( g , H )-optimal so- The proof is provided in Appendix F.3. We use an example to illustrate the effect of δ on the approximate gradient sample size |S g |.
In addition, when δ = O 2 g /10 , p ≈ 0.9. Replacing δ with O 2 g /10 in Eqs. (55) and (56), it can be obtained that the lower bound of |S g | is proportional to ln 10 −2 g . Combining Assumption 3 and the stopping condition in Eq. (47) for the inexact solver, a stronger convergence result can be obtained for Algorithm 1, which is presented in the following theorem and corollary.

Computational Complexity Analysis
We analyse the number of main operations required by the proposed algorithm. Taking the PCA task as an example, it optimizes over the Grassmann manifold Gr (r, d). Denote by m the number of inexact iterations and D the manifold dimension, i.e., D = d × (d − r) for the Grassmann manifold. Starting from the gradient and Hessian computation, the full case requires O(ndr) operations for both in the PCA task. By using the subsampling technique, these can be reduced to O(|S g |dr) and O(|S H |dr) by gradient and Hessian approximation. Following an existing setup for cost computation, i.e., Inexact RTR method [32], the full function cost evaluation takes n operations, while the approximate cost evaluation after subsampling becomes O(|S n |dr), where S n is the subsampled set of data points used to compute the function cost. These show that, for large-scale practices with n max (|S g |, |S H |, |S n |), the computational cost reduction gained from the subsampling technique is significant.
For the subproblem solver by Algorithm 2 or 3, the dominant computation within each iteration is the Hessian computation, which as mentioned above requires O(|S H |dr) operations after using the subsampling technique. Taking this into account to analyze Algorithm 1, its overall computational complexity becomes O max is the cost of the subproblem solver by either Algorithm 2 or Algorithm 3. Algorithm 2 is guaranteed to return the optimal subproblem solution within at most m = D = d × (d − r) inner iterations, of which the complexity is at most O(|S H |d 2 (d − r) 2 r 2 ). Such a polynomial complexity is at least as good as the ST-tCG solver used in the Inexact RTR algorithm. For Algorithm 3, although m is not guaranteed to be bounded and polynomial, we have empirically observed that m is generally smaller than D in practice, presenting a similar complexity to Algorithm 2.
For the competing methods, we follow the same parameter settings from the existing implementations, including the batch size (i.e. sampling size), step size  (i.e. learning rate) and the inner iteration number to ensure the same results as the reported ones. For our method, we first keep the common algorithm parameters the same as the competing methods, including γ, τ , g and H . Then, we use a grid search to find appropriate values of θ and κ θ for both Algorithms 2 and 3. Specifically, the searching grid for θ is (0.02, 0.05, 0.1, 0.2, 0.5, 1), and the searching grid for κ θ is (0.005, 0.01, 0.02, 0.04, 0.08, 0.16). For the parameter κ in Algorithm 3, we keep it the same as the other conjugate gradient solvers. The early stopping approach as described in Section 3.4 is applied to all the compared algorithms.
Regarding the batch setting, which is also the sample size setting for approximating the gradient and Hessian, we adopt the same value as used in existing subsampling implementations to keep consistency. Also, the same settings are used for both the PCA and matrix completion tasks. Specifically, the batch size |S g | = n/100 is used for RSGD, RSVRG and RSRG where S H is not considered as these are first-order methods. For both the Inexact RTR and the proposed Inexact Sub-RN-CR, |S H | = n/100 and |S g | = n is used. This is to follow the existing setting in [32] for benchmark purposes, which exploits the approximate Hessian but the full gradient. In addition to these, we experiment with another batch setting of {|S H | = n/100, |S g | = n/10} for both the Inexact RTR and Inexact Sub-RN-CR. This is flagged by (G) in the algorithm name, meaning that the algorithm uses the approximate gradient in addition to the approximate Hessian. Its purpose is to evaluate the effect of S g in the optimization.
Evaluation is conducted based on two machine learning tasks of PCA and low-rank matrix completion using both synthetic and real-world datasets with n d 1. Both tasks can be formulated as non-convex optimization problems on the Grassmann manifold Gr(r, d). The algorithm performance is evaluated by oracle calls and the run time. The former counts the number of function, gradient, and Hessian-vector product computations. For instance, Algorithm 1 requires n + |S g | + m|S H | oracle calls each iteration, where m is the number of iterations of the subproblem solver. Regarding the user-defined parameters in Algorithm 1, we use σ = 10 −18 . Empirical observations suggest that the magnitude of the data entries affects the optimization in its early stage, and hence these factors are taken into account in the setting of σ 0 . Let S = [s ij ] denote the input data matrix containing L rows and H columns. We compute σ 0 by considering the data dimension, also the averaged data

Optimality gap |f-f*|
Inexact RTR |S g |=n/1.5 Inexact RTR |S g |=n/2 Inexact RTR |S g |=n/5 Inexact RTR |S g |=n/10 |S g |=n/1. Optimality gap |f-f*|  magnitude normalized by its standard deviation, given as where µ S = 1 LH i∈[L],j∈[H] s i,j and dim(M ) is the manifold dimension. Regarding the early stopping setting in Eq. (46), K = 5 is used for both tasks, and we use τ f = 10 −12 for MNIST and τ f = 10 −10 for the remaining datasets in the PCA task. In the matrix completion task, we set τ f = 10 −10 for the synthetic datasets and τ f = 10 −3 for the real-world datasets. For the early stopping settings in Eq. (47) and in Step (13) of Algorithm 3, we adopt κ θ = 0.08 and θ = 0.1. Between the two subproblem solvers, we observe that Algorithm 2 by the Lanczos method and Algorithm 3 by the conjugate gradient perform similarly. Therefore, we report the main results using Algorithm 2, and provide supplementary results for Algorithm 3 in a separate Section 5.4.

PCA Experiments
PCA can be interpreted as a minimization of the sum of squared residual errors between the projected and the original data points, formulated as min U∈Gr(r,d) where z i ∈ R d . The objective function can be re-expressed as one on the Grassmann manifold via min U∈Gr(r,d) One synthetic dataset P1 and two real-world datasets including MNIST [37] and Covertype [8] are used in the evaluation. The P1 dataset is firstly generated by randomly sampling each element of a matrix A ∈ R n×d from a normal distribution N (0, 1). This is then followed by a multiplication with a diagonal matrix S ∈ R d×d with each diagonal element randomly sampled from an exponential distribution Exp (2), which increases the difference between the feature variances. After that, a mean-subtraction preprocessing is applied to AS to obtain the final P1 dataset. The (n, d, r) values are: 5 × 10 5 , 10 3 , 5 for P1, 6 × 10 4 , 784, 10 for MNIST, and (581012, 54, 10) for Covertype. Algorithm accuracy is assessed by optimality gap, defined as the absolute difference T z i withÛ as the optimal solution returned by Algorithm 1. The optimal function value f is computed by using the eigen-decomposition solutionŨ, which is a classical way to obtain PCA result without going through the optimization program. Fig. 2 compares the optimality gap changes over iterations for all the competing algorithms. Additionally, Fig. 3 summarizes their accuracy and convergence performance in optimality gap and run time. Fig. 4a reports the performance without using early stopping for the P1 dataset. It can be seen that the Inexact Sub-RN-CR reaches the minima with the smallest iteration number for both the synthetic and real-world datasets. In particular, the larger the scale of a problem is, the more obvious the advantage of our Inexact Sub-RN-CR is, evidenced by the performance difference.
However, both the Inexact RTR and Inexact Sub-RN-CR achieve their best PCA performance when using a full gradient calculation accompanied by a subsampled Hessian. The subsampled gradient does not seem to result in a satisfactory solution as shown in Fig. 2 with |S g | = n/10. Additionally, we report more results for the Inexact RTR and the proposed Inexact Sub-RN-CR in Fig. 4b on the P1 dataset with different gradient batch sizes, including |S g | ∈ {n/1.5, n/2, n/5, n/10}. They all perform less well than |S g | = n. More accurate gradient information is required to produce a high-precision solution in these tested cases. A hypothesis on the cause of this phenomenon might be that the variance of the approximate gradients across samples is larger  than that of the approximate Hessians. Hence, a sufficiently large sample size is needed for a stable approximation of the gradient information. Errors in approximate gradients may cause the algorithm to converge to a sub-optimal point with a higher cost, thus performing less well. Another hypothesis might be that the quadratic term UU T in Eq. (  error from the approximate gradient, which could significantly increase the PCA reconstruction error.
By fixing the gradient batch size |S g | = n for both the Inexact RTR and Inexact Sub-RN-CR, we compare in Figs. 4c and 4d their sensitivity to the used batch size for Hessian approximation. We experiment with |S H | ∈ {n/10, n/10 2 , n/10 3 , n/10 4 }. It can be seen that the Inexact Sub-RN-CR outperforms the Inexact RTR in almost all cases except for |S H | = n/10 4 for the MNIST dataset. The Inexact Sub-RN-CR possesses a rate of increase in oracle calls significantly smaller for large sample sizes. This implies that the Inexact Sub-RN-CR is more robust than the Inexact RTR to batch-size change for inexact Hessian approximation.

Low-rank Matrix Completion Experiments
Low-rank matrix completion aims at completing a partial matrix Z with only a small number of entries observed, under the assumption that the matrix has a low rank. One way to formulate the problem is shown as below min U∈Gr(r,d),A∈R r×n where Ω denotes the index set of the observed matrix entries. The operator P Ω : R d×n → R d×n is defined as P Ω (Z) ij = Z ij if (i, j) ∈ Ω, while P Ω (Z) ij = 0 otherwise. We generate it by uniformly sampling a set of |Ω| = 4r(n + d − r) elements from the dn entries. Let a i be the i-th column of A, z i be the i-th column of Z, and Ω i be the subset of Ω that contains sampled indices for the i-th column of Z. Then, a i has a closed-form solution where U Ωi contains the selected rows of U, and z Ωi the selected elements of z i according to the indices in Ω i , and † denotes the pseudo inverse operation.
To evaluate a solution U, we generate another index setΩ, which is used as the test set and satisfiesΩ ∩ Ω = ∅, following the same way as generating Ω.
We compute the mean squared error (MSE) by In evaluation, three synthetic datasets and a real-world dataset Jester [23] are used where the training and test sets are already predefined by [23]. The synthetic datasets are generated by following a generative model similar to [41] based on SVD. Specifically, to develop a synthetic dataset, we generate two matrices A ∈ R d×r and B ∈ R n×r with their elements independently sampled from the normal distribution N (0, 1). Then, we generate two orthogonal matrices Q A and Q B by applying the QR decomposition [56] respectively to A and B. After that, we construct a diagonal matrix S ∈ R r×r of which the diagonal elements are computed by s i,i = 10 3+ (i−r) log 10 (c) r−1 for i = 1, ..., r, and the final data matrix is computed by Z = Q A SQ T B . The reason to construct S in this specific way is to have an easy control over the condition number of the data matrix, denoted by κ(Z), which is the ratio between the maximum and minimum singular values of Z. Because κ(Z) = σmax(Z) σmin(Z) = different algorithms and datasets. In Fig. 6, the Inexact Sub-RN-CR outperforms the others in most cases, and it can even be nearly twice as fast as the state-of-the-art methods for cases with a larger condition number (see dataset M2 and M3). This shows that the proposed algorithm is efficient at handling ill-conditioned problems. In Fig. 7, the Inexact Sub-RN-CR achieves a sufficiently small MSE with the shortest run time, faster than the Inexact RTR and RTRMC. Unlike in the PCA task, the subsampled gradient approximation actually helps to improve the convergence. A hypothesis for explaining this phenomenon could be that, as compared to the quadratic term UU T in the PCA objective function, the linear term U in the matrix completion objective function accumulates fewer errors from the inexact gradient, making the optimization more stable.
Additionally, Fig. 8a compares the Inexact RTR and the Inexact Sub-RN-CR with varying batch sizes for gradient estimation and with fixed |S H | = n. The M1-M3 results show that our algorithm exhibits stronger robustness to |S g |, as it converges to the minima with only 50% additional oracle calls when reducing |S g | from n/10 to n/10 2 , whereas Inexact RTR requires  To summarize, we have observed from both the PCA and matrix completion tasks that the effectiveness of the subsampled gradient in the proposed approach can be sensitive to the choice of the practical problems, while the subsampled Hessian steadily contributes to a faster convergence rate.

Imaging Applications
In this section, we demonstrate some practical applications of PCA and matrix completion, which are solved by using the proposed optimization algorithm Inexact Sub-RN-CR, for analyzing medical images and scene images.

Functional Connectivity in fMRI by PCA
Functional magnetic-resonance imaging (fMRI) can be used to measure brain activities and PCA is often used to find functional connectivities between brain regions based on the fMRI scans [68]. This method is based on the assumption that the activation is independent of other signal variations such as brain motion and physiological signals [68]. Usually, the fMRI images are represented as a 4D data block subject to observational noise, including 3 spatial dimensions and 1 time dimension. Following a common preprocessing routine [53,35], we denote an fMRI data block by D ∈ R u×v×w×T and a mask by M ∈ {0, 1} u×v×w that contains d nonzero elements marked by 1. By applying the mask to the data block, we obtain a feature matrix f ∈ R d×T , where each column stores the features of the brain at a given time stamp. One can increase the sample size by collecting k fMRI data blocks corresponding to k human subjects, after which the feature matrix is expanded to a larger matrix F ∈ R d×kT . In this experiment, an fMRI dataset referred to as ds000030 provided by the OpenfMRI database [44] is used, where u = v = 64, w = 34, and T = 184. We select k = 13 human subjects and use the provided brain mask with d = 228483. The dimension of the final data matrix is (n, d) = (2392, 228483), where n = kT . We set the rank as r = 5 which is sufficient to capture over 93% of the variance in the data. After the PCA processing, each computed principal component can be rendered back to the brain reconstruction by using the open-source library Nilearn [1]. Fig. 9 displays the optimization performance, where the Inexact Sub-RN-CR converges faster in terms of both run time and oracle calls. For our method and the Inexact RTR, adopting the subsampled gradient leads to a suboptimal solution in less time than using the full gradient. We speculate that imprecise gradients cause an oscillation of the optimization near local optima. Fig. 10 compares the results obtained by our optimization algorithm with those computed by eigen-decomposition. The highlighted regions denote the main activated regions with positive connectivity (yellow) and negative connectivity (blue). The components learned by the two approaches are similar, with some cases learned by our approach tending to be more connected (see Figs. 10a and 10c).

Image Recovery by Matrix Completion
In this application, we demonstrate image recovery with matrix completion using a (W, H, C) = 2519 × 1679 × 3 scene image selected from the BIG dataset [16]. As seen in Fig. 12a, this image contains rich texture information. The values of (n, d, r) for conducting the matrix completion task are (1679, 2519, 20) where we use a relatively large rank to allow more accurate recovery. The sampling ratio for observing the pixels is set as SR = |Ω| W ×H×C = 0.6. Fig. 11 compares the performance of different algorithms, where the Inexact Sub-RN-CR takes the shortest time to obtain a satisfactory solution. The subsampled gradient promotes the optimization speed of the Inexact Sub-RN-CR without sacrificing much the MSE error. Fig. 12 illustrates the recovered image using three representative algorithms, providing similar visual results.

Results of Conjugate Gradient Subproblem Solver
We experiment with Algorithm 3 for solving the subproblem. In Step (3) of Algorithm 3, the eigen-decomposition method [20] used to solve the minimization problem has a complexity O(C 3 ) where C = 4 is the fixed degree of the polynomial. Figs. 13-16 display the results for both the PCA and matrix completion tasks. Overall, Algorithm 3 can obtain the optimal results with the fastest convergence speed, as compared to the opponent approaches. We have observed that, in general, Algorithm 3 provides similar results to Algorithm 2, but they differ in run time. For instance, Algorithm 2 runs 18% faster for the PCA task with the MNIST dataset and 20% faster for the matrix completion task with the M1 dataset, as compared to Algorithm 3. But Algorithm 3 runs 17% faster than Algorithm 2 for the matrix completion task with the M2 dataset. A hypothesis could be that Algorithm 2 performs well on well-conditioned data (e.g. MNIST and M1) because of its strength of finding the global solution, while for ill-conditioned data (e.g. M2), it may not show significant advantages over Algorithm 3. Moreover, from the computational aspect, the Step (3) in Algorithm 3 is of O(C 3 ) complexity, which tends to be faster than solving Eq. (18) as required by Algorithm 2. Overall this makes Algorithm 3 probably a better choice than Algorithm 2 for processing ill-conditioned data.

Examination of Convergence Analysis Assumptions
As explained in Section 3.3.3 and Section 3.4, the eigenstep condition in Assumption 1, Assumption 2 and Assumption 3, although are required by convergence analysis, are not always satisfied by a subproblem solver in practice.
In this experiment, we attempt to estimate the probability P that an assumption is satisfied in the process of optimization, by counting the number of outer iterations of Algorithm 1 where an assumption holds. We repeat this entire process five times (T = 5) to attain a stable result. Let N i be the number of outer iterations where the assumption is satisfied, and M i the total number of outer iterations, in the i-th repetition (i = 1, 2, . . . , 5). We compute the probability by P = i∈[T ] Ni i∈[T ] Mi . Experiments are conducted for the PCA task using the P1 dataset.
In order to examine Assumption 2, which is the stopping criterion in Eq. (47), we temporarily deactivate the other stopping criteria. We observe that Algorithm 2 can always produce a solution that satisfies Assumption 2. However, Algorithm 3 has only P ≈ 50% chance to produce a solution satisfying Assumption 2. The reason is probably that when computing r i in Step (11)  of Algorithm 3, the first-order approximation of ∇f (R x k (η * k )) is used rather than the exact ∇f (R x k (η * k )) for the sake of computational efficiency. This can result in an approximation error.
Regarding the eigenstep condition in Assumption 1, it can always be met by Algorithm 2 with P ≈ 100%. This indicates that even a few inner iterations are sufficient for it to find a solution pointing in the direction of negative curvature. However, Algorithm 3 has a P ≈ 70% chance to meet the eigenstep condition. This might be caused by insufficient inner iterations according to Theorem 1. Moreover, the solution obtained by Algorithm 3 is only guaranteed to be stationary according to Theorem 1, rather than pointing in the direction of the negative curvature. This could be a second cause for Algorithm 3 not to meet the eigenstep condition in Eq. (34).
While about Assumption 3, according to Lemma 2, Algorithm 2 always satisfies it. This is verified by our results with P = 100%. Algorithm 3 has a P ≈ 80% chance to meet Assumption 3 empirically. This empirical result matches the theoretical result indicated by Lemma 3 where solutions from Algorithm 3 tend to approximately satisfy Assumption 3.

Conclusion
We have proposed the Inexact Sub-RN-CR algorithm to offer an effective and fast optimization for an important class of non-convex problems whose constraint sets possess manifold structures. The algorithm improves the current state of the art in second-order Riemannian optimization by using cubic regularization and subsampled Hessian and gradient approximations. We have also provided rigorous theoretical results on its convergence, and empirically evaluated and compared it with state-of-the-art Riemannian optimization techniques for two general machine learning tasks and multiple datasets. Both theoretical and experimental results demonstrate that the Inexact Sub-RN-CR offers improved convergence and computational costs. Although the proposed method is promising in solving large-sample problems, there remains an open and interesting question of whether the proposed algorithm can be effective in training a constrained deep neural network. This is more demanding in its required computational complexity and convergence characteristics than many other machine learning problems, and it is more challenging to perform the Hessian approximation. Our future work will pursue this direction.
≥ 0, which results in the following: When (73) This concludes that for any ξ l , it has ζ l x k ≤ η * k x k . We introduce the notationm k to denote the subproblem in Eq. (7) using the inexact Eq. (75) also uses (1) the fact of α ≥ −1 which comes from the fact of α being in the form of a−b max(a,b) , and thus 1 − α ≤ 2, (2) the definition of the smallest and largest eigenvalues in Eqs. (16) and (20), which gives for and any unit tangent vectors η and ξ, and (3) the fact that To derive Eq. (77), we usē Next, as η * l k is the optimal solution in the subspace K l (H k , G k ), we havem k η * l k ≤ m k ζ l , and hencē We then show that the Lanczos method exhibits at least the same convergence property as above for the subsampled Riemannian cubic-regularization subproblem. Let η * k be the global minimizer for the subproblemm k in Eq. (10). η * k is equivalent toη * k in the RTR subproblem with ∆ k = η * k x k and λ * = σ k η * k x k [14]. Then, letting η * l k be the minimizer Combining this with Eq. (81), it haŝ This completes the proof.
C Appendix: Proof of Theorem 1 Proof We first prove the relationship between G i k and r i . According to Algorithm 3, r 0 = G 0 k = G k . Then for i > 0, we have where ≈ comes from the first-order Taylor extension and Eq. (85) follows the Step (11) in Algorithm 3.
The exact line search in the Step. (3) of Algorithm 3 approximates Zeroing the derivative of Eq. (86) with respect to α gives where ≈ results from the use of subsampled gradient G i k to approximate the full gradient ∇f (x i k ). We then show that each p i is a sufficient descent direction, i.e., for some constant C > 0 [49]. When i = 0, p 1 = G 0 k , and thus G 0 Step (17) in Algorithm 3 and Eq. (85), we have p i+1 ≈ −G i k + β i p i . Applying the inner product to both sides by G i k , we have Next, we show that β i is upper bounded. Using Eq. (85), we have where c 3 > 1 is some constant.

Subsequently, this gives
Combing Eqs. (89) and (94), we have This contradicts Eq. (90) and completes the proof. Regarding the eigenstep condition, the proof is also fairly simple. The solution η * k from Algorithm 2 with l = D is the global minimizer over R D . As the subspace spanned by Cauchy steps and eigensteps belongs to R D , i.e., Span η C k , η E k ∈ R D , we havem Hence, the solution from Algorithm 2 satisfies the eigenstep condition in Eq. (37)) from Assumption 1. Assumption 2: As stated in Section 3.3 of [15], any minimizer, including the global minimizer η * k from Algorithm 2, is a stationary point ofm k , and naturally has the property ∇ηm k (η * k ) = 0x k . Hence, it has ∇ηm k (η * k ) x k = 0 ≤ κ θ min 1, η * k x k G k x k . Assumption 2 is then satisfied.
Assumption 3: Given the above-mentioned property of the global minimizer η * k of m k from Algorithm 2, i.e., ∇ηm k (η * k ) = 0x k , and using the definition of ∇ηm k (η * k ), it has Applying the inner product operation with η * k to both sides of Eq. (97), it has and this corresponds to Eq. (42). As η * k is a descent direction, we have G k , η * k x k ≤ 0. Combining this with Eq. (98), it has and this is Eq. (43). This completes the proof.

D.1.2 The Case of l < D
Proof Cauchy condition in Assumption 1: Implied by Eq. (19), any intermediate solution η * l k satisfies the Cauchy condition. Assumption 2: As stated in Section 3.3 of [15], any minimizer η * ofm k admits the property ∇ηm k (η * ) = 0x k . Since each η * l k is the minimizer ofm k over K l , it has ∇ηm k (η * l k ) = 0x k . Then, it has ∇ηm k (η * l k ) x k = 0 ≤ κ θ min 1, η * l k x k G k x k . Assumption 2 is then satisfied.
Assumption 3: According to Lemma 3.2 of [15], a global minimizer ofm k over a subspace of R D satisfies Assumption 3. As the solution η * l k in each iteration of Algorithm 2 is a global solution over the subspace K l , each η * l k satisfies Assumption 3.

D.2 Proof of Lemma 3
Proof For Cauchy Condition of Assumption 1: In the first iteration of Algorithm 3, the step size is optimized along the steepest gradient direction, as in the classical steepestdescent method of Cauchy, i.e., At each iteration i of Algorithm 3, the line search process in Eq. (24) aims at finding a step size that can achieve a cost decrease, otherwise the step size will be zero, meaning that no strict decrease can be achieved and the algorithm will stop at Step (4). Because of this, we havem Then, considering all i = 1, 2, ..., we havê This shows Algorithm 3 always returns a solution better than or equal to the Cauchy step.
Let G k+1 be the subsampled gradient evaluated at Rx k (η * k ). Based on Theorem 1, it has G k+1 = lim i→∞ G i k where G i k is the resulting subsampled gradient after i inner iterations. Since G k+1 is the approximate gradient of the full gradient ∇f (Rx k (η * k )), it has ∇f (Rx k (η * k )) = E[G k+1 ] = E[lim i→∞ G i k ]. Hence, it has ∇ηf (Rx k (η * k )) x k (105) which indicates the equality holds as ∇ηf Rx k (η * k ) x k = 0. Combining this with Eq. (104), it completes the proof.
E Appendix: Proof of Theorem 2

E.1 Matrix Bernstein Inequality
We build the proof on the matrix Bernstein inequality. We restate this inequality in the following lemma. Lemma 4 (Matrix Bernstein Inequality ([26, 58])) Let A 1 , ..., An be independent, centered random matrices with the common dimension d 1 × d 2 . Assume that each one is uniformly bounded: E[A i ] = 0, A i ≤ µ, i = 1, ..., n.
(108) Given the matrix sum Z = n i=1 A i , we define its variance ν(Z) by Then This lemma supports us to prove Theorem 2 as below.

E.2 Main Proof
Proof Following the subsampling process, a total of |Sg| matrices are uniformly sampled from the set of n Riemannan gradients gradf i (x) ⊆ R d×r n i=1 . We denote each sampled element as G Define the random matrix x − gradf (x), i = 1, 2, .., |Sg|.
Our focus is the type of problem as in Eq. (1), therefore it has Define a random variable Its variance satisfies Take G (i) x = gradf 1 (x) as an example and applying the definition of Kg max in Eq. (53), we have where the first inequality uses (a + b) 2 ≤ 2a 2 + 2b 2 . Combining Eq. (115) and Eq. (116), it has Now we are ready to apply Lemma 4. Given E 1 |Sg | X i = 1 |Sg | E[X i ] = 0 and according to the superma definition 1 |Sg | X i x = 1 |Sg | X i x ≤ Kg max |Sg | , the following is obtained from the matrix Bernstein inequality: of which the last equality implies = 2 2 K 2 gmax + Kg max ln d+r δ |S|g .
In simple words, with a probability at least 1 − δ, we have X x < . Letting this results in the sample size bound as in Eq. (55) |Sg| ≥ Therefore, we have X x ≤ δg. Expanding X, we have the following satisfied with a probability at least 1 − δ: For the problem defined in Eq. (1) with a second-order retraction, it has Define a random variable Its variance satisfies Take H (i) x [η] = ∇ 2f 1 (0x)[η] as an example and applying the definition of K Hmax in Eq. (54), we have where the first inequality uses (a + b) 2 ≤ 2a 2 + 2b 2 , and η is the current moving direction being optimized in Eq. (10). Combining Eq. (130) and Eq. ((131), we have We then apply Lemma 4. Given E , the following is obtained from the matrix Bernstein inequality: Proof By differentiating the approximate model, we have where the first inequality follows from the triangle inequality and the second inequality from Eqs. (49), (51) and (52). Additionally, from Lemma 3.8 in [33], we have where L l > 0 is a constant. Then, we have with the first inequality following from the triangle inequality and the second from Eqs. (51), (52) and (167). Then, by combining Eqs. (47), (166) and (168), with θ k := κ θ min(1, η i k x k ) we obtain This results in Subsequently, it has In the above derivation, we use the property of the parallel transport that preserves the length of the transported vector. The last inequality in Eq. (170) is based on Eqs. (51) and the triangle inequality. Now, we consider the following two cases. (i) If η k x k ≥ 1, from Eq. (47) we have θ k = κ θ , and therefore