Universal heavy-ball method for nonconvex optimization under H¨older continuous Hessians

We propose a new ﬁrst-order method for minimizing nonconvex functions with Lipschitz continuous gradients and H¨older continuous Hessians. The proposed algorithm is a heavy-ball method equipped with two particular restart mechanisms. It ﬁnds a solution where the gradient norm is less than ε in O ( H 1 2+2 ν ν ε − 4+3 ν 2+2 ν ) function and gradient evaluations, where ν ∈ [0 , 1] and H ν are the H¨older exponent and constant, respectively. Our algorithm is ν -independent and thus universal; it automatically achieves the above complexity bound with the optimal ν ∈ [0 , 1] without knowledge of H ν . In addition, the algorithm does not require other problem-dependent parameters as input, including the gradient’s Lipschitz constant or the target accuracy ε . Numerical results illustrate that the proposed method is promising.


Introduction
This paper studies general nonconvex optimization problems: where f : R d → R is twice differentiable and lower bounded, i.e., inf x∈R d f (x) > −∞.Throughout the paper, we impose the following assumption of Lipschitz continuous gradients.Assumption 1.There exists a constant L > 0 such that ∥∇f (x) − ∇f (y)∥ ≤ L∥x − y∥ for all x, y ∈ R d .
First-order methods [3,31], which access f through function and gradient evaluations, have gained increasing attention because they are suitable for large-scale problems.A classical result is that the gradient descent method finds an ε-stationary point (i.e., x ∈ R d where ∥∇f (x)∥ ≤ ε) in O(ε −2 ) function and gradient evaluations under Assumption 1.Recently, more sophisticated first-order methods have been developed to achieve faster convergence for more smooth functions.Such methods [2,6,28,[33][34][35]53] have complexity bounds of O(ε −7/4 ) or Õ(ε −7/4 ) under Lipschitz continuity of Hessians in addition to gradients.1This research stream raises two natural questions: Question 1.How fast can first-order methods converge under smoothness assumptions stronger than Lipschitz continuous gradients but weaker than Lipschitz continuous Hessians?
Question 2. Can a single algorithm achieve both of the following complexity bounds: O(ε −2 ) for functions with Lipschitz continuous gradients and O(ε −7/4 ) for functions with Lipschitz continuous gradients and Hessians?
Question 2 is also crucial from a practical standpoint because it is often challenging for users of optimization methods to check whether a function of interest has a Lipschitz continuous Hessian.It would be nice if there were no need to use several different algorithms to achieve faster convergence.Motivated by the questions, we propose a new first-order method and provide its complexity analysis with the Hölder continuity of Hessians.Hölder continuity generalizes Lipschitz continuity and has been widely used for complexity analyses of optimization methods [12, 13, 18, 20, 22-25, 30, 38].Several properties and an example of Hölder continuity can be found in [23,Section 2].
• Our algorithm requires no knowledge of problem-dependent parameters, including the optimal ν, the Lipschitz constant L, or the target accuracy ε.
Let us describe our ideas for developing such an algorithm.We employ the Hessian-free analysis recently developed for Lipschitz continuous Hessians [35] to estimate the Hessian's Hölder continuity with only first-order information.The Hessian-free analysis uses inequalities that include the Hessian's Lipschitz constant H 1 but not a Hessian matrix itself, enabling us to estimate H 1 .Extending this analysis to general ν allows us to estimate the Hölder constant H ν , given ν ∈ [0, 1].We thus obtain an algorithm that requires ν as input and has the complexity bound (2) for the given ν.However, the resulting algorithm lacks usability because ν that minimizes (2) is generally unknown.
Our main idea for developing a ν-independent algorithm is to set ν = 0 for the above ν-dependent algorithm.This may seem strange, but we prove that it works; a carefully designed algorithm for ν = 0 achieves the complexity bound (2) for any ν ∈ [0, 1].Although we design an estimate for H 0 , it also has a relationship with H ν for ν ∈ (0, 1], as will be stated in Proposition 1.This proposition allows us to obtain the desired complexity bounds without specifying ν.
To evaluate the numerical performance of the proposed method, we conducted experiments with standard machine-learning tasks.The results illustrate that the proposed method outperforms state-of-the-art methods.
Notation.For vectors a, b ∈ R d , let ⟨a, b⟩ denote the dot product and ∥a∥ denote the Euclidean norm.For a matrix A ∈ R m×n , let ∥A∥ denote the operator norm, or equivalently the largest singular value.

Related work
This section reviews previous studies from several perspectives and discusses similarities and differences between them and this work.
Complexity of second-order methods using Hölder continuous Hessians.The Hölder continuity of Hessians has been used to analyze second-order methods.Grapiglia and Nesterov [23] proposed a regularized Newton method that finds an ε-stationary point in O(ε − 2+ν 1+ν ) evaluations of f , ∇f , and ∇ 2 f , where ν ∈ [0, 1] is the Hölder exponent of ∇ 2 f .The complexity bound generalizes previous O(ε −3/2 ) bounds under Lipschitz continuous Hessians [10,11,14,40].We make the same assumption of Hölder continuous Hessians as in [23] but do not compute Hessians in the algorithm.Table 2 summarizes the first-order and second-order methods together with their assumptions.
Reference / Algorithm [12,22] Gradient descent This work [33][34][35]  Universality for Hölder continuity.When Hölder continuity is assumed, it is preferable that algorithms not require the exponent ν as input because a suitable value for ν tends to be hard to find in real-world problems.Such ν-independent algorithms, called universal methods, were first developed as first-order methods for convex optimization [30,38] and have since been extended to other settings, including higher-order methods or nonconvex problems [12,13,20,[22][23][24][25].Within this research stream, this paper proposes a universal method with a new setting: a first-order method under Hölder continuous Hessians.Because of the differences in settings, the existing techniques for universality cannot be applied directly; we obtain a universal method by setting ν = 0 for a ν-dependent algorithm, as discussed in Section 1.
Heavy-ball methods.Heavy-ball (HB) methods are a kind of momentum method first proposed by Polyak [43] for convex optimization.Although some complexity results have been obtained for (strongly) convex settings [21,32], they are weaker than the optimal bounds given by Nesterov's accelerated gradient method [36,39].For nonconvex optimization, HB and its variants [15,29,46,50] have been practically used with great success, especially in deep learning, while studies on theoretical convergence analysis are few [34,41,42].O'Neill and Wright [42] analyzed the local behavior of the original HB method, showing that the method is unlikely to converge to strict saddle points.Ochs et al. [41] proposed a generalized HB method, iPiano, that enjoys a complexity bound of O(ε −2 ) under Lipschitz continuous gradients, which is of the same order as that of GD.Li and Lin [34] proposed an HB method with a restart mechanism that achieves a complexity bound of O(ε −7/4 ) under Lipschitz continuous gradients and Hessians.Our algorithm is another HB method with a different restart mechanism that enjoys more general complexity bounds than Li and Lin [34], as discussed in Section 1.
Comparison with [35].This paper shares some mathematical tools with [35] because we utilize the Hessian-free analysis introduced in [35] to estimate Hessian's Hölder continuity.While the analysis in [35] is for Nesterov's accelerated gradient method under Lipschitz continuous Hessians, we here analyze Polyak's HB method under Hölder continuity.Thanks to the simplicity of the HB momentum, our estimate for the Hölder constant is easier to compute than the estimate for the Lipschitz constant proposed in [35], which improves the efficiency of our algorithm.We would like to emphasize that a ν-independent algorithm cannot be derived simply by applying the mathematical tools in [35].It should also be mentioned that we have not confirmed that it is impossible or very challenging to develop a ν-independent algorithm with Nesterov's momentum under Hölder continuous Hessians.
Lower bounds.So far, we have discussed upper bounds on complexity, but there are also some studies on its lower bounds.Carmon et al. [8] proved that no deterministic or stochastic first-order method can improve the complexity of O(ε −2 ) with the assumption of Lipschitz continuous gradients alone.(See [8, Theorems 1 and 2] for more rigorous statements.)This result implies that GD is optimal in terms of complexity under Lipschitz continuous gradients.Carmon et al. [9] showed a lower bound of Ω(ε −12/7 ) for first-order methods under Lipschitz continuous gradients and Hessians.
Compared with the upper bound of O(ε −7/4 ) under the same assumptions, there is still a Θ(ε −1/28 ) gap.Closing this gap would be an interesting research question, though this paper does not focus on it.

Preliminary results
The following lemma is standard for the analyses of first-order methods.
Lemma 1 (e.g., [37,Lemma 1.2.3]).Under Assumption 1, the following holds for any x, y ∈ R d : This inequality helps estimate the Lipschitz constant L and evaluate the decrease in the objective function per iteration.We also use the following inequalities derived from Hölder continuous Hessians.
Lemma 3.For all x, y ∈ R d and ν ∈ [0, 1] such that H ν < +∞, the following holds: The proofs are given in Appendix A.1.These lemmas generalize [35, Lemmas 2 and 3] for Lipschitz continuous Hessians (i.e., ν = 1).It is important to note that the inequalities in Lemmas 2 and 3 are Hessian-free; they include the Hessian's Hölder constant H ν but not a Hessian matrix itself.Accordingly, we can adaptively estimate the Hölder continuity of ∇ 2 f in the algorithm without computing Hessians.

Algorithm
The proposed method, Algorithm 1, is a heavy-ball (HB) method equipped with two particular restart schemes.In the algorithm, the iteration counter k is reset to 0 when HB restarts on Line 8 or 10, whereas the total iteration counter K is not.We refer to the period between one reset of k and the next reset as an epoch.Note that it is unnecessary to implement K in the algorithm; it is included here only to make the statements in our analysis concise.
The algorithm uses estimates ℓ and h k for the Lipschitz constant L and the Hölder constant H 0 .The estimate ℓ is fixed during an epoch, while h k is updated at each iteration, having the subscript k.

Update of solutions
With an estimate ℓ for the Lipschitz constant L, Algorithm 1 defines a solution sequence (x k ) as follows: v 0 = 0 and for k ≥ 1.Here, (v k ) is the velocity sequence, and 0 ≤ θ k ≤ 1 is the momentum parameter.Let x −1 := x 0 for convenience, which makes (4) valid for k = 0.This type of optimization method is called a heavy-ball method or Polyak's momentum method.
In this paper, we use the simplest parameter setting: for all k ≥ 1.Our choice of θ k differs from the existing ones; the existing complexity analyses [16,17,21,32,34,43] of HB prohibit θ k = 1.For example, Li and Lin [34] proposed Our new proof technique described later in Section 5.1 enables us to set θ k = 1.We will later use the averaged solution to compute the estimate h k for H 0 and set the best solution x ⋆ k .The averaged solution can be computed efficiently with a simple recursion:

Estimation of Hölder continuity
Let to simplify the notation.Our analysis uses the following inequalities due to Lemmas 2 and 3.
Algorithm 1 requires no information on the Hölder continuity of ∇ 2 f , automatically estimating it.To illustrate the trick, let us first consider a prototype algorithm that works when a value of which come from Lemma 4. This estimation scheme yields a ν-dependent algorithm that has the complexity bound (2) for the given ν, though we will omit the details.The algorithm is not so practical because it requires ν ∈ [0, 1] such that H ν < +∞ as input.However, perhaps surprisingly, setting ν = 0 for the ν-dependent algorithm gives a ν-independent algorithm that achieves the bound (2) for all ν ∈ [0, 1].Algorithm 1 is the ν-independent algorithm obtained in that way.Let h 0 := 0 for convenience.At iteration k ≥ 1 of each epoch, we use the estimate h k for H 0 defined by The above inequalities were obtained by plugging ν = 0 into (8) and ( 9).
Although we designed h k to estimate H 0 , it fortunately also relates to H ν for general ν ∈ [0, 1].The following upper bound on h k shows the relationship between h k and H ν , which will be used in the complexity analysis.Proposition 1.For all k ≥ 1 and ν ∈ [0, 1] such that H ν < +∞, the following holds: Proof.Lemma 4 gives Hence, definition (10) of h k yields The desired result follows inductively since For ν = 0, Proposition 1 gives a natural upper bound, h k ≤ H 0 , since the estimate h k is designed for H 0 based on Lemma 4. For ν ∈ (0, 1], the upper bound can become tighter when kS k is small.Indeed, the iterates (x k ) are expected to move less significantly in an epoch as the algorithm proceeds.Accordingly, (S k ) increases more slowly in later epochs, yielding a tighter upper bound on h k .This trick improves the complexity bound from O(ε −2 ) for ν = 0 to O(ε − 4+3ν 2+2ν ) for general ν ∈ [0, 1].

Restart mechanisms
Algorithm 1 is equipped with two restart mechanisms.The first one uses the standard descent condition to check whether the current estimate ℓ for the Lipschitz constant L is large enough.If the descent condition (13) does not hold, HB restarts with a larger ℓ from the best solution x ⋆ k := argmin x∈{x 0 ,...,x k ,x 1 ,...,x k } f (x) during the epoch.We consider not only x 0 , . . ., x k but also the averaged solutions x1 , . . ., xk as candidates for the next starting point because averaging may stabilize the behavior of the HB method.As we will show later in Lemma 6, the gradient norm of averaged solutions is small, which leads to stability.For strongly-convex quadratic problems, Danilova and Malinovsky [16] also show that averaged HB methods have a smaller maximal deviation from the optimal solution than the vanilla HB method.A similar effect for nonconvex problems is expected in the neighborhood of local optima where quadratic approximation is justified.
The second restart scheme resets the momentum effect when k becomes large; if is satisfied, HB restarts from the best solution x ⋆ k .At the restart, we can reset ℓ to a smaller value in the hope of improving practical performance, though decreasing ℓ is not necessary for the complexity analysis.This restart scheme guarantees that holds at iteration k of each epoch.The Lipschitz estimate ℓ increases only when the descent condition ( 13) is violated.On the other hand, Lemma 1 implies that condition (13) always holds as long as ℓ ≥ L. Hence, we have the following upper bound on ℓ.

Objective decrease for one epoch
First, we evaluate the decrease in the objective function value during one epoch.
Lemma 5. Suppose that Assumption 1 holds and that the descent condition holds for all 1 ≤ i ≤ k.Then, the following holds under condition (15): Before providing the proof, let us remark on the lemma.Evaluating the decrease in the objective function is the central part of a complexity analysis.It is also an intricate part because the function value does not necessarily decrease monotonically in nonconvex acceleration methods.To overcome the non-monotonicity, previous analyses have employed different proof techniques.For example, Li and Lin [33] constructed a quadratic approximation of the objective, diagonalized the Hessian, and evaluated the objective decrease separately for each coordinate; Marumo and Takeda [35] designed a tricky potential function and showed that it is nearly decreasing.
This paper uses another technique to deal with the non-monotonicity.We observe that the solution x k does not need to attain a small function value; it is sufficient for at least one of x 1 , . . ., x k to do so, thanks to our particular restart mechanism.This observation permits the left-hand side of (17) to be min 1≤i≤k f (x i ) rather than f (x k ) and makes the proof easier.The proof of Lemma 5 calculates a weighted sum of 2k − 1 inequalities derived from Lemmas 1 and 3, which is elementary compared with the existing proofs.Now, we provide that proof.
Proof of Lemma 5. Combining (16) with the update rules (3) and (4) yields for 1 ≤ i ≤ k.For 1 ≤ i < k, we also have (11) and We will calculate a weighted sum of 2k − 1 inequalities: The left-hand side of the weighted sum is On the right-hand side of the weighted sum, some calculations with v 0 = 0 show that the innerproduct terms of ⟨v i−1 , v i ⟩ cancel out as follows: The remaining terms on the right-hand side of the weighted sum are We now obtain Finally, we evaluate the coefficient on the right-hand side with (15) as which completes the proof.
The proof elucidates that the second restart condition ( 14) was designed to derive the lower bound of ℓ 2k−1 4k in (20).
For an epoch that ends at Line 10 in iteration k ≥ 1, Lemma 5 gives For an epoch that ends at Line 8 in iteration k ≥ 2, the lemma gives These bounds will be used to derive the complexity bound.

Upper bound on gradient norm
Next, we prove the following upper bound on the gradient norm at the averaged solution.

Complexity bound
Let l denote the upper bound on the Lipschitz estimate ℓ given in Proposition 2: l := max{ℓ init , αL}.
The following theorem shows iteration complexity bounds for Algorithm 1. Recall that α > 1 and 0 < β ≤ 1 are the input parameters of Algorithm 1.

Theorem 1. Suppose that Assumption 1 holds and inf x∈R
In Algorithm 1, when ∥∇f (x k )∥ ≤ ε holds for the first time, the total iteration count K is at most In particular, if we set β = 1, then c 1 = 0 and the upper bound simplifies to inf Proof.We classify the epochs into three types: • successful epoch: an epoch that does not find an ε-stationary point and ends at Line 10 with the descent condition ( 13) satisfied, • unsuccessful epoch: an epoch that does not find an ε-stationary point and ends at Line 8 with the descent condition ( 13) unsatisfied, • last epoch: the epoch that finds an ε-stationary point.
Let N suc and N unsuc be the number of successful and unsuccessful epochs, respectively.Let K suc be the total iteration number of all successful epochs.Below, we fix ν ∈ [0, 1] arbitrarily such that H ν < +∞.(Note that there exists such a ν since H 0 ≤ 2L < +∞.) Successful epochs.Let us focus on a successful epoch and let k denote the total number of iterations of the epoch we are focusing on, i.e., the epoch ends at iteration k.We then have as follows: if k = 1, we have On the other hand, putting the restart condition (14) together with Proposition 1 yields and hence Combining ( 26) and ( 27) leads to Plugging them into the (21) yields since ν ≥ 0. Summing these bounds over all successful epochs results in and hence Other epochs.Let k 1 , . . ., k Nunsuc and k Nunsuc+1 be the iteration number of unsuccessful and last epochs, respectively.Then, the total iteration number of the epochs can be bounded with the Cauchy-Schwarz inequality as follows: where i: k i ≥2 denotes a sum over i = 1, . . ., N unsuc + 1 such that k i ≥ 2. We will evaluate N unsuc and the sum of k 2 i .First, we have ℓ init β Nsuc α Nunsuc ≤ l and hence from (28), where c 1 and c 2 are defined by (24).Next, let us focus on an epoch that ends at iteration k ≥ 2. Lemma 6 gives ε < ℓ 8S k−1 /k 3 and hence Summing this bound over all unsuccessful and last epochs results in Plugging ( 30) and ( 31) into (29) yields where the last inequality uses Putting this bound together with (28) gives an upper bound on the total iteration number of all epochs: where we have used 2 Algorithm 1 evaluates the objective function and its gradient at two points, x k and xk , in each iteration.Therefore, the number of evaluations is of the same order as the iteration complexity in Theorem 1.
The complexity bounds given in Theorem 1 may look somewhat unfamiliar since they involve an inf-operation on ν.Such a bound is a significant benefit of ν-independent algorithms.The ν-dependent prototype algorithm described immediately after Lemma 4 achieves the bound only for the given ν.In contrast, Algorithm 1 is ν-independent and automatically achieves the bound with the optimal ν, as shown in Theorem 1.The fact that the optimal ν is difficult to find also points to the advantage of our ν-independent algorithm.The complexity bound (25) also gives a looser bound: 91∆ lH where we have taken ν = 0 and have used H 0 ≤ 2L ≤ 2 l.This bound matches the classical bound of O(ε −2 ) for GD.Theorem 1 thus shows that our HB method has a more elaborate complexity bound than GD.
Remark 1.Although we employed global Lipschitz and Hölder continuity in Assumption 1 and Definition 1, they can be restricted to the region where the iterates reach.More precisely, if we assume that the iterates (x k ) generated by Algorithm 1 are contained in some convex set C ⊆ R d , we can replace all R d in our analysis with C; we can obtain the same complexity bound as Theorem 1 with Lipschitz and Hölder continuity on C. 3

Numerical experiments
This section compares the performance of the proposed method with several existing algorithms.The experimental setup, including the compared algorithms and problem instances, follows [35].We implemented the code in Python with JAX [4] and Flax [26] and executed them on a computer with an Apple M3 Chip (12 cores) and 36 GB RAM.The source code used in the experiments is available on GitHub. 4

Compared algorithms
We compared the following six algorithms.
• JNJ2018 [28, Algorithm 2] is an accelerated gradient (AG) method for nonconvex optimization.The parameters were set in accordance with [28, Eq. ( 3)].The equation involves constants c and χ, whose values are difficult to determine; we set them as c = χ = 1.
The parameter setting for JNJ2018 and LL2022 requires the values of the Lipschitz constants L and H 1 and the target accuracy ε.For these two methods, we tuned the best L among {10 −4 , 10 −3 , . . ., 10 10 } and set H 1 = 1 and ε = 10 −16 following [33,35].It should be noted that if these values deviate from the actual ones, the methods do not guarantee convergence.

Problem instances
We tested the algorithms on seven different instances.The first four instances are benchmark functions from [27].
The dimension d of the above problems was fixed as d = 10 6 .The starting point was set as x init = x * + δ, where x * is the optimal solution, and each entry of δ was drawn from the normal distribution N (0, 1).For the Qing function (34), we used x * = ( √ 1, √ 2, . . ., √ d) to set the starting point.
The other three instances are more practical examples from machine learning.
• Training a neural network for classification with the MNIST dataset: The vectors x 1 , . . ., x N ∈ R M and y 1 , . . ., y N ∈ {0, 1} K are given data, ℓ CE is the cross-entropy loss, and ϕ 1 (•; w) : R M → R K is a neural network parameterized by w ∈ R d .We used a three-layer fully connected network with bias parameters.The layers each have M , 32, 16, and K nodes, where M = 784 and K = 10.The hidden layers have the logistic sigmoid activation, and the output layer has the softmax activation.The total number of the parameters is d = (784 × 32 + 32 × 16 + 16 × 10) + (32 + 16 + 10) = 25818.The data size is N = 10000.
• Training an autoencoder for the MNIST dataset: min The vectors x 1 , .• Low-rank matrix completion with the MovieLens-100K dataset: min The set Ω consists of N = 100000 observed entries of a p × q data matrix, and (i, j, s) ∈ Ω means that the (i, j)-th entry is s.The second term with the Frobenius norm ∥•∥ F was proposed in [51] as a way to balance U and V .The size of the data matrix is p = 943 times q = 1682, and we set the rank as r ∈ {100, 200}.Thus, the number of variables is pr + qr ∈ {262500, 525000}.
Although we did not check whether the above seven instances have globally Lipschitz continuous gradients or Hessians, we confirmed in our experiments that the iterates generated by each algorithm were bounded.Since all of the above instances are continuously thrice differentiable, both the gradients and Hessians are Lipschitz continuous in the bounded domain.Considering Remark 1, we can say that in the experiments, the proposed algorithm achieves the same complexity bound as Theorem 1.

Results
Figure 1 illustrates the results with the four benchmark functions. 5The horizontal axis is the number of calls to the oracle that computes both f (x) and ∇f (x) at a given point x ∈ R d .
Let us first focus on the methods other than L-BFGS, which is very practical but does not have complexity guarantees for general nonconvex functions, unlike the other methods.Figures 1(a) and 1(b) show that Proposed converged faster than the existing methods except for L-BFGS, and Figure 1(c) shows that Proposed and MT2022 converged fast.Figure 1(d) shows that GD and LL2022 attained a small objective function value, while GD and Proposed converged fast regarding gradient norm.In summery, the proposed algorithm was stable and fast.
L-BFGS successfully solved the four benchmarks, but we should note that the results do not imply that L-BFGS converged faster than the proposed algorithm in terms of execution time.Figure 2 provides the four figures in the right column of Figure 1, with the horizontal axis replaced by the elapsed time.Figure 2 shows that Proposed converged comparably or faster in terms of time than L-BFGS.One reason for the large difference in the apparent performance of L-BFGS in Figures 1  and 2 is that the computational costs of the non-oracle parts in L-BFGS, such as updating the Hessian approximation and solving linear systems, are not negligible.In contrast, the proposed algorithm does not require heavy computation besides oracle calls and is more advantageous in execution time when function and gradient evaluations are low-cost.
Figure 3 presents the results with the machine learning instances.Similar to Figure 1, Figure 3 shows that the proposed algorithm performed comparably or better than the existing methods except for L-BFGS, especially in reducing the gradient.
Figure 4 illustrates the objective function value f (x k ) and the estimates ℓ and h k at each iteration of the proposed algorithm for the machine learning instances.The iterations at which a restart occurred are also marked; "successful" and "unsuccessful" mean restarts at Line 10 and Line 8 of Algorithm 1, respectively.This figure shows that the proposed algorithm restarts frequently in the early stages but that the frequency decreases as the iterations progress.The frequent restarts in the early stages help update the estimate ℓ; ℓ reached suitable values in the first few iterations, even though it was initialized to a pretty small value, ℓ init = 10 −3 .The infrequent restarts in later stages enable the algorithm to take full advantage of the HB momentum.

A Omitted proofs A.1 Proof of Lemmas 2 and 3
Proof of Lemma 2. Since f is twice differentiable, we have and its weighted average gives A.2 Proof of ( 7) 7) is a of [35,Eq. (22)], which was originally for an accelerated gradient method with Lipschitz continuous Hessians, for our heavy-ball method with Hölder continuous Hessians.The following proof of ( 7) is based on the one for [35, Eq. ( 22)] but is easier, thanks to our simple choice of θ k = 1.
Proof.Using the triangle inequality and Lemma 2 with n = k, z i = x i , and and we will evaluate each term.First, it follows from the update rule (3) that Therefore, the first term on the right-hand side of (39) reduces to ℓ k k ∥.Next, we bound the second term.Using the triangle inequality and the Cauchy-Schwarz inequality yields We obtain the desired result by evaluating the right-hand side of (39).

Table 1 :
Complexity of first-order methods for nonconvex optimization."Exponent in complexity" means