1 Introduction

This paper studies general nonconvex optimization problems:

$$\begin{aligned} \min _{x \in \mathbb R^d} \ f(x), \end{aligned}$$

where \(f :\mathbb R^d \rightarrow \mathbb R\) is twice differentiable and lower bounded, i.e., \(\inf _{x \in \mathbb R^d} f(x) > - \infty \). Throughout the paper, we impose the following assumption of Lipschitz continuous gradients.

Assumption 1

There exists a constant \(L > 0\) such that \(\Vert \nabla f(x) - \nabla f(y)\Vert \le L\Vert x - y\Vert \) for all \(x, y \in \mathbb R^d\).

First-order methods [3, 31], which access f through function and gradient evaluations, have gained increasing attention because they are suitable for large-scale problems. A classical result is that the gradient descent method finds an \(\varepsilon \)-stationary point (i.e., \(x \in \mathbb R^d\) where \(\Vert \nabla f(x)\Vert \le \varepsilon \)) in \(O(\varepsilon ^{-2})\) function and gradient evaluations under Assumption 1. Recently, more sophisticated first-order methods have been developed to achieve faster convergence for more smooth functions. Such methods [2, 6, 28, 33,34,35, 53] have complexity bounds of \(O(\varepsilon ^{-7/4})\) or \(\tilde{O}(\varepsilon ^{-7/4})\) under Lipschitz continuity of Hessians in addition to gradients.Footnote 1

This research stream raises two natural questions:

  1. Question 1

    How fast can first-order methods converge under smoothness assumptions stronger than Lipschitz continuous gradients but weaker than Lipschitz continuous Hessians?

  2. Question 2

    Can a single algorithm achieve both of the following complexity bounds: \(O(\varepsilon ^{-2})\) for functions with Lipschitz continuous gradients and \(O(\varepsilon ^{-7/4})\) for functions with Lipschitz continuous gradients and Hessians?

Question 2 is also crucial from a practical standpoint because it is often challenging for users of optimization methods to check whether a function of interest has a Lipschitz continuous Hessian. It would be nice if there were no need to use several different algorithms to achieve faster convergence.

Motivated by the questions, we propose a new first-order method and provide its complexity analysis with the Hölder continuity of Hessians. Hölder continuity generalizes Lipschitz continuity and has been widely used for complexity analyses of optimization methods [12, 13, 18, 20, 22,23,24,25, 30, 38]. Several properties and an example of Hölder continuity can be found in [23, Section 2].

Definition 1

The Hölder constant of \(\nabla ^2 f\) with exponent \(\nu \in [0, 1]\) is defined by

$$\begin{aligned} H_{\nu } {:}{=}\sup _{x,y \in \mathbb R^d,\,x \ne y} \frac{\Vert \nabla ^2 f(x) - \nabla ^2 f(y)\Vert }{\Vert x - y\Vert ^\nu }. \end{aligned}$$

The Hessian \(\nabla ^2 f\) is said to be Hölder continuous with exponent \(\nu \), or \(\nu \)-Hölder, if \(H_\nu < +\infty \).

We should emphasize that f determines the value of \(H_{\nu }\) for each \(\nu \in [0, 1]\) and that \(\nu \) is not a constant determined by f. Under Assumption 1, we have \(H_{0} \le 2 L\) because the assumption implies \(\Vert \nabla ^2 f(x)\Vert \le L\) for all \(x \in \mathbb R^d\) [37, Lemma 1.2.2]. For \(\nu \in (0, 1]\), we may have \(H_\nu = +\infty \), but we will allow it. In contrast, all existing first-order methods [2, 6, 28, 33,34,35, 53] with complexity bounds of \(O(\varepsilon ^{-7/4})\) or \(\tilde{O}(\varepsilon ^{-7/4})\) assume \(H_{1} < + \infty \) (i.e., the Lipschitz continuity of \(\nabla ^2 f\)) in addition to Assumption 1. We should note that it is often difficult to compute the Hölder constant \(H_{\nu }\) of a real-world function for a given \(\nu \in [0, 1]\).

The proposed first-order method is a heavy-ball method equipped with two particular restart mechanisms, enjoying the following advantages:

  • For \(\nu \in [0, 1]\) such that \(H_{\nu } < + \infty \), our algorithm finds an \(\varepsilon \)-stationary point in

    $$\begin{aligned} O \left( { H_{\nu }^{\frac{1}{2 + 2 \nu }} \varepsilon ^{- \frac{4 + 3 \nu }{2 + 2 \nu }} }\right) \end{aligned}$$

    function and gradient evaluations under Assumption 1. This result answers Question 1 and covers the classical bound of \(O(\varepsilon ^{-2})\) for \(\nu = 0\) and the state-of-the-art bound of \(O(\varepsilon ^{-7/4})\) for \(\nu = 1\).

  • The complexity bound (2) is simultaneously attained for all \(\nu \in [0, 1]\) such that \(H_{\nu } < + \infty \) by a single \(\nu \)-independent algorithm. The algorithm thus automatically achieves the bound with the optimal \(\nu \in [0, 1]\) that minimizes (2). This result affirmatively answers  Question 2..

  • Our algorithm requires no knowledge of problem-dependent parameters, including the optimal \(\nu \), the Lipschitz constant \(L\), or the target accuracy \(\varepsilon \).

Let us describe our ideas for developing such an algorithm. We employ the Hessian-free analysis recently developed for Lipschitz continuous Hessians [35] to estimate the Hessian’s Hölder continuity with only first-order information. The Hessian-free analysis uses inequalities that include the Hessian’s Lipschitz constant \(H_{1}\) but not a Hessian matrix itself, enabling us to estimate \(H_{1}\). Extending this analysis to general \(\nu \) allows us to estimate the Hölder constant \(H_{\nu }\), given \(\nu \in [0, 1]\). We thus obtain an algorithm that requires \(\nu \) as input and has the complexity bound (2) for the given \(\nu \). However, the resulting algorithm lacks usability because \(\nu \) that minimizes (2) is generally unknown.

Our main idea for developing a \(\nu \)-independent algorithm is to set \(\nu = 0\) for the above \(\nu \)-dependent algorithm. This may seem strange, but we prove that it works; a carefully designed algorithm for \(\nu = 0\) achieves the complexity bound (2) for any \(\nu \in [0, 1]\). Although we design an estimate for \(H_{0}\), it also has a relationship with \(H_{\nu }\) for \(\nu \in (0, 1]\), as will be stated in Proposition 1. This proposition allows us to obtain the desired complexity bounds without specifying \(\nu \).

To evaluate the numerical performance of the proposed method, we conducted experiments with standard machine-learning tasks. The results illustrate that the proposed method outperforms state-of-the-art methods.

Notation. For vectors \(a, b \in \mathbb R^d\), let \(\langle a,b\rangle \) denote the dot product and \(\Vert a\Vert \) denote the Euclidean norm. For a matrix \(A \in \mathbb R^{m \times n}\), let \(\Vert A\Vert \) denote the operator norm, or equivalently the largest singular value.

2 Related work

This section reviews previous studies from several perspectives and discusses similarities and differences between them and this work.

Complexity of first-order methods. Gradient descent (GD) is known to have a complexity bound of \(O(\varepsilon ^{-2})\) under Lipschitz continuous gradients (e.g., [37, Example. 1.2.3]). First-order methods [12, 22] for Hölder continuous gradients have recently been proposed to generalize the bound; they enjoy bounds of \(O(\varepsilon ^{-\frac{1+\mu }{\mu }})\), where \(\mu \in (0, 1]\) is the Hölder exponent of \(\nabla f\). First-order methods have also been studied under stronger assumptions. The methods of [2, 6, 28, 53] enjoy complexity bounds of \({\tilde{O}}(\varepsilon ^{-7/4})\) under Lipschitz continuous gradients and Hessians,Footnote 2 and the bounds have recently been improved to \(O(\varepsilon ^{-7/4})\) [33,34,35]. This paper generalizes the classical bound of \(O(\varepsilon ^{-2})\) in a different direction from [12, 22] and interpolates the existing bounds of \(O(\varepsilon ^{-2})\) and \(O(\varepsilon ^{-7/4})\). Table 1 compares our complexity results with the existing ones.

Table 1 Complexity of first-order methods for nonconvex optimization. “Exponent in complexity” means \(\alpha \) in \(O(\varepsilon ^{- \alpha })\)

Complexity of second-order methods using Hölder continuous Hessians. The Hölder continuity of Hessians has been used to analyze second-order methods. Grapiglia and Nesterov [23] proposed a regularized Newton method that finds an \(\varepsilon \)-stationary point in \(O(\varepsilon ^{-\frac{2 + \nu }{1 + \nu }})\) evaluations of f, \(\nabla f\), and \(\nabla ^2 f\), where \(\nu \in [0, 1]\) is the Hölder exponent of \(\nabla ^2 f\). The complexity bound generalizes previous \(O(\varepsilon ^{-3/2})\) bounds under Lipschitz continuous Hessians [10, 11, 14, 40]. We make the same assumption of Hölder continuous Hessians as in [23] but do not compute Hessians in the algorithm. Table 2 summarizes the first-order and second-order methods together with their assumptions.

Table 2 Nonconvex optimization methods under smoothness assumptions

Universality for Hölder continuity. When Hölder continuity is assumed, it is preferable that algorithms not require the exponent \(\nu \) as input because a suitable value for \(\nu \) tends to be hard to find in real-world problems. Such \(\nu \)-independent algorithms, called universal methods, were first developed as first-order methods for convex optimization [30, 38] and have since been extended to other settings, including higher-order methods or nonconvex problems [12, 13, 20, 22,23,24,25]. Within this research stream, this paper proposes a universal method with a new setting: a first-order method under Hölder continuous Hessians. Because of the differences in settings, the existing techniques for universality cannot be applied directly; we obtain a universal method by setting \(\nu = 0\) for a \(\nu \)-dependent algorithm, as discussed in Sect. 1.

Heavy-ball methods. Heavy-ball (HB) methods are a kind of momentum method first proposed by Polyak [43] for convex optimization. Although some complexity results have been obtained for (strongly) convex settings [21, 32], they are weaker than the optimal bounds given by Nesterov’s accelerated gradient method [36, 39]. For nonconvex optimization, HB and its variants [15, 29, 46, 50] have been practically used with great success, especially in deep learning, while studies on theoretical convergence analysis are few [34, 41, 42]. O’Neill and Wright [42] analyzed the local behavior of the original HB method, showing that the method is unlikely to converge to strict saddle points. Ochs et al. [41] proposed a generalized HB method, iPiano, that enjoys a complexity bound of \(O(\varepsilon ^{-2})\) under Lipschitz continuous gradients, which is of the same order as that of GD. Li and Lin [34] proposed an HB method with a restart mechanism that achieves a complexity bound of \(O(\varepsilon ^{-7/4})\) under Lipschitz continuous gradients and Hessians. Our algorithm is another HB method with a different restart mechanism that enjoys more general complexity bounds than [34], as discussed in Sect. 1.

Comparison with [35]. This paper shares some mathematical tools with [35] because we utilize the Hessian-free analysis introduced in [35] to estimate Hessian’s Hölder continuity. While the analysis in [35] is for Nesterov’s accelerated gradient method under Lipschitz continuous Hessians, we here analyze Polyak’s HB method under Hölder continuity. Thanks to the simplicity of the HB momentum, our estimate for the Hölder constant is easier to compute than the estimate for the Lipschitz constant proposed in [35], which improves the efficiency of our algorithm. We would like to emphasize that a \(\nu \)-independent algorithm cannot be derived simply by applying the mathematical tools in [35]. It should also be mentioned that we have not confirmed that it is impossible or very challenging to develop a \(\nu \)-independent algorithm with Nesterov’s momentum under Hölder continuous Hessians.

Lower bounds. So far, we have discussed upper bounds on worst-case complexity, but there are also papers that study lower bounds. Carmon et al. [8] proved that no deterministic or stochastic first-order method can improve the complexity of \(O(\varepsilon ^{-2})\) with the assumption of Lipschitz continuous gradients alone. (See [8, Theorems 1 and 2] for more rigorous statements.) This result implies that GD is optimal in terms of complexity under Lipschitz continuous gradients. Carmon et al. [9] showed a lower bound of \(\Omega (\varepsilon ^{-12/7})\) for first-order methods under Lipschitz continuous gradients and Hessians. Compared with the upper bound of \(O(\varepsilon ^{-7/4})\) under the same assumptions, there is still a \(\Theta (\varepsilon ^{-1/28})\) gap. Closing this gap would be an interesting research question, though this paper does not focus on it.

3 Preliminary results

The following lemma is standard for the analyses of first-order methods.

Lemma 1

(e.g., [37, Lemma 1.2.3]) Under Assumption 1, the following holds for any \(x, y \in \mathbb R^d\):

$$\begin{aligned} f(x) - f(y)&\le \langle \nabla f(y),x - y\rangle + \frac{L}{2} \Vert x - y\Vert ^2. \end{aligned}$$

This inequality helps estimate the Lipschitz constant \(L\) and evaluate the decrease in the objective function per iteration.

We also use the following inequalities derived from Hölder continuous Hessians.

Lemma 2

For any \(z_1, \dots , z_n \in \mathbb R^d\), let \({\bar{z}} {:}{=}\sum _{i=1}^n \lambda _i z_i\), where \(\lambda _1,\dots ,\lambda _n \ge 0\) and \(\sum _{i=1}^n \lambda _i = 1\). Then, the following holds for all \(\nu \in [0, 1]\) such that \(H_{\nu } < + \infty \):

$$\begin{aligned} \left\| { \nabla f({\bar{z}}) - \sum _{i=1}^n \lambda _i \nabla f \left( { z_i }\right) }\right\|&\le \frac{H_{\nu }}{1 + \nu } \sum _{i=1}^n \lambda _i \Vert z_i - {\bar{z}}\Vert ^{1 + \nu } \\&\le \frac{H_{\nu }}{1 + \nu } \left( { \sum _{1 \le i < j \le n} \lambda _i \lambda _j \Vert z_i - z_j\Vert ^2 } \right) ^{\frac{1 + \nu }{2}}. \end{aligned}$$

Lemma 3

For all \(x, y \in \mathbb R^d\) and \(\nu \in [0, 1]\) such that \(H_{\nu } < + \infty \), the following holds:

$$\begin{aligned} f(x) - f(y)&\le \frac{1}{2} \langle \nabla f(x) + \nabla f(y),x - y\rangle + \frac{2 H_{\nu }}{(1 + \nu ) (2 + \nu ) (3 + \nu )} \Vert x - y\Vert ^{2 + \nu }. \end{aligned}$$

The proofs are given in Sect. A.1. These lemmas generalize [35, Lemmas 3.1 and 3.2] for Lipschitz continuous Hessians (i.e., \(\nu = 1\)). It is important to note that the inequalities in Lemmas 2 and 3 are Hessian-free; they include the Hessian’s Hölder constant \(H_{\nu }\) but not a Hessian matrix itself. Accordingly, we can adaptively estimate the Hölder continuity of \(\nabla ^2 f\) in the algorithm without computing Hessians.

4 Algorithm

The proposed method, Algorithm 1, is a heavy-ball (HB) method equipped with two particular restart schemes. In the algorithm, the iteration counter k is reset to 0 when HB restarts on Line 8 or 10, whereas the total iteration counter K is not. We refer to the period between one reset of k and the next reset as an epoch. Note that it is unnecessary to implement K in the algorithm; it is included here only to make the statements in our analysis concise.

The algorithm uses estimates \(\ell \) and \(h_k\) for the Lipschitz constant \(L\) and the Hölder constant \(H_{0}\). The estimate \(\ell \) is fixed during an epoch, while \(h_k\) is updated at each iteration, having the subscript k.

4.1 Update of solutions

With an estimate \(\ell \) for the Lipschitz constant \(L\), Algorithm 1 defines a solution sequence \((x_k)\) as follows: \(v_0 = \textbf{0}\) and

$$\begin{aligned} v_k&= \theta _{k-1} v_{k-1} - \frac{1}{\ell } \nabla f(x_{k-1}), \end{aligned}$$
$$\begin{aligned} x_k&= x_{k-1} + v_k \end{aligned}$$

for \(k \ge 1\). Here, \((v_k)\) is the velocity sequence, and \(0 \le \theta _k \le 1\) is the momentum parameter. Let \(x_{-1} {:}{=}x_0\) for convenience, which makes (4) valid for \(k = 0\). This type of optimization method is called a heavy-ball method or Polyak’s momentum method.

In this paper, we use the simplest parameter setting:

$$\begin{aligned} \theta _k = 1 \end{aligned}$$

for all \(k \ge 1\). Our choice of \(\theta _k\) differs from the existing ones; the existing complexity analyses [16, 17, 21, 32, 34, 43] of HB prohibit \(\theta _k = 1\). For example, Li and Lin [34] proposed \(\theta _k = 1 - 5 (H_{1} \varepsilon )^{1/4} / \sqrt{L}\). Our new proof technique described later in Sect. 5.1 enables us to set \(\theta _k = 1\).

We will later use the averaged solution

$$\begin{aligned} {\bar{x}}_k {:}{=}\frac{1}{k} \sum _{i=0}^{k-1} x_i \end{aligned}$$

to compute the estimate \(h_k\) for \(H_{0}\) and set the best solution \(x^\star _k\). The averaged solution can be computed efficiently with a simple recursion: \({\bar{x}}_{k+1} = \frac{k}{k+1} {\bar{x}}_k + \frac{1}{k+1} x_k\).

Algorithm 1
figure a

Proposed heavy-ball method

4.2 Estimation of Hölder continuity


$$\begin{aligned} S_k {:}{=}\sum _{i=1}^k \Vert v_i\Vert ^2 \end{aligned}$$

to simplify the notation. Our analysis uses the following inequalities due to Lemmas 2 and 3.

Lemma 4

For all \(k \ge 1\) and \(\nu \in [0, 1]\) such that \(H_{\nu } < + \infty \), the following hold:

$$\begin{aligned} f(x_k) - f(x_{k-1})&\le \frac{1}{2} \langle \nabla f(x_{k-1}) + \nabla f(x_k),v_k\rangle + \frac{2 H_{\nu }}{(1 + \nu ) (2 + \nu ) (3 + \nu )} \Vert v_k\Vert ^{2 + \nu }, \end{aligned}$$
$$\begin{aligned} \Vert \nabla f ({\bar{x}}_k) \Vert&\le \frac{\ell }{k} \Vert v_k\Vert + \frac{H_{\nu }}{1 + \nu } \left( { \frac{k S_k}{8} }\right) ^{\frac{1 + \nu }{2}}. \end{aligned}$$


Eq. (6) follows immediately from Lemma 3. The proof of (7) is given in Sect. A.2. \(\square \)

Algorithm 1 requires no information on the Hölder continuity of \(\nabla ^2 f\), automatically estimating it. To illustrate the trick, let us first consider a prototype algorithm that works when a value of \(\nu \in [0, 1]\) such that \(H_{\nu } < + \infty \) is given. Given such a \(\nu \), one can compute an estimate h of \(H_{\nu }\) such that

$$\begin{aligned} f(x_k) - f(x_{k-1})&\le \frac{1}{2} \langle \nabla f(x_{k-1}) + \nabla f(x_k),v_k\rangle + \frac{2 h}{(1 + \nu ) (2 + \nu ) (3 + \nu )} \Vert v_k\Vert ^{2 + \nu }, \end{aligned}$$
$$\begin{aligned} \Vert \nabla f ({\bar{x}}_k) \Vert&\le \frac{\ell }{k} \Vert v_k\Vert + \frac{h}{1 + \nu } \left( { \frac{k S_k}{8} }\right) ^{\frac{1 + \nu }{2}}, \end{aligned}$$

which come from Lemma 4. This estimation scheme yields a \(\nu \)-dependent algorithm that has the complexity bound (2) for the given \(\nu \), though we will omit the details. The algorithm is not so practical because it requires \(\nu \in [0, 1]\) such that \(H_{\nu } < + \infty \) as input. However, perhaps surprisingly, setting \(\nu = 0\) for the \(\nu \)-dependent algorithm gives a \(\nu \)-independent algorithm that achieves the bound (2) for all \(\nu \in [0, 1]\). Algorithm 1 is the \(\nu \)-independent algorithm obtained in that way.

Let \(h_0 {:}{=}0\) for convenience. At iteration \(k \ge 1\) of each epoch, we use the estimate \(h_k\) for \(H_{0}\) defined by

$$\begin{aligned} h_k = \max \bigg \{ h_{k-1},\ {}&\frac{3}{\Vert v_k\Vert ^2} \left( { f(x_k) - f(x_{k-1}) - \frac{1}{2} \langle \nabla f(x_{k-1}) + \nabla f(x_k),v_k\rangle } \right) ,\nonumber \\&\sqrt{\frac{8}{k S_k}} \left( { \Vert \nabla f ({\bar{x}}_k) \Vert - \frac{\ell }{k} \Vert v_k\Vert }\right) \bigg \} \end{aligned}$$

so that \(h_k \ge h_{k-1}\) and

$$\begin{aligned} f(x_k) - f(x_{k-1})&\le \frac{1}{2} \langle \nabla f(x_{k-1}) + \nabla f(x_k),v_k\rangle + \frac{h_k}{3} \Vert v_k\Vert ^2, \end{aligned}$$
$$\begin{aligned} \Vert \nabla f ({\bar{x}}_k) \Vert&\le \frac{\ell }{k} \Vert v_k\Vert + h_k \sqrt{\frac{k S_k}{8}}. \end{aligned}$$

The above inequalities were obtained by plugging \(\nu = 0\) into (8) and (9).

Although we designed \(h_k\) to estimate \(H_{0}\), it fortunately also relates to \(H_{\nu }\) for general \(\nu \in [0, 1]\). The following upper bound on \(h_k\) shows the relationship between \(h_k\) and \(H_{\nu }\), which will be used in the complexity analysis.

Proposition 1

For all \(k \ge 1\) and \(\nu \in [0, 1]\) such that \(H_{\nu } < + \infty \), the following holds:

$$\begin{aligned} h_k \le H_{\nu } (k S_k)^{\frac{\nu }{2}}. \end{aligned}$$


Lemma 4 gives

$$\begin{aligned} \frac{3}{\Vert v_k\Vert ^2} \left( { f(x_k) - f(x_{k-1}) - \frac{1}{2} \langle \nabla f(x_{k-1}) + \nabla f(x_k),v_k\rangle } \right)&\le \frac{6 H_{\nu }}{(1 + \nu ) (2 + \nu ) (3 + \nu )} \Vert v_k\Vert ^{\nu },\\ \sqrt{\frac{8}{k S_k}} \left( { \Vert \nabla f ({\bar{x}}_k) \Vert - \frac{\ell }{k} \Vert v_k\Vert }\right)&\le \frac{H_{\nu }}{1 + \nu } \left( { \frac{k S_k}{8} }\right) ^{\frac{\nu }{2}}\\&\le \frac{H_{\nu }}{1 + \nu } (k S_k)^{\frac{\nu }{2}}. \end{aligned}$$

Hence, definition (10) of \(h_k\) yields

$$\begin{aligned}&h_k \le \max \left\{ h_{k-1},\, \frac{6 H_{\nu }}{(1 + \nu ) (2 + \nu ) (3 + \nu )} \Vert v_k\Vert ^{\nu } ,\, \frac{H_{\nu }}{1 + \nu } (k S_k)^{\frac{\nu }{2}} \right\} \\&\quad \le \max \{ h_{k-1},\, H_{\nu } \Vert v_k\Vert ^{\nu } ,\, H_{\nu } (k S_k)^{\frac{\nu }{2}} \}&\quad (\text {by } {\nu \ge 0})\\&\quad = \max \{ h_{k-1},\, H_{\nu } \left( {k S_k}\right) ^{\frac{\nu }{2}} \}&\quad (\text {by } {\Vert v_k\Vert \le \sqrt{S_k} \le \sqrt{k S_k}}). \end{aligned}$$

The desired result follows inductively since \(H_{\nu } (k S_k)^{\frac{\nu }{2}}\) is nondecreasing in k. \(\square \)

For \(\nu = 0\), Proposition 1 gives a natural upper bound, \(h_k \le H_{0}\), since the estimate \(h_k\) is designed for \(H_{0}\) based on Lemma 4. For \(\nu \in (0, 1]\), the upper bound can become tighter when \(k S_k\) is small. Indeed, the iterates \((x_k)\) are expected to move less significantly in an epoch as the algorithm proceeds. Accordingly, \((S_k)\) increases more slowly in later epochs, yielding a tighter upper bound on \(h_k\). This trick improves the complexity bound from \(O(\varepsilon ^{-2})\) for \(\nu = 0\) to \(O(\varepsilon ^{- \frac{4 + 3 \nu }{2 + 2 \nu }})\) for general \(\nu \in [0, 1]\).

4.3 Restart mechanisms

Algorithm 1 is equipped with two restart mechanisms. The first one uses the standard descent condition

$$\begin{aligned} f(x_k) - f(x_{k-1}) \le \langle \nabla f(x_{k-1}),v_k\rangle + \frac{\ell }{2} \Vert v_k\Vert ^2 \end{aligned}$$

to check whether the current estimate \(\ell \) for the Lipschitz constant \(L\) is large enough. If the descent condition (13) does not hold, HB restarts with a larger \(\ell \) from the best solution \(x^\star _k {:}{=}\mathop {\textrm{argmin}}\limits _{x \in \{ x_0,\dots ,x_k,{\bar{x}}_1,\dots ,{\bar{x}}_k \}} f(x)\) during the epoch.

We consider not only \(x_0,\dots ,x_k\) but also the averaged solutions \({\bar{x}}_1,\dots ,{\bar{x}}_k\) as candidates for the next starting point because averaging may stabilize the behavior of the HB method. As we will show later in Lemma 6, the gradient norm of averaged solutions is small, which leads to stability. For strongly-convex quadratic problems, Danilova and Malinovsky [16] also show that averaged HB methods have a smaller maximal deviation from the optimal solution than the vanilla HB method. A similar effect for nonconvex problems is expected in the neighborhood of local optima where quadratic approximation is justified.

The second restart scheme resets the momentum effect when k becomes large; if

$$\begin{aligned} k (k+1) h_k > \frac{3}{8} \ell \end{aligned}$$

is satisfied, HB restarts from the best solution \(x^\star _k\). At the restart, we can reset \(\ell \) to a smaller value in the hope of improving practical performance, though decreasing \(\ell \) is not necessary for the complexity analysis. This restart scheme guarantees that

$$\begin{aligned} k (k-1) h_{k-1} \le \frac{3}{8} \ell \end{aligned}$$

holds at iteration k of each epoch.

The Lipschitz estimate \(\ell \) increases only when the descent condition (13) is violated. On the other hand, Lemma 1 implies that condition (13) always holds as long as \(\ell \ge L\). Hence, we have the following upper bound on \(\ell \).

Proposition 2

Suppose that Assumption 1 holds. Then, the following is true throughout Algorithm 1: \(\ell \le \max \{ \ell _{\textrm{init}}, \alpha L \}\).

5 Complexity analysis

This section proves that Algorithm 1 enjoys the complexity bound (2) for all \(\nu \in [0, 1]\).

5.1 Objective decrease for one epoch

First, we evaluate the decrease in the objective function value during one epoch.

Lemma 5

Suppose that Assumption 1 holds and that the descent condition

$$\begin{aligned} f(x_i) - f(x_{i-1}) \le \langle \nabla f(x_{i-1}),v_i\rangle + \frac{\ell }{2} \Vert v_i\Vert ^2 \end{aligned}$$

holds for all \(1 \le i \le k\). Then, the following holds under condition (15):

$$\begin{aligned} \min _{1 \le i \le k} f(x_i) \le f(x_0) - \frac{\ell S_k}{4k}. \end{aligned}$$

Before providing the proof, let us remark on the lemma.

Evaluating the decrease in the objective function is the central part of a complexity analysis. It is also an intricate part because the function value does not necessarily decrease monotonically in nonconvex acceleration methods. To overcome the non-monotonicity, previous analyses have employed different proof techniques. For example, Li and Lin [33] constructed a quadratic approximation of the objective, diagonalized the Hessian, and evaluated the objective decrease separately for each coordinate; Marumo and Takeda [35] designed a tricky potential function and showed that it is nearly decreasing.

This paper uses another technique to deal with the non-monotonicity. We observe that the solution \(x_k\) does not need to attain a small function value; it is sufficient for at least one of \(x_1,\dots ,x_k\) to do so, thanks to our particular restart mechanism. This observation permits the left-hand side of (17) to be \(\min _{1 \le i \le k} f(x_i)\) rather than \(f(x_k)\) and makes the proof easier. The proof of Lemma 5 calculates a weighted sum of \(2k-1\) inequalities derived from Lemmas 1 and 3, which is elementary compared with the existing proofs. Now, we provide that proof.

Proof of Lemma 5

Combining (16) with the update rules (3) and (4) yields

$$\begin{aligned} f(x_i) - f(x_{i-1})&\le \langle \nabla f(x_{i-1}),v_i\rangle + \frac{\ell }{2} \Vert v_i\Vert ^2 = \ell \langle v_{i-1},v_i\rangle - \frac{\ell }{2} \Vert v_i\Vert ^2 \end{aligned}$$

for \(1 \le i \le k\). For \(1 \le i < k\), we also have

$$\begin{aligned} f(x_i) - f(x_{i-1})&\le \frac{1}{2} \langle \nabla f(x_{i-1}) + \nabla f(x_i),v_i\rangle + \frac{h_{k-1}}{3} \Vert v_i\Vert ^2&\quad&({\text {by}\,(11)\,\text {and}\,h_i \le h_{k-1}})\nonumber \\&= \frac{\ell }{2} \langle v_{i-1},v_i\rangle - \frac{\ell }{2} \langle v_i,v_{i+1}\rangle + \frac{h_{k-1}}{3} \Vert v_i\Vert ^2&\quad&(\text {by}\,{(3)}). \end{aligned}$$

We will calculate a weighted sum of \(2k-1\) inequalities:

  • (18) with weight 1 for \(1 \le i \le k\),

  • (19) with weight \(2(k-i)\) for \(1 \le i < k\).

The left-hand side of the weighted sum is

$$\begin{aligned}&\sum _{i=1}^k \left( { f(x_i) - f(x_{i-1}) }\right) + \sum _{i=1}^{k-1} 2(k-i) \left( { f(x_i) - f(x_{i-1}) }\right) \\&\quad = - (2k-1) f(x_0) + \sum _{i=1}^{k-1} 2 f(x_i) + f(x_k) \ge (2k-1) \left( { \min _{1 \le i \le k} f(x_i) - f(x_0) }\right) . \end{aligned}$$

On the right-hand side of the weighted sum, some calculations with \(v_0 = \textbf{0}\) show that the inner-product terms of \(\langle v_{i-1},v_i\rangle \) cancel out as follows:

$$\begin{aligned}&\ell \sum _{i=1}^k \langle v_{i-1},v_i\rangle + \ell \sum _{i=1}^{k-1} (k-i) \left( { \langle v_{i-1},v_i\rangle - \langle v_i,v_{i+1}\rangle }\right) \\&\quad = \ell \sum _{{i=2}}^k \langle v_{i-1},v_i\rangle + \ell \sum _{{i=2}}^{k-1} (k-i) \langle v_{i-1},v_i\rangle - \ell \sum _{i=2}^k (k-i+1) \langle v_{i-1},v_i\rangle = 0. \end{aligned}$$

The remaining terms on the right-hand side of the weighted sum are

$$\begin{aligned}&{- \frac{\ell }{2}} \sum _{i=1}^k \Vert v_i\Vert ^2 + \frac{h_{k-1}}{3} \sum _{i=1}^{k-1} 2 (k-i) \Vert v_i\Vert ^2\\&\quad \le - \frac{\ell }{2} \sum _{i=1}^k \Vert v_i\Vert ^2 + \frac{h_{k-1}}{3} \sum _{i=1}^k 2 (k-1) \Vert v_i\Vert ^2 = - \left( { \frac{\ell }{2} - \frac{2}{3} (k-1) h_{k-1} }\right) S_k. \end{aligned}$$

We now obtain

$$\begin{aligned} \min _{1 \le i \le k} f(x_i) - f(x_0) \le - \left( { \frac{\ell }{2} - \frac{2}{3} (k-1) h_{k-1} }\right) \frac{S_k}{2k-1}. \end{aligned}$$

Finally, we evaluate the coefficient on the right-hand side with (15) as

$$\begin{aligned} \frac{\ell }{2} - \frac{2}{3} (k-1) h_{k-1} \ge \frac{\ell }{2} - \frac{\ell }{4 k} = \ell \frac{2k-1}{4k}, \end{aligned}$$

which completes the proof. \(\square \)

The proof elucidates that the second restart condition (14) was designed to derive the lower bound of \(\ell \frac{2k-1}{4k}\) in (20).

For an epoch that ends at Line 10 in iteration \(k \ge 1\), Lemma 5 gives

$$\begin{aligned} f(x^\star _k) \le \min _{1 \le i \le k} f(x_i) \le f(x_0) - \frac{\ell S_k}{4k}. \nonumber \\ \end{aligned}$$

For an epoch that ends at Line 8 in iteration \(k \ge 2\), the lemma gives

$$\begin{aligned} f(x^\star _k) \le f(x^\star _{k-1}) \le \min _{1 \le i \le k-1} f(x_i) \le f(x_0) - \frac{\ell S_{k-1}}{4 (k-1)} \le f(x_0) - \frac{\ell S_{k-1}}{4k}. \nonumber \\ \end{aligned}$$

These bounds will be used to derive the complexity bound.

5.2 Upper bound on gradient norm

Next, we prove the following upper bound on the gradient norm at the averaged solution.

Lemma 6

In Algorithm 1, the following holds at iteration \(k \ge 2\):

$$\begin{aligned} \min _{1 \le i < k} \Vert \nabla f ({\bar{x}}_i) \Vert \le \ell \sqrt{\frac{8 S_{k-1}}{k^3}}. \end{aligned}$$


For \(k = 2\), the result follows from \(\Vert \nabla f ({\bar{x}}_1) \Vert = \Vert \nabla f (x_0) \Vert = \ell \Vert v_1\Vert \). Below, we assume that \(k \ge 3\). Let \(A_k {:}{=}\sum _{i=1}^{k-1} i^2\); we have

$$\begin{aligned} A_k = \frac{k (k-1) (2k-1)}{6} \ge \frac{k^3}{6} \end{aligned}$$

for \(k \ge 3\). A weighted sum of (12) over k yields

$$\begin{aligned} A_k \min _{1 \le i < k} \Vert \nabla f ({\bar{x}}_i) \Vert&\le \sum _{i=1}^{k-1} i^2 \Vert \nabla f ({\bar{x}}_i) \Vert \le \ell \sum _{i=1}^{k-1} i \Vert v_i\Vert + h_{k-1} \sqrt{S_{k-1}} \sum _{i=1}^{k-1} i^2 \sqrt{\frac{i}{8}} \end{aligned}$$

since \(h_k\) and \(S_k\) are nondecreasing in k. Each term can be bounded by the Cauchy–Schwarz inequality as

$$\begin{aligned} \sum _{i=1}^{k-1} i \Vert v_i\Vert&\le \sqrt{A_k S_{k-1}},\quad \sum _{i=1}^{k-1} i^2 \sqrt{\frac{i}{8}} = \sum _{i=1}^{k-1} i \sqrt{\frac{i^3}{8}} \le \sqrt{A_k} \left( { \sum _{i=1}^{k-1} \frac{i^3}{8} }\right) ^{1/2} = \sqrt{\frac{A_k}{32}} k (k-1), \end{aligned}$$

and thus

$$\begin{aligned} \min _{1 \le i < k} \Vert \nabla f ({\bar{x}}_i) \Vert \le \ell \sqrt{\frac{S_{k-1}}{A_k}} + \sqrt{\frac{S_{k-1}}{32 A_k}} k (k-1) h_{k-1} \le \ell \sqrt{\frac{S_{k-1}}{A_k}} \left( { 1 + \frac{3}{8 \sqrt{32}} }\right) , \end{aligned}$$

where the last inequality uses (15). Using (23) and \(1 + \frac{3}{8 \sqrt{32}} < \frac{2}{\sqrt{3}}\) concludes the proof. \(\square \)

5.3 Complexity bound

Let \(\bar{\ell }\) denote the upper bound on the Lipschitz estimate \(\ell \) given in Proposition 2: \(\bar{\ell }{:}{=}\max \{ \ell _{\textrm{init}}, \alpha L \}\). The following theorem shows iteration complexity bounds for Algorithm 1. Recall that \(\alpha > 1\) and \(0 < \beta \le 1\) are the input parameters of Algorithm 1.

Theorem 1

Suppose that Assumption 1 holds and \(\inf _{x \in \mathbb R^d} f(x) > - \infty \). Let

$$\begin{aligned} \Delta {:}{=}f(x_\textrm{init}) - \inf _{x \in \mathbb R^d} f(x),\quad { c_1 {:}{=}\log _\alpha \left( {\frac{1}{\beta }}\right) , \quad \text {and}\quad c_2 {:}{=}1 + \log _\alpha \left( {\frac{\bar{\ell }}{\ell _{\textrm{init}}}}\right) . } \end{aligned}$$

In Algorithm 1, when \(\Vert \nabla f({\bar{x}}_k)\Vert \le \varepsilon \) holds for the first time, the total iteration count K is at most

$$\begin{aligned} \inf _{\nu \in [0, 1]} \Bigg \{ 91 (1 + \sqrt{{c_1}}) \Delta \sqrt{\bar{\ell }} H_{\nu }^{\frac{1}{2 + 2 \nu }} \varepsilon ^{- \frac{4 +3 \nu }{2 + 2 \nu }} + 256 {c_1} \Delta H_{\nu }^{\frac{1}{1 +\nu }} \varepsilon ^{- \frac{2 + \nu }{1 + \nu }} \Bigg \} { + 6 \sqrt{c_2 \Delta \bar{\ell }} \varepsilon ^{-1} + c_2 }. \end{aligned}$$

In particular, if we set \(\beta = 1\), then \({c_1} = 0\) and the upper bound simplifies to

$$\begin{aligned} \inf _{\nu \in [0, 1]} \Bigg \{ 91 \Delta \sqrt{\bar{\ell }} H_{\nu }^{\frac{1}{2 + 2 \nu }} \varepsilon ^{- \frac{4 + 3 \nu }{2 + 2 \nu }} \Bigg \} { + 6 \sqrt{c_2 \Delta \bar{\ell }} \varepsilon ^{-1} + c_2 }. \end{aligned}$$


We classify the epochs into three types:

  • successful epoch: an epoch that does not find an \(\varepsilon \)-stationary point and ends at Line 10 with the descent condition (13) satisfied,

  • unsuccessful epoch: an epoch that does not find an \(\varepsilon \)-stationary point and ends at Line 8 with the descent condition (13) unsatisfied,

  • last epoch: the epoch that finds an \(\varepsilon \)-stationary point.

Let \(N_{\textrm{suc}}\) and \(N_{\textrm{unsuc}}\) be the number of successful and unsuccessful epochs, respectively. Let \(K_{\textrm{suc}}\) be the total iteration number of all successful epochs. Below, we fix \(\nu \in [0, 1]\) arbitrarily such that \(H_{\nu } < + \infty \). (Note that there exists such a \(\nu \) since \(H_{0} \le 2 L< + \infty \).)

Successful epochs. Let us focus on a successful epoch and let k denote the total number of iterations of the epoch we are focusing on, i.e., the epoch ends at iteration k.

We then have

$$\begin{aligned} S_k \ge \frac{\varepsilon ^2 k^3}{8 \ell ^2} \end{aligned}$$

as follows: if \(k = 1\), we have \(S_k = \Vert v_1\Vert ^2 = \frac{1}{\ell ^2} \Vert \nabla f(x_0)\Vert ^2 > \frac{\varepsilon ^2}{\ell ^2} \ge \frac{\varepsilon ^2 k^3}{8 \ell ^2}\); if \(k \ge 2\), Lemma 6 gives \(\varepsilon < \ell \sqrt{8 S_{k-1} / k^3} \le \ell \sqrt{8 S_k / k^3}\).

On the other hand, putting the restart condition (14) together with Proposition 1 yields

$$\begin{aligned} \frac{1}{4} \ell< \frac{3}{8} \ell < k (k+1) h_k \le 2 k^2 h_k \le 2 k^2 H_{\nu } (k S_k)^{\frac{\nu }{2}} \end{aligned}$$

and hence

$$\begin{aligned} S_k \ge \frac{1}{k} \left( { \frac{\ell }{8 k^2 H_{\nu }} }\right) ^{2 / \nu }. \end{aligned}$$

Combining (27) and (26) leads to

$$\begin{aligned} S_k&= S_k^{\frac{2 + \nu }{2 + 2 \nu }} S_k^{\frac{\nu }{2 + 2 \nu }} \ge \left( {\frac{\varepsilon ^2 k^3}{8 \ell ^2}}\right) ^{\frac{2 + \nu }{2 + 2 \nu }} \left( { \frac{1}{k} \left( { \frac{\ell }{8 k^2 H_{\nu }}}\right) ^{2 / \nu } } \right) ^{\frac{\nu }{2 + 2 \nu }} = 2^{- \frac{12 + 3 \nu }{2 + 2 \nu }} H_{\nu }^{- \frac{1}{1 + \nu }} \varepsilon ^{\frac{2 + \nu }{1 + \nu }} \frac{k}{\ell },\\ S_k&= S_k^{\frac{4 + 3 \nu }{4 + 4 \nu }} S_k^{\frac{\nu }{4 + 4 \nu }} \ge \left( {\frac{\varepsilon ^2 k^3}{8 \ell ^2}}\right) ^{\frac{4 + 3 \nu }{4 + 4 \nu }} \left( { \frac{1}{k} \left( { \frac{\ell }{8 k^2 H_{\nu }}}\right) ^{2 / \nu } } \right) ^{\frac{\nu }{4 + 4 \nu }} = 2^{- \frac{18 + 9 \nu }{4 + 4 \nu }} H_{\nu }^{- \frac{1}{2 + 2 \nu }} \varepsilon ^{\frac{4 + 3 \nu }{2 + 2 \nu }} \frac{k^2}{\ell ^{3/2}}. \end{aligned}$$

Plugging them into (21) yields

$$\begin{aligned} f(x_0) - f(x^\star _k) \ge \frac{\ell S_k}{4 k}&\ge 2^{- \frac{16 + 7 \nu }{2 + 2 \nu }} H_{\nu }^{- \frac{1}{1 + \nu }} \varepsilon ^{\frac{2 + \nu }{1 + \nu }} \ge 2^{-8} H_{\nu }^{- \frac{1}{1 + \nu }} \varepsilon ^{\frac{2 + \nu }{1 + \nu }},\\ f(x_0) - f(x^\star _k) \ge \frac{\ell S_k}{4 k}&\ge 2^{- \frac{26 + 17 \nu }{4 + 4 \nu }} H_{\nu }^{- \frac{1}{2 + 2 \nu }} \varepsilon ^{\frac{4 + 3 \nu }{2 + 2 \nu }} \frac{k}{\sqrt{\ell }} \ge 2^{-\frac{13}{2}} H_{\nu }^{- \frac{1}{2 + 2 \nu }} \varepsilon ^{\frac{4 + 3 \nu }{2 + 2 \nu }} \frac{k}{\sqrt{\bar{\ell }}} \end{aligned}$$

since \(\nu \ge 0\). Summing these bounds over all successful epochs results in

$$\begin{aligned} \Delta \ge 2^{-8} H_{\nu }^{- \frac{1}{1 + \nu }} \varepsilon ^{\frac{2 + \nu }{1 + \nu }} N_{\textrm{suc}},\quad \Delta \ge 2^{-\frac{13}{2}} H_{\nu }^{- \frac{1}{2 + 2 \nu }} \varepsilon ^{\frac{4 + 3 \nu }{2 + 2 \nu }} \frac{K_{\textrm{suc}}}{\sqrt{\bar{\ell }}}, \end{aligned}$$

and hence

$$\begin{aligned} N_{\textrm{suc}} \le 2^8 \Delta H_{\nu }^{\frac{1}{1 + \nu }} \varepsilon ^{- \frac{2 + \nu }{1 + \nu }},\quad K_{\textrm{suc}} \le 2^{\frac{13}{2}} \Delta \sqrt{\bar{\ell }} H_{\nu }^{\frac{1}{2 + 2 \nu }} \varepsilon ^{- \frac{4 + 3 \nu }{2 + 2 \nu }}. \end{aligned}$$

Other epochs. Let \(k_1,\dots ,k_{N_{\textrm{unsuc}}}\) and \(k_{N_{\textrm{unsuc}} + 1}\) be the iteration number of unsuccessful and last epochs, respectively. Then, the total iteration number of the epochs can be bounded with the Cauchy–Schwarz inequality as follows:

$$\begin{aligned} \sum _{i=1}^{N_{\textrm{unsuc}} + 1} k_i&= { \sum _{i:\, k_i = 1} k_i + \sum _{i:\, k_i \ge 2} k_i }\le N_{\textrm{unsuc}} + 1 + \sum _{i:\, k_i \ge 2} k_i \nonumber \\&\le N_{\textrm{unsuc}} + 1 + \sqrt{N_{\textrm{unsuc}} + 1} \sqrt{ \sum _{i:\, k_i \ge 2} k_i^2 }, \end{aligned}$$

where \(\sum _{i:\, k_i \ge 2}\) denotes a sum over \(i = 1, \dots , N_{\textrm{unsuc}} + 1\) such that \(k_i \ge 2\). We will evaluate \(N_{\textrm{unsuc}}\) and the sum of \(k_i^2\). First, we have \(\ell _{\textrm{init}}\beta ^{N_{\textrm{suc}}} \alpha ^{N_{\textrm{unsuc}}} \le \bar{\ell }\) and hence

$$\begin{aligned} N_{\textrm{unsuc}} \le { c_1 N_{\textrm{suc}} + c_2 - 1 \le 2^8 c_1 \Delta H_{\nu }^{\frac{1}{1 + \nu }} \varepsilon ^{- \frac{2 + \nu }{1 + \nu }} + c_2 - 1 } \end{aligned}$$

from (28), where \(c_1\) and \(c_2\) are defined by (24). Next, let us focus on an epoch that ends at iteration \(k \ge 2\). Lemma 6 gives \(\varepsilon < \ell \sqrt{8 S_{k-1} / k^3}\) and hence \(S_{k-1} \ge \frac{\varepsilon ^2 k^3}{8 \ell ^2}\). Plugging this bound into (22) yields

$$\begin{aligned} f(x_0) - f(x^\star _k) \ge \frac{\ell S_{k-1}}{4k} \ge \frac{\varepsilon ^2 k^2}{2^5 \ell }. \end{aligned}$$

Summing this bound over all unsuccessful and last epochs results in

$$\begin{aligned} \sum _{i:\, k_i \ge 2} k_i^2 \le \frac{2^5 \Delta \bar{\ell }}{\varepsilon ^2}. \end{aligned}$$

Plugging (30) and (31) into (29) yields

$$\begin{aligned} \sum _{i=1}^{N_{\textrm{unsuc}} + 1} k_i&\le 2^8 {c_1} \Delta H_{\nu }^{\frac{1}{1 + \nu }} \varepsilon ^{- \frac{2 + \nu }{1 + \nu }} + {c_2} + \sqrt{ 2^8 {c_1} \Delta H_{\nu }^{\frac{1}{1 + \nu }} \varepsilon ^{- \frac{2 + \nu }{1 + \nu }} + {c_2} } \sqrt{ \frac{2^5 \Delta \bar{\ell }}{\varepsilon ^2} }\\&\le 2^8 {c_1} \Delta H_{\nu }^{\frac{1}{1 + \nu }} \varepsilon ^{- \frac{2 + \nu }{1 + \nu }} + {c_2} + 2^{\frac{13}{2}} \sqrt{{c_1}} \Delta \sqrt{\bar{\ell }} H_{\nu }^{\frac{1}{2 + 2 \nu }} \varepsilon ^{- \frac{4 + 3 \nu }{2 + 2 \nu }} {+ 2^{\frac{5}{2}} \sqrt{c_2 \Delta \bar{\ell }} \varepsilon ^{-1}}, \end{aligned}$$

where the last inequality uses \(\sqrt{a + b} \le \sqrt{a} + \sqrt{b}\) for \(a, b \ge 0\). Putting this bound together with (28) gives an upper bound on the total iteration number of all epochs:

$$\begin{aligned} K_{\textrm{suc}} + \sum _{i=1}^{N_{\textrm{unsuc}} + 1} k_i&\le 91 (1 + \sqrt{{c_1}}) \Delta \sqrt{\bar{\ell }} H_{\nu }^{\frac{1}{2 + 2 \nu }} \varepsilon ^{- \frac{4 + 3 \nu }{2 + 2 \nu }} \\&\quad + 256 {c_1} \Delta H_{\nu }^{\frac{1}{1 + \nu }} \varepsilon ^{- \frac{2 + \nu }{1 + \nu }} { + 6 \sqrt{c_2 \Delta \bar{\ell }} \varepsilon ^{-1} + c_2 }, \end{aligned}$$

where we have used \(2^{\frac{13}{2}} < 91\), \(2^8 = 256\), and \(2^{\frac{5}{2}} < 6\). Since \(\nu \in [0, 1]\) is now arbitrary, taking the infimum completes the proof. \(\square \)

Algorithm 1 evaluates the objective function and its gradient at two points, \(x_k\) and \({\bar{x}}_k\), in each iteration. Therefore, the number of evaluations is of the same order as the iteration complexity in Theorem 1.

The complexity bounds given in Theorem 1 may look somewhat unfamiliar since they involve an \(\inf \)-operation on \(\nu \). Such a bound is a significant benefit of \(\nu \)-independent algorithms. The \(\nu \)-dependent prototype algorithm described immediately after Lemma 4 achieves the bound

$$\begin{aligned} 91 (1 + \sqrt{{c_1}}) \Delta \sqrt{\bar{\ell }} H_{\nu }^{\frac{1}{2 + 2 \nu }} \varepsilon ^{- \frac{4 + 3 \nu }{2 + 2 \nu }} + 256 {c_1} \Delta H_{\nu }^{\frac{1}{1 + \nu }} \varepsilon ^{- \frac{2 + \nu }{1 + \nu }} { + 6 \sqrt{c_2 \Delta \bar{\ell }} \varepsilon ^{-1} + c_2 }, \end{aligned}$$

only for the given \(\nu \). In contrast, Algorithm 1 is \(\nu \)-independent and automatically achieves the bound with the optimal \(\nu \), as shown in Theorem 1. The fact that the optimal \(\nu \) is difficult to find also points to the advantage of our \(\nu \)-independent algorithm. The complexity bound (25) also gives a looser bound:

$$\begin{aligned} \inf _{\nu \in [0, 1]} \left\{ 91 \Delta \sqrt{\bar{\ell }} H_{\nu }^{\frac{1}{2 + 2 \nu }} \varepsilon ^{- \frac{4 + 3 \nu }{2 + 2 \nu }}\right\} + O(\varepsilon ^{-1})&\le 91 \Delta \sqrt{\bar{\ell }H_{0}} \varepsilon ^{-2} + O(\varepsilon ^{-1})\\ {}&\le 91 \sqrt{2} \Delta \bar{\ell }\varepsilon ^{-2} + O(\varepsilon ^{-1}), \end{aligned}$$

where we have taken \(\nu = 0\) and have used \(H_{0} \le 2 L\le 2 \bar{\ell }\). This bound matches the classical bound of \(O(\varepsilon ^{-2})\) for GD. Theorem 1 thus shows that our HB method has a more elaborate complexity bound than GD.

Remark 1

Although we employed global Lipschitz and Hölder continuity in Assumptions 1 and Definition 1, they can be restricted to the region where the iterates reach. More precisely, if we assume that the iterates \((x_k)\) generated by Algorithm 1 are contained in some convex set \(C \subseteq \mathbb R^d\), we can replace all \(\mathbb R^d\) in our analysis with C; we can obtain the same complexity bound as Theorem 1 with Lipschitz and Hölder continuity on C.Footnote 3

6 Numerical experiments

This section compares the performance of the proposed method with several existing algorithms. The experimental setup, including the compared algorithms and problem instances, follows [35]. We implemented the code in Python with JAX [4] and Flax [26] and executed them on a computer with an Apple M3 Chip (12 cores) and 36 GB RAM. The source code used in the experiments is available on GitHub.Footnote 4

6.1 Compared algorithms

We compared the following six algorithms.

  • Proposed is Algorithm 1 with parameters set as \((\ell _{\textrm{init}}, \alpha , \beta ) = (10^{-3}, 2, 0.1)\).

  • GD is a gradient descent method with Armijo-type backtracking. This method has input parameters \(\ell _{\textrm{init}}\), \(\alpha \), and \(\beta \) similar to those in Proposed, which were set as \((\ell _{\textrm{init}}, \alpha , \beta ) = (10^{-3}, 2, 0.9)\).

  • JNJ2018 [28, Algorithm 2] is an accelerated gradient (AG) method for nonconvex optimization. The parameters were set in accordance with [28, Eq. (3)]. The equation involves constants c and \(\chi \), whose values are difficult to determine; we set them as \(c = \chi = 1\).

  • LL2022 [33, Algorithm 2] is another AG method. The parameters were set in accordance with [33, Theorem 2.2 and Section 4].

  • MT2022 [35, Algorithm 1] is another AG method. The parameters were set in accordance with [35, Section 6.1].

  • L-BFGS is the limited-memory BFGS method [5]. We used SciPy [52] for the method, i.e., scipy.optimize.minimize with option method="L-BFGS-B".

The parameter setting for JNJ2018 and LL2022 requires the values of the Lipschitz constants \(L\) and \(H_{1}\) and the target accuracy \(\varepsilon \). For these two methods, we tuned the best \(L\) among \(\{ 10^{-4},10^{-3},\dots ,{10^{10}} \}\) and set \(H_{1} = 1\) and \(\varepsilon = 10^{-16}\) following [33, 35]. It should be noted that if these values deviate from the actual ones, the methods do not guarantee convergence.

6.2 Problem instances

We tested the algorithms on seven different instances. The first four instances are benchmark functions from [27].

  • Dixon–Price function [19]:

    $$\begin{aligned} \min _{(x_1,\dots ,x_d) \in \mathbb R^d}\ (x_1 - 1)^2 + \sum _{i=2}^d i (2 x_i^2 - x_{i-1})^2. \end{aligned}$$

    The optimum is \(f(x^*) = 0\) at \(x^*_i = 2^{2^{1-i} - 1}\) for \(1 \le i \le d\).

  • Powell function [44]:

    $$\begin{aligned}&\min _{(x_1,\dots ,x_d) \in \mathbb R^d}\ \sum _{i=1}^{\lfloor d/4\rfloor } \left( { \left( {x_{4i-3} + 10 x_{4i-2}}\right) ^2 + 5 \left( {x_{4i-1} - x_{4i}}\right) ^2 + \left( {x_{4i-2} - 2 x_{4i-1}}\right) ^4 + 10 \left( {x_{4i-3} - x_{4i}}\right) ^4 }\right) . \end{aligned}$$

    The optimum is \(f(x^*) = 0\) at \(x^* = (0, \dots , 0)\).

  • Qing Function [45]:

    $$\begin{aligned} \min _{(x_1,\dots ,x_d) \in \mathbb R^d}\ \sum _{i=1}^{d-1} (x_i^2 - i)^2. \end{aligned}$$

    The optimum is \(f(x^*) = 0\) at \(x^* = (\pm \sqrt{1}, \pm \sqrt{2}, \dots , \pm \sqrt{d})\).

  • Rosenbrock function [47]:

    $$\begin{aligned} \min _{(x_1,\dots ,x_d) \in \mathbb R^d}\ \sum _{i=1}^{d-1} \left( { 100 \left( {x_{i+1} - x_i^2}\right) ^2 + (x_i - 1)^2 }\right) . \end{aligned}$$

    The optimum is \(f(x^*) = 0\) at \(x^* = (1, \dots , 1)\).

The dimension d of the above problems was fixed as \(d = 10^6\). The starting point was set as \(x_{\textrm{init}}= x^* + \delta \), where \(x^*\) is the optimal solution, and each entry of \(\delta \) was drawn from the normal distribution \({\mathcal {N}}(0, 1)\). For the Qing function (34), we used \(x^* = (\sqrt{1}, \sqrt{2}, \dots , \sqrt{d})\) to set the starting point.

The other three instances are more practical examples from machine learning.

  • Training a neural network for classification with the MNIST dataset:

    $$\begin{aligned} \min _{w \in \mathbb R^d}\&\frac{1}{N} \sum _{i=1}^N \ell _{\textrm{CE}}(y_i, \phi _1(x_i; w)). \end{aligned}$$

    The vectors \(x_1,\dots ,x_N \in \mathbb R^M\) and \(y_1,\dots ,y_N \in \{ 0, 1 \}^K\) are given data, \(\ell _{\textrm{CE}}\) is the cross-entropy loss, and \(\phi _1(\cdot ; w): \mathbb R^M \rightarrow \mathbb R^K\) is a neural network parameterized by \(w \in \mathbb R^d\). We used a three-layer fully connected network with bias parameters. The layers each have M, 32, 16, and K nodes, where \(M = 784\) and \(K = 10\). The hidden layers have the logistic sigmoid activation, and the output layer has the softmax activation. The total number of the parameters is \(d = (784 \times 32 + 32 \times 16 + 16 \times 10) + (32 + 16 + 10) = 25818\). The data size is \(N = 10000\).

  • Training an autoencoder for the MNIST dataset:

    $$\begin{aligned} \min _{w \in \mathbb R^d}\&\frac{1}{2MN} \sum _{i=1}^N \Vert x_i - \phi _2(x_i; w)\Vert ^2. \end{aligned}$$

    The vectors \(x_1,\dots ,x_N \in \mathbb R^M\) are given data, and \(\phi _2(\cdot ; w): \mathbb R^M \rightarrow \mathbb R^M\) is a neural network parameterized by \(w \in \mathbb R^d\). We used a four-layer fully connected network with bias parameters. The layers each have M, 32, 16, 32, and M nodes, where \(M = 784\). The hidden and output layers have the logistic sigmoid activation. The total number of the parameters is \(d = (784 \times 32 + 32 \times 16 + 16 \times 32 + 32 \times 784) + (32 + 16 + 32 + 784) = 52064\). The data size is \(N = 10000\).

  • Low-rank matrix completion with the MovieLens-100K dataset:

    $$\begin{aligned} \min _{\begin{array}{c} U \in \mathbb R^{p \times r}\\ V \in \mathbb R^{q \times r} \end{array}}\ {}&\frac{1}{2 N} \sum _{(i, j, s) \in \Omega } \left( {(U V^\top )_{ij} - s}\right) ^2 + \frac{1}{2 N} \Vert U^\top U - V^\top V\Vert _{\textrm{F}}^2. \end{aligned}$$

    The set \(\Omega \) consists of \(N = 100000\) observed entries of a \(p \times q\) data matrix, and \((i, j, s) \in \Omega \) means that the (ij)-th entry is s. The second term with the Frobenius norm \(\Vert \cdot \Vert _{\textrm{F}}\) was proposed in [51] as a way to balance U and V. The size of the data matrix is \(p = 943\) times \(q = 1682\), and we set the rank as \(r \in \{ 100, 200 \}\). Thus, the number of variables is \(pr + qr \in \{ 262500, 525000 \}\).

Although we did not check whether the above seven instances have globally Lipschitz continuous gradients or Hessians, we confirmed in our experiments that the iterates generated by each algorithm were bounded. Since all of the above instances are continuously thrice differentiable, both the gradients and Hessians are Lipschitz continuous in the bounded domain. Considering Remark 1, we can say that in the experiments, the proposed algorithm achieves the same complexity bound as Theorem 1.

Fig. 1
figure 1

Numerical results with benchmark functions

Fig. 2
figure 2

Numerical results with benchmark functions. The horizontal axis is the elapsed time in seconds

Fig. 3
figure 3

Numerical results with machine learning instances

Fig. 4
figure 4

The objective function value \(f(x_k)\) and the estimates \(\ell \) and \(h_k\) at each iteration of the proposed method. The iterations at which a restart occurred are marked. Left: the first 500 iterations. Right: later 500 iterations

6.3 Results

Figure 1 illustrates the results with the four benchmark functions.Footnote 5 The horizontal axis is the number of calls to the oracle that computes both f(x) and \(\nabla f(x)\) at a given point \(x \in \mathbb R^d\).

Let us first focus on the methods other than L-BFGS, which is very practical but does not have complexity guarantees for general nonconvex functions, unlike the other methods.

Figure 1a and b show that Proposed converged faster than the existing methods except for L-BFGS, and Fig. 1c shows that Proposed and MT2022 converged fast. Figure 1d shows that GD and LL2022 attained a small objective function value, while GD and Proposed converged fast regarding gradient norm. In summery, the proposed algorithm was stable and fast.

L-BFGS successfully solved the four benchmarks, but we should note that the results do not imply that L-BFGS converged faster than the proposed algorithm in terms of execution time. Figure 2 provides the four figures in the right column of Fig. 1, with the horizontal axis replaced by the elapsed time. Figure 2 shows that Proposed converged comparably or faster in terms of time than L-BFGS. One reason for the large difference in the apparent performance of L-BFGS in Figs. 1 and 2 is that the computational costs of the non-oracle parts in L-BFGS, such as updating the Hessian approximation and solving linear systems, are not negligible. In contrast, the proposed algorithm does not require heavy computation besides oracle calls and is more advantageous in execution time when function and gradient evaluations are low-cost.

Figure 3 presents the results with the machine learning instances. Similar to Figs. 13 shows that the proposed algorithm performed comparably or better than the existing methods except for L-BFGS, especially in reducing the gradient.

Figure 4 illustrates the objective function value \(f(x_k)\) and the estimates \(\ell \) and \(h_k\) at each iteration of the proposed algorithm for the machine learning instances. The iterations at which a restart occurred are also marked; “successful” and “unsuccessful” mean restarts at Line 10 and Line 8 of Algorithm 1, respectively. This figure shows that the proposed algorithm restarts frequently in the early stages but that the frequency decreases as the iterations progress. The frequent restarts in the early stages help update the estimate \(\ell \); \(\ell \) reached suitable values in the first few iterations, even though it was initialized to a pretty small value, \(\ell _{\textrm{init}}= 10^{-3}\). The infrequent restarts in later stages enable the algorithm to take full advantage of the HB momentum.