1 Introduction

We consider iterative methods for solving the unconstrained minimization problem

$$\begin{aligned} \min _{x\in V} f(x), \end{aligned}$$
(1)

where V is a Hilbert space, and \(f:V\rightarrow {\mathbb {R}}\cup \{+\infty \}\) is a properly closed convex function. We shall first consider smooth f on the entire space V and later focus on the composite case \(f = h + g\) where both h (smooth) and g (non-smooth) are convex on some (simple) closed convex set \(Q\subseteq V\). We are mainly interested in the development and analysis of accelerated first-order methods.

Suppose V is equipped with the inner product \((\cdot ,\cdot )\) and the correspondingly induced norm \(\left\Vert {\cdot } \right\Vert \). We use \(\left\langle {\cdot ,\cdot } \right\rangle \) to denote the duality pair between \(V^*\) and V, where \(V^*\) is the continuous dual space of V and is endowed with the conventional dual norm \(\left\Vert {\cdot } \right\Vert _{*}\). For any interval \(I\subseteq {\mathbb {R}}\), denote by \(C^k(I;V)\) the space of all k-times continuous differentiable V-valued functions on I, and the superscript k is dropped when \(k=0\). Let \(\varOmega \subseteq V\) be some closed convex subset, we say \(f\in \mathcal S_{\mu }^{1}(\varOmega )\) if it is continuous differentiable on \(\varOmega \) and there exists \(\mu \geqslant 0\) such that

$$\begin{aligned} f(x) - f(y) - \langle \nabla f(y), x - y \rangle \geqslant \frac{\mu }{2} \Vert x- y \Vert ^2 \quad \forall \, x,y \in \varOmega . \end{aligned}$$
(2)

We call (2) the \(\mu \)-convexity of f and when \(\mu >0\), we say f is strongly convex. We also write \(f\in \mathcal S^{1,1}_{\mu ,L}(\varOmega )\) if \(f\in {\mathcal {S}}_{\mu }^{1}(\varOmega )\) and \(\nabla f\) is Lipschitz continuous on \(\varOmega \): there exists \(0<L<\infty \) such that

$$\begin{aligned} \Vert \nabla f(x) - \nabla f(y)\Vert _{*} \leqslant L\Vert x - y \Vert \quad \forall \, x,y \in \varOmega . \end{aligned}$$
(3)

By [29, Theorem 2.1.5], this implies the inequality

$$\begin{aligned} f(x) - f(y) - \langle \nabla f(y), x - y \rangle \leqslant \frac{L}{2} \Vert x- y \Vert ^2 \quad \forall \, x,y \in \varOmega . \end{aligned}$$
(4)

For \(\varOmega = V\), we shall write \({\mathcal {S}}_{\mu }^{1}(\varOmega )\) and \({\mathcal {S}}^{1,1}_{\mu ,L}(\varOmega )\) as \({\mathcal {S}}_{\mu }^{1}\) and \({\mathcal {S}}^{1,1}_{\mu ,L}\), respectively.

The above functional classes are what we work with in this paper. As for the optimization problem (1), we also care about the global minimizer(s) of f. For strongly convex case, it is well-known that the minimizer exists uniquely. However, for convex case, to promise the existence of minimizers, additional assumption, such as the coercivity condition, is usually imposed. Throughout, we denote by \(\mathrm{argmin}\,f\) the set of global minimizers of (1) and assume it is nonempty.

One approach to derive the gradient descent (GD) method is discretizing an ordinary differential equation (ODE), i.e., the so-called gradient flow:

$$\begin{aligned} x'(t) = - \nabla f(x(t)),\quad t>0. \end{aligned}$$
(5)

Here we introduce an artificial time variable t and \(x'\) is the derivative taken with respect to t. For ease of notation, in the sequel, we shall omit t when no confusion arises. The simplest forward (explicit) Euler method with step size \(\eta _k>0\) leads to the GD method

$$\begin{aligned} x_{k+1} = x_k - \eta _k \nabla f(x_k). \end{aligned}$$

In the terminology of numerical analysis, it is well-known that this method is conditionally A-stable (cf. Sect. 2), and for \(f\in {\mathcal {S}}_{\mu ,L}^{1,1}\) with \(0\leqslant \mu \leqslant L<\infty \), the step size \(\eta _k=1/L\) is allowed to get the rate (see [29, Chapter 2])

$$\begin{aligned} O\left( \min \big \{L/k,(1+\mu /L)^{-k}\big \}\right) . \end{aligned}$$
(6)

One can also consider the backward (implicit) Euler method

$$\begin{aligned} x_{k+1} = x_k - \eta _k \nabla f(x_{k+1}), \end{aligned}$$
(7)

which is unconditionally A-stable (cf. Sect. 2) and coincides with the well-known proximal point algorithm (PPA) [33]

$$\begin{aligned} x_{k+1} = \mathbf{prox}_{\eta _k f}(x_k):=\mathop {\mathrm{argmin}\,}\limits _{y\in V} \left( f(y) + \frac{1}{2\eta _k}\Vert y - x_k \Vert ^2 \right) . \end{aligned}$$
(8)

Note that this method allows f to be nonsmooth and possesses linear convergence rate even for convex functions, as long as \(\eta _k\geqslant \eta >0\) for all \(k>0\).

1.1 Main results

Let us start from the quadratic objective \(f(x) = \frac{1}{2}x^{\top }Ax\) over \({{\mathbb {R}}}^d\), for which the gradient flow (5) reads simply as

$$\begin{aligned} x' = -Ax, \end{aligned}$$
(9)

where A is symmetric positive semi-definite and makes \(f\in {\mathcal {S}}_{\mu ,L}^{1,1}\). Instead of solving  (9), we turn to a general linear ODE system

$$\begin{aligned} y'=Gy. \end{aligned}$$
(10)

Briefly speaking, our main idea is to seek such a system (10) with some asymmetric block matrix G that transforms the spectrum of A from the real line to the complex plane and reduces the condition number from \(\kappa (A) = L/\mu \) to \(\kappa (G) = O(\sqrt{L/\mu })\). Afterwards, accelerated gradient methods may be constructed from A-stable methods for solving (10) with a significant larger step size and consequently improve the contraction rate from \(O((1-\mu /L)^k)\) to \(O((1-\sqrt{\mu /L})^k)\). Furthermore, to handle the convex case \(\mu =0\), we combine the transformation idea with suitable time scaling technique; for more details, we refer to Sect. 2.

One successful and important transformation is given below

$$\begin{aligned} G = \begin{pmatrix} - I &{} \quad I \\ \mu /\gamma -A/\gamma &{} \quad -\mu /\gamma \, I \end{pmatrix}, \end{aligned}$$
(11)

where the built-in scaling factor \(\gamma \) is positive and satisfies

$$\begin{aligned} \gamma ' =\mu -\gamma ,\quad \gamma (0)=\gamma _0>0. \end{aligned}$$
(12)

Based on this, for general \(f\in {\mathcal {S}}_\mu ^1\) with \(\mu \geqslant 0\), we replace A in (11) with \(\nabla f\) and write \(y=(x,v)\) to obtain a first-order dynamical system:

$$\begin{aligned} \left\{ \begin{aligned} x' = {}&v-x,\\ v'={}&\frac{\mu }{\gamma }(x-v) - \frac{1}{\gamma }\nabla f(x). \end{aligned} \right. \end{aligned}$$
(13)

Eliminating v, we arrive at a second-order ODE of x:

$$\begin{aligned} \gamma x''+ \left( \mu +\gamma \right) x' +\nabla f(x)=0, \end{aligned}$$
(14)

which is actually a heavy ball model (cf. (21)) with variable damping coefficients in front of \(x''\) and \(x'\). Thanks to the scaling factor \(\gamma \), we can handle both the convex case (\(\mu = 0\)) and the strongly convex case (\(\mu > 0\)) in a unified way. Moreover, we shall prove the exponential decay property

$$\begin{aligned} {\mathcal {L}}(t)\leqslant e^{-t}{\mathcal {L}}(0),\quad t>0, \end{aligned}$$
(15)

for a tailored Lyapunov function

$$\begin{aligned} {\mathcal {L}}(t)= f(x(t))-f(x^*)+\frac{\gamma (t)}{2} \left\Vert {v(t)-x^*} \right\Vert ^2,\quad t>0, \end{aligned}$$
(16)

where \(x^*\in \mathrm{argmin}\,f\) is a global minimizer of f.

Accelerated gradient methods based on numerical discretizations of the dynamical system (13) with \(f\in \mathcal S_{\mu ,L}^{1,1}\) are then considered and analyzed by means of a discrete version of the Lyapunov function (16). It will be shown that the implicit scheme (see (72)) possesses linear convergence rate as long as the time step size is uniformly bounded below. This matches the exponential decay rate (15) in the continuous level. Also, for convex case \(\mu =0\), this implicit method amounts to an accelerated PPA, that is very close to Güler’s PPA [20] and enjoys the same rate \(O(1/k^2)\) (cf. Theorem 4). In Sect. 5, for semi-implicit schemes with suitable corrections (either an extrapolation or a gradient step), we prove the following convergence rate

$$\begin{aligned} O\left( \min \big \{L/k^2,(1+\sqrt{\mu /L})^{-k}\big \}\right) , \end{aligned}$$
(17)

which is optimal in the sense of [29]. Moreover, we can recover Nesterov’s optimal method [27, 29] exactly from a semi-implicit scheme with a gradient descent correction; see Sect. 6. Therefore, instead of using estimate sequence, our ODE approach provides an alternative derivation of Nesterov’s method and hopefully more intuitive for understanding the acceleration mechanism. From this point of view, we name both (13) and (14) as Nesterov accelerated gradient (NAG) flow.

As a proof of concepts, we also generalize our NAG flow to the composite case

$$\begin{aligned} \min _{x\in Q} f(x):= \min _{x\in Q} \left[ h(x)+g(x) \right] , \end{aligned}$$
(18)

where \(Q\subseteq V\) is a (simple) closed convex set, \(h\in \mathcal S_{\mu ,L}^{1,1}(Q)\) with \(0\leqslant \mu \leqslant L<\infty \) and \(g:V\rightarrow {\mathbb {R}}\cup \{+\infty \}\) is proper, closed, and convex. We use \( \mathbf{dom\,}g\) to denote the effective domain of g and assume that \(Q\cap \mathbf{dom}\, g\ne \emptyset \). Treating (18) as an unconstrained minimization of \(F=f+i_Q\) where \(i_Q\) denotes the indicator function of Q, the generalized version of (14) is a second-order differential inclusion

$$\begin{aligned} \gamma x''+ \left( \mu +\gamma \right) x' +\partial F(x)\ni 0. \end{aligned}$$
(19)

We shall give the existence of the solution to (19) in proper sense and then obtain the exponential decay (15) for almost all \(t>0\).

For the unconstrained case \(Q = V\), by using the tool of composite gradient mapping [29, Chapter 2], a semi-implicit scheme with correction for the generalized NAG flow (19) is presented and leads to an accelerated proximal gradient method (APGM); see Algorithm 2. We also give a simplified variant that is closely related to FISTA [12]. For the constrained problem (18), an accelerated forward-backward method is proposed in Algorithm 4. Both two algorithms call the proximal operation of g (over Q) only once in each iteration, and they are proved to share the same accelerated convergence rate (17).

The rest of this paper is organized as follows. In the continuing of the introduction, we will review some existing works devoting to the accelerated gradient methods from the ODE point of view. Next, in Sect. 2, we shall explain the acceleration mechanism from A-stability theory of ODE solvers and derive our NAG flow as well. Then in Sect. 3 we focus on the NAG flow and prove its exponential decay. After that, accelerated gradient methods based on numerical discretizations of NAG flow are proposed and analyzed in Sects. 4, 5 and 6. Finally, in Sect. 7, we extend the our NAG flow to composite optimization and propose two new accelerated methods with convergence analysis.

1.2 Related works

The well-known momentum method can be traced back to the 1960s. In [34], Polyak studied the heavy ball (HB) method

$$\begin{aligned} x_{k+1} =x_k -\alpha \nabla f(x_k)+\beta (x_k -x_{k-1}) \end{aligned}$$
(20)

and its continuous analogue, the heavy ball dynamical system:

$$\begin{aligned} x''+\alpha _1 x'+\alpha _2\nabla f(x) = 0. \end{aligned}$$
(21)

Local linear convergence results for (20) via spectrum analysis were established in [34, Theorem 9]. Note that the HB method (20) adds a momentum term up to the gradient step and is sensitive to its parameters. For \(f\in \mathcal S_{\mu ,L}^{1,1}\), it shares the same theoretical convergence rate (6) as the gradient descent method; see [18, 40]. To our best knowledge, no work has established the global accelerated rate (17) for the original HB method (20). Recently, Nguyen et al. [26] developed the so-called accelerated residual method which combines (20) with an extra gradient descent step:

$$\begin{aligned} \left\{ \begin{aligned} {}&y_{k} =x_k -\alpha \nabla f(x_k)+\beta (x_k -x_{k-1}),\\ {}&x_{k+1} = y_k-\frac{\alpha }{\beta +1}\nabla f(y_k). \end{aligned} \right. \end{aligned}$$

Numerically, they verified the efficiency and usefulness of this method with a restart strategy. We refer to [1, 3, 11, 19] for further investigations of the HB system (21).

To understand an accelerated gradient method with the rate \(O(1/k^2)\) proposed by Nesterov [27], Su, Boyd and Candès [37] derived the following second-order ODE

$$\begin{aligned} x'' + \frac{\alpha }{t}x' + \nabla f(x) = 0,\quad t>0, \end{aligned}$$
(22)

where \(\alpha >0\) and \(f\in {\mathcal {S}}_{0,L}^{1,1}\). If \(\alpha \geqslant 3\) or \(1<\alpha <3\) and \((f-f(x^*))^{(\alpha -1)/2}\) is convex, they proved the decay rate \(O(t^{-2})\). If \(\alpha \geqslant 3\) and f is strongly convex, they also obtained a faster rate \(O(t^{-2\alpha /3})\). Later on, Aujol and Dossal [10] established a generic result:

$$\begin{aligned} f(x(t)) - f(x^*) \leqslant \left\{ \begin{aligned}&Ct^{-2},&\text {if}~\alpha \geqslant 2\beta +1,\\&Ct^{-2\alpha /(2\beta +1)},&\text {if}~0<\alpha <2\beta +1, \end{aligned} \right. \end{aligned}$$
(23)

where \(\beta >0\) and \((f-f(x^*))^\beta \) is convex. Almost at the same time, Attouch et al. [8] obtained the estimate (23) for \(\beta =1\) and considered numerical discretizations for (22) with the convergence rate \(O(k^{-\min \{2,2\alpha /3\}})\). Also, Vassilis et al. [42] studied the non-smooth version of (22):

$$\begin{aligned} x'' + \frac{\alpha }{t}x' + \partial f(x) \ni 0. \end{aligned}$$
(24)

They proved that the solution trajectory of (24) converges to a minimizer of f and derived the decay estimate (23) for \(\beta =1\). For more works and generalizations related to the model (22) and the corresponding algorithms, we refer to [2, 5,6,7, 14] and references therein.

Recently, Wibisono et al. [43] introduced a Lagrangian

$$\begin{aligned} {\mathcal {E}}(y,w,t) = \frac{e^{\int _{0}^{t}\alpha (s)\,{\mathrm{ds}}}}{\alpha (t)\beta (t)} \left( \frac{\beta (t)}{2}\left\Vert {w} \right\Vert ^2-\alpha ^2(t)f(y) \right) , \end{aligned}$$
(25)

for smooth and convex f, where \(\alpha :{\mathbb {R}}_+\rightarrow {\mathbb {R}}_+\) is continuous and \(\beta :\mathbb R_+\rightarrow {\mathbb {R}}_+ \) satisfies

$$\begin{aligned} \beta '\geqslant -\alpha \beta ,\quad \beta (0)=\beta _0>0. \end{aligned}$$
(26)

The Lagrangian (25) itself introduces a variational problem, the Euler–Lagrange equation to which is

$$\begin{aligned} \left\{ \begin{aligned} {}&y' =\alpha (w-y),\\ {}&\beta w'=-\alpha \nabla f(y). \end{aligned} \right. \end{aligned}$$
(27)

They then established the convergence rate (cf. [43, Theorem 2.1])

$$\begin{aligned} f(y(t))-f(x^*)\leqslant e^{-\int _{0}^{t}\alpha (s)\,{\mathrm{ds}}} {\mathcal {L}}(0), \end{aligned}$$
(28)

by means of the Lyapunov function

$$\begin{aligned} {\mathcal {L}}(t)=e^{\int _{0}^{t}\alpha (s) \,{\mathrm{ds}}}\big [f(y(t))-f(x^*) \big ]+ \frac{1}{2}\left\Vert {w(t)-x^*} \right\Vert ^2. \end{aligned}$$

Following this work, for \(f\in {\mathcal {S}}_{\mu }^1\) with \(\mu >0\), Wilson et al. [44] introduced another Lagrangian whose Euler–Lagrange equations reads as

$$\begin{aligned} \left\{ \begin{aligned} {}&y' = \alpha (w-y),\\ {}&\mu w'=\mu \alpha (y-w)-\alpha \nabla f(y), \end{aligned} \right. \end{aligned}$$
(29)

with the same scaling function \(\alpha \) in (25). They proved the decay estimate (28) as well, by using the Lyapunov function

$$\begin{aligned} {\mathcal {L}}(t)= e^{\int _{0}^{t}\alpha (s)\,{\mathrm{ds}}} \left[ f(y(t)) - f(x^*) + \frac{\mu }{2}\left\Vert {w(t) - x^*} \right\Vert ^2\right] . \end{aligned}$$
(30)

When \(\alpha =\sqrt{\mu }\), (29) gives the following model

$$\begin{aligned} y''+2\sqrt{\mu }y'+\nabla f(y) = 0, \end{aligned}$$
(31)

which reduces to an HB system (cf. (21)); see also Siegel [38].

In addition, Siegel [38] and Wilson et al. [44] proposed two semi-explicit schemes for (31) individually. Both of their schemes are supplemented with an extra gradient descent step and share the same linear convergence rate \(O((1-\sqrt{\mu /L})^k)\).

Recently, introducing the so-called duality gap which is the difference of appropriate upper and lower bound approximations for the objective function, Diakonikolas and Orecchia [17] presented a general framework for the construction and analysis of continuous time dynamical systems and the corresponding numerical discretizations. They recovered several existing ODE models such as the gradient flow (5), the mirror descent dynamic system and its accelerated version. We mention that the derivation of our NAG flow and analyses of discrete algorithms are fundamentally different from their duality gap technique.

2 Stability of ODE solvers and acceleration

In what follows, for any \(M\in \mathbb R^{d\times d}\), \(\sigma (M)\) denotes the spectrum of M, i.e., the set of all eigenvalues of M. The spectral radius is then defined by \(\rho (M) := \max _{\lambda \in \sigma (M)} |\lambda |\), and when M is invertible, its condition number \(\kappa (M) := \rho (M^{-1})\rho (M)\). If \(\sigma (M)\subset {\mathbb {R}}\), then \(\lambda _{\min }(M)\) and \(\lambda _{\max }(M)\) stand for the minimum and maximum of \(\sigma (M)\), respectively. Moreover, \(\left\Vert {\cdot } \right\Vert _2\) is the usual 2-norm for vectors and matrices.

To present our main idea as simple as possible, in this section, unless other specified, we restrict ourselves to the quadratic objective \(f(x) = \frac{1}{2}x^{\top }Ax\), where A is a symmetric matrix with the bound

$$\begin{aligned} 0\leqslant \mu :=\lambda _{\min }(A)\leqslant \lambda \leqslant \lambda _{\max }(A):=L\quad \forall \,\lambda \in \sigma (A). \end{aligned}$$

For this model example, \(\nabla f(x) = Ax\) and the gradient flow (5) reads as \(x'=-Ax\). The global minimal is achieved at \(x^*=0\), and when \(\mu >0\), the condition number of A is \(\kappa (A)=L/\mu \).

2.1 A-stability of ODE solvers

Let \(G\in {\mathbb {R}}^{d\times d}\) and assume \({{\mathfrak {R}}}{{\mathfrak {e}}}(\lambda ) < 0\) for all \(\lambda \in \sigma (G)\). For the linear ODE system

$$\begin{aligned} y ' = G y, \quad y(0) = y_0\in {\mathbb {R}}^{d}, \end{aligned}$$
(32)

it is not hard to derive that \(\left\Vert {y(t)} \right\Vert _2\rightarrow 0\) as \(t\rightarrow \infty \) (see [13, Theorem 7] for instance). Hence \(y^*=0\) is an equilibrium of the dynamic system (32).

We now recall the concept of A-stability of ODE solves [23, 39]. A one-step method \(\phi \) for (32) with step size \(\alpha >0\) can be formally written as

$$\begin{aligned} y_{k+1} = E_{\phi }(\alpha , G) y_{k}. \end{aligned}$$
(33)

As \(y^* = 0\) is an equilibrium point, (33) also gives the error equation. The scheme \(\phi \) is called absolute stable or A-stable if \(\rho (E_{\phi }( \alpha , G)) < 1\) from which the asymptotic convergence \(y_{k} \rightarrow 0\) follows (cf. [16, Theorem 6.1]). If \(\rho (E_{\phi }( \alpha , G)) < 1\) holds for all \(\alpha >0\), then it is called unconditionally A-stable, and if \(\rho (E_{\phi }( \alpha , G)) < 1\) for any \(\alpha \in I\), where I is an interval of the positive half line, then the scheme is called conditionally A-stable.

If \(E_{\phi }(\alpha ,G)\) is normal, then \(\Vert E_{\phi }(\alpha ,G) \Vert _2=\rho (E_{\phi }(\alpha ,G))\). Therefore for A-stable methods the linear convergence follows directly from the norm contraction

$$\begin{aligned} \left\Vert {y_{k+1}} \right\Vert _2\leqslant \rho (E_{\phi }(\alpha ,G))\left\Vert {y_{k}} \right\Vert _2. \end{aligned}$$
(34)

In general cases, however, bounding the spectral radius by one does not imply the norm contraction, i.e., (34) may not be true when \(E_{\phi }(\alpha ,G)\) is non-normal, even if (33) is A-stable. Nevertheless, we shall continue using the tool of A-stability through spectral analysis and comment on its limitation in Sect. 2.6.

2.2 Implicit and explicit Euler methods

It is well known that the implicit Euler (IE) method

$$\begin{aligned} \frac{y_{k+1}-y_k}{\alpha } = Gy_{k+1} \end{aligned}$$

is unconditionally A-stable. Indeed, \(E_{\mathrm{IE}} ( \alpha , G) = (I - \alpha G)^{-1}\) and \(\rho (E_{\mathrm{IE}} ( \alpha , G))<1\) for all \(\alpha >0\) since all eigenvalues of \(\alpha G\) lie on the left of the complex plane and their distances to 1 are larger than one. Moreover, as it has no restriction on the step size, the implicit Euler method can achieve faster convergent rate by time rescaling which is equivalent to choosing a large step size.

In contract, the explicit Euler method

$$\begin{aligned} \frac{y_{k+1}-y_k}{\alpha } = Gy_{k} \end{aligned}$$
(35)

is only conditionally A-stable. Let us consider the case \(G=-A\) with \(\mu >0\). Then (35) is exactly the gradient descent method for minimizing \(\frac{1}{2}x^{\top }Ax\). It is not hard to obtain that

$$\begin{aligned} \rho (E_{\mathrm{GD}}(\alpha , -A) ) = \rho (I-\alpha A) = \max \big \{ \left|{1-\alpha \mu } \right|, \, \left|{1-\alpha L} \right| \big \}. \end{aligned}$$
(36)

Hence \(\rho (E_{\mathrm{GD}}(\alpha , -A) ) <1\) provided \(0<\alpha <2/ L\). Thanks to the symmetry of A, we have \(\Vert E_{\mathrm{GD}}(\alpha , -A) \Vert _2 = \rho (E_{\mathrm{GD}}(\alpha , -A) ) \) and the norm convergence with linear rate follows. Moreover, based on (36), a standard argument outputs the optimal choice \(\alpha ^* = 2/(\mu + L)\), which gives the minimal spectrum

$$\begin{aligned} \Vert E_{\mathrm{GD}}(\alpha ^*, -A) \Vert _2= \min _{\alpha >0} \rho (I-\alpha A ) =\frac{\kappa (A) -1}{\kappa (A) + 1}. \end{aligned}$$
(37)

A quasi-optimal but simpler choice is \(\alpha _* = 1/ L\) which yields

$$\begin{aligned} \Vert E_{\mathrm{GD}}(\alpha _*, -A) \Vert _2 = \rho (I-\alpha _* A) = 1 - \frac{1}{\kappa (A)}. \end{aligned}$$
(38)

We formulate the convergence rates (37) and (38) in terms of the condition number \(\kappa (A)\) as it is invariant to the rescaling of A, i.e., \(\kappa (cA) = \kappa (A)\) for any real number \(c\ne 0\). To be A-stable, one has to choose \(0<\alpha <2/\lambda _{\max }(A)\). It seems that a simple rescaling to cA can reduce \(\lambda _{\max }(cA)\) and thus enlarge the range of the step size. However, the condition number \(\kappa (cA) = \kappa (A)\) is invariant. From this we see that for the GD method (35), the simple rescaling cA is in vain.

The magnitude of the step size is relative to \(\min |\lambda (G)|\). To fix the discussion, we chose \(G = - A/\mu \) in (35) so that \(\lambda _{\min }(A/\mu ) = 1\). Then in order for the explicit Euler method to be A-stable it is equivalent to choose \(\alpha = O(1/\kappa (A))\) which leads to the contraction rate \(1-1/\kappa (A)\). Consequently for ill-conditioned problems, a tiny step size proportional to \(1/\kappa (A)\) is required.

Rather than the rescaling, our main intuition is to seek some transformation G of A, that reduces \(\kappa (A)\) to \(\kappa (G)=O(\sqrt{\kappa (A)})\). We wish to construct explicit A-stable methods which can enlarge the step size from \(O(1/\kappa (A))\) to \(O(1/\sqrt{\kappa (A)})\) and consequently improve the contraction rate from \(1 - 1/\kappa (A)\) to \(O(1 - 1/\sqrt{\kappa (A)})\).

2.3 Transformation to the complex plane

Let us first consider the case \(\mu >0\) and embed A into some \(2\times 2\) block matrix G with a rotation built-in. Specifically, we construct two candidates

$$\begin{aligned} G_{_{\mathrm{HB}}} = \begin{pmatrix} 0 &{} I\\ -A/\mu &{}\quad - 2I \end{pmatrix} \quad \text {and}\quad G_{_{\mathrm{NAG}}} = \begin{pmatrix} -I &{} I\\ I-A/\mu &{}\quad -I \end{pmatrix}. \end{aligned}$$
(39)

Due to the asymmetrical fact, \(\sigma (A)\) will be transformed from the real line to the complex plane. This may shrink the condition number; see the following result.

Proposition 1

For \(G=G_{_{\mathrm{HB}}}\) or \(G_{_{\mathrm{NAG}}}\) given in (39), it satisfies \({{\mathfrak {R}}}{{\mathfrak {e}}}(\lambda ) < 0\) for any \(\lambda \in \sigma (G)\), which promises the decay property \(\left\Vert {y(t)} \right\Vert _2\rightarrow 0\,\) for the system \(y'=Gy\). Moreover, we have \(\kappa (G_{_{\mathrm{HB}}}) = \kappa (G_{_{\mathrm{NAG}}}) = \sqrt{\kappa (A)}\).

Proof

Let us first consider \(G=G_{_{\mathrm{HB}}}\). As A is symmetric, we can write \(A = U\varLambda U^{\top }\) with unitary matrix U and diagonal matrix \(\varLambda \) consisting of eigenvalues of A. By applying the similar transform to G with the block diagonal matrix \(\mathrm{diag}(U, U)\), it suffices to consider eigenvalues of

$$\begin{aligned} R_{_{\mathrm{HB}}}= \begin{pmatrix} 0 &{} 1\\ -\theta &{} \quad -2 \end{pmatrix}, \quad \theta \in \sigma (A/\mu ). \end{aligned}$$

It is clear that \(\det R_{_{\mathrm{HB}}} = \theta \) and \(\mathrm{tr}\, R_{_{\mathrm{HB}}}=-2<0\). In addition, since \(\left|{\mathrm{tr}\,R_{_\mathrm{HB}} } \right|^2\leqslant {}4\det R_{_{\mathrm{HB}}}\), any eigenvalue \(\lambda _R\in \sigma (R_{_{\mathrm{HB}}})\) is a complex number and

$$\begin{aligned} {{\mathfrak {R}}}{{\mathfrak {e}}}(\lambda _R) {}=-1,\quad \left|{\lambda _R} \right| = {}\sqrt{\det R_{_{\mathrm{HB}}}}=\sqrt{\theta }. \end{aligned}$$

As \(1 = \lambda _{\min }(A/\mu )\leqslant \theta \leqslant \lambda _{\max }(A/\mu ) = \kappa (A),\) we conclude \(\kappa (G_{_\mathrm{HB}}) = \sqrt{\kappa (A)}\).

Apply the similar transformation with \(P = \begin{pmatrix} 1&{}\; 0\\ 1&{}\; 1 \end{pmatrix},\) we observe that

$$\begin{aligned} R_{_{\mathrm{NAG}}} = PR_{_{\mathrm{HB}}}P^{-1} = \begin{pmatrix} -1 &{} 1\\ 1-\theta &{}\quad -1 \end{pmatrix}. \end{aligned}$$

So \(\sigma (R_{_{\mathrm{NAG}}} ) = \sigma (R_{_{\mathrm{HB}}})\) and consequently \(\kappa (G_{_{\mathrm{NAG}}}) = \sqrt{\kappa (A)}\). This completes the proof of this proposition. \(\square \)

We write \(y = (x, v)^{\top }\) and eliminate v in \(y'=Gy\) to get a second order ODE of x, in which we replace Ax by general form \(\nabla f(x)\). Both \(G_{_{\mathrm{HB}}}\) and \(G_{_{\mathrm{NAG}}} \) yield the same thing

$$\begin{aligned} \mu x'' + 2\mu x' + \nabla f(x) = 0, \end{aligned}$$
(40)

which is a special case of the HB model (cf. (21)).

Note that we can find a lot of transformations G and derive corresponding ODE models. Indeed, given any G that meets our demand, both cG and \(QGQ^{-1}\) are acceptable candidates, where \(c>0\) and Q is some invertible matrix. We are not going further deep beyond those two transformations given in (39) for the strongly convex case \(\mu >0\) but aim to combine the transformation with a refined time scaling to propose another one for convex case \(\mu =0\) in Sect. 2.5.

2.4 Acceleration from a Gauss–Seidel splitting

We now consider numerical discretization for (32) with \(G=G_{_{\mathrm{HB}}}\) and \(G_{_{\mathrm{NAG}}}\) given in (39). As discussed in Sect. 2.2, the implicit Euler method is unconditionally A-stable. But computing \((I - \alpha G)^{-1}\) needs significant effort and may not be practical.

One may hope that the explicit Euler method \(y_{k+1} = (I + \alpha G) y_k\) will be A-stable with step size \(\alpha = O(1/\kappa (G))= O(1/\sqrt{\kappa (A)})\). Unfortunately, unlike the discussion for (35) with \(G=-A\), where \(\sigma (I-\alpha A)\) lies on the real line and \(\rho (I-\alpha A)\) can be easily shrunk by choosing \(\alpha = 1/\rho (A)\) (cf. (36)), the general asymmetric G spreads the spectrum on the complex plane. For both \(G=G_{_{\mathrm{HB}}}\) and \(G=G_{_{\mathrm{NAG}}}\), we have \(\mathfrak {R}(\lambda ) = -1\) for all \(\lambda \in \sigma (G)\). Denote by \(r = \rho (G)\). Then \(\rho ^2(I + \alpha G) = (1-\alpha )^2 + \alpha ^2 (r^2-1)\). To be A-stable, requiring \(\rho (I + \alpha G) < 1\) is equivalent to letting \(0< \alpha < 2/r^2 = O(1/\kappa (A))\), where small step size \(\alpha = O(1/\kappa (A))\) is still needed. The optimal choice \(\alpha ^*=r^{-2}\) only gives

$$\begin{aligned} \rho (I+\alpha ^*G) = 1-\alpha ^* = 1-O(1/\kappa (A)), \end{aligned}$$

where no acceleration has been obtained.

We then expect that an explicit scheme closer to the implicit Euler method will hopefully have better stability with a larger step size.

Motivated by the Gauss–Seidel (GS) method [45] for computing \((I - \alpha G)^{-1}\), we consider the matrix splitting \(G = M + N\) with M being the lower triangular part of G (including the diagonal) and \(N = G - M\), and propose the following Gauss–Seidel splitting scheme

$$\begin{aligned} \frac{y_{k+1} - y_{k}}{\alpha } = M y_{k+1} + N y_{k} \end{aligned}$$
(41)

which gives the relation

$$\begin{aligned} y_{k+1} = E(\alpha , G) y_{k},\quad E(\alpha , G) :=(I - \alpha M)^{-1} (I + \alpha N). \end{aligned}$$
(42)

Note that for \(G=G_{_{\mathrm{HB}}}\) and \(G_{_{\mathrm{NAG}}}\), the scheme (41) is still explicit as the lower triangular block matrix \(I - \alpha M\) can be inverted easily, without involving \(A^{-1}\).

The spectrum bound is given below and for the algebraic proof details, we refer to “Appendix A”.

Theorem 1

For \(G = G_{_{\mathrm{HB}}}\) or \(G_{_{\mathrm{NAG}}}\) given in (39), if \( 0< \alpha \leqslant 2/\sqrt{\kappa (A)}\), then the Gauss–Seidel splitting scheme (41) is A-stable and

$$\begin{aligned} \rho (E(\alpha ,G)) \leqslant \frac{1}{\sqrt{1+2\alpha }}. \end{aligned}$$

2.5 Dynamic time rescaling for the convex case

The ODE model (40) given in Sect. 2.3 cannot treat the case \(\mu =0\) and the previous spectral analysis fails. Equivalently the condition number \(\kappa (A)\) is infinity and the spectrum bound becomes 1. To conquer this, a careful rescaling is needed. Throughout this subsection, we assume \(\mu = 0\).

For the gradient flow

$$\begin{aligned} x'(t) = -\nabla f(x(t)), \end{aligned}$$
(43)

one can easily establish the sub-linear rate \(f(x(t))-f(x^*)\leqslant C/t\); see [37]. To recover the exponential rate, we introduce a time rescaling \(t(s) =e^{s}\) and let \(y(s)=x(t(s))\). Then (43) becomes the following rescaled gradient flow

$$\begin{aligned} \gamma (s) y'(s) = -\nabla f(y(s)), \end{aligned}$$
(44)

with the scaling factor \(\gamma (s)=e^s\). Besides, the previous sublinear rate \(f(x(t))-f(x^*)\leqslant C/t\) turns into \(f(y(s))-f(x^*) \leqslant Ce^{-s}\). That is in the continuous level, we can achieve the exponential decay through suitable rescaling of time even for \(\mu =0\) .

Now let us go back to our model case \(f(x) = \frac{1}{2}x^{\top }Ax\) with \(\mu =0\) and \(\lambda _{\max }(A) = L\). Coupled with the transformation \(G_{_{\mathrm{NAG}}} \), we consider

$$\begin{aligned} y' = G(\gamma ) \, y,\quad G(\gamma ) = \begin{pmatrix} - I &{} \quad I \\ -A/\gamma &{} \quad O \end{pmatrix}, \end{aligned}$$
(45)

where \(y = (x, v)^{\top }\) and

$$\begin{aligned} \gamma ' = -\gamma , \quad \gamma (0)=\gamma _0>0. \end{aligned}$$
(46)

This gives a second-order ODE in terms of x:

$$\begin{aligned} \gamma x'' + \gamma x' + \nabla f(x)=0, \end{aligned}$$
(47)

which is in the HB type but with variable damping coefficients.

Obviously, the implicit Euler method for solving (45) is still unconditional A-stable. We now apply the GS splitting (41) to (45) and get

$$\begin{aligned} y_{k+1} = {}E(\alpha _{k}, G(\gamma _{k+1})) y_{k}, \end{aligned}$$
(48)

where \(E(\alpha _{k}, G(\gamma _{k+1}))\) is defined in (42). The equation (46) is discretized by

$$\begin{aligned} \gamma _{k+1}={}\gamma _{k}-\alpha _{k}\gamma _{k+1}. \end{aligned}$$
(49)

Eliminating \(v_{k}\) in (48) will give an HB method with variable coefficients

$$\begin{aligned} x_{k+1} = x_{k} -\frac{\alpha _k\alpha _{k-1}}{\gamma _k+\alpha _k\gamma _k}\nabla f(x_{k}) + \frac{\alpha _k}{\alpha _{k-1}+\alpha _k \alpha _{k-1}} (x_{k} - x_{k-1}). \end{aligned}$$

Instead of studying the spectrum bound \(E(\alpha _k,G(\gamma _{k+1}))\) which is 1, we apply the scaling technique to obtain a regularized matrix

$$\begin{aligned} {\widetilde{E}}_{k} = \begin{pmatrix} I &{} \; O \\ O &{} \; \gamma _{k+1} I \end{pmatrix} E(\alpha _{k}, G(\gamma _{k+1})) \begin{pmatrix} I &{} \; O \\ O &{} \; \gamma _{k} I \end{pmatrix}^{-1}, \end{aligned}$$

which is nearly similar with \(E(\alpha _k,G(\gamma _{k+1}))\). Set \(z_k=\begin{pmatrix} I &{} \; O \\ O &{} \; \gamma _{k} I \end{pmatrix} y_{k} \), then the discrete system (48) for \(\{y_k\}\) becomes

$$\begin{aligned} z_{k+1}={\widetilde{E}}_{k}z_k, \end{aligned}$$
(50)

With a careful chosen step size, the spectrum bound of \({\widetilde{E}}_{k}\) is given below and for the algebraic proof details, we refer to “Appendix A”. We note that, the step size in Theorem 2 is only to agree with the setting of Lemma B2 and for general choice \(L\alpha _k^2/\gamma _k=O(1)\) and suitable initial value \(\gamma _0\), it is possible to maintain the spectrum bound (51) together with the decay estimate (52).

Theorem 2

If \(\gamma _0=L\) and \(L\alpha _{k}^2 =\gamma _{k}(1+\alpha _k)\), then both the scheme (48) and its equivalent form (50) are A-stable and we have

$$\begin{aligned} \rho ({{\widetilde{E}}}_{k}) =\frac{\gamma _{k+1}}{\gamma _{k}} =\frac{1}{1+\alpha _{k}}, \end{aligned}$$
(51)

which further implies that

$$\begin{aligned} \prod _{i=0}^{k-1}\rho ({\widetilde{E}}_i) = \frac{\gamma _{k}}{\gamma _0} = O(k^{-2}). \end{aligned}$$
(52)

2.6 Limitation of spectral analysis

For quadratic objective f, both the ODE models (40) and (47) are linear and the spectrum bound of \(E(\alpha ,G)\) for the Gauss–Seidel splitting (42) is derived. But as pointed out in the beginning, for A-stable methods, bounding the spectral radius by one is not sufficient for the norm convergence if the matrix \(E(\alpha ,G)\) is non-normal; see convincible examples in [23, Appendix D.2] and [23, Appendix D.4].

Moving beyond quadratic f and nonlinear ODE systems, transient growth or instability of perturbed problems can easily lead to nonlinear instabilities. Particularly, for the HB system (21), it is shown in [22] that the parameters optimized for linear ODE models does not guarantee the global convergence for a nonlinear system.

To provide rigorous convergence analysis for both continuous and discrete levels, in the sequel we shall introduce the tool of Lyapunov function. Following many related works [6, 37, 43], we first analyze some proper ODEs via a Lyapunov function, then construct optimization algorithms from numerical discretizations of continuous models and use a discrete Lyapunov function to establish the convergence rates of the proposed algorithms.

3 Nesterov accelerated gradient flow

3.1 Continuous problem

In the previous section, we have obtained two ODE models for quadratic objective \(f(x) = \frac{1}{2}x^{\top }Ax\) with \(\mu > 0\) and \(\mu = 0\), respectively. To handle those two cases in a unified way, we combine \(G_{_{\mathrm{NAG}}}\) in (39) with \(G(\gamma )\) in (45) and consider the following transformation

$$\begin{aligned} G = \begin{pmatrix} - I &{} \quad I \\ \mu /\gamma -A/\gamma &{} \quad -\mu /\gamma \, I \end{pmatrix}, \end{aligned}$$
(53)

where

$$\begin{aligned} \gamma ' = \mu -\gamma ,\quad \gamma (0)=\gamma _0>0. \end{aligned}$$
(54)

One can solve the above equation and obtain

$$\begin{aligned} \gamma (t) = \mu + (\gamma _0-\mu ) {{\mathrm{e}}}^{-t}, \quad t\geqslant 0. \end{aligned}$$

Since \(\gamma _0>0\), we have that \(\gamma (t)>0\) for all \(t\geqslant 0\) and \(\gamma (t)\) converges to \(\mu \) exponentially and monotonically as \(t \rightarrow +\infty \). In particular, if \(\gamma _0=\mu >0\), then \(\gamma (t)=\mu \). Therefore, when \(\mu =0\),  (53) reduces to (45) and when \(\gamma _0=\mu >0\),  (53) recovers (39) indeed. Correspondingly, the transform (53) gives the system

$$\begin{aligned} \left\{ \begin{aligned} x'={}&v - x,\\ \gamma v'={}&\mu (x - v ) - Ax. \end{aligned} \right. \end{aligned}$$
(55)

Heuristically, for general \(f\in {\mathcal {S}}_\mu ^1\) with \(\mu \geqslant 0\), we just replace Ax in (55) with \(\nabla f(x)\) and obtain our NAG flow

$$\begin{aligned} \left\{ \begin{aligned} x'={}&v - x,\\ \gamma v'={}&\mu (x - v ) - \nabla f(x), \end{aligned} \right. \end{aligned}$$
(56)

with initial conditions \(x(0)=x_0\) and \(v(0)=v_0\). The equivalent second-order ODE (will also be abbreviated as NAG flow) reads as follows

$$\begin{aligned} \gamma x''+(\mu +\gamma )x'+\nabla f(x)=0, \end{aligned}$$
(57)

with initial conditions \(x(0)=x_0\) and \(x'(0)=v_0-x_0\). Clearly, if \(\gamma _0 = \mu >0\), then (57) becomes (40), and if \( \mu =0\), then (57) coincides with (47).

Motivated by (30), we introduce a Lyapunov function for (56):

$$\begin{aligned} {\mathcal {L}}(t):= f(x(t))-f(x^*)+\frac{\gamma (t)}{2} \left\Vert {v(t)-x^*} \right\Vert ^2,\quad t\geqslant 0. \end{aligned}$$
(58)

In addition, we need the following lemma, which is trivial but very useful for the convergence analysis in both of the continuous and discrete levels.

Lemma 1

For any \(u,v,w\in V\), we have

$$\begin{aligned} 2(u-v,v-w) = \left\Vert {u-w} \right\Vert ^2-\left\Vert {u-v} \right\Vert ^2-\left\Vert {v-w} \right\Vert ^2. \end{aligned}$$

We first present the well-posedness of (57) and prove the exponential decay property of the Lyapunov function (58).

Lemma 2

If \(f\in {\mathcal {S}}_{\mu ,L}^{1,1}\) with \(\mu \geqslant 0\), then the NAG flow (57) admits a unique solution \(x\in C^2([0,\infty ); V)\) and moreover

$$\begin{aligned} {\mathcal {L}}'(t)\leqslant -{\mathcal {L}}(t) - \frac{\mu }{2} \Vert x'(t)\Vert ^{2}, \end{aligned}$$
(59)

which implies that

$$\begin{aligned} {\mathcal {L}}(t)+\frac{\mu }{2} \int _{0}^{t}e^{s-t}\left\Vert {x'(s)} \right\Vert ^2\mathrm{d }s \leqslant {{\mathrm{e}}}^{-t} {\mathcal {L}}(0),\quad t\geqslant 0. \end{aligned}$$
(60)

Proof

Basically, as \(\nabla f\) is Lipschitz continuous, applying the standard existence and uniqueness results of ODE (see [9, Theorem 4.1.4]) yields the fact that the system (56) admits a unique classical solution \((x,v)\in C^1([0,\infty ); V)\times C^1([0,\infty ); V)\). This implies that \(x' =v-x\in C^1([0,\infty ); V)\), and therefore \(x\in C^2([0,\infty ); V)\) is also the unique solution to our NAG flow (57).

It remains to prove (59), which yields the exponential decay (60) immediately. A straightforward calculation yields that

$$\begin{aligned} {\mathcal {L}}'(t)={} \left\langle {\nabla f(x),x'} \right\rangle + \frac{\gamma '}{2}\left\Vert {v-x^*} \right\Vert ^2+ \gamma \left\langle {v', v-x^*} \right\rangle , \end{aligned}$$

and by (54) and (56), we replace \(\gamma '\) and \(v'\) by their right hand side terms and obtain

$$\begin{aligned} {\mathcal {L}}'(t)= \left\langle {\nabla f(x),x'} \right\rangle + \frac{ \mu -\gamma }{2}\left\Vert {v-x^*} \right\Vert ^2+ \left\langle {\mu ( x-v)- \nabla f(x),v-x^*} \right\rangle . \end{aligned}$$
(61)

Let us focus on the last term. Thanks to Lemma 1,

$$\begin{aligned} \mu (x-v, v-x^{*}) = \frac{\mu }{2}\left( \left\| x-x^{*}\right\| ^{2} -\Vert x-v\Vert ^{2}- \left\| v-x^{*}\right\| ^{2}\right) , \end{aligned}$$

and the gradient term is split as follows

$$\begin{aligned} -\left\langle {\nabla f(x), v-x^*} \right\rangle =- \left\langle { \nabla f(x), v-x} \right\rangle -\left\langle {\nabla f(x), x-x^*} \right\rangle . \end{aligned}$$
(62)

By the relation \(x' = v- x\), the first term in (62) becomes \(\left\langle {-\nabla f(x), x'} \right\rangle \) which cancels the first term in (61). Combining all identities together gives

$$\begin{aligned} {\mathcal {L}}'(t) ={} \frac{\mu }{2} \left\Vert {x-x^{*}} \right\Vert ^{2} -\left\langle {\nabla f(x), x-x^*} \right\rangle -\frac{ \gamma }{2}\left\Vert {v-x^*} \right\Vert ^2- \frac{\mu }{2} \Vert x'\Vert ^{2}. \end{aligned}$$
(63)

As f is \(\mu \)-strongly convex (cf.(2)), there holds

$$\begin{aligned} \frac{\mu }{2} \left\Vert {x-x^{*}} \right\Vert ^{2} -\left\langle {\nabla f(x), x-x^*} \right\rangle \leqslant f(x^*) - f(x), \end{aligned}$$

and plugging this into (63) implies that

$$\begin{aligned} {\mathcal {L}}'(t)\leqslant -{\mathcal {L}}(t) - \frac{\mu }{2} \Vert x'(t)\Vert ^{2}, \end{aligned}$$

which proves (59) and thus completes the proof of this lemma. \(\square \)

Remark 1

According to the proof of Lemma 2, the Eq. (54) for \(\gamma \) can be relaxed to \(\gamma ' \leqslant \mu -\gamma \). This makes (61) and (63) become inequality but leaves the final estimate (59) invariant. \(\square \)

3.2 Rescaling property

Based on our NAG flow (56) (or (57)), it is possible to use time scaling technique to construct more ODE systems with any desirable convergence rate. It is worth distinguishing the connection and difference with existing dynamical models.

Specifically, let \(\alpha \) be any continuous nonnegative function on \({\mathbb {R}}_+\), and consider the time rescaling

$$\begin{aligned} t(\tau )=\int _{0}^{\tau }\alpha (s)\,{\mathrm{ds}}, \quad \tau >0. \end{aligned}$$
(64)

Set \(y(\tau ) = x(t(\tau )),w(\tau ) = v(t(\tau ))\) and \(\beta (\tau ) = \gamma (t(\tau ))\), then it is clear that

$$\begin{aligned} y'(\tau ) = t'(\tau )x'(t(\tau )) = \alpha (\tau )x'(t(\tau )), \end{aligned}$$

Similarly, \(w'(\tau ) = \alpha (\tau ) v'(t(\tau ))\) and plugging those facts into (56) gives the scaled NAG flow

$$\begin{aligned} \left\{ \begin{aligned} {}&y' =\alpha (w - y ),\\ {}&\beta w' = \mu \alpha \big (y - w \big ) - \alpha \nabla f(y ), \end{aligned} \right. \end{aligned}$$
(65)

with initial conditions \(y(0)=x_0\) and \(y'(0)=\alpha (0)x'(0)\). By Remark 1, the Eq. (54) can be replaced by \(\gamma '\leqslant \mu -\gamma \), which becomes

$$\begin{aligned} \beta '\leqslant \alpha (\mu -\beta ), \quad \beta (0)=\gamma _0. \end{aligned}$$
(66)

Correspondingly, the Lyapunov function (58) reads as follows

$$\begin{aligned} \widetilde{{\mathcal {L}}}(\tau ):= f(y(\tau ))-f(x^*)+\frac{\beta (\tau )}{2} \left\Vert {w(\tau )-x^*} \right\Vert ^2,\quad \tau \geqslant 0. \end{aligned}$$

Analogously to (59), we can prove

$$\begin{aligned} \widetilde{{\mathcal {L}}}'\leqslant -\alpha \widetilde{{\mathcal {L}}} - \frac{\mu \alpha }{2} \Vert w-y\Vert ^{2}, \end{aligned}$$

which implies that

$$\begin{aligned} \widetilde{{\mathcal {L}}}(\tau ) \leqslant {{\mathrm{e}}}^{-\int _0^\tau \alpha (s)\,\mathrm{d} s} \widetilde{{\mathcal {L}}}(0), \quad \tau \geqslant 0. \end{aligned}$$
(67)

Therefore, larger scaling factor \(\alpha \) promises faster decay rate.

We note that the scaled NAG flow (65) is very close to the two models (27) and (29), which are derived in [43] and [44] respectively, via the variational perspective. Indeed, they differs mainly from the coefficient of \(w'\). By (66), an elementary calculation gives

$$\begin{aligned} \beta (\tau ) \leqslant \mu + (\gamma _0-\mu ) {{\mathrm{e}}}^{-\int _0^\tau \alpha (s)\,\mathrm{d} s}, \quad \tau \geqslant 0. \end{aligned}$$

Therefore, (65) chooses variable coefficient \(\beta (\tau )\) for \(\mu \geqslant 0\), while (27) considers dynamically changing coefficient (26) only for \(\mu =0\) and (29) adopts fixed parameter \(\mu >0\). For strongly convex case \(\mu >0\), if we take \(\beta =\mu \), which satisfies (66), then the scaled system (65) coincides with  (29). For convex case \(\mu =0\), if both (27) and (66) are equalities, then (65) agrees with (27). Hence, we conclude that our NAG flow system is more tight and provides a unified way to handle \(\mu =0\) and \(\mu >0\).

Now, let us look at a concrete rescaling example. Let the scaling factor \(\alpha \) satisfy

$$\begin{aligned} 2\alpha ' \leqslant \mu -\alpha ^2,\quad \alpha (0)=\sqrt{\gamma _0}. \end{aligned}$$
(68)

For instance, the following choice is allowed:

$$\begin{aligned} \alpha (\tau )=\frac{\sqrt{\gamma _0}\ b}{\sqrt{\gamma _0}\, \tau +b},\quad 0<b\leqslant 2. \end{aligned}$$
(69)

For the equality case of (68), we have a closed-form solution

$$\begin{aligned} \alpha (\tau ) = \left\{ \begin{aligned}&\frac{2\sqrt{\gamma _0}}{\sqrt{\gamma _0}\, \tau +2},&\text { if }\mu =0,\\&\sqrt{\mu }\cdot \frac{e^{\sqrt{\mu } \, \tau }-\alpha _\mu }{e^{\sqrt{\mu } \, \tau }+\alpha _\mu },&\text { if }\mu >0, \end{aligned} \right. \end{aligned}$$
(70)

where

$$\begin{aligned} \alpha _\mu = \frac{\sqrt{\mu }-\sqrt{\gamma _0}}{\sqrt{\mu }+\sqrt{\gamma _0}}\in (-1,1). \end{aligned}$$

We now set \(\beta = \alpha ^2\) which fulfills (66) by our assumption (68), then the scaled NAG flow (65) gives a new HB system

$$\begin{aligned} y'' + \frac{1}{\alpha } \left( \mu +\alpha ^2-\alpha '\right) y'+\nabla f(y)=0. \end{aligned}$$
(71)

According to (67), we have the estimate

$$\begin{aligned} \widetilde{{\mathcal {L}}}(\tau ) \leqslant \left\{ \begin{aligned}&\frac{b^b\widetilde{{\mathcal {L}}}(0)}{(\sqrt{\gamma _0}\tau +b)^b},&\text { if } \alpha \text { satisfies (69)},\\&\frac{(1+\alpha _\mu )^2\widetilde{{\mathcal {L}}}(0)}{ \left( e^{\sqrt{\mu } \tau /2}+\alpha _\mu e^{-\sqrt{\mu } \tau /2} \right) ^2},&\text { if } \alpha \text { satisfies (70) and } \mu >0. \end{aligned} \right. \end{aligned}$$

Particularly, if \(\mu >0\) and \(\alpha \) satisfies (70) with \(\gamma _0=\sqrt{\mu }\), then \(\alpha (\tau )=\sqrt{\mu }\) and (71) recovers (31) with the same rate \(O(e^{-\sqrt{\mu }\tau })\). Moreover, if \(\mu =0\) and \(\alpha \) satisfies (69) with \(\gamma _0=4\) and \(b=2\), then \(\alpha (\tau )=2/(\tau +1)\) and (71) becomes

$$\begin{aligned} y'' + \frac{3}{\tau +1}y'+\nabla f(y)=0,\quad \tau >0, \end{aligned}$$

which gives the decay rate \(O(\tau ^{-2})\) and coincides with the prevailing ODE model (22) derived in [37].

4 An implicit scheme

Exponential decay of an implicit discretization for solving (56) can be established, which is more or less straightforward since one can easily follow the proof from the continuous problem. However, the implicit scheme requires efficient solver or proximal calculation and may not be practical sometimes. It is presented here to bridge the analysis from the continuous level to semi-implicit and explicit schemes.

Consider the following implicit scheme

$$\begin{aligned} \left\{ \begin{aligned} \frac{x_{k+1}-x_{k}}{\alpha _k}={}&v_{k+1}-x_{k+1},\\ \frac{v_{k+1}-v_{k}}{\alpha _k}={}&\frac{\mu }{\gamma _k} (x_{k+1}-v_{k+1}) -\frac{1}{\gamma _k}\nabla f(x_{k+1}), \end{aligned} \right. \end{aligned}$$
(72)

where \(\alpha _k>0\) denotes the time step size to discretize the time derivative and the parameter Eq. (54) is also discretized implicitly

$$\begin{aligned} \frac{ \gamma _{k+1} - \gamma _{k}}{\alpha _k} =\mu -\gamma _{k+1}, \quad \gamma _0>0. \end{aligned}$$
(73)

We shall present the convergence result for the implicit scheme (72)–(73). To do so, we introduce a suitable Lyapunov function

$$\begin{aligned} {\mathcal {L}}_k:={} f(x_k)-f(x^*) + \frac{\gamma _k}{2} \left\Vert {v_k-x^*} \right\Vert ^2, \end{aligned}$$
(74)

which is clearly a discrete analogue to the continuous one (58).

Theorem 3

If \( f\in {\mathcal {S}}_{\mu }^{1}\) with \(\mu \geqslant 0\), then for the scheme (72) with \(\alpha _k>0\), we have

$$\begin{aligned} {\mathcal {L}}_{k+1}\leqslant \frac{ {\mathcal {L}}_k}{1+\alpha _k }, \quad k\in {\mathbb {N}}. \end{aligned}$$

Proof

It suffices to prove

$$\begin{aligned} {\mathcal {L}}_{k+1}-{\mathcal {L}}_{k}\leqslant -\alpha _k {\mathcal {L}}_{k+1}. \end{aligned}$$
(75)

Let us mimic the proof of Lemma 2. Instead of the derivative, we compute the difference as follows

$$\begin{aligned} \begin{aligned} {\mathcal {L}}_{k+1}-{\mathcal {L}}_{k} =&f(x_{k+1})-f(x_k) + \frac{\gamma _{k+1}-\gamma _k}{2} \left\Vert {v_{k+1}-x^*} \right\Vert ^2\\&+\frac{\gamma _k}{2} \left( \left\Vert {v_{k+1}-x^*} \right\Vert ^2-\left\Vert {v_{k}-x^*} \right\Vert ^2 \right) \\ =&f(x_{k+1})-f(x_k) + \frac{\alpha _k}{2}(\mu - \gamma _{k+1}) \left\Vert {v_{k+1}-x^*} \right\Vert ^2\\&+\gamma _k \left( {v_{k+1}-v_k, (v_{k+1}+v_{k})/2 -x^*} \right) . \end{aligned} \end{aligned}$$

Analogously to the continuous level, we focus on the last term

$$\begin{aligned} \begin{aligned}&\gamma _k \left( {v_{k+1}-v_k, (v_{k+1}+v_{k})/2 -x^*} \right) \\&\quad = \gamma _k \left( {v_{k+1}-v_k, v_{k+1} -x^*} \right) - \frac{\gamma _k}{2} \left\Vert {v_{k+1}-v_k} \right\Vert ^2. \end{aligned} \end{aligned}$$

By (72), it follows that

$$\begin{aligned} \begin{aligned}&\gamma _k\left( {v_{k+1}-v_k, v_{k+1} -x^*} \right) \\&\quad = \mu \alpha _k\left( {x_{k+1} - v_{k+1}, v_{k+1} - x^{*}} \right) - \alpha _k \left\langle { \nabla f(x_{k+1}), v_{k+1} - x^{*}} \right\rangle , \end{aligned} \end{aligned}$$

and we use Lemma 1 to split the cross term into squares:

$$\begin{aligned} \begin{aligned}&2\left( {x_{k+1} - v_{k+1}, v_{k+1} - x^{*}} \right) \\&\quad =\left\| x_{k+1}-x^{*}\right\| ^{2}-\Vert x_{k+1}-v_{k+1}\Vert ^{2} -\left\| v_{k+1}-x^{*}\right\| ^{2}. \end{aligned} \end{aligned}$$

For the gradient term, we have \(v_{k+1}-x^{*} = v_{k+1}-x_{k+1} + x_{k+1} - x^{*}\) and use (72) to obtain

$$\begin{aligned}&-\alpha _k \left\langle { \nabla f(x_{k+1}), v_{k+1} - x^{*}} \right\rangle \\&\quad = - \left\langle { \nabla f(x_{k+1}), x_{k+1} - x_{k}} \right\rangle - \alpha _k\left\langle { \nabla f(x_{k+1}), x_{k+1} - x^{*}} \right\rangle . \end{aligned}$$

Consequently, using the \(\mu \)-strongly convex property (cf.(2)) of f and dropping surplus negative square terms, we see

$$\begin{aligned} \begin{aligned} {\mathcal {L}}_{k+1}-{\mathcal {L}}_{k} \leqslant {}&-\alpha _k \mathcal L_{k+1}. \end{aligned} \end{aligned}$$

This proves (75) and concludes the proof of this theorem. \(\square \)

We observe from Theorem 3 that the fully implicit scheme (72) achieves linear convergence rate as long as \(\alpha _k\geqslant \alpha >0\) for all \(k>0\) and larger \(\alpha _k\) yields faster convergence rate. We also mention that (72) can be rewritten as

$$\begin{aligned} \left\{ \begin{aligned} x_{k+1} ={}&\mathbf{prox}_{\eta _k f}(y_k),\\ v_{k+1}={}&x_{k+1}+ \frac{x_{k+1}-x_k}{\alpha _k}, \end{aligned} \right. \end{aligned}$$
(76)

where the proximal operator \(\mathbf{prox}_{\eta _k f}\) has been introduced in (8) and

$$\begin{aligned} \gamma _{k+1} = {}\frac{\gamma _k + \mu \alpha _k}{1+\alpha _k}, \, \eta _k = {}\frac{\alpha _k^2}{\gamma _k+(\mu +\gamma _k)\alpha _k}, \, y_k ={} \frac{\gamma _k\alpha _kv_k+(\gamma _k+\mu \alpha _k)x_k}{\gamma _k+(\mu +\gamma _k)\alpha _k}. \end{aligned}$$

Therefore, it allows f to be nonsmooth and we claim that Theorem 3 still holds true in this case. One just replaces the gradient \(\nabla f(x_{k+1})\) with the subgradient \((y_k-x_{k+1})/\eta _k\in \partial f(x_{k+1})\); see (105) and (112).

For convex case, i.e., \(\mu =0\), our method (76) is very close to Güler’s proximal point algorithm [20]

$$\begin{aligned} \left\{ \begin{aligned} x_{k+1} ={}&\mathbf{prox}_{\eta _k f}(y_k),\quad \eta _k =\alpha _k^2/\gamma _{k+1},\\ v_{k+1}={}&x_{k}+ \frac{x_{k+1}-x_k}{\alpha _k}, \end{aligned} \right. \end{aligned}$$

where \(\gamma _{k+1}-\gamma _{k}=-\alpha _{k}\gamma _k\) and \( y_k ={} \alpha _kv_k+(1-\alpha _k)x_k\). Indeed, with suitable step size, they share the similar rate; see [20, Theorem 2.3] and Theorem 4 below.

Theorem 4

If f is proper, closed and convex and we choose \(\alpha _k^2=\eta _k\gamma _k(1+\alpha _k)\) with \(\eta _k>0\), then for the proximal point algorithm (76) with \(\mu =0\), we have

$$\begin{aligned} \frac{{\mathcal {L}}_{0}}{(1+\sum _{i=0}^{k-1}\sqrt{\gamma _0\eta _i})^2} \leqslant {\mathcal {L}}_{k}\leqslant {} \frac{4\mathcal L_{0}}{(2+\sum _{i=0}^{k-1}\sqrt{\gamma _0\eta _i})^2}, \end{aligned}$$
(77)

which means if \(\sum _{k=0}^{\infty }\sqrt{\eta _k}=\infty \) then \({\mathcal {L}}_k\rightarrow 0\) as \(k\rightarrow \infty \). Moreover, it holds that

$$\begin{aligned} {\mathcal {L}}_{k}\leqslant {} \frac{4}{\sum _{i=0}^{k-1}\sqrt{\eta _i}} \left( \frac{1}{\gamma _0}\left( f(x_0)-f(x^*)\right) +\frac{1}{2}\left\Vert {v_0-x^*} \right\Vert ^2 \right) . \end{aligned}$$
(78)

Proof

For convenience and later use, define a sequence \(\{\rho _{k}\}\) by that

$$\begin{aligned} \rho _0=1,\quad \rho _k:= \prod _{i=0}^{k-1}\frac{1}{1+\alpha _i},\quad k\geqslant 1. \end{aligned}$$
(79)

As mentioned above, Theorem 3 holds true for such a nonsmooth f and thus it is evident that \(\mathcal L_k\leqslant \rho _k{\mathcal {L}}_0\). Invoking Lemma B2 proves (77) and it is trivial to obtain (78) from (77). This finishes the proof. \(\square \)

Remark 2

Note that the sequence \(\{\gamma _k\}\) in (73) is bounded: \(0<\gamma _{k}\leqslant \max \{\mu ,\gamma _{0}\}\) and \(\gamma _k\rightarrow \mu \) as \(k\rightarrow \infty \). Hence, even for large \(\gamma _0\), the Lyapunov function \({\mathcal {L}}_k\) is asymptotically bounded as \(k\rightarrow \infty \). In addition, from (77) and (78), we see that, for small \(\gamma _0\), the convergence rate depends on \(\gamma _0\) but large \(\gamma _0\) does not pollute the final rate. This fact also holds true for all the forthcoming convergence bounds. \(\square \)

5 Gauss–Seidel splitting with corrections

This section considers the Gauss–Seidel splitting (41), which is a semi-implicit discretization. In Sect. 2.4, we have established the spectrum bound \(O(1-\sqrt{\mu /L})\) with step size \(\alpha _k=O(\sqrt{\mu /L})\) for quadratic objectives. However, as we summarized in Sect. 2.6, spectral analysis is not sufficient for (norm) convergence.

Indeed, in the sequel, we further show that, for the discrete Lyapunov function (74), with any step size \(\alpha _k>0\), the naive discretization (41), reformulated as (80), does not lead to the contraction property like (75). This motivates us to add some proper correction steps.

5.1 The Gauss–Seidel splitting

Recall the Gauss–Seidel splitting (41): given step size \(\alpha _k>0\) and previous result \((x_k,v_k)\), compute \((x_{k+1},v_{k+1})\) from

$$\begin{aligned} \left\{ \begin{aligned} \frac{x_{k+1}-x_{k}}{\alpha _k}={}&v_{k}-x_{k+1},\\ \frac{v_{k+1}-v_{k}}{\alpha _k}={}&\frac{\mu }{\gamma _k}(x_{k+1}-v_{k+1}) -\frac{1}{\gamma _k}\nabla f(x_{k+1}). \end{aligned} \right. \end{aligned}$$
(80)

In addition, the parameter Eq. (54) of \(\gamma \) is still discretized implicitly via (73).

Lemma 3

If \(f\in {\mathcal {S}}_{\mu }^{1}\) with \(\mu \geqslant 0\), then for (80) with any step size \(\alpha _k>0\), we have

$$\begin{aligned} {\mathcal {L}}_{k+1}-{\mathcal {L}}_{k} \leqslant -\alpha _k {\mathcal {L}}_{k+1} - \frac{\gamma _k}{2}\left\Vert {v_{k+1}-v_k} \right\Vert ^2 -\alpha _k\left\langle {\nabla f(x_{k+1}),v_{k+1} - v_k} \right\rangle , \end{aligned}$$
(81)

and

$$\begin{aligned} {\mathcal {L}}_{k+1}-{\mathcal {L}}_{k} \leqslant -\alpha _k {\mathcal {L}}_{k+1} + \frac{\alpha _k^2}{2\gamma _k} \left\Vert {\nabla f(x_{k+1})} \right\Vert _{*}^2. \end{aligned}$$
(82)

Proof

Following the proof of Theorem 3, we start from the difference

$$\begin{aligned} \begin{aligned} {\mathcal {L}}_{k+1}-{\mathcal {L}}_{k} =&f(x_{k+1})-f(x_k)- \frac{\alpha _k \gamma _{k+1}}{2} \left\Vert {v_{k+1}-x^*} \right\Vert ^2\\&-\frac{\mu \alpha _k}{2} \left\Vert {x_{k+1}-v_{k+1}} \right\Vert ^2- \frac{\gamma _k}{2} \left\Vert {v_{k+1}-v_k} \right\Vert ^2\\&+\frac{\mu \alpha _k}{2} \left\Vert {x_{k+1}-x^*} \right\Vert ^2 -\alpha _k \left\langle { \nabla f(x_{k+1}), v_{k+1} - x^{*}} \right\rangle . \end{aligned} \end{aligned}$$

Using the update for \(x_{k+1}\) in (80), we split the gradient term as below

$$\begin{aligned}&-\alpha _k \left\langle { \nabla f(x_{k+1}), v_{k+1} - x^{*}} \right\rangle \\&\quad = -\alpha _k\left\langle {\nabla f(x_{k+1}),v_{k+1} - v_k} \right\rangle - \left\langle {\nabla f(x_{k+1}),\alpha _k(v_k-x_{k+1})} \right\rangle \\&\qquad -\alpha _k\left\langle { \nabla f(x_{k+1}), x_{k+1} - x^{*}} \right\rangle \\&\quad =-\alpha _k\left\langle {\nabla f(x_{k+1}),v_{k+1} - v_k} \right\rangle - \left\langle { \nabla f(x_{k+1}), x_{k+1} - x_{k}} \right\rangle \\&\qquad - \alpha _k\left\langle { \nabla f(x_{k+1}), x_{k+1} - x^{*}} \right\rangle . \end{aligned}$$

As \(f\in {\mathcal {S}}_{\mu }^{1}\), we obtain that

$$\begin{aligned} \begin{aligned} {\mathcal {L}}_{k+1}-{\mathcal {L}}_{k} \leqslant&-\alpha _k \mathcal L_{k+1} - \frac{\gamma _k}{2}\left\Vert {v_{k+1}-v_k} \right\Vert ^2 -\alpha _k\left\langle {\nabla f(x_{k+1}),v_{k+1} - v_k} \right\rangle \\&-\frac{\mu \alpha _k}{2}\left\Vert {x_{k+1}-v_{k+1}} \right\Vert ^2 -\frac{\mu }{2}\left\Vert {x_{k+1}-x_k} \right\Vert ^2. \end{aligned} \end{aligned}$$

Ignoring all the negative terms of the second line, the above estimate implies (81).

As we see, different from (75), the estimate (81) contains a combination of a negative term and another cross term. Obviously, an easy application of Cauchy-Schwarz inequality yields

$$\begin{aligned} - \frac{\gamma _k}{2}\left\Vert {v_{k+1}-v_k} \right\Vert ^2 -\alpha _k\left\langle {\nabla f(x_{k+1}),v_{k+1} - v_k} \right\rangle \leqslant \frac{\alpha _k^2}{2\gamma _k}\left\Vert {\nabla f(x_{k+1})} \right\Vert _{*}^2. \end{aligned}$$

This proves another bound (82) that only involves a positive gradient norm. \(\square \)

5.2 A predictor–corrector method

To conquer the cross term \( -\alpha _k\left\langle {\nabla f(x_{k+1}),v_{k+1} - v_k} \right\rangle \) in (81), we add an extra extrapolation step to (80) which can be thought as an semi-implicit discretization of \(x'= v - x\) with the newest update \(v_{k+1}\). More precisely, consider

$$\begin{aligned} \left\{ \begin{aligned} \frac{y_{k}-x_{k}}{\alpha _k}={}&v_{k}-y_{k},\\ \frac{v_{k+1}-v_{k}}{\alpha _k}={}&\frac{\mu }{\gamma _k}(y_{k}-v_{k+1}) -\frac{1}{\gamma _k}\nabla f(y_{k}),\\ \frac{x_{k+1}-x_{k}}{\alpha _k}={}&v_{k+1}-x_{k+1}. \end{aligned} \right. \end{aligned}$$
(83)

This is in line with the spirit of the predictor-corrector method for ODE solvers [39, Section 3.8]. The variable \(y_k\) is the predictor produced by an explicit scheme and \(x_{k+1}\) is the corrector by an implicit scheme. It can be also thought of as a symmetric Gauss–Seidel iteration for approximating the implicit Euler method. Again, the parameter Eq. (54) of \(\gamma \) is still discretized via (73).

As the first two steps of (83) agree with (80), with \(x_{k+1}\) being \(y_k\), recalling the estimate (81), we have

$$\begin{aligned} \widehat{ {\mathcal {L}}}_{k} -{\mathcal {L}}_{k} \leqslant -\alpha _k \widehat{{\mathcal {L}}}_{k} - \frac{\gamma _k}{2}\left\Vert {v_{k+1}-v_k} \right\Vert ^2 -\alpha _k\left\langle {\nabla f(y_{k}),v_{k+1} - v_k} \right\rangle , \end{aligned}$$

where

$$\begin{aligned} \widehat{ {\mathcal {L}}}_k := f(y_k)-f(x^*)+\frac{\gamma _{k+1}}{2}\left\Vert {v_{k+1}-x^*} \right\Vert ^2. \end{aligned}$$
(84)

Therefore, it follows that

$$\begin{aligned} \widehat{ {\mathcal {L}}}_{k} \leqslant \frac{\mathcal L_k}{1+\alpha _k} - \frac{\gamma _k}{2(1+\alpha _k)}\left\Vert {v_{k+1}-v_k} \right\Vert ^2 -\frac{\alpha _k}{1+\alpha _k}\left\langle {\nabla f(y_{k}),v_{k+1} - v_k} \right\rangle . \end{aligned}$$

From the update for \(y_k\) and \(x_{k+1}\) in (83), we find the relation

$$\begin{aligned} x_{k+1}-y_k = \frac{\alpha _k}{1+\alpha _k}(v_{k+1}-v_k), \end{aligned}$$

and if \(f\in {\mathcal {S}}^{1,1}_{\mu ,L}\), then there comes the estimate (cf. (4))

$$\begin{aligned} \begin{aligned} {\mathcal {L}}_{k+1}- \widehat{ {\mathcal {L}}}_k&= f(x_{k+1})-f(y_k)\\&\leqslant \left\langle {\nabla f(y_k),x_{k+1}-y_k} \right\rangle +\frac{L}{2}\left\Vert {x_{k+1}-y_k} \right\Vert ^2\\&=\frac{\alpha _k}{1+\alpha _k}\left\langle {\nabla f(y_{k}),v_{k+1}-v_k} \right\rangle \\&\quad +\frac{L\alpha _k^2}{2(1+\alpha _k)^2}\left\Vert {v_{k+1}-v_k} \right\Vert ^2. \end{aligned} \end{aligned}$$

As a result, we obtain

$$\begin{aligned} {\mathcal {L}}_{k+1}\leqslant \frac{{\mathcal {L}}_k}{1+\alpha _k} +\left( \frac{L\alpha _k^2}{2(1+\alpha _k)^2}-\frac{\gamma _k}{2(1+\alpha _k)} \right) \left\Vert {v_{k+1}-v_k} \right\Vert ^2. \end{aligned}$$
(85)

The second term vanishes if we choose suitable step size; see the theorem below.

Theorem 5

Assume that \(f\in \mathcal {S}_{\mu ,L}^{1,1}\) with \(0\leqslant \mu \leqslant L<\infty \) and \(L\alpha _k^2 =\gamma _k(1+\alpha _k)\), then for the predictor-corrector scheme (83) together with (73), we have

$$\begin{aligned} {\mathcal {L}}_{k+1} \leqslant \frac{ {\mathcal {L}}_k}{1+\alpha _k }, \quad k\in {\mathbb {N}}, \end{aligned}$$
(86)

where \({\mathcal {L}}_k\) is defined by (74). Consequently, for all \(k\geqslant 0\),

$$\begin{aligned} {\mathcal {L}}_{k}\leqslant {} {\mathcal {L}}_{0}\times \min \left\{ \frac{4L}{(\sqrt{\gamma _0}\, k+2\sqrt{L})^2},\, \left( 1+\sqrt{\frac{\min \{\gamma _0,\mu \}}{L}}\right) ^{-k} \right\} , \end{aligned}$$
(87)

and moreover, for all \(k\geqslant 1\),

$$\begin{aligned} \begin{aligned} {\mathcal {L}}_{k}\leqslant {}&C_{\gamma _0,L}\times \min \left\{ \frac{4}{k^2},\, \left( 1+\sqrt{\frac{\min \{\gamma _0,\mu \}}{L}}\right) ^{1-k} \right\} , \end{aligned} \end{aligned}$$
(88)

where

$$\begin{aligned} C_{\gamma _0,L}:= \frac{L}{\gamma _0} \big ( f(x_0)-f(x^*)\big )+\frac{L}{2} \left\Vert {v_0-x^*} \right\Vert ^2. \end{aligned}$$
(89)

Proof

The inequality (85) suggests the choice \(L\alpha _k^2 =\gamma _k(1+\alpha _k)\) and promises (86). Recalling the sequence \(\{\rho _{k}\}\) defined by (79), we have \( {\mathcal {L}}_k\leqslant \rho _k{\mathcal {L}}_0\). Hence, using Lemma B2 gives the decay estimate of \(\rho _k\) and proves (87).

It remains to check (88) for all \(k\geqslant 1\).

From Lemma B2 we easily get

$$\begin{aligned} \begin{aligned} \rho _k{\mathcal {L}}_0 \leqslant&\left( f(x_0)-f(x^*)+\frac{\gamma _0}{2} \left\Vert {v_0-x^*} \right\Vert ^2 \right) \times \frac{4L}{(\sqrt{\gamma _0}\, k+2\sqrt{L})^2} \leqslant {}\frac{4C_{\gamma _{0},L}}{k^2}\\ \end{aligned}\nonumber \\ \end{aligned}$$
(90)

On the other hand, by the relation \(L\alpha _0^2=\gamma _0(1+\alpha _0)\), it is evident that

$$\begin{aligned} \alpha _0 = \frac{1}{2L} \left( \gamma _0+\sqrt{4\gamma _0L+\gamma _0^2}\right) , \end{aligned}$$

which implies

$$\begin{aligned} \frac{1}{1+\alpha _0}= \frac{2L}{\gamma _0+2L+\sqrt{4\gamma _0L+\gamma _0^2}} \leqslant \frac{L}{\gamma _0}. \end{aligned}$$

The above estimate also indicates that

$$\begin{aligned} \rho _k{\mathcal {L}}_0 = \frac{{\mathcal {L}}_0}{1+\alpha _0} \frac{\rho _k}{\rho _1} \leqslant C_{\gamma _{0},L}\frac{\rho _k}{\rho _1} =C_{\gamma _{0},L} \times \prod _{i=1}^{k-1}\frac{1}{1+\alpha _i}. \end{aligned}$$

Applying Lemma B2 shows \(\alpha _{k}\geqslant \sqrt{\min \{\gamma _0,\mu \}/L}\) and it follows that

$$\begin{aligned} \rho _k{\mathcal {L}}_0 \leqslant C_{\gamma _{0},L} \times \left( 1+\sqrt{\min \{\gamma _0,\mu \}/L}\right) ^{1-k}. \end{aligned}$$

Collecting this estimate and (90) establishes the final rate (88) and thus completes the proof of this theorem. \(\square \)

Remark 3

We mention that the estimate (88) verifies the claim made previously in Remark 2. That is, the convergence rate given in Theorem 5 depends on small \(\gamma _0\) but is robust when \(\gamma _0\geqslant L\). \(\square \)

5.3 Correction via a gradient step

Motivated by the estimate (82), we can also aim to cancel the squared gradient norm. One preferable choice is the gradient descent step and according to our discussion below, any other correction step satisfying the decay property (94) is acceptable. Note that the two numerical schemes proposed in [38, 44] for the HB Eq. (31) also have additional gradient steps.

As what we did before, replace \(x_{k+1}\) by \(y_k\) in (80) and consider the following corrected scheme: given \(\alpha _k>0\) and \((x_k,v_k)\), compute \((x_{k+1},v_{k+1})\) from

$$\begin{aligned} \left\{ \begin{aligned} \frac{y_{k}-x_{k}}{\alpha _k}={}&v_{k}-y_{k},\\ \frac{v_{k+1}-v_{k}}{\alpha _k}={}&\frac{\mu }{\gamma _k}(y_{k}-v_{k+1}) -\frac{1}{\gamma _k}\nabla f(y_{k}),\\ x_{k+1}-y_k={}&-\frac{1}{L}\nabla f(y_k). \end{aligned} \right. \end{aligned}$$
(91)

The implicit discretization (73) for the parameter Eq. (54) keeps unchanged here. In the first equation \(y_k\) can be solved in terms of the known data \((x_k, v_k)\). After that, we evaluate the gradient \(\nabla f(y_k)\) once and use it to update \((x_{k+1}, v_{k+1})\).

Theorem 6

Assume that \(f\in \mathcal {S}_{\mu ,L}^{1,1}\) with \(0\leqslant \mu \leqslant L<\infty \) and \(L\alpha _k^2 =\gamma _k(1+\alpha _k)\), then for the corrected scheme (91) together with (73), we have

$$\begin{aligned} {\mathcal {L}}_{k+1} \leqslant \frac{ {\mathcal {L}}_k}{1+\alpha _k }, \quad k\in {\mathbb {N}}, \end{aligned}$$
(92)

where \({\mathcal {L}}_k\) is defined by (74), and both the two estimates (87) and (88) hold true here.

Proof

According to (82) in Lemma 3, we have established that

$$\begin{aligned} \widehat{{\mathcal {L}}}_{k}-{\mathcal {L}}_{k} \leqslant -\alpha _k \widehat{{\mathcal {L}}}_{k}+ \frac{\alpha _k^2}{2\gamma _k} \left\Vert {\nabla f(y_{k})} \right\Vert _{*}^2, \end{aligned}$$
(93)

where \(\widehat{{\mathcal {L}}}_k\) is defined by (84). Thanks to the additional gradient step in (91), we have the basic gradient descent inequality:

$$\begin{aligned} f(x_{k+1})-f(y_k)\leqslant -\frac{1}{2L} \left\Vert {\nabla f(y_k)} \right\Vert _{*}^2, \end{aligned}$$
(94)

which comes from (4) since \(f\in \mathcal S_{\mu ,L}^{1,1}\) and implies that

$$\begin{aligned} {\mathcal {L}}_{k+1}\leqslant \widehat{{\mathcal {L}}}_k -\frac{1}{2L} \left\Vert {\nabla f(y_k)} \right\Vert _{*}^2. \end{aligned}$$

Plugging this into (93) gives

$$\begin{aligned} {\mathcal {L}}_{k+1}-{\mathcal {L}}_{k} \leqslant -\alpha _k {\mathcal {L}}_{k+1} +\frac{1}{2L\gamma _k} \left( L\alpha _k^2 -\gamma _k(1+\alpha _k)\right) \left\Vert {\nabla f(y_k)} \right\Vert _{*}^2. \end{aligned}$$

This together with the condition \(L\alpha _k^2 =\gamma _k(1+\alpha _k)\) yields (92).

As we choose the same step size as Theorem 5, based on the contraction (92), it is trivial to conclude that the two estimates (87) and (88) hold true here indeed. This completes the proof of this theorem. \(\square \)

6 A corrected semi-implicit scheme from NAG method

In this section, we consider another semi-implicit scheme which comes exactly from Nesterov accelerated gradient method.

6.1 NAG method

In [29, Chapter 2, General scheme of optimal method], by using the estimate sequence, Nesterov presented an accelerated gradient method for solving (1) with \(f\in {\mathcal {S}}_{\mu ,L}^{1,1}\), \(0\leqslant \mu \leqslant L<\infty \); see Algorithm 1 below.

figure a

Note that we have many choices for \(x_{k+1}\) in step 5 of Algorithm 1. One noticeable example is the gradient descent step (see [29, Chapter 2, Constant Step Scheme, I]):

$$\begin{aligned} x_{k+1} = y_k - \frac{1}{L}\nabla f(y_k). \end{aligned}$$
(95)

With this choice, the sequence \(\{v_k\}\) in Algorithm 1 can be eliminated and \(y_{k+1}\) is updated by that (see [29, Chapter 2, Constant Step Scheme, II])

$$\begin{aligned} y_{k+1} = x_{k+1} + \frac{\alpha _k-\alpha _k^2}{\alpha _{k+1}+\alpha _k^2} (x_{k+1}-x_k), \end{aligned}$$

where \(\alpha _{k+1}\in (0,1)\) is calculated from the quadratic equation

$$\begin{aligned} L\alpha _{k+1}^2 = L\alpha ^2_k(1-\alpha _{k+1})+\mu \alpha _{k+1}. \end{aligned}$$

If \(\mu >0\) and \(\alpha _0=\sqrt{\mu /L}\), then \(\alpha _k = \sqrt{\mu /L}\); see [29, Chapter 2, Constant Step Scheme, III]. In particular, if \(\mu =0\), then Algorithm 1 (with \(x_{k+1}\) updated by (95)) coincides with the accelerated scheme proposed by Nesterov early in the 1980s [27].

6.2 NAG method as a corrected semi-implicit scheme

After simple calculations, we can rewrite Algorithm 1 as an equivalent form

$$\begin{aligned} \left\{ \begin{aligned} \frac{\gamma _{k+1} - \gamma _{k} }{\alpha _k} ={}&\mu -\gamma _{k},\\ \frac{y_{k}-x_{k}}{\alpha _k}={}&\frac{\gamma _k}{\gamma _{k+1}}(v_{k}-y_k),\\ \frac{v_{k+1}-v_{k}}{\alpha _k}={}&\frac{\mu }{\gamma _{k+1}}(y_k-v_{k}) -\frac{1}{\gamma _{k+1}}\nabla f(y_k), \end{aligned} \right. \end{aligned}$$
(96)

where in addition we update \(x_{k+1}\) satisfying

$$\begin{aligned} f(x_{k+1})\leqslant f(y_k)- \frac{1}{2L}\left\Vert {\nabla f(y_k)} \right\Vert _{*}^2. \end{aligned}$$
(97)

Surprisingly, (96) formulates a semi-implicit discretization for our NAG flow (56) with a correction step (97) and an explicit discretization for the Eq. (54) of \(\gamma \). Similarly to (91), we can adopt the gradient descent step which promises (97).

Based on subtle algebraic calculations of the estimate sequence, Nesterov [29, Chapter 2] proved the convergence rate of Algorithm 1. In the following, we give an alternative proof by using the Lyapunov function (74).

Theorem 7

Assume that \(f\in \mathcal {S}_{\mu ,L}^{1,1}\) with \(0\leqslant \mu \leqslant L<\infty \). If \(L\alpha _k^2= \gamma _{k+1}\), then for Algorithm 1, i.e., the scheme (96) together with (97), we have \(0<\alpha _k\leqslant 1\) and

$$\begin{aligned} {\mathcal {L}}_{k+1}\leqslant (1-\alpha _k){\mathcal {L}}_k, \quad k\in {\mathbb {N}}, \end{aligned}$$
(98)

where \({\mathcal {L}}_k\) is defined by (74). Consequently for all \(k\geqslant 0\),

$$\begin{aligned} {\mathcal {L}}_{k}\leqslant {} {\mathcal {L}}_{0}\times \min \left\{ \frac{4L}{(\sqrt{\gamma _0}\, k+2\sqrt{L})^2},\, \left( 1-\sqrt{\frac{\min \{\gamma _1,\mu \}}{L}}\right) ^{k} \right\} . \end{aligned}$$
(99)

Moreover, for all \(k\geqslant 1\),

$$\begin{aligned} \begin{aligned} {\mathcal {L}}_{k}\leqslant {}&C_{\gamma _0,L}\times \min \left\{ \frac{4}{k^2},\, \left( 1-\sqrt{\frac{\min \{\gamma _1,\mu \}}{L}}\right) ^{k-1} \right\} , \end{aligned} \end{aligned}$$
(100)

where \(C_{\gamma _0,L}\) has been defined in (89).

Proof

Let us first prove (98). By (96), we find

$$\begin{aligned} \left\{ \begin{aligned}&v_{k} =y_{k}+ \frac{\gamma _{k+1}}{\alpha _k\gamma _k}(y_{k}-x_{k}),\\&v_{k+1} =y_k+\frac{1-\alpha _k}{\alpha _k}(y_k-x_k) -\frac{\alpha _k}{\gamma _{k+1}}\nabla f(y_k), \end{aligned} \right. \end{aligned}$$

and a direct computation gives

$$\begin{aligned} \begin{aligned}&\frac{\gamma _{k+1}}{2} \left\Vert {v_{k+1}-x^*} \right\Vert ^2 - \frac{\gamma _{k}}{2}(1-\alpha _k) \left\Vert {v_{k}-x^*} \right\Vert ^2 \\&\quad = \alpha _k \left( \left\langle {\nabla f(y_k),x^*-y_k} \right\rangle + \frac{\mu }{2}\left\Vert {x^*-y_k} \right\Vert ^2 \right) \\&\qquad + (1-\alpha _k) \left( \left\langle {\nabla f(y_k),x_k-y_k} \right\rangle + \frac{\mu }{2}\left\Vert {x_k-y_k} \right\Vert ^2 \right) \\&\qquad + \frac{\alpha _k^2}{2\gamma _{k+1}}\left\Vert {\nabla f(y_k)} \right\Vert _*^2-\frac{\mu (1-\alpha _k)}{2\alpha _k\gamma _k} (\gamma _{k}+\mu \alpha _k) \left\Vert {y_k-x_k} \right\Vert ^2. \end{aligned} \end{aligned}$$

Dropping the negative term \(-\left\Vert {y_k-x_k} \right\Vert ^2\) and using the \(\mu \)-convexity of f imply that

$$\begin{aligned} \begin{aligned}&\frac{\gamma _{k+1}}{2} \left\Vert {v_{k+1}-x^*} \right\Vert ^2 - \frac{\gamma _{k}}{2}(1-\alpha _k) \left\Vert {v_{k}-x^*} \right\Vert ^2 \\&\quad \leqslant \alpha _k \left( f(x^*)-f(y_k) \right) + (1-\alpha _k) \left( f(x_k)-f(y_k)\right) + \frac{\alpha _k^2}{2\gamma _{k+1}}\left\Vert {\nabla f(y_k)} \right\Vert _*^2, \end{aligned} \end{aligned}$$

and we get the inequality

$$\begin{aligned} \begin{aligned} {\mathcal {L}}_{k+1}-(1-\alpha _k){\mathcal {L}}_{k} \leqslant {}&f(x_{k+1})-f(y_{k})+ \frac{\alpha _k^2}{2\gamma _{k+1}}\left\Vert {\nabla f(y_k)} \right\Vert _{*}^2. \end{aligned} \end{aligned}$$

Consequently, by (97) and the relation \(L\alpha _k^2=\gamma _{k+1}\), the right hand side of the above inequality is negative, which proves  (98).

In this case, we modify (79) as follows

$$\begin{aligned} \rho _0=1,\quad \rho _k:= \prod _{i=0}^{k-1}(1-\alpha _i),\quad k\geqslant 1, \end{aligned}$$
(101)

then by (98) it is clear that \( \mathcal L_k\leqslant \rho _k{\mathcal {L}}_0\), and invoking Lemma B1 proves (99). As the proof of (100) is very similar with that of (88), we omit the details here and conclude the proof of this theorem. \(\square \)

Remark 4

Similar to our corrected schemes (83) and (91), NAG method (i.e., Algorithm 1) generates a three-term sequence \(\{(x_k,y_{k},v_{k})\}\) as well. If \(\mu =0\), then they share the same convergence rate bound

$$\begin{aligned} {\mathcal {L}}_k\leqslant \frac{4L {\mathcal {L}}_0}{(\sqrt{\gamma _0}\, k+2\sqrt{L})^2}, \end{aligned}$$

and when \(\gamma _0=\mu >0\), we have

$$\begin{aligned} {\mathcal {L}}_k\leqslant {\mathcal {L}}_0\times \left\{ \begin{aligned}&(1-\sqrt{\mu /L})^{k},&\text {for NAG method},\\&(1+\sqrt{\mu /L})^{-k},&\text {for (91) and~(83) }. \end{aligned} \right. \end{aligned}$$
(102)

In view of the trivial fact

$$\begin{aligned} 1 - \epsilon = \frac{1}{1+\epsilon } - \frac{\epsilon ^2}{1+\epsilon }, \quad \epsilon = \sqrt{\mu /L}\leqslant 1, \end{aligned}$$

we see the rates in (102) are asymptotically the same and NAG method can achieve a slightly better convergence rate. However, we note that they share the same computational complexity

$$\begin{aligned} O\left( \min \big \{ \sqrt{L/\epsilon },\,\sqrt{L/\mu }\cdot |\ln \epsilon | \big \} \right) , \end{aligned}$$

which is optimal, in the sense that [29] it achieves the complexity lower bound of first-order algorithms for the function class \({\mathcal {S}}_{\mu ,L}^{1,1}\) with \(0\leqslant \mu \leqslant L<\infty \). \(\square \)

Remark 5

Unlike the gradient descent method, the function value \(f(x_k)\) of accelerated gradient methods may not decrease in each step. It is the discrete Lyapunov function \({\mathcal {L}}_k\) that is always decreasing; see (86), (92) and (98). \(\square \)

Remark 6

To reduce the function value, one can adopt the restating strategy [31]. Specifically, given \((\gamma _0,v_0,x_0)\), if \(f(x_k)\) is increasing after k-iteration, then set \(k=0\) and restart the iteration process with another initial guess \(({\tilde{\gamma }}_0,{\tilde{v}}_0,{\tilde{x}}_0)\). By Theorems 5, 6 and 7, when \(f\in {\mathcal {S}}_{0,L}^{1,1}\) and \(\gamma _0=L,v_0 = x_0\), we only have the sublinear convergence rate

$$\begin{aligned} f(x_k)-f(x^*) \leqslant \frac{4}{k^2}\! \left( f(x_0)-f(x^*)+\frac{L}{2} \left\Vert {x_0-x^*} \right\Vert ^2 \right) \!\! \leqslant \frac{4L}{k^2} \left\Vert {x_0-x^*} \right\Vert ^2,\nonumber \\ \end{aligned}$$
(103)

where we used (4), which promises

$$\begin{aligned} f(x_0)-f(x^*)\leqslant \frac{L}{2}\left\Vert {x_0-x^*} \right\Vert ^2. \end{aligned}$$

Additionally, assume f satisfies the quadratic growth condition with \(\sigma >0\):

$$\begin{aligned} f(x)-f(x^*)\geqslant \sigma \mathrm{dist}^2(x,\mathrm{argmin}\,f) \quad \forall \,x\in V, \end{aligned}$$

where \(\mathrm{dist}(x,\mathrm{argmin}\,f) = \inf _{x^*\in \mathrm{argmin}\,f}\left\Vert {x-x^*} \right\Vert \). As (103) holds for all \(x^*\in \mathrm{argmin}\,f\), we have immediately that

$$\begin{aligned} f(x_k)-f(x^*) \leqslant {} \frac{4L}{k^2}\mathrm{dist}^2(x,\mathrm{argmin}\,f) \leqslant \frac{4L }{\sigma k^2} (f(x_0)-f(x^*)). \end{aligned}$$

Therefore, as analyzed in [30], if we consider fixed restart technique [31] every k steps, then after \(N=nk\) steps we will get

$$\begin{aligned} f(x_{N})-f(x^*)\leqslant \left( \frac{4L}{\sigma k^2}\right) ^{n}(f(x_0)-f(x^*)). \end{aligned}$$

Evidently, the optimal choice \( k_{\#} =e \sqrt{4L/\sigma }\) yields the linear rate

$$\begin{aligned} f(x_{N})-f(x^*)\leqslant e^{-2N/k_{\#}} (f(x_0)-f(x^*)). \end{aligned}$$

If the parameter \(\sigma \) is unknown, one can use the adaptive restart technique [31].

When f is quadratic and convex, changing \(\gamma _k\) from L to \(\mu \) periodically will smoothing out error in different frequencies and can further optimize the constant in front of the accelerated rate. That is, the dynamically changing parameter \(\{\gamma _k\}\) hopefully outperforms the fixed one \(\gamma _k=\mu \). For general nonlinear convex functions, a rigorous justification of the restart strategy is under investigation. \(\square \)

7 Composite convex optimization

In this part we mainly focus on the composite optimization

$$\begin{aligned} \min _{x\in Q} f(x):= \min _{x\in Q} \left[ h(x)+g(x) \right] , \end{aligned}$$
(104)

where \(Q\subseteq V\) is a simple closed convex set, \(h\in \mathcal S_{\mu ,L}^{1,1}(Q)\) with \(0\leqslant \mu \leqslant L<\infty \) and \(g:V\rightarrow {\mathbb {R}}\cup \{+\infty \}\) is proper, closed and convex, and \(Q\cap \mathbf{dom}\, g\ne \emptyset \). In general g is not differentiable but its subdifferential \(\partial g\) exists as a set-valued function. More precisely, the subdifferential \(\partial g(x)\) of g at x is defined by that

$$\begin{aligned} \partial g(x) :=\left\{ p\in V^*:\,g(y)\geqslant g(x)+\left\langle {p,y-x} \right\rangle \quad \,\forall \, y\in V\right\} . \end{aligned}$$
(105)

Remark 7

For the case that \(h\in {\mathcal {S}}_{0,L}^{1,1}(Q)\) and g is \(\mu \)-strongly convex with \(\mu \geqslant 0\), we can split \(h+g\) as \((h(x) + \frac{\mu }{2}\Vert x\Vert ^2) + (g(x) - \frac{\mu }{2}\Vert x\Vert ^2)\), which reduces to our current assumption for (104). \(\square \)

We shall apply our ODE solver approach to the problem (104). The first step is to generalize the dynamical system (56) to the current nonsmooth setting. Basically, we set \(F = f+i_Q\) with \(i_Q\) being the indicator function of Q and obtain a differential inclusion for minimizing F on V, which is equivalent to minimize f over Q. After that, optimization methods (see Algorithms 2 and 4) for solving the original problem (104) with the accelerated convergence rate

$$\begin{aligned} O\left( \min \big \{L/k^2,(1+\sqrt{\mu /L})^{-k}\big \}\right) \end{aligned}$$

are proposed from numerical discretizations of the continuous model (106). This is a proof of the effective and usefulness of our NAG flow model (106) and the ODE solver approach, by which we can construct new accelerated methods easily.

7.1 Continuous model

For minimizing a nonsmooth function F over V, our NAG flow (56) becomes a differential inclusion

$$\begin{aligned} \left\{ \begin{aligned} x'={}&v - x,\\ \gamma v'\in {}&\mu (x - v )-\partial F(x). \end{aligned} \right. \end{aligned}$$
(106)

To ensure solution existence, suitable initial conditions shall be imposed later. Correspondingly, the second-order ODE (57) reads as a second-order differential inclusion

$$\begin{aligned} \gamma x''+(\mu +\gamma )x'+\partial F(x)\ni 0. \end{aligned}$$
(107)

As the subdifferential \(\partial F\) is a set-valued maximal monotone operator, classical \(C^2\) solution to (107) may not exist because discontinuity can occur in \(x'\). Therefore, the concept of energy-conserving solution has been introduced in [15, 32, 36].

Let us assume the initial data

$$\begin{aligned} x (0) = x_0\in \mathbf{dom} F\quad \text {and}\quad x'(0) = x_1\in {\mathcal {T}}_{\mathbf{dom} F}(x_0), \end{aligned}$$
(108)

where \({\mathcal {T}}_{\mathbf{dom} F}(x_0)\) denotes the tangent cone of \(\mathbf{dom} F\) at \(x_0\):

$$\begin{aligned} {\mathcal {T}}_{\mathbf{dom} F}(x_0):= \overline{\mathop {\cup }\limits _{\tau >0}\tau (x_0-\overline{\mathbf{dom} F})}. \end{aligned}$$

In addition, we shall introduce some vector-valued functional spaces. Given any interval \(I\subset {\mathbb {R}}\), let M(IV) be the space of V-valued Radon measures on I; for any \(m\in {\mathbb {N}}\) and \(1\leqslant p\leqslant \infty \), \(W^{m,p}(I;V)\) denotes the standard V-valued Sobolev space [21]; the space of all V-valued functions with bounded variation is defined by BV(IV) [4]. Also, \(W_\mathrm{loc}^{m,p}(I;V)\) and \(BV_{\mathrm{loc}}(I;V)\) consist of all the sets \(W^{m,p}(\omega ;V)\) and \(BV(\omega ;V)\) respectively, where \(\omega \subset I\) is any compact subset.

Definition 1

We call \(x:[0,\infty )\rightarrow V\) an energy-conserving solution to (107) with initial data (108) if it satisfies the following.

  1. 1.

    \(x\in W^{1,\infty }_{\mathrm{loc}}(0,\infty ;V),\,x(0) = x_0\) and \(x(t)\in \mathbf{dom} F\) for all \(t>0\).

  2. 2.

    \(x'\in BV_{\mathrm{loc}}([0,\infty );V),\,x'(0+) = x_1\).

  3. 3.

    For almost all \(t>0\), there holds the energy equality:

    $$\begin{aligned} \begin{aligned} {}&F(x(t)) + \frac{\gamma (t)}{2}\left\Vert {x'(t)} \right\Vert ^2 + \int _{0}^{t}\frac{\mu +3\gamma (s)}{2}\left\Vert {x'(s)} \right\Vert ^2 {{\mathrm{d}}}s = {} F(x_0) + \frac{\gamma _0}{2}\left\Vert {x_1} \right\Vert ^2. \end{aligned} \end{aligned}$$
  4. 4.

    There exists some \(\nu \in M(0,\infty ;V)\) such that

    $$\begin{aligned} \gamma x'' + (\mu +\gamma )x'+\nu = 0 \end{aligned}$$

    holds in the sense of distributions, and for any \(T>0\), we have

    $$\begin{aligned} \int _{0}^{T}\big (F(y(t))-F(x(t))\big ){{\mathrm{d}}}t \geqslant \left\langle { \nu ,y-x} \right\rangle _{C([0,T];V)} \quad \text {for all } y\in C([0,T];V). \end{aligned}$$

In [25], the problem (107) has been extended to a general case

$$\begin{aligned} \gamma x''+(\mu +\gamma )x'+\partial F(x)\ni \xi , \end{aligned}$$

where \(\xi \) stands for small perturbation. Therefore, according to [25, Theorem 2.1], we have the existence of an energy-conserving solution to (107) and by [25, Theorems 2.2 and 2.3], we obtain the exponential decay, which is a nonsmooth version of (60).

Theorem 8

Assume V is a finite dimensional Hilbert space. In the sense of Definition 1, the differential inclusion (107) admits an energy-conserving solution \(x:[0,\infty )\rightarrow V\) satisfying

$$\begin{aligned} F(x(t))-F(x^*) +\frac{\gamma (t)}{2}\left\Vert {x(t)+x'(t)-x^*} \right\Vert ^2 \leqslant 2 {\mathcal {L}}_0e^{-t}, \end{aligned}$$
(109)

for almost all \(t>0\), where \( {\mathcal {L}}_0 := F(x_0)-F(x^*) +\frac{\gamma _0}{2}\left\Vert {x_0+x_1-x^*} \right\Vert ^2\).

Remark 8

If additionally \(\mathbf{dom} F = V\), then \(x\in W^{2,\infty }_\mathrm{loc}(0,\infty ;V) \cap C^1([0,\infty );V)\) and (109) holds for all \(t>0\). \(\square \)

7.2 An APGM for unconstrained optimization

Let us first consider the unconstrained case \(Q=V\), i.e.,

$$\begin{aligned} \min _{x\in V} f(x):= \min _{x\in V} \left[ h(x)+g(x) \right] , \end{aligned}$$
(110)

where \(f\in {\mathcal {S}}_{\mu ,L}^{1,1}\) with \(0\leqslant \mu \leqslant L<\infty \) and \(g:V\rightarrow {\mathbb {R}}\cup \{+\infty \}\) is a properly closed and convex function and possibly nonsmooth.

7.2.1 Gradient mapping

To treat the nonsmooth part g, we introduce the tool of gradient mapping. Following [29, Chapter 2], given any \(\eta >0\), the composite gradient mapping \({\mathcal {G}}_f(x,\eta )\) of f at x is defined by that

$$\begin{aligned} {\mathcal {G}}_f(x,\eta ):= \frac{x-S_f(x,\eta )}{\eta } \quad x\in V, \end{aligned}$$
(111)

where \(S_f(x,\eta ):=\mathbf{prox}_{\eta g}(x-\eta \nabla h(x))\) and the proximal operator \(\mathbf{prox}_{\eta g}\) has been defined by (8). Note that \(S_f(x,\eta )\) is clearly well-defined and so is \({\mathcal {G}}_f(x,\eta )\).

It is well known [33, 35] that

$$\begin{aligned} \frac{x-\mathbf{prox}_{\eta g}(x)}{\eta }\in \partial g( \mathbf{prox}_{\eta g}(x)) , \end{aligned}$$
(112)

which yields the fact

$$\begin{aligned} {\mathcal {G}}_f(x,\eta )-\nabla h(x)\in \partial g(S_f(x,\eta )). \end{aligned}$$
(113)

From this we conclude that the fixed-point set of \(S_f(\cdot ,\eta )\) is \(\mathrm{argmin}\,f\). Indeed, \(x = S_f(x,\eta )\) if and only if \(0\in \partial f(x) \). We also observe from (113) that the gradient mapping (111) is defined reversely from the proximal-gradient step for minimizing \(f = h+g\), i.e.,

$$\begin{aligned}&\frac{S_f(x,\eta )-x}{\eta }\in - \nabla h(x) - \partial g(S_f(x,\eta ))\\&\quad = -{\mathcal {G}}_f(x,\eta ). \end{aligned}$$

Hence it plays the role of the gradient \(\nabla f\) in the smooth case. Particularly, if \(g = 0\), then \({\mathcal {G}}_f(x,\eta ) = \nabla h(x)\) and \(S_f(x,\eta )=x-\eta \nabla h(x)\) is nothing but a gradient step.

To move on, we present an auxiliary lemma, which is a key ingredient for our convergence analysis. As we will fix \(\eta =1/L\), for simplicity, we set \({\mathcal {G}}_f(x):= {\mathcal {G}}_f(x,1/L)\) and \(S_f(x):=S_f(x,1/L)\).

Lemma 4

Assume \(f = h+g\), where \(h\in {\mathcal {S}}_{\mu ,L}^{1,1}\) with \(0\leqslant \mu \leqslant L<\infty \) and \(g:V\rightarrow \mathbb R\cup \{+\infty \}\) is properly closed and convex. Then for any \(x,y\in V\),

$$\begin{aligned} \begin{aligned} f(y)\geqslant {}&f(S_f(x)) +\left\langle {{\mathcal {G}}_f(x),y-x} \right\rangle +\frac{\mu }{2}\left\Vert {y-x} \right\Vert ^2 +\frac{1}{2L} \left\Vert {{\mathcal {G}}_f(x)} \right\Vert ^2. \end{aligned} \end{aligned}$$
(114)

Proof

Since \(h\in {\mathcal {S}}_{\mu ,L}^{1,1}\), applying (2) and (4) gives

$$\begin{aligned} \begin{aligned} h(x)-h(y)+\left\langle {\nabla h(x),y-x} \right\rangle \leqslant {}&-\frac{\mu }{2}\left\Vert {x-y} \right\Vert ^2,\\ h(S_f(x))- h(x)+\left\langle {\nabla h(x),x-S_f(x)} \right\rangle \leqslant {}&\frac{L}{2}\left\Vert {S_f(x)-x} \right\Vert ^2, \end{aligned} \end{aligned}$$

which implies that

$$\begin{aligned} \begin{aligned} h(y)\geqslant {}&h(S_f(x,\eta )) +\left\langle {\nabla h(x),y-S_f(x)} \right\rangle +\frac{\mu }{2}\left\Vert {y-x} \right\Vert ^2-\frac{1}{2L} \left\Vert {{\mathcal {G}}_f(x)} \right\Vert ^2. \end{aligned} \end{aligned}$$

Observing (113), we get

$$\begin{aligned} \begin{aligned} g(y)\geqslant {}&g(S_f(x)) +\left\langle { {\mathcal {G}}_f(x)-\nabla h(x), y-S_f(x)} \right\rangle . \end{aligned} \end{aligned}$$

Summing the above two inequalities and using the split

$$\begin{aligned} \left\langle { {\mathcal {G}}_f(x), y-S_f(x)} \right\rangle&= \left\langle { {\mathcal {G}}_f(x), y-x} \right\rangle + \left\langle { {\mathcal {G}}_f(x), x-S_f(x)} \right\rangle \\&= \left\langle { {\mathcal {G}}_f(x), y-x} \right\rangle + \frac{1}{L} \left\Vert {\mathcal G_f(x)} \right\Vert ^2, \end{aligned}$$

we finally arrive at (114) and end the proof of this lemma. \(\square \)

Remark 9

For a fixed x, the right hand side of (114) defines a quadratic approximation of f at x, and it is strongly reminiscent of the quadratic lower bound approximation (2) for the smooth case. However, compared to (2), the constant is shifted from f(x) to a lower value \(f(S_f(x)) + \frac{1}{2L} \left\Vert {{\mathcal {G}}_f(x)} \right\Vert ^2\). The first order part is \({\mathcal {G}}_f(x)\) instead of the subgradient at x. The quadratic part \(\frac{\mu }{2}\left\Vert {y-x} \right\Vert ^2\) is due to the \(\mu \)-convexity. \(\square \)

7.2.2 The proposed method

Based on the corrected semi-implicit scheme (91) for NAG flow (56), it is possible to generalize it to solve the differential inclusion (106). We simply replace the gradient \(\nabla f(y_k)\) with the gradient mapping \({\mathcal {G}}_f(y_k)\) and set the correction as \( x_{k+1} ={}S_f(y_k)\). More precisely, consider

$$\begin{aligned} \left\{ \begin{aligned} \frac{y_k-x_{k}}{\alpha _k}={}&v_{k}-y_k,\\ x_{k+1} ={}&S_f(y_k),\\ \frac{v_{k+1}-v_{k}}{\alpha _k}={}&\frac{\mu }{\gamma _k}(y_k-v_{k+1}) -\frac{1}{\gamma _k} {\mathcal {G}}_f(y_{k}),\\ \frac{ \gamma _{k+1} - \gamma _{k}}{\alpha _k} ={}&\mu -\gamma _{k+1}. \end{aligned} \right. \end{aligned}$$
(115)

Once \(x_{k+1}=S_f(y_k)=\mathbf{prox}_{\eta g}(y_k-\eta \nabla h(y_k))\) is obtained, we can update \(v_{k+1}\) with known datum \(x_k,y_k,v_k\) and \(x_{k+1}\). Thus in each iteration, (115) only calls the proximal operation \(\mathbf{prox}_{\eta g}\) once.

We still use the step size \(L\alpha _k^2=\gamma _k(1+\alpha _k)\) and summarize the semi-implicit scheme (115) in Algorithm 2, which is called semi-implicit APGM (Semi-APGM for short). Also, the convergence rate is derived via the discrete Lyapunov function (74).

figure b

Theorem 9

For Algorithm 2, we have

$$\begin{aligned} {\mathcal {L}}_{k+1} \leqslant \frac{ {\mathcal {L}}_k}{1+\alpha _k }\quad \forall \,k\in {\mathbb {N}}, \end{aligned}$$
(116)

where \({\mathcal {L}}_k = f(x_k)-f(x^*) + \frac{\gamma _k}{2} \left\Vert {v_k-x^*} \right\Vert ^2 \), and both (87) and (88) hold true here.

Proof

The proof of (116) is very similar to that of (92). Replacing \(x_{k+1}\) and its gradient \(\nabla f(x_{k+1})\) in (80) respectively with \(y_k\) and \({\mathcal {G}}_f(y_k)\), we can proceed as the proof of Lemma 3 and use Lemma 4 to obtain

$$\begin{aligned} \begin{aligned} \widehat{ {\mathcal {L}}}_{k}-{\mathcal {L}}_{k} \leqslant {}&-\alpha _k \widehat{ {\mathcal {L}}}_{k} +(1+\alpha _k)\left( f(y_k)-f(x_{k+1})\right) \\ {}&\quad + \frac{\alpha _k^2}{2\gamma _k} \left\Vert {\mathcal G_f(y_k)} \right\Vert ^2-\frac{1+\alpha _k}{2L}\left\Vert {{\mathcal {G}}_f(y_k)} \right\Vert ^2, \end{aligned} \end{aligned}$$
(117)

where \(\widehat{ {\mathcal {L}}}_{k}\) is defined by (84). Thanks to the relation \(L\alpha _k^2=\gamma _k(1+\alpha _k)\), the second line of (117) vanishes, and inserting the identity \(f(y_k)-f(x_{k+1})= \widehat{ {\mathcal {L}}}_{k}- {\mathcal {L}}_{k+1}\) into (117) gives (116). Based on this, it is not hard to see that both (87) and (88) hold true. This finishes the proof of this theorem. \(\square \)

We mention that with another choice

$$\begin{aligned} L\alpha _k^2 = \mu \alpha _k^2+\gamma _{k}(1+\alpha _k), \end{aligned}$$

we can drop the sequence \(\{v_{k}\}\) from (115). The procedure is not straightforward but very similar to that of Nesterov’s optimal method in [29, page 80]. We omit the details and only list the following algorithm.

figure c

This can be viewed as a generalization of [29, Chapter 2, Constant Step Scheme, II] to problem (110). Particularly, for convex case \(\mu =0\), it is very close to FISTA [12]. Both of them share the same spirit: applying one proximal gradient step first and then using some extrapolation formulae. The difference comes only from the use of the two sequences \(\{\alpha _k\}\) and \(\{\beta _k\}\). We also claim that Algorithm 3 has the same accelerated convergence rate as Algorithm 2, i.e., \(O(\min (L/k^2,(1+\sqrt{\mu /L})^{-k}))\). In contrast FISTA is designed for \(\mu =0\) and has only the sublinear rate \(O(L/k^2)\).

We mention that, accelerated proximal gradient methods for solving (110) with only one evaluation of \(\mathbf{prox}_{\eta g}\) in each iteration can be found in [38] (only for strongly convex case) and [24, Chapter 2, Algorithm 2.2] (for both convex and strongly convex cases).

Both Algorithms 2 and 3 cannot be applied directly to the general constraint case (104). The main issue comes from the definition (111) of the gradient mapping \({\mathcal {G}}_f(x,\eta )\), where we impose the restriction \(x\in Q\) and calculate the proximal operator \(\mathbf{prox}_{\eta g}\) over Q to obtain \(S_f(x)\in Q\). For both two algorithms, we shall compute \(x_{k+1}=S_f(y_k)=\mathbf{prox}_{\eta g}(y_k-\eta \nabla h(y_k))\). But the sequence \(\{y_k\}\) in Algorithms 2 and 3 may be outside the constraint set. This is not acceptable because \(\nabla h(y_k)\) might not exist: for instance, \(Q = [0,\infty )\) and h is the entropy function.

The original FISTA [12] and the methods in [38] and [24, Chapter 2, Algorithm 2.2] mentioned above, cannot be applied to the constrained problem (104) either. This stimulates us to propose a new operator splitting scheme to conquer this problem.

7.3 An accelerated forward–backward method for constrained optimization

We now go back to the constrained problem (104). As mentioned above, the tool of gradient mapping is not convenient for us to handle this case. To avoid using it, we utilize the separable structure of \(f=h+g\) and apply explicit and implicit schemes for h and g, respectively. This is the so-called operator splitting technique in ODE solvers and is also known as the forward-backward method.

Let us start from the predictor-corrector scheme (83) and rewrite it as follows

$$\begin{aligned} \left\{ \begin{aligned} {}&y_k = \frac{x_k+\alpha _kv_k}{1+\alpha _k},\quad w_k = {}\frac{\gamma _kv_k+\mu \alpha _ky_k}{\gamma _k+\mu \alpha _k},\\ {}&v_{k+1} = {\mathop {\hbox {argmin}}\limits _{v\in V}}\left\{ \left\langle {\nabla f(y_k),v} \right\rangle + \frac{\gamma _{k}+\mu \alpha _k}{2\alpha _k}\left\Vert {v-w_k} \right\Vert ^2 \right\} ,\\ {}&x_{k+1} = \frac{x_k+\alpha _kv_{k+1}}{1+\alpha _k}. \end{aligned} \right. \end{aligned}$$
(118)

For minimizing \(f=h+g\) over Q, we modify the above method as follows

$$\begin{aligned} \left\{ \begin{aligned} {}&y_k = \frac{x_k+\alpha _kv_k}{1+\alpha _k},\quad w_k = {}\frac{\gamma _kv_k+\mu \alpha _ky_k}{\gamma _k+\mu \alpha _k},\\ {}&v_{k+1} = {\mathop {\hbox {argmin}}\limits _{v\in Q}}\left\{ g(v) + \left\langle {\nabla h(y_k),v} \right\rangle + \frac{\gamma _{k}+\mu \alpha _k}{2\alpha _k}\left\Vert {v-w_k} \right\Vert ^2 \right\} ,\\ {}&x_{k+1} = \frac{x_k+\alpha _kv_{k+1}}{1+\alpha _k}, \end{aligned} \right. \end{aligned}$$
(119)

where \(x_0,\,v_0\in Q\) and the parameter sequence \(\{\gamma _k\}\) comes from the implicit discretization (73) of the Eq. (54). Clearly, as convex combinations are used, the method (119) preserves the three-term sequence \(\{(x_k,y_k,v_k)\}\) in Q and it requires the proximal computation of g over Q only once in each iteration.

We choose \(L\alpha _k^2=\gamma _{k}(1+\alpha _k)\) as before and rewrite (119) in Algorithm 4, which is called semi-implicit accelerated forward-backward (Semi-AFB for short) method.

figure d

In [41], Tseng considered problem (104) only with convex assumption, i.e., \(\mu =0\), and proposed an APGM that possesses the rate \(O(L/k^2)\). By using the technique of estimate sequence, Nesterov [28] presented an accelerated method for solving (104) with the assumption that h is L-smooth over Q and g is \(\mu \)-strongly convex with \(\mu \geqslant 0\). Both our Algorithm 4 and Nesterov’s method generate a three-term sequence \(\{(x_k,y_k,v_k)\}\) and have the same accelerated rate \(O(\min (L/k^2,(1+\sqrt{\mu /L})^{-k}))\); see [28, Theorem 6] and our Theorem 10. However, as mentioned in [12], the later used an accumulated history of the past iterations to build recursively a sequence of estimate functions, and in each iteration, to update \(x_{k+1}\) and \(v_{k+1}\), Nesterov’s method in [28] calls \(\mathbf{prox}_{ g}\) over Q twice.

Below, we shall establish the convergence rate of Algorithm 4 via the analysis of a Lyapunov function. It is well known [28, Eq (2.9)] that the first-order optimality condition for \(v_{k+1}\) in (119) is the variational inequality

$$\begin{aligned} \left\langle {\nabla h(y_k)+\frac{\gamma _{k}+\mu \alpha _k}{\alpha _k}(v_{k+1}-w_k) +p_{k+1},x-v_{k+1}} \right\rangle \geqslant 0\quad \forall \,x\in Q, \end{aligned}$$

where \(p_{k+1}\in \partial g(v_{k+1})\). Expanding \(w_k\), we observe the relation

$$\begin{aligned}&\gamma _k\left( {v_{k+1}-v_k, v_{k+1} -x} \right) \nonumber \\&\quad \leqslant \mu \alpha _k\left( {y_{k} - v_{k+1}, v_{k+1} - x} \right) - \alpha _k \left\langle {\nabla h(y_k)+p_{k+1}, v_{k+1} - x} \right\rangle , \end{aligned}$$
(120)

where \(x\in Q\) is arbitrary.

Theorem 10

For Algorithm 4, we have

$$\begin{aligned} {\mathcal {L}}_{k+1}\leqslant \frac{ {\mathcal {L}}_k}{1+\alpha _k }\quad \forall \,k\in {\mathbb {N}}, \end{aligned}$$
(121)

where \({\mathcal {L}}_k= f(x_k)-f(x^*) + \frac{\gamma _k}{2} \left\Vert {v_k-x^*} \right\Vert ^2\), and both (87) and (88) hold true here.

Proof

As before, we calculate the difference

$$\begin{aligned} \begin{aligned} {\mathcal {L}}_{k+1}-{\mathcal {L}}_{k} =&f(x_{k+1})-f(x_k)+ \frac{\alpha _k}{2}(\mu - \gamma _{k+1}) \left\Vert {v_{k+1}-x^*} \right\Vert ^2\\&+\gamma _k \left( {v_{k+1}-v_k, v_{k+1} -x^*} \right) - \frac{\gamma _k}{2} \left\Vert {v_{k+1}-v_k} \right\Vert ^2. \end{aligned} \end{aligned}$$

Thanks to (120), we have

$$\begin{aligned}&\gamma _k\left( {v_{k+1}-v_k, v_{k+1} -x^*} \right) \nonumber \\&\quad \leqslant \mu \alpha _k\left( {y_{k} - v_{k+1}, v_{k+1} - x^{*}} \right) - \alpha _k \left\langle {\nabla h(y_k)+p_{k+1}, v_{k+1} - x^{*}} \right\rangle . \end{aligned}$$
(122)

where \(p_{k+1} \in \partial g(v_{k+1})\).

By Lemma 1, the first term in (122) is split as follows

$$\begin{aligned} \begin{aligned}&2\mu \alpha _k\left( {y_{k} - v_{k+1}, v_{k+1} - x^{*}} \right) \\&\quad =\mu \alpha _k\left( \left\| y_{k}-x^{*}\right\| ^{2} -\Vert y_{k}-v_{k+1}\Vert ^{2} -\left\| v_{k+1}-x^{*}\right\| ^{2}\right) . \end{aligned} \end{aligned}$$

The gradient term in (122) is more subtle. Firstly, by convexity of g, we have

$$\begin{aligned} \begin{aligned}&-\alpha _k \left\langle {p_{k+1}, v_{k+1} - x^{*}} \right\rangle \leqslant -\alpha _k\left( g(v_{k+1})-g(x^*)\right) \\&\quad =-\alpha _k\left( g(x_{k+1})-g(x^*)\right) -\alpha _k\left( g(v_{k+1})-g(x_{k+1})\right) , \end{aligned} \end{aligned}$$

and secondly, according to the update for \(y_{k}\) (see step 4 in Algorithm 4), we find

$$\begin{aligned}&-\alpha _k \left\langle { \nabla h(y_k), v_{k+1} - x^{*}} \right\rangle \\&\quad =-\alpha _k\left\langle {\nabla h(y_k),v_{k+1} - v_k} \right\rangle -\alpha _k\left\langle { \nabla h(y_k), v_{k} - x^{*}} \right\rangle \\&\quad =-\alpha _k\left\langle {\nabla h(y_k),v_{k+1} - v_k} \right\rangle - \left\langle { \nabla h(y_k), y_{k} - x_{k}} \right\rangle - \alpha _k\left\langle { \nabla h(y_k), y_{k} - x^{*}} \right\rangle . \end{aligned}$$

As h is \(\mu \)-strongly convex on Q, by the fact \(\{(x_k,y_k,v_k)\}\subset Q\), it follows that

$$\begin{aligned}&- \left\langle { \nabla h(y_k), y_{k} - x_{k}} \right\rangle - \alpha _k\left\langle { \nabla h(y_k), y_{k} - x^{*}} \right\rangle \\&\quad \leqslant h(x_k)-h(y_k) -\frac{\mu }{2}\left\Vert {x_k-y_k} \right\Vert ^2 - \alpha _k\left( h(y_k)-h(x^*)\right) -\frac{\mu \alpha _k}{2}\left\Vert {x^*-y_k} \right\Vert ^2\\&\quad = (1+\alpha _k)\left( h(x_{k+1})-h(y_{k})\right) - \alpha _k\left( h(x_{k+1})-h(x^*)\right) -\frac{\mu \alpha _k}{2}\left\Vert {x^*-y_k} \right\Vert ^2\\&\qquad +h(x_{k})-h(x_{k+1}) -\frac{\mu }{2}\left\Vert {x_k-y_k} \right\Vert ^2. \end{aligned}$$

Therefore, collecting all the estimates and dropping surplus negative terms related to \(-\left\Vert {x_k-y_k} \right\Vert ^2\) and \(-\Vert y_{k}-v_{k+1}\Vert ^{2}\), we get

$$\begin{aligned}&{\mathcal {L}}_{k+1}-{\mathcal {L}}_{k}\nonumber \\&\quad \leqslant -\alpha _k{\mathcal {L}}_{k+1} +(1+\alpha _k)\left( h(x_{k+1})-h(y_{k})\right) -\alpha _k\left\langle {\nabla h(y_k),v_{k+1} - v_k} \right\rangle \nonumber \\&\qquad - \frac{\gamma _k}{2} \left\Vert {v_{k+1}-v_k} \right\Vert ^2 +g(x_{k+1})-g(x_{k})-\alpha _k\left( g(v_{k+1})-g(x_{k+1})\right) . \end{aligned}$$
(123)

Let us consider the additional terms in (123). In view of (4), we have

$$\begin{aligned} h(x_{k+1})-h(y_k)\leqslant \left\langle {\nabla h(y_k),x_{k+1}-y_k} \right\rangle + \frac{L}{2}\left\Vert {x_{k+1}-y_k} \right\Vert ^2. \end{aligned}$$

Thanks to the extrapolation step for \(x_{k+ 1}\) (see step 6 in Algorithm 4), we find a crucial relation

$$\begin{aligned} x_{k+1}-y_k = \frac{\alpha _k}{1+\alpha _k}(v_{k+1}-v_k), \end{aligned}$$

which gives that

$$\begin{aligned} \begin{aligned}&(1+\alpha _k)\left( h(x_{k+1})-h(y_{k})\right) -\alpha _k\left\langle {\nabla h(y_k),v_{k+1} - v_k} \right\rangle - \frac{\gamma _k}{2} \left\Vert {v_{k+1}-v_k} \right\Vert ^2\\&\quad \leqslant \frac{L\alpha _k^2}{2(1+\alpha _k)}\left\Vert {v_{k+1}-v_k} \right\Vert ^2 - \frac{\gamma _k}{2} \left\Vert {v_{k+1}-v_k} \right\Vert ^2=0, \end{aligned} \end{aligned}$$

as \(L\alpha _k^2=\gamma _k(1+\alpha _k)\). Moreover, since \(x_{k+1}\) is a convex combination of \(x_k\) and \(v_{k+1}\), the estimate follows

$$\begin{aligned} \begin{aligned}&g(x_{k+1})-g(x_{k})-\alpha _k\left( g(v_{k+1})-g(x_{k+1})\right) \\&\quad =(1+\alpha _k)g(x_{k+1})-g(x_{k})-\alpha _kg(v_{k+1}) \leqslant {}0. \end{aligned} \end{aligned}$$

Plugging this and the previous inequality into (123) gives

$$\begin{aligned} {\mathcal {L}}_{k+1}-{\mathcal {L}}_{k}\leqslant -\alpha _k{\mathcal {L}}_{k+1}, \end{aligned}$$

which establishes (121).

By the relation \(L\alpha _k^2=\gamma _{k}(1+\alpha _k)\) and the contraction (121), it is clear that the two estimates (87) and (88) hold true. This completes the proof of this theorem. \(\square \)