From differential equation solvers to accelerated first-order methods for convex optimization

Convergence analysis of accelerated first-order methods for convex optimization problems are presented from the point of view of ordinary differential equation solvers. A new dynamical system, called Nesterov accelerated gradient flow, has been derived from the connection between acceleration mechanism and $A$-stability of ODE solvers, and the exponential decay of a tailored Lyapunov function along with the solution trajectory is proved. Numerical discretizations are then considered and convergence rates are established via a unified discrete Lyapunov function. The proposed differential equation solver approach can not only cover existing accelerated methods, such as FISTA, G\"{u}ler's proximal algorithm and Nesterov's accelerated gradient method, but also produce new algorithms for composite convex optimization that possess accelerated convergence rates.


Introduction
We consider iterative methods for solving the unconstrained minimization problem min where V is a Hilbert space, and f : V → R ∪ {+∞} is a properly closed convex function.We shall first consider smooth f on the entire space V and later focus on the composite case f = h + g where both h (smooth) and g (non-smooth) are convex on some (simple) closed convex set Q ⊆ V .We are mainly interested in the development and analysis of accelerated first-order methods.Suppose V is equipped with the inner product (•, •) and the correspondingly induced norm • .We use •, • to denote the duality pair between V * and V , where V * is the continuous dual space of V and is endowed with the conventional dual norm • * .For any interval I ⊆ R, denote by C k (I; V ) the space of all k-times continuous differentiable V -valued functions on I, and the superscript k is dropped when k = 0. Let Ω ⊆ V be some closed convex subset, we say f ∈ S 1 µ (Ω) if it is continuous differentiable on Ω and there exists µ 0 such that f (x) − f (y) − ∇f (y), x − y µ 2 x − y 2 ∀ x, y ∈ Ω. ( We call (2) the µ-convexity of f and when µ > 0, we say f is strongly convex.We also write f ∈ S 1,1 µ,L (Ω) if f ∈ S 1 µ (Ω) and ∇f is Lipschitz continuous on Ω: there exists 0 < L < ∞ such that By [29,Theorem 2.1.5],this implies the inequality For Ω = V , we shall write S 1 µ (Ω) and S 1,1 µ,L (Ω) as S 1 µ and S 1,1 µ,L , respectively.The above functional classes are what we work with in this paper.As for the optimization problem (1), we also care about the global minimizer(s) of f .For strongly convex f ∈ S 1 µ (Q) with µ > 0, it is well-known that the minimizer exists uniquely.However, for convex case µ = 0, to promise the existence of minimizers, additional assumption, such as coercivity condition, which means f (x) → ∞ when x → ∞, is usually imposed.Throughout, we denote by argminf the set of global minimizers of (1) and assume it is nonempty.
One approach to derive the gradient descent (GD) method is discretizing an ordinary differential equation (ODE), i.e., the so-called gradient flow: x ′ (t) = −∇f (x(t)), t > 0. ( Here we introduce an artificial time variable t and x ′ is the derivative taken with respect to t.For ease of notation, in the sequel, we shall omit t when no confusion arises.The simplest forward (explicit) Euler method with step size η k > 0 leads to the GD method In the terminology of numerical analysis, it is well-known that this method is conditionally A-stable (cf.Section 2), and for f ∈ S 1,1 µ,L with 0 µ L < ∞, the step size η k = 1/L is allowed to get the rate (see [29,Chapter 2]) One can also consider the backward (implicit) Euler method which is unconditionally A-stable (cf.Section 2) and coincides with the wellknown proximal point algorithm (PPA) [33] x k+1 = prox η k f (x k ) := argmin Note that this method allows f to be nonsmooth and possesses linear convergence rate even for convex objective, as long as η k η > 0 for all k > 0.

Main results
Let us start from the quadratic objective f (x) = 1 2 x ⊤ Ax over R d , for which the gradient flow (5) reads simply as where A is symmetric positive semi-definite and makes f ∈ S 1,1 µ,L .Instead of solving (9), we turn to a general linear ODE system y ′ = Gy.(10) Briefly speaking, our main idea is to seek such a system (10) with some asymmetric block matrix G that transforms the spectrum of A from the real line to the complex plane and reduces the condition number from κ(A) = L/µ to κ(G) = O( L/µ).Afterwards, accelerated gradient methods may be constructed from A-stable methods for solving (10) with a significant larger step size and consequently improve the contraction rate from O((1 − µ/L) k ) to O((1 − µ/L) k ).Furthermore, to handle the convex case µ = 0, we combine the transformation G with suitable time scaling technique; for more details, we refer to Section 2. One successful and important transformation example is given below where the built-in scaling factor γ is positive and satisfies Based on this, for general f ∈ S 1 µ with µ 0, we replace A in (11) with ∇f and write y = (x, v) to obtain a first-order dynamical system: where γ solves (12).Eliminating v, we arrive at a second-order ODE of x: which is actually a heavy ball model (cf.(21)) with novel variable damping coefficients in front of x ′′ and x ′ .Thanks to the scaling factor γ, we can handle both the convex case (µ = 0) and the strongly convex case (µ > 0) in a unified way.Moreover, we shall prove that for µ 0, there holds the exponential decay property L(t) e −t L(0), t > 0, (15) for a tailored Lyapunov function where x * ∈ argminf is a global minimizer of f .Accelerated gradient methods based on numerical discretizations of the dynamical system (13) with f ∈ S 1,1 µ,L are then considered and analyzed by means of a discrete version of the Lyapunov function (16).It will be shown that the implicit scheme (see (72)) possesses linear convergence rate as long as the time step size is uniformly bounded below.This matches the exponential decay rate (15) in the continuous level.Also, for convex case µ = 0, this implicit method amounts to an accelerated PPA, that is very close to Güler's PPA [20] and enjoys the same rate O(1/k 2 ) (cf.Theorem 4.2).In Section 5, for semiimplicit schemes with suitable corrections (either an extrapolation or a gradient step), we prove the following convergence rate which is optimal in the sense of [29].Moreover, we can recover Nesterov's optimal method [27,29] exactly from a semi-implicit scheme with gradient descent correction; see Section 6.Therefore, instead of using estimate sequence, our ODE approach provides an alternative derivation of Nesterov's method and hopefully more intuitive for understanding the acceleration mechanism.From this point of view, we name both ( 13) and ( 14) as Nesterov accelerated gradient (NAG) flow.
As a proof of concepts, we also generalize our NAG flow to the composite optimization problem where Q ⊆ V is a (simple) closed convex set, h ∈ S 1,1 µ,L (Q) with 0 µ L < ∞ and g : V → R ∪ {+∞} is proper, closed and convex.As usual, we use dom g to denote the effective domain of g and assume that Q ∩ dom g = ∅.
Treating (18) as an unconstrained minimization of F = f + i Q where i Q denotes the indicator function of Q, the generalized version of ( 14) is a second-order differential inclusion We shall give the solution existence of (19) in proper sense and then obtain the exponential decay (15) for almost all t > 0.
For the unconstrained case Q = V , by using the tool of composite gradient mapping [29, Chapter 2], a semi-implicit scheme with correction for the generalized NAG flow (19) is presented and leads to an accelerated proximal gradient method (APGM); see Algorithm 2. We also give a simplified variant that is closely related to FISTA [12].For the constrained problem (18), an accelerated forward-backward method is proposed in Algorithm 4. Both two algorithms call the proximal operation of g (over Q) only once in each iteration, and they are proved to share the same convergence rate (17).
The rest of this paper is organized as follows.In the continuing of the introduction, we will review some existing works devoting to the accelerated gradient methods from the ODE point of view.Next, in Section 2, we shall explain the acceleration mechanism from A-stability theory of ODE solvers and derive our NAG flow model as well.Then in Section 3 we focus on the NAG flow and prove its exponential decay.After that, accelerated gradient methods based on numerical discretizations are proposed and analyzed in Sections 4, 5 and 6.Finally, in Section 7, we extend the our NAG flow to composite optimization and propose two new accelerated methods with convergence rate analysis.

Related works
The well-known momentum method can be traced back to 1960s.In [34], Polyak studied the heavy ball (HB) method and its continuous analogue, the heavy ball dynamical system: Local linear convergence results for ( 20) and ( 21) via spectrum analysis were established in [34,Theorem 9].Note that the HB method (20) adds a momentum term up to the gradient step and is sensitive to its parameters.For f ∈ S 1,1 µ,L , it shares the same theoretical convergence rate (6) as the gradient descent method; see [18,40].To our best knowledge, no work has established the global accelerated rate (17) for the original HB method (20).Recently, Nguyen et al. [26] developed the so-called accelerated residual method which combines (20) with an extra gradient descent step: Numerically, they verified the efficiency and usefulness of this method with a restart strategy.We refer to [1,3,11,19] for further investigations of the HB system (21).
To understand an accelerated gradient method with the rate O(1/k 2 ) proposed by Nesterov [27], Su, Boyd and Candès [37] derived the following secondorder ODE where α > 0 and f ∈ S 1,1 0,L .If α 3 or 1 < α < 3 and (f − f (x * )) (α−1)/2 is convex, then they proved the decay rate O(t −2 ).If α 3 and f is strongly convex, then they also obtained a faster rate O(t −2α/3 ).Later on, Aujol and Dossal [10] established a generic result: where β > 0 and (f − f (x * )) β is convex.Almost at the same time, Attouch et al. [8] obtained the estimate (23) for β = 1 and considered numerical discretizations for (22) with the convergence rate O(k − min{2,2α/3} ), which matches the continuous decay property (23) for the case β = 1.Also, Vassilis et al. [42] studied the non-smooth version of ( 22): They proved that the solution trajectory of (24) converges to a minimizer of f and derived the decay estimate (23) for β = 1.For more works and generalizations related to the model (22) and the corresponding algorithms, we refer to [2,5,6,7,14] and references therein.Recently, Wibisono et al. [43] introduced a Lagrangian for smooth and convex f , where the scaling function α : R + → R + is continuous and β : R + → R + satisfies The Lagrangian (25) itself introduces a variational problem, the Euler-Lagrange equation to which is They then established the convergence rate (cf.[43,Theorem 2.1]) by means of the Lyapunov function Following this work, for any f ∈ S 1 µ with µ > 0, Wilson et al. [44] introduced another Lagrangian whose Euler-Lagrange equations reads as with the same scaling function α in (25).They proved the decay estimate (28) as well, by using the Lyapunov function When α = √ µ, (29) gives the following model which reduces to an HB system (cf.( 21)).From another Lyapunov function Siegel [38] also derived (31) and proved that In addition, Siegel [38] and Wilson et al. [44] proposed two semi-explicit schemes for (31) individually.Both of their schemes are supplemented with an extra gradient descent step and share the same linear convergence rate O((1 − µ/L) k ).
Recently, introducing the so-called duality gap which is the difference of appropriate upper and lower bound approximations for the objective function, Diakonikolas and Orecchia [17] presented a general framework for the construction and analysis of continuous time dynamical systems and the corresponding numerical discretizations.They recovered several existing ODE models such as the gradient flow (5), the mirror descent dynamic system and its accelerated version.We mention that the derivation of our NAG model and analyses of discrete algorithms are fundamentally different from their duality gap technique.

Stability of ODE Solvers and Acceleration
In what follows, for any square matrix M ∈ R d×d , σ(M ) denotes the spectrum of M , i.e., the set of all eigenvalues of M .The spectral radius is then defined by ρ(M ) := max λ∈σ(M) |λ|, and when M is invertible, its condition number κ(M ) := ρ(M −1 )ρ(M ).If σ(M ) ⊂ R, then λ min (M ) and λ max (M ) stand for the minimum and maximum of σ(M ), respectively.Moreover, • 2 is the usual 2-norm for vectors and matrices.
To present our main idea as simple as possible, in this section, unless other specified, we restrict ourselves to the quadratic objective f (x) = 1 2 x ⊤ Ax, where A is a symmetric matrix with the bound For this model example, ∇f (x) = Ax and the gradient flow (5) reads as x ′ = −Ax.The global minimal is achieved at x * = 0, and when µ > 0, the condition number of A is κ(A) = L/µ.

A-stability of ODE solvers
Let G ∈ R d×d and assume Re(λ) < 0 for all λ ∈ σ(G).For the linear ODE system y ′ = Gy, y(0 it is not hard to derive that y(t) 2 → 0 as t → ∞ (see [13,Theorem 7] for instance).Hence y * = 0 is an equilibrium of the dynamic system (32).We now recall the concept of A-stability of ODE solves [23,39].A one-step method φ for (32) with step size α > 0 can be formally written as As y * = 0 is an equilibrium point, (33) also gives the error equation.The scheme φ is called absolute stable or A-stable if ρ(E φ (α, G)) < 1 from which the asymptotic convergence y k → 0 follows (cf.[16,Theorem 6.1]).If ρ(E φ (α, G)) < 1 holds for all α > 0, then it is called unconditionally A-stable, and if ρ(E φ (α, G)) < 1 for any α ∈ I, where I is an interval of the positive half line, then the scheme is called conditionally A-stable.
. Therefore for Astable methods the linear convergence follows directly from the norm contraction In general cases, however, bounding the spectral radius by one does not imply the norm contraction, i.e., (34) may not be true when E φ (α, G) is non-normal, even if (33) is A-stable.Nevertheless, we shall continue using the tool of Astability through spectral analysis and comment on its limitation in Section 2.6.

Implicit and Explicit Euler methods
It is well known that the implicit Euler (IE) method 1 for all α > 0 since all eigenvalues of αG lie on the left of the complex plane and their distance to 1 is larger than one.Moreover, as it has no restriction on the step size, the implicit Euler method can achieve faster convergent rate by time rescaling which is equivalent to chose a large step size.The explicit Euler method is only conditionally A-stable.Let us consider the case G = −A with µ > 0.
Then (35) is exactly the gradient descent method for minimizing 1 2 x ⊤ Ax.It is not hard to obtain that Hence ρ(E GD (α, −A)) < 1 provided 0 < α < 2/L.Thanks to the symmetry of A, we have E GD (α, −A) 2 = ρ(E GD (α, −A)) and the norm convergence with linear rate follows.Moreover, based on (36), a standard argument outputs the optimal choice α * = 2/(µ + L), which gives the minimal spectrum A quasi-optimal but simpler choice is α * = 1/L which yields We formulate the convergence rates (37) and (38) in terms of the condition number κ(A) as it is invariant to the rescaling of A, i.e., κ(cA) = κ(A) for any real number c = 0. To be A-stable, one has to choose 0 < α < 2/λ max (A).It seems that a simple rescaling to cA can reduce λ max (cA) and thus enlarge the range of the step size.However, the condition number κ(cA) = κ(A) is invariant.From this we see that for the GD method (35), the simple rescaling cA is in vain.
The magnitude of the step size is relative to min |λ(G)|.To fix the discussion, we chose G = −A/µ in (35) so that λ min (A/µ) = 1.Then in order for the explicit Euler method to be A-stable it is equivalent to choose α = O(1/κ(A)) which leads to the contraction rate 1 − 1/κ(A).Consequently for ill-conditioned problems, tiny step size proportional to 1/κ(A) is required.
Rather than the rescaling, our main intuition is to seek some transformation G of A, that keeps min |λ(G)| = 1 and reduces κ(A) to κ(G) = O( κ(A)).We wish to construct explicit A-stable methods which can enlarge the step size from O(1/κ(A)) to O(1/ κ(A)) and consequently improve the contraction rate from 1 − 1/κ(A) to O(1 − 1/ κ(A)).

Transformation to the complex plane
Let us first consider the case µ > 0 and embed A into some 2 × 2 block matrix G with a rotation built-in.Specifically, we construct two candidates Due to the asymmetrical fact, σ(A) will be transformed from the real line to the complex plane.This may shrink the condition number; see the following result.
Proposition 2.1.For G = G HB or G NAG given in (39), it satisfies Re(λ) < 0 for any λ ∈ σ(G), which promises the decay property y(t) 2 → 0 for the system y ′ = Gy.Moreover, we have κ( Proof.Let us first consider G = G HB .As A is symmetric, we can write A = U ΛU ⊤ with unitary matrix U and diagonal matrix Λ consisting of eigenvalues of A. By applying the similar transform to G with the block diagonal matrix diag(U, U ), it suffices to consider eigenvalues of It is clear that det R HB = θ and tr R HB = −2 < 0. In addition, since |tr R HB | 2 4 det R HB , any eigenvalue λ R ∈ σ(R HB ) is a complex number and Apply the similar transformation with P = 1 0 1 1 , we observe that This completes the proof of this proposition.
We write y = (x, v) ⊤ and eliminate v in y ′ = Gy to get a second order ODE of x, in which we replace Ax by general form ∇f (x).Both G HB and G NAG yield the same ODE which is a special case of the HB model (cf.( 21)).Note that we can find a lot of transformations G and derive corresponding ODE models.Indeed, given any G that meets our demand, both cG and QGQ −1 are acceptable candidates, where c > 0 and Q is some invertible matrix.We are not going further deep beyond those two transformations given in (39) for the strongly convex case µ > 0 but aim to combine the transformation with refined time scaling to propose another one for convex case µ = 0 in Section 2.5.

Acceleration from a Gauss-Seidel splitting
We now consider numerical discretization for (32) with G = G HB and G NAG given in (39).As discussed in Section 2.2, the implicit Euler method is unconditional A-stable.But computing (I − αG) −1 needs significant effort and may not be practical.
One may hope that the explicit Euler method y k+1 = (I + αG)y k will be A-stable with step size α = O(1/κ(G)) = O(1/ κ(A)).Unfortunately, unlike the discussion for (35) with G = −A, where σ(I − αA) lies on the real line and ρ(I − αA) can be easily shrunk by choosing α = 1/ρ(A) (cf.( 36)), the general asymmetric G spreads the spectrum on the complex plane.For both where no acceleration has been obtained.
We then expect that an explicit scheme closer to the implicit Euler method will hopefully have better stability with larger step size.Motivated by the Gauss-Seidel (GS) method [45] for computing (I−αG) −1 , we consider the matrix splitting G = M + N with M being the lower triangular part of G (including the diagonal part) and N = G − M , and propose the following Gauss-Seidel splitting scheme which gives the relation Note that for G = G HB and G NAG , the scheme ( 41) is still explicit as the lower triangular block matrix I − αM can be inverted easily, without involving A −1 .The spectrum bound is given below and for the algebraic proof details, we refer to Appendix A.

Dynamic time rescaling for the convex case
The ODE model (40) given in Section 2.3 cannot treat the case µ = 0 and the previous spectral analysis also fails.Equivalently the condition number κ(A) is infinity and the spectrum bound becomes 1.To conquer this, a careful rescaling is needed.Throughout this subsection, we assume µ = 0.For the gradient flow one can easily establish the sub-linear rate f (x(t)) C/t; see [37].To recover the exponential rate, we introduce a time rescaling t(s) = e s and let y(s) = x(t(s)).Then (43) becomes the rescaled gradient flow with the scaling factor γ(s) = e s .Besides, the previous sublinear rate f (x(t)) C/t turns into f (y(s)) Ce −s .That is in the continuous level, we can achieve exponential decay through suitable rescaling of time even for convex case µ = 0 .Now let us go back to our model case f (x) = 1 2 x ⊤ Ax with µ = 0 and λ max (A) = L. Coupled with the transformation G NAG , we consider where y = (x, v) ⊤ and This gives a second-order ODE in terms of x: which is in the HB type but with variable damping coefficients.
Obviously, the implicit Euler method for solving (45) is still unconditional A-stable.We now apply the GS splitting ( 41) to (45) and get where E(α k , G(γ k+1 )) is defined in (42).The equation ( 46) is discretized by that Eliminating v k in (48) will give an HB method with variable coefficients Instead of studying the spectrum bound E(α k , G(γ k+1 )) which is 1, we apply the scaling technique to obtain a regularized matrix With a careful chosen step size, the spectrum bound of E k is given below and for the algebraic proof details, we refer to Appendix A. We note that, the step size choice in Theorem 2.2 is only to agree with the setting of Lemma B.2 and for general choice Lα 2 k /γ k = O(1) and suitable initial value γ 0 , it is possible to maintain the spectrum bound (51) together with the decay estimate (52).
Theorem 2.2.If γ 0 = L and Lα 2 k = γ k (1 + α k ), then both the scheme (48) and its equivalent form (50) are A-stable and we have which further implies that

Limitation of spectral analysis
For quadratic objective f , both the ODE models (40)  Particularly, for the HB system (21), it is shown in [22] that the parameters optimized for linear ODE models does not guarantee the global convergence for a nonlinear system.
To provide rigorous convergence analysis for both continuous and discrete levels, in the sequel we shall introduce the tool of Lyapunov function.Following many related works [6,37,43], we first analyze some proper ODEs via a Lyapunov function, then construct optimization algorithms from numerical discretizations of continuous models and use a discrete Lyapunov function to establish the convergence rates of the proposed algorithms.
Lemma 3.1.For any u, v, w ∈ V , we have We first present the well-posedness of (57) and prove the exponential decay property of the Lyapunov function (58).
µ,L with µ 0, then the NAG flow (57) admits a unique solution x ∈ C 2 ([0, ∞); V ) and moreover which implies that Proof.Basically, as ∇f is Lipschitz continuous, applying the standard existence and uniqueness results of ODE (see [9,Theorem 4.1.4])yields the fact that the system (56) admits a unique classical solution (x, v) ) is also the unique solution to our NAG flow (57).It remains to prove (59), which yields the exponential decay (60) immediately.A straightforward calculation yields that and by ( 54) and (56), we replace γ ′ and v ′ by their right hand side terms and obtain Let us focus on the last term.Thanks to Lemma 3.1, and the gradient term is split as follows By the relation x ′ = v − x, the first term in (62) becomes −∇f (x), x ′ which cancels the first term in (61).Combining all identities together gives As f is µ-strongly convex (cf.( 2)), there holds and plugging this into (63) implies that which proves (59) and thus completes the proof of this lemma.

Rescaling property
Based on our NAG flow (56) (or (57)), it is possible to use time scaling technique to construct more ODE systems with any desirable convergence rate.It is worth distinguishing the connection and difference with existing dynamical models.

An Implicit Scheme
Exponential decay of an implicit discretization for solving (56) can be established, which is more or less straightforward since one can easily follow the proof from the continuous problem.However, the implicit scheme requires efficient solver or proximal calculation and may not be practical sometimes.It is presented here to bridge the analysis from the continuous level to semi-implicit and explicit schemes.Consider the following implicit scheme where α k > 0 denotes the time step size to discretize the time derivative and the parameter equation ( 54) is also discretized implicitly We shall present the convergence result for the implicit scheme (72).To do so, we introduce a suitable Lyapunov function which is clearly a discrete analogue to the continuous one (58).
Theorem 4.1.If f ∈ S 1 µ with µ 0, then for the scheme (72) with α k > 0, we have Proof.It suffices to prove Let us mimic the proof of Lemma 3.2.Instead of the derivative, we compute the difference as follows Analogously to the continuous level, we focus on the last term By (72), it follows that and we use Lemma 3.1 to split the cross term into squares: For the gradient term, we have v k+1 − x * = v k+1 − x k+1 + x k+1 − x * and use (72) to obtain Consequently, using the µ-strongly convex property (cf.( 2)) of f and dropping surplus negative square terms, we see This proves (75) and concludes the proof of this theorem.
We observe from Theorem 4.1 that the fully implicit scheme (72) achieves linear convergence rate as long as α k α > 0 for all k > 0 and larger α k yields faster convergence rate.We also mention that (72) can be rewritten as where the proximal operator prox η k f has been introduced in (8) and Therefore, it allows f to be nonsmooth and we claim that Theorem 4.1 still holds true in this case.One just replaces the gradient ∇f (x k+1 ) with the subgradient (y k − x k+1 )/η k ∈ ∂f (x k+1 ); see (105) and (112).
For convex case, i.e., µ = 0, our method (76) is very close to Güler's proximal point algorithm [20]    then for the proximal point algorithm (76) with µ = 0, we have Proof.For convenience and later use, define a sequence {ρ k } by that As mentioned above, Theorem 4.1 holds true for such a nonsmooth f and thus it is evident that L k ρ k L 0 .Invoking Lemma B.2 proves (77) and it is trivial to obtain (78) from (77).This finishes the proof.
Remark 4.1.Note that the sequence {γ k } in (73) is bounded: 0 < γ k max{µ, γ 0 } and γ k → µ as k → ∞.Hence, even for large γ 0 , the Lyapunov function L k is asymptotically bounded as k → ∞.In addition, from (77) and (78), we see that, for small γ 0 , the convergence rate depends on γ 0 but large γ 0 does not pollute the final rate.This fact also holds true for all the forthcoming convergence bounds.

Gauss-Seidel Splitting with Corrections
This section considers the Gauss-Seidel splitting (41), which is a semi-implicit discretization.In Section 2.4, we have established the spectrum bound O(1 − µ/L) with step size α k = O( µ/L) for quadratic objectives.However, as we summarized in Section 2.6, spectrum analysis is not sufficient for (norm) convergence.Indeed, in the sequel, we further show that, for the discrete Lyapunov function (74), with any step size α k > 0, the naive discretization (41), reformulated as (80), does not lead to the contraction property like (75).Therefore, this motivates us to add some proper correction steps.

The Gauss-Seidel splitting
Recall the Gauss-Seidel splitting (41): given step size α k > 0 and previous result In addition, the parameter equation (54) of γ is still discretized implicitly via (73).
Lemma 5.1.If f ∈ S 1 µ with µ 0, then for (80) with any step size α k > 0, we have and Proof.Following the proof of Theorem 4.1, we start from the difference Using the update for x k+1 in (80), we split the gradient term as below As f ∈ S 1 µ , we obtain that Ignoring all the negative terms of the second line, the above estimate implies (81).
As we see, different from (75), the estimate (81) contains a combination of a negative term and another cross term.Obviously, an easy application of Cauchy-Schwarz inequality yields This yields another bound (82) that only involves a positive gradient norm.

A predictor-corrector method
To conquer the cross term −α k ∇f (x k+1 ), v k+1 − v k in (81), we add an extra extrapolation step to (80) which can be thought as an semi-implicit discretization of x ′ = v − x with the newest update v k+1 .More precisely, consider This is in line with the spirit of the predictor-corrector method for ODE solvers [39, Section 3.8].The variable y k is the predictor produced by an explicit scheme and x k+1 is the corrector by an implicit scheme.It can be also thought of as a symmetric Gauss-Seidel iteration for approximating the implicit Euler method.Again, the parameter equation (54) of γ is still discretized via (73).
As the first two steps of (83) agree with (80), with x k+1 being y k , recalling the estimate (81), we have where Therefore, it follows that From the update for y k and x k+1 in (83), we find the relation and if f ∈ S 1,1 µ,L , then there comes the estimate (cf.( 4)) As a result, we obtain The second term vanishes if we choose suitable step size; see the theorem below.
It remains to check (88) for all k 1.From Lemma B.2 we easily get On the other hand, by the relation Lα 2 0 = γ 0 (1 + α 0 ), it is evident that The above estimate also indicates that Applying Lemma B.2 shows that α k min{γ 0 , µ}/L and it follows that Collecting this estimate and (90) establishes the final rate (88) and thus completes the proof of this theorem.
Remark 5.1.We mention that the estimate (88) verifies the claim made previously in Remark 4.1.That is, the convergence rate given in Theorem 5.1 depends on small γ 0 but is robust when γ 0 L.

Correction via a gradient step
Motivated by the estimate (82), we can also aim to cancel the gradient norm square.One preferable choice is the gradient descent step and according to our discussion below, any other correction step satisfying the decay property (94) is acceptable.Note that the two numerical schemes proposed in [38] and [44] for the HB equation ( 31) also have additional gradient steps.
As what we did before, replace x k+1 by y k in (80) and consider the following corrected scheme: given The implicit discretization (73) for the parameter equation ( 54) keeps unchanged here.In the first equation y k can be solved in terms of the known data (x k , v k ).
After that, we evaluate the gradient ∇f (y k ) once and use it to update (x k+1 , v k+1 ).
, then for the corrected scheme (91) together with (73), we have where L k is defined by (74), and both the two estimates (87) and (88) hold true here.
Proof.According to (82) in Lemma 5.1, we have established that where L k is defined by (84).Thanks to the additional gradient step in (91), we have the basic gradient descent inequality: which comes from (4) since f ∈ S 1,1 µ,L and implies that Plugging this into (93) gives This together with the condition Lα 2 k = γ k (1 + α k ) yields (92).As we choose the same step size as Theorem 5.1, based on the contraction (92), it is trivial to conclude that the two estimates (87) and (88) hold true here indeed.This completes the proof of this theorem.

A Corrected Semi-implicit Scheme from NAG Method
In this section, we consider another semi-implicit scheme which comes exactly from Nesterov accelerated gradient method.

NAG method
In [29, Chapter 2, General scheme of optimal method], by using the estimate sequence, Nesterov presented an accelerated gradient method for solving (1) with f ∈ S 1,1 µ,L with 0 µ L < ∞; see Algorithm 1 below.

2:
Compute 6: 7: end for Note that we have many choices for x k+1 in step 5 of Algorithm 1.One noticeable example is the gradient descent step (see [29,Chapter 2,Constant Step Scheme, I]): With this choice, the sequence {v k } in Algorithm 1 can be eliminated and y k+1 is updated by that (see [29,Chapter 2,Constant Step Scheme, II]) where α k+1 ∈ (0, 1) is calculated from the quadratic equation Step Scheme, III].In particular, if µ = 0, then Algorithm 1 (with x k+1 updated by (95)) coincides with the accelerated scheme proposed by Nesterov early in the 1980s [27].

NAG method as a corrected semi-implicit scheme
After simple calculations, we can rewrite Algorithm 1 as an equivalent form where in addition we update x k+1 satisfying Surprisingly, (96) formulates a semi-implicit discretization for our NAG flow (56) with a correction step (97) and an explicit discretization for the equation (54) of γ.Similar to (91), we can adopt the gradient descent step which promises (97).
Based on subtle algebraic calculations of the estimate sequence, Nesterov [29, Chapter 2] proved the convergence rate of Algorithm 1.In the following, we give an alternative proof by using the Lyapunov function (74).
, then for Algorithm 1, i.e., the scheme (96) together with (97), we have 0 < α k 1 and where L k is defined by (74).Consequently for all k 0, Moreover, for all k 1, where C γ0,L has been defined in (89).
Proof.Let us first prove (98).By (96), we find and a direct computation gives Dropping the negative term − y k − x k 2 and using the µ-convexity of f imply that and we get the inequality Consequently, by (97) and the relation Lα 2 k = γ k+1 , the right hand side of the above inequality is negative, which proves (98).
In this case, we modify (79) as follows then by (98) it is clear that L k ρ k L 0 , and invoking Lemma B.1 proves (99).
As the proof of (100) is very similar with that of (88), we omit the details here and conclude the proof of this theorem.
Remark 6.1.Similar to our corrected schemes (83) and (91), NAG method (i.e., Algorithm 1) generates a three-term sequence {(x k , y k , v k )} as well.If µ = 0, then they share the same convergence rate bound and when γ 0 = µ > 0, we have In view of the trivial fact we see the rates in (102) are asymptotically the same but NAG method can achieve a slightly better convergence rate.However, we note that they share the same computational complexity which is optimal, in the sense that [29] it achieves the complexity lower bound of first-order algorithms for the function class S 1,1 µ,L with 0 µ L < ∞.Remark 6.2.Unlike the gradient descent method, the function value f (x k ) of accelerated gradient methods may not decrease in each step.It is the discrete Lyapunov function L k that is always decreasing; see (86), ( 92) and (98).Remark 6.3.To reduce the function value, one can adopt the restating strategy [31].Specifically, given (γ 0 , v 0 , x 0 ), if f (x k ) is increasing after k-iteration, then set k = 0 and restart the iteration process with another initial guess (γ 0 , ṽ0 , x0 ).By Theorems 5.1, 5.2 and 6.1, when f ∈ S 1,1 0,L and γ 0 = L, v 0 = x 0 , we only have the sublinear convergence rate where we used (4), which promises Additionally, assume f satisfies the quadratic growth condition with σ > 0: where dist(x, argminf ) = inf x * ∈argminf x − x * .As (103) holds for all x * ∈ argminf , we have immediately that Therefore, as analyzed in [30], if we consider fixed restart technique [31] every k steps, then after N = nk steps we will get Evidently, the optimal choice k # = e 4L/σ yields the linear rate If the parameter σ is unknown, one can use the adaptive restart technique [31].When f is quadratic and convex, changing γ k from L to µ periodically will smoothing out error in different frequencies and can further optimize the constant in front of the accelerated rate.That is, the dynamically changing parameter {γ k } hopefully outperforms the fixed one γ k = µ.For general nonlinear convex functions, a rigorous justification of the restart strategy is under investigation.

Composite Convex Optimization
In this part we mainly focus on the composite optimization where Q ⊆ V is a simple closed convex set, h ∈ S 1,1 µ,L (Q) with 0 µ L < ∞ and g : V → R ∪ {+∞} is proper, closed and convex, and Q ∩ dom g = ∅.In general g is not differentiable but its subdifferential ∂g exists as a set-valued function.More precisely, the subdifferential ∂g(x) of g at x is defined by that Remark 7.1.For the case that h ∈ S 1,1 0,L (Q) and g is µ-strongly convex with µ 0, we can split h + g as (h(x) + µ 2 x 2 ) + (g(x) − µ 2 x 2 ), which reduces to our current assumption for (104).
We shall apply our ODE solver approach to the problem (104).The first step is to generalize the dynamical system (56) to the current nonsmooth setting.Basically, we set F = f + i Q with i Q being the indicator function of Q and obtain a differential inclusion for minimizing F on V , which is equivalent to minimize f over Q.After that, optimization methods (see Algorithms 2 and 4) for solving the original problem (104) with the accelerated convergence rate are proposed from numerical discretizations of the continuous model (106).This is a proof of the effective and usefulness of our NAG flow model (106) and the ODE solver approach, by which we can construct new accelerated methods.

Continuous model
For minimizing a nonsmooth function F over V , our NAG flow (56) becomes a differential inclusion To ensure solution existence, suitable initial conditions shall be imposed later.Correspondingly, the second-order ODE (57) reads as a second-order differential inclusion Above, the scaling factor γ is still the solution to (54).
As the subdifferential ∂F is a set-valued maximal monotone operator, classical C 2 solution to (107) may not exist because discontinuity can occur in x ′ .Therefore, the concept of energy-conserving solution has been introduced in [15,32,36].
Let us assume the initial data where T domF (x 0 ) denotes the tangent cone of domF at x 0 : In addition, we shall introduce some vector-valued functional spaces.Given any interval I ⊂ R, let M (I; V ) be the space of V -valued Radon measures on I; for any m ∈ N and 1 p ∞, W m,p (I; V ) denotes the standard V -valued Sobolev space [21]; the space of all V -valued functions with bounded variation is defined by BV (I; V ) [4].Also, W m,p loc (I; V ) and BV loc (I; V ) consist of all the sets W m,p (ω; V ) and BV (ω; V ) respectively, where ω ⊂ I is any compact subset.Definition 7.1.We call x : [0, ∞) → V an energy-conserving solution to (107) with initial data (108) if it satisfies the following.
3. For almost all t > 0, there holds the energy equality: 4. There exists some ν ∈ M (0, ∞; V ) such that holds in the sense of distributions, and for any T > 0, we have In [25], the problem (107) has been extended to a general case where ξ stands for small perturbation.Therefore, according to [25, Theorem 2.1], we have the existence of an energy-conserving solution to (107) and by [25, Theorems 2.2 and 2.3], we obtain the exponential decay, which is a nonsmooth version of (60).

An APGM for unconstrained optimization
Let us first consider the unconstrained case Q = V , i.e., min where f ∈ S 1,1 µ,L with 0 µ L < ∞ and g : V → R ∪ {+∞} is a properly closed and convex function and possibly nonsmooth.

Gradient mapping
To treat the nonsmooth part g, we introduce the tool of gradient mapping.Following [29,Chapter 2], given any η > 0, the composite gradient mapping G f (x, η) of f at x is defined by that where S f (x, η) := prox ηg (x − η∇h(x)) and the proximal operator prox ηg has been defined by (8).Note that S f (x, η) is clearly well-defined and so is G f (x, η).
It is well known [33,35] that which yields the fact From this we conclude that the fixed-point set of S f (•, η) is argminf .Indeed, x = S f (x, η) if and only if 0 ∈ ∂f (x).We also observe from (113) that the gradient mapping (111) is defined reversely from the proximal-gradient step for minimizing f = h + g, i.e., Hence it plays the role of the gradient ∇f in the smooth case.Particularly, if g = 0, then G f (x, η) = ∇h(x) and S f (x, η) = x − η∇h(x) is nothing but a gradient step.
To move on, we present an auxiliary lemma, which is a key ingredient for our convergence analysis.As we will fix η = 1/L, for simplicity, we set G f (x) := G f (x, 1/L) and S f (x) := S f (x, 1/L).
Lemma 7.1.Assume f = h + g, where h ∈ S 1,1 µ,L with 0 µ L < ∞ and g : V → R ∪ {+∞} is properly closed and convex.Then for any x, y ∈ V , Proof.Since h ∈ S 1,1 µ,L , applying (2) and (4) gives which implies that Observing (113), we get Summing the above two inequalities and using the split we finally arrive at (114) and end the proof of this lemma.
Remark 7.3.For a fixed x, the right hand side of (114) defines a quadratic approximation of f at x, and it is strongly reminiscent of the quadratic lower bound approximation (2) for the smooth case.However, compared to (2), the constant is shifted from f (x) to a lower value f The first order part is G f (x) instead of the subgradient at x.The quadratic part µ 2 y − x 2 is due to the µ-convexity.

The proposed method
Based on the corrected semi-implicit scheme (91) for NAG flow (56), it is possible to generalize it to solve the differential inclusion (106).Indeed, we just replace the gradient ∇f (y k ) with the gradient mapping G f (y k ) and set the correction as x k+1 = S f (y k ).More precisely, consider Once x k+1 = S f (y k ) = prox ηg (y k − η∇h(y k )) is obtained, we can update v k+1 with known datum x k , y k , v k and x k+1 .Thus in each iteration, (115) only calls the proximal operation prox ηg once.
We still use the step size Lα 2 k = γ k (1 + α k ) and summarize the semi-implicit scheme (115) in Algorithm 2, which is called semi-implicit APGM (Semi-APGM for short).Also, the convergence rate is derived via the discrete Lyapunov function (74).
Proof.The proof of ( 116) is very similar to that of (92).Indeed, replacing x k+1 and its gradient ∇f (x k+1 ) in (80) respectively with y k and G f (y k ), we can proceed as the proof of Lemma 5.1 and use Lemma 7.1 to obtain where L k is defined by (84).Thanks to the relation Lα 2 k = γ k (1+α k ), the second line of (117) vanishes, and inserting the identity f (y k ) − f (x k+1 ) = L k − L k+1 into (117) gives (116).Based on this, it is not hard to see that both (87) and (88) hold true.This finishes the proof of this theorem.
We mention that with another choice we can drop the sequence {v k } from (115).The procedure is not straightforward but very similar to that of Nesterov's optimal method in [29, page 80].We omit the details and only list the following algorithm.

5:
Update x k+1 = prox ηg (y k − η∇h(y k )).6: end for This can be viewed as a generalization of [29, Chapter 2, Constant Step Scheme, II] to problem (110).Particularly, for convex case µ = 0, it is very close to FISTA [12].Both of them share the same spirit: applying one proximal gradient step first and then using some extrapolation formulae.The difference comes only from the use of the two sequences {α k } and {β k }.We also claim that Algorithm 3 has the same accelerated convergence rate as Algorithm 2, i.e., O(min(L/k 2 , (1 + µ/L) −k )).In contrast FISTA is designed for µ = 0 and has only the sublinear rate O(L/k 2 ).
We also mention that, accelerated proximal gradient methods for solving (110) with only one evaluation of prox ηg in each iteration can be found in [38] (only for strongly convex case) and [24, Chapter 2, Algorithm 2.2] (for both convex and strongly convex cases).
Both Algorithms 2 and 3 cannot be applied directly to the general constraint case (104).The main issue comes from the definition (111) of the gradient mapping G f (x, η), where we shall impose the restriction x ∈ Q and calculate the proximal operator prox ηg over Q to obtain S f (x) ∈ Q.For both two algorithms, we shall compute x k+1 = S f (y k ) = prox ηg (y k − η∇h(y k )).But the sequence {y k } in Algorithms 2 and 3 may be outside the constraint set.This is not acceptable because ∇h(y k ) might not exist: for instance, Q = [0, ∞) and h is the entropy function.
The original FISTA [12] and the methods in [38] and [24, Chapter 2, Algorithm 2.2] mentioned above, cannot be applied to the constrained problem (104) either.This stimulates us to propose a new operator splitting scheme to conquer this problem.

An accelerated forward-backward method for constrained optimization
We now go back to the constrained problem (104).As mentioned above, the tool of gradient mapping is not convenient for us to handle this case.To avoid using it, we utilize the separable structure of f = h + g and apply explicit and implicit schemes for h and g, respectively.This is the so-called operator splitting technique in ODE solvers and is also known as the forward-backward method.
Let us start from the predictor-corrector scheme (83) and rewrite it as follows For minimizing f = h + g over Q, we modify the above method as follows where x 0 , v 0 ∈ Q and the parameter sequence {γ k } comes from the implicit discretization (73) of the equation (54).Clearly, as convex combinations are used, the method (119) preserves the three-term sequence {(x k , y k , v k )} in Q and it requires the proximal computation of g over Q only once in each iteration.We choose Lα 2 k = γ k (1 + α k ) as before and rewrite (119) in Algorithm 4, which is called semi-implicit accelerated forward-backward (Semi-AFB for short) method.
In [41], Tseng considered problem (104) only with convex assumption, i.e., µ = 0, and proposed an APGM that possesses the rate O(L/k 2 ).By using the technique of estimate sequence, Nesterov [28] presented an accelerated method for solving (104) with the assumption that h is L-smooth over Q and g is µstrongly convex with µ 0. Both our Algorithm 4 and Nesterov's method Algorithm 4 Semi-AFB method for solving min x∈Q [h(x) + g(x)] Input: x 0 , v 0 ∈ Q, γ 0 > 0 and L > 0.

4:
Set 7: end for generate a three-term sequence {(x k , y k , v k )} and have the same accelerated rate O(min(L/k 2 , (1 + µ/L) −k )); see [28,Theorem 6] and our Theorem 7.3.However, as mentioned in [12], the later used an accumulated history of the past iterations to build recursively a sequence of estimate functions, and each iteration, to update x k+1 and v k+1 , Nesterov's method in [28] calls prox g over Q twice.Below, we shall establish the convergence rate of Algorithm 4 via the analysis of a Lyapunov function.It is well known [28, Eq (2.9)] that the first-order optimality condition for v k+1 in (119) is the variational inequality where p k+1 ∈ ∂g(v k+1 ).Expanding w k , we observe the relation where x ∈ Q is arbitrary.
Proof.As before, we calculate the difference Thanks to (120), we have As h is µ-strongly convex on Q, by the fact {(x k , y k , v k )} ⊂ Q, it follows that Therefore, collecting all the estimates and dropping surplus negative terms related to − x k − y k 2 and − y k − v k+1 2 , we get Let us consider the additional terms in (123).In view of (4), we have h(x k+1 ) − h(y k ) ∇h(y k ), Thanks to the extrapolation step for x k+1 (see step 6 in Algorithm 4), we find a crucial relation which gives that Thanks to Lemma B.2, there holds This proves (52) and completes the proof of Theorem 2.2.
with suitable step size, they share the similar rate; see [20, Theorem 2.3] and Theorem 4.2 below.Theorem 4.2.If f is proper, closed and convex and we choose α 2 k