Understanding the Acceleration Phenomenon via High-Resolution Differential Equations

Gradient-based optimization algorithms can be studied from the perspective of limiting ordinary differential equations (ODEs). Motivated by the fact that existing ODEs do not distinguish between two fundamentally different algorithms---Nesterov's accelerated gradient method for strongly convex functions (NAG-SC) and Polyak's heavy-ball method---we study an alternative limiting process that yields high-resolution ODEs. We show that these ODEs permit a general Lyapunov function framework for the analysis of convergence in both continuous and discrete time. We also show that these ODEs are more accurate surrogates for the underlying algorithms; in particular, they not only distinguish between NAG-SC and Polyak's heavy-ball method, but they allow the identification of a term that we refer to as"gradient correction"that is present in NAG-SC but not in the heavy-ball method and is responsible for the qualitative difference in convergence of the two methods. We also use the high-resolution ODE framework to study Nesterov's accelerated gradient method for (non-strongly) convex functions, uncovering a hitherto unknown result---that NAG-C minimizes the squared gradient norm at an inverse cubic rate. Finally, by modifying the high-resolution ODE of NAG-C, we obtain a family of new optimization methods that are shown to maintain the accelerated convergence rates of NAG-C for smooth convex functions.


Introduction
Machine learning has become one of the major application areas for optimization algorithms during the past decade. While there have been many kinds of applications, to a wide variety of problems, the most prominent applications have involved large-scale problems in which the objective function is the sum over terms associated with individual data, such that stochastic gradients can be computed cheaply, while gradients are much more expensive and the computation (and/or storage) of Hessians is often infeasible. In this setting, simple first-order gradient descent algorithms have become dominant, and the effort to make these algorithms applicable to a broad range of machine learning problems has triggered a flurry of new research in optimization, both methodological and theoretical.
We will be considering unconstrained minimization problems, min x∈R n f (x), (1.1) where f is a smooth convex function. Perhaps the simplest first-order method for solving this problem is gradient descent. Taking a fixed step size s, gradient descent is implemented as the recursive rule given an initial point x 0 . As has been known at least since the advent of conjugate gradient algorithms, improvements to gradient descent can be obtained within a first-order framework by using the history of past gradients. Modern research on such extended first-order methods arguably dates to Polyak [Pol64, Pol87], whose heavy-ball method incorporates a momentum term into the gradient step. This approach allows past gradients to influence the current step, while avoiding the complexities of conjugate gradients and permitting a stronger theoretical analysis. Explicitly, starting from an initial point x 0 , x 1 ∈ R n , the heavy-ball method updates the iterates according to where α > 0 is the momentum coefficient. While the heavy-ball method provably attains a faster rate of local convergence than gradient descent near a minimum of f , it does not come with global guarantees. Indeed, [LRP16] demonstrate that even for strongly convex functions the method can fail to converge for some choices of the step size. 1 The next major development in first-order methodology was due to Nesterov, who discovered a class of accelerated gradient methods that have a faster global convergence rate than gradient descent [Nes83,Nes13]. For a µ-strongly convex objective f with L-Lipschitz gradients, Nesterov's accelerated gradient method (NAG-SC) involves the following pair of update equations: given an initial point x 0 = y 0 ∈ R n . Equivalently, NAG-SC can be written in a single-variable form that is similar to the heavy-ball method: starting from x 0 and x 1 = x 0 − 2s∇f (x 0 ) 1+ √ µs . Like the heavy-ball method, NAG-SC blends gradient and momentum contributions into its update direction, but defines a specific momentum coefficient √ µs . Nesterov also developed the estimate sequence technique to prove that NAG-SC achieves an accelerated linear convergence rate: if the step size satisfies 0 < s ≤ 1/L. Moreover, for a (weakly) convex objective f with L-Lipschitz gradients, Nesterov defined a related accelerated gradient method (NAG-C), that takes the following form: x k+1 = y k+1 + k k + 3 (y k+1 − y k ), (1.5) with x 0 = y 0 ∈ R n . The choice of momentum coefficient k k+3 , which tends to one, is fundamental to the estimate-sequence-based argument used by Nesterov to establish the following inverse quadratic convergence rate: for any step size s ≤ 1/L. Under an oracle model of optimization complexity, the convergence rates achieved by NAG-SC and NAG-C are optimal for smooth strongly convex functions and smooth convex functions, respectively [NY83].

Gradient Correction: Small but Essential
Throughout the present paper, we let α = 1− √ µs 1+ √ µs and x 1 = x 0 − 2s∇f (x 0 ) 1+ √ µs to define a specific implementation of the heavy-ball method in (1.2). This choice of the momentum coefficient and the second initial point renders the heavy-ball method and NAG-SC identical except for the last (small) term in (1.4). Despite their close resemblance, however, the two methods are in fact fundamentally different, with contrasting convergence results (see, for example, [Bub15]). Notably, the former algorithm in general only achieves local acceleration, while the latter achieves acceleration method for all initial values of the iterate [LRP16]. As a numerical illustration, Figure 1 presents the trajectories that arise from the two methods when minimizing an ill-conditioned convex quadratic function. We see that the heavy-ball method exhibits pronounced oscillations throughout the iterations, whereas NAG-SC is monotone in the function value once the iteration counter exceeds 50.
This striking difference between the two methods can only be attributed to the last term in (1.4): 1 − √ µs 1 + √ µs · s (∇f (x k ) − ∇f (x k−1 )) , (1.7) which we refer to henceforth as the gradient correction 2 . This term corrects the update direction in NAG-SC by contrasting the gradients at consecutive iterates. Although an essential ingredient in NAG-SC, the effect of the gradient correction is unclear from the vantage point of the estimatesequence technique used in Nesterov's proof. Accordingly, while the estimate-sequence technique delivers a proof of acceleration for NAG-SC, it does not explain why the absence of the gradient correction prevents the heavy-ball method from achieving acceleration for strongly convex functions. algorithms to obtain ordinary differential equations (ODEs) that can be analyzed using the rich toolbox associated with ODEs, including Lyapunov functions 3 . For instance, [SBC16] shows thaẗ X(t) + 3 tẊ (t) + ∇f (X(t)) = 0, (1.8) with initial conditions X(0) = x 0 andẊ(0) = 0, is the exact limit of NAG-C (1.5) by taking the step size s → 0. Alternatively, the starting point may be a Lagrangian or Hamiltonian framework [WWJ16]. In either case, the continuous-time perspective not only provides analytical power and intuition, but it also provides design tools for new accelerated algorithms. Unfortunately, existing continuous-time formulations of acceleration stop short of differentiating between the heavy-ball method and NAG-SC. In particular, these two methods have the same limiting ODE (see, for example, [WRJ16]): √ µẊ(t) + ∇f (X(t)) = 0, (1.9) and, as a consequence, this ODE does not provide any insight into the stronger convergence results for NAG-SC as compared to the heavy-ball method. As will be shown in Section 2, this is because the gradient correction √ µs s (∇f (x k ) − ∇f (x k−1 )) = O(s 1.5 ) is an order-of-magnitude smaller than the other terms in (1.4) if s = o(1). Consequently, the gradient correction is not reflected in the low-resolution ODE (1.9) associated with NAG-SC, which is derived by simply taking s → 0 in both (1.2) and (1.4).

Overview of Contributions
Just as there is not a singled preferred way to discretize a differential equation, there is not a single preferred way to take a continuous-time limit of a difference equation. Inspired by dimensional-analysis strategies widely used in fluid mechanics in which physical phenomena are investigated at multiple scales via the inclusion of various orders of perturbations [Ped13], we propose to incorporate O( √ s) terms into the limiting process for obtaining an ODE, including the (Hessian-driven) gradient correction √ s∇ 2 f (X)Ẋ in (1.7). This will yield high-resolution ODEs that differentiate between the NAG methods and the heavy-ball method.
for small s, with the identification t = k √ s. The gradient correction √ s∇ 2 f (X)Ẋ in NAG-C arises in the same fashion 6 . Interestingly, although both NAGs are first-order methods, their gradient corrections brings in second-order information from the objective function. Despite being small, the gradient correction has a fundamental effect on the behavior of both NAGs, and this effect is revealed by inspection of the high-resolution ODEs. We provide two illustrations of this.
• Effect of the gradient correction in acceleration. Viewing the coefficient ofẊ as a damping ratio, the ratio 2 √ µ + √ s∇ 2 f (X) ofẊ in the high-resolution ODE (1.11) of NAG-SC is adaptive to the position X, in contrast to the fixed damping ratio 2 √ µ in the ODE (1.10) for the heavy-ball method. To appreciate the effect of this adaptivity, imagine that the velocityẊ is highly correlated with an eigenvector of ∇ 2 f (X) with a large eigenvalue, such that the large friction (2 √ µ + √ s∇ 2 f (X))Ẋ effectively "decelerates" along the trajectory of the ODE (1.11) of NAG-SC. This feature of NAG-SC is appealing as taking a cautious step in the presence of high curvature generally helps avoid oscillations. Figure 1 and the left plot of Figure 2 confirm the superiority of NAG-SC over the heavy-ball method in this respect.
If we can translate this argument to the discrete case we can understand why NAG-SC achieves acceleration globally for strongly convex functions but the heavy-ball method does not. We will be able to make this translation by leveraging the high-resolution ODEs to construct discrete-time Lyapunov functions that allow maximal step sizes to be characterized for the NAG-SC and the heavy-ball method. The detailed analyses is given in Section 3.
• Effect of gradient correction in gradient norm minimization. We will also show how to exploit the high-resolution ODE of NAG-C to construct a continuous-time Lyapunov function to analyze convergence in the setting of a smooth convex objective with L-Lipschitz gradients. Interestingly, the time derivative of the Lyapunov function is not only negative, but it is smaller than −O( √ st 2 ∇f (X) 2 ). This bound arises from the gradient correction and, indeed, it cannot be obtained from the Lyapunov function studied in the low-resolution case by [SBC16]. This finer characterization in the high-resolution case allows us to establish a new phenomenon: That is, we discover that NAG-C achieves an inverse cubic rate for minimizing the squared gradient norm. By comparison, from (1.6) and the L-Lipschitz continuity of ∇f we can only show that ∇f (x k ) 2 ≤ O L 2 /k 2 . See Section 4 for further elaboration on this cubic rate for NAG-C.
As we will see, the high-resolution ODEs are based on a phase-space representation that provides a systematic framework for translating from continuous-time Lyapunov functions to discrete-time Lyapunov functions. In sharp contrast, the process for obtaining a discrete-time Lyapunov function for low-resolution ODEs presented by [SBC16] relies on "algebraic tricks" (see, for example, Theorem 6 of [SBC16]). On a related note, a Hessian-driven damping term also appears in ODEs for modeling Newton's method [AABR02, AMR12, APR16].

Related Work
There is a long history of using ODEs to analyze optimization methods [HM12, Sch00, Fio05]. Recently, the work of [SBC14, SBC16] has sparked a renewed interest in leveraging continuous dynamical systems to understand and design first-order methods and to provide more intuitive proofs for the discrete methods. Below is a rather incomplete review of recent work that uses continuous-time dynamical systems to study accelerated methods. In

Organization and Notation
The remainder of the paper is organized as follows. In Section 2, we briefly introduce our highresolution ODE-based analysis framework. This framework is used in Section 3 to study the heavyball method and NAG-SC for smooth strongly convex functions. In Section 4, we turn our focus to NAG-C for a general smooth convex objective. In Section 5 we derive some extensions of NAG-C. We conclude the paper in Section 6 with a list of future research directions. Most technical proofs are deferred to the Appendix. We mostly follow the notation of [Nes13], with slight modifications tailored to the present paper. Let F 1 L (R n ) be the class of L-smooth convex functions defined on R n ; that is, where · denotes the standard Euclidean norm and L > 0 is the Lipschitz constant. (Note that this implies that ∇f is also L ′ -Lipschitz for any L ′ ≥ L.) The function class F 2 L (R n ) is the subclass of F 1 L (R n ) such that each f has a Lipschitz-continuous Hessian. For p = 1, 2, let S p µ,L (R n ) denote the subclass of F p L (R n ) such that each member f is µ-strongly convex for some 0 < µ ≤ L. That for all x, y ∈ R n . Note that this is equivalent to the convexity of f (x) − µ 2 x − x ⋆ 2 , where x ⋆ denotes a minimizer of the objective f .

The High-Resolution ODE Framework
This section introduces a high-resolution ODE framework for analyzing gradient-based methods, with NAG-SC being a guiding example. Given a (discrete) optimization algorithm, the first step in this framework is to derive a high-resolution ODE using dimensional analysis, the next step is to construct a continuous-time Lyapunov function to analyze properties of the ODE, the third step is to derive a discrete-time Lyapunov function from its continuous counterpart and the last step is to translate properties of the ODE into that of the original algorithm. The overall framework is illustrated in Figure 3.

High-Resolution ODEs
Continuous E(t) Nesterov's Acceleration Gradient Norm Minimization dimensional analysis phase-space representation Figure 3: An illustration of our high-resolution ODE framework. The three solid straight lines represent Steps 1, 2 and 3, and the two curved lines denote Step 4. The dashed line is used to emphasize that it is difficult, if not impractical, to construct discrete Lyapunov functions directly from the algorithms.
Step 1: Deriving High-Resolution ODEs Our focus is on the single-variable form (1.4) of NAG-SC. For any nonnegative integer k, let t k = k √ s and assume x k = X(t k ) for some sufficiently smooth curve X(t). Performing a Taylor expansion in powers of √ s, we get We now use a Taylor expansion for the gradient correction, which gives Multiplying both sides of (1.4) by √ µs · 1 s and rearranging the equality, we can rewrite NAG-SC as Next, plugging (2.1) and (2.2) into (2.3), we have 7 7 Note that we use the approximation relies on the low-accuracy Taylor expansion in the derivation of the low-resolution ODE of NAG-C. We illustrate this derivation of the three low-resolution ODEs in Appendix A.2; they can be compared to the high-resolution ODEs that we derive here.
which can be rewritten as Multiplying both sides of the last display by 1 − √ µs, we obtain the following high-resolution ODE of NAG-SC: Our analysis is inspired by dimensional analysis [Ped13], a strategy widely used in physics to construct a series of differential equations that involve increasingly high-order terms corresponding to small perturbations. In more detail, taking a small s, one first derives a differential equation that consists only of O(1) terms, then derives a differential equation consisting of both O(1) and O( √ s), and next, one proceeds to obtain a differential equation of NAG-C, but is not found in the high-resolution ODE of the heavy-ball method. As shown below, each ODE admits a unique global solution under mild conditions on the objective, and this holds for an arbitrary step size s > 0. The solution is accurate in approximating its associated optimization method if s is small. To state the result, we use C 2 (I; R n ) to denote the class of twice-continuously-differentiable maps from I to R n for I = [0, ∞) (the heavy-ball method and NAG-SC) and I = [1.5 √ s, ∞) (NAG-C).
Proposition 2.1. For any f ∈ S 2 µ (R n ) := ∪ L≥µ S 2 µ,L (R n ), each of the ODEs (1.10) and (1.11) with the specified initial conditions has a unique global solution X ∈ C 2 ([0, ∞); R n ). Moreover, the two methods converge to their high-resolution ODEs, respectively, in the sense that for any fixed T > 0.
In fact, Proposititon 2.1 holds for T = ∞ because both the discrete iterates and the ODE trajectories converge to the unique minimizer when the objective is stongly convex.
Proposition 2.2. For any f ∈ F 2 (R n ) := ∪ L>0 F 2 L (R n ), the ODE (1.12) with the specified initial conditions has a unique global solution X ∈ C 2 ([1.5 √ s, ∞); R n ). Moreover, NAG-C converges to its high-resolution ODE in the sense that for any fixed T > 0.
The proofs of these propositions are given in Appendix A.3.1 and Appendix A.3.2.
Step 2: Analyzing ODEs Using Lyapunov Functions With these high-resolution ODEs in place, the next step is to construct Lyapunov functions for analyzing the dynamics of the corresponding ODEs, as is done in previous work [SBC16, WRJ16, LRP16]. For NAG-SC, we consider the Lyapunov function (2.4) The first and second terms (1+ √ µs) (f (X) − f (x ⋆ )) and 1 4 Ẋ 2 can be regarded, respectively, as the potential energy and kinetic energy, and the last term is a mix. For the mixed term, it is interesting to note that the time derivative ofẊ + 2 The differentiability of E(t) will allow us to investigate properties of the ODE (1.11) in a principled manner. For example, we will show that E(t) decreases exponentially along the trajectories of (1.11), recovering the accelerated linear convergence rate of NAG-SC. Furthermore, a comparison between the Lyapunov function of NAG-SC and that of the heavy-ball method will explain why the gradient correction √ s∇ 2 f (X)Ẋ yields acceleration in the former case. This is discussed in Section 3.1.
Step 3: Constructing Discrete Lyapunov Functions Our framework make it possible to translate continuous Lyapunov functions into discrete Lyapunov functions via a phase-space representation (see, for example, [Arn13]). We illustrate the procedure in the case of NAG-SC. The first step is formulate explicit position and velocity updates: where the velocity variable v k is defined as: . Interestingly, this phase-space representation has the flavor of symplectic discretization, in the sense that the update for x k − x k−1 is explicit (it only depends on the last iterate v k−1 ) while the update for v k − v k−1 is implicit (it depends on the current iterates x k and v k ) 8 . 8 Although this suggestion is a heuristic one, it is also possible to rigorously derive a symplectic integrator of the high-resolution ODE of NAG-SC; this integrator has the form: The representation (2.5) suggests translating the continuous-time Lyapunov function (2.4) into a discrete-time Lyapunov function of the following form: (2.6) by replacing continuous terms (e.g.,Ẋ) by their discrete counterparts (e.g., v k ). Akin to the continuous (2.4), here I, II, and III correspond to potential energy, kinetic energy, and mixed energy, respectively, from a mechanical perspective. To better appreciate this translation, note that the factor The need for the final (small) negative term is technical; we discuss it in Section 3.2.
Step 4: Analyzing Algorithms Using Discrete Lyapunov Functions The last step is to map properties of high-resolution ODEs to corresponding properties of optimization methods. This step closely mimics Step 2 except that now the object is a discrete algorithm and the tool is a discrete Lyapunov function such as (2.6). Given that Step 2 has been performed, this translation is conceptually straightforward, albeit often calculation-intensive. For example, using the discrete Lyapunov function (2.6), we will recover the optimal linear rate of NAG-SC and gain insights into the fundamental effect of the gradient correction in accelerating NAG-SC. In addition, NAG-C is shown to minimize the squared gradient norm at an inverse cubic rate by a simple analysis of the decreasing rate of its discrete Lyapunov function.

Gradient Correction for Acceleration
In this section, we use our high-resolution ODE framework to analyze NAG-SC and the heavyball method. Section 3.1 focuses on the ODEs with an objective function f ∈ S 2 µ,L (R n ), and in Section 3.2 we extend the results to the discrete case for f ∈ S 1 µ,L (R n ). Finally, in Section 3.3 we offer a comparative study of NAG-SC and the heavy-ball method from a finite-difference viewpoint.
Throughout this section, the strategy is to analyze the two methods in parallel, thereby highlighting the differences between the two methods. In particular, the comparison will demonstrate the vital role of the gradient correction, namely ) in the discrete case and √ s∇ 2 f (X)Ẋ in the ODE case, in making NAG-SC an accelerated method.

The ODE Case
The following theorem characterizes the convergence rate of the high-resolution ODE corresponding to NAG-SC.
Theorem 1 (Convergence of NAG-SC ODE). Let f ∈ S 2 µ,L (R n ). For any step size 0 < s ≤ 1/L, the solution X = X(t) of the high-resolution ODE (1.11) satisfies The theorem states that the functional value f (X) tends to the minimum f (x ⋆ ) at a linear rate.
4 . The proof of Theorem 1 is based on analyzing the Lyapunov function E(t) for the high-resolution ODE of NAG-SC. Recall that E(t) defined in (2.4) is The next lemma states the key property we need from this Lyapunov function Lemma 3.1 (Lyapunov function for NAG-SC ODE). Let f ∈ S 2 µ,L (R n ). For any step size s > 0, and with X = X(t) being the solution to the high-resolution ODE (1.11), the Lyapunov function (2.4) satisfies The proof of this theorem relies on Lemma 3.1 through the ≥ 0 plays no role at the moment, but Section 3.2 will shed light on its profound effect in the discretization of the high-resolution ODE of NAG-SC.
Proof of Theorem 1. Lemma 3.1 impliesĖ(t) ≤ − By integrating out t, we get Recognizing the initial conditions Together with the Cauchy-Schwarz inequality, the two inequalities yield which is valid for all s > 0. To simplify the coefficient of 4 , note that L can be replaced by 1/s in the analysis since s ≤ 1/L. It follows that Furthermore, a bit of analysis reveals that since µs ≤ µ/L ≤ 1, and this step completes the proof of Theorem 1.
We now consider the heavy-ball method (1.2). Recall that the momentum coefficient α is set to √ µs . The following theorem characterizes the rate of convergence of this method.
Theorem 2 (Convergence of heavy-ball ODE). Let f ∈ S 2 µ,L (R n ). For any step size 0 < s ≤ 1/L, the solution X = X(t) of the high-resolution ODE (1.10) satisfies As in the case of NAG-SC, the proof of Theorem 2 is based on a Lyapunov function: which is the same as the Lyapunov function (2.4) for NAG-SC except for the lack of the √ s∇f (X) term. In particular, (2.4) and (3.3) are identical if s = 0. The following lemma considers the decay rate of (3.3).
Lemma 3.2 (Lyapunov function for the heavy-ball ODE). Let f ∈ S 2 µ,L (R n ). For any step size s > 0, the Lyapunov function (3.3) for the high-resolution ODE (1.10) satisfies The proof of Theorem 2 follows the same strategy as the proof of Theorem 1. In brief, Lemma 3.2 gives E(t) ≤ e − √ µt/4 E(0) by integrating over the time parameter t. Recognizing the initial conditions in the high-resolution ODE of the heavy-ball method and using the L-smoothness of ∇f , Lemma 3.2 yields if the step size s ≤ 1/L. Finally, since 0 < µs ≤ µ/L ≤ 1, the coefficient satisfies The proofs of Lemma 3.1 and Lemma 3.2 share similar ideas. In view of this, we present only the proof of the former here, deferring the proof of Lemma 3.2 to Appendix B.1.
Proof of Lemma 3.1. Along trajectories of (1.11) the Lyapunov function (2.4) satisfies (3.4) Furthermore, ∇f (X), X − x ⋆ is greater than or equal to both f (X) − f (x ⋆ ) + µ 2 X − x ⋆ 2 and µ X − x ⋆ 2 due to the µ-strong convexity of f . This yields which together with (3.4) suggests that the time derivative of this Lyapunov function can be bounded as (3.5) Next, the Cauchy-Schwarz inequality yields from which it follows that Combining (3.5) and (3.6) completes the proof of the theorem.
Remark 3.3. The only inequality in (3.4) is due to the term √ s 2 ( ∇f (X) 2 +Ẋ ⊤ ∇ 2 f (X)Ẋ), which is discussed right after the statement of Lemma 3.1. This term results from the gradient correction √ s∇ 2 f (X)Ẋ in the NAG-SC ODE. For comparison, this term does not appear in Lemma 3.2 in the case of the heavy-ball method as its ODE does not include the gradient correction and, accordingly, its Lyapunov function (3.3) is free of the √ s∇f (X) term.

The Discrete Case
This section carries over the results in Section 3.1 to the two discrete algorithms, namely NAG-SC and the heavy-ball method. Here we consider an objective f ∈ S 1 µ,L (R n ) since second-order differentiability of f is not required in the two discrete methods. Recall that both methods start with an arbitrary x 0 and Theorem 3 (Convergence of NAG-SC). Let f ∈ S 1 µ,L (R n ). If the step size is set to s = 1/(4L), for all k ≥ 0.
In brief, the theorem states that log(f (x k ) − f (x ⋆ )) ≤ −O(k µ/L), which matches the optimal rate for minimizing smooth strongly convex functions using only first-order information [Nes13].
Although this optimal rate of NAG-SC is well known in the litetature, this is the first Lyapunovfunction-based proof of this result.
As indicated in Section 2, the proof of Theorem 3 rests on the discrete Lyapunov function (2.6): Recall that this functional is derived by writing NAG-SC in the phase-space representation (2.5). Analogous to Lemma 3.1, the following lemma gives an upper bound on the difference E(k+1)−E(k).
Lemma 3.4 (Lyapunov function for NAG-SC). Let f ∈ S 1 µ,L (R n ). Taking any step size 0 < s ≤ 1/(4L), the discrete Lyapunov function (2.6) with {x k } ∞ k=0 generated by NAG-SC satisfies The form of the inequality ensured by Lemma 3.4 is consistent with that of Lemma 3.1. Alternatively, it can be written as E(k + 1) ≤ 1 1+ √ µs 6 E(k). With Lemma 3.4 in place, we give the proof of Theorem 3.
Proof of Theorem 3. Given s = 1/(4L), we have To see this, first note that Combining these two inequalities, we get which gives (3.7). Next, we inductively apply Lemma 3.4, yielding in NAG-SC, one can show that (3.9) Taking s = 1/(4L) in (3.9), it follows from (3.7) and (3.8) that Here the constant factor C µ/L is a short-hand for which is less than five by making use of the fact that µ/L ≤ 1. This completes the proof.
We now turn to the heavy-ball method (1.2). Recall that α = Theorem 4 (Convergence of heavy-ball method). Let f ∈ S 1 µ,L (R n ). If the step size is set to s = µ/(16L 2 ), the iterates {x k } ∞ k=0 generated by the heavy-ball method satisfy The heavy-ball method minimizes the objective at the rate log(f (x k ) − f (x ⋆ )) ≤ −O(kµ/L), as opposed to the optimal rate −O(k µ/L) obtained by NAG-SC. Thus, the acceleration phenomenon is not observed in the heavy-ball method for minimizing functions in the class S 1 µ,L (R n ). This difference is, on the surface, attributed to the much smaller step size s = µ/(16L 2 ) in Theorem 4 than the (s = 1/(4L)) in Theorem 3. Further discussion of this difference is given after Lemma 3.5 and in Section 3.3.
In addition to allowing us to complete the proof of Theorem 4, Lemma 3.5 will shed light on why the heavy-ball method needs a more conservative step size. To state this lemma, we consider the discrete Lyapunov function defined as which is derived by discretizing the continuous Lyapunov function (3.3) using the phase-space representation of the heavy-ball method: Lemma 3.5 (Lyapunov function for the heavy-ball method). Let f ∈ S 1 µ,L (R n ). For any step size s > 0, the discrete Lyapunov function (3.10) with {x k } ∞ k=0 generated by the heavy-ball method satisfies (3.12) The proof of Lemma 3.5 can be found in Appendix B.3. To apply this lemma to prove Theorem 4, we need to ensure A sufficient and necessary condition for (3.13) is , which can be further reduced to an equality (for example, f (x) = L 2 x 2 ). Thus, the step size s must obey In particular, the choice of s = µ 16L 2 fulfills (3.14) and, as a consequence, Lemma 3.5 implies The remainder of the proof of Theorem 4 is similar to that of Theorem 3 and is therefore omitted.
As an aside, [Pol64] uses s = 4/( √ L + √ µ) 2 for local accelerated convergence of the heavy-ball method. This choice of step size is larger than our step size s = µ 16L 2 , which yields a non-accelerated but global convergence rate.
The term s 2 1+ √ µs ∇f (x k+1 ) 2 in (3.12) that arises from finite differencing of (3.10) is a (small) term of order O(s) and, as a consequence, this term is not reflected in Lemma 3.2. In relating to the case of NAG-SC, one would be tempted to ask why this term does not appear in Lemma 3.4. In fact, a similar term can be found in E(k + 1) − E(k) by taking a closer look at the proof of Lemma 3.4. However, this term is canceled out by the discrete version of the quadratic term in Lemma 3.1 and is, therefore, not present in the statement of Lemma 3.4. Note that this quadratic term results from the gradient correction (see Remark 3.3). In light of the above, the gradient correction is the key ingredient that allows for a larger step size in NAG-SC, which is necessary for achieving acceleration.
For completeness, we finish Section 3.2 by proving Lemma 3.4.
Proof of Lemma 3.4. Using the Cauchy-Schwarz inequality, we have 9 which, together with the inequality for f ∈ S 1 µ,L (R n ), shows that the Lyapunov function (2.6) satisfies (3.15) Next, as shown in Appendix B.2, the inequality See the definition of III in (2.6).
holds for s ≤ 1/(2L). Comparing the coefficients of the same terms in (3.15) for E(k + 1) and (3.16), we conclude that the first difference of the discrete Lyapunov function (2.6) must satisfy

A Numerical Stability Perspective on Acceleration
As shown in Section 3.2, the gradient correction is the fundamental cause of the difference in convergence rates between the heavy-ball method and NAG-SC. This section aims to further elucidate this distinction from the viewpoint of numerical stability. A numerical scheme is said to be stable if, roughly speaking, this scheme does not magnify errors in the input data. Accordingly, we address the question of what values of the step size s are allowed for solving the high-resolution ODEs (1.10) and (1.11) in a stable fashion. While various discretization schemes on low-resolution ODEs have been explored in [WWJ16, WRJ16, ZMSJ18], we limit our attention to the forward Euler scheme to simplify the discussion (see [SB13] for an exposition on discretization schemes). For the heavy-ball method, the forward Euler scheme applied to (1.10) is ǫ for a small perturbation ǫ, we get the characteristic equation of (3.17): where I denotes the n × n identity matrix. The numerical stability of (3.17) requires the roots of the characteristic equation to be no larger than one in absolute value. Therefore, a necessary condition for the stability is that 10 By the L-smoothness of f , the largest singular value of ∇ 2 f (X(t − √ s)) can be as large as L. Therefore, (3.18) is guaranteed in the worst case analysis only if which shows that the step size must obey (3.19) 10 The notation A B indicates that B − A is positive semidefinite for symmetric matrices A and B.
Next, we turn to the high-resolution ODE (1.11) of NAG-SC, for which the forward Euler scheme reads (3.20) Its characteristic equation is which, as earlier, suggests that the numerical stability condition of (3.20) is This inequality is ensured by setting the step size As constraints on the step sizes, both (3.19) and (3.21) are in agreement with the discussion in Section 3.2, albeit from a different perspective. In short, a comparison between (3.17) and (3.20) reveals that the Hessian √ s∇ 2 f (X(t − √ s)) makes the forward Euler scheme for the NAG-SC ODE numerically stable with a larger step size, namely s = O(1/L). This is yet another reflection of the vital importance of the gradient correction in yielding acceleration for NAG-SC.

Gradient Correction for Gradient Norm Minimization
In this section, we extend the use of the high-resolution ODE framework to NAG-C (1.5) in the setting of minimizing an L-smooth convex function f . The main result is an improved rate of NAG-SC for minimizing the squared gradient norm. Indeed, we show that NAG-C achieves the O(L 2 /k 3 ) rate of convergence for minimizing ∇f (x k ) 2 . To the best of our knowledge, this is the sharpest known bound for this problem using NAG-C without any modification. Moreover, we will show that the gradient correction in NAG-C is responsible for this rate and, as it is therefore unsurprising that this inverse cubic rate was not perceived within the low-resolution ODE frameworks such as that of [SBC16]. In Section 4.3, we propose a new accelerated method with the same rate O(L 2 /k 3 ) and briefly discuss the benefit of the phase-space representation in simplifying technical proofs.

The ODE Case
We begin by studying the high-resolution ODE (1.12) corresponding to NAG-C with an objective f ∈ F 2 L (R n ) and an arbitrary step size s > 0. For convenience, let t 0 = 1.5 √ s.
Theorem 5. Assume f ∈ F 2 L (R n ) and let X = X(t) be the solution to the ODE (1.12). The squared gradient norm satisfies , for all t > t 0 .
By taking the step size s = 1/L, this theorem shows that where the infimum operator is necessary as the squared gradient norm is generally not decreasing in t. In contrast, directly combining the convergence rate of the function value (see Corollary 4.2) and inequality ∇f (X) 2 ≤ 2L(f (X) − f (x ⋆ )) only gives a O(L/t 2 ) rate for squared gradient norm minimization.
The proof of the theorem is based on the continuous Lyapunov function which reduces to the continuous Lyapunov function in [SBC16] when setting s = 0.
The decreasing rate of E(t) as specified in the lemma is sufficient for the proof of Theorem 5. First, note that Lemma 4.1 readily gives where the last step is due to the fact E(t) ≥ 0. Thus, it follows that Recognizing the initial conditions of the ODE (1.12), we get which together with (4.3) gives .

(4.4)
This bound reduces to the one claimed by Theorem 5 by only keeping the first term √ s(t 3 − t 3 0 )/3 in the denominator.
The gradient correction √ s∇ 2 f (X)Ẋ in the high-resolution ODE (1.12) plays a pivotal role in Lemma 4.1 and is, thus, key to Theorem 5. As will be seen in the proof of the lemma, the factor ∇f (X) 2 in (4.2) results from the term t √ s∇f (X) in the Lyapunov function (4.1), which arises from the gradient correction in the ODE (1.12). In light of this, the low-resolution ODE (1.8) of NAG-C cannot yield a result similar to Lemma 4.1 and; furthermore, we conjecture that the O( √ L/t 3 ) rate does applies to this ODE. Section 4.2 will discuss this point further in the discrete case.
In passing, it is worth pointing out that the analysis above applies to the case of s = 0. In this case, we have t 0 = 0, and (4.4) turns out to be This result is similar to that of the low-resolution ODE in [SBC16] 11 . This section is concluded with the proof of Lemma 4.1.
Proof of Lemma 4.1. The time derivative of the Lyapunov function (4.1) obeys

Making use of the basic inequality
Note that Lemma 4.1 shows E(t) is a decreasing function, from which we get by recognizing the initial conditions of the high-resolution ODE (1.12). This gives the following corollary.
Corollary 4.2. Under the same assumptions as in Theorem 5, for any t > t 0 , we have

The Discrete Case
We now turn to the discrete NAG-C (1.5) for minimizing an objective f ∈ F 1 L (R n ). Recall that this algorithm starts from any x 0 and y 0 = x 0 . The discrete counterpart of Theorem 5 is as follows.
for all k ≥ 0.
Taking s = 1/(3L), Theorem 6 shows that NAG-C minimizes the squared gradient norm at the rate O(L 2 /k 3 ). This theoretical prediction is in agreement with two numerical examples illustrated in Figure 4. To our knowledge, the bound O(L 2 /k 3 ) is sharper than any existing bounds in the literature for NAG-C for squared gradient norm minimization. In fact, the convergence result for NAG-C and the L-smoothness of the objective immediately give ∇f (x k ) 2 ≤ O(L 2 /k 2 ). This well-known but loose bound can be improved by using a recent result from [AP16], which shows that a slightly modified version NAG-C satisfies f (x k ) − f (x ⋆ ) = o(L/k 2 ) (see Section 5.2 for more discussion of this improved rate). This reveals which, however, remains looser than that of Theorem 6. In addition, the rate o(L 2 /k 2 ) is not valid for k ≤ n/2 and, as such, the bound o(L 2 /k 2 ) on the squared gradient norm is dimension-dependent [AP16]. For completeness, the rate O(L 2 /k 3 ) can be achieved by introducing an additional sequence of iterates and a more aggressive step size policy in a variant of NAG-C [GL16]. In stark contrast, our result shows that no adjustments are needed for NAG-C to yield an accelerated convergence rate for minimizing the gradient norm.
An Ω(L 2 /k 4 ) lower bound has been established by [Nes12] as the optimal convergence rate for minimizing ∇f 2 with access to only first-order information. (For completeness, Appendix C.3 presents an exposition of this fundamental barrier.) In the same paper, a regularization technique is used in conjunction with NAG-SC to obtain a matching upper bound (up to a logarithmic factor). This method, however, takes as input the distance between the initial point and the minimizer, which is not practical in general [KF18].
T is a 500×500 positive semidefinite matrix and b is 1×500. All entries of b, T ∈ R 500×500 are i.i.d. uniform random variables on (0, 1), and · 2 denotes the matrix spectral norm. Right: Returning to Theorem 6, we present a proof of this theorem using a Lyapunov function argument. By way of comparison, we remark that Nesterov's estimate sequence technique is unlikely to be useful for characterizing the convergence of the gradient norm as this technique is essentially based on local quadratic approximations. The phase-space representation of NAG-C (1.5) takes the following form: for any initial position x 0 and the initial velocity v 0 = − √ s∇f (x 0 ). This representation allows us to discretize the continuous Lyapunov function (4.1) into The following lemma characterizes the dynamics of this Lyapunov function.
Lemma 4.3. Under the assumptions of Theorem 6, we have for all k ≥ 0.
Next, we provide the proof of Theorem 6.
Proof of Theorem 6. We start with the fact that for k ≥ 2. To show this, note that it suffices to guarantee which is self-evident since s ≤ 1/(3L) by assumption. Next, by a telescoping-sum argument, Lemma 4.3 leads to the following inequalities for k ≥ 4: where the second inequality is due to (4.7). To further simplify the bound, observe that for k ≥ 4. Plugging this inequality into (4.9) yields (4.10) It is shown in Appendix C.1 that for s ≤ 1/(3L). As a consequence of this, (4.10) gives For completeness, Appendix C.1 proves, via a brute-force calculation, that ∇f (x 0 ) 2 , ∇f (x 1 ) 2 , ∇f (x 2 ) 2 , and ∇f (x 3 ) 2 are all bounded above by the right-hand side of (4.11). This completes the proof of the first inequality claimed by Theorem 6.
For the second claim in Theorem 6, the definition of the Lyapunov function and its decreasing property ensured by (4.7) implies are bounded by the right-hand side of (4.12). This completes the proof. Now, we prove Lemma 4.3.
Proof of Lemma 4.3. The difference of the Lyapunov function (4.6) satisfies where the last two equalities are due to which follows from the phase-space representation (4.5). Rearranging the identity for E(k+1)−E(k), we get (4.14) The next step is to recognize that the convexity and the L-smoothness of f gives Plugging these two inequalities into (4.14), we have where the second inequality uses the fact that ∇f ( To further bound E(k + 1) − E(k), making use of (4.13) with k + 1 in place of k, we get This completes the proof.
In passing, we remark that the gradient correction sheds light on the superiority of the highresolution ODE over its low-resolution counterpart, just as in Section 3. Indeed, the absence of the gradient correction in the low-resolution ODE leads to the lack of the term (k + 1)s∇f (x k ) in the Lyapunov function (see Section 4 of [SBC16]), as opposed to the high-resolution Lyapunov function (4.6). Accordingly, it is unlikely to carry over the bound E(k+1)−E(k) ≤ −O(s 2 k 2 ∇f (x k+1 ) 2 ) of Lemma 4.3 to the low-resolution case and, consequently, the low-resolution ODE approach pioneered by [SBC16] is insufficient to obtain the O(L 2 /k 3 ) rate for squared gradient norm minimization.

A Modified NAG-C without a Phase-Space Representation
This section proposes a new accelerated method that also achieves the O(L 2 /k 3 ) rate for minimizing the squared gradient norm. This method takes the following form: starting with x 0 and y 0 = x 0 . As shown by the following theorem, this new method has the same convergence rates as NAG-C.
We refer readers to Appendix C.2 for the proof of Theorem 7, which is, as earlier, based on a Lyapunov function. However, since both f (x k ) and f (y k ) appear in the iteration, (4.15) does not admit a phase-space representation. As a consequence, the construction of the Lyapunov function is complex; we arrived at it via trial and error. Our initial aim was to seek possible improved rates of the original NAG-C without using the phase-space representation, but the enormous challenges arising in this process motivated us to (1) modify NAG-C to the current (4.15), and (2) to adopt the phase-space representation. Employing the phase-space representation yields a simple proof of the O(L 2 /k 3 ) rate for the original NAG-C and this technique turned out to be useful for other accelerated methods.

Extensions
Motivated by the high-resolution ODE (1.12) of NAG-C, this section considers a family of generalized high-resolution ODEs that take the form for t ≥ α √ s/2, with initial conditions X(α √ s/2) = x 0 andẊ(α √ s/2) = − √ s∇f (x 0 ). As demonstrated in [SBC16, ACR17, VJFC18], the low-resolution counterpart (that is, set s = 0) of (5.1) achieves acceleration if and only if α ≥ 3. Accordingly, we focus on the case where the friction parameter α ≥ 3 and the gradient correction parameter β > 0. An investigation of the case of α < 3 is left for future work.
By discretizing the ODE (5.1), we obtain a family of new accelerated methods for minimizing smooth convex functions: starting with x 0 = y 0 . The second line of the iteration is equivalent to In Section 5.1, we study the convergence rates of this family of generalized NAC-C algorithms along the lines of Section 4. To further our understanding of (5.2), Section 5.2 shows that this method in the super-critical regime (that is, α > 3) converges to the optimum actually faster than O(1/(sk 2 )). As earlier, the proofs of all the results follow the high-resolution ODE framework introduced in Section 2. Proofs are deferred to Appendix D. Finally, we note that Section 6 briefly sketches the extensions along this direction for NAG-SC.

Convergence Rates
The theorem below characterizes the convergence rates of the generalized NAG-C (5.2).
The proof of Theorem 8 is given in Appendix D.1 for α = 3 and Appendix D.2 for α > 3. This theorem shows that the generalized NAG-C achieves the same rates as the original NAG-C in both squared gradient norm and function value minimization. The constraint β > 1 2 reveals that further leveraging of the gradient correction does not hurt acceleration, but perhaps not the other way around (note that NAG-C in its original form corresponds to β = 1). It is an open question whether this constraint is a technical artifact or is fundamental to acceleration.

Faster Convergence in Super-Critical Regime
We turn to the case in which α > 3, where we show that the generalized NAG-C in this regime attains a faster rate for minimizing the function value. The following proposition provides a technical inequality that motivates the derivation of the improved rate.
Proposition 5.1. Let f ∈ F 1 L (R n ), α > 3, and β > 1 2 . There exists c ′ α,β > 0 such that, taking any step size 0 < s ≤ c ′ α,β /L, the iterates {x k } ∞ k=0 generated by the generalized NAG-C (5.2) obey where the constants c ′ α,β and C ′ α,β only depend on α and β. In relating to Theorem 8, one can show that Proposition 5.1 in fact implies (5.3) in Theorem 8. To see this, note that for k ≥ 1, one has where the second inequality follows from Proposition 5.1. Proposition 5.1 can be thought of as a generalization of Theorem 6 of [SBC16]. In particular, this result implies an intriguing and important message. To see this, first note that, by taking Thus, it is tempting to suggest that there might exist a faster convergence rate in the sense that This faster rate is indeed achievable as we show next, though there are examples where (5.4) and are both satisfied but (5.5) does not hold (a counterexample is given in Appendx D.3).
Theorem 9. Under the same assumptions as in Proposition 5.1, taking the step size s = c ′ α,β /L, the iterates {x k } ∞ k=0 generated by the generalized NAG-C (5.2) starting from any ) of the generalized NAG-C (5.2) with various (α, β). The setting is the same as the left plot of Figure 4, with the objective f (x) = 1 2 Ax, x + b, x . The step size is s = 10 −1 A −1 2 . The left shows the short-time behaviors of the methods, while the right focuses on the long-time behaviors. The scaled error curves with the same β are very close to each other in the short-time regime, but in the long-time regime, the scaled error curves with the same α almost overlap. The four scaled error curves slowly tend to zero. Figures 5 and 6 present several numerical studies concerning the prediction of Theorem 9. For a fixed dimension n, the convergence in Theorem 9 is uniform over functions in F 1 = ∪ L>0 F 1 L and, consequently, is independent of the Lipschitz constant L and the initial point x 0 . In addition to following the high-resolution ODE framework, the proof of this theorem reposes on the finiteness of the series in Proposition 5.1. See Appendix D.2 and Appendix D.4 for the full proofs of the proposition and the theorem, respectively.
In the literature, [AP16, May17, ACPR18] use low-resolution ODEs to establish the faster rate o(1/k 2 ) for the generalized NAG-C (5.2) in the special case of β = 1. In contrast, our proof of Theorem 9 is more general and applies to a broader class of methods.
In passing, we make the observation that Proposition 5.1 reveals that for all k and a constant c > 0.
In view of the above, it might be true that the rate of the generalized NAG-C for minimizing the squared gradient norm can be improved to We leave the confirmation or disconfirmation of this asymptotic result for future research.

Discussion
In this paper, we have proposed high-resolution ODEs for modeling three first-order optimization methods-the heavy-ball method, NAG-SC, and NAG-C. These new ODEs are more faithful surrogates for the corresponding discrete optimization methods than existing ODEs in the literature, thus serving as a more effective tool for understanding, analyzing, and generalizing first-order methods. Using this tool, we identified a term that we refer to as "gradient correction" in NAG-SC and in its high-resolution ODE, and we demonstrate its critical effect in making NAG-SC an accelerated method, as compared to the heavy-ball method. We also showed via the high-resolution ODE of NAG-C that this method minimizes the squared norm of the gradient at a faster rate than expected for smooth convex functions, and again the gradient correction is the key to this rate. Finally, the analysis of this tool suggested a new family of accelerated methods with the same optimal convergence rates as NAG-C.
The aforementioned results are obtained using the high-resolution ODEs in conjunction with a new framework for translating findings concerning the amenable ODEs into those of the less "userfriendly" discrete methods. This framework encodes an optimization property under investigation to a continuous-time Lyapunov function for an ODE and a discrete-time Lyapunov function for the discrete method. As an appealing feature of this framework, the transformation from the continuous Lyapunov function to its discrete version is through a phase-space representation. This representation links continuous objects such as position and velocity variables to their discrete counterparts in a faithful manner, permitting a transparent analysis of the three discrete methods that we studied.
There are a number of avenues open for future research using the high-resolution ODE framework. First, the discussion of Section 5 can carry over to the heavy-ball method and NAG-SC, which correspond to the high-resolution ODË with β = 0 and β = 1, respectively. This ODE with a general 0 < β < 1 corresponds to a new algorithm that can be thought of as an interpolation between the two methods. It is of interest to investigate the convergence properties of this class of algorithms. Second, we recognize that new optimization algorithms are obtained in [WWJ16, WRJ16] by using different discretization schemes on low-resolution ODE. Hence, a direction of interest is to apply the techniques therein to our high-resolution ODEs and to explore possible appealing properties of the new methods. Third, the technique of dimensional analysis, which we have used to derive high-resolution ODEs, can be further used to incorporate even higher-order powers of √ s into the ODEs. This might lead to further fine-grained findings concerning the discrete methods. More broadly, we wish to remark on possible extensions of the high-resolution ODE framework beyond smooth convex optimization in the Euclidean setting. In the non-Euclidean case, it would be interesting to derive a high-resolution ODE for mirror descent [KBB15, WWJ16]. This framework might also admit extensions to non-smooth optimization and stochastic optimization, where the ODEs are replaced, respectively, by differential inclusions [ORX + 16, VJFC18] and stochastic differential equations [KB17, HLLL17, LTE17, LS17, XWG18, HMC + 18, GGZ18]. Finally, recognizing that the high-resolution ODEs are well-defined for non-convex functions, we believe that this framework will provide more accurate characterization of local behaviors of first-order algorithms near saddle points [JGN + 17, DJL + 17, HLS17]. On a related note, given the centrality of the problem of finding an approximate stationary point in the non-convex setting [CDHS17a, CDHS17b, AZ18], it is worth using the high-resolution ODE framework to explore possible applications of the faster rate for minimizing the squared gradient norm that we have uncovered. [ZMSJ18] Jingzhao Zhang, Aryan Mokhtari, Suvrit Sra, and Ali Jadbabaie. Direct Runge-Kutta discretization achieves acceleration. arXiv preprint arXiv:1805.00521, 2018.

A.1 Derivation of High-Resolution ODEs
In this section, we formally derive the high-resolution ODEs of the heavy-ball method and NAG-C. Let t k = k √ s. For the moment, let X(t) be a sufficiently smooth map from [0, ∞) (the heavy-ball is the sequence of iterates generated by the heavy-ball method or NAG-C, depending on the context. The heavy-ball method. For any function f (x) ∈ S 2 µ,L (R n ), setting α = 1− √ µs 1+ √ µs , multiplying both sides of (1.2) by √ µs · 1 s and rearranging the equality, we obtain Plugging (2.1) into (A.1), we havë By only ignoring the O(s) term, we obtain the high-resolution ODE (1.10) for the heavy-ball method X + 2 √ µẊ + (1 + √ µs) ∇f (X) = 0.

NAG-C.
For any function f (x) ∈ F 2 L (R n ), multiplying both sides of (1.5) by √ µs · 1 s and rearranging the equality, we get For convenience, we slightly change the definition t k = k √ s + (3/2) √ s instead of t k = k √ s.

A.2 Derivation of Low-Resolution ODEs
In this section, we derive low-resolution ODEs of accelerated gradient methods for comparison. The results presented here are well-known in the literature and the purpose is for ease of reading. In [SBC16], the second-order Taylor expansions at both x k−1 and x k+1 with the step size √ s are, With the Taylor expansion (A.3), we obtain the gradient correction
(b) Recall the equivalent form (A.1) of the heavy-ball method (1.2) is

Plugging (A.3) and (A.4) into (A.1), we havë
Hence, taking s → 0, we obtain the low-resolution ODE (1.9) of the heavy-ball method Notably, NAG-SC and the heavy-ball method share the same low-resolution ODE (1.9), which is almost consistent with (1.10). Thus the low-resolution ODE fails to capture the information from the "gradient correction" of NAG-SC.

A.3 Solution Approximating Optimization Algorithms
To investigate the property about the high-resolution ODEs (1.10), (1.11) and (1.12), we need to state the relationship between them and their low-resolution corresponding ODEs. Here, we denote the solution to high-order ODE by X s = X s (t). Actually, the low-resolution ODE is the special case of high-resolution ODE with s = 0. Take NAG-SC for examplë In other words, we consider a family of ODEs about the step size parameter s.

A.3.1 Proof of Proposition 2.1
Global Existence and Uniqueness To prove the global existence and uniqueness of solution to the high-resolution ODEs (1.10) and (1.11), we first emphasize a fact that if X s = X s (t) is the solution of (1.10) or (1.11), there exists some constant C 1 > 0 such that which is only according to the following Lyapunov function Now, we proceed to prove the global existence and uniqueness of solution to the high-resolution ODEs (1.10) and (1.11). Recall initial value problem (IVP) for first-order ODE system in R m aṡ of which the classical theory about global existence and uniqueness of solution is shown as below. Apparently, the set M C 1 = (X s ,Ẋ s ) ∈ R 2n Ẋ s ≤ C 1 is a compact manifold satisfying Theorem 10 with m = 2n.
• For the heavy-ball method, the phase-space representation of high-resolution ODE (1.10) is d dt • For NAG-SC, the phase-space representation of high-resolution ODE (1.11) is d dt Based on the phase-space representation (A.8) and (A.10), together with the Lipschitz condition (A.9) and (A.11), Theorem 10 leads to the following Corollary.
Corollary A.1. For any f ∈ S 2 µ (R n ) := ∪ L≥µ S 2 µ,L (R n ), each of the two ODEs (1.10) and (1.11) with the specified initial conditions has a unique global solution X ∈ C 2 (I; R n ) Approximation Based on the Lyapunov function (A.6), the gradient norm is bounded along the solution of (1.10) or (1.11), that is, sup 0≤t<∞ ∇f (X s (t)) ≤ C 2 . (A.12) Recall the low-resolution ODE (1.9), the phase-space representation is proposed as Similarly, using a Lyapunov function argument, we can show that if X = X(t) is a solution of (1.9), we have sup 0≤t<∞ Ẋ (t) ≤ C 3 . (A.14) Simple calculation tells us that there exists some constant L 1 > 0 such that Now, we proceed to show the approximation.
Lemma A.2. Let the solution to high-resolution ODEs (1.10) and (1.11) as X = X s (t) and that of (1.9) as X = X(t), then we have for any fixed T > 0 In order to prove (A.16), we prove a stronger result as Before we start to prove (A.17), we first describe the standard Gronwall-inequality as below. The proof is only according to simple calculus, here we omit it.
Proof of Lemma A.2. We separate it into two parts.
• For the heavy-ball method, the phase-space representations (A.8) and (A.13) tell us that By the boundedness (A.12), (A.5) and (A.14) and the inequality (A.15), we have According to Lemma A.3, we have • For NAG-SC, the phase-space representations (A.10) and (A.13) tell us that Similarly, by the boundedness (A.12), (A.5) and (A.14) and the inequality (A.15), we have According to Lemma A.3, we have The proof is complete.
Lemma A.4. The two methods, heavy-ball method and NAG-SC, converge to their low-resolution ODE (1.9) in the sense that for any fixed T > 0.
which is only according to the following Lyapunov function Now, we proceed to prove the global existence and uniqueness of solution to the high-resolution ODEs (1.12). Recall initial value problem (IVP) for first-order nonautonomous system in R m aṡ of which the classical theory about global existence and uniqueness of solution is shown as below.
for any x, y ∈ M . The readers can also refer to [GH13]. Similarly, the set is a compact manifold satisfying Theorem 11 with m = 2n.
For NAG-C, the phase-space representation of high-resolution ODE (1.11) is d dt Based on the phase-space representation (A.21), together with (A.22), Theorem 11 leads the following Corollary.
Corollary A.5. For any f ∈ F 2 (R n ) := ∪ L>0 F 2 L (R n ), the ODE (1.12) with the specified initial conditions has a unique global solution X ∈ C 2 (I; R n ).
Approximation Using a linear transformation t + (3/2) √ s instead of t, we can rewrite highresolution ODE (1.12) as Here, we adopt the technique max{δ, t} instead of t for any δ > 0 to overcome the singular point t = 0, which is used firstly in [SBC16]. Then (A.24) is replaced into d dt Recall the low-resolution ODE (1.8), with the above technique, the phase-space representation is proposed as d dt with the initial X δ s (0) = x 0 andẊ δ s (0) = 0. Then according to (A.25) and (A.26), if we can prove for any δ > 0 and any t ∈ [0, T ], the following equality holds lim s→0 X δ s (t) − X δ (t) = 0.
Lemma A.6. Denote the solution to high-resolution ODE (1.12) as X = X s (t) and that to (1.8) as X = X(t). We have By the boundedness (A.27) and (A.28) and the Lipschitz inequality (A.29), we have According to Lemma A.3, we obtain the result as (A.31) The proof is complete.
. NAG-C converges to its low-resolution ODE in the sense that for any fixed T > 0.
Combined with Corollary A.5, Lemma A.6 and Lemma A.7, we complete the proof of Proposition 2.2.

A.4 Closed-Form Solutions for Quadratic Functions
In this section, we propose the closed-form solutions to the three high-resolution ODEs for the quadratic objective function where θ is the parameter suitable for the function in S 2 µ,L (R n ) and F 2 L (R n ). We compare them with the corresponding low-resolution ODEs and show the key difference. Throughout this section, both c 1 and c 2 are arbitrary real constants.

A.4.1 Oscillations and Non-Oscillations
For any function f (x) ∈ S 2 µ,L (R n ), the parameter θ is set in [µ, L]. First, plugging the quadratic objective (A.32) into the low-resolution ODE (1.9) of both NAG-SC and heavy-ball method, we haveẌ The closed-form solution of (A.33) can be shown from the theory of ODE, as below.
• When θ > µ, that is, 4µ − 4θ < 0, the closed-form solution is the superimposition of two independent oscillation solutions of which the asymptotic estimate is • When θ = µ, that is, 4µ − 4θ = 0, the closed-form solution is the superimposition of two independent non-oscillation solutions of which the asymptotic estimate is Second, plugging the quadratic objective (A.32) into the high-resolution ODE (1.11) of NAG-SC, we haveẌ The closed-form solutions to (A.34) are shown as below.
• When s < 4(θ−µ) θ 2 , that is, 4(µ − θ) + sθ 2 < 0, the closed-form solution is the superimposition of two independent oscillation solutions the asymptotic estimate of which is • When s = 4(θ−µ) θ 2 , that is, 4(µ − θ) + sθ 2 = 0, the closed-form solution is the superimposition of two independent non-oscillation solutions the asymptotic estimate of which is • When s > 4(θ−µ) θ 2 , that is, 4(µ − θ) + sθ 2 > 0, the closed-form solution is also the superimposition of two independent non-oscillation solutions the asymptotic estimate of which is Note that a simple calculation shows Hence, when the step size satisfies s ≥ 2, there is always no oscillation in the closed-form solution of (A.34). Finally, plugging the quadratic objective (A.32) into the high-resolution ODE (1.10) of the heavy-ball method, we haveẌ Since 4µ − 4(1 + √ µs)θ < 0 is well established, the closed-form solution of (A.35) is the superimposition of two independent oscillation solutions the asymptotic estimate is In summary, both the closed-form solutions to (A.33) and (A.35) are oscillated except the fragile condition θ = µ and the speed of linear convergence is Θ e − √ µt . However, the rate of convergence in the closed-form solution to the high-resolution ODE (A.34) is always faster than Θ e − √ µt . Additionally, when the step size s ≥ 2, there is always no oscillation in the closed-form solution of the high-resolution ODE (A.34).

A.4.2 Kummer's Equation and Confluent Hypergeometric Function
For any function f (x) ∈ F 2 L (R n ), the parameter θ is required to located in (0, L]. Plugging the quadratic objective (A.32) into the low-resolution ODE (1.8) of NAG-C, we havë X + 3 tẊ + θX = 0, the closed-form solution of which has been proposed in [SBC16] where J 1 (·) and Y 1 (·) are the Bessel function of the first kind and the second kind, respectively. According to the asymptotic property of Bessel functions, we obtain the following estimate . Now, we plug the quadratic objective (A.32) into the high-resolution ODE (1.12) of NAG-C and obtainẌ For convenience, we define two new parameters as Let Y = Xe ρt and t ′ = ξt, the high-resolution ODE (A.36) can be rewritten as which actually corresponds to the Kummer's equation. According to the closed-form solution to Kummer's equation, the high-resolution ODE (A.36) for quadratic function can be solved analytically as where M (·, ·, ·) and U (·, ·, ·) are the confluent hypergeometric functions of the first kind and the second kind. The integral expressions of M (·, ·, ·) and U (·, ·, ·) are given as Since the possible value of arg(ξt) either 0 or π/2, we have Apparently, from the asymptotic estimate of (A.38), we have • When s < 4/θ, that is, sθ 2 − 4θ < 0, the closed-form solution (A.37) is estimated as Hence, when the step size satisfies s < 4/L, the above upper bound always holds.
• When s ≥ 4/θ, that is, sθ 2 − 4θ ≥ 0, the closed-form solution (A.37) is estimated as Apparently, we can bound With Cauchy-Schwarz inequality the Lyapunov function (3.3) can be estimated as Along the solution to the high-resolution ODE (1.10), the time derivative of the Lyapunov function (3.3) is

With (B.1) and the inequality for any function
the time derivative of the Lyapunov function can be estimated as Hence, the proof is complete.
B.2 Completing the Proof of Lemma 3.4

B.2.2 Derivation of (B.2)
Now, we show the derivation of (B.2). Recall the discrete Lyapunov function (2.6), For convenience, we calculate the difference between E(k) and E(k + 1) by the three parts, I, II and III respectively.
• For the part I, potential, with the convexity, we have • For the part II, kinetic energy, with the phase representation of NAG-SC (2.5), we have • For the part III, mixed energy, with the phase representation of NAG-SC (2.5), we have Both II 2 and III 3 above are the discrete correspondence of the terms − √ s 2 ∇f (X(t)) 2 and − √ s (3.1). The impact can be found in the calculation. Now, we calculate the difference of discrete Lyapunov function (2.6) at k-th iteration by the simple operation Now, the term, (1/2)I 1 + II 5 + II 6 + III 4 , can be calculated as With phase representation of NAG-SC (2.5), we have For convenience, we note the term IV = (1/2)I 1 + III 2 + III 3 . Then, with phase representation of NAG-SC (2.5), the difference of Lyapunov function (2.6) is Now, we can find the impact of additional term in the Lyapunov function (2.6). In other words, the II 4 + IV 1 term added the additional term is a perfect square, as below Merging all the similar items, II 4 + IV 1 + additional term, I 2 + II 3 , we have (II 4 + IV 1 + additional term) + (I 2 + II 3 ) Now, we obtain that the difference of Lyapunov function (2.6) is With the inequality for any function f (x) ∈ S 1 µ,L (R n ) we have B.3 Proof of Lemma 3.5 With the phase representation of the heavy-ball method (3.11) and Cauchy-Schwarz inequality, we have The discrete Lyapunov function (3.10) can be estimated as For convenience, we also split the discrete Lyapunov function (3.10) into three parts and mark them as below where the three parts I, II and III are corresponding to potential, kinetic energy and mixed energy in classical mechanics, respectively.
• For the part I, potential, with the basic convex of f (x) ∈ S 1 µ,L (R n ) • For the part II, kinetic energy, with the phase representation of the heavy-ball method (3.11), we have • For the part III, mixed energy, with the phase representation of the heavy-ball method (3.11), we have Now, we calculate the difference of discrete Lyapunov function (2.6) at the k-th iteration by the simple operation as With the phase representation of the heavy-ball method (3.11), we have and 1 2 Now, the difference of discrete Lyapunov function (3.10) can be rewritten as With the inequality for any function f (x) ∈ S 1 µ,L (R n ) we have Comparing the coefficient of the estimate of Lyapunov function (B.3), we have The proof is complete.

C Technical Details in Section 4
C.1 Technical Details in Proof of Theorem 6 C.1.1 Iterates (x k , y k ) at k = 1, 2, 3 The iterate (x k , y k ) at k = 1 is ) .

C.2 Proof of Theorem 7
Let w k = (1/2) [(k + 2)x k − ky k + (k − 1)s∇f (y k )] for convenience. Using the dynamics of {(x k , y k )} ∞ k=0 generated by the modified NAG-C (4.15), we have Hence, the difference between w k+1 − x ⋆ 2 and w k − x ⋆ 2 is If the step size satisfies s ≤ 1/L, there exists a tighter basic inequality than [SBC16, Equation (22)] and [Bub15, Lemma 3.6] for any function f (x) ∈ F 1 L (R n ) With (C.10), we can obtain that Consider the discrete Lyapunov function Hence, the difference between E(k + 1) and E(k) in (C.11) is When k ≥ 2, we have Furthermore, we have Combining with (C.12), we obtain that Similarly, when s ≤ 1/L, for k = 0, we have for k = 1, following the modified NAG-C (4.15), we obtain (x 1 , y 1 ) as For function value, (C.12) tells we complete the proof.

D Technical Details in Section 5
D.1 Proof of Theorem 8: Case α = 3 Before starting to prove Theorem 8, we first look back our high-resolution ODE framework in Section 2.
• Step 1, the generalized high-resolution ODE has been given in (5.1).
• Step 2, the continuous Lyapunov function is constructed as Following this Lyapunov function (D.1), we can definitely obtain similar results as Theorem 5 and Corollary 4.2. The detailed calculation, about the estimate of the optimal constant β and how the constant β influence the initial point, is left for readers. • Step 3, before constructing discrete Lyapunov functions, we show the phase-space representation (5.2) as Now, we show how to construct the discrete Lyapunov function and analyze the algorithms (5.2) with α = 3 in order to prove Theorem 8.
With the phase-space representation (D.2) for α = 3, we can obtain The difference of the discrete Lyapunov function (D.3) of the k-th iteration is With the basic inequality of any function and the phase-space representation (D.2) the difference of the discrete Lyapunov function (D.3) can be estimated as Utilizing the phase-space representation (D.2) again, we calculate the difference of the discrete Lyapunov function (D.3) as To guarantee that the Lyapunov function E(k) is decreasing, a sufficient condition is Simple calculation tells us that (D.5) can be rewritten as Apparently, when β → 1, the step size satisfies 0 < s ≤ k − 1 k + 1 · 1 L which is consistent with (4.8). Now, we turn to discuss the parameter 0 ≤ β < 1 case by case.
• When the parameter β ≤ 1/2, the sufficient condition (D.5) for the Lyapunov function E(k) decreasing cannot be satisfied for sufficiently large k.
Consistently, we can obtain the sufficient condition for the Lyapunov function E(k) decreasing (D.5) and the sufficient condition for step size (D.6). Now, we turn to discuss the parameter β ≥ 1 case by case.
By simple calculation, we complete the proof.
D.2 Proof of Theorem 8: Case α > 3 Before starting to prove Theorem 8: Case α > 3, we first also look back our high-resolution ODE framework in Section 2. • Step 1, the generalized high-resolution ODE has been given in (5.1). • Step 2, the continuous Lyapunov function is constructed as which is consistent with (D.1) for α → 3. Following this Lyapunov function (D.8), we can obtain for any t > t 0 = max { √ s(α/2 − β)(α − 2)/(α − 3), √ s(α/2)}. The two inequalities of (D.9) for the convergence rate of function value is stronger than Corollary 4.2. The detailed calculation, about the estimate of the optimal constant β and how the constant β influences the initial point, is left for readers. • Step 3, before constructing discrete Lyapunov functions, we look back the phase-space representation (D.2) The discrete functional is constructed as When β = 1, with α → 3, the discrete Lyapunov function E(k) degenerates to (4.6).