An Accelerated Lyapunov Function for Polyak’s Heavy-Ball on Convex Quadratics

In 1964, Polyak showed that the Heavy-ball method, the simplest momentum technique, accelerates convergence of strongly-convex problems in the vicinity of the solution. While Nesterov later developed a globally accelerated version, Polyak’s original algorithm remains simpler and more widely used in applications such as deep learning. Despite this popularity, the question of whether Heavy-ball is also globally accelerated or not has not been fully answered yet, and no convincing counterexample has been provided. This is largely due to the diﬃculty in ﬁnding an eﬀective Lyapunov function: indeed, most proofs of Heavy-ball acceleration in the strongly-convex quadratic setting rely on eigenvalue arguments. Our study adopts a diﬀerent approach: studying momentum through the lens of quadratic invariants of simple harmonic oscillators. By utilizing the modiﬁed Hamiltonian of Stormer-Verlet integrators, we are able to construct a Lyapunov function that demonstrates an 𝑂 ( 1 / 𝑘 2 ) rate for Heavy-ball in the case of convex quadratic problems. This is a promising ﬁrst step towards potentially proving the acceleration of Polyak’s momentum method and we hope it inspires further research in this ﬁeld.


Introduction
The problem of unconstrained continuous convex optimization consists in finding an element of the set arg min ∈R   (), for some lower bounded convex  : R  → R, generally assumed to be regular, e.g., twice continuously differentiable:  ∈  2 (R  , R).

Acceleration in discrete-and continuous-time
In 1979 Nemirovsky and Yudin [17] showed that, if  is convex and -smooth , no gradient-based optimizer can converge to a solution faster than  (1/ 2 ), where  is the number of gradient evaluations .While Gradient Descent (GD) converges like  (1/), the optimal rate  (1/ 2 ) is achieved by the celebrated Accelerated Gradient Descent (AGD) method, proposed by Nesterov in A differentiable function  : R  → R is said to be -smooth if it has -Lipschitz gradients.This lower bound holds just for  <  hence it is only interesting in the high-dimensional setting.

Questions arise immediately:
Are both these modifications necessary for acceleration?In particular, is evaluating the gradient at non-iterate points crucial or even necessary for acceleration?
To put these questions in the right historical context, one has to go back to Polyak's 1964 seminal paper [20], where the very first momentum method was proposed for  2 and -strongly-convex problems .Using an elegant functional-analytic argument on multistep methods, Polyak proved that momentum alone -without shifted gradient evaluation (a.k.a.Heavy-ball (HB), see equation Many similar writings are possible.Here, we consider the particular version studied in [22] and a physicist notation, where   is a velocity variable.This makes the connection to continuous-time cleaner and consistent with recent work on the geometry of momentum methods [5].
In the strongly-convex case,   is not monotonically increasing, but is instead chosen to be a constant dependent on the strong-convexity modulus , that is  = below) -is able to achieve acceleration in a neighborhood of the solution.This local argument becomes of course global in the quadratic case (for a simplified proof, see Proposition 1 in [15]).

(HB)
Despite the many attempts, nobody in the last 56 years has been able to show that HB has a global (i.e. for any initialization) accelerated rate -neither in the strongly-convex case (using a fixed momentum) nor in the non-strongly-convex case (using an increasing −1 +2 momentum).Beyond the technical difficulty, another plausible reason may also be lack of interest, as the introduction of Nesterov's globally accelerated method in 1982, that overshadowed the conceptually simpler method from Polyak.However, many researchers in the last decade, supported by numerical evidence and by the success of Heavy-ball in deep learning [13], expressed their belief that HB is accelerated: [. . .] supported by the numerical simulations we envisage that the convergence factor could be strengthened even further.This is indeed left as a future work.
-Ghadimi et al. [7], 2015 Despite the long history of this approach, there is still an open question whether the heavy ball method converges to the optimum globally with accelerated rate when the objective function is twice continuous differentiable.
-Gorbunov et al. [8], 2019 Neither the evaluation of the gradient at a shifted position, nor a specifically engineered damping parameter, as for example proposed in Nesterov (2004, Sec. 2.2), seem necessary.
-Muehlebach and Jordan [16], 2020 Other researcher believe HB is not accelerated: If we can translate this argument to the discrete case we can understand why AGD achieves acceleration globally for strongly-convex functions but the Heavy-ball method does not.
-Shi et al. [21], 2018 While on the theoretical side the opinion is mixed, on the experimental side no numerical simulation has been able to show that HB is not accelerated.In Figure 1, we provide two examples for the non-strongly-convex case (i.e. very small, such that an increasing momentum is preferable, leading 1/ 2 convergence as opposed to (1 − √︁ /)  ).In particular, we show that HB is comparable to AGD through the lens of the pathological lower-bounding quadratic example introduced by [17] and used to construct the  (1/ 2 ) bound in convex optimization -at least until the effect of non-trivial strong-convexity becomes dominant (at around  (  ) = 10 −6 ).The optimal step-size 1/ was used.
Figure 1: For both examples HB with momentum −1 +2 exhibits an accelerated 1/ 2 convergence rate, even though AGD with momentum −1 +2 is faster in a neighborhood of the optimizer due to strong-convexity.Instead, GD violates the Nesterov  (1/ 2 ) upper bound.We recall that, while Nesterov's upper bound holds for all  > 0, the  (1/ 2 ) lower bound (originally discovered by Nemirovski and Yudin [17]) only holds at  = /2 (for more details, check the discussion in [18]).

Contributions
The purpose of the manuscript at hand is to study the effect of shifts in gradient extrapolation points on acceleration in convex optimization (i.e. to study the difference between Heavy-ball and Nesterov's method).In particular, the next pages are organized as follows: 1. We start from a continuous-time argument: inspired by a recent idea from Flammarion and Bach [4], in Section 2.1 we show how AGD-ODE with damping 2/ can be derived from the equation of a simple harmonic oscillator:  = −.By using Lyapunov equations and a simple change of variables, we retrieve the Lyapunov function proposed by Su, Boyd and Candes [22] to prove a rate  (1/ 2 ) for AGD-ODE.This procedure is principled and leads to many insights on Lyapunov function design.
2. In Section 2.2, we apply the same methodology in discrete time, and show that HB with momentum −1 +1 can be derived from the Störmer-Verlet discretization of the simple harmonic oscillator.Solving again Lyapunov's equations, we are able to show an  (1/ 2 ) rate for a Heavy-ball argorithm for convex quadratics.While this rate is already present in [4], our proof technique is different as it relies on a Lyapunov function as opposed to an eigenvalue analysis.
3. In Section 3, by generalizing the discrete-time Lyapunov function found in Section 2.2 we derive a modified Heavy-ball method with a rate of convergence  (1/ 2 ) for any  ≥ 2 and  ≥ 2. Our result not only generalizes the theory in [4], but also provides an interesting connection between the continuous and the discrete -as the used Lyapunov function converges, in the limit ℎ → 0, to the one used in [22] for  ≥ 2.
Recent related works.Very recently, Wang et al. [23] proved that Heavy-ball is accelerated for a class of functions satisfying the Polyak-Łojasiewicz condition.Instead, here we provide a Lyapunov function for the non-strongly-convex setting, where the Polyak-Łojasiewicz constant vanishes.We remark that, for strongly-convex quadratic potentials, Heavy-ball is already known to achieve acceleration [15].However, the eigenvalue argument used in [15] cannot be leveraged in the non-strongly-convex setting, where the minimum eigenvalue can be arbitrarily low.As such, our work provides insights on how to construct effective Lyapunov functions in the non-quadratic case, where Lyanonov arguments are often the go-to option.

From quadratic invariants of oscillators to accelerated rates
Our procedure in this section is inspired by a beautiful idea presented by Flammarion and Bach [4]: it is sometimes possible to translate a time-dependent convergence rate problem into a time-independent stability problem.Here we go one step further, and show how, with an additional step (computation of quadratic invariants), it is possible to derive Lyapunov functions and rates for the corresponding algorithms.We first illustrate the idea in continuous-time and then proceed with the discrete-time analysis.Our starting point is the following ODE: (AGD-ODE2) From the analysis in [22], we know that on a quadratic  () =  * + 1 2 ( −  * ), ( −  * ) , with  positive semidefinite and  * ∈ R, the solution converges to  * ∈ arg min ∈R   () at the rate  (1/ 2 ).To prove this rate, the authors in [22] use the following Lyapunov function: ( We show here a constructive way to derive  (Section 2.1) and then (Section 2.2) we apply the same procedure to get a Lyapunov function for Heavy-ball (i.e., the discretization).For simplicity, we consider here  * = 0 and  * = 0.

Lyapunov functions from continuous-time invariants
Consider an harmonic oscillator on the potential  () = 1 2  , i.e.  = −.From basic physics, we know that such a system is marginally stable (bounded dynamics).By choosing  =  we get  =  +   and  =  +  +  .This implies That is, AGD-ODE can be reconstructed from a simple linearized pendulum.By introducing the variable  = , we can write the pendulum in phase space as a linear dynamical system Hence, the pendulum has the form  =  , where  = (, ).We would now like to get a Lyapunov function for this system.To do this, we recall a fundamental proposition (check Thm.4.6. in [12]).
Proposition 1 (Continuous-time Lyapunov equations).The linear system  =   is Lyapunov stable if and only if for all positive semidefinite matrices , there exists a symmetric matrix  such that Moreover,  () =    is a Lyapunov function and  () = −  .
Since we know that a pendulum is only marginally stable (i.e., not asymptotically stable), we can limit ourselves to the choice of a null matrix .Hence, we need to solve the Lyapunov equation is a quadratic invariant, i.e.  () = 0.This is well known, since  is actually twice the total energy (Hamiltonian) of the pendulum.Finally, we can change variables and get that is a Lyapunov function for AGD-ODE2, with  () = 0.This is precisely equation 1.
From quadratic to convex.With a small modification (using a factor  − 1 instead of 1), it is possible to get a Lyapunov function that works for AGD-ODE in the more general convex case.

Discrete-time invariants
We apply the construction from the last subsection to the discrete case.Inspired by Flammarion and Bach [4], we consider at first a slightly modified HB: This algorithm is the discrete-time equivalent of  + 2   + ∇  () = 0.As for the continuous-time case, we start from  () = 1  2 ,  .In this case, HB2 can be written as That is, if we set   =   , we get With surprise, we recognize that this is the Störmer-Verlet method [11] on  = −, with step-size ℎ (that's why had ℎ 2 from the very beginning).
It would be natural, as for the continuous-time case, to consider the total energy as a quadratic invariant to derive a Lyapunov function.However, it turns out that, interestingly, the Störmer-Verlet method does not precisely conserve the total energy: there are small oscillations (see Section 3 in [9])!Taking into account such small oscillations (Figure 2) is of fundamental importance -since they lead to a crucial modification of the invariants we have to use.The Störmer-Verlet method on a one-dimensional quadratic potential (i.e., a simplified pendulum) does not conserve the total energy.Details on this phenomenon can be found in [10,9].Proposition 3 (Discrete-time Lyapunov equations).The system  +1 =    is Lyapunov stable if an only if for all positive semidefinite matrices , there exists a symmetric matrix  such that Moreover,  () =    is a Lyapunov function and  ( +1 ) −  (  ) = −     for all .We apply the theorem above (for  = 0) to the linear system Under the choice   = (  −  −1 )/ℎ, this system is equivalent to equation 6, i.e., the discretized pendulum we want to find a quadratic invariant for.Solving the discrete Lyapunov equation gives us We make the following comments: • As ℎ → 0, the modified energy approaches the total energy in equation 3. The purpose of the additional cross-term is to eliminate the small energy oscillations we see in Figure 2.
• Assuming without loss of generality that  does not have zero eigenvalues,  is positive semidefinite (i.e., yields a valid Lyapunov function) if and only if the Schur complement of  (i.e., the block  22 ) in  is positive semidefinite.That is, we need Since  and ( − ℎ 2 /4) are co-diagonalizable, the product is positive semidefinite if and only if both  ≥ 0 and ( − ℎ 2 /4) ≥ 0. This requires 0 ≤  ≤ 4 ℎ 2 , which in turns implies an upper bound on the step-size ℎ 2 : The same condition (note that our step-size is ℎ 2 , not ℎ) can be deduced from the analysis of HB2 in [4].Now it's time to change variables back:   =   .If we set   := (  −  −1 )/ℎ, as also done in the introduction, we get By substituting these formulas in equation 8, we get the following final form for an effective Lyapunov function for HB2 -for the quadratic case: To better understand this Lyapunov function, we multiply everything by ℎ 2 and get Recalling that the "time" variable  is defined to be   = ℎ, this cost becomes This Lyapunov function can be easily generalized by noting that   ,   = 2(  (  ) −  * ) and   = ∇  (  ): Finally, note that • From equation 10 as ℎ → 0 we get equation 4: the continuous-time Lyapunov function of [22].
• The mixing term is necessary and makes the positive definiteness (see equation 9) of   non-trivial.
All in all, in this subsection, we proved the following result.  ) , then equation 10 is non-negative and such that  +1 −   = 0 along the HB2 trajectory, for all .From this, one can deduce an accelerated rate of  (1/ 2 ) in suboptimality.
Details of the proof are given in the proof of Theorem 1 (more general).

Accelerated Heavy-ball methods for convex quadratics
In this section, we start to lift the discussion to the convex non-quadratic setting, by providing a generalization of HB2.Indeed, we know from the continuous-time analysis in [22] that  + 2   + ∇  () = 0 may not have an accelerated rate for functions which are convex but not necessarily quadratic.In this case, a rate of  (1/ 2 ) only holds for with  ≥ 3.In the same way, we expect that HB2 (which is the discretization for  = 2) may not have an accelerated rate in the convex non-quadratic setting and a generalization corresponding to high friction is therefore necessary.Our objective in this chapter is to construct such a generalization of HB2, which we name HB.
The case 0 <  ≤ 3 was studied by Attouch et al. [3]: a convergence rate of  (1/  ) with  < 2/3 is shown in this case.The same result also holds in discrete time.

A generalized Heavy-ball with high friction and guarantees on quadratics
After a few weeks of intense calculations, we found that this algorithm gives the desired result (Thm.1).
scaled gradient of iterate . (HB) First, note that  = 2 recovers HB2 -which we proved to be accelerated in the last subsection using a novel Lyapunov argument.The second, and perhaps the most crucial, thing to note is that HB recalls the high friction generalization of AGD proposed by [22] (see Theorem 6 in their paper): . (AGD) Between HB and AGD there are a few important differences: • In AGD the gradient is evaluated at   + −1 +−1 (  −  −1 ), while in HB it is evaluated at   .• in HB the effective step-size (i.e.what multiplies the gradient) is iteration-dependent, and goes from ℎ 2 /2 to ℎ 2 as  → ∞.We believe this has not to be regarded as part of the acceleration mechanism: it is just a small modification needed to make the analysis easier.
• Arguably HB (neglecting the small correction) is conceptually simpler that AGD: compared to GD, only a momentum term is added at each iteration -and this can be thought of as the source of acceleration.
We proceed in proving that HB is accelerated in the quadratic case.
• It reduces to equation 10 in the case  = 2.For  > 2, it is a graspable generalization of equation 10 -which we instead derived in a systematic way using Lyapunov equations.The term ( − 1) is inspired by the continuous-time limit in equation 5.
• Consider the Lyapunov function above, but without cross term i.e.

2(𝑘
This function works for proving an  (1/ 2 ) rate for AGDr (it's a Lyapunov function, see Thm. 6 from [22]).Therefore, higher complexity (i.e., an additional cross term) is needed to study the acceleration of Heavy-ball, when compared to Nesterov's method.

Proof of the theorem
It is useful to simplify equation 11 and to work with variables   and  −1 -a natural choice in the discrete setting.We split the Lyapunov function into two parts: First, we are going to study  2  in the non-quadratic case, and then  1  in the quadratic case.Theorem 1 will follow from a combination of the two corresponding lemmata.The first lemma shares many similarities with the proof of Theorem 1 in [7].
Lemma 1.For any differentiable function  : R  → R (not necessarily convex or -smooth) and any sequence of iterates (  )  ≥0 returned by HB, we have: where We proceed in computing  +1 −   .The algorithm symmetric structure here is fundamental: Instead,  +1 +   is slightly more complex.
The proof is concluded by taking the inner product.
We proceed by computing the difference  1 +1 −  1  .Our calculations will be very quick, since we can leverage, in the quadratic case, on a simplified expression for  1  .Lemma 2. Let  1  be defined as equation 12.In the context of Theorem 1, we have and Proof.From equation 12, we get We proceed computing  1 +1 −  1  using this simplified form: where Now, recall the definition of HB: where we subtracted  * from both sides.By plugging this into Δ  , we get The result follows after taking the inner product ℎ 2   −  * , Δ  .
We are finally ready to prove the result.
Crucially, note that the terms including   −  * , (  −  −1 ) cancel.This is necessary to make our proof (or, probably, any proof) work, since such inner product between the gradient and the momentum changes sign (infinitely) many times along the trajectory, and therefore cannot be easily compared to other quantities.For the same reason, in the corresponding continuous-time proof from [22], the terms including ∇  (),  also perfectly cancel out.
All in all, by collecting some terms, we get where  :=  − ℎ 2 4  2 is the matrix that we already studied in the context of Lyapunov equations (see equation 9).Since  ≥ 2, a sufficient condition for  +1 −   ≤ 0 is  ≥ 0, which holds under ℎ 2 ≤ 4  max ( ) .As a sanity check, the reader can appreciate the fact that, if  = 2, then  +1 =  as we already proved in Proposition 4 (follows from the fact that   solves the Lyapunov equations).Last, we have to translate the fact that   is non-increasing to a convergence rate.This is not trivial in our case, since   also contains a cross term which is not necessarily positive.Actually, we do not even know that   ≥ 0 yet!Hence, we have to come up with some tricks.We start from rewriting the (simplified) Lyapunov function: Now, let us add and subtract a term ℎ 2 ( +  − 2) 2   −  * , (  −  * ) , with  > 0. We have: Now, if we show that Ṽ is always positive, then  +1 ≤   for all  implies: which gives the desired rate: Therefore, we only need to show Ṽ ≥ 0. To do this, we introduce two new variables: and get a simplified form for Ṽ Hence, we just need to show that 2   is positive definite, for some  and ℎ 2 .Using the Schur characterization for positive semidefinite matrices, P ≥ 0 if and only if .
It is clear that B() is positive semidefinite if and only if 1 .

Numerical verification of our Lyapunov function
We verify numerically that the Lyapunov function for HBr proposed in equation 11 works on quadratics.To more clearly show the effect the inner product correction term, which originated from the quadratic invariant of the Störmer-Verlet method, we use here a slightly different notation:   =   We recall that, the term  12  (a.k.a. the cross-term) vanishes as ℎ → 0, and is indeed not present in the continuous-time limit.We show that this term, which we derived using Lyapunov equations in Sec. 2, plays a fundamental role in ensuring  +1 −   ≤ 0. In Figure 3 we verify numerically Thm. 1.In Figure 4 we show the essential role of  12 .Here we used ℎ 2 = 1/, but HB can take larger steps (up to 4/), while the other algorithms become unstable (Figure 5).

Conclusion
In conclusion, the question of whether the Heavy-ball method is globally accelerated for nonstrongly-convex quadratic problems has yet to be fully answered, and has attracted the attention of recent research [23].Our study takes a novel approach by examining momentum through the lens of quadratic invariants of simple harmonic oscillators, and by utilizing the modified Hamiltonian of Stormer-Verlet integrators we were able to construct a Lyapunov function that demonstrates an  (1/ 2 ) rate for Heavy-ball in the case of convex quadratic problems, where eigenvalues can vanish.This is a promising first step towards potentially proving the acceleration of Polyak's momentum method through Lyapunov function arguments.

Figure 2 :
Figure2: The Störmer-Verlet method on a one-dimensional quadratic potential (i.e., a simplified pendulum) does not conserve the total energy.Details on this phenomenon can be found in[10,9].

Figure 3 :
Figure 3: Dynamics of the Lyapunov function for HB on linear regression (ill-conditioned Hessian, with condition number ).Shown is the behavior for  = 2, 3 with step-size 1/.For  = 2,   is constant, as predicted by Prop. 4. For  = 3,   is decreasing as predicted by Thm. 1.

Figure 4 :
Figure 4: Same setting of the second example in Figure 3, but different candidate Lyapunov function (no cross term).This confirms the cross-term is necessary.