A Speed Restart Scheme for a Dynamics with Hessian-Driven Damping

In this paper, we analyze a speed restarting scheme for the inertial dynamics with Hessian-driven damping, introduced by Attouch et al. (J Differ Equ 261(10):5734–5783, 2016). We establish a linear convergence rate for the function values along the restarted trajectories. Numerical experiments suggest that the Hessian-driven damping and the restarting scheme together improve the performance of the dynamics and corresponding iterative algorithms in practice.


Introduction
In convex optimization, first-order methods are iterative algorithms that use function values and (generalized) derivatives to build minimizing sequences.Perhaps the oldest and simplest of them is the gradient method [13], which can be interpreted as a finite-difference discretization of the differential equation ẋ(t) + ∇φ x(t) = 0, (1) describing the steepest descent dynamics.The gradient method is applicable to smooth functions, but there are more contemporary variations that can deal with nonsmooth ones, and even exploit the functions' structure to enhance the algorithm's per iteration complexity, or overall performance.A keynote example (see [9] for further insight) is the proximal-gradient (or forward-backward) method [18,28], (see also [22,30]), which is in close relationship with a nonsmooth version of (1).In any case, the analysis of related differential equations or inclusions is a valuable source of insight into the dynamic behavior of these iterative algorithms.
In [29,25], the authors introduced inertial substeps in the iterations of the gradient method, in order to accelerate its convergence.This variation improves the worst-case convergence rate from O(k −1 ) to O(k −2 ).
Despite its faster convergence rate guarantees, trajectories satisfying (AVD)−as well as sequences generated by inertial first order methods−exhibit a somewhat chaotic behavior, especially if the objective function is ill-conditioned.In particular, the function values tend not to decrease monotonically, but to present an oscillatory character, instead.
Example 1.1.We consider the quadratic function φ : R 3 → R, defined by whose condition number is max{ρ 2 , ρ −2 }. Figure 1 shows the behavior of the solution to (AVD), with x(1) = (1, 1, 1) and ẋ(1) = −∇φ x(1) (the direction of maximum descent).In order to avoid this undesirable behavior, and partly inspired by a continuous version of Newton's method [2], Attouch, Peypouquet and Redont [7] proposed a Dynamic Inertial Newton system with Asymptotically Vanishing Damping, given by where α, β > 0. In principle, this expression only makes sense when φ is twice differentiable, but the authors show that it can be transformed into an equivalent first-order equation in time and space, which can be extended to a differential inclusion that is well posed whenever φ is closed and convex.The authors presented (DIN-AVD) as a continuous-time model for the design of new algorithms, a line of research already outlined in [7], and continued in [4].Back to (DIN-AVD), the function values vanish along the solutions, with the same rates as for (AVD).Nevertheless, in contrast with the solutions of (AVD), the oscillations are tame.
Example 1.2.In the context of Example 1.1, Figure 2 shows the behavior of the solution to (DIN-AVD) in comparison with that of (DIN-AVD), both with x(1) = (1, 1, 1) and ẋ(1) = −∇φ x(1) .An alternative way to avoid−or at least moderate−the oscillations exemplified in Figure 1 for the solutions of (AVD) is to stop the evolution and restart it with zero initial velocity, from time to time.The simplest option is to do so periodically, at fixed intervals.This idea is used in [26] for the accelerated gradient method, where the number of iterations between restarts that depends on the parameter of strong convexity of the function.See also [24,1,8], where the problem of estimating the appropriate restart times is addressed.An adaptive policy for the restarting of Nesterov's Method was proposed by O'Donoghue and Candès in [27], where the algorithm is restarted at the first iteration k such that φ(x k+1 ) > φ(x k ), which prevents the function values to increase locally.This kind of restarting criteria shows a remarkable performance, although convergence rate guarantees have not been established, although some partial steps in this direction have been made in [15,17].Moreover, the authors of [27] observe that this heuristic displays an erratic behavior when the difference φ(x k ) − φ(x k+1 ) is small, due to the prevalence of cancellation errors.Therefore, this method must be handled with care if high accuracy is desired.A different restarting scheme, based on the speed of the trajectories, is proposed for (AVD) in [31], where rates of convergence are established.The improvent can be remarkable, as shown in Figure 3.
In [31], the authors also perform numerical tests using Nesterov's inertial gradient method, with this restarting scheme as a heuristic, and observe a faster convergence to the optimal value.
The aim of this work is to analyze the impact that the speed restarting scheme has on the solutions of (DIN-AVD), in order to set the theoretical foundations to further accelerate Hessian driven inertial algorithms−like the ones in [4]−by means of a restarting policy.We provide linear convergence rates for functions with quadratic growth, and observe a noticeable improvement in the behavior of the trajectories in terms of stability and convergence speed, both in comparison with the non-restarted trajectories, and with the restarted solutions of (AVD).As a byproduct, we generalize improve some of the results in [31].
The paper is organized as follows: In section 2, we describe the speed restart scheme and state the convergence rate of the corresponding trajectories, which is the main theoretical result of this paper.Section ϕ(χ(t)) ϕ(x(t)) Figure 3: Values along the trajectory, with (red) and without (blue) restarting, for (AVD).
3 contains the technical auxiliary results−especially some estimations on the restarting time−leading to the proof of our main result, which is carried out in Section 4. Finally, we present a few simple numerical examples in Section 5, in order to illustrate the improvements, in terms of convergence speed, of the restarted trajectories.
2 Restarted trajectories for (DIN-AVD) Throughout this paper, φ : R n → R is a twice continuously differentiable convex function, which attains its minimum value φ * , and whose gradient ∇φ is Lipschitz-continuous with constant L > 0. Consider the ordinary differential equation (DIN-AVD), with initial conditions x(0) = x 0 , ẋ(0) = 0, and parameters α > 0 and = 0 and (DIN-AVD) holds for every t > 0. Existence and uniqueness of such a solution is not straightforward due to the singularity at t = 0, but can be established by a limiting procedure.As shown in Appendix A, we have the following: Theorem 2.1.For every x 0 ∈ R n , the differential equation (DIN-AVD), with initial conditions x(0) = x 0 and ẋ(0) = 0, has a unique solution.
We are concerned with the design and analysis of a restart scheme to accelerate the convergence of the solutions of (DIN-AVD) to minimizers of φ, based on the method proposed in [31].

A speed restarting scheme and the main theoretical result
Since the damping coefficient α/t goes to 0 as t → ∞, large values of t result in a smaller stabilization of the trajectory.The idea is thus to restart the dynamics at the point where the speed ceases to increase.
If z / ∈ argmin(φ), then T (z) cannot be 0. In fact, we shall prove (see Corollaries 3.5 and 3.8) that 2. For i ≥ 1, having defined χ x 0 (t) for t ∈ [0, S i ], set x i = χ x 0 (S i ), and compute y x i .Then, set T i+1 = T (x i ) and S i+1 = S i + T i+1 , and define In view of ( 5), S i is defined for all i ≥ 1, inf i≥1 (S i+1 − S i ) > 0 and lim i→∞ S i = ∞.Moreover, in view of Remark 2.2, we have Our main theoretical result establishes that φ χ x 0 (t) converges linearly to φ * , provided there exists for all z ∈ R n .The Lojasiewicz inequality ( 6) is equivalent to quadratic growth and is implied by strong convexity (see [11]).More precisely, we have the following: Theorem 2.5.Let φ : R n → R be convex and twice continuously differentiable.Assume ∇φ is Lipschitzcontinuous with constant L > 0, there exists µ > 0 such that (6) holds, and that the minimum value φ * of φ is attained.Given α ≥ 3 and β > 0, let be the restarted trajectory defined by (DIN-AVD) from an initial point x 0 ∈ R n .Then, there exist constants C, K > 0 such that The rather technical proof is split into several parts and presented in the next subsections.

Technicalities
Throughout this section, we fix z / ∈ argmin(φ) and, in order to simplify the notation, we denote by x (instead of y z ) the solution of (DIN-AVD) with initial condition x(0) = z and ẋ(0) = 0.

A few useful artifacts
We begin by defining some useful auxiliary functions and point out the main relationships between them.
To this end, we first rewrite equation (DIN-AVD) as Integrating ( 7) over [0, t], we get In order to obtain an upper bound for the speed ẋ, the integrals will be majorized using the function which is positive, nondecreasing and continuous.
Lemma 3.1.For every t > 0, we have Proof.For the first estimation, we use the Lipschitz-continuity of ∇φ and the fact that M in nondecreasing, to obtain Then, from the definition of I z (t) we deduce that For the second inequality, we proceed analogously to get Then, The dependence of M z on the initial condition z may be greatly simplified.To this end, set The function H is concave, quadratic, does not depend on z, and has exactly one positive zero, given by In particular, H decreases strictly from 1 to 0 on [0, τ 1 ].
Lemma 3.2.For every t ∈ (0, τ 1 ), Proof.If 0 < u ≤ t, using ( 8) and ( 9), along with Lemma 3.1, we obtain Since the right-hand side is nondecreasing in t, we take the supremum for u ∈ [0, t] to deduce that Rearranging the terms, and using the definition of H, given in ( 13), we see that .
We conclude by observing that H is positive on (0, τ 1 ).
We highlight the fact that the bound above depends on z only via the factor ∇φ(z) .

Estimates for the restarting time
We begin by finding a lower bound for the restarting time, depending on the parameters α, β and L, but not on the initial condition z.
Lemma 3.4.Let z / ∈ argmin(φ), and let x be the solution of (DIN-AVD) with initial conditions x(0) = z and ẋ(0) = 0.For every t ∈ (0, τ 1 ), we have Proof.From ( 8) and ( 9), we know that On the other hand, d dt Then, where With this notation, we have For the first term, we do as follows: where we have used the Cauchy-Schwarz inequality and Corollary 3.3.For the second term, we first use (17) and observe that , by Corollary 3.3.We conclude that The function G, defined by does not depend on the initial condition z.Its unique positive zero is In view of the definition of the restarting time, an immediate consequence of Lemma 3.4 is Corollary 3.5.Let T * = inf T (z) : z / ∈ argmin(φ) .Then, τ 3 ≤ T * .
Remark 3.6.If β = 0, then The case α = 3 and β = 0 was studied in [31], and the authors provided 4   5 √ L as a lower bound for the restart.The arguments presented here yield a higher bound, since Recall that the function H given in (13) decreases from 1 to 0 on [0, τ 1 ].Therefore, H(t) > 1 2 for all t ∈ [0, τ 2 ), where Evaluating the right-hand side of (18), we see that These facts will be useful to provide an upper bound for the restarting time.
4 Function value decrease and proof of Theorem 2.5 The next result provides the ratio at which the function values have been reduced by the time the trajectory is restarted.
Moreover, in view of ( 21) and Corollary 3.5, we can take τ * = τ 3 to obtain a lower bound.If β = 0, we obtain which is independent of L. As a consequence, the inequality in Proposition 4.1 becomes For α = 3, this gives For this particular case, a similar result, obtained in [31] for strongly convex functions, namely 67 71 Our constant is approximately 66.37% larger than the one from [31], which implies a greater reduction in the function values each time the trajectory is restarted.On the other hand, if β > 0, we can still obtain a slightly smaller lower bound, namely Ψ(τ 3 ) > 2α + 1 2α + 2

2
, independent from β and L. The proof is technical and will be omitted.
Remark 4.3.It is possible to implement a fixed restart scheme.To this end, we modify Definition 2.3 by setting T i ≡ τ , with any τ ∈ (0, τ 2 ) ∩ (0, T * ], such as τ * or τ 3 , for example.In theory, τ * gives the same convergence rate as the original restart scheme presented throughout this work.From a practical perspective, though, restarting the dynamics too soon may result in a poorer performance.Therefore, finding larger values of τ * and τ 3 is crucial to implement a fixed restart (see Remarks 3.6 and 4.2).

Numerical illustration
In this section, we provide a very simple numerical example to illustrate how the convergence is improved by the restarting scheme.A more thorough numerical analysis will be carried out in a forthcoming paper, where implementable optimization algorithms will be analyzed.

Example 1.2 revisited
We consider the quadratic function φ : R 3 → R, defined in Example 1.1 by (2), with ρ = 10.We set α = 3.1 and β = 0.25, and compute the solutions of (AVD) and (DIN-AVD), starting from x(1) = x 1 = (1, 1, 1) and zero initial velocity, with and without restarting, using the Python tool odeint from the scipy package.Figure 4 shows a comparison of the values along the trajectory with and without restarting, first for (AVD), and then for (DIN-AVD).In both cases, the restarted trajectories appear to be more stable and converge faster.However, one can do better.As mentioned earlier, restarting schemes based on function values are effective from a practical perspective, but show an erratic behavior as the trajectory approaches a minimizer.It seems natural as a heuristic to use the first (or n-th) function-value restart point as a warm start, and then apply speed restarts, for which we have obtained convergence rate guarantees.Although the velocity must be set to zero after each restart, there are no constraints on the initial velocity used to compute the warm starting point.The results are shown in Figure 5, with initial velocity set to zero and ẋ(1) = −β∇φ(x 1 ), respectively.A linear regression after the first restart provides estimations for the linear convergence rate of the function values along the corresponding trajectories, when modeled as φ χ(t) ∼ Ae −Bt , with A, B > 0. The results are displayed in Table 1.The absolute value of the exponent B in the linear convergence rate is increased by 34,67% in the case ẋ(1) = 0, and by 39,86% in the case ẋ(1) = −β∇φ(x 1 ).Also, the minimum values for the methods presented in Figure 5 can be analyzed.The last and best function values on [1,25] are displayed on Table 2.In all cases, the best value without restart is approximately 10 4 times larger than the one obtained with our policy.We also observe similar final values for the restarted trajectories despite the different initial velocities.

A first exploration of the algorithmic consequences
Different discretizations of (DIN-AVD) can be used to design implementable algorithms and generate minimizing sequences for φ, which hopefully will share the stable behavior we observe in the solutions of (DIN-AVD).Three such algorithms were first proposed in [4], for which we implemented a speed restart scheme, analogue to the one we have used for the solutions of (DIN-AVD).Since we obtained very similar ẋ(1) = 0 ẋ(1) = −β∇φ(x 1 ) results and the numerical analysis of algorithms is not the focus of this paper, we describe only the simplest one in detail, and present the numerical results for that one.As in [31], a parameter k min is introduced, to avoid two consecutive restarts to be too close.

Algorithm 1: Inertial Gradient Algorithm with Hessian Damping (IGAHD) -Speed Restart version
Choose Example 5.1.We begin by applying Algorithm 1, as well as the variation with the warm start, to the function φ : R 3 → R in Examples 1.1 and 1.2, with the parameters k min = 10, β = h = 1/ √ L and α = 3.1.Figure 6 shows the evolution of the function values along the iterations.The coefficients in the approximation φ(x k ) ∼ Ae −Bt , with A, B > 0, obtained for each algorithm, are detailed on Table 3.As one would expect, the value of B is similar and that of A is significantly lower.Also, Table 4 shows the values obtained along 1000 iterations.The best value without restart is 10 5 times larger than the one obtained with our policy.For the experiment, n = 500, A is randomly generated with eigenvalues in (0, 1), and b is also chosen at random.We first compute L, and set k min = 10, h = 1/ √ L, α = 3.1 and β = h.The initial points x 0 = x 1 Last iteration without restart We assume that γ is continuous and positive, with lim t→0 γ(t) = +∞, and that F and G are (continuous and) sufficiently regular so that the differential equation ( 24), with initial condition x(δ) = x δ and ẋ(δ) = v δ , has a unique solution defined on [δ, T ∞ ) for some T ∞ ∈ (0, ∞] and all δ > 0. Let We have the following: Theorem A.1.Assume there is T > 0 such that Then, the differential equation (24), with initial condition x(0) = x 0 and ẋ(0) = v 0 , has a solution.

Example 5 . 2 .
Given a positive definite symmetric matrix A of size n × n, and a vector b ∈ R n , define φ : R n → R byφ(x) = 1 2 x T Ax + b T x.

Table 4 :
Functions values for Example 5.1.

Table 5 :
Coefficients in the linear regression for Example 5.2.