Explicit Stabilised Gradient Descent for Faster Strongly Convex Optimisation

We introduce the explicit stabilised gradient descent method (ESGD) for strongly convex optimisation problems. This new algorithm is based on explicit stabilised integrators for stiff differential equations, a powerful class of numerical schemes to avoid the severe step size restriction faced by standard explicit integrators. For optimising quadratic and strongly convex functions, we prove that ESGD nearly achieves the optimal convergence rate of the conjugate gradient algorithm, and the suboptimality of ESGD diminishes as the condition number of the quadratic function worsens. We show that this optimal rate is obtained also for a partitioned variant of ESGD applied to perturbations of quadratic functions. In addition, numerical experiments on general strongly convex problems show that ESGD outperforms Nesterov's accelerated gradient descent.


Introduction
Optimisation is at the heart of many applied mathematical and statistical problems, while its beauty lies in the simplicity of describing the problem in question. In this work, given a function f : R d → R, we are interested in finding a minimiser x * ∈ R d of the program We make the common assumption throughout that f ∈ F µ,L , namely the set of µ-strongly convex differentiable functions that have L-Lipschitz continuous derivative [1]. Corresponding to f is its gradient flow, defined as where x 0 is its initialisation. It is easy to see that traversing the gradient flow always reduces the value of f . Indeed, for any positive h, it holds that ∇f (x(t)) 2 2 dt.
By discretising the gradient flow in (2), we can design various optimisation algorithms for Program (1). For example, by substituting in (2) the approximation at x n , we obtain the gradient descent (GD), namely where h > 0 is the step size [1]. For this discretisation to remain stable, that is, for the GD update in (5) to remain close to the gradient flow and for the value of f to reduce in every iteration, the step size h must not be too large. Indeed, a well-known shortcoming of GD is that we must take h ≤ 2/L to ensure stability, otherwise f might increase from one iteration to the next [1]. This drawback of GD has motivated a number of research works that use different discretisations of (2) to design better optimisation algorithms [2,3,4,5,6]. For example, substituting in (2) the approximation (4) at x n+1 instead of x n , we arrive at the update which is known as the implicit Euler method in numerical analysis [7] because, as the name suggests, it involves solving (6) for x n+1 . It is not difficult to see that, unlike GD, there is no limit on the step size h for the implicit Euler method, a property known in the numerical analysis jargon as A-stability [7]. Moreover, it is easy to verify that x n+1 in (6) is also the unique minimiser of the program and in fact the map from x n to x n+1 is known in the optimisation literature as the proximal map of the function hf [8]. Unfortunately, even if ∇f is known explicitly, solving (6) for x n+1 or equivalently computing the proximal map is often just as hard as solving Program (1), with a few notable exceptions [8]. This setback severely limits the applicability of the proximal algorithm in (6) for solving Program (1).
Contributions. With this motivation, we propose the Explicit Stabilised Gradient Descent (ESGD) method for solving Program (1). ESGD offers the best of both worlds, namely the computational tractability of GD (explicit Euler method) and the stability of the proximal algorithm (implicit Euler method). Inspired by [2], ESGD uses explicit stabilised methods [9,10,11] to discretise the gradient flow (2). In numerical analysis, explicit stabilised methods provide a computationally efficient alternative to the implicit Euler method for stiff differential equations, where standard integrators face a severe step size restriction, in particular in high dimensions for spatial discretisations of diffusion PDEs, see the review [12]. Every iteration of ESGD consists of s internal stages, where each stage performs a simple GD-like update. Unlike GD however, ESGD does not decay monotonically along its internal stages, which allows it to take longer steps and travel faster along the gradient flow. After s internal stages, ESGD ensures that its new iterate is stable, namely the value of f indeed decreases after each iteration of ESGD.
The rest of this paper is organised as follows. Section 2 formally introduces ESGD, which is summarised in Algorithm 1 and accessible without reading the rest of this paper. In Section 3, we then quantify the performance of ESGD for solving strongly convex quadratic programs, while in Section 4 we introduce and study theoretically a partitioned variant of ESGD applied to perturbations of quadratic functions. Then in Section 5, we empirically compare ESGD with other first-order optimisation algorithms and conclude that ESGD improves over the state of the art in practice. This paper concludes with an overview of the remaining theoretical challenges.

Introducing Explicit Stabilised Gradient Descent
Let us start with the simple scalar program where λ is positive, and consider the corresponding gradient flow also known as the Dahlquist test equation [7]. It is obvious from (9) that lim t→∞ x(t) = 0 and any sensible optimisation algorithm should provide iterates {x n } n with a similar property, that is, For GD, which corresponds to the explicit Euler disctretisation of the gradient flow, it easily follows from (2) that where R gd is the stability polynomial. Hence, (10) holds if z ∈ D gd , where the stability domain D gd of GD is defined as That is, (10) holds if h ∈ (0, 2/λ), which imposes a severe limit on the time step h when λ is large. Beyond this limit, namely for larger step sizes, the iterates of GD might not necessarily reduce the value of f or, put differently, the explicit Euler method might no longer be a faithful discretisation of the gradient flow. At the other extreme, for the proximal algorithm which corresponds to the implicit Euler discretisation of the gradient flow in (9), it follows from (6) that with the stability domain D pa = {z ∈ C ; |R pa (z)| < 1} . Therefore (10) holds for any positive step size h. This property is known as A-stability of a numerical method [7]. Unfortunately, the proximal algorithm (implicit Euler method) is often computationally intractable particularly in higher dimensions. In numerical analysis, explicit stabilised methods for discretising the gradient flow offer the best of both worlds, as they are not only explicit and thus computationally tractable, but they also share some favourable stability properties of the implicit method. Our main contribution in this work is adapting these methods for optimisation, as detailed next.
For discretising the gradient flow (9), the key idea behind explicit stabilised methods is to relax the requirement that every step of explicit Euler method should remain stable, namely, faithful to the gradient flow. This relaxation in turn allows the explicit stabilised method to take longer steps and traverse the gradient flow faster. To be specific, applying any given explicit Runge-Kutta method with s stages (i.e. using s evaluations of ∇f ) per step to (9) yields a recursion of the form with the corresponding stability domain D s = {z ∈ C ; |R s (z)| < 1}. We wish to choose {a j } s j=2 to maximise the step size h while ensuring that z = −hλ still remains in the stability domain D s , namely for the update of the explicit stabilised method to remain stable. More formally, we wish to solve max a2,··· ,as The solution to (15) is s = 2s 2 and, after substituting the optimal values for {a 2 } s j=1 in (14), we find that the unique corresponding R s (z) is the shifted Chebyshev polynomial R s (z) = T s (1 + z/s 2 ) where T s (cos θ) = cos(sθ) is the Chebyshev polynomial of the first kind with degree s [12]. As we can see in Figure 1, R s (z) equi-oscillates between −1 and 1 on z ∈ [− s , 0], see the red curve in this Figure 1(b). That is, after every s internal stages, the new iterate of the explicit stabilised method remains stable and faithful to (9), while travelling the most along the gradient flow.
Numerical stability is still an issue for the explicit stabilised method outlined above, particularly for the values of z = −λh for which |R s (z)| = 1. As seen in Figure 1(a), even the slightest numerical imperfection will land us outside of the stability domain D s , which might make the algorithm unstable. In addition, for such values of z, the new iterate x n+1 is not necessarily any closer to the minimiser, here the origin. Traditionally, for a damping factor η, one therefore tightens the stability requirement to |R s (z)| ≤ α s (η) < 1 for every z ∈ [− s,η , −δ η ]. More specifically, for positive η, we can take in which case R s (z) oscillates between −α s (η) and α s (η) for every z ∈ [−l s,η , −δ η ], where   In fact, l s,η (2 − 4 3 η)s 2 is close to the optimal stability domain size l s,0 = 2s 2 for small η, see [12]. It also follows from (14) that namely, the new iterate of the explicit stabilised method is indeed closer to the minimiser. In addition, as we can see in Figure 1, introducing damping also ensures that a strip around the negative real axis is included in the complex stability domain D s , which grows in the imaginary direction as the damping parameter η increases. While a small damping η = 0.05 is usually sufficient for standard stiff problems, the benefit of large damping η was first exploited in [13] in the context of stiff stochastic differential equations and later improved in [14] using second kind Chebyshev polynomials. It is straightforward to generalise this method beyond the scalar Program (8). In particular, for the algorithmic implementation to be numerically stable, one should not evaluate the Chebyshev polynomials T s (z) naively [15] and instead use their well-known three-term recurrence; we dub this algorithm the Explicit Stabilised Gradient Descent (ESGD), see Algorithm 1.

Strongly Convex Quadratic Programming
Consider the Program (1) with where A ∈ R d×d is a positive definite and symmetric matrix, and b ∈ R d . As the next result shows, that convergence of a Runge-Kutta method (such as GD, proximal algorithm) depends 1 on the eigenvalues of A. Proposition 1. For solving Program (1) with f in (19), consider an optimisation algorithm with the stability function R, for example R = R gd for GD or R = R pa for proximal algorithm, and let Algorithm 1 Explicit Stabilised Gradient Descent (ESGD) for solving Program (1) Input: The gradient ∇f of a differentiable function f : Body: Until a convergence criterion is met, repeat: • Set x 0 n = x n and x 1 n = x 0 n − hµ 1 ∇f (x n ). • For j ∈ {2, · · · , s}, repeat: and T s is the Chebyshev polynomial of the first kind with degree s.
{x n } n≥0 be the iterates of this algorithm. Also let µ = λ 1 ≤ · · · ≤ λ d = L denote the eigenvalues of A in nondecreasing order. Then it holds that for every n ≥ 0.
Let us next apply Proposition 1 to both GD and ESGD. Gradient descent. In the case of GD, it is not difficult to verify that where we used (11). It follows from (21) that we must take h ∈ (0, 2/L) for GD to be stable, namely for GD to reduce the value of f . In addition, one obtains the best possible decay by choosing h = 2/(µ + L). More specifically, we have that where κ := L/µ, is the condition number of matrix A. That is, the best convergence rate for GD is predicted by Proposition 1 as achieved with the step size h = 2/(µ + L). Remarkably, the same conclusion holds for any function in F µ,L , as discussed in [16]. Explicit stabilised gradient descent. In the case of Algorithm 1, there are three different interlinked parameters that need to be chosen, namely the step size h, the number of internal stages s, and the damping factor η. For fixed positive η, let us next select h and s such the numerator of (16), namely |T s (ω 0 + ω 1 z)|, is bounded by one. Equivalently, we take h, s such that For an efficient algorithm, we choose the smallest step size h and the smallest number s of internal stages such that (24) holds. More specifically, (24) dictates that κ ≤ (1+ω 0 )/(−1+ω 0 ) = 1+2s 2 /η, see (18). This in turn determines the parameters h, s as In this plot, we observe that c esgd (κ) − c opt (κ) decays as C η / √ κ for κ → ∞ and for fixed η, see the slope −1/2 in the plot. Numerical evaluations suggest that the constant C η = O(1/ √ η) becomes arbitrarily small as η grows, which corroborates Proposition 2. For comparison, we also include the result c agd (κ) − c opt (κ) for the optimal rate of the Nestrov's Accelerated Gradient Descent (AGD) given by c agd = (1 − 2/ √ 3κ + 1) 2 [16]. Comparing the two algorithms in Section 5 we find that ESGD outperforms AGD, namely c esgd ≤ c agd for η ≥ η 0 , where η 0 1.17 is a moderate size constant.
Under (24) and using the definitions in (16,17) Then an immediate consequence of Proposition 1 is that Given that every iteration of ESGD consists of s internal stages-hence, with the same cost as s GD steps-it is natural to define the effective convergence rate of ESGD as The following result, proved in the supplementary material, evaluates the effective convergence rate of ESGD, as given by (27), at the limit of η → ∞.
Above, c opt (κ) is the optimal convergence rate of a first-order algorithm, which is achieved by the conjugate gradient (CG) algorithm for quadratic f , see [1]. Put differently, Proposition 2 states that ESGD nearly achieves the optimal convergence rate in the limit of η → ∞. Also note that remarkably the performance of ESGD relative to the conjugate gradient improves as the condition number of f worsens, namely as κ increases. The non-asymptotic behaviour of ESGD is numerically investigated in Figure 2, corroborating Proposition 2. As illustrated in Section 5, Algorithm 1 also applies to non-quadratic optimization problems. 2

Perturbation of a Quadratic Objective Function
Proposition 2 shows us that in the case of strongly convex quadratic problems one can recover the optimal convergence rate of the conjugate gradient for large values of the damping parameter η. Here we discuss a modification of the Algorithm 1 to a specific class of nonlinear problems for which we can prove the same convergence rate as in the quadratic case. More precisely, we will consider where A is positive definite matrix for which we know the smallest and largest eigenvalue, and g is a β-smooth convex function [1]. Inspired by [17], we consider the modification 3 of Algorithm 1 specifically designed for the minimization of functions of the form (29). We call this method Partitioned Explicit Stabilised Gradient Descent (PESGD) and show in Theorem 1, proved in the supplementary material, that it matches the rate given by the analysis of quadratic problems. PESGD is a numerically stable implementation of the following update where R s is the stability function given by equation (16) and B s is a related polynomial. It is worth noting that this method has only one evaluation of ∇g per step of the algorithm, which can be advantageous if the evaluations of ∇g are costly.

Algorithm 2 Partitioned Explicit Stabilised Gradient Descent (PESGD) for solving Program (29)
Input: The gradient ∇g of a differentiable function g : R d → R, and the matrix A ∈ R d×d . Initialisation x 0 ∈ R d . Body: Until the convergence criterion is met, repeat: • Set x 0 n = x n and x 1 n = x 0 n − µ 1 h(Ax 0 n + h∇g(x 0 n )). • For j ∈ {2, · · · , s}, repeat: where the constants µ j , ν j are defined in (18).
• Set x n+1 = x s n . Output: Estimate x = x n+1 of a minimiser of Program (1) for f given by (29).
Theorem 1. Let f be given by (29) where A ∈ R d×d is a positive definite matrix with largest and smallest eigenvalues L and µ, respectively, and condition number κ = L/µ. Let γ > 0 and g a β-smooth convex function for which 0 < β < C(η)γµ where C(η) is defined in (A.3) and the parameters of PESGD are chosen according to conditions (25). Then the iterates of PESGD satisfy Roughly speaking, Theorem 1 states that the effective convergence rate of PESGD matches that of PESGD in (26) up to a factor of (1 + γ) 2 . Since lim s→∞ (1 + γ) 2/s = 1, the effective rate in (31) is equivalent to c esgd (κ) in the limit of a large condition number κ. Indeed, the numerical evidence in Figure 2 suggests that PESGD remains efficient for moderate values of the damping parameter η. In particular, for the value of η 0 = 1.17, we have C(η 0 ) 0.59 when κ 1.

Numerical Examples
We now illustrate the performance of ESGD (Algorithm 1) and PESGD (Algorithm 2) for solving Program (1) on the following test problems: (a) Strongly convex quadratic programming: a random matrix drawn from the Wishart distribution. We used d = 4800. This problem is mildly ill-conditioned with κ ≈ 10 4 and can also be solved by the conjugate gradient (CG) method.
T ∈ R m×d is the design matrix and y ∈ {−1, 1} d . We have f ∈ F µ,L with µ = τ and L = τ + Ξ 2 2 /4. As in [18], we used the Madelon UCI dataset with d = 500, m = 2000, and τ = 10 2 . This is a very poorly conditioned problem with κ = L/µ ≈ 10 9 . (c) Regression with (smoothed) elastic net regularisation: is the standard Huber loss function with parameter τ to smooth |t| [19]. We used A ∈ R m×d a random matrix drawn from the Gaussian distribution scaled by 1/ √ d, d = 3000, m = 900, λ = 0.2, τ = 10 −3 , µ = 10 −2 . The objective function is in F µ,L with L ≈ (1 + m/d) 2 + λ/τ + µ. The condition number is κ ≈ 10 4 . (d) Nonlinear elliptic PDE: we consider a finite difference discretisation of the 1D integro-differential partial differential equation which describes the stationary variant of a temperature profile of air near the ground [20]. Using a finite difference approximation u(i∆x) U i , i = 1, . . . d on a spatial mesh with size ∆x = 1/(d + 1), and using the trapezoidal quadrature for the integral, we obtain a problem of the form (29) where A is the usual tridiagonal discrete Laplace matrix of size d × d, and the entries of ∇g(U ) ∈ R d are given by ∂g Problems (a)-(d) are depicted in Figure 3, where we also output the intermedia values x j n in Algorithms 1 and 2 after every evaluation of the gradient of f . Figure 3 confirms that ESGD behaves like CG for quadratic problems for large values of damping η, while it outperforms AGD for moderate values of damping for more general strongly convex objective functions. Finally, in the case of nonlinear elliptic PDE (here d = 500), we see that the cheaper variant PESGD performs similarly to ESGD for the sets of parameters η = 10, s = 715 and η = 1.17, s = 245, while having a single evaluation of the gradient ∇g(x n ) per step, which corroborates Theorem 1.

Conclusion
In this paper, using ideas from numerical analysis of ODEs, we introduced a new class of optimisation algorithms for strongly convex functions based on explicit stabilised methods. These new methods, ESGD and PESGD, are as easy to implement as SD, and were shown to match the optimal convergence rates of first order methods for certain subclasses of F µ,L . Our numerical experiments illustrate that this might be the case for all functions in F µ,L , and proving this is the subject of our current research efforts. In addition, there is a number of different interesting research avenues for this class of methods, including adjusting them to convex optimisation problems (µ=0), as well as for adaptively choosing the time-step h using local information to optimise their performance further. Furthermore, their adaptation to stochastic optimisation problems, where one replaces the full gradient of the function f by a noisy version but cheaper version of it, is another interesting but challenging direction we aim to investigate further.

A Proof of the main results
In this section we discuss the proofs of the main results in the paper.

A.1 Proof of Proposition 1
We start our proof by noticing that the gradient flow (2) in the case of the quadratic function (19) becomes In addition since A is positive definite and symmetric, there exists an orthogonal matrix V such that If we now make the change of variables y = V −1 x − D −1 V −1 b we obtain the following equation Hence in this coordinate system each coordinate is independent of the other, while the objective function can be written as We can now write equation (A.1) in the vector form as and hence each coordinate satisfies independently the simple quadratic Program (19) and the application of Runge-Kutta method with stability function R(z) gives y n+1 = R(−λ i h)y n . Hence which completes the proof.

A.3 Proof of Theorem 1
Our starting point in proving Theorem 1 would be equation (30). In particular, we will show that if we choose our parameters suitably then this scheme converges with the rate predicted by Theorem 1, and we will then show how Algorithm 2 corresponds to an implementation of this scheme.