1 Introduction

In this paper, we consider the unconstrained convex optimization problem

$$\begin{aligned} \min _{x\in \mathbb {R}^d} f(x). \end{aligned}$$

One approach to analyze and derive optimization methods for this problem is modeling them by ordinary differential equations (ODEs). This approach has been considered for a long time, starting with the modeling of the steepest descent method by gradient flow. Further, it has received more attention in recent years since the derivation of an ODE model of Nesterov’s accelerated gradient method (NAG) by Su–Boyd–Candès [1].

One of the advantages of considering ODE models is that we can analyze convergence rates of, for example, the objective function succinctly in continuous time via Lyapunov functions which express the rate explicitly. This framework of Lyapunov analysis has been organized by [2, 3], and applied to various ODEs (e.g., [4,5,6].)

The weak discrete gradient (wDG) is a concept proposed in [7] to transfer the Lyapunov analysis in continuous-time ODEs to discrete-time optimization methods. It represents a class of gradient evaluation methods that satisfy conditions necessary to reproduce the continuous proof in discrete systems. By discretizing ODEs using wDGs, we have an abstract optimization method that contains a variety of optimization methods. Additionally, selecting a wDG yields a concrete optimization method with a convergence rate guarantee. However, [7] did not discuss which specific wDG is suitable for optimization in practice, and only limited preliminary numerical examples are given in the appendix section.

Therefore, in this paper, we examine the properties and performance of the methods obtained by discretizing the following accelerated gradient flows with wDGs more intensively numerically:

$$\begin{aligned}&\dot{x} = \frac{2}{t}(v-x), \quad \dot{v} = -\frac{t}{2} \nabla f(x) \quad \text {(for convex} f),\end{aligned}$$
(1)
$$\begin{aligned}&\dot{x} = \sqrt{\mu }(v -x), \quad \dot{v} = \sqrt{\mu }(x-v-\nabla f(x)/\mu ) \quad \text {(for} \mu \text {-strongly convex} f). \end{aligned}$$
(2)

A differentiable function f is said to be \(\mu \)-strongly convex if it satisfies

$$\begin{aligned} \frac{\mu }{2} \Vert y-x \Vert ^2 \le f(y) - f(x) - \langle {\nabla f(x)},{y-x}\rangle , \end{aligned}$$

and f is said to be L-smooth if it satisfies

$$\begin{aligned} \frac{L}{2} \Vert y-x \Vert ^2 \ge f(y) - f(x) - \langle {\nabla f(x)},{y-x}\rangle . \end{aligned}$$

The above ODEs are models for Nesterov’s famous accelerated gradient method for convex functions [8] and strongly convex functions [9].

In this paper, we discuss the following three issues based on numerical experiments:

  1. 1.

    Differences between wDG discretizations and other discretizations: As wDG schemes for the accelerated gradient flow are different from NAG, we aim to compare them. Additionally, since the wDG schemes do not coincide with any typical numerical methods for ODEs, we compare them with the partitioned Runge–Kutta method, which is a natural choice for the ODE.

  2. 2.

    Practicability of implicit methods: Most wDG discretizations yield implicit schemes. In optimization, implicit methods are not favored because of their high computational complexity per iteration. In numerical analysis, however, it is common to employ implicit methods to solve “stiff” equations due to their high stability. In contrast to explicit methods, which can only compute numerical solutions with a small time step width due to their lower stability, the use of implicit methods allows for a much larger time step width. Consequently, despite the higher computational cost per step, the overall cost is reduced, enabling the attainment of numerical solutions at more advanced time points. We point out that a similar phenomenon occurs in optimization as well. Specifically, for “stiff” optimization problems, implicit methods can converge more rapidly in terms of overall computational cost compared to explicit methods.

  3. 3.

    Efficiency of variants of the proximal gradient method derived by wDG discretizations: We can create a new wDG by summing two wDGs. This property allows us to produce variants of the proximal gradient method with convergence rate guarantees. We verify their performance through numerical experiments.

Issues 1 and 2 will be discussed in Sect. 3 and Issue 3 in Sect. 4.

Before proceeding, let us summarize the insight in [7, Appendix I] here, in which simple numerical illustrations were given for both convex and strongly convex cases. For the convex case, the paper considered NAG for convex functions, and an explicit wDG scheme for convex functions, which is a variant of (5) below (this is almost the only explicit scheme we can naturally think of out of the wDG framework for convex functions). They were tested for a simple quartic function and the following was obtained: the performance was roughly the same for both methods, while the performance of the wDG scheme was slightly better asymptotically. For strongly convex case, we can think of various explicit wDG schemes. There, two explicit wDG schemes (which are called wDGEX_sc and wDGEX2_sc in the present paper; see below) and NAG for strongly convex functions were tested for a simple quadratic function. The numerical experiment suggested the following: again, the performance of these three methods was quite similar, while trajectories in the solution space were considerably different. In both convex and strongly convex cases, the actual convergence rates were much better than the theoretical estimates, which is not surprising since the theoretical estimate is the estimate for the worst-case functions.

In the numerical experiments carried out in the present study, we have the following settings. First, we mainly focus on strongly convex cases, since in the convex case, the only explicit wDG scheme was already tested in [7] as noted above, and few are left for investigation (an exception will be given in the last numerical example.) Second, we observe the convergence in the gradient norm, i.e., \(\Vert \nabla f(x)\Vert _2\), instead of f(x) itself, although the wDG framework provides convergence rate guarantees with respect to the latter. This is because we also deal with objective functions whose exact minimums are unknown. The profiles remain informative in view of the inequality \( \Vert \nabla f(x)\Vert _2^2 \le 2L(f(x)-f^{\star })\), which holds for any L-smooth convex functions. Finally, we assume the existence of the optimal solution \( x^{\star } \) and the optimal value \( f^{\star } = f (x^{\star }) \).

2 Preliminary: weak discrete gradient [7]

In this section, we briefly review the wDG.

In the Lyapunov analysis of ODEs, we derive convergence rates by showing the nonincrease of rate-revealing Lyapunov functions (by showing \((\textrm{d}/\textrm{d}t) E(t)\le 0\)). For accelerated gradient flows (1) and (2), the following functions serve as Lyapunov functions:

$$\begin{aligned} E(t)= & {} t^2(f(x) - f^\star ) + \Vert v - x^\star \Vert ^2 \quad \text {for (1), and } \end{aligned}$$
(3)
$$\begin{aligned} {E}(t)= & {} \textrm{e}^{\sqrt{\mu }t}\left( {f(x) - f^\star + \frac{\mu }{2}\Vert v-x^\star \Vert ^2}\right) \quad \text {for (2)}. \end{aligned}$$
(4)

Since these E’s are non-increasing and the terms involving \(\Vert v - x^\star \Vert ^2\) are nonnegative, we can derive immediately the convergence rate \(f(x(t)) - f^\star = \textrm{O} \left( {1/t^2}\right) \) and \(\textrm{O} \left( {\textrm{e}^{- \sqrt{\mu }t}}\right) \).

wDGs represent a class of gradient approximations that, given an optimization ODE and its Lyapunov function for the rate analysis, provides a discretization of the ODE that preserves its convergence rate.

Definition 1

(Weak discrete gradient) A gradient approximation \(\overline{\nabla }f :\mathbb {R}^d \times \mathbb {R}^d \rightarrow \mathbb {R}^d\) is said to be weak discrete gradient of f if there exists \(\alpha \ge 0\) and \(\beta , \gamma \) with \(\beta +\gamma \ge 0\) such that the following two conditions hold for all \(x,y,z \in \mathbb {R}^d\):

$$\begin{aligned} f(y) - f(x)&\le \langle {\overline{\nabla }f(y,z)},{y-x}\rangle + \alpha \Vert y-z \Vert ^2 - \beta \Vert z-x \Vert ^2 -\gamma \Vert y-x \Vert ^2, \\ \overline{\nabla }f(x,x)&= \nabla f(x). \end{aligned}$$

The following functions are wDGs.

Proposition 1

Suppose that \( f :\mathbb {R}^d \rightarrow \mathbb {R}\) is a \(\mu \)-strongly convex function. Let (L) and (SC) denote the additional assumptions: (L) f is L-smooth, and (SC) \(\mu > 0 \). Then, the following functions are wDGs:

  1. 1.

    If \(\overline{\nabla }f(y,x) = \nabla f(x)\) and f satisfies (L), then \((\alpha ,\beta ,\gamma ) = (L/2,\mu /2,0)\).

  2. 2.

    If \(\overline{\nabla }f(y,x) = \nabla f(y)\), then \((\alpha ,\beta ,\gamma ) = (0,0,\mu /2)\).

  3. 3.

    If \(\overline{\nabla }f(y,x) = \nabla f(\frac{x+y}{2})\) and f satisfies (L), then \((\alpha ,\beta ,\gamma ) = ((L+\mu )/8,\mu /4,\mu /4)\).

  4. 4.

    If \(\overline{\nabla }f(y,x) = \nabla _{\textrm{AVF}} f (y,x)\) and f satisfies (L), then \((\alpha ,\beta ,\gamma ) = (L/6+\mu /12,\mu /4,\mu /4)\).

  5. 5.

    If \(\overline{\nabla }f(y,x) = \nabla _{\textrm{G}} f (y,x)\) and f satisfies (L)(SC), then \((\alpha ,\beta ,\gamma ) = ((L+\mu )/8 + {(L-\mu )^2}/{16\mu }, {\mu }/{4}, 0)\).

  6. 6.

    If \(\overline{\nabla }f(y,x) = \nabla _{\textrm{IA}} f (y,x)\) and f satisfies (L)(SC), then \((\alpha ,\beta ,\gamma ) = ( {dL^2}/{\mu }-{\mu }/{4}, {\mu }/{2}, -{\mu }/{4} )\).

Here, Gonzalez discrete gradient [10] is as follows:

$$\begin{aligned} \nabla _\textrm{G} f(y,x) := \nabla f ( * ){\frac{y+x}{2}} + \frac{f(y) - f(x) - \langle {\nabla f ( * ){\frac{y+x}{2}}},{y-x}\rangle }{\Vert y-x \Vert ^2} (y-x). \end{aligned}$$

Itoh–Abe discrete gradient [11] is as follows:

$$\begin{aligned} \nabla _\textrm{IA} f (y,x) := \begin{bmatrix} \frac{f(y_1,x_2,x_3\dots ,x_d) - f(x_1,x_2,x_3,\dots ,x_d)}{y_1-x_1}\\ \frac{f(y_1,y_2,x_3\dots ,x_d) - f(y_1,x_2,x_3,\dots ,x_d)}{y_2 - x_2}\\ \vdots \\ \frac{f(y_1,y_2,y_3,\dots ,y_d) - f(y_1,y_2,y_3\dots ,x_d)}{y_d-x_d}\\ \end{bmatrix}, \end{aligned}$$

and average vector field (AVF) [12] is as follows:

$$\begin{aligned} \nabla _\textrm{AVF} f (y,x) := \int _0^1 \nabla f(\tau y + (1-\tau ) x) \textrm{d}\tau . \end{aligned}$$

Given an ODE and its rate-revealing Lyapunov function, we can derive a wDG scheme that ensures a decrease in the Lyapunov function. Here are the convergence rate theorems of wDG discretizations of ODE (1) and (2) with respect to Lyapunov functions (3) and (4). In the following schemes, \(\delta ^+\) denotes the forward difference operator with time step h, e.g., \(\delta ^+ x^{\left( {k}\right) }:= (x^{\left( {k+1}\right) }-x^{\left( {k}\right) }) / h\).

Proposition 2

(Convex case, Theorem 5.4 in [7]) Let \(\overline{\nabla }f\) be a wDG of f and suppose that \(\beta \ge 0,\gamma \ge 0\). Let f be a convex function that additionally satisfies the necessary conditions that the wDG requires. Let \( \{ * \}{ ( * ){ x^{\left( {k}\right) }, v^{\left( {k}\right) } } } \) be the sequence given by

$$\begin{aligned} {\left\{ \begin{aligned} \delta ^+ x^{\left( {k}\right) }&= \frac{\delta ^+A_{k}}{A_k}( * ){v^{\left( {k+1}\right) }-x^{\left( {k+1}\right) }}, \\ \delta ^+ v^{\left( {k}\right) }&= -\frac{\delta ^+A_k}{4} \overline{\nabla }f( * ){x^{\left( {k+1}\right) },z^{\left( {k}\right) }},\\ \frac{z^{\left( {k}\right) }-x^{\left( {k}\right) }}{h}&= \frac{\delta ^+A_k}{A_{k+1}}( * ){v^{\left( {k}\right) }-x^{\left( {k}\right) }} \end{aligned}\right. } \end{aligned}$$
(5)

with \(( * ){ x^{\left( {0}\right) }, v^{\left( {0}\right) }} = (x_0,v_0)\), where \(A_k:= A(kh)\). Then, if \(A_k = (kh)^2\) and \(h\le 1/\sqrt{2\alpha }\), the sequence satisfies

$$\begin{aligned} f( * ){x^{\left( {k}\right) }} - f^\star \le \frac{2\Vert * \Vert {x_0 - x^\star }^2}{A_k}. \end{aligned}$$

Proposition 3

(Strongly convex case, Theorem 5.5 in [7]) Let \(\overline{\nabla }f\) be a wDG of f and suppose that \(\beta +\gamma >0\). Let f be a strongly convex function that additionally satisfies the necessary conditions that the wDG requires. Let \( \{ * \}{ ( * ){ x^{\left( {k}\right) }, v^{\left( {k}\right) } } } \) be the sequence given by

$$\begin{aligned} \left\{ \begin{aligned} \delta ^+ x^{\left( {k}\right) }&= \sqrt{2(\beta +\gamma )}( * ){v^{\left( {k+1}\right) }-x^{\left( {k+1}\right) }},\\ \delta ^+ v^{\left( {k}\right) }&= \sqrt{2(\beta +\gamma )} \left( \frac{\beta }{\beta +\gamma }z^{\left( {k}\right) } + \frac{\gamma }{\beta +\gamma }x^{\left( {k+1}\right) } -v^{\left( {k+1}\right) }-\frac{\overline{\nabla }f(x^{\left( {k+1}\right) },z^{\left( {k}\right) })}{2(\beta +\gamma )} \right) , \\ \frac{z^{\left( {k}\right) } - x^{\left( {k}\right) }}{h}&= \sqrt{2(\beta +\gamma )}( * ){ x^{\left( {k}\right) } + v^{\left( {k}\right) } - 2z^{\left( {k}\right) }} \end{aligned} \right. \end{aligned}$$
(6)

with \(( * ){x^{\left( {0}\right) }, v^{\left( {0}\right) }}= (x_0,v_0)\). Then, if \(h \le \overline{h}:= ( * ){\sqrt{2}(\sqrt{\alpha +\gamma } - \sqrt{\beta +\gamma })}^{-1}\), the sequence satisfies

$$\begin{aligned} f ( * ){ x^{\left( {k}\right) } } - f^\star \le ( * ){ 1 + \sqrt{2(\beta +\gamma )}h}^{-k}( * ){f(x_0) - f^\star + \beta \Vert v_0 - x^\star \Vert ^2}. \end{aligned}$$

In particular, the sequence satisfies

$$\begin{aligned} f ( * ){ x^{\left( {k}\right) } } - f^\star \le ( * ){1 - \sqrt{\frac{\beta +\gamma }{\alpha +\gamma }}}^k ( * ){f(x_0) - f^\star + \beta \Vert v_0 - x^\star \Vert ^2}, \end{aligned}$$

when the optimal step size \( h = \overline{h}\) is employed.

3 Features of wDG schemes

3.1 Target methods

In this section, we discuss the features of the wDG method by numerically comparing them to existing optimization methods and numerical methods. We consider wDG methods derived from the accelerated gradient flow for strongly convex functions (2).

  • wDGEX_sc: An explicit method obtained by the following instantiation: we first replace \(\overline{\nabla }f(x^{\left( {k+1}\right) },z^{\left( {k}\right) })\) with \(\overline{\nabla }f(x^{\left( {k+1}\right) },x^{\left( {k}\right) })\) in the abstract scheme (6), which gives a convergent wDG scheme with a worse convergence rate [7]. However, this scheme is more natural as a numerical method. We then set \(\overline{\nabla }f(y,x) = \nabla f(x)\) (Item 1 of Proposition 1) to find

    $$\begin{aligned} \left\{ \begin{aligned} \delta ^+ x^{\left( {k}\right) }&= \sqrt{2(\beta +\gamma )}( * ){v^{\left( {k+1}\right) }-x^{\left( {k+1}\right) }},\\ \delta ^+ v^{\left( {k}\right) }&= \sqrt{2(\beta +\gamma )} \left( \frac{\beta }{\beta +\gamma }x^{\left( {k}\right) } + \frac{\gamma }{\beta +\gamma }x^{\left( {k+1}\right) } -v^{\left( {k+1}\right) }-\frac{\nabla f(x^{\left( {k}\right) })}{2(\beta +\gamma )} \right) . \end{aligned} \right. \end{aligned}$$

    For L-smooth functions, the convergence rate is \(\displaystyle f(x^{\left( {k}\right) }) - f^\star \le \textrm{O} \left( {( * ){1- \mu h}^k}\right) \) if \(h \le 1/L\).

  • wDGEX2_sc: An explicit method obtained by using \(\overline{\nabla }f(y,x) = \nabla f(x)\) in the abstract scheme (6). The convergence rate and the step-size restriction are shown in the proposition.

  • wDGagr_sc: An explicit method obtained by considering another Lyapunov function

    $$\begin{aligned} E = f(x) - f^\star + \frac{4}{9}\mu ( * ){\Vert v-x^\star \Vert ^2 - \frac{1}{2}\Vert v-x \Vert ^2}. \end{aligned}$$

    This function is proposed in [13] and shown to decrease faster than (4); however this does not guarantee faster convergence of f since \(\frac{4}{9}\mu ( * ){\Vert v-x^\star \Vert ^2 - \frac{1}{2}\Vert v-x \Vert ^2}\) is not always positive. We can derive an abstract wDG method for the ODE that ensure the corresponding non-increase of E:

    $$\begin{aligned} \left\{ \begin{aligned} \delta ^+ x^{\left( {k}\right) }&= \sqrt{2(\beta +\gamma )}( * ){v^{\left( {k+1}\right) }-x^{\left( {k+1}\right) }},\\ \delta ^+ v^{\left( {k}\right) }&= \sqrt{2(\beta +\gamma )} \left( \frac{2\beta }{\beta +\gamma }z^{\left( {k}\right) } + \frac{2\gamma }{\beta +\gamma }x^{\left( {k+1}\right) } -2v^{\left( {k+1}\right) }-\frac{9\overline{\nabla }f(x^{\left( {k+1}\right) },z^{\left( {k}\right) })}{8(\beta +\gamma )} \right) , \\ \frac{z^{\left( {k}\right) } - x^{\left( {k}\right) }}{h}&= \sqrt{2(\beta +\gamma )}( * ){ x^{\left( {k}\right) } + v^{\left( {k}\right) } - 2z^{\left( {k}\right) }}. \end{aligned} \right. \end{aligned}$$

    We omit the proof of the non-increase property. We set \(\overline{\nabla }f(y,x) = \nabla f(x)\) to obtain an explicit method. Owing to the aforementioned reason, the convergence rate of the method is not known. It is interesting to numerically observe how the fast decrease in the above E affects the actual performance.

  • wDGIE_sc: An implicit method obtained by letting \(\overline{\nabla }f(y,x) = \nabla f(y)\) (Item 2 of Proposition 1) in (6).

  • wDGAVF_sc: An implicit method obtained by letting \(\overline{\nabla }f(y,x) = \nabla _{\textrm{AVF}} f (y,x)\) (Item 4 of Proposition 1) in (6). Note that if f is quadratic, this method is the same as the implicit midpoint rule \(\overline{\nabla }f(y,x) = \nabla f ((y+x)/2)\).

  • wDGIA_sc: An implicit method obtained by letting \(\overline{\nabla }f(y,x) = \nabla _{\textrm{IA}} f (y,x)\) (Item 6 of Proposition 1) in (6).

We compare these methods to Nesterov’s accelerated gradient method for \(\mu \)-strongly convex functions whose continuous limit is also (2):

  • NAG_sc:

    $$\begin{aligned} \left\{ \begin{aligned} x^{\left( {k+1}\right) }&= y^{\left( {k}\right) } - s \nabla f(y^{\left( {k}\right) }),\\ y^{\left( {k+1}\right) }&= x^{\left( {k+1}\right) } + \frac{1-\sqrt{\mu s}}{1+\sqrt{\mu s}}(x^{\left( {k+1}\right) }-x^{\left( {k}\right) }), \end{aligned} \right. \end{aligned}$$

    where s is a step size. For L-smooth functions, its convergence rate is \(\displaystyle f(x^{\left( {k}\right) }) - f^\star \le \textrm{O} \left( {( * ){1- \sqrt{\mu s}}^k}\right) \) if \(s \le 1/L\).

wDGEX2_sc and NAG_sc have the same best convergence rate among those explicit methods when \(h=1/(\sqrt{L} - \sqrt{\mu })\) and \(s = 1/L\) respectively.

We also compare wDG methods to a partitioned Runge–Kutta method and its modified version.

  • pRK: An explicit partitioned Runge–Kutta method (cf. [14]) for (2)

    $$\begin{aligned} \left\{ \begin{aligned} z^{\left( {k}\right) }-x^{\left( {k}\right) }=&\frac{\sqrt{\mu }h}{2}(v^{\left( {k}\right) }-z^{\left( {k}\right) }),\\ v^{\left( {k+1}\right) }-v^{\left( {k}\right) }=&\sqrt{\mu }h ( * ){z^{\left( {k}\right) } - v^{\left( {k+1}\right) } - \frac{\nabla f(z^{\left( {k}\right) })}{\mu }},\\ x^{\left( {k+1}\right) }-z^{\left( {k}\right) }=&\frac{\sqrt{\mu }h}{2}(v^{\left( {k+1}\right) }-z^{\left( {k}\right) }). \end{aligned} \right. \end{aligned}$$
    (7)

    The convergence rate is \(\textrm{O} \left( {( * ){1+\sqrt{\mu }h}^{-k}}\right) \) for \(h \le 4\sqrt{\mu }/L\) (see Appendix A.1),

  • pRK2: A modified version of (7)

    $$\begin{aligned} \left\{ \begin{aligned} z^{\left( {k}\right) }-x^{\left( {k}\right) }=&( * ){1-\frac{\sqrt{\mu }h}{2}}\sqrt{\mu }h (v^{\left( {k}\right) }-z^{\left( {k}\right) }),\\ v^{\left( {k+1}\right) }-v^{\left( {k}\right) }=&\sqrt{\mu }h ( * ){z^{\left( {k}\right) } - \frac{1}{2}v^{\left( {k}\right) } - \frac{1}{2}v^{\left( {k+1}\right) } - \frac{\nabla f(z^{\left( {k}\right) })}{\mu }},\\ x^{\left( {k+1}\right) }-z^{\left( {k}\right) }=&\frac{\mu h^2}{2}(v^{\left( {k+1}\right) }-z^{\left( {k}\right) }) + ( * ){1-\frac{\sqrt{\mu }h}{2}} \sqrt{\mu }h^2 \frac{v^{\left( {k+1}\right) }-v^{\left( {k}\right) }}{h}. \end{aligned} \right. \end{aligned}$$
    (8)

    The convergence rate is \(\textrm{O} \left( {( * ){\frac{1 - \sqrt{\mu }h/2}{1+\sqrt{\mu }h/2}}^k}\right) \) for \(h \le \sqrt{1/L}\) (see Appendix A.2).

3.2 Numerical experiments

We compare the above methods in terms of trajectories and function decrease. The test problem is the 2-dimensional quadratic function minimization \(\min _{x\in \mathbb {R}^2} f(x)\) where

$$\begin{aligned} f(x) = \frac{1}{2}x^\top A x + b^\top x, \quad A = \begin{pmatrix} 0.101 &{} 0.099 \\ 0.099 &{} 0.101 \end{pmatrix} ,\, b = \begin{pmatrix} 0.01\\ 0.02 \end{pmatrix}. \end{aligned}$$
(9)

This function is 0.002-strongly convex and 0.2-smooth.

First, we compare explicit methods. Figure 1 shows the results when the baseline step size is set to \(s_0 = 1/L\) for NAG_sc and \(h_0=1/(\sqrt{L} - \sqrt{\mu })\) for the other methods, which are theoretical maximum step sizes for NAG_sc and wDG2_sc respectively. Each method is run with step sizes that are constant multiple of \(s_0, h_0\).

The trajectories can be roughly classified into those of (wDGEX_sc, pRK), (NAG_sc, wDGEX2_sc, pRK2) and (wDGagr_sc). The first group trajectories oscillate similarly to the original ODE trajectory, and the same level of oscillation is maintained for larger step sizes. The second group trajectories oscillate less and cease to oscillate as h increases. pRK2 has a smaller upper step size to converge than wDGEX_sc and wDGEX2_sc, which seems to explain that the solutions of pRK2 are more oscillatory near \(h_0\). wDGagr_sc shows unstable trajectories even at small step sizes, diverging at a step size of about \(h = 0.6h_0\), which is still a step size where the Lyapunov function decreases theoretically.

Fig. 1
figure 1

Trajectories and convergences of \(\Vert \nabla f(x^{\left( {k}\right) }) \Vert \) by explicit methods for Problem (9). The initial point is (2, 3)

Meanwhile, there is almost no difference in the decrease in f for the methods other than wDGagr, except that the decrease by wDG_sc and pRK is not monotone. wDG_agr decreases f faster than the other methods for the step sizes that do not cause divergence, but even at \(h=0.5h_0\), just before the divergence, it was not much faster than the other methods at \(h = h_0\).

Next, we compare the implicit methods and NAG_sc. Figure 2 shows the results. The trajectories of NAG_sc and wDGIE_sc oscillate similarly and smaller as h goes to \(h_0\). wDGAVF_sc shows a larger oscillation than those two. The trajectory of wDGIA_sc is quite different from the other ones, showing a qualitatively different behavior from that of the original ODE.

Fig. 2
figure 2

Trajectories and convergences of \(\Vert \nabla f(x^{\left( {k}\right) }) \Vert \) by implicit methods for Problem (9). The initial point is (2, 3)

Implicit methods demonstrate their strength when the objective function is a “stiff” function with either a large L or \(L/ \mu \). In explicit methods, strict step-size constraints due to stiffness lead to very slow convergence. Conversely, implicit methods tend to have looser step-size constraints as long as the scheme is solvable. Hence, for implicit methods, the computational cost per step becomes significantly higher, but for stiff objective functions, this can pay off, resulting in faster overall convergence times compared to explicit methods. Let us consider the following optimization problem as an example, in which \(\mu \) is very small.

$$\begin{aligned} f(x) = \frac{1}{2}x^\top Ax + \log ( * ){\sum _{i=1}^d \exp (0.05 x_i)}, \quad A = \begin{pmatrix} 1 &{} 1/2 &{} \cdots &{} 1/d\\ 1/2 &{} 1/3 &{} &{} \\ \vdots &{} &{} \ddots &{} \\ 1/d &{} &{} &{} 1/(2d-1) \end{pmatrix}, \end{aligned}$$
(10)

where \(x = (x_1,\dots , x_d)^\top \in \mathbb {R}^d\) and \(d = 10\).

For this problem, the results of applying WDG_sc, and WDGIE_sc are shown in Fig. 3. In terms of total computation time, the implicit method converges faster.

Fig. 3
figure 3

Convergences of \(\Vert \nabla f(x^{\left( {k}\right) }) \Vert \) by wDGEX_sc and wDGIE_sc for Problem (10). The horizontal axis is the execution time; in the range shown in the figure, wDGEX_sc has 14,000 iterations and wDGIE_sc has 1,000 iterations. Experimental settings: Google colabolatory (OS: Ubuntu 22.04.2 LTS, CPU: Intel Xeon 2.20 GHz, Memory: 12 GiB, language: python 3.10.12)

In the next section, we consider exploiting implicit wDG methods for not necessarily stiff problems by assuming f is a sum of convex functions.

4 Variants of proximal gradient method

4.1 Our interests

In this section, we consider the following problem:

$$\begin{aligned} \min _{x\in \mathbb {R}^d} f(x) = \min _{x \in \mathbb {R}^d} \{ f_1(x) + f_2(x) \}, \end{aligned}$$
(11)

where \(f_1\) is convex and L-smooth, and \(f_2\) is convex and easy to compute its proximal map

$$\begin{aligned} {{\,\textrm{Prox}\,}}_{hf_2}(x) := \underset{y \in \mathbb {R}^d}{{{\,\mathrm{arg~min}\,}}}\ \{ * \}{f_2(y) + \frac{1}{2h}\Vert y-x \Vert ^2}. \end{aligned}$$

Examples of \(f_2\) are \(L^1\), \(L^2\) regularization terms \(\lambda \Vert x \Vert _1\), \(\lambda \Vert x \Vert _2^2\) and indicator function \(\iota _C(x)\) of a convex set \(C \subseteq \mathbb {R}^d\), i.e., \(\iota _C(x):=0\) if \(x \in C\) and \(\iota _C(x):= \infty \) if \(x \notin C\).

The proximal gradient method defined below is a popular method to solve (11):

$$\begin{aligned} x^{\left( {k+1}\right) } = {{\,\textrm{Prox}\,}}_{hf_2}(x^{\left( {k}\right) } - h\nabla f_1(x^{\left( {k}\right) })). \end{aligned}$$

For differentiable \(f_2\), computing this iteration is equivalent to solving the equation

$$\begin{aligned} \frac{x^{\left( {k+1}\right) }-x^{\left( {k}\right) }}{h} = -\nabla f_1(x^{\left( {k}\right) }) -\nabla f_2(x^{\left( {k+1}\right) }), \end{aligned}$$

which is known as an IMEX-scheme for the gradient flow \(\dot{x} = -\nabla f_1(x) - \nabla f_2(x)\), that is, the combination of the explicit Euler discretization of \(-\nabla f\) and the implicit Euler discretization of \(-\nabla f_2\). Since these types of discretization are members of the wDG and the sum of two wDGs is also a wDG, the convergence rate of this method is the result of the unified analysis for wDG schemes (this framework is valid for the non-differentiable \(f_2\) using subdifferentials).

Now we consider using another wDG to discretize \(-\nabla f_2\):

$$\begin{aligned} \frac{x^{\left( {k+1}\right) }-x^{\left( {k}\right) }}{h} = -\nabla f_1(x^{\left( {k}\right) }) -\overline{\nabla }f_2(x^{\left( {k+1}\right) },x^{\left( {k}\right) }), \end{aligned}$$

and assume that this scheme is easily solvable for \(x^{\left( {k+1}\right) }\). As said before, most wDG methods are implicit and the high computational complexity generally discourages their application to large-scale problems except for highly stiff case, as discussed in the previous section. However, in the current setting, these implicit methods can be efficiently computed, similar to the proximal gradient methods with \(L^1\) or \(L^2\) regularizations.

FISTA is the accelerated method of the proximal gradient method and its scheme is

$$\begin{aligned} \left\{ \begin{aligned} x^{\left( {k+1}\right) }&= {{\,\textrm{Prox}\,}}_{hf_2}(y^{\left( {k}\right) } - h\nabla f_1(y^{\left( {k}\right) })) \\ y^{\left( {k+1}\right) }&= x^{\left( {k+1}\right) } + \frac{k}{k+3}(x^{\left( {k+1}\right) }-x^{\left( {k}\right) }), \end{aligned} \right. \end{aligned}$$

for the convex function f, and

$$\begin{aligned} \left\{ \begin{aligned} x^{\left( {k+1}\right) }&= {{\,\textrm{Prox}\,}}_{hf_2}(y^{\left( {k}\right) } - h\nabla f_1(y^{\left( {k}\right) })) \\ y^{\left( {k+1}\right) }&= x^{\left( {k+1}\right) } + \frac{1 - \sqrt{\mu h}}{1 + \sqrt{\mu h}} \nabla (x^{\left( {k+1}\right) } - x^{\left( {k}\right) }) \end{aligned} \right. \end{aligned}$$

for strongly convex f. In this study, we distinguish these two methods by calling them FISTA_c and FISTA_sc.

The continuous limits of these methods are the accelerated gradient flows (1) and (2). In addition to the above gradient flow case, the abstract schemes (5) and (6) with the wDG defined by \( \overline{\nabla }f (y,x) = \nabla f_1 (x) + \overline{\nabla }f_2 (y,x)\) produce new optimization methods similar to FISTA:

$$\begin{aligned} \left\{ \begin{aligned} x^{\left( {k+1}\right) } - x^{\left( {k}\right) }&= \frac{2k+1}{k^2}(v^{\left( {k+1}\right) }-x^{\left( {k+1}\right) }), \\ v^{\left( {k+1}\right) } - v^{\left( {k}\right) }&= -\frac{2k+1}{4} h^2 (\nabla f_1(z^{\left( {k}\right) }) + \overline{\nabla }f_2(x^{\left( {k+1}\right) },z^{\left( {k}\right) })),\\ z^{\left( {k}\right) }-x^{\left( {k}\right) }&= \frac{2k+1}{(k+1)^2}(v^{\left( {k}\right) }-x^{\left( {k}\right) }) \end{aligned} \right. \end{aligned}$$
(12)

for convex f, and

$$\begin{aligned} \left\{ \begin{aligned} x^{\left( {k+1}\right) }-x^{\left( {k}\right) }&= \sqrt{2(\beta +\gamma )}h ( * ){v^{\left( {k+1}\right) }-x^{\left( {k+1}\right) }},\\ v^{\left( {k+1}\right) }-v^{\left( {k}\right) }&= \sqrt{2(\beta +\gamma )}h \left( \frac{\beta }{\beta +\gamma }z^{\left( {k}\right) } + \frac{\gamma }{\beta +\gamma }x^{\left( {k+1}\right) } -v^{\left( {k+1}\right) } \right. \\&\qquad \left. - \frac{1}{2(\beta + \gamma )}(\nabla f_1(z^{\left( {k}\right) }) + \overline{\nabla }f_2(x^{\left( {k+1}\right) },z^{\left( {k}\right) })) \right) , \\ z^{\left( {k}\right) } - x^{\left( {k}\right) }&= \sqrt{2(\beta +\gamma )}h( * ){ x^{\left( {k}\right) } + v^{\left( {k}\right) } - 2z^{\left( {k}\right) }} \end{aligned} \right. \end{aligned}$$
(13)

for strongly convex f. If \( \overline{\nabla }f_2 \) is wDG with parameters \( ( \alpha _2, \beta _2, \gamma _2 ) \), then \( \overline{\nabla }f (y,x) = \nabla f_1 (x) + \overline{\nabla }f_2 (y,x) \) is wDG with parameters \( (L_1/2+\alpha _2,\mu _1/2+\beta _2,\gamma _2) \), where we assume that \( f_1 \) is \(L_1\)-smooth and \(\mu _1\)-strongly convex. Therefore, Proposition 2 implies that the scheme (12) satisfies \( f(x^{\left( {k}\right) }) - f^{\star } \le \textrm{O} \left( {1/k^2}\right) \) if \( h \le 1 / \sqrt{L_1 + 2 \alpha _2} \), and Proposition 3 implies that the scheme (13) satisfies \( f(x^{\left( {k}\right) }) - f^{\star } \le \textrm{O} \left( { ( 1 + \sqrt{ \mu _1 + 2 \beta _2 + 2 \gamma _2 } h )^{-k} }\right) \) if \( h \le (\sqrt{ L_1 + 2 \alpha _2 + 2 \gamma _2 } - \sqrt{\mu _1 + 2 \beta _2 + 2 \gamma _2 })^{-1}\). In particular, by selecting \(\overline{\nabla }f_2 (x^{\left( {k+1}\right) },z^{\left( {k}\right) }) = \nabla f_2 (x^{\left( {k+1}\right) })\), these schemes have the same rate as FISTA but different iterative formulas, which we call IMEX_c/sc for (12)/(13). Using other wDGs for \(-\nabla f_2\) in place of the implicit Euler method, we can create different methods. In this study, we consider AVF \(\overline{\nabla }f_2 (x^{\left( {k+1}\right) },z^{\left( {k}\right) }) = \int _0^1 \nabla f_2 (\tau x^{\left( {k+1}\right) }+(1-\tau )z^{\left( {k}\right) }) \textrm{d}\tau \) and we call them AVFEX_c/sc.

Our interest is the practical behavior of these wDG methods compared to FISTA. In the following section, we apply FISTA and wDG methods to several test problems. For strongly convex problems \(\min f\), we use _sc methods, and for convex but not strongly convex problems, we use _c methods.

4.2 Numerical experiments

In this section, we consider several (strongly) convex functions as \(f_1\) and two regularization terms as \(f_2\): \(L^2\) regularization \(f_2(x) = \frac{\lambda }{2}\Vert x \Vert _2^2\) and \(L^1\) regularization \(f_2(x) = \lambda \Vert x \Vert _1\). For \(L^2\) regularization problems, we compare FISTA_sc, IMEX_sc and AVFEX_sc; note that adding the \(L^2\)-regularization term makes any original convex objective functions (\(f_1(x)\)) strongly convex. For \(L^1\) regularization problems, since \(f_2(x)=\lambda \Vert x\Vert _1\) is not differentiable, \(\nabla _\textrm{AVF} f_2\) cannot be a wDG, and we should abandon it. Thus, we only consider FISTA_sc and IMEX_sc when \(f_1\) is strongly convex, and FISTA_c and IMEX_c when \(f_1\) is not strongly convex (but convex). The used step size is the theoretical maximum value that guarantees the convergence rate.

We consider the following four problems.

4.2.1 Problem 1

$$\begin{aligned} f_1(x) = \frac{1}{2} x^\top A x + b^\top x, \quad \text {where}\, A = \begin{pmatrix} 0.101 &{} 0.099 \\ 0.099 &{} 0.101 \end{pmatrix} ,\, b = \begin{pmatrix} 0.01 \\ 0.02 \end{pmatrix}. \end{aligned}$$

This function is 0.002-strongly convex and 0.2-smooth.

The result of \(L^2\) regularization \(f_2(x) = \frac{\lambda }{2}\Vert x \Vert _2^2\) is presented in Fig. 4. The shapes of trajectory are almost the same among the three methods and \(\lambda \)’s except that FISTA_sc overruns at the first step. For \(\lambda =0.01\), IMEX_sc and AVFEX_sc show slightly faster convergence than FISTA_sc, and the decrease by AVFEX_sc is oscillatory. For \(\lambda =0.001\), AVFEX_sc is faster than the other two methods. For \(\lambda = 0.0001\), the three methods have the same convergence speed. Notably, the computational complexities per step of the tested methods are roughly the same, and the comparison of the convergence profiles against steps is fair. This notice applies to similar figures in this section.

Fig. 4
figure 4

Trajectories and convergences of \(\Vert \nabla f \Vert \) by FISTA_sc, IMEX_sc and AVFEX_sc for Problem 1 with the \(L^2\) regularization

The result of \(L^1\) regularization \(f_2(x) = \lambda \Vert x \Vert _1\) is Fig. 5. In this case, FISTA_sc and IMEX_sc show similar trajectories and convergences in \(\lambda = 0.01, 0.001\) and more \(\lambda \)’s.

Fig. 5
figure 5

Trajectories and convergences of \(\Vert \nabla f \Vert \) by FISTA_sc and IMEX_sc for Problem 1 with the \(L^1\) regularization

4.2.2 Problem 2

$$\begin{aligned} f_1 (x) = \frac{10}{4} ( * ){\frac{1}{2}[ * ]{{x_1}^2 + \sum _{i=1}^{1000} (x_i - x_{i+1})^2 + {x_{1000}}^2} -x_1}. \end{aligned}$$

This function has \(L \approx 10\) and \(\mu \approx 1.57 \times 10^{-2}\).

Figure 6 shows the convergence of \(\Vert \nabla f \Vert \). Similar to Problem 1, in the \(L^2\) regularization setting, IMEX_sc and AVFEX_sc are faster than FISTA_sc at large \(\lambda \), and they have the same speed at small \(\lambda \). In the \(L^1\) regularization setting, there is no difference in the convergence of \(\Vert \nabla f \Vert \) between FISTA_sc and IMEX_sc for \(\lambda = 0.1\) and some more \(\lambda \)’s.

Fig. 6
figure 6

Convergences of \(\Vert \nabla f \Vert \) for Problem 2 by FISTA_sc, IMEX_sc and AVFEX_sc for the \(L^2\) regularization (ac) and by FISTA_sc and IMEX_sc for the \(L^1\) regularization (d)

4.2.3 Problem 3 [15]

$$\begin{aligned} f_1 (x) = \sum _{i=1}^{50} g(a_i^\top x - b_i), \quad g(x) = {\left\{ \begin{array}{ll} \frac{1}{2} x^2 \textrm{e}^{-r/x}, &{} x>0,\\ 0, &{} x \le 0. \end{array}\right. } \end{aligned}$$

where \(x, a_i \in \mathbb {R}^{1000}\), \(b_i \in \mathbb {R}\) \((i=1,\dots ,50)\) and \( r = 0.001\). Parameters \(a_i,b_i\) are chosen randomly. This is a convex but not strongly convex function. With the chosen parameters, L is approximately 1548. For this problem, we consider methods for strongly convex functions in the case of the \(L^2\)-regularization, and those for convex functions in the case of the \(L^1\)-regularization (as described before). This notice applies to Problem 4 as well.

The results are shown in Fig. 7. In the \(L^2\) regularization setting, at \(\lambda =1\), AVF has an advantage, but there is no significant difference between the methods in performance as in the previous two cases. For the \(L^1\) setting, no _c method achieved successful optimization.

Fig. 7
figure 7

Convergences of \(\Vert \nabla f \Vert \) for Problem 3 by FISTA_sc, IMEX_sc and AVFEX_sc for the \(L^2\) regularization (ac) and by FISTA_c and IMEX_c for the \(L^1\) regularization (d)

4.2.4 Problem 4

$$\begin{aligned} f_1 (x) = \sum _{i = 1}^{200} \log ( * ){1 + \textrm{e}^{-y_i \xi _i^\top x}}, \end{aligned}$$

where \(\xi _i \in \mathbb {R}^d\) and \(y_i \in \{ 1,-1 \}\) \((i= 1,\dots ,200)\) are randomly chosen. This is a convex but not strongly convex function appearing in the logistic regression. With the chosen parameters, L is approximately 1549.

Fig. 8
figure 8

Convergences of \(\Vert \nabla f \Vert \) for Problem 2 by FISTA_sc, IMEX_sc and AVFEX_sc for the \(L^2\) regularization (ac) and by FISTA_c and IMEX_c for the \(L^1\) regularization (df)

The results are shown in Fig. 8. The optimization algorithms exhibited different behavior for this problem compared to previous ones. For \(\lambda =100\), IMEX_sc is the fastest, followed by AVFEX_sc and FISTA_sc. However, for \(\lambda =10\), the order of speed is as follows: FISTA_sc, IMEX_sc, and AVFEX_sc. For \(\lambda =0.1\), there are no differences in speed.

5 Summary and discussions

In this section, we summarize the results of the numerical experiments to answer the three issues raised in the Introduction.

5.1 Issue 1: differences between wDGs and other methods

In the preliminary numerical experiment in [7], the advantage of wDG methods was not clear both for strongly convex and convex objective functions, as mentioned in Introduction. Below are the novel contributions of the present study in the context of strongly convex functions.

We first observed in Sect. 3.2 that the convergence profiles are roughly the same for all the tested methods except for wDGagr_sc. This means that wDG methods for the (strongly convex) NAG ODE perform well at the same level as NAG_sc. Hence, one might think that constructing optimization methods via the wDG framework is not really advantageous. However, one important observation is that wDGEX_sc, which is a wDG scheme that is quite natural as a numerical method for the NAG ODE, is competitive in comparison with other methods including NAG_sc (which is not a natural discretization). This is more preferable from a numerical analysis perspective.

Another important observation there is that wDGagr_sc converges much faster than other methods; however unfortunately a theoretical convergence estimate is yet to be given. Nevertheless, this verifies the potential ability of the wDG framework; if we succeed in finding better ODE and/or its Lyapunov function, we can immediately construct a method that can outperform existing methods. Although wDGagr_sc tends to be unstable for larger time steps, this does not mean it is inferior in this respect. By comparing Cases (b) and (c) in Fig. 1, in (b), wDGagr_sc converges more than twice faster than other methods. That is, it performs better than other methods in (c).

Additionally, we observe that even when convergence profiles are similar, the corresponding trajectories in the solution space can be different (recall Problem 1.) This encourages us to further explore various wDGs. A negative example in this respect is the wDGIA_sc, the implicit method based on the Itoh–Abe discrete gradient, which performs very poorly. This warns us that even though the IA discrete gradient is now drawing attention in the context of optimization (due to its possible high computational efficiency compared to other implicit methods), its use is not always encouraged.

In Sect. 4.2, we observe that the implicit wDG methods are as highly efficient as the standard proximal gradient methods (with \(L^2\)- or \(L^1\)-regularizations), and the convergence profiles are similar, or even slightly better (within the experiments in the present study). In this sense, we should consider such wDG methods in the situations where FISTA is needed.

5.2 Issue 2: practicability of implicit wDG methods

This is investigated in the following two ways, in the case of strongly convex objective functions: First, in Sect. 3.2, it is confirmed that for some highly stiff objective functions, implicit optimization methods can actually be advantageous. Although this finding is widely known in the numerical analysis (of ODEs) field, it has not been established in the optimization field. Next, in Sect. 4.2, it is confirmed (as mentioned above) that implicit wDG methods are of practical use when we consider proximal gradient methods. Hence, implicit wDG methods are in fact practical under some circumstances. This greatly widens the scope of new optimization methods; we have various implicit ODE solvers in numerical analysis that have been proven to deserve consideration.

5.3 Issue 3: efficiency of proximal gradient-like wDG methods

The practicability of the proximal gradient-like methods in the case of strongly convex objective functions is discussed above. The last two problems (Problems 3 and 4) contained the case of (not strongly) convex functions. The results show that the behaviors are roughly the same. This means that there is no strong evidence to recommend wDG methods, but at the same time no reason to avoid using them. Further investigations with other wDGs might reveal meaningful differences.

6 Concluding remarks

In this paper, we conducted numerical experiments on methods based on wDG framework to validate their properties and utility. The results show that in most cases, wDG methods show competitive performance as compared to typical optimization methods (e.g., NAG and FISTA) and can even slightly outperform them. Additionally, the results show that implicit methods deserve further consideration. This verifies the possibility of wDG framework, where in many cases the resulting method becomes implicit. This is consistent with the fact that structure-preserving numerical methods are recommended to ODEs that are difficult to solve with generic methods. We hope that this paper encourages researchers in the optimization and numerical analysis communities to build on the framework and enrich optimization researches.