Abstract
We study convergence of the trajectories of the Heavy Ball dynamical system, with constant damping coefficient, in the framework of convex and non-convex smooth optimization. By using the Polyak–Łojasiewicz condition, we derive new linear convergence rates for the associated trajectory, in terms of objective function values, without assuming uniqueness of the minimizer.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
Let \({\mathcal {H}}\) be a Hilbert space and \(F:{\mathcal {H}}\rightarrow {\bar{\mathbb {R}}}=\mathbb {R}\) be a differentiable function with L-Lipschitz continuous gradient, such that the set of minimizers is nonempty. We are interested in the convergence properties for t going to infinity of the solution-trajectory of the following second-order dynamical system
where \(\alpha >0\). System (1) is called heavy ball system since it physically describes the motion of a material point (ball) rolling over the graph of the function F and subject to a friction proportional to the velocity (see [2, 10, 14]). The constant of proportionality \(\alpha >0\) is called damping parameter [3, 14, 39]. In general, systems like (1) play an important role in various fields such as mechanics and physics, and in optimization (see for example [8, 12, 19, 25, 26, 42] and references therein).
Indeed the dynamical system (1) has been widely studied in the optimization literature, due to the fact that under suitable assumptions on the function F, the solution-trajectory satisfies the minimization property:
This approach has been especially fruitful to study convergence properties of discrete accelerated first order algorithms [4, 6, 12]. As it is shown in [39], for strongly convex functions, system (1) leads to some linear rates of convergence for \(F(x(t))-\min F\), which can be faster, depending on the strong convexity parameter, than those associated to the trajectory of the first order gradient flow system \({\dot{x}}(t) +\nabla F(x(t))=0\) (see [38]). The purpose of this paper is to establish new convergence properties for the solutions of system (1). To the best of our knowledge, there are no results concerning the explicit convergence rates of the trajectories generated by the Heavy ball system (1), for objective functions satisfying the Polyak–Łojasiewicz condition without convexity or uniqueness of the minimizer. In this paper we are addressing this issue and establish worst-case linear convergence rates for the objective function \(F(x(t))-\min F\). In addition we extend some of these rates in the case of a convex, but not necessarily differentiable. function F. Polyak–Lojasiewicz condition is a relaxation of strong convexity, and has proven to be especially useful in obtaining linear convergence rates for several dynamical systems usually employed in optimization (see for example [21, 23, 31, 38]), as also for training deep neural networks (see for example [1, 43]).
Systems as (1) are closely linked with optimization algorithms, such as the heavy-ball algorithm, which are frequently used in various settings in optimization, machine learning and inverse problems. In many cases, the convergence properties of the continuous systems are inherited by their associated numerical schemes, i.e. algorithms. The last observation makes the study of continuous-in-time systems as (1) very popular, since they can provide powerful insights and tools for the analysis of inertial first order methods [4, 6, 12, 13, 42].
1.1 Previous work
The seminal paper by Polyak [39] shows that, if F is \(\mu \) stronly convex and twice differentiable, the generated trajectory converges linearly to the unique minimizer of F as \(e^{-2\sqrt{\mu }t}\). For \(\mu <1\), this convergence rate is faster than the one obtained of the gradient flow, that is \(e^{-2\mu t}\), see e.g. [37,38,39]. Further studies of system (1) were also made in [2, 10, 14, 30], where convergence properties of the trajectory of solutions of (1) were established in more general settings: F convex [2], non-convex [14] or non-differentiable [10].
Another interesting fact which was motivated and pointed out in [14] is that, in contrast with the Gradient flow, the heavy-ball (1), is a second order (in time) system which is not a descent scheme. This allows the trajectory generated by (1) to escape possible local minima of non-convex functions, by tuning properly the initial velocity \({\dot{x}}(0)\) (see for example the corresponding discussion in [14]). In general, systems like (1) with other choices of damping parameter became very popular and constitute an active area of research thanks to their nice convergence properties (among the rich literature, one can consult [5, 11,12,13, 15, 26, 42] and their possible references).
Recently, the research directions for system (1) have focused on studying the case where the objective function \(F\) satisfies weaker geometrical assumptions in order to tackle the minimization problem of a wider family of functions. In this context in [16, 41], it was discovered that one can relax the strong convexity property and the Lipschitz character of the gradient of F and still obtain some linear convergence rates for the associated trajectory of (1). For quasi-strongly convex functions (see relation (q\({{\mathcal {S}}}{{\mathcal {C}}}\)) in Sect. 2), admitting a unique minimizer, it was shown that the quantity \(F(x(t))-{\min }F\) converges as \(e^{-\sqrt{2\mu }t}\), where \(\mu \) is the quasi-strong convexity parameter and \(\alpha =3\sqrt{{\mu }/{2}}\), see Theorem 1 in [16]. This result was recently extended in [17], where the authors derive some linear convergence rates for convex functions admitting a unique minimizer and satisfying the Quadratic growth condition (see (\({{\mathcal {Q}}}{{\mathcal {G}}}\)) in Sect. 2), which is equivalent to the Polyak–Łojasiewicz condition.
As for the non-convex setting (i.e. the minimizing function F is not necessarily convex), less is known about the convergence properties of system (1). In particular, in [19] (see also [7, 11, 37]), the authors show that for continuously differentiable functions satisfying the Kurdyka–Łojasiewicz condition (which is a generalization of the Polyak–Łojasiewicz condition), the solution of system (1) converges linearly to a minimizer of the corresponding function. However their analysis does not provide explicit formulas for the exponents of these linear rates, which is the main subject of the current paper.
1.2 Organization
The paper is organized as follows: In Sect. 2 we recall some basic definitions and tools concerning the main geometrical assumption on the minimizing function F. In Sect. 3 we present the main results of the paper concerning the convergence rates of \(F(x(t))-\min F\). In Theorem 1, we treat the non convex case, while some slight improved rates are given in Theorem 2 where the minimizing function is additionally supposed to be convex. In Sect. 4, we extend some of the results in the convex non-smooth setting. Some numerical experiments are reported in Sect. 5. Section 6 contains the related convergence analysis and the proofs of the two main theorems. Finally, Appendix 1 contains some basic auxiliary results.
2 Preliminaries
In this section we present some basic definitions and results which will be used in the rest of the paper.
Let \({\mathcal {H}}\) be a Hilbert space endowed with the scalar product \(\langle \cdot ,\cdot \rangle \) and the associated norm \(\left\| \cdot \right\| \). We consider the following minimization problem:
where \({\mathcal {H}}\) is a Hilbert space and \(F:{\mathcal {H}}\longrightarrow \mathbb {R}\) is a function that satisfies the following assumptions:
- \({\mathcal {A}}.1\):
-
F is differentiable with L-Lipschitz gradient
- \({\mathcal {A}}.2\):
-
The set of minimizers \(X_{*}=\text {arg}\,\text {min} F\) is not empty.
More precisely we are interested in the convergence properties of a solution-trajectory \(\{x(t)\}_{t\ge 0}\) of the dynamical system (1), to a minimizing solution of (2)
Under conditions \({\mathcal {A}}.1\) and \({\mathcal {A}}.2\), the existence and uniqueness of a strong global solution \(x(t)\in {\mathcal {C}}^{2}({\mathcal {H}})\) of the initial value problem (1) can be guaranteed by the Cauchy–Lipschitz theorem, see [14, Theorem 11].
Throughout the paper, We will assume that the function F satisfies the Polyak–Łojasiewicz condition.
Definition 1
Let \(F:{\mathcal {H}}\rightarrow \mathbb {R}\) be a differentiable function with \(\text {arg}\,\text {min} F \ne \emptyset \). We say that the function F satisfies the Polyak–Łojasiewicz condition, if there exists some constant \(\mu >0\), such that the following inequality holds:
Condition (\({{\mathcal {P}}}{{\mathcal {L}}}\)) was introduced in the early works in [34] (see also [38]) and can be identified as a particular case of the Łojasiewicz (or Kurdyka–Łojasiewicz) gradient inequality (see for example [21,22,23, 32, 34]). The main difference between our definition and the classical one (see [21, 22, 32]), is the fact that inequality (\({{\mathcal {P}}}{{\mathcal {L}}}\)) is usually required to hold locally, namely for all points in a neighborhood of a given critical point and in a suitable sublevel set, see [21,22,23] for a throughout analysis and extensions. In this global form, this property has been introduced in [38] and is also known under the name Polyak–Łojasiewicz inequality, see e.g. [38], or [31]. The global requirement, on the one hand restricts the class of considered functions, but on the other hand allows to obtain global convergence results with explicit constants. Indeed, either for classical dynamical systems or algorithms, condition (\({{\mathcal {P}}}{{\mathcal {L}}}\)) is intimately connected with the linear convergence rates of the objective function values both in the convex and non-convex setting, see e.g. [23, 27, 28, 31, 37,38,39, 44] and their associated references. In what follows, we will make some remarks about functions satisfying property (\({{\mathcal {P}}}{{\mathcal {L}}}\)).
The Polyak–Łojasiewicz condition (\({{\mathcal {P}}}{{\mathcal {L}}}\)) is a relaxation of the strong convexity property of a function, i.e.
Indeed, from the definition of strong convexity, by considering the minimum with respect to \(y\in {\mathcal {H}}\) on both sides, one deduces the the(\({{\mathcal {P}}}{{\mathcal {L}}}\)) condition.
Remark 1
In this remark we collect some basic observations about the Polyak–Łojasiewicz condition in Definition 1:
-
Condition (\({{\mathcal {P}}}{{\mathcal {L}}}\)) implies that every critical point of F is a global minimizer.
-
Differently from the notion of strong convexity, condition (\({{\mathcal {P}}}{{\mathcal {L}}}\)) does not imply uniqueness of the minimizer, even in the convex case. For instance, in \(\mathbb {R}\), consider the function \(F(x)=|\left| x\right| -1|_{+}^{2}=\bigl (\max \{\left| x\right| -1,0\}\bigr )^2\).
-
Condition (\({{\mathcal {P}}}{{\mathcal {L}}}\)) does not imply that the function is convex, see Example 1 and Figure 1.
Example 1
Let \(f:\mathbb {R}^d\rightarrow \mathbb {R}\) be a differentiable function and consider \(F:\mathbb {R}^{d+1}\rightarrow \mathbb {R}\) by setting
Then the set of minimizers of F is the graph f, and this set is not convex unless f is affine, thus F is not necessarily convex unless f is affine. On the other hand, condition (\({{\mathcal {P}}}{{\mathcal {L}}}\)) is satisfied with \(\mu =2\). Examples of functions in the class described by relation (4) are relevant for deep learning applications, see e.g. [33].
Condition (\({{\mathcal {P}}}{{\mathcal {L}}}\)) is also closely connected with other geometrical notions, frequently used in the optimization literature to establish linear convergence rates for approximation methods, such as the error-bound condition or metric subregularity of the gradient at the origin, i.e.:
or the quadratic growth condition, i.e.:
where dist\((x,C)=\inf \{\left\| x-y\right\| ~ : ~ y\in C\}\) denotes the distance of a point \(x\in {\mathcal {H}}\) from a set \(C\subset {\mathcal {H}}\).
Another notion which has also been used to establish linear convergence of first order methods is the notion of quasi-strong convexity, which was introduced in [35] as a relaxation of the strong convexity property of a function. It has also been used to study the asymptotic properties of the trajectories of system (1) in the recent work [16].
A differentiable function \(F:{\mathcal {H}}\rightarrow \mathbb {R}\) is called \(\beta \)-quasi-strongly convex, if there exists some constant \(\mu >0\) such that for all \(x\in {\mathcal {H}}\) and for all projections \({\bar{x}}\) of x on \(X_*\), it holds:
Notice however that a projection \({\bar{x}}\) of x on \(X_*\) may not exist when \({\mathcal {H}}\) is infinitely dimensional, e.g. if \(X_{*}\) is not a Chebyshev set. Unless there are no additional assumptions such as convexity or weakly closeness of \(X_{*}\), or finite dimensionality of the space \({\mathcal {H}}\), relation (q\({{\mathcal {S}}}{{\mathcal {C}}}\)) may have an empty meaning. In what follows, whenever employing relation (q\({{\mathcal {S}}}{{\mathcal {C}}}\)), we will refer to functions F such that the projection of a point onto \(X_{*}\) is nonempty. Note however that, given \(x\in {\mathcal {H}}\), there may be more than one projection of x on \(X_{*}\) . Unlike strong convexity of a function, it is worth mentioning that the quasi-strong convexity does not imply neither convexity nor uniqueness of a minimizer.
Below we present some known results about the interplay between conditions (\({{\mathcal {P}}}{{\mathcal {L}}}\)), (\({{\mathcal {E}}}{{\mathcal {B}}}\)), (\({{\mathcal {Q}}}{{\mathcal {G}}}\)) and (q\({{\mathcal {S}}}{{\mathcal {C}}}\)) in different settings. In the convex setting, conditions (\({{\mathcal {P}}}{{\mathcal {L}}}\)), (\({{\mathcal {E}}}{{\mathcal {B}}}\)) and (\({{\mathcal {Q}}}{{\mathcal {G}}}\)) are all equivalent, while quasi-strong convexity is stronger than the Polyak–Łojasiewicz condition (\({{\mathcal {P}}}{{\mathcal {L}}}\)). In particular, it can be shown that the class of quasi-strongly convex functions is a subclass of functions satisfying condition (\({{\mathcal {P}}}{{\mathcal {L}}}\)).
Proposition 1
Let \(F:{\mathcal {H}}\rightarrow \mathbb {R}\) be a continuously differentiable function with \(\text {arg}\,\text {min} F \ne \emptyset \). Then the following implications hold true:
with \(\theta =\mu =\beta \).
-
1.
If F has L-Lipschitz gradient, then:
$$\begin{aligned} ({{{\mathcal {E}}}{{\mathcal {B}}}}) \implies ({{{\mathcal {P}}}{{\mathcal {L}}}}) \end{aligned}$$(6)with \(\mu =\frac{\eta ^{2}}{L}\).
-
2.
If F is convex, then:
$$\begin{aligned} ({{{\mathcal {Q}}}{{\mathcal {G}}}}) \implies ({{{\mathcal {E}}}{{\mathcal {B}}}}) \implies ({{{\mathcal {P}}}{{\mathcal {L}}}}) \end{aligned}$$(7)with \(\eta =\frac{\theta }{2}\) and \(\mu =\frac{\theta }{4}\).
-
3.
If F is convex with L-Lipschitz gradient, then:
$$\begin{aligned} ({{{\mathcal {P}}}{{\mathcal {L}}}}) \implies ({q{{\mathcal {S}}}{{\mathcal {C}}}}) \end{aligned}$$(8)with \(\beta =\frac{\mu ^2}{L}\)
Proposition 1 collects some already known results.
Proof
For the proof of the first implication in (5), we observe that by using the Cauchy–Schwarz inequality for the scalar product in the quasi-strong convexity condition (q\({{\mathcal {S}}}{{\mathcal {C}}}\)) and then the Young’s inequality for the product, for any \(x\in {\mathcal {H}}\) and a projection \({\bar{x}}\) of x onto \(X_*\) and any \(\varepsilon >0\), we have:
which, by choosing \(\varepsilon =\beta \), is the (\({{\mathcal {P}}}{{\mathcal {L}}}\)) condition.
The second implication of (5) can be found on [23, Theorem 27] and [44, Theorem 1].
The implication (6) can be found in [31, Theorem 2], and the ones in (7), in [44, Theorem 1] or in [28, Proposition 3.3] (see also [22, 23, 31]).
Finally the last implication in (8) can be found in [16, Lemma 2] and we give here a brief proof for sake of completeness.
Indeed, since F is convex with L-Lipschitz gradient satisfying condition (\({{\mathcal {P}}}{{\mathcal {L}}}\)), then by the Baillon-Haddad Theorem (see (87) in Lemma 6 in Appendix 1) we have:
Since (\({{\mathcal {P}}}{{\mathcal {L}}}\)) implies (\({{\mathcal {Q}}}{{\mathcal {G}}}\)) with \(\theta \)=\(\mu \), by using (\({{\mathcal {Q}}}{{\mathcal {G}}}\)) in (10), we find:
which shows that F is \(\frac{\mu ^2}{L}\)-quasi strongly convex. \(\square \)
Remark 2
Notice that for non-convex functions the inverse implication of (8) in Proposition (1) does not hold true in general. Motivated by an example given in [31], for any \(\varepsilon >0\), one can consider the following non-convex function:
The function F is continuously differentiable with L-Lipschitz gradient with \(L\le 14\), satisfies (\({{\mathcal {P}}}{{\mathcal {L}}}\)) with some constant \(\mu =\frac{1}{32}\). However this function is not (globally) quasi-strongly convex (see also Figure 2).
3 Main results
In this section we present the main results about convergence of the trajectories of the dynamical system (1), both in the nonconvex and convex case. In Theorems 1 and 2 we provide some upper bounds for the decay of the trajectory of (1), in terms of objective function values, for a specific choice of the damping parameter \(\alpha \). This choice of \(\alpha \) is "optimal" in the sense that it is the one that ensures the fastest decay rate of the trajectory, according to the upper bounds found in the convergence analysis (see Theorems 4 and 5 in Section 6).
Theorem 1
Let \(F:{\mathcal {H}}\rightarrow \mathbb {R}\) be a differentiable function with L-Lipschitz gradient. Assume that \(\text {arg}\,\text {min} F\ne \emptyset \) and F also satisfies (\({{\mathcal {P}}}{{\mathcal {L}}}\)) with \(\mu >0\) and denote \(F_{*}=\min F\) and \(\kappa =\frac{L}{\mu }\). If \(\big (x(t)\big )_{t\ge 0}\) is the solution-trajectory of the dynamical system (1), then for all \(\varepsilon >0\), the following bounds hold true:
-
If \(\kappa \le \frac{9}{8}\) and \(\alpha =\frac{\sqrt{\mu }}{2\sqrt{2}}\left( 5+\sqrt{9-8\kappa }\right) -\frac{\varepsilon }{2}\), then
$$\begin{aligned} F(x(t))-F_*\le \left( F(x(0))-F_*\right) \left( \frac{4\varepsilon +2\sqrt{2\mu }}{\varepsilon }\right) e^{-\left( \sqrt{2\mu }-\varepsilon \right) t} \end{aligned}$$(13) -
If \(\kappa >\frac{9}{8}\) and \(\alpha =(2\sqrt{\kappa }-\sqrt{\kappa -1})\sqrt{\mu }\), then
$$\begin{aligned} F(x(t))-F_*\le \left( F(x(0))-F_*\right) \frac{4\sqrt{\kappa -1}\left( 3\sqrt{\kappa -1}+\sqrt{\kappa }\right) }{8\kappa -9}e^{-2\left( \sqrt{\kappa }-\sqrt{\kappa -1}\right) \sqrt{\mu }t} \end{aligned}$$(14)
The next Theorem shows that for convex functions, one can obtain slightly better results.
Theorem 2
Let \(F:{\mathcal {H}}\rightarrow \mathbb {R}\) be a convex differentiable function with L-Lipschitz gradient. Assume that \(\text {arg}\,\text {min} F\ne \emptyset \) and F also satisfies (\({{\mathcal {P}}}{{\mathcal {L}}}\)) with \(\mu >0\) and denote \(F_{*}=\min F\) and \(\kappa =\frac{L}{\mu }\). If \(\big (x(t)\big )_{t\ge 0}\) is the solution-trajectory of the dynamical system (1), then for all \(\varepsilon >0\), the following bounds hold true:
-
If \(\kappa =1\), \(\varepsilon >0\) and \(\alpha =2\sqrt{\mu }-\varepsilon \), then
$$\begin{aligned} \left\| \nabla F(x(t))\right\| ^{2} \le 2\big (F(x(0))-F_{*}\big )\bigg (1+\frac{\sqrt{\mu }}{\varepsilon }\bigg )e^{-2\bigl (\sqrt{\mu }-\varepsilon \bigr )t} \end{aligned}$$(15)and
$$\begin{aligned} F(x(t))-F_{*} \le \frac{\big (F(x(0))-F_{*}\big )}{\mu }\bigg (1+\frac{\sqrt{\mu }}{\varepsilon }\bigg )e^{-2\bigl (\sqrt{\mu }-\varepsilon \bigr )t} \end{aligned}$$(16) -
If \(\kappa >1\), and \(\alpha =(2\sqrt{\kappa }-\sqrt{\kappa -1})\sqrt{\mu }\), then
$$\begin{aligned} \left\| \nabla F(x(t))\right\| ^{2} \le 2\big (F(x(0))-F_{*}\big )\bigg (1+\sqrt{\frac{\kappa }{\kappa -1}}\bigg )e^{-2\bigl (\sqrt{\kappa }-\sqrt{\kappa -1}\bigr )\sqrt{\mu }t} \end{aligned}$$(17)and
$$\begin{aligned} F(x(t))-F_{*} \le \frac{\big (F(x(0))-F_{*}\big )}{\mu }\bigg (1+\sqrt{\frac{\kappa }{\kappa -1}}\bigg )e^{-2\bigl (\sqrt{\kappa }-\sqrt{\kappa -1}\bigr )\sqrt{\mu }t} \end{aligned}$$(18)
Before presenting the convergence analysis and the corresponding proofs, let us make some comments related to Theorems 1 and 2.
Remark 3
(Localization on the sub-level sets) While Theorem 1 refers to functions satisfying the Polyak–Łojasiewicz condition globally (see Definition 1), the results of the aforementioned theorem still hold true for any function F that satisfies (\({{\mathcal {P}}}{{\mathcal {L}}}\)) on the sublevel set \(\varOmega =\{ x\in {\mathcal {H}}~ : ~ F(x) \le F(x(0)) \}\), which is invariant with respect to the system (1) (i.e. \(x(t)\in \varOmega \), \(\forall t\ge 0\)).
Indeed by considering the total energy of system (1), \(U(t)=F(x(t))+\frac{1}{2}\left\| {\dot{x}}(t)\right\| ^2\), we find:
which shows that U(t) is non-increasing. By using the non-increasing property of U we deduce that \(F(x(t))\le F(x(0))\), for all \(t\ge 0\) and therefore the trajectory \(\{x(t)\}_{t\ge 0}\) generated by (1) remains in the sublevel set \(\varOmega \), where F satisfies (\({{\mathcal {P}}}{{\mathcal {L}}}\)) and the decay rates found in Theorem 1 remain valid.
Remark 4
In both Theorems 1 and 2, one can observe that, the associated worst-case linear convergence rates are affected by the magnitude of the condition number \(\kappa =\frac{L}{\mu }\ge 1\) (equivalently the magnitude of the difference \(L-\mu \ge 0\)). In particular, while in the case \(\kappa =1\) (i.e. \(\mu =L\)) the rate in (14) is worst-case optimal (see next Remark 5), in the case \(\kappa >1\), the exponents in (14) and (18) can become small, if \(\kappa =\frac{L}{\mu }>>1\).
Remark 5
(Optimality of rates in the case \(\mu =L\) for convex functions) As remarked in [29], if \({\mathcal {H}}=\mathbb {R}\) and \(F(x)=\frac{\mu }{2}\left| x\right| ^2\) (quadratic function in one-dimensional case), the system (1) reduces to a linear ODE with constant coefficients whose solution x(t) has an explicit form. In that case it can be shown that for all \(\varepsilon >0\):
and \(\sup \{r(\alpha ) ~ : ~ \alpha >0 )\}=2\sqrt{\mu }\). This shows that the estimate (16), in the case \(\mu =L\) of Theorem 2, recovers the -worst case- optimal decay rate in the class of convex functions satisfying condition (\({{\mathcal {P}}}{{\mathcal {L}}}\)).
Remark 6
(Comparison with Gradient flow and the works [16, 17]) The first-order in time Gradient flow system, i.e. \({\dot{x}}(t) +\nabla F(x(t))=0\), provides linear convergence rates for the objective function values with a corresponding convergence factor equal to \(2\mu \) (see e.g. [37,38,39]) both in convex and non-convex setting. By comparing the convergence factor \(2(\sqrt{\kappa }-\sqrt{\kappa -1})\sqrt{\mu }\) found in (14) and (18) of Theorems 1 and 2 (respectively), we have:
describing the regimes for parameters \(\mu \) and L, for which the worst-case decay rate of each method is faster than the other.
It is also worth mentioning that the Lipschitz character of the gradient is not necessary to deduce the linear rates for the Gradient flow, in contrast to our setting. Nevertheless, Theorem 1 is the first result that states explicit rates in the non-convex setting for the Heavy ball dynamical system.
Concerning the convex setting, similar results are obtained in [17], where the authors study the system (1) for convex differentiable functions, with a unique minimizer, satisfying the Quadratic growth condition (\({{\mathcal {Q}}}{{\mathcal {G}}}\)) with \(\theta >0\) (recall here that in the convex setting the Quadratic growth condition (\({{\mathcal {Q}}}{{\mathcal {G}}}\)) is equivalent to the Polyak–Łojasiewicz condition (\({{\mathcal {P}}}{{\mathcal {L}}}\)) with the same constant (\(\alpha =\mu \)), see (5) in Proposition 1 of Section 2). In particular, for the aforementioned class of functions, they provide linear decay rates for the objective error \(F(x(t))-\min F\), with a worst-case optimal convergence factor of \((2-\sqrt{2})\sqrt{\mu }\), achieved for \(\alpha =(2-\frac{\sqrt{2}}{2})\sqrt{\mu }\) (see Theorem and Corollary 1 in [17]). Comparing these rates with those found in Theorem 2, one has the following:
From equation (22) one can see that if \(\kappa \le \kappa _{*}\) the rate in Theorem 2 is faster, while in the case \(\kappa >\kappa _{*}\), the rate obtained in [17] is better and is proven by exploiting a different Lyapunov function, relying though on the uniqueness of the minimizer of the function.
In the same spirit, in [16] the authors prove similar linear convergence rates, in terms of objective function values, for convex and \(\beta \)-quasi-strongly convex functions admitting a unique minimizer. Theorem 1 in [16] states that the best rate for the objective value function is \(\exp {(-\sqrt{2\beta }t)}\). Since condition (\({{\mathcal {P}}}{{\mathcal {L}}}\)) with \(\mu >0\) for convex functions with L-Lipschitz gradient implies quasi-strong convexity with parameter \(\beta =\frac{\mu ^2}{L}\) (see (8) in Proposition 1), we have \(\sqrt{2\beta }=\sqrt{\frac{2\mu }{\kappa }}\). As done before, a straightforward comparison of the two factors \(2(\sqrt{\kappa }-\sqrt{\kappa -1})\sqrt{\mu }\) and \(\sqrt{\frac{2\mu }{\kappa }}\), shows that:
Theorem 2 in [16]) states improved rated when F has Lipschitz gradient. They are given via the formula \(\exp {(-\frac{2}{3}(1+\frac{2}{3}\frac{9\mu ^{2}-2\alpha ^2}{9L+3\mu -\frac{2}{3}\alpha ^2})t)}\), for \(\alpha \le 3\sqrt{\frac{\mu }{2}}\). However, the optimal rate with respect to \(\alpha \) in the last formula cannot be explicitly computed. A straightforward comparison with our results is therefore difficult to establish.
4 Non-smooth setting
In this section we extend some of the previous results to the case of a convex, proper and lower semi-continuous function \(F: {\mathcal {H}}\rightarrow {\overline{\mathbb {R}}}=\mathbb {R}\cup \{+\infty \}\), with \(X_{*}=\text {arg}\,\text {min} F\ne \emptyset \). In this context, we make use of the Moreau envelope and the proximal operator of F defined (respectively) for any \(\lambda >0\), as follows (see e.g. [18, Definitions 12.20 and 12.23]):
One of the major advantage of the Moreau envelope \(F_{\lambda }\) is that is a convex, continuously differentiable function with \(\frac{1}{\lambda }\)-Lipschitz gradient (see e.g. [18, Proposition 12.29]). Thus considering he Heavy ball system (1) for \(F_{\lambda }\) is still possible even if F is not smooth or differentiable.
Formally, we state all the useful properties regarding \(F_{\lambda }\) and \(\text {prox}_{\lambda F}\), that are used in this work in Lemma 1. For further details and proofs about the Moreau envelope and the proximal operator we address the interested reader to [40] and [18].
Lemma 1
Let \(F:{\mathcal {H}}\rightarrow {\overline{\mathbb {R}}}\) be a convex, proper and lower semi-continuous function with \(\text {arg}\,\text {min} F\ne \emptyset \). Let \(\lambda >0\) and \(F_{\lambda }\) and \(\text {prox}_{\lambda F}\) the Moreau envelope and proximal operator of F (respectively). The following assertions hold true for all \(x\in {\mathcal {H}}\).
-
1.
\(\underset{x\in {\mathcal {H}}}{\min } F(x) = \underset{x\in {\mathcal {H}}}{\min }F_{\lambda }(x)\) and \(\mathop {{\text {arg}\,\text {min}}}\limits _{x\in {\mathcal {H}}} F(x) = \mathop {\text {arg}\,\text {min}}\limits _{x\in {\mathcal {H}}}F_{\lambda }(x)\).
-
2.
\(F_{\lambda }(x)=F(\text {prox}_{\lambda F}(x))+\frac{1}{2\lambda }\left\| x-\text {prox}_{\lambda F}(x)\right\| ^2 \le F(x)\).
-
3.
\(\frac{1}{\lambda }\left( x-\text {prox}_{\lambda F}(x)\right) \in \partial F(x)\).
-
4.
\(F_{\lambda }\) is differentiable with \(\frac{1}{\lambda }\)-Lipschitz gradient and \(\nabla F_{\lambda }(x)=\frac{1}{\lambda }\left( x-\text {prox}_{\lambda F}(x)\right) \).
Notice that in this setting the (\({{\mathcal {P}}}{{\mathcal {L}}}\)) condition on F can be generalized as follows (see e.g. [23, Paragraph 2.3]):
where \(\partial F\) is the subdifferential of F and dist\((x,C)=\inf \{\left\| x-y\right\| ~ : ~ y\in C\}\) denotes the distance of a point \(x\in {\mathcal {H}}\) from a set \(C\subset {\mathcal {H}}\).. It is also worth mentioning that condition (ns-\({{\mathcal {P}}}{{\mathcal {L}}}\)) and the growth condition (\({{\mathcal {Q}}}{{\mathcal {G}}}\)) are still equivalent in this setting (F convex) with same relations between the two parameters as the ones described in (5) and (7) of Proposition 1 (see e.g. [23, Theorem 27] and [44, Theorem 1]).
Proposition 2
Let \(\lambda >0\), \(F:{\mathcal {H}}\rightarrow {\overline{\mathbb {R}}}\) be a convex, proper and lower semi-continuous function and \(F_{\lambda }\) its Moreau envelope. Then the following assertions hold true:
-
1.
If F satisfies (ns-\({{\mathcal {P}}}{{\mathcal {L}}}\)) with parameter \(\mu \), then \(F_{\lambda }\) satisfies (\({{\mathcal {P}}}{{\mathcal {L}}}\)) with parameter \(\frac{\mu }{\lambda \mu +1}\).
-
2.
If \(F_{\lambda }\) satisfies (\({{\mathcal {P}}}{{\mathcal {L}}}\)) with parameter \(\mu \), then F satisfies (ns-\({{\mathcal {P}}}{{\mathcal {L}}}\)) with parameter \(\frac{\mu }{4}\).
Proof
1. By using properties 1. and 2. in Lemma 1), we have:
where in the last inequality we used the fact that F satisfies condition (ns-\({{\mathcal {P}}}{{\mathcal {L}}}\)).
Since \(\frac{1}{\lambda }(x-\text {prox}_{\lambda F}(x))\in \partial F(\text {prox}_{\lambda F}(x))\) by property 3. in Lemma 1, from (26), we obtain:
which allows to conclude that \(F_{\lambda }\) satisfies (\({{\mathcal {P}}}{{\mathcal {L}}}\)) with parameter \(\frac{\mu }{\lambda \mu +1}\).
2. From (5) of Proposition 1\(F_{\lambda }\) satisfies (\({{\mathcal {Q}}}{{\mathcal {G}}}\)) with parameter \(\mu \), thus, by using properties 1. and 2. in Lemma 1 we obtain:
From (28), it follows that F satisfies (\({{\mathcal {Q}}}{{\mathcal {G}}}\)), with parameter \(\mu \), therefore F satisfies (ns-\({{\mathcal {P}}}{{\mathcal {L}}}\)) with parameter \(\frac{\mu }{4}\) (see e.g. [44, Theorem 1]). \(\square \)
Theorem 3
Let \(F:{\mathcal {H}}\rightarrow {\overline{\mathbb {R}}}\) be a convex, proper and lower semi-continuous function with \(\text {arg}\,\text {min} F\ne \emptyset \) satisfying (\({{\mathcal {P}}}{{\mathcal {L}}}\)) with \(\mu >0\). Let also \(\lambda >0\) and \(F_{\lambda }\) and \(\text {prox}_{\lambda F}\) be the Moreau envelope and the proximal operator of F (respectively) and \(x_{\lambda }(t)\) the trajectory generated by the Heavy ball system (1) with \(F_{\lambda }\), with \(\alpha =\frac{2\lambda \mu +1+\sqrt{\lambda \mu +1}}{\sqrt{\lambda }\left( \lambda \mu +1+\sqrt{\lambda \mu +1}\right) }\). Then the following bounds hold true:
If in addition F is M-Lipschitz, then:
where \(C(\lambda ,\mu )=\sqrt{\big (F_{\lambda }(x_{\lambda }(0))-F_{*}\big )\bigg (1+\sqrt{\lambda \mu +1}\bigg )}\)
Proof of Theorem 3
On the one hand, from property 2. in Lemma 1 we have \(F(\text {prox}_{\lambda F}(x_{\lambda }(t)))-F_{*}\le F_{\lambda }(x_{\lambda }(t))-\min F_{\lambda }\). On the other hand, from Proposition 2, it follows that \(F_{\lambda }\) satisfies (\({{\mathcal {P}}}{{\mathcal {L}}}\)) with parameter \(\frac{\mu }{\lambda \mu +1}\) and since it is convex with \(\frac{1}{\lambda }\)-Lipschitz gradient, we can directly apply Theorem 2 and deduce (29) from (18).
If now we assume that F is M-Lipschitz, by the characterization of \(F_{\lambda }\) and \(\text {prox}_{\lambda F}\) (see properties 1, 2 and 4 in Lemma 1), we have:
where in the second inequality we used the M-Lipschitz character of F and in the last one that \(F_{\lambda }\) satisfies (ns-\({{\mathcal {P}}}{{\mathcal {L}}}\)) with parameter \(\frac{\mu }{\lambda \mu +1}\).
Thus from (31), it follows that
Inequality (30), follows from the bounds (17) and (18) of Theorem 2 for \(\left\| \nabla F_{\lambda }(x_{\lambda }(t))\right\| \) and \(F_{\lambda }(x_{\lambda }(t))-F_{*}\) respectively. \(\square \)
5 Numerical experiments
In this section we present two synthetic examples to illustrate the results discussed in Theorems 1 and 2, regarding the Heavy ball system (1). In particular we test the performance of the trajectory generated by (1), in terms of objective function values \(F(x(t))-F_{*}\), with different values for \(\alpha \) and the first-order in time Gradient flow system. All the experiments were performed in the Matlab programming language, by using the ODE solver “ode45” with absolute tolerance \(\approx 10^{-13}\).
Example 1 (Convex, \(\mu \)-(\({{\mathcal {P}}}{{\mathcal {L}}}\))) In the first example we consider a quadratic function of the form \(F(x)=\langle Ax,x\rangle \), where\(A\in \mathbb {R}^{d\times d}\), \(x\in \mathbb {R}^d\) and \(d=100\). A is a random symmetric matrix, with its smallest eigenvalue equal to 0 and some given smallest (and larger) positive eigenvalue \(\mu \) (and L respectively). With these choices, F is a convex but not strongly convex function with L-Lipschitz gradient and satisfies the (\({{\mathcal {P}}}{{\mathcal {L}}}\)) condition with parameter \(\mu \). We fix \(L=1\) and make 3 different choices for \(\mu \), corresponding to \(\kappa =\frac{L}{\mu }=10\), \(\kappa =100\) and \(\kappa =200\). In this example we compare the performance of the Gradient flow and the Heavy ball scheme (1) with 4 different choices for the damping parameter \(\alpha \). One is set equal to \(\alpha _{*}=(2\sqrt{\kappa }-\sqrt{\kappa -1})\sqrt{\mu }\) as Theorem 2 indicates, two are slight variations of \(\alpha _{*}\), i.e. \(\alpha _{\pm }=\alpha _{*}\pm \varepsilon \), with \(\varepsilon =0.1\) and the last on is chosen equal to \(2\sqrt{\mu }\) (see e.g. [30, 37, 39]). The results are reported in Fig. 3. While the choice \(\alpha =2\sqrt{\mu }\) has better performance in the well-conditioned case (\(\kappa =10\)), it takes more iterations to reach a smaller error than the one for the choices \(\alpha _{*}\) (or \(\alpha _{*} \pm \varepsilon \)) in the ill conditioned cases (\(\kappa =100\) or \(\kappa =200\)). In addition, we observe that the case \(\alpha =(2\sqrt{\kappa }-\sqrt{\kappa -1})\sqrt{\mu }\) leads to less oscillatory behavior for \(F(x(t))-F_{*}\), hinging a corresponding trajectory which overshoots less the set of minimizers \(X_{*}\).
Example 2 (non-convex, \(\mu \)-(\({{\mathcal {P}}}{{\mathcal {L}}}\))) In the second example we consider the function \(F:\mathbb {R}^{2}\rightarrow \mathbb {R}_{+}\), such that \(F(x,y)=\frac{\left| y-\sin (x)\right| ^2}{8}\). As discussed in the introduction (see Fig. 1), the function F is not convex, but satisfies the (\({{\mathcal {P}}}{{\mathcal {L}}}\)) condition with \(\mu =\frac{1}{4}\). Its set of minimizers is the whole curve \(X_{*}=\{(x,y)\in \mathbb {R}^{2} ~: ~ y=\sin (x)\}\). It is also worth mentioning, that while \(\nabla F\) is not globally Lipschitz, it is L-Lipshitz on sub-level sets. Therefore the results stated in Theorem 1 are applicable (see also Remark 3). In this case, in Fig. 4, we can observe that the performance of Heavy ball in terms of the objective function values is similar both for \(\alpha =2\sqrt{\mu }\) and \(\alpha =(2\sqrt{\kappa }-\sqrt{\kappa -1})\sqrt{\mu }\) and better than the one of Gradient flow.
6 Convergence analysis
Throughout this section, we consider a function \(F:{\mathcal {H}}\rightarrow \mathbb {R}\) satisfying assumptions \({\mathcal {A}}.1\) and \({\mathcal {A}}.2\) and the Polyak–Łojasiewicz condition (\({{\mathcal {P}}}{{\mathcal {L}}}\)) and we denote \(F_{*}=\min F\).
As mentioned before, the analysis that we follow in this Section in order to prove the main results in Theorems 1 and 2, relies on Lyapunov techniques. The Lyapunov function \(V\) that we will introduce shortly was also used in previous works to study linear convergence of the trajectories of (1) or other similar systems, see for example [19, 24, 37].
Let \(x(t)\) be a solution of (1) and for sake of simplicity, we use the following notation for the objective error-function:
For some parameters \(a,\delta \in \mathbb {R}\) we define the following Lyapunov function for any \(t\ge 0\):
which will play a central role in our analysis.
For systems like (1), it is possible to choose other standard energy-functionals (see for example [5, 9, 16, 42]) which can be used to deduce interesting convergence properties for x(t), which however usually require uniqueness of the minimizer. Regarding this issue, it is important to stress that the choice of V does not depend explicitly on any minimizer of the objective function \(F\). On the other hand, from the forthcoming analysis it transpires that the Lipschitz character of the gradient cannot be avoided with this choice of energy-function.
In order to improve readability of the forthcoming analysis, we briefly sketch the salient steps, which are divided in three paragraphs. In paragraph 6.1 we provide some basic estimates for the energy \(V\), as also for W. First, in Lemma 2 we show that under simple conditions on the parameters \(a\) and \(\delta \) and the damping parameter \(\alpha \), the Lyapunov function V decreases linearly thanks to the hypothesis (\({{\mathcal {P}}}{{\mathcal {L}}}\)). In Theorem 4 in paragraph 6.2 (see also Lemma 4) we proceed by showing that the linear convergence of \(V\) implies the linear convergence of the objective function \(W\).
In the last part of paragraph 6.2 we give the full proof of Theorem 1, by finding the optimal choices for the parameters a, \(\delta \) and the damping coefficient \(\alpha \), in order to have the fastest decay rate according to the Lyapunov analysis.
Finally, in paragraph 6.3, a similar sequence of proofs is carried out under the additional hypothesis of convexity of the objective function \(F\).
6.1 Lyapunov estimates
We start by proving that the Lyapunov function V converges linearly to zero whenever \(t\rightarrow +\infty \).
Lemma 2
Let \(F:{\mathcal {H}}\rightarrow \mathbb {R}\) satisfying \({\mathcal {A}}.1\), \({\mathcal {A}}.2\) and condition (\({{\mathcal {P}}}{{\mathcal {L}}}\)). Let also \(V\) be the function defined in (34) and let:
Then, for any \(a,\delta \in \mathbb {R}_{+}\) satisfying
we have the following upper bound
Proof
By a direct computation of the derivative of \(V(t)\) we get
Since \(\nabla F\) is L-Lipschitz continuous and \({\dot{x}}(\cdot )\) is continuous, it follows that \(\nabla F(x(t))\) is absolutely continuous (see Remark 1 in [24]) and for almost every \(t>0\), it holds:
Thus, from (37), we obtain:
Next, by using the basic equation (1) for the solution x(t) and expressing \(\ddot{x}(t)=-\alpha {\dot{x}}(t)-\nabla F(x(t))\) in (39), we find:
Since \({\dot{W}}(t)=\langle \nabla F(x(t)), {\dot{x}}(t)\rangle \), we have:
and therefore, letting \(R=\alpha +\delta -a\) as in Equation (35), we obtain
By definition of V(t) (34), we have
hence by injecting (43) into (42), we find:
Since \(F\) satisfies condition (\({{\mathcal {P}}}{{\mathcal {L}}}\)) (i.e. \(2\mu W(t) \le \left\| \nabla F((t))\right\| ^{2} \)), we obtain
Conditions (H1) and (H2) yield:
By direct integration we get
and since V is continuous, (47) holds true for all \(t\ge 0\). Finally, the initial conditions \(x(0)=x_0\) and \({\dot{x}}(0)=0\) in (1) imply \(V(0)=a(F(x_0)-F_{*})\). \(\square \)
The next Lemma provides some sufficient conditions that ensure the validity of assumptions of Lemma 2 (i.e. conditions (HR), (H1) and (H2)).
Lemma 3
Let \(\delta >0\) and consider the following conditions:
where
and
Then the conditions (HR), (H1) and (H2) of Lemma 2 are satisfied, with
Proof
Let \(\delta >0\) and \(a=\delta +\frac{2L}{\delta }-\alpha \) in the definition of V (34).
Since \(\alpha <\delta +\frac{2L}{\delta }\) we have \(a>0\). By definition of R in (35) and \(a=\delta +\frac{2L}{\delta }-\alpha \), we have \(R=\alpha +\delta -a=2\left( \alpha -\frac{L}{\delta }\right) \) and since \(\alpha >\frac{L}{\delta }\), it follows that \(R>0\) and condition (HR) is satisfied. In addition since \(R=2\left( \alpha -\frac{L}{\delta }\right) \), condition (H2) holds true as equality.
Finally since \(a=\delta +\frac{2L}{\delta }-\alpha \) and \(R=2\left( \alpha -\frac{L}{\delta }\right) \) condition (H1) is equivalent to:
The associated second order equation admits two real valued roots as defined in Eq. (49) for every \(\delta >0\), therefore the inequality (52) holds true if and only if \(\alpha \in [0,\alpha _-]\cup [\alpha _+,+\infty )\). Since the feasible set for \(\alpha \) defined in Eq. (48) is contained in \(\alpha \in [0,\alpha _-]\cup [\alpha _+,+\infty )\), it follows that condition (H1) is satisfied and concludes the proof of Lemma 3. \(\square \)
The next Lemma shows how we can take advantage of the linear rates for V, stated in Lemma 2, to deduce linear rates for the objective function values \(F(x(t))-F_{*}\).
Lemma 4
Let \(F:{\mathcal {H}}\rightarrow \mathbb {R}\) satisfying \({\mathcal {A}}.1\), \({\mathcal {A}}.2\) and condition (\({{\mathcal {P}}}{{\mathcal {L}}}\)) and \(V\) as defined in Eq. (34). Assume that conditions (HR), (H1) and (H2) of Lemma 2 hold true. Then, for all \(t>0\), we have:
where \(m =\min \{a,R\}\) and
Proof
By Lemma 2 and Eq. (34) we immediately get
Neglecting the non-negative term \(\frac{\delta }{2}\left\| {\dot{x}}(t)\right\| ^{2}\), we obtain:
Hence, by applying Lemma 5 (with \(s=0\)), we have:
By computing the integral in the last inequality (57) and neglecting the non-positive terms in the case \(a\ne R\), we find:
with \(m=\min \{a,R\}\). On the other hand, if \(a=R\), by computing the integral in (57), we have:
which concludes the proof of Lemma 4. \(\square \)
6.2 The non-convex setting
In the following Theorem we give a more explicit expression of the convergence rate of the objective function \(F(x(t))-F_{*}\), depending on the damping parameter \(\alpha \) and the auxiliary variable \(\delta \).
Theorem 4
Let \(F:{\mathcal {H}}\rightarrow \mathbb {R}\) satisfying \({\mathcal {A}}.1\), \({\mathcal {A}}.2\) and condition (\({{\mathcal {P}}}{{\mathcal {L}}}\)) with \(\mu >0\) and denote \(\kappa =\frac{L}{\mu }\). Let \(\delta >0\) and \(\big (x(t)\big )_{t\ge 0}\) be the solution-trajectory associated to the dynamical system (1). Then for all \(t\ge 0\) the following bound holds
with \(m=\min \left\{ \delta +2\frac{L}{\delta }-\alpha ,2\big (\alpha -\frac{L}{\delta }\big )\right\} \), \(\alpha \in \left( \frac{L}{\delta },\alpha _{-}\right] \cup \left[ \alpha _{+},\delta +\frac{2L}{\delta }\right) \) where \(\alpha _{\pm }=\frac{1}{2}\bigg (\delta +\frac{3L}{\delta }\pm \sqrt{\bigg (\delta +\frac{L}{\delta }\bigg )^2-4\mu }\bigg )\).
In particular we have the following cases:
-
\(m=2\left( \alpha -\frac{L}{\delta }\right) \) in the following cases:
$$\begin{aligned} {\left\{ \begin{array}{ll} \alpha \in \left( \frac{L}{\delta },\frac{1}{3}\left( \delta +\frac{4L}{\delta }\right) \right) ~ \text { if } ~ \kappa \le \frac{9}{8}\text { and } \delta \in (\delta _-,\delta _+) \\ \\ \alpha \in \left( \frac{L}{\delta },\alpha _-\right] \\ \quad \text { if } ~ \biggl (\kappa \le \frac{9}{8} \text { and } \delta \in (0,\delta _-)\cup (\delta _+,+\infty )\biggr ) \text { OR } \biggl (\kappa \ge \frac{9}{8} \text { and } \delta >0\biggr ) \end{array}\right. } \end{aligned}$$(61) -
\(m=\delta +2\frac{L}{\delta }-\alpha \) in the following cases:
$$\begin{aligned} {\left\{ \begin{array}{ll} \alpha \in \left( \frac{1}{3}\left( \frac{4L}{\delta }+\delta \right) , \alpha _-\right] \cup \bigl [\alpha _+,\delta +\frac{2L}{\delta }\bigr ) ~ \text { if } ~ \kappa \le \frac{9}{8}\text { and } \delta \in (\delta _-,\delta _+)\ \\ \\ \alpha \in [\alpha _+,\delta +\frac{2L}{\delta }) \\ \quad \text { if } ~ \biggl (\kappa \le \frac{9}{8} \text { and } \delta \in (0,\delta _-)\cup (\delta _+,+\infty )\biggr ) \text { or } \biggl (\kappa \ge \frac{9}{8} \text { and } \delta >0\biggr ) \end{array}\right. } \end{aligned}$$(62)
where \(\delta _\pm = \frac{1}{2\sqrt{2}}\left( 3\sqrt{\mu }\pm \sqrt{9\mu -8L}\right) \).
Proof
Let \(\delta >0\), \(\alpha \in \left( \frac{L}{\delta },\alpha _{-}\right] \cup \left[ \alpha _{+},\delta +\frac{2L}{\delta }\right) \) and set \(a=\delta +\frac{2L}{\delta }-\alpha \) in the definition of V (34), so that Lemmas 2, and 3 hold true with \(R=2\left( \alpha -\frac{L}{\delta }\right) \).
By using Lemma 4 it follows that for all \(t>0\), it holds:
with \(m =\min \left\{ a,R\right\} \), with \(a=\delta +\frac{2L}{\delta }-\alpha \) and \(R=2\big (\alpha -\frac{L}{\delta }\big )\). In addition if we exclude the case \(a=R\) (i.e. \(\alpha =\frac{\delta +\frac{4L}{\delta }}{3}\)), Lemma 4, implies:
By substituting \(a=\delta +2\frac{L}{\delta }-\alpha \) and \(R=2\left( \alpha -\frac{L}{\delta }\right) \), we obtain
To make more explicit the rates of the convergence of the objective function we need to study the value of \(m\), as a function of \(\alpha \) and \(\delta \).
Recalling that \(\alpha \in \left( \frac{L}{\delta },\alpha _-\right] \cup \left[ \alpha _+,\delta +\frac{2L}{\delta }\right) \) where
we consider the following two cases:
-
\(m = 2\left( \alpha -\frac{L}{\delta }\right) \)
This case is equivalent to the condition
$$\begin{aligned} \delta +2\frac{L}{\delta }-\alpha > 2\left( \alpha -\frac{L}{\delta }\right) \end{aligned}$$which can be rewritten as
$$\begin{aligned} \alpha < \frac{1}{3}\left( \delta +\frac{4L}{\delta }\right) \end{aligned}$$First we note that we always have \(\frac{1}{3}\left( \delta +\frac{4L}{\delta }\right) \le \alpha _+\). Indeed, by a direct computation
$$\begin{aligned} \frac{1}{3}\left( \delta +\frac{4L}{\delta }\right)&\le \frac{1}{2}\left( \frac{3L}{\delta }+\delta + \sqrt{\left( \frac{L}{\delta } + \delta \right) ^2-4\mu }\right) \\ -\frac{1}{3}\left( \frac{L}{\delta }+\delta \right)&\le \sqrt{\left( \frac{L}{\delta } + \delta \right) ^2-4\mu } \end{aligned}$$which is true for every \(L\), \(\mu \) and \(\delta \).
Therefore we have \(\alpha \in \left( \frac{L}{\delta },\min \{\alpha _-, \frac{1}{3}\left( \delta +\frac{4L}{\delta }\right) \}\right) \). We start ordering the values in the minimum by studying the case \(\frac{1}{3}\left( \delta +\frac{4L}{\delta }\right) < \alpha _-\) (note that we exclude the equality since we assumed that \(\alpha \) cannot take the value \(\frac{1}{3}\left( \delta +\frac{4L}{\delta }\right) \)) which is equivalent to
$$\begin{aligned} \frac{1}{3}\left( \delta +\frac{4L}{\delta }\right)&< \frac{1}{2}\left( \frac{3L}{\delta }+\delta - \sqrt{\left( \frac{L}{\delta } + \delta \right) ^2-4\mu }\right) \\ \frac{1}{3}\left( \frac{L}{\delta }+\delta \right)&> \sqrt{\left( \frac{L}{\delta } + \delta \right) ^2-4\mu } \end{aligned}$$By taking the square we obtain
$$\begin{aligned} \left( \frac{L}{\delta } + \delta \right) ^2 < \frac{9}{2}\mu \end{aligned}$$(64)or equivalently
$$\begin{aligned} \delta ^2 - \frac{3}{\sqrt{2}}\sqrt{\mu }\delta + L < 0 \end{aligned}$$which is satisfied if \(\kappa =\frac{L}{\mu }\le \frac{9}{8}\) and \(\delta \in (\delta _-,\delta _+)\), where
$$\begin{aligned} \delta _\pm = \frac{1}{2\sqrt{2}}\left( 3\sqrt{\mu }\pm \sqrt{9\mu -8L}\right) \end{aligned}$$Analogously, we find that \(\frac{1}{3}\left( \delta +\frac{4L}{\delta }\right) > \alpha _-\) can be reduced to
$$\begin{aligned} \delta ^2 - \frac{3}{\sqrt{2}}\sqrt{\mu }\delta + L > 0 \end{aligned}$$which holds true if \(\kappa =\frac{L}{\mu }\le \frac{9}{8}\) and \(\delta \in (0,\delta _-)\cup (\delta _+,+\infty )\) or if \(\kappa \ge \frac{9}{8}\) and \(\delta >0\) .
-
\(m = \delta +2\frac{L}{\delta }-\alpha \)
This case can be studied analogously to the previous one by noticing that this value of \(m\) is achieved whenever
$$\begin{aligned} \delta +2\frac{L}{\delta }-\alpha > 2\left( \alpha -\frac{L}{\delta }\right) \end{aligned}$$which is equivalent to
$$\begin{aligned} \alpha > \frac{1}{3}\left( \delta +\frac{4L}{\delta }\right) \end{aligned}$$together with \(\alpha \in \left( \frac{L}{\delta },\alpha _-\right] \cup [\alpha _+,\delta +\frac{2L}{\delta })\). The statement follows from the relation between \(\frac{1}{3}\left( \delta +\frac{4L}{\delta }\right) \) and \(\alpha _\pm \) as already studied in the previous point.
\(\square \)
We are now ready to give the proof of Theorem 1.
Proof of Theorem 1
To prove the statement of Theorem we optimize (maximize) the rate \(m=\min \left\{ \delta +2\frac{L}{\delta }-\alpha ,2\big (\alpha -\frac{L}{\delta }\big )\right\} \) over the feasible values for \(\alpha \) and \(\delta \), defined in Theorem 4. In particular, by (61) and (62), we consider two cases:
-
\(m=2\left( \alpha -\frac{L}{\delta }\right) \)
In this case by maximizing m, over \(\alpha \) and \(\delta \), if \(\varepsilon >0\), we have:
$$\begin{aligned} \begin{aligned} m^*&= 2\sup _{\alpha , \delta }\left( \alpha -\frac{L}{\delta }\right) \\&= {\left\{ \begin{array}{ll} \sup _{\delta \in [\delta _-, \delta _+]}\frac{2}{3}\left( \frac{L}{\delta }+\delta \right) -\varepsilon &{} \text {if}\ \kappa \le \frac{9}{8} ~ \text { and } ~ \alpha =\frac{\delta +\frac{4L}{\delta }}{3}-\frac{\varepsilon }{2} \\ \sup _{\delta \in (0,\delta _-]\cup [\delta _+,+\infty )}2\left( \alpha _- - \frac{L}{\delta }\right) -\varepsilon &{} \text {if}\ \kappa \le \frac{9}{8} ~ \text { and } ~ \alpha =\alpha _{-}-\frac{\varepsilon }{2} \\ \sup _{\delta>0}2\left( \alpha _- - \frac{L}{\delta }\right) &{} \text {if}\ \kappa >\frac{9}{8} ~ \text { and } ~ \alpha =\alpha _{-} \end{array}\right. } \end{aligned} \end{aligned}$$(65)where \(\alpha _{-}=\frac{1}{2}\left( \delta +\frac{3L}{\delta }-\sqrt{\left( \delta +\frac{L}{\delta }\right) ^2-4\mu }\right) \).
For the first case, note that the supremum of \(\frac{L}{\delta }+\delta \) at \([\delta _{-},\delta _{+}]\) is attained at \(\delta =\delta _{\pm }\) and is equal to \(m^*=\sqrt{2\mu }-\varepsilon \).
For the second case, since \(2\left( \alpha _- - \frac{L}{\delta }\right) -\varepsilon =\delta +\frac{L}{\delta }-\sqrt{\left( \delta +\frac{L}{\delta }\right) ^2-4\mu }-\varepsilon \), its maximum value for \(\delta \in (0,\delta _-]\cup [\delta _+,+\infty )\) is achieved for \(\delta =\delta _\pm \) and is equal to \(m^*=\sqrt{2\mu }-\varepsilon \).
Finally, in the third case the maximum for \(2\left( \alpha _- - \frac{L}{\delta }\right) \) for \(\delta >0\), is achieved for \(\delta =\sqrt{L}\) and equals to \(m^*=2(\sqrt{L}-\sqrt{L-\mu })\).
Concluding, without loss of generality, we have the following possible rates:
$$\begin{aligned} m^* = {\left\{ \begin{array}{ll} \sqrt{2\mu }-\varepsilon &{} \text { if } ~ \kappa \le \frac{9}{8} ~ \text { and } ~ \alpha =\frac{(5\pm \sqrt{9-8\kappa })}{2\sqrt{2}}\sqrt{\mu }-\frac{\varepsilon }{2} \\ 2\left( \sqrt{\kappa }-\sqrt{\kappa -1}\right) \sqrt{\mu } &{} \text { if } ~ \kappa >\frac{9}{8} ~ \text { and } ~ \alpha =\left( 2\sqrt{\kappa }-\sqrt{\kappa -1}\right) \sqrt{\mu } \\ \end{array}\right. } \end{aligned}$$(66) -
\(m = \delta +2\frac{L}{\delta }-\alpha \)
Analogously, in this case the best possible convergence rate is given by
$$\begin{aligned} \begin{aligned} m^*&= \sup _{\alpha , \delta }\left( \delta +\frac{2L}{\delta }-\alpha \right) \\&= {\left\{ \begin{array}{ll} \sup _{\delta \in (\delta _-, \delta _+)}\frac{2}{3}\left[ \frac{L}{\delta } + \delta \right] &{} \text {if }\ \kappa \le \frac{9}{8} ~ \text { and } ~ \alpha =\frac{\delta +\frac{4L}{\delta }}{3}-\varepsilon \\ \sup _{\delta \in (0,\delta _-]\cup [\delta _+,+\infty )}\left( \delta +\frac{2L}{\delta }-\alpha _+\right) &{} \text {if }\ \kappa \le \frac{9}{8} ~ \text { and } ~ \alpha =\alpha _{+} \\ \sup _{\delta \in \mathbb {R}}\left( \delta +\frac{2L}{\delta }-\alpha _+\right) &{} \text { if }\ \kappa >\frac{9}{8} ~ \text { and } ~ \alpha =\alpha _{+} \end{array}\right. } \end{aligned} \end{aligned}$$(67)where \(\alpha _{+}=\frac{1}{2}\left( \delta +\frac{3L}{\delta }+\sqrt{\left( \delta +\frac{L}{\delta }\right) ^2-4\mu }\right) \) and \(\varepsilon >0\).
In the first case of (67) we have the same expression we studied in the previous section which gives \(m^*=\sqrt{2\mu }-\varepsilon \). For the second, since \(\delta +\frac{2L}{\delta }-\alpha _+=\frac{1}{2}\left( \delta +\frac{L}{\delta }-\sqrt{\left( \delta +\frac{L}{\delta }\right) ^2-4\mu }\right) \) its maximum value in \(\sqrt{L}\notin (0,\delta _-]\cup [\delta _+,+\infty )\) is attained for \(\delta =\delta _\pm \) and is \(m^*=\sqrt{\frac{\mu }{2}}\), while in the third case, the maximal value is achieved for \(\delta =\sqrt{L}\) and equals to \(m^*=\sqrt{L}-\sqrt{L-\mu }\). Overall, in this case, we have
$$\begin{aligned} m^* = {\left\{ \begin{array}{ll} \sqrt{2\mu }-\varepsilon &{} \text { if }\ \kappa \le \frac{9}{8} \\ \left( \sqrt{\kappa }-\sqrt{\kappa -1}\right) \sqrt{\mu } &{} \text { if }\ \kappa >\frac{9}{8} \\ \end{array}\right. } \end{aligned}$$(68)
Comparing (66) and (68), the best value for \(m^{*}\) is:
Finally by computing the associated constant of relation (60) in Theorem 4, for each case, if \(\varepsilon >0\), then it holds:
which concludes the proof of Theorem 1. \(\square \)
6.3 Convex setting
In this section we prove Theorem 2 dealing with the convex setting. In particular we will get some slightly improved rates with respect to the ones found in the previous Theorem 4. This improvement is reflected in the following Theorem.
Theorem 5
Let F be a convex function with L-Lipschitz gradient, satisfying condition \({{\mathcal {P}}}{{\mathcal {L}}}\) with \(\mu >0\) and \(\big ((x(t)\big )_{t\ge 0}\) the solution-trajectory associated to the dynamical system (1). Let \(\delta >0\) and \(\alpha _{-}=\frac{1}{2}\bigg (\delta +\frac{3L}{\delta }-\sqrt{\bigg (\delta +\frac{L}{\delta }\bigg )^2-4\mu }\bigg )\) We consider the following cases:
-
If \(\mu =L\) and \(\delta \le \sqrt{L}\), then (71) holds true for all \(\alpha \in \big (\frac{L}{\delta },\alpha _{-}\big )\)
-
If \(\mu <L\) and \(\delta >0\) or if \(\mu =L\) and \(\delta >\sqrt{L}\), then (71) holds true for all \(\alpha \in \big (\frac{L}{\delta },\alpha _{-}\big ]\)
$$\begin{aligned} \begin{aligned} \left\| \nabla F(x(t))\right\| ^{2}&\le C_{\delta ,\alpha }\big (F(x(0))-F_{*}\big )e^{-2\left( \alpha -\frac{L}{\delta }\right) t} \\ \text {and } \qquad F(x(t))-F_{*}&\le \frac{2C_{\delta ,\alpha }}{\mu }\big (F(x(0)-F_{*}\big )e^{-2\big (\alpha -\frac{L}{\delta }\big )t} \end{aligned} \end{aligned}$$(71)with \(C_{\delta ,\alpha }=2\bigg (1+\frac{L}{\delta \big (\delta +\frac{L}{\delta }-\alpha \big )}\bigg )\).
Proof
By using Cauchy–Schwarz inequality and then Young’s inequality, for the scalar product \(\langle \nabla F(x(t)),{\dot{x}}(t)\rangle \), for all \(\eta >0\) and \(t\ge 0\), we have:
By injecting the previous inequality (72) into the definition of the energy V (34), we find:
From convexity and the L-Lipschitz character of \(\nabla F\) (see Proposition 6 in Appendix) and (73), it follows that for all \(\eta >0\) and \(t\ge 0\), it holds:
Let us choose \(\eta =\delta \), so that (74) becomes:
Therefore, from (75), if \(a>\frac{L}{\delta }\) and the conditions of Lemma 2 are satisfied, then it is immediate that for all \(t\ge 0\), it holds:
with \(R=2\bigg (\alpha -\frac{L}{\delta }\bigg )\).
Next we show that the conditions of Lemma 2 are satisfied. In fact by Lemma 3, we recall that for all \(\delta >0\), if we set \(\alpha \in (\frac{L}{\delta },\alpha _{-}]\cup [\alpha _{+},\delta +\frac{2L}{\delta })\), with \(\alpha _{\pm }=\frac{1}{2}\bigg (\delta +\frac{3L}{\delta }\pm \sqrt{\bigg (\delta +\frac{L}{\delta }\bigg )^2-4\mu }\bigg )\) and \(a=\delta +\frac{2L}{\delta }-\alpha \), then Lemma 2 holds true with \(R=2\left( \alpha -\frac{L}{\delta }\right) \). By imposing additionally that \(a=\delta +\frac{2L}{\delta }-\alpha >\frac{L}{\delta }\), a straightforward computation shows that the damping parameter \(\alpha \) should also satisfy \(\alpha <\delta +\frac{L}{\delta }\). Thus the overall conditions for \(\alpha \), \(\delta \) and a, so that (76) holds true, are (here notice that \(\delta +\frac{L}{\delta }\le \alpha _{+}\)):
Note that \(\alpha _{-}\le \delta +\frac{L}{\delta }\), holds always true and is saturated (holds as an equality), if and only if \(L=\mu \) and \(0<\delta \le \sqrt{L}\).
In the case \(\mu =L\), if \(0<\delta \le \sqrt{L}\), we have \(\delta +\frac{L}{\delta }=\alpha _{-}\), thus from (77), the rate is \(R=2\bigg (\alpha -\frac{L}{\delta }\bigg )\), for all \(\alpha \in \big (\frac{L}{\delta },\delta +\frac{L}{\delta }\big )\).
If instead \(\delta >\sqrt{L}\), then \(\alpha _{-}<\delta +\frac{L}{\delta }\), thus from (77), \(R=2\bigg (\alpha -\frac{L}{\delta }\bigg )\) for all \(\alpha \in \big (\frac{L}{\delta },\alpha _{-}\big )\).
In the case \(\mu <L\), since \(\alpha _{-}<\delta +\frac{L}{\delta }\) for all \(\delta >0\), hence as before we have \(R=2\bigg (\alpha -\frac{L}{\delta }\bigg )>0\), for all \(\alpha \in \big (\frac{L}{\delta },\alpha _{-}\big )\).
Finally, for both cases from (76), with \(a=\delta +\frac{2L}{\delta }-\alpha \) and \(R=2\bigg (\alpha -\frac{L}{\delta }\bigg )\), we have:
and by using condition (\({{\mathcal {P}}}{{\mathcal {L}}}\)) in (78), we obtain:
which allows to conclude the proof of Theorem 5. \(\square \)
Remark 7
In relation (71) of Theorem 5, the rate for \(\left\| \nabla F(x(t))\right\| ^2\) and \(F(x(t))-F_{*}\) is expressed as a function of both the damping term \(\alpha \) and the auxiliary parameter \(\delta \), which also determines the choice of \(\alpha \). The presence of \(\delta \), is due to the Lyapunov analysis that we follow, where \(\delta >0\) plays the role of a Lyapunov parameter (which is classically set free for tuning). Note that given any \(\alpha \), there exists a \(\delta \), such that the conditions of Theorem 5 are always satisfied.
Indeed, from (71) of Theorem 5, \(\alpha \) can be chosen as follows:
Since \(\delta \rightarrow \alpha (\delta )\) is continuous and strictly decreasing, one could solve (80) for \(\delta \), that is
Unfortunately the (positive) root of the third-order polynomial in (81) cannot be given via an algebraic formulation, which is why the auxiliary variable \(\delta >0\) is included in the statement of the Theorem.
We are now ready to give the proof of Theorem 2.
Proof of Theorem 2
From Theorem 5 and in particular relation (71), we have \(R=2\bigg (\alpha -\frac{L}{\delta }\bigg )\), hence it is of best interest to choose \(\alpha \) as large as possible (with respect to \(\delta \)).
Without loss of generality, let us consider the two complementary cases according to the Theorem 5:
-
Let \(\mu =L\) and \(\alpha \in \bigl (\frac{L}{\delta },\alpha _{-}\bigr )\), for all \(\delta >0\). According to the first observation and taking the maximal value for \(\alpha \), for any \(\varepsilon \in (0,\alpha _{-}-\frac{L}{\delta })\), we can choose \(\alpha =\alpha _{-}-\varepsilon =\frac{1}{2}\bigg (\delta +\frac{3L}{\delta }-\sqrt{\bigg (\delta -\frac{L}{\delta }\bigg )^2}\bigg )-\varepsilon =\frac{1}{2}\bigg (\delta +\frac{3L}{\delta }-\bigl |\delta -\frac{L}{\delta }\bigr |\bigg )-\varepsilon \). It follows that the convergence factor R is equal to
$$\begin{aligned} R=R(\delta )=2\bigg (\alpha _{-}-\varepsilon -\frac{L}{\delta }\bigg ) = \delta +\frac{L}{\delta } -\biggl |\delta -\frac{L}{\delta }\biggr | -2\varepsilon \qquad \forall \delta >0 \end{aligned}$$(82)Therefore by studying the expression of R in (82) as a function of \(\delta >0\), one can deduce that the optimal value for \(\delta \) that maximizes R, is \(\delta ^{*}=\sqrt{L}\) and gives \(R^{*}=R(\delta ^{*})=2\big (\sqrt{L}-\sqrt{L-\mu }\big )-\varepsilon =2\sqrt{\mu }-\varepsilon \) and \(\alpha =2\sqrt{L}-\sqrt{L-\mu }-\varepsilon =2\sqrt{\mu }-\varepsilon \), for any \(\varepsilon \in (0,\alpha _{-}-\frac{L}{\delta })\), which concludes the first point of Theorem 2.
-
If \(\mu <L\) and \(\delta >0\), then from the conditions of Theorem 5 the maximal value that we can choose for \(\alpha \), is \(\alpha =\alpha _{-}=\frac{1}{2}\bigg (\delta +\frac{3L}{\delta }-\sqrt{\bigg (\delta +\frac{L}{\delta }\bigg )^2-4\mu }\bigg )\). It follows that the convergence factor R is equal to :
$$\begin{aligned} R=R(\delta )=2\bigg (\alpha _{-}-\frac{L}{\delta }\bigg ) = \delta +\frac{L}{\delta } -\sqrt{\bigg (\delta +\frac{L}{\delta }\bigg )^2-4\mu } \qquad \forall \delta >0 \end{aligned}$$(83)Finally, in the same way as in the previous case, one can deduce that the optimal value for \(\delta \) that maximizes R in (83), is \(\delta ^{*}=\sqrt{L}\) and gives \(R^{*}=R(\delta ^{*})=2\big (\sqrt{L}-\sqrt{L-\mu }\big )\) and \(\alpha =2\sqrt{L}-\sqrt{L-\mu }\) which concludes the second point of Theorem 2.
\(\square \)
References
Allen-Zhu, Z., Li, Y., Song, Z.: A convergence theory for deep learning via over-parameterization. In: Chaudhuri, K., Salakhutdinov R. (eds.) Proceedings of the 36th International Conference on Machine Learning, Proceedings of Machine Learning Research, vol. 97, pp. 242–252. PMLR (2019). http://proceedings.mlr.press/v97/allen-zhu19a.html
Alvarez, F.: On the minimizing property of a second order dissipative system in Hilbert spaces. SIAM J. Control Optim. 38(4), 1102–1119 (2000)
Alvarez, F., Attouch, H., Bolte, J., Redont, P.: A second-order gradient-like dissipative dynamical system with hessian-driven damping. Application to optimization and mechanics. J. Math. Pures Appl. 81(8), 747–779 (2002)
Apidopoulos, V., Aujol, J.F., Dossal, C.: Convergence rate of inertial forward-backward algorithm beyond Nesterov’s rule. Math. Program. (2018). https://doi.org/10.1007/s10107-018-1350-9
Apidopoulos, V., Aujol, J.F., Dossal, C.: The differential inclusion modeling FISTA algorithm and optimality of convergence rate in the case b \(\le 3\). SIAM J. Optim. 28(1), 551–574 (2018)
Apidopoulos, V., Aujol, J.F., Dossal, C., Rondepierre, A.: Convergence rates of an inertial gradient descent algorithm under growth and flatness conditions. Math. Program. 187(1), 151–193 (2021). https://doi.org/10.1007/s10107-020-01476-3
Attouch, H., Bot, R.I., Csetnek, E.R.: Fast optimization via inertial dynamics with closed-loop damping. arXiv preprint arXiv:2008.02261 (2020)
Attouch, H., Cabot, A.: Asymptotic stabilization of inertial gradient dynamics with time-dependent viscosity. J. Differ. Equ. 263(9), 5412–5458 (2017)
Attouch, H., Cabot, A.: Convergence rates of inertial forward–backward algorithms. SIAM J. Optim. 28(1), 849–874 (2018)
Attouch, H., Cabot, A., Redont, P.: The dynamics of elastic shocks via epigraphical regularization of a differential inclusion. Barrier and penalty approximations. Adv. Math. Sci. Appl. 12(1), 273–306 (2002)
Attouch, H., Chbani, Z., Fadili, J., Riahi, H.: First-order optimization algorithms via inertial systems with hessian driven damping. Math. Program. (2020). https://doi.org/10.1007/s10107-020-01591-1
Attouch, H., Chbani, Z., Peypouquet, J., Redont, P.: Fast convergence of inertial dynamics and algorithms with asymptotic vanishing viscosity. Math. Program. 168(1–2), 123–175 (2018)
Attouch, H., Chbani, Z., Riahi, H.: Rate of convergence of the Nesterov accelerated gradient method in the subcritical case \(\alpha \le 3\). ESAIM: Control, Optimisation and Calculus of Variations (2019)
Attouch, H., Goudou, X., Redont, P.: The heavy ball with friction method, i. The continuous dynamical system: global exploration of the local minima of a real-valued function by asymptotic analysis of a dissipative dynamical system. Commun. Contemp. Math. 2(01), 1–34 (2000)
Aujol, J.F., Dossal, C., Rondepierre, A.: Optimal convergence rates for Nesterov acceleration. arXiv preprint arXiv:1805.05719 (2018)
Aujol, J.F., Dossal, C., Rondepierre, A.: Convergence rates of the Heavy-Ball method for quasi-strongly convex optimization (2020). https://hal.archives-ouvertes.fr/hal-02545245. Working paper or preprint
Aujol, J.F., Dossal, C., Rondepierre, A.: Convergence rates of the Heavy-Ball method with Lojasiewicz property. Research report, IMB - Institut de Mathématiques de Bordeaux; INSA Toulouse; UPS Toulouse (2020). https://hal.archives-ouvertes.fr/hal-02928958
Bauschke, H.H., Combettes, P.L.: Convex Analysis and Monotone Operator Theory in Hilbert Spaces. Springer, Berlin (2011)
Bégout, P., Bolte, J., Jendoubi, M.A.: On damped second-order gradient systems. J. Differ. Equ. 259(7), 3115–3143 (2015)
Bertsekas, D.P.: Nonlinear programming. J. Oper. Res. Soc. 48(3), 334–334 (1997). https://doi.org/10.1057/palgrave.jors.2600425
Bolte, J., Daniilidis, A., Lewis, A.: The łojasiewicz inequality for nonsmooth subanalytic functions with applications to subgradient dynamical systems. SIAM J. Optim. 17(4), 1205–1223 (2007)
Bolte, J., Daniilidis, A., Ley, O., Mazet, L.: Characterizations of łojasiewicz inequalities: subgradient flows, talweg, convexity. Trans. Am. Math. Soc. 362(6), 3319–3363 (2010)
Bolte, J., Nguyen, T.P., Peypouquet, J., Suter, B.W.: From error bounds to the complexity of first-order descent methods for convex functions. Math. Program. 165(2), 471–507 (2017)
Boţ, R.I., Csetnek, E.R., László, S.C.: Approaching nonsmooth nonconvex minimization through second-order proximal-gradient dynamical systems. J. Evol. Equ. 18(3), 1291–1318 (2018)
Cabot, A., Engler, H., Gadat, S.: On the long time behavior of second order differential equations with asymptotically small dissipation. Trans. Am. Math. Soc. 361(11), 5983–6017 (2009)
Cabot, A., Engler, H., Gadat, S.: Second-order differential equations with asymptotically small dissipation and piecewise flat potentials. Electron. J. Differ. Equ. 17, 33–38 (2009)
Drusvyatskiy, D., Lewis, A.S.: Error bounds, quadratic growth, and linear convergence of proximal methods. Math. Oper. Res. 43(3), 919–948 (2018)
Garrigos, G., Rosasco, L., Villa, S.: Convergence of the forward-backward algorithm: beyond the worst case with the help of geometry. arXiv preprint arXiv:1703.09477 (2017)
Ghisi, M., Gobbino, M., Haraux, A.: The remarkable effectiveness of time-dependent damping terms for second order evolution equations. SIAM J. Control Optim. 54(3), 1266–1294 (2016)
Haraux, A., Jendoubi, M.A.: Convergence of solutions of second-order gradient-like systems with analytic nonlinearities. J. Differ. Equ. 144(2), 313–320 (1998)
Karimi, H., Nutini, J., Schmidt, M.: Linear convergence of gradient and proximal-gradient methods under the Polyak–Łojasiewicz condition. In: Frasconi, P., Landwehr, N., Manco, G., Vreeken, J. (eds.) Mach. Learn. Knowl. Discov. Databases, pp. 795–811. Springer, Cham (2016)
Kurdyka, K.: On gradients of functions definable in o-minimal structures. Ann. Inst. Fourier 48(3), 769–783 (1998)
Liu, C., Zhu, L., Belkin, M.: Toward a theory of optimization for over-parameterized systems of non-linear equations: the lessons of deep learning (2020)
Lojasiewicz, S.: Une propriété topologique des sous-ensembles analytiques réels, in “les équations aux dérivées partielles (paris, 1962)” éditions du centre national de la recherche scientifique (1963)
Necoara, I., Nesterov, Y., Glineur, F.: Linear convergence of first order methods for non-strongly convex optimization. Math. Program. 175(1–2), 69–107 (2019)
Nesterov, Y.: Introductory lectures on convex optimization: a basic course (2013)
Polyak, B., Shcherbakov, P.: Lyapunov functions: an optimization theory perspective. IFAC Papers OnLine 50(1), 7456–7461 (2017)
Polyak, B.T.: Gradient methods for the minimisation of functionals. USSR Comput. Math. Math. Phys. 3(4), 864–878 (1963)
Polyak, B.T.: Some methods of speeding up the convergence of iteration methods. USSR Comput. Math. Math. Phys. 4(5), 1–17 (1964)
Rockafellar, R.T., Wets, R.J.B.: Variational Analysis, vol. 317. Springer, Berlin (2009)
Siegel, J.W.: Accelerated first-order methods: differential equations and Lyapunov functions. arXiv preprint arXiv:1903.05671 (2019)
Su, W., Boyd, S., Candes, E.J.: A differential equation for modeling Nesterov’s accelerated gradient method: theory and insights. J. Mach. Learn. Res. 17(153), 1–43 (2016)
Yuan, Z., Yan, Y., Jin, R., Yang, T.: Stagewise training accelerates convergence of testing error over sgd. In: Wallach, H., Larochelle, H., Beygelzimer, A., Alché-Buc, F.D, Fox, E., Garnett R. (eds.) Advances in Neural Information Processing Systems, vol. 32. Curran Associates, Inc (2019). https://proceedings.neurips.cc/paper/2019/file/fcdf25d6e191893e705819b177cddea0-Paper.pdf
Zhang, H.: New analysis of linear convergence of gradient-type methods via unifying error bound conditions. Math. Program. 180(1), 371–416 (2020)
Acknowledgements
We would like to thank the anonymous reviewers for their fruitful comments and recommendations.
Funding
Open access funding provided by Università degli Studi di Genova within the CRUI-CARE Agreement.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
V.A. acknowledges the financial support of the European Research Council (Grant SLING 819789), the AFOSR Projects FA9550-18-1-7009, FA9550-17-1-0390 and BAA-AFRL-AFOSR-2016-0007 (European Office of Aerospace Research and Development), and the EU H2020-MSCA-RISE Project NoMADS - DLV-777826. S.V. acknowledges the support of GNAMPA 2020: “Processi evolutivi con memoria descrivibili tramite equazioni integro-differenziali”. This work has been supported by the ITN-ETN project TraDE-OPT funded by the European Union’s Horizon 2020 research and innovation programme under the Marie Skłodowska–Curie Grant Agreement No. 861137.
Original article has been corrected: Funding information has been updated.
Appendix A: General lemmas
Appendix A: General lemmas
In this appendix we give some general auxiliary lemmas used in the main core of this work.
Lemma 5
(Grönwall’s Lemma) Let \(I\subset \mathbb {R}_{+}=[0,+\infty )\) be an interval and u,g,h \(:I\longrightarrow \mathbb {R}\), with \(h\in L^{1}_{loc}(I)\), \(g\in {\mathcal {C}}(I)\), and \(u\in {\mathcal {C}}^{1}(I)\), such that for all \(t\in I\):
Then for all \(s<t\in I\), it holds:
where \(G(t)=\int _{s}^{t}g(r)dr\).
Proof
By multiplying both sides of relation (5) by \(e^{-G(t)}\ge 0\), we obtain :
Hence by integrating on [s, t], we find (notice that \(G(s)=0\)) :
which by multiplying on both sides by \(e^{-G(t)}\ge 0\), gives (85) and allows to conclude the proof of the Lemma. \(\square \)
As a consequence of Lipschitz continuity of the gradient, one can obtain the following inequality (see for example [20] or [36]):
Lemma 6
Let \(F:{\mathcal {H}}\longrightarrow \mathbb {R}\) a function with L-Lipschitz gradient. Then
If in addition F is convex, then :
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Apidopoulos, V., Ginatta, N. & Villa, S. Convergence rates for the heavy-ball continuous dynamics for non-convex optimization, under Polyak–Łojasiewicz condition. J Glob Optim 84, 563–589 (2022). https://doi.org/10.1007/s10898-022-01164-w
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10898-022-01164-w
Keywords
- Non-convex optimization
- Smooth optimization
- Inertial dynamics
- Heavy-ball method
- Polyak–Łojasiewicz condition
- Rates of convergence