Approaching Nonsmooth Nonconvex Optimization Problems Through First Order Dynamical Systems with Hidden Acceleration and Hessian Driven Damping Terms

In this paper we carry out an asymptotic analysis of the proximal-gradient dynamical system ẋ(t)+x(t)=proxγfx(t)−γ∇Φ(x(t))−ax(t)−by(t),ẏ(t)+ax(t)+by(t)=0\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\left\{ \begin{array}{ll} \dot x(t) +x(t) = \text{prox}_{\gamma f}\left[x(t)-\gamma\nabla{\Phi}(x(t))-ax(t)-by(t)\right],\\ \dot y(t)+ax(t)+by(t)=0 \end{array}\right. $$\end{document} where f is a proper, convex and lower semicontinuous function, Φ a possibly nonconvex smooth function and γ,a and b are positive real numbers. We show that the generated trajectories approach the set of critical points of f + Φ, here understood as zeros of its limiting subdifferential, under the premise that a regularization of this sum function satisfies the Kurdyka-Łojasiewicz property. We also establish convergence rates for the trajectories, formulated in terms of the Łojasiewicz exponent of the considered regularization function.


Introduction
We begin with a short literature review that serves as motivation for the research conducted in this paper.
The Newton-like dynamical system x(t) + λẋ(t) + γ ∇ 2 (x(t))(ẋ(t)) + ∇ (x(t)) = 0 ( 1 ) has been investigated by Alvarez, Attouch, Bolte and Redont in [3] in the context of asymptotically approaching the minimizers of the optimization problem for a smooth C 2 function and λ and γ positive numbers. System (1) is a second order system both in time, due to the presence of the acceleration termẍ(t), which is associated to inertial effects, and in space, due to presence of the Hessian ∇ 2 (x(t)). The trajectories generated by (1) have been proved to converge to a critical point , when this function is analytic, and to a minimizer of , when it is convex. Dynamical systems of type (1) are of large interest, as they occur in different applications in fields like optimization, mechanics, control theory and PDE theory (see [3,4,9,14,16,17]). Let us underline that (1) arises as a natural combination of the continuous Newton method [4] ∇ 2 (x(t))(ẋ(t)) + ∇ (x(t)) = 0 ( 3 ) with the heavy ball with friction method [6,13] x(t) + λẋ(t) + ∇ (x(t)) = 0.
(4) While in (3) the Hessian may be degenerated and the trajectories in (4) may have some oscillations with negative effects on numerical computations, their combination (1) overcome in general these drawbacks. An illustrative example is described in detail in [3] in the context of the minimization of the Rosenbrock function : R 2 → R, (x 1 , x 2 ) = 100(x 2 − x 2 1 ) 2 + (1 − x 1 ) 2 , where the positive impact of the Hessian driven damping on the trajectories is emphasized. For more insights into the theoretical and numerical advantages of second order dynamical systems of type (1) we refer the reader to [3].
The authors of [3] have also pointed out the surprising fact that the dynamical system (1) can be viewed as a first order dynamical system with no occurrence of the Hessian. More precisely, it has been shown that (1) is equivalent to ẋ(t) + γ ∇ (x(t)) + ax(t) + by(t) = 0, y(t) + ax(t) + by(t) = 0 (5) where a := λ − 1 γ and b := 1 γ . The obvious advantage of (5) comes from the fact that for its asymptotic analysis no second order information on the smooth function is needed. We refer to [3,16] for applications and other arguments in favor of this reformulation of (1).
On the other hand, in order to asymptotically approach the minimizers of constrained optimization problems of the form where C ⊆ R n is a nonempty, closed, convex set, the following projection-gradient dynamical system has been considered and investigated by Antipin [6] and Bolte [21] x(t) + x(t) = proj C x(t) − γ ∇ (x(t)) .
Here, proj C : R n → C denotes the projection operator onto the set C. These being given, the following combination of the systems (5) and (7) ẋ(t) + x(t) = proj C x(t) − γ ∇ (x(t)) − ax(t) − by(t) ẏ(t) + ax(t) + by(t) = 0 (8) has been proposed in [3], for a, b and γ positive numbers, in order to asymptotically approach the minimizers of the constrained optimization problem (6) in the hypothesis that the objective function is convex. Proximal-gradient dynamical systems, which are generalizations of (7), have been recently considered by Abbas and Attouch in [1,Section 5.2] in the full convex setting. Implicit dynamical systems related to both optimization problems and monotone inclusions have been considered in the literature also by Attouch and Svaiter in [18], Attouch, Abbas and Svaiter in [2] and Attouch, Alvarez and Svaiter in [7]. These investigations have been continued and extended in [19,[29][30][31][32].
In the last years the interest in approaching the solvability of nonconvex optimization problems from continuous and discrete perspective is continuously increasing (see [8,10,11,25,26,28,33,35,36,38,43]). Following this tendency, we investigate in this paper the optimization problem where f is a (possibly nonsmooth) proper, convex and lower semicontinuous function and a (possibly nonconvex) smooth function. More precisely, in this paper we investigate the convergence of the trajectories generated by the proximal-gradient dynamical system where a, b and γ are positive real numbers and denotes the proximal point operator of γf , to a critical point of f + , here understood as a zero of its limiting subdifferential. To this end we assume that a regularization of the objective function satisfies the Kurdyka-Łojasiewicz property; in other words, it is a KL function. The convergence analysis relies on methods and concepts of real algebraic geometry introduced by Łojasiewicz [41] and Kurdyka [39] and later developed in the nonsmooth setting by Attouch, Bolte and Svaiter [11] and Bolte, Sabach and Teboulle [26].
In the convergence analyis we use three main ingredients: (1) we prove a Lyapunovtype property, expressed as a sufficient decrease of a regularization of the objective function along the trajectories, (2) we show the existence of a subgradient lower bound for the trajectories and, finally, (3) we derive convergence by making use of the Kurdyka-Łojasiewicz property of the objective function (for a similar approach in the continuous case see [3] and in the discrete setting see [11,26]). Furthermore, we obtain convergence rates for the trajectories expressed in terms of the Łojasiewicz exponent of the regularized objective function.

Preliminaries
We recall some notions and results which are needed throughout the paper. We consider on R n the Euclidean scalar product and the corresponding norm denoted by ·, · and · , respectively.
The domain of the function f : We say that f is proper, if dom f = ∅. For the following generalized subdifferential notions and their basic properties we refer to [27,42,44]. Let f : R n → R∪{+∞} be a proper and lower semicontinuous function. The Fréchet (viscosity) subdifferential of f at x ∈ dom f is the set for each x ∈ R n . When f is convex, these subdifferential notions coincide with the convex subdifferential, The following closedness criterion of the graph of the limiting subdifferential will be used in the convergence analysis: if (x k ) k∈N The Fermat rule reads in this nonsmooth setting as follows: if x ∈ R n is a local minimizer of f , then 0 ∈ ∂f (x). We denote by crit(f ) = {x ∈ R n : 0 ∈ ∂f (x)} the set of (limiting)-critical points of f .
When f is continuously differentiable around x ∈ R n we have ∂f (x) = {∇f (x)}. We will also make use of the following subdifferential sum rule: if f : R n → R ∪ {+∞} is proper and lower semicontinuous and h : R n → R is a continuously differentiable function, A crucial role in the asymptotic analysis of the dynamical system (10) is played by the class of functions satisfying the Kurdyka-Łojasiewicz property. For η ∈ (0, +∞], we denote by η the class of concave and continuous functions ϕ : [0, η) → [0, +∞) such that ϕ(0) = 0, ϕ is continuously differentiable on (0, η), continuous at 0 and ϕ (s) > 0 for all s ∈ (0, η). In the following definition (see [10,26]) we use also the distance function to a set, defined for A ⊆ R n as dist(x, A) = inf y∈A x − y for all x ∈ R n . Definition 1 (Kurdyka-Łojasiewicz property) Let f : R n → R ∪ {+∞} be a proper and lower semicontinuous function. We say that f satisfies the Kurdyka-Łojasiewicz (KL) property at x ∈ dom ∂f = {x ∈ R n : ∂f (x) = ∅}, if there exist η ∈ (0, +∞], a neighborhood U of x and a function ϕ ∈ η such that for all x in the intersection If f satisfies the KL property at each point in dom ∂f , then f is called KL function. The origins of this notion go back to the pioneering work of Łojasiewicz [41], where it is proved that for a real-analytic function f : R n → R and a critical point x ∈ R n (that is ∇f (x) = 0), there exists θ ∈ [1/2, 1) such that the function |f − f (x)| θ ∇f −1 is bounded around x. This corresponds to the situation when ϕ(s) = Cs 1−θ , where C > 0. The result of Łojasiewicz allows the interpretation of the KL property as a re-parametrization of the function values in order to avoid flatness around the critical points. Kurdyka [39] extended this property to differentiable functions definable in o-minimal structures. Further extensions to the nonsmooth setting can be found in [10,[22][23][24].
One of the remarkable properties of the KL functions is their ubiquity in applications (see [26]). To the class of KL functions belong semi-algebraic, real sub-analytic, semiconvex, uniformly convex and convex functions satisfying a growth condition. We refer the reader to [8,10,11,[22][23][24]26] and the references therein for more on KL functions and illustrating examples.
In the analysis below the following uniform KL property given in [26, Lemma 6] will be used.

Lemma 1
Let ⊆ R n be a compact set and let f : R n → R ∪ {+∞} be a proper and lower semicontinuous function. Assume that f is constant on and that it satisfies the KL property at each point of . Then there exist ε, η > 0 and ϕ ∈ η such that for all x ∈ and all x in the intersection the inequality holds.
In the following we recall the notion of locally absolutely continuous function and state two of its basic properties. Remark 2 (a) An absolutely continuous function is differentiable almost everywhere, its derivative coincides with its distributional derivative almost everywhere and one can recover the function from its derivativeẋ = y by integration.
The following two results, which can be interpreted as continuous versions of the quasi-Fejér monotonicity for sequences, will play an important role in the asymptotic analysis of the trajectories of the dynamical system investigated in this paper. For their proofs we refer the reader to [2, Lemma 5.1] and [2, Lemma 5.2], respectively.

Lemma 3 Suppose that F : [0, +∞) → R is locally absolutely continuous and bounded from below and that there exists
Then there exists lim t→∞ F (t) ∈ R.
Further we recall a differentiability result that involves the composition of convex functions with absolutely continuous trajectories, which is due to Brézis ([34, Lemme 3.3, p. 73]; see also [12,Lemma 3.2]).
Lemma 5 Let f : R n → R∪{+∞} be a proper, convex and lower semicontinuous function.
We close this sesction with the following characterization of the proximal point operator of a proper, convex and lower semicontinuous function f : R n → R ∪ {+∞}: for every γ > 0 it holds (see for example [20]) where ∂f denotes the convex subdifferential of f .

Asymptotic Analysis
The dynamical system we investigate in this paper reads where x 0 , y 0 ∈ R n and a, b and γ are positive real numbers. We assume that f : R n → R ∪ {+∞} is proper, convex and lower semicontinuous, while : R n → R is a Fréchet differentiable with L-Lipschitz continuous gradient, for The existence and uniqueness of the trajectories generated by (14) can be proved by using the estimates from the proof of Lemma 7 below and by following a classical argument, as in [3, Theorem 7.1].
For the asymptotic analysis, we impose on the parameters involved the following condition: (15) and notice that the first inequality is fulfilled for an arbitrary b > 0, if a ∈ (0, 2) and γ > 0 are chosen small enough, while the second one holds for a > 0 small enough.

Remark 6
The reader may notice from the analysis we will perform in the next subsection, in particular from the proof of Lemma 7(a), that the conditions in (15) play a crucial role when deriving a decrease property of the objective function, a result which is fundamental for the asymptotic analysis of the dynamical system (14). By time discretization, the dynamical system (14) gives rise to the following iterative scheme ∀n ≥ 0 : where x 0 , y 0 ∈ R n and (λ n ) n∈N , (h n ) n∈N are positive real sequences. The results we will obtain in this paper in relation to the asymptotic analysis of (14) justify and motivate the study of the iterative scheme (16) when addressing the solving of the minimization problem (9). Numerical experiments on concrete problems should give more insights into the proper choice of the parameters involved in (15). We refer the reader also to [15], where the authors investigated in the convex setting the convergence properties of a numerical scheme obtained by time discretization of the dynamical system with two potentials and Hessian-driven damping proposed in [14].

Convergence of the trajectories
We begin with the proof of a decrease property for a regularization of the objective function along the trajectories.
Lemma 7 Suppose that f + is bounded from below and the parameters a, b, γ and L satisfy (15). For x 0 , y 0 ∈ R n , let (x, y) ∈ C 1 ([0, +∞), R n ) × C 2 ([0, +∞), R n ) be the unique global solution of (14). Then the following statements are true: Proof Define z : [0, +∞) → R n by Since prox γf is nonexpansive (that is 1-Lipschitz continuous), in view of Remark 2(b), z is locally absolutely continuous. From the Lipschitz continuity of ∇ we obtain hence, for almost every t ≥ 0, Sinceẋ it follows thatẋ is locally absolutely continuous, henceẍ exists almost everywhere on [0, +∞) and for almost every t ≥ 0 it holds We fix an arbitrary T > 0. From the characterization (13) of the proximal point operator we have Due to the continuity properties of the trajectories and their derivatives on [0, T ], (20) and the Lipschitz continuity of ∇ , we have Applying Lemma 5 we obtain that the function t → f ẋ(t)+x(t) is absolutely continuous and for almost every t ∈ [0, T ]. Moreover, it holds for almost every t ∈ [0, T ]. Summing up the last two equalities and by taking into account (14), we obtain for almost every t ∈ [0, T ]. Further, due to (14) we have d dt Substituting the term ẋ(t),ẏ(t) from the last relation into (22) we get for almost every t ∈ [0, T ]. Noticing that
(b) By integration we get Since f + is bounded from below and by taking into account that T > 0 has been arbitrarily chosen, we obtainẋ ,ẏ ∈ L 2 ([0, +∞); R n ).
Due to (20), this further impliesẍ Furthermore, for almost every t ∈ [0, +∞) we have By applying Lemma 4, it follows that lim t→+∞ẋ (t) = 0. Moreover, from (14) we get thaẗ y exists andÿ ∈ L 2 ([0, +∞); R n ) due to (24). The same arguments are used in order to conclude lim t→+∞ẏ (t) = 0. (c) From (a) we get for almost every t ≥ 0. From Lemma 3 it follows that exists and it is a real number, hence from the conclusion follows.
We define the limit set of x as Lemma 8 Suppose that f + is bounded from below and the parameters a, b, γ and L satisfy (15). (14). Then Proof Let x ∈ ω(x) and t k → +∞ be such that x(t k ) → x as k → +∞. From (21) we have Lemma 7(b), (14) and the Lipschitz continuity of ∇ ensure that We claim that Indeed, from (28) and the lower semicontinuity of f we get Further, sincė we have the inequality Taking in the above inequality the limit as k → +∞, we derive by using again Lemma which combined with (30) implies By using (28) and the continuity of we conclude that (29) is true. Altogether, from (26), (27), (28), (29) and the closedness criteria of the limiting subdifferential we obtain 0 ∈ ∂(f + )(x) and the proof is complete.
Proof (H1) follows from Lemma 7. The first statement in (H2) is a consequence of (21), the equationẏ(t) + ax(t) + by(t) = 0 and the fact that for all (u, v, w) ∈ R n × R n × R n . The second statement in (H2) is a consequence of the Lipschitz continuity of ∇ . Finally, (H3) has been shown as intermediate step in the proof of Lemma 8.

Lemma 10
Suppose that f + is bounded from below and the parameters a, b, γ and L satisfy (15). (14). Consider the function Suppose that x is bounded. Then the following statements are true: Finally, (c) is a classical result from [37]. We also refer the reader to the proof of Theorem 4.1 in [3], where it is shown that the properties of ω(x) of being nonempty, compact and connected are generic for bounded trajectories fulfilling lim t→+∞ẋ (t) = 0.
Remark 11 Suppose that a, b, γ and L > 0 fulfill the inequality (15) and f + is coercive, in other words, lim (14). Then f + is bounded from below and x is bounded.
Indeed, since f + is a proper, lower semicontinuous and coercive function, it follows that inf u∈R n [f (u) + (u)] is finite and the infimum is attained. Hence f + is bounded from below. On the other hand, from (23) it follows Since f + is coercive, the lower level sets of f + are bounded, hence the above inequality yields thatẋ + x is bounded, which combined with lim t→+∞ẋ (t) = 0 delivers the boundedness of x. Notice that in this case y is bounded, too, due to Lemma 7(b) and the equationẏ(t) + ax(t) + by(t) = 0. Now we are in the position to present the first main result of the paper, which concerns the convergence of the trajectories generated by (14).
II. For every By using Lemma 10(c) and (d) and the fact that H is a KL function, by Lemma 1, there exist positive numbers and η and a concave function ϕ ∈ η such that for all one has Let t 1 ≥ 0 be such that H ẋ(t) + x(t), x(t), y(t) < H x, x, − a b x + η for all t ≥ t 1 .

Since lim t→+∞ dist ẋ(t) + x(t), x(t), y(t) ,
= 0, there exists t 2 ≥ 0 such that for all t ≥ t 2 the inequality dist ẋ(t) + x(t), x(t), y(t) , < holds. Hence for all t ≥ T := max{t 1 , t 2 }, ẋ(t) + x(t), x(t), y(t) belongs to the intersection in (32). Thus, according to (33), for every t ≥ T we have By applying Lemma 9(H2) we obtain for almost every t ∈ [T , +∞) where From here, by using Lemma 9(H1), that ϕ > 0 and we deduce that for almost every t ∈ [T , +∞) it holds (37) Let be α > 0 (which does not depend on t) such that From (37) we derive the inequality which holds for almost every t ≥ T . Since ϕ is bounded from below, by integration it followsẋ,ẏ ∈ L 1 ([0, +∞); R n ). From here we obtain that lim t→+∞ x(t) exists and the conclusion follows from the results obtained in this section.
Since the class of semi-algebraic functions is closed under addition (see for example [26]) and (u, v, w) → c u − v 2 + c av + bw 2 is semi-algebraic for c, c > 0, we obtain the following direct consequence of the above theorem.

Convergence rates
In this subsection we investigate the convergence rates of the trajectories generated by the dynamical system (14). When solving optimization problems involving KL functions, convergence rates have been proved to depend on the so-called Łojasiewicz exponent (see [8,22,36,41]). The main result of this subsection refers to the KL functions which satisfy Definition 1 for ϕ(s) = Cs 1−θ , where C > 0 and θ ∈ (0, 1). We recall the following definition considered in [8].
Definition 3 Let f : R n → R ∪ {+∞} be a proper and lower semicontinuous function. The function f is said to have the Łojasiewicz property, if for every x ∈ crit f there exist C, ε > 0 and θ ∈ (0, 1) such that for every x fulfilling x − x < ε and every x * ∈ ∂f (x). (40) According to [10, Lemma 2.1 and Remark 3.2(b)], the KL property is automatically satisfied at any noncritical point, fact which motivates the restriction to critical points in the above definition. The real number θ in the above definition is called Łojasiewicz exponent of the function f at the critical point x.
The convergence rates obtained in the following theorem are in the spirit of [22] and [8].
Theorem 14 Suppose that f + is bounded from below and the parameters a, b, γ and L satisfy (15). (14). Consider the function Suppose that x is bounded and H satisfies Definition 1 for ϕ(s) = Cs 1−θ , where C > 0 and θ ∈ (0, 1). Then there exists x ∈ crit(f + ) such that lim t→+∞ x(t) = x and lim t→+∞ y(t) = − a b x. Let θ be the Łojasiewicz exponent of H at x, x, − a b x ∈ crit H , according to the Definition 3. Then there exist a 1 , b 1 , a 2 , b 2 > 0 and t 0 ≥ 0 such that for every t ≥ t 0 the following statements are true: 1 2 ), then x and y converge in finite time; Proof We define σ : [0, +∞) → [0, +∞) by (see also [22]) It is immediate that Indeed, this follows by noticing that for T ≥ t and by letting afterwards T → +∞.
Remark 15 (i) In the light of (38) one can notice that in the above proof α > 0 can be chosen such that 2α max(C 1 , C 2 ) ≤ min(M 1 , M 2 ), where M 1 , M 2 are defined in statement (a) of Lemma 7 and C 1 , C 2 in (36). Moreover, the parameter N in the above proof can be chosen as N := max(C 1 , C 2 ) (see Lemma 9(H2)). In this way, the impact of the parameters involved on the convergence rates can be stated exactly. More insights into the role played by these parameters should be gained through numerical experiments on the iterative scheme (16).
(ii) The computation of the Łojasiewicz exponent of H , which appears in the expression of the convergence rates, is not a trivial task. One exception is the situation when H is semialgebraic (which happens for instance when f and share this property); some progress into this direction have been recently reported in [40].
(iii) For optimization problems involving KL functions, in particular convergence rates for the iterates (trajectories) have been achieved; see for instance [8,22,36,41]. In the recent contribution [25], the authors addressed in the same setting also the convergence of the objective function values. These investigations can open a new perspective for the study of the convergence behavior of objective function values in the context of different numerical algorithms and dynamical systems approaching optimization problems with analytic features.