A fast continuous time approach for non-smooth convex optimization using Tikhonov regularization technique

In this paper we would like to address the classical optimization problem of minimizing a proper, convex and lower semicontinuous function via the second order in time dynamics, combining viscous and Hessian-driven damping with a Tikhonov regularization term. In our analysis we heavily exploit the Moreau envelope of the objective function and its properties as well as Tikhonov regularization properties, which we extend to a nonsmooth case. We introduce the setting, which at the same time guarantees the fast convergence of the function (and Moreau envelope) values and strong convergence of the trajectories of the system to a minimal norm solution—the element of the minimal norm of all the minimizers of the objective. Moreover, we deduce the precise rates of convergence of the values for the particular choice of parameters. Various numerical examples are also included as an illustration of the theoretical results.

The main goal of the research is to provide the setting in a nonsmooth case, where we would have fast convergence of the function values combined with strong convergence of the trajectories -the solution of (1) -to the element of the minimal norm from the set of all minimizers of the objective function.This analysis is an extrapolation of the one conducted in [3] to the case of a nonsmooth objective function.We also aim to provide the exact rates of convergence of the values for the polynomial choice of the smoothing parameter λ and Tikhonov function ε.As a conclusion, multiple numerical experiments were conducted allowing better understanding of the theoretical results.

Nonsmooth optimization
The Moreau envelope plays a significant role in nonsmooth optimization.It is defined as (Φ : H → R is a proper, convex and lower semicontinuous function) where λ > 0. Φ λ is convex and continuously differentiable with and ∇Φ λ is 1 λ -Lipschitz continuous.Here, denotes the proximal operator of Φ of parameter λ.Moreover (see [1]), The work [5] by Attouch-László serves as a starting point for a lot of different research topics in nonsmooth optimization.The following dynamics was considered ẍ(t) + α t ẋ(t) + β d dt ∇Φ λ(t) (x(t)) + ∇Φ λ(t) (x(t)) = 0 (4) where α > 1 and β > 0, and the term d dt ∇Φ λ(t) (x(t)) is inspired by the Hessian driven damping term in the case of smooth functions.For this system multiple fundamental results were proven, such as convergence rates for the Moreau envelope values as well as for the velocity of the system Φ λ(t) (x(t)) − Φ * = o 1 t 2 and ẋ(t) = o 1 t as t → +∞, from where convergence rates for the Φ along the trajectories themselves were deduced Φ prox λ(t)Φ (x(t)) − Φ * = o 1 t 2 and prox λ(t)Φ (x(t)) − x(t) = o λ(t) t as t → +∞.
In addition, convergence rates for the gradient of the Moreau envelope of parameter λ(t) and its time derivative along x(t) were established Moreover, the weak convergence of the trajectories x(t) to a minimizer of Φ as t → +∞ was deduced.
From here one may go in many directions in order to continue investigating the topic of second order dynamics.Time scaling, for instance, can be introduced to improve the speed of convergence of the values, as it was done in [12].Another way to proceed is to consider the so-called Tikhonov regularization technique, to which we devote the next chapter of our manuscript.

Tikhonov regularization
The presence of the Tikhonov term in the system equation dramatically influences the behaviour of its trajectories, namely, under some appropriate conditions, it improves the convergence of the trajectories from weak to a strong one.Not only that, but it also ensures the convergence not to an arbitrary element from the set of all minimizers of the objective, but to the particular one, which has the smallest norm.Under the presence of the Tikhonov term in the system it is still possible to obtain fast rates of convergece of the function values.Systems with Tikhonov regularization were studied in, for instance, in [2,3,4,6,10,11,13,14].
One of the fine examples in a smooth setting is presented below (see [3]) , Φ : H −→ R is twice continuously differentiable and convex, ε is nonincreasing and goes to zero, as t → +∞, and p is chosen appropriately.This system inherits the properties of fast convergence rates of the function values, being of the order 1 t 2 , and additionally provides the strong convergence results for the trajectories of the system in the same setting.
Concerning the nonsmooth case we refer to [11], where it was covered for the more general systems, governed by a maximally monotone operator, but with a different damping.The authors studied the following dynamics where α > 0, β ≥ 0, 0 < q ≤ 1 and λ(t) = λt 2q for λ > 0, A is a maximally monotone operator and A λ is its Yosida regularization of the order λ.The system ( 5) is related to the inclusion problem 0 ∈ Ax.
The authors showed the fast convergence rates for ẋ(t) , A λ(t) (x(t)) and d dt A λ(t) (x(t)) being of the order 1 t q , 1 t 2q and 1 t 3q correspondingly.Moreover, they established the strong convergence of the trajectories of the system.Remark 1.We would like to stress that Theorem 11 of [11] does not cover the case presented in this paper.
1. First of all, the systems (1) and (5) have different damping coefficients.The damping in (1) depends on the Tikhonov function ε, while the damping in (5) is taken in a polynomial form 1 t q .Thus, if we take ε(t) = 1 t 2q in (5) to mimic the relation between the damping parameter and the Tikhonov function as in (1), then one of the conditions of Theorem 11 becomes where 0 < q < 1, which is obviously not fulfilled.
2. Secondly, the smoothing parameter λ in [11] is fixed, while our analysis holds for more general choice of λ.However, if we want to consider the polynomial case of parameters (Section 5), then we indeed arrive at a similar restriction for λ: in Section 6 we will discover that for strong conergence of the trajectories and polynomial choice of parameters, λ(t) = t l , we have to take 0 ≤ l < 2, which is a wider range than 0 < q < 1 for λ(t) = t 2q .
In this paper we aim to develop the ideas presented in [3] for p = 0 to cover the nonsmooth case.Section 2 gathers some preliminary results, which we will need in our analysis.The main result of our research is presented in Section 3. In Section 4 we study in more detalies the results of the previous section in order to show that they are meaningful.Section 5 provides the polynomial setting, in which the results are valid and the analysis works.Section 6 establishes the actual rates of convergence of the values and the trajectories.In Section 7 we consider an interesting particular case ε(t) = 1 t 2 and show fast convergence of the function values.Finally, Section 8 is all about numerical experiments, which illustrate the theory.

Auxiliary estimates and properties
Let us begin with two important properties, which we will later use in our analysis.The first one concerns the proximal mapping: The second one, which is known as the first order optimality condition, in our case reads as We continue with the following lemma (see [9], Proposition 12.22, for the first term of the lemma and [7], Appendix, A1, for the second one).
Lemma 1.Let Φ : H −→ R be a proper, convex and lower semicontinuous function, λ, µ > 0. Then The following estimates will be used later to evaluate the derivative of our energy function.
Lemma 2. The following properties are satisfied: 2. the function t → x ε(t),λ(t) is Lipschitz continuous on the compact intervals of (t 0 , +∞), thus, is almost everywhere differentiable.Moreover, for almost every Proof.By the definition of ϕ ε(t),λ(t) by 1 of Lemma 1.Thus, . From (7) we obtain where the second equality comes from 2 of Lemma 1. Combining the last two equalities with (2) we obtain which is the first claim.
To obtain the second claim we start with (7) noticing that for h > 0 Consider Taking the inner product of each part of this equality with x ε(t+h),λ(t+h) − x ε(t),λ(t) , we notice that by the monotonicity of ∇Φ λ(t+h) .So, Let us divide the last inequality by h 2 to obtain Now notice that, since the mapping t → x ε(t),λ(t) is Lipschitz continuous on the compact intervals of R + \ {0} (according to [1]), therefore, almost everywhere differentiable.Tending h to zero we deduce for almost every , where we used the following estimate from [12] lim where we used (2), ( 6) and the Cauchy-Schwarz inequality.On the other hand, Cauchy-Schwartz inequality yields Combining the last two inequalities we arrive at Replacing ∇Φ λ(t) (x ε(t),λ(t) ) using (7) gives us the second claim.
Let us also mention two key properties of the Tikhonov regularization, which we will use later in the analysis Then the following properties of the mapping t −→ x ε(t),λ(t) are satisfied: and lim Proof.Suppose that t ≥ t 0 .By the monotonicity of ∇Φ λ(t) we deduce By (7) we obtain Using Cauchy-Schwarz inequality we derive This proves the first claim.For the second one consider (7) again and note that it is equivalent to by the item 2 of Lemma 1.Note that by (8) we have λ(t)+ 1 ε(t) → +∞ and λ(t)ε(t)+1 → 1, as t → +∞.From now on the proof is inspired by Theorem 23.44 of [9].Take z ∈ argmin Φ = argmin Φ λ for each λ > 0. From (11) and from the fact that the resolvent of maximally monotone operator is maximally monotone and firmly nonexpansive (see, for instance, Corollary 23.11(i) of [9]) and Cauchy-Schwarz inequality it follows that for all t ≥ t 0 (note that z could be represented as which gives the boundedness of x ε(t),λ(t) for all t ≥ t 0 .Now, let y be a weak sequential cluster point of {x ε(tn),λ(tn) } n∈N , namely, x ε(t kn ),λ(t kn ) y, as n → +∞.From ( 7) we deduce Using which is equivalent to The sequence lies in gra ∂Φ by (13) and converges to (y, 0) in H weak × H strong due to the sequence {x ε(t kn ),λ(t kn ) } n∈N being also bounded and (8).Therefore, since gra ∂Φ is sequentially closed (see Proposition 20.38(ii) of [9]) it follows that y ∈ argmin Φ.From ( 12) we derive by the definition of weak convergence, thus, x ε(t kn ),λ(t kn ) → y, as n → +∞.On the other hand, ( 12) leads to So, x * being the only weak sequential cluster point of the bounded sequence x ε(tn),λ(tn) n∈N means that x ε(tn),λ(tn) x * , as n → +∞, by Lemma 2.46 of [9].By (12) again we deduce and so the second claim follows.

Existence and uniqueness of the solution of (1)
Our nearest goal is to deduce the existence and uniqueness of the solution of the dynamical system (1).Suppose β > 0. Let us integrate (1) from t 0 to t to obtain Let us multiply the second one by β and then by summing it with the first line we get rid of the gradient of the Moreau envelope in the second equation We denote now y(t) = βz(t) + 1 − αβ ε(t) x(t), and, after simplification, we obtain the following equivalent formulation for the dynamical system In case β = 0 for every t ≥ t 0 , (1) can be equivalently written as Therefore, based on the two reformulations of the dynamical system (1) above we provide the following existence and uniqueness result, which is a consequence of the Cauchy-Lipschitz theorem for strong global solutions.The proof follows the lines of the proofs of Theorem 1 in [5] or of Theorem 1.1 in [8] with some small adjustments.

Main result
This section is devoted to establishing some crucial estimates for the following quantities Φ λ(t) (x(t))−Φ * and x(t)−x ε(t),λ(t) for all t ≥ t 0 .In order to do so we will use the ideas and methods of Lyapunov analysis.We introduce the energy function where α 2 ≤ γ < α.The idea is to show that this energy function satisfies the following differential inequality, as it was done in [3], and g are positive functions.The next theorem provides the analysis needed to obtain the desired inequality.Theorem 5. Let x : [t 0 , +∞) −→ H be a solution of (1).Assume that (8) holds and suppose that there exist a, c > 0 such that for t large enough it holds that Then there exists t 1 ≥ t 0 such that for all t ≥ t 1 where Proof.We start with computing the derivative of the energy function (14).Let us denote v(t) = γ ε(t) x(t) − x ε(t),λ(t) + ẋ(t) + β∇Φ λ(t) (x(t)).Once again, by the classical derivation chain rule using (1) from Lemma 2 and (3) we obtain for all t ≥ t 0 Our nearest goal is to obtain the upper bound for Ė.Let us calculate for all t ≥ t 0 where above we used (1).Thus, for all t ≥ t 0 Let us use the previous estimates to evaluate the quantity v(t), v(t) .Namely, by the ε(t)-strong convexity of ϕ ε(t),λ(t) for all t ≥ t 0 and then for all t ≥ t 0 Again, by the ε(t)-strong convexity of ϕ ε(t),λ(t) since ε(t) ≤ 0 for all t ≥ t 0 It is true that for all a > 0 as well as In the same spirit for all b > 0 Furthermore, Combining all the estimates above we arrive for all t ≥ t 0 at Returning to the expression for Ė(t) we notice that the terms ∇ϕ ε(t),λ(t) (x(t)), ẋ(t) cancel each other out.
Let us now consider Therefore, using µ(t) = (α − γ) ε(t) − ε(t) 2ε(t) (the terms with x(t) − x ε(t),λ(t) , ẋ(t) disappear), we obtain for all t ≥ t 0 Further we have ( ε(t) ≤ 0 for all t ≥ t 0 ) As we have established earlier by Lemma 2 item 2 and ( 9) and since there exists we deduce for all t ≥ t 1 Choosing b = cγ with c > 0 we obtain for all t ≥ t 1 Let us investigate the signs of the terms in the inequality above when t is large enough to satisfy what we assumed before (15) -( 18).First of all, due to (15).Then, So, at the end we deduce for all t ≥ t 1 (20) Integrating (20) from t 1 to t we obtain or, neglecting the positive terms, From (20) we also obtain for all t ≥ t 1 Multiplying this with Γ(t) = exp t t 1 µ(s)ds and integrating again on [t 1 , t] we deduce .
Now that we have the estimate for our energy function E, we would like to proceed with the main goal of this section, namely, Theorem 6.Let x : [t 0 , +∞) −→ H be a solution of (1).Then for any t ≥ t 0 and the trajectory x(t) converges strongly to x * as soon as lim t→+∞ Using the definition of E we obtain By the definition of the proximal mapping The second result immediately follows from the ε(t)-strong convexity of ϕ ε(t),λ(t) : and thus Finally, by lim t→+∞ ε(t) = 0 and ( 10) we deduce the strong convergence of the trajectories to x * as soon as lim t→+∞ 4 Further analysis for the general parameters choice In this section we will show for the most general setting when the results of the Theorem 6 make sense, namely, when all the quantities on the right-hand side do converge to zero.Let us notice that since we can simplify a bit the analysis of this section, due to

The asymptotic behaviour of the function Γ
Let us start with the function Γ(t) = exp Since ε(t) is positive for all t ≥ t 1 ≥ t 0 , the integral is nonnegative and the whole exponent is lower bounded by 1.Using the property of Tikhonov function, namely, lim t→+∞ ε(t) = 0, we deduce that Let us recall the form of the energy function where α 2 ≤ γ < α.Let us study the behaviour of the function as t → +∞.Since g(t)Γ(t) ≥ 0 for all t ≥ t 1 so is the integral t t 1 g(s)Γ(s)ds.If there exists a constant such that 0 ≤ t t 1 g(s)Γ(s)ds ≤ const, then E(t) goes to zero as t → +∞ due to the properties of h and Theorem 5. Otherwise, we may apply L'Hospital's rule to obtain if the latest exists, which we are going to show now.Consider So, by (21) we deduce Again, by (15) we know that So, again using (21) we deduce that lim t→+∞ E(t) = 0.

The asymptotic behaviour of the function E ε
Let us assume additionally that In the same spirit let us analyse the asymptotic behaviour of E(t) ε(t) as t → +∞.From Theorem 5 we know that .
By ( 22) we immediately deduce that lim t→+∞ E(t 1 ) ε(t)Γ(t) = 0.For the first term let us use the same technique as in the previous chapter and apply L'Hospital's rule to obtain Thus, we have established that lim t→+∞

The asymptotic behaviour of the function λE
In this section we need to assume the full set of conditions (22) again.We will study the behaviour of λ(t)E(t), as t → +∞.Again, from Theorem 5 we know that We immediately obtain that E(t 1 )λ(t) = +∞ by ( 8) and ( 22).

Analysis of the conditions
Let us gather in this section all the conditions that we made in our analysis and show that they all could be satisfied at the same time at least for the polynomial choice of parameters.

Polynomial choice of parameters
Let us take λ(t) = t l and ε(t) = 1 t d , l ≥ 0 and d > 0. The set of the conditions in this case becomes (i) lim t→+∞ t l−d = 0; There exist α 2 ≤ γ < α, a > 0 and c > 0 such that for all t large enough The conditions above are, in turn, equivalent to (i) l < d; (ii) d ≤ 2; (iii) 2γ(α − γ) < 1; (iv) d ≥ 1 and (v) is always satisfied starting from t large enough.
Finally, we deduce for l and d Remark 2. Condition (iii) does not contradict with the choice of γ, namely, γ could be chosen to satisfy both of them at the same time: Indeed, (iii) implies 2 is always positive and we are free to choose γ such that α and thus we take , Γ(t) = exp t t 1 µ(s)ds and µ(t) = (α − γ) ε(t) − ε(t) 2ε(t) .Now, let us deduce the actual rates of convergence of the function values and trajectories for the same polynomial choice of parameter functions λ(t) = t l and ε(t) = 1 t d , l ≥ 0 and d > 0.

The functions µ and Γ
Let us consider the case when 1 ≤ d < 2. The case when d = 2 will be treated separately.The function µ thus writes as follows where goes to zero exponentially, as time goes to infinity due to 1 ≤ d < 2.

The function g
First notice that , where Let us notice that the behaviour of , as t → +∞, since, as we have established earlier, 1 ≤ d < 2 and 0 ≤ l < d.

Integrating the product Γg
The technique, which will be used in this section, is inspired by [3].First of all, notice that for some δ > 0 Secondly, there exists such δ that starting from some t 2 ≥ t 1 it holds that l t 3d 2 +1−l . Thus, where

Finalizing the estimates
Let us return to This expression converges to zero at a speed of the slowest decaying term (all the other decay exponentially): .
Thus, there exists a constant C 3 > 0 such that for all t ≥ t 2 .

The rates themselves
Now we can deduce the actual rates for the quantities in Theorem 6.For all t ≥ t 2 Again, there exist constants C 4 , C 5 > 0 such that for all t ≥ t 2 7 The rates of convergence of the function values in case d = 2 This particular case is of a great interest, as it is in a way a bordering case, when one cannot show the strong convergence of the trajectories, but still can show the fast convergence of the values.In this case the functions µ and Γ are where C 1 = 4γ(2a + cγ)(l + 1) 2 .Thus, where . By Theorem 5 we have . We know that α 2 ≤ γ < α and 0 ≤ l < 2. Thus, in the brackets the term with t −2 is dominating, as t → +∞.Moreover, α − γ + 1 > 1.So, the behaviour of the entire expression depends on the value of α.There exists a constant C 4 such that for all t ≥ t 1   We notice that a faster growing function λ implies faster convergence of the Moreau envelope of the objective function Φ.
Increasing the speed of decay of the Tikhonov function ε for a fixed l = 1 accelerates the convergence of the Moreau envelope values, which was predicted by the theory:   As we can see in case Tikhonov function is missing the trajectories converge to a minimizer 1 of Φ, however, Tikhonov term ensures the convergence to the minimal norm solution 0.
Finally, for the same choice of λ and Φ let us take different Tikhonov functions to study their effect on the trajectories of (1).For this purpose we increase the starting point to x(t 0 ) = 100.As we see, the faster ε decays, the slower trajectories converge, which totally corresponds to the theoretical results.
To end this section let us break some of the fundamental conditions of our analysis and show that there is no convergence of the trajectories in this case.The author is immensely grateful to Professor R.I. Bot ¸for valuable comments and fruitful discussions, which significantly improved the quality of this manuscript.

Appendix
Lemma 7. Let S be a non-empty subset of a real Hilbert space H and x : [0, +∞) → H a given map.Assume that 0 − y and thus y = x * by the characterization of x * , namely, forx * ∈ argmin Φ and ∀z ∈ argmin Φ it holds that z − x * , 0 − x * ≤ 0.

4. 2
The asymptotic behaviour of the function E

1
The rates of convergence of the Moreau envelope values Let us consider the following objective function Φ : R → R, Φ(x) = |x| + x 22 and plot the values of its Moreau envelope for different polynomial functions λ and ε in order to illustrate the theoretical results with some numerical examples.We set λ(t) = t l and ε(t) = 1 t d with x(t 0 ) = x 0 = 10, ẋ(t 0 ) = 0, α = 10, β = 1 and t 0 = 1.Consider different Moreau envelope parameters λ with d = 1.9:

Figure 3 :
Figure 3: The role of Tikhonov term.

Figure 5 :
Figure 5: l and d do not meet the requirements.

6
The precise rates of convergence of the values and trajectories

.
That leads us to the following rates for all t ≥ t 1 As we can see, the strong convergence of the trajectories can no longer be shown.Nevertheless, forC 5 = 2C 4 + x * 2 2 we deduce for all t ≥ t 1 Φ λ(t) (x(t)) − Φ * ≤ C 5 t 2+ Since we are free to choose γ such that α 2 ≤ γ < α, and since we want to have as fast rates as possible, we should take γ = α Here we have to consider several cases.1.If 0 < α < 2, then there exists C 6 such that for all t ≥ t 1 .If α ≥ 2, then there exists C 6 such that for all t ≥ t 1Φ λ(t) (x(t)) − Φ * ≤ C 6 t 2 , Φ prox λ(t)Φ (x(t)) − Φ * ≤ Remark 3.Probably, it is possible to show the weak convergence of the trajectories to a minimizer of the objective function in case d = 2. 2