A SUBGRADIENT METHOD WITH CONSTANT STEP-SIZE FOR ℓ 1 -COMPOSITE OPTIMIZATION

. Subgradient methods are the natural extension to the non-smooth case of the classical gradient descent for regular convex optimization problems. However, in general, they are characterized by slow convergence rates, and they require decreasing step-sizes to converge. In this paper we propose a subgradient method with constant step-size for composite convex objectives with ℓ 1 -regularization. If the smooth term is strongly convex, we can establish a linear convergence result for the function values. This fact relies on an accurate choice of the element of the subdiﬀerential used for the update, and on proper actions adopted when non-diﬀerentiability regions are crossed. Then, we propose an accelerated version of the al-gorithm, based on conservative inertial dynamics and on an adaptive restart strategy, that is guaranteed to achieve a linear convergence rate in the strongly convex case. Finally, we test the performances of our algorithms on some strongly and non-strongly convex examples.


Introduction
In this paper we deal with convex composite optimization, i.e., we consider objective functions f : R n → R of the form f (x) = g(x) + h(x), (0.1) where g : R n → R is C 1 -regular with Lipschitz-continuous gradient, and h : R n → R ∪ {+∞} is a non-smooth convex function.We recall that the concept of composite function was introduced by Nesterov in [14], and it usually denotes the splitting (0.1) in the case that the non-regular term h is simple.In this framework, possible examples of simple functions include, e.g., the indicator of a closed convex set, or the supremum of a finite family of linear functions.The problem of minimizing such composite functions can be effectively addressed by means of forward-backward methods (see, e.g., [7]), and their accelerated versions [4].In this regard, we report the recent contribution [20], where it is considered an accelerated method that achieves linear convergence when g, h in (0.1) are strongly convex.
The aim of this paper is to develop a convergent subgradient method with constant step-size for the minimization of particular instances of (0.1).The subgradient method was first introduced in [24] and, given an initial guess x 0 ∈ R n , the algorithm produces a sequence (x k ) k≥0 with update rule x k+1 = x k − h k v k k ≥ 0, (0.2) where v k ∈ ∂f (x k ), i.e., it is an element taken from the subdifferential of the objective at the point x k , and h k > 0 denotes the step-size.If we set ν k = h k |v k | 2 , we can equivalently rephrase (0.2) as where ν k represents the step-length at the k-th iteration.It is possible to deduce the convergence lim k→∞ f (x k ) = f (x * ) as soon as (ν k ) k≥0 satisfies lim k→∞ ν k = 0 and ∞ k=1 ν k = ∞ (see [25,Chapter 2]).In [19,Theorem 5.2] it is proposed a construction for (ν k ) k≥1 that achieves f (x k ) − f (x * ) = o(1/ √ k) as k → ∞ when the value f (x * ) is known a priori.We insist on the fact that, in the results mentioned above, the vector v k can be any element of ∂f (x k ).If we now consider constant step-sizes, i.e., h k = h > 0 for every k ≥ 0, in general we cannot expect the convergence of the iterates of (0.2) to a minimizer.For instance, given the one-dimensional function f : x → |x|, for every choice h > 0, if the initial guess x 0 ∈ {rh : r ∈ Z}, then the sequence produced by (0.2) oscillates and it remains well-separated from 0. From this example it is clear that, in order to work out a convergent subgradient method with constant step-size, it is crucial to identify the regions where the objective f is non-differentiable, and to take proper actions when the sequence (x k ) k≥0 crosses them.Moreover, in our analysis a role of primary importance is played by the choice of the element v k ∈ ∂f (x k ) used for the iteration.Subgradient methods with constant step-size have already been considered in the convex optimization literature, and, typically, it is possible to prove that the iterates arrive to the sublevel set {x ∈ R n : f (x) ≤ inf f + c}, where the quantity c > 0 is related to the step-size h > 0. In a similar flavor, if the objective function is strongly convex, the sequence produced by the algorithm manages to reach a ball centered at the minimizer, whose radius depends on h.For a presentations of these results, we refer the reader to [3,Section 3.2].Moreover, under suitable assumptions on the growth of f around the minimizer x * , it is possible to prove that the distance of the iterates to x * has a linear decay, up to a certain threshold (that, once again, is estimated in function of h).For further details, see [11,Theorem 1] and [8,Theorem 4.3].Finally, we report the recent contribution of [12], where the authors study the stability of a subgradient method with constant step-size around local minimizers, when f is non-smooth and non-convex.To the best of our knowledge, the one presented here are the first convergence results for a subgradient method with constant step-size.
In this paper, we devote our attention to the case where the non-regular term at the righthand side of (0.1) consists in the ℓ 1 -penalization, i.e., where we have h(x) = γ|x| 1 = γ n i=1 |x i | with γ > 0, and This kind of problem is well-studied since the presence of the ℓ 1 -norm induces sparsity in the minimizer, and for this reason such minimization tasks easily arise in real-world applications.For instance, we recall [6] for signal processing applications, [29] for imaging problems, and finally [9,27] for the ℓ 1 -regularized logistic regression, which is widely used in machine learning, computer vision, data mining, bioinformatics and neural signal processing.
In our approach, we take advantage of the structure of the points where the objective f is non-differentiable.We recall that, in the case of ℓ 1 -penalization, such points coincide with the set n i=1 {x ∈ R n : x i = 0}.Hence, at each iteration, if the current value x k has some null component, i.e., x k ∈ i∈β x k {x ∈ R n : x i = 0} for some β x k ⊂ {1, . . ., n}, we first decide which hyperplanes {x ∈ R n : x i = 0} i ∈ β x k we move parallel to.This choice is authomatically done by selecting for the update (0.2) the direction ) denotes the element of ∂f (x k ) with minimal Euclidean norm.The interesting situation occurs when some components strictly change sign when moving from x k to x k − h∂ − f (x k ).In that case, we have to properly decide whether to allow (some of) these changes of sign, or to set the corresponding components equal to 0. We stress the fact that this phase is fundamental in order to avoid the oscillations that characterized the one-dimensional example reported above.For this method, described in Algorithm 1, we can establish a linear convergence result as soon as the regular function g : R n → R appearing at the right-hand side of (0.1) is strongly convex.To show that, we make use of a non-smooth version of the Polyak-Lojasiewicz inequality (see, e.g., [5,28]).Then, in Section 3, we propose a momentum-based acceleration of Algorithm 1, inspired by the restarted-conservative algorithm introduced in [22].In the smooth convex framework, the idea of introducing momentum to accelerate the convergence of the classical gradient method dates back to the 1960s, with the works of Polyak [17,18].These methods, often called heavy-ball, can be interpreted as discretization of a second order damped mechanical system, where the objective function plays the role of the potential energy.In [26] it was shown that also the celebrated Nesterov accelerated gradient method (see [13]) can be interpreted in this framework.This led to a renewed interest in the interplay between discrete-time optimization algorithms and continuous-time dynamical models.In this context, in the mechanical system, the classical linear and isotropic viscosity friction is often replaced by a more general dissipative term.In this regard, we recall the contributions [1,2,23].From the discrete-time side, in [16] the authors empirically observed that adaptively resetting to 0 the momentum variable (i.e., the velocity) can further boost the convergence.Motivated by this fact, in [22] it was considered a conservative dynamical model (i.e., without any dissipative term in the dynamics), whose convergence completely relies on a proper restart scheme.In Algorithm 2 we propose for composite functions with ℓ 1 -penalization a new version of the restarted-conservative algorithm that has been heuristically outlined in [22], and in Section 3 we show that the per-iteration decay achieved by Algorithm 2 is always larger or equal than in Algorithm 1. Finally, in Section 4 we test our algorithms in strongly and non-strongly convex optimization problems with ℓ 1 -regularization.

Preliminary results
In this section we establish some auxiliary results that will be used later.Given a convex function f : R n → R, for every x ∈ R n we denote with ∂f (x) ⊂ R n the subdifferential of f at the point x.We recall that Definition 1.Let f : R n → R be a convex function.For every x ∈ R n , we define the vector Remark 1.We observe that Definition 1 is always well-posed.Indeed, for every convex function f : R n → R, for every x ∈ R n the subdifferential ∂f (x) is a non-empty, compact and convex subset of R n .Namely, since we do not allow f to assume the value +∞, this fact descends directly from [15,Theorem 3.1.15].Moreover, we can equivalently rephrase (1.1) as , as a positive-definite quadratic programming problem on a convex domain.Hence, we deduce that ∂ − f (x) is well-defined, and that it consists of a single element.Considering this last fact, in this paper we understand ∂ − f : R n → R n as a vector-valued operator, rather than a set-valued mapping.
We report below a non-smooth version of the celebrated Polyak-Lojasiewicz inequality.We refer the reader to [17] and [15,Theorem 2.1.10]for the classical statement in the smooth case, and to [5, Section 2.3] and [28,Section 2.2] for the extension to non-differentiable functions.
Lemma 1.1.Let f : R n → R be a µ-strongly convex function, and let x * be its minimizer.Then, for every x ∈ R n and for every element of the subdifferential y ∈ ∂f (x) the following inequality holds: and, in particular, we have Proof.Let us introduce the auxiliary function ψ : R n → R defined as The fact that f is µ-strongly convex guarantees that ψ is still a convex function.Moreover, for every x ∈ R n we have that ∂ψ(x) = ∂f (x) − µ(x − x * ).
We now introduce the class of functions that will be the main object of our investigation.We consider a composite objective (see [14]) f : R n → R of the form where g : R n → R is a C 1 -regular convex function with Lipschitz-continuous gradient of constant L > 0, and where γ > 0 is a positive constant.We recall that for every x ∈ R n , where e i is the i-th element of the standard basis of R n .If we define ∂ i f (x) := e i , ∂f (x) , we have that for every i = 1, . . ., n, where ∂ i g(x) := ∂ ∂xi g(x) denotes the usual partial derivative of the regular term g : R n → R at the right-hand side of (1.5).From (1.7) we read that the i-th component of ∂f (x) is affected only by ν i .Therefore, in order to compute the operator ∂ − f : R n → R n introduced in Definition 1, we can find separately the element of minimal absolute value of In particular, for every x ∈ R n we have that , we define the following partition of the components {1, . . ., n} induced by the point x: (1.9) From now on, when making use of a partition α 1 , . . ., α k of the indexes of the components {1, . . ., n}, for every z = (z 1 , . . ., z n ) ∈ R n we write z = (z α 1 , . . ., z α k ), where z α j ∈ R |α j | is the vector obtained by extracting from z the components that belong to α j , i.e., z α j = (z i ) i∈α j for every j = 1, . . ., k.The next technical result is the key-lemma of the convergence proof of Section 2.
Lemma 1.2.Let f : R n → R be a convex function of the form (1.5).Given x ∈ R n , let α + x , α − x , β x be the partition of {1, . . ., n} corresponding to the point x and prescribed by (1.9).

Let us consider a vector
(1.10) Then the following inequality holds: where L > 0 is the Lipschitz constant of the regular term at the right-hand side of (1.5), and Remark 2. We recall that, in the case of a regular convex function φ : R n → R with L-Lipschitz continuous gradient, we have for every x, v ∈ R n (see, e.g., [15,Theorem 2.1.5]).The crucial fact for the proof of Lemma 1.2 is that, when v satisfies the conditions (1.10), the segment − → xx ′ lies in a region where the restriction of the objective f is regular, where we set x ′ := x + v. Lemma 1.2 will be used to prove that, along proper directions, the objective function f is decreasing.
Proof.Before proceeding, we introduce another partition of the set of indexes β x : , and we define where Let us define the auxiliary function f aux : R n → R as where g : R n → R is the smooth term at the right-hand side of (1.5).From the definition of f aux , it follows that We observe that the function f aux : R n → R is as regular as g, i.e., it is of class C 1 with L-Lipschitz continuous gradient.Indeed, the first term at the right hand-side of (1.14) is obtained as the composition ∇g • Π ζ 0 , where Π ζ 0 : R n → R n is the linear (1-Lipschitz) orthogonal projection onto the subspace {z ∈ R n | z ζ 0 = 0} ⊂ R n .Moreover, the last terms at the right hand-side of (1.14) are constant.Therefore, using the identity if we apply the estimate (1.12) to f aux , we deduce that Therefore, the thesis follows if we show that the following equalities hold: Using the partition of the components {1, . . ., n} provided by the families of indexes α + x , α − x , β + x , β − x , and β 0 x , we have the following possibilities: x , in virtue of (1.8) and (1.13), we obtain x , then x i = 0 and v i > 0, and, in virtue of (1.10), we deduce that ∂ − i f (x) < 0. In particular, using again (1.8), this implies that On the other hand, recalling the expression of f aux in (1.13) and the inclusion x , then v i = 0, and we immediately obtain This argument shows that (1.15) is true, and it concludes the proof.

Subgradient method and convergence analysis
In this section we propose a subgradient method with constant step-size for the numerical minimization of a convex function f : R n → R with the composite structure reported in (1.5).We insist on the fact that the analysis presented here holds only when the non-smooth term at the right-hand side of (1.5) is a ℓ 1 -penalization.Before introducing formally the algorithm, we provide some insights that have guided us towards its construction.Let x ∈ R n be the current guess for the minimizer of f .We want to find a suitable direction in the subdifferential v ∈ ∂f (x) such that f (x − hv) ≤ f (x), where h > 0 represents a constant step-size.In order to accomplish this, a natural choice consists in setting v = ∂ − f (x), where ∂ − f (x) is defined as in (1.1).To see this, we first observe that, in virtue of the particular structure of ∂f reported in (1.7), we can choose separately the components v 1 , . . ., v n of the direction of the movement.If xi = 0, then ∂ i f (x) consists of a single element, hence the only possible choice is , resulting in an increase of the objective function.For this reason, it is convenient to set v i = ∂ − i f (x) = 0, and to move tangentially to the non-differentiability region {x ∈ R n | x i = 0}.On the other hand, if xi = 0 and, e.g., ∂ − i f (x) > 0, then ∂ i f (x) ⊂ (0, +∞), and for every choice of v i ∈ ∂ i f (x), we have that xi − hv i = −hv i < 0. However, observing that lim h→0 Besides the selection of the direction v = ∂ − f (x), the second crucial aspect is whether some sign changes occur in the coordinates when moving from x to x − h∂ − f (x).If not, the situation is pretty analogous to a step of the classical gradient descent in the smooth framework.On the Figure 1.In this 2D-example, we observe that the second component x 2 changes sign when moving from see the picture on the left), then we complete the step by moving tangentially to the axis {x 2 = 0}, in the direction , then we complete the step using the direction other hand, if there is, e.g., a positive component xi that becomes negative, then we should carefully decide if the barrier {x ∈ R n | x i = 0} should be crossed, or not.This is a key-point, in order to avoid the oscillations that characterized the simple example in the Introduction.In this case, we first set to 0 the components involved in a sign change, and for these components we re-evaluate ∂ − f .Finally, using this additional information, we complete the step, as depicted in Figure 1.The implementation of the method is described in Algorithm 1.

21: end while
We now establish the linear convergence result for Algorithm 1 in the case of strongly convex objective.
Theorem 2.1.Let f : R n → R be a function such that f (x) = g(x) + γ|x| 1 for every x ∈ R n , where γ > 0 and g : R n → R is C 1 -regular.We further assume that there exist constants L > µ > 0 such that g is µ-strongly convex and ∇g is L-Lipschitz continuous.Let (x k ) k≥0 be the sequence generated by Algorithm 1.Then, there exists κ = κ(L, µ) ∈ (0, 1) such that where x * ∈ R n denotes the unique minimizer of f , and where we set the step-size h = 1 L .Proof.We follow the procedure described in Algorithm 1.We prove that each iteration leads to a linear decrease of the value of the objective function.The first stage of each step is based on the following update: ) where h > 0 represents the step-size of the sub-gradient method.We distinguish two possible scenarios, corresponding to the if-else statement at the lines 5 and 7 of Algorithm 1. Case 1.We have that sign( ) i.e., none of the components of x k and of x temp changes sign, in the sense that from strictly positive it becomes strictly negative, or vice-versa.If we set v := −h∂ − f (x k ), we observe that the hypotheses of Lemma 1.2 are met for the point x k and the vector v. Indeed, using the partition introduced in (1.9) and induced by the point x k , from (2.3) it follows that i ∈ α + , then v i satisfies (1.10) by construction.Therefore, from (1.11) we deduce that In this case, we assign x k+1 := x temp and, choosing h = 1 L in order to minimize the right-hand side of the previous inequality, we get Case 2. Recalling the definition of x temp in (2.2), we are in the second scenario when ∃i ∈ {1, . . ., n} : (x k i > 0 and x temp i < 0) or (x k i < 0 and i.e., there is at least one component that strictly changes sign.Before proceeding, we introduce the following partition of the components: and we define the following intermediate points: and Phase (2) where We observe that (2.7) corresponds to the assignments of lines 9-10 in Algorithm 1, while (2.9) incorporates lines 11-12.Finally, x ′′ is defined in (2.8) accordingly to line 13.We insist on the fact that in the update (2.8) the vector v ′′ is computed by re-evaluating ∂ − ξ 0 x k f at the point x ′ .This is because ∂ − ξ 0 x k f may exhibit sudden changes when considering the points x k and x ′ .In this regard, our construction guarantees that we employ the most trustworthy values for the choice of the decrease direction v ′′ .We point out that, if x k i = 0, then i ∈ ξ 0 x k .Moreover, we remark that if i ∈ ξ 0 x k and ∂ i f (x k ) = 0, then we have necessarily that x k i = 0. Indeed, in this case, from (2.2) and Phase (1).From (2.7), we immediately observe that and where, for every i ∈ ξ 0 x k , we set x k , recalling (2.6) and (2.2), we have which in turn gives x k i ∂ − i f (x k ) ≥ 0 and, as a matter of fact, η i ≥ 0. On the other hand, in order to show that η i ≤ 1, we assume without loss of generality that x k i = 0.Then, using again (2.11), it follows that 0 that yields η i ≤ 1.Therefore, we conclude that Finally, from (2.5) we deduce that there exists at least one index î ∈ ξ 0 x k such that η î > 0. Using the partition α + x k , α − x k , β x k of {1, . . ., n} induced by the point x k and prescribed by (1.9), we obtain that the following conditions are satisfied: Hence, in any case, x k , then an analogous reasoning as before yields x k , and x ′ i = 0. Therefore, v ′ i = 0.The previous argument proves that the vector v ′ introduced in (2.10) satisfies the assumptions of Lemma 1.2 at the point x k .Thus, we deduce that (2.13) If we set η := max{η i | i ∈ ξ 0 x k }, we observe that (2.13) implies that f (x ′ ) ≤ f (x k ) whenever h ∈ 0, 2  Lη .We stress the fact that the condition (2.5) that characterizes the present scenario guarantees that η > 0. Phase (2).We now investigate the update described in (2.8)-(2.9).Let α + x ′ , α − x ′ and β x ′ be the partition of the components {1, . . ., n} induced by the point x ′ and prescribed by (1.9).Recalling (2.6) and the definition of x ′ in (2.7), we observe that α x ′ , which, in virtue of (2.9), yields Moreover, using (2.9), (2.2) and (2.6), we deduce that (2.15) On the other hand, from (2.9) and recalling that By combining (2.15) and (2.16), we obtain that the hypotheses of Lemma 1.2 are met when considering the point x ′ and the direction v ′′ .Hence, it follows that (2.17) On the other hand, recalling (2.9) and (2.14), we have that where we used the Lipschitz-continuity of ∇g, (2.10) and the fact that L in (2.17), owing to (2.18) we deduce that Moreover, by combining the last inequality with (2.13) (using again h = 1 L ), we obtain that where we used (2.12) in the last passage.In virtue of Lemma 1.1, from (2.19) we deduce that

.20)
We now distinguish two possibilities, corresponding to the if-else statement at lines 14 and 16 of Algorithm 1.
Remark 3. The hypothesis of the strong convexity of the smooth function g : R n → R in Theorem 2.1 can be slightly relaxed by requiring that g : R n → R is convex, that the objective f : R n → R adimits a minimizer x * and that there exists a constant µ > 0 such that f satisfies the inequality (1.2) for every x ∈ R n .Indeed, in the proof of Theorem 2.1 we only employ (1.2), and we do not use the strong convexity assumption.On the other hand, the assumption of convexity for g is needed for the notion of subgradient considered in this paper.

Accelerated subgradient method
In this section we propose a momentum-based acceleration of Algorithm 1 for an objective function f : R n → R with the ℓ 1 -composite structure introduced in (1.5).As observed in the Introduction, in the smooth-objective framework it is possible to design minimization schemes with momentum by discretizing second order ODEs of the form: where V : R n → R represents the objective function, and A(x, t) ∈ R n×n is a positive semidefinite matrix that tunes the generalized viscosity friction.In [16] it was noticed that adaptive restart strategies can further accelerate the convergence to the minimizer, since they are capable of eliminating the oscillations typical of under-damped mechanical systems.The term adaptive restart denotes a procedure that resets to 0 the momentum/velocity variable (i.e., p in (3.1)), as soon as a suitable condition is satisfied.In [22] it was considered a conservative dynamics by dropping the viscosity term, i.e, choosing A(x, t) ≡ 0 in (3.1).Then, using the symplectic Euler scheme (see, e.g., [10]) to discretize the system, it was proposed the following conservative algorithm: where h m > 0 represents the discretization step-size.In the case of a regular and convex objective V , the conservative scheme (3.2) achieves at each iteration a decrease of the function V greater or equal than the classical gradient descent.This fact relies on the following restart strategy: "reset p k = 0 whenever ∇V (x k+1 ), p k > 0".In [22] it was also investigated a heuristic extension of (3.2) to the case of a non-smooth objective f : R n → R with ℓ 1 -composite structure, where In this section, taking advantage of the observations done in Section 2 for the non-accelerated subgradient method, we propose a variant of the algorithm described in [22,Algorithm 4].The main differences concern the way we manage the changes of sign in the components, and the condition for the reset of the momentum variable.Indeed, from (3.3) we deduce that where we set h = h 2 m .Therefore, it is natural to divide every step of the accelerated algorithm into two phases: • q ← x k − h∂ − f (x k ) (subgradient phase).If sign changes in the components occur, we adopt the same procedures as in Algorithm 1. • q ′ ← q + √ hp k (momentum phase).Also in this phase, we have particular care of sign changes of the components.Moreover, we use the general principle that "in the momentum phase we do not modify null components".This is motivated by the fact that the momentum variable carries information about the previous values of the ∂ − f .However, since ∂ − i f typically undergoes sudden modification when the i-th component of the state variable x k vanishes or changes sign, the information contained in p k i could be of little use, if not misleading.For this reason, in Algorithm 2 we set p k i = 0 if the i-th component of the state variable is null, or if it has been involved in a sign change.See, respectively, line 10 and line 17 of the accelerated subgradient method reported in Algorithm 2. Finally, in virtue of (3.4) and the remarks done above, we observe that a natural choice for the stepsize is h = 1/L, where L is the Lipschitz constant of the gradient of the regular term g : R m → R.
Remark 4. In line 31 of Algorithm 2 we have introduced the quantity ∂f (q ′ ).We recall that f (x) = g(x) + γ|x| 1 , where g is convex and C 1 -regular, and γ > 0. Using the same notations as in Algorithm 2, ∂f (q ′ ) = ( ∂ 1 f (q ′ ), . . ., ∂ n f (q ′ )) is defined as follows: for every i = 1, . . ., n.We observe that ∂f (q ′ ) is well-defined for every component since, by construction, sign(q i • q ′ i ) ≥ 0 for every i = 1, . . ., n. Remark 5. We observe that the computation of the quantity r at the line 35 requires an evaluation of the subdifferential of f at the point q ′ .From a computational viewpoint, the demanding part is the evaluation of the gradient of the regular term, i.e., ∇g(q ′ ).However, if r ≤ 0, then x ← q ′ (line 42), and ∇g(q ′ ) can be stored and re-used for the construction of ∂ − f (x) at the subsequent iteration.
We can prove the following result on the decrease of the objective function f , guaranteeing that, in any circumstance, Algorithm 2 is at least as good as Algorithm 1.
Proposition 3.1.Let f : R n → R be a function such that f (x) = g(x) + γ|x| 1 for every x ∈ R n , where γ > 0 and g : R n → R is a C 1 convex function such that ∇g is L-Lipschitz continuous, with L > 0. Let us consider q 0 ∈ R n as the initial point, and let q ′ be the output produced by an iteration of Algorithm 2 and let q be the output of an iteration of Algorithm 1 (see line 29 of Algorithm 2).Then, we have that f (q ′ ) ≤ f (q).Remark 6.Under the same assumptions as Theorem 2.1, i.e., when g : R n → R is µ-strongly convex, from Proposition 3.1 it follows that Algorithm 2 achieves a linear convergence rate.Indeed, if we denote by (x k ) k≥0 the sequence generated by Algorithm 2 setting the step-size Algorithm 2 Accelerated conservative subgradient method for ℓ 1 -composite optimization 1: x ← x 0 , p ← 0 ∈ R n , h > 0 constant step-size 2: k ← 0 3: while k ≤ maxiter do 4: . ., n then 6: 9: else 10: 11: x ′ i ← 0, ∀i ∈ I 13: pi ← 0, ∀i ∈ I 16: x ′′ ← x ′ + v ′′ 17: q ← x ′ 20: p ← 0

38:
x k+1 ← q ′ 39: 40: end while h equal to the inverse of the Lipschitz constant of ∇g, then, if we apply Proposition 3.1 with q 0 = x k , for every k ≥ 0 we have: where κ ∈ (0, 1) is the constant appearing in Theorem 2.1, and q ∈ R n is the output of a single iteration of Algorithm 1 with starting point x k .
Proof.Using the same notations as in Algorithm 2, we have that q is obtained from q 0 with an iteration of Algorithm 1 (see line 19 and line 22 of Algorithm 2).If p = 0, then there is nothing to prove.On the other hand, owing to the if statement at lines 26-30, we have that sign(q ′ i • q i ) ≥ 0 for every i = 1, . . ., n.We further observe that q ′ = q + √ hp holds in every case (see line 25 and line 29).Let us define Logistic regression with ℓ 1 -regularization.We considered a sparse logistic regression problem.We constructed x real ∈ R n with the following procedure: each component was zero with probability p = 0.8, and, if nonzero, its value was sampled using a standard normal N (0, 1).Then, we independently sampled the entries of b = (b 1 , . . ., b m ) ∈ {0, 1} m using the distribution: P(b i = 1) = (1 + exp( M i , x real )) for every i = 1, . . ., m, where M 1 , . . ., M m ∈ R n are the rows of a matrix M ∈ R m×n with independent components generated with N (0, 1).Supposing to know the matrix M and the measurements b, the sparse log-likelyhood maximization can be formulated as the problem of minimizing where we set γ = 0.25|∇g(0)| ∞ .We used n = 100 and m = 500, and we sampled the component of the initial guess with N (0, 2).This problem is convex but not strongly convex.
LogSumExp with ℓ 1 -regularization.We considered the function f : R n → R defined as follows: where M 1 , . . ., M k ∈ R n are the rows of the matrix M ∈ R k×n , and b ∈ R k .The entries of M and b were independently sampled using a Gaussian N (0, 1), as well as the components of the starting point.We set r = 5, and we used n = 200 and k = 500.This is another example of non-strongly convex problem.We briefly comment on the results of the experiments described above.We observe that the non-accelerated algorithms, i.e., Algorithm 1 and ISTA, have always very similar performances.
Restarted FISTA is the most performing in the strongly convex case, while Algorithm 2 seems to be the most efficient with non-strongly convex objectives.If compared to the restart-conservative of [22], we observe that Algorithm 2 is much faster in the early phases of the minimization process.Finally, the classical subgradient method with diminishing step-size is the less performing scheme.
The fact that the decays achieved Algorithm 1 and ISTA are almost identical motivated us to construct an example where the difference in performances could be more apparent.We considered a two-dimensional function such that x * = (1, 0), and such that ∂f (x * ) = {0} × [0, ǫ], for some ǫ > 0.More precisely, we defined f : R 2 → R as and we set c = 0.85 and γ = 1.In this case, the correct individuation of the fact that the second component of the minimizer is null can be challenging.This is due to the identity ∂ 2 f (x * ) = [0, ǫ], or, in other words, since the vector 0 does not lie in the relative interior of ∂f (x * ).In this scenario, in the case a crossing of the set {x ∈ R 2 : x 2 = 0} occurs, we expect that Algorithm 1 might better decide whether the component x 2 should be set equal to 0. We used as initial guess the point x 0 = (0.95, 0.5).We also considered a family of problems obtained by perturbing c, γ and x 0 with Gaussian noise of standard deviations, respectively, equal to 0.1, 0.1 and 0.05.The results are reported in Figure 3.We observe that Algorithm 1 achieves better performances than ISTA on the designed problem, and this advantage seems to be robust with respect to the perturbations introduced.Finally, despite using step-sizes that decay faster than in the previous experiments, the classical subgradient method exhibits evident oscillations, both in the original and in the noisy problem.
Figure 2. Experiments with ℓ 1 -regularization.We compared the performances of Algorithm 1 (red) and Algorithm 2 (black) with ISTA (dashed blue), restarted FISTA (magenta), the conservative-restart scheme proposed in [22] (dashed black), and the classical subgradient method with stepsize h k = 10k −1/4 .Each problem was solved 100 times, and the plots were obtained by taking the average.The convergence rate is measured by evaluating the gap f (x k ) − f (x * ) at each iteration.For the second picture, we repeated the test 100 times, and the plots were obtained by taking the average.The convergence rate is measured by evaluating the gap f (x k ) − f (x * ) at each iteration.

Conclusions
In this paper, we considered composite convex optimization problems with ℓ 1 -penalization, and we formulated a subgradient algorithm with constant step-size.In the case of strongly convex objectives, we established a linear convergence result for the method.Using dynamical system considerations, we proposed an accelerated version of the subgradient algorithm, that, at each iteration, achieves a decay of the objective always greater or equal than the decay corresponding to a step of the non-accelerated subgradient method.We observed in numerical experiments that the inertial algorithm can effectively compete with one of the most performing schemes for this kind of problems, i.e., FISTA combined with an adaptive restart strategy.For future work, it could be interesting to design subgradient algorithms for composite optimization involving a non-smooth term of the form x → |Ax| 1 .In this case, a challenging point consists in finding strategies for computing ∂ − f (or a suitable approximation) that could be practical for high-dimensional settings.

Figure 3 .
Figure3.Comparison of Algorithm 1 (red) with ISTA (dashed blue), and the classical subgradient method with stepsize h k = k −1 .On the left, we considered the function defined in (4.1), while on the right we solved a perturbed problem.For the second picture, we repeated the test 100 times, and the plots were obtained by taking the average.The convergence rate is measured by evaluating the gap f (x k ) − f (x * ) at each iteration.