Solving structured nonsmooth convex optimization with complexity O (ε − 1 / 2 )

This paper describes an algorithm for solving structured nonsmooth convex optimization problems using the optimal subgradient algorithm (OSGA), which is a ﬁrst-order method with the complexity O (ε − 2 ) for Lipschitz continuous nonsmooth problems and O (ε − 1 / 2 ) for smooth problems with Lipschitz continuous gradient. If the nonsmoothness of the problem is manifested in a structured way, we reformulate the problem so that it can be solved efﬁciently by a new setup of OSGA (called OSGA-V) with the complexity O (ε − 1 / 2 ) . Further, to solve the reformulated problem, we equip OSGA-O with an appropriate prox-function for which the OSGA-O subproblem can be solved either in a closed form or by a simple iterative scheme, which decreases the computational cost of applying the algorithm for large-scale problems. We show that applying the new scheme is feasible for many problems arising in applications. Some numerical results are reported conﬁrming the theoretical foundations.


Introduction
Subgradient methods are a class of first-order methods that have been developed to solve convex nonsmooth optimization problems, dating back to 1960; see, e.g., (Polyak 1987;Shor 1985).In general, they only need function values and subgradients, and not only inherit the basic features of general first-order methods such as low memory requirement and simple structure but also are able to deal with every convex optimization problem.They are suitable for solving convex problems with a large number of variables, say several millions.Although these features make them very attractive for applications involving high-dimensional data, they usually suffer from slow convergence, which finally limits the attainable accuracy.In 1983, Nemirovsky and Yudin (1983) derived the worst-case complexity bound of first-order methods for several classes of problems to achieve an ε-solution, which is O(ε −2 ) for Lipschitz continuous nonsmooth problems and O(ε −1/2 ) for smooth problems with Lipschitz continuous gradients.The low convergence speed of subgradient methods suggests that they often reach an ε-solution in the number of iterations closing to the worst-case complexity bound on iterations.
In Nemirovsky and Yudin (1983), it was proved that the subgradient, subgradient projection, and mirror descent methods attain the optimal complexity of first-order methods for solving Lipschitz continuous nonsmooth problems; here, the mirror descent method is a generalization of the subgradient projection method, cf.(Beck and Teboulle 2003;Beck et al. 2010).Nesterov (2011), Nesterov (2006) proposed some primal-dual subgradient schemes, which attain the complexity O(ε −2 ) for Lipschitz continuous nonsmooth problems.Juditsky and Nesterov (2014) proposed a primal-dual subgradient scheme for uniformly convex functions with an unknown convexity parameter, which attains the complexity close to the optimal bound.Nesterov (1983) and later in Nesterov (2004) proposed some gradient methods for solving smooth problems with Lipschitz continuous gradients attaining the complexity O(ε −1/2 ).He also in Nesterov (2005a, b) proposed some smoothing methods for structured nonsmooth problems.Smoothing methods also have been studied by many authors; see, e.g., Beck and Teboulle (2012), Boţ and Hendrich (2013), Boţ and Hendrich (2015), and Devolder et al. (2012).
In many fields of applied sciences and engineering such as signal and image processing, geophysics, economic, machine learning, and statistics, there exist many applications that can be modeled as a convex optimization problem, in which the objective function is a composite function of a smooth function with Lipschitz continuous gradients and a nonsmooth function; see Ahookhosh (2016) and references therein.Studying this class of problems using first-order methods has dominated the convex optimization literature in the recent years.Nesterov (2013Nesterov ( , 2015) ) proposed some gradient methods for solving composite problems obtaining the complexity O(ε −1/2 ).For this class of problems, other first-order methods with the complexity O(ε −1/2 ) have been developed by Auslender and Teboulle (2006), Beck and Teboulle (2012), Chen et al. (2014Chen et al. ( , 2017Chen et al. ( , 2015Chen et al. ( , 2014)), Devolder et al. (2013), Gonzaga and Karas (2013), Gonzaga et al. (2013), Lan (2015), Lan et al. (2011),

Preliminaries and notation
Let V be a finite-dimensional vector space endowed with the norm • , and let V * denotes its dual space, formed by all linear functional on V where the bilinear pairing g, x denotes the value of the functional g ∈ V * at x ∈ V.The associated dual norm of • is defined by where x = (x g 1 , . . ., denotes its effective domain, and f is called proper if dom f = ∅ and f (x) > −∞ for all x ∈ V. Let C be a subset of V.In particular, if C is a box, we denote it by x = [x, x] in which x and x are the vectors of lower and upper bounds on the components of x, respectively.The vector g ∈ V * is called a subgradient of f at x if f (x) ∈ R and The set of all subgradients is called the subdifferential of f at x and is denoted by ∂ f (x).If f : V → R is nonsmooth and convex, then Fermat's optimality condition for the nonsmooth convex optimization problem where N C (x) is the normal cone of C at x defined by The proximal-like operator prox C λ f (y) is the unique optimizer of the optimization problem where λ > 0. From (1), the first-order optimality condition of (3) is given by giving the classical proximity operator.A function f is called strongly convex with the convexity parameter σ > 0 if and only if where g denotes any subgradient of f at x, i.e., g ∈ ∂ f (x).
The subdifferential of φ(x) = W x is given in the next result, for an arbitrary norm • in R n and a matrix W ∈ R m×n .To observe a proof of this result, see Proposition 2.1.17in Ahookhosh (2015).
In the next example, we show how Proposition 1 is applied to φ = • ∞ , which will be needed in Sect. 4. The subdifferential of other norms of R n can be computed with Proposition 1 in the same way.
Example 2 We use Proposition 1 to derive the subdifferential of φ = • ∞ at an arbitrary point x.We first recall that the dual norm of If x = 0, we set

A review of optimal subgradient algorithm (OSGA)
In this section, we briefly review the main idea of the optimal subgradient algorithm (OSGA) proposed by Neumaier (2016).To this end, we first consider the convex constrained minimization problem where f : C → R is a proper and convex function defined on a nonempty, closed, and convex subset C of V.The aim is to derive a solution x ∈ C using the first-order blackbox information, i.e., function values and subgradients.OSGA (see Algorithm 2) is an optimal subgradient algorithm for the problem (7) that constructs a sequence of iterates whose related function values converge to the minimum with the optimal complexity.The primary objective is to monotonically reduce bounds on the error term f (x b ) − f of the function values, where f := f ( x) and x b is the best known point.
In details, OSGA considers a linear relaxation of f at x defined by where γ ∈ R and h ∈ V * and a continuously differentiable prox-function Q : C → R satisfying (6) and Moreover, OSGA requires an efficient routine for finding a maximizer u := U (γ , h) and the optimal objective value η := E(γ , h) of the auxiliary problem where it is known that the supremum η is positive and the function E γ,h : C → R is defined by 123 In Neumaier (2016), it is shown that OSGA attains the following bound on function values Hence, by decreasing the error factor η, the convergence to an ε-minimizer x b is guaranteed by for some target tolerance ε > 0. In Neumaier (2016), it is shown that the number of iterations to achieve this optimizer is O(ε −1/2 ) for smooth f with Lipschitz continuous gradients and O(ε −2 ) for Lipschitz continuous nonsmooth f , which are optimal in both cases, cf.(Nemirovsky and Yudin 1983).The algorithm does not need to know about the global Lipschitz parameters and has a low memory requirement.Hence, if the subproblem (10) can be solved efficiently, it is appropriate for solving largescale problems.In the next section, we show that OSGA can solve some structured nonsmooth problems with the complexity O(ε −1/2 ).Moreover, it is shown that by selecting a suitable prox-function Q, the subproblem (10) can be solved efficiently for this class of problems.
As discussed in Neumaier (2016), to update the given parameters α, h, γ , η and u, OSGA uses the following scheme: Algorithm 1: PUS (parameters updating scheme) If the best function value f x b is stored and updated, then each iteration of OSGA only requires the computation of two function values f x and f x (Lines 6 and 11) and one subgradient g x (Line 6).
Algorithm 2: OSGA (optimal subgradient algorithm) where f : U × R → R is a proper and convex function that is smooth with Lipschitz continuous gradients with respect to both arguments and monotone increasing with respect to the second argument, A : V → U is a linear operator, C ⊆ V is a simple convex domain, and φ : V → R is a simple nonsmooth, real-valued, and convex loss function.This class of convex problems generalizes the composite problem considered in Nesterov (2013Nesterov ( , 2015)).As discussed in Sect.2, OSGA attains the complexity O(ε −2 ) for this class of problems.Hence we aim to reformulate the problem (12) in such a way that OSGA attains the complexity O(ε −1/2 ).We here reformulate the problem (12) in the form min f (x, ξ) where By the assumptions on f , the reformulated function f is smooth and has Lipschitz continuous gradients.OSGA can handle the problems of the form (13) with the complexity O(ε −1/2 ) in the price of adding a functional constraint to the feasible domain where x ∈ R n is the original object, y ∈ R m is an observation, and ν ∈ R m is an additive or impulsive noise.The objective is to recover x from y by solving (17).
In practice, this problem is typically underdetermined and ill-conditioned, and ν is unknown.Hence x typically is recovered by solving one of the minimization problems These problems can be reformulated in the form (13) by setting respectively.

New setup of optimal subgradient algorithm (OSGA-O)
This section describes the subproblem (10) for a problem of the form (13). To this end, we introduce some prox-function and employ it to derive an inexpensive solution of the subproblem.We generally assume that the domain C is simple enough such that η and ( u, u) can be computed cheaply, in O(n log n) operations, say. where This means that Q is a strongly convex function with the convexity parameter 1, and since Q 0 > 0, we get Q(x, x) > 0. Then Q is strongly convex, and Q(x, x) > 0. This shows that Q is a prox-function.We now replace the linear relaxation (8) by Using this linear relaxation and the prox-function ( 24), the subproblem (10) is rewritten in the form sup where Let ( u, u) ∈ V × R be a maximizer of ( 26) and η = E γ,h, h ( u, u).The next result gives a bound on the error f (x b , x b ) − f , which is important for providing the complexity analysis of OSGA-O.
Proposition 5 The maximizer ( u, u) of ( 26) and the associated η satisfy Proof The problem ( 26) and the definition ( 27) imply that the function is nonnegative and vanishes at (x, x) = ( u, u), i.e., the identity (29) holds.Since T , the first order optimality condition holds, i.e., for all (x, x) ∈ C, giving the results.
The subsequent result gives a systematic way for solving OSGA subproblem (26) for problems of the form (13).
Theorem 6 Let ( u, u) ∈ V × R be a maximizer of ( 26) and Furthermore, η and λ can be computed by solving the two-dimensional system of equations Proof From Proposition 5, at the minimizer ( u, u), we obtain and We conclude the proof in the next two parts: Assuming that these two inequalities hold, we prove (35).From φ(x) ≤ x and η u+ h ≥ 0, we obtain We now assume (35) and prove (36).The inequality η u + h ≥ 0 holds; otherwise, by selecting x large enough, we get which is a contradiction with (35).Since φ(x) ≤ x, the second inequality of (36) holds.
In the second part, by setting x = u and u = φ( u), we see that u is a solution of the minimization problem The first-order optimality condition (1) of this problem leads to On the other hand, by writing the first-order optimality condition (4) for the problem By comparing ( 37) and ( 38) and setting y = −η −1 h, λ = u + η −1 h, we conclude that both problems have the same minimizer u.Since u = φ( u), we obtain Using this and substituting u = φ( u) in ( 34), η and λ are found by solving the system of nonlinear equations ( 33).This completes the proof.
In Theorem 6, if C = V, the problem ( 32) is reduced to the classical proximity operator u = prox λφ (y) defined in (3).Hence, the problem ( 32) is called proximallike.Therefore, the word "simple" in the definition of C means that the problem (32) can be solved efficiently either in a closed form or by an inexpensive iterative scheme.
To have a clear view of Theorem 6, we give the following example.
Example 7 Let us consider the 1 -regularized least squares problem (19).Then, the problem can be reformulated as Ahookhosh (2015)).Substituting this into (33) gives This is a two-dimensional system of nonsmooth equations that can be reformulated as a nonlinear least squares problem; see, e.g., (Pang and Qi 1993).
In view of Theorem 6, we now provide a systematic way for solving OSGA-O subproblem (26), which is summarized in next scheme.
Algorithm 3: SUS (subproblem solver for OSGA-O) 1 begin 2 solve the system of nonlinear equation ( 39) approximately by a nonlinear solver to find η and λ;

end
To implement Algorithm 3 (SUS), we need a reliable nonlinear solver to deal with the system of nonlinear equation ( 39) and a routine giving the solution of the proximal-like problem (32) effectively.In Sect.4, we investigate solving the proximallike problem (32) for some practically important loss functions φ.Algorithm 2 requires two solutions of the subproblem (26) (u in Line 6 and u in Line 10) that are provided by Line 3 of SUS (similar notation can be considered for u ).

Convergence analysis and complexity
In this section, we establish the complexity bounds of OSGA-O for Lipschitz continuous nonsmooth problems and smooth problems with Lipschitz continuous gradients.We also show that if f is strictly convex, the sequence generated by OSGA-O is convergent to x.
To guarantee the existence of a minimizer for OSGA-O, we assume the following conditions.

(H1)
The objective function f is proper and convex; Since f is convex, the upper level set N f (x 0 , x 0 ) is closed, and V × R is a finitedimensional vector space, (H2) implies that the upper level set N f (x 0 , x 0 ) is convex and compact.It follows from the continuity and properness of the objective function f that it attains its global minimizer on the upper level set N f (x 0 , x 0 ).Therefore, there is at least one minimizer ( x, x * ).
Since the underlying problem ( 13) is a special case of the problem (7) considered by Neumaier (2016), the complexity results of OSGA-O is the same as OSGA.
Theorem 8 Suppose that f − μQ is convex and μ ≥ 0. Then we have Proof Since all assumptions of Theorems 4.1 and 4.2, Propositions 5.2 and 5.3, and Theorem 5.1 of Neumaier (2016) are satisfied, the results remain valid.
Indeed, if a nonsmooth problem can be reformulated as ( 13) with a nonsmooth loss function φ, then OSGA-O can solve the reformulated problem with the complexity O(ε −1/2 ) for an arbitrary accuracy parameter ε.The next result shows that the sequence Proof Since f is strictly convex, the minimizer ( x, x * ) is unique.By ( x, x * ) ∈ int C, there exists a small δ > 0 such that the neighborhood where Now, Theorem 8 implies that the algorithm attains an ε δ -solution of (13) in a finite number κ of iterations.Hence, after κ 1 iterations, the best point To prove this statement by contradiction, we suppose that there exists Therefore, there exists λ 0 such that It follows from (41), the strictly convex property of f , and giving the result.

Solving proximal-like subproblem
In this section, we show that the proximal-like problem (32) can be solved in a closed form for many special cases appearing in applications.To this end, we first consider unconstrained problems (C = V) and study some problems with simple constrained domains (C = V).Although finding proximal points is a mature area in convex nonsmooth optimization (cf.(Combettes and Pesquet 2011;Parikh and Boyd 2013)), we here address the solution of several proximal-like problems of the form (32) appearing in applications that to the best of our knowledge have not been studied in literature.

Unconstrained examples (C = V)
We here consider several interesting unconstrained proximal problems appearing in applications and explain how the associated OSGA-O auxiliary problem (32) can be solved.
In recent years, the interest of applying regularizations with weighted norms is increased by emerging many applications; see, e.g., (Daubechies et al. 2010;Rauhut and Ward 2016).Let d be a vector in R n such that d i = 0 for i = 1, . . ., n.Then, we define the weight matrix D := diag(d), which is a diagonal matrix with D i,i = d i for i = 1, . . ., n.It is clear that D is an invertible matrix.The next two results show how to compute a solution of the problem (32) for special cases of φ arising frequently in applications.
Proposition 11 Let D := diag(d), where d ∈ R n and d i = 0, for i = 1, . . ., n.If φ(x) = Dx 2 , the proximity operator ( 32) is given by prox λφ (y) = 0 if D −1 y 2 ≤ λ and otherwise, for i = 1, . . ., n, prox λφ (y where τ is the unique solution of the one-dimensional nonlinear equation n i=1 Proof The optimality condition (5) shows that u = prox λφ (y) if and only if We consider two cases: We define the function ψ : ]0, +∞[ → R by where it is clear that ψ is decreasing and It can be deduced by D −1 y 2 > λ and the mean value theorem that there exists τ ∈ ]0, +∞[ such that ψ( τ ) = 0, giving the result.
We here emphasize that if D = I (I denotes the identity matrix) then the proximity operator for φ(•) = • 2 is given by prox λφ (y) = (1 − λ/ y 2 ) + y, cf.(Parikh and Boyd 2013).If one solves the equation ψ(τ ) = 0 approximately, and an initial interval [a, b] is available such that ψ(a)ψ(b) < 0, then a solution can be computed to an ε-accuracy using the bisection scheme in O(log 2 ((b−a)/ε)) iterations; see, e.g., (Neumaier 2001).However, it is preferable to use a more sophisticated zero finder like the secant bisection scheme (Algorithm 5.2.6, (Neumaier 2001)).If an interval [a, b] with sign change is available, one can also use MATLAB fzero function combining the bisection scheme, the inverse quadratic interpolation, and the secant method.
Grouped variables typically appear in high-dimensional statistical learning problems.For example, in data mining applications, categorical features are encoded by a set of dummy variables forming a group.Another interesting example is learning sparse additive models in statistical inference, where each component function can be represented using basis expansions and thus can be treated as a group.For such problems (see (Liu et al. 2010) and references therein), it is more natural to select groups of variables instead of individual ones when a sparse model is preferred.
In the following two results, we show how the proximity operator prox λφ (•) can be computed for the mixed-norms φ(•) = • 1,2 and φ(•) = • 1,∞ , which are especially important in the context of sparse optimization and sparse recovery with grouped variables.
Case (ii).Let y g i 2 > λ.Then, Case (i) implies that u g i = 0. From Proposition 1, we obtain where i = 1, . . ., m and y g i 2 > λ.Then ( 45) and ( 46) imply giving u g i = μ i y g i .Substituting this into the previous identity and solving it with respect to μ i yield completing the proof.
Then, the proximity operator (32) is given by (prox λφ (y g i )) for i = 1, . . ., m, where with where v and φ is separable with respect to the grouped variables, we fix the index i ∈ {1, . . ., m}.The optimality condition (5) shows that u g i = prox λφ (y g i ) if and only if 0 ∈ u g i − y g i + λ ∂ u g i ∞ . (51) We now consider two cases: (i) Case (i).Let y g i 1 ≤ λ.Then, we show that u g i = 0 satisfies (51).If u g i = 0, the subdifferential of φ derived in Example 2 is ∂φ( . By substituting this into (51), we get that From Case (i), we have u g i = 0. We show that with I g i defined in (49), satisfies (51).Hence, using the subdifferential of φ derived in Example 2, there exist coefficients β j g i , for j ∈ I g i , such that where Let u g i be the vector defined in (52).We define 48).We show that the choice ( 55) satisfies ( 53).We first show u i ∞ > 0. It follows from ( 48) and ( 50) if k i < n and from y g i 1 > λ if k i = n.Using ( 52) and ( 55), we come to = sign(y 53) is satisfied componentwise.It remains to show that (54) holds.From (50), we have that |y ∞ , for j ∈ I g i .This and (55) yield β j g i ≥ 0 for j ∈ I g i .It can be deduced from (48) that giving the results.
Then, the proximity operator (32) is given by with Proof The proof is straightforward from Proposition 13 by setting m = 1, n 1 = n, y g 1 = y, and I g 1 = I.

Constrained examples (C = V)
In this section, we consider the subproblem (32) and show how it can be solved for some φ and C.More precisely, we solve the minimization problem where φ(x) is a simple convex function and C is a simple domain.We consider a few examples of this form.
Proposition 15 Let φ(x) = Dx 1 and C = [x, x], where D is a diagonal matrix.Then, the global minimizer of the subproblem (32) is given by for i = 1, . . ., n, where ω(y, λ) := Proof The optimality condition (4) shows that u = prox C λφ (y) if and only if where N C (u) is the normal cone of C at u defined in (2).We show that u = 0 if and only if ω(y, λ) ≤ 0. We first consider that (63) (62) suggests u = 0 if and only if there exists p ∈ N C (0) ∩ (y − λ∂φ(x)).By Proposition 1, this is possible if and only if min The solution of this problem is p = y − λ|D1|, where 1 is the vector of all ones.Hence, the minimum of this problem is given by ( 61).This implies u = 0 if and only if ω(y, λ) ≤ 0. We, therefore, consider two cases: Case (i).u = 0.Then, we have ω(y, λ) ≤ 0. Case (ii).u = 0.Then, ω(y, λ) > λ.Proposition 1 yields By induction on nonzero elements of u, we get g i u i = |d i u i |, for i = 1, . . ., n.This implies that g i = |d i | sign(u i ) if u i = 0.This and the definition of N C (u) yield for i = 1, . . ., n, and equivalently for u = 0, we get In Case (c), we end up to u i = 0, completing the proof.
Then, the global minimizer of the subproblem (32) is given by This and the definition of N C (u) imply Then the global minimizer of the subproblem (32) is determined by (67) for i = 1, . . ., n, where ω(y, λ) is defined by (61).
The optimality condition (4) shows that u = prox C λφ (y) if and only if By ( 69), we have u = 0 if and only if there exists p ∈ N C (0) ∩ (y − λ 2 ∂φ(x)), where N C (0) is defined by (63).By Proposition 1, this is possible if and only if min The solution of this problem is p = y − λ 2 |D1|, where 1 is the vector of all ones.
Let x ≥ 0 be nonnegativity constraints.These constraints are important in many applications, especially if x describes physical quantities; see, e.g., (Esser et al. 2013;Kaufman andNeumaier 1996, 1997).Since nonnegativity constraints can be regarded as an especial case of bound-constrained domain, Propositions 15, 16, and 17 can be used to derive the results for nonnegativity constraints.

Experiment with random data
We consider solving an underdetermined system where A ∈ R m×n (m < n) and y ∈ R m .Underdetermined systems of linear equations have frequently appeared in many applications of linear inverse problem such as those in the fields of signal and image processing, geophysics, economics, machine learning, and statistics.The objective is to recover x from the observed vector y, and matrix A by some optimization models.Due to the ill-conditioned feature of the problem, the most popular optimization models are ( 18), ( 19), and (20), where ( 18) is smooth and ( 19) and (20) are nonsmooth.In Sect.5.1.1,we report numerical results with the 1 minimization (19), and in Sect.5.1.2,we give results regarding the elastic net minimization problem (20).We set m = 5000 and n = 10000, and the data A, y, and x 0 for problem ( 19) is randomly generated by where rand generates uniformly distributed random numbers between 0 and 1 and x 0 is a starting point for algorithms.
We divide the solvers into two classes: (i) proximal-based methods (PGA, FISTA, NESCO, and NESUN) that can be directly applied to nonsmooth problems; (ii) Subgradient-based methods (NSDSG, NES83, NESCS, and NES05) in which the nonsmooth first-order oracle is required, where NES83, NESCS, and NES05 are adapted to take a subgradient in the place of the gradient.We set where a i (i = 1, 2, . . ., n) is the i-th column of A. In the implementation, NESCS, NES05, PGA, and FISTA use L = 10 4 L, and NSDSG employs α 0 = 10 −7 .Algorithm 1, for both OSGA and OSGA-O, uses the parameters δ = 0.9, α max = 0.7, κ = κ = 0.5, and the prox-function ( 24) with Q 0 =1 2 x 0 2 + , where is the machine precision.All numerical experiments were executed on a PC Intel Core i7-3770 CPU 3.40GHz 8 GB RAM.To solve the nonlinear system of equations ( 33), we first consider the nonlinear least-squares problem (40) and solve it by the MATLAB internal function fminsearch, 1 which is a derivative-free solver handling both smooth and nonsmooth problems.In our implementation, we apply OSGA-O to the problem, stop it after 100 iterations and save the best function value attained ( f s ), and run the others until either the same function value is achieved or the number of iterations reaches 5000.In our comparison, N i and T denote the total number of iterations and the running time, respectively.
To display the results, we used the Dolan and Moré performance profile (Dolan and Moré 2002) with the performance measures N i and T .In this procedure, the performance of each algorithm is measured by the ratio of its computational outcome versus the best numerical outcome of all algorithms.This performance profile offers a tool to statistically compare the performance of algorithms.Let S be a set of all algorithms and P be a set of test problems.For each problem p and algorithm s, t p,s denotes the computational outcome with respect to the performance index, which is used in the definition of the performance ratio If an algorithm s fails to solve a problem p, the procedure sets r p,s := r failed , where r failed should be strictly larger than any performance ratio (72).Let n p be the number of problems in the experiment.For any factor τ ∈ R, the overall performance of an algorithm s is given by ρ s (τ ) := 1 n p size{ p ∈ P : r p,s ≤ τ }.In Fig. 4, we can see that the worst results are obtained by NSDSG and PGA, while FISTA, NESCO, NESUN, NES83, NESCS, NES05 and OSGA behave competitively.Further, OSGA-O performs better than the others significantly.

Sparse recovery (compressed sensing)
In recent years, there has been an increasing interest in finding sparse solutions of many problems using the structured models in various areas of applied mathematics.In most cases, the problem involves high-dimensional data with a small number of available measurements, where the core of these problems involves an optimization problem of the form (19) or (20).Thanks to the sparsity of solutions and the structure of problems, these optimization problems can be solved in reasonable time even for the extremely high-dimensional data sets.Sparse recovery, basis pursuit, lasso, waveletbased deconvolution, and compressed sensing are some examples, where the latter case receives lots of attentions during the recent years, cf.(Candés 2006;Donoho 2006).Let us consider a linear inverse problem of the form (71) that we solve it with minimization problems ( 19) and (20).We set n = 4096 and m = 1024.The problem is generated by the same procedure given in GPSR (Figueiredo et al. 2007) package available at the best performance compared with the others for both 1 minimization and elastic net problems.

Conclusions
This paper discusses the solution of structured nonsmooth convex optimization problems with the complexity O(ε −1/2 ), which is optimal for smooth problems with Lipschitz continuous gradients.First, if the nonsmoothness of the problem is manifested in a structured way, the problem is reformulated so that the objective is smooth with Lipschitz continuous gradients in the price of adding a functional constraint to the feasible domain.Then, a new setup of the optimal subgradient algorithm (OSGA-O) is developed to solve the reformulated problem with the complexity O(ε −1/2 ).
Next, it is proved that the OSGA-O auxiliary problem is equivalent to a proximallike problem, which is well-studied due to its appearance in Nesterov-type optimal methods for composite minimization.For several problems appearing in applications, either an explicit formula or a simple iterative scheme for solving the corresponding proximal-like problems is provided.
Finally, some numerical results with random data and a sparse recovery problem are given indicating a good behavior of OSGA-O compared to some state-of-the-art first-order methods, which confirm the theoretical foundations.
(i) (Nonsmooth complexity bound) If the points generated by Algorithm 2 stay in a bounded region of the interior of C, or if f is Lipschitz continuous in C, the total number of iterations needed to reach a point with f (x, x) ≤ f ( x, x * ) + ε is at most O((ε 2 +με) −1 ).Thus the asymptotic worst case complexity is O(ε −2 ) when μ = 0 and O(ε −1 ) when μ > 0. (ii) (Smooth complexity bound) If f has Lipschitz continuous gradients with Lipschitz constant L, the total number of iterations needed by Algorithm 2 to reach a point with f Du/ Du 2 , and the optimality condition (5) yields u − y + λ D Performance profile for the running time

Fig. 3
Fig. 3 Performance profiles for the number of iterations N i and the running time T for the elastic net problem: a, b display the results for N i and T of PGA, FISTA, NESCO, NESUN, OSGA, and OSGA-O; c, d, respectively, illustrate the results for N i and T of NSDSG, NES83, NESCS, NES05, OSGA, and OSGA-O.In all of these subfigures OSGA-O attains the best results with respect to both measures N i and T

16 end 17 end 3 Structured nonsmooth convex optimization
x b ;