A hybrid quasi-Newton projected-gradient method with application to Lasso and basis-pursuit denoising

We propose a new algorithm for the optimization of convex functions over a polyhedral set in Rn\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\mathbb {R}}^n$$\end{document}. The algorithm extends the spectral projected-gradient method with limited-memory BFGS iterates restricted to the present face whenever possible. We prove convergence of the algorithm under suitable conditions and apply the algorithm to solve the Lasso problem, and consequently, the basis-pursuit denoise problem through the root-finding framework proposed by van den Berg and Friedlander (SIAM J Sci Comput 31(2):890–912, 2008). The algorithm is especially well suited to simple domains and could also be used to solve bound-constrained problems as well as problems restricted to the simplex.


Introduction
In this paper we propose an algorithm for optimization problems of the form where f : R n → R is a convex, twice continuously differentiable function, and C is a polyhedral set in R n . The main focus of the paper is the application and specialization of the framework to the Lasso problem [1]: and its extension to the weighted one-norm. The work in this paper is motivated by the need for an efficient and accurate solver for the Lasso subproblems appearing in the spgl1 [2] solver for basis-pursuit denoise [3] problems of the form minimize x x 1 subject to 1 2 Both the (BP σ ) and (LS τ ) formulations are central to compressed sensing [4,5] as a means of recovering sparse or approximately sparse vectors x 0 from linearly compressed and often noisy observations b = Ax 0 + z. The basis-pursuit denoise formulation (BP σ ) is often a more natural choice in practice compared to the Lasso formulation, as it is parameterized in the noise level σ , rather than in τ , the one-norm of the unknown signal x 0 , which may be more difficult to determine.
It was shown in [2] that basis-pursuit denoise and Lasso are connected through the Pareto curve and that solving (BP σ ) can be reduced to finding the smallest τ for which the Lasso solution x * τ satisfies Ax * τ − b ≤ σ . Denoting by τ σ this critical value of τ and assuming that b lies in the range space of A it was shown in [2] that the Pareto curve is convex and differentiable at all τ ∈ [0, τ 0 ) with gradient A T r ∞ / r 2 where r denotes the misfit Ax * τ − b. Evaluation of both p(τ ) and p (τ ) relies on the misfit r , which can be obtained by solving (LS τ ). The spgl1 solver proposed in [2] applies root finding on the Pareto curve, as illustrated in Fig. 1, to solve p(τ ) = σ and thereby reduce basis-pursuit denoise to a series of Lasso problems. In spgl1 these subproblems are solved using the spectral projected-gradient (SPG) algorithm [6], which we discuss in more detail in Sect. 2.
For certain problems it was observed that SPG generates long sequences of iterates that all lie on the same face of the polyhedral set of feasible points. This suggests the use of an active-set type method in which a quasi-Newton method, such as the limited-memory BFGS (L-BFGS) method [7], is used to minimize the problem restricted to the currently active face. In order to obtain an efficient solver it is important to avoid unnecessarily costly subproblem solves or complicated heuristics to determine when to switch between the solvers. In this paper we therefore propose a hybrid algorithm that switches between the two methods in a seamless and lightweight fashion.

Paper outline
In Sect. 2 we provide a concise background on the SPG and L-BFGS methods along with some of their theoretical properties. We then describe the proposed algorithm for the general problem formulation (1) in Sect. 3. In Sect. 4 we study the geometry of the constraints in the Lasso problem, and develop the tools needed for an efficient implementation of the framework for Lasso. Numerical experiments are provided in Sect. 5, followed by the conclusions in Sect. 6.

Notation and definitions
We use caligraphic capital letters for sets. Given any two set S 1 and S 2 , we write S 1 + S 2 for {x 1 + x 2 | x 1 ∈ S 1 , x 2 ∈ S 2 }, and likewise for S 1 − S 2 . For a seeming lack of established terminology, we define the difference hull of a set S, diff hull(S), as the linear hull of differences, span{u 1 −u 2 | u 1 , u 2 ∈ S}. The difference hull can be seen as the linear subspace corresponding to the affine hull of S translated to contain the origin. For any x in a polyhedral set C, we define F(x) to be the unique face F of C for which x ∈ relint(F); this may be C itself. The normal cone of C at x is given by N (x) := {d ∈ R n | P(x + d) = x}. The normal cones N (x) are identical for all x ∈ relint(F), and we can therefore define the normal cone of a face N (F) as N (x) evaluated at any arbitrary x ∈ relint(F). Orthogonal projection of any vector v ∈ R n onto C is defined as Given any point x ∈ relint(F) we are interested in the set of directions d ∈ R n for which there exists an > 0 such that the projection of x + d lies in F. It can be verified that Because the normal cone N (x) is the same for all x ∈ relint(F) we can define the self-projection cone of a face F as S(F) := N (F) + diff hull(F). Note that N (F) ⊥ diff hull(F), or stronger yet, that the difference hull of F is the orthogonal complement of the linear hull of N (F). For any k-face F of C, k ≥ 1, we denote by Φ(F) ∈ R n×k an arbitrary but fixed orthonormal basis for diff hull(F). We will never use Φ(F) when F is a vertex and therefore leave it undefined. We denote by e i the ith column of an identity matrix whose size is clear from the context. The proximation of a function f is defined as prox f (u) := arg min x f (x) + 1 2 x − y 2 2 .

Algorithm 1
The nonmonotone spectral projected-gradient method (SPG) Barzilai The nonmonotone spectral projected-gradient method (SPG), outlined in Algorithm 1, was introduced by Birgin et al. [6] for problems of the form (1), with C a closed convex set in R n , and f : R n → R a function with continuous partial derivatives on an open set that contains C. The SPG algorithm is based on the curvilinear projected-gradient method, which performs a line search along the curvilinear trajectory (see also [8]): In order to speed up the method, two important modifications were introduced in [6]. The first modification allows a limited level of nonmonotonicity in the objective value. Given μ, γ ∈ (0, 1), the Armijo-type line search starts with an initial step length α i , and then finds the first nonnegative integer k such that The right-hand side of this condition ensures sufficient descent, but only with respect to the maximum of up to M of the most recent objective values. In case M = 1 this reduces to the standard Armijo line-search condition. The second modification is the use of the spectral step length, as proposed by Barzilai and Borwein [9]. Given , the initial step length at iteration i is defined as where 0 < α min < α max are fixed parameters. More information on the motivation behind this particular choice of step length can be found in [6,10]. It was shown in [6] that any accumulation point x * of the sequence {x i } lies in C and satisfies Practical implementations of Algorithm 1 may use a relaxed version of (4) along with other conditions as a stopping criterion.

Limited-memory BFGS
The L-BFGS algorithm by Liu and Nocedal [7] is a popular quasi-Newton method for unconstrained minimization of smooth functionsf : R n → R: At each iteration, the algorithm constructs a positive-definite approximation H i of the inverse Hessian off at x i , based on an initial positive-definite matrix H along withn = min{i, N } of the most recent vector pairs {s i− j , y i− j }n −1 j=0 , where and N denotes the memory size for L-BFGS. The iterates in L-BFGS are of the form where the search direction d i is given by and the step size α i is chosen to satisfy the Wolfe conditions: Parameters γ 1 and γ 2 are chosen such that 0 < γ 1 < 1 2 , and γ 1 < γ 2 < 1. For details on the structure of the inverse approximation H i and efficient ways of evaluating the matrix-vector product in (6), see [7,11].

Convergence results
For the analysis of the L-BFGS algorithm, Liu and Nocedal make the following assumptions [7]: Assumption 1 Given starting point x 0 , we have: (1) The objective functionf is twice continuously differentiable; (2) The level set D := {x ∈ R n |f (x) ≤ f (x 0 )} is convex; and (3) There exist positive constants μ 1 and μ 2 such that for all x ∈ D and v ∈ R n .
Under these conditions, and with some simplifications, they prove that Theorem 2 (Liu and Nocedal [7]) The L-BFGS algorithm generates a sequence {x i } that converges to the unique minimizer x * in D. Moreover, there exists a constant c > 0 such thatf

Proposed algorithm
The proposed algorithm can be seen as a modification of the SPG method that allows the use of quasi-Newton steps over a currently active face. The basic idea is that whenever two successive iterates x i and x i−1 lie on the same face, we can form or update a quadratic model off , the objective function restricted to the face. Whenever a model for the current face is available, the algorithm will attempt a quasi-Newton step that is restricted to the face and satisfies the Wolfe conditions (7). If the quasi-Newton step fails, or is otherwise abandoned, the algorithm simply falls back to the SPG method and takes a regular spectral projected-gradient step. After each step-regardless of the type-we check the conditions required to update the quadratic model and initiate the quasi-Newton step: If these conditions are not met, we discard the Hessian approximation used in the quadratic model. The self-projection criterion in (10) is essential and omitting it could cause the algorithm to take repeated quasi-Newton iterations converging to a minimum on the relative interior of the face rather than the global minimum.

Quasi-Newton over a face
One way of performing quasi-Newton over a face is by maintaining an inverse Hessian approximation using the update vectors in (5), and computing the search direction d i using (6). However, this approach has some major disadvantages. First, we may have that d i / ∈ diff hull(F i ), which means that x i + αd i / ∈ F i for all nonzero α. This could be partially solved by projection onto the face, but such a projected direction is no longer guaranteed to be a descent direction [12]. This too could be addressed by modifying the Hessian, but doing so would further complicate the algorithm. A second disadvantage is that we maintain the inverse Hessian approximation for the higher-dimensional ambient space, and the approximation may therefore not be very accurate along aff(F i ).

Algorithm 2 Outline of the proposed hybrid quasi-Newton projected-gradient method
Reset objective function history The solution of the above problems is straightforward: we simply work with a representation for the function restricted to aff(F i ). Let F i be a k-dimensional face with k > 0. Then we can find an orthonormal basis Φ := Φ(F i ) ∈ R n×k whose span coincides with diff hull(F i ). Using Φ we can write any point x ∈ F i as x = v + Φc, where v ∈ R n is an arbitrary but fixed point in F i , and c ∈ R k is a coefficient in the lower-dimensional space. The functionf : R k → R, which restricts f to the current face, is then given byf (c) = f (v + Φc). The idea then is to form the inverse Hessian approximation overf , and use it to obtain a search directiond ∈ R k , which can then be mapped back to the ambient space for the actual line search. In particular, we can form the approximate inverse Hessian H i ∈ R k×k by updating an initial positive definite H using In order to obtain the search direction we first compute ∇f by projecting the gradient ∇ f (x i ) onto the lower-dimensional space. We then apply the inverse Hessian followed by back projection, giving: The resulting vector clearly lies in diff hull(F i ) and we therefore have that x i + αd i ∈ F i for all step sizes α ∈ [0, α bnd ], for some α bnd > 0, possibly with α bnd = + ∞. Line search is equivalently done in the projected or ambient dimension, and we aim to find a step size α within the above range, such that the Wolfe conditions (7) are met.
For the line search we could start with a unit step length, whenever α bnd ≥ 1, or we could try α = α bnd first to encourage exploring lower-dimensional faces, provided of course that α bnd < ∞. If no suitable step length can be found, or a certain maximum number of trial steps is taken, we abandon the quasi-Newton step and take a spectral projected-gradient step instead. As a final remark, note that condition (10) should never be met for vertices, since that would imply not only that , which means that the optimality condition given in (4) would already have been satisfied at x i−1 .

Convergence
For the convergence analysis of Algorithm 2 we rely on the results in [6] and [7], and add a step in the algorithm that resets the objective-value history used by SPG after each series of successful quasi-Newton iterations to ensure that any subsequent iteration has a lower objective value. We use the following assumptions, which are somewhat more restrictive than those in the aforementioned two papers (see for example Assumption 1):

Assumption 3 We assume that
(1) Function f is twice continuously differentiable, and bounded below; (2) There exist constants 0 < μ 1 ≤ μ 2 < ∞ such that for all x, v ∈ R n : Under these assumptions, which imply convexity of f , we have the following result: Theorem 4 Let f (x) satisfy Assumption 3 and let x 0 ∈ C. Then the sequence {x t } generated by Algorithm 2 converges to the unique minimizer of (1).
Proof Assumption 3 ensures the existence of a unique minimizer x * to (1), which satisfies If there exists a finite t for which x t = x * , we are done. Suppose, therefore that x t = x * for all t. We consider two cases. First, if there are finitely many quasi-Newton steps, there must at such that all iterations t >t are of the projected gradient type. In this case the result follows directly from the analysis in [6]. Second, consider the case where there are infinitely many quasi-Newton steps. For a fixed face F, minimizinĝ f is equivalent to minimizing the objective over the affine hull of the current face F: For any successful step it follows from Theorem 2 that there exists a constant c > 0 such that the quasi-Newton step satisfies where x * F denotes the minimizer of (13). Because the history of the M most recent objective values is reset after each successful quasi-Newton step, any intermediate projected-gradient step will not increase the objective. Based on this, Lemma 1 below, shows that the number of quasi-Newton iterates on any F that does not contain x * is finite. By polyhedrality of the domain, the number of faces is bounded, and we must therefore take infinitely many iterations on at least one face that contains x * . Repeated application of (14) then shows that the objective value converges to f (x * F ) = f (x * ). From Assumption 3 it then follows that {x t } converges to x * . Lemma 1 Let F be a face of C such that x * / ∈ F. Then the number of quasi-Newton steps on F taken by Algorithm 2 is finite.
Proof The quasi-Newton method is applied tof , which is equivalent to f restricted to F. We therefore focus on (13). Let x * F be the solution to (13), and denote by x [ j] and x [ j]+1 the starting, respectively ending, point for the jth quasi-Newton step on F. It follows from Theorem 4 that for j ≥ 2. This holds since any intermediate quasi-Newton iteration can only reduce the objective, and likewise for projected-gradient steps, as a consequence of resetting the function-value history. We consider two cases. First assume that x * F / ∈ F. Letf be the minimum of f (x) over x ∈ F. Repeated application of (15) gives For sufficiently large, but finite j, the right-hand side in (16) must fall belowf − f (x * F ), which is strictly positive. Since every successful quasi-Newton step results in a vector x [ j]+1 ∈ F by construction, it follows that the number of quasi-Newton iterates on F must be bounded.
For the second case, assume that x * F ∈ F. Because optimization is done over , but this cannot be the case since it would imply that x * F is a global minimizer. (The same holds when x * F lies on a lowerdimensional subface on the boundary of F.) Since f is continuously differentiable by assumption, it follows that −∇ f (x) / ∈ self proj(F) for all points x ∈ F sufficiently close to x * F . Assumption 3 then allows us to define a sufficiently close neighborhood as Applying the same argument we used above shows that the right-hand side of (16) again falls belowf − f (x * F ) for sufficiently large j. Once this happens all following iterates x t ∈ F must have f (x t ) ≤f . Since the self-projection cone condition −∇ f (x) ∈ self proj(F) does not hold at these points, no more quasi-Newton steps are taken on F.
A similar analysis holds when the spectral projected-gradient method is replaced by another convergent algorithm, provided that the iterates do not exceed the initial objective value.

Application to Lasso
The proposed algorithm depends on a number of operations on the constraint set. In particular, it has to determine the face in which the current iterate lies, check membership of the self-projection cone, and determine an orthonormal basis for the current face. For the algorithm to be of practical use, the constraint set must therefore be simple enough to allow efficient evaluation of these operations. As this work was motivated by improving the Lasso problem, we focus on the weighted one-norm ball (which for unit weights is also known as the cross-polytope or n-octahedron [13]): where x w,1 := i w i |x i | positive w i . The proposed framework also applies naturally to bound or simplex constrained problems, but these are outside the scope of this paper. The objective function we consider throughout this section is which can also be written in the form 1 2 Ax − b 2 2 + c T x, with A and b appropriately redefined. The benefit of having an objective function of the form (17) is that it permits closed-form expressions for step lengths satisfying certain conditions. In the remainder of this section we discuss practical considerations for the line-search conditions and look at the specific structure and properties of the set C w,1 .

Line search
For most objective functions the line search is done by evaluating f (P(x + αd)) or f (x + αd) for a series of α values until all required conditions, such as Armijo and Wolfe, are satisfied. The objective function in (17) has closed-form solutions for some of the problems arising in the line search, thereby allowing us to simplify the algorithms and improve their performance.

Optimal unconstrained step size
As a start we look at the step length that minimizes the objective along f (x + αd): Differentiating f with respect to α and equating to zero leads to the following expression: When μ = 0 and c = 0 this reduces to α opt = − r T Ad/ Ad 2 2 .

Wolfe line search conditions
The maximum step length for which the Armijo condition (7a) is satisfied can be found by expanding the terms and simplifying. Doing so gives the following bound: Likewise, the gradient condition (7b) reduces to

Projection arc
Line search in gradient projection methods is often done by backtracking from a single projection p(α) = x + α(P C (x + d)), with α ∈ [0, 1], or by search over the projection arc p(α) = P C (x + αd), with α ≥ 0. The trajectory of the first method depends strongly on the scaling of d and is more likely to generate points on the interior of the feasible set. The second method better captures the structure of the domain, but can also be more expensive computationally. In an earlier version of the present work [14] we studied line search based on the entire projection arc (that is, over all values α ≥ 0) along with its properties. Given the high computational complexity of the line search in practice and the lengthy treatise we omit the details here and refer the interested reader to [14].

Facial structure
The weighted one-norm ball of radius τ is the convex hull of vertices {±τ/w i · e i } i . Every proper k-face F of the weighted one-norm ball C w,1 can be written as the convex . . , n} with cardinality k + 1, and σ i ∈ {−1, +1}. Throughout this section we assume that τ > 0.
Given an x ∈ C we can determine F(x) as follows. First, we need to check whether Otherwise, x lies on a proper face, which can be uniquely characterized by the sign vector sgn(x) whose ith entry is given by sgn(x i ). Determining F(x) and checking equality of faces can therefore be done in O(n) time.

Projection onto the feasible set
Projection onto the weighted one-norm ball is discussed in [15] and is based on the solution of the prox function where [·] + = max(0, ·),the absolute value, sign function, and multiplication in the right-hand side are evaluated elementwise. Projection onto C w,1 then amounts to finding the smallest λ ≥ 0 for which x λ (u) w,1 ≤ τ . The entries in x λ (u), and therefore x λ (u) w,1 , are continuous and piecewise linear in λ with break points occurring at λ = |u i |/w i . We can obtain an O(n log n) algorithm that finds the optimal λ and subsequent projection by sorting the break points [15]. This can be reduced to an expected O(n) algorithm [16] by avoiding the explicit sorting step.

Self-projection cone of a face
Given x ∈ C w,1 and search direction d ∈ R n , we want to know if d ∈ self proj(F(x)). When x w,1 < τ it follows that x lies in the interior of C w,1 meaning that F(x) = C w,1 and d ∈ self proj(C w,1 ) = R n , trivially. For the case where x w,1 = τ , consider the support it follows that the projection threshold λ(α) ≤ α d w,1 / min i {w i } =: να. At the same time, a necessary condition for i ∈ I to be removed from the support is that (19) this gives the necessary condition For d to be in the self-projection cone we therefore need to show that (1) x + αd does not move into the polytope, and (2) the support does not increase. It can be verified that the first condition is satisfied if and only if the directional derivative For the second condition to be satisfied we require for all i / ∈ I and sufficiently small α that the absolute value of the entries remains less than or equal to the threshold value, namely α|d i | ≤ w i λ(α). When the support remains the same we find λ(α) by solving A necessary (and sufficient) condition for the support to remain the same is therefore that Summarizing the above we have:

Orthogonal basis for a face
For the construction of a quadratic approximation of objective function f over a face F, we require an orthogonal basis Φ for diff hull(F). For simplicity, consider the facet of the unit cross polytope lying in the positive orthant in R 3 . In other words, consider the unit simplex given by conv{e 1 , e 2 , e 3 }. A first vector for the basis can then be obtained by normalizing e 2 − e 1 to have unit norm. A second vector orthogonal to the first can be obtained by connecting the point halfway on the line segment e 1 -e 2 to e 3 , that is, e 3 − (e 1 + e 2 )/2, followed again by normalization. Extending this to general k-simplices we find (k In other words It can be seen that the above procedure amounts to taking a QR factorization of the k + 1 × k matrix [−e, I ] T of differences between the first vertex and all others of the k-simplex, and discarding the last column in Q whose entries are all equal to 1/ √ n. The special structure of Q allows us to evaluate matrix-vector products with Q and its transpose in O(k) time, without having to form the matrix explicitly. For the general case, let F = F(x). For the case where F = C no projection is needed and we can simply choose Φ = I . Otherwise, let I = {i | x i = 0} denote the support of x. The desired basis can then be obtained by first restricting the vector to its support and then normalizing the sign pattern, thus giving: Matrix-vector products with Φ can be evaluated in O(n) time, again without forming the matrix.
Generic weighted one-norm ball For the weighted one-norm ball we consider a simplicial face given by conv(w 0 e 1 , w 1 e 2 , . . . , w n e n ), with nonzero weights w 0 to w n . (Throughout this paragraph it is more convenient to work with a weight vector whose The orthonormal basis corresponding to the face can again be found by applying the standard QR-algorithm to the matrix of differences between the vertices, and taking advantage of the special structure of the matrix. The initial setup is illustrated in Fig. 2a with v 1 = −w 0 . The two operations in this process are projecting out the contributions of all subsequent columns and normalizing the columns to unit norm. We do not normalize until the very end but do keep track of the squared two norm of the completed columns. Given vectors a and b we obtain the component in b orthogonal to a by evaluating b − a,b a,a a. In the first step of the factorization (we are interested only in Q) we orthogonalize with respect to the first column a. The inner product of each column with a is identical and Using this we also compute the squared two norm of the first column as γ 1 = α 1 + w 2 1 . After the sweep with the first column we are left with the matrix shown in Fig. 2b where The next step is to sweep with the updated second column. For this we compute the inner product with the remaining columns and itself, yielding α 2 = v 2 2 2 and γ 2 = α 2 + w 2 2 , respectively. After this sweep we arrive at the matrix given in Fig. 2c. Proceeding like this we successively sweep with each of the column until we are done. Consider now the sweep with some column k, illustrated in Fig. 2d.
from which we derive recurrence relations γ k+1 = α k+1 + w 2 k+1 and With α 1 and γ 1 as given above, this allows us to compute all α and γ values. Ultimately we are interested in the final orthonormal Q matrix. Defining scaling factors as well as factors u i = −α i /γ i for 1 ≤ i ≤ n and u 0 := − 1, it can be found based on the structure of vectors v that Multiplication with this matrix and its transpose may still seem expensive but we now show how the structure enables O(n) algorithms for both operations. Multiplication with the diagonal matrices is trivial so we focus only on multiplication with the central part of the matrix. Looking at a small example we can decompose this matrix as The key part is multiplication with the last matrix M. To evaluate y = Mv we initialize y 3 = v 3 and then work upwards. Direct evaluation gives y 2 = v 2 + μ 2,2 v 3 , which can be rewritten as y 2 = v 2 + μ 2,2 y 3 . A pattern emerges when looking at the computation of y 1 : where μ 1,2 = μ 1,1 μ 2,2 , or more generally μ i,k = μ i, j μ j+1,k for i ≤ j ≤ k, follows from the definition of μ in (23). Given y n = v n , we therefore obtain the recurrence y k = v k + μ k,k y k+1 , which allows us to evaluate y = Mv in linear time.
With v appropriately redefined we now look at y = M T v: Starting with y 1 = v 1 we find y 2 = μ 1,1 y 1 + v 2 and y 3 = μ 2,2 y 2 + v 3 , using μ 1,2 = μ 1,1 μ 2,2 . This gives the recurrence y k+1 = μ k,k y k + v k+1 . We summarize the initialization and multiplication with Q and Q T in Algorithm 3. Note that these algorithms use a different indexing scheme for a convenient implementation. For practical implementations we can precompute and store 1/ √ γ k instead of γ k and avoid storing α since it is not used during the evaluation of matrix-vector products. Alternatively, we can reduce the memory footprint at the cost of increased computation by storing only α and re-computing μ k , u k , and γ k whenever they are needed.

Algorithm 3 Implicit construction of Q and matrix-vector multiplication with Q and Q T
(a) Precomputing α, γ , u, and μ

Maximum step length along a face
Given a feasible search direction d it is useful to know the maximum α for which x + αd ∈ C w,1 . When x lies in the interior of C w,1 or when (20) is violated and x + αd moves into the interior, we need to compute the first intersection with the boundary. When x lies on a proper face of C w,1 and d moves along the face or onto a higher dimensional face, the maximum step length is determined by the first element to reach zero. More details can be found in [14].

Stopping criteria
We now look at stopping criteria for optimizing f (x) as defined in (17) over the weighted one-norm ball. A common stopping criterion for problem of this type is to look at the relative norm of the projected gradient: which is zero if and only if x is optimal. In addition to this we can look at the relative duality gap, which we define as the difference δ between f (x) and any dual feasible objective, divided by max{1, f (x)}. Given that the dual feasible point will typically not be optimal, the relative duality gap merely provides an upper bound on the actual gap. Improving the estimate of the duality gap can cause the iterates to satisfy the stopping criteria earlier and thereby help reduce the runtime. For the derivation of the dual problem we follow [2,15] and rewrite the original problem as: The dual of this problem is given by where the Lagrange dual function L is given by Here, the infimum over r is solved by equating the gradient to zero, giving y = r and y T r = y 2 2 . For the infimum over x we consider two cases, based on the value of μ. Dual when μ = 0. Following the derivation given in [15], with minor changes to account for the c term, it can be shown that From this we then obtain the dual problem: As a dual-feasible point we can choose y = r . For any given y it can be verified that choosing λ = A T y − c 1 w ,∞ always gives the largest dual objective value. Given x and the corresponding residual r = b − Ax we therefore obtain the following duality gap: Dual when μ > 0. The simplest way of dealing with μ > 0 is to rewrite the problem as: . This reduces the problem to the form where μ = 0 and we can therefore directly use the dual formulation (25). Choosing y =r , withr = [r ; − √ μx], and applying the same derivation as given above, we obtain a dual objective value of and a corresponding duality gap of Another approach is to solve the original infimum over x in (24) for the case where μ > 0. For a fixed y and λ we have When λ = 0 it is easily seen that For the more general case where λ > 0, we first reformulate (28) as with h(x) = λ μ x w,1 = x λw μ ,1 and v = 1 μ (A T y − c). For problems of the form (29) we have:

Theorem 6 Let h(·) be any norm then
Proof Note that the objective is coercive and therefore attains the minimum. This allows us to rewrite and solve the objective as follows: We then need to show that where (a) follows from [18, Lemma 2.10], and (b) follows from the fact that h * is the indicator function for the dual norm of h, which implies that prox h * (v) is an element of the dual norm ball, and h * (prox h * (v)) = 0.
Application of Theorem 6 to (29) with proximal operator [see also (18)] The same expression holds for λ = 0 and substitution into (24) therefore gives the following dual problem: maximize y, λ≥0 Even when restricting y to the current residual r in the primal formulation, it can be seen that the value of (30) is never smaller than that of (26) and, consequently, that the duality gap never exceeds the value in (27). In particular, by choosing λ = A T y + μx − c 1 w ,∞ , it follows that for any index i we have Multiplying either side by w i and rearranging gives Because the right-hand side is always nonnegative, this continues to hold when applying the [·] + operator on the left-hand side, and as a result we have 1 2μ from which the desired result immediately follows.
Finding a dual-feasible solution It follows from Slater's condition and strong duality that, at the solution (x, r ) for (25), we have y = r and, without loss of generality, λ = A T r − c 1 w ,∞ . When r is not optimal, we can still choose y = r and obtain a dual-feasible solution. For (30) we can also take y = r , but finding λ requires some more work. In general, given any y we want to find a λ that maximizes the objective. Writing z = |A T r − c| and ignoring constant terms, this is equivalent to solving With I(λ) := {i | z i ≥ λw i } we can write the objective as Discarding all zero terms with z i = 0, this function is piecewise quadratic with breakpoints at λ i = w i /z i . We can write the sequence of breakpoints in non-decreasing order as λ [i] for i = 0, . . . , n, with λ [0] := 0. The gradient between successive breakpoints is linear and continuously increases from In order to find the optimal point λ * , we consider two cases. In the first case we have f (0) ≥ 0, or equivalently τ ≥ 1/μ i w i z i , which means that the function is non-decreasing and we find λ * = 0. In the second case we need to find λ for which the gradient vanishes. This can be done by traversing the breakpoints until the first breakpoint is found where the gradient is nonnegative. The desired solution λ * is then found by linear interpolation over the last segment. Including sorting this can be done in O(n log n) time. This problem is very similar to projection onto the one-norm ball, and can also be evaluated in expected O(n) time using an algorithm similar to that proposed in [16].
Primal-dual pairs Using the methods described above, we can compute an upper bound on the duality gap given a feasible x and the corresponding residual r . We can do at least as good, and often better, by maintaining the maximum dual objective found so far, and using this to determine the relative duality gap. This way it is possible for the primal objective for the current iterate x to attain the desired optimality tolerance while the corresponding dual estimate is far from optimal. Within the root-finding framework this means that we cannot simply use the latest residual r to evaluate the gradient of the Pareto curve. Instead we should maintain the value of λ corresponding to the best dual solution at any point, and use this as the gradient approximation. In other words, we need to keep track of the best primal and dual variables separately.

Convergence
In many practical situations, the m × n matrix A in (LS τ ) will have more columns than rows (m < n). Due to the presence of a null space, it follows that Assumption 3 is not satisfied, and the convergence result in Theorem 4 does not apply as is. For the result to apply it suffices to ensure that condition (12) applies to the restriction f of f on each of the faces of C. A sufficient condition for this is that the columns in A are in general position, such that there does not exist any subset of m or fewer columns of A that is rank deficient. The hybrid algorithm can be updated to check for the self-projection cone and perform quasi-Newton steps only when the number of nonzeros in the current iterate are at most m.

Numerical experiments
In this section we evaluate the performance of the hybrid approach on the Lasso problem (LS τ ) both independently and within the spgl1 root-finding framework [2] described in the introduction. The spgl1 solver can be used both for stand-alone Lasso problems, as well as for basis-pursuit denoise (BP σ ) problems. For the hybrid method we are mostly concerned with the performance of the former and we therefore changed spgl1 in two stages. First we modified the stopping criteria used in the Lasso mode, now declaring a solve successful only if the relative duality falls below a certain tolerance level. We then added all modifications needed for the implementation of the hybrid approach. To distinguish between the different algorithms, we use the convention that spgl1 is used only to refer to the existing implementation provided by [2]. We refer to the version of spgl1 with the more stringent stopping criteria as the spg method, which is then extended with the techniques described in this paper to obtain the hybrid method. For the hybrid method we use an L-BFGS memory size of eight for all experiments. When used in the root-finding mode to solve (BP σ ), spgl1 uses several different criteria to decide when to update τ . Each subproblems in spgl1 is considered solved when the relative change in objective is small, and at least one iteration was taken within the current subproblem. The overall problem is declared solved when A T y ∞ , the relative difference between r 2 and σ , or the relative duality gap is sufficiently small. For the basis-pursuit denoise experiments based on the spg and hybrid algorithms, we use a separate implementation of the root-finding framework in which each Lasso subproblem is fully solved before updating τ . The differences in stopping criteria, and especially the lack of guarantees on the duality gap for the final subproblem in spgl1, make it difficult to compare the performances directly. We therefore focus predominantly on how the performance of the hybrid method differs from the reference spg method.

Active-set type method
Given their similarity, we first compare the performance of the hybrid method with an active-set type method. For this, we rewrite the Lasso subproblem (LS τ ) as a quadratic program (QP) by splitting x into its positive and negative components: As a solver we choose qpOASES [19], a parametric active-set method that relies on solving linear systems of equations, and therefore generally works best for small to medium scale problems.
For the comparison we generate 20 random m × 2 m matrices A and vectors b with i.i.d. random gaussian elements and columns normalized to unit length. For each problem instance we use spgl1 to solve the basis-pursuit denoise problem with σ ∈ {0.2, 0.1, 0.05, 0.02, 0.01} to get corresponding τ values. We then solve the first Lasso problem with qpOASES, as well as spg and the hybrid method, both with relative duality tolerance 10 −6 . In preliminary runs it was found that the runtime for qpOASES is insensitive to the problem instance and we therefore ran only the first problem instance for each τ value. For the spg and hybrid methods we ran all 20 instances. The (average) runtime for the three methods is plotted against problem size m in Fig. 3. The first thing to notice is that the runtime of qpOASES is insensitive to the value of τ , whereas the runtime for spg and the hybrid method increases with τ (corresponding to smaller σ ). As expected, qpOASES performs well on small problems. However, it scales poorly with increasing problem size, and becomes uncompetitive beyond m ≈ 200. Using the procedure described in Sect. 4.3 we generated dual-feasible solutions for each of the primal solutions obtains with qpOASES. This showed that the relative duality gap for solutions by qpOASES was around 10 −12 , which is much smaller than the threshold used for spg and the hybrid method. To ensure the comparison is fair we modified qpOASES to keep track of the relative duality gap at each iteration to see if early stopping was possible. This turned out not to be the case, since the one-norm of the iterates was found to substantially exceed τ for all but the last several iterations, thereby prohibiting early stopping.
The total runtime over all problem instances using qpOASES amounted to 24,725 s. Even with additional problem instances for m = 1536 and m = 2048, the total of the (average) runtime for spg and the hybrid method were 1492 s and 1206 s. The total runtime for the hybrid method is 19% less than that of spg.

Comparison with pnopt
The pnopt algorithm, introduced by Lee et al. [20], is a proximal Newton-type method for solving problems of the form where f (x) is convex and continuously differentiable, and h(x) is convex, but not necessarily differentiable. By choosing f (x) = 1 2 Ax − b 2 2 and setting h(x) to the indicator function of the one-norm ball of radius τ , pnopt can solve the Lasso formulation (LS τ ). We configure pnopt to use an L-BFGS quadratic model for f (x) with the default memory size of 50 (using a memory size of eight, as used in the hybrid method, gave similar results). Differences in the stopping criteria make it difficult to directly compare the performance of pnopt with spg and the hybrid method. We determine relative duality gap tolerance values and pnopt runtimes using the following steps: 1. We run pnopt on each of the problem instances from Sect. 5.1.1 using solverspecific tolerance values of 10 −4 , 10 −5 , and 10 −6 . 2. Given the solutions for each of the three settings, we determine the relative duality gap, as described in Sect. 4.3, and choose this value as the reference tolerance. The median of the resulting tolerance values are plotted in Fig. 4a. 3. In order to ensure that pnopt did not reach the stopping criterion much earlier, we created a modified version of pnopt to evaluate the duality gap at each iteration to determine the first iteration at which the reference tolerance is attained. In most cases this number is equal to the number of iterations in step 1, in the remaining cases it is less due to the choice of the threshold value. 4. Finally, we rerun the original pnopt solver with the number of iterations limited to the value found in the previous step. This is the first iteration at which the relative duality gap tolerance is attained. 5. The time reported for each problem instance and setting for pnopt is then chosen as the minimum of the runs in steps 1, 3, and 4. Note that this approach is advan-  Fig. 4 Plots of a the median relative duality gap for ten problem instances solved with pnopt, and b the median runtime over the problem instances using pnopt, spg, and the hybrid method with a relative duality gap threshold corresponding to the settings in plot (a) tageous to pnopt because it effectively excludes the time needed to evaluate the relative duality gap.
For spg and the hybrid method we use the reference tolerance found in step 2 above as the stopping criterion. Figure 4b plots the median runtimes over 10 problem instances for the problem sizes in Sect. 5.1.1 and the relative duality tolerances corresponding to the three original tolerance values (we plot the median rather than the average runtime due to the presence of problem instances that take excessive time to solve using pnopt; for spg and the hybrid method the mean and median are quite similar). It can be seen that both spg and the hybrid method are much faster than pnopt for all problem sizes. Between spg and the hybrid method we see that the hybrid method is slightly slower than spg, especially on the small problem sizes. This can be attributed to the large tolerance values for the relative duality gap and the low number of iterations needed to reach it, which means that the quasi-Newton method either never activates or only activates several iterations before spg finishes. In this case, the overhead associated with maintaining the support cannot be offset by the benefits of any quasi-Newton steps.

Lasso on sparse problems
We now take a closer look at the occurrence of line-search errors in spg and the speed up obtained using the hybrid method. For this we generate test problems following a conventional compressed-sensing scenario and choose A to be a random 1024 × 2048 matrix with i.i.d. normal entries with columns normalized to unit norm. We set b = Ax 0 , for k-sparse vectors x 0 with random support and its non-zero entries generated i.i.d. from (1) the normal distribution; (2) the uniform distribution over [− 1, 1]; and (3) the discrete set {− 1, +1} with equal probability. We set τ = 0.99 · x 0 1 and terminate the algorithm whenever the relative duality gap, defined here as ( f (x) − f dual )/ max{ f (x), 10 −3 }, falls below a given tolerance level. We run 50 instances for each of the three distributions used above and report in Table 1 the results obtained The first three columns for either method give the average runtime over 50 instances when the non-zero entries are sampled i.i.d. from respectively the discrete {−1, 1}, uniform (− 1,1), and the normal distribution. The fourth column gives the median relative duality gap at the final iteration taken over all 150 problem instances and should be compared with the optimality tolerance, which was set to 10 −6 . The fifth column for each of the two blocks, indicated by the check mark, gives the percentage of runs that completed successfully, that is, completed without a line-search error. The right-most column gives the average of the speed-up values for each of the three distributions with an optimality tolerance of 10 −6 . In general we see that the runtime goes up considerable as we keep increasing k. Moreover, the results show clear differences in the runtime for the three distributions with a much higher runtime for problems based on sparse vectors with ± 1 entries. For sparsity levels up to around one hundred the number of iterations in the spg method is relatively small (between 25 and 50). For these problems the hybrid method may complete before or soon after the first quasi-Newton step is taken. The slight overhead of the method and occasionally a small number of additional iterations make the hybrid method somewhat slower on average for these problems than the spg method. For larger values of k, the number of iterations goes up, and the effect of the quasi-Newton steps in the hybrid method becomes apparent with average speed up values between 20 and 30%. Aside from reduced runtime we see from Table 1 that the hybrid method also manages to solve many more problems to the desired accuracy level compared to the spg method; the number of solved problems steadily falls to around 9% with increasing k for the spg method, but remains at 100% for all but the largest k for the hybrid method. The median relative duality gap provides further information about the level of accuracy reached before the algorithm completes or terminates with a line-search error. For the largest values of k, the spg method fails to complete with a relative duality gap of even 10 −5 for at least half of the problems.

Root finding
Given that most of the runtimes appearing in Table 1 are of the order of seconds, it is valid to question whether these problems are too idealized and well behaved to give a good idea about the practical performance of the algorithms. In this section we therefore look at two different types of problems. First we introduce a class of random problems that better reflect conditions found in practical problems. Second we evaluate the performance on the Sparco [21] collection of test problems for sparse reconstruction.

Coherent test problem generation
In the compressed-sensing literature it is well known that a random Gaussian matrix satisfies with high probability that all sufficiently small subsets of columns form a near-orthogonal basis for the subspace spanned by these columns-a property known as the restricted isometry [22]. Another quantity used to characterize matrices is the mutual coherence, defined as the maximum absolute pairwise cosine distance between the columns. In practical applications the matrix A is often quite coherent [23], and although there are no theoretical results on how this affects the complexity of onenorm minimization, it has been observed empirically that more coherent problems are harder to solve. The construction we propose for generating such problems is by means of a random walk on the (m − 1)-sphere with a step size parameterized by γ . Starting with a unit-norm column a 1 we construct successive columns by sampling a vector v k with i.i.d. Gaussian entries and setting a k+1 = α 1 a k + α 2 v k , where α 1 and α 2 are chosen such that a k+1 2 = 1 and a k , a k+1 = 1 − γ . In other words, a k+1 lies on the boundary of a spherical cap with center center a k and angle θ such that cos(θ ) = 1 − γ . The mutual coherence of the resulting matrix is lower bounded by 1 − γ , and an example of the distribution of the pairwise cosine distance between the columns is given in Fig. 5a. An example Gram matrix, plotted in Fig. 5b, shows that aside from the banded structure, there are regions of increased coherence whenever the random walk approaches earlier locations. From Fig. 5c we see that lowering γ while keeping a 1 and v k fixed leads to an increase of the top singular value σ 1 as the columns become more and more similar. Figure 5d illustrates that the maximum pairwise coherence μ does not necessarily have a relationship with the top singular value.

Highly coherent measurement matrices
We apply spg and the hybrid method to solve (BP σ ) using the root-finding framework explained in Sect. 1. Each Lasso subproblem (LS τ ) is optimized to a certain optimality tolerance, and the overall problem is considered solved whenever the relative misfit |σ − r 2 |/ max(σ, 10 −3 ) falls below 10 −5 . For completeness we also compare the performance with the spgl1 algorithm as provided by [2]. For the first set of experiments we use the highly coherent matrices described in Sect. 5.2.1. As before we create a k-sparse vector x 0 with non-zero entries sampled from different distributions, and set b = Ax 0 + v, where the entries in v are zero in the  Table 2a-c we run ten instances for each of the three distributions and report the average run time over all thirty runs. The percentage reduction in runtime is computed based on the total runtime and was found to closely match the percentage obtained for each of the three signal classes independently. For the root-finding columns we solve (BP σ ) with σ = 0.01 b 2 and optimality tolerance levels of 10 −4 and 10 −6 . For the Lasso columns we solve (LS τ ) on equivalent problems with τ set to the value obtained using the root-finding procedure. The results in Table 2d-f apply to noisy problems where v 2 is scaled to the given percentage of Ax 0 2 , and σ is set accordingly. For these experiments we only consider sparse x 0 with random ± 1 entries. The percentage decrease in runtime relative to spg, for spgl1 and the hybrid method for all problem instances in Table 2 is given in Fig. 6. A summary of the total runtime for the different solvers along with the percentage of solutions with relative duality gap within given ranges is provided by Table 3.
The first thing to note from the results in Table 2 is that the problems generated with lower values of γ are indeed more difficult to solve for both spg and the hybrid method. Compared to the spg method, the hybrid method reduces the average runtime for nearly all problems, and does so by a percentage that increases as the problems get harder, a trend clearly seen in Fig. 6. From Table 3 we see that the hybrid method with optimality tolerances of 10 −4 and 10 −6 reduces the total runtime respectively The columns within the root finding and Lasso blocks are respectively the runtime in seconds of the spg and hybrid method, and (in bold) the reduction in runtime in percent of the hybrid method compared to the spg method Fig. 6 Percentage runtime reduction relative to spg for solvers spgl1 and the hybrid method on a set of highly coherent problems. The hybrid method is run on both the root-finding and lasso problems with two optimality tolerance levels each The reduction in runtime for the successive spg-hybrid pairs are 43, 34, 24, and 20%, respectively by 34% and 43% for the basis-pursuit problems, and 20% and 24% for the Lasso problems. The larger relative reduction in runtime for basis pursuit is due to the use of warm starting in the root-finding procedure, which removes a substantial number of iterations that would otherwise be identical for the hybrid and spg methods. Despite the improvements, the hybrid method still has a larger runtime than spgl1 on most problems. However, from Table 3 we see that spgl1 does not even reach a relative duality gap of 10 −3 for nearly 80% of the problems, as a result of the relaxed stopping criteria. Tightening these criteria, as done in what we label the spg method, increases the number of solutions that attain the desired optimality tolerance. Nevertheless, the spg method still fails to reach an optimality of 10 −6 for some 60% of the problems. Finally, we see that the hybrid method not only improves the runtime of the spg method, but also manages to reach the requested optimality on all problems from Table 2.

Sparco test problems
Sparco [21] provides a standard collection of test problems for compressed sensing and sparse recovery. The problems in Sparco are of the form b = Ax + v, where A is represented as a linear operator rather than an explicit matrix. After excluding problems that are too easy to solve or require access to third-party software, we obtain the problem selection listed in Table 4. For some problems we scale the original b to avoid a very small objective value at the solution, which causes the duality gap relative to max( f (x), 1) to be satisfied more easily. The table also lists the one-norm of the solutions found when solving with σ = 0.01 b 2 and σ = 0.001 b 2 , respectively, for the scaled b.
We run the spg and hybrid methods with optimality tolerances ranging from 10 −2 down to 10 −4 . Beyond that, some of the problems simply took too long to finish. For spgl1 we use optimality tolerance values set to 10 −6 and 10 −9 . By comparison these may seem excessively small, and we certainly do not expect the relative duality gap to reach these levels. Instead, we choose the small values to help control the other stopping criteria, such as the relative change in the objective value, which are parameterized using the same tolerance parameter. The results of the experiments with the two choices of σ , are summarized in Tables 5 and 6. The hybrid method reduces the runtime of the spg method in 42 out of the 56 settings, often considerably so. For a tolerance level of 10 −4 the hybrid method consistently outperforms the spg method with an average time reduction of 38%. The required optimality level is reached on all problems except for problem 903 with the smaller σ and optimality tolerance 10 −4 . For this problem the spg method stops with a relative duality gap of 2 × 10 −4 following a line-search error. The runtime for spgl1 with optimality tolerance 10 −6 is very low overall, but comes at the cost of a rather large relative duality gap at the solution. Lowering the tolerance to 10 −9 reduces the gap, but also leads to a considerable increase in runtime. In either case the number of root-finding iterations can be very large, especially if the target value of τ is exceeded and gradual reduction follows. The lowest relative duality gap reached by spgl1 over all problems in Tables 5 and 6 is 4 × 10 −3 . The varying optimality levels make it difficult to compare results, so of special interest are problem instances where spgl1 simultaneously has a lower runtime and relative duality gap with either the spg or hybrid method, or vice versa. From the tables we see

Primal-dual gap
We now consider the formulation for μ > 0. In Sect. 4.3 we described two different ways of deriving a dual formulation.
In the first approach we augment A and b to account for the μ 2 x 2 2 term and reduce the problem to the standard Lasso formulation. The derivation of the dual for this formulation in [2,15] provides a way of generating a dual-feasible point (ȳ,λ) from a primal-feasible x by choosingȳ =Āx −b and solving a trivial optimization problem forλ. In the second approach we deal with formulation (32) directly and obtain a dual problem parameterized in (y, λ). As before we can choose y to be equal to the residual, now in terms of the original A and b, and remain with a non-trivial optimization problem for λ that is nevertheless easily solved using the algorithm described in Sect. 4.3. We refer to the two derivations as the augmented derivation and the optimized derivation. The term 'optimized' refers to the need to solve for λ, but more importantly, to the fact that the dual objective generated from any x using the optimized derivation is never smaller than that using the augmented derivation, as shown in Sect. 4.3.
To evaluate the practical difference between the two approaches we generate a large number of randomized test problems of the form b = Ax 0 + v, where x 0 are random vectors with sparsity levels ranging from 50 to 350 in steps of 50 and on-support entries draw i.i.d. from the normal distribution. The m × n measurement matrices A are drawn i.i.d. from N (0, 1/m), with m = 1000 and n = 2000. Finally, the additive noise vectors v are drawn from the normal distribution and scaled to have Euclidean norm equal to 0, 0.01, 0.1, and 1% of that of the clean observation Ax 0 . For the problem formulation we set τ to x 0 1 scaled by 0.7-1.2 in 0.1 increments, and range μ log-linearly from 10 −1 to 10 −4 in four steps. As a result of the additive μ 2 x 2 term in the objective, the solutions are no longer sparse. As a result the hybrid method tends to coincide with the spg method, and we therefore only consider the latter for these experiments.
For each of the settings we evaluate the time required by the augmented and optimized formulations to reach a relative duality gap of 10 −4 . Figure 7a, b plot the speed up obtained using the optimized formulation along with the runtime of the augmented formulation for different problem instances with two levels of μ. Despite the slightly more expensive evaluation of the dual, we see that the optimized formulation is around 1.5-4 times faster for μ = 0.01, and up to 7 times faster for μ = 0.001. For μ = 0.1 (not shown in the plot) the speedup ranges from 1.2 to 3, and for μ = 0.001 the speed up exceeds 10 on many problem instances and reaches a maximum of around 30.
We now take a closer look at the relative distance of the primal and dual objective to the optimum for the two circled problems in Fig. 7c,d. The progress of the primal objective over the iterations, indicated by the gray line, is the same for both formulations. For the dual objective there is a marked difference between the two. Notably, the augmented formulation converges much slower than the optimized formulation, thereby preventing the stopping criterion from being satisfied for many more iterations. We now take another look at the Sparco problems from Tables 5 and 6. For each of the settings we record the optimal τ and then run the hybrid solver with a target optimality tolerance of 10 −8 to obtain a best-effort optimum (for some problems the line search failed before reaching the desired tolerance). We then run the spg and hybrid solvers with a target accuracy of 10 −5 and record the relative distance of the primal and dual objective to the optimum at every iteration. The results for four representative problems are plotted in Fig. 8. From the plots we see that the iterates of the hybrid method initially coincide or otherwise closely follow those of the spg method. Once the hybrid method starts using quasi-Newton iterates increasingly often we see a sharp decrease in the relative distance to the optimum of the primal and dual iterates. The iterates of the spg method, by contrast, continue to decrease very slowly. Indeed, of the fourteen problem settings, the spg method managed to solve only two to the desired level of accuracy. Of the remaining problems, two reach the default iteration limit of ten times the number of rows in A, while all other problems The percentage reduction in time of the hybrid method over spg is given in bold next to the problem index. Entries marked in italic indicate a solution that is not a root fail with a line-search error. The hybrid method manages to solve all problems except for problem 401 with multiplier 10 −3 . This problem reached the iteration limit, but could otherwise be solved successfully to a tolerance level of even 10 −8 . As before, we see that the dual objective converges to the optimum much slower than the primal, and unfortunately, there is no clear way to extend the optimized dual formulation from Sect. 4.3 to the standard Lasso formulation where μ = 0. Given that the satisfaction of the optimality condition depends almost entirely by the dual objective value, it makes sense to look at the speed up that would be attained if the optimal objective value were to be known and satisfaction of the optimality condition depends only on the primal objective. In Table 7 we provide the attainable speed up for the different Sparco problems with varying optimality tolerance levels. Clearly, both the spg and hybrid methods would benefit greatly from an improved dual, although the effect is less for the hybrid method, due to the already fast convergence of the dual objective in the final iterations.

Conclusions
In this paper we have presented a hybrid algorithm for minimization of quadratic functions over weighted one-norm balls. The method extends the spectral projected gradient method with L-BFGS iterations applied to reparameterizations of the objective function over active faces of the one-norm ball. For the decision of the iteration type we introduce the self-projection cone of a face and provide a complete characterization of this cone for weighted one-norm balls. The reparameterization uses an implicit orthonormal basis for the current face, and we provide an efficient algorithm for matrix-vector multiplication with this basis and its transpose. As part of the numerical experiments we propose a challenging class of test problems in which the columns of the m × n measurement matrix A are generated based on a random walk over the (m−1)-sphere. Based on extensive numerical experiments on We give a lower bound (indicated by the ' > ' sign) when the dual objective failed to reach the given optimality level, either because the maximum number of iterations was reached, or because a line-search error occurred these and other test problems we showed that the hybrid method outperforms the original spectral projected gradient methods on a large fraction of the problems. Especially for medium to high accuracy solves and more challenging problems the spg method was found to either take much more time to reach the desired level of accuracy, or fail prematurely due to line-search problems. Both methods performed favorably against the parametric active-set solver qpOASES [19] and the proximal Newton-type solver pnopt [20]. The current stopping criterion used in both spg and the hybrid method relies on the generation of a dual feasible point from the primal iterate to determine the relative optimality of the iterate. From the experiments we found that the primal objective converges to the optimum much faster than the dual objective, and that satisfaction of the stopping criterion therefore depends entirely on the dual objective reaching the critical threshold. The performance of both methods could therefore be improved substantially given a better dual estimate.
In this paper we have studied the application of the hybrid method to the Lasso problem. Other important problems that may benefit from the approach but were not discussed in this paper include box-constrained optimization and minimization of quadratic functions over the simplex.