A Laplacian Approach to $\ell_1$-Norm Minimization

We propose a novel differentiable reformulation of the linearly-constrained $\ell_1$ minimization problem, also known as the basis pursuit problem. The reformulation is inspired by the Laplacian paradigm of network theory and leads to a new family of gradient-based methods for the solution of $\ell_1$ minimization problems. We analyze the iteration complexity of a natural solution approach to the reformulation, based on a multiplicative weights update scheme, as well as the iteration complexity of an accelerated gradient scheme. The results can be seen as bounds on the complexity of iteratively reweighted least squares (IRLS) type methods of basis pursuit.


Introduction
An important primitive in the areas of signal processing and optimization is that of finding a minimum 1 -norm solution to an underdetermined system of linear equations.Specifically, for some n ≤ m, let ŝ ∈ R m represent an unknown signal, b ∈ R n a measurement vector, and A ∈ R n×m a full-rank matrix such that Aŝ = b.In some circumstances, the unknown signal ŝ can be recovered by computing a minimum 1 -norm solution to the system As = b; in other words, solving the following optimization problem: minimize s 1 (BP) subject to As = b, s ∈ R m .This 1 -minimization problem is known as basis pursuit.It is a central problem in the theory of sparse representation and arises in several applications, such as imaging and face recognition.Through a standard reduction, it also captures the 1 -regression problem used in statistical estimation.
The convex optimization problem (BP) can be cast as a linear program and thus could be solved via an interior-point method.Another popular approach to 1 -minimization is the iteratively reweighted least squares (IRLS) method, which is based on iteratively solving a series of adaptively weighted 2 -minimization problems.IRLS methods are popular in practice, due to their simplicity and the fact that they do not require preprocessing nor special initialization rules.Despite this, theoretical guarantees for IRLS methods in the literature are not common, particularly in terms of global convergence bounds.
This work contributes to developing the understanding and design of IRLS-type methods for basis pursuit.We propose a novel exact reformulation of (BP) as a differentiable convex problem over the positive orthant, which we call the dissipation minimization problem.A distinguishing feature of this approach is that it entails the solution of a single differentiable convex problem.The reformulation leads naturally to a new family of IRLS-type methods solving (BP).
We exemplify this approach by providing global convergence bounds for discrete IRLStype algorithms for (BP).We explore two possible routes to the solution of the dissipation minimization problem, and thus of (BP), where we use the established framework of first-order optimization methods to derive two provably convergent iterative algorithms.We bound their iteration complexity as O(m 2 / 3 ) and O(m 2 / 2 ), respectively, where is the relative error parameter.These methods are in the IRLS family since each iteration can be reduced to the solution of a weighted least squares problem.Both methods are very simple to implement and the first one exhibits a geometric convergence rate in numerical experiments.
Our dissipation-based reformulation of (BP) may be of independent interest.It is rooted in the Laplacian framework of network theory: it generalizes concepts such as the Laplacian matrix and the transfer matrix, which were originally developed to express the relation between electrical quantities across different terminals of a resistive network.(Many of our formulas have simple interpretations when the constraint matrix A is derived from a network matrix).
This paper is organized as follows.In Section 2, we present the dissipation minimization reformulation of basis pursuit and some of its structural properties.In Section 3 we prove the equivalence between basis pursuit and dissipation minimization.In Section 4 we look at the continuous dynamics obtained by applying mirror descent to the dissipation minimization objective and connect them with existing literature.In Section 5, we analyze a discretization of these dynamics that yields an iterative IRLS-type method for the solution of the dissipation minimization problem and, hence, of basis pursuit; this method can be seen as an application of the well-known multiplicative weights update scheme, and its iteration complexity is O(m 2 / 3 ).Then, by leveraging Nesterov's accelerated gradient scheme, we present and analyze an improved IRLS-type method with iteration complexity O(m 2 / 2 ).In Section 6, implementations of the two methods are compared against existing solvers from the l1benchmark suite [39].
Related literature.Given its central role in the areas of sparse representation and statistics, the literature on the basis pursuit problem and 1 -regression is extensive; see for example [11,16,18,24] and references therein.Several algorithms for basis pursuit are reviewed in Chapter 15 of [24]; for an experimental comparison and an application to face recognition, see [39].
Various versions of IRLS schemes have been studied for a long time [26,32] and, as already mentioned, the methods have been popular in practice due to their simplicity and experimental performance [17].On the other hand, theoretical guarantees for IRLS-type algorithms are few and far between [7,21,35].A recent IRLS algorithm stands out in the context of this paper, as it applies to the basis pursuit problem and comes with a worst-case guarantee: a Õ(m 1/3 −8/3 ) iterations algorithm due to Chin et al. [19,Theorem 5.1], derived by further developing the approach of Christiano et al. [20].In this context, our approach breaks the −8/3 bound for an IRLS method (at the cost of a worse dependency on m).We nevertheless emphasize that the goal of this work is not to establish the superiority of a specific algorithm, but rather to highlight a new approach that, already when coupled with off-the-shelf optimization methods, offers a principled way to derive IRLS-type algorithms with competitive theoretical performance.Subsequently to the first appearance of our results (on arXiv), an improved bound of Õ(m 1/3 −2/3 + −2 ) iterations for a more sophisticated IRLS-type algorithm for (BP) has been derived by Ene and Vladu [22] (again building on the ideas of [20] and [19]).While this algorithm has a rather more favorable worst-case dependency on the parameters, in practice it requires roughly 1/ iterations [22,Section 4]; in contrast, as we observe in Section 6, the experimental convergence rate of our approach is geometric, that is, the iterations required are linear in log(1/ ), suggesting that a much stronger theoretical bound may hold in our setting.
Our reformulation of basis pursuit is new, though it is in part inspired by the Laplacian framework [34].In particular, the definition of the dissipation function is based on a generalization of the Laplacian potential of a network.This reinforces the idea from Chin et al. [19] that concepts originally developed for network optimization can be fruitful in the context of 1 -regression.The dissipation-minimizing dynamics considered in Section 4 are an application of the mirror descent (or natural gradient) dynamics [2,3,29,30] to

Algorithm
Iteration complexity Ref. [19] Õ(m Table 1.Worst-case iteration complexity of recent IRLS methods for 1 -norm minimization our new objective function.In Section 5.1 we show, in particular, how the algorithmic framework of Lu, Freund and Nesterov [29] (see also [6]) can be applied to the dissipation minimization problem.The improved algorithm discussed in Section 5.2 is instead based on Nesterov's well-known accelerated gradient method [31].The dynamics studied in Sections 4 and 5 bear some formal similarity to the so-called Physarum dynamics, studied in the context of natural computing, which are the network dynamics of a slime mold [10,14,[35][36][37].The fact that Physarum dynamics are of IRLS type was first observed in [35].In this context, our result can be seen as the derivation of a Physarum-like dynamics purely from an optimization principle: dissipation minimization following the natural gradient.A relevant difference is that the specific dynamics we study is a gradient system, while the dynamics studied in [10,35] is provably not a gradient system.This is precisely what enables us to apply the machinery of first-order convex optimization methods, and acceleration in particular.
We note that a different proof of Theorem 3.1 has been independently provided by Facca, Cardin and Putti [23] in the context of the Physarum dynamics.
Notation.For a vector x ∈ R m , we use diag(x) to denote the m × m diagonal matrix with the coefficients of x along the diagonal.The inner product of two vectors x, y ∈ R m is denoted by x, y = x y.The maximum (respectively, minimum) eigenvalue of a diagonalizable matrix M is denoted by λ max (M ) (respectively, λ min (M )).For a vector x ∈ R m , x p denotes the p -norm of x (1 ≤ p ≤ ∞), and |x| denotes the vector y such that y i = |x i |, i = 1, . . ., m.Similarly, x 2 denotes the vector y such that y i = x 2 i , i = 1, . . ., m.With a slight overlap of notation, which should nevertheless not cause any confusion, we instead reserve x k with a symbolic index k to denote the vector produced by the kth step of an iterative algorithm.

Basis pursuit and the dissipation minimization problem
2.1.Assumptions on the basis pursuit problem.We make the following assumptions on (BP): (A.1) the matrix A has full rank and n ≤ m; (A.2) the system As = b has at least one solution s such that s j = 0 for each j = 1, . . ., m. • Let u = A (AA ) −1 b be the least-square solution to As = b.There is at least one solution to A s = b with s j = 0 for each j = 1, . . ., 2m, given by • No optimal solution to the instance (A , b) is such that s 2j−1 • s 2j < 0 for some j: if that was the case, one could form a solution of lesser cost by replacing each of s 2j−1 and s 2j with their average.Thus, any optimal solution s to (A , b) can be transformed back into a solution s to (A, b) by taking s j = s 2j−1 + s 2j for each j = 1, . . ., m.Such a solution satisfies s 1 = s 1 and thus must be optimal for (A, b).
Remark 2.1.A special case of (BP) is when A is derived from a network matrix.Specifically, consider a connected network with n + 1 nodes and m edges, and suppose edge j connects node u to node v. Define b j ∈ R m as (b j ) u = 1, (b j ) v = −1, and all other entries 0. The matrix ×m is called the incidence matrix of the network.For any connected network, the incidence matrix B has rank n and, additionally, any row of B can be expressed as a linear combination of the remaining n rows, because the sum of all rows is a zero vector.Let A be the submatrix of B obtained by deleting an arbitrary row.Then A satisfies assumption (A.1) and thus, without loss of generality, (A.2).A solution s to As = b can be interpreted as an assignment of flow values to each edge such that the net in-flow at every node v = 1, . . ., n matches the prescribed demand b v .
2.2.The dissipation potential.In this section we introduce the dissipation potential, which is the function on which our reformulation of the basis pursuit problem is based.
Definition 2.1.The Laplacian-like matrix relative to a vector x ∈ R m ≥0 is the matrix L(x) def = AXA , where X = diag(x).
Remark 2.2.In the network setting described in Remark 2.1, a vector x ∈ R m >0 can be interpreted as a set of weights, or conductances, on the edges of the network.Then the matrix BXB is the weighted Laplacian of the network [12,34].The matrix L(x) = AXA is sometimes called the reduced Laplacian.
The following function definition is central to our approach. (1) We call f the dissipation potential.An equivalent definition of f is as the convex closure of f 0 , which is the function whose epigraph in R m+1 is the closure of the epigraph of f 0 [33,Chapter 7].The effective domain of f is the set The functions f and f 0 differ only on the boundary of the positive orthant.We will show that f always achieves a minimum on R m ≥0 , and hence on R m .One of our main results (Theorem 3.1) is that this minimum equals the minimum of (BP).
Remark 2.3.Consider again the case where the matrix A is derived from a network matrix, as in Remark 2.1.The node of the network corresponding to the row that was removed from the incidence matrix to form A is called the grounded node.Now assume that for some Then the Laplacian potential b L −1 (x)b yields the effective resistance between the grounded node and node u when the conductances of the network are specified by the vector x.A standard result in network theory is that decreasing the conductance of any edge can only increase the effective resistance between any two nodes (see, for example, [12,25]).Thus, the minimization of the dissipation potential f involves an equilibrium between two opposing tendencies: decreasing any x j decreases the linear term 1 x, but increases the Laplacian term b L −1 (x)b.

2.3.
Basic properties of the dissipation potential.We proceed to show that the dissipation potential attains a minimum.We start with some basic properties of f 0 .
Lemma 2.3.The function f 0 is positive, convex and differentiable on R m >0 .
Proof.Positivity follows from the positive-definiteness of L −1 (x) for x ∈ R m >0 (implied by Proposition 2.2).For convexity, it suffices to show that the mapping First observe that x → AXA is a linear matrix-valued function, i.e., each one of the entries of AXA is a linear function of x, since multiplying X on the left and right with A and A yields linear combinations of the elements of x.Second, the matrix to scalar function Y → b Y −1 b is convex on the cone of positive definite matrices, for any b ∈ R n (see for example [15,Section 3.1.7]).By combining the two facts above, it follows that the composition x → b (AXA ) −1 b is convex, and hence so is f 0 .Finally, since the entries of L(x) are linear functions of x, the function f 0 is a rational function with no poles in R m >0 , hence differentiable.
To argue that f attains a minimum, we first recall some notions from convex analysis [8,33].An extended real-valued function f : R m → [−∞, +∞] is called proper if its domain is nonempty and the function never attains the value −∞.
. By construction, f coincides with the closure of f 0 and thus it is a closed proper convex function [33,Theorem 7.4].Its nonnegativity follows from the positivity of f 0 and from (2).Proof.Note that lim x →∞ f (x) = ∞, because b (AXA ) −1 b ≥ 0 for any x ∈ dom f 0 , and 1 x → ∞ as x → ∞ with x ∈ dom f 0 .In other words, f is also a coercive function and therefore, it attains a minimal value over any nonempty closed set intersecting its domain [8,Theorem 2.14]; in particular, it attains its minimal value over R m ≥0 .
Since f (x) = lim inf x →x f 0 (x ), the minimum attained by f over R m ≥0 equals inf x>0 f 0 (x).Note also that this minimum may be attained on the boundary of dom f .2.4.Gradient and Hessian.In this section we derive some formulas for the gradient and Hessian of f on the interior of its domain.
Remark 2.4.In the network setting described in Remark 2.1, d j (x) expresses the voltage along edge j when an external current b u enters each node u = 1, . . ., n (and a balancing current − u b u enters the grounded node).
The next lemma relates the gradient ∇f (x) to the voltage vector at x.
, where a j stands for the jth column of A.
Proof.First observe that L(x) = AXA = m j=1 x j a j a j and thus ∂L/∂x j = a j a j .We apply the formula for the derivative of a matrix inverse: We obtain The claim follows by the definition of f .
To express the Hessian of f , in addition to the voltages we need the notion of transfer matrix.
Remark 2.5.In the network setting described in Remark 2.1, the transfer matrix T (x) expresses the relation between input currents and output voltages, when the conductances are given by the vector x.Namely, T ij (x) is the amount of voltage observed along edge i of the network when a unit external current is applied between the endpoints of edge j.
Corollary 2.7.For any x > 0, , where denotes the Schur matrix product defined by Proof.For any i, j = 1, . . ., m, by Lemma 2.6 and applying once more (3), we get The claim follows by Definition 2.4.
2.5.Bounds on the norms of gradient and Hessian.In this section we derive some norm bounds for the gradient and Hessian of the dissipation potential f ; they will be used crucially to derive complexity bounds for the algorithms studied in Section 5. Two matrices M , M are called congruent if there is a nonsingular matrix S such that M = SM S .For the proofs in this section, the main tool we rely on is the following algebraic fact relating the eigenvalues of congruent matrices; see for example [28, Theorem 4.5.9]for a proof.Theorem 2.8 (Ostrowski).Let M, S ∈ R m×m be two symmetric matrices, with S nonsingular.For k = 1, . . ., m, let λ k (M ), λ k (SM S ) denote the k-th largest eigenvalue of M and SM S , respectively.For each k = 1, . . ., m there is a positive real number Lemma 2.9.Let x ∈ R m >0 .Each nonzero eigenvalue of T (x) is at least (max i=1,...,m x i ) −1 and at most (min i=1,...,m x i ) −1 .
Proof.Consider the matrix Π(x Hence, Π(x) is the orthogonal projection matrix that projects onto the range of (AX 1/2 ) .In particular, Π(x) 2 = Π(x) and each eigenvalue of Π(x) equals 0 or 1.Since T (x) = X −1/2 Π(x)X −1/2 , the matrices T (x) and Π(x) are congruent.By Theorem 2.8, the algebraic multiplicity of the zero eigenvalue of T (x) and Π(x) is the same, and each positive eigenvalue of T (x) must lie between the smallest and the largest eigenvalue of X −1 .These are (max i x i ) −1 and (min i x i ) −1 , respectively.
Additionally, if s * is an optimal solution to (BP), Since the largest eigenvalue of T (x) is at most (min i x i ) −1 by Lemma 2.9, we can bound A,b , and using the optimality of u for the 2 norm and of s * for the 1 norm we derive Proof.Combine Lemma 2.10 with Lemma 2.6.
Proof.We can use the matrix identity where D(x) def = diag(d(x)).Hence, by Theorem 2.8, the largest eigenvalue of ∇ 2 f (x) satisfies (10) for some θ lying between the smallest and largest eigenvalues of D(x) 2 .Since by Lemma 2.10 combining (10) and (11) with Lemma 2.9 we get λ max (∇ 2 f (x)) ≤ 2(min i x i ) −3 c A,b .

Equivalence between basis pursuit and dissipation minimization
In this section we prove the equivalence between basis pursuit and dissipation minimization.
Theorem 3.1.The value of the optimization problem is equal to the value of the optimization problem We call (DM) the dissipation minimization problem associated to A and b.Note that the objective in (DM) is exactly f 0 (x)/2, hence by (2) the minimum of (DM) equals the minimum of f (x)/2 over R m ≥0 ; the fact that this minimum is achieved is guaranteed by Corollary 2.5.Definition 3.1.Let x > 0. The solution induced by x is the vector q(x) The term "solution" is justified by the fact that Aq(x) = LL −1 b = b.Induced solutions have the following simple characterization.
The solution induced by x, q(x), equals the unique optimal solution to the quadratic optimization problem: Proof.This lemma is a straightforward generalization of Thomson's principle [12,Chapter 9] from electrical network theory.We adapt an existing proof [13,Lemma 3] to the notation used in this paper.Since the objective function in (QP x ) is strictly convex, the problem has a unique optimal solution.Consider any solution s, and let r = s − q(x).Then Ar = b − b = 0 and hence s X −1 s = (q + r) X −1 (q + r) = q X −1 q + 2r X −1 q + r X −1 r ≥ q X −1 q, since r X −1 r ≥ 0 and r X −1 q = r A L −1 b = (Ar) L −1 b = 0. Therefore, the objective function value of any solution s to (QP x ) is at least as large as the objective function value of the solution q(x).
The value of (QP x ) is, in fact, the Laplacian potential b L −1 (x)b.
Proof.We already proved that the minimum of (QP x ) is q(x) X −1 q(x).Substituting the definition of q(x), Lemma 3.4.For any x > 0, q(x) ∈ R m is such that Aq = b and q(x) 1 ≤ f (x)/2.Thus, the value of (BP) is at most that of (DM).
Proof.For any x ∈ R m >0 , consider its induced solution q(x) = XA L(x) −1 b.We already observed that q(x) is feasible for (BP).Moreover, we can bound: (by Corollary 3.3) where the first upper bound follows from the Cauchy-Schwarz inequality, and the second from the Arithmetic Mean-Geometric Mean inequality.
To prove the converse of Lemma 3.4, we develop an intermediate lemma that relates the value of an optimal solution s * of (BP) to the dissipation value of a vector x such that x = |s| with s sufficiently close to s * .Lemma 3.5.Let s ∈ R m , ∈ (0, 1) be such that As = b, s j = 0 and (1 − ) s * j ≤ |s j | ≤ s * j + /m for some s * such that As * = b and each j = 1, . . ., m.Then for x = |s|, Proof.On one hand, by the assumed upper bound |s j | ≤ |s * j | + /m, trivially ( 13) On the other hand, consider the solution q(x) induced by x and recall that q(x) is feasible for (BP), since Aq = b, and optimal for (QP x ).By the assumed lower bound where the first upper bound follows from the fact that s * is a solution to (QP x ), and the second follows from the hypothesis.Combining ( 13) and ( 14), we get Lemma 3.6.The value of (DM) is at most that of (BP).
Proof.Consider an optimal solution s * ∈ R m to (BP).Let s ∈ R m be a solution to As = b such that s j = 0 for all j = 1, . . ., m (such an s exists by assumption (A.2)).For any For any ∈ (0, 1) we can ensure that the hypotheses of Lemma 3.5 are satisfied by choosing a small enough δ > 0. For such a value of δ, Lemma 3.5 yields As can be chosen arbitrarily small, and the right-hand side of (15) approaches s * 1 as → 0, we obtain the claim.This concludes the proof of Theorem 3.1.Not only are the optimal values of (BP) and (DM) the same, but one can bound the suboptimality of any feasible point of (BP) in terms of the dissipation value of a corresponding vector.Theorem 3.7.Let s ∈ R m be a feasible point of (BP) such that s j = 0 for all j = 1, . . ., m, is an upper bound on the suboptimality of s.
Proof.Consider the following linear formulation of (BP) (left) and its dual (right): Given any solution s to (BP) such that x = |s| > 0, let us take Then A ν ∞ ≤ 1 by definition of ρ(x); moreover, λ + µ = 1, λ − µ + A ν = 0, and λ, µ ≥ 0. Thus, (x, s) is a primal feasible solution, (λ, µ, ν) is a dual feasible solution, and by weak duality We close this section by observing that a simpler proof of Theorem 3.1 can be obtained by the following quadratic variational formulation of the 1 -norm: for any s ∈ R m , where the last identity follows from Corollary 3.3.However, the full strength of Lemma 3.4 and Lemma 3.6 is crucial to be able to constructively transform feasible points for (DM) into feasible points for (BP) and vice versa.

Continuous dynamics for dissipation minimization
Theorem 3.1 readily suggests an approach to the solution of the basis pursuit problem.Namely, the solution of the non-smooth, equality constrained formulation (BP) is reduced to the solution of the differentiable formulation (DM) on the positive orthant.
Mirror descent dynamics.To solve (DM), it is natural to adopt methods for differentiable constrained optimization that are designed for simple constraints.Consider first the following set of ordinary differential equations, aimed at solving inf {f (x) | x > 0}: (16) ẋj = −x j ∂f (x) ∂x j , j = 1, . . ., m, with initial condition x(0) = x 0 for some x 0 > 0. When f is the dissipation potential, by Lemma 2.6 this yields the explicit dynamics The dynamical system ( 16) is a nonlinear Lotka-Volterra type system of differential equations, of a kind that is common in population dynamics [27].It is also an example of a Hessian gradient flow [1]: it can be expressed in the form (18) ẋ = −H −1 (x)∇f (x) where is the Hessian of a convex function h; namely, here H(x) = X −1 , and h : R m >0 → R is the negative entropy function System (18) can also be expressed as which is known as the mirror descent dynamics or natural gradient flow [2,30].The well-posedness of (18) has been considered, for example, in [1].A dynamics formally similar to (17) is the Physarum dynamics [10,14,35,36], namely, Differently from (17), the dynamics ( 21) is not a gradient flow, that is, there is no function f that allows to write the dynamics in the form (18) or (20) (with h the negative entropy).
Convergence of the dynamics.The fact that the solution of the mirror descent dynamics (18) converges to a minimizer of f with rate 1/t is a well-known result; see, for example, [1,38].We include a streamlined proof for completeness.
Proof.We compute A key role in the convergence of the mirror descent dynamics is played by the Bregman divergence of the function h.Convexity of h implies the nonnegativity of D h (x, y).When h is the negative entropy, D h is the relative entropy function (also known as Kullback-Leibler divergence), for which D h (x, y) = 0 if and only if x = y.Theorem 4.2 ( [1,38]).Let x * ∈ R m ≥0 be a minimizer of f .As t → ∞, the values f (x(t)) with x(t) given by ( 16) converge to f (x * ).In particular, Proof.In the following, to shorten notation we often write x in place of x(t).Since (d/dt)∇h(x) + ∇f (x) = 0 by (18), for any y we have (d/dt)∇h(x) + ∇f (x), x − y = 0.This is equivalent to On the other hand, since (d/dt)h(x) = ∇h(x), ẋ , a simple calculation shows Combining ( 22) and ( 23), and plugging in y = x * , ( 24) The proof is concluded by a potential function argument [5,38].Consider the function Its time derivative is, by (24), where the last summand is nonpositive by Lemma 4.1 and the other terms equal, by definition, −D f (x * , x) ≤ 0. Hence, E(t) ≤ E(0) for all t ≥ 0, which is equivalent to proving the claim.

Algorithms for dissipation minimization
We now turn to the problem of designing IRLS-type algorithms for (DM) (and thus (BP)) with provably bounded iteration complexity.Two technical obstacles in the setup of a first-order method for formulation (DM) are: 1) that the positive orthant is not a closed set, and 2) that the gradients of f may not be uniformly bounded on the positive orthant.There is a way to deal with both issues at once: instead of solving inf x>0 f (x), for an appropriately small δ > 0 one can minimize f over This is established by the next lemma.
Proof.The first inequality is trivial.As for the second, recall that f (x) = 1 x + b L −1 (x)b for any x > 0, and that in the latter sum, the second term is non-increasing with x (by Lemma 2.6).Thus, for any x > 0, In other words, for any x > 0, there is y ≥ δ1 (namely, y = x + δ1) such that f (y) ≤ f (x) + δm.
In the following, we let A,b /(2m), where is the desired error factor and c A,b is as defined in Lemma 2.10; this, by Lemma 2.10 and Theorem 3.1, ensures that the additional error incurred by restricting solutions to Ω δ is at most ( /2) s * 1 = ( /4)f (x * ).5.1.Primal gradient scheme.Guided by (20), we might consider its forward Euler discretization (25) ∇h where x k ∈ Ω δ denotes the kth iterate, and η ∈ R >0 an appropriate step size.Indeed, the update (25) falls within a well-studied methodology for first-order convex optimization [9,29].
We adapt this framework to the solution of (DM).
The primal gradient scheme is a first-order method for minimizing a differentiable convex function f over a closed convex set Q.This scheme, which is defined with respect to a reference function h, proceeds as follows [6,29]: (1) Initialize x 0 ∈ Q.Let β > 0 be a parameter.
(2) At iteration k = 0, 1, . .., compute ∇f (x k ) and set (26) We apply the scheme with h as defined in (19) and with Q = Ω δ .Then, the minimization in ( 26) can be carried out analytically; it reduces to (27) x Update ( 27) is straightforward to implement as long as one can compute ∇f (x k ).This computation is discussed in Section 5.3.
Convergence of the primal gradient scheme.As shown in [29], the primal gradient scheme achieves an absolute error bounded by O(β/k) after k iterations provided that the function f is β-smooth relative to h.In our case, where both f and h are twice-differentiable on Q, relative β-smoothness is defined as (28) λ max (∇ for all x ∈ Q.

Theorem 5.2 ([29]
).If f is β-smooth relative to h, then for all k ≥ 1, the updates (26) satisfy where To apply Theorem 5.2 in our setting, we need to bound the smoothness parameter β.We do this by leveraging the bounds derived in Section 2.5.
Proof.Condition ( 28) is equivalent to the condition that the largest eigenvalue of the matrix [∇ 2 h(x)] −1 ∇ 2 f (x) = X∇ 2 f (x) be at most β (see [28,Theorem 7.7.3]).The matrix X∇ 2 f (x) is similar to X 1/2 ∇ 2 f (x)X 1/2 , hence it suffices to bound the eigenvalues of the latter.Since where we used the fact that X and D(x) are diagonal.By the proof of Lemma 2.9, the eigenvalues of Π(x) are all 0 or 1.Hence, using again the relation between the eigenvalues of congruent matrices (Theorem 2.8), we conclude that the largest eigenvalue of X 1/2 ∇ 2 f (x)X 1/2 is bounded by that of 2D(x) 2 .Since D(x) = diag(d(x)), the latter equals 2 d(x) 2  ∞ , which is 2c A,b /δ 2 = 8m 2 / 2 by Lemma 2.10 and the definitions of Ω δ and δ.
Proof.By Theorem 5.2 and Lemma 5.3, after k iterations it holds that (29) f where We complete the proof by bounding R/f (x * ) in terms of log(m/ ).Let with the last inequality following from (6).Thus, Hence, k = 96m 2 log(m/ )/ 3 iterations suffice to achieve relative error .
5.2.Accelerated gradient scheme.The second optimization scheme that we consider is the accelerated gradient method of Nesterov [31].This can be summarized as follows: (1) Initialize x 0 ∈ Q.Let β > 0 be a parameter.
(2) At iteration k = 0, 1, . .., compute ∇f (x k ) and set α k = 1/2(k + 1), τ k = 2/(k + 3) and In our application of the scheme, Q = Ω δ and the minimization in (31) and (32) can be carried out analytically; explicitly, they become To implement (34)- (35), it is enough to be able to access the gradient ∇f (x k ) and the cumulative gradient i α i ∇f (x i ); the latter can be maintained with one additional update at each iteration.
Convergence of the accelerated gradient scheme.The well-known result by Nesterov [31] shows that the accelerated gradient scheme achieves an absolute error bounded by O(β/k 2 ) after k iterations provided that the gradient of the function f is β-Lipschitz-continuous over Q.In our case, where f is twice-differentiable on Q, this means (36) λ max (∇ 2 f (x)) ≤ β for all x ∈ Q.
Theorem 5.5 ([31]).If ∇f is β-Lipschitz-continuous over Q, then for all k ≥ 1, the updates (31)- (33) satisfy Again, to apply Theorem 5.5 in our setting, we need to bound the smoothness parameter β.We do this by exploiting Lemma 2.12.Proof.Immediate from Lemma 2.12, the fact that Q = Ω δ and the definition of Ω δ .Recall that δ = c  Proof.By Theorem 5.5 and Lemma 5.6, after k iterations it holds that (37) f We complete the proof by bounding . By the assumption that x 0 = |u| where u is the least square solution to As = b, x 0 A,b (recall the definition of c A,b in Lemma 2.10).Moreover, Hence (R/c A,b ) 1/2 ≤ 3m 1/2 and substitution in (39) yields the theorem.
5.3.Implementing the iterations.We conclude this section by commenting on a few implementations details and in particular on how each iteration of ( 27) and ( 31)-( 33) could be implemented.A notable point is that each iteration can be reduced to a series of operations that access the matrix A only through the solution of a system of the form AW A p = b, for some diagonal matrix W , or through matrix-vector multiplications of the form Ax or A x.
Computation of the gradient.By Lemma 2.6, computing the vector d(x) = A L −1 (x)b is enough to compute the gradient at x, since ∇f (x) = 1 − d 2 (x).To compute d(x), it is enough to solve the linear system L(x)p = b for p, then premultiply the solution with A .Note that since L(x) = AXA , the system L(x)p = b is a symmetric linear system with a positive definite constraint matrix.
Warm start.Heuristically, the solution of the system L(x k+1 )p = b, which is required to compute the gradient at iteration k + 1, can be expected to be close to that of the system L(x k )p = b when x k+1 is close to x k .Hence, one possibility in practice is to use the solution obtained at step k to warm-start the linear equation solver at step k + 1, with a possible substantial reduction in the computational cost of each iteration.
Initial point and exit criterion.We assumed the starting point is the least square solution in Theorem 5.7, but this was only to optimize the worst-case iteration bound.In fact, Theorem 5.4 and Eq. ( 39) always apply and the schemes we discussed do not require a special initialization apart from membership into Ω δ ; hence, any point that is not too close to the boundary of the positive orthant is a suitable starting point.We can stop the schemes after the number of iterations k is large enough to ensure the error guarantees of Theorems 5.4 and 5.7 (or Eq. ( 39)).Alternatively, a natural exit criterion in practice can be based on the duality gap provided by Theorem 3.7.
Obtaining feasible iterates for (BP).The algorithms as described above produce iterates in the positive orthant, that is, iterates that are feasible for (DM), but after all, our goal was to obtain feasible iterates of (BP).By using the ideas of Lemma 3.4, we can easily associate with any iterate x k ∈ R m >0 an iterate s k that is feasible for (BP), and the cost of which is not larger than the dissipation cost of x k : namely, take By the proof of Lemma 3.4, we know that s k 1 ≤ f (x k )/2.Thus, the error bounds for f (x k ) can be directly translated into error bounds for s k 1 .Note that s k can be computed essentially for free, since s k = X k d(x k ) and d(x k ) is a byproduct of the gradient computation at iteration k.

Numerical comparison with other algorithms for 1 -minimization
We include in this section a numerical comparison of our schemes to other well-known algorithms for 1 -minimization.The results suggest that both the primal scheme and a slightly revised accelerated scheme converge at a geometric rate, that is, much faster than what our theoretical analysis guarantees.This suggests the open problem of improving the quality of our error bounds.
To compare our approaches to other algorithms for 1 -minimization, we implemented them in MATLAB and ran the l1benchmark suite by Yang et al. [39], which includes implementations of many other 1 -minimization solvers.A representative comparison is shown in Figure 1.The figure plots the relative error of the algorithms as a function of computation time, averaged on 20 randomly generated instances (with m = 1000, n = 800, and 20% or 30% nonzeros in the ground truth solution).The implementations based on our approaches are: • the Primal Gradient Scheme of Section 5.1 (PGS, with β = 3.5, δ = 10 −15 ), • the Accelerated Gradient Scheme of Section 5.2 (AGS, with β = 3.5, δ = 10 −15 , τ k = 2/(k + 3)), and Other algorithms measured in the experiment are the Homotopy method, the primal and dual augmented Lagrangian methods (PALM, DALM), the primal-dual interior point method (PDIPA), the truncated Newton interior point method (L1LS), the fast iterative soft-thresholding method (FISTA), and the approximate message passing method (AMP).We refer the reader to Yang et al. [39] for references and discussion of these other methods.We observe, incidentally, that many of these methods construct points that are only approximately feasible for (BP), since they relax the constraint As = b into the objective function, in one form or the other.
In the experiments, PGS clearly exhibits a geometric convergence rate, which is much better than what Theorem 5.4 guarantees, strongly suggesting that an improved theoretical analysis may be possible.Over time, PGS essentially reaches the machine precision barrier (≈ 10 −15 ), which is not true for other methods in the benchmark, such as FISTA, L1LS or the interior point method (PDIPA).
AGS, on the other hand, appears to be rather inaccurate in practice and does not exhibit a substantially better behavior than what is guaranteed by Theorem 5.7.This suggests that the entropic form of the updates -used in PGS but not in AGS -might have a high impact in practice.Therefore, we also benchmark a revised algorithm (AGS2) obtained by adopting an entropic form of the AGS updates (34)- (35), as follows (colored terms are new): The resulting scheme AGS2 is seen in Figure 1 to exhibit a geometric convergence rate and to be competitive against some of the best results in the benchmark, such as those of the primal augmented Lagrangian method (PALM).

Conclusions
We proposed a novel exact reformulation of the basis pursuit problem, which leads to a new family of gradient-based, IRLS-type methods for its solution.We then analyzed the iteration complexity of a natural optimization approach to the reformulation, based on the mirror descent scheme, as well as the iteration complexity of an accelerated gradient method.The first scheme can be seen as the discretization of a Hessian gradient flow and also as a variant on the Physarum dynamics, derived purely from optimization principles.The accelerated method, on the other hand, improves the error dependency for IRLS-type methods for basis pursuit, from −8/3 to −2 .The experimental convergence rate of the first scheme, as well as that of a simple variant the second scheme, appears to be geometric.We interpret this as evidence that the dissipation minimization perspective may stimulate even more approaches to the design and analysis of efficient and practical IRLS-type methods.

Proposition 2 . 1 .
Assumption (A.2) is without loss of generality, given (A.1).Proof.If the basis pursuit instance (A, b) satisfies (A.1) but not (A.2), form a new instance (A , b) where A is obtained from A by duplicating every column.Observe the following about the two instances: • A has full rank and n = n ≤ m ≤ 2m = m .• For any solution to (A, b), there is a solution to (A , b) with the same cost.

Corollary 2 . 5 .
The function f attains a minimum on R m ≥0 .

Lemma 2 . 10 .
Let x ∈ R m >0 .Then d(x) ∞ ≤ (min i=1,...,m x i ) −1 • s 2 ,where s is any solution to As = b.In particular, for c A,b def = b (AA ) −1 b, proving the first part of the claim.For the second part, consider the least square solution u def = A (AA ) −1 b.Then u 2 = c 1/2

Theorem 5 . 7 .
If x 0 = |u| where u def = A (AA ) −1 b is the least square solution to As = b, the accelerated gradient scheme (31)-(33) applied to the dissipation minimization problem (DM) achieves relative error at most after 24m 2 / 2 iterations.