Performance of first and secondorder methods for \(\ell _1\)regularized least squares problems
 1.6k Downloads
 5 Citations
Abstract
We study the performance of first and secondorder optimization methods for \(\ell _1\)regularized sparse leastsquares problems as the conditioning of the problem changes and the dimensions of the problem increase up to one trillion. A rigorously defined generator is presented which allows control of the dimensions, the conditioning and the sparsity of the problem. The generator has very low memory requirements and scales well with the dimensions of the problem.
Keywords
\(\ell _1\)Regularised leastsquares Firstorder methods Secondorder methods Sparse least squares instance generator Illconditioned problems1 Introduction

Magnetic resonance imaging (MRI): A medical imaging tool used to scan the anatomy and the physiology of a body [27].

Image inpainting: A technique for reconstructing degraded parts of an image [7].

Image deblurring: Image processing tool for removing the blurriness of a photo caused by natural phenomena, such as motion [21].

Genomewide association study (GWA): DNA comparison between two groups of people (with/without a disease) in order to investigate factors that a disease depends on [41].

Estimation of global temperature based on historic data [22].
Firstorder methods have been very successful in various scientific fields, such as support vector machine [45], compressed sensing [14], image processing [12] and data fitting [22]. Several new firstorder type approaches have recently been proposed for various imaging problems in the special issue edited by Bertero et al. [8]. However, even for the simple unconstrained problems that arise in the previous fields there exist more challenging instances. Since firstorder methods do not capture sufficient secondorder information, their performance might degrade unless the problems are well conditioned [16]. On the other hand, the secondorder methods capture the curvature of the objective function sufficiently well, but by consensus they are usually applied only on medium scale problems or when high precision accuracy is required. In particular, it is frequently claimed [2, 5, 6, 20, 37] that the secondorder methods do not scale favourably as the dimensions of the problem increase because of their high computational complexity per iteration. Such claims are based on an assumption that a full secondorder information has to be used. However, there is evidence [16, 19] that for nontrivial problems, inexact secondorder methods can be very efficient.
In this paper we will exhaustively study the performance of first and secondorder methods. We will perform numerical experiments for largescale problems with sizes up to one trillion of variables. We will examine conditions under which certain methods are favoured or not. We hope that by the end of this paper the reader will have a clear view about the performance of first and secondorder methods.
Another contribution of the paper is the development of a rigorously defined instance generator for problems of the form of (1). The most important feature of the generator is that it scales well with the size of the problem and can inexpensively create instances where the user controls the sparsity and the conditioning of the problem. For example see Sect. 8.9, where an instance of one trillion variables is created using the proposed generator. We believe that the flexibility of the proposed generator will cover the need for generation of various good test problems.
This paper is organised as follows. In Sect. 2 we briefly discuss the structure of first and secondorder methods. In Sect. 3 we give the details of the instance generator. In Sect. 4 we provide examples for constructing matrix A. In Section 5, we present some measures of the conditioning of problem (1). These measures will be used to examine the performance of the methods in the numerical experiments. In Sect. 6 we discuss how the optimal solution of the problem is selected. In Sect. 7 we briefly describe known problem generators and explain how our propositions add value to the existing approaches. In Sect. 8 we present the practical performance of first and secondorder methods as the conditioning and the size of the problems vary. Finally, in Sect. 9 we give our conclusions.
2 Brief discussion on first and secondorder methods
Loosely speaking, close to the optimal solution of problem (1), the better the approximation \(Q_\tau \) of \(f_\tau \) at any point x the fewer iterations are required to solve (1). On the other hand, the practical performance of such methods is a tradeoff between careful incorporation of the curvature of \(f_\tau \), i.e. secondorder derivative information in \(Q_\tau \) and the cost of solving subproblem (3) in GFrame.
3 Instance generator
In this section we discuss an instance generator for (1) for the cases \(m\ge n\) and \(m<n\). The generator is inspired by the one presented in Section 6 of [31]. The advantage of our modified version is that it allows to control the properties of matrix A and the optimal solution \(x^*\) of (1). For example, the sparsity of matrix A, its spectral decomposition, the sparsity and the norm of \(x^*\), since A and \(x^*\) are defined by the user.
Throughout the paper we will denote the \(i\mathrm{{th}}\) component of a vector, by the name of the vector with subscript i. Whilst, the \(i\mathrm{{th}}\) column of a matrix is denoted by the name of the matrix with subscript i.
3.1 Instance generator for \(m\ge n\)
3.2 Instance generator for \(m < n\)
In this subsection we extend the instance generator that was proposed in Sect. 3.1 to the case of matrix \(A\in {\mathbb {R}}^{m\times n}\) with more columns than rows, i.e. \(m<n\). Given \(\tau >0\), \(B\in {\mathbb {R}}^{m\times m}\), \(N\in {\mathbb {R}}^{m\times nm}\) and \(x^*\in {\mathbb {R}}^n\) the generator returns a vector \(b\in {\mathbb {R}}^m\) and a matrix \(A\in {\mathbb {R}}^{m\times n}\) such that \(x^*:= {{\mathrm{arg\,min}}}_x f_\tau (x)\).
Similarly to IGen in Sect. 3.1, for Step 3 in IGen2 we have to perform a matrix inversion, which generally can be an expensive operation. However, in the next section we discuss techniques how this matrix inversion can be executed using a sequence of elementary orthogonal transformations.
4 Construction of matrix A
In this subsection we provide a paradigm on how matrix A can be inexpensively constructed such that its singular value decomposition is known and its sparsity is controlled. We examine the case of instance generator IGen where \(m\ge n\). The paradigm can be easily extended to the case of IGen2, where \(m<n\).
It is important to mention that other settings of matrix A in (16) could be used, for example different combinations of permutation matrices and Givens rotations. The setting chosen in (16) is flexible, it allows for an inexpensive construction of matrix A and makes the control of the singular value decomposition and the sparsity of matrices A and \(A^\intercal A\) easy.
4.1 An example using Givens rotation
4.2 Control of sparsity of matrix A and \(A^\intercal A\)
We now present examples in which we demonstrate how sparsity of matrix A can be controlled through Givens rotations.
5 Conditioning of the problem
Let us now precisely define how we measure the conditioning of problem (1). For simplicity, throughout this section we assume that matrix A has more rows than columns, \(m\ge n\), and it is fullrank. Extension to the case of matrix A with more columns than rows is easy and we briefly discuss this at the end of this section.
We denote with \(\hbox {span}(\cdot )\) the span of the columns of the input matrix. Moreover, S is defined in (12), \(S^c\) is its complement.
Two factors are considered that affect the conditioning of the problem. First, the usual condition number of the secondorder derivative of \(1/2\Vert Axb\Vert _2^2\) in (1), which is simply \(\kappa (A^\intercal A) = \lambda _{1}(A^\intercal A)/\lambda _{n}(A^\intercal A)\), where \(0<\lambda _n\le \lambda _{n1} \le \cdots \le \lambda _1\) are the eigenvalues of matrix \(A^\intercal A\). It is wellknown that the larger \(\kappa (A^\intercal A)\) is, the more difficult problem (1) becomes.
Let us assume that there exists some \(\rho \) which satisfies \(\lambda _{n}(A^\intercal A) \le \rho \ll \lambda _{1}(A^\intercal A)\). If \(\kappa _\rho (x^*) \) is large, i.e., \(\Vert P_\rho x^*\Vert _2\) is close to zero, then the majority of the mass of \(x^*\) is “hidden” in the space spanned by eigenvectors which correspond to eigenvalues that are smaller than \(\rho \), i.e., the orthogonal space of \(\hbox {span}(G_\rho )\). In Sect. 1 we referred to methods that do not incorporate information which correspond to small eigenvalues of \(A^\intercal A\). Therefore, if the previous scenario holds, then we expect the performance of such methods to degrade. In Sect. 8 we empirically verify the previous arguments.
If matrix A has more columns than rows then the previous definitions of conditioning of problem (1) are incorrect and need to be adjusted. Indeed, if \(m<n\) and \(\hbox {rank}(A)=\hbox {min}(m,n)=m\), then \(A^\intercal A\) is a rank deficient matrix which has m nonzero eigenvalues and \(nm\) zero eigenvalues. However, we can restrict the conditioning of the problem to a neighbourhood of the optimal solution of \(x^*\). In particular, let us define a neighbourhood of \(x^*\) so that all points in this neighbourhood have nonzeros at the same indices as \(x^*\) and zeros elsewhere, i.e. \({\mathcal {N}}:=\{x\in {\mathbb {R}}^n \  \ x_i\ne 0 \ \forall i\in S, \ x_i=0 \ \forall i \in S^c \}\). In this case an important feature to determine the conditioning of the problem is the ratio of the largest and the smallest nonzero eigenvalues of \(A_S^\intercal A_S\), where \(A_S\) is a submatrix of A built of columns of A which belong to set S.
6 Construction of the optimal solution
The aim of Procedure OsGen2 is to find a sparse \(x^*\) with \(\kappa _\rho (x^*)\) arbitrarily large for some \(\rho \) in the interval \(\lambda _{n}(A^\intercal A) \le \rho \ll \lambda _{1}(A^\intercal A)\). In particular, OsGen2 will return a sparse \(x^*\) which can be expressed as \(x^*=Gv\). The coefficients v are close to the inverse of the eigenvalues of matrix \(A^\intercal A\). Intuitively, this technique will create an \(x^*\) which has strong dependence on subspaces which correspond to small eigenvalues of \(A^\intercal A\). The constant \(\gamma \) is used in order to control the norm of \(x^*\).
7 Existing problem generators
So far in Sect. 3.1 we have described in details our proposed problem generator. Moreover, in Sect. 4 we have described how to construct matrices A such that the proposed generator is scalable with respect to the number of unknown variables. We now briefly describe existing problem generators and explain how our propositions add value to the existing approaches.
Another representative example is proposed in [26]. This generator, which we discovered during the revision of our paper, proposes the same setting as in our paper. In particular, given A, \(x^*\) and \(\tau \) one can construct a vector b (or a noise vector e) such that (20) is satisfied. However, in [26] the author suggests that b can be found using a simple iterative procedure. Depending on matrix A and how illconditioned it is, this procedure might be slow. In this paper, we suggest that one can rely on numerical linear algebra tools, such as Givens rotation, in order to inexpensively construct b (or a noise vector e) using straightforwardly scalable operations. Additionally, we show in Sect. 8 that a simple construction of matrix A is sufficient to extensively test the performance of methods.
8 Numerical experiments
In this section we study the performance of stateoftheart first and secondorder methods as the conditioning and the dimensions of the problem increase. The scripts that reproduce the experiments in this section as well as the problem generators that are described in Sect. 3 can be downloaded from: http://www.maths.ed.ac.uk/ERGO/trillion/.
8.1 Stateoftheart methods
A number of efficient first [2, 13, 24, 33, 34, 37, 38, 39, 40, 43, 44] and secondorder [4, 10, 16, 19, 25, 35, 36] methods have been developed for the solution of problem (1).
 FISTA (Fast Iterative ShrinkageThresholding Algorithm) [2] is an optimal firstorder method for problem (1), which adheres to the structure of GFrame. At a point x, FISTA builds a convex function:where L is an upper bound of \(\lambda _{max}(A^\intercal A)\), and solves subproblem (3) exactly using shringkagethresholding [12, 27]. An efficient implementation of this algorithm can be found as part of TFOCS (Templates for FirstOrder Conic Solvers) package [6] under the name N83. In this implementation the parameter L is calculated dynamically.$$\begin{aligned} Q_\tau (y;x) := \tau \Vert y\Vert _1 + \frac{1}{2}\Vert Axb\Vert ^2 + (A^\intercal (Axb))^\intercal (yx) + \frac{L}{2} \Vert yx\Vert _2^2, \end{aligned}$$
 PCDM (Parallel Coordinate Descent Method) [34] is a randomized parallel coordinate descent method. The parallel updates are performed asynchronously and the coordinates to be updated are chosen uniformly at random. Let \(\varpi \) be the number of processors that are employed by PCDM. Then, at a point x, PCDM builds \(\varpi \) convex approximations:\(\forall i=1,2,\cdots ,\varpi \), where \(A_i\) is the ith column of matrix A and \(L_i=(A^\intercal A)_{ii}\) is the ith diagonal element of matrix \(A^\intercal A\) and \(\beta \) is a positive constant which is defined in Sect. 8.3. The \(Q_\tau ^i\) functions are minimized exactly using shrinkagethresholding.$$\begin{aligned} Q_\tau ^i(y_i;x) := \tau y_i + \frac{1}{2}\Vert Axb\Vert ^2 + (A_i^\intercal (Axb)) (y_ix_i) + \frac{\beta L_i}{2} (y_ix_i)^2, \end{aligned}$$
 PSSgb (Projected Scaled Subgradient, GafniBertsekas variant) [36] is a secondorder method. At each iteration of PSSgb the coordinates are separated into two sets, the working set \({\mathcal {W}}\) and the active set \({\mathcal {A}}\). The working set consists of all coordinates for which, the current point x is nonzero. The active set is the complement of the working set \({\mathcal {W}}\). The following local quadratic model is build at each iterationwhere \(\tilde{\nabla } f_\tau (x)\) is a subgradient of \(f_\tau \) at point x with the minimum Euclidean norm, see Subsection 2.2.1 in [36] for details. Moreover, matrix H is defined as:$$\begin{aligned} Q_\tau (y;x) := f_\tau (x) + (\tilde{\nabla } f_\tau (x))^\intercal (yx) + \frac{1}{2}(yx)^\intercal H (yx), \end{aligned}$$where \(H_{{\mathcal {W}}}\) is an LBFGS (Limitedmemory Broyden–Fletcher–Goldfarb–Shanno) Hessian approximation with respect to the coordinates \({\mathcal {W}}\) and \(H_{{\mathcal {A}}}\) is a positive diagonal matrix. The diagonal matrix \(H_{{\mathcal {A}}}\) is a scaled identity matrix, where the Shanno–Phua/Barzilai–Borwein scaling is used, see Sect. 2.3.1 in [36] for details. The local model is minimized exactly since the inverse of matrix H is known due to properties of the LBFGS Hessian approximation \(H_{{\mathcal {W}}}\).$$\begin{aligned} H = \begin{bmatrix} H_{{\mathcal {W}}} \quad&0 \\ 0 \quad&H_{{\mathcal {A}}} \end{bmatrix}, \end{aligned}$$
 pdNCG (primal–dual Newton Conjugate Gradients) [16] is also a secondorder method. At every point x pdNCG constructs a convex function \(Q_\tau \) exactly as described for (9). The subproblem (3) is solved inexactly by reducing it to the linear system:which is solved approximately using preconditioned Conjugate Gradients (PCG). A simple diagonal preconditioner is used for all experiments. The preconditioner is the inverse of the diagonal of matrix \({\nabla }^2 f_\tau ^\mu (x)\).$$\begin{aligned} \nabla ^2 f_{\tau }^{\mu }(x) (yx) = \nabla f_{\tau }^{\mu }(x), \end{aligned}$$
8.2 Implementation details
Solvers pdNCG, FISTA and PSSgb are implemented in MATLAB, while solver PCDM is a C++ implementation. We expect that the programming language will not be an obstacle for pdNCG, FISTA and PSSgb. This is because these methods rely only on basic linear algebra operations, such as the dot product, which are implemented in C++ in MATLAB by default. The experiments in Sects. 8.4, 8.5, 8.6 were performed on a Dell PowerEdge R920 running Redhat Enterprise Linux with four Intel Xeon E74830 v2 2.2GHz processors, 20MB Cache, 7.2 GT/s QPI, Turbo (4\(\times \)10Cores).
The huge scale experiments in Sect. 8.9 were performed on a Cray XC30 MPP supercomputer. This work made use of the resources provided by ARCHER (http://www.archer.ac.uk/), made available through the Edinburgh Compute and Data Facility (ECDF) (http://www.ecdf.ed.ac.uk/). According to the most recent list of commercial supercomputers, which is published in TOP500 list (http://www.top500.org), ARCHER is currently the 25th fastest supercomputer worldwide out of 500 supercomputers. ARCHER has a total of 118, 080 cores with performance 1, 642.54 TFlops/s on LINPACK benchmark and 2, 550.53 TFlops/s theoretical peak perfomance. The most computationally demanding experiments which are presented in Sect. 8.9 required more than half of the cores of ARCHER, i.e., 65, 536 cores out of 118, 080.
8.3 Parameter tuning
We describe the most important parameters for each solver, any other parameters are set to their default values. For pdNCG we set the smoothing parameter to \(\mu = 10^{5}\), this setting allows accurate solution of the original problem with an error of order \({\mathcal {O}}(\mu )\) [30]. For pdNCG, PCG is terminated when the relative residual is less that \(10^{1}\) and the backtracking linesearch is terminated if it exceeds 50 iterations. Regarding FISTA the most important parameter is the calculation of the Lipschitz constant L, which is handled dynamically by TFOCS. For PCDM the coordinate Lipschitz constants \(L_i\) \(\forall i=1,2,\ldots ,n\) are calculated exactly and parameter \(\beta = 1 + (\omega 1)(\varpi 1)/(n1)\), where \(\omega \) changes for every problem since it is the degree of partial separability of the fidelity function in (1), which is easily calculated (see [34]), and \(\varpi =40\) is the number of cores that are used. For PSSgb we set the number of LBFGS corrections to 10.
We set the regularization parameter \(\tau =1\), unless stated otherwise. We run pdNCG for sufficient time such that the problems are adequately solved. Then, the rest of the methods are terminated when the objective function \( f_{\tau }\) in (1) is below the one obtained by pdNCG or when a predefined maximum number of iterations limit is reached. All comparisons are presented in figures which show the progress of the objective function against the wall clock time. This way the reader can compare the performance of the solvers for various levels of accuracy. We use logarithmic scales for the wall clock time and terminate runs which do not converge in about \(10^5\) sec, i.e., approximately 27 h.
8.4 Increasing condition number of \(A^\intercal A\)
In this experiment we present the performance of FISTA, PCDM, PSSgb and pdNCG for increasing condition number of matrix \(A^TA\) when Procedure OsGen is used to construct the optimal solution \(x^*\). We generate six matrices A and two instances of \(x^*\) for every matrix A; 12 instances in total.
The singular value decomposition of matrix A is \(A=\varSigma G^\intercal \), where \(\varSigma \) is the matrix of singular values, the columns of matrices \(I_m\) and G are the left and right singular vectors, respectively, see Sect. 4.1 for details about the construction of matrix G. The singular values of matrices A are chosen uniformly at random in the intervals \([0, 10^q]\), where \(q=0,1,\ldots ,5\), for each of the six matrices A. Then, all singular values are shifted by \(10^{1}\). The previous resulted in a condition number of matrix \(A^\intercal A\) which varies from \(10^2\) to \(10^{12}\) with a step of times \(10^2\). The rotation angle \(\theta \) of matrix G is set to \(2{\pi }/{3}\) radians. Matrices A have \(n=2^{22}\) columns, \(m =2 n\) rows and rank n. The optimal solutions \(x^*\) have \(s = n/2^7\) nonzero components for all twelve instances.
For the second set of six instances we set \(\gamma =10^3\) in Procedure OsGen, which resulted in the same \(\kappa _{0.1}(x^*)\) as before for every matrix A.
We observed that pdNCG required at most 30 iterations to converge for all experiments. For FISTA, PCDM and PSSgb the number of iterations was varying between thousands and tens of thousands iterations depending on the condition number of matrix \(A^\intercal A\); the larger the condition number the more the iterations. However, the number of iterations is not a fair metric to compare solvers because every solver has different computational cost per iteration. In particular, FISTA, PCDM and PSSgb perform few inner products per iteration, which makes every iteration inexpensive, but the number of iterations is sensitive to the condition number of matrix \(A^\intercal A\). On the other hand, for pdNCG the empirical iteration complexity is fairly stable, however, the number of inner products per iteration (mainly matrixvector products with matrix A) may increase as the condition number of matrix \(A^\intercal A\) increases. Inner products are the major computational burden at every iteration for all solvers, therefore, the faster an algorithm converged in terms of wallclock time the less inner products that are calculated. In Figs. 5 and 6 we display the objective evaluation against wallclock time (logscale) to facilitate the comparison of different algorithms.
8.5 Increasing condition number of \(A^\intercal A\): nontrivial construction of \(x^*\)
In this experiment we examine the performance of the methods as the condition number of matrix \(A^\intercal A\) increases, while the optimal solution \(x^*\) is generated using Procedure OsGen3 (instead of OsGen) with \(\gamma = 100\) and \(s_1 = s_2= s/2\). Two classes of instances are generated, each class consists of four instances \((A,x^*)\) with \(n=2^{22}\), \(m =2 n\) and \(s=n/2^7\). Matrix A is constructed as in Sect. 8.4. The singular values of matrices A are chosen uniformly at random in the intervals \([0, 10^q]\), where \(q=0,1,\ldots ,3\), for all generated matrices A. Then, all singular values are shifted by \(10^{1}\). The previous resulted in a condition number of matrix \(A^\intercal A\) which varies from \(10^2\) to \(10^{8}\) with a step of times \(10^2\). The condition number of the generated optimal solutions was on average \(\kappa _{0.1}(x^*)\approx 40\).
The two classes of experiments are distinguished based on the rotation angle \(\theta \) that is used for the composition of Givens rotations G. In particular, for the first class of experiments the angle is \(\theta =2{\pi }/10\) radians, while for the second class of experiments the rotation angle is \(\theta =2{\pi }/10^{3}\) radians. The difference between the two classes is that the second class consists of matrices \(A^\intercal A\) for which, a major part of their mass is concentrated in the diagonal. This setting is beneficial for PCDM since it uses information only from the diagonal of matrices \(A^\intercal A\). This setting is also beneficial for pdNCG since it uses a diagonal preconditioner for the inexact solution of linear systems at every iteration.
The results for the first class of experiments are presented in Fig. 7. For instances with \(\kappa (A^\intercal A) \ge 10^{6}\) PCDM was terminated after 1, 000, 000 iterations, which corresponded to more than 27 h of wallclock time.
8.6 Increasing dimensions
In this experiment we present the performance of pdNCG, FISTA, PCDM and PSSgb as the number of variables n increases. We generate four instances where the number of variables n takes values \(2^{20}\), \(2^{22}\), \(2^{24}\) and \(2^{26}\), respectively. The singular value decomposition of matrix A is \(A=\varSigma G^\intercal \). The singular values in matrix \(\varSigma \) are chosen uniformly at random in the interval [0, 10] and then are shifted by \(10^{1}\), which resulted in \(\kappa (A^\intercal A)\approx 10^{4}\). The rotation angle \(\theta \) of matrix G is set to \(2\pi /10\) radians. Moreover, matrices A have \(m= 2 n\) rows and rank n. The optimal solutions \(x^*\) have \(s = n/2^7\) nonzero components for each generated instance. For the construction of the optimal solutions \(x^*\) we use Procedure OsGen3 with \(\gamma = 100\) and \(s_1 = s_2= s/2\), which resulted in \(\kappa _{0.1}(x^*) \approx 3\) on average.
8.7 Increasing density of matrix \(A^\intercal A\)
In this experiment we demonstrate the performance of pdNCG, FISTA, PCDM and PSSgb as the density of matrix \(A^\intercal A\) increases. We generate four instances \((A,x^*)\). For the first experiment we generate matrix \(A = \varSigma G^\intercal \), where \(\varSigma \) is the matrix of singular values, the columns of matrices \(I_m\) and G are the left and right singular vectors, respectively. For the second experiment we generate matrix \(A=\varSigma (G_2G)^\intercal \), where the columns of matrices \(I_m\) and \(G_2G\) are the left and right singular vectors of matrix A, respectively; \(G_2\) has been defined in Sect. 4.2. Finally, for the third and fourth experiments we have \(A=\varSigma (GG_2G)^\intercal \) and \(A=\varSigma (G_2GG_2G)^\intercal \), respectively. For each experiment the singular values of matrix A are chosen uniformly at random in the interval [0, 10] and then are shifted by \(10^{1}\), which resulted in \(\kappa (A^\intercal A)\approx 10^4\). The rotation angle \(\theta \) of matrices G and \(G_2\) is set to \(2\pi /10\) radians. Matrices A have \(m= 2 n\) rows, rank n and \(n=2^{22}\). The optimal solutions \(x^*\) have \(s = n/2^7\) nonzero components for each experiment. Moreover, Procedure OsGen3 is used with \(\gamma = 100\) and \(s_1 = s_2= s/2\) for the construction of \(x^*\) for each experiment, which resulted in \(\kappa _{0.1}(x^*) \approx 2\) on average.
8.8 Varying parameter \(\tau \)
In this experiment we present the performance of pdNCG, FISTA, PCDM and PSSgb as parameter \(\tau \) varies from \(10^{4}\) to \(10^4\) with a step of times \(10^2\). We generate four instances \((A,x^*)\), where matrix \(A=\varSigma G^\intercal \) has \(m= 2 n\) rows, rank n and \(n=2^{22}\). The singular values of matrices A are chosen uniformly at random in the interval [0, 10] and then are shifted by \(10^{1}\), which resulted in \(\kappa (A^\intercal A)\approx 10^4\) for each experiment. The rotation angles \(\theta \) for matrix G in A is set to \(2\pi /10\) radians. The optimal solution \(x^*\) has \(s = n/2^7\) nonzero components for all instances. Moreover, the optimal solutions are generated using Procedure OsGen3 with \(\gamma = 100\), which resulted in \(\kappa _{0.1}(x^*) \approx 3\) for all four instances.
8.9 Performance of a secondorder method on huge scale problems
We now present the performance of pdNCG on synthetic huge scale (up to one trillion variables) SLS problems as the number of variables and the number of processors increase.
Performance of pdNCG for synthetic huge scale SLS problems. All problems have been solved to a relative error of order \(10^{4}\) of the obtained solution
n  Processors  Memory (terabytes)  Time (seconds) 

\(2^{30}\)  64  0.192  1923 
\(2^{32}\)  256  0.768  1968 
\(2^{34}\)  1024  3.072  1986 
\(2^{36}\)  4096  12.288  1970 
\(2^{38}\)  16384  49.152  1990 
\(2^{40}\)  65536  196.608  2006 
Details of the performance of pdNCG are given in Table 1. Observe the nearly linear scaling of pdNCG with respect to the number of variables n and the number of processors. For all experiments in Table 1 pdNCG required 8 Newton steps to converge, 100 PCG iterations per Newton step on average, where every PCG iteration requires two matrixvector products with matrix A.
9 Conclusion
In this paper we developed an instance generator for \(\ell _1\)regularized sparse leastsquares problems. The generator is aimed for the construction of very largescale instances. Therefore it scales well as the number of variables increases, both in terms of memory requirements and time. Additionally, the generator allows control of the conditioning and the sparsity of the problem. Examples are provided on how to exploit the previous advantages of the proposed generator. We believe that the optimization community needs such a generator to be able to perform fair assessment of new algorithms.
Using the proposed generator we constructed very largescale sparse instances (up to one trillion variables), which vary from very wellconditioned to moderately illconditioned. We examined the performance of several representative first and secondorder optimization methods. The experiments revealed that regardless of the size of the problem, the performance of the methods crucially depends on the conditioning of the problem. In particular, the firstorder methods PCDM and FISTA are faster for problems with small or moderate condition number, whilst, the secondorder method pdNCG is much more efficient for illconditioned problems.
Notes
Acknowledgments
This work has made use of the resources provided by ARCHER (http://www.archer.ac.uk/), made available through the Edinburgh Compute and Data Facility (ECDF) (http://www.ecdf.ed.ac.uk/). The authors are grateful to Dr. Kenton D’ Mellow for providing guidance and helpful suggestions regarding the use of ARCHER and the solution of large scale problems.
References
 1.Bach, F., Jenatton, R., Mairal, J., Obozinski, G.: Optimization with sparsityinducing penalties. J. Found. Trends. Mach. Learn. 4(1), 1–106 (2012)CrossRefzbMATHGoogle Scholar
 2.Beck, A., Teboulle, M.: A fast iterative shrinkagethresholding algorithm for linear inverse problems. SIAM J. Imaging Sci. 2(1), 183–202 (2009)MathSciNetCrossRefzbMATHGoogle Scholar
 3.Becker, S.: CoSaMP and OMP for sparse recovery. http://www.mathworks.co.uk/matlabcentral/fileexchange/32402cosampandompforsparserecovery (2012)
 4.Becker, S., Fadili, J.: A quasiNewton proximal splitting method. In: Pereira, F., Burges, C.J.C., Bottou, L., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems 25, pp. 2618–2626. Curran Associates Inc., Red Hook (2012)Google Scholar
 5.Becker, S.R., Bobin, J., Candès, E.J.: Nesta: A fast and accurate firstorder method for sparse recovery. SIAM J. Imaging Sci. 4(1), 1–39 (2011)MathSciNetCrossRefzbMATHGoogle Scholar
 6.Becker, S.R., Candés, E.J., Grant, M.C.: Templates for convex cone problems with applications to sparse signal recovery. Math. Program. Comput. 3(3), 165–218 (2011). http://tfocs.stanford.edu
 7.Bertalmio, M., Sapiro, G., Ballester, C., Caselles, V.: Image inpainting. Proceedings of the 27th annual conference on computer graphics and interactive techniques (SIGGRAPH), pp. 417–424 (2000)Google Scholar
 8.Bertero, M., Ruggiero, V., Zanni, L.: Special issue: Imaging 2013. Comput. Optim. Appl. 54, 211–213 (2013)CrossRefGoogle Scholar
 9.Boyd, S., Parikh, N., Chu, E., Peleato, B., Eckstein, J.: Distributed optimization and statistical learning via the alternating direction method of multipliers. J. Found. Trends. Mach. Learn. 3(1), 1–122 (2011)CrossRefzbMATHGoogle Scholar
 10.Byrd, R.H., Nocedal, J., Oztoprak, F.: An inexact successive quadratic approximation method for convex l1 regularized optimization. Math. Program. Ser B. (2015). doi: 10.1007/s101070150941y
 11.Chambolle, A., Caselles, V., Cremers, D., Novaga, M., Pock, T.: An introduction to total variation for image analysis. Radon. Ser. Comp. Appl. Math 9, 263–340 (2010)MathSciNetzbMATHGoogle Scholar
 12.Chambolle, A., DeVore, R.A., Lee, N.Y., Lucier, B.J.: Nonlinear wavelet image processing: variational problems, compression, and noise removal through wavelet shrinkage. IEEE Trans. Image Process. 7(3), 319–335 (1998)MathSciNetCrossRefzbMATHGoogle Scholar
 13.Chang, K.W., Hsieh, C.J., Lin, C.J.: Coordinate descent method for largescale \(\ell _2\)loss linear support vector machines. J. Mach. Learn. Res. 9, 1369–1398 (2008)MathSciNetzbMATHGoogle Scholar
 14.Donoho, D.L.: Compressed sensing. IEEE Trans. Inf. Theory. 52(4), 1289–1306 (2006)MathSciNetCrossRefzbMATHGoogle Scholar
 15.Eckstein, J.: Augmented lagrangian and alternating direction methods for convex optimization: A tutorial and some illustrative computational results. RUTCOR Research Reports (2012)Google Scholar
 16.Fountoulakis, K., Gondzio, J.: A secondorder method for strongly convex \(\ell _1\)regularization problems. Math. Program. (2015). doi: 10.1007/s1010701508754. http://www.maths.ed.ac.uk/ERGO/pdNCG/
 17.Friedman, J., Hastie, T., Tibshirani, R.: Regularization paths for generalized linear models via coordinate descent. J. Mach. Learn. Res. 9, 627–650 (2008)MathSciNetGoogle Scholar
 18.Goldstein, T., O’Donoghue, B., Setze, S., Baraniuk, R.: Fast alternating direction optimization methods. SIAM J. Imaging. Sci. 7(3), 1588–1623 (2014)MathSciNetCrossRefzbMATHGoogle Scholar
 19.Gondzio, J.: Matrixfree interior point method. Comput. Optim. Appl. 51(2), 457–480 (2012)MathSciNetCrossRefzbMATHGoogle Scholar
 20.Hale, E.T., Yin, W., Zhang, Y.: Fixedpoint continuation method for \(\ell _1\)minimization: methodology and convergence. SIAM J. Optim. 19(3), 1107–1130 (2008)MathSciNetCrossRefzbMATHGoogle Scholar
 21.Hansen, P.C., Nagy, J.G., O’Leary, D.P.: Deblurring Images: Matrices, Spectra and Filtering. SIAM, Philadelphia (2006)CrossRefzbMATHGoogle Scholar
 22.Hansen, P.C., Pereyra, V., Scherer, G.: Least Squares Data Fitting with Applications. JHU Press, Baltimore (2012)zbMATHGoogle Scholar
 23.He, B., Yuan, X.: On the \({\cal O}(1/t)\) convergence rate of alternating direction method. SIAM J. Numer. Anal. 50(2), 700–709 (2012)MathSciNetCrossRefzbMATHGoogle Scholar
 24.Hsieh, C.J., Chang, K.W., Lin, C.J., Keerthi, S.S., Sundararajan, S.: A dual coordinate descent method for largescale linear SVM. Proceedings of the 25th international conference on machine learning, ICML 2008, pp. 408–415 (2008)Google Scholar
 25.Kim, S.J., Koh, K., Lustig, M., Boyd, S., Gorinevsky, D.: An interiorpoint method for largescale \(\ell _1\)regularized least squares. IEEE J. Sel. Top. Signal. Process. 1(4), 606–617 (2007)CrossRefGoogle Scholar
 26.Lorenz, D.A.: Constructing test instances for basis pursuit denoising. IEEE Trans. Signal. Process. 61(5), 1210–1214 (2013)MathSciNetCrossRefGoogle Scholar
 27.Lustig, M., Donoho, D., Pauly, J.M.: Sparse MRI: The application of compressed sensing for rapid MR imaging. Magn. Reson. Med 58(6), 1182–1195 (2007)CrossRefGoogle Scholar
 28.Needell, D., Tropp, J.A.: Cosamp: Iterative signal recovery from incomplete and inaccurate samples. Appl. Comput. Harmon. Anal 26(3), 301–321 (2009)MathSciNetCrossRefzbMATHGoogle Scholar
 29.Nesterov, Y.: Introd. Lecture. Note. Convex. Optim. A Basic Course, Boston (2004)CrossRefGoogle Scholar
 30.Nesterov, Y.: Smooth minimization of nonsmooth functions. Math. Program. 103(1), 127–152 (2005)MathSciNetCrossRefzbMATHGoogle Scholar
 31.Nesterov, Yu.: Gradient methods for minimizing composite functions. Math. Program. 140(1), 125–161 (2013)MathSciNetCrossRefzbMATHGoogle Scholar
 32.Parikh, N., Boyd, S.: Proximal algorithms. J. Found. Trends. Optim. 1(3), 123–231 (2013)Google Scholar
 33.Richtárik, P., Takáč, M.: Iteration complexity of randomized blockcoordinate descent methods for minimizing a composite function. Math. Program. Ser A. 144(1), 1–38 (2014)MathSciNetCrossRefzbMATHGoogle Scholar
 34.Richtárik, P., Takáč, M.: Parallel coordinate descent methods for big data optimization. Math. Program. Ser A. 1–52 (2015). doi: 10.1007/s1010701509016
 35.Scheinberg, K., Tang, X.: Practical inexact proximal quasiNewton method with global complexity analysis. Technical report, March (2014). http://www.arxiv.org/abs/1311.6547
 36.Schmidt, M.: Graphical model structure learning with l1regularization. PhD thesis, University British Columbia, (2010)Google Scholar
 37.ShalevShwartz, S., Tewari, A.: Stochastic methods for \(\ell _1\)regularized loss minimization. J. Mach. Learn. Res. 12(4), 1865–1892 (2011)MathSciNetzbMATHGoogle Scholar
 38.Tseng, P.: Convergence of a block coordinate descent method for nondifferentiable minimization. J. Optim. Theory. Appl. 109(3), 475–494 (2001)MathSciNetCrossRefzbMATHGoogle Scholar
 39.Tseng, P.: Efficiency of coordinate descent methods on hugescale optimization problems. SIAM J. Optim. 22, 341–362 (2012)MathSciNetCrossRefzbMATHGoogle Scholar
 40.Tseng, P., Yun, S.: A coordinate gradient descent method for nonsmooth separable minimization. Math. Program. Ser B. 117, 387–423 (2009)MathSciNetCrossRefzbMATHGoogle Scholar
 41.Vattikuti, S., Lee, J.J., Chang, C.C., Hsu, S.D., Chow, C.C.: Applying compressed sensing to genomewide association studies. Giga. Sci. 3(10), 1–17 (2014)Google Scholar
 42.Wang, Y., Yang, J., Yin, W., Zhang, Y.: A new alternating minimization algorithm for total variation image reconstruction. SIAM J. Imaging Sci. 1(3), 248–272 (2008)MathSciNetCrossRefzbMATHGoogle Scholar
 43.Wright, S.J.: Accelerated blockcoordinate relaxation for regularized optimization. SIAM J. Optim. 22(1), 159–186 (2012)MathSciNetCrossRefzbMATHGoogle Scholar
 44.Wu, T.T., Lange, K.: Coordinate descent algorithms for lasso penalized regression. Ann. Appl. Stat. 2(1), 224–244 (2008)MathSciNetCrossRefzbMATHGoogle Scholar
 45.Yuan, G.X., Chang, K.W., Hsieh, C.J., Lin, C.J.: A comparison of optimization methods and software for largescale \(\ell _1\)regularized linear classification. J. Mach. Learn. Res. 11, 3183–3234 (2010)MathSciNetzbMATHGoogle Scholar
Copyright information
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.