In this section we study the performance of state-of-the-art first- and second-order methods as the conditioning and the dimensions of the problem increase. The scripts that reproduce the experiments in this section as well as the problem generators that are described in Sect. 3 can be downloaded from: http://www.maths.ed.ac.uk/ERGO/trillion/.
State-of-the-art methods
A number of efficient first- [2, 13, 24, 33, 34, 37–40, 43, 44] and second-order [4, 10, 16, 19, 25, 35, 36] methods have been developed for the solution of problem (1).
In this section we examine the performance of the following state-of-the-art methods. Notice that the first three methods FISTA, PSSgb and PCDM do not perform smoothing of the \(\ell _1\)-norm, while pdNCG does.
-
FISTA (Fast Iterative Shrinkage-Thresholding Algorithm) [2] is an optimal first-order method for problem (1), which adheres to the structure of GFrame. At a point x, FISTA builds a convex function:
$$\begin{aligned} Q_\tau (y;x) := \tau \Vert y\Vert _1 + \frac{1}{2}\Vert Ax-b\Vert ^2 + (A^\intercal (Ax-b))^\intercal (y-x) + \frac{L}{2} \Vert y-x\Vert _2^2, \end{aligned}$$
where L is an upper bound of \(\lambda _{max}(A^\intercal A)\), and solves subproblem (3) exactly using shringkage-thresholding [12, 27]. An efficient implementation of this algorithm can be found as part of TFOCS (Templates for First-Order Conic Solvers) package [6] under the name N83. In this implementation the parameter L is calculated dynamically.
-
PCDM (Parallel Coordinate Descent Method) [34] is a randomized parallel coordinate descent method. The parallel updates are performed asynchronously and the coordinates to be updated are chosen uniformly at random. Let \(\varpi \) be the number of processors that are employed by PCDM. Then, at a point x, PCDM builds \(\varpi \) convex approximations:
$$\begin{aligned} Q_\tau ^i(y_i;x) := \tau |y_i| + \frac{1}{2}\Vert Ax-b\Vert ^2 + (A_i^\intercal (Ax-b)) (y_i-x_i) + \frac{\beta L_i}{2} (y_i-x_i)^2, \end{aligned}$$
\(\forall i=1,2,\cdots ,\varpi \), where \(A_i\) is the ith column of matrix A and \(L_i=(A^\intercal A)_{ii}\) is the ith diagonal element of matrix \(A^\intercal A\) and \(\beta \) is a positive constant which is defined in Sect. 8.3. The \(Q_\tau ^i\) functions are minimized exactly using shrinkage-thresholding.
-
PSSgb (Projected Scaled Subgradient, Gafni-Bertsekas variant) [36] is a second-order method. At each iteration of PSSgb the coordinates are separated into two sets, the working set \({\mathcal {W}}\) and the active set \({\mathcal {A}}\). The working set consists of all coordinates for which, the current point x is nonzero. The active set is the complement of the working set \({\mathcal {W}}\). The following local quadratic model is build at each iteration
$$\begin{aligned} Q_\tau (y;x) := f_\tau (x) + (\tilde{\nabla } f_\tau (x))^\intercal (y-x) + \frac{1}{2}(y-x)^\intercal H (y-x), \end{aligned}$$
where \(\tilde{\nabla } f_\tau (x)\) is a sub-gradient of \(f_\tau \) at point x with the minimum Euclidean norm, see Subsection 2.2.1 in [36] for details. Moreover, matrix H is defined as:
$$\begin{aligned} H = \begin{bmatrix} H_{{\mathcal {W}}} \quad&0 \\ 0 \quad&H_{{\mathcal {A}}} \end{bmatrix}, \end{aligned}$$
where \(H_{{\mathcal {W}}}\) is an L-BFGS (Limited-memory Broyden–Fletcher–Goldfarb–Shanno) Hessian approximation with respect to the coordinates \({\mathcal {W}}\) and \(H_{{\mathcal {A}}}\) is a positive diagonal matrix. The diagonal matrix \(H_{{\mathcal {A}}}\) is a scaled identity matrix, where the Shanno–Phua/Barzilai–Borwein scaling is used, see Sect. 2.3.1 in [36] for details. The local model is minimized exactly since the inverse of matrix H is known due to properties of the L-BFGS Hessian approximation \(H_{{\mathcal {W}}}\).
-
pdNCG (primal–dual Newton Conjugate Gradients) [16] is also a second-order method. At every point x pdNCG constructs a convex function \(Q_\tau \) exactly as described for (9). The subproblem (3) is solved inexactly by reducing it to the linear system:
$$\begin{aligned} \nabla ^2 f_{\tau }^{\mu }(x) (y-x) = -\nabla f_{\tau }^{\mu }(x), \end{aligned}$$
which is solved approximately using preconditioned Conjugate Gradients (PCG). A simple diagonal preconditioner is used for all experiments. The preconditioner is the inverse of the diagonal of matrix \({\nabla }^2 f_\tau ^\mu (x)\).
Implementation details
Solvers pdNCG, FISTA and PSSgb are implemented in MATLAB, while solver PCDM is a C++ implementation. We expect that the programming language will not be an obstacle for pdNCG, FISTA and PSSgb. This is because these methods rely only on basic linear algebra operations, such as the dot product, which are implemented in C++ in MATLAB by default. The experiments in Sects. 8.4, 8.5, 8.6 were performed on a Dell PowerEdge R920 running Redhat Enterprise Linux with four Intel Xeon E7-4830 v2 2.2GHz processors, 20MB Cache, 7.2 GT/s QPI, Turbo (4\(\times \)10Cores).
The huge scale experiments in Sect. 8.9 were performed on a Cray XC30 MPP supercomputer. This work made use of the resources provided by ARCHER (http://www.archer.ac.uk/), made available through the Edinburgh Compute and Data Facility (ECDF) (http://www.ecdf.ed.ac.uk/). According to the most recent list of commercial supercomputers, which is published in TOP500 list (http://www.top500.org), ARCHER is currently the 25th fastest supercomputer worldwide out of 500 supercomputers. ARCHER has a total of 118, 080 cores with performance 1, 642.54 TFlops/s on LINPACK benchmark and 2, 550.53 TFlops/s theoretical peak perfomance. The most computationally demanding experiments which are presented in Sect. 8.9 required more than half of the cores of ARCHER, i.e., 65, 536 cores out of 118, 080.
Parameter tuning
We describe the most important parameters for each solver, any other parameters are set to their default values. For pdNCG we set the smoothing parameter to \(\mu = 10^{-5}\), this setting allows accurate solution of the original problem with an error of order \({\mathcal {O}}(\mu )\) [30]. For pdNCG, PCG is terminated when the relative residual is less that \(10^{-1}\) and the backtracking line-search is terminated if it exceeds 50 iterations. Regarding FISTA the most important parameter is the calculation of the Lipschitz constant L, which is handled dynamically by TFOCS. For PCDM the coordinate Lipschitz constants \(L_i\)
\(\forall i=1,2,\ldots ,n\) are calculated exactly and parameter \(\beta = 1 + (\omega -1)(\varpi -1)/(n-1)\), where \(\omega \) changes for every problem since it is the degree of partial separability of the fidelity function in (1), which is easily calculated (see [34]), and \(\varpi =40\) is the number of cores that are used. For PSSgb we set the number of L-BFGS corrections to 10.
We set the regularization parameter \(\tau =1\), unless stated otherwise. We run pdNCG for sufficient time such that the problems are adequately solved. Then, the rest of the methods are terminated when the objective function \( f_{\tau }\) in (1) is below the one obtained by pdNCG or when a predefined maximum number of iterations limit is reached. All comparisons are presented in figures which show the progress of the objective function against the wall clock time. This way the reader can compare the performance of the solvers for various levels of accuracy. We use logarithmic scales for the wall clock time and terminate runs which do not converge in about \(10^5\) sec, i.e., approximately 27 h.
Increasing condition number of \(A^\intercal A\)
In this experiment we present the performance of FISTA, PCDM, PSSgb and pdNCG for increasing condition number of matrix \(A^TA\) when Procedure OsGen is used to construct the optimal solution \(x^*\). We generate six matrices A and two instances of \(x^*\) for every matrix A; 12 instances in total.
The singular value decomposition of matrix A is \(A=\varSigma G^\intercal \), where \(\varSigma \) is the matrix of singular values, the columns of matrices \(I_m\) and G are the left and right singular vectors, respectively, see Sect. 4.1 for details about the construction of matrix G. The singular values of matrices A are chosen uniformly at random in the intervals \([0, 10^q]\), where \(q=0,1,\ldots ,5\), for each of the six matrices A. Then, all singular values are shifted by \(10^{-1}\). The previous resulted in a condition number of matrix \(A^\intercal A\) which varies from \(10^2\) to \(10^{12}\) with a step of times \(10^2\). The rotation angle \(\theta \) of matrix G is set to \(2{\pi }/{3}\) radians. Matrices A have \(n=2^{22}\) columns, \(m =2 n\) rows and rank n. The optimal solutions \(x^*\) have \(s = n/2^7\) nonzero components for all twelve instances.
For the first set of six instances we set \(\gamma =10\) in OsGen, which resulted in \(\kappa _{0.1}(x^*) \approx 1\) for all experiments. The results are presented in Fig. 5. For these instances PCDM is clearly the fastest for \(\kappa (A^\intercal A) \le 10^4\), while for \(\kappa (A^\intercal A) \ge 10^6\) pdNCG is the most efficient.
For the second set of six instances we set \(\gamma =10^3\) in Procedure OsGen, which resulted in the same \(\kappa _{0.1}(x^*)\) as before for every matrix A.
The results are presented in Fig. 6. For these instances PCDM is the fastest for very well conditioned problems with \(\kappa (A^\intercal A) \le 10^2\), while pdNCG is the fastest for \(\kappa (A^\intercal A) \ge 10^4\).
We observed that pdNCG required at most 30 iterations to converge for all experiments. For FISTA, PCDM and PSSgb the number of iterations was varying between thousands and tens of thousands iterations depending on the condition number of matrix \(A^\intercal A\); the larger the condition number the more the iterations. However, the number of iterations is not a fair metric to compare solvers because every solver has different computational cost per iteration. In particular, FISTA, PCDM and PSSgb perform few inner products per iteration, which makes every iteration inexpensive, but the number of iterations is sensitive to the condition number of matrix \(A^\intercal A\). On the other hand, for pdNCG the empirical iteration complexity is fairly stable, however, the number of inner products per iteration (mainly matrix-vector products with matrix A) may increase as the condition number of matrix \(A^\intercal A\) increases. Inner products are the major computational burden at every iteration for all solvers, therefore, the faster an algorithm converged in terms of wall-clock time the less inner products that are calculated. In Figs. 5 and 6 we display the objective evaluation against wall-clock time (log-scale) to facilitate the comparison of different algorithms.
Increasing condition number of \(A^\intercal A\): non-trivial construction of \(x^*\)
In this experiment we examine the performance of the methods as the condition number of matrix \(A^\intercal A\) increases, while the optimal solution \(x^*\) is generated using Procedure OsGen3 (instead of OsGen) with \(\gamma = 100\) and \(s_1 = s_2= s/2\). Two classes of instances are generated, each class consists of four instances \((A,x^*)\) with \(n=2^{22}\), \(m =2 n\) and \(s=n/2^7\). Matrix A is constructed as in Sect. 8.4. The singular values of matrices A are chosen uniformly at random in the intervals \([0, 10^q]\), where \(q=0,1,\ldots ,3\), for all generated matrices A. Then, all singular values are shifted by \(10^{-1}\). The previous resulted in a condition number of matrix \(A^\intercal A\) which varies from \(10^2\) to \(10^{8}\) with a step of times \(10^2\). The condition number of the generated optimal solutions was on average \(\kappa _{0.1}(x^*)\approx 40\).
The two classes of experiments are distinguished based on the rotation angle \(\theta \) that is used for the composition of Givens rotations G. In particular, for the first class of experiments the angle is \(\theta =2{\pi }/10\) radians, while for the second class of experiments the rotation angle is \(\theta =2{\pi }/10^{3}\) radians. The difference between the two classes is that the second class consists of matrices \(A^\intercal A\) for which, a major part of their mass is concentrated in the diagonal. This setting is beneficial for PCDM since it uses information only from the diagonal of matrices \(A^\intercal A\). This setting is also beneficial for pdNCG since it uses a diagonal preconditioner for the inexact solution of linear systems at every iteration.
The results for the first class of experiments are presented in Fig. 7. For instances with \(\kappa (A^\intercal A) \ge 10^{6}\) PCDM was terminated after 1, 000, 000 iterations, which corresponded to more than 27 h of wall-clock time.
The results for the second class of experiments are presented in Fig. 8. Notice in this figure that the objective function is only slightly reduced. This does not mean that the initial solution, which was the zero vector, was nearly optimal. This is because noise with large norm, i.e., \(\Vert Ax^* - b\Vert \) is large, was used in these experiments, therefore, changes in the optimal solution did not have large affect on the objective function.
Increasing dimensions
In this experiment we present the performance of pdNCG, FISTA, PCDM and PSSgb as the number of variables n increases. We generate four instances where the number of variables n takes values \(2^{20}\), \(2^{22}\), \(2^{24}\) and \(2^{26}\), respectively. The singular value decomposition of matrix A is \(A=\varSigma G^\intercal \). The singular values in matrix \(\varSigma \) are chosen uniformly at random in the interval [0, 10] and then are shifted by \(10^{-1}\), which resulted in \(\kappa (A^\intercal A)\approx 10^{4}\). The rotation angle \(\theta \) of matrix G is set to \(2\pi /10\) radians. Moreover, matrices A have \(m= 2 n\) rows and rank n. The optimal solutions \(x^*\) have \(s = n/2^7\) nonzero components for each generated instance. For the construction of the optimal solutions \(x^*\) we use Procedure OsGen3 with \(\gamma = 100\) and \(s_1 = s_2= s/2\), which resulted in \(\kappa _{0.1}(x^*) \approx 3\) on average.
The results of this experiment are presented in Fig. 9. Notice that all methods have a linear-like scaling with respect to the size of the problem.
Increasing density of matrix \(A^\intercal A\)
In this experiment we demonstrate the performance of pdNCG, FISTA, PCDM and PSSgb as the density of matrix \(A^\intercal A\) increases. We generate four instances \((A,x^*)\). For the first experiment we generate matrix \(A = \varSigma G^\intercal \), where \(\varSigma \) is the matrix of singular values, the columns of matrices \(I_m\) and G are the left and right singular vectors, respectively. For the second experiment we generate matrix \(A=\varSigma (G_2G)^\intercal \), where the columns of matrices \(I_m\) and \(G_2G\) are the left and right singular vectors of matrix A, respectively; \(G_2\) has been defined in Sect. 4.2. Finally, for the third and fourth experiments we have \(A=\varSigma (GG_2G)^\intercal \) and \(A=\varSigma (G_2GG_2G)^\intercal \), respectively. For each experiment the singular values of matrix A are chosen uniformly at random in the interval [0, 10] and then are shifted by \(10^{-1}\), which resulted in \(\kappa (A^\intercal A)\approx 10^4\). The rotation angle \(\theta \) of matrices G and \(G_2\) is set to \(2\pi /10\) radians. Matrices A have \(m= 2 n\) rows, rank n and \(n=2^{22}\). The optimal solutions \(x^*\) have \(s = n/2^7\) nonzero components for each experiment. Moreover, Procedure OsGen3 is used with \(\gamma = 100\) and \(s_1 = s_2= s/2\) for the construction of \(x^*\) for each experiment, which resulted in \(\kappa _{0.1}(x^*) \approx 2\) on average.
The results of this experiment are presented in Fig. 10. Observe, that all methods had a robust performance with respect to the density of matrix A.
Varying parameter \(\tau \)
In this experiment we present the performance of pdNCG, FISTA, PCDM and PSSgb as parameter \(\tau \) varies from \(10^{-4}\) to \(10^4\) with a step of times \(10^2\). We generate four instances \((A,x^*)\), where matrix \(A=\varSigma G^\intercal \) has \(m= 2 n\) rows, rank n and \(n=2^{22}\). The singular values of matrices A are chosen uniformly at random in the interval [0, 10] and then are shifted by \(10^{-1}\), which resulted in \(\kappa (A^\intercal A)\approx 10^4\) for each experiment. The rotation angles \(\theta \) for matrix G in A is set to \(2\pi /10\) radians. The optimal solution \(x^*\) has \(s = n/2^7\) nonzero components for all instances. Moreover, the optimal solutions are generated using Procedure OsGen3 with \(\gamma = 100\), which resulted in \(\kappa _{0.1}(x^*) \approx 3\) for all four instances.
The performance of the methods is presented in Fig. 11. Notice in figure 11d that for pdNCG the objective function \(f_\tau \) is not always decreasing monotonically. A possible explanation is that the backtracking line-search of pdNCG, which guarantees monotonic decrease of the objective function [16], terminates in case that 50 backtracking iterations are exceeded, regardless if certain termination criteria are satisfied.
Performance of a second-order method on huge scale problems
We now present the performance of pdNCG on synthetic huge scale (up to one trillion variables) S-LS problems as the number of variables and the number of processors increase.
We generate six instances \((A,x^*)\), where the number of variables n takes values \(2^{30}\), \(2^{32}\), \(2^{34}\), \(2^{36}\), \(2^{38}\) and \(2^{40}\). Matrices \(A=\varSigma G^\intercal \) have \(m= 2 n\) rows and rank n. The singular values \(\sigma _i\) for \(i=1,2,\ldots ,n\) of matrices A are set to \(10^{-1}\) for odd i’s and \(10^2\) for even i’s. The rotation angle \(\theta \) of matrix G is set to \(2\pi /3\) radians. The optimal solutions \(x^*\) have \(s = n/2^{10}\) nonzero components for each experiment. In order to simplify the practical generation of this problem the optimal solutions \(x^*\) are set to have s / 2 components equal to \(-10^4\) and the rest of nonzero components are set equal to \(10^{-1}\).
Table 1 Performance of pdNCG for synthetic huge scale S-LS problems. All problems have been solved to a relative error of order \(10^{-4}\) of the obtained solution
Details of the performance of pdNCG are given in Table 1. Observe the nearly linear scaling of pdNCG with respect to the number of variables n and the number of processors. For all experiments in Table 1 pdNCG required 8 Newton steps to converge, 100 PCG iterations per Newton step on average, where every PCG iteration requires two matrix-vector products with matrix A.