Fast and scalable Lasso via stochastic Frank–Wolfe methods with a convergence guarantee
 951 Downloads
 2 Citations
Abstract
Frank–Wolfe (FW) algorithms have been often proposed over the last few years as efficient solvers for a variety of optimization problems arising in the field of machine learning. The ability to work with cheap projectionfree iterations and the incremental nature of the method make FW a very effective choice for many largescale problems where computing a sparse model is desirable. In this paper, we present a highperformance implementation of the FW method tailored to solve largescale Lasso regression problems, based on a randomized iteration, and prove that the convergence guarantees of the standard FW method are preserved in the stochastic setting. We show experimentally that our algorithm outperforms several existing state of the art methods, including the Coordinate Descent algorithm by Friedman et al. (one of the fastest known Lasso solvers), on several benchmark datasets with a very large number of features, without sacrificing the accuracy of the model. Our results illustrate that the algorithm is able to generate the complete regularization path on problems of size up to four million variables in <1 min.
Keywords
Frank–Wolfe algorithm Lasso Largescale regression1 Introduction
Many machine learning and data mining tasks can be formulated, at some stage, in the form of an optimization problem. As constantly growing amounts of high dimensional data are becoming available in the Big Data era, a fundamental thread in research is the development of highperformance implementations of algorithms tailored to solving these problems in a very largescale setting. One of the most popular and powerful techniques for highdimensional data analysis is the Lasso (Tibshirani 1996). In the last decade there has been intense interest in this method, and several papers describe generalizations and variants of the Lasso (Tibshirani 2011). In the context of supervised learning, it was recently proved that the Lasso problem can be reduced to an equivalent SVM formulation, which potentially allows one to leverage a wide range of efficient algorithms devised for the latter problem (Jaggi 2014). For unsupervised learning, the idea of Lasso regression has been used in Lee et al. (2010) for biclustering in biological research.
From an optimization point of view, the Lasso can be formulated as an \(\ell _1\)regularized least squares problem, and largescale instances must usually be tackled by means of an efficient firstorder algorithm. Several such methods have already been discussed in the literature. Variants of Nesterov’s Accelerated Gradient Descent, for example, guarantee an optimal convergence rate among firstorder methods (Nesterov 2013). Stochastic algorithms such as Stochastic Gradient Descent and Stochastic Mirror Descent have also been proposed for the Lasso problem (Langford et al. 2009; ShalevShwartz and Tewari 2011). More recently, Coordinate Descent (CD) algorithms (Friedman et al. 2007, 2010), along with their stochastic variants (ShalevShwartz and Tewari 2011; Richtárik and Takáĉ 2014), are gaining popularity due to their efficiency on structured largescale problems. In particular, the CD implementation of Friedman et al. mentioned above is specifically tailored for Lasso problems, and is currently recognized as one of the best solvers for this class of problems.

We propose a highperformance stochastic implementation of the classical Frank–Wolfe (FW) algorithm to solve the Lasso problem. We show experimentally how the proposed method is able to efficiently scale up to problems with a very large number of features, improving on the performance of other state of the art methods such as the Coordinate Descent algorithm in Friedman et al. (2010).

We include an analysis of the complexity of our algorithm, and prove a novel convergence result that yields an \({\mathcal {O}}(1/k)\) convergence rate analogous (in terms of expected value) to the one holding for the standard FW method.

We highlight how the properties of the FW method allow to obtain solutions that are significantly more sparse in terms of the number of features compared with those from various competing methods, while retaining the same optimization accuracy.
Structure of the paper
In Sect. 2, we provide an overview of the Lasso problem and its formulations, and review some of the related literature. Then, in Sect. 3, we discuss FW optimization and specialize the algorithm for the Lasso problem. The randomized algorithm used in our implementation is discussed in Sect. 4, and its convergence properties are analyzed. In Sect. 5 we show several experiments on benchmark datasets and discuss the obtained results. Finally, Sect. 6 closes the paper with some concluding remarks.
2 The Lasso problem
2.1 Formulation
2.2 Relevance and applications
The Lasso is part of a powerful family of regularized linear regression methods for highdimensional data analysis, which also includes ridge regression (RR) (Hoerl and Kennard 1970; Hastie et al. 2009), the ElasticNet (Zou and Hastie 2005), and several recent extensions thereof (Zou 2006; Zou and Zhang 2009; Tibshirani et al. 2005). From a statistical point of view, they can be viewed as methods for trading off the bias and variance of the coefficient estimates in order to find a model with better predictive performance. From a machine learning perspective, they allow to adaptively control the capacity of the model space in order to prevent overfitting. In contrast to RR, which is obtained by substituting the \(\ell _1\) norm in (2) by the squared \(\ell _2\) norm \(\sum _i \alpha _i^2\), it is wellknown that the Lasso does not only reduce the variance of coefficient estimates but is also able to perform variable selection by shrinking many of these coefficients to zero. Elasticnet regularization trades off \(\ell _1\) and \(\ell _2\) norms using a “mixed” penalty \({\varOmega }(\alpha ) = \gamma \Vert \alpha \Vert _{1} + (1 \gamma ) \Vert \alpha \Vert _{2}\) which requires tuning the additional parameter \(\gamma \) (Zou and Hastie 2005). \(\ell _p\) norms with \(p \in [0,1)\) can enforce a more aggressive variable selection, but lead to computationally challenging nonconvex optimization problems. For instance, \(p=0\), which corresponds to “direct” variable selection, leads to an NPhard problem (Weston et al. 2003).
Thanks to its ability to perform variable selection and model estimation simultaneously, the Lasso (and \(\ell _1\)regularization in general) is widely used in applications involving a huge number of candidate features or predictors. Examples of these applications include biomolecular data analysis, text regression, functional magnetic resonance imaging (fMRI) and sensor networks. In all these cases, the number dimensions or attributes p can far exceed the number of data instances m.
2.3 Related work
Problem (1) is a quadratic programming problem with a convex constraint, which in principle may be solved using standard techniques such as interiorpoint methods, guaranteeing convergence in few iterations. However, the computational work required per iteration as well as the memory demanded by these approaches make them practical only for small and mediumsized problems. A faster specialized interior point method for the Lasso was proposed in Kim et al. (2007), which however compares unfavorably with the baseline used in this paper (Friedman et al. 2010).
One of the first efficient algorithms proposed in the literature for finding a solution of (2) is the Least Angle Regression (LARS) by Efron et al. (2004). As its main advantage, LARS allows to generate the entire Lasso regularization path with the same computational cost as standard leastsquares via QR decomposition, i.e. \({\mathcal {O}}(mp^2)\), assuming \(m < p\) (Hastie et al. 2009). At each step, it identifies the variable most correlated with the residuals of the model obtained so far and includes it in the set of “active” variables, moving the current iterate in a direction equiangular with the rest of the active predictors. It turns out that the algorithm we propose makes the same choice but updates the solution differently using cheaper computations. A similar homotopy algorithm to calculate the regularization path has been proposed in Turlach (2005), which differs slightly from LARS in the choice of the search direction.
More recently, it has been shown by Friedman et al. that a careful implementation of the Coordinate Descent method (CD) provides an extremely efficient solver (Friedman et al. 2007, 2010; Friedman 2012), which also applies to more general models such as the ElasticNet proposed by Zou and Hastie (2005). In contrast to LARS, this method cyclically chooses one variable at a time and performs a simple analytical update. The full regularization path is built by defining a sensible range of values for the regularization parameter and taking the solution for a given value as warmstart for the next. This algorithm has been implemented into the Glmnet package and can be considered the current standard for solving this class of problems. Recent works have also advocated the use of Stochastic Coordinate Descent (SCD) (ShalevShwartz and Tewari 2011), where the order of variable updates is chosen randomly instead of cyclically. This strategy can prevent the adverse effects caused by possibly unfavorable orderings of the coordinates, and allows to prove stronger theoretical guarantees compared to the plain CD (Richtárik and Takáĉ 2014).
Other methods for \(\ell _1\)regularized regression may be considered. For instance, Zhou et al. recently proposed a geometrical approach where the Lasso is reformulated as a nearest point problem and solved using an algorithm inspired by the classical Wolfe method (Zhou et al. 2015). However, the popularity and proved efficiency of Glmnet on highdimensional problems make it the chosen baseline in this work.
3 Frank–Wolfe optimization
One of the earliest constrained optimization approaches, the Frank–Wolfe algorithm (Frank and Wolfe 1956) has recently seen a sudden resurgence in interest from the optimization community (Clarkson 2010; Jaggi 2013), and several authors have pointed out how FW methods can be used as a principled and efficient alternative to solve several largescale problems arising in machine learning, statistics, bioinformatics and related fields (Ñanculef et al. 2014; Frandi et al. 2015; Harchaoui et al. 2014; Signoretto et al. 2014). As argued in Sect. 3.2, the FW algorithm enjoys several properties that make it very attractive for this type of problems. Overall, though, the number of works showing experimental results for FW on practical applications is limited compared to that of the theoretical studies appearing in the literature. In the context of problems with \(\ell _1\)regularization or sparsity constraints, the use of FW has been discussed in ShalevShwartz et al. (2010), but no experiments are provided. A closely related algorithm has been proposed in Zhou et al. (2015), but its implementation has a high computational cost in terms of time and memory requirements, and is not suitable for solving large problems on a standard desktop or laptop machine. As such, the current literature does not provide many examples of efficient FWbased software for largescale Lasso or \(l_1\)regularized optimization. We aim to fill this gap by showing how a properly implemented stochastic FW method can improve on the performance of the current state of the art solvers on Lasso problems with a very large number of features.
3.1 The standard Frank–Wolfe algorithm
Another distinctive feature of the algorithm is the fact that the solution at a given iteration K can be expressed as a convex combination of the vertices \(u^{(k)}\), \(k = 0,\ldots ,K1\). Due to the incremental nature of the FW iteration, at most one new extreme point of \({\varSigma }\) is discovered at each iteration, implying that at most k of such points are active at iteration k. Furthermore, this sparsity bound holds for the entire run of the algorithm, effectively allowing to control the sparsity of the solution as it is being computed. This fact carries a particular relevance in the context of sparse approximation, and generally in all applications where it is crucial to find models with a small number of features. It also represents, as we will show in our experiments in Sect. 5, one of the major differences between incremental, forward approximation schemes and more general solvers for \(\ell _1\)regularized optimization, which in general cannot guarantee to find sparse solutions along the regularization path.
3.2 Theoretical properties
We summarize here some wellknown theoretical results for the FW algorithm which are instrumental in understanding the behaviour of the method. We refer the reader to (Jaggi 2013) for the proof of the following proposition. To prove the result, it is sufficient to assume that f has bounded curvature, which, as explained in Jaggi (2013), is roughly equivalent to the Lipschitz continuity of \(\nabla f\).
Proposition 1
An immediate consequence of Proposition 1 is an upper bound on the iteration complexity: given a tolerance \(\varepsilon > 0\), the FW algorithm finds an \(\varepsilon \) approximate solution, i.e. an iterate \(\alpha ^{(k)}\) such that \(f(\alpha ^{(k)})  f(\alpha ^{*}) \le \varepsilon \), after \({\mathcal {O}}(1/\varepsilon )\) iterations. Besides giving an estimate on the total number of iterations which has been shown experimentally to be quite tight in practice (Frandi et al. 2014, 2015), this fact tells us that the tradeoff between sparsity and accuracy can be partly controlled by appropriately setting the tolerance parameter. Recently, Garber and Hazan showed that under certain conditions the FW algorithm can obtain a convergence rate of \({\mathcal {O}}(1/k^2)\), comparable to that of firstorder algorithms such as Nesterov’s method (Garber and Hazan 2015). However, their results require strong convexity of the objective function and of the feasible set, a set of hypotheses which is not satisfied for several important ML problems such as the Lasso or the Matrix Recovery problem with trace norm regularization.
Another possibility is to employ a FullyCorrective variant of the algorithm, where at each step the solution is updated by solving the problem restricted to the currently active vertices. The algorithm described in Zhou et al. (2015), where the authors solve the Lasso problem via a nearest point solver based on Wolfe’s method, operates with a very similar philosophy. A similar case can be made for the LARS algorithm of Efron et al. (2004), which however updates the solution in a different way. The FullyCorrective FW also bears a resemblance to the Orthogonal Matching Pursuit algorithms used in the Signal Processing literature (Tropp 2004), a similarity which has already been discussed in Clarkson (2010) and Jaggi (2013). However, as mentioned in Clarkson (2010), the increase in computational cost is not paid off by a corresponding improvement in terms of complexity bounds. In fact, the work in Lan (2014) shows that the result in Proposition 1 cannot be improved for any firstorder method based on solving linear subproblems without strengthening the assumptions. Greedy approximation techniques based on both the vanilla and the FullyCorrective FW have also been proposed in the context of approximate risk minimization with an \(\ell _0\) constraint by ShalevShwartz et al., who proved several strong runtime bounds for the sparse approximations of arbitrary target solutions (ShalevShwartz et al. 2010).
Finally, it is worth mentioning that the result of Proposition 1 can indeed be improved by using variants of FW that employ additional search directions, and allow under suitable hypotheses to obtain a linear convergence rate (Ñanculef et al. 2014; LacosteJulien and Jaggi 2014). It should be mentioned, however, that such rates only hold in the vicinity of the solution and that, as shown in Frandi et al. (2015), a large number of iterations might be required to gain substantial advantages. For this reason, we choose not to pursue this strategy in the present paper.
4 Randomized Frank–Wolfe for Lasso problems
4.1 Randomized Frank–Wolfe iterations
Although the FW method is generally very efficient for structured problems with a sparse solution, it also has a number of practical drawbacks. For example, it is well known that the total number of iterations required by a FW algorithm can be large, thus making the optimization prohibitive on very large problems. Even when (4) has an analytical solution due to the problem structure, the resulting complexity depends on the problem size (Ñanculef et al. 2014), and can thus be impractical in cases where handling largescale datasets is required. For example, in the specialization of the algorithm to problem (1), the main bottleneck is the computation of the FW vertex \(i_{*}^{(k)}\) in (6) which corresponds to examining all the p candidate predictors and choosing the one most correlated with the current residuals (assuming the design matrix has been standardized s.t. the predictors have unit norm). Coincidentally, this is the same strategy underlying wellknown methods for variable selection such as LARS and Forward Stepwise Regression (see Sect. 1).
Note that, although in this work we only discuss the basic Lasso problem, extending the proposed implementation to the more general ElasticNet model of Zou and Hastie (2005) is straightforward. The derivation of the necessary analytical formulae is analogous to the one shown above. Furthermore, an extension of the algorithm to solve \(\ell _1\)regularized logistic regression problems, another relevant tool in highdimensional data analysis, can be easily obtained following the guidelines in Friedman et al. (2010).
4.2 Complexity and implementation details
In Algorithm 2, we compute the coordinates of the gradient using the method of residuals given by Eq. (7). Due to the randomization, this method becomes very advantageous with respect to the use of the alternative method based on the active covariates, even for very large p. Indeed, if we denote by s the cost of performing a dot product between a predictor \(z_i\) and another vector in \({\mathbb {R}}^m\), the overall cost of picking out the FW vertex in step 1 of our algorithm is \({\mathcal {O}}(s{\mathcal {S}})\). Using the method of the active covariates would instead give an overall cost of \({\mathcal {O}}(s{\mathcal {S}} \Vert {\alpha }^{(k)}\Vert _0)\), which is always worse. Note however that this method may be better than the method of the residuals in a deterministic implementation by using caching tricks as proposed in Friedman et al. (2007), Friedman et al. (2010). For instance, caching the dot products between all the predictors and the active ones and keeping updated all the coordinates of the gradient would cost \({\mathcal {O}}(p)\) except when new active variables appear in the solution, in which case the cost becomes \({\mathcal {O}}(ps)\). However, this would allow to find the FW vertex in \({\mathcal {O}}(p)\) operations. In this scenario, the fixed \({\mathcal {O}}(sp)\) cost of the method of residuals may be worse if the Lasso solution is very sparse. It is worth noting that the dot product cost s is proportional to the number of nonzero components in the predictors, which in typical highdimensional problems is significantly lower than m.
In the current implementation, the term \(\sigma _i {:}{=}z_i^Ty\) will be precomputed for any \(i=1,2,\ldots ,p\) before starting the iterations of the algorithm. This allows to write (7) as \(z_i^TR^{(k)} = \sigma _i + z_i^TX{\alpha }^{(k)}\). Equation (10) for updating residuals can therefore be replaced by an equation to update \(p^{(k)}=X{\alpha }^{(k)}\), eliminating the dependency on m.
4.3 Relation to SVM algorithms and sparsity certificates
The previous implementation suggests that the FW algorithm will be particularly suited to the case \(p \gg m\) where a regression problem has a very large number of features but not so many training points. It is interesting to compare this scenario to the situation in SVM problems. In the SVM case, the FW vertices correspond to training points, and the standard FW algorithm is able to quickly discover the relevant “atoms” (the support vectors), but has no particular advantage when handling lots of features. In contrast, in Lasso applications, where we are using the \(z_i\)’s as training points, the situation is somewhat inverted: the algorithm should discover the critical features in at most \({\mathcal {O}}(1/\epsilon )\) iterations and guarantee that at most \({\mathcal {O}}(1/\epsilon )\) attributes will be used to perform predictions. This is, indeed, the scenario in which Lasso is used for several applications of practical interest, as problems with a very large number of features arise frequently in specific domains like bioinformatics, web and text analysis and sensor networks.
In the context of SVMs, the randomized FW algorithm has been already discussed in Frandi et al. (2014). However, the results in the mentioned paper were experimental in nature, and did not contain a proof of convergence, which is instead provided in this work. Note that, although we have presented the randomized search for the specific case of problem (1), the technique applies more generally to the case where \({\varSigma }\) is a polytope (or has a separable structure with every block being a polytope, as in LacosteJulien et al. (2013)). We do not feel this hypothesis to be restrictive, as basically every practical application of the FW algorithm proposed in the literature falls indeed into this setting.
4.4 Convergence analysis
We show that the stochastic FW converges (in the sense of expected value) with the same rate as the standard algorithm. First, we need the following technical result.
Lemma 1
Proof
Note that selecting a random subset \({\mathcal {S}}\) of size \(\kappa \) to solve (9) is equivalent to (i) building a random matrix \(A_{\mathcal {S}}\) as in Lemma 1, (ii) computing the restricted gradient \(\tilde{\nabla } f = \frac{p}{\kappa } A_{\mathcal {S}} \nabla f({\alpha }^{(k)})\) and then (iii) solving the linear subproblem (6) substituting \(\nabla f({\alpha }^{(k)})\) by \(\tilde{\nabla } f\). In other words, the proposed randomization can be viewed as approximating \(\nabla f({\alpha }^{(k)})\) by \(\tilde{\nabla } f\). Lemma 1 implies that \({\mathbb {E}}[\tilde{\nabla } f] = \nabla f({\alpha }^{(k)})\), which is the key to prove our main result.
Proposition 2
This result has a similar flavor to that in LacosteJulien et al. (2013), and the analysis is similar to the one presented in Wang and Qian (2014). However, in contrast to the above works, we do not assume any structure in the optimization problem or in the sampling. A detailed proof can be found in the “Appendix”. As in the deterministic case, Proposition 2 implies a complexity bound of \({\mathcal {O}}(1/\varepsilon )\) iterations to reach an approximate solution \(\alpha ^{(k)}\) such that \({\mathbb {E}}_{{\mathcal {S}}^{(k)}}[f(\alpha ^{(k)})]  f(\alpha ^{*}) \le \varepsilon \).
4.5 Choosing the sampling size
When using a randomized FW iteration it is important to choose the sampling size in a sensible way. Indeed, some recent works showed how this choice entails a tradeoff between accuracy (in the sense of premature stopping) and complexity, and henceforth CPU time (Frandi et al. 2014). This kind of approximation is motivated by the following result, which suggests that it is reasonable to pick \({\mathcal {S}} \ll p\).
Theorem 1
[(Schölkopf and Smola 2001), Theorem 6.33] Let \({\mathcal {D}} \subset {\mathbb {R}}\) s.t. \({\mathcal {D}} = p\) and let \({\mathcal {D}}^{\prime } \subset {\mathcal {D}}\) be a random subset of size \(\kappa \). Then, the probability that the largest element in \({\mathcal {D}}^{\prime }\) is greater than or equal to \(\tilde{p}\) elements of \({\mathcal {D}}\) is at least \(1(\frac{\tilde{p}}{p})^{\kappa }\).
The value of this result lies in the ability to obtain probabilistic bounds for the quality of the sampling independently of the problem size p. For example, in the case of the Lasso problem, where \({\mathcal {D}} = \{\nabla f(\alpha ^{(k)})_1,\ldots ,\nabla f(\alpha ^{(k)})_p\}\) and \({\mathcal {D}}^{\prime } = \{ \nabla f(\alpha ^{(k)})_i \text{ s.t. } i \in {\mathcal {S}} \}\), it is easy to see that it suffices to take \({\mathcal {S}} \approx 194\) to guarantee that, with probability at least 0.98, \(\nabla f(\alpha ^{(k)})_{i^{(k)}_{\mathcal {S}}}\) lies between the \(2\%\) largest gradient components (in absolute value), independently of p. This kind of sampling has been discussed for example in Frandi et al. (2015).
The issue with this strategy is that, for problems with very sparse solutions (which is the case for strong levels of regularization), even a large confidence interval does not guarantee that the algorithm can sample a good vertex in most of the iterations. Intuitively, the sampling strategy should allow the algorithm to detect the set of vertices active at the optimum, which correspond, at various stages of the optimization process, to descent directions for the objective function. In sparse approximation problems, extracting a sampling set without relevant features may imply adding “spurious” components to the solution, reducing the sparseness of the model we want to find.
A more involved strategy, which exploits the incremental structure of the FW algorithm, would be using a large \(\kappa \) at early iterations and smaller values of \(\kappa \) as the solution gets more dense. The idea here is that if the optimal solution is very sparse the algorithm requires few expensive iterations to converge, while in contrast, when the solution is dense, it will require more, but faster, iterations (e.g. for a confidence \(1\rho = 0.98\) and \(s/p=0.02\) the already mentioned \(\kappa =194\) suffices).
5 Numerical experiments
List of the benchmark datasets used in the experiments
Dataset  m  t  p  Type 

Synthetic10000  200  200  10, 000  Synthetic 
Synthetic50000  200  200  50, 000  Synthetic 
Pyrim  74  \({}{}\)  201, 376  Regression 
Triazines  186  \({}{}\)  635, 376  Regression 
E2006tfidf  16, 087  3, 308  150, 360  Regression 
E2006log1p  16, 087  3, 308  4, 272, 227  Regression 
Dorothea  800  150  100, 000  Classification 
URLreputation  2, 172, 130  220, 000  3, 231, 961  Classification 
KDD2010algebra  8, 407, 752  510, 302  20, 216, 830  Classification 

The wellknown CD algorithm by Friedman et al., as implemented in the Glmnet package (Friedman et al. 2010). This method is highly optimized for the Lasso and is widely considered as one the most popular and efficient solvers in this context.

The SCD algorithm as described in ShalevShwartz and Tewari (2011), which is significant both for being a stochastic method and for having better theoretical guarantees than the standard cyclic CD.

The Accelerated Gradient Descent with projections for both the regularized and the constrained Lasso, as this algorithm guarantees an optimal complexity bound. We choose as a reference the implementation in the SLEP package by Liu et al. (2009).
Methods proposed for scaling the Lasso and their complexities
Approach  Form  Number of iterations  Complexity per iteration  Sparse Its. 

Accelerated Gradient + Proj. (Liu and Ye 2009)  (1)  \({\mathcal {O}}(1/\sqrt{\epsilon })\)  \({\mathcal {O}}(mp+p){\tiny \dagger _1}\)  No 
Accelerated Gradient + Reg. Proj. (Liu and Ye 2010)  (2)  \({\mathcal {O}}(1/\sqrt{\epsilon })\)  \({\mathcal {O}}(mp+p) {\dagger _1}\)  No 
Cyclic Coordinate Descent (\(\text{ CD }\)) (Friedman et al. 2007, 2010)  (2)  Unknown  \({\mathcal {O}}(mp) \dagger _2\)  Yes 
Stochastic Gradient Descent (\(\text{ SGD }\)) (Langford et al. 2009)  (2)  \({\mathcal {O}}(1/\epsilon ^2)\)  \({\mathcal {O}}(p)\)  No 
Stochastic Mirror Descent (ShalevShwartz and Tewari 2011)  (2)  \({\mathcal {O}}(\log (p)/\epsilon ^2)\)  \({\mathcal {O}}(p)\)  No 
GeoLasso (Zhou et al. 2015)  (1)  \({\mathcal {O}}(1/\epsilon )\)  \({\mathcal {O}}(mp + a^2)\)  Yes 
Frank–Wolfe (\(\text{ FW }\)) (Jaggi 2013)  (1)  \({\mathcal {O}}(1/\epsilon )\)  \({\mathcal {O}}(mp)\)  Yes 
Stochastic Coord. Descent (\(\text{ SCD }\)) (Richtárik and Takáĉ 2014)  (2)  \({\mathcal {O}}(p/\epsilon )\)  \({\mathcal {O}}(m) \dagger _3\)  Yes 
Stochastic Frank–Wolfe  (1)  \({\mathcal {O}}(1/\epsilon )\)  \({\mathcal {O}}(m{\mathcal {S}})\)  Yes 
Since an appropriate level of regularization needs to be automatically selected in practice, the algorithms are compared by computing the entire regularization path on a range of values of the regularization parameters \(\lambda \) and \(\delta \) (depending on whether the method solves the penalized or the constrained formulation). Specifically, we first estimate two intervals \([\lambda _{\min },\lambda _{\max }]\) and \([\delta _{\min },\delta _{\max }]\), and then solve problems (2) and (1) on a 100point parameter grid in logarithmic scale. For the penalized Lasso, we use \(\lambda _{\min } = \lambda _{\max }/100\), where \(\lambda _{\max }\) is determined as in the Glmnet code. Then, to make the comparison fair (i.e. to ensure that all the methods solve the same problems according to the equivalence in Sect. 2), we choose for the constrained Lasso \(\delta _{\max } = \Vert \alpha _{\min }\Vert _1\) and \( \delta _{\min } = \delta _{\max }/100\), where \(\alpha _{\min }\) is the solution obtained by Glmnet with the smallest regularization parameter \(\lambda _{\min }\) and a high precision (\(\varepsilon = 10^{8}\)). The idea is to give the solvers the same “sparsity budget” to find a a solution of the regression problem.
Warmstart strategy
5.1 “Sanity Check” on the synthetic datasets
The aim of these experiments is not to measure the performance of the algorithms (which will be assessed below on seven reallife datasets of large and very large size), but rather to compare their ability to capture the evolution of the most relevant features of the models, and discuss how this relates to their behaviour from an optimization point of view. To do this, we monitor the values of the 10 most relevant features along the path, as computed by both the CD and FW algorithms, and plot the results in Figs. 1 and 2. To determine the relevant variables, we use as a reference the regularization path obtained by Glmnet with \(\varepsilon = 10^{8}\) (which is assumed to be a reasonable approximation of the exact solution), and compute the 10 variables having, on average, the highest absolute value along the path. As this experiment is intended mainly as a sanity check to verify that our solver reconstructs the solution correctly, we do not include other algorithms in the comparison. In order to implement the random sampling strategy in the FW algorithm, we first calculated the average number of active features along the path, rounded up to the nearest integer, as an empirical estimate of the sparsity level. Then, we chose \( {\mathcal {S}}\) based on the probability \(\rho \) of capturing at least one of the relevant features at each sampling, according to (13). A confidence level of \(99\%\) was used for this experiment, leading to sampling sizes of 372 and 324 points for the two problems of size 10000, and of 1616 and 1572 points for those of size 50000.
5.2 Results on largescale datasets
In this section, we report the results on the problems Pyrim, Triazines, E2006tfidf, E2006log1p, Dorothea, URLreputation and KDD2010algebra. These datasets represent actual largescale problems of practical interest. The Pyrim and Triazines datasets stem from two quantitative structureactivity relationship (QSAR) problems, where the biological responses of a set of molecules are measured and statistically related to the molecular structure on their surface. We expanded the original feature set by means of product features, i.e. modeling the response variable y as a linear combination of polynomial basis functions, as suggested in Huang et al. (2010). For this experiment, we used product features of order 5 and 4 respectively, which leads to largescale regression problems with \(p=201,376\) and \(p=635,376\). Problems E2006tfidf and E2006log1p stem instead from the reallife NLP task of predicting the risk of investment (i.e. the stock return volatility) in a company based on available financial reports (Kogan et al. 2009). Finally, the three classification problems correspond to tasks related to molecular bioactivity prediction (Dorothea), malicious URL detection (URLreputation), and educational data mining (KDD1010algebra). For benchmarking purposes, these tasks were cast as Lasso problems by treating the binary responses as real continuous variables.
Sizes of the sampling set \({\mathcal {S}}\) for the largescale datasets
% of p  1 %  2 %  3 % 

Pyrim  2,014  4,028  6,402 
Triazines  6,354  12,708  19,062 
E2006tfidf  1,504  3,008  4,511 
E2006log1p  42,723  85,445  128,167 
Dorothea  1,000  2,000  3,000 
URLreputation  32,320  64,640  96,959 
KDD2010algebra  202,169  404,337  606,505 
Results for the baseline solvers on the largescale regression problems Pyrim, Triazines, E2006tfidf and E2006log1p
CD  SCD  SLEP Reg.  SLEP Const.  

Pyrim  
Time (s)  6.22e\(+\)00  1.59e\(+\)01  5.43e\(+\)00  6.86e\(+\)00 
Iterations  2.54e\(+\)02  1.44e\(+\)02  1.00e\(+\)02  1.12e\(+\)02 
Dot products  2.08e\(+\)07  2.90e\(+\)07  8.56e\(+\)07  1.29e\(+\)08 
Active features  68.4  116.6  3, 349  13, 030 
Triazines  
Time (s)  2.75e\(+\)01  8.42e\(+\)01  4.27e\(+\)01  5.93e\(+\)01 
Iterations  2.62e\(+\)02  1.59e\(+\)02  1.01e\(+\)02  1.11e\(+\)02 
Dot products  6.80e\(+\)07  1.01e\(+\)08  2.87e\(+\)08  4.67e\(+\)08 
Active features  150.0  330.8  29,104  130,303 
E2006tfidf  
Time (s)  9.10e\(+\)00  2.19e\(+\)01  1.24e\(+\)01  2.27e\(+\)01 
Iterations  3.48e\(+\)02  2.01e\(+\)02  1.06e\(+\)02  2.50e\(+\)02 
Dot products  2.04e\(+\)07  3.03e\(+\)07  5.97e\(+\)07  1.37e\(+\)08 
Active features  149.5  275.3  444.8  724.3 
E2006log1p  
Time (s)  1.60e\(+\)02  4.92e\(+\)02  1.00e\(+\)02  1.42e\(+\)02 
Iterations  3.55e\(+\)02  1.99e\(+\)02  1.11e\(+\)02  1.18e\(+\)02 
Dot products  5.73e\(+\)08  8.50e\(+\)08  1.78e\(+\)09  2.85e\(+\)09 
Active features  281.3  1, 158.2  12,806  54,704 
Results for the baseline solvers on the largescale classification problems Dorothea, URLreputation and KDD2010algebra
CD  SCD  SLEP Reg.  SLEP Const.  

Dorothea  
Time (s)  3.34e\(+\)00  1.17e\(+\)01  4.10e\(+\)00  6.19 e\(+\)00 
Iterations  4.45e\(+\)02  2.99e\(+\)02  2.30e\(+\)02  3.01e\(+\)02 
Dot products  1.22e\(+\)07  2.99e\(+\)07  6.92e\(+\)07  1.02e\(+\)08 
Active features  134.8  153.3  211.1  731.5 
URLreputation  
Time (s)  2.55e\(+\)02  7.86e\(+\)02  1.66e\(+\)03  5.65e\(+\)03 
Iterations  4.44e\(+\)02  3.01e\(+\)02  1.77e\(+\)02  5.92e\(+\)02 
Dot products  4.53e\(+\)08  9.73e\(+\)08  1.98e\(+\)09  4.89e\(+\)009 
Active features  53.4  77.9  126.8  52.44 
KDD2010  
Time (s)  6.15e\(+\)02  2.33e\(+\)03  8.86e\(+\)02  4.12e\(+\)03 
Iterations  2.27e\(+\)02  1.59e\(+\)02  1.04e\(+\)02  2.22e\(+\)02 
Dot products  2.08e\(+\)09  3.21e\(+\)09  8.92e\(+\)09  1.88e\(+\)10 
Active features  906.0  1,444.1  1,825.4  1,978.5 
Performance metrics for stochastic FW on the largescale regression problems Pyrim, Triazines, E2006tfidf and E2006log1p
FW \(1\%\)  FW \(2\%\)  FW \(3\%\)  

Pyrim  
Time (s)  2.28e−01  4.47e−01  6.60e−01 
Speedup  \(27.3\times \)  \(13.9\times \)  \(9.4\times \) 
Iterations  2.77e\(+\)02  2.80e\(+\)02  2.77e\(+\)02 
DotProd  7.61e\(+\)05  1.53e\(+\)06  2.28e\(+\)06 
Active features  27.6  28.1  27.9 
Triazines  
Time (s)  2.61e\(+\)00  5.31e\(+\)00  8.19e\(+\)00 
Speedup  \(10.5\times \)  \(5.2\times \)  \( 3.4\times \) 
Iterations  7.15e\(+\)02  7.29e\(+\)02  7.43e\(+\)02 
DotProd  5.18e\(+\)06  1.06e\(+\)07  1.61e\(+\)07 
Active features  120.6  117.5  118.7 
E2006tfidf  
Time (s)  8.83e−01  1.76e\(+\)00  2.74e\(+\)00 
Speedup  \(10.3\times \)  \(5.2\times \)  \( 3.3\times \) 
Iterations  1.27e\(+\)03  1.35e\(+\)03  1.41e\(+\)03 
DotProd  1.97e\(+\)06  4.35e\(+\)06  6.84e\(+\)06 
Active features  123.7  125.8  127.1 
E2006log1p  
Time (s)  1.93e\(+\)01  4.14e\(+\)01  6.59e\(+\)01 
Speedup  \(8.3\times \)  \(3.9\times \)  \( 2.4\times \) 
Iterations  1.75e\(+\)03  1.91e\(+\)03  1.99e\(+\)03 
DotProd  7.90e\(+\)07  1.71e\(+\)08  2.68e\(+\)08 
Active features  196.9  199.8  203.7 
As a measure of the performance of the considered algorithms, we report the CPU time in seconds, the total number of iterations and the number of requested dot products (which account for most of the required running time for all the algorithms)^{1} along the entire regularization path. Note that, in assessing the number of iterations, we consider one complete cycle of CD to be roughly equivalent to a full deterministic iteration of FW (since in both cases the complexity is determined by a full pass through all the coordinates) and to p random coordinate explorations in SCD. Finally, in order to evaluate the sparsity of the solutions, we report the average number of active features along the path. Results are displayed in Tables 4, 5 (baseline methods) and Tables 6, 7 (stochastic FW). In the latter, the speedups with respect to the CD algorithm are also reported. It can be seen how for all the choices of the sampling size the FW algorithm allows for a substantial improvement in computational performance, as confirmed by both the CPU times and the machineindependent number of requested dot products (which are roughly proportional to the running times). The plain SCD algorithm performs somewhat worse than CD, something we attribute mainly to the fact that the Glmnet implementation of CD is a highly optimized one, using a number of adhoc tricks tailored to the Lasso problem that we decided to preserve in our comparison. If we used a plain implementation of CD, we would expect to obtain results very close to those exhibited by SCD.
Performance metrics for stochastic FW on the largescale classification problems Dorothea, URLreputation and KDD2010algebra
FW \(1\%\)  FW \(2\%\)  FW \(3\%\)  

Dorothea  
Time (s)  1.31e−01  2.48e−01  3.78e−01 
Speedup  25.5\(\times \)  13.5\(\times \)  8.84\(\times \) 
Iterations  8.04e\(+\)02  8.09e\(+\)02  8.46e\(+\)02 
DotProd  9.17e\(+\)05  1.83e\(+\)06  2.84e\(+\)06 
Active features  50.9  52.5  54.8 
URLreputation  
Time (s)  1.50e\(+\)01  2.27e\(+\)01  3.13e\(+\)01 
Speedup  17.0\(\times \)  11.2\(\times \)  8.15\(\times \) 
Iterations  5.33e\(+\)02  5.50e\(+\)02  5.66e\(+\)02 
DotProd  2.04e\(+\)07  4.20e\(+\)07  6.45e\(+\)07 
Active features  25.2  26.5  28.1 
KDD2010  
Time (s)  1.68e\(+\)02  3.42e\(+\)02  5.21e\(+\)02 
Speedup  3.66\(\times \)  1.80\(\times \)  1.18\(\times \) 
Iterations  2.70e\(+\)03  2.78e\(+\)03  2.83e\(+\)03 
DotProd  4.53e\(+\)08  1.16e\(+\)09  1.77e\(+\)09 
Active features  423.6  428.0  433.6 
In order to evaluate the accuracy of the obtained models, we plot in Figs. 5–7 the mean square error (MSE) against the \(\ell _1\)norm of the solution along the regularization path, computed both on the original training set (curves 5a–d, 6a,c and 7a,c) and on the test set (curves 6b,d and 7b,d). Figure 5 reports only the training error, as the corresponding problems did not come with a test set. Note that the value of the objective function in problem (1) coincides with the mean squared error (MSE) on the training set, therefore the training error plots effectively depict the convergence of the FW algorithms. For the sake of completeness, we also report in Fig. 8 the training and test error rates for one of the classification problems, Dorothea. First of all, we can see how the decrease in the objective value is basically identical in all cases, which indicates that with our sampling choices the use of a randomized algorithm does not affect the optimization accuracy. Second, the test error curves show that the predictive capability of all the FW models is competitive with that of the models found by the CD algorithm (particularly in the case of the larger problem E2006log1p). Looking at Figs. 6 and 7, it is also important to note that in all cases the best model, corresponding to the minimum of the test error curves, is found for a relatively low value of the constraint parameter, indicating that sparse solutions are preferable and that solutions involving more variables tend to cause overfitting, which is yet another incentive to use algorithms that can naturally induce sparsity. Again, it can be seen how the minima of all the curves coincide, indicating that all the algorithms are able to correctly identify the best compromise between sparsity and training error. The fact that we are able to attain the same models obtained by a highly efficient algorithm (tailored for the Lasso) such as Glmnet using a sampling size as small as \(3\%\) of the total number of features is particularly noteworthy. Combined with the consistent advantages in CPU time over other competing solvers and its attractive sparsity properties, it shows how the randomized FW represents a solid, highperformance option for solving highdimensional Lasso problems. Finally, we note that even on the classification problem Dorothea FW is able to obtain more accurate models than CD, particularly towards the end of the path. We remark, though, that this experiment has mainly an illustrative purpose, and that solving classification tasks is not among the aims of the algorithm presented here.
6 Conclusions and perspectives
In this paper, we have studied the practical advantages of using a randomized Frank–Wolfe algorithm to solve the constrained formulation of the Lasso regression problem on highdimensional datasets involving a number of variables ranging from the hundred thousands to a few millions. We have presented a theoretical proof of convergence based on the expected value of the objective function. Our experiments show that we are able to obtain results that outperform those of other stateoftheart solvers such as the Glmnet algorithm, a standard among practitioners, without sacrificing the accuracy of the model in a significant way. Importantly, our solutions are consistently more sparse than those found using several popular firstorder methods, demonstrating the advantage of using an incremental, greedy optimization scheme in this context.
In a future work, we intend to address the issue of whether it is possible to find suitable sampling conditions which can lead to a stronger stochastic convergence result, i.e. to certifiable probability bounds for approximate solutions. A more detailed convergence analysis taking into account higher order moments beyond the expected value would also be in our view a valuable contribution. Finally, we remark that the proposed approach can be readily extended to other similar problems such as ElasticNet or more general \(\ell _2\)regularized problems such as logistic regression, or to related applications such as the sparsification of SVM models. Another possibility to tackle various Lasso formulations is to exploit an equivalent formulation in terms of SVMs, an area where FW methods have already shown promising results. Together, all these elements strengthen the conclusions of our previous research work, showing that FW algorithms can provide a complete and flexible framework to efficiently solve a wide variety of largescale machine learning and data mining problems.
Footnotes
 1.
Note that the SLEP code uses highly optimized libraries for matrix multiplication, therefore matrixvector computations can be faster than naive C++ implementations.
Notes
Acknowledgments
The authors wish to thank three anonymous reviewers for their valuable comments. The research leading to these results has received funding from the European Research Council under the European Union’s Seventh Framework Programme (FP7/20072013) / ERC AdG ADATADRIVEB (290923). This paper reflects only the authors’ views and the Union is not liable for any use that may be made of the contained information. Research Council KUL: GOA/10/09 MaNet, CoE PFV/10/002 (OPTEC), BIL12/11T; Flemish Government: FWO: Projects: G.0377.12 (Structured systems), G.088114N (Tensor based data similarity); Ph.D./Postdoc grants; iMinds Medical Information Technologies SBO 2014; IWT: POM II SBO 100031; Belgian Federal Science Policy Office: IUAP P7/19 (DYSCO, Dynamical systems, control and optimization, 2012–2017). The second author received funding from CONICYT Chile through FONDECYT Project 1130122 and DGIPUTFSM 24.14.84. The first author thanks the colleagues from the Department of Computer Science and Engineering, University of Bologna, for their hospitality during the period in which this research was conceived.
References
 Chang, C. C., & Lin, C. J. (2011). LIBSVM: A library for support vector machines. https://www.csie.ntu.edu.tw/~cjlin/libsvm.
 Clarkson, K. (2010). Coresets, sparse greedy approximation, and the Frank–Wolfe algorithm. ACM Transactions on Algorithms, 6(4), 63:1–63:30.MathSciNetCrossRefMATHGoogle Scholar
 Efron, B., Hastie, T., Johnstone, I., & Tibshirani, R. (2004). Least angle regression. The Annals of Statistics, 32(2), 407–499.MathSciNetCrossRefMATHGoogle Scholar
 Frandi, E., \({\tilde{\rm N}}\)anculef, R., & Suykens, J. A. K. (2014). Complexity issues and randomization strategies in Frank–Wolfe algorithms for machine learning. In 7th NIPS workshop on optimization for machine learning.Google Scholar
 Frandi, E., \({\tilde{\rm N}}\)anculef, R., & Suykens, J. A. K. (2015). A PARTANaccelerated Frank–Wolfe algorithm for large scale SVM classification. In Proceedings of the international joint conference on neural networks 2015.Google Scholar
 Frank, M., & Wolfe, P. (1956). An algorithm for quadratic programming. Naval Research Logistics Quarterly, 1, 95–110.MathSciNetCrossRefGoogle Scholar
 Friedman, J. (2012). Fast sparse regression and classification. International Journal of Forecasting, 28, 722–738.CrossRefGoogle Scholar
 Friedman, J., Hastie, T., Höfling, H., & Tibshirani, R. (2007). Pathwise coordinate optimization. The Annals of Applied Statistics, 1(2), 302–332.MathSciNetCrossRefMATHGoogle Scholar
 Friedman, J., Hastie, T., & Tibshirani, R. (2010). Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33(1), 1–22.CrossRefGoogle Scholar
 Garber, D., & Hazan, E. (2015). Faster rates for the Frank–Wolfe method over stronglyconvex sets. In Proceedings of the 32nd ICML.Google Scholar
 Harchaoui, Z., Juditski, A., & Nemirovski, A. (2014). Conditional gradient algorithms for normregularized smooth convex optimization. Mathematical Programming Series A, 13(1), 1–38.Google Scholar
 Hastie, T., Tibshirani, R., & Friedman, J. (2009). The elements of statistical learning. New York: Springer New York Inc.CrossRefMATHGoogle Scholar
 Hoerl, A. E., & Kennard, R. W. (1970). Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 12(1), 55–67.MathSciNetCrossRefMATHGoogle Scholar
 Huang, L., Jia, J., Yu, B., Chun, B. G., Maniatis, P., & Naik, M. (2010). Predicting execution time of computer programs using sparse polynomial regression. In Advances in neural information processing systems (pp. 883–891).Google Scholar
 Jaggi, M. (2013). Revisiting Frank–Wolfe: Projectionfree sparse convex optimization. In Proceedings of the 30th international conference on machine learning.Google Scholar
 Jaggi, M. (2014). An equivalence between the Lasso and support vector machines. In J. A. K. Suykens, M. Signoretto, & A. Argyriou (Eds.), Regularization, optimization, kernels, and support vector machines, chap 1 (pp. 1–26). Boca Raton: Chapman & Hall/CRC.Google Scholar
 Kim, S. J., Koh, K., Lustig, M., Boyd, S., & Gorinevsky, D. (2007). An interiorpoint method for largescale l 1regularized least squares. IEEE Journal of Selected Topics in Signal Processing, 1(4), 606–617.CrossRefGoogle Scholar
 Kogan, S., Levin, D., Routledge, B. R., Sagi, J. S., & Smith, N. A. (2009). Predicting risk from financial reports with regression. In Proceedings of the NAACL ’09 (pp 272–280).Google Scholar
 LacosteJulien, S., & Jaggi, M. (2014). An affine invariant linear convergence analysis for Frank–Wolfe algorithms. arXiv:1312.7864v2.
 LacosteJulien, S., Jaggi, M., Schmidt, M., & Pletscher, P. (2013). Blockcoordinate Frank–Wolfe optimization for structural SVMs. In Proceedings of the 30th international conference on machine learning.Google Scholar
 Lan, G. (2014). The complexity of largescale convex programming under a linear optimization oracle. arXiv:1309.5550v2.
 Langford, J., Li, L., & Zhang, T. (2009). Sparse online learning via truncated gradient. In Advances in neural information processing systems (pp. 905–912).Google Scholar
 Lee, M., Shen, H., Huang, J. Z., & Marron, J. S. (2010). Biclustering via sparse singular value decomposition. Biometrics, 66(4), 1087–1095.MathSciNetCrossRefMATHGoogle Scholar
 Lichman, M. (2013). UCI Machine Learning Repository. Irvine, CA: University of California, School of Information and Computer Science. http://archive.ics.uci.edu/ml.
 Liu, J., & Ye, J. (2009). Efficient euclidean projections in linear time. In Proceedings of the 26th international conference on machine learning, (pp. 657–664). New York: ACM.Google Scholar
 Liu, J., & Ye, J. (2010). Efficient \(\ell _1/\ell _q\) norm regularization. arXiv:1009.4766.
 Liu, J., Ji, S., & Ye, J. (2009). SLEP: Sparse learning with efficient projections. http://www.yelab.net/software/SLEP/. Arizona State University.
 \({\tilde{\rm N}}\)anculef, R., Frandi, E., Sartori, C., & Allende, H. (2014). A novel Frank–Wolfe algorithm: Analysis and applications to largescale SVM training. Information Sciences, 285, 66–99.Google Scholar
 Nesterov, Y. (2013). Gradient methods for minimizing composite functions. Mathematical Programming Series B, 140(1), 125–161.MathSciNetCrossRefMATHGoogle Scholar
 Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., et al. (2011). Scikitlearn: Machine Learning in Python. Journal of Machine Learning Research, 12, 2825–2830.MathSciNetMATHGoogle Scholar
 Richtárik, P., & Takáĉ, M. (2014). Iteration complexity of randomized blockcoordinate descent methods for minimizing a composite function. Mathematical Programming Series A, 144(1), 1–38.MathSciNetCrossRefMATHGoogle Scholar
 Schölkopf, B., & Smola, A. (2001). Learning with kernels: Support vector machines, regularization, optimization, and beyond. Cambridge: MIT Press.Google Scholar
 ShalevShwartz, S., & Tewari, A. (2011). Stochastic methods for \(\ell _1\)regularized loss minimization. Journal of Machine Learning Research, 12, 1865–1892.MathSciNetMATHGoogle Scholar
 ShalevShwartz, S., Srebro, N., & Zhang, T. (2010). Trading accuracy for sparsity in optimization problems with sparsity constraints. SIAM Journal on Optimization, 20(6), 2807–2832.MathSciNetCrossRefMATHGoogle Scholar
 Signoretto, M., Frandi, E., Karevan, Z., & Suykens, J. A. K. (2014). High level high performance computing for multitask learning of timevarying models. In IEEE CIBD 2014.Google Scholar
 Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society Series B, 58(1), 267–288.MathSciNetMATHGoogle Scholar
 Tibshirani, R. (2011). Regression shrinkage and selection via the lasso: A retrospective. Journal of the Royal Statistical Society Series B, 73(3), 273–282.MathSciNetCrossRefGoogle Scholar
 Tibshirani, R., Saunders, M., Rosset, S., Zhu, J., & Knight, K. (2005). Sparsity and smoothness via the fused lasso. Journal of the Royal Statistical Society Series B, 67(1), 91–108.MathSciNetCrossRefMATHGoogle Scholar
 Tropp, J. A. (2004). Greed is good: Algorithmic results for sparse approximation. IEEE Transactions on Information theory, 50(10), 2231–2242.MathSciNetCrossRefMATHGoogle Scholar
 Turlach, B.A. (2005). On algorithms for solving least squares problems under an \(l_1\) penalty or an \(l_1\) constraint. In Proceedings of the American Statistical Association, Statistical Computing Section (pp. 2572–2577).Google Scholar
 Wang, Y., & Qian, X. (2014). Stochastic coordinate descent Frank–Wolfe algorithm for largescale biological network alignment. In GlobalSIP14—Workshop on genomic signal processing and statistics.Google Scholar
 Weston, J., Elisseeff, A., Schölkopf, B., & Tipping, M. (2003). Use of the zero norm with linear models and kernel methods. Journal of Machine Learning Research, 3, 1439–1461.Google Scholar
 Zhou, Q., Song, S., Huang, G., & Wu, C. (2015). Efficient Lasso training from a geometrical perspective. Neurocomputing, 168, 234–239.CrossRefGoogle Scholar
 Zou, H. (2006). The adaptive Lasso and its oracle properties. Journal of the American Statistical Association, 101(476), 1418–1429.MathSciNetCrossRefMATHGoogle Scholar
 Zou, H., & Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society Series B, 67, 301–320.Google Scholar
 Zou, H., & Zhang, H. H. (2009). On the adaptive elasticnet with a diverging number of parameters. Annals of Statistics, 37(4), 1733.MathSciNetCrossRefMATHGoogle Scholar