A fast unified algorithm for solving group-lasso penalize learning problems

Abstract

This paper concerns a class of group-lasso learning problems where the objective function is the sum of an empirical loss and the group-lasso penalty. For a class of loss function satisfying a quadratic majorization condition, we derive a unified algorithm called groupwise-majorization-descent (GMD) for efficiently computing the solution paths of the corresponding group-lasso penalized learning problem. GMD allows for general design matrices, without requiring the predictors to be group-wise orthonormal. As illustration examples, we develop concrete algorithms for solving the group-lasso penalized least squares and several group-lasso penalized large margin classifiers. These group-lasso models have been implemented in an R package gglasso publicly available from the Comprehensive R Archive Network (CRAN) at http://cran.r-project.org/web/packages/gglasso. On simulated and real data, gglasso consistently outperforms the existing software for computing the group-lasso that implements either the classical groupwise descent algorithm or Nesterov’s method.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3

References

  1. Alon, U., Barkai, N., Notterman, D., Gish, K., Ybarra, S., Mack, D., Levine, A.: Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proc. Nat. Acad. Sci. 96(12), 6745 (1999)

    Article  Google Scholar 

  2. Daubechies, I., Defrise, M., De Mol, C.: An iterative thresholding algorithm for linear inverse problems with a sparsity constraint. Commun. Pure Appl. Math. 57, 1413–1457 (2004)

    MATH  Article  Google Scholar 

  3. Efron, B., Hastie, T., Johnstone, I., Tibshirani, R.: Least angle regression. Ann. Stat. 32(2), 407–451 (2004)

    MATH  MathSciNet  Article  Google Scholar 

  4. Friedman, J., Hastie, T., Tibshirani, R.: Regularized paths for generalized linear models via coordinate descent. J. Stat. Softw. 33, 1–22 (2010)

    Article  Google Scholar 

  5. Fu, W.: Penalized regressions: the bridge versus the lasso. J. Comput. Gr. Stat. 7(3), 397–416 (1998)

    Google Scholar 

  6. Genkin, A., Lewis, D., Madigan, D.: Large-scale Bayesian logistic regression for text categorization. Technometrics 49(3), 291–304 (2007)

    MathSciNet  Article  Google Scholar 

  7. Gorman, R., Sejnowski, T.: Analysis of hidden units in a layered network trained to classify sonar targets. Neural Netw. 1(1), 75–89 (1988)

    Article  Google Scholar 

  8. Graham, K., de Las Morenas, A., Tripathi, A., King, C., Kavanah, M., Mendez, J., Stone, M., Slama, J., Miller, M., Antoine, G., et al.: Gene expression in histologically normal epithelium from breast cancer patients and from cancer-free prophylactic mastectomy patients shares a similar profile. Br. J. Cancer 102(8), 1284–1293 (2010)

    Article  Google Scholar 

  9. Hunter, D., Lange, K.: A tutorial on MM algorithms. Am. Stat. 58(1), 30–37 (2004)

    MathSciNet  Article  Google Scholar 

  10. Lange, K., Hunter, D., Yang, I.: Optimization transfer using sur- rogate objective functions (with discussion). J. Comput. Gr. Stat. 9, 1–20 (2000)

  11. Liu, J., Ji, S., Ye, J.: SLEP: Sparse Learning with Efficient Projections, Arizona State University. URL: http://www.public.asu.edu/jye02/Software/SLEP (2009)

  12. Meier, L., van de Geer, S., Bühlmann, P.: The group lasso for logistic regression. J. R. Stat. Soc. Ser. B 70, 53–71 (2008)

    MATH  Article  Google Scholar 

  13. Nesterov, Y.: ‘Introductory lectures on convex optimization: A basic course’, Operations Research. (2004)

  14. Nesterov, Y.: Gradient methods for minimizing composite objective function, Technical report, Technical Report, Center for Operations Research and Econometrics (CORE), Catholic University of Louvain (UCL) (2007)

  15. Osborne, M., Presnell, B., Turlach, B.: A new approach to variable selection in least squares problems. IMA J. Numer. Anal. 20(3), 389–403 (2000)

    MATH  MathSciNet  Article  Google Scholar 

  16. Quinlan, J.: Combining instance-based and model-based learning. In: Proceedings of the Tenth International Conference on Machine Learning, pp. 236–243 (1993)

  17. Sæbø, S., Almøy, T., Aarøe, J., Aastveit, A.: St-pls: a multi-directional nearest shrunken centroid type classifier via pls. J. Chemom. 22(1), 54–62 (2008)

    Article  Google Scholar 

  18. Scheetz, T., Kim, K., Swiderski, R., Philp, A., Braun, T., Knudtson, K., Dorrance, A., DiBona, G., Huang, J., Casavant, T., et al.: Regulation of gene expression in the mammalian eye and its relevance to eye disease. Proc. Nat. Acad. Sci. 103(39), 14429–14434 (2006)

    Article  Google Scholar 

  19. Segal, M., Dahlquist, K., Conklin, B.: Regression approaches for microarray data analysis. J. Comput. Biol. 10(6), 961–980 (2003)

    Article  Google Scholar 

  20. Singh, D., Febbo, P., Ross, K., Jackson, D., Manola, J., Ladd, C., Tamayo, P., Renshaw, A., D’Amico, A., Richie, J., et al.: Gene expression correlates of clinical prostate cancer behavior. Cancer Cell 1(2), 203–209 (2002)

    Article  Google Scholar 

  21. Tibshirani, R.: Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B 58, 267–288 (1996)

    MATH  MathSciNet  Google Scholar 

  22. Tibshirani, R., Bien, J., Friedman, J., Hastie, T., Simon, N., Taylor, J., Tibshirani, R.: Strong rules for discarding predictors in lasso-type problems. J. R. Stat. Soc. Ser. B 74, 245–266 (2012)

    MathSciNet  Article  Google Scholar 

  23. Tseng, P.: Convergence of a block coordinate descent method for nondifferentiable minimization. J. Optim. Theory Appl. 109(3), 475–494 (2001)

    MATH  MathSciNet  Article  Google Scholar 

  24. Vogt, J., Roth, V.: A complete analysis of the \(l_{1, p}\) group-lasso. In: Proceedings of the 29th International Conference on Machine Learning (ICML-12), ICML 2012, Omnipress, pp 185–192 (2012)

  25. Wang, H., Leng, C.: A note on adaptive group lasso. Comput. Stat. Data Anal. 52, 5277–5286 (2008)

    MATH  MathSciNet  Article  Google Scholar 

  26. Wang, L., Zhu, J., Zou, H.: Hybrid huberized support vector machines for microarray classification and gene selection. Bioinformatics 24, 412–419 (2008)

    Article  Google Scholar 

  27. Wu, T., Lange, K.: Coordinate descent algorithms for lasso penalized regression. Ann. Appl. Stat. 2, 224–244 (2008)

    MATH  MathSciNet  Article  Google Scholar 

  28. Wu, T., Lange, K.: The MM alternative to EM. Stat. Sci. 4, 492–505 (2010)

    MathSciNet  Article  Google Scholar 

  29. Yuan, M., Lin, Y.: Model selection and estimation in regression with grouped variables. J. R. Stat. Soc. Ser. B 68, 49–67 (2006)

    MATH  MathSciNet  Article  Google Scholar 

  30. Zhang, T.: Statistical behavior and consistency of classification methods based on convex risk minimization. Ann. Stat. 32, 56–85 (2004)

    MATH  Article  Google Scholar 

  31. Zou, H.: The adaptive lasso and its oracle properties. J. Am. Stat. Assoc. 101, 1418–1429 (2006)

    MATH  Article  Google Scholar 

Download references

Acknowledgments

The authors thank the editor, an associate editor and two referees for their helpful comments and suggestions. This work is supported in part by NSF Grant DMS-08-46068.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Hui Zou.

Appendix: Proofs

Appendix: Proofs

Proof of Lemma 1

Part (1). For any \(\varvec{\beta }\) and \(\varvec{\beta }^*\), write \(\varvec{\beta }-\varvec{\beta }^*=V\) and define \(g(t)=L(\varvec{\beta }^*+tV \mid \mathbf{D})\) so that

$$\begin{aligned} g(0)=L(\varvec{\beta }^* \mid \mathbf{D}), \quad g(1)=L(\varvec{\beta }\mid \mathbf{D}). \end{aligned}$$

By the mean value theorem, \(\exists \ a \in (0,1)\) such that

$$\begin{aligned} g(1)=g(0)+g^{\prime }(a)=g(0)+g^{\prime }(0)+[g^{\prime }(a)-g^{\prime }(0)]. \end{aligned}$$
(25)

Write \(\Phi _f^{\prime }=\frac{\partial {\Phi (y,f)}}{\partial {f}}\). Note that

$$\begin{aligned} g^{\prime }(t)=\frac{1}{n}\sum ^n_{i=1}\tau _i\Phi _f^{\prime }(y_i,\mathbf{x}^{{ \mathsf T}}_i (\varvec{\beta }^*+tV))(\mathbf{x}^{{ \mathsf T}}_i V). \end{aligned}$$
(26)

Thus \(g^{\prime }(0)=(\varvec{\beta }-\varvec{\beta }^*)^{{ \mathsf T}}\nabla L(\varvec{\beta }^* | \mathbf{D}).\) Moreover, from (26) we have

$$\begin{aligned} |g^{\prime }(a)-g^{\prime }(0)|&= \vert \frac{1}{n}\sum ^n_{i=1}\tau _i[\Phi _f^{\prime }(y_i,\mathbf{x}^{{ \mathsf T}}_i (\varvec{\beta }^*+aV))\nonumber \\&-\Phi _f^{\prime }(y_i,\mathbf{x}^{{ \mathsf T}}_i \varvec{\beta }^*)](\mathbf{x}^{{ \mathsf T}}_i V) \vert \nonumber \\&\le \frac{1}{n}\sum ^n_{i=1} \tau _i|\Phi _f^{\prime }(y_i,\mathbf{x}^{{ \mathsf T}}_i (\varvec{\beta }^*+aV))\nonumber \\&-\Phi _f^{\prime }(y_i,\mathbf{x}^{{ \mathsf T}}_i \varvec{\beta }^*) | |\mathbf{x}^{{ \mathsf T}}_i V| \nonumber \\&\le \frac{1}{n}\sum ^n_{i=1} C\tau _i|\mathbf{x}^{{ \mathsf T}}_i aV | |\mathbf{x}^{{ \mathsf T}}_i V| \end{aligned}$$
(27)
$$\begin{aligned}&\le \frac{1}{n}\sum ^n_{i=1} C \tau _i \Vert \mathbf{x}^{{ \mathsf T}}_i V \Vert ^2_2 \nonumber \\&= \frac{C}{n}V^{{ \mathsf T}} [ \mathbf{X}^{{ \mathsf T}} \varvec{\Gamma }\mathbf{X}] V, \end{aligned}$$
(28)

where in (27) we have used the inequality \(| \Phi ^{\prime }(y,f_1)-\Phi ^{\prime }(y,f_2)| \le C |f_1-f_2|\). Plugging (27) into (25) we have

$$\begin{aligned} L(\varvec{\beta }\mid \mathbf{D})&\le L(\varvec{\beta }^* \mid \mathbf{D})+(\varvec{\beta }-\varvec{\beta }^*)^{{ \mathsf T}}\nabla L(\varvec{\beta }^* | \mathbf{D})\\&+\frac{1}{2}(\varvec{\beta }-\varvec{\beta }^*)^{{ \mathsf T}} \mathbf{H}( \varvec{\beta }- \varvec{\beta }^*), \end{aligned}$$

with \( \mathbf{H}=\frac{2C}{n}\mathbf{X}^{{ \mathsf T}} \varvec{\Gamma }\mathbf{X}. \)

Part (2). Write \(\Phi _f^{\prime \prime }=\frac{\partial {\Phi ^2(y,f)}}{\partial {f^2}}\). By Taylor’s expansion, \(\exists \ b \in (0,1)\) such that

$$\begin{aligned} g(1)=g(0)+g^{\prime }(0)+g^{\prime \prime }(b). \end{aligned}$$
(29)

Note that

$$\begin{aligned} g^{\prime \prime }(b)&= \frac{1}{n}\sum ^n_{i=1}\tau _i \Phi _f^{\prime \prime }(y_i,\mathbf{x}^{{ \mathsf T}}_i (\varvec{\beta }^*+bV))(\mathbf{x}^{{ \mathsf T}}_i V)^2\nonumber \\&\le \frac{1}{n}\sum ^n_{i=1}C_2\tau _i (\mathbf{x}^{{ \mathsf T}}_i V)^2, \end{aligned}$$
(30)

where we have used the inequality \(\Phi _f^{\prime \prime } \le C_2\). Plugging (30) into (29) we have

$$\begin{aligned} L(\varvec{\beta }\mid \mathbf{D})&\le L(\varvec{\beta }^* \mid \mathbf{D})+(\varvec{\beta }-\varvec{\beta }^*)^{{ \mathsf T}}\nabla L(\varvec{\beta }^* | \mathbf{D})\\&+\frac{1}{2}(\varvec{\beta }-\varvec{\beta }^*)^{{ \mathsf T}} \mathbf{H}( \varvec{\beta }- \varvec{\beta }^*), \end{aligned}$$

with \( \mathbf{H}=\frac{C_2}{n}\mathbf{X}^{{ \mathsf T}} \varvec{\Gamma }\mathbf{X}. \) \(\square \)

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Yang, Y., Zou, H. A fast unified algorithm for solving group-lasso penalize learning problems. Stat Comput 25, 1129–1141 (2015). https://doi.org/10.1007/s11222-014-9498-5

Download citation

Keywords

  • Groupwise descent
  • Group lasso
  • grplasso
  • Large margin classifiers
  • MM principle
  • SLEP