Skip to main content
Log in

Group linear algorithm with sparse principal decomposition: a variable selection and clustering method for generalized linear models

  • Regular Article
  • Published:
Statistical Papers Aims and scope Submit manuscript

Abstract

This paper introduces the Group Linear Algorithm with Sparse Principal decomposition, an algorithm for supervised variable selection and clustering. Our approach extends the Sparse Group Lasso regularization to calculate clusters as part of the model fit. Therefore, unlike Sparse Group Lasso, our idea does not require prior specification of clusters between variables. To determine the clusters, we solve a particular case of sparse Singular Value Decomposition, with a regularization term that follows naturally from the Group Lasso penalty. Moreover, this paper proposes a unified implementation to deal with, but not limited to, linear regression, logistic regression, and proportional hazards models with right-censoring. Our methodology is evaluated using both biological and simulated data, and details of the implementation in R and hyperparameter search are discussed.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

Notes

  1. https://github.com/jlaria/glasp.

  2. https://www.genecards.org/cgi-bin/carddisp.pl?gene=BCL2.

  3. https://www.genecards.org/cgi-bin/carddisp.pl?gene=CASP10.

  4. https://www.genecards.org/cgi-bin/carddisp.pl?gene=BMP6 &keywords=BMP6.

  5. https://www.genecards.org/cgi-bin/carddisp.pl?gene=SRP72 &keywords=SRP72.

References

  • Alizadeh AA, Eisen MB, Davis RE, Ma C, Lossos IS, Rosenwald A, Boldrick JC, Sabet H, Tran T, Yu X et al (2000) Distinct types of diffuse large b-cell lymphoma identified by gene expression profiling. Nature 403(6769):503–511

    Article  Google Scholar 

  • Bair E, Hastie T, Paul D, Tibshirani R (2006) Prediction by supervised principal components. J Am Stat Assoc 101(473):119–137

    Article  MATH  Google Scholar 

  • Beck A, Teboulle M (2009) A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J Imag Sci 2(1):183–202

    Article  MATH  Google Scholar 

  • Beisser D, Klau GW, Dandekar T, Müller T, Dittrich MT (2010) Bionet: an r-package for the functional analysis of biological networks. Bioinformatics 26(8):1129–1130

    Article  Google Scholar 

  • Bergstra J, Bengio Y (2012) Random search for hyper-parameter optimization. J Mach Learn Res 13(Feb):281–305

    MATH  Google Scholar 

  • Bühlmann P, Rütimann P, van de Geer S, Zhang CH (2013) Correlated variables in regression: clustering and sparse estimation. J Stat Plan Inference 143(11):1835–1858

    Article  MATH  Google Scholar 

  • Chen K, Chen K, Müller HG, Wang JL (2011) Stringing high-dimensional data for functional analysis. J Am Stat Assoc 106(493):275–284

    Article  MATH  Google Scholar 

  • Ciuperca G (2020) Adaptive elastic-net selection in a quantile model with diverging number of variable groups. Statistics 54(5):1147–1170

    Article  MATH  Google Scholar 

  • Dittrich MT, Klau GW, Rosenwald A, Dandekar T, Müller T (2008) Identifying functional modules in protein-protein interaction networks: an integrated exact approach. Bioinformatics 24(13):i223–i231

    Article  Google Scholar 

  • Eddelbuettel D, François R (2011) Rcpp: seamless R and C++ integration. J Stat Softw 40(8):1–18

    Article  Google Scholar 

  • Friedman J, Hastie T, Tibshirani R (2010a) A note on the group lasso and a sparse group lasso. arXiv preprint arXiv:1001.0736

  • Friedman J, Hastie T, Tibshirani R (2010b) Regularization paths for generalized linear models via coordinate descent. J Stat Softw 33(1):1

  • Kuhn M (2020) tune: Tidy Tuning Tools. https://CRAN.R-project.org/package=tune, r package version 0.1.0

  • Kuhn M, Vaughan D (2020) parsnip: a Common API to Modeling and Analysis Functions. https://CRAN.R-project.org/package=parsnip, r package version 0.0.5

  • Laria JC, Carmen Aguilera-Morillo M, Lillo RE (2019) An iterative sparse-group lasso. J Comput Graph Stat 28(3):722–731

    Article  MATH  Google Scholar 

  • Luo S, Chen Z (2020) Feature selection by canonical correlation search in high-dimensional multiresponse models with complex group structures. J Am Stat Assoc 115(531):1227–1235

    Article  MATH  Google Scholar 

  • Moore DF (2016) Applied survival analysis using R. Springer, New York

    Book  MATH  Google Scholar 

  • Ndiaye E, Fercoq O, Gramfort A, Salmon J (2016) Gap safe screening rules for sparse-group lasso. In: Advances in Neural Information Processing Systems, pp 388–396

  • Price BS, Sherwood B (2017) A cluster elastic net for multivariate regression. J Mach Learn Res 18(1):8685–8723

    Google Scholar 

  • Rand WM (1971) Objective criteria for the evaluation of clustering methods. J Am Stat Assoc 66(336):846–850

    Article  Google Scholar 

  • Ren S, Kang EL, Lu JL (2020) Mcen: a method of simultaneous variable selection and clustering for high-dimensional multinomial regression. Stat Comput 30(2):291–304

    Article  MATH  Google Scholar 

  • Rosenwald A, Wright G, Chan WC, Connors JM, Campo E, Fisher RI, Gascoyne RD, Muller-Hermelink HK, Smeland EB, Giltnane JM et al (2002) The use of molecular profiling to predict survival after chemotherapy for diffuse large-b-cell lymphoma. N Engl J Med 346(25):1937–1947

    Article  Google Scholar 

  • Shen H, Huang JZ (2008) Sparse principal component analysis via regularized low rank matrix approximation. J Multivar Anal 99(6):1015–1034

    Article  MATH  Google Scholar 

  • Simon N, Friedman J, Hastie T, Tibshirani R (2013) A sparse-group lasso. J Comput Graph Stat 22(2):231–245

    Article  Google Scholar 

  • Snoek J, Larochelle H, Adams RP (2012) Practical bayesian optimization of machine learning algorithms. In: Advances in Neural Information Processing Systems, pp 2951–2959

  • Therneau TM (2015) A package for survival analysis in S. https://CRAN.R-project.org/package=survival, version 2.38

  • Therneau TM, Grambsch PM (2000) Modeling survival data: extending the cox model. Springer, New York

    Book  MATH  Google Scholar 

  • Tibshirani R (1996) Regression shrinkage and selection via the lasso. J R Stat Soc 58(1):267–288

    MATH  Google Scholar 

  • Tibshirani R, Bien J, Friedman J, Hastie T, Simon N, Taylor J, Tibshirani RJ (2012) Strong rules for discarding predictors in lasso-type problems. J R Stat Soc Ser B 74(2):245–266

    Article  MATH  Google Scholar 

  • Witten DM, Shojaie A, Zhang F (2014) The cluster elastic net for high-dimensional regression with unknown variable grouping. Technometrics 56(1):112–122

    Article  Google Scholar 

  • Zhang Y, Zhang N, Sun D, Toh KC (2020) An efficient hessian based algorithm for solving large-scale sparse group lasso problems. Math Program 179(1):223–263

    Article  MATH  Google Scholar 

  • Zhao H, Wu Q, Li G, Sun J (2019) Simultaneous estimation and variable selection for interval-censored data with broken adaptive ridge regression. J Am Stat Assoc 1–13

  • Zhou N, Zhu J (2010) Group variable selection via a hierarchical lasso and its oracle property. Stat Interface 3:557–574

    Article  MATH  Google Scholar 

  • Zou H, Hastie T (2005) Regularization and variable selection via the elastic net. J R Stat Soc Ser B 67(2):301–320

    Article  MATH  Google Scholar 

Download references

Acknowledgements

We gratefully acknowledge the help provided by Prof. Daniela Witten, who gave us access to the source code of CEN and the simulation set-ups compared in Sect. 4. We also acknowledge the constructive comments of the anonymous referees that have contributed to improve the contents of this paper.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Juan C. Laria.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file 1 (pdf 173 KB)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Laria, J.C., Aguilera-Morillo, M.C. & Lillo, R.E. Group linear algorithm with sparse principal decomposition: a variable selection and clustering method for generalized linear models. Stat Papers 64, 227–253 (2023). https://doi.org/10.1007/s00362-022-01313-z

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00362-022-01313-z

Keywords

Navigation