Comparative study of computational algorithms for the Lasso with high-dimensional, highly correlated data

Kim, Baekjin; Yu, Donghyeon; Won, Joong-Ho

doi:10.1007/s10489-016-0850-7

Comparative study of computational algorithms for the Lasso with high-dimensional, highly correlated data

Published: 20 October 2016

Volume 48, pages 1933–1952, (2018)
Cite this article

Applied Intelligence Aims and scope Submit manuscript

Baekjin Kim¹,
Donghyeon Yu² &
Joong-Ho Won¹

607 Accesses
5 Citations
Explore all metrics

Abstract

Variable selection is important in high-dimensional data analysis. The Lasso regression is useful since it possesses sparsity, soft-decision rule, and computational efficiency. However, since the Lasso penalized likelihood contains a nondifferentiable term, standard optimization tools cannot be applied. Many computation algorithms to optimize this Lasso penalized likelihood function in high-dimensional settings have been proposed. To name a few, coordinate descent (CD) algorithm, majorization-minimization using local quadratic approximation, fast iterative shrinkage thresholding algorithm (FISTA) and alternating direction method of multipliers (ADMM). In this paper, we undertake a comparative study that analyzes relative merits of these algorithms. We are especially concerned with numerical sensitivity to the correlation between the covariates. We conduct a simulation study considering factors that affect the condition number of covariance matrix of the covariates, as well as the level of penalization. We apply the algorithms to cancer biomarker discovery, and compare convergence speed and stability.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Introduction to the LASSO

Article 01 April 2018

MM for penalized estimation

Article 08 April 2021

Variable selection via generalized SELO-penalized linear regression models

Article 08 June 2018

References

Akaike H (1973) Information theory and an extension of the maximum likelihood principle. In: Second International Symposium on Information Theory, pp 267–281
Beck A, Teboulle M (2009) A Fast Iterative Shrinkage-Thresholding Algorithm fo Linear Inverse Problems. SIAM J. Imaging Sciences, doi:10.1137/080716542
Bejamini Y, Hochberg Y (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. J Royal Stat Soc Ser B 57(1):289–300
MathSciNet MATH Google Scholar
Boyd S, Parikh N, Chu E, Peleato B, Eckstein J (2010) Distributed optimization and statistical learning via the alternating direction method of multipliers. Found Trends Mach Learn 3(1):1–122
Article MATH Google Scholar
Cagle PT, Allen TC, Olsen RJ (2013) Lung cancer biomarkers: present status and future developments. Arch Pathol Labor Med 137(9):1191–1198
Article Google Scholar
Efron B, Hastie T, Johnstone I, Tibshirani R (2004) Least angle regression. Ann Stat 32(2):407–499
Article MathSciNet MATH Google Scholar
El-Telbany A, Ma PC (2012) Cancer genes in lung cancer: racial disparities: are there any? Genes Cancer 3:467–480
Article Google Scholar
Friedman J, Hastie T, Hofling H, Tibshirani R (2007) Pathwise coordinate optimization. Ann Appl Stat 1(2):307–332
Article MathSciNet MATH Google Scholar
Gemmeke JF, Hamme HV, Cranen B, Boves L (2010) Compressive sensing for missing data imputation in noise robust speech recognitions. J Sel Topics Signal Process 4(2):272–287
Article Google Scholar
Huang DW, Sherman BT, Lempicki RA (2009) Systematic and integrative analysis of large gene lists using david bioinformatics resources. Nat Protoc 4(1):44–57
Article Google Scholar
Hunter DR, Lange K (2000) Quantile regression via an mm algorithm. Journal of Computational and Graphical Statistics pp 60–77
Hunter DR, Li R (2005) Variable selection using mm algorithms. Ann Stat 33(4):1617–1642
Article MathSciNet MATH Google Scholar
Irizarry RA, Bolstad BM, Collin F, Cope LM, Hobbs B, Speed TP (2003) Summaries of affymetrix genechip probe level data. Nucleic Acids Res 31(4):e15
Article Google Scholar
Jemal A, Siegel R, Xu J, Ward E (2010) Cancer statistics. Cancer J Clin 60(5):277–300
Article Google Scholar
Kati C, Alacam H, Duran L, Guzel A, Akdemir HU, Sisman B, Sahin C, Yavuz Y, Altintas N, Murat N, Okuyucu A (2014) The effectiveness of the serum surfactant protein d (sp-d) level to indicate lung injury in pulmonary embolism. Clin Lab 60(9):1457–1464
Google Scholar
Parikh N, Boyd S (2013) Proximal algorithms. Found Trends Optim 1(3):123–231
Google Scholar
Peng J, Wang P, Zhou N, Zhu J (2012) Partial correlation estimation by joint sparse regression models. Journal of the American Statistical Association
Pounds S, Morris SW (2003) Estimating the occurrence of false positives and false negatives in microarray studies by approximating and partitioning the empirical distribution of p-values. Bioinformatics 19(10):1236–1242
Article Google Scholar
Schwarz G (1978) Estimating the dimension of a model. Ann Stat 6(2):461–464
Article MathSciNet MATH Google Scholar
Shedden K, Taylor JM, Enkemann SA, Tsao MS, Yeatman TJ, Gerald WL, Eschrich S, Jurisica I, Giordano TJ, Misek DE, Chang AC, Zhu CQ, Strumpf D, Hanash S, Shepherd FA, Ding K, Seymour L, Naoki K, Pennell N, Weir B, Verhaak R, Ladd-Acosta C, Golub T, Gruidl M, Sharma A, Szoke J, Zakowski M, Rusch V, Kris M, Viale A, Motoi N, Travis W, Conley B, Seshan VE, Meyerson M, Kuick R, Dobbin KK, Lively T, Jacobson JW, Beer DG (2008) Gene expression-based survival prediction in lung adenocarcinoma: a multi-site, blinded validation study. Nat Med 14(8):822–827
Article Google Scholar
Shewchuk JR (1994) An introduction to the conjugate gradient method without the agonizing pain. Carnegie Mellon University, Pittsburgh, PA
Google Scholar
Tang H, Xiao G, Behrens C, Schiller J, Allen J, Chow CW, Suraokar M, Corvalan A, Mao J, White MA, Wistuba II, Minna JD, Xie Y (2013) A 12-gene set predicts survival benefits from adjuvant chemotherapy in non-small cell lung cancer patients. Clin Cancer Res 19(6):1577–1586
Article Google Scholar
Tibshirani R (1996) Regression shrinkage and selection via the lasso. J Royal Stat Soc Ser B 58:267–288
MathSciNet MATH Google Scholar
Tibshirani R, Bien J, Friedman J, Hastie T, Simon N, Taylor J, Tibshirani RJ (2012) Strong rules for discarding predictors in lasso-type problems. J Royal Stat Soc Ser B (Stat Methodol) 74(2):245–266
Article MathSciNet Google Scholar
Woenckhaus M, Klein-Hitpass L, Grepmeier U, Merk J, Pfeifer M, Wild P, Bettstetter M, Wuensch P, Blaszyk H, Hartmann A, et al. (2006) Smoking and cancer-related gene expression in bronchial epithelium and non-small-cell lung cancers. J Pathol 210(2):192–204
Article Google Scholar
Wright J, Yang AY, Ganesh A, Sastry S, Ma Y (2009) Robust face recognition via sparse representation. IEEE Trans Pattern Anal Mach Intell 31(2):210–227
Article Google Scholar
Wu TT, Lange K (2008) Coordinate descent algorithms for lasso penalizaed regression. Ann Appl Stat 2 (1):224–244
Article MathSciNet MATH Google Scholar
Yang AY, Zhou Z, Ganesh A, Shankar SS, Ma Y (2013) Fast l1-minimization algorithms for robust face recognition. IEEE Trans Image Process 22:8
Google Scholar
Yu D, Son W, Lim J, Xiao G (2015) Statistical completion of partially identified graph with application to estimation of gene regulatory network. Biostatistics 16(4):670–685
Article MathSciNet Google Scholar

Download references

Acknowledgments

Donghyeon Yu was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIP, No. 2015R1C1A1A02036312). Joong-Ho Won was supported by the National Research Foundation of Kor ea (NRF) grant funded by the Korean government (MSIP, Nos. 2013R1A1A1057949 and 2014R1A4A1007895).

Author information

Authors and Affiliations

Department of Statistics, Seoul National University, Seoul, Korea
Baekjin Kim & Joong-Ho Won
Department of Statistics, Keimyung University, Daegu, Korea
Donghyeon Yu

Authors

Baekjin Kim
View author publications
You can also search for this author in PubMed Google Scholar
Donghyeon Yu
View author publications
You can also search for this author in PubMed Google Scholar
Joong-Ho Won
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Joong-Ho Won.

Appendix:

1.1 A Preconditioned conjugate gradient (PCG) method

Conjugate gradient (CG) is a method to resolve positive definite linear equations, A x = b, applied to sparse system that has too large data to solve with Cholesky Decomposition. Instead of directly solving linear equations, this method is to minimize the following function f(x),

$$f(x) = \frac{1}{2}x^{T}Ax -b^{T}x. $$

For a positive definite A, two nonzero vectors u, v are said to be conjugate with respect to A, if they satisfy

$$\langle u,v\rangle_{A} \triangleq u^{T}Av=0. $$

If P is defined as

$$P=\left\lbrace p_{k} : \forall i \neq k, \langle p_{i},p_{k}\rangle_{A}=0 \right\rbrace, $$

it means the set of n number of mutually conjugate directional vectors. Thus, the set P becomes a basis of $\mathbb {R}^{n}$ and x is represented in the form of

$$x=\sum\limits_{i=1}^{n} \alpha_{i} p_{i}. $$

By multiplying both sides by matrix A, b is decomposed by

$$b=Ax=\sum\limits_{i=1}^{n} \alpha_{i} Ap_{i}. $$

Multiplying an arbitrary directional vector p _k∈P,

$${p_{k}^{T}}b={p_{k}^{T}}Ax=\sum\limits_{i=1}^{n} \alpha_{i} {p_{k}^{T}} A p_{i} = \alpha_{k} {p_{k}^{T}} A p_{k}. $$

Accordingly, the explicit form of α _k can be derived as followed,

$$\alpha_{k}=\frac{{p_{k}^{T}}b}{{p_{k}^{T}}Ap_{k}}=\frac{\langle p_{k},b\rangle} {\|p_{k}\|_{A}^{2}}. $$

If mutually conjugate directional vectors are not given, conjugate gradient (CG) solves the problem iteratively. Set x ₀ as an initial value of x, and a linear equation given by

$$Az=b-Ax_{0} $$

becomes a target function to solve. If we regarding r _k = b−A x _k as k-th residual, r _k becomes a negative gradient of convex function x = x _k, ∇f(x) given by,

$$\nabla f(x_{k}) = Ax_{k} -b, $$

which means that conjugate gradient method moves toward the direction of r _k. Since all directional vectors should satisfy the condition that all vectors are conjugate with respect to A, then k-th direction p _k is given by,

$$p_{k}=r_{k}-\sum\limits_{i>k}\frac{{p_{i}^{T}}Ar_{k}}{{p_{i}^{T}}Ap_{i}}p_{i}. $$

Following this direction, next value of x is updated as followed,

$$x_{k+1}=x_{k} +\alpha_{k} p_{k}, $$

where

$$\alpha_{k}=\frac{{p_{k}^{T}}b}{{p_{k}^{T}}Ap_{k}}=\frac{{p_{k}^{T}}r_{k-1}}{{p_{k}^{T}}Ap_{k}}. $$

Convergence rate of conjugate gradient method depends on condition number of A and especially eigenvalues of A [21]. Accordingly, A x = b problem can be regarded same as linear equation that multiply by inverse matrix of preconditioner given by

$$M^{-1}Ax=M^{-1}b. $$

In choosing an appropriate preconditioner, it should satisfy some necessary conditions.

M is both symmetric and positive definite matrix.
M ⁻¹ A is well conditioned and hardly has extreme eigenvalues.
M x = b is easy to solve.

Widely used preconditioners that satisfy these conditions are the followings;

1)
Diagonal: M=diag(1/A ₁₁,...,1/A _{n
n}),
2)
Incomplete(approximate) Cholesky factorization: $M=\hat {A}^{-1}$, where $\hat {A}=\hat {L}\hat {L}^{T}$.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kim, B., Yu, D. & Won, JH. Comparative study of computational algorithms for the Lasso with high-dimensional, highly correlated data. Appl Intell 48, 1933–1952 (2018). https://doi.org/10.1007/s10489-016-0850-7

Download citation

Published: 20 October 2016
Issue Date: August 2018
DOI: https://doi.org/10.1007/s10489-016-0850-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Comparative study of computational algorithms for the Lasso with high-dimensional, highly correlated data

Abstract

Access this article