Abstract
Variable selection is important in high-dimensional data analysis. The Lasso regression is useful since it possesses sparsity, soft-decision rule, and computational efficiency. However, since the Lasso penalized likelihood contains a nondifferentiable term, standard optimization tools cannot be applied. Many computation algorithms to optimize this Lasso penalized likelihood function in high-dimensional settings have been proposed. To name a few, coordinate descent (CD) algorithm, majorization-minimization using local quadratic approximation, fast iterative shrinkage thresholding algorithm (FISTA) and alternating direction method of multipliers (ADMM). In this paper, we undertake a comparative study that analyzes relative merits of these algorithms. We are especially concerned with numerical sensitivity to the correlation between the covariates. We conduct a simulation study considering factors that affect the condition number of covariance matrix of the covariates, as well as the level of penalization. We apply the algorithms to cancer biomarker discovery, and compare convergence speed and stability.
Similar content being viewed by others
References
Akaike H (1973) Information theory and an extension of the maximum likelihood principle. In: Second International Symposium on Information Theory, pp 267–281
Beck A, Teboulle M (2009) A Fast Iterative Shrinkage-Thresholding Algorithm fo Linear Inverse Problems. SIAM J. Imaging Sciences, doi:10.1137/080716542
Bejamini Y, Hochberg Y (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. J Royal Stat Soc Ser B 57(1):289–300
Boyd S, Parikh N, Chu E, Peleato B, Eckstein J (2010) Distributed optimization and statistical learning via the alternating direction method of multipliers. Found Trends Mach Learn 3(1):1–122
Cagle PT, Allen TC, Olsen RJ (2013) Lung cancer biomarkers: present status and future developments. Arch Pathol Labor Med 137(9):1191–1198
Efron B, Hastie T, Johnstone I, Tibshirani R (2004) Least angle regression. Ann Stat 32(2):407–499
El-Telbany A, Ma PC (2012) Cancer genes in lung cancer: racial disparities: are there any? Genes Cancer 3:467–480
Friedman J, Hastie T, Hofling H, Tibshirani R (2007) Pathwise coordinate optimization. Ann Appl Stat 1(2):307–332
Gemmeke JF, Hamme HV, Cranen B, Boves L (2010) Compressive sensing for missing data imputation in noise robust speech recognitions. J Sel Topics Signal Process 4(2):272–287
Huang DW, Sherman BT, Lempicki RA (2009) Systematic and integrative analysis of large gene lists using david bioinformatics resources. Nat Protoc 4(1):44–57
Hunter DR, Lange K (2000) Quantile regression via an mm algorithm. Journal of Computational and Graphical Statistics pp 60–77
Hunter DR, Li R (2005) Variable selection using mm algorithms. Ann Stat 33(4):1617–1642
Irizarry RA, Bolstad BM, Collin F, Cope LM, Hobbs B, Speed TP (2003) Summaries of affymetrix genechip probe level data. Nucleic Acids Res 31(4):e15
Jemal A, Siegel R, Xu J, Ward E (2010) Cancer statistics. Cancer J Clin 60(5):277–300
Kati C, Alacam H, Duran L, Guzel A, Akdemir HU, Sisman B, Sahin C, Yavuz Y, Altintas N, Murat N, Okuyucu A (2014) The effectiveness of the serum surfactant protein d (sp-d) level to indicate lung injury in pulmonary embolism. Clin Lab 60(9):1457–1464
Parikh N, Boyd S (2013) Proximal algorithms. Found Trends Optim 1(3):123–231
Peng J, Wang P, Zhou N, Zhu J (2012) Partial correlation estimation by joint sparse regression models. Journal of the American Statistical Association
Pounds S, Morris SW (2003) Estimating the occurrence of false positives and false negatives in microarray studies by approximating and partitioning the empirical distribution of p-values. Bioinformatics 19(10):1236–1242
Schwarz G (1978) Estimating the dimension of a model. Ann Stat 6(2):461–464
Shedden K, Taylor JM, Enkemann SA, Tsao MS, Yeatman TJ, Gerald WL, Eschrich S, Jurisica I, Giordano TJ, Misek DE, Chang AC, Zhu CQ, Strumpf D, Hanash S, Shepherd FA, Ding K, Seymour L, Naoki K, Pennell N, Weir B, Verhaak R, Ladd-Acosta C, Golub T, Gruidl M, Sharma A, Szoke J, Zakowski M, Rusch V, Kris M, Viale A, Motoi N, Travis W, Conley B, Seshan VE, Meyerson M, Kuick R, Dobbin KK, Lively T, Jacobson JW, Beer DG (2008) Gene expression-based survival prediction in lung adenocarcinoma: a multi-site, blinded validation study. Nat Med 14(8):822–827
Shewchuk JR (1994) An introduction to the conjugate gradient method without the agonizing pain. Carnegie Mellon University, Pittsburgh, PA
Tang H, Xiao G, Behrens C, Schiller J, Allen J, Chow CW, Suraokar M, Corvalan A, Mao J, White MA, Wistuba II, Minna JD, Xie Y (2013) A 12-gene set predicts survival benefits from adjuvant chemotherapy in non-small cell lung cancer patients. Clin Cancer Res 19(6):1577–1586
Tibshirani R (1996) Regression shrinkage and selection via the lasso. J Royal Stat Soc Ser B 58:267–288
Tibshirani R, Bien J, Friedman J, Hastie T, Simon N, Taylor J, Tibshirani RJ (2012) Strong rules for discarding predictors in lasso-type problems. J Royal Stat Soc Ser B (Stat Methodol) 74(2):245–266
Woenckhaus M, Klein-Hitpass L, Grepmeier U, Merk J, Pfeifer M, Wild P, Bettstetter M, Wuensch P, Blaszyk H, Hartmann A, et al. (2006) Smoking and cancer-related gene expression in bronchial epithelium and non-small-cell lung cancers. J Pathol 210(2):192–204
Wright J, Yang AY, Ganesh A, Sastry S, Ma Y (2009) Robust face recognition via sparse representation. IEEE Trans Pattern Anal Mach Intell 31(2):210–227
Wu TT, Lange K (2008) Coordinate descent algorithms for lasso penalizaed regression. Ann Appl Stat 2 (1):224–244
Yang AY, Zhou Z, Ganesh A, Shankar SS, Ma Y (2013) Fast l1-minimization algorithms for robust face recognition. IEEE Trans Image Process 22:8
Yu D, Son W, Lim J, Xiao G (2015) Statistical completion of partially identified graph with application to estimation of gene regulatory network. Biostatistics 16(4):670–685
Acknowledgments
Donghyeon Yu was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIP, No. 2015R1C1A1A02036312). Joong-Ho Won was supported by the National Research Foundation of Kor ea (NRF) grant funded by the Korean government (MSIP, Nos. 2013R1A1A1057949 and 2014R1A4A1007895).
Author information
Authors and Affiliations
Corresponding author
Appendix:
Appendix:
1.1 A Preconditioned conjugate gradient (PCG) method
Conjugate gradient (CG) is a method to resolve positive definite linear equations, A x = b, applied to sparse system that has too large data to solve with Cholesky Decomposition. Instead of directly solving linear equations, this method is to minimize the following function f(x),
For a positive definite A, two nonzero vectors u, v are said to be conjugate with respect to A, if they satisfy
If P is defined as
it means the set of n number of mutually conjugate directional vectors. Thus, the set P becomes a basis of \(\mathbb {R}^{n}\) and x is represented in the form of
By multiplying both sides by matrix A, b is decomposed by
Multiplying an arbitrary directional vector p k ∈P,
Accordingly, the explicit form of α k can be derived as followed,
If mutually conjugate directional vectors are not given, conjugate gradient (CG) solves the problem iteratively. Set x 0 as an initial value of x, and a linear equation given by
becomes a target function to solve. If we regarding r k = b−A x k as k-th residual, r k becomes a negative gradient of convex function x = x k , ∇f(x) given by,
which means that conjugate gradient method moves toward the direction of r k . Since all directional vectors should satisfy the condition that all vectors are conjugate with respect to A, then k-th direction p k is given by,
Following this direction, next value of x is updated as followed,
where
Convergence rate of conjugate gradient method depends on condition number of A and especially eigenvalues of A [21]. Accordingly, A x = b problem can be regarded same as linear equation that multiply by inverse matrix of preconditioner given by
In choosing an appropriate preconditioner, it should satisfy some necessary conditions.
-
M is both symmetric and positive definite matrix.
-
M −1 A is well conditioned and hardly has extreme eigenvalues.
-
M x = b is easy to solve.
Widely used preconditioners that satisfy these conditions are the followings;
-
1)
Diagonal: M=diag(1/A 11,...,1/A n n ),
-
2)
Incomplete(approximate) Cholesky factorization: \(M=\hat {A}^{-1}\), where \(\hat {A}=\hat {L}\hat {L}^{T}\).
Rights and permissions
About this article
Cite this article
Kim, B., Yu, D. & Won, JH. Comparative study of computational algorithms for the Lasso with high-dimensional, highly correlated data. Appl Intell 48, 1933–1952 (2018). https://doi.org/10.1007/s10489-016-0850-7
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-016-0850-7