Preselection in Lasso-Type Analysis for Ultra-High Dimensional Genomic Exploration
We address the issue of variable preselection in high-dimensional penalized regression, such as the lasso, a commonly used approach to variable selection and prediction in genomics. Preselection—to start with a manageable set of covariates—is becoming increasingly necessary for enabling advanced analysis tasks to be carried out on data sets of huge size created by high throughput technologies. Preselection of features to be included in multivariate analyses based on simple univariate ranking is a natural strategy that has often been implemented despite its potential bias. We demonstrate this bias and propose a way to correct it. Starting with a sequential implementation of the lasso with increasing lists of predictors, we exploit a property of the set of corresponding cross-validation curves, a pattern that we call “freezing”. The ranking of the predictors to be included sequentially is based on simple measures of associations with the outcome, which can be pre-computed in an efficient way for ultra high dimensional data sets externally to the penalized regression implementation. We demonstrate by simulation that our sequential approach leads in a vast majority of cases to a safe and efficient way of focusing the lasso analysis on a smaller and manageable number of predictors. In situations where the lasso performs well, we need typically less than 20 % of the variables to recover the same solution as if using the full set of variables. We illustrate the applicability of our strategy in the context of a genome-wide association study and on microarray genomic data where we need just 2. 5 % and 13 % of the variables respectively. Finally we include an example where 260 million gene-gene interactions are ranked and we are able to recover the lasso solution using only 1 % of these. Freezing offers great potential for extending the applicability of penalized regressions to current and upcoming ultra high dimensional problems in bioinformatics. Its applicability is not limited to the standard lasso but is a generic property of many penalized approaches.
This research was supported by grant number 204664 from the Norwegian Research Council (NRC) and by Statistics for Innovation (sfi)2, a centre for research based innovation funded by NRC. SR and LCB spent a research period in Paris at Inserm UMRS937, and SR has an adjunct position at (sfi)2. IA was funded by a grant from the Agence Nationale de la Recherche (ANR Maladies neurologiques et maladies psychiatriques) as part of a project on the relation between Parkinson’s disease and genes involved in the metabolism and transport of xenobiotics (PI: Alexis Elbaz, Inserm) for which access to GWAS data was obtained through dbGAP; this work utilized in part data from the NINDS DbGaP database from the CIDR:NGRC PARKINSONS DISEASE STUDY (Accession: phs000196.v2.p1). Sjur Reppe at Ullevaal University Hospital provided the Bone biopsy data.
- 8.El Ghaoui, L., Viallon, V., Rabbani, T.: Safe feature elimination for the lasso and sparse supervised learning problems. ArXiv e-prints 1009.4219 (2011)Google Scholar
- 9.Fan, J., Lv, J.: Sure independence screening for ultrahigh dimensional feature space. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 70(5), 849–911 (2008)Google Scholar
- 12.Hamza, T.H., Zabetian, C.P., Tenesa, A., Laederach, A., Montimurro, J., Yearout, D., Kay, D.M., Doheny, K.F., Paschall, J., Pugh, E., Kusel, V.I., Collura, R., Roberts, J., Griffith, A., Samii, A., Scott, W.K., Nutt, J., Factor, S.A., Payami, H.: Common genetic variation in the HLA region is associated with late-onset sporadic parkinsons disease. Nat. Genet. 42(9), 781–785 (2010)CrossRefGoogle Scholar
- 16.Simon, R.M., Korn, E.L., McShane, L.M., Radmacher, M.D., Wright, G.W., Zhao, Y.: Design and analysis of DNS microarray investigations. In: Statistics for Biology and Health. Springer, New York (2004)Google Scholar
- 18.Tibshirani, R., Bien, J., Friedman, J., Hastie, T., Simon, N., Taylor, J., Tibshirani, R.J.: Strong rules for discarding predictors in lasso-type problems. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 74(2), 245–266 (2012)Google Scholar
- 20.Waldmann, P., Mészáros, G., Gredler, B., Fuerst, C., Sölkner, J.: Evaluation of the lasso and the elastic net in genome-wide association studies. Frontiers in Genetics, 4, 270. http://doi.org/10.3389/fgene.2013.00270 (2013)
- 25.Zou, H., Hastie, T.: Regularization and variable selection via the elastic net. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 67(2), 301–320 (2005)Google Scholar