# Preselection in Lasso-Type Analysis for Ultra-High Dimensional Genomic Exploration

## Abstract

We address the issue of variable preselection in high-dimensional penalized regression, such as the lasso, a commonly used approach to variable selection and prediction in genomics. Preselection—to start with a manageable set of covariates—is becoming increasingly necessary for enabling advanced analysis tasks to be carried out on data sets of huge size created by high throughput technologies. Preselection of features to be included in multivariate analyses based on simple univariate ranking is a natural strategy that has often been implemented despite its potential bias. We demonstrate this bias and propose a way to correct it. Starting with a sequential implementation of the lasso with increasing lists of predictors, we exploit a property of the set of corresponding cross-validation curves, a pattern that we call “freezing”. The ranking of the predictors to be included sequentially is based on simple measures of associations with the outcome, which can be pre-computed in an efficient way for ultra high dimensional data sets externally to the penalized regression implementation. We demonstrate by simulation that our sequential approach leads in a vast majority of cases to a safe and efficient way of focusing the lasso analysis on a smaller and manageable number of predictors. In situations where the lasso performs well, we need typically less than 20 *%* of the variables to recover the same solution as if using the full set of variables. We illustrate the applicability of our strategy in the context of a genome-wide association study and on microarray genomic data where we need just 2. 5 *%* and 13 *%* of the variables respectively. Finally we include an example where 260 million gene-gene interactions are ranked and we are able to recover the lasso solution using only 1 *%* of these. Freezing offers great potential for extending the applicability of penalized regressions to current and upcoming ultra high dimensional problems in bioinformatics. Its applicability is not limited to the standard lasso but is a generic property of many penalized approaches.

## Notes

### Acknowledgements

This research was supported by grant number 204664 from the Norwegian Research Council (NRC) and by Statistics for Innovation (sfi)^{2}, a centre for research based innovation funded by NRC. SR and LCB spent a research period in Paris at Inserm UMRS937, and SR has an adjunct position at (sfi)^{2}. IA was funded by a grant from the Agence Nationale de la Recherche (ANR Maladies neurologiques et maladies psychiatriques) as part of a project on the relation between Parkinson’s disease and genes involved in the metabolism and transport of xenobiotics (PI: Alexis Elbaz, Inserm) for which access to GWAS data was obtained through dbGAP; this work utilized in part data from the NINDS DbGaP database from the CIDR:NGRC PARKINSONS DISEASE STUDY (Accession: phs000196.v2.p1). Sjur Reppe at Ullevaal University Hospital provided the Bone biopsy data.

### References

- 1.Ambroise, C., McLachlan, G.J.: Selection bias in gene extraction on the basis of microarray gene-expression data. Proc. Natl. Acad. Sci.
**99**(10), 6562–6566 (2002)CrossRefMATHGoogle Scholar - 2.Bair, E., Hastie, T., Paul, D., Tibshirani, R.: Prediction by supervised principal components. J. Am. Stat. Assoc.
**101**(473), 119–137 (2006)CrossRefMathSciNetMATHGoogle Scholar - 3.Bien, J., Taylor, J., Tibshirani, R.: A lasso for hierarchical testing of interactions. Ann. Stat.
**41**(3), 1111–1141 (2013)CrossRefMathSciNetMATHGoogle Scholar - 4.Bühlmann, P., van de Geer, S.: Statistics for High-Dimensional Data: Methods, Theory and Applications. Springer Series in Statistics. Springer, Berlin (2011)CrossRefGoogle Scholar
- 5.Cantor, R.M., Lange, K., Sinsheimer, J.S.: Prioritizing GWAS results: a review of statistical methods and recommendations for their application. Am. J. Hum. Genet.
**86**(1), 6–22 (2010)CrossRefGoogle Scholar - 6.Cho, S., Kim, K., Kim, Y.J., Lee, J.-K., Cho, Y.S., Lee, J.-Y., Han, B.-G., Kim, H., Ott, J., Park, T.: Joint identification of multiple genetic variants via elastic-net variable selection in a genome-wide association analysis. Ann. Hum. Genet.
**74**(5), 416–428 (2010)CrossRefGoogle Scholar - 7.Efron, B., Hastie, T., Johnstone, I., Tibshirani, R.: Least angle regression. Ann. Stat.
**32**, 407–499 (2004)CrossRefMathSciNetMATHGoogle Scholar - 8.El Ghaoui, L., Viallon, V., Rabbani, T.: Safe feature elimination for the lasso and sparse supervised learning problems. ArXiv e-prints 1009.4219 (2011)Google Scholar
- 9.Fan, J., Lv, J.: Sure independence screening for ultrahigh dimensional feature space. J. R. Stat. Soc. Ser. B (Stat. Methodol.)
**70**(5), 849–911 (2008)Google Scholar - 10.Friedman, J., Hastie, T., Tibshirani, R.: Regularization paths for generalized linear models via coordinate descent. J. Stat. Softw.
**33**(1), 1–22 (2010)CrossRefGoogle Scholar - 11.Genovese, C.R., Jin, J., Wasserman, L., Yao, Z.: A comparison of the lasso and marginal regression. J. Mach. Learn. Res.
**13**(1), 2107–2143 (2012)MathSciNetMATHGoogle Scholar - 12.Hamza, T.H., Zabetian, C.P., Tenesa, A., Laederach, A., Montimurro, J., Yearout, D., Kay, D.M., Doheny, K.F., Paschall, J., Pugh, E., Kusel, V.I., Collura, R., Roberts, J., Griffith, A., Samii, A., Scott, W.K., Nutt, J., Factor, S.A., Payami, H.: Common genetic variation in the HLA region is associated with late-onset sporadic parkinsons disease. Nat. Genet.
**42**(9), 781–785 (2010)CrossRefGoogle Scholar - 13.Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd edn. Springer Series in Statistics. Springer, New York (2009)CrossRefGoogle Scholar
- 14.Meinshausen, N.: Relaxed lasso. Comput. Stat. Data Anal.
**52**(1), 374–393 (2007)CrossRefMathSciNetMATHGoogle Scholar - 15.Reppe, S., Refvem, H., Gautvik, V.T., Olstad, O.K., Høvring, P.I., Reinholt, F.P., Holden, M., Frigessi, A., Jemtland, R., Gautvik, K.M.: Eight genes are highly associated with BMD variation in postmenopausal caucasian women. Bone
**46**(3), 604–612 (2010)CrossRefGoogle Scholar - 16.Simon, R.M., Korn, E.L., McShane, L.M., Radmacher, M.D., Wright, G.W., Zhao, Y.: Design and analysis of DNS microarray investigations. In: Statistics for Biology and Health. Springer, New York (2004)Google Scholar
- 17.Tibshirani, R.: Regression shrinkage and selection via the lasso. J. R. Stat. Soc.
**58**, 267–288 (1996)MathSciNetMATHGoogle Scholar - 18.Tibshirani, R., Bien, J., Friedman, J., Hastie, T., Simon, N., Taylor, J., Tibshirani, R.J.: Strong rules for discarding predictors in lasso-type problems. J. R. Stat. Soc. Ser. B (Stat. Methodol.)
**74**(2), 245–266 (2012)Google Scholar - 19.van de Geer, S., Bühlmann, P., Zhou, S.: The adaptive and the thresholded lasso for potentially misspecified models (and a lower bound for the lasso). Electron. J. Stat.
**5**, 688–749 (2011)CrossRefMathSciNetMATHGoogle Scholar - 20.Waldmann, P., Mészáros, G., Gredler, B., Fuerst, C., Sölkner, J.: Evaluation of the lasso and the elastic net in genome-wide association studies. Frontiers in Genetics,
**4**, 270. http://doi.org/10.3389/fgene.2013.00270 (2013) - 21.Waldron, L., Pintilie, M., Tsao, M.-S., Shepherd, F.A., Huttenhower, C., Jurisica, I.: Optimized application of penalized regression methods to diverse genomic data. Bioinformatics
**27**(24), 3399–3406 (2011)CrossRefGoogle Scholar - 22.Yang, C., Wan, X., Yang, Q., Xue, H., Yu, W.: Identifying main effects and epistatic interactions from large-scale snp data via adaptive group lasso. BMC Bioinf.
**11**(Suppl. 1), S18 (2010)CrossRefGoogle Scholar - 23.Yuan, M., Lin, Y.: Model selection and estimation in regression with grouped variables. J. R. Stat. Soc. Ser. B
**68**(1), 49–67 (2006)CrossRefMathSciNetMATHGoogle Scholar - 24.Zou, H.: The adaptive lasso and its oracle properties. J. Am. Stat. Assoc.
**101**(476), 1418–1429 (2006)CrossRefMATHGoogle Scholar - 25.Zou, H., Hastie, T.: Regularization and variable selection via the elastic net. J. R. Stat. Soc. Ser. B (Stat. Methodol.)
**67**(2), 301–320 (2005)Google Scholar