Preselection in Lasso-Type Analysis for Ultra-High Dimensional Genomic Exploration

  • Linn Cecilie Bergersen
  • Ismaïl Ahmed
  • Arnoldo Frigessi
  • Ingrid K. Glad
  • Sylvia Richardson
Conference paper
Part of the Abel Symposia book series (ABEL, volume 11)

Abstract

We address the issue of variable preselection in high-dimensional penalized regression, such as the lasso, a commonly used approach to variable selection and prediction in genomics. Preselection—to start with a manageable set of covariates—is becoming increasingly necessary for enabling advanced analysis tasks to be carried out on data sets of huge size created by high throughput technologies. Preselection of features to be included in multivariate analyses based on simple univariate ranking is a natural strategy that has often been implemented despite its potential bias. We demonstrate this bias and propose a way to correct it. Starting with a sequential implementation of the lasso with increasing lists of predictors, we exploit a property of the set of corresponding cross-validation curves, a pattern that we call “freezing”. The ranking of the predictors to be included sequentially is based on simple measures of associations with the outcome, which can be pre-computed in an efficient way for ultra high dimensional data sets externally to the penalized regression implementation. We demonstrate by simulation that our sequential approach leads in a vast majority of cases to a safe and efficient way of focusing the lasso analysis on a smaller and manageable number of predictors. In situations where the lasso performs well, we need typically less than 20 % of the variables to recover the same solution as if using the full set of variables. We illustrate the applicability of our strategy in the context of a genome-wide association study and on microarray genomic data where we need just 2. 5 % and 13 % of the variables respectively. Finally we include an example where 260 million gene-gene interactions are ranked and we are able to recover the lasso solution using only 1 % of these. Freezing offers great potential for extending the applicability of penalized regressions to current and upcoming ultra high dimensional problems in bioinformatics. Its applicability is not limited to the standard lasso but is a generic property of many penalized approaches.

Notes

Acknowledgements

This research was supported by grant number 204664 from the Norwegian Research Council (NRC) and by Statistics for Innovation (sfi)2, a centre for research based innovation funded by NRC. SR and LCB spent a research period in Paris at Inserm UMRS937, and SR has an adjunct position at (sfi)2. IA was funded by a grant from the Agence Nationale de la Recherche (ANR Maladies neurologiques et maladies psychiatriques) as part of a project on the relation between Parkinson’s disease and genes involved in the metabolism and transport of xenobiotics (PI: Alexis Elbaz, Inserm) for which access to GWAS data was obtained through dbGAP; this work utilized in part data from the NINDS DbGaP database from the CIDR:NGRC PARKINSONS DISEASE STUDY (Accession: phs000196.v2.p1). Sjur Reppe at Ullevaal University Hospital provided the Bone biopsy data.

References

  1. 1.
    Ambroise, C., McLachlan, G.J.: Selection bias in gene extraction on the basis of microarray gene-expression data. Proc. Natl. Acad. Sci. 99(10), 6562–6566 (2002)CrossRefMATHGoogle Scholar
  2. 2.
    Bair, E., Hastie, T., Paul, D., Tibshirani, R.: Prediction by supervised principal components. J. Am. Stat. Assoc. 101(473), 119–137 (2006)CrossRefMathSciNetMATHGoogle Scholar
  3. 3.
    Bien, J., Taylor, J., Tibshirani, R.: A lasso for hierarchical testing of interactions. Ann. Stat. 41(3), 1111–1141 (2013)CrossRefMathSciNetMATHGoogle Scholar
  4. 4.
    Bühlmann, P., van de Geer, S.: Statistics for High-Dimensional Data: Methods, Theory and Applications. Springer Series in Statistics. Springer, Berlin (2011)CrossRefGoogle Scholar
  5. 5.
    Cantor, R.M., Lange, K., Sinsheimer, J.S.: Prioritizing GWAS results: a review of statistical methods and recommendations for their application. Am. J. Hum. Genet. 86(1), 6–22 (2010)CrossRefGoogle Scholar
  6. 6.
    Cho, S., Kim, K., Kim, Y.J., Lee, J.-K., Cho, Y.S., Lee, J.-Y., Han, B.-G., Kim, H., Ott, J., Park, T.: Joint identification of multiple genetic variants via elastic-net variable selection in a genome-wide association analysis. Ann. Hum. Genet. 74(5), 416–428 (2010)CrossRefGoogle Scholar
  7. 7.
    Efron, B., Hastie, T., Johnstone, I., Tibshirani, R.: Least angle regression. Ann. Stat. 32, 407–499 (2004)CrossRefMathSciNetMATHGoogle Scholar
  8. 8.
    El Ghaoui, L., Viallon, V., Rabbani, T.: Safe feature elimination for the lasso and sparse supervised learning problems. ArXiv e-prints 1009.4219 (2011)Google Scholar
  9. 9.
    Fan, J., Lv, J.: Sure independence screening for ultrahigh dimensional feature space. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 70(5), 849–911 (2008)Google Scholar
  10. 10.
    Friedman, J., Hastie, T., Tibshirani, R.: Regularization paths for generalized linear models via coordinate descent. J. Stat. Softw. 33(1), 1–22 (2010)CrossRefGoogle Scholar
  11. 11.
    Genovese, C.R., Jin, J., Wasserman, L., Yao, Z.: A comparison of the lasso and marginal regression. J. Mach. Learn. Res. 13(1), 2107–2143 (2012)MathSciNetMATHGoogle Scholar
  12. 12.
    Hamza, T.H., Zabetian, C.P., Tenesa, A., Laederach, A., Montimurro, J., Yearout, D., Kay, D.M., Doheny, K.F., Paschall, J., Pugh, E., Kusel, V.I., Collura, R., Roberts, J., Griffith, A., Samii, A., Scott, W.K., Nutt, J., Factor, S.A., Payami, H.: Common genetic variation in the HLA region is associated with late-onset sporadic parkinsons disease. Nat. Genet. 42(9), 781–785 (2010)CrossRefGoogle Scholar
  13. 13.
    Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd edn. Springer Series in Statistics. Springer, New York (2009)CrossRefGoogle Scholar
  14. 14.
    Meinshausen, N.: Relaxed lasso. Comput. Stat. Data Anal. 52(1), 374–393 (2007)CrossRefMathSciNetMATHGoogle Scholar
  15. 15.
    Reppe, S., Refvem, H., Gautvik, V.T., Olstad, O.K., Høvring, P.I., Reinholt, F.P., Holden, M., Frigessi, A., Jemtland, R., Gautvik, K.M.: Eight genes are highly associated with BMD variation in postmenopausal caucasian women. Bone 46(3), 604–612 (2010)CrossRefGoogle Scholar
  16. 16.
    Simon, R.M., Korn, E.L., McShane, L.M., Radmacher, M.D., Wright, G.W., Zhao, Y.: Design and analysis of DNS microarray investigations. In: Statistics for Biology and Health. Springer, New York (2004)Google Scholar
  17. 17.
    Tibshirani, R.: Regression shrinkage and selection via the lasso. J. R. Stat. Soc. 58, 267–288 (1996)MathSciNetMATHGoogle Scholar
  18. 18.
    Tibshirani, R., Bien, J., Friedman, J., Hastie, T., Simon, N., Taylor, J., Tibshirani, R.J.: Strong rules for discarding predictors in lasso-type problems. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 74(2), 245–266 (2012)Google Scholar
  19. 19.
    van de Geer, S., Bühlmann, P., Zhou, S.: The adaptive and the thresholded lasso for potentially misspecified models (and a lower bound for the lasso). Electron. J. Stat. 5, 688–749 (2011)CrossRefMathSciNetMATHGoogle Scholar
  20. 20.
    Waldmann, P., Mészáros, G., Gredler, B., Fuerst, C., Sölkner, J.: Evaluation of the lasso and the elastic net in genome-wide association studies. Frontiers in Genetics, 4, 270. http://doi.org/10.3389/fgene.2013.00270 (2013)
  21. 21.
    Waldron, L., Pintilie, M., Tsao, M.-S., Shepherd, F.A., Huttenhower, C., Jurisica, I.: Optimized application of penalized regression methods to diverse genomic data. Bioinformatics 27(24), 3399–3406 (2011)CrossRefGoogle Scholar
  22. 22.
    Yang, C., Wan, X., Yang, Q., Xue, H., Yu, W.: Identifying main effects and epistatic interactions from large-scale snp data via adaptive group lasso. BMC Bioinf. 11(Suppl. 1), S18 (2010)CrossRefGoogle Scholar
  23. 23.
    Yuan, M., Lin, Y.: Model selection and estimation in regression with grouped variables. J. R. Stat. Soc. Ser. B 68(1), 49–67 (2006)CrossRefMathSciNetMATHGoogle Scholar
  24. 24.
    Zou, H.: The adaptive lasso and its oracle properties. J. Am. Stat. Assoc. 101(476), 1418–1429 (2006)CrossRefMATHGoogle Scholar
  25. 25.
    Zou, H., Hastie, T.: Regularization and variable selection via the elastic net. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 67(2), 301–320 (2005)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2016

Authors and Affiliations

  • Linn Cecilie Bergersen
    • 1
  • Ismaïl Ahmed
    • 2
  • Arnoldo Frigessi
    • 3
  • Ingrid K. Glad
    • 1
  • Sylvia Richardson
    • 4
  1. 1.Department of MathematicsUniversity of OsloOsloNorway
  2. 2.INSERMCESP Center for Research in Epidemiology and Population HealthParisFrance
  3. 3.Oslo Centre for Biostatistics and EpidemiologyUniversity of OsloOsloNorway
  4. 4.MRC Biostatistics Unit, Cambridge Institute of Public HealthUniversity of CambridgeCambridgeUK

Personalised recommendations