Statistical learning approaches in the genetic epidemiology of complex diseases

  • Anne-Laure BoulesteixEmail author
  • Marvin N. Wright
  • Sabine Hoffmann
  • Inke R. König
Original Investigation
Part of the following topical collections:
  1. Genetic epidemiology of complex diseases


In this paper, we give an overview of methodological issues related to the use of statistical learning approaches when analyzing high-dimensional genetic data. The focus is set on regression models and machine learning algorithms taking genetic variables as input and returning a classification or a prediction for the target variable of interest; for example, the present or future disease status, or the future course of a disease. After briefly explaining the basic motivation and principle of these methods, we review different procedures that can be used to evaluate the accuracy of the obtained models and discuss common flaws that may lead to over-optimistic conclusions with respect to their prediction performance and usefulness.


Regression Validation Cross-validation Omics data High-dimensional data Prognostic model 



We thank Jenny Lee for proofreading the manuscript.

Supplementary material

439_2019_1996_MOESM1_ESM.docx (25 kb)
Supplementary material 1 (docx 24 KB)


  1. Abraham G, Inouye M (2015) Genomic risk prediction of complex human disease and its clinical application. Curr Opin Genet Dev 33:10–16CrossRefGoogle Scholar
  2. Abraham G, Havulinna AS, Bhalala OG, Byars SG, De Livera AM, Yetukuri L, Tikkanen E, Perola M, Schunkert H, Sijbrands EJ et al (2016) Genomic prediction of coronary heart disease. Eur Heart J 37(43):3267–3278CrossRefGoogle Scholar
  3. Alipanahi B, Delong A, Weirauch MT, Frey BJ (2015) Predicting the sequence specificities of dna-and rna-binding proteins by deep learning. Nat Biotechnol 33(8):831–838CrossRefGoogle Scholar
  4. Ambroise C, McLachlan GJ (2002) Selection bias in gene extraction on the basis of microarray gene-expression data. Proc Natl Acad Sci 99(10):6562–6566CrossRefGoogle Scholar
  5. Bellot P, de los Campos G, Pérez-Enciso M (2018) Can deep learning improve genomic prediction of complex human traits? Genetics 210(3):809–819CrossRefGoogle Scholar
  6. Boulesteix AL (2016) In: Abdi H, Esposito Vinzi V, Russolillo G, Saporta G, Trinchera L (eds) The multiple facets of partial least squares methods. Springer, Berlin, pp 45–57CrossRefGoogle Scholar
  7. Boulesteix AL, Sauerbrei W (2011) Added predictive value of high-throughput molecular data to clinical data and its validation. Brief Bioinform 12(3):215–229CrossRefGoogle Scholar
  8. Boulesteix AL, Strobl C (2009) Optimal classifier selection and negative bias in error rate estimation: an empirical study on high-dimensional prediction. BMC Med Res Methodol 9(1):85CrossRefGoogle Scholar
  9. Boulesteix AL, Strobl C, Augustin T, Daumer M (2008) Evaluating microarray-based classifiers: an overview. Cancer Inform 6:77–97CrossRefGoogle Scholar
  10. Boulesteix AL, Janitza S, Hornung R, Probst P, Busen H, Hapfelmeier A (2018) Making complex prediction rules applicable for readers: current practice in random forest literature and recommendations. Biometr J. Google Scholar
  11. Braga-Neto UM, Dougherty ER (2004) Is cross-validation valid for small-sample microarray classification? Bioinformatics 20(3):374–380CrossRefGoogle Scholar
  12. Breiman L (2001) Statistical modeling: the two cultures (with comments and a rejoinder by the author). Stat Sci 16(3):199–231CrossRefGoogle Scholar
  13. Chanock S, Manolio T, Boehnke M, Boerwinkle E, Hunter D, Thomas G, Hirschhorn J, Abecasis G, Altshuler D, Bailey-Wilson J, Brooks L, Cardon L, Daly M, Donnelly P, Fraumeni J, Freimer N, Gerhard D, Gunter C, Guttmacher A, Guyer M, Harris E, Hoh J, Hoover R, Kong C, Merikangas K, Morton C, Palmer L, Phimister E, Rice J, Roberts J, Rotimi C, Tucker M, Vogan K, Wacholder S, Wijsman E, Winn D, Collins F (2007) Replicating genotype–phenotype associations. Nature 447:655–660CrossRefGoogle Scholar
  14. Chen T, Guestrin C (2016) In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining ACM, New York, NY, USA, KDD ’16, pp 785–794.
  15. Chollet F et al (2015) Keras.
  16. De Bin R, Sauerbrei W, Boulesteix AL (2014) Investigating the prediction ability of survival models based on both clinical and omics data: two case studies. Stat Med 33(30):5310–5329CrossRefGoogle Scholar
  17. Dupuy A, Simon RM (2007) Critical review of published microarray studies for cancer outcome and guidelines on statistical analysis and reporting. J Natl Cancer Inst 99(2):147–157CrossRefGoogle Scholar
  18. Eriksson J, Evans DS, Nielson CM, Shen J, Srikanth P, Hochberg M, McWeeney S, Cawthon PM, Wilmot B, Zmuda J et al (2015) Limited clinical utility of a genetic risk score for the prediction of fracture risk in elderly subjects. J Bone Miner Res 30(1):184–194CrossRefGoogle Scholar
  19. Fan J, Lv J (2008) Sure independence screening for ultrahigh dimensional feature space. J R Stat Soc Ser B (Stat Methodol) 70(5):849–911CrossRefGoogle Scholar
  20. Geman D, Ochs M, Price ND, Tomasetti C, Younes L (2015) An argument for mechanism-based statistical inference in cancer. Hum Genet 134(5):479–495CrossRefGoogle Scholar
  21. Gola D, Mahachie John J, Van Steen K, König IR (2016) A roadmap to multifactor dimensionality reduction methods. Brief Bioinform 17:293–308CrossRefGoogle Scholar
  22. Goodfellow I, Bengio Y, Courville A (2016) Deep learning. MIT Press.
  23. Haddow JE, Palomaki GE (2004) In: Khoury MJ, Little J, Burke W (eds) Human genome epidemiology: scope and strategies. Oxford University Press, New York, pp 217–233Google Scholar
  24. Hastie T, Tibshirani R, Friedman JJH (2009) The elements of statistical learning, 2nd edn. Springer, New YorkCrossRefGoogle Scholar
  25. Hornung R, Bernau C, Truntzer C, Wilson R, Stadler T, Boulesteix AL (2015) A measure of the impact of CV incompleteness on prediction error estimation with application to PCA and normalization. BMC Med Res Methodol 15(1):95CrossRefGoogle Scholar
  26. Hu Y, Lu Q, Powles R, Yao X, Yang C, Fang F, Xu X, Zhao H (2017) Leveraging functional annotations in genetic risk prediction for human complex diseases. PLoS Comput Biol 13(6):e1005589CrossRefGoogle Scholar
  27. Igl BW, König IR, Ziegler A (2009) What do we mean by “replication” and “validation” in genome-wide association studies? Hum Heredity 67:66–68CrossRefGoogle Scholar
  28. James G, Witten D, Hastie T, Tibshirani R (2013) An introduction to statistical learning with applications in R. Springer, New YorkCrossRefGoogle Scholar
  29. Janitza S, Celik E, Boulesteix AL (2018) A computationally fast variable importance test for random forests for high-dimensional data. Adv Data Anal Classif 12(4):885–915CrossRefGoogle Scholar
  30. Kelley DR, Snoek J, Rinn JL (2016) Basset: learning the regulatory code of the accessible genome with deep convolutional neural networks. Genome Res 26:990–999CrossRefGoogle Scholar
  31. Khera AV, Emdin CA, Drake I, Natarajan P, Bick AG, Cook NR, Chasman DI, Baber U, Mehran R, Rader DJ et al (2016) Genetic risk, adherence to a healthy lifestyle, and coronary disease. N Engl J Med 375(24):2349–2358CrossRefGoogle Scholar
  32. Klau S, Jurinovic V, Hornung R, Herold T, Boulesteix AL (2018) Priority-lasso: a simple hierarchical approach to the prediction of clinical outcome using multi-omics data. BMC Bioinform 19(1):322CrossRefGoogle Scholar
  33. König IR (2011) Validation in genetic association studies. Brief Bioinform 12:253–258CrossRefGoogle Scholar
  34. König IR, Malley JD, Weimar C, Diener HC, Ziegler A (2007) Practical experiences on the necessity of external validation. Stat Med 26:5499–5511CrossRefGoogle Scholar
  35. König IR, Malley JD, Pajevic S, Weimar C, Diener HC, Ziegler A (2008) Patient-centered yes/no prognosis using learning machines. Int J Data Min Bioinform 2(4):289–341CrossRefGoogle Scholar
  36. König IR, Fuchs O, Hansen G, von Mutius E, Kopp M (2017) What is precision medicine? Eur Respir J 50:1700391CrossRefGoogle Scholar
  37. Kruppa J, Ziegler A, König IR (2012) Risk estimation and risk prediction using machine-learning methods. Hum Genet 131:1639–1654CrossRefGoogle Scholar
  38. Kruppa J, Liu Y, Biau G, Kohler M, König IR, Malley JD, Ziegler A (2014) Probability estimation with machine learning methods for dichotomous and multicategory outcome: Theory. Biometr J 56(4):534–563CrossRefGoogle Scholar
  39. Li C, Yang C, Gelernter J, Zhao H (2014) Improving genetic risk prediction by leveraging pleiotropy. Hum Genet 133(5):639–650CrossRefGoogle Scholar
  40. Lockhart R, Taylor J, Tibshirani RJ, Tibshirani R (2014) A significance test for the lasso. Ann Stat 42(2):413–468CrossRefGoogle Scholar
  41. Maier R, Moser G, Chen GB, Ripke S, Absher D, Agartz I, Akil H, Amin F, Andreassen OA, Anjorin A et al (2015) Joint analysis of psychiatric disorders increases accuracy of risk prediction for schizophrenia, bipolar disorder, and major depressive disorder. Am J Hum Genet 96(2):283–294CrossRefGoogle Scholar
  42. Meinshausen N, Meier L, Bühlmann P (2009) p values for high-dimensional regression. J Am Stat Assoc 104(488):1671–1681CrossRefGoogle Scholar
  43. Molinaro AM, Simon R, Pfeiffer RM (2005) Prediction error estimation: a comparison of resampling methods. Bioinformatics 21(15):3301–3307CrossRefGoogle Scholar
  44. Müller B, Wilcke A, Boulesteix AL, Brauer J, Passarge E, Boltze J, Kirsten H (2016) Improved prediction of complex diseases by common genetic markers: state of the art and further perspectives. Hum Genet 135(3):259–272CrossRefGoogle Scholar
  45. Nembrini S, König IR, Wright MN (2018) The revival of the Gini importance? Bioinformatics 34(21):3711–3718CrossRefGoogle Scholar
  46. Nielsen MA (2015) Neural networks and deep learning. Determination Press.
  47. Pencina MJ, D’Agostino RB Sr, Steyerberg EW (2011) Extensions of net reclassification improvement calculations to measure usefulness of new biomarkers. Stat Med 30(1):11–21CrossRefGoogle Scholar
  48. Pingault JB, O’Reilly PF, Schoeler T, Ploubidis GB, Rijsdijk F, Dudbridge F (2018) Using genetic data to strengthen causal inference in observational research. Nat Rev Genet 19(9):566–580CrossRefGoogle Scholar
  49. Smith JA, Ware EB, Middha P, Beacher L, Kardia SL (2015) Current applications of genetic risk scores to cardiovascular outcomes and subclinical phenotypes. Curr Epidemiol Rep 2(3):180–190CrossRefGoogle Scholar
  50. Talmud PJ, Hingorani AD, Cooper JA, Marmot MG, Brunner EJ, Kumari M, Kivimäki M, Humphries SE (2010) Utility of genetic and non-genetic risk factors in prediction of type 2 diabetes: Whitehall II prospective cohort study. Br Med J 340:b4838CrossRefGoogle Scholar
  51. Taylor J, Tibshirani R (2018) Post-selection inference for-penalized likelihood models. Can J Stat 46(1):41–61CrossRefGoogle Scholar
  52. Tibshirani R (1996) Regression shrinkage and selection via the lasso. J R Stat Soc Ser B (Stat Methodol) 58:267–288Google Scholar
  53. Varma S, Simon R (2006) Bias in error estimation when using cross-validation for model selection. BMC Bioinform 7(1):91CrossRefGoogle Scholar
  54. Wasserman L, Roeder K (2009) High dimensional variable selection. Ann Stat 37(5A):2178–2201CrossRefGoogle Scholar
  55. Wilson P, D’Agostino R, Levy D, Belanger A, Silbershatz H, Kannel W (1998) Prediction of coronary heart disease using risk factor categories. Circulation 97:1837–1847CrossRefGoogle Scholar
  56. Winham SJ, Jenkins GD, Biernacka JM (2016) Modeling x chromosome data using random forests: conquering sex bias. Genet Epidemiol 40:123–132CrossRefGoogle Scholar
  57. Wright M, Ziegler A (2017) Ranger: a fast implementation of random forests for high dimensional data in C++ and R. J Stat Softw 77(1):1–17CrossRefGoogle Scholar
  58. Wu J, Pfeiffer RM, Gail MH (2013) Strategies for developing prediction models from genome-wide association studies. Genet Epidemiol 37(8):768–777CrossRefGoogle Scholar
  59. Zhou J, Troyanskaya OG (2015) Predicting effects of noncoding variants with deep learning-based sequence model. Nat Methods 12(10):931–934CrossRefGoogle Scholar
  60. Ziegler A, DeStefano AL, König IR (2007) Data mining, neural nets, trees—problems 2 and 3 of genetic analysis workshop 15. Genet Epidemiol 31:S51–S60CrossRefGoogle Scholar

Copyright information

© Springer-Verlag GmbH Germany, part of Springer Nature 2019

Authors and Affiliations

  1. 1.Institute for Medical Information Processing, Biometry and EpidemiologyLudwig-Maximilians-UniversityMunichGermany
  2. 2.Leibniz Institute for Prevention Research and Epidemiology-BIPSBremenGermany
  3. 3.Section of Biostatistics, Department of Public HealthUniversity of CopenhagenCopenhagenDenmark
  4. 4.Institute of Medical Biometry and StatisticsUniversity of LübeckLübeckGermany

Personalised recommendations