Skip to main content

Statistical Methods for Disease Risk Prediction with Genotype Data

  • Protocol
  • First Online:
Statistical Genomics

Part of the book series: Methods in Molecular Biology ((MIMB,volume 2629))

Abstract

Single-nucleotide polymorphism (SNP) is the basic unit to understand the heritability of complex traits. One attractive application of the susceptible SNPs is to construct prediction models for assessing disease risk. Here, we introduce prediction methods for human traits using SNPs data, including the polygenic risk score (PRS), linear mixed models (LMMs), penalized regressions, and methods for controlling population stratification.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Protocol
USD 49.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 189.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 249.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Claussnitzer M, Cho JH, Collins R et al (2020) A brief history of human disease genetics. Nature 577(7789):179–189

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  2. Corder EH, Saunders AM, Strittmatter WJ et al (1993) Gene dose of apolipoprotein E type 4 allele and the risk of Alzheimer’s disease in late onset families. Science 261(5123):921–923

    Article  CAS  PubMed  Google Scholar 

  3. Clayton DG (2009) Prediction and interaction in complex disease genetics: experience in type 1 diabetes. PLoS Genet 5(7):e1000540

    Article  PubMed  PubMed Central  Google Scholar 

  4. Lux MP, Fasching PA, Beckmann MW (2006) Hereditary breast and ovarian cancer: review and future perspectives. J Mol Med 84(1):16–28

    Article  PubMed  Google Scholar 

  5. Manolio TA, Collins FS, Cox NJ et al (2009) Finding the missing heritability of complex diseases. Nature 461(7265):747–753

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  6. Lango Allen H, Estrada K, Lettre G et al (2010) Hundreds of variants clustered in genomic loci and biological pathways affect human height. Nature 467(7317):832–838

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  7. Yang J, Benyamin B, McEvoy BP et al (2010) Common SNPs explain a large proportion of the heritability for human height. Nat Genet 42(7):565–569

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  8. Lee SH, Wray NR, Goddard ME et al (2011) Estimating missing heritability for disease from genome-wide association studies. Am J Hum Genet 88(3):294–305

    Article  PubMed  PubMed Central  Google Scholar 

  9. Golan D, Lander ES, Rosset S (2014) Measuring missing heritability: inferring the contribution of common variants. Proc Natl Acad Sci 111(49):E5272–E5281

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  10. Wei Z, Wang W, Bradfield J et al (2013) Large sample size, wide variant spectrum, and advanced machine-learning technique boost risk prediction for inflammatory bowel disease. Am J Hum Genet 92(6):1008–1012

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  11. Lambert SA, Abraham G, Inouye M (2019) Towards clinical utility of polygenic risk scores. Hum Mol Genet 28(R2):R133–R142

    Article  CAS  PubMed  Google Scholar 

  12. Rencher AC, Schaalje GB (2008) Linear models in statistics. Wiley, Hoboken

    Google Scholar 

  13. Allen DM (1971) Mean square error of prediction as a criterion for selecting variables. Technometrics 13(3):469–475

    Article  Google Scholar 

  14. Huang J, Ling CX (2005) Using AUC and accuracy in evaluating learning algorithms. IEEE Trans Knowl Data Eng 17(3):299–310

    Article  CAS  Google Scholar 

  15. Visscher ISCMpPSMspmhebWNRSJL, Michael C. 6 Visscher Peter M. 5 PasWNRMSSPscmhedSPFOD, Gurling H et al (2009) Common polygenic variation contributes to risk of schizophrenia and bipolar disorder. Nature 460(7256):748–752

    Article  Google Scholar 

  16. Anderson CA, Pettersson FH, Clarke GM et al (2010) Data quality control in genetic case-control association studies. Nat Protoc 5(9):1564–1573

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  17. McCullagh P, Nelder JA (2019) Generalized linear models. Routledge, London

    Book  Google Scholar 

  18. Chang CC, Chow CC, Tellier LC et al (2015) Second-generation PLINK: rising to the challenge of larger and richer datasets. Gigascience 4(1):s13742-13015-10047-13748

    Article  Google Scholar 

  19. Clarke L, Fairley S, Zheng-Bradley X et al (2017) The international genome sample resource (IGSR): a worldwide collection of genome variation incorporating the 1000 genomes project data. Nucleic Acids Res 45(D1):D854–D859

    Article  CAS  PubMed  Google Scholar 

  20. Dudbridge F (2013) Power and predictive accuracy of polygenic risk scores. PLoS Genet 9(3):e1003348

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  21. Euesden J, Lewis CM, O’reilly PF (2015) PRSice: polygenic risk score software. Bioinformatics 31(9):1466–1468

    Article  CAS  PubMed  Google Scholar 

  22. Wray NR, Lee SH, Mehta D et al (2014) Research review: polygenic methods and their application to psychiatric traits. J Child Psychol Psychiatry 55(10):1068–1087

    Article  PubMed  Google Scholar 

  23. Vilhjálmsson BJ, Yang J, Finucane HK et al (2015) Modeling linkage disequilibrium increases accuracy of polygenic risk scores. Am J Hum Genet 97(4):576–592

    Article  PubMed  PubMed Central  Google Scholar 

  24. O’donovan MC, Craddock N, Norton N et al (2008) Identification of loci associated with schizophrenia by genome-wide association and follow-up. Nat Genet 40(9):1053–1055

    Article  PubMed  Google Scholar 

  25. Consortium IMSG (2010) Evidence for polygenic susceptibility to multiple sclerosis—the shape of things to come. Am J Hum Genet 86(4):621–625

    Article  Google Scholar 

  26. Speliotes EK, Willer CJ, Berndt SI et al (2010) Association analyses of 249,796 individuals reveal 18 new loci associated with body mass index. Nat Genet 42(11):937–948

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  27. Simonson MA, Wills AG, Keller MC et al (2011) Recent methods for polygenic analysis of genome-wide data implicate an important effect of common variants on cardiovascular disease risk. BMC Med Genet 12(1):1–9

    Article  Google Scholar 

  28. Stahl EA, Wegmann D, Trynka G et al (2012) Bayesian inference analyses of the polygenic architecture of rheumatoid arthritis. Nat Genet 44(5):483–489

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  29. Duncan L, Shen H, Gelaye B et al (2019) Analysis of polygenic risk score usage and performance in diverse human populations. Nat Commun 10(1):1–9

    Article  CAS  Google Scholar 

  30. Kim MS, Patel KP, Teng AK et al (2018) Genetic disease risks can be misestimated across global populations. Genome Biol 19(1):1–14

    Article  Google Scholar 

  31. Martin AR, Gignoux CR, Walters RK et al (2017) Human demographic history impacts genetic risk prediction across diverse populations. Am J Hum Genet 100(4):635–649

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  32. Mostafavi H, Harpak A, Agarwal I et al (2020) Variable prediction accuracy of polygenic scores within an ancestry group. elife 9:e48376

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  33. Cai M, Xiao J, Zhang S et al (2021) A unified framework for cross-population trait prediction by leveraging the genetic correlation of polygenic traits. Am J Hum Genet 108(4):632–655

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  34. Coram MA, Fang H, Candille SI et al (2017) Leveraging multi-ethnic evidence for risk assessment of quantitative traits in minority populations. Am J Hum Genet 101(2):218–226

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  35. Selzam S, Krapohl E, Von Stumm S et al (2017) Predicting educational achievement from DNA. Mol Psychiatry 22(2):267–272

    Article  CAS  PubMed  Google Scholar 

  36. Lee JJ, Wedow R, Okbay A et al (2018) Gene discovery and polygenic prediction from a genome-wide association study of educational attainment in 1.1 million individuals. Nat Genet 50(8):1112–1121

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  37. Zhang Y, Lu Q, Ye Y et al (2021) SUPERGNOVA: local genetic correlation analysis reveals heterogeneous etiologic sharing of complex traits. Genome Biol 22(1):1–30

    Article  Google Scholar 

  38. Ruderfer DM, Fanous AH, Ripke S et al (2014) Polygenic dissection of diagnosis and clinical dimensions of bipolar disorder and schizophrenia. Mol Psychiatry 19(9):1017–1024

    Article  CAS  PubMed  Google Scholar 

  39. Maier R, Moser G, Chen G-B et al (2015) Joint analysis of psychiatric disorders increases accuracy of risk prediction for schizophrenia, bipolar disorder, and major depressive disorder. Am J Hum Genet 96(2):283–294

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  40. Ruderfer DM, Ripke S, McQuillin A et al (2018) Genomic dissection of bipolar disorder and schizophrenia, including 28 subphenotypes. Cell 173(7):1705–1715. e1716

    Article  CAS  PubMed Central  Google Scholar 

  41. Guo H, Li JJ, Lu Q et al (2021) Detecting local genetic correlations with scan statistics. Nat Commun 12(1):1–13

    Google Scholar 

  42. Krapohl E, Patel H, Newhouse S et al (2018) Multi-polygenic score approach to trait prediction. Mol Psychiatry 23(5):1368–1374

    Article  CAS  PubMed  Google Scholar 

  43. Maier RM, Zhu Z, Lee SH et al (2018) Improving genetic prediction by leveraging genetic correlations among human diseases and traits. Nat Commun 9(1):1–17

    Article  Google Scholar 

  44. Grotzinger AD, Rhemtulla M, de Vlaming R et al (2019) Genomic structural equation modelling provides insights into the multivariate genetic architecture of complex traits. Nat Hum Behav 3(5):513–525

    Article  PubMed  PubMed Central  Google Scholar 

  45. Wand H, Lambert SA, Tamburro C et al (2021) Improving reporting standards for polygenic scores in risk prediction studies. Nature 591(7849):211–219

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  46. Mars N, Koskela JT, Ripatti P et al (2020) Polygenic and clinical risk scores and their impact on age at onset and prediction of cardiometabolic diseases and common cancers. Nat Med 26(4):549–557

    Article  CAS  PubMed  Google Scholar 

  47. Khera AV, Chaffin M, Aragam KG et al (2018) Genome-wide polygenic scores for common diseases identify individuals with risk equivalent to monogenic mutations. Nat Genet 50(9):1219–1224

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  48. Elliott J, Bodinier B, Bond TA et al (2020) Predictive accuracy of a polygenic risk score–enhanced prediction model vs a clinical risk score for coronary artery disease. JAMA 323(7):636–645

    Article  PubMed  PubMed Central  Google Scholar 

  49. Inouye M, Abraham G, Nelson CP et al (2018) Genomic risk prediction of coronary artery disease in 480,000 adults: implications for primary prevention. J Am Coll Cardiol 72(16):1883–1893

    Article  PubMed  PubMed Central  Google Scholar 

  50. Abraham G, Havulinna AS, Bhalala OG et al (2016) Genomic prediction of coronary heart disease. Eur Heart J 37(43):3267–3278

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  51. Yang J, Zaitlen NA, Goddard ME et al (2014) Advantages and pitfalls in the application of mixed-model association methods. Nat Genet 46(2):100–106

    Article  PubMed  PubMed Central  Google Scholar 

  52. Loh P-R, Tucker G, Bulik-Sullivan BK et al (2015) Efficient Bayesian mixed-model analysis increases association power in large cohorts. Nat Genet 47(3):284–290

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  53. Lloyd-Jones LR, Zeng J, Sidorenko J et al (2019) Improved polygenic prediction by Bayesian multiple regression on summary statistics. Nat Commun 10(1):1–11

    Article  CAS  Google Scholar 

  54. Vilhjálmsson BJ, Nordborg M (2013) The nature of confounding in genome-wide association studies. Nat Rev Genet 14(1):1–2

    Article  PubMed  Google Scholar 

  55. Makowsky R, Pajewski NM, Klimentidis YC et al (2011) Beyond missing heritability: prediction of complex traits. PLoS Genet 7(4):e1002051

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  56. Habier D, Fernando RL, Kizilkaya K et al (2011) Extension of the Bayesian alphabet for genomic selection. BMC Bioinform 12(1):1–12

    Article  Google Scholar 

  57. Moser G, Lee SH, Hayes BJ et al (2015) Simultaneous discovery, estimation and prediction analysis of complex traits using a Bayesian mixture model. PLoS Genet 11(4):e1004969

    Article  PubMed  PubMed Central  Google Scholar 

  58. Zeng J, De Vlaming R, Wu Y et al (2018) Signatures of negative selection in the genetic architecture of human complex traits. Nat Genet 50(5):746–753

    Article  CAS  PubMed  Google Scholar 

  59. Zeng P, Zhou X (2017) Non-parametric genetic prediction of complex traits with latent Dirichlet process regression models. Nat Commun 8(1):1–11

    Article  Google Scholar 

  60. Durvasula A, Lohmueller KE (2021) Negative selection on complex traits limits phenotype prediction accuracy between populations. Am J Hum Genet 108(4):620–631

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  61. Shi H, Gazal S, Kanai M et al (2021) Population-specific causal disease effect sizes in functionally important regions impacted by selection. Nat Commun 12(1):1–15

    Google Scholar 

  62. Wang Y, Guo J, Ni G et al (2020) Theoretical and empirical quantification of the accuracy of polygenic scores in ancestry divergent populations. Nat Commun 11(1):1–9

    Google Scholar 

  63. Xia X, Sun R, Zhang Y et al (2022) A prism vote framework for individualized risk prediction of traits in genome-wide sequencing data of multiple populations. bioRxiv. https://doi.org/10.1101/2022.02.02.478767

  64. Erbe M, Hayes B, Matukumalli L et al (2012) Improving accuracy of genomic predictions within and between dairy cattle breeds with imputed high-density single nucleotide polymorphism panels. J Dairy Sci 95(7):4114–4129

    Article  CAS  PubMed  Google Scholar 

  65. Zhou X, Carbonetto P, Stephens M (2013) Polygenic modeling with Bayesian sparse linear mixed models. PLoS Genet 9(2):e1003264

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  66. Yang J, Fritsche LG, Zhou X et al (2017) A scalable Bayesian method for integrating functional information in genome-wide association studies. Am J Hum Genet 101(3):404–416

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  67. Zhu X, Stephens M (2017) Bayesian large-scale multiple regression with summary statistics from genome-wide association studies. Ann Appl Stat 11(3):1561

    Article  PubMed  PubMed Central  Google Scholar 

  68. Hoerl AE, Kennard RW (1970) Ridge regression: biased estimation for nonorthogonal problems. Technometrics 12(1):55–67

    Article  Google Scholar 

  69. Tibshirani R (1996) Regression shrinkage and selection via the lasso. J R Stat Soc Ser B Methodol 58(1):267–288

    Google Scholar 

  70. Zou H, Hastie T (2005) Regularization and variable selection via the elastic net. J R Stat Soc Ser B Stat Methodol 67(2):301–320

    Article  Google Scholar 

  71. Friedman J, Hastie T, Tibshirani R (2010) Regularization paths for generalized linear models via coordinate descent. J Stat Softw 33(1):1

    Article  PubMed  PubMed Central  Google Scholar 

  72. Zeng Y, Breheny P (2017) The biglasso package: a memory-and computation-efficient solver for lasso model fitting with big data in r. arXiv preprint arXiv:170105936

    Google Scholar 

  73. Privé F, Aschard H, Ziyatdinov A et al (2018) Efficient analysis of large-scale genome-wide data with two R packages: bigstatsr and bigsnpr. Bioinformatics 34(16):2781–2787

    Article  PubMed  PubMed Central  Google Scholar 

  74. Qian J, Tanigawa Y, Du W et al (2020) A fast and scalable framework for large-scale and ultrahigh-dimensional sparse regression with application to the UK Biobank. PLoS Genet 16(10):e1009141

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  75. Mak TSH, Porsch RM, Choi SW et al (2017) Polygenic scores via penalized regression on summary statistics. Genet Epidemiol 41(6):469–480

    Article  PubMed  Google Scholar 

  76. Abraham G, Malik R, Yonova-Doing E et al (2019) Genomic risk score offers predictive performance comparable to clinical risk factors for ischaemic stroke. Nat Commun 10(1):1–10

    Article  Google Scholar 

  77. Lu X, Niu X, Shen C et al (2021) Development and validation of a polygenic risk score for stroke in the Chinese population. Neurology 97(6):e619–e628

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  78. Devlin B, Roeder K (1999) Genomic control for association studies. Biometrics 55(4):997–1004

    Article  CAS  PubMed  Google Scholar 

  79. Devlin B, Roeder K, Wasserman L (2001) Genomic control, a new approach to genetic-based association studies. Theor Popul Biol 60(3):155–166

    Article  CAS  PubMed  Google Scholar 

  80. Sul JH, Martin LS, Eskin E (2018) Population structure in genetic studies: confounding factors and mixed models. PLoS Genet 14(12):e1007309

    Article  PubMed  PubMed Central  Google Scholar 

  81. Clayton DG, Walker NM, Smyth DJ et al (2005) Population structure, differential bias and genomic control in a large-scale, case-control association study. Nat Genet 37(11):1243–1246

    Article  CAS  PubMed  Google Scholar 

  82. Price AL, Patterson NJ, Plenge RM et al (2006) Principal components analysis corrects for stratification in genome-wide association studies. Nat Genet 38(8):904–909

    Article  CAS  PubMed  Google Scholar 

  83. Yang J, Lee SH, Goddard ME et al (2011) GCTA: a tool for genome-wide complex trait analysis. Am J Hum Genet 88(1):76–82

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  84. Consortium GP (2015) A global reference for human genetic variation. Nature 526(7571):68

    Article  Google Scholar 

  85. Consortium EP (2012) An integrated encyclopedia of DNA elements in the human genome. Nature 489(7414):57

    Article  Google Scholar 

  86. Bernstein BE, Stamatoyannopoulos JA, Costello JF et al (2010) The NIH roadmap epigenomics mapping consortium. Nat Biotechnol 28(10):1045–1048

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  87. Lonsdale J, Thomas J, Salvatore M et al (2013) The genotype-tissue expression (GTEx) project. Nat Genet 45(6):580–585

    Article  CAS  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Maggie Haitian Wang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature

About this protocol

Check for updates. Verify currency and authenticity via CrossMark

Cite this protocol

Xia, X., Zhang, Y., Wei, Y., Wang, M.H. (2023). Statistical Methods for Disease Risk Prediction with Genotype Data. In: Fridley, B., Wang, X. (eds) Statistical Genomics. Methods in Molecular Biology, vol 2629. Humana, New York, NY. https://doi.org/10.1007/978-1-0716-2986-4_15

Download citation

  • DOI: https://doi.org/10.1007/978-1-0716-2986-4_15

  • Published:

  • Publisher Name: Humana, New York, NY

  • Print ISBN: 978-1-0716-2985-7

  • Online ISBN: 978-1-0716-2986-4

  • eBook Packages: Springer Protocols

Publish with us

Policies and ethics