Skip to main content

Statistical Challenges in Sequence-Based Association Studies with Population- and Family-Based Designs

Abstract

Over the past few years, association analysis has become the primary tool for finding genes that underlie complex traits. Both population-based and family-based designs are commonly used designs in genetic association studies. Recent technological advances in exome and whole genome sequencing afford the next generation of sequence-based association studies. We review here recent developments in statistical methodology and remaining challenges related to sequence-based association studies with both population-based and family-based designs.

This is a preview of subscription content, access via your institution.

Fig. 1

References

  1. Metzker ML (2010) Sequencing technologies—the next generation. Nat Rev Genet 11:31–46

    Article  Google Scholar 

  2. Shendure J (2011) Next-generation human genetics. Genome Biol 12:408

    Article  Google Scholar 

  3. McCarthy MI, Abecasis GR, Cardon LR, Goldstein DB, Little J, Ioannidis JPA, Hirschhorn JN (2008) Genome-wide association studies for complex traits: consensus, uncertainty and challenges. Nat Rev Genet 9:356–369

    Article  Google Scholar 

  4. Hindorff LA, Sethupathy P, Junkins HA, Ramos EM, Mehta JP, Collins FS, Manolio TA (2009) Potential etiologic and functional implications of genome-wide association loci for human diseases and traits. Proc Natl Acad Sci USA 106:9362–9367

    Article  Google Scholar 

  5. Manolio TA et al. (2009) Finding the missing heritability of complex diseases. Nature 461:747–753

    Article  Google Scholar 

  6. Pritchard JK (2001) Are rare variants responsible for susceptibility to common diseases? Am J Hum Genet 69:124–137

    Article  Google Scholar 

  7. Pritchard JK, Cox NJ (2002) The allelic architecture of human disease genes: common disease, common variant … or not? Hum Mol Genet 11:2417–2423

    Article  Google Scholar 

  8. Yang J, Manolio TA, Pasquale LR, Boerwinkle E, Caporaso N, Cunningham JM, de Andrade M, Feenstra B, Feingold E, Hayes MG, Hill WG, Landi MT, Alonso A, Lettre G, Lin P, Ling H, Lowe W, Mathias RA, Melbye M, Pugh E, Cornelis MC, Weir BS, Goddard ME, Visscher PM (2011) Genome partitioning of genetic variation for complex traits using common SNPs. Nat Genet 43:519–525

    Article  Google Scholar 

  9. Eichler EE, Flint J, Gibson G, Kong A, Leal SM, Moore JH, Nadeau JH (2010) Missing heritability and strategies for finding the underlying causes of complex disease. Nat Rev Genet 11:446–450

    Article  Google Scholar 

  10. Nielsen R, Paul JS, Albrechtsen A, Song YS (2011) Genotype and SNP calling from next-generation sequencing data. Nat Rev Genet 12:443–451

    Article  Google Scholar 

  11. Ng SB, Turner EH, Robertson PD, Flygare SD, Bigham AW, Lee C, Shaffer T, Wong M, Bhattacharjee A, Eichler EE et al. (2009) Targeted capture and massively parallel sequencing of 12 human exomes. Nature 461:272–276

    Article  Google Scholar 

  12. Ng SB, Buckingham KJ, Lee C, Bigham AW, Tabor HK, Dent KM, Huff CD, Shannon PT, Jabs EW, Nickerson DA et al. (2010) Exome sequencing identifies the cause of a Mendelian disorder. Nat Genet 42:30–35

    Article  Google Scholar 

  13. Ng SB, Bigham AW, Buckingham KJ, Hannibal MC, McMillin MJ, Gildersleeve HI, Beck AE, Tabor HK, Cooper GM, Mefford HC et al. (2010) Exome sequencing identifies MLL2 mutations as a cause of Kabuki syndrome. Nat Genet 42:790–793

    Article  Google Scholar 

  14. Shendure J, Ji H (2008) Next-generation DNA sequencing. Nat Biotechnol 26:1135–1145

    Article  Google Scholar 

  15. Li H, Durbin R (2009) Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics 25:1754–1760

    Article  Google Scholar 

  16. Langmead B, Trapnell C, Pop M, Salzberg SL (2009) Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol 10:R25

    Article  Google Scholar 

  17. Li R, Yu C, Li Y, Lam TW, Yiu SM, Kristiansen K, Wang J (2009) SOAP2: an improved ultrafast tool for short read alignment. Bioinformatics 25:1966–1967

    Article  Google Scholar 

  18. Lunter G, Goodson M (2010) Stampy: a statistical algorithm for sensitive and fast mapping of Illumina sequence reads. Genome Res 21:936–939

    Article  Google Scholar 

  19. Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R (2009) The sequence alignment/map format and SAMtools. Bioinformatics 25:2078–2079

    Article  Google Scholar 

  20. 1000 Genomes Project Consortium (2010) A map of human genome variation from population-scale sequencing. Nature 467:1061–1073

    Article  Google Scholar 

  21. Ionita-Laza I, Lange C, Laird MN (2009) Estimating the number of unseen variants in the human genome. Proc Natl Acad Sci USA 106:5008–5013

    Article  MathSciNet  MATH  Google Scholar 

  22. Ionita-Laza I, Laird NM (2010) On the optimal design of genetic variant discovery studies. Stat Appl Genet Mol Biol 9:33

    MathSciNet  Google Scholar 

  23. Alkan C, Coe BP, Eichler EE (2011) Genome structural variation discovery and genotyping. Nat Rev Genet 12:363–376

    Article  Google Scholar 

  24. DePristo MA, Banks E, Poplin R, Garimella KV, Maguire JR, Hartl C, Philippakis AA, del Angel G, Rivas MA, Hanna M, McKenna A, Fennell TJ, Kernytsky AM, Sivachenko AY, Cibulskis K, Gabriel SB, Altshuler D, Daly MJ (2011) A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet 43:491–498

    Article  Google Scholar 

  25. Danecek P, Auton A, Abecasis G, Albers CA, Banks E, DePristo MA, Handsaker RE, Lunter G, Marth GT, Sherry ST, McVean G, Durbin R (2011) The variant call format and VCFtools. Bioinformatics 27:2156–2158

    Article  Google Scholar 

  26. Weale ME (2010) Quality control for genome-wide association studies. Methods Mol Biol 628:341–372

    Article  Google Scholar 

  27. Tong MY, Cassa CA, Kohane IS (2011) Automated validation of genetic variants from large databases: ensuring that variant references refer to the same genomic locations. Bioinformatics 27:891–893

    Article  Google Scholar 

  28. Ng SB, Buckingham KJ, Lee C, Bigham AW, Tabor HK, Dent KM, Huff CD, Shannon PT, Jabs EW, Nickerson DA, Shendure J, Bamshad MJ (2010) Exome sequencing identifies the cause of a Mendelian disorder. Nat Genet 42:30–35

    Article  Google Scholar 

  29. Leek JT, Scharpf RB, Bravo HC, Simcha D, Langmead B, Johnson WE, Geman D, Baggerly K, Irizarry RA (2010) Tackling the widespread and critical impact of batch effects in high-throughput data. Nat Rev Genet 11:733–739

    Article  Google Scholar 

  30. Taub MA, Corrada Bravo H, Irizarry RA (2011) Overcoming bias and systematic errors in next generation sequencing data. Genome Med 2:87

    Article  Google Scholar 

  31. Ng SB, Bigham AW, Buckingham KJ, Hannibal MC, McMillin MJ, Gildersleeve HI, Beck AE, Tabor HK, Cooper GM, Mefford HC, Lee C, Turner EH, Smith JD, Rieder MJ, Yoshiura K, Matsumoto N, Ohta T, Niikawa N, Nickerson DA, Bamshad MJ, Shendure J (2010) Exome sequencing identifies MLL2 mutations as a cause of Kabuki syndrome. Nat Genet 42:790–793

    Article  Google Scholar 

  32. Bamshad MJ, Ng SB, Bigham AW, Tabor HK, Emond MJ, Nickerson DA, Shendure J (2011) Exome sequencing as a tool for Mendelian disease gene discovery. Nat Rev Genet 12:745–755

    Article  Google Scholar 

  33. Risch N (1990) Linkage strategies for genetically complex traits. I. Multilocus models. Am J Hum Genet 46:222–228

    Google Scholar 

  34. Laird NM, Lange C (2009) The role of family-based designs in genome-wide association studies. Stat Sci 24:388–397

    Article  MathSciNet  Google Scholar 

  35. Bodmer W, Bonilla C (2008) Common and rare variants in multifactorial susceptibility to common diseases. Nat Genet 40:695–701

    Article  Google Scholar 

  36. Mackay TF, Richards S, Stone EA, Barbadilla A, Ayroles JF, Zhu D, Casillas S, Han Y, Magwire MM, Cridland JM, Richardson MF, Anholt RR, Barrón M, Bess C, Blankenburg KP, Carbone MA, Castellano D, Chaboub L, Duncan L, Harris Z, Javaid M, Jayaseelan JC, Jhangiani SN, Jordan KW, Lara F, Lawrence F, Lee SL, Librado P, Linheiro RS, Lyman RF, Mackey AJ, Munidasa M, Muzny DM, Nazareth L, Newsham I, Perales L, Pu LL, Qu C, Ràmia M, Reid JG, Rollmann SM, Rozas J, Saada N, Turlapati L, Worley KC, Wu YQ, Yamamoto A, Zhu Y, Bergman CM, Thornton KR, Mittelman D, Gibbs RA (2012) The Drosophila melanogaster genetic reference panel. Nature 482:173–178

    Article  Google Scholar 

  37. Ionita-Laza I, Ottman R (2011) Study designs for identification of rare disease variants in complex diseases: the utility of family-based designs. Genetics 189:1061–1068

    Article  Google Scholar 

  38. Dempster AP, Schatzoff M (1965) Expected significance level as a sensitivity index for test statistics. J Am Stat Assoc 60:420–436

    Article  MathSciNet  MATH  Google Scholar 

  39. Sackrowitz HB, Samuel-Cahn E (1999) P-values as random variables: expected P-values. Am Stat 53:326–331

    MathSciNet  Google Scholar 

  40. Price AL, Zaitlen NA, Reich D, Patterson N (2010) New approaches to population stratification in genome-wide association studies. Nat Rev Genet 11:459–463

    Article  Google Scholar 

  41. Campbell CD, Ogburn EL, Lunetta KL, Lyon HN, Freedman ML, Groop LC, Altshuler D, Ardlie KG, Hirschhorn JN (2005) Demonstrating stratification in a European American population. Nat Genet 37:868–872

    Article  Google Scholar 

  42. Keen-Kim D, Mathews CA, Reus VI, Lowe TL, Herrera LD, Budman CL, Gross-Tsur V, Pulver AE, Bruun RD, Erenberg G, Naarden A, Sabatti C, Freimer NB (2006) Overrepresentation of rare variants in a specific ethnic group may confuse interpretation of association analyses. Hum Mol Genet 15:3324–3328

    Article  Google Scholar 

  43. Mathieson I, McVean G (2012) Differential confounding of rare and common variants in spatially structured populations. Nat Genet. doi:10.1038/ng.1074

    MATH  Google Scholar 

  44. Pritchard JK, Stephens M, Rosenberg NA, Donnelly P (2000) Association mapping in structured populations. Am J Hum Genet 67:170–181

    Article  Google Scholar 

  45. Devlin B, Roeder K (1999) Genomic control for association studies. Biometrics 55:997–1004

    Article  MATH  Google Scholar 

  46. Price AL, Patterson NJ, Plenge RM, Weinblatt ME, Shadick NA, Reich D (2006) Principal components analysis corrects for stratification in genome-wide association studies. Nat Genet 38:904–909

    Article  Google Scholar 

  47. Yu J, Pressoir G, Briggs WH, Vroh Bi I, Yamasaki M, Doebley JF, McMullen MD, Gaut BS, Nielsen DM, Holland JB, Kresovich S, Buckler ES (2006) A unified mixed-model method for association mapping that accounts for multiple levels of relatedness. Nat Genet 38:203–208

    Article  Google Scholar 

  48. Spielman RS, McGinnis RE, Ewens WJ (1993) Transmission test for linkage disequilibrium: the insulin gene region and insulin-dependent diabetes mellitus (IDDM). Am J Hum Genet 52:506–516

    Google Scholar 

  49. Laird N, Horvath S, Xu X (2000) Implementing a unified approach to family based tests of association. Genet Epidemiol 19:S36–S42

    Article  Google Scholar 

  50. Li B, Leal SM (2008) Methods for detecting associations with rare variants for common diseases: application to analysis of sequence data. Am J Hum Genet 83:311–321

    Article  Google Scholar 

  51. Madsen BE, Browning SR (2009) A groupwise association test for rare mutations using a weighted sum statistic. PLoS Genet 5:e1000384

    Article  Google Scholar 

  52. Morris AP, Zeggini E (2010) An evaluation of statistical approaches to rare variant analysis in genetic association studies. Genet Epidemiol 34:188–193

    Article  Google Scholar 

  53. Price AL, Kryukov GV, de Bakker PI, Purcell SM, Staples J, Wei LJ, Sunyaev SR (2010) Pooled association tests for rare variants in exon-resequencing studies. Am J Hum Genet 86:832–838

    Article  Google Scholar 

  54. Liu DJ, Leal SM (2010) A novel adaptive method for the analysis of next-generation sequencing data to detect complex trait associations with rare variants due to gene main effects and interactions. PLoS Genet 6:e1001156

    Article  Google Scholar 

  55. King, CR, Rathouz, PJ, Nicolae, DL (2010) An evolutionary framework for association testing in resequencing studies. PLoS Genet 6:e1001202

    Article  Google Scholar 

  56. Bhatia G, Bansal V, Harismendy O, Schork NJ, Topol EJ, Frazer K, Bafna V (2010) A covering method for detecting genetic associations between rare variants and common phenotypes. PLoS Comput Biol 6:e1000954

    Article  Google Scholar 

  57. Han F, Pan W (2010) A data-adaptive sum test for disease association with multiple common or rare variants. Hum Hered 70:42–54

    Article  Google Scholar 

  58. Yi N, Liu N, Zhi D, Li J (2011) Hierarchical generalized linear models for multiple groups of rare and common variants: jointly estimating group and individual-variant effects. PLoS Genet 7:e1002382

    Article  Google Scholar 

  59. Zhu X, Feng T, Li Y, Lu Q, Elston RC (2010) Detecting rare variants for complex traits using family and unrelated data. Genet Epidemiol 34:171–187

    Article  Google Scholar 

  60. Li Y, Byrnes AE, Li M (2010) To identify associations with rare variants, just WHaIT: weighted haplotype and imputation-based tests. Am J Hum Genet 87:728–735

    Article  Google Scholar 

  61. Ionita-Laza I, Buxbaum JD, Laird NM, Lange C (2011) A new testing strategy to identify rare variants with either risk or protective effect on disease. PLoS Genet 7:e1001289

    Article  Google Scholar 

  62. Neale BM, Rivas MA, Voight BF, Altshuler D, Devlin B, Orho-Melander M, Kathiresan S, Purcell SM, Roeder K, Daly MJ (2011) Testing for an unusual distribution of rare variants. PLoS Genet 7:e1001322

    Article  Google Scholar 

  63. Wu MC, Lee S, Cai T, Li Y, Boehnke M, Lin X (2011) Rare-variant association testing for sequencing data with the sequence kernel association test. Am J Hum Genet 89:82–93

    Article  Google Scholar 

  64. Lin DY, Tang ZZ (2011) A general framework for detecting disease associations with rare variants in sequencing studies. Am J Hum Genet 89:354–367

    Article  Google Scholar 

  65. Gorlov IP, Gorlova OY, Sunyaev SR, Spitz MR, Amos CI (2008) Shifting paradigm of association studies: value of rare single-nucleotide polymorphisms. Am J Hum Genet 82:100–112

    Article  Google Scholar 

  66. Adzhubei IA, Schmidt S, Peshkin L, Ramensky VE, Gerasimova A, Bork P, Kondrashov AS, Sunyaev SR (2010) A method and server for predicting damaging missense mutations. Nat Methods 7:248–249

    Article  Google Scholar 

  67. Kumar P, Henikoff S, Ng PC (2009) Predicting the effects of coding non-synonymous variants on protein function using the SIFT algorithm. Natl Protoc 4:1073–1081

    Article  Google Scholar 

  68. Ionita-Laza I, Makarov V, Yoon S, Raby B, Buxbaum J, Nicolae DL, Lin X (2011) Finding disease variants in Mendelian disorders by using sequence data: methods and applications. Am J Hum Genet, in press

  69. Abecasis GR, Cardon LR, Cookson WOC (2000) A general test of association for quantitative traits in nuclear families. Am J Hum Genet 66:279–292

    Article  Google Scholar 

  70. Falconer DS (1989) Introduction to quantitative genetics. Longman Scientific & Technical, London

    Google Scholar 

  71. Kenny EE, Kim M, Gusev A, Lowe JK, Salit J, Smith JG, Kovvali S, Kang HM, Newton-Cheh C, Daly MJ, Stoffel M, Altshuler DM, Friedman JM, Eskin E, Breslow JL, Pe’er I (2010) Increased power of mixed models facilitates association mapping of 10 loci for metabolic traits in an isolated population. Hum Mol Genet 20:827–839

    Article  Google Scholar 

  72. De G, Yip WK, Ionita-Laza I, Laird NM (2011) Rare variant analysis for family-based design, submitted

  73. Wakefield J (2009) Bayes factors for genome-wide association studies: comparison with p-values. Genet Epidemiol 33:79–86

    Article  Google Scholar 

  74. Roeder K, Devlin B, Wasserman L (2007) Improving power in genome-wide association studies: weights tip the scale. Genet Epidemiol 31:741–747

    Article  Google Scholar 

  75. Ionita-Laza I, McQueen MB, Laird NM, Lange C (2007) Genome-wide weighted hypothesis testing in family-based association studies, with an application to a 100k scan. Am J Hum Genet 81:607–614

    Article  Google Scholar 

  76. Neale BM, Sham PC (2004) The future of association studies: gene-based analysis and replication. Am J Hum Genet 75:353–362

    Article  Google Scholar 

  77. Wu MC, Kraft P, Epstein MP, Taylor DM, Chanock SJ, Hunter DJ, Lin X (2010) Powerful SNP-set analysis for case-control genome-wide association studies. Am J Hum Genet 86:929–942

    Article  Google Scholar 

  78. Van Steen K, McQueen MB, Herbert A, Raby B, Lyon H, Demeo DL, Murphy A, Su J, Datta S, Rosenow C, Christman M, Silverman EK, Laird NM, Weiss ST, Lange C (2005) Genomic screening and replication using the same data set in family-based association testing. Nat Genet 37:683–691

    Article  Google Scholar 

  79. Glaz J, Pozdnyakov V, Wallenstein S (eds) (2009) Scan statistics: methods and applications. ISBN 978-0-8176-4748-3

    Book  MATH  Google Scholar 

  80. Ionita-Laza I, Makarov V, ARRA Autism Sequencing Consortium, Buxbaum J (2012) Scan-statistic approach identifies clusters of rare disease variants in three independent datasets in LRP2, a gene linked and associated with autism spectrum disorders. Am J Hum Genet, in press

  81. Feng T, Elston RC, Zhu X (2011) Detecting rare and common variants for complex traits: sibpair and odds ratio weighted sum statistics (SPWSS, ORWSS). Genet Epidemiol 35:398–409

    Article  Google Scholar 

Download references

Acknowledgements

The research was partially supported by NSF Grant DMS-1100279 and NIH Grants 1R03HG005908 and R01MH095797 (to I.I.-L).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Iuliana Ionita-Laza.

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Ionita-Laza, I., Cho, M.H. & Laird, N.M. Statistical Challenges in Sequence-Based Association Studies with Population- and Family-Based Designs. Stat Biosci 5, 54–70 (2013). https://doi.org/10.1007/s12561-012-9062-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s12561-012-9062-9

Keywords

  • Association study
  • Next-generation sequencing
  • Population- and family-based designs