Human Genetics

, Volume 128, Issue 6, pp 597–608 | Cite as

Using public control genotype data to increase power and decrease cost of case–control genetic association studies

  • Lindsey A. Ho
  • Ethan M. LangeEmail author
Original Investigation


Genome-wide association (GWA) studies are a powerful approach for identifying novel genetic risk factors associated with human disease. A GWA study typically requires the inclusion of thousands of samples to have sufficient statistical power to detect single nucleotide polymorphisms that are associated with only modest increases in risk of disease given the heavy burden of a multiple test correction that is necessary to maintain valid statistical tests. Low statistical power and the high financial cost of performing a GWA study remains prohibitive for many scientific investigators anxious to perform such a study using their own samples. A number of remedies have been suggested to increase statistical power and decrease cost, including the utilization of free publicly available genotype data and multi-stage genotyping designs. Herein, we compare the statistical power and relative costs of alternative association study designs that use cases and screened controls to study designs that are based only on, or additionally include, free public control genotype data. We describe a novel replication-based two-stage study design, which uses free public control genotype data in the first stage and follow-up genotype data on case-matched controls in the second stage that preserves many of the advantages inherent when using only an epidemiologically matched set of controls. Specifically, we show that our proposed two-stage design can substantially increase statistical power and decrease cost of performing a GWA study while controlling the type-I error rate that can be inflated when using public controls due to differences in ancestry and batch genotype effects.


Susceptibility Allele Batch Effect Wellcome Trust Case Control Consortium Public Control POP1 Ancestry 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.



This work was supported by National Institutes of Health grants CA120082 and CA1363621. We would like to express our appreciation to Yunfei Wang for his assistance in programing and three anonymous reviewers for their helpful suggestions.

Supplementary material

439_2010_880_MOESM1_ESM.doc (1.1 mb)
Supplementary material 1 (DOC 1,177 kb)


  1. Ahn K, Haynes C, Kim W, Fleur RS, Gordon D, Finch SJ (2007) The effects of SNP genotyping errors on the power of the Cochran-Armitage linear trend test for case/control association studies. Ann Hum Genet 71:249–261CrossRefPubMedGoogle Scholar
  2. Armitage P (1955) Tests for linear trends in proportions and frequencies. Biometrics 11:375–386CrossRefGoogle Scholar
  3. Chapman DG, Nam JM (1968) Asymptotic power of chi square tests for linear trends in proportions. Biometrics 24:315–327CrossRefPubMedGoogle Scholar
  4. Cochran WG (1954) Some methods for strengthening the common chi-squared tests. Biometrics 10:417–451CrossRefGoogle Scholar
  5. Edwards BJ, Haynes C, Levenstien MA, Finch SJ, Gordon D (2005) Power and sample size calculations in the presence of phenotype errors for case/control genetic association studies. BMC Genet 6:18CrossRefPubMedGoogle Scholar
  6. Haiman CA, Patterson N, Freedman ML, Myers SR, Pike MC, Waliszewska A, Neubauer J, Tandon A, Schirmer C, McDonald GJ, Greenway SC, Stram DO, Le ML, Kolonel LN, Frasco M, Wong D, Pooler LC, Ardlie K, Oakley-Girvan I, Whittemore AS, Cooney KA, John EM, Ingles SA, Altshuler D, Henderson BE, Reich D (2007) Multiple regions within 8q24 independently affect risk for prostate cancer. Nat Genet 39:638–644CrossRefPubMedGoogle Scholar
  7. Hom G, Graham RR, Modrek B, Taylor KE, Ortmann W, Garnier S, Lee AT, Chung SA, Ferreira RC, Pant PV, Ballinger DG, Kosoy R, Demirci FY, Kamboh MI, Kao AH, Tian C, Gunnarsson I, Bengtsson AA, Rantapaa-Dahlqvist S, Petri M, Manzi S, Seldin MF, Ronnblom L, Syvanen AC, Criswell LA, Gregersen PK, Behrens TW (2008) Association of systemic lupus erythematosus with C8orf13-BLK and ITGAM-ITGAX. N Engl J Med 358:900–909CrossRefPubMedGoogle Scholar
  8. Kraft P (2006) Efficient two-stage genome-wide association designs based on false positive report probabilities. In: Pacific symposium on biocomputing, pp 523–534Google Scholar
  9. Luca D, Ringquist S, Klei L, Lee AB, Gieger C, Wichmann HE, Schreiber S, Krawczak M, Lu Y, Styche A, Devlin B, Roeder K, Trucco M (2008) On the use of general control samples for genome-wide association studies: genetic matching highlights causal variants. Am J Hum Genet 82:453–463CrossRefPubMedGoogle Scholar
  10. Moskvina V, Holmans P, Schmidt KM, Craddock N (2005) Design of case-controls studies with unscreened controls. Ann Hum Genet 69:566–576CrossRefPubMedGoogle Scholar
  11. Moskvina V, Craddock N, Holmans P, Owen MJ, O’Donovan MC (2006) Effects of differential genotyping error rate on the type I error probability of case-control studies. Hum Hered 61:55–64CrossRefPubMedGoogle Scholar
  12. Neale BM, Purcell S (2008) The positives, protocols, and perils of genome-wide association. Am J Med Genet B Neuropsychiatr Genet 147B(7):1288–1294Google Scholar
  13. Price AL, Patterson NJ, Plenge RM, Weinblatt ME, Shadick NA, Reich D (2006) Principal components analysis corrects for stratification in genome-wide association studies. Nat Genet 38:904–909CrossRefPubMedGoogle Scholar
  14. R Development Core Team (2006) R: a language and environment for statistical computing. R Development Core Team, ViennaGoogle Scholar
  15. Reich DE, Goldstein DB (2001) Detecting association in a case-control study while correcting for population stratification. Genet Epidemiol 20:4–16CrossRefPubMedGoogle Scholar
  16. Roeder K, Luca D (2009) Searching for disease susceptibility variants in structured populations. Genomics 93:1–4CrossRefPubMedGoogle Scholar
  17. Satagopan JM, Verbel DA, Venkatraman ES, Offit KE, Begg CB (2002) Two-stage designs for gene-disease association studies. Biometrics 58:163–170CrossRefPubMedGoogle Scholar
  18. Satagopan JM, Venkatraman ES, Begg CB (2004) Two-stage designs for gene-disease association studies with sample size constraints. Biometrics 60:589–597CrossRefPubMedGoogle Scholar
  19. Sebastiani P, Solovieff N, Puca A, Hartley SW, Melista E, Andersen S, Dworkis DA, Wilk JB, Myers RH, Steinberg MH, Montano M, Baldwin CT, Perls TT (2010) Genetic signatures of exceptional longevity in humans. Science (in press)Google Scholar
  20. Silverberg MS, Cho JH, Rioux JD, McGovern DP, Wu J, Annese V, Achkar JP, Goyette P, Scott R, Xu W, Barmada MM, Klei L, Daly MJ, Abraham C, Bayless TM, Bossa F, Griffiths AM, Ippoliti AF, Lahaie RG, Latiano A, Pare P, Proctor DD, Regueiro MD, Steinhart AH, Targan SR, Schumm LP, Kistner EO, Lee AT, Gregersen PK, Rotter JI, Brant SR, Taylor KD, Roeder K, Duerr RH (2009) Ulcerative colitis-risk loci on chromosomes 1p36 and 12q15 found by genome-wide association study. Nat Genet 41:216–220CrossRefPubMedGoogle Scholar
  21. Skol AD, Scott LJ, Abecasis GR, Boehnke M (2006) Joint analysis is more efficient than replication-based analysis for two-stage genome-wide association studies. Nat Genet 38:209–213CrossRefPubMedGoogle Scholar
  22. Skol AD, Scott LJ, Abecasis GR, Boehnke M (2007) Optimal designs for two-stage genome-wide association studies. Genet Epidemiol 31:776–788CrossRefPubMedGoogle Scholar
  23. Slager SL, Schaid DJ (2001) Case-control studies of genetic markers: power and sample size approximations for Armitage’s test for trend. Hum Hered 52:149–153CrossRefPubMedGoogle Scholar
  24. Thomas D, Xie R, Gebregziabher M (2004) Two-stage sampling designs for gene association studies. Genet Epidemiol 27:401–414CrossRefPubMedGoogle Scholar
  25. Wang H, Thomas DC, Pe’er I, Stram DO (2006) Optimal two-stage genotyping designs for genome-wide association scans. Genet Epidemiol 30:356–368CrossRefPubMedGoogle Scholar
  26. Wellcome Trust Case Control Consortium (2007) Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature 447:661–678CrossRefGoogle Scholar
  27. Wrensch M, Jenkins RB, Chang JS, Yeh RF, Xiao Y, Decker PA, Ballman KV, Berger M, Buckner JC, Chang S, Giannini C, Halder C, Kollmeyer TM, Kosel ML, LaChance DH, McCoy L, O’Neill BP, Patoka J, Pico AR, Prados M, Quesenberry C, Rice T, Rynearson AL, Smirnov I, Tihan T, Wiemels J, Yang P, Wiencke JK (2009) Variants in the CDKN2B and RTEL1 regions are associated with high-grade glioma susceptibility. Nat Genet 41:905–908CrossRefPubMedGoogle Scholar
  28. Yu K, Wang Z, Li Q, Wacholder S, Hunter DJ, Hoover RN, Chanock S, Thomas G (2008) Population substructure and control selection in genome-wide association studies. PLoS One 3:e2551CrossRefPubMedGoogle Scholar
  29. Zheng G, Tian X (2005) The impact of diagnostic error on testing genetic association in case-control studies. Stat Med 24:869–882CrossRefPubMedGoogle Scholar
  30. Zhuang JJ, Zondervan K, Nyberg F, Harbron C, Jawaid A, Cardon LR, Barratt BJ, Morris AP (2010) Optimizing the power of genome-wide association studies by using publicly available reference samples to expand the control group. Genet Epidemiol 34(4):319–326Google Scholar

Copyright information

© Springer-Verlag 2010

Authors and Affiliations

  1. 1.Department of BiostatisticsUniversity of North Carolina at Chapel HillChapel HillUSA
  2. 2.Department of GeneticsUniversity of North Carolina at Chapel HillChapel HillUSA

Personalised recommendations