Imputation across genotyping arrays for genome-wide association studies: assessment of bias and a correction strategy

Johnson, Eric O.; Hancock, Dana B.; Levy, Joshua L.; Gaddis, Nathan C.; Saccone, Nancy L.; Bierut, Laura J.; Page, Grier P.

doi:10.1007/s00439-013-1266-7

Imputation across genotyping arrays for genome-wide association studies: assessment of bias and a correction strategy

Original Investigation
Published: 22 January 2013

Volume 132, pages 509–522, (2013)
Cite this article

Human Genetics Aims and scope Submit manuscript

Eric O. Johnson¹,
Dana B. Hancock¹,
Joshua L. Levy²,
Nathan C. Gaddis²,
Nancy L. Saccone³,
Laura J. Bierut⁴ &
…
Grier P. Page⁵

1135 Accesses
35 Citations
1 Altmetric
Explore all metrics

Abstract

A great promise of publicly sharing genome-wide association data is the potential to create composite sets of controls. However, studies often use different genotyping arrays, and imputation to a common set of SNPs has shown substantial bias: a problem which has no broadly applicable solution. Based on the idea that using differing genotyped SNP sets as inputs creates differential imputation errors and thus bias in the composite set of controls, we examined the degree to which each of the following occurs: (1) imputation based on the union of genotyped SNPs (i.e., SNPs available on one or more arrays) results in bias, as evidenced by spurious associations (type 1 error) between imputed genotypes and arbitrarily assigned case/control status; (2) imputation based on the intersection of genotyped SNPs (i.e., SNPs available on all arrays) does not evidence such bias; and (3) imputation quality varies by the size of the intersection of genotyped SNP sets. Imputations were conducted in European Americans and African Americans with reference to HapMap phase II and III data. Imputation based on the union of genotyped SNPs across the Illumina 1M and 550v3 arrays showed spurious associations for 0.2 % of SNPs: ~2,000 false positives per million SNPs imputed. Biases remained problematic for very similar arrays (550v1 vs. 550v3) and were substantial for dissimilar arrays (Illumina 1M vs. Affymetrix 6.0). In all instances, imputing based on the intersection of genotyped SNPs (as few as 30 % of the total SNPs genotyped) eliminated such bias while still achieving good imputation quality.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Benefits of Accurate Imputations in GWAS

GAWMerge expands GWAS sample size and diversity by combining array-based genotyping and whole-genome sequencing

Article Open access 11 August 2022

Efficient phasing and imputation of low-coverage sequencing data using large reference panels

Article 07 January 2021

References

Almeida MA, Oliveira PS, Pereira TV, Krieger JE, Pereira AC (2011) An empirical evaluation of imputation accuracy for association statistics reveals increased type-I error rates in genome-wide associations. BMC Genet 12:10. doi:10.1186/1471-2156-12-10
Article PubMed Google Scholar
Altshuler DM, Gibbs RA, Peltonen L, Dermitzakis E, Schaffner SF, Yu F, Bonnen PE, de Bakker PI, Deloukas P, Gabriel SB, Gwilliam R, Hunt S, Inouye M, Jia X, Palotie A, Parkin M, Whittaker P, Chang K, Hawes A, Lewis LR, Ren Y, Wheeler D, Muzny DM, Barnes C, Darvishi K, Hurles M, Korn JM, Kristiansson K, Lee C, McCarrol SA, Nemesh J, Keinan A, Montgomery SB, Pollack S, Price AL, Soranzo N, Gonzaga-Jauregui C, Anttila V, Brodeur W, Daly MJ, Leslie S, McVean G, Moutsianas L, Nguyen H, Zhang Q, Ghori MJ, McGinnis R, McLaren W, Takeuchi F, Grossman SR, Shlyakhter I, Hostetter EB, Sabeti PC, Adebamowo CA, Foster MW, Gordon DR, Licinio J, Manca MC, Marshall PA, Matsuda I, Ngare D, Wang VO, Reddy D, Rotimi CN, Royal CD, Sharp RR, Zeng C, Brooks LD, McEwen JE (2010) Integrating common and rare genetic variation in diverse human populations. Nature 467(7311):52–58. doi:10.1038/nature09298
Article PubMed CAS Google Scholar
Amundadottir L, Kraft P, Stolzenberg-Solomon RZ, Fuchs CS, Petersen GM, Arslan AA, Bueno-de-Mesquita HB, Gross M, Helzlsouer K, Jacobs EJ, LaCroix A, Zheng W, Albanes D, Bamlet W, Berg CD, Berrino F, Bingham S, Buring JE, Bracci PM, Canzian F, Clavel-Chapelon F, Clipp S, Cotterchio M, de Andrade M, Duell EJ, Fox JW Jr, Gallinger S, Gaziano JM, Giovannucci EL, Goggins M, Gonzalez CA, Hallmans G, Hankinson SE, Hassan M, Holly EA, Hunter DJ, Hutchinson A, Jackson R, Jacobs KB, Jenab M, Kaaks R, Klein AP, Kooperberg C, Kurtz RC, Li D, Lynch SM, Mandelson M, McWilliams RR, Mendelsohn JB, Michaud DS, Olson SH, Overvad K, Patel AV, Peeters PH, Rajkovic A, Riboli E, Risch HA, Shu XO, Thomas G, Tobias GS, Trichopoulos D, Van Den Eeden SK, Virtamo J, Wactawski-Wende J, Wolpin BM, Yu H, Yu K, Zeleniuch-Jacquotte A, Chanock SJ, Hartge P, Hoover RN (2009) Genome-wide association study identifies variants in the ABO locus associated with susceptibility to pancreatic cancer. Nat Genet 41(9):986–990. doi:10.1038/ng.429
Article PubMed CAS Google Scholar
Beecham GW, Martin ER, Gilbert JR, Haines JL, Pericak-Vance MA (2010) APOE is not associated with Alzheimer disease: a cautionary tale of genotype imputation. Ann Hum Genet 74(3):189–194. doi:10.1111/j.1469-1809.2010.00573.x
Article PubMed CAS Google Scholar
Bierut LJ, Agrawal A, Bucholz KK, Doheny KF, Laurie C, Pugh E, Fisher S, Fox L, Howells W, Bertelsen S, Hinrichs AL, Almasy L, Breslau N, Culverhouse RC, Dick DM, Edenberg HJ, Foroud T, Grucza RA, Hatsukami D, Hesselbrock V, Johnson EO, Kramer J, Krueger RF, Kuperman S, Lynskey M, Mann K, Neuman RJ, Nothen MM, Nurnberger JI Jr, Porjesz B, Ridinger M, Saccone NL, Saccone SF, Schuckit MA, Tischfield JA, Wang JC, Rietschel M, Goate AM, Rice JP (2010) A genome-wide association study of alcohol dependence. Proc Natl Acad Sci USA 107(11):5082–5087. doi:10.1073/pnas.0911109107
Article PubMed CAS Google Scholar
Durbin RM, Abecasis GR, Altshuler DL, Auton A, Brooks LD, Gibbs RA, Hurles ME, McVean GA (2010) A map of human genome variation from population-scale sequencing. Nature 467(7319):1061–1073. doi:10.1038/nature09534
Article PubMed CAS Google Scholar
Fellay J, Shianna KV, Ge D, Colombo S, Ledergerber B, Weale M, Zhang K, Gumbs C, Castagna A, Cossarizza A, Cozzi-Lepri A, De Luca A, Easterbrook P, Francioli P, Mallal S, Martinez-Picado J, Miro JM, Obel N, Smith JP, Wyniger J, Descombes P, Antonarakis SE, Letvin NL, McMichael AJ, Haynes BF, Telenti A, Goldstein DB (2007) A whole-genome association study of major determinants for host control of HIV-1. Science 317(5840):944–947. doi:10.1126/science.1143767
Article PubMed CAS Google Scholar
Hancock DB, Levy JL, Gaddis NC, Bierut LJ, Saccone NL, Page GP, Johnson EO (2012) Assessment of genotype imputation performance using 1000 Genomes in African American studies. PLoS ONE 7(11):e50610. doi:10.1371/journal.pone.0050610
Hartz SM, Johnson EO, Saccone NL, Hatsukami D, Breslau N, Bierut LJ (2011) Inclusion of African Americans in genetic studies: what is the barrier? Am J Epidemiol 174(3):336–344. doi:10.1093/aje/kwr084
Article PubMed Google Scholar
Ho LA, Lange EM (2010) Using public control genotype data to increase power and decrease cost of case-control genetic association studies. Hum Genet 128(6):597–608. doi:10.1007/s00439-010-0880-x
Article PubMed Google Scholar
Howie B, Marchini J, Stephens M (2011) Genotype imputation with thousands of genomes. G3 (Bethesda) 1(6):457–470. doi:10.1534/g3.111.001198
Article Google Scholar
Hunter DJ, Kraft P, Jacobs KB, Cox DG, Yeager M, Hankinson SE, Wacholder S, Wang Z, Welch R, Hutchinson A, Wang J, Yu K, Chatterjee N, Orr N, Willett WC, Colditz GA, Ziegler RG, Berg CD, Buys SS, McCarty CA, Feigelson HS, Calle EE, Thun MJ, Hayes RB, Tucker M, Gerhard DS, Fraumeni JF Jr, Hoover RN, Thomas G, Chanock SJ (2007) A genome-wide association study identifies alleles in FGFR2 associated with risk of sporadic postmenopausal breast cancer. Nat Genet 39(7):870–874. doi:10.1038/ng2075
Article PubMed CAS Google Scholar
Li Y, Willer CJ, Ding J, Scheet P, Abecasis GR (2010) MaCH: using sequence and genotype data to estimate haplotypes and unobserved genotypes. Genet Epidemiol 34(8):816–834. doi:10.1002/gepi.20533
Article PubMed Google Scholar
Manichaikul A, Mychaleckyj JC, Rich SS, Daly K, Sale M, Chen WM (2010) Robust relationship inference in genome-wide association studies. Bioinformatics 26(22):2867–2873. doi:10.1093/bioinformatics/btq559
Article PubMed CAS Google Scholar
Manolio TA, Rodriguez LL, Brooks L, Abecasis G, Ballinger D, Daly M, Donnelly P, Faraone SV, Frazer K, Gabriel S, Gejman P, Guttmacher A, Harris EL, Insel T, Kelsoe JR, Lander E, McCowin N, Mailman MD, Nabel E, Ostell J, Pugh E, Sherry S, Sullivan PF, Thompson JF, Warram J, Wholley D, Milos PM, Collins FS (2007) New models of collaboration in genome-wide association studies: the Genetic Association Information Network. Nat Genet 39(9):1045–1051. doi:10.1038/ng2127
Article PubMed CAS Google Scholar
Marchini J, Howie B (2010) Genotype imputation for genome-wide association studies. Nat Rev Genet 11(7):499–511. doi:10.1038/nrg2796
Article PubMed CAS Google Scholar
Mukherjee S, Simon J, Bayuga S, Ludwig E, Yoo S, Orlow I, Viale A, Offit K, Kurtz RC, Olson SH, Klein RJ (2011) Including additional controls from public databases improves the power of a genome-wide association study. Hum Hered 72(1):21–34. doi:10.1159/000330149
Article PubMed CAS Google Scholar
Pasaniuc B, Rohland N et al (2012) Extremely low-coverage sequencing and imputation increases power for genome-wide association studies. Nat Genet 44(6):631–635
Google Scholar
Price AL, Patterson NJ, Plenge RM, Weinblatt ME, Shadick NA, Reich D (2006) Principal components analysis corrects for stratification in genome-wide association studies. Nat Genet 38(8):904–909. doi:10.1038/ng1847
Article PubMed CAS Google Scholar
Pritchard JK, Stephens M, Donnelly P (2000) Inference of population structure using multilocus genotype data. Genetics 155(2):945–959
PubMed CAS Google Scholar
Pritchard JK, Przeworski M (2001) Linkage disequilibrium in humans: models and data. Am J Hum Genet 69(1):1–14
Google Scholar
Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MA, Bender D, Maller J, Sklar P, de Bakker PI, Daly MJ, Sham PC (2007) PLINK: a tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet 81(3):559–575. doi:10.1086/519795
Article PubMed CAS Google Scholar
Shriner D, Adeyemo A, Chen G, Rotimi CN (2010) Practical considerations for imputation of untyped markers in admixed populations. Genet Epidemiol 34(3):258–265. doi:10.1002/gepi.20457
PubMed Google Scholar
Sinnott JA, Kraft P (2012) Artifact due to differential error when cases and controls are imputed from different platforms. Hum Genet 131(1):111–119. doi:10.1007/s00439-011-1054-1
Article PubMed Google Scholar
Southam L, Panoutsopoulou K, Rayner NW, Chapman K, Durrant C, Ferreira T, Arden N, Carr A, Deloukas P, Doherty M, Loughlin J, McCaskie A, Ollier WE, Ralston S, Spector TD, Valdes AM, Wallis GA, Wilkinson JM, Marchini J, Zeggini E (2011) The effect of genome-wide association scan quality control on imputation outcome for common variants. Eur J Hum Genet 19(5):610–614. doi:10.1038/ejhg.2010.242
Article PubMed CAS Google Scholar
Spencer CC, Su Z, Donnelly P, Marchini J (2009) Designing genome-wide association studies: sample size, power, imputation, and the choice of genotyping chip. PLoS Genet 5(5):e1000477. doi:10.1371/journal.pgen.1000477
Article PubMed Google Scholar
Tiwari HK, Birkner T et al (2011) Accurate and flexible power calculations on the spot: applications to genomic research. Stat Interface 4(3):353–358
Google Scholar
Uh HW, Deelen J, Beekman M, Helmer Q, Rivadeneira F, Hottenga JJ, Boomsma DI, Hofman A, Uitterlinden AG, Slagboom PE, Bohringer S, Houwing-Duistermaat JJ (2012) How to deal with the early GWAS data when imputing and combining different arrays is necessary. Eur J Hum Genet 20(5):572–576. doi:10.1038/ejhg.2011.231
Article PubMed CAS Google Scholar
Zheng J, Li Y et al (2011) A comparison of approaches to account for uncertainty in analysis of imputed genotypes. Genet Epidemiol 35(2):102–110
Google Scholar
Zhuang JJ, Zondervan K, Nyberg F, Harbron C, Jawaid A, Cardon LR, Barratt BJ, Morris AP (2010) Optimizing the power of genome-wide association studies by using publicly available reference samples to expand the control group. Genet Epidemiol 34(4):319–326
Article PubMed Google Scholar

Download references

Acknowledgments

This work was supported by National Institute of Drug Abuse grant nos. R33DA027486 and R01DA026141 (E.O. Johnson PI), as well as R01DA025888 (L.J. Bierut & E.O. Johnson Co-PIs). Funding support for SAGE was provided through the NIH Genes, Environment and Health Initiative [GEI] (U01 HG004422). SAGE is one of the GWAS funded as part of the Gene Environment Association Studies (GENEVA) under GEI. Assistance with phenotype harmonization and genotype cleaning, as well as with general study coordination, was provided by the GENEVA Coordinating Center (U01 HG004446). Assistance with data cleaning was provided by the National Center for Biotechnology Information. Support for collection of datasets and samples was provided by the Collaborative Study on the Genetics of Alcoholism (COGA; U10 AA008401), the Collaborative Genetic Study of Nicotine Dependence (COGEND; P01 CA089392), and the Family Study of Cocaine Dependence (FSCD; R01 DA013423). Funding support for genotyping, which was performed at the Johns Hopkins University Center for Inherited Disease Research, was provided by the NIH GEI (U01HG004438), the National Institute on Alcohol Abuse and Alcoholism, the National Institute on Drug Abuse, and the NIH contract “High throughput genotyping for studying the genetic contributions to human disease” (HHSN268200782096C). The SAGE dataset used for the analyses described in this manuscript was obtained from dbGaP at http://www.ncbi.nlm.nih.gov/projects/gap/, through accession number phs000092.v1.p1. The CGEMS (http://cgems.cancer.gov/) PanScan study was derived from 12 cohorts, as outlined by Amundadottir et al. (2009). The PanScan dataset used for the analyses described in this manuscript was obtained from dbGaP through accession number phs000206.v3.p2. The CGEMS breast cancer GWAS was derived from the Nurses’ Health Study, which was supported by NIH grants CA65725, CA87969, CA49449, CA67262, CA50385, and 5UO1CA098233. The CGEMS dataset used for the analyses described in this manuscript was obtained from dbGaP through accession number phs000147.v1.p1. Funding support for the GWAS of Schizophrenia was provided by the National Institute of Mental Health (R01 MH67257, R01 MH59588, R01 MH59571, R01 MH59565, R01 MH59587, R01 MH60870, R01 MH59566, R01 MH59586, R01 MH61675, R01 MH60879, R01 MH81800, U01 MH46276, U01 MH46289 U01 MH46318, U01 MH79469, and U01 MH79470), and the genotyping of samples was provided through GAIN. The datasets used for the analyses described in this manuscript were obtained from the dbGaP through accession number phs000021.v3.p2. Samples and associated phenotype data for the GWAS of Schizophrenia were provided by the Molecular Genetics of Schizophrenia Collaboration (PI: Pablo V. Gejman, Evanston Northwestern Healthcare (ENH) and Northwestern University, Evanston, IL, USA).

Author information

Authors and Affiliations

Behavioral Health Epidemiology Program, RTI International, 3040 Cornwallis Road, PO Box 12194, Research Triangle Park, NC, 27709-12194, USA
Eric O. Johnson & Dana B. Hancock
Research Computing Division, RTI International, Research Triangle Park, NC, 27709, USA
Joshua L. Levy & Nathan C. Gaddis
Department of Genetics, Washington University, St. Louis, MO, 63110, USA
Nancy L. Saccone
Department of Psychiatry, Washington University, St. Louis, MO, 63110, USA
Laura J. Bierut
Genomics, Statistical Genetics, and Environmental Research Program, RTI International, Atlanta, GA, 30341, USA
Grier P. Page

Authors

Eric O. Johnson
View author publications
You can also search for this author in PubMed Google Scholar
Dana B. Hancock
View author publications
You can also search for this author in PubMed Google Scholar
Joshua L. Levy
View author publications
You can also search for this author in PubMed Google Scholar
Nathan C. Gaddis
View author publications
You can also search for this author in PubMed Google Scholar
Nancy L. Saccone
View author publications
You can also search for this author in PubMed Google Scholar
Laura J. Bierut
View author publications
You can also search for this author in PubMed Google Scholar
Grier P. Page
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Eric O. Johnson.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (DOC 1099 kb)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Johnson, E.O., Hancock, D.B., Levy, J.L. et al. Imputation across genotyping arrays for genome-wide association studies: assessment of bias and a correction strategy. Hum Genet 132, 509–522 (2013). https://doi.org/10.1007/s00439-013-1266-7

Download citation

Received: 18 July 2012
Accepted: 07 January 2013
Published: 22 January 2013
Issue Date: May 2013
DOI: https://doi.org/10.1007/s00439-013-1266-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Imputation across genotyping arrays for genome-wide association studies: assessment of bias and a correction strategy

Abstract

Access this article

Similar content being viewed by others

Benefits of Accurate Imputations in GWAS

GAWMerge expands GWAS sample size and diversity by combining array-based genotyping and whole-genome sequencing

Efficient phasing and imputation of low-coverage sequencing data using large reference panels

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Electronic supplementary material

Supplementary material 1 (DOC 1099 kb)

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Imputation across genotyping arrays for genome-wide association studies: assessment of bias and a correction strategy

Abstract

Access this article

Similar content being viewed by others

Benefits of Accurate Imputations in GWAS

GAWMerge expands GWAS sample size and diversity by combining array-based genotyping and whole-genome sequencing

Efficient phasing and imputation of low-coverage sequencing data using large reference panels

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Electronic supplementary material

Supplementary material 1 (DOC 1099 kb)

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation