Abstract
While increasingly large reference panels for genome-wide imputation have been recently made available, the degree to which imputation accuracy can be enhanced by population-specific reference panels remains an open question. Here, we sequenced at full-depth (≥ 30×), across two platforms (Illumina X Ten and Complete Genomics, Inc.), a moderately large (n = 738) cohort of samples drawn from the Ashkenazi Jewish population. We developed a series of quality control steps to optimize sensitivity, specificity, and comprehensiveness of variant calls in the reference panel, and then tested the accuracy of imputation against target cohorts drawn from the same population. Quality control (QC) thresholds for the Illumina X Ten platform were identified that permitted highly accurate calling of single nucleotide variants across 94% of the genome. QC procedures also identified numerous regions that are poorly mapped using current reference or alternate assemblies. After stringent QC, the population-specific reference panel produced more accurate and comprehensive imputation results relative to publicly available, large cosmopolitan reference panels, especially in the range of rare variants that may be most critical to further progress in mapping of complex phenotypes. The population-specific reference panel also permitted enhanced filtering of clinically irrelevant variants from personal genomes.
This is a preview of subscription content, access via your institution.





References
Ankala A, Tamhankar PM, Valencia CA, Rayam KK, Kumar MM, Hegde MR (2015) Clinical applications and implications of common and founder mutations in Indian subpopulations. Hum Mutat 36:1–10
Atzmon G, Hao L, Pe’er I, Velez C, Pearlman A, Palamara PF, Morrow B, Friedman E, Oddoux C, Burns E, Ostrer H (2010) Abraham’s children in the genome era: major Jewish diaspora populations comprise distinct genetic clusters with shared Middle Eastern Ancestry. Am J Hum Genet 86(6):850–859
Campbell IM, Gambin T, Jhangiani S, Grove ML, Veeraraghavan N, Muzny DM, Shaw CA, Gibbs RA, Boerwinkle E, Yu F, Lupski JR (2016) Multiallelic positions in the human genome: Challenges for genetic analyses. Hum Mutat 37:231–234
Carmi S, Hui KY, Kochav E, Liu X, Xue J, Grady F, Guha S, Upadhyay K, Ben-Avraham D, Mukherjee S et al (2014) Sequencing an Ashkenazi reference panel supports population-targeted personal genomics and illuminates Jewish and European origins. Nat Commun 5:4835
Church DM, Schneider VA, Steinberg KM, Schatz MC, Quinlan AR, Chin CS, Kitts PA, Aken B, Marth GT, Hoffman MM, Herrero J, Mendoza ML, Durbin R, Flicek P (2015) Extending reference assembly models. Genome Biol 16:13
Danecek P, Auton A, Abecasis G, Albers CA, Banks E, DePristo MA, Handsaker RE, Lunter G, Marth GT, Sherry ST, McVean G, Durbin R, 1000 Genomes Project Analysis Group (2011) The variant call format and VCFtools. Bioinformatics 27:2156–2158
Deelen P, Menelaou A, van Leeuwen EM, Kanterakis A, van Dijk F, Medina-Gomez C, Francioli LC, Hottenga JJ, Karssen LC, Estrada K, Kreiner-Møller E, Rivadeneira F et al (2014) Improved imputation quality of low-frequency and rare variants in European samples using the ‘Genome of The Netherlands’. Eur J Hum Genet 22:1321–1326
Dewey FE, Grove ME, Pan C, Goldstein BA, Bernstein JA, Chaib H, Merker JD, Goldfeder RL, Enns GM, David SP et al (2014) Clinical interpretation and implications of whole-genome sequencing. JAMA 311:1035–1045
Drmanac R, Sparks AB, Callow MJ, Halpern AL, Burns NL, Kermani BG, Carnevali P, Nazarenko I, Nilsen GB, Yeung G et al (2010) Human genome sequencing using unchained base reads on self-assembling DNA nanoarrays. Science 327:78–81
Druet T, Macleod IM, Hayes BJ (2014) Toward genomic prediction from whole-genome sequence data: impact of sequencing design on genotype imputation and accuracy of predictions. Heredity 112(1):39–47
Genome of the Netherlands Consortium (2014) Whole-genome sequence variation, population structure and demographic history of the Dutch population. Nat Genet 46:818–825
Genovese G, Fromer M, Stahl EA, Ruderfer DM, Chambert K, Landén M, Moran JL, Purcell SM, Sklar P, Sullivan PF, Hultman CM, McCarroll SA (2016 Nov) Increased burden of ultra-rare protein-altering variants among 4,877 individuals with schizophrenia. Nat Neurosci 19(11):1433–1441
Green RC, Berg JS, Grody WW, Kalia SS, Korf BR, Martin CL, McGuire AL, Nussbaum RL, O’Daniel JM, Ormond KE, Rehm HL, Watson MS et al (2013) ACMG recommendations for reporting of incidental findings in clinical exome and genome sequencing. Genet Med 15:565–574
Guha S, Rosenfeld JA, Malhotra AK, Lee AT, Gregersen PK, Kane JM, Pe’er I, Darvasi A, Lencz T (2012) Implications for health and disease in the genetic signature of the Ashkenazi Jewish population. Genome Biol 13(1):R2
Heinzen EL, Neale BM, Traynelis SF, Allen AS, Goldstein DB (2015) The genetics of neuropsychiatric diseases: looking in and beyond the exome. Annu Rev Neurosci 38:47–68
Highnam G, Wang JJ, Kusler D, Zook J, Vijayan V, Leibovich N, Mittelman D (2015) An analytical framework for optimizing variant discovery from personal genomes. Nat Commun 6:6275
Hoffmann TJ, Witte JS (2015) Strategies for imputing and analyzing rare variants in association studies. Trends Genet 31:556–563
Hou L, Kember RL, Roach JC, O’Connell JR, Craig DW, Bucan M, Scott WK, Pericak-Vance M, Haines JL, Crawford MH, Shuldiner AR, McMahon FJ (2017) A population-specific reference panel empowers genetic studies of Anabaptist populations. Sci Rep 7:6079
Iglesias AI, van der Lee SJ, Bonnemaijer PWM, Höhn R, Nag A, Gharahkhani P, Khawaja AP, Broer L, International Glaucoma Genetics Consortium (IGGC), Foster PJ, Hammond CJ, Hysi PG et al (2017) Haplotype reference consortium panel: Practical implications of imputations with large reference panels. Hum Mutat 38:1025–1032
Kiezun A, Garimella K, Do R, Stitziel NO, Neale BM, McLaren PJ, Gupta N, Sklar P, Sullivan PF, Moran JL et al (2012) Exome sequencing and the genetic basis of complex traits. Nat Genet 44:623–630
Laehnemann D, Borkhardt A, McHardy AC (2016) Denoising DNA deep sequencing data-high-throughput sequencing errors and their correction. Brief Bioinform 17:154–179
Lam HY, Clark MJ, Chen R, Chen R, Natsoulis G, O’Huallachain M, Dewey FE, Habegger L, Ashley EA, Gerstein MB, Butte AJ et al (2011) Performance comparison of whole-genome sequencing platforms. Nat Biotechnol 30:78–82
Larmer SG, Sargolzaei M, Brito LF, Ventura RV, Schenkel FS (2017) Novel methods for genotype imputation to whole-genome sequence and a simple linear model to predict imputation accuracy. BMC Genet 18(1):120
Lawler M, Siu LL, Rehm HL, Chanock SJ, Alterovitz G, Burn J, Calvo F, Lacombe D, Teh BT, North KN, Sawyers CL; Clinical Working Group of the Global Alliance for Genomics and Health (GA4GH) (2015) All the world’s a stage: facilitating discovery science and improved cancer care through the global alliance for genomics and health. Cancer Discov 5(11):1133–1136
Lek M, Karczewski KJ, Minikel EV, Samocha KE, Banks E, Fennell T, O’Donnell-Luria AH, Ware JS, Hill AJ, Cummings BB et al (2016) Analysis of protein-coding genetic variation in 60,706 humans. Nature 536:285–291
Lencz T, Guha S, Liu C, Rosenfeld J, Mukherjee S, DeRosse P, John M, Cheng L, Zhang C, Badner JA et al (2013) Genome-wide association study implicates NDST3 in schizophrenia and bipolar disorder. Nat Commun 4:2739
Li H (2014) Toward better understanding of artifacts in variant calling from high-coverage samples. Bioinformatics 30:2843–2851
Li H, Durbin R (2009) Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25:1754–1760
Lipson M, Loh PR, Sankararaman S, Patterson N, Berger B, Reich D (2015) Calibrating the human mutation rate via ancestral recombination density in diploid genomes. PLoS Genet 11:e1005550
Lohmueller KE (2014) The impact of population demography and selection on the genetic architecture of complex traits. PLoS Genet 10:e1004379
MacArthur DG, Manolio TA, Dimmock DP, Rehm HL, Shendure J, Abecasis GR, Adams DR, Altman RB, Antonarakis SE, Ashley EA et al (2014) Guidelines for investigating causality of sequence variants in human disease. Nature 508:469–476
McCarthy S, Das S, Kretzschmar W, Delaneau O, Wood AR, Teumer A, Kang HM, Fuchsberger C, Danecek P, Sharp K, Luo Y et al (2016) A reference panel of 64,976 haplotypes for genotype imputation. Nat Genet 48:1279–1283
Miga KH, Eisenhart C, Kent WJ (2015) Utilizing mapping targets of sequences underrepresented in the reference assembly to reduce false positive alignments. Nucleic Acids Res 43:e133
Mitt M, Kals M, Pärn K, Gabriel SB, Lander ES, Palotie A, Ripatti S, Morris AP, Metspalu A, Esko T, Mägi R, Palta P (2017) Improved imputation accuracy of rare and low-frequency variants using population-specific high-coverage WGS-based imputation reference panel. Eur J Hum Genet 25:869–876
Nagasaki M, Yasuda J, Katsuoka F, Nariai N, Kojima K, Kawai Y, Yamaguchi-Kabata Y, Yokozawa J, Danjoh I, Saito S et al (2015) Rare variant discovery by deep whole-genome sequencing of 1,070 Japanese individuals. Nat Commun 6:8018
Nagy PL, Mansukhani M (2015) The role of clinical genomic testing in diagnosis and discovery of pathogenic mutations. Expert Rev Mol Diagn 15:1101–1105
Palamara PF, Lencz T, Darvasi A, Pe’er I (2012) Length distributions of identity by descent reveal fine-scale demographic history. Am J Hum Genet 91:809–822
Palamara PF, Francioli LC, Wilton PR, Genovese G, Gusev A, Finucane HK, Sankararaman S; Genome of the Netherlands Consortium, Sunyaev SR, de Bakker PI, Wakeley J, Pe’er I, Price AL (2015) Leveraging distant relatedness to quantify human mutation and gene-conversion rates. Am J Hum Genet 97:775–789
Pistis G, Porcu E, Vrieze SI, Sidore C, Steri M, Danjou F, Busonero F, Mulas A, Zoledziewska M, Maschio A et al (2015) Rare variant genotype imputation with thousands of study-specific whole-genome sequences: implications for cost-effective study designs. Eur J Hum Genet 23:975–983
Popitsch N, WGS500 Consortium, Schuh A, Taylor JC (2017) ReliableGenome: annotation of genomic regions with high/low variant calling concordance. Bioinformatics 33:155–160
Rieber N, Zapatka M, Lasitschka B, Jones D, Northcott P, Hutter B, Jäger N, Kool M, Taylor M, Lichter P et al (2013) Coverage bias and sensitivity of variant calling for four whole-genome sequencing technologies. PLoS One 8:e66621
Roach JC, Glusman G, Smit AF, Huff CD, Hubley R, Shannon PT, Rowen L, Pant KP, Goodman N, Bamshad M et al (2010) Analysis of genetic inheritance in a family quartet by whole-genome sequencing. Science 328:636–639
Ross MG, Russ C, Costello M, Hollinger A, Lennon NJ, Hegarty R, Nusbaum C, Jaffe DB (2013) Characterizing and measuring bias in sequence data. Genome Biol 14:R51
Surakka I, Horikoshi M, Mägi R, Sarin AP, Mahajan A, Lagou V, Marullo L, Ferreira T, Miraglio B, Timonen S et al (2015) The impact of low-frequency and rare variants on lipid levels. Nat Genet 47:589–597
UK10K Consortium, Walter K, Min JL, Huang J, Crooks L, Memari Y, McCarthy S, Perry JR, Xu C, Futema M, et al (2015) The UK10K project identifies rare variants in health and disease. Nature 526:82–90
Van der Auwera GA, Carneiro MO, Hartl C, Poplin R, Del Angel G, Levy-Moonshine A, Jordan T, Shakir K, Roazen D, Thibault J, Banks E, Garimella KV et al (2013) From FastQ data to high confidence variant calls: the Genome Analysis Toolkit best practices pipeline. Curr Protoc Bioinform 43:11.10.1–33
Ventura RV, Lu D, Schenkel FS, Wang Z, Li C, Miller SP (2014) Impact of reference population on accuracy of imputation from 6 K to 50 K single nucleotide polymorphism chips in purebred and crossbreed beef cattle. J Anim Sci 92(4):1433–1444
Wall JD, Tang LF, Zerbe B, Kvale MN, Kwok PY, Schaefer C, Risch N (2014) Estimating genotype error rates from high-coverage next-generation sequence data. Genome Res 24:1734–1739
Walsh R, Thomson KL, Ware JS, Funke BH, Woodley J, McGuire KJ, Mazzarotto F, Blair E, Seller A, Taylor JC et al (2017) Reassessment of Mendelian gene pathogenicity using 7,855 cardiomyopathy cases and 60,706 reference samples. Genet Med 19:192–203
Whiffin N, Minikel E, Walsh R, O’Donnell-Luria AH, Karczewski K, Ing AY, Barton PJR, Funke B, Cook SA, MacArthur D, Ware JS (2017) Using high-resolution variant frequencies to empower clinical genome interpretation. Genet Med 19(10):1151–1158
Wong LP, Ong RT, Poh WT, Liu X, Chen P, Li R, Lam KK, Pillai NE, Sim KS, Xu H et al (2013) Deep whole-genome sequencing of 100 southeast Asian Malays. Am J Hum Genet 92:52–66
Zhang P, Zhan X, Rosenberg NA, Zöllner S (2013) Genotype imputation reference panel selection using maximal phylogenetic diversity. Genetics 195:319–330
Zook JM, Chapman B, Wang J, Mittelman D, Hofmann O, Hide W, Salit M (2014) Integrating human sequence data sets provides a resource of benchmark SNP and indel genotype calls. Nat Biotechnol 32:246–251
Zook JM, Catoe D, McDaniel J, Vang L, Spies N, Sidow A, Weng Z, Liu Y, Mason CE, Alexander N et al (2016) Extensive sequencing of seven human genomes to characterize benchmark reference materials. Sci Data 3:160025
Zou J, Valiant G, Valiant P, Karczewski K, Chan SO, Samocha K, Lek M, Sunyaev S, Daly M, MacArthur DG (2016) Quantifying unobserved protein-coding variants in human populations provides a roadmap for large-scale sequencing projects. Nat Commun 7:13293
1000 Genomes Project Consortium, Auton A, Brooks LD, Durbin RM, Garrison EP, Kang HM, Korbel JO, Marchini JL, McCarthy S, McVean GA, Abecasis GR (2015) A global reference for human genetic variation. Nature 526:68–74
Acknowledgements
The authors are extremely grateful to Soren Germer, Ph.D. and his team at the New York Genome Center for performing the Illumina sequencing. We acknowledge financial support from the Human Frontier Science Program (SC); NIH research Grants AG042188 (GA), DK62429, DK062422, DK092235 (JHC), NS050487, NS060113 (LNC), AG021654, AG027734 (NB), MH089964, MH095458, MH084098 (TL), and CA121852 (computational infrastructure, IPe’er); NSF research grants 08929882 and 0845677 (IPe’er); Rachel and Lewis Rudin Foundation (HE); Northwell Health Foundation (TL); Brain & Behavior Foundation (TL); US-Israel Binational Science Foundation (TL, AD); LUNGevity Foundation (ZHG); New York Crohn’s Disease Foundation (IPeter); Edwin & Caroline Levy and Joseph & Carol Reich (SB); the Parkinson’s Disease Foundation (LNC); the Sharon Levine Corzine Cancer Research Fund (KO); and the Andrew Sabin Family Research Fund (KO).
Author information
Authors and Affiliations
Contributions
TL and IP led the analysis, and led the writing of the manuscript. JY, CP, and SC conducted the primary analyses. TL led the funding of the study. TL, AD, GA, DB, NB, and LNC provided samples and conducted lab work. TL, IP, NB, SB, AD, JHC, LNC, ZHG, VJ, RK, SL, KO, HO, LJO, IP, and GA initiated and designed the study, and provided funding.
Corresponding authors
Ethics declarations
Conflict of interest
The authors declare no competing financial interests.
Accession codes
Whole genome sequence data have been deposited at the European Genome-phenome Archive (EGA, http://www.ebi.ac.uk/ega/), which is hosted by the EBI, under accession code EGAS00001000664. Genotype data for target samples is available at The database of Genotypes and Phenotypes (dbGaP, https://www.ncbi.nlm.nih.gov/gap), under Accession number phs000448.v1.p1.
Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
About this article
Cite this article
Lencz, T., Yu, J., Palmer, C. et al. High-depth whole genome sequencing of an Ashkenazi Jewish reference panel: enhancing sensitivity, accuracy, and imputation. Hum Genet 137, 343–355 (2018). https://doi.org/10.1007/s00439-018-1886-z
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00439-018-1886-z