Abstract
It has recently been demonstrated that inference methods based on genealogical processes with recombination can uncover past population history in unprecedented detail. However, these methods scale poorly with sample size, limiting resolution in the recent past, and they require phased genomes, which contain switch errors that can catastrophically distort the inferred history. Here we present SMC++, a new statistical tool capable of analyzing orders of magnitude more samples than existing methods while requiring only unphased genomes (its results are independent of phasing). SMC++ can jointly infer population size histories and split times in diverged populations, and it employs a novel spline regularization scheme that greatly reduces estimation error. We apply SMC++ to analyze sequence data from over a thousand human genomes in Africa and Eurasia, hundreds of genomes from a Drosophila melanogaster population in Africa, and tens of genomes from zebra finch and long-tailed finch populations in Australia.
Similar content being viewed by others
Accession codes
References
Tennessen, J.A. et al. Evolution and functional impact of rare coding variation from deep sequencing of human exomes. Science 337, 64–69 (2012).
1000 Genomes Project Consortium. A map of human genome variation from population-scale sequencing. Nature 467, 1061–1073 (2010).
Skoglund, P. et al. Genetic evidence for two founding populations of the Americas. Nature 525, 104–108 (2015).
Raghavan, M. et al. Genomic evidence for the Pleistocene and recent population history of Native Americans. Science 349, aab3884 (2015).
Huerta-Sánchez, E. et al. Altitude adaptation in Tibetans caused by introgression of Denisovan-like DNA. Nature 512, 194–197 (2014).
Racimo, F., Sankararaman, S., Nielsen, R. & Huerta-Sánchez, E. Evidence for archaic adaptive introgression in humans. Nat. Rev. Genet. 16, 359–371 (2015).
Green, R.E. et al. A draft sequence of the Neandertal genome. Science 328, 710–722 (2010).
Prüfer, K. et al. The complete genome sequence of a Neanderthal from the Altai Mountains. Nature 505, 43–49 (2014).
Sankararaman, S. et al. The genomic landscape of Neanderthal ancestry in present-day humans. Nature 507, 354–357 (2014).
Vernot, B. & Akey, J.M. Resurrecting surviving Neandertal lineages from modern human genomes. Science 343, 1017–1021 (2014).
Miller, W. et al. Polar and brown bear genomes reveal ancient admixture and demographic footprints of past climate change. Proc. Natl. Acad. Sci. USA 109, E2382–E2390 (2012).
Stewart, J.R. & Stringer, C.B. Human evolution out of Africa: the role of refugia and climate change. Science 335, 1317–1321 (2012).
Sawyer, S.A. & Hartl, D.L. Population genetics of polymorphism and divergence. Genetics 132, 1161–1176 (1992).
Griffiths, R.C. & Tavaré, S. Sampling theory for neutral alleles in a varying environment. Proc. R. Soc. Lond. B 344, 403–410 (1994).
Wiuf, C. & Hein, J. Recombination as a point process along sequences. Theor. Popul. Biol. 55, 248–259 (1999).
McVean, G.A. & Cardin, N.J. Approximating the coalescent with recombination. Phil. Trans. R. Soc. Lond. B 360, 1387–1393 (2005).
Marjoram, P. & Wall, J.D. Fast “coalescent” simulation. BMC Genet. 7, 16 (2006).
Gutenkunst, R.N., Hernandez, R.D., Williamson, S.H. & Bustamante, C.D. Inferring the joint demographic history of multiple populations from multidimensional SNP frequency data. PLoS Genet. 5, e1000695 (2009).
Excoffier, L., Dupanloup, I., Huerta-Sánchez, E., Sousa, V.C. & Foll, M. Robust demographic inference from genomic and SNP data. PLoS Genet. 9, e1003905 (2013).
Bhaskar, A., Wang, Y.X.R. & Song, Y.S. Efficient inference of population size histories and locus-specific mutation rates from large-sample genomic variation data. Genome Res. 25, 268–279 (2015).
Kamm, J.A., Terhorst, J. & Song, Y.S. Efficient computation of the joint sample frequency spectra for multiple populations. J. Comput. Graph. Stat. (in the press).
Li, H. & Durbin, R. Inference of human population history from individual whole-genome sequences. Nature 475, 493–496 (2011).
Dutheil, J.Y. et al. Ancestral population genomics: the coalescent hidden Markov model approach. Genetics 183, 259–274 (2009).
Schiffels, S. & Durbin, R. Inferring human population size and separation history from multiple genome sequences. Nat. Genet. 46, 919–925 (2014).
Paul, J.S., Steinrücken, M. & Song, Y.S. An accurate sequentially Markov conditional sampling distribution for the coalescent with recombination. Genetics 187, 1115–1128 (2011).
Steinrücken, M., Paul, J.S. & Song, Y.S. A sequentially Markov conditional sampling distribution for structured populations with migration and recombination. Theor. Popul. Biol. 87, 51–61 (2013).
Sheehan, S., Harris, K. & Song, Y.S. Estimating variable effective population sizes from multiple genomes: a sequentially Markov conditional sampling distribution approach. Genetics 194, 647–662 (2013).
Steinrücken, M., Kamm, J.A. & Song, Y.S. Inference of complex population histories using whole-genome sequences from multiple populations. Preprint at. bioRxiv http://dx.doi.org/10.1101/026591 (2015).
Browning, S.R. & Browning, B.L. Haplotype phasing: existing methods and new developments. Nat. Rev. Genet. 12, 703–714 (2011).
Delaneau, O., Zagury, J.-F. & Marchini, J. Improved whole-chromosome phasing for disease and population genetic studies. Nat. Methods 10, 5–6 (2013).
Terhorst, J. & Song, Y.S. Fundamental limits on the accuracy of demographic inference based on the sample frequency spectrum. Proc. Natl. Acad. Sci. USA 112, 7677–7682 (2015).
Drmanac, R. et al. Human genome sequencing using unchained base reads on self-assembling DNA nanoarrays. Science 327, 78–81 (2010).
Fu, Q. et al. Genome sequence of a 45,000-year-old modern human from western Siberia. Nature 514, 445–449 (2014).
Langergraber, K.E. et al. Generation times in wild chimpanzees and gorillas suggest earlier divergence times in great ape and human evolution. Proc. Natl. Acad. Sci. USA 109, 15716–15721 (2012).
Singhal, S. et al. Stable recombination hotspots in birds. Science 350, 928–932 (2015).
Lack, J.B. et al. The Drosophila genome nexus: a population genomic resource of 623 Drosophila melanogaster genomes, including 197 from a single ancestral range population. Genetics 199, 1229–1241 (2015).
Keightley, P.D., Ness, R.W., Halligan, D.L. & Haddrill, P.R. Estimation of the spontaneous mutation rate per nucleotide site in a Drosophila melanogaster full-sib family. Genetics 196, 313–320 (2014).
Griffiths, R.C. & Marjoram, P. in Progress in Population Genetics and Human Evolution (eds. Donnelly, P. and Tavaré, S.) 87, 257–270 (Springer-Verlag, 1997).
Hobolth, A. & Jensen, J.L. Markovian approximation to the finite loci coalescent with recombination along multiple sequences. Theor. Popul. Biol. 98, 48–58 (2014).
Wilton, P.R., Carmi, S. & Hobolth, A. The SMC is a highly accurate approximation to the ancestral recombination graph. Genetics 200, 343–355 (2015).
Tataru, P., Nirody, J.A. & Song, Y.S. diCal-IBD: demography-aware inference of identity-by-descent tracts in unrelated individuals. Bioinformatics 30, 3430–3431 (2014).
Polanski, A. & Kimmel, M. New explicit expressions for relative frequencies of single-nucleotide polymorphisms with application to statistical inference on population growth. Genetics 165, 427–436 (2003).
Simonsen, K.L. & Churchill, G.A. A Markov chain model of coalescence with recombination. Theor. Popul. Biol. 52, 43–59 (1997).
Paul, J.S. & Song, Y.S. Blockwise HMM computation for large-scale population genomic inference. Bioinformatics 28, 2008–2015 (2012).
Bishop, C.M. Pattern Recognition and Machine Learning (Springer, 2006).
Staab, P.R., Zhu, S., Metzler, D. & Lunter, G. scrm: efficiently simulating long sequences using the approximated coalescent with recombination. Bioinformatics 31, 1680–1682 (2015).
Acknowledgements
We thank J. Pool and C. Langley for helpful comments on our inferred Drosophila demography. We also thank H. Li for providing us with the Ust'-Ishim genome sequence. This research is supported in part by NIH grants R01 GM094402 and R01 GM108805 and by a Packard Fellowship for Science and Engineering (Y.S.S.).
Author information
Authors and Affiliations
Contributions
J.T., J.A.K. and Y.S.S. conceived the study, developed the theoretical model and wrote the manuscript. J.T. developed software implementing the method and performed data analysis. J.A.K. contributed benchmarks of ∂a∂i.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing financial interests.
Integrated supplementary information
Supplementary Figure 1 Results of demographic inference when ρ is not known.
Each step plot represents inference on a single simulated data set with sample size n = 50. The colors of the estimated size histories correspond to the ratio of recombination to mutation used in each simulation, which was not known to SMC++ during model fitting. The ratio ranged from 1:10 (black) to 10:1 (light blue). The true demography used for simulation is shown in bold black. The nested scatterplot compares the true versus estimated ratio of recombination to mutation rates. The mutation rate θ/2 was assumed to be known. SMC++ is able to fairly accurately estimate the recombination rate over two orders of magnitude with respect to the mutation rate and is most accurate when the mutation and recombination rates are approximately equal.
Supplementary Figure 5 Sensitivity analysis for human demographic inference.
Blue lines are reproduced from Figure 5. Red lines represent the result of randomly downsampling the data to contain 90% of the original set of chromosomes and rerunning the analysis.
Supplementary Figure 7 Schematic of the differences between PSMC, MSMC and SMC++.
The HMM used in PSMC tracks the hidden TRMCA of a pair of haploid lineages and emits binary symbols based on the heterozygosity of this pair at each block of sites. MSMC tracks the hidden time to first coalescence among several haploid lineages, as well as the identity (denoted by the bolded bars) of the two lineages that coalesce first. It considers as emissions the allelic state of all lineages in the sample. SMC++, like PSMC, tracks the TMRCA in only a pair of individuals and emits 2-tuples whose distribution is given by the conditioned SFS (Section S1).
Supplementary information
Supplementary Text and Figures
Supplementary Figures 1–7, Supplementary Tables 1–3 and Supplementary Note (PDF 1875 kb)
Rights and permissions
About this article
Cite this article
Terhorst, J., Kamm, J. & Song, Y. Robust and scalable inference of population history from hundreds of unphased whole genomes. Nat Genet 49, 303–309 (2017). https://doi.org/10.1038/ng.3748
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/ng.3748
- Springer Nature America, Inc.
This article is cited by
-
Evolutionary origin of germline pathogenic variants in human DNA mismatch repair genes
Human Genomics (2024)
-
Gene flow and an anomaly zone complicate phylogenomic inference in a rapidly radiated avian family (Prunellidae)
BMC Biology (2024)
-
Past volcanic activity predisposes an endemic threatened seabird to negative anthropogenic impacts
Scientific Reports (2024)
-
Natural selection and genetic diversity maintenance in a parasitic wasp during continuous biological control application
Nature Communications (2024)
-
Climate change from an ectotherm perspective: evolutionary consequences and demographic change in amphibian and reptilian populations
Biodiversity and Conservation (2024)