Abstract
Evaluating the likelihood function of parameters in highly-structured population genetic models from extant deoxyribonucleic acid (DNA) sequences is computationally prohibitive. In such cases, one may approximately infer the parameters from summary statistics of the data such as the site-frequency-spectrum (SFS) or its linear combinations. Such methods are known as approximate likelihood or Bayesian computations. Using a controlled lumped Markov chain and computational commutative algebraic methods, we compute the exact likelihood of the SFS and many classical linear combinations of it at a non-recombining locus that is neutrally evolving under the infinitely-many-sites mutation model. Using a partially ordered graph of coalescent experiments around the SFS, we provide a decision-theoretic framework for approximate sufficiency. We also extend a family of classical hypothesis tests of standard neutrality at a non-recombining locus based on the SFS to a more powerful version that conditions on the topological information provided by the SFS.
Article PDF
References
Bahlo, M., & Griffiths, R. (1996). Inference from gene trees in a subdivided population. Theor. Popul. Biol. 57, 79–95.
Barvinok, A. (1994). Polynomial time algorithm for counting integral points in polyhedra when the dimension is fixed. Math. Oper. Res. 19, 769–779.
Beaumont, M., Zhang, W., & Balding, D. (2002). Approximate Bayesian computation in population genetics. Genetics 162, 2025–2035.
Bertorelle, G., Benazzo, A., & Mona, S. (2010). ABC as a ßexible framework to estimate demography over space and time: some cons, many pros. Mol. Ecol. 19, 2609–2625.
Birkner, M., & Blath, J. (2008). Computing likelihoods for coalescents with multiple collisions in the infinitely many sites model. J. Math. Biol. 57, 435–465.
Cam, L.L. (1964). Sufficiency and approximate sufficiency. Ann. Math. Stat. 35, 1419–1455.
Casanellas, M., Garcia, L., & Sullivant, S. (2005). Catalog of small trees. In L. Pachter & B. Sturmfels (Eds.), Algebraic statistics for computational biology (pp. 291–304). Cambridge: Cambridge University Press.
Diaconis, P., & Sturmfels, B. (1998). Algebraic algorithms for sampling from conditional distributions. Ann. Stat. 26, 363–397.
Duflo, M. (1997). Random iterative models. Berlin: Springer.
Erdös, P., Guy, R., & Moon, J. (1975) On refining partitions. J. Lond. Math. Soc. (2) 9, 565–570.
Ewens, W. (1972). The sampling theory of selectively neutral alleles. Theor. Popul. Biol. 3, 87–112.
Ewens, W. (1974). A note on the sampling theory of infinite alleles and infinite sites models. Theor. Popul. Biol. 6, 143–148.
Ewens, W. (2000). Mathematical population genetics (2nd edn.). Berlin: Springer.
Fay, J., & Wu, C. (2000). Hitchhiking under positive Darwinian selection. Genetics 155, 1405–1413.
Felsenstein, J. (1981). Evolutionary trees from DNA sequences: a maximum likelihood approach. J. Mol. Evol. 17, 368–376.
Felsenstein, J. (2006). Accuracy of coalescent likelihood estimates: do we need more sites, more sequences, or more loci? Mol. Biol. Evol. 23, 691–700.
Grayson, D., & Stillman, M. (2004). Macaulay 2, a software system for research in algebraic geometry. Available at www.math.uiuc.edu/Macaulay2.
Griffiths, R., & Tavare, S. (1994). Ancestral inference in population genetics. Stat. Sci., 9, 307–319.
Griffiths, R., & Tavare, S. (1996). Markov chain inference methods in population genetics. Math. Comput. Modelling, 23, 141–158.
Griffiths, R., & Tavare, S. (2003). The genealogy of a neutral mutation. In P. Green, N. Hjort, & S. Richardson (Eds.), Highly structured stochastic systems (pp. 393–412). London: Oxford University Press.
Hastings, W. (1970). Monte Carlo sampling methods using Markov chains and their applications. Biometrika 57, 97–109.
Hemmecke, R., Hemmecke, R., & Malkin, P. (2005). 4ti2 version 1.2—computation of Hilbert bases, Graver bases, toric Gröbner bases, and more. Available at www.4ti2.de.
Hosten, S., Khetan, A., & Sturmfels, B. (2005). Solving the likelihood equations. Found Comput. Math. 5(4), 389–407.
Hudson, R. (1993). The how and why of generating gene genealogies. In: Clark. A., Takahata, N. (Eds.) Mechanisms of molecular evolution (pp. 23–36). Sunderland: Sinauer.
Hudson, R. (2002). Generating samples under a Wright–Fisher neutral model of genetic variation. Bioinformatics 18, 337–338.
Iorio, M., & Griffiths, R. (2004). Importance sampling on coalescent histories. I. Adv. Appl. Probab., 36, 417–433.
Jones, G., & Hobert, J. (2001). Honest exploration of intractable probability distributions via Markov chain Monte Carlo. Stat. Sci. 16(4), 312–334.
Jukes, T., & Cantor, C. (1969). Evolution of protein molecules. In H. Munro (Ed.), Mammalian protein metabolism (pp. 21–32). San Diego: Academic Press.
Kemeny, Snell (1960). Finite Markov chains. Princeton: Van Nostrand.
Kendall, D. (1975). Some problems in mathematical genealogy. In: Gani, J. (Ed.), Perspectives in probability and statistics (pp. 325–345). San Diego: Academic Press.
Kingman, J. (1982a). The coalescent. Stoch. Process. Their Appl. 13, 235–248.
Kingman, J. (1982b). On the genealogy of large populations. J. Appl. Probab. 19, 27–43.
Kolmogorov, A. (1942). Sur l’estimation statistique des parametères de la loi de gauss. Bull. Acad. Sci. URSS Ser. Math. 6, 3–32.
Loera, J. D., Haws, D., Hemmecke, R., Huggins, P., Tauzer, J., & Yoshida, R. (2004). Lattice Point Enumeration: LattE, software to count the number of lattice points inside a rational convex polytope via Barvinok’s cone decomposition. Available at www.math.ucdavis.edu/~latte.
Marjoram, P., Molitor, J., Plagnol, V., & Tavare, S. (2003). Markov chain Monte Carlo without likelihoods. Proc. Natl. Acad. Sci. USA 100, 15, 324–15,328.
Metropolis, N., Rosenbluth, A., Rosenbluth, M., Teller, A., & Teller, E. (1953). Equations of state calculations by fast computing machines. J. Chem. Phys. 21, 1087–1092.
Mossel, E., & Vigoda, E. (2005). Phylogenetic MCMC algorithms are misleading on mixtures of trees. Science, 309, 2207–2209.
Mossel, E., & Vigoda, E. (2006). Limitations of Markov chain Monte Carlo algorithms for Bayesian inference of phylogeny. Ann. Appl. Probab., 16(4), 2215–2234.
Rosenblatt, M. (1974). Random processes. Berlin: Springer.
Sainudiin, R., & Stadler, T. (2009) A unified multi-resolution coalescent: Markov lumpings of the Kingman-Tajima n-coalescent. UCDMS Research Report 2009/4, 5 April 2009 (submitted). Available at http://www.math.canterbury.ac.nz/~r.sainudiin/preprints/SixCoal.pdf.
Sainudiin, R., & York, T. (2009). Auto-validating von Neumann rejection sampling from small phylogenetic tree spaces. Algorithms Mol. Biol. 4, 1.
Sainudiin, R., Clark, A., & Durrett, R. (2007). Simple models of genomic variation in human SNP density. BMC Genomics 8, 146.
Semple, C., & Steel, M. (2003). Phylogenetics. Oxford University Press, London.
Sisson, S., Fan, Y., & Tanaka, M. (2007). Sequential Monte Carlo without likelihoods. Proc. Natl. Acad. Sci. USA 104, 1760–1765.
Slatkin, M. (2002). A vectorized method of importance sampling with applications to models of mutation and migration. Theor. Popul. Biol. 62, 339–348.
Stephens, M., & Donnelly, P. (2000). Inference in molecular population genetics. J. R. Stat. Soc. B 62, 605–655.
Tajima, F. (1989). Statistical method for testing the neutral mutation hypothesis by DNA polymorphism. Genetics 123, 585–595.
Tavaré, S. (1984). Line-of-descent and genealogical processes, and their applications in population genetics models. Theor. Popul. Biol. 26, 119–164.
Thornton, K., Jensen, J. D., Becquet, C., & Andolfatto, P. (2007). Progress and prospects in mapping recent selection in the genome. Heredity 98, 340–348.
Wakeley, J. (2007). Coalescent theory: an introduction. Greenwood Village: Roberts & Co.
Watterson, G. (1975). On the number of segregating sites in genetical models without recombination. Theor. Popul. Biol., 7, 256–276.
Weiss, G., & von Haeseler, A. (1998). Inference of population history using a likelihood approach. Genetics, 149, 1539–1546.
Yang, Z. (2000). Complexity of the simplest phylogenetic estimation problem. Proc. R. Soc. Lond. B Biol. Sci. 267, 109–119.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
Open Access This is an open access article distributed under the terms of the Creative Commons Attribution Noncommercial License (https://creativecommons.org/licenses/by-nc/2.0), which permits any noncommercial use, distribution, and reproduction in any medium, provided the original author(s) and source are credited.
About this article
Cite this article
Sainudiin, R., Thornton, K., Harlow, J. et al. Experiments with the Site Frequency Spectrum. Bull Math Biol 73, 829–872 (2011). https://doi.org/10.1007/s11538-010-9605-5
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11538-010-9605-5