Bulletin of Mathematical Biology

, Volume 73, Issue 4, pp 829–872 | Cite as

Experiments with the Site Frequency Spectrum

  • Raazesh Sainudiin
  • Kevin Thornton
  • Jennifer Harlow
  • James Booth
  • Michael Stillman
  • Ruriko Yoshida
  • Robert Griffiths
  • Gil McVean
  • Peter Donnelly
Open Access
Original Article

Abstract

Evaluating the likelihood function of parameters in highly-structured population genetic models from extant deoxyribonucleic acid (DNA) sequences is computationally prohibitive. In such cases, one may approximately infer the parameters from summary statistics of the data such as the site-frequency-spectrum (SFS) or its linear combinations. Such methods are known as approximate likelihood or Bayesian computations. Using a controlled lumped Markov chain and computational commutative algebraic methods, we compute the exact likelihood of the SFS and many classical linear combinations of it at a non-recombining locus that is neutrally evolving under the infinitely-many-sites mutation model. Using a partially ordered graph of coalescent experiments around the SFS, we provide a decision-theoretic framework for approximate sufficiency. We also extend a family of classical hypothesis tests of standard neutrality at a non-recombining locus based on the SFS to a more powerful version that conditions on the topological information provided by the SFS.

Keywords

Controlled lumped coalescent Population genetic Markov bases 

References

  1. Bahlo, M., & Griffiths, R. (1996). Inference from gene trees in a subdivided population. Theor. Popul. Biol. 57, 79–95. CrossRefGoogle Scholar
  2. Barvinok, A. (1994). Polynomial time algorithm for counting integral points in polyhedra when the dimension is fixed. Math. Oper. Res. 19, 769–779. CrossRefMATHMathSciNetGoogle Scholar
  3. Beaumont, M., Zhang, W., & Balding, D. (2002). Approximate Bayesian computation in population genetics. Genetics 162, 2025–2035. Google Scholar
  4. Bertorelle, G., Benazzo, A., & Mona, S. (2010). ABC as a ßexible framework to estimate demography over space and time: some cons, many pros. Mol. Ecol. 19, 2609–2625. CrossRefGoogle Scholar
  5. Birkner, M., & Blath, J. (2008). Computing likelihoods for coalescents with multiple collisions in the infinitely many sites model. J. Math. Biol. 57, 435–465. CrossRefMATHMathSciNetGoogle Scholar
  6. Cam, L.L. (1964). Sufficiency and approximate sufficiency. Ann. Math. Stat. 35, 1419–1455. CrossRefMATHGoogle Scholar
  7. Casanellas, M., Garcia, L., & Sullivant, S. (2005). Catalog of small trees. In L. Pachter & B. Sturmfels (Eds.), Algebraic statistics for computational biology (pp. 291–304). Cambridge: Cambridge University Press. CrossRefGoogle Scholar
  8. Diaconis, P., & Sturmfels, B. (1998). Algebraic algorithms for sampling from conditional distributions. Ann. Stat. 26, 363–397. CrossRefMATHMathSciNetGoogle Scholar
  9. Duflo, M. (1997). Random iterative models. Berlin: Springer. MATHGoogle Scholar
  10. Erdös, P., Guy, R., & Moon, J. (1975) On refining partitions. J. Lond. Math. Soc. (2) 9, 565–570. CrossRefMATHGoogle Scholar
  11. Ewens, W. (1972). The sampling theory of selectively neutral alleles. Theor. Popul. Biol. 3, 87–112. CrossRefMathSciNetGoogle Scholar
  12. Ewens, W. (1974). A note on the sampling theory of infinite alleles and infinite sites models. Theor. Popul. Biol. 6, 143–148. CrossRefGoogle Scholar
  13. Ewens, W. (2000). Mathematical population genetics (2nd edn.). Berlin: Springer. Google Scholar
  14. Fay, J., & Wu, C. (2000). Hitchhiking under positive Darwinian selection. Genetics 155, 1405–1413. Google Scholar
  15. Felsenstein, J. (1981). Evolutionary trees from DNA sequences: a maximum likelihood approach. J. Mol. Evol. 17, 368–376. CrossRefGoogle Scholar
  16. Felsenstein, J. (2006). Accuracy of coalescent likelihood estimates: do we need more sites, more sequences, or more loci? Mol. Biol. Evol. 23, 691–700. CrossRefGoogle Scholar
  17. Grayson, D., & Stillman, M. (2004). Macaulay 2, a software system for research in algebraic geometry. Available at www.math.uiuc.edu/Macaulay2.
  18. Griffiths, R., & Tavare, S. (1994). Ancestral inference in population genetics. Stat. Sci., 9, 307–319. CrossRefMATHMathSciNetGoogle Scholar
  19. Griffiths, R., & Tavare, S. (1996). Markov chain inference methods in population genetics. Math. Comput. Modelling, 23, 141–158. CrossRefMATHMathSciNetGoogle Scholar
  20. Griffiths, R., & Tavare, S. (2003). The genealogy of a neutral mutation. In P. Green, N. Hjort, & S. Richardson (Eds.), Highly structured stochastic systems (pp. 393–412). London: Oxford University Press. Google Scholar
  21. Hastings, W. (1970). Monte Carlo sampling methods using Markov chains and their applications. Biometrika 57, 97–109. CrossRefMATHGoogle Scholar
  22. Hemmecke, R., Hemmecke, R., & Malkin, P. (2005). 4ti2 version 1.2—computation of Hilbert bases, Graver bases, toric Gröbner bases, and more. Available at www.4ti2.de.
  23. Hosten, S., Khetan, A., & Sturmfels, B. (2005). Solving the likelihood equations. Found Comput. Math. 5(4), 389–407. CrossRefMATHMathSciNetGoogle Scholar
  24. Hudson, R. (1993). The how and why of generating gene genealogies. In: Clark. A., Takahata, N. (Eds.) Mechanisms of molecular evolution (pp. 23–36). Sunderland: Sinauer. Google Scholar
  25. Hudson, R. (2002). Generating samples under a Wright–Fisher neutral model of genetic variation. Bioinformatics 18, 337–338. CrossRefGoogle Scholar
  26. Iorio, M., & Griffiths, R. (2004). Importance sampling on coalescent histories. I. Adv. Appl. Probab., 36, 417–433. CrossRefMATHGoogle Scholar
  27. Jones, G., & Hobert, J. (2001). Honest exploration of intractable probability distributions via Markov chain Monte Carlo. Stat. Sci. 16(4), 312–334. CrossRefMATHMathSciNetGoogle Scholar
  28. Jukes, T., & Cantor, C. (1969). Evolution of protein molecules. In H. Munro (Ed.), Mammalian protein metabolism (pp. 21–32). San Diego: Academic Press. Google Scholar
  29. Kemeny, Snell (1960). Finite Markov chains. Princeton: Van Nostrand. MATHGoogle Scholar
  30. Kendall, D. (1975). Some problems in mathematical genealogy. In: Gani, J. (Ed.), Perspectives in probability and statistics (pp. 325–345). San Diego: Academic Press. Google Scholar
  31. Kingman, J. (1982a). The coalescent. Stoch. Process. Their Appl. 13, 235–248. CrossRefMATHMathSciNetGoogle Scholar
  32. Kingman, J. (1982b). On the genealogy of large populations. J. Appl. Probab. 19, 27–43. CrossRefMathSciNetGoogle Scholar
  33. Kolmogorov, A. (1942). Sur l’estimation statistique des parametères de la loi de gauss. Bull. Acad. Sci. URSS Ser. Math. 6, 3–32. MATHGoogle Scholar
  34. Loera, J. D., Haws, D., Hemmecke, R., Huggins, P., Tauzer, J., & Yoshida, R. (2004). Lattice Point Enumeration: LattE, software to count the number of lattice points inside a rational convex polytope via Barvinok’s cone decomposition. Available at www.math.ucdavis.edu/~latte.
  35. Marjoram, P., Molitor, J., Plagnol, V., & Tavare, S. (2003). Markov chain Monte Carlo without likelihoods. Proc. Natl. Acad. Sci. USA 100, 15, 324–15,328. CrossRefGoogle Scholar
  36. Metropolis, N., Rosenbluth, A., Rosenbluth, M., Teller, A., & Teller, E. (1953). Equations of state calculations by fast computing machines. J. Chem. Phys. 21, 1087–1092. CrossRefGoogle Scholar
  37. Mossel, E., & Vigoda, E. (2005). Phylogenetic MCMC algorithms are misleading on mixtures of trees. Science, 309, 2207–2209. CrossRefGoogle Scholar
  38. Mossel, E., & Vigoda, E. (2006). Limitations of Markov chain Monte Carlo algorithms for Bayesian inference of phylogeny. Ann. Appl. Probab., 16(4), 2215–2234. CrossRefMATHMathSciNetGoogle Scholar
  39. Rosenblatt, M. (1974). Random processes. Berlin: Springer. MATHGoogle Scholar
  40. Sainudiin, R., & Stadler, T. (2009) A unified multi-resolution coalescent: Markov lumpings of the Kingman-Tajima n-coalescent. UCDMS Research Report 2009/4, 5 April 2009 (submitted). Available at http://www.math.canterbury.ac.nz/~r.sainudiin/preprints/SixCoal.pdf.
  41. Sainudiin, R., & York, T. (2009). Auto-validating von Neumann rejection sampling from small phylogenetic tree spaces. Algorithms Mol. Biol. 4, 1. CrossRefGoogle Scholar
  42. Sainudiin, R., Clark, A., & Durrett, R. (2007). Simple models of genomic variation in human SNP density. BMC Genomics 8, 146. CrossRefGoogle Scholar
  43. Semple, C., & Steel, M. (2003). Phylogenetics. Oxford University Press, London. MATHGoogle Scholar
  44. Sisson, S., Fan, Y., & Tanaka, M. (2007). Sequential Monte Carlo without likelihoods. Proc. Natl. Acad. Sci. USA 104, 1760–1765. CrossRefMATHMathSciNetGoogle Scholar
  45. Slatkin, M. (2002). A vectorized method of importance sampling with applications to models of mutation and migration. Theor. Popul. Biol. 62, 339–348. CrossRefMATHGoogle Scholar
  46. Stephens, M., & Donnelly, P. (2000). Inference in molecular population genetics. J. R. Stat. Soc. B 62, 605–655. CrossRefMATHMathSciNetGoogle Scholar
  47. Tajima, F. (1989). Statistical method for testing the neutral mutation hypothesis by DNA polymorphism. Genetics 123, 585–595. Google Scholar
  48. Tavaré, S. (1984). Line-of-descent and genealogical processes, and their applications in population genetics models. Theor. Popul. Biol. 26, 119–164. CrossRefMATHGoogle Scholar
  49. Thornton, K., Jensen, J. D., Becquet, C., & Andolfatto, P. (2007). Progress and prospects in mapping recent selection in the genome. Heredity 98, 340–348. Google Scholar
  50. Wakeley, J. (2007). Coalescent theory: an introduction. Greenwood Village: Roberts & Co. Google Scholar
  51. Watterson, G. (1975). On the number of segregating sites in genetical models without recombination. Theor. Popul. Biol., 7, 256–276. CrossRefMathSciNetGoogle Scholar
  52. Weiss, G., & von Haeseler, A. (1998). Inference of population history using a likelihood approach. Genetics, 149, 1539–1546. Google Scholar
  53. Yang, Z. (2000). Complexity of the simplest phylogenetic estimation problem. Proc. R. Soc. Lond. B Biol. Sci. 267, 109–119. CrossRefGoogle Scholar

Copyright information

© The Author(s) 2010

Authors and Affiliations

  • Raazesh Sainudiin
    • 1
    • 2
    • 4
  • Kevin Thornton
    • 3
  • Jennifer Harlow
    • 4
  • James Booth
    • 5
  • Michael Stillman
    • 6
  • Ruriko Yoshida
    • 7
  • Robert Griffiths
    • 8
  • Gil McVean
    • 8
  • Peter Donnelly
    • 8
  1. 1.Biomathematics Research CentreChristchurchNew Zealand
  2. 2.Chennai Mathematical InstituteSiruseriIndia
  3. 3.Department of Ecology and Evolutionary BiologyUniversity of CaliforniaIrvineUSA
  4. 4.Department of Mathematics and StatisticsUniversity of CanterburyChristchurchNew Zealand
  5. 5.Department of Biological Statistics and Computational BiologyCornell UniversityIthacaUSA
  6. 6.Department of MathematicsCornell UniversityIthacaUSA
  7. 7.Department of StatisticsUniversity of KentuckyLexingtonUSA
  8. 8.Department of StatisticsUniversity of OxfordOxfordUK

Personalised recommendations