Abstract
Phylogenetic mixture models are statistical models of character evolution allowing for heterogeneity. Each of the classes in some unknown partition of the characters may evolve by different processes, or even along different trees. Such models are of increasing interest for data analysis, as they can capture the variety of evolutionary processes that may be occurring across long sequences of DNA or proteins. The fundamental question of whether parameters of such a model are identifiable is difficult to address, due to the complexity of the parameterization. Identifiability is, however, essential to their use for statistical inference.
We analyze mixture models on large trees, with many mixture components, showing that both numerical and tree parameters are indeed identifiable in these models when all trees are the same. This provides a theoretical justification for some current empirical studies, and indicates that extensions to even more mixture components should be theoretically well behaved. We also extend our results to certain mixtures on different trees, using the same algebraic techniques.
Similar content being viewed by others
References
Allman, E. S., & Rhodes, J. A. (2003). Phylogenetic invariants for the general Markov model of sequence mutation. Mathematical Biosciences, 186(2), 113–144.
Allman, E. S., & Rhodes, J. A. (2006). The identifiability of tree topology for phylogenetic models, including covarion and mixture models. Journal of Computational Biology, 13(5), 1101–1113.
Allman, E. S., & Rhodes, J. A. (2008). Identifying evolutionary trees and substitution parameters for the general Markov model with invariable sites. Mathematical Biosciences, 211(1), 18–33.
Allman, E. S., & Rhodes, J. A. (2008). Phylogenetic ideals and varieties for the general Markov model. Advances in Applied Mathematics, 40(2), 127–148.
Allman, E. S., & Rhodes, J. A. (2009). The identifiability of covarion models in phylogenetics. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 6(1), 76–88.
Allman, E. S., Ané, C., & Rhodes, J. A. (2008). Identifiability of a Markovian model of molecular evolution with gamma-distributed rates. Advances in Applied Probability, 40, 229–249. arXiv:0709.0531.
Allman, E. S., Matias, C., & Rhodes, J. A. (2009). Identifiability of parameters in latent structure models with many observed variables. Annals of Statistics, 37(6A), 3099–3132.
Allman, E. S., Petrović, S., Rhodes, J. A., & Sullivant, S. (2010). Identifiability of two-tree mixtures for group-based models. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 8(3), 710–722.
Allman, E. S., Matias, C., & Rhodes, J. A. (2011). Parameter identifiability in a class of random graph mixture models. Journal of Statistical Planning and Inference, 141, 1719–1736.
Chai, J., & Housworth, E. A. (2011, to appear). On Rogers’s proof of identifiability for the GTR+Γ+I model. Systematic Biology.
Chang, J. T. (1996). Full reconstruction of Markov models on evolutionary trees: identifiability and consistency. Mathematical Biosciences, 137(1), 51–73.
Cox, D., Little, J., & O’Shea, D. (1997). Ideals, varieties, and algorithms: an introduction to computational algebraic geometry and commutative algebra (2nd edn.). New York: Springer.
Degnan, J. H., & Salter, L. A. (2005). Gene tree distributions under the coalescent process. Evolution, 59, 24–37.
Eriksson, N. (2005). Tree construction using singular value decomposition. In Algebraic statistics for computational biology (pp. 347–358). New York: Cambridge University Press.
Felsenstein, J. (2004). Inferring phylogenies. Sunderland: Sinauer.
Huelsenbeck, J. P., & Suchard, M. A. (2007). A nonparametric method for accommodating and testing across-site rate variation. Systematic Biology, 56(6), 975–987.
Kim, J. (2000). Slicing hyperdimensional oranges: the geometry of phylogenetic estimation. Molecular Phylogenetics and Evolution, 17(1), 58–75.
Kruskal, J. B. (1976). More factors than subjects, tests and treatments: an indeterminacy theorem for canonical decomposition and individual differences scaling. Psychometrika, 41(3), 281–293.
Kruskal, J. B. (1977). Three-way arrays: rank and uniqueness of trilinear decompositions, with application to arithmetic complexity and statistics. Linear Algebra and Its Applications, 18(2), 95–138.
Landsberg, J. M. (2011). The geometry of tensors with applications. Manuscript.
Le, S. Q., Lartillot, N., & Gascuel, O. (2008). Phylogenetic mixture models for proteins. Philosophical Transactions of the Royal Society of London. Series B, Biological Sciences, 363, 3965–3976.
Matsen, F. A., & Steel, M. A. (2007). Phylogenetic mixtures on a single tree can mimic a tree of another topology. Systematic Biology, 56(5), 767–775.
Matsen, F. A., Mossel, E., & Steel, M. (2008). Mixed-up trees: the structure of phylogenetic mixtures. Bulletin of Mathematical Biology, 70(4), 1115–1139.
Mossel, E., & Vigoda, E. (2005). Phylogenetic MCMC algorithms are misleading on mixtures of trees. Science, 309, 2207–2209.
Pagel, M., & Meade, A. (2004). A phylogenetic mixture model for detecting pattern-heterogeneity in gene sequence or character-state data. Systematic Biology, 53(4), 571–581.
Pagel, M., & Meade, A. (2005). Mixture models in phylogenetic inference. In O. Gascuel (Ed.), Mathematics of evolution and phylogeny (pp. 121–142). Oxford: Oxford University Press.
Rannala, B. (2002). Identifiability of parameters in MCMC Bayesian inference of phylogeny. Systematic Biology, 51(5), 754–760.
Rhodes, J. A. (2010). A concise proof of Kruskal’s theorem on tensor decomposition. Linear Algebra and Its Applications, 432(7), 1818–1824.
Semple, C., & Steel, M. (2003). Oxford lecture series in mathematics and its applications: Vol. 24. Phylogenetics. Oxford: Oxford University Press.
Štefankovič, D., & Vigoda, E. (2007). Phylogeny of mixture models: Robustness of maximum likelihood and non-identifiable distributions. Journal of Computational Biology, 14(2), 156–189.
Strassen, V. (1983). Rank and optimal computation of generic tensors. Linear Algebra and Its Applications, 52/53, 645–685.
Wakeley, J. (2008). Coalescent theory. Greenwood Village: Roberts & Company.
Wang, H. C., Li, K., Susko, E., & Roger, A. J. (2008). A class frequency mixture model that adjusts for site-specific amino acid frequencies and improves inference of protein phylogeny. BMC Evolutionary Biology, 8, 331.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Rhodes, J.A., Sullivant, S. Identifiability of Large Phylogenetic Mixture Models. Bull Math Biol 74, 212–231 (2012). https://doi.org/10.1007/s11538-011-9672-2
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11538-011-9672-2