Abstract
The accurate reconstruction of phylogenies from short molecular sequences is an important problem in computational biology. Recent work has highlighted deep connections between sequence-length requirements for high-probability phylogeny reconstruction and the related problem of the estimation of ancestral sequences. In Daskalakis et al. (in Probab. Theory Relat. Fields 2010), building on the work of Mossel (Trans. Am. Math. Soc. 356(6):2379–2404, 2004), a tight sequence-length requirement was obtained for the simple CFN model of substitution, that is, the case of a two-state symmetric rate matrix Q. In particular the required sequence length for high-probability reconstruction was shown to undergo a sharp transition (from O(log n) to poly(n), where n is the number of leaves) at the “critical” branch length g ML(Q) (if it exists) of the ancestral reconstruction problem defined roughly as follows: below g ML(Q) the sequence at the root can be accurately estimated from sequences at the leaves on deep trees, whereas above g ML(Q) information decays exponentially quickly down the tree.
Here, we consider a more general evolutionary model, the GTR model, where the q×q rate matrix Q is reversible with q≥2. For this model, recent results of Roch (Preprint, 2009) show that the tree can be accurately reconstructed with sequences of length O(log (n)) when the branch lengths are below g Lin(Q), known as the Kesten–Stigum (KS) bound, up to which ancestral sequences can be accurately estimated using simple linear estimators. Although for the CFN model g ML(Q)=g Lin(Q) (in other words, linear ancestral estimators are in some sense best possible), it is known that for the more general GTR models one has g ML(Q)≥g Lin(Q) with a strict inequality in many cases. Here, we show that this phenomenon also holds for phylogenetic reconstruction by exhibiting a family of symmetric models Q and a phylogenetic reconstruction algorithm which recovers the tree from O(log n)-length sequences for some branch lengths in the range (g Lin(Q),g ML(Q)). Second, we prove that phylogenetic reconstruction under GTR models requires a polynomial sequence-length for branch lengths above g ML(Q).
Article PDF
Similar content being viewed by others
Avoid common mistakes on your manuscript.
References
Bleher, P. M., Ruiz, J., & Zagrebnov, V. A. (1995). On the purity of the limiting Gibbs state for the Ising model on the Bethe lattice. J. Stat. Phys., 79(1–2), 473–482.
Borgs, C., Chayes, J. T., Mossel, E., & Roch, S. (2006). The Kesten–Stigum reconstruction bound is tight for roughly symmetric binary channels. In FOCS (pp. 518–530).
Daskalakis, C., Mossel, E., & Roch, S. (2010). Evolutionary trees and the Ising model on the Bethe lattice: a proof of Steel’s conjecture. Probab. Theory Relat. Fields. doi:10.1007/s00440-009-0246-2
Erdös, P. L., Steel, M. A., Székely, L. A., & Warnow, T. A. (1999). A few logs suffice to build (almost) all trees (part 1). Random Struct. Algorithms, 14(2), 153–184.
Evans, W. S., Kenyon, C., Peres, Y., & Schulman, L. J. (2000). Broadcasting on trees and the Ising model. Ann. Appl. Probab., 10(2), 410–433.
Felsenstein, J. (2004). Inferring phylogenies. Sunderland: Sinauer.
Ioffe, D. (1996). On the extremality of the disordered state for the Ising model on the Bethe lattice. Lett. Math. Phys., 37(2), 137–143.
Janson, S., & Mossel, E. (2004). Robust reconstruction on trees is determined by the second eigenvalue. Ann. Probab., 32, 2630–2649.
Kesten, H., & Stigum, B. P. (1967). Limit theorems for decomposable multi-dimensional Galton–Watson processes. J. Math. Anal. Appl., 17, 309–338.
Mossel, E. (1998). Recursive reconstruction on periodic trees. Random Struct. Algorithms, 13(1), 81–97.
Mossel, E. (2001). Reconstruction on trees: beating the second eigenvalue. Ann. Appl. Probab., 11(1), 285–300.
Mossel, E. (2003). On the impossibility of reconstructing ancestral data and phylogenies. J. Comput. Biol., 10(5), 669–678.
Mossel, E. (2004). Phase transitions in phylogeny. Trans. Am. Math. Soc., 356(6), 2379–2404.
Mossel, E., & Peres, Y. (2003). Information flow on trees. Ann. Appl. Probab., 13(3), 817–844.
Mossel, E., & Steel, M. (2005). How much can evolved characters tell us about the tree that generated them? In O. Gascuel (Ed.), Mathematics of evolution and phylogeny (pp. 384–412). Oxford: Oxford University Press.
Peres, Y., & Roch, S. (2009). Reconstruction on trees: Exponential moment bounds for linear estimators. Preprint.
Roch, S. (2008). Sequence-length requirement for distance-based phylogeny reconstruction: Breaking the polynomial barrier. In FOCS (pp. 729–738).
Roch, S. (2009). Phase transition in distance-based phylogeny reconstruction. doi:10.1126/science.1182300.
Semple, C., & Steel, M. (2003). Mathematics and its applications series : Vol. 22. Phylogenetics. Oxford: Oxford University Press.
Sly, A. (2009). Reconstruction for the Potts model. In M. Mitzenmacher (Ed.), STOC (pp. 581–590). New York: ACM.
Steel, M. (2001). My favourite conjecture. Preprint.
Steel, M. A., & Székely, L. A. (2002). Inverting random functions. II. Explicit bounds for discrete maximum likelihood estimation, with applications. SIAM J. Discrete Math., 15(4), 562–575 (electronic).
Author information
Authors and Affiliations
Corresponding author
Additional information
E. Mossel supported by NSF Career Award (DMS 054829), by ONR award N00014-07-1-0506, by ISF grant 1300/08 and by Marie Curie grant PIRG04-GA-2008-239317.
S. Roch supported by NSF grant DMS-1007144.
Rights and permissions
Open Access This is an open access article distributed under the terms of the Creative Commons Attribution Noncommercial License (https://creativecommons.org/licenses/by-nc/2.0), which permits any noncommercial use, distribution, and reproduction in any medium, provided the original author(s) and source are credited.
About this article
Cite this article
Mossel, E., Roch, S. & Sly, A. On the Inference of Large Phylogenies with Long Branches: How Long Is Too Long?. Bull Math Biol 73, 1627–1644 (2011). https://doi.org/10.1007/s11538-010-9584-6
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11538-010-9584-6