, Volume 66, Issue 2, pp 419–449 | Cite as

Fast Phylogeny Reconstruction Through Learning of Ancestral Sequences

  • Radu MihaescuEmail author
  • Cameron Hill
  • Satish Rao


Given natural limitations on the length DNA sequences, designing phylogenetic reconstruction methods which are reliable under limited information is a crucial endeavor. There have been two approaches to this problem: reconstructing partial but reliable information about the tree (Mossel in IEEE Comput. Biol. Bioinform. 4:108–116, 2007; Daskalakis et al. in SIAM J. Discrete Math. 25:872–893, 2011; Daskalakis et al. in Proc. of RECOMB 2006, pp. 281–295, 2006; Gronau et al. in Proc. of the 19th Annual SODA 2008, pp. 379–388, 2008), and reaching “deeper” in the tree through reconstruction of ancestral sequences. In the latter category, Daskalakis et al. (Proc. of the 38th Annual STOC, pp. 159–168, 2006) settled an important conjecture of M. Steel (My favourite conjecture. Preprint, 2001), showing that, under the CFN model of evolution, all trees on n leaves with edge lengths bounded by the Ising model phase transition can be recovered with high probability from genomes of length O(logn) with a polynomial time algorithm. Their methods had a running time of O(n 10).

Here we enhance our methods from Daskalakis et al. (Proc. of RECOMB 2006, pp. 281–295, 2006) with the learning of ancestral sequences and provide an algorithm for reconstructing a sub-forest of the tree which is reliable given available data, without requiring a-priori known bounds on the edge lengths of the tree. Our methods are based on an intuitive minimum spanning tree approach and run in O(n 3) time. For the case of full reconstruction of trees with edges under the phase transition, we maintain the same asymptotic sequence length requirements as in Daskalakis et al. (Proc. of the 38th Annual STOC, pp. 159–168, 2006), despite the considerably faster running time.


Phylogenetic reconstruction Ising model Phase transitions Phylogenetic forests Information flow Ancestral sequence reconstruction 



We thank Elchanan Mossel for invaluable discussions regarding reconstruction of ancestral sequences. Thanks to Costis Daskalakis, Elchanan Mossel and Sebastien Roch for pointing out errors in a preliminary version of the paper.

We also thank the an anonymous reviewer of an earlier version for pointing out most of the analysis in Appendix C.

Radu Mihaescu was supported by a National Science Foundation Graduate Fellowship, by the Fannie an John Hertz Foundation graduate fellowship and by the CIPRES project. All other authors were supported by CIPRES.


  1. 1.
    Cavender, J.: Taxonomy with confidence. Math. Biosci. 40, 271–280 (1978) MathSciNetzbMATHCrossRefGoogle Scholar
  2. 2.
    Daskalakis, C., Hill, C., Jaffe, A., Mihaescu, R., Mossel, E., Rao, S.: Maximal accurate forest from distance matrices. In: Proceedings of RECOMB 2006, vol. 3909, pp. 281–295. Springer, Berlin (2006) Google Scholar
  3. 3.
    Daskalakis, C., Mossel, E., Roch, S.: Optimal phylogenetic reconstruction. In: Proceedings of the Thirty-Eighth Annual ACM Symposium on Theory of Computing (STOC 2006), pp. 159–168 (2006) CrossRefGoogle Scholar
  4. 4.
    Daskalakis, C., Mossel, E., Roch, S.: Phylogenies without branch bounds: contracting the short, pruning the deep. SIAM J. Discrete Math. 25(2), 872–893 (2011) MathSciNetzbMATHCrossRefGoogle Scholar
  5. 5.
    Erdos, P.L., Steel, M., Szekely, L., Warnow, T.: A few logs suffice to build (almost) all trees (I). Random Struct. Algorithms 14, 153–184 (1997) MathSciNetCrossRefGoogle Scholar
  6. 6.
    Erdos, P.L., Steel, M.A., Szekely, L.A., Warnow, T.J.: A few logs suffice to build (almost) all trees (II). Theor. Comput. Sci. 221(1–2), 77–118 (1999) MathSciNetCrossRefGoogle Scholar
  7. 7.
    Farris, J.S.: A probability model for inferring evolutionary trees. Syst. Zool. 22, 250–256 (1973) CrossRefGoogle Scholar
  8. 8.
    Fischer, M., Steel, M.: Sequence length bounds for resolving a deep phylogenetic divergence. J. Theor. Biol. 256, 247–252 (2008) MathSciNetCrossRefGoogle Scholar
  9. 9.
    Gronau, I., Moran, S., Snir, S.: Fast and reliable reconstruction of phylogenetic trees with very short edges. In: Proceedings of the Nineteenth Annual ACM-SIAM Symposium on Discrete Algorithms (SODA 2008), pp. 379–388 (2008) Google Scholar
  10. 10.
    Kimura, M.: Estimation of evolutionary distances between homologous nucleotide sequences. Proc. Natl. Acad. Sci. 78(1), 454–458 (1981) zbMATHCrossRefGoogle Scholar
  11. 11.
    Mossel, E.: On the impossibility of reconstructing ancestral data and phylogenies. J. Comput. Biol. 10(5), 669–678 (2003) CrossRefGoogle Scholar
  12. 12.
    Mossel, E.: Phase transitions in phylogeny. Trans. Am. Math. Soc. 356(6), 2379–2404 (2004) MathSciNetzbMATHCrossRefGoogle Scholar
  13. 13.
    Mossel, E.: Distorted metrics on trees and phylogenetic forests. IEEE Comput. Biol. Bioinform. 4, 108–116 (2007) CrossRefGoogle Scholar
  14. 14.
    Roch, S.: Sequence length requirement of distance-based phylogeny reconstruction: breaking the polynomial barrier. In: Proceedings of the 49th IEEE Symposium on Foundations of Computer Science (FOCS 2008), pp. 729–738 (2008) CrossRefGoogle Scholar
  15. 15.
    Semple, C., Steel, M.: Phylogenetics. Mathematics and Its Applications, vol. 22. Oxford University Press, Oxford (2003) zbMATHGoogle Scholar
  16. 16.
    Steel, M.: My favourite conjecture. Preprint (2001) Google Scholar

Copyright information

© Springer Science+Business Media, LLC 2012

Authors and Affiliations

  1. 1.Dept. of Computer ScienceUC BerkeleyBerkeleyUSA
  2. 2.Dept. of MathematicsUC BerkeleyBerkeleyUSA

Personalised recommendations