Probability Theory and Related Fields

, Volume 169, Issue 1–2, pp 3–62 | Cite as

Phase transition in the sample complexity of likelihood-based phylogeny inference

  • Sebastien RochEmail author
  • Allan Sly


Reconstructing evolutionary trees from molecular sequence data is a fundamental problem in computational biology. Stochastic models of sequence evolution are closely related to spin systems that have been extensively studied in statistical physics and that connection has led to important insights on the theoretical properties of phylogenetic reconstruction algorithms as well as the development of new inference methods. Here, we study maximum likelihood, a classical statistical technique which is perhaps the most widely used in phylogenetic practice because of its superior empirical accuracy. At the theoretical level, except for its consistency, that is, the guarantee of eventual correct reconstruction as the size of the input data grows, much remains to be understood about the statistical properties of maximum likelihood in this context. In particular, the best bounds on the sample complexity or sequence-length requirement of maximum likelihood, that is, the amount of data required for correct reconstruction, are exponential in the number, n, of tips—far from known lower bounds based on information-theoretic arguments. Here we close the gap by proving a new upper bound on the sequence-length requirement of maximum likelihood that matches up to constants the known lower bound for some standard models of evolution. More specifically, for the r-state symmetric model of sequence evolution on a binary phylogeny with bounded edge lengths, we show that the sequence-length requirement behaves logarithmically in n when the expected amount of mutation per edge is below what is known as the Kesten-Stigum threshold. In general, the sequence-length requirement is polynomial in n. Our results imply moreover that the maximum likelihood estimator can be computed efficiently on randomly generated data provided sequences are as above. Our main technical contribution, which may be of independent interest, relates the total variation distance between the leaf state distributions of two trees with a notion of combinatorial distance between the trees. In words we show in a precise quantitative manner that the more different two evolutionary trees are, the easier it is to distinguish their output.


Phylogenetic reconstruction Maximum likelihood Sequence-length requirement 

Mathematics Subject Classification

60J20 92D15 



We thank the anonymous reviewers of a previous version for helpful comments.


  1. 1.
    Allen, B.L., Steel, M.: Subtree transfer operations and their induced metrics on evolutionary trees. Ann. Comb. 1, 1–15 (2001)MathSciNetCrossRefzbMATHGoogle Scholar
  2. 2.
    Andoni, A., Daskalakis, C., Hassidim, A., Roch, S.: Global alignment of molecular sequences via ancestral state reconstruction. Stoch. Process. Appl. 122(12), 3852–3874 (2012)Google Scholar
  3. 3.
    Borgs, C., Chayes, J., Mossel, E., Roch, S.: The Kesten-Stigum reconstruction bound is tight for roughly symmetric binary channels. In: FOCS, pp. 518–530 (2006)Google Scholar
  4. 4.
    Brown, D.G., Truszkowski, J.: Fast phylogenetic tree reconstruction using locality-sensitive hashing. In: Algorithms in Bioinformatics, pp 14–29. Springer (2012)Google Scholar
  5. 5.
    Cavender, J.A.: Taxonomy with confidence. Math. Biosci. 40(3–4), 271–280 (1978)MathSciNetCrossRefzbMATHGoogle Scholar
  6. 6.
    Cryan, M., Goldberg, L.A., Goldberg, P.W.: Evolutionary trees can be learned in polynomial time. SIAM J. Comput. 31(2), 375–397 (2002). Short version In: Proceedings of the 39th Annual Symposium on Foundations of Computer Science (FOCS 98), pp. 436–445 (1998)Google Scholar
  7. 7.
    Chang, J.T.: Full reconstruction of Markov models on evolutionary trees: identifiability and consistency. Math. Biosci. 137(1), 51–73 (1996)MathSciNetCrossRefzbMATHGoogle Scholar
  8. 8.
    Chor, B., Tuller, T.: Finding a maximum likelihood tree is hard. J. ACM 53(5), 722–744 (2006)MathSciNetCrossRefzbMATHGoogle Scholar
  9. 9.
    Choi, M.J., Tan, V.Y., Anandkumar, A., Willsky, A.S.: Learning latent tree graphical models. J. Mach. Learn. Res. 12, 1771–1812 (2011)MathSciNetzbMATHGoogle Scholar
  10. 10.
    Daskalakis, C., Mossel, E., Roch, S.: Evolutionary trees and the ising model on the Bethe lattice: a proof of Steel’s conjecture. Probab. Theory Relat. Fields 149, 149–189 (2011). doi: 10.1007/s00440-009-0246-2 MathSciNetCrossRefzbMATHGoogle Scholar
  11. 11.
    Daskalakis, C., Mossel, E., Roch, S.: Phylogenies without branch bounds: contracting the short, pruning the deep. SIAM J. Discret. Math. 25(2), 872–893 (2011)MathSciNetCrossRefzbMATHGoogle Scholar
  12. 12.
    Daskalakis, C., Roch, S.: Alignment-free phylogenetic reconstruction: sample complexity via a branching process analysis. Ann. Appl. Probab. 23(2), 693–721 (2013)MathSciNetCrossRefzbMATHGoogle Scholar
  13. 13.
    Deonier, R.C., Tavaré, S., Waterman, M.S.: Computational Genome Analysis: An Introduction. Springer, New York (2005)zbMATHGoogle Scholar
  14. 14.
    Evans, W.S., Kenyon, C., Peres, Y., Schulman, L.J.: Broadcasting on trees and the Ising model. Ann. Appl. Probab. 10(2), 410–433 (2000)MathSciNetCrossRefzbMATHGoogle Scholar
  15. 15.
    Erdös, P.L., Steel, M.A., Székely, L.A., Warnow, T.A.: A few logs suffice to build (almost) all trees (part 1). Random Struct. Algorithms 14(2), 153–184 (1999)CrossRefzbMATHGoogle Scholar
  16. 16.
    Erdös, P.L., Steel, M.A., Székely, L.A., Warnow, T.A.: A few logs suffice to build (almost) all trees (part 2). Theor. Comput. Sci. 221, 77–118 (1999)CrossRefzbMATHGoogle Scholar
  17. 17.
    Farris, J.S.: A probability model for inferring evolutionary trees. Syst. Zool. 22(4), 250–256 (1973)MathSciNetCrossRefGoogle Scholar
  18. 18.
    Felsenstein, J.: Evolutionary trees from dna sequences: a maximum likelihood approach. J. Mol. Evol. 17, 368–376 (1981)CrossRefGoogle Scholar
  19. 19.
    Felsenstein, J.: Inferring Phylogenies. Sinauer, Sunderland (2004)Google Scholar
  20. 20.
    Georgii, H.O.: Gibbs Measures and Phase Transitions, Volume 9 of de Gruyter Studies in Mathematics. Walter de Gruyter & Co., Berlin (1988)CrossRefGoogle Scholar
  21. 21.
    Guindon, S., Lethiec, F., Duroux, P., Gascuel, O.: PHYML online web server for fast maximum likelihood-based phylogenetic inference. Nucl. Acids Res. 33(suppl 2), W557–W559 (2005)CrossRefGoogle Scholar
  22. 22.
    Gronau, I., Moran, S., Snir, S.: Fast and reliable reconstruction of phylogenetic trees with indistinguishable edges. Random Struct. Algorithms 40(3), 350–384 (2012)MathSciNetCrossRefzbMATHGoogle Scholar
  23. 23.
    Grimmett, G.: The Random-Cluster Model, Volume 333 of Grundlehren der Mathematischen Wissenschaften [Fundamental Principles of Mathematical Sciences]. Springer, Berlin (2006)Google Scholar
  24. 24.
    Huson, D.H., Nettles, S.H., Warnow, T.J.: Disk-covering, a fast-converging method for phylogenetic tree reconstruction. J. Comput. Biol. 6(3–4), 369–386 (1999)CrossRefGoogle Scholar
  25. 25.
    Ioffe, D.: On the extremality of the disordered state for the Ising model on the Bethe lattice. Lett. Math. Phys. 37(2), 137–143 (1996)MathSciNetCrossRefzbMATHGoogle Scholar
  26. 26.
    Jukes, T.H., Cantor, C.: Mammalian protein metabolism. In: Munro, H.N. (ed.) Evolution of Protein Molecules, pp. 21–132. Academic Press, Cambridge (1969)Google Scholar
  27. 27.
    Janson, S., Mossel, E.: Robust reconstruction on trees is determined by the second eigenvalue. Ann. Probab. 32, 2630–2649 (2004)MathSciNetCrossRefzbMATHGoogle Scholar
  28. 28.
    Kesten, H., Stigum, B.P.: Additional limit theorems for indecomposable multidimensional Galton-Watson processes. Ann. Math. Stat. 37, 1463–1481 (1966)MathSciNetCrossRefzbMATHGoogle Scholar
  29. 29.
    Lacey, M.R., Chang, J.T.: A signal-to-noise analysis of phylogeny estimation by neighbor-joining: insufficiency of polynomial length sequences. Math. Biosci. 199(2), 188–215 (2006)MathSciNetCrossRefzbMATHGoogle Scholar
  30. 30.
    Liggett, T.M.: Interacting Particle Systems, Volume 276 of Grundlehren der Mathematischen Wissenschaften [Fundamental Principles of Mathematical Sciences]. Springer, New York (1985)Google Scholar
  31. 31.
    Lehmann, E.L., Romano, J.P.: Testing Statistical Hypotheses (Springer Texts in Statistics), 3rd edn. Springer, New York (2005)Google Scholar
  32. 32.
    Mihaescu, R., Hill, C., Rao, S.: Fast phylogeny reconstruction through learning of ancestral sequences. Algorithmica 66(2), 419–449 (2013)MathSciNetCrossRefzbMATHGoogle Scholar
  33. 33.
    Mossel, E.: Reconstruction on trees: beating the second eigenvalue. Ann. Appl. Probab. 11(1), 285–300 (2001)MathSciNetCrossRefzbMATHGoogle Scholar
  34. 34.
    Mossel, E.: On the impossibility of reconstructing ancestral data and phylogenies. J. Comput. Biol. 10(5), 669–678 (2003)CrossRefGoogle Scholar
  35. 35.
    Mossel, E.: Phase transitions in phylogeny. Trans. Am. Math. Soc. 356(6), 2379–2404 (2004)MathSciNetCrossRefzbMATHGoogle Scholar
  36. 36.
    Mossel, E.: Survey: information flow on trees. In: Nestril, J., Winkler, P. (eds.) Graphs, Morphisms and Statistical Physics, pp. 155–170. American Mathematical Society, Providence (2004)CrossRefGoogle Scholar
  37. 37.
    Mossel, E.: Distorted metrics on trees and phylogenetic forests. IEEE/ACM Trans. Comput. Biol. Bioinform. 4(1), 108–116 (2007)CrossRefGoogle Scholar
  38. 38.
    Motwani, R., Raghavan, P.: Randomized Algorithms. Cambridge University Press, Cambridge (1995)CrossRefzbMATHGoogle Scholar
  39. 39.
    Mossel, E., Roch, S.: Learning nonsingular phylogenies and hidden Markov models. Ann. Appl. Probab. 16(2), 583–614 (2006)MathSciNetCrossRefzbMATHGoogle Scholar
  40. 40.
    Mossel, E., Roch, S.: Phylogenetic mixtures: concentration of measure in the large-tree limit. Ann. Appl. Probab. 22(6), 2429–2459 (2012)MathSciNetCrossRefzbMATHGoogle Scholar
  41. 41.
    Mossel, E., Roch, S.: Identifiability and inference of non-parametric rates-across-sites models on large-scale phylogenies. J. Math. Biol. 67(4), 767–797 (2013)MathSciNetCrossRefzbMATHGoogle Scholar
  42. 42.
    Mossel, E., Roch, S., Sly, A.: On the inference of large phylogenies with long branches: How long is too long? Bull. Math. Biol. 73, 1627–1644 (2011). doi: 10.1007/s11538-010-9584-6 MathSciNetCrossRefzbMATHGoogle Scholar
  43. 43.
    Neyman, J.: Molecular studies of evolution: a source of novel statistical problems. In: Gupta, S.S., Yackel, J. (eds.) Statistical Desicion Theory and Related Topics, pp. 1–27. Academic Press, New York (1971)Google Scholar
  44. 44.
    Peres, Y.: Probability on trees: an introductory climb. In: Lectures on Probability Theory and Statistics (Saint-Flour, 1997). Lecture Notes in Math, vol. 1717, pp. 193–280. Springer, Berlin (1999)Google Scholar
  45. 45.
    Roch, S.: A short proof that phylogenetic tree reconstruction by maximum likelihood is hard. IEEE/ACM Trans. Comput. Biol. Bioinform. 3(1), 92–94 (2006)CrossRefGoogle Scholar
  46. 46.
    Roch, S.: Sequence length requirement of distance-based phylogeny reconstruction: breaking the polynomial barrier. In: FOCS, pp. 729–738 (2008)Google Scholar
  47. 47.
    Roch, S.: Toward extracting all phylogenetic information from matrices of evolutionary distances. Science 327(5971), 1376–1379 (2010)MathSciNetCrossRefzbMATHGoogle Scholar
  48. 48.
    Sly, A.: Reconstruction for the potts model. In: STOC, pp. 581–590 (2009)Google Scholar
  49. 49.
    Saitou, N., Nei, M.: The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol. Biol. Evol. 4(4), 406–425 (1987)Google Scholar
  50. 50.
    Steel, M.A., Székely, L.A.: Inverting random functions. II. Explicit bounds for discrete maximum likelihood estimation, with applications. SIAM J. Discret. Math. 15(4), 562–575 (2002)MathSciNetCrossRefzbMATHGoogle Scholar
  51. 51.
    Semple, C., Steel, M.: Phylogenetics, Volume 22 of Mathematics and Its Applications Series. Oxford University Press, Oxford (2003)zbMATHGoogle Scholar
  52. 52.
    Steel, M.A., Székely, L.A.: On the variational distance of two trees. Ann. Appl. Probab. 16(3), 1563–1575 (2006)MathSciNetCrossRefzbMATHGoogle Scholar
  53. 53.
    Smith, S.A., Stamatakis, A.: Inferring and postprocessing huge phylogenies. In: Elloumi, M., Zomaya, A.Y. (eds.) Biological Knowledge Discovery Handbook: Preprocessing, Mining, and Postprocessing of Biological Data. Wiley, Hoboken (2013). doi: 10.1002/9781118617151.ch46
  54. 54.
    Stamatakis, A.: RAxML-VI-HPC: maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed models. Bioinformatics 22(21), 2688–2690 (2006)CrossRefGoogle Scholar
  55. 55.
    Steel, M.: Recovering a tree from the leaf colourations it generates under a Markov model. Appl. Math. Lett. 7(2), 19–23 (1994)MathSciNetCrossRefzbMATHGoogle Scholar
  56. 56.
    Steel, M.: My Favourite Conjecture (2001) (unpublished)Google Scholar
  57. 57.
    Steel, M.: Phylogeny—Discrete and Random Processes in Evolution, Volume 89 of CBMS-NSF Regional Conference Series in Applied Mathematics. Society for Industrial and Applied Mathematics (SIAM), Philadelphia (2016)Google Scholar
  58. 58.
    Tan, V.Y.F., Anandkumar, A., Tong, L., Willsky, A.S.: A large-deviation analysis of the maximum-likelihood learning of Markov tree structures. IEEE Trans. Inform. Theory 57(3), 1714–1735 (2011)MathSciNetCrossRefzbMATHGoogle Scholar
  59. 59.
    Tan, V.Y.F., Anandkumar, A., Willsky, A.S.: Learning high-dimensional markov forest distributions. J. Mach. Learn. Res. 12, 1617–1653 (2011)MathSciNetzbMATHGoogle Scholar
  60. 60.
    Wald, A.: Note on the consistency of the maximum likelihood estimate. Ann. Math. Stat. 20, 595–601 (1949)MathSciNetCrossRefzbMATHGoogle Scholar
  61. 61.
    Warnow, T.: Computational Phylogenetics: An Introduction to Designing Methods for Phylogeny Estimation. To be published by Cambridge University Press, Cambridge (2017)Google Scholar

Copyright information

© Springer-Verlag GmbH Germany 2017

Authors and Affiliations

  1. 1.Department of MathematicsUW–MadisonMadisonUSA
  2. 2.Department of MathematicsPrinceton University PrincetonUSA

Personalised recommendations