Skip to main content
Log in

Phase transition in the sample complexity of likelihood-based phylogeny inference

  • Published:
Probability Theory and Related Fields Aims and scope Submit manuscript

Abstract

Reconstructing evolutionary trees from molecular sequence data is a fundamental problem in computational biology. Stochastic models of sequence evolution are closely related to spin systems that have been extensively studied in statistical physics and that connection has led to important insights on the theoretical properties of phylogenetic reconstruction algorithms as well as the development of new inference methods. Here, we study maximum likelihood, a classical statistical technique which is perhaps the most widely used in phylogenetic practice because of its superior empirical accuracy. At the theoretical level, except for its consistency, that is, the guarantee of eventual correct reconstruction as the size of the input data grows, much remains to be understood about the statistical properties of maximum likelihood in this context. In particular, the best bounds on the sample complexity or sequence-length requirement of maximum likelihood, that is, the amount of data required for correct reconstruction, are exponential in the number, n, of tips—far from known lower bounds based on information-theoretic arguments. Here we close the gap by proving a new upper bound on the sequence-length requirement of maximum likelihood that matches up to constants the known lower bound for some standard models of evolution. More specifically, for the r-state symmetric model of sequence evolution on a binary phylogeny with bounded edge lengths, we show that the sequence-length requirement behaves logarithmically in n when the expected amount of mutation per edge is below what is known as the Kesten-Stigum threshold. In general, the sequence-length requirement is polynomial in n. Our results imply moreover that the maximum likelihood estimator can be computed efficiently on randomly generated data provided sequences are as above. Our main technical contribution, which may be of independent interest, relates the total variation distance between the leaf state distributions of two trees with a notion of combinatorial distance between the trees. In words we show in a precise quantitative manner that the more different two evolutionary trees are, the easier it is to distinguish their output.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

References

  1. Allen, B.L., Steel, M.: Subtree transfer operations and their induced metrics on evolutionary trees. Ann. Comb. 1, 1–15 (2001)

    Article  MathSciNet  MATH  Google Scholar 

  2. Andoni, A., Daskalakis, C., Hassidim, A., Roch, S.: Global alignment of molecular sequences via ancestral state reconstruction. Stoch. Process. Appl. 122(12), 3852–3874 (2012)

  3. Borgs, C., Chayes, J., Mossel, E., Roch, S.: The Kesten-Stigum reconstruction bound is tight for roughly symmetric binary channels. In: FOCS, pp. 518–530 (2006)

  4. Brown, D.G., Truszkowski, J.: Fast phylogenetic tree reconstruction using locality-sensitive hashing. In: Algorithms in Bioinformatics, pp 14–29. Springer (2012)

  5. Cavender, J.A.: Taxonomy with confidence. Math. Biosci. 40(3–4), 271–280 (1978)

    Article  MathSciNet  MATH  Google Scholar 

  6. Cryan, M., Goldberg, L.A., Goldberg, P.W.: Evolutionary trees can be learned in polynomial time. SIAM J. Comput. 31(2), 375–397 (2002). Short version In: Proceedings of the 39th Annual Symposium on Foundations of Computer Science (FOCS 98), pp. 436–445 (1998)

  7. Chang, J.T.: Full reconstruction of Markov models on evolutionary trees: identifiability and consistency. Math. Biosci. 137(1), 51–73 (1996)

    Article  MathSciNet  MATH  Google Scholar 

  8. Chor, B., Tuller, T.: Finding a maximum likelihood tree is hard. J. ACM 53(5), 722–744 (2006)

    Article  MathSciNet  MATH  Google Scholar 

  9. Choi, M.J., Tan, V.Y., Anandkumar, A., Willsky, A.S.: Learning latent tree graphical models. J. Mach. Learn. Res. 12, 1771–1812 (2011)

    MathSciNet  MATH  Google Scholar 

  10. Daskalakis, C., Mossel, E., Roch, S.: Evolutionary trees and the ising model on the Bethe lattice: a proof of Steel’s conjecture. Probab. Theory Relat. Fields 149, 149–189 (2011). doi:10.1007/s00440-009-0246-2

    Article  MathSciNet  MATH  Google Scholar 

  11. Daskalakis, C., Mossel, E., Roch, S.: Phylogenies without branch bounds: contracting the short, pruning the deep. SIAM J. Discret. Math. 25(2), 872–893 (2011)

    Article  MathSciNet  MATH  Google Scholar 

  12. Daskalakis, C., Roch, S.: Alignment-free phylogenetic reconstruction: sample complexity via a branching process analysis. Ann. Appl. Probab. 23(2), 693–721 (2013)

    Article  MathSciNet  MATH  Google Scholar 

  13. Deonier, R.C., Tavaré, S., Waterman, M.S.: Computational Genome Analysis: An Introduction. Springer, New York (2005)

    MATH  Google Scholar 

  14. Evans, W.S., Kenyon, C., Peres, Y., Schulman, L.J.: Broadcasting on trees and the Ising model. Ann. Appl. Probab. 10(2), 410–433 (2000)

    Article  MathSciNet  MATH  Google Scholar 

  15. Erdös, P.L., Steel, M.A., Székely, L.A., Warnow, T.A.: A few logs suffice to build (almost) all trees (part 1). Random Struct. Algorithms 14(2), 153–184 (1999)

    Article  MATH  Google Scholar 

  16. Erdös, P.L., Steel, M.A., Székely, L.A., Warnow, T.A.: A few logs suffice to build (almost) all trees (part 2). Theor. Comput. Sci. 221, 77–118 (1999)

    Article  MATH  Google Scholar 

  17. Farris, J.S.: A probability model for inferring evolutionary trees. Syst. Zool. 22(4), 250–256 (1973)

    Article  MathSciNet  Google Scholar 

  18. Felsenstein, J.: Evolutionary trees from dna sequences: a maximum likelihood approach. J. Mol. Evol. 17, 368–376 (1981)

    Article  Google Scholar 

  19. Felsenstein, J.: Inferring Phylogenies. Sinauer, Sunderland (2004)

    Google Scholar 

  20. Georgii, H.O.: Gibbs Measures and Phase Transitions, Volume 9 of de Gruyter Studies in Mathematics. Walter de Gruyter & Co., Berlin (1988)

    Book  Google Scholar 

  21. Guindon, S., Lethiec, F., Duroux, P., Gascuel, O.: PHYML online web server for fast maximum likelihood-based phylogenetic inference. Nucl. Acids Res. 33(suppl 2), W557–W559 (2005)

    Article  Google Scholar 

  22. Gronau, I., Moran, S., Snir, S.: Fast and reliable reconstruction of phylogenetic trees with indistinguishable edges. Random Struct. Algorithms 40(3), 350–384 (2012)

    Article  MathSciNet  MATH  Google Scholar 

  23. Grimmett, G.: The Random-Cluster Model, Volume 333 of Grundlehren der Mathematischen Wissenschaften [Fundamental Principles of Mathematical Sciences]. Springer, Berlin (2006)

    Google Scholar 

  24. Huson, D.H., Nettles, S.H., Warnow, T.J.: Disk-covering, a fast-converging method for phylogenetic tree reconstruction. J. Comput. Biol. 6(3–4), 369–386 (1999)

    Article  Google Scholar 

  25. Ioffe, D.: On the extremality of the disordered state for the Ising model on the Bethe lattice. Lett. Math. Phys. 37(2), 137–143 (1996)

    Article  MathSciNet  MATH  Google Scholar 

  26. Jukes, T.H., Cantor, C.: Mammalian protein metabolism. In: Munro, H.N. (ed.) Evolution of Protein Molecules, pp. 21–132. Academic Press, Cambridge (1969)

    Google Scholar 

  27. Janson, S., Mossel, E.: Robust reconstruction on trees is determined by the second eigenvalue. Ann. Probab. 32, 2630–2649 (2004)

    Article  MathSciNet  MATH  Google Scholar 

  28. Kesten, H., Stigum, B.P.: Additional limit theorems for indecomposable multidimensional Galton-Watson processes. Ann. Math. Stat. 37, 1463–1481 (1966)

    Article  MathSciNet  MATH  Google Scholar 

  29. Lacey, M.R., Chang, J.T.: A signal-to-noise analysis of phylogeny estimation by neighbor-joining: insufficiency of polynomial length sequences. Math. Biosci. 199(2), 188–215 (2006)

    Article  MathSciNet  MATH  Google Scholar 

  30. Liggett, T.M.: Interacting Particle Systems, Volume 276 of Grundlehren der Mathematischen Wissenschaften [Fundamental Principles of Mathematical Sciences]. Springer, New York (1985)

    Google Scholar 

  31. Lehmann, E.L., Romano, J.P.: Testing Statistical Hypotheses (Springer Texts in Statistics), 3rd edn. Springer, New York (2005)

    Google Scholar 

  32. Mihaescu, R., Hill, C., Rao, S.: Fast phylogeny reconstruction through learning of ancestral sequences. Algorithmica 66(2), 419–449 (2013)

    Article  MathSciNet  MATH  Google Scholar 

  33. Mossel, E.: Reconstruction on trees: beating the second eigenvalue. Ann. Appl. Probab. 11(1), 285–300 (2001)

    Article  MathSciNet  MATH  Google Scholar 

  34. Mossel, E.: On the impossibility of reconstructing ancestral data and phylogenies. J. Comput. Biol. 10(5), 669–678 (2003)

    Article  Google Scholar 

  35. Mossel, E.: Phase transitions in phylogeny. Trans. Am. Math. Soc. 356(6), 2379–2404 (2004)

    Article  MathSciNet  MATH  Google Scholar 

  36. Mossel, E.: Survey: information flow on trees. In: Nestril, J., Winkler, P. (eds.) Graphs, Morphisms and Statistical Physics, pp. 155–170. American Mathematical Society, Providence (2004)

    Chapter  Google Scholar 

  37. Mossel, E.: Distorted metrics on trees and phylogenetic forests. IEEE/ACM Trans. Comput. Biol. Bioinform. 4(1), 108–116 (2007)

    Article  Google Scholar 

  38. Motwani, R., Raghavan, P.: Randomized Algorithms. Cambridge University Press, Cambridge (1995)

    Book  MATH  Google Scholar 

  39. Mossel, E., Roch, S.: Learning nonsingular phylogenies and hidden Markov models. Ann. Appl. Probab. 16(2), 583–614 (2006)

    Article  MathSciNet  MATH  Google Scholar 

  40. Mossel, E., Roch, S.: Phylogenetic mixtures: concentration of measure in the large-tree limit. Ann. Appl. Probab. 22(6), 2429–2459 (2012)

    Article  MathSciNet  MATH  Google Scholar 

  41. Mossel, E., Roch, S.: Identifiability and inference of non-parametric rates-across-sites models on large-scale phylogenies. J. Math. Biol. 67(4), 767–797 (2013)

    Article  MathSciNet  MATH  Google Scholar 

  42. Mossel, E., Roch, S., Sly, A.: On the inference of large phylogenies with long branches: How long is too long? Bull. Math. Biol. 73, 1627–1644 (2011). doi:10.1007/s11538-010-9584-6

    Article  MathSciNet  MATH  Google Scholar 

  43. Neyman, J.: Molecular studies of evolution: a source of novel statistical problems. In: Gupta, S.S., Yackel, J. (eds.) Statistical Desicion Theory and Related Topics, pp. 1–27. Academic Press, New York (1971)

    Google Scholar 

  44. Peres, Y.: Probability on trees: an introductory climb. In: Lectures on Probability Theory and Statistics (Saint-Flour, 1997). Lecture Notes in Math, vol. 1717, pp. 193–280. Springer, Berlin (1999)

  45. Roch, S.: A short proof that phylogenetic tree reconstruction by maximum likelihood is hard. IEEE/ACM Trans. Comput. Biol. Bioinform. 3(1), 92–94 (2006)

    Article  Google Scholar 

  46. Roch, S.: Sequence length requirement of distance-based phylogeny reconstruction: breaking the polynomial barrier. In: FOCS, pp. 729–738 (2008)

  47. Roch, S.: Toward extracting all phylogenetic information from matrices of evolutionary distances. Science 327(5971), 1376–1379 (2010)

    Article  MathSciNet  MATH  Google Scholar 

  48. Sly, A.: Reconstruction for the potts model. In: STOC, pp. 581–590 (2009)

  49. Saitou, N., Nei, M.: The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol. Biol. Evol. 4(4), 406–425 (1987)

    Google Scholar 

  50. Steel, M.A., Székely, L.A.: Inverting random functions. II. Explicit bounds for discrete maximum likelihood estimation, with applications. SIAM J. Discret. Math. 15(4), 562–575 (2002)

    Article  MathSciNet  MATH  Google Scholar 

  51. Semple, C., Steel, M.: Phylogenetics, Volume 22 of Mathematics and Its Applications Series. Oxford University Press, Oxford (2003)

    MATH  Google Scholar 

  52. Steel, M.A., Székely, L.A.: On the variational distance of two trees. Ann. Appl. Probab. 16(3), 1563–1575 (2006)

    Article  MathSciNet  MATH  Google Scholar 

  53. Smith, S.A., Stamatakis, A.: Inferring and postprocessing huge phylogenies. In: Elloumi, M., Zomaya, A.Y. (eds.) Biological Knowledge Discovery Handbook: Preprocessing, Mining, and Postprocessing of Biological Data. Wiley, Hoboken (2013). doi:10.1002/9781118617151.ch46

  54. Stamatakis, A.: RAxML-VI-HPC: maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed models. Bioinformatics 22(21), 2688–2690 (2006)

    Article  Google Scholar 

  55. Steel, M.: Recovering a tree from the leaf colourations it generates under a Markov model. Appl. Math. Lett. 7(2), 19–23 (1994)

    Article  MathSciNet  MATH  Google Scholar 

  56. Steel, M.: My Favourite Conjecture (2001) (unpublished)

  57. Steel, M.: Phylogeny—Discrete and Random Processes in Evolution, Volume 89 of CBMS-NSF Regional Conference Series in Applied Mathematics. Society for Industrial and Applied Mathematics (SIAM), Philadelphia (2016)

    Google Scholar 

  58. Tan, V.Y.F., Anandkumar, A., Tong, L., Willsky, A.S.: A large-deviation analysis of the maximum-likelihood learning of Markov tree structures. IEEE Trans. Inform. Theory 57(3), 1714–1735 (2011)

    Article  MathSciNet  MATH  Google Scholar 

  59. Tan, V.Y.F., Anandkumar, A., Willsky, A.S.: Learning high-dimensional markov forest distributions. J. Mach. Learn. Res. 12, 1617–1653 (2011)

    MathSciNet  MATH  Google Scholar 

  60. Wald, A.: Note on the consistency of the maximum likelihood estimate. Ann. Math. Stat. 20, 595–601 (1949)

    Article  MathSciNet  MATH  Google Scholar 

  61. Warnow, T.: Computational Phylogenetics: An Introduction to Designing Methods for Phylogeny Estimation. To be published by Cambridge University Press, Cambridge (2017)

    Google Scholar 

Download references

Acknowledgements

We thank the anonymous reviewers of a previous version for helpful comments.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sebastien Roch.

Additional information

2016 Wolfgang Doeblin Prize Article.

Sebastien Roch: Work partly done at Microsoft Research, UCLA, IPAM and the Simons Institute for the Theory of Computing. Work supported by NSF Grants DMS-1007144 and DMS-1149312 (CAREER), and an Alfred P. Sloan Research Fellowship.

Allan Sly: Work partly done at Microsoft Research. Work supported by NSF Grants DMS-1208339 and DMS-1352013 and an Alfred P. Sloan Research Fellowship.

Preliminary lemmas

Preliminary lemmas

In this section, we collect a few useful lemmas.

1.1 Ancestral reconstruction

An important part of our construction involves reconstructing ancestral states. We will use the following lemma from [14] which we typically apply to a rooted subtree. Let \(T = (V,E;\phi ;w) \in \mathbb {Y}\) rooted at \(\rho \). Let \(e = (x,y) \in E\) and assume that x is closest to \(\rho \) (in topological distance). We define \(\mathrm {P}(\rho ,e) = \mathrm {P}(\rho ,y)\), \(|e|_\rho = |\mathrm {P}(\rho ,e)|\), and

$$\begin{aligned} R_\rho (e) = \left( 1 - \theta _e^2\right) \Theta _{\rho ,y}^{-2}, \end{aligned}$$
(60)

where \(\Theta _{\rho ,y} = e^{-\mathrm {d}_T(\rho ,y)}\) and \(\theta _e = e^{-w_e}\).

Lemma 2

(Ancestral reconstruction [14]) For any unit flow \(\Psi \) from \(\rho \) to [n],

$$\begin{aligned} \mathbb {E}_T\left| \mathbb {P}_T[\sigma _\rho = +1|\sigma _X] - \mathbb {P}_T[\sigma _\rho = -1|\sigma _X] \right| \ge \frac{1}{1+ \sum _{e \in E} R_\rho (e) \Psi (e)^2}, \end{aligned}$$
(61)

where the LHS is the difference between the probability of correct and incorrect reconstruction using MLE. (See [14, Equation (14), Lemma 5.1 and Theorem 1.2’].)

1.2 Random cluster representation

We use a convenient percolation-based representation of the CFN model known as the random cluster model (see e.g. [23]). Let \(T = (V,E;\phi ;w) \in \mathbb {Y}\) with corresponding \((\delta _e)_{e\in E}\).

Lemma 3

(Random cluster representation) Run a percolation process on T where edge e is open with probability \(1 - 2 \delta _e\). Then associate to each open connected component a state according to the uniform distribution on \(\{+1,-1\}\). The state vector on the vertices so obtained \((\sigma _v)_{v\in V}\) has the same distribution as the corresponding CFN model.

1.3 Concentration inequalities

Recall the following standard concentration inequality (see e.g. [38]):

Lemma 4

(Azuma-Hoeffding Inequality) Suppose \(\mathbf{Z}=(Z_1,\ldots ,Z_m)\) are independent random variables taking values in a set S, and \(h:S^m \rightarrow \mathbb {R}\) is any t-Lipschitz function: \(|h(\mathbf{z}) - h(\mathbf{z'})|\le t\) whenever \(\mathbf{z}, \mathbf{z'} \in S^m\) differ at just one coordinate. Then, \(\forall \zeta > 0\),

$$\begin{aligned} \mathbb {P}\left[ |h(\mathbf{Z}) - \mathbb {E}[h(\mathbf{Z})]| \ge \zeta \right] \le 2\exp \left( -\frac{\zeta ^2 }{2 t^2 m}\right) . \end{aligned}$$

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Roch, S., Sly, A. Phase transition in the sample complexity of likelihood-based phylogeny inference. Probab. Theory Relat. Fields 169, 3–62 (2017). https://doi.org/10.1007/s00440-017-0793-x

Download citation

  • Received:

  • Revised:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00440-017-0793-x

Keywords

Mathematics Subject Classification

Navigation