Skip to main content

On the Number of Non-equivalent Ancestral Configurations for Matching Gene Trees and Species Trees

Abstract

An ancestral configuration is one of the combinatorially distinct sets of gene lineages that, for a given gene tree, can reach a given node of a specified species tree. Ancestral configurations have appeared in recursive algebraic computations of the conditional probability that a gene tree topology is produced under the multispecies coalescent model for a given species tree. For matching gene trees and species trees, we study the number of ancestral configurations, considered up to an equivalence relation introduced by Wu (Evolution 66:763–775, 2012) to reduce the complexity of the recursive probability computation. We examine the largest number of non-equivalent ancestral configurations possible for a given tree size n. Whereas the smallest number of non-equivalent ancestral configurations increases polynomially with n, we show that the largest number increases with \(k^n\), where k is a constant that satisfies \(\root 3 \of {3}\,\le \,k\,<\,1.503\). Under a uniform distribution on the set of binary labeled trees with a given size n, the mean number of non-equivalent ancestral configurations grows exponentially with n. The results refine an earlier analysis of the number of ancestral configurations considered without applying the equivalence relation, showing that use of the equivalence relation does not alter the exponential nature of the increase with tree size.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

References

  • Aho AV, Sloane NJA (1973) Some doubly exponential sequences. Fibonacci Q. 11:429–437

    MathSciNet  MATH  Google Scholar 

  • Allman ES, Degnan JH, Rhodes JA (2011) Identifying the rooted species tree from the distribution of unrooted gene trees under the coalescent. J Math Biol 62:833–862

    MathSciNet  Article  MATH  Google Scholar 

  • Degnan JH, Salter LA (2005) Gene tree distributions under the coalescent process. Evolution 59:24–37

    Article  Google Scholar 

  • Disanto F, Rosenberg NA (2015) Coalescent histories for lodgepole species trees. J Comput Biol 22:918–929

    MathSciNet  Article  Google Scholar 

  • Disanto F, Rosenberg NA (2016) Asymptotic properties of the number of matching coalescent histories for caterpillar-like families of species trees. IEEE/ACM Trans Comput Biol Bioinf 13:913–925

    Article  Google Scholar 

  • Disanto F, Rosenberg NA (2017) Enumeration of ancestral configurations for matching gene trees and species trees. J Comput Biol 24:831–850

  • Felsenstein J (1978) The number of evolutionary trees. Syst. Zool. 27:27–33

    Article  Google Scholar 

  • Felsenstein J (2004) Inferring phylogenies. Sinauer, Sunderland, MA

  • Flajolet P, Sedgewick R (2009) Analytic combinatorics. Cambridge University Press, Cambridge

    Book  MATH  Google Scholar 

  • Harding EF (1971) The probabilities of rooted tree-shapes generated by random bifurcation. Adv Appl Prob 3:44–77

    MathSciNet  Article  MATH  Google Scholar 

  • Rosenberg NA (2006) The mean and variance of the numbers of \(r\)-pronged nodes and \(r\)-caterpillars in Yule-generated genealogical trees. Ann Comb 10:129–146

    MathSciNet  Article  MATH  Google Scholar 

  • Rosenberg NA (2007) Counting coalescent histories. J Comput Biol 14:360–377

    MathSciNet  Article  Google Scholar 

  • Rosenberg NA (2013) Coalescent histories for caterpillar-like families. IEEE/ACM Trans Comput Biol Bioinf 10:1253–1262

    Article  Google Scholar 

  • Rosenberg NA, Degnan JH (2010) Coalescent histories for discordant gene trees and species trees. Theor Pop Biol 77:145–151

    Article  MATH  Google Scholar 

  • Sedgewick R, Flajolet P (1996) An introduction to the analysis of algorithms. Addison-Wesley, Boston

    MATH  Google Scholar 

  • Than C, Ruths D, Innan H, Nakhleh L (2007) Confounding factors in HGT detection: statistical error, coalescent effects, and multiple solutions. J Comput Biol 14:517–535

    MathSciNet  Article  Google Scholar 

  • Wu Y (2012) Coalescent-based species tree inference from gene tree topologies under incomplete lineage sorting by maximum likelihood. Evolution 66:763–775

    Article  Google Scholar 

Download references

Acknowledgements

We thank Elizabeth Allman, James Degnan, and John Rhodes for discussions, and two reviewers for comments. Support was provided by National Institutes of Health grant R01 GM117590 and by a 2014 Rita Levi Montalcini grant to FD from the Ministero dell’Istruzione, dell’Università e della Ricerca.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Filippo Disanto.

Appendices

Appendix 1: Proof of (9)

Let \(C^*(r_S) = \{\gamma _{S,1}, \ldots , \gamma _{S,q} \}\) with \(c^*(r_S)=q\), and let \(C^*(r_L) = \{\gamma _{L,1}, \ldots ,\gamma _{L,Q} \}\), with \(c^*(r_L) = Q\). Because condition (8) is satisfied, the entire tree \(t_{r_S}\) can be displayed in \(t_{r_L}\), each configuration \(\gamma _{S,i} \in C^*(r_S)\) has exactly one corresponding configuration \(\gamma _{L,i} \in C^*(r_L)\) such that \(t_{r_S}(\gamma _{S,i}) \cong t_{r_L}(\gamma _{L,i})\), and \(Q\,\ge \,q\).

From (6), we obtain

$$\begin{aligned} \tilde{C}(r)=\{ \{ r_{S},r_L \} \} \cup \big [ C^*(r_{S}) \otimes \{ \{r_L \} \} \big ] \cup \big [ \{ \{ r_{S}\} \} \otimes C^*(r_L) \big ] \cup \big [ C^*(r_{S}) \otimes C^*(r_L) \big ], \end{aligned}$$

which can be further decomposed as

$$\begin{aligned} \tilde{C}(r)= & {} \{ \{ r_{S},r_L \} \} \cup \big [ \{\gamma _{S,1}, \ldots ,\gamma _{S,q} \} \otimes \{ \{r_L \} \} \big ] \cup \big [ \{ \{ r_{S}\} \} \otimes \big [\{\gamma _{L,1}, \ldots ,\gamma _{L,q} \} \nonumber \\&\cup \{\gamma _{L,q+1}, \ldots ,\gamma _{L,Q} \}\big ] \big ] \nonumber \\&\cup \big [\{\gamma _{S,1}, \ldots ,\gamma _{S,q} \} \otimes \big [\{\gamma _{L,1}, \ldots ,\gamma _{L,q} \} \cup \{\gamma _{L,q+1}, \ldots ,\gamma _{L,Q} \}\big ] \big ] \nonumber \\= & {} \{ \{ r_{S},r_L \} \} \end{aligned}$$
(28)
$$\begin{aligned}&\cup \big [ \{\gamma _{S,1}, \ldots ,\gamma _{S,q} \} \otimes \{ \{r_L \} \} \big ] \cup \big [ \{ \{ r_{S}\} \} \otimes \{\gamma _{L,1}, \ldots ,\gamma _{L,q} \} \big ] \end{aligned}$$
(29)
$$\begin{aligned}&\cup \big [ \{ \{ r_{S}\} \} \otimes \{\gamma _{L,q+1}, \ldots ,\gamma _{L,Q} \} \big ] \end{aligned}$$
(30)
$$\begin{aligned}&\cup \big [\{\gamma _{S,1}, \ldots ,\gamma _{S,q} \} \otimes \{\gamma _{L,1}, \ldots ,\gamma _{L,q} \} \big ] \end{aligned}$$
(31)
$$\begin{aligned}&\cup \big [\{\gamma _{S,1}, \ldots ,\gamma _{S,q} \} \otimes \{\gamma _{L,q+1}, \ldots ,\gamma _{L,Q} \} \big ]. \end{aligned}$$
(32)

We merge equivalent configurations to obtain \(C^*(r)\) from \(\tilde{C}(r)\). From (29), we remove those in \(\{\gamma _{S,1}, \ldots ,\gamma _{S,q} \} \otimes \{ \{r_L \} \} \), as they are equivalent to those in \(\{ \{ r_{S}\} \} \otimes \{\gamma _{L,1}, \ldots ,\gamma _{L,q} \}\). Thus, we take only q among the 2q configurations in (29). Moreover, due to the equivalence \(\gamma _{S,i} \cup \gamma _{L,j} \sim _r \gamma _{S,j} \cup \gamma _{L,i}\), we take only those configurations of the form \(\gamma _{S,i} \cup \gamma _{L,j}\) with \(i\,\le \,j\) among those in \(\{\gamma _{S,1}, \ldots ,\gamma _{S,q} \} \otimes \{\gamma _{L,1}, \ldots ,\gamma _{L,q} \}\). Thus, among the \(q^2\) configurations in (31)—those with \(1\,\le \,i, j\,\le \,q\)—we take only \(q(q+1)/2\) non-equivalent ones. No equivalences are possible among configurations in (28), (30), and (32), and all are retained in \(C^*(r)\). From (28)–(32), we then have

$$\begin{aligned} c^*(r)= & {} |C^*(r)| = 1 + q + (Q-q) + \frac{q(q+1)}{2} + q(Q-q) = 1 + q + Q \\&+ qQ - \frac{q(q+1)}{2}. \end{aligned}$$

Replacing q by \(c^*(r_S)\) and Q by \(c^*(r_L)\) gives (9).

Appendix 2: Proof of (12)

The proof follows the approach of Aho and Sloane (1973, Sect. 3) for solving certain recurrences. From (11), we have \(x_{h+1} = x_h^2 [1 + 1/(2x_h) + 1/(2x_h^2) ]\). Taking the logarithm \(y_h = \log x_h\) yields \(y_{h+1} = 2y_h + \alpha _h\), where \(\alpha _h = \log [1+ {1}/{(2x_h)} + {1}/{(2x_h^2)}]\). Following Aho and Sloane (1973), \(y_h\) has solution

$$\begin{aligned} y_h = 2^h y_0 + \sum _{i=0}^{\infty } 2^{h-i-1}\alpha _i - \sum _{i=h}^{\infty } 2^{h-i-1}\alpha _i = 2^{h}\bigg (y_0 + \sum _{i=0}^{\infty } 2^{-i-1}\alpha _i \bigg ) - \sum _{i=h}^{\infty } 2^{h-i-1}\alpha _i. \end{aligned}$$
(33)

Converting back to \(x_h = \exp (y_h)\), from (33) we have

$$\begin{aligned} x_h= & {} \bigg [ x_0 \exp \bigg (\sum _{i=0}^{\infty } 2^{-i-1}\alpha _i \bigg ) \bigg ]^{(2^h)} \exp \bigg ( - \sum _{i=h}^{\infty } 2^{h-i-1}\alpha _i \bigg ) \\= & {} (k_0^*)^{(2^h)} \exp \bigg ( - \sum _{i=h}^{\infty } 2^{h-i-1}\alpha _i \bigg ), \end{aligned}$$

where the last step uses the fact that \(x_0=1/2\).

We then have

$$\begin{aligned} \frac{x_h}{(k_0^*)^{(2^h)}}= \exp \bigg ( - \sum _{i=h}^{\infty } 2^{h-i-1}\alpha _i \bigg ). \end{aligned}$$

When \(h \rightarrow \infty \), the sum \(\sum _{i=h}^{\infty } 2^{h-i-1}\alpha _i\) converges to zero because it can be bounded \(0 \le \sum _{i=h}^{\infty } 2^{h-i-1}\alpha _i\,\le \,\alpha _h \sum _{i=h}^{\infty } 2^{h-i-1} = \alpha _h\), where because \(x_h \rightarrow \infty \) as \(h \rightarrow \infty \), \(\alpha _h \rightarrow 0\) as \(h \rightarrow \infty \). It follows that \(x_h/(k_0^*)^{(2^h)}\) converges to 1, producing (12).

Appendix 3: Properties of \(w'(n)\)

We prove that for each \(n\ge 2\), \(w'(n)\,\le \,n/2\), with equality only for \(n=2\), 4, or 6. The result is verified by direct computation of \(w'(n)\) for \(2\,\le \,n\,\le \,7\). For \(n\,\ge \,8\), by definition, \(w'(n)=\lfloor x \rfloor \), where x satisfies \(2^{x-2}+x=n-1\). Seeking a contradiction, suppose \(\lfloor x \rfloor = w'(n)\,\ge \,n/2\). Because \(x\,\ge \,\lfloor x \rfloor \), we would have \(x\,\ge \,n/2\), and therefore \(n-1=2^{x-2}+x\,\ge \,2^{n/2-2} + n/2 \ge 2(n/2 - 2) + n/2 = 3n/2-4\), noting that \(2^u\,\ge \,2u\) for \(u \ge 2\). The inequality \(n-1\,\ge \,3n/2-4\) cannot hold if \(n\,\ge \,8\). Therefore, when \(n\,\ge \,8\), we must have \(w'(n) < n/2\).

Appendix 4: Proof that Trees in \(T_{n,w}\) Satisfy (8) for \(w\,\ge \,2\)

We first prove that given any \(w\ge 2\), a caterpillar tree \(t_1\) of size \(|t_1| = w\) can be displayed in any tree \(t_2\) of size \(|t_2| \ge 2^{w-2}+1\) through a root configuration \(\gamma \) of \(t_2\), that is, \(t_1 \cong t_2(\gamma )\). The proof is by induction on w.

For \(w=2\), we have \(|t_2|\,\ge \,2\) and the result follows by taking the root configuration \(\gamma \) determined by the left and right descendants of the root in \(t_2\). For the inductive step, because \(|t_2|\,\ge \,2^{w-2}+1\), the larger root subtree of \(t_2\) has size at least \(\lceil |t_2|/2 \rceil \,\ge \,\lceil 2^{w-3}+1/2 \rceil = 2^{w-3} + 1 \). By the inductive hypothesis, the larger root subtree of \(t_2\) can display a caterpillar of size \(w-1\) through a root configuration \(\gamma '\). Taking the root configuration \(\gamma \) of \(t_2\) obtained as \(\gamma = \gamma ' \cup \{ \rho \}\), where \(\rho \) is the root of the smaller root subtree of \(t_2\), we have \(t_1 \cong t_2(\gamma )\) as desired.

Now suppose we are given a tree \(t \in T_{n,w}\), with \(2\,\le \,w \le w'(n)\). The smaller root subtree \(t_{r_S}\) of t is by definition a caterpillar of size \(w\,\ge \,2\), and the larger root subtree \(t_{r_L}\) has size \(|t_{r_L}| = n-w\). By definition, \(w\,\le \,w'(n) = \lfloor x \rfloor \,\le \,x\), where \(x = n - 2^{x-2} -1\), and therefore, \(w\,\le \,n - 2^{w-2} - 1\). In particular, \(|t_{r_L}| = n-w \ge 2^{w-2}+1\). From what we have shown above, a root configuration \(\gamma \) of \(t_{r_L}\) exists such that \(t_{r_S} \cong t_{r_L}(\gamma )\).

Appendix 5: Proof of (18)

Recall that for each tree \(t \in T_{n,w}\), the smaller root subtree \(t_{r_S}\) is a caterpillar of size \(w \in [1,w']\) and the larger root subtree \(t_{r_L}\) has size \(n-w\). Because we assume \(w < n/2\), \(t_{r_S}\) and \(t_{r_L}\) have different sizes and different unlabeled topologies. Given a tree \(\overline{t} \in T_{n-w}\), the number of trees in \(T_{n,w}\) such that \(t_{r_L} = \overline{t}\) (after rescaling labels for the taxa) is \({{n}\atopwithdelims (){w}} \gamma _w\), where \(\gamma _w\) is the number of caterpillar labeled topologies of size w. Dividing by \(|T_{n,w}| = {{n}\atopwithdelims (){w}} \gamma _w |T_{n-w}|\) yields the probability \(\mathbb {P}[t_{r_L}=\overline{t}|t \in T_{n,w}] = 1/|T_{n-w}|\) as desired.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Disanto, F., Rosenberg, N.A. On the Number of Non-equivalent Ancestral Configurations for Matching Gene Trees and Species Trees. Bull Math Biol 81, 384–407 (2019). https://doi.org/10.1007/s11538-017-0342-x

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11538-017-0342-x

Keywords

  • Ancestral configurations
  • Combinatorics
  • Gene trees and species trees
  • Phylogenetics