Abstract
Using topological summaries of gene trees as a basis for species tree inference is a promising approach to obtain acceptable speed on genomicscale datasets, and to avoid some undesirable modeling assumptions. Here we study the probabilities of splits on gene trees under the multispecies coalescent model, and how their features might inform species tree inference. After investigating the behavior of split consensus methods, we investigate split invariants—that is, polynomial relationships between split probabilities. These invariants are then used to show that, even though a split is an unrooted notion, split probabilities retain enough information to identify the rooted species tree topology for trees of 5 or more taxa, with one possible 6taxon exception.
This is a preview of subscription content, log in to check access.
References
Alanzi ARA, Degnan JH (2017) Inferring rooted species trees from unrooted gene trees using approximate Bayesian computation. Mol Phylogenet Evol. https://doi.org/10.1016/j.ympev.2017.07.017
Allman ES, Degnan JH, Rhodes JA (2011a) Determining species tree topologies from clade probabilities under the coalescent. J Theor Biol 289:96–106
Allman ES, Degnan JH, Rhodes JA (2011b) Identifying the rooted species tree from the distribution of unrooted gene trees under the coalescent. J Math Biol 62(6):833–862
Allman ES, Degnan JH, Rhodes JA (2013) Species tree inference by the STAR method, and generalizations. J Comput Biol 20(1):50–61
Allman ES, Degnan JH, Rhodes JA (2016) Species tree inference from gene splits by unrooted STAR methods. IEEE/ACM Trans Comput Biol Bioinform. https://doi.org/10.1109/TCBB.2016.2604812
Ané C (2016) Personal communication
Chifman J, Kubatko L (2015) Identifiability of the unrooted species tree topology under the coalescent model with timereversible substitution processes, sitespecific rate variation, and invariable sites. J Theor Biol 374:35–47
Decker W, Greuel GM, Pfister G, Schönemann H (2016) Singular 4–1–0—a computer algebra system for polynomial computations. http://www.singular.unikl.de
Degnan JH (2013) Anomalous unrooted gene trees. Syst Biol 62:574–590
Degnan JH, Salter LA (2005) Gene tree distributions under the coalescent process. Evolution 59(1):24–37
Degnan JH, DeGiorgio M, Bryant D, Rosenberg NA (2009) Properties of consensus methods for inferring species trees from gene trees. Syst Biol 58(1):35–54
Ewing GB, Ebersberger I, Schmidt HA, von Haeseler A (2008) Rooted triple consensus and anomalous gene trees. BMC Evol Biol 8:118
Heled J, Drummond AJ (2010) Bayesian inference of species trees from multilocus data. Mol Biol Evol 27:570–580
Kubatko LS, Carstens BC, Knowles LL (2009) STEM: species tree estimation using maximum likelihood for gene trees under coalescence. Bioinformatics 25(7):971–973
Larget BR, Kotha SK, Dewey CN, Ané C (2010) BUCKy: gene tree/species tree reconciliation with Bayesian concordance analysis. Bioinformatics 26:2910–2911
Liu L, Pearl DK (2007) Species trees from gene trees: reconstructing Bayesian posterior distributions of a species phylogeny using estimated gene tree distributions. Syst Biol 56:504–514
Liu L, Yu L (2011) Estimating species trees from unrooted gene trees. Syst Biol 60:661–667
Liu L, Yu L, Pearl DK, Edwards SV (2009) Estimating species phylogenies using coalescence times among sequences. Syst Biol 58:468–477
Liu L, Yu L, Edwards SV (2010) A maximum pseudolikelihood approach for estimating species trees under the coalescent model. BMC Evol Biol 10:302
Long C, Kubatko L (2017) Identifiability and reconstructibility of species phylogenies under a modified coalescent. arXiv:1701.06871
Mirarab S, Warnow T (2015) ASTRALII: coalescentbased species tree estimation with many hundreds of taxa and thousands of genes. Bioinformatics 31:i44–i52
Philippe H, Brinkmann H, Lavrov DV, Littlewood DTJ, Manuel M, Wörheide G, Baurain D (2011) Resolving difficult phylogenetic questions: why more sequences are not enough. PLoS Biol 9(3):e1000602
Ronquist F, Teslenko M, van der Mark P, Ayres DL, Darling A, Höhna S, Larget B, Liu L, Suchard MA, Huelsenbeck JP (2012) MrBayes 3.2: efficient Bayesian phylogenetic inference and model choice across a large model space. Syst Biol 61(3):539–542
Semple C, Steel M (2003) Phylogenetics Oxford lecture series in mathematics and its applications, vol 24. Oxford University Press, Oxford
Vachaspati P, Warnow T (2015) ASTRID: accurate species trees from internode distances. BMC Genom 16(Suppl 10):S3
Wu Y (2012) Coalescentbased species tree inference from gene tree topologies under incomplete lineage sorting by maximum likelihood. Evolution 66(3):763–775
Acknowledgements
This work was begun while ESA and JAR were Shortterm Visitors and JHD was a Sabbatical Fellow at the National Institute for Mathematical and Biological Synthesis, an institute sponsored by the National Science Foundation, the US Department of Homeland Security, and the US Department of Agriculture through NSF Award #EF0832858, with additional support from the University of Tennessee, Knoxville. It was further supported by the National Institutes of Health Grant R01 GM117590, awarded under the Joint DMS/NIGMS Initiative to Support Research at the Interface of the Biological and Mathematical Sciences.
Author information
Affiliations
Corresponding author
Appendices
A Greedy Split Consensus on 5Taxon Trees: Proofs
Here we prove Propositions 2 and 3 from Sect. 3.
With \(\mathcal {X}=\{a,b,c,d,e\}\), there are 10 nontrivial splits, each with blocks of size 2 and 3. We use notation for split probabilities given in Example 1. Computations, assisted by the software COAL (Degnan and Salter 2005), produce the formulas in Table 2 for these split probabilities on the 3 species tree shapes, \(\sigma _{\mathrm{bal}}\), \(\sigma _{pc}\), and \(\sigma _{\mathrm{cat}}\).
Proof of Proposition 2
Given the equalities of split probabilities in Table 2, we need only show that \(s_{ab}, s_{de} \ge s_{ac}, s_{ad}, s_{cd}\) for each of the trees \(\sigma _{\mathrm{bal}}\) and \(\sigma _{\mathrm{ps}}\), for a total of 12 inequalities. Note that positive branch lengths imply \(0<X,Y,Z<1.\)
While each of the inequalities can be checked without machine assistance, as an example we use the software Maple to verify one of them: \(s_{de}>s_{cd}\) for \(\sigma _{\mathrm{bal}}\). Note
The Maple command
produces as output
verifying the claim. \(\square \)
Proof of Proposition 3
That \(s_{ab}> s_{xy}\) for \(xy\ne de\) can be verified as in the preceding proof.
Suppose now that \(s_{de} \ge s_{ab}\). Then since \(s_{ab}\) is larger than all the remaining split probabilities by the above calculations, the true nontrivial splits on the species tree have the highest probability, and greedy consensus for gene tree splits is consistent.
Now assume instead that \(s_{ab}>s_{de}\), so \(s_{ab}\) is the strict maximum of the split probabilities. Under the greedy consensus algorithm, splits incompatible with Sp(ab) are discarded and only the splits \(s_{cd}\), \(s_{ce}\), and \(s_{de}\) remain as candidate splits for acceptance by the algorithm.
It can be verified that \(s_{de} > s_{ce}\) for all \(0<X,Y,Z<1\). However,
can have either sign. If \(F(X,Y,Z)>0\), then greedy consensus will return the correct species tree. If \(F(X,Y,Z)<0\), it will return the tree ((a, b), e, (c, d)). \(\square \)
B Identifiability of Trees: Additional Proofs
The proofs we give of Propositions 6, 7, and 8 depend on a careful analysis of probabilities under the coalescent. That of Proposition 6 is the simplest, and serves as a model for the others.
B.1 Proof of Proposition 6
The proof of Proposition 6 depends on several lemmas.
We begin with a definition. Consider a nonbinary rooted species tree \(((x_1,x_2,\ldots x_k)\text {:}L,y)\) formed by attaching a single outgroup taxon y to a claw tree with k taxa \(x_i\), with edge length \(L>0\). Under the multispecies coalescent model we will be interested in the case where the gene lineages, one for each \(x_i\), have coalesced \(\ell \) times, from k to \(k\ell \) lineages, by the time they reach the root of the tree, and then further coalescences occur with the y lineage in the root population, until a single tree is formed. For \(\mathcal {A}\subset \mathcal {X}= \{x_1,x_2,\ldots x_k,y\}\). We denote the probability that a resulting gene tree displays a split \(Sp(\mathcal {A}_g)\) as
Note that this probability does not depend on branch lengths in the species tree, since \(L>0\) and we have conditioned on \(\ell \). Furthermore, since the \(x_i\) lineages are exchangeable under the coalescent model on this tree, \(p(\mathcal {A}\mid k,\ell )\) actually depends on \(\mathcal {A}\) only through the number of \(x_i\in \mathcal {A}\) and whether \(y\in A\), but not on the particular \(x_i\in \mathcal {A}\).
By an msplit, we mean a split of taxa where one block of the partition has size m. We now give recursions and base cases for the probability of various 2splits for the above species tree.
Lemma 3

1.
\(p(x_1x_2\mid k,0)=p(x_1y\mid k, 0)\) for \(k\ge 2\),

2.
\(p(x_1x_2\mid 3,\ell )=p(x_1y\mid 3, \ell )=\frac{1}{3}\) for \(\ell =0,1,2\),

3.
\(p(x_1y\mid k, 0)=\frac{1}{\left( {\begin{array}{c}k+1\\ 2\end{array}}\right) } + \frac{\left( {\begin{array}{c}k1\\ 2\end{array}}\right) }{\left( {\begin{array}{c}k+1\\ 2\end{array}}\right) }p(x_1y\mid k1, 0)\) for \(k\ge 3\),

4.
\(p(x_1x_2\mid k, \ell )=\frac{1}{\left( {\begin{array}{c}k\\ 2\end{array}}\right) } + \frac{\left( {\begin{array}{c}k2\\ 2\end{array}}\right) }{\left( {\begin{array}{c}k\\ 2\end{array}}\right) }p(x_1x_2\mid k1, \ell 1)\), for \(k\ge 4\), \(k> \ell \ge 1\),

5.
\(p(x_1y\mid k, \ell )=\frac{\left( {\begin{array}{c}k1\\ 2\end{array}}\right) }{\left( {\begin{array}{c}k\\ 2\end{array}}\right) } p(x_1y\mid k1, \ell 1)\) for \(k\ge 3\), \(k>\ell \ge 1\).
Proof
These all follow directly from properties of the coalescent model. We give reasoning for several, leaving the rest to the reader.
For claim (1), observe no coalescent events occur below the root of the tree, so exchangeability of lineages at the root implies the statement.
For claim (4), note that for the split \(Sp(X_1X_2)\) to form, the first coalescent event must either be between the \(x_1\) and \(x_2\) lineages, which occurs with probability \(1/\left( {\begin{array}{c}k\\ 2\end{array}}\right) \), or be between \(x_i\) lineages with \(i\ne 1,2\), which occurs with probability \(\left( {\begin{array}{c}k2\\ 2\end{array}}\right) /\left( {\begin{array}{c}k\\ 2\end{array}}\right) \), with the split forming subsequently. \(\square \)
We next establish some probability bounds.
Lemma 4
For \(k\ge 4\), \(k > \ell \ge 0\), \(p(x_1y \mid k,\ell ) < \frac{1}{k}\).
Proof
Lemma 3 (3) and (2) imply \(p(x_1y\mid 4,0)= \frac{1}{5} < \frac{1}{4}\). For \(k>4\), \(\ell = 0\), Lemma 3 (3) and an inductive hypothesis then show
For \(\ell \ge 1\), first consider the case that \(k\ell =1\), 2, or 3. The using Lemma 3 (5) repeatedly and Lemma 3 (2) shows
If instead \(k\ell \ge 4\), Lemma 3 (5) and what has already been established imply
\(\square \)
Next, we obtain a key inequality.
Lemma 5
For \(k=3\), \(\ell =0,1,2\) and for \(k > 3\), \(\ell =0\),
For \(k\ge 4\) and \(k>\ell \ge 1\),
Proof
For \(k=3\), \(\ell =0,1,2\) and for \(k>3\), \(\ell =0\), the claimed equalities follow from Lemma 3 (2) and (1), respectively.
For the inequality when \(k\ge 4\), \(k>\ell \ge 1\), by Lemma 3 (4) and (5),
Using Lemma 3 (2) in Eq. (13) shows
for \(\ell =1,2,3\), establishing the \(k=4\) case of the inequality.
For \(k > 4\), by Lemma 3 (1) and Lemma 4 Eq. (13) yields
This shows the inequality holds for \(\ell =1\), and provides base cases for an inductive proof for \(\ell \ge 1\).
Finally, Eq. (13), an inductive hypothesis, and Lemma 4 show that for \(\ell \ge 2\)
\(\square \)
Proof of Proposition 6
That the equality holds for species tree (T, (a, b)) is an instance of Theorem 2. It is enough to establish the inequality for ((T, a), b), since that for ((T, b), a) will follow by interchanging taxon names.
On the species tree ((T, a), b), let v denote the MRCA of T and a. Observe that for the splits Sp(AC) or Sp(BC) to form, it is necessary that the c lineage not coalesce with any other below v. In any such realization of the coalescent process below v, lineages from taxa on T will have coalesced to \(k1\) lineages by v, where \(k\ge 3\). There the lineage from a enters, and \(\ell \) coalescent events, \(k>\ell \ge 0\) occur on the edge immediately ancestral to v.
To establish the inequality, we consider it conditioned on a number of disjoint and exhaustive events: For each possible \(k, \ell \), let \(\mathcal C = \mathcal C (k, \ell )\) denote the event that \(k1\) agglomerated lineages from T reach v, one of which is the lineage from c alone, and that \(\ell \) coalescent events occur in the population immediately ancestral to v. Fixing \(\mathcal C = \mathcal C (k, \ell )\), with \(y=b\), \(x_1=c\), \(x_2=a\) we have
Lemma 5 thus shows \(\mathbb {P}(Sp(ac\mid \mathcal C)\mathbb {P}(Sp(bc\mid \mathcal C))\) is positive for \(k\ge 4\), \(k>\ell \ge 1\), and zero for other relevant cases. Multiplying by the probabilities of each \(\mathcal C = \mathcal C (k, \ell )\) and summing, we obtain the desired unconditioned expression \(\mathbb {P}(Sp(ac))\mathbb {P}(Sp(bc))\). Because T has at least three taxa, there are some positive summands from \(k \ge 4\), \(\ell \ge 1\), so the desired inequality holds. \(\square \)
B.2 Proof of Proposition 7
While the proof of Proposition 7 follows the same line of reasoning as that of Proposition 6, there are further technical details. We first extend some of the results from the previous section to splits of size 3. These will be applied in arguments for the species tree \(((T,(b_1,b_2)),a)\).
Lemma 6

1.
\(p(\mathcal {A}\mid k,0)=p(\mathcal {B}\mid k,0)\) for \(\mathcal {A}=\mathcal {B},\)

2.
\(p(x_1x_2x_3\mid k,0)=\frac{3}{\left( {\begin{array}{c}k+1\\ 2\end{array}}\right) } p(x_1x_2\mid k1,0)+\frac{\left( {\begin{array}{c}k2\\ 2\end{array}}\right) }{\left( {\begin{array}{c}k+1\\ 2\end{array}}\right) }p(x_1x_2x_3\mid k1, 0)\) for \(k\ge 4\),

3.
\(p(x_1x_2x_3\mid 3,\ell )=p(x_1x_2y\mid 3,\ell )=1\), for \(\ell =0,1,2\),

4.
\(p(x_1x_2x_3\mid 4,\ell )=\frac{1}{2}p(x_1x_2\mid 3,\ell 1)\) for \(\ell = 1, 2, 3\),

5.
\(p(x_1x_2x_3\mid k,\ell )=\frac{3}{\left( {\begin{array}{c}k\\ 2\end{array}}\right) }p(x_1x_2\mid k1,\ell 1)+\frac{\left( {\begin{array}{c}k3\\ 2\end{array}}\right) }{\left( {\begin{array}{c}k\\ 2\end{array}}\right) }p(x_1x_2x_3\mid k1,\ell 1)\) for \(k\ge 5\), \(k>\ell \ge 1\),

6.
\(p(x_1x_2y\mid k,\ell )=\frac{1}{\left( {\begin{array}{c}k\\ 2\end{array}}\right) }p(x_1y\mid k1,\ell 1)+\frac{\left( {\begin{array}{c}k2\\ 2\end{array}}\right) }{\left( {\begin{array}{c}k\\ 2\end{array}}\right) }p(x_1x_2y\mid k1,\ell 1)\) for \(k\ge 4\), \(k>\ell \ge 1.\)
Proof
For claim (1), it suffices to note that \(k+1\) lineages enter the population above the root, with no coalescent events having occurred below, so the probabilities of any two msplits are the same by exchangeability of lineages under the coalescent model.
For claim (2), again \(k+1\) lineages enter the root population, with no previous coalescence. For \(Sp(X_1X_2X_3)\) to form, the first coalescent event above the root must be between a pair of lineages chosen from \(x_1, x_2, x_3\), or disjoint from them. It is between a pair chosen from them with probability \(\frac{3}{\left( {\begin{array}{c}k+1\\ 2\end{array}}\right) }\). Then, for \(Sp(X_1 X_2 X_3)\) to form, this pair’s lineage must join with the remaining \(x_i\) lineage. By claim (1), this has probability \(p(x_1 x_2\mid k1,0)\). Multiplying these probabilities, we obtain the first summand. The first coalescent event not involving any of the \(x_1,x_2,x_3\) lineages, and then the desired split forming with 1 less lineage present gives the second summand.
The remaining verifications are left to the reader. \(\square \)
Lemma 7
For \(k\ge 4\), \(k>\ell \ge 0\),
Proof
We first show the inequality for \(\ell =0\), by induction on k. From Lemmas 3 and 6, \(p(x_1y\mid 4,0)+p(x_1x_2y\mid 4,0)=\frac{2}{5},\) establishing the base case of \(k=4\).
If \(k\ge 5\), an inductive hypothesis, Lemma 3 (3), Lemma 6 (1) and (2), and Lemma 4 yield
Next observe that for \(\ell =1,2,3\), Lemmas 3 and 6 imply
With the \(k=4\), \(\ell =1,2,3\) cases and the \(k\ge 4\), \(\ell =0\) cases already established, we now proceed by induction on \(\ell \). For \(k\ge 5\), \(k>\ell \ge 1\) by Lemma 3 (5), Lemma 6 (6), Lemma 4, and an inductive hypothesis,
\(\square \)
Lemma 8
Let
Then for \(k=4\), \(\ell =0,1,2,3\), and for \(k\ge 5\), \(\ell =0\), \(P(k,\ell )=0.\) For \(k\ge 5\), \(k>\ell \ge 1\), \(P(k,\ell )>0.\)
Proof
Note that for \(k=4\), the event \(Sp(x_1x_2)\) is the same as \(Sp(x_3x_4y)\), and \(Sp(x_1x_2x_3)\) is the same as \(Sp(x_4y)\), so using exchangeability of the \(x_i\) lineages we have
Thus \(P(4,\ell )=0\) for \(\ell = 0, 1, 2, 3\). For \(k\ge 5\), Lemma 6 (1) implies \(P(k,0)=0\).
For \(k\ge 5\), \(\ell \ge 1\), by Lemmas 3 (4), (5) and 6 (5), (6), we find
Using Lemmas 5 and 7, the nonnegativity of probabilities, and an inductive hypothesis that \(P(k1,\ell 1)\ge 0\), it follows that
\(\square \)
Lemma 9
Consider a species tree with topology \(((T,(b_1,b_2)),a)\), where T is a subtree on at least three taxa, one of which is c. Suppose the edge above \((T,(b_1,b_2))\) has positive length. Then under the multispecies coalescent model,
Proof
Let v denote the MRCA on the species tree of the taxa on T and the \(b_i\).
To establish the claimed inequality, it is enough to show it holds when conditioned on whether \(b_1\) and \(b_2\) lineages have coalesced before reaching v or not. If they have coalesced before v to form a single lineage, then the events \(Sp(ab_2c)\) and \(Sp(b_1c)\) have probability zero. Thus using b for \(b_1b_2\), we wish to show
This follows immediately from Proposition 6.
We henceforth condition on the two \(b_i\) lineages being distinct at v. Noticing that all four probabilities in the expression of interest are 0 if the c lineage coalesces with any lineage below v, we further condition on the c lineage being distinct at v, where there are thus \(k\ge 4\) lineages entering the population above v, and \(\ell \) coalescent events occurring between v and the root.
Then, with \(\mathcal C=\mathcal C(k,\ell )\) denoting the events we condition on,
From Lemma 8 we find conditioned on \(\mathcal C\) that the expression is strictly negative for \(k\ge 5\), \(k > \ell \ge 1\), and zero for \(k\ge 5\), \(\ell =0\) and \(k=4\), \(\ell =0,1,2,3\). Thus weighting the conditioned expressions by the probabilities of the events \(\mathcal C\) and summing, we see the full expression is negative, as long as \(k \ge 5\) and \(\ell \ge 1\) is possible. Since T has at least 3 taxa, this only requires that the edge above v has positive length. \(\square \)
To handle the species tree \(((T,a),(b_1,b_2))\) we proceed analogously, but consider a rooted species tree \(((x_1,x_2,\ldots x_k)\text {:}L,y_1,y_2)\) formed by attaching a trifurcating root to two outgroups \(y_1,y_2\) and a claw tree with k taxa, with a positive edge length L. We will be interested in the case where the gene lineages, one for each \(x_i\), have coalesced \(\ell \) times, from k to \(k\ell \) lineages, by the time they reach the root of the tree, and then further coalescence occurs in the root population until a single tree is formed. With \(\mathcal {X}=\{x_1,\ldots x_k,y_1,y_2\}\) and \(\mathcal {A}\subset \mathcal {X}\), let
for this species tree. By exchangeability of lineages in the coalescent model, \(r(\mathcal {A}\mid k,\ell )\) depends on \(\mathcal {A}\) only up to the number of \(x_i\) and the number of \(y_i\) it contains.
The reader who has followed previous arguments should be able to verify the following.
Lemma 10

1.
\(r(\mathcal {A} \mid k,0)=r(\mathcal {B}\mid k, 0)\) for \(\mathcal {A}=\mathcal {B}\),

2.
\(r(x_1 x_2 \mid 3, 0) = \frac{1}{5}\), \(r(x_1x_2\mid 3,\ell )=\frac{1}{3}\) for \(\ell =1,2\),

3.
\(r(x_1x_2\mid k, 0)=\frac{1}{\left( {\begin{array}{c}k+2\\ 2\end{array}}\right) } + \frac{\left( {\begin{array}{c}k\\ 2\end{array}}\right) }{\left( {\begin{array}{c}k+2\\ 2\end{array}}\right) }r(x_1x_2\mid k1, 0)\) for \(k\ge 3\),

4.
\(r(x_1x_2\mid k, \ell )=\frac{1}{\left( {\begin{array}{c}k\\ 2\end{array}}\right) } + \frac{\left( {\begin{array}{c}k2\\ 2\end{array}}\right) }{\left( {\begin{array}{c}k\\ 2\end{array}}\right) }r(x_1x_2\mid k1, \ell 1)\), for \(k\ge 4\), \(k> \ell \ge 1\),

5.
\(r(x_1y_1 \mid 2,0)=\frac{1}{3}\), \(r(x_1y_1 \mid 2,1)=0\),

6.
\(r(x_1y_1\mid k, \ell )=\frac{\left( {\begin{array}{c}k1\\ 2\end{array}}\right) }{\left( {\begin{array}{c}k\\ 2\end{array}}\right) } r(x_1y_1\mid k1, \ell 1)\) for \(k\ge 3\), \(k>\ell \ge 1\).

7.
\(r(x_1x_2x_3 \mid 4,\ell )= \frac{1}{2} r(x_1x_2\mid 3, \ell 1)\) for \(\ell = 1, 2, 3\),

8.
\(r(x_1x_2x_3 \mid k,\ell )=\frac{3}{\left( {\begin{array}{c}k\\ 2\end{array}}\right) } r(x_1x_2\mid k1, \ell 1)+\frac{\left( {\begin{array}{c}k3\\ 2\end{array}}\right) }{\left( {\begin{array}{c}k\\ 2\end{array}}\right) } r(x_1x_2x_3\mid k1, \ell 1)\) for \(k\ge 5\), \(k>\ell \ge 1\),

9.
\(r(x_1x_2y_1\mid 2,0)=r(x_1x_2y_1\mid 2,1) =1\), \(r(x_1x_2y_1\mid 3,1)=\frac{1}{9}\), \(r(x_1x_2y_1\mid 3,2)=0\),

10.
\(r(x_1x_2y_1\mid k, 0)=\frac{3}{\left( {\begin{array}{c}k+2\\ 2\end{array}}\right) }r(x_1x_2\mid k1,0)+\frac{\left( {\begin{array}{c}k1\\ 2\end{array}}\right) }{\left( {\begin{array}{c}k+2\\ 2\end{array}}\right) } r(x_1x_2y_1\mid k1, 0)\) for \(k\ge 3\),

11.
\(r(x_1x_2y_1\mid k, \ell )=\frac{1}{\left( {\begin{array}{c}k\\ 2\end{array}}\right) }r(x_1y_1\mid k1,\ell 1)+\frac{\left( {\begin{array}{c}k2\\ 2\end{array}}\right) }{\left( {\begin{array}{c}k\\ 2\end{array}}\right) }r(x_1x_2y_1\mid k1, \ell 1)\) for \(k\ge 4\), \(k>\ell \ge 1\),

12.
\(r(x_1 y_1 y_2 \mid 2, 0) = r(x_1y_1y_2\mid 2, 1)=1\),

13.
\(r(x_1y_1y_2\mid k, \ell )=\frac{\left( {\begin{array}{c}k1\\ 2\end{array}}\right) }{\left( {\begin{array}{c}k\\ 2\end{array}}\right) }r(x_1y_1y_2\mid k1, \ell 1)\) for \(k\ge 3\), \(k>\ell \ge 1\).
Lemma 11

1.
\(r(x_1y_1 \mid k,0)\le \frac{1}{k+2}\) for \(k\ge 3\),

2.
\(r(x_1y_1 \mid k,\ell )+r(x_1y_1y_2 \mid k,\ell )< \frac{1}{k1}\) if \(k \ge 3\) and \(\ell = 0\), or if \(k\ge 4\) and \(k> \ell \ge 1\).
Proof
For claim (1) first note that Lemma 10 (1) and (2) establish the \(k=3\) case. Then using Lemma 10 (3) one sees inductively that for \(k>3\),
For claim (2) when \(\ell = 0\), note that by Lemma 10 (1), (2), (5), (9), and (10),
so the base case of \(k=3\) holds. Then for \(k> 3\), using Lemma 10 (1), (3), and (10) we have
Using an inductive hypothesis and claim (1) of this proposition yields
Assume now \(k \ge 4\) and \(k > \ell \ge 1\), and consider first the case that \(k\ell =1\) or 2. Applying Lemma 10 (6) and (13) repeatedly, we have
From Lemma 10
so for \(k\ge 4\),
If \(k\ell \ge 3\), then applying Lemma 10 (6) and (13) repeatedly gives
Using what we proved above, this shows
\(\square \)
Lemma 12
Let
Then for \(k=3\), \(\ell =0,1,2\), and for \(k\ge 4\), \(\ell =0\), \(R(k,\ell )=0.\) For \(k\ge 4\) and \(k>\ell \ge 1\), \(R( k,\ell )>0.\)
Proof
For \(k=3\), the events \(Sp(x_1x_2)\) and \(Sp(x_3y_1y_2)\) are the same, as are \(Sp(x_1x_2y_1)\) and \(Sp(x_3y_2)\), so using exchangeability of the \(x_i\) and of the \(y_i\) lineages
Thus \(R(3,\ell )=0\) for \(\ell =0,1,2\). For \(k\ge 3\), Lemma 10 (1) implies \(R(k,0)=0\).
Now consider \(k\ge 4\), \(k>\ell \ge 1\). By Lemma 10 (4), (6), (11), and (13) we find
An inductive hypothesis that \(R(k1,\ell 1)\ge 0\), Lemma 11, and the positivity of \(r(x_1y_1 \mid k1,\ell 1)\) then show
\(\square \)
Lemma 13
Consider a species tree with topology \(((T,a),(b_1,b_2))\), where T is a subtree on at least three taxa, one of which is c. Suppose the edge above (T, a) has positive length. Then under the multispecies coalescent model,
Proof
Let \(\rho \) denote the root of the species tree, and v the MRCA of the taxa on T and a.
To establish the claimed inequality, it is enough to show it holds when conditioned on whether the \(b_1\) and \(b_2\) lineages have coalesced before reaching \(\rho \) or not. If they have coalesced below \(\rho \) to form a single lineage, then the events \(Sp(ab_2c)\) and \(Sp(b_1c)\) have probability zero. Thus using b for \(b_1b_2\), we wish to show
This follows immediately from Proposition 6.
We henceforth condition on the event that the lineages from \(b_1\) and \(b_2\) are distinct at \(\rho \). Noticing that all four probabilities in the expression of interest are 0 if the c lineage coalesces with any lineage below v, we further condition on the event that the c lineage is distinct at v, so there are \(k\ge 3\) distinct lineages at v, and that \(\ell \) coalescent events occur on the edge above v. Calling this event \(\mathcal C = \mathcal C(k, \ell )\),
From Lemma 12 we find that conditioned on \(\mathcal C\) the expression of interest is strictly positive for \(k\ge 4\), \(k > \ell \ge 1\), and zero for \(k=3\), \(\ell =0,1,2\) and \(k\ge 4\), \(\ell =0\). Weighting the conditioned expressions by the probabilities of the \(\mathcal C\) and summing we get the unconditioned expression. Since T has at least 3 taxa and the branch length above v has positive length, some of the summands corresponds to the event \(\mathcal C(k,\ell )\) with \(k\ge 4\), \(k>\ell \ge 1\); thus the full expression is positive. \(\square \)
Finally, Proposition 7 follows from Theorem 2, Lemmas 9 and 13.
B.3 Proof of Proposition 8
To establish Proposition 8, we first extend the results of Lemma 10, and those that follow it, to splits of size 4.
A proof of the following is left to the reader.
Lemma 14

1.
\(r(x_1x_2x_3y_1\mid 3, 0)=1\),

2.
\(r(x_1x_2x_3y_1\mid k, 0)=\frac{6}{\left( {\begin{array}{c}k+2\\ 2\end{array}}\right) } r(x_1x_2y_1\mid k1, 0)+\frac{\left( {\begin{array}{c}k2\\ 2\end{array}}\right) }{\left( {\begin{array}{c}k+2\\ 2\end{array}}\right) }r(x_1x_2x_3y_1\mid k1,0)\) for \(k\ge 4\),

3.
\(r(x_1x_2x_3y_1\mid 4,\ell )=\frac{1}{2}r(x_1x_2y_1\mid 3,\ell 1)\) for \(\ell = 1, 2, 3\),

4.
\(r(x_1x_2x_3y_1\mid k,\ell )=\frac{3}{\left( {\begin{array}{c}k\\ 2\end{array}}\right) }r(x_1x_2y_1\mid k1,\ell 1) +\frac{\left( {\begin{array}{c}k3\\ 2\end{array}}\right) }{\left( {\begin{array}{c}k\\ 2\end{array}}\right) }r(x_1x_2x_3y_1\mid k1,\ell 1)\) for \(k\ge 5\), \(k>\ell \ge 1\),

5.
\(r(x_1x_2y_1y_2\mid 3, \ell )=1\) for \(\ell =0,1,2\),

6.
\(r(x_1x_2y_1y_2\mid k, \ell )=\frac{1}{\left( {\begin{array}{c}k\\ 2\end{array}}\right) }r(x_1y_1y_2\mid k1,\ell 1) +\frac{\left( {\begin{array}{c}k2\\ 2\end{array}}\right) }{\left( {\begin{array}{c}k\\ 2\end{array}}\right) }r(x_1x_2y_1y_2\mid k1,\ell 1)\) for \(k\ge 4\), \(k>\ell \ge 1\),
Lemma 15
Let \(U(k,\ell )=\)
Then \(U(k,\ell )< \frac{1}{k2}\) for \(k\ge 4\), \(k>\ell \ge 0\).
Proof
We first take up the case that \(\ell = 0\), and observe by Lemmas 10 and 14 that for \(k\ge 4\),
Since
we see
establishing the \(k=4, \ell = 0\) case. Proceeding inductively for \(k\ge 5\), and using Lemma 11 (2), we have
For \(\ell > 0\), if \(k\ge 4\), \(k>\ell \ge 1\), Lemmas 10 and 14 show
In particular, since
then
providing, along with the cases with \(\ell = 0\), the base cases for induction. Now for \(k\ge 5\), \(k>\ell \ge 1\), we see from Eq. (14), Lemma 11 (2), and an inductive hypothesis that
\(\square \)
Lemma 16
Let
Then for \(k=4\), \(\ell =0, 1,2,3\) and for \(k\ge 5\), \(\ell =0\), \(S(k,\ell )= 0\). For \(k\ge 5\), \(k>\ell \ge 1\), \(S(k,\ell )>0\).
Proof
Since for \(k=4\), the events \(Sp(x_1x_2)=Sp(x_3x_4y_1y_2)\), \(Sp(x_1x_2x_3)=Sp(x_4y_1y_2)\), and \(Sp(x_1x_2x_3y_1)=Sp(x_4y_2)\), so using exchangeability of the \(x_i\) and of the \(y_i\) lineages we have
so \(S(4,\ell )=0\) for \(\ell =0,1,2,3\). For \(k\ge 5\), Lemma 10 (1) implies \(S(k,0)=0\).
For \(k\ge 5\), \(k>\ell \ge 1\), using Lemmas 10 and 14 we find
Using an inductive hypothesis, Lemmas 15, and 12 and nonnegativity of probabilities, this implies
\(\square \)
Proof of Proposition 8
On the species tree \(((T,(a_1,a_2)),(b_1,b_2))\) let \(\rho \) denote the root, v the MRCA of the taxa on T and the \(a_i\), and let c be a taxon on T. We first show that since the edge above v has positive length, then
To establish this, it is enough to show it holds when conditioned on whether or not the \(a_1\) and \(a_2\) lineages have coalesced before reaching v, and whether or not the \(b_1\) and \(b_2\) lineages have coalesced before reaching \(\rho \). If both pairs have coalesced in this way, then the events \(Sp(a_1c)\), \(Sp(a_1b_2c)\), \(Sp(a_1a_2b_2c)\), \(Sp(b_1c)\), \(Sp(b_1a_2 c)\), and \(Sp(b_1 a_2 b_2c)\) all have probability zero. Using a for \(a_1a_2\) and b for \(b_1b_2\), we need only show
This follows immediately from Proposition 6. Similarly, the cases in which exactly one of the pairs of \(a_1, a_2\) lineages or \(b_1, b_2\) lineages have coalesced in the population immediately ancestral to their respective MRCAs follow from Proposition 7.
We henceforth condition on the event that the \(a_i\) lineages are distinct at v and the \(b_i\) lineages are distinct at \(\rho \). Noticing that all eight probabilities in the expression of interest are 0 if the c lineage coalesces with any lineage below v, we further condition on the c lineage being distinct at v (so there are \(k\ge 4\) lineages in total entering the population above v) and \(\ell \) coalescent events occur between v and \(\rho \).
Then, with \(\mathcal C = \mathcal C (k,\ell )\) denoting the event that these conditioning requirements are met,
After substituting these in to the expression in (15), from Lemma 16 we see that when conditioned on \(\mathcal C\) it is strictly positive for \(k\ge 5\), \(k\ge \ell \ge 1\) and zero for \(k=4\), \(\ell =0,1,2,3\) and for \(k\ge 5\), \(\ell =0\). Thus weighting the conditioned expressions by the probabilities of the \(\mathcal C\) and summing over all relevant k and \(\ell \), we see the unconditioned inequality (15) holds since T has at least 3 taxa so summands with \(k\ge 5\), \(\ell \ge 1\) are present.
Interchanging the \(a_i\) and \(b_i\) in inequality (15) shows the negativity of the expression on the tree \(((T,(b_1,b_2)),(a_1,a_2))\). Since its vanishing on the tree \((T,((a_1,a_2),(b_1,b_2)))\) was shown in Theorem 2, the proof is complete. \(\square \)
Rights and permissions
About this article
Cite this article
Allman, E.S., Degnan, J.H. & Rhodes, J.A. Split Probabilities and Species Tree Inference Under the Multispecies Coalescent Model. Bull Math Biol 80, 64–103 (2018). https://doi.org/10.1007/s1153801703635
Received:
Accepted:
Published:
Issue Date:
Keywords
 Multispecies coalescent model
 Split probability
 Species tree identifiability
Mathematics Subject Classification
 92D15