Split Probabilities and Species Tree Inference Under the Multispecies Coalescent Model

Abstract

Using topological summaries of gene trees as a basis for species tree inference is a promising approach to obtain acceptable speed on genomic-scale datasets, and to avoid some undesirable modeling assumptions. Here we study the probabilities of splits on gene trees under the multispecies coalescent model, and how their features might inform species tree inference. After investigating the behavior of split consensus methods, we investigate split invariants—that is, polynomial relationships between split probabilities. These invariants are then used to show that, even though a split is an unrooted notion, split probabilities retain enough information to identify the rooted species tree topology for trees of 5 or more taxa, with one possible 6-taxon exception.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3

References

  1. Alanzi ARA, Degnan JH (2017) Inferring rooted species trees from unrooted gene trees using approximate Bayesian computation. Mol Phylogenet Evol. https://doi.org/10.1016/j.ympev.2017.07.017

    Google Scholar 

  2. Allman ES, Degnan JH, Rhodes JA (2011a) Determining species tree topologies from clade probabilities under the coalescent. J Theor Biol 289:96–106

    MathSciNet  Article  Google Scholar 

  3. Allman ES, Degnan JH, Rhodes JA (2011b) Identifying the rooted species tree from the distribution of unrooted gene trees under the coalescent. J Math Biol 62(6):833–862

    MathSciNet  Article  MATH  Google Scholar 

  4. Allman ES, Degnan JH, Rhodes JA (2013) Species tree inference by the STAR method, and generalizations. J Comput Biol 20(1):50–61

    MathSciNet  Article  Google Scholar 

  5. Allman ES, Degnan JH, Rhodes JA (2016) Species tree inference from gene splits by unrooted STAR methods. IEEE/ACM Trans Comput Biol Bioinform. https://doi.org/10.1109/TCBB.2016.2604812

  6. Ané C (2016) Personal communication

  7. Chifman J, Kubatko L (2015) Identifiability of the unrooted species tree topology under the coalescent model with time-reversible substitution processes, site-specific rate variation, and invariable sites. J Theor Biol 374:35–47

    MathSciNet  Article  MATH  Google Scholar 

  8. Decker W, Greuel G-M, Pfister G, Schönemann H (2016) Singular 4–1–0—a computer algebra system for polynomial computations. http://www.singular.uni-kl.de

  9. Degnan JH (2013) Anomalous unrooted gene trees. Syst Biol 62:574–590

    Article  Google Scholar 

  10. Degnan JH, Salter LA (2005) Gene tree distributions under the coalescent process. Evolution 59(1):24–37

    Article  Google Scholar 

  11. Degnan JH, DeGiorgio M, Bryant D, Rosenberg NA (2009) Properties of consensus methods for inferring species trees from gene trees. Syst Biol 58(1):35–54

    Article  Google Scholar 

  12. Ewing GB, Ebersberger I, Schmidt HA, von Haeseler A (2008) Rooted triple consensus and anomalous gene trees. BMC Evol Biol 8:118

    Article  Google Scholar 

  13. Heled J, Drummond AJ (2010) Bayesian inference of species trees from multilocus data. Mol Biol Evol 27:570–580

    Article  Google Scholar 

  14. Kubatko LS, Carstens BC, Knowles LL (2009) STEM: species tree estimation using maximum likelihood for gene trees under coalescence. Bioinformatics 25(7):971–973

    Article  Google Scholar 

  15. Larget BR, Kotha SK, Dewey CN, Ané C (2010) BUCKy: gene tree/species tree reconciliation with Bayesian concordance analysis. Bioinformatics 26:2910–2911

    Article  Google Scholar 

  16. Liu L, Pearl DK (2007) Species trees from gene trees: reconstructing Bayesian posterior distributions of a species phylogeny using estimated gene tree distributions. Syst Biol 56:504–514

    Article  Google Scholar 

  17. Liu L, Yu L (2011) Estimating species trees from unrooted gene trees. Syst Biol 60:661–667

    Article  Google Scholar 

  18. Liu L, Yu L, Pearl DK, Edwards SV (2009) Estimating species phylogenies using coalescence times among sequences. Syst Biol 58:468–477

    Article  Google Scholar 

  19. Liu L, Yu L, Edwards SV (2010) A maximum pseudo-likelihood approach for estimating species trees under the coalescent model. BMC Evol Biol 10:302

    Article  Google Scholar 

  20. Long C, Kubatko L (2017) Identifiability and reconstructibility of species phylogenies under a modified coalescent. arXiv:1701.06871

  21. Mirarab S, Warnow T (2015) ASTRAL-II: coalescent-based species tree estimation with many hundreds of taxa and thousands of genes. Bioinformatics 31:i44–i52

    Article  Google Scholar 

  22. Philippe H, Brinkmann H, Lavrov DV, Littlewood DTJ, Manuel M, Wörheide G, Baurain D (2011) Resolving difficult phylogenetic questions: why more sequences are not enough. PLoS Biol 9(3):e1000602

    Article  Google Scholar 

  23. Ronquist F, Teslenko M, van der Mark P, Ayres DL, Darling A, Höhna S, Larget B, Liu L, Suchard MA, Huelsenbeck JP (2012) MrBayes 3.2: efficient Bayesian phylogenetic inference and model choice across a large model space. Syst Biol 61(3):539–542

    Article  Google Scholar 

  24. Semple C, Steel M (2003) Phylogenetics Oxford lecture series in mathematics and its applications, vol 24. Oxford University Press, Oxford

    Google Scholar 

  25. Vachaspati P, Warnow T (2015) ASTRID: accurate species trees from internode distances. BMC Genom 16(Suppl 10):S3

    Article  Google Scholar 

  26. Wu Y (2012) Coalescent-based species tree inference from gene tree topologies under incomplete lineage sorting by maximum likelihood. Evolution 66(3):763–775

    Article  Google Scholar 

Download references

Acknowledgements

This work was begun while ESA and JAR were Short-term Visitors and JHD was a Sabbatical Fellow at the National Institute for Mathematical and Biological Synthesis, an institute sponsored by the National Science Foundation, the US Department of Homeland Security, and the US Department of Agriculture through NSF Award #EF-0832858, with additional support from the University of Tennessee, Knoxville. It was further supported by the National Institutes of Health Grant R01 GM117590, awarded under the Joint DMS/NIGMS Initiative to Support Research at the Interface of the Biological and Mathematical Sciences.

Author information

Affiliations

Authors

Corresponding author

Correspondence to John A. Rhodes.

Appendices

A Greedy Split Consensus on 5-Taxon Trees: Proofs

Here we prove Propositions 2 and 3 from Sect. 3.

With \(\mathcal {X}=\{a,b,c,d,e\}\), there are 10 non-trivial splits, each with blocks of size 2 and 3. We use notation for split probabilities given in Example 1. Computations, assisted by the software COAL (Degnan and Salter 2005), produce the formulas in Table 2 for these split probabilities on the 3 species tree shapes, \(\sigma _{\mathrm{bal}}\), \(\sigma _{pc}\), and \(\sigma _{\mathrm{cat}}\).

Table 2 Split probabilities for gene trees arising on the 5-taxon species trees under the multispecies coalescent model

Proof of Proposition 2

Given the equalities of split probabilities in Table 2, we need only show that \(s_{ab}, s_{de} \ge s_{ac}, s_{ad}, s_{cd}\) for each of the trees \(\sigma _{\mathrm{bal}}\) and \(\sigma _{\mathrm{ps}}\), for a total of 12 inequalities. Note that positive branch lengths imply \(0<X,Y,Z<1.\)

While each of the inequalities can be checked without machine assistance, as an example we use the software Maple to verify one of them: \(s_{de}>s_{cd}\) for \(\sigma _{\mathrm{bal}}\). Note

$$\begin{aligned} s_{de}-s_{cd}= 1 + \frac{1}{6} XY^3Z- \frac{1}{6} XYZ - YZ. \end{aligned}$$

The Maple command

figurea

produces as output

figureb

verifying the claim. \(\square \)

Proof of Proposition 3

That \(s_{ab}> s_{xy}\) for \(xy\ne de\) can be verified as in the preceding proof.

Suppose now that \(s_{de} \ge s_{ab}\). Then since \(s_{ab}\) is larger than all the remaining split probabilities by the above calculations, the true non-trivial splits on the species tree have the highest probability, and greedy consensus for gene tree splits is consistent.

Now assume instead that \(s_{ab}>s_{de}\), so \(s_{ab}\) is the strict maximum of the split probabilities. Under the greedy consensus algorithm, splits incompatible with Sp(ab) are discarded and only the splits \(s_{cd}\), \(s_{ce}\), and \(s_{de}\) remain as candidate splits for acceptance by the algorithm.

It can be verified that \(s_{de} > s_{ce}\) for all \(0<X,Y,Z<1\). However,

$$\begin{aligned} s_{de} - s_{cd} = 1 + \frac{1}{18}XY^3Z^6 + \frac{1}{9}XY^3 - \frac{1}{6}XY - Y = F(X,Y,Z) \end{aligned}$$

can have either sign. If \(F(X,Y,Z)>0\), then greedy consensus will return the correct species tree. If \(F(X,Y,Z)<0\), it will return the tree ((ab), e, (cd)). \(\square \)

B Identifiability of Trees: Additional Proofs

The proofs we give of Propositions 67, and 8 depend on a careful analysis of probabilities under the coalescent. That of Proposition 6 is the simplest, and serves as a model for the others.

B.1 Proof of Proposition 6

The proof of Proposition 6 depends on several lemmas.

We begin with a definition. Consider a non-binary rooted species tree \(((x_1,x_2,\ldots x_k)\text {:}L,y)\) formed by attaching a single outgroup taxon y to a claw tree with k taxa \(x_i\), with edge length \(L>0\). Under the multispecies coalescent model we will be interested in the case where the gene lineages, one for each \(x_i\), have coalesced \(\ell \) times, from k to \(k-\ell \) lineages, by the time they reach the root of the tree, and then further coalescences occur with the y lineage in the root population, until a single tree is formed. For \(\mathcal {A}\subset \mathcal {X}= \{x_1,x_2,\ldots x_k,y\}\). We denote the probability that a resulting gene tree displays a split \(Sp(\mathcal {A}_g)\) as

$$\begin{aligned} p(\mathcal {A}\mid k,\ell ). \end{aligned}$$

Note that this probability does not depend on branch lengths in the species tree, since \(L>0\) and we have conditioned on \(\ell \). Furthermore, since the \(x_i\) lineages are exchangeable under the coalescent model on this tree, \(p(\mathcal {A}\mid k,\ell )\) actually depends on \(\mathcal {A}\) only through the number of \(x_i\in \mathcal {A}\) and whether \(y\in A\), but not on the particular \(x_i\in \mathcal {A}\).

By an m-split, we mean a split of taxa where one block of the partition has size m. We now give recursions and base cases for the probability of various 2-splits for the above species tree.

Lemma 3

  1. 1.

    \(p(x_1x_2\mid k,0)=p(x_1y\mid k, 0)\) for \(k\ge 2\),

  2. 2.

    \(p(x_1x_2\mid 3,\ell )=p(x_1y\mid 3, \ell )=\frac{1}{3}\) for \(\ell =0,1,2\),

  3. 3.

    \(p(x_1y\mid k, 0)=\frac{1}{\left( {\begin{array}{c}k+1\\ 2\end{array}}\right) } + \frac{\left( {\begin{array}{c}k-1\\ 2\end{array}}\right) }{\left( {\begin{array}{c}k+1\\ 2\end{array}}\right) }p(x_1y\mid k-1, 0)\) for \(k\ge 3\),

  4. 4.

    \(p(x_1x_2\mid k, \ell )=\frac{1}{\left( {\begin{array}{c}k\\ 2\end{array}}\right) } + \frac{\left( {\begin{array}{c}k-2\\ 2\end{array}}\right) }{\left( {\begin{array}{c}k\\ 2\end{array}}\right) }p(x_1x_2\mid k-1, \ell -1)\), for \(k\ge 4\), \(k> \ell \ge 1\),

  5. 5.

    \(p(x_1y\mid k, \ell )=\frac{\left( {\begin{array}{c}k-1\\ 2\end{array}}\right) }{\left( {\begin{array}{c}k\\ 2\end{array}}\right) } p(x_1y\mid k-1, \ell -1)\) for \(k\ge 3\), \(k>\ell \ge 1\).

Proof

These all follow directly from properties of the coalescent model. We give reasoning for several, leaving the rest to the reader.

For claim (1), observe no coalescent events occur below the root of the tree, so exchangeability of lineages at the root implies the statement.

For claim (4), note that for the split \(Sp(X_1X_2)\) to form, the first coalescent event must either be between the \(x_1\) and \(x_2\) lineages, which occurs with probability \(1/\left( {\begin{array}{c}k\\ 2\end{array}}\right) \), or be between \(x_i\) lineages with \(i\ne 1,2\), which occurs with probability \(\left( {\begin{array}{c}k-2\\ 2\end{array}}\right) /\left( {\begin{array}{c}k\\ 2\end{array}}\right) \), with the split forming subsequently. \(\square \)

We next establish some probability bounds.

Lemma 4

For \(k\ge 4\), \(k > \ell \ge 0\), \(p(x_1y \mid k,\ell ) < \frac{1}{k}\).

Proof

Lemma 3 (3) and (2) imply \(p(x_1y\mid 4,0)= \frac{1}{5} < \frac{1}{4}\). For \(k>4\), \(\ell = 0\), Lemma 3 (3) and an inductive hypothesis then show

$$\begin{aligned} p(x_1y\mid k,0)<\frac{1}{\left( {\begin{array}{c}k+1\\ 2\end{array}}\right) } +\frac{\left( {\begin{array}{c}k-1\\ 2\end{array}}\right) }{\left( {\begin{array}{c}k+1\\ 2\end{array}}\right) }\frac{1}{k-1}= \frac{1}{k+1} <\frac{1}{k}. \end{aligned}$$

For \(\ell \ge 1\), first consider the case that \(k-\ell =1\), 2, or 3. The using Lemma 3 (5) repeatedly and Lemma 3 (2) shows

$$\begin{aligned} p(x_1y\mid k, \ell )&=\frac{\left( {\begin{array}{c}k-1\\ 2\end{array}}\right) }{\left( {\begin{array}{c}k\\ 2\end{array}}\right) } \frac{\left( {\begin{array}{c}k-2\\ 2\end{array}}\right) }{\left( {\begin{array}{c}k-1\\ 2\end{array}}\right) }\ldots \frac{\left( {\begin{array}{c}3\\ 2\end{array}}\right) }{\left( {\begin{array}{c}4\\ 2\end{array}}\right) }p(x_1y \mid 3, \ell -k+3)\\&=\frac{6}{k(k-1)} \cdot \frac{1}{3}=\frac{2}{k(k-1)}< \frac{1}{k}. \end{aligned}$$

If instead \(k-\ell \ge 4\), Lemma 3 (5) and what has already been established imply

$$\begin{aligned} p(x_1y\mid k, \ell )&=\frac{\left( {\begin{array}{c}k-1\\ 2\end{array}}\right) }{\left( {\begin{array}{c}k\\ 2\end{array}}\right) } \frac{\left( {\begin{array}{c}k-2\\ 2\end{array}}\right) }{\left( {\begin{array}{c}k-1\\ 2\end{array}}\right) }\ldots \frac{\left( {\begin{array}{c}k-\ell \\ 2\end{array}}\right) }{\left( {\begin{array}{c}k-\ell +1\\ 2\end{array}}\right) } p(x_1y \mid k-\ell , 0)\\&\le \frac{(k-\ell )(k-\ell -1)}{k(k-1)}\frac{1}{k-\ell }<\frac{1}{k}. \end{aligned}$$

\(\square \)

Next, we obtain a key inequality.

Lemma 5

For \(k=3\), \(\ell =0,1,2\) and for \(k > 3\), \(\ell =0\),

$$\begin{aligned} p(x_1x_2\mid k,\ell )-p(x_1y\mid k,\ell )=0. \end{aligned}$$

For \(k\ge 4\) and \(k>\ell \ge 1\),

$$\begin{aligned} p(x_1x_2\mid k,\ell )-p(x_1y\mid k,\ell )>0. \end{aligned}$$

Proof

For \(k=3\), \(\ell =0,1,2\) and for \(k>3\), \(\ell =0\), the claimed equalities follow from Lemma 3 (2) and (1), respectively.

For the inequality when \(k\ge 4\), \(k>\ell \ge 1\), by Lemma 3 (4) and (5),

$$\begin{aligned}&p(x_1 x_2\mid k,\ell )-p(x_1y\mid k,\ell )\nonumber \\&=\frac{1}{\left( {\begin{array}{c}k\\ 2\end{array}}\right) }\left( 1+\left( {\begin{array}{c}k-2\\ 2\end{array}}\right) p(x_1x_2\mid k-1,\ell -1) -\left( {\begin{array}{c}k-1\\ 2\end{array}}\right) p(x_1y\mid k-1,\ell -1)\right) \nonumber \\&=\frac{1}{\left( {\begin{array}{c}k\\ 2\end{array}}\right) }\Bigg ( 1+\left( {\begin{array}{c}k-2\\ 2\end{array}}\right) \big ( \, p(x_1x_2 \mid k-1,\ell -1)-p(x_1y\mid k-1,\ell -1)\, \big ) \nonumber \\&\quad -(k-2)p(x_1y\mid k-1,\ell -1) \Bigg ). \end{aligned}$$
(13)

Using Lemma 3 (2) in Eq. (13) shows

$$\begin{aligned} p(x_1x_2\mid 4,\ell )-p(x_1y\mid 4,\ell ) = \frac{1}{18} > 0 \end{aligned}$$

for \(\ell =1,2,3\), establishing the \(k=4\) case of the inequality.

For \(k > 4\), by Lemma 3 (1) and Lemma 4 Eq. (13) yields

$$\begin{aligned} p(x_1x_2\mid k,1)-p(x_1y\mid k,1)>\frac{1}{\left( {\begin{array}{c}k\\ 2\end{array}}\right) }\left( 1+\left( {\begin{array}{c}k-2\\ 2\end{array}}\right) 0-(k-2)\frac{1}{k-1} \right) >0. \end{aligned}$$

This shows the inequality holds for \(\ell =1\), and provides base cases for an inductive proof for \(\ell \ge 1\).

Finally, Eq. (13), an inductive hypothesis, and Lemma 4 show that for \(\ell \ge 2\)

$$\begin{aligned} p(x_1x_2\mid k,\ell )-p(x_1y\mid k,\ell )>\frac{1}{\left( {\begin{array}{c}k\\ 2\end{array}}\right) }\left( 1+\left( {\begin{array}{c}k-2\\ 2\end{array}}\right) 0 -(k-2)\frac{1}{k-1} \right) >0. \end{aligned}$$

\(\square \)

Proof of Proposition 6

That the equality holds for species tree (T, (ab)) is an instance of Theorem 2. It is enough to establish the inequality for ((Ta), b), since that for ((Tb), a) will follow by interchanging taxon names.

On the species tree ((Ta), b), let v denote the MRCA of T and a. Observe that for the splits Sp(AC) or Sp(BC) to form, it is necessary that the c lineage not coalesce with any other below v. In any such realization of the coalescent process below v, lineages from taxa on T will have coalesced to \(k-1\) lineages by v, where \(k\ge 3\). There the lineage from a enters, and \(\ell \) coalescent events, \(k>\ell \ge 0\) occur on the edge immediately ancestral to v.

To establish the inequality, we consider it conditioned on a number of disjoint and exhaustive events: For each possible \(k, \ell \), let \(\mathcal C = \mathcal C (k, \ell )\) denote the event that \(k-1\) agglomerated lineages from T reach v, one of which is the lineage from c alone, and that \(\ell \) coalescent events occur in the population immediately ancestral to v. Fixing \(\mathcal C = \mathcal C (k, \ell )\), with \(y=b\), \(x_1=c\), \(x_2=a\) we have

$$\begin{aligned} \mathbb {P}(Sp(ac)\mid \mathcal C)&= p( x_1x_2 \mid k,\ell ),\\ \mathbb {P}(Sp(bc)\mid \mathcal C)&=p(x_1y \mid k, \ell ). \end{aligned}$$

Lemma 5 thus shows \(\mathbb {P}(Sp(ac\mid \mathcal C)-\mathbb {P}(Sp(bc\mid \mathcal C))\) is positive for \(k\ge 4\), \(k>\ell \ge 1\), and zero for other relevant cases. Multiplying by the probabilities of each \(\mathcal C = \mathcal C (k, \ell )\) and summing, we obtain the desired unconditioned expression \(\mathbb {P}(Sp(ac))-\mathbb {P}(Sp(bc))\). Because T has at least three taxa, there are some positive summands from \(k \ge 4\), \(\ell \ge 1\), so the desired inequality holds. \(\square \)

B.2 Proof of Proposition 7

While the proof of Proposition 7 follows the same line of reasoning as that of Proposition 6, there are further technical details. We first extend some of the results from the previous section to splits of size 3. These will be applied in arguments for the species tree \(((T,(b_1,b_2)),a)\).

Lemma 6

  1. 1.

    \(p(\mathcal {A}\mid k,0)=p(\mathcal {B}\mid k,0)\) for \(|\mathcal {A}|=|\mathcal {B}|,\)

  2. 2.

    \(p(x_1x_2x_3\mid k,0)=\frac{3}{\left( {\begin{array}{c}k+1\\ 2\end{array}}\right) } p(x_1x_2\mid k-1,0)+\frac{\left( {\begin{array}{c}k-2\\ 2\end{array}}\right) }{\left( {\begin{array}{c}k+1\\ 2\end{array}}\right) }p(x_1x_2x_3\mid k-1, 0)\) for \(k\ge 4\),

  3. 3.

    \(p(x_1x_2x_3\mid 3,\ell )=p(x_1x_2y\mid 3,\ell )=1\), for \(\ell =0,1,2\),

  4. 4.

    \(p(x_1x_2x_3\mid 4,\ell )=\frac{1}{2}p(x_1x_2\mid 3,\ell -1)\) for \(\ell = 1, 2, 3\),

  5. 5.

    \(p(x_1x_2x_3\mid k,\ell )=\frac{3}{\left( {\begin{array}{c}k\\ 2\end{array}}\right) }p(x_1x_2\mid k-1,\ell -1)+\frac{\left( {\begin{array}{c}k-3\\ 2\end{array}}\right) }{\left( {\begin{array}{c}k\\ 2\end{array}}\right) }p(x_1x_2x_3\mid k-1,\ell -1)\) for \(k\ge 5\), \(k>\ell \ge 1\),

  6. 6.

    \(p(x_1x_2y\mid k,\ell )=\frac{1}{\left( {\begin{array}{c}k\\ 2\end{array}}\right) }p(x_1y\mid k-1,\ell -1)+\frac{\left( {\begin{array}{c}k-2\\ 2\end{array}}\right) }{\left( {\begin{array}{c}k\\ 2\end{array}}\right) }p(x_1x_2y\mid k-1,\ell -1)\) for \(k\ge 4\), \(k>\ell \ge 1.\)

Proof

For claim (1), it suffices to note that \(k+1\) lineages enter the population above the root, with no coalescent events having occurred below, so the probabilities of any two m-splits are the same by exchangeability of lineages under the coalescent model.

For claim (2), again \(k+1\) lineages enter the root population, with no previous coalescence. For \(Sp(X_1X_2X_3)\) to form, the first coalescent event above the root must be between a pair of lineages chosen from \(x_1, x_2, x_3\), or disjoint from them. It is between a pair chosen from them with probability \(\frac{3}{\left( {\begin{array}{c}k+1\\ 2\end{array}}\right) }\). Then, for \(Sp(X_1 X_2 X_3)\) to form, this pair’s lineage must join with the remaining \(x_i\) lineage. By claim (1), this has probability \(p(x_1 x_2\mid k-1,0)\). Multiplying these probabilities, we obtain the first summand. The first coalescent event not involving any of the \(x_1,x_2,x_3\) lineages, and then the desired split forming with 1 less lineage present gives the second summand.

The remaining verifications are left to the reader. \(\square \)

Lemma 7

For \(k\ge 4\), \(k>\ell \ge 0\),

$$\begin{aligned} p(x_1y\mid k,\ell )+p(x_1x_2y\mid k,\ell )<\frac{1}{k-2}. \end{aligned}$$

Proof

We first show the inequality for \(\ell =0\), by induction on k. From Lemmas 3 and 6, \(p(x_1y\mid 4,0)+p(x_1x_2y\mid 4,0)=\frac{2}{5},\) establishing the base case of \(k=4\).

If \(k\ge 5\), an inductive hypothesis, Lemma 3 (3), Lemma 6 (1) and (2), and Lemma 4 yield

$$\begin{aligned} p(x_1&y| k,0)+p(x_1x_2y| k,0)\\&=\frac{1}{\left( {\begin{array}{c}k+1\\ 2\end{array}}\right) }\Bigg ( 1+ (k+1)p(x_1y|k-1,0)\\&\quad +\left( {\begin{array}{c}k-2\\ 2\end{array}}\right) \left( p(x_1y|k-1,0)+p(x_1x_2y|k-1,0)\right) \Bigg )\\&< \frac{1}{\left( {\begin{array}{c}k+1\\ 2\end{array}}\right) }\left( 1+\frac{k+1}{k-1}+ \left( {\begin{array}{c}k-2\\ 2\end{array}}\right) \frac{1}{k-3} \right) =\frac{k^2+k+2}{(k+1)k(k-1)} <\frac{1}{k-2}. \end{aligned}$$

Next observe that for \(\ell =1,2,3\), Lemmas 3 and 6 imply

$$\begin{aligned} p(x_1y\mid 4, \ell )+p(x_1x_2y\mid 4,\ell )=\frac{7}{18}<\frac{1}{4-2}. \end{aligned}$$

With the \(k=4\), \(\ell =1,2,3\) cases and the \(k\ge 4\), \(\ell =0\) cases already established, we now proceed by induction on \(\ell \). For \(k\ge 5\), \(k>\ell \ge 1\) by Lemma 3 (5), Lemma 6 (6), Lemma 4, and an inductive hypothesis,

$$\begin{aligned} p(x_1&y\mid k,\ell )+p(x_1x_2y\mid k,\ell )\\&=\frac{\left( {\begin{array}{c}k-1\\ 2\end{array}}\right) +1}{\left( {\begin{array}{c}k\\ 2\end{array}}\right) }p(x_1y\mid k-1,\ell -1) +\frac{\left( {\begin{array}{c}k-2\\ 2\end{array}}\right) }{\left( {\begin{array}{c}k\\ 2\end{array}}\right) }p(x_1x_2y\mid k-1,\ell -1)\\&=\frac{k-1}{\left( {\begin{array}{c}k\\ 2\end{array}}\right) }p(x_1y\mid k-1,\ell -1) +\frac{\left( {\begin{array}{c}k-2\\ 2\end{array}}\right) }{\left( {\begin{array}{c}k\\ 2\end{array}}\right) }\left( p(x_1y\mid k-1, \ell -1)\right. \\&\quad \left. +p(x_1x_2y\mid k-1,\ell -1)\right) \\&<\frac{k-1}{\left( {\begin{array}{c}k\\ 2\end{array}}\right) }\frac{1}{k-1} +\frac{\left( {\begin{array}{c}k-2\\ 2\end{array}}\right) }{\left( {\begin{array}{c}k\\ 2\end{array}}\right) } \frac{1}{k-3}=\frac{1}{k-1}<\frac{1}{k-2}. \end{aligned}$$

\(\square \)

Lemma 8

Let

$$\begin{aligned} P(k,\ell )=p(x_1x_2\mid k,\ell )+p(x_1x_2x_3\mid k,\ell )-p(x_1y\mid k,\ell )-p(x_1x_2y\mid k,\ell ). \end{aligned}$$

Then for \(k=4\), \(\ell =0,1,2,3\), and for \(k\ge 5\), \(\ell =0\), \(P(k,\ell )=0.\) For \(k\ge 5\), \(k>\ell \ge 1\), \(P(k,\ell )>0.\)

Proof

Note that for \(k=4\), the event \(Sp(x_1x_2)\) is the same as \(Sp(x_3x_4y)\), and \(Sp(x_1x_2x_3)\) is the same as \(Sp(x_4y)\), so using exchangeability of the \(x_i\) lineages we have

$$\begin{aligned} p(x_1x_2\mid 4,\ell )&=p(x_1x_2y\mid 4,\ell ),\\ p(x_1x_2x_3\mid k,\ell )&=p(x_1y\mid k,\ell ). \end{aligned}$$

Thus \(P(4,\ell )=0\) for \(\ell = 0, 1, 2, 3\). For \(k\ge 5\), Lemma 6 (1) implies \(P(k,0)=0\).

For \(k\ge 5\), \(\ell \ge 1\), by Lemmas 3 (4), (5) and 6 (5), (6), we find

$$\begin{aligned} P(k,\ell )&=\frac{1}{\left( {\begin{array}{c}k\\ 2\end{array}}\right) }\bigg [ 1+(k-2) \left( p(x_1x_2\mid k-1,\ell -1) -p(x_1y\mid k-1,\ell -1)\right) \\&\quad -(k-2) \left( p(x_1y\mid k-1,\ell -1)+p(x_1x_2y\mid k-1,\ell -1) \right) \\&\quad +2p(x_1x_2\mid k-1,\ell -1) +p(x_1x_2y\mid k-1,\ell -1)\\&\quad +\left( {\begin{array}{c}k-3\\ 2\end{array}}\right) P(k-1,\ell -1)\bigg ]. \end{aligned}$$

Using Lemmas 5 and 7, the nonnegativity of probabilities, and an inductive hypothesis that \(P(k-1,\ell -1)\ge 0\), it follows that

$$\begin{aligned} P(k,\ell )>\frac{1}{\left( {\begin{array}{c}k\\ 2\end{array}}\right) }\left( 1+(k-2)\cdot 0 -(k-2) \frac{1}{k-2}+2\cdot 0 +0+\left( {\begin{array}{c}k-3\\ 2\end{array}}\right) 0 \right) =0. \end{aligned}$$

\(\square \)

Lemma 9

Consider a species tree with topology \(((T,(b_1,b_2)),a)\), where T is a subtree on at least three taxa, one of which is c. Suppose the edge above \((T,(b_1,b_2))\) has positive length. Then under the multispecies coalescent model,

$$\begin{aligned} \mathbb {P}(Sp(ac))+\mathbb {P}(Sp(ab_2c))-\mathbb {P}(Sp(b_1c))- \mathbb {P}(Sp(b_1b_2c))<0. \end{aligned}$$

Proof

Let v denote the MRCA on the species tree of the taxa on T and the \(b_i\).

To establish the claimed inequality, it is enough to show it holds when conditioned on whether \(b_1\) and \(b_2\) lineages have coalesced before reaching v or not. If they have coalesced before v to form a single lineage, then the events \(Sp(ab_2c)\) and \(Sp(b_1c)\) have probability zero. Thus using b for \(b_1b_2\), we wish to show

$$\begin{aligned} \mathbb {P}(Sp(ac))-\mathbb {P}(Sp(bc))<0. \end{aligned}$$

This follows immediately from Proposition 6.

We henceforth condition on the two \(b_i\) lineages being distinct at v. Noticing that all four probabilities in the expression of interest are 0 if the c lineage coalesces with any lineage below v, we further condition on the c lineage being distinct at v, where there are thus \(k\ge 4\) lineages entering the population above v, and \(\ell \) coalescent events occurring between v and the root.

Then, with \(\mathcal C=\mathcal C(k,\ell )\) denoting the events we condition on,

$$\begin{aligned} \mathbb {P}(Sp(ac)\mid \mathcal C)&=p(x_1y\mid k,\ell ),\\ \mathbb {P}(Sp(ab_2c)\mid \mathcal C)&=p(x_1x_2y\mid k, \ell ),\\ \mathbb {P}(Sp(b_1c)\mid \mathcal C)&=p(x_1x_2\mid k,\ell ),\\ \mathbb {P}(Sp(b_1b_2c) \mid \mathcal C)&=p(x_1x_2x_3\mid k,\ell ). \end{aligned}$$

From Lemma 8 we find conditioned on \(\mathcal C\) that the expression is strictly negative for \(k\ge 5\), \(k > \ell \ge 1\), and zero for \(k\ge 5\), \(\ell =0\) and \(k=4\), \(\ell =0,1,2,3\). Thus weighting the conditioned expressions by the probabilities of the events \(\mathcal C\) and summing, we see the full expression is negative, as long as \(k \ge 5\) and \(\ell \ge 1\) is possible. Since T has at least 3 taxa, this only requires that the edge above v has positive length. \(\square \)

To handle the species tree \(((T,a),(b_1,b_2))\) we proceed analogously, but consider a rooted species tree \(((x_1,x_2,\ldots x_k)\text {:}L,y_1,y_2)\) formed by attaching a trifurcating root to two outgroups \(y_1,y_2\) and a claw tree with k taxa, with a positive edge length L. We will be interested in the case where the gene lineages, one for each \(x_i\), have coalesced \(\ell \) times, from k to \(k-\ell \) lineages, by the time they reach the root of the tree, and then further coalescence occurs in the root population until a single tree is formed. With \(\mathcal {X}=\{x_1,\ldots x_k,y_1,y_2\}\) and \(\mathcal {A}\subset \mathcal {X}\), let

$$\begin{aligned} r(\mathcal {A}\mid k,\ell )=\mathbb {P}(Sp(\mathcal {A} )\mid k,\ell ) \end{aligned}$$

for this species tree. By exchangeability of lineages in the coalescent model, \(r(\mathcal {A}\mid k,\ell )\) depends on \(\mathcal {A}\) only up to the number of \(x_i\) and the number of \(y_i\) it contains.

The reader who has followed previous arguments should be able to verify the following.

Lemma 10

  1. 1.

    \(r(\mathcal {A} \mid k,0)=r(\mathcal {B}\mid k, 0)\) for \(|\mathcal {A}|=|\mathcal {B}|\),

  2. 2.

    \(r(x_1 x_2 \mid 3, 0) = \frac{1}{5}\), \(r(x_1x_2\mid 3,\ell )=\frac{1}{3}\) for \(\ell =1,2\),

  3. 3.

    \(r(x_1x_2\mid k, 0)=\frac{1}{\left( {\begin{array}{c}k+2\\ 2\end{array}}\right) } + \frac{\left( {\begin{array}{c}k\\ 2\end{array}}\right) }{\left( {\begin{array}{c}k+2\\ 2\end{array}}\right) }r(x_1x_2\mid k-1, 0)\) for \(k\ge 3\),

  4. 4.

    \(r(x_1x_2\mid k, \ell )=\frac{1}{\left( {\begin{array}{c}k\\ 2\end{array}}\right) } + \frac{\left( {\begin{array}{c}k-2\\ 2\end{array}}\right) }{\left( {\begin{array}{c}k\\ 2\end{array}}\right) }r(x_1x_2\mid k-1, \ell -1)\), for \(k\ge 4\), \(k> \ell \ge 1\),

  5. 5.

    \(r(x_1y_1 \mid 2,0)=\frac{1}{3}\), \(r(x_1y_1 \mid 2,1)=0\),

  6. 6.

    \(r(x_1y_1\mid k, \ell )=\frac{\left( {\begin{array}{c}k-1\\ 2\end{array}}\right) }{\left( {\begin{array}{c}k\\ 2\end{array}}\right) } r(x_1y_1\mid k-1, \ell -1)\) for \(k\ge 3\), \(k>\ell \ge 1\).

  7. 7.

    \(r(x_1x_2x_3 \mid 4,\ell )= \frac{1}{2} r(x_1x_2\mid 3, \ell -1)\) for \(\ell = 1, 2, 3\),

  8. 8.

    \(r(x_1x_2x_3 \mid k,\ell )=\frac{3}{\left( {\begin{array}{c}k\\ 2\end{array}}\right) } r(x_1x_2\mid k-1, \ell -1)+\frac{\left( {\begin{array}{c}k-3\\ 2\end{array}}\right) }{\left( {\begin{array}{c}k\\ 2\end{array}}\right) } r(x_1x_2x_3\mid k-1, \ell -1)\) for \(k\ge 5\), \(k>\ell \ge 1\),

  9. 9.

    \(r(x_1x_2y_1\mid 2,0)=r(x_1x_2y_1\mid 2,1) =1\), \(r(x_1x_2y_1\mid 3,1)=\frac{1}{9}\), \(r(x_1x_2y_1\mid 3,2)=0\),

  10. 10.

    \(r(x_1x_2y_1\mid k, 0)=\frac{3}{\left( {\begin{array}{c}k+2\\ 2\end{array}}\right) }r(x_1x_2\mid k-1,0)+\frac{\left( {\begin{array}{c}k-1\\ 2\end{array}}\right) }{\left( {\begin{array}{c}k+2\\ 2\end{array}}\right) } r(x_1x_2y_1\mid k-1, 0)\) for \(k\ge 3\),

  11. 11.

    \(r(x_1x_2y_1\mid k, \ell )=\frac{1}{\left( {\begin{array}{c}k\\ 2\end{array}}\right) }r(x_1y_1\mid k-1,\ell -1)+\frac{\left( {\begin{array}{c}k-2\\ 2\end{array}}\right) }{\left( {\begin{array}{c}k\\ 2\end{array}}\right) }r(x_1x_2y_1\mid k-1, \ell -1)\) for \(k\ge 4\), \(k>\ell \ge 1\),

  12. 12.

    \(r(x_1 y_1 y_2 \mid 2, 0) = r(x_1y_1y_2\mid 2, 1)=1\),

  13. 13.

    \(r(x_1y_1y_2\mid k, \ell )=\frac{\left( {\begin{array}{c}k-1\\ 2\end{array}}\right) }{\left( {\begin{array}{c}k\\ 2\end{array}}\right) }r(x_1y_1y_2\mid k-1, \ell -1)\) for \(k\ge 3\), \(k>\ell \ge 1\).

Lemma 11

  1. 1.

    \(r(x_1y_1 \mid k,0)\le \frac{1}{k+2}\) for \(k\ge 3\),

  2. 2.

    \(r(x_1y_1 \mid k,\ell )+r(x_1y_1y_2 \mid k,\ell )< \frac{1}{k-1}\) if \(k \ge 3\) and \(\ell = 0\), or if \(k\ge 4\) and \(k> \ell \ge 1\).

Proof

For claim (1) first note that Lemma 10 (1) and (2) establish the \(k=3\) case. Then using Lemma 10 (3) one sees inductively that for \(k>3\),

$$\begin{aligned} r(x_1y_1\mid k, 0)&\le \frac{1}{\left( {\begin{array}{c}k+2\\ 2\end{array}}\right) } + \frac{\left( {\begin{array}{c}k\\ 2\end{array}}\right) }{\left( {\begin{array}{c}k+2\\ 2\end{array}}\right) }\frac{1}{k+1} =\frac{k^2+k+2}{(k+2)(k+1)^2}\\&\le \frac{(k+1)^2}{(k+2)(k+1)^2}=\frac{1}{k+2}. \end{aligned}$$

For claim (2) when \(\ell = 0\), note that by Lemma 10 (1), (2), (5), (9), and (10),

$$\begin{aligned} r(x_1y_1 \mid 3,0)+r(x_1y_1y_2\mid 3, 0)=\frac{1}{5} +\frac{3}{10}\cdot \frac{1}{3} +\frac{1}{10}\cdot 1=\frac{2}{5}<\frac{1}{2}, \end{aligned}$$

so the base case of \(k=3\) holds. Then for \(k> 3\), using Lemma 10 (1), (3), and (10) we have

$$\begin{aligned}&r(x_1y_1\mid k, 0)+r(x_1y_1y_2\mid k, 0)\\&=\frac{1}{\left( {\begin{array}{c}k+2\\ 2\end{array}}\right) } +\frac{\left( {\begin{array}{c}k\\ 2\end{array}}\right) +3}{\left( {\begin{array}{c}k+2\\ 2\end{array}}\right) } r(x_1y_1\mid k-1, 0) +\frac{\left( {\begin{array}{c}k-1\\ 2\end{array}}\right) }{\left( {\begin{array}{c}k+2\\ 2\end{array}}\right) }r(x_1y_1y_2\mid k-1, 0)\\&=\frac{1}{\left( {\begin{array}{c}k+2\\ 2\end{array}}\right) } +\frac{k+2}{\left( {\begin{array}{c}k+2\\ 2\end{array}}\right) } r(x_1y_1\mid k-1, 0) \\&\quad +\frac{\left( {\begin{array}{c}k-1\\ 2\end{array}}\right) }{\left( {\begin{array}{c}k+2\\ 2\end{array}}\right) } \bigg ( r(x_1y_1\mid k-1, 0) +r(x_1y_1y_2\mid k-1, 0)\bigg ). \end{aligned}$$

Using an inductive hypothesis and claim (1) of this proposition yields

$$\begin{aligned}&r(x_1y_1\mid k, 0)+r(x_1y_1y_2\mid k, 0)\\&<\frac{2}{(k+2)(k+1)} +\frac{2}{k+1} \cdot \frac{1}{k+1} +\frac{(k-1)(k-2)}{(k+2)(k+1)} \cdot \frac{1}{k-2}\\&=\frac{1}{k+2} +\frac{2}{(k+1)^2}<\frac{1}{k+1} +\frac{2}{(k+1)^2} <\frac{1}{k-1}. \end{aligned}$$

Assume now \(k \ge 4\) and \(k > \ell \ge 1\), and consider first the case that \(k-\ell =1\) or 2. Applying Lemma 10 (6) and (13) repeatedly, we have

$$\begin{aligned}&r(x_1y_1\mid k, \ell )+r(x_1y_1y_2\mid k, \ell )\\&\quad =\frac{3}{\left( {\begin{array}{c}k\\ 2\end{array}}\right) }\big (r(x_1y_1\mid 3, \ell -k+3)+r(x_1y_1y_2\mid 3, \ell -k+3)\big ). \end{aligned}$$

From Lemma 10

$$\begin{aligned} r(x_1y_1\mid 3,1)+r(x_1y_1y_2\mid 3,1)&=\frac{1}{9}+ \frac{1}{3}=\frac{4}{9},\\ r(x_1y_1\mid 3,2)+r(x_1y_1y_2\mid 3,2)&=0+\frac{1}{3}=\frac{1}{3}, \end{aligned}$$

so for \(k\ge 4\),

$$\begin{aligned} r(x_1y_1\mid k, \ell )+r(x_1y_1y_2\mid k, \ell )<\frac{6}{k(k-1)} \cdot \frac{4}{9}< \frac{1}{k-1}. \end{aligned}$$

If \(k-\ell \ge 3\), then applying Lemma 10 (6) and (13) repeatedly gives

$$\begin{aligned} r(x_1y_1\mid k, \ell )+r(x_1y_1y_2\mid k, \ell )=\frac{\left( {\begin{array}{c}k-\ell \\ 2\end{array}}\right) }{\left( {\begin{array}{c}k\\ 2\end{array}}\right) }\left( r(x_1y_1\mid k-\ell , 0)+r(x_1y_1y_2\mid k-\ell ,0)\right) . \end{aligned}$$

Using what we proved above, this shows

$$\begin{aligned} r(x_1y_1\mid k, \ell )+r(x_1y_1y_2\mid k, \ell )<\frac{\left( {\begin{array}{c}k-\ell \\ 2\end{array}}\right) }{\left( {\begin{array}{c}k\\ 2\end{array}}\right) }\cdot \frac{1}{k-\ell -1} <\frac{1}{k-1}. \end{aligned}$$

\(\square \)

Lemma 12

Let

$$\begin{aligned} R(k, \ell ) = r(x_1x_2\mid k,\ell )+r(x_1x_2y_1\mid k,\ell )-\,r(x_1y_1\mid k,\ell )-r(x_1y_1y_2\mid k,\ell ). \end{aligned}$$

Then for \(k=3\), \(\ell =0,1,2\), and for \(k\ge 4\), \(\ell =0\), \(R(k,\ell )=0.\) For \(k\ge 4\) and \(k>\ell \ge 1\), \(R( k,\ell )>0.\)

Proof

For \(k=3\), the events \(Sp(x_1x_2)\) and \(Sp(x_3y_1y_2)\) are the same, as are \(Sp(x_1x_2y_1)\) and \(Sp(x_3y_2)\), so using exchangeability of the \(x_i\) and of the \(y_i\) lineages

$$\begin{aligned} r(x_1x_2\mid k, \ell )&=r(x_1y_1y_2\mid k, \ell ),\\ r(x_1x_2y_1\mid k, \ell )&=r( x_1y_1 \mid k, \ell ). \end{aligned}$$

Thus \(R(3,\ell )=0\) for \(\ell =0,1,2\). For \(k\ge 3\), Lemma 10 (1) implies \(R(k,0)=0\).

Now consider \(k\ge 4\), \(k>\ell \ge 1\). By Lemma 10 (4), (6), (11), and (13) we find

$$\begin{aligned} R(k,l)= & {} \frac{1}{\left( {\begin{array}{c}k\\ 2\end{array}}\right) }\Big ( 1 +r(x_1y_1 \mid k-1,\ell -1) +\left( {\begin{array}{c}k-2\\ 2\end{array}}\right) R(k-1,\ell -1) \\&\quad -(k-2)\big ( r(x_1y_1\mid k-1, \ell -1) +r(x_1y_1y_2\mid k-1,\ell -1) \big ) \Big ). \end{aligned}$$

An inductive hypothesis that \(R(k-1,\ell -1)\ge 0\), Lemma 11, and the positivity of \(r(x_1y_1 \mid k-1,\ell -1)\) then show

$$\begin{aligned} R(k,l) > \frac{1}{\left( {\begin{array}{c}k\\ 2\end{array}}\right) }\left( 1 + 0 +\left( {\begin{array}{c}k-2\\ 2\end{array}}\right) 0 -(k-2)\frac{1}{k-2}\right) = 0. \end{aligned}$$

\(\square \)

Lemma 13

Consider a species tree with topology \(((T,a),(b_1,b_2))\), where T is a subtree on at least three taxa, one of which is c. Suppose the edge above (Ta) has positive length. Then under the multispecies coalescent model,

$$\begin{aligned} \mathbb {P}(Sp(ac))+\mathbb {P}(Sp(ab_2c))-\mathbb {P}(Sp(b_1c))-\mathbb {P}(Sp(b_1b_2c))>0. \end{aligned}$$

Proof

Let \(\rho \) denote the root of the species tree, and v the MRCA of the taxa on T and a.

To establish the claimed inequality, it is enough to show it holds when conditioned on whether the \(b_1\) and \(b_2\) lineages have coalesced before reaching \(\rho \) or not. If they have coalesced below \(\rho \) to form a single lineage, then the events \(Sp(ab_2c)\) and \(Sp(b_1c)\) have probability zero. Thus using b for \(b_1b_2\), we wish to show

$$\begin{aligned} \mathbb {P}(Sp(ac))-\mathbb {P}(Sp(bc))>0. \end{aligned}$$

This follows immediately from Proposition 6.

We henceforth condition on the event that the lineages from \(b_1\) and \(b_2\) are distinct at \(\rho \). Noticing that all four probabilities in the expression of interest are 0 if the c lineage coalesces with any lineage below v, we further condition on the event that the c lineage is distinct at v, so there are \(k\ge 3\) distinct lineages at v, and that \(\ell \) coalescent events occur on the edge above v. Calling this event \(\mathcal C = \mathcal C(k, \ell )\),

$$\begin{aligned} \mathbb {P}(Sp(ac)\mid \mathcal C)&=r(x_1x_2\mid k,\ell ),\\ \mathbb {P}(Sp(ab_2c)\mid \mathcal C)&=r(x_1x_2y_1\mid k, \ell ),\\ \mathbb {P}(Sp(b_1c)\mid \mathcal C)&=r(x_1y_1\mid k,\ell ),\\ \mathbb {P}(Sp(b_1b_2c)\mid \mathcal C)&=r(x_1y_1y_2\mid k,\ell ). \end{aligned}$$

From Lemma 12 we find that conditioned on \(\mathcal C\) the expression of interest is strictly positive for \(k\ge 4\), \(k > \ell \ge 1\), and zero for \(k=3\), \(\ell =0,1,2\) and \(k\ge 4\), \(\ell =0\). Weighting the conditioned expressions by the probabilities of the \(\mathcal C\) and summing we get the unconditioned expression. Since T has at least 3 taxa and the branch length above v has positive length, some of the summands corresponds to the event \(\mathcal C(k,\ell )\) with \(k\ge 4\), \(k>\ell \ge 1\); thus the full expression is positive. \(\square \)

Finally, Proposition 7 follows from Theorem 2, Lemmas 9 and 13.

B.3 Proof of Proposition 8

To establish Proposition 8, we first extend the results of Lemma 10, and those that follow it, to splits of size 4.

A proof of the following is left to the reader.

Lemma 14

  1. 1.

    \(r(x_1x_2x_3y_1\mid 3, 0)=1\),

  2. 2.

    \(r(x_1x_2x_3y_1\mid k, 0)=\frac{6}{\left( {\begin{array}{c}k+2\\ 2\end{array}}\right) } r(x_1x_2y_1\mid k-1, 0)+\frac{\left( {\begin{array}{c}k-2\\ 2\end{array}}\right) }{\left( {\begin{array}{c}k+2\\ 2\end{array}}\right) }r(x_1x_2x_3y_1\mid k-1,0)\) for \(k\ge 4\),

  3. 3.

    \(r(x_1x_2x_3y_1\mid 4,\ell )=\frac{1}{2}r(x_1x_2y_1\mid 3,\ell -1)\) for \(\ell = 1, 2, 3\),

  4. 4.

    \(r(x_1x_2x_3y_1\mid k,\ell )=\frac{3}{\left( {\begin{array}{c}k\\ 2\end{array}}\right) }r(x_1x_2y_1\mid k-1,\ell -1) +\frac{\left( {\begin{array}{c}k-3\\ 2\end{array}}\right) }{\left( {\begin{array}{c}k\\ 2\end{array}}\right) }r(x_1x_2x_3y_1\mid k-1,\ell -1)\) for \(k\ge 5\), \(k>\ell \ge 1\),

  5. 5.

    \(r(x_1x_2y_1y_2\mid 3, \ell )=1\) for \(\ell =0,1,2\),

  6. 6.

    \(r(x_1x_2y_1y_2\mid k, \ell )=\frac{1}{\left( {\begin{array}{c}k\\ 2\end{array}}\right) }r(x_1y_1y_2\mid k-1,\ell -1) +\frac{\left( {\begin{array}{c}k-2\\ 2\end{array}}\right) }{\left( {\begin{array}{c}k\\ 2\end{array}}\right) }r(x_1x_2y_1y_2\mid k-1,\ell -1)\) for \(k\ge 4\), \(k>\ell \ge 1\),

Lemma 15

Let \(U(k,\ell )=\)

$$\begin{aligned} r(x_1y_1\mid k, \ell )+r(x_1x_2y_1\mid k, \ell ) +r(x_1y_1y_2\mid k, \ell )+r(x_1x_2y_1y_2\mid k,\ell ). \end{aligned}$$

Then \(U(k,\ell )< \frac{1}{k-2}\) for \(k\ge 4\), \(k>\ell \ge 0\).

Proof

We first take up the case that \(\ell = 0\), and observe by Lemmas 10 and 14 that for \(k\ge 4\),

$$\begin{aligned} U(k,0)&=\frac{1}{\left( {\begin{array}{c}k+2\\ 2\end{array}}\right) } \Bigg ( 1+(2k+3) \left( r(x_1y_1\mid k-1,0) +r(x_1y_1y_2 \mid k-1,0)\right) \\&\quad - r(x_1y_1y_2 \mid k-1,0)+\left( {\begin{array}{c}k-2\\ 2\end{array}}\right) U(k-1,0) \Bigg ). \end{aligned}$$

Since

$$\begin{aligned} U(3,0)=\frac{1}{5} +\frac{1}{5}+\frac{1}{5}+1=\frac{8}{5}, \end{aligned}$$

we see

$$\begin{aligned} U(4,0)=\frac{1}{15}\left( 1+11\left( \frac{1}{5} +\frac{1}{5}\right) -\frac{1}{5}+1\cdot \frac{8}{5} \right) =\frac{34}{75}<\frac{1}{4-2}, \end{aligned}$$

establishing the \(k=4, \ell = 0\) case. Proceeding inductively for \(k\ge 5\), and using Lemma 11 (2), we have

$$\begin{aligned} U(k,0)&<\frac{1}{\left( {\begin{array}{c}k+2\\ 2\end{array}}\right) } \left( 1+(2k+3)\frac{1}{k-2} -0 +\left( {\begin{array}{c}k-2\\ 2\end{array}}\right) \frac{1}{k-3} \right) \\&= \frac{k^2 + 2k + 6}{k^2+3k+2} \cdot \frac{1}{k-2}<\frac{1}{k-2}. \end{aligned}$$

For \(\ell > 0\), if \(k\ge 4\), \(k>\ell \ge 1\), Lemmas 10 and 14 show

$$\begin{aligned} U(k,\ell )= & {} \frac{1}{\left( {\begin{array}{c}k\\ 2\end{array}}\right) }\Bigg ( \left( {\begin{array}{c}k-2\\ 2\end{array}}\right) U(k-1,\ell -1)\nonumber \\&\quad +(k-1) \left( r(x_1y_1\mid k-1,\ell -1)+r(x_1y_1y_2\mid k-1,\ell -1)\right) \Bigg ).\qquad \end{aligned}$$
(14)

In particular, since

$$\begin{aligned} U(3,1)&=\frac{1}{9}+\frac{1}{9} +\frac{1}{3}+1=\frac{14}{9},\\ U(3,2)&=0+ 0+\frac{1}{3}+1=\frac{4}{3}, \end{aligned}$$

then

$$\begin{aligned} U(4,1)&=\frac{1}{6}\left( \frac{8}{5}+3\left( \frac{1}{5}+\frac{1}{5}\right) \right) =\frac{7}{15}<\frac{1}{4-2},\\ U(4,2)&=\frac{1}{6}\left( \frac{14}{9}+3\left( \frac{1}{9}+\frac{1}{3}\right) \right) = \frac{13}{27}<\frac{1}{4-2},\\ U(4,3)&=\frac{1}{6}\left( \frac{4}{3}+3\left( 0+\frac{1}{3}\right) \right) =\frac{7}{18}<\frac{1}{4-2}, \end{aligned}$$

providing, along with the cases with \(\ell = 0\), the base cases for induction. Now for \(k\ge 5\), \(k>\ell \ge 1\), we see from Eq. (14), Lemma 11 (2), and an inductive hypothesis that

$$\begin{aligned} U(k,\ell )&<\frac{1}{\left( {\begin{array}{c}k\\ 2\end{array}}\right) }\Bigg ( \left( {\begin{array}{c}k-2\\ 2\end{array}}\right) \frac{1}{k-3}+(k-1) \frac{1}{k-2} \Bigg ) \\&=\frac{k^2-2k+2}{k(k-1)(k-2)}<\frac{1}{k-2}. \end{aligned}$$

\(\square \)

Lemma 16

Let

$$\begin{aligned} S(k,\ell )= & {} r(x_1x_2\mid k,\ell )+r(x_1x_2x_3\mid k,\ell )+r(x_1x_2x_3y_1 \mid k, \ell )\\&\quad -\,r(x_1y_1\mid k,\ell )-r(x_1y_1y_2\mid k,\ell )-r(x_1x_2y_1y_2 \mid k,\ell ). \end{aligned}$$

Then for \(k=4\), \(\ell =0, 1,2,3\) and for \(k\ge 5\), \(\ell =0\), \(S(k,\ell )= 0\). For \(k\ge 5\), \(k>\ell \ge 1\), \(S(k,\ell )>0\).

Proof

Since for \(k=4\), the events \(Sp(x_1x_2)=Sp(x_3x_4y_1y_2)\), \(Sp(x_1x_2x_3)=Sp(x_4y_1y_2)\), and \(Sp(x_1x_2x_3y_1)=Sp(x_4y_2)\), so using exchangeability of the \(x_i\) and of the \(y_i\) lineages we have

$$\begin{aligned} r(x_1x_2\mid 4,\ell )&=r(x_1x_2y_1y_2\mid 4,\ell ),\\ r(x_1x_2x_3\mid 4,\ell )&=r(x_1y_1y_2\mid 4,\ell ),\\ r(x_1x_2x_3y_1\mid 4,\ell )&=r(x_1y_1\mid 4,\ell ), \end{aligned}$$

so \(S(4,\ell )=0\) for \(\ell =0,1,2,3\). For \(k\ge 5\), Lemma 10 (1) implies \(S(k,0)=0\).

For \(k\ge 5\), \(k>\ell \ge 1\), using Lemmas 10 and 14 we find

$$\begin{aligned} S(k,\ell )&= \frac{1}{\left( {\begin{array}{c}k\\ 2\end{array}}\right) }\Bigg ( \left( {\begin{array}{c}k{-}3\\ 2\end{array}}\right) S(k{-}1,\ell {-}1) -(k{-}3)U(k-1,\ell -1) +kR(k-1,\ell -1)\\&\quad + 1+ 2r(x_1y_1\mid k-1,\ell -1)+r(x_1y_1y_2 \mid k-1,\ell -1)) \Bigg ). \end{aligned}$$

Using an inductive hypothesis, Lemmas 15, and 12 and nonnegativity of probabilities, this implies

$$\begin{aligned} S(k,\ell )> \frac{1}{\left( {\begin{array}{c}k\\ 2\end{array}}\right) } \left( \left( {\begin{array}{c}k-3\\ 2\end{array}}\right) \cdot 0-(k-3)\frac{1}{k-3}+k\cdot 0+1+ 2 \cdot 0+0\right) =0. \end{aligned}$$

\(\square \)

Proof of Proposition 8

On the species tree \(((T,(a_1,a_2)),(b_1,b_2))\) let \(\rho \) denote the root, v the MRCA of the taxa on T and the \(a_i\), and let c be a taxon on T. We first show that since the edge above v has positive length, then

$$\begin{aligned}&\mathbb {P}(Sp(a_1c))+\mathbb {P}(Sp(a_1a_2c))+\mathbb {P}(Sp(a_1b_2c))+\mathbb {P}(Sp(a_1a_2b_2c))\nonumber \\&\quad -\mathbb {P}(Sp(b_1c))- \mathbb {P}(Sp(b_1a_2 c))- \mathbb {P}(Sp(b_1b_2c)) - \mathbb {P}(Sp( b_1a_2 b_2c))>0.\qquad \quad \end{aligned}$$
(15)

To establish this, it is enough to show it holds when conditioned on whether or not the \(a_1\) and \(a_2\) lineages have coalesced before reaching v, and whether or not the \(b_1\) and \(b_2\) lineages have coalesced before reaching \(\rho \). If both pairs have coalesced in this way, then the events \(Sp(a_1c)\), \(Sp(a_1b_2c)\), \(Sp(a_1a_2b_2c)\), \(Sp(b_1c)\), \(Sp(b_1a_2 c)\), and \(Sp(b_1 a_2 b_2c)\) all have probability zero. Using a for \(a_1a_2\) and b for \(b_1b_2\), we need only show

$$\begin{aligned} \mathbb {P}(Sp(ac))-\mathbb {P}(Sp(bc))>0. \end{aligned}$$

This follows immediately from Proposition 6. Similarly, the cases in which exactly one of the pairs of \(a_1, a_2\) lineages or \(b_1, b_2\) lineages have coalesced in the population immediately ancestral to their respective MRCAs follow from Proposition 7.

We henceforth condition on the event that the \(a_i\) lineages are distinct at v and the \(b_i\) lineages are distinct at \(\rho \). Noticing that all eight probabilities in the expression of interest are 0 if the c lineage coalesces with any lineage below v, we further condition on the c lineage being distinct at v (so there are \(k\ge 4\) lineages in total entering the population above v) and \(\ell \) coalescent events occur between v and \(\rho \).

Then, with \(\mathcal C = \mathcal C (k,\ell )\) denoting the event that these conditioning requirements are met,

$$\begin{aligned} \mathbb {P}(Sp(a_1c)\mid \mathcal C)= & {} r(x_1x_2\mid k,\ell ),\\ \mathbb {P}(Sp(a_1a_2c)\mid \mathcal C)= & {} r(x_1x_2x_3\mid k,\ell ),\\ \mathbb {P}(Sp(a_1b_2c)\mid \mathcal C)= & {} r(x_1x_2y_1\mid k,\ell ),\\ \mathbb {P}(Sp(a_1a_2b_2c)\mid \mathcal C)= & {} r(x_1x_2x_3y_1\mid k,\ell ),\\ \mathbb {P}(Sp(b_1c)\mid \mathcal C)= & {} r(x_1y_1\mid k, \ell ),\\ \mathbb {P}(Sp(b_1a_2 c)\mid \mathcal C)= & {} r(x_1x_2y_1\mid k,\ell ),\\ \mathbb {P}(Sp(b_1b_2c)\mid \mathcal C)= & {} r(x_1y_1y_2\mid k,\ell ),\\ \mathbb {P}(Sp(b_1a_2 b_2c)\mid \mathcal C)= & {} r(x_1x_2y_1y_2\mid k,\ell ).\\ \end{aligned}$$

After substituting these in to the expression in (15), from Lemma 16 we see that when conditioned on \(\mathcal C\) it is strictly positive for \(k\ge 5\), \(k\ge \ell \ge 1\) and zero for \(k=4\), \(\ell =0,1,2,3\) and for \(k\ge 5\), \(\ell =0\). Thus weighting the conditioned expressions by the probabilities of the \(\mathcal C\) and summing over all relevant k and \(\ell \), we see the unconditioned inequality (15) holds since T has at least 3 taxa so summands with \(k\ge 5\), \(\ell \ge 1\) are present.

Interchanging the \(a_i\) and \(b_i\) in inequality (15) shows the negativity of the expression on the tree \(((T,(b_1,b_2)),(a_1,a_2))\). Since its vanishing on the tree \((T,((a_1,a_2),(b_1,b_2)))\) was shown in Theorem 2, the proof is complete. \(\square \)

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Allman, E.S., Degnan, J.H. & Rhodes, J.A. Split Probabilities and Species Tree Inference Under the Multispecies Coalescent Model. Bull Math Biol 80, 64–103 (2018). https://doi.org/10.1007/s11538-017-0363-5

Download citation

Keywords

  • Multispecies coalescent model
  • Split probability
  • Species tree identifiability

Mathematics Subject Classification

  • 92D15