Skip to main content
Log in

A stochastic Farris transform for genetic data under the multispecies coalescent with applications to data requirements

  • Published:
Journal of Mathematical Biology Aims and scope Submit manuscript

Abstract

Species tree estimation faces many significant hurdles. Chief among them is that the trees describing the ancestral lineages of each individual gene—the gene trees—often differ from the species tree. The multispecies coalescent is commonly used to model this gene tree discordance, at least when it is believed to arise from incomplete lineage sorting, a population-genetic effect. Another significant challenge in this area is that molecular sequences associated to each gene typically provide limited information about the gene trees themselves. While the modeling of sequence evolution by single-site substitutions is well-studied, few species tree reconstruction methods with theoretical guarantees actually address this latter issue. Instead, a standard—but unsatisfactory—assumption is that gene trees are perfectly reconstructed before being fed into a so-called summary method. Hence much remains to be done in the development of inference methodologies that rigorously account for gene tree estimation error—or completely avoid gene tree estimation in the first place. In previous work, a data requirement trade-off was derived between the number of loci m needed for an accurate reconstruction and the length of the locus sequences k. It was shown that to reconstruct an internal branch of length f, one needs m to be of the order of \(1/[f^{2} \sqrt{k}]\). That previous result was obtained under the restrictive assumption that mutation rates as well as population sizes are constant across the species phylogeny. Here we further generalize this result beyond this assumption. Our main contribution is a novel reduction to the molecular clock case under the multispecies coalescent, which we refer to as a stochastic Farris transform. As a corollary, we also obtain a new identifiability result of independent interest: for any species tree with \(n \ge 3\) species, the rooted topology of the species tree can be identified from the distribution of its unrooted weighted gene trees even in the absence of a molecular clock.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2

Similar content being viewed by others

Notes

  1. Technically, we must allow each edge of S to be a sequence of populations with varying population sizes and mutation rates (corresponding to a path within a larger phylogeny). The proofs of Propositions 1 and  2 are unaffected by this extension.

  2. Actually, the quantiles are estimated from part of the gene set (\(\mathcal {M}_{\mathrm{R}1}\)) to avoid unwanted correlations. The rest of the analysis is done on the other part. See Algorithm 1.

  3. The p-distances in (9) are actually estimated over half the gene length to once again avoid unwanted correlations. That is, we use \(\widehat{p}_{xy}^{i\downarrow }\) defined in Algorithm 1 to compute I. We use the other half to estimate the differences below.

References

  • Allman ES, Degnan JH, Rhodes JA (2011) Identifying the rooted species tree from the distribution of unrooted gene trees under the coalescent. J Math Biol 62(6):833–862

    Article  MathSciNet  Google Scholar 

  • Allman ES, Degnan JH, Rhodes JA (2018) Species tree inference from gene splits by unrooted star methods. IEEE/ACM Trans Comput Biol Bioinf 15(1):337–342

    Article  Google Scholar 

  • Allman ES, Long C, Rhodes JA (2019) Species tree inference from genomic sequences using the log-det distance. SIAM J Appl Algebra Geom 3(1):107–127 (Publisher: Society for Industrial and Applied Mathematics)

    Article  MathSciNet  Google Scholar 

  • Boucheron S, Lugosi G, Massart P (2013) Concentration inequalities: a nonasymptotic theory of independence. Oxford University Press, Oxford

    Book  Google Scholar 

  • Bayzid MS, Mirarab S, Boussau B, Warnow T (2015) Weighted statistical binning: enabling statistically consistent genome-scale phylogenetic analyses. PLoS ONE 10(6):e0129183–e0129183 (06)

    Article  Google Scholar 

  • Bayzid MS, Warnow T (2013) Naive binning improves phylogenomic analyses. Bioinformatics 29(18):2277–2284 (07)

    Article  Google Scholar 

  • Chifman J, Kubatko L (2014) Quartet inference from SNP data under the coalescent model. Bioinformatics 30(23):3317–3324

    Article  Google Scholar 

  • Chifman J, Kubatko L (2015) Identifiability of the unrooted species tree topology under the coalescent model with time-reversible substitution processes, site-specific rate variation, and invariable sites. J Theor Biol 374:35–47

    Article  MathSciNet  Google Scholar 

  • DeGiorgio M, Degnan JH (2010) Fast and consistent estimation of species trees using supermatrix rooted triples. Mol Biol Evol 27(3):552–69

    Article  Google Scholar 

  • DeGiorgio M, Degnan JH (2014) Robustness to divergence time underestimation when inferring species trees from estimated gene trees. Syst Biol 63(1):66

    Article  Google Scholar 

  • Degnan JH, DeGiorgio M, Bryant D, Rosenberg NA (2009) Properties of consensus methods for inferring species trees from gene trees. Syst Biol 58(1):35–54

    Article  Google Scholar 

  • Dasarathy G, Nowak R, Roch S (2015) Data requirement for phylogenetic inference from multiple loci: a new distance method. Comput Biol Bioinform IEEE/ACM Trans 12(2):422–432

    Article  Google Scholar 

  • Degnan JH, Rosenberg NA (2006) Discordance of species trees with their most likely gene trees. PLoS Genetics, 2(5)

  • Degnan JH, Rosenberg NA (2009) Gene tree discordance, phylogenetic inference and the multispecies coalescent. Trends Ecol Evol 24(6):332–340

    Article  Google Scholar 

  • Durrett R (1996) Probability: theory and examples, 2nd edn. Duxbury Press, Belmont, CA

    MATH  Google Scholar 

  • Erdos PL, Steel MA, Székely LA, Warnow TJ (1999) A few logs suffice to build (almost) all trees (i). Random Struct Algorithms 14(2):153–184

    Article  MathSciNet  Google Scholar 

  • Kubatko LS, Degnan JH (2007) Inconsistency of phylogenetic estimates from concatenated data under coalescence. Syst Biol 56(1):17–24

    Article  Google Scholar 

  • Kapli P, Yang Z, Telford MJ (2020) Phylogenetic tree building in the genomic age. Nature Rev Gene

  • Long C, Kubatko L (2017) Identifiability and reconstructibility of species phylogenies under a modified coalescent. Arxiv publication arXiv:1701.06871

  • Long C, Kubatko L (2019) Identifiability and reconstructibility of species phylogenies under a modified coalescent. Bull Math Biol 81(2):408–430

    Article  MathSciNet  Google Scholar 

  • Liu L, Yu L, Pearl DK (2010) Maximum tree: a consistent estimator of the species tree. J Math Biol 60(1):95–106

    Article  MathSciNet  Google Scholar 

  • Maddison WP (1997) Gene trees in species trees. Syst Biol 46(3):523–536

    Article  Google Scholar 

  • Mirarab S, Bayzid Md. S, Boussau B, Warnow T (2014) Statistical binning enables an accurate coalescent-based estimation of the avian tree. Science, 346(6215)

  • Mirarab S, Bayzid MS, Warnow T (2016) Evaluating summary methods for multilocus species tree estimation in the presence of incomplete lineage sorting. System Biol 65(3):366

    Article  Google Scholar 

  • Mossel E, Roch S (2010) Incomplete lineage sorting: consistent phylogeny estimation from multiple loci. IEEE/ACM Trans Comput Biol Bioinform 7(1):166–171

    Article  Google Scholar 

  • Mossel E, Roch S (2017) Distance-based species tree estimation under the coalescent: information-theoretic trade-off between number of loci and sequence length. Ann Appl Probab 27(5):2926–2955

    Article  MathSciNet  Google Scholar 

  • Matsen FA, Steel M (2007) Phylogenetic mixtures on a single tree can mimic a tree of another topology. Syst Biol 56(5):767–775

    Article  Google Scholar 

  • Nakhleh L (2013) Computational approaches to species phylogeny inference and gene tree reconciliation. Trends Ecol Evol. doi: https://doi.org/10.1016/j.tree.2013.09.004

  • Rhodes JA (2019) Topological metrizations of trees, and new quartet methods of tree inference. IEEE/ACM Trans Comput Biol Bioinform

  • Rusinko J, McPartlon M (2017) Species tree estimation using neighbor joining. J Theor Biol 414:5–7

    Article  Google Scholar 

  • Roch S, Nute M, Warnow T (2019) Long-branch attraction in species tree estimation: inconsistency of partitioned likelihood and topology-based summary methods. System Biol 68(2):281–297

    Article  Google Scholar 

  • Roch S (2013) An analytical comparison of multilocus methods under the multispecies coalescent: the three-taxon case. In: Biocomputing 2013: proceedings of the pacific symposium, Kohala Coast, Hawaii, USA, January 3-7, 2013, pp 297–306

  • Roch S (2018) On the variance of internode distance under the multispecies coalescent. In: Comparative genomics - 16th international conference, RECOMB-CG 2018, Magog-Orford, QC, Canada, October 9-12, 2018, Proceedings, pp 196–206

  • Roos B (2001) Binomial approximation to the poisson binomial distribution: the Krawtchouk expansion. Theory Prob Its Appl 45(2):258–272

    Article  MathSciNet  Google Scholar 

  • Roch S, Steel M (2015) Likelihood-based tree reconstruction on a concatenation of aligned sequence data sets can be statistically inconsistent. Theor Popul Biol 100:56–62

    Article  Google Scholar 

  • Roch S, Warnow T (2015) On the robustness to gene tree estimation error (or lack thereof) of coalescent-based species tree methods. Syst Biol 64(4):663–676

    Article  Google Scholar 

  • Rannala B, Yang Z (2003) Bayes estimation of species divergence times and ancestral population sizes using DNA sequences from multiple loci. Genetics 164(4):1645–1656

    Article  Google Scholar 

  • Scornavacca C, Delsuc F, Galtier N (2020) Phylogenetics in the Genomic Era. No commercial publisher | Authors open access book

  • Saitou N, Nei M (1987) The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol Biol Evol 4(4):406–425

    Google Scholar 

  • Shekhar S, Roch S, Mirarab S (2017) Species tree estimation using ASTRAL: how many genes are enough? In: RECOMB’17—proceedings of the 21st annual international conference on research in computational molecular biology, pp 393–395

  • Steel MA, Székely LA (2002) Inverting random functions. II. Explicit bounds for discrete maximum likelihood estimation, with applications. SIAM J. Discrete Math. 15(4):562–575 (electronic)

    Article  MathSciNet  Google Scholar 

  • Semple C, Steel M (2003) Phylogenetics, vol 22. Mathematics and its Applications Series. Oxford University Press, Oxford

    MATH  Google Scholar 

  • Semple C, Steel MA (2003) Phylogenetics, vol 24. Oxford University Press, Oxford

    MATH  Google Scholar 

  • Steel M (2009) A basic limitation on inferring phylogenies by pairwise sequence comparisons. J Theor Biol 256(3):467–472

    Article  MathSciNet  Google Scholar 

  • Steel M (2016) Phylogeny—discrete and random processes in evolution, vol 89 CBMS-NSF Regional Conference Series in Applied Mathematics. Society for Industrial and Applied Mathematics (SIAM), Philadelphia, PA

  • Warnow T (2017) Computational Phylogenetics: an introduction to designing methods for phylogeny estimation. Cambridge University Press, Cambridge

    Book  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sebastien Roch.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

G. Dasarathy: Supported by the NSF grant CCF-2048223 (CAREER) and the NIH grant 1R01GM140468-01. E. Mossel: Supported by Simons-NSF Grant DMS-2031883 and by a Vannevar Bush Faculty Fellowship ONR-N00014-20-1-2826. S. Roch: Supported by NSF Grants DMS-1149312 (CAREER), DMS-1614242, CCF-1740707 (TRIPODS), DMS-1902892 and DMS-2023239 (TRIPODS Phase II), as well as a Simons Fellowship and a Vilas Associates Award.

A Data requirement tradeoff: detailed proofs

A Data requirement tradeoff: detailed proofs

In this section we analyze Algorithm 2, which performs a quantile test on the output of Algorithm 1.

figure b

Proof

(Theorem 2) Let \(S_\mathcal {X}\) be the species tree S restricted to \(\mathcal {X} = [3]\) and let \(r'\) denote its root, i.e., the most recent common ancestor of \(\mathcal {X}\).

Define a partition of the set of genes \([m] = \mathcal {M}_{\mathrm{R}1}\sqcup \mathcal {M}_{\mathrm{R}2} \sqcup \mathcal {M}_{\mathrm{Q}1} \sqcup \mathcal {M}_{\mathrm{Q}2}\) such that the following conditions hold:

$$\begin{aligned}&\left| \mathcal {M}_{\mathrm{R}1} \right| = c_{1} \log \varepsilon ^{-1},\ \left| \mathcal {M}_{\mathrm{R}2} \right| = c_{1}\left( 1\vee \frac{\log k}{k f^2} \right) \log \varepsilon ^{-1},\nonumber \\&\left| \mathcal {M}_{\mathrm{Q}1} \right| \ge c_{1} \alpha ^{-1} \log \varepsilon ^{-1},\ \left| \mathcal {M}_{\mathrm{Q}2} \right| \ge c_{1} f^{-2}\left( \alpha + f \right) \log \varepsilon ^{-1}, \end{aligned}$$
(26)

for a constant \(c_1 > 0\) to be determined later. The reconstruction algorithm on \(\mathcal {X}\) has two steps:

Ultrametric reduction: In this step, we invoke Algorithm 1 with sequence data \(\{\xi _{x}^{ij}: x\in \mathcal {X}, i \in [m], j\in [k]\}\). The algorithm outputs new sequences \(\{\xi _{x,N}^{ij}:x\in \mathcal {X}, i \in \mathcal {M}_{\mathrm{Q}1} \sqcup \mathcal {M}_{\mathrm{Q}2}, j\in [k]\}\), and Theorem 1 guarantees that these new sequences have the same distribution as a multispecies coalescent process on \(S_{\mathcal {X}}'\) with species metric \((\hat{\dot{\mu }}_{xy})\), where \(S_{\mathcal {X}}'\) and S have the same rooted topology, and \((\hat{\dot{\mu }}_{xy})\) and \(({\dot{\mu }}_{xy})\) are \(\mathcal {O}(f /\sqrt{\log k})\)-close with probability at least \( 1- \varepsilon \).

Quantile test: Now, we invoke Algorithm 2 with the sequence data \(\{\xi _{x,N}^{ij}: i \in \mathcal {M}_{\mathrm{Q}1} \sqcup \mathcal {M}_{\mathrm{Q}2}, j\in [k]\}\) output by Step 1. By Propositions 5 and 7 in Sect. 1, it follows that, with probability at least \( 1- \varepsilon \), Algorithm 2 returns the right topology.\(\square \)

1.1 A.1 Quantile test: robustness analysis

Robustness of quantile test Algorithm 1 produces a new sequence dataset \(\left\{ \xi _{x,N}^{ij}: x\in \mathcal {X} \right\} \) that appears close to being distributed according to an ultrametric species phylogeny. The next step is to perform a triplet test of Mossel and Roch (2017), as detailed in Algorithm 2. Roughly speaking, this test is based on comparing an appropriately chosen quantile of the gene metrics. In fact, because we do not have direct access to the latter, we use a sequence-based surrogate, the empirical p-distances , for each gene \(i \in \mathcal {M}_{\mathrm{Q}}\) in the output of the reduction, whose expectation is a monotone transformation of the corresponding gene metrics. The idea of Algorithm 2 is to use the above p-distances to define a “similarity measure” \(\widehat{s}_{xy}\) between each pair of leaves \(x,y\in \mathcal {X}\) to reveal the underlying species tree topology on \(\mathcal {X}\). It works as follows. The set of genes \(\mathcal {M}_{\mathrm{Q}}\) is divided into two disjoint subsets \(\mathcal {M}_{\mathrm{Q}1}, \mathcal {M}_{\mathrm{Q}2}\) so that \(\left| \mathcal {M}_{\mathrm{Q}1} \right| , \left| \mathcal {M}_{\mathrm{Q}2} \right| \) satisfy the conditions above; this is to avoid unwanted correlations. The set \(\mathcal {M}_{\mathrm{Q}1}\) is used to compute the \(c_3 \alpha \)-quantile \(\widehat{q}^{(c_{3} \alpha )}_{xy}\) of \(\{\widehat{q}^i_{xy}\,:\,i\in \mathcal {M}_{\mathrm{Q}1}\}\), where \(c_3 > 0\) is a constant determined in the proofs and \(\alpha = \max \left\{ \frac{\log m}{m}, \sqrt{\frac{\log k}{k}} \right\} .\) Let \(\widehat{q}_*\) denote the maximum among \(\left\{ \widehat{q}^{(c_{3} \alpha )}_{xy}: x,y\in \mathcal {X} \right\} \). We then use the genes in \(\mathcal {M}_{\mathrm{Q}2}\) to define the similarity measure \(\widehat{s}_{xy} = \frac{1}{\left| \mathcal {M}_{\mathrm{Q}2} \right| }\left| \left\{ i\in \mathcal {M}_{\mathrm{Q}2}: \widehat{q}^i_{xy} \le \widehat{q}_*\right\} \right| .\) Whichever pair \(x, y \in \mathcal {X}\) produces the largest value of \(\widehat{s}_{xy}\) is declared the closest, i.e., the output is xy|z where z is the remaining leaf in \(\mathcal {X}\).

As stated in Theorem 1, the output to the ultrametric reduction is almost—but not perfectly—ultrametric. To account for this extra error, we perform a delicate robustness analysis of the quantile-based triplet test. At a high level, the proof follows Mossel and Roch (2017). After (1) controlling the deviation of the quantiles, we establish that (2) the test works in expectation and then (3) finish off with concentration inequalities. All these steps must be updated to account for the error introduced in the reduction step. Step (2) is particularly involved and requires a delicate analysis of the CDF of a mixture of binomials.

For the rest of the proof, we assume that the rooted topology of \(S_\mathcal {X}\) is 12|3. The other cases are similar.

Control of empirical quantiles In the following proposition, we show that the empirical quantiles are well-behaved, and provide a good estimate of the \(\alpha \)-quantile of the underlying MSC random variables. We define the random variables \(q_{xy}^i\) and \(r_{xy}^i\) associated to a gene tree i:

$$\begin{aligned} q_{xy}^i = p (\delta _{xy}^i + \widehat{\Delta }_{1x} + \widehat{\Delta }_{1y}),\qquad r_{xy}^i = p (\delta _{xy}^i + {\Delta }_{1x} + {\Delta }_{1y}). \end{aligned}$$

Also, we need the 0th quantile of these random variables, specifically

$$\begin{aligned} q_{xy}^{(0)} =\! p\left( \delta _{xy}^{(0)} \!+ \widehat{\Delta }_{1x} \!+ \widehat{\Delta }_{1y} \right) \!=\! p (\mu _{xy} + \widehat{\Delta }_{1x} + \widehat{\Delta }_{1y}), \quad r_{xy}^{(0)} = p (\mu _{xy} + {\Delta }_{1x} \!+\! {\Delta }_{1y}). \end{aligned}$$

We first show that \(\widehat{q}_*= \max \{ \widehat{q}^{(c_{3} \alpha )}_{xy}: x,y\in \mathcal {X} \}\) is close to \(\max \{{r}_{xy}^{(0)} : x,y\in \mathcal {X} \}\).

Proposition 5

(Quantile behaviour). Let \(\alpha = \max \left\{ m^{-1}\log m, k^{-0.5}\sqrt{\log k} \right\} \) and let \(\phi \) be as in Theorem 2. Then, there are \(c_{3}, c_{4}, c_{5} >0\) depending on the parameters \(\mu _U\), \(g'\) and g such that, for each pair of leaves \(x,y \in \mathcal {X}\), the \(c_{3}\alpha \)-quantile satisfies the following with probability at least \(1 - 6\exp \left( -c_{4} \left| \mathcal {M}_{\mathrm{Q}1} \right| \alpha \right) \), provided Theorem 1 holds,

$$\begin{aligned} \widehat{q}_{xy}^{(c_{3}\alpha )}\in \left[ {q}_{xy}^{(0)}, {q}_{xy}^{(0)}+ c_{5} \alpha \right] \subset \left[ {r}_{xy}^{(0)}-c_{5}\phi , {r}_{xy}^{(0)} +c_{5}\phi + c_{5}\alpha \right] . \end{aligned}$$

Proof

Proposition 4 guarantees that \(\widehat{\Delta }_{xy}\) and \(\Delta _{xy}\) are close. Therefore, using the fact that \(p(\cdot )\) is a Lipschitz function, we know that there exists a constant \(c_{5}'>0\) such that \(\left| r_{xy}^{(0)} - q_{xy}^{(0)} \right| \le c_{5}' \phi .\) The second containment in the statement of the lemma follows from this after adjusting the constant. The first part is proved as in (Mossel and Roch 2017, Claim 12), except for a small change: we use Bernstein’s inequality (see e.g. Boucheron et al. 2013) in place of Chebyshev’s inequality to obtain an exponential dependence on \(\left| \mathcal {M}_{\mathrm{Q}1} \right| \) (ultimately resulting in a logarithmic dependence in \(\varepsilon \) in the condition on \(\left| \mathcal {M}_{\mathrm{Q}1} \right| \) in (26)).\(\square \)

Expected version of quantile test We will use \(\bar{\mathbb {P}}\), \(\bar{\mathbb {E}}\), and \(\overline{\mathrm {Var}}\) to denote probabilities, expectations and variances conditioned on the event that Theorem 1 and Proposition 5 hold. We use the genes in \(\mathcal {M}_{\mathrm{Q}2}\), which are not affected by the conditioning under \(\bar{\mathbb {P}}\), to define a similarity measure among pairs of leaves in \(\mathcal {X}\), \(\widehat{s}_{xy}\), as above. We next show that this similarity measure has the right behavior in expectation. That is, defining \(s_{xy} := \bar{\mathbb {E}}\left[ \widehat{s}_{xy} \right] \), we show that \(s_{12} > \max \{s_{13}, s_{23}\}\) (proof in Sect. 3).

Proposition 6

(Expected version of quantile test). For any \(C_2 >0\), there exist constants \(c_{6}, c_{7}>0\) such that \(s_{12} - \max \{s_{13}, s_{23}\} \ge c_{6} p(3f/4) > 0\) provided

$$\begin{aligned} m \ge c_{7}\frac{1}{p(3f/4)}&\log \left( \frac{1}{p(3f/4)} \right) ,\qquad k \ge c_{7}\left( \frac{\sqrt{\log k}}{p(3f/4)} \right) ^{1/C_2},\\&\phi \in \mathcal {O}(p(3f/4)/\sqrt{\log k}). \end{aligned}$$

Sample version of quantile test We finish the proof by showing that the empirical similarity measures are consistent with the underlying species tree with high probability (proof in Sect. 4).

Proposition 7

(Sample version of quantile test). There is a \(c_{8}>0\) such that, provided Proposition 6 holds, the \(\bar{\mathbb {P}}\)-probability that Algorithm 2 fails is less than \(4 \exp \left( - \frac{\left| \mathcal {M}_{\mathrm{Q}2} \right| p(3f/4)^2}{c_{8} \left( p(3f/4) + \alpha \right) } \right) .\)

1.2 A.2 Auxiliary results

We will need a few technical auxiliary claims.

Quantiles We will need an observation about the quantiles of gene tree pairwise distances. For a pair of leaves \(a,b \in L\), \(\delta _{ab}\) is the branch length induced by the random gene tree under the MSC. Notice that, by definition, \(\delta _{ab}\ge \mu _{ab}\) . We will let \(f_{ab}\left( \cdot \right) \) and \(F_{ab}\left( \cdot \right) \) denote respectively the density and the cumulative density function of the random variable \(Z_{ab}\triangleq \frac{\delta _{ab}-\mu _{ab}}{2}\). Because of the memoryless property of the exponential, it is natural to think of the distribution of \(Z_{ab}\) as a mixture of distributions whose supports are disjoint, corresponding to the different branches that the lineages go through before coalescing. We state this more generally as follows. Suppose that \(U_1,U_2,\ldots , U_r\) are subsets of \(\mathbb {R}\) such that they satisfy \(\sup (U_i)\le \inf (U_{i+1})\), \(i =1,2,\ldots , r\). Suppose that f is a probability density function such that \(f(x) = \sum _{i=1}^r \omega _i f_i(x)\), where \(\omega _1,\omega _2,\ldots ,\omega _r\in (0,1)\) are such that \(\sum _{i=1}^r \omega _i = 1\), and the density \(f_i\) is supported on \(U_i\) for \(i=1,\ldots ,r\). Then, the quantile function of f is given as follows

where, we set \(\omega _0 = 0\). Specializing to \(f_{ab}\) under the MSC, it follows that there exists a finite sequence of constants \(\mu _1,\ldots ,\mu _r \in [\mu _L,\mu _U]\) and \(h_0,\ldots ,h_{r-1} \in [f',g'+ng]\) such that

$$\begin{aligned} \omega _i = e^{-\sum _{j=1}^{i-1}\mu _{j}^{-1}(h_j - h_{j-1})} - e^{-\sum _{j=1}^{i}\mu _{j}^{-1}(h_j - h_{j-1})},\qquad f_i(x) = \frac{\mu _i^{-1} e^{-\mu _i^{-1}(x - h_{i-1})}}{1 - e^{-\mu _i^{-1}(h_i - h_{i-1})}}. \end{aligned}$$

Cancellations lead to \(f_{ab}(x) = \sum _{i=1}^r e^{-\sum _{j=1}^{i-1}\mu _{j}^{-1}(h_j - h_{j-1})}\mu _i^{-1} e^{-\mu _i^{-1} (x - h_{i-1})}\). This formula implies that the density is bounded between positive constants. We will need the following implication. For any \(\alpha \in [0,1)\), we let \(\delta _{ab}^{(\alpha )}\) and \(p_{ab}^{(\alpha )}\) denote the \(\alpha \)-quantile of the \(\delta _{ab}\) and \(p_{ab}\) respectively. Since by definition \(p_{ab} = p(\delta _{ab})\), we have that \(p^{(\alpha )}_{ab} = p(\delta _{ab}^{(\alpha )})\), where \(p(x) = \frac{3}{4}\left( 1 - e^{-4x/3} \right) \). Then, for any \(0< \beta '< \beta < 1\), there are constants \(0< c'< c'' <+\infty \) (depending on \(\mu _L, \mu _U, g, g', n, \beta ', \beta \)) such that for any \(\xi \in (0,1-\beta ')\), we have

$$\begin{aligned} c' \xi \le \delta _{ab}^{(\beta +\xi )} - \delta _{ab}^{(\beta )} \le c'' \xi ,\qquad c' \xi \le p_{ab}^{(\beta +\xi )} - p_{ab}^{(\beta )} \le c'' \xi . \end{aligned}$$
(27)

CDF lemma Finally, we restate a key bound about the cumulative distribution function from Mossel and Roch (2017).

Lemma 6

(CDF behavior; see Claim 6 in Mossel and Roch (2017)). There exists a constant \(c_{3}'>0\) depending on the parameters \(\mu _U\), \(g'\) and g such that \(\mathbb {P}\left[ \widehat{q}_{xy} \le q^{(0)}_{xy} \right] \le \frac{c_{3}'}{\sqrt{k}}\le c_{3}' \alpha .\)

1.3 A.3 Proof of Proposition 6

Proof

(Proposition 6) In this proof, we are concerned with behavior in expectation. Hence, we fix an \(i\in \mathcal {M}_{\mathrm{Q}2}\) and drop the i in \(\widehat{q}^i_{xy}\), etc. Let \(\mathcal {E}_{12|3}\) be the event that there is a coalescence in the internal branch of the species phylogeny and observe that \(s_{12}\) can be decomposed as follows

$$\begin{aligned} s_{12}&= \bar{\mathbb {E}}[\widehat{s}_{12}] = \bar{\mathbb {P}}\left[ \widehat{q}_{12} \le \widehat{q}_*\right] = \bar{\mathbb {P}}\left[ \mathcal {E}_{12|3}\right] \bar{\mathbb {P}}\left[ \widehat{q}_{12} \le \widehat{q}_*| \mathcal {E}_{12|3}\right] \nonumber \\&\quad + \bar{\mathbb {P}}\left[ \mathcal {E}_{12|3}^c \right] \bar{\mathbb {P}}\left[ \widehat{q}_{12} \le \widehat{q}_*| \mathcal {E}_{12|3}^c\right] \end{aligned}$$
(28)

In the proof in Mossel and Roch (2017), instead of \(\widehat{q}_{12}\) one deals with \(\widehat{r}_{12}\), and it follows from the symmetries of the MSC that \(\bar{\mathbb {P}}\left[ \widehat{r}_{12} \le \widehat{q}_*| \mathcal {E}_{12|3}^c\right] = \bar{\mathbb {P}}\left[ \widehat{r}_{13} \le \widehat{q}_*\right] \). This turns out to suffice to establish the expected version of the quantile test in Mossel and Roch (2017). In our setting, we must control quantitatively the difference between these two probabilities due to the slack added by the reduction step. The following lemma is proved below.\(\square \)

Lemma 7

(Closeness to symmetry). For \(\widehat{q}_*\) as defined in Algorithm 2 and any constant \(C_2>0\), there exist constants \(c_{7}', c_{7}'' >0\) such that

$$\begin{aligned} \left| \bar{\mathbb {P}}\left[ \widehat{q}_{13} \le \widehat{q}_*\right] - \bar{\mathbb {P}}\left[ \widehat{q}_{12} \le \widehat{q}_*| \mathcal {E}_{12|3}^c\right] \right| \le \phi _2 := c_{7}' \phi \sqrt{\log k} \end{aligned}$$

provided

$$\begin{aligned} m \ge c_{7}''\frac{1}{\phi \sqrt{\log k}} \log \left( \frac{1}{\phi \sqrt{\log k}} \right) , \qquad k \ge \left( \frac{1}{\phi } \right) ^{1/C_2}. \end{aligned}$$

Using the above lemma in (28), we can now bound \(s_{12}\) from below as follows

$$\begin{aligned} s_{12}\ge & {} \bar{\mathbb {P}}\left[ \mathcal {E}_{12|3}\right] \bar{\mathbb {P}}\left[ \widehat{q}_{12} \le \widehat{q}_*| \mathcal {E}_{12|3}\right] + \bar{\mathbb {P}}\left[ \mathcal {E}_{12|3}^c \right] \bar{\mathbb {P}}\left[ {\widehat{q}_{13}} \le \widehat{q}_*\right] - \phi _2\nonumber \\= & {} \bar{\mathbb {P}}\left[ \mathcal {E}_{12|3}\right] \left( \bar{\mathbb {P}}\left[ \widehat{q}_{12} \le \widehat{q}_*| \mathcal {E}_{12|3}\right] - s_{13}\right) + s_{13} - \phi _2. \end{aligned}$$
(29)

This implies that

$$\begin{aligned} s_{12} - s_{13} \ge \bar{\mathbb {P}}\left[ \mathcal {E}_{12|3}\right] \left( \bar{\mathbb {P}}\left[ \widehat{q}_{12} \le \widehat{q}_*| \mathcal {E}_{12|3}\right] - s_{13}\right) - \phi _2. \end{aligned}$$

The expected version of the quantile test succeeds provided the latter quantity is bounded from below by 0. We establish something a bit stronger, which will be useful in the analysis of the sample version of the quantile test. The following lemma is proved below.

Lemma 8

(Tails) There are \(c_{6}', c_{6}'' > 0\) such that \(\bar{\mathbb {P}}\left[ \widehat{q}_{12} \le \widehat{q}_*| \mathcal {E}_{12|3}\right] \ge c_{6}'\) and \(s_{13} = \bar{\mathbb {P}}\left[ \widehat{q}_{13} \le \widehat{q}_*\right] \le c_{6}'' \alpha \).

The first inequality captures the intuition that, conditioned on coalescence in the internal branch of the species phylogeny, the probability of \(\widehat{q}_{12}\) being small is high. The second inequality captures the intuition that since \(\widehat{q}_*\) behaves roughly like \(q_{13}^{(\alpha )} = p(\delta _{13}^{(\alpha )} + \widehat{\Delta }_{13})\), the event that \(\widehat{q}_{13}\le \widehat{q}_*\) is dominated by the event that the underlying MSC random variable satisfies the same inequality, with the deviations of the JC contribution on top of this being of order \(k^{-0.5}\).

Notice that, if we use Lemma 8 in (29), there is a constant \(c_{6}>0\) such that \(s_{12} - s_{13} \ge c_{6}\, \bar{\mathbb {P}}\left[ \mathcal {E}_{12|3} \right] \) provided \(\phi _2 \le c_{6} p(3f/4)\) for a large enough \(c_6 > 0\), where we used that \(\bar{\mathbb {P}}\left[ \mathcal {E}_{12|3} \right] \) is lower bounded by p(3f/4). This, along with a similar argument for \(s_{23}\), concludes the proof of Proposition 6.

Proof

(Lemma 7) First observe that \(\bar{\mathbb {P}}\left[ \widehat{r}_{13} \le \widehat{q}^*| \mathcal {E}_{12|3}^c \right] = \bar{\mathbb {P}}\left[ \widehat{r}_{13} \le \widehat{q}^*\right] \) since \(\widehat{r}_{13}\) is unaffected by whether coalescence occurs in the internal branch of the species phylogeny. We prove Lemma 7 by arguing that \(\bar{\mathbb {P}}\left[ \widehat{q}_{12} \le \widehat{q}^*| \mathcal {E}_{12|3}^c \right] \) is close to \(\bar{\mathbb {P}}\left[ \widehat{r}_{13} \le \widehat{q}^*| \mathcal {E}_{12|3}^c \right] \), and that \(\bar{\mathbb {P}}\left[ \widehat{q}_{13} \le \widehat{q}^*\right] \) is close to \(\bar{\mathbb {P}}\left[ \widehat{r}_{13} \le \widehat{q}^*\right] \).

We do this by analyzing the effect on the cumulative distribution function (CDF) of a perturbation of the mean. In our case, the distribution of interest is a mixture of binomials. Indeed notice that in the latter case, conditioned on the value of \(\delta _{13}\) and \(q_{13} = p (\delta _{13} + \widehat{\Delta }_{13})\), we are seeking to compare two binomial random variables \(k \,\widehat{q}_{13} \sim \mathrm{Bin}(k, {q_{xy}})\) and \(k \,\widehat{r}_{13} \sim \mathrm{Bin}(k,{q_{xy}} + \beta )\), for some small \(\beta \) (and similarly for the other inequality). The desired result will be implied by the following bound proved at the end of this subsection.\(\square \)

Lemma 9

(Mixture of binomials: CDF perturbation). Let \(\delta _{xy}\) be the distance between x and y on a random gene tree drawn according to the MSC, and let \(p_{xy} = p(\delta _{xy})\) denote the corresponding expected p-distance. Suppose that we have two binomial random variables \(J_1\sim \mathrm{Bin}(k, p_{xy})\) and \(J_2\sim \mathrm{Bin}(k,p_{xy} + \beta )\), for some fixed \(\beta \in (0, 1-p_{xy})\) with \(\beta = \Theta (\phi )\), and that we are given constants \(c_{14},\gamma >0\) such that \(\gamma < p_{xy}^{(c_{14} [\alpha \vee \phi ])}\). Then, for any constant \(C_2>0\) there exist constants \(c_{13},c_{13}' >0\) such that \(\left| \bar{\mathbb {P}}\left[ J_1 \le k \gamma \right] - \bar{\mathbb {P}}\left[ J_2 \le k \gamma \right] \right| \le c_{13}' \beta \sqrt{\log k}\) holds provided

$$\begin{aligned} m \ge c_{13}\frac{1}{\beta \sqrt{\log k}} \log \left( \frac{1}{\beta \sqrt{\log k}} \right) ,\qquad k \ge \left( \frac{1}{\beta } \right) ^{1/C_2}. \end{aligned}$$

Observe that, although \(J_1\) and \(J_2\) above do not depend on m, the quantity \(\gamma \)—through \(\alpha \)—does.

Notice that this result implies that there exists a constant \(c_{7}'>0\) such that

$$\begin{aligned}&\left| \bar{\mathbb {P}}\left[ \widehat{r}_{13} \le \widehat{q}^*\right] - \bar{\mathbb {P}}\left[ \widehat{q}_{13} \le \widehat{q}^*\right] \right| \le \frac{c_{7}'}{2}\phi \sqrt{\log k},\qquad \nonumber \\&\left| \bar{\mathbb {P}}\left[ \widehat{q}_{12} \le \widehat{q}^*| \mathcal {E}_{12|3}^c \right] - \bar{\mathbb {P}}\left[ \widehat{r}_{13} \le \widehat{q}^*| \mathcal {E}_{12|3}^c \right] \right| \le \frac{c_{7}'}{2}\phi \sqrt{\log k}\nonumber . \end{aligned}$$

To see why this is true, first observe that \(\widehat{q}^*\le q^{(0)}_{13} + c_{5} \alpha \le q_{13}^{(c_{5}' \alpha )}\), which follows from Proposition 5 and (27). The bound on \(\widehat{q}^*\) also holds in terms of r-quantiles up to an additive term \(O(\phi )\) per Proposition 5. So we can take \(\gamma = \widehat{q}^*\) in Lemma 9. Using this, and taking \(\beta \) to be \(\Theta (\phi )\), we get the above two inequalities. Note that, for both inequalities, we apply Lemma 9 to \(\widehat{r}_{13}\), \(\widehat{q}_{13}\) or \(\widehat{q}_{12}\) so as to keep \(\beta \) positive as required.

Proof

(Lemma 8) We start with the second inequality in the statement of the lemma. Notice that, from Proposition 5 (on which \(\bar{\mathbb {P}}\) is conditioning) and (27), we know that \(\widehat{q}^*\le q^{(0)}_{13} + c_{5} \alpha \le q_{13}^{(c_{5}' \alpha )}\).

Hence,

$$\begin{aligned} \bar{\mathbb {P}}\left[ \widehat{q}_{13} \le \widehat{q}_*\right]&\le \bar{\mathbb {P}}\left[ \widehat{q}_{13} \le q_{13}^{(c_{5}' \alpha )} \right] \\&= \bar{\mathbb {P}}\left[ \widehat{q}_{13} \le q_{13}^{(c_{5}' \alpha )} | {q}_{13} \le q_{13}^{(c_{5}' \alpha )}\right] \bar{\mathbb {P}}\left[ {q}_{13} \le q_{13}^{(c_{5}' \alpha )}\right] \\&\qquad + \bar{\mathbb {P}}\left[ \widehat{q}_{13} \le q_{13}^{(c_{5}' \alpha )} | {q}_{13}> q_{13}^{(c_{5}' \alpha )}\right] \bar{\mathbb {P}}\left[ {q}_{13}> q_{13}^{(c_{5}' \alpha )}\right] \\&\le c_{5}' \alpha + \bar{\mathbb {P}}\left[ \widehat{q}_{13} \le q_{13}^{(c_{5}' \alpha )} | {q}_{13} > q_{13}^{(c_{5}' \alpha )}\right] . \end{aligned}$$

The conditional probability on the last line is bounded above by \(c_3' \alpha \) by Lemma 6 and the memoryless property of the exponential. This implies the second inequality of the lemma.

To derive the first one, we make a few observations:

  1. (1)

    from Proposition 5, \(\widehat{q}_*\ge \widehat{q}_{13}^{(c_{3} \alpha )}\ge {r}_{13}^{(0)}-c_{5}\phi \);

  2. (2)

    by definition \(q_{12} = p(\delta _{12} + \widehat{\Delta }_{12})\) and, conditioned on \(\delta _{12}\), \(k \,\widehat{q}_{12}\) is distributed as Bin\((k,q_{12})\);

  3. (3)

    given that \(q_{12} = p(\delta _{12} + \widehat{\Delta }_{12})\) and \(r_{12} = p(\delta _{12} + \Delta _{12})\) and by Proposition 4, it follows that there is \(c_{16} > 0\) such that the event \(\{r_{12} \le {r}_{13}^{(0)} - c_{16} \phi \}\) implies the event \(\{q_{12} \le {r}_{13}^{(0)}-c_{5}\phi \}\) under \( \bar{\mathbb {P}}\);

  4. (4)

    the event \(\mathcal {E}_{12|3}\) is equivalent to the condition that \(r_{12} = p(\delta _{12} + \Delta _{12}) \le p(\mu _{13} + \Delta _{13}) = r_{13}^{(0)}\).

We use these facts to bound the desired quantity \(\bar{\mathbb {P}}\left[ \widehat{q}_{12} \le \widehat{q}_*| \mathcal {E}_{12|3}\right] \) from below by

$$\begin{aligned}&{\mathop {\ge }\limits ^{(a)}} \bar{\mathbb {P}}\left[ \widehat{q}_{12} \le {r}_{13}^{(0)}-c_{5}\phi | \mathcal {E}_{12|3}\right] \nonumber \\&{\mathop {\ge }\limits ^{(b)}} \bar{\mathbb {P}}\left[ \mathrm{Bin}(k, q_{12}) \le {k [{r}_{13}^{(0)}-c_{5}\phi ]} | q_{12} \le {r}_{13}^{(0)}-c_{5}\phi , \mathcal {E}_{12|3} \right] \ \bar{\mathbb {P}}\left[ q_{12} \le {r}_{13}^{(0)}-c_{5}\phi | \mathcal {E}_{12|3}\right] \nonumber \\&{\mathop {\ge }\limits ^{(c)}} \bar{\mathbb {P}}\left[ \mathrm{Bin}(k, q_{12}) \le {k[{r}_{13}^{(0)}-c_{5}\phi ]} | q_{12} \le {r}_{13}^{(0)}-c_{5}\phi , \mathcal {E}_{12|3} \right] \ \bar{\mathbb {P}}\left[ r_{12} \le {r}_{13}^{(0)} - c_{16} \phi | r_{12} \le r_{13}^{(0)} \right] \end{aligned}$$

where (a) follows from Observation 1 above, (b) follows after conditioning on the event that \(q_{12} \le {r}_{13}^{(0)}-c_{5}\phi \), and (c) follows from Observations 3 and 4. Finally, the last line is greater than some constant \(C'_j > 0\) from the Berry-Esséen theorem (see e.g. Durrett 1996), which gives a constant lower bound on the first term, and (27) together with the assumption that \(\phi \ll p(3f/4)\) and the fact that the probability of \(\mathcal {E}_{12|3}\) is at least p(3f/4), which gives a constant lower bound on the second term.\(\square \)

Proof

(Lemma 9) We need an auxiliary lemma which characterizes the difference between two binomial distributions in terms of the difference of the underlying probabilities. This follows from Roos (2001).\(\square \)

Lemma 10

(Binomial: CDF perturbation). For \(J_1\) and \(J_2\) as above and any \(\gamma \in (0,1)\), we have

$$\begin{aligned} \left| \bar{\mathbb {P}}\left[ J_1 \le k \gamma | p_{xy}\right] - \bar{\mathbb {P}}\left[ J_2 \le k \gamma | p_{xy}\right] \right| \le \frac{2\sqrt{2 e} \sqrt{k+2}}{\sqrt{ (p_{xy}+\beta ) (1 -p_{xy}-\beta )}} \beta . \end{aligned}$$
(30)

Proof

It holds that, if \(\theta (p_{xy}+\beta ) = \frac{\beta ^2 (k+2)}{2 (p_{xy}+\beta ) (1 - p_{xy}-\beta )} < 1\),

$$\begin{aligned} \left| \bar{\mathbb {P}}\left[ J_1 \le k \gamma | p_{xy}\right] - \bar{\mathbb {P}}\left[ J_2 \le k \gamma | p_{xy}\right] \right|&\le \left\| \mathrm{Bin}(k,p_{xy}) - \mathrm{Bin}(k,p_{xy}+\beta ) \right\| _1\\&\le \frac{\sqrt{e \theta (p_{xy}+\beta )}}{\left( 1 - \sqrt{\theta (p_{xy}+\beta )} \right) ^2}, \end{aligned}$$

where the above inequality comes from (Roos 2001, (15)), by setting \(s = 0\) there, and choosing the Poisson-Binomial distribution to simply be the binomial distribution Bin\((k,p_{xy})\). If \( \beta \le \sqrt{\frac{(p_{xy}+\beta ) (1 -p_{xy}-\beta )}{2 (k+2)}}\), then \(1 - \sqrt{\theta (p_{xy}+\beta )} \ge 0.5\). In this case, we have (30). On the other hand, if \(\beta > \sqrt{\frac{(p_{xy}+\beta ) (1 - p_{xy}-\beta )}{2 (k+2)}}\), since the difference between two probabilities is upper bounded by 2, the upper bound (30) holds trivially.\(\square \)

We cannot directly apply Lemma 10 to prove Lemma 9 since the \(\sqrt{k+2}\) factor in the upper bound is too loose for our purposes. Instead, we employ a more careful argument that splits the domain of \(p_{xy}\).

\({p_{xy}\in I_1 = [ p_{xy}^{(0)}, p_{xy}^{(2c_{14} \sqrt{\frac{\log k}{k}})}]}\) (low substitution regime for small k): In this case, we use the fact that Lemma 10 guarantees that the binomial distributions are \(\mathcal {O}(\beta \sqrt{k})\) apart. That is, there exists a constant \(c_{15}' > 0\) such that

$$\begin{aligned}&\bar{\mathbb {P}}\left[ p_{xy}\in I_1 \right] \bar{\mathbb {E}}\left[ \left| \bar{\mathbb {P}}\left[ J_1 \le k \gamma \right] - \bar{\mathbb {P}}\left[ J_2 \le k \gamma \right] \right| | p_{xy}\in I_1\right] \\&\quad \le \bar{\mathbb {P}}\left[ p_{xy}\in I_1 \right] c_{15}' \beta \sqrt{k+2} \le c_{15}' \beta \sqrt{\log k}, \end{aligned}$$

where the last step follows from the quantile definition, after appropriately increasing the constant \(c_{15}'\).

\(p_{xy} \in I_2 = [p_{xy}^{(2c_{14} \sqrt{\frac{\log k}{k}})}, p_{xy}^{({2c_{14} [\alpha \vee \phi ]})}]\) (low substitution regime for large k):

In the case this interval is not empty, there exists a constant \(c_{15}'' >0\) such that

$$\begin{aligned} \bar{\mathbb {P}}\left[ p_{xy}\in I_2 \right] \mathbb {E}\left[ \left| \bar{\mathbb {P}}\left[ J_1 \le k\gamma \right] - \bar{\mathbb {P}}\left[ J_2 \le k\gamma \right] \right| | p_{xy}\in I_2\right]&\le \bar{\mathbb {P}}[p_{xy}\in I_2]\nonumber \\&{\mathop {\le }\limits ^{(a)}} c_{15}'' {[\alpha \vee \phi ]}\nonumber \\&{\mathop {\le }\limits ^{(b)}} c_{15}''\frac{\log m}{m} {\vee \phi },\nonumber \end{aligned}$$

where (a) follows from (27), and (b) follows from the definition of \(\alpha \).

\({p_{xy}\in I_3= [ p_{xy}^{(2c_{14} {[\alpha \vee \phi ]})},0.5]}\) (high substitution regime): In this case observe that, since \(\gamma < p_{xy}^{(c_{14} {[\alpha \vee \phi ]})}\) (i.e., we are looking at a left tail below the mean), we can apply Chernoff’s bound (see e.g.  Boucheron et al. 2013) on each of the two terms in the difference individually. For any constant \(C_2 > 0\), we can choose \(c_{14}>0\) so that the following inequality holds for some \(c_{15}'''>0\)

$$\begin{aligned}&\bar{\mathbb {P}}\left[ p_{xy}\in I_3 \right] \mathbb {E}\left[ \left| \bar{\mathbb {P}}\left[ J_1 \le k \gamma \right] - \bar{\mathbb {P}}\left[ J_2 \le k \gamma \right] \right| | p_{xy}\in I_3\right] \\&\quad \le {2 \exp (- k (p_{xy}^{(2c_{14} [\alpha \vee \phi ])} - p_{xy}^{(c_{14} [\alpha \vee \phi ])})^2/2)}\nonumber \\&\quad \le c_{15}''' k^{-C_2},\nonumber \end{aligned}$$

where we used that \(p_{xy}^{(2c_{14} [\alpha \vee \phi ])} - p_{xy}^{(c_{14} [\alpha \vee \phi ])} = \text{\O}mega (\sqrt{\frac{\log k}{k}})\) by (27) and the definition of \(\alpha \).

Putting the three regimes together, there is a constant \(c_{7}'>0\) (not depending on fmk) such that

$$\begin{aligned} \left| \bar{\mathbb {P}}\left[ J_1 \le k \gamma \right] - \bar{\mathbb {P}}\left[ J_2 \le k \gamma \right] \right| \le \frac{c_{7}'}{3} \left( \beta \sqrt{\log k} + \frac{\log m}{m} {\vee \phi } + k^{-C_2} \right) . \end{aligned}$$

There is a \(c_{13}\) such that the conditions of the lemma imply \(m^{-1}\log m \le \beta \sqrt{\log k}\) and \(k^{-C_2} \le \beta \sqrt{\log k}\).

1.4 A.4 Proof of Proposition 7

Recall that \(\mathcal {E}_{12|3}\) is the event that there is a coalescence in the internal branch of the species phylogeny. First we prove:

Lemma 11

(Variance bound). There is a \(c_{8}'>0\) such that, \(\forall x,y\), \(\overline{\mathrm{Var}}(\widehat{s}_{xy}) \le \frac{c_{8}'}{\left| \mathcal {M}_{\mathrm{Q}2} \right| } \left( \bar{\mathbb {P}}\left[ \mathcal {E}_{12|3} \right] + \phi + \alpha \right) .\)

Proof

Note that the variance of \(\widehat{s}_{12}\) is bounded from above by \(\frac{1}{\left| \mathcal {M}_{\mathrm{Q}2} \right| }\bar{\mathbb {P}}\left[ \widehat{q}_{12}\le \widehat{q}_*\right] \).

Observe further that, by Proposition 5, \(\widehat{q}_*\le r_{13}^{(0)} + c_{5}\phi + c_{5} \alpha \). Moreover, conditioned on the random distance \(\delta _{12}\), \(k\,\widehat{q}_{12}\) is distributed according to Bin\((k, q_{12})\), where \(q_{12} = p\left( \delta _{12} + \widehat{\Delta }_{12} \right) \). Finally, from Lemma 6 and the memoryless property of the exponential, we have that

$$\begin{aligned} \bar{\mathbb {P}}\left[ \widehat{q}_{12} \le r_{13}^{(0)} + c_{5}\phi + c_{5} \alpha | q_{12} > r_{13}^{(0)} + c_{5}\phi + c_{5}\alpha \right] \le c_{3}' \alpha . \end{aligned}$$

Therefore, arguing as in the proof of Lemma 8,

$$\begin{aligned} \bar{\mathbb {P}}\left[ \widehat{q}_{12}\le \widehat{q}_*\right]&\le \bar{\mathbb {P}}\left[ q_{12} \le r_{13}^{(0)} + c_{5}\phi + c_{5} \alpha \right] + c_{3}' \alpha . \end{aligned}$$
(31)

From (27), it follows that there is a constant \(c_{8}''>0\) such that

$$\begin{aligned} \bar{\mathbb {P}}\left[ q_{12} \le r_{13}^{(0)} + c_{5}\phi + c_{5} \alpha \right] \le c_{8}'' \left( r_{13}^{(0)} + c_{5}\phi + c_{5} \alpha - q_{12}^{(0)}\right) . \end{aligned}$$
(32)

Moreover,

$$\begin{aligned} r_{13}^{(0)} - q_{12}^{(0)} \le p\left( \delta _{13}^{(0)} + \Delta _{13} \right) - p\left( \delta _{12}^{(0)} + {\Delta }_{12} \right) + c_{8}''' \phi \le c_{8}''' \left( \bar{\mathbb {P}}\left[ \mathcal {E}_{12|3} \right] + \phi \right) . \end{aligned}$$

The last inequality follows from the fact that \(\bar{\mathbb {P}}\left[ \mathcal {E}_{12|3} \right] \) is of the order of the length of the internal branch of the species phylogeny; and so is the difference in the middle expression. Notice that this along with (31) and (32) imply that there is a constant \(c_{8}'>0\) (after changing it appropriately) such that \(\bar{\mathbb {P}}\left[ \widehat{q}_{12} \le \widehat{q}_*\right] \le c_{8}'\left( \bar{\mathbb {P}}\left[ \mathcal {E}_{12|3} \right] + \phi + \alpha \right) \).\(\square \)

Proof

(Proposition 7) An error in the quantile test implies that either \(\widehat{s}_{13} > \widehat{s}_{12}\) or \(\widehat{s}_{23} > \widehat{s}_{12}\). Therefore, the probability that the algorithm makes an error is at most

$$\begin{aligned}&\bar{\mathbb {P}}\left[ \widehat{s}_{13} \ge \widehat{s}_{12} \right] + \bar{\mathbb {P}}\left[ \widehat{s}_{23} \ge \widehat{s}_{12} \right] \nonumber \\&\le \bar{\mathbb {P}}\left[ \widehat{s}_{13} - s_{13} \ge \frac{s_{12} - s_{13}}{2} \right] + \bar{\mathbb {P}}\left[ s_{12} - \widehat{s}_{12} \ge \frac{s_{12} - s_{13}}{2} \right] \nonumber \\&\qquad + \bar{\mathbb {P}}\left[ \widehat{s}_{23} - s_{23} \ge \frac{s_{12} - s_{23}}{2} \right] + \bar{\mathbb {P}}\left[ s_{12} - \widehat{s}_{12} \ge \frac{s_{12} - s_{23}}{2} \right] . \end{aligned}$$
(33)

Take the second term on r.h.s. (the other terms being similar). We need two ingredients to invoke Bernstein’s inequality: (1) a lower bound on the “gap” \(\frac{s_{12} - s_{13}}{2}\) (Proposition 6), and (2) an upper bound on the variance of \(\widehat{s}_{12}\) (Lemma 11). We get that \(\bar{\mathbb {P}}\left[ s_{12} - \widehat{s}_{12} \ge \frac{s_{12} - s_{13}}{2} \right] \) is at most

$$\begin{aligned}&\exp \left( - \frac{0.5 \left( \frac{s_{12}-s_{13}}{2} \right) ^2}{\mathrm{Var}(\widehat{s}_{12}) + \frac{1}{6}(s_{12}-s_{13})} \right) \nonumber \\&\quad \le \exp \left( - \frac{\left| \mathcal {M}_{\mathrm{Q}2} \right| \left( c_{6} \bar{\mathbb {P}}\left[ \mathcal {E}_{12|3}\right] \right) ^2}{c_{8}'\left( \alpha + \phi + \bar{\mathbb {P}}\left[ \mathcal {E}_{12|3} \right] \right) + \frac{c_{6}}{6}\bar{\mathbb {P}}\left[ \mathcal {E}_{12|3} \right] } \right) \nonumber \\&\quad \le \exp \left( - \frac{\left| \mathcal {M}_{\mathrm{Q}2} \right| p(3f/4)^2}{c_{8}\left( p(3f/4) + \alpha \right) } \right) , \end{aligned}$$
(34)

where the last line follows from the fact that \(\bar{\mathbb {P}}\left[ \mathcal {E}_{12|3} \right] \) is bounded from below by p(3f/4) and the assumption that \(\phi \ll p(3f/4)\).\(\square \)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Dasarathy, G., Mossel, E., Nowak, R. et al. A stochastic Farris transform for genetic data under the multispecies coalescent with applications to data requirements. J. Math. Biol. 84, 36 (2022). https://doi.org/10.1007/s00285-022-01731-5

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s00285-022-01731-5

Keywords

Mathematics Subject Classification

Navigation