Keywords

1 Introduction

Inferring rooted species trees is important for many downstream applications of phylogenetics, such as comparative genomics [7, 11] and dating [25]. These estimations use different loci from across the genomes of the selected species, and so are referred to as multi-locus analyses. If rooted gene trees can be accurately inferred, then the rooted species tree can be estimated from them [14]; however, this is not a reliable assumption [29]. Hence, the standard approach is to first estimate an unrooted species tree using multi-locus datasets, and then root that estimated tree.

The estimation of the unrooted species tree is challenged by biological processes, such as incomplete lineage sorting (ILS) or gene duplication and loss (GDL), that can result in different parts of the genome having different evolutionary trees. When ILS or GDL occur, statistically consistent estimation of the unrooted species tree requires techniques that take the source of heterogeneity into consideration [15, 22]. The case of ILS, as modeled by the multispecies coalescent (MSC) model [10], is the most well-studied, and there are several methods for estimating unrooted species trees that have been proven statistically consistent under the MSC (see [22] for a survey of such methods).

The general problem of rooting a species tree (or indeed even a gene tree) is of independent interest, but presents many challenges. A common approach is the use of an outgroup taxon (i.e., the inclusion of a species that is outside the smallest clade containing the remaining species), so that the resultant tree is rooted on the edge leading to the outgroup [16]. However, outgroup selection has its own difficulties: if the outgroup is too distant, then it may be attached fairly randomly to the tree containing the remaining species, and if it is too close, it may even be an ingroup taxon [5, 6, 9, 13]. Other approaches use branch lengths estimated on the tree to find the root based on specific optimization criteria; however, these approaches tend to degrade in accuracy unless the strict molecular clock holds (which assumes that all sites along the genome evolve under a constant rate) [8, 18, 31].

Quintet Rooting (QR) [30] is a recently introduced method that is designed to root a given species tree using the unrooted gene tree topologies, under the assumption that the gene trees can differ from the species tree due to ILS. QR is based on mathematical theory established by Allman, Degnan, and Rhodes [2], which showed that the rooted species tree topology is identifiable from the unrooted gene tree topologies whenever the number of species is at least five. In [30], QR was shown to provide good accuracy for rooting both estimated and true species trees in the presence of ILS, compared to alternative methods.

However, QR was not proven to be statistically consistent for locating the root. Thus, we do not have a proof that the root location selected by QR, when given the true species tree topology, will converge to the correct location as the number of gene trees in the input increases. Although much attention has been paid to establishing statistical consistency for unrooted species tree estimation methods and many methods, such as ASTRAL [19], SVDQuartets [32] and BUCKy [12], have been proven to be statistically consistent estimators of the unrooted species tree topology under the MSC, to the best of our knowledge, no prior study has addressed the statistical consistency properties of methods for rooting species trees.

In this paper, we argue that QR is not guaranteed to be statistically consistent under the MSC, but we also present a modification to QR, which we call QR-STAR, that we prove statistically consistent. Moreover, QR-STAR, like QR, runs in polynomial time. We also provide results of a simulation study comparing QR to QR-STAR. Due to space limitations, most of the proofs and results from the simulation study are presented in the full version of the paper available at https://doi.org/10.1101/2022.10.26.513897.

2 Background

We present the theory from [2] first, which establishes identifiability of the rooted species tree from unrooted quintet trees, and then we describe Quintet Rooting (QR), our earlier method for rooting species trees. Together these form the basis for deriving our new method, QR-STAR, which we present in the next section.

2.1 Allman, Degnan, and Rhodes (ADR) Theory

Allman, Degnan, and Rhodes (ADR) [2] established that the unrooted topology of the species tree is identifiable from four-leaf unrooted gene trees under the MSC, a result that is well known and used in several “quartet-based” methods for estimating species trees under the MSC [12, 17, 19, 24]. ADR also proved that the rooted species tree topology is identifiable from unrooted five-leaf gene tree topologies; this result is much less well known, but was recently used in the development of QR for rooting species trees.

Fig. 1.
figure 1

ADR invariants and inequalities for different rooted topological shapes. The invariants and inequalities found by ADR define, for each rooted 5-taxon model tree topology, a partial order on the probabilities of the 15 unrooted 5-leaf gene trees; importantly, the partial order depends only on the “rooted shape” of the 5-taxon model species trees (i.e., caterpillar, balanced and pseudo-caterpillar). Thus, the topology of any 5-leaf rooted binary species tree is uniquely determined by the partial order, and so can be determined from the true distribution on unrooted 5-leaf gene trees (i.e., it is identifiable, as established by ADR).

ADR have described the probability distribution of unrooted gene tree topologies under each 5-taxon MSC model species tree. On a given set of five taxa, there exist 105 different rooted binary trees, labeled with \(R_1, \dots , R_{105}\)Footnote 1, that can be categorized into three groups based on their (unlabeled) rooted shapes: caterpillar, balanced and pseudo-caterpillar [27]. An example of a tree from each category is shown in Fig. 1. Each 5-taxon model species tree defines a specific probability distribution over the 15 different unrooted gene tree topologies on the same leafset, shown with \(T_1, \dots , T_{15}\). Theorem 9 in [2] states that this distribution uniquely determines the rooted tree topology and its internal branch lengths for trees with at least five taxa.

To prove this identifiability result, the ADR theory specifies a set of linear invariants (i.e., equalities) and inequalities that must hold between the probabilities of unrooted 5-taxon gene trees, for any choice of the parameters of the model species tree. These linear invariants and inequalities define a partial order on the probabilities of the topologies of the different 5-taxon unrooted gene trees. In other words, two gene tree probabilities \(u_i = \mathbb {P}(T_i)\) and \(u_j = \mathbb {P}(T_j)\) can have one of four possible relationships: \(u_i > u_j\), \(u_j > u_i\), \(u_i = u_j\), or \(u_i\) and \(u_j\) are not comparable.

Figure 1 shows examples of these partial orders with a Hasse diagram for a particular leaf labeling of trees from each rooted shape. Note that some probabilities are members of the same set (e.g., for \(R_1\), set \(c_4\) contains both \(u_4\) and \(u_{13}\), indicating that \(u_4 = u_{13}\)), and so we refer to the sets \(c_i\) as equivalence classes on these probabilities. Furthermore, we will denote the set of equivalence classes associated with a 5-taxon rooted tree R with \(C_R\). As can be seen in Fig. 1, the number of equivalence classes depends on the shape of the rooted species tree, with caterpillar, balanced and pseudo-caterpillar trees having 7, 5 and 5 equivalence classes, respectively.

Each directed edge between two equivalence classes in these Hasse diagrams defines an inequality, so that all gene tree probabilities in class \(c_a\) at the source of an edge are greater than all gene tree probabilities in class \(c_b\) at the target, and we denote this by \(c_a > c_b\). The exact values of the unrooted gene tree probabilities depend on the internal branch lengths of the model tree, and ADR provide a set of formulas that relate the model tree parameters to the probability distribution of the unrooted gene trees in Appendix B of [2], which will be used in our proofs.

2.2 Quintet Rooting

The input to QR is an unrooted species tree T with n leaves and a set \(\mathcal {G}\) of k single-copy unrooted gene trees where the gene trees draw their leaves from the leafset of T, denoted by \(\mathcal {L}(T)\). Given this input, QR searches over all possible rootings of T and returns a tree most consistent with the distribution of quintets (i.e., 5-taxon trees) in the input gene trees.

QR approaches this problem by selecting a set Q of quintets of taxa from \(\mathcal {L}(T)\) (called the “quintet sampling” step; refer to Supplementary Materials Sec. A for details), and scoring all rooted versions of T based on their induced trees on these quintets. The subtree \(T_{|q}\), T restricted to taxa in quintet set q, can be rooted on any of its seven edges. In a preprocessing step, QR computes a score for each of these seven different rootings for all trees induced on the quintets in set Q, based on a cost function. This results in \(7 \times |Q|\) computations, and therefore the preprocessing step takes \(O(k(|Q|+n))\). Next, for every rooted version of T, QR sums up the costs of all its induced rooted trees on quintets in Q using the scores computed in the preprocessing step, and returns the rooting with the minimum overall cost. Since T can be rooted on any of its \(2n-3\) edges, the scoring step takes \(O(n + |Q|)\) time. Therefore, the overall runtime of QR when using an O(n) sampling of quintets is O(nk). Figure 2 shows the pipeline of QR and its individual steps.

Thus, QR provides an exact solution to the optimization problem with the following input and output:

  • Input: An unrooted tree topology T, a set of k unrooted gene tree topologies \(\mathcal {G} = \{g_1, g_2, \dots , g_k\}\), a set Q containing quintets of taxa from leafset \(\mathcal {L}(T)\) and a cost function \(\textrm{Cost}(r, \vec {u})\).

  • Output: Rooted tree R with topology T such that \(\sum _{q \in Q}\textrm{Cost}(R|_{q}, \vec {\hat{u}}_q)\) is minimized, where \(\vec {\hat{u}}_q\) is the distribution of unrooted gene tree quintets in \(\mathcal {G}|_{q} = \{{g_1}|_{q}, {g_2}|_{q}, \dots , {g_k}|_{q}\}\).

Cost Function. The cost function \(\textrm{Cost}(R|_{q}, \vec {\hat{u}}_q)\) measures the fitness of the rooted quintet tree \(R|_{q}\) with the distribution of the unrooted gene trees restricted to q (i.e., \(\vec {\hat{u}}_q\)), according to the linear invariants and inequalities derived from the ADR theory. In particular, this cost function is designed to penalize a rooted tree \(R|_{q}\) if the estimated quintet distribution \(\vec {\hat{u}}_q\) violates some of the inequalities or invariants in its partial order. To this end, a penalty term was considered for each invariant and inequality in the partial order of a 5-taxon rooted tree that is violated in a quintet distribution. The cost function was defined based on a linear combination of these penalty terms, and had the following form, where r is a 5-taxon rooted tree and \(\vec {\hat{u}}\) is an estimated quintet distribution:

$$\begin{aligned} \begin{aligned} \textrm{Cost}(r, \vec {\hat{u}}) = \underbrace{\sum _{c \in C_r}{\frac{1}{|c|}\sum _{u_a, u_b \in c}{|\hat{u}_a - \hat{u}_b|}}}_{\text {Invariants Penalty}} +&\underbrace{\sum _{c > c^\prime \in C_r}{\frac{1}{|c^\prime |}\sum _{u_a \in c, u_b \in c^\prime }{\max (0, \hat{u}_b - \hat{u}_a)}}}_{\text {Inequalities Penalty}}. \end{aligned} \end{aligned}$$
(1)

The normalization factors \(\frac{1}{|c|}\) and \(\frac{1}{|c^\prime |}\) were used to reduce a topological bias that arose from differences in the sizes of the equivalence classes for each tree shape.

3 QR-STAR

QR-STAR is an extension to QR that has an additional step for determining the rooted shape (i.e., the rooted topology without the leaf labels) of a quintet tree, as well as an associated penalty term in its cost function. This penalty term compares the rooted shape of the 5-taxon tree, denoted by S(r), with the rooted shape inferred by QR-STAR from the given quintet distribution, denoted by \(\hat{S}(\hat{u})\). The motivation for this additional preprocessing step is that, as we argue in Supplementary Materials Sec. C, the cost function of QR does not guarantee statistical consistency. The cost function of QR-STAR takes the following general form

$$\begin{aligned} \begin{aligned} \textrm{Cost}^*(r, \vec {\hat{u}}) =&\underbrace{\sum _{c \in C_r}{\sum _{u_a, u_b \in c}{\alpha _{a, b}{|\hat{u}_a - \hat{u}_b|}}}}_{\text {Invariants Penalty}} + \underbrace{\sum _{\begin{array}{c} c > c^\prime \in C_r \end{array}}{\sum _{u_a \in c, u_b \in c^\prime }{\beta _{a, b}\max (0, \hat{u}_b - \hat{u}_a)}}}_{\text {Inequalities Penalty}} \\&+ \underbrace{C\mathbbm {1}|S(r) \ne \hat{S}(\hat{u})|}_{\text {Shape Penalty}} \end{aligned} \end{aligned}$$
(2)

where \(\alpha _{a, b} \ge 0\) and \(\beta _{a, b}, C > 0\) are constant real numbers for all abFootnote 2. Let \(\alpha _{\max } = \max _{a, b}(\alpha _{a, b})\) and \(\beta _{\min } = \min _{a, b}(\beta _{a, b})\) where ab ranges over all pairs of indices ab used in the penalty terms in Eq. 2.

Fig. 2.
figure 2

Pipeline of QR and QR-STAR. The input is an unrooted species tree T and set of unrooted gene trees \(\mathcal {G}\) on the same leafset. a) The sampling step selects a set Q of quintets from the leafset of T (shown is the linear encoding sampling). b) An additional step in QR-STAR that determines the rooted shape for each selected quintet. c) The preprocessing step computes a cost for each of the seven possible rootings of each selected quintet. d) The scoring step computes a score for each rooted tree in the search space based on the costs computed in the preprocessing step, and returns a rooting of T with minimum score.

Each of the 105 rooted binary trees on a given set of 5 leaves have a unique set of inequalities and invariants that can be derived from the ADR theory. The cost function in Eq. 2 considers a penalty term for these inequalities and invariants as well as the shape of the tree, so that \(\textrm{Cost}^*(r, \vec {\hat{u}})\) is minimized for a rooted 5-taxon tree r that best describes the given estimated quintet distribution.

3.1 Determining the Rooted Shape

Model 5-taxon species trees with different rooted shapes (i.e., caterpillar, balanced, pseudo-caterpillar) define equivalence classes with different class sizes on the unrooted gene tree probability distribution. These class sizes can be used to determine the unlabeled shape of a rooted tree, when given the true gene tree probability distribution. For example, the size of the equivalence class with the smallest gene tree probabilities is 8 for the pseudo-caterpillar trees and 6 for balanced or caterpillar trees. Therefore, the size of the equivalence class corresponding to the minimal element in the partial order can differentiate a pseudo-caterpillar tree from other tree shapes. Moreover, both caterpillar and balanced trees have a unique class with the second smallest probability, which is of size 2 for caterpillar trees and 4 for balanced trees and this can be used to differentiate a caterpillar tree from a balanced tree. This approach is used in Theorem 9 in [2] for establishing the identifiability of rooted 5-taxon trees from unrooted gene trees.

However, given an estimated gene tree distribution, it is likely that none of the invariants derived from the ADR theory exactly hold, and so the class sizes cannot be directly determined and the approach above cannot be used as is to infer the shape of a rooted quintet. Here we propose a simple modification for determining the rooted shape of a tree from the estimated distribution of unrooted gene trees, by looking for significant gaps between quintet gene tree probabilities.

Let T be the unrooted species tree with \(n \ge 5\) leaves given to QR-STAR and q be a quintet of taxa from \(\mathcal {L}(T)\). Let \(\vec {\hat{u}}\) be the quintet distribution estimated from input gene trees induced on taxa in set q. QR-STAR first sorts \(\vec {\hat{u}}\) in ascending order to get \(\hat{u}_{\sigma _1} \le \hat{u}_{\sigma _2} \le \dots \le \hat{u}_{\sigma _{15}}\). Let \(A(k) = \sqrt{\frac{2}{k}\ln (30|Q|k)}\) (refer to Lemma 4 for the derivation of A(k)), where k is the number of input gene trees and |Q| is the size of the set of sampled quintets in QR-STAR (this depends on the number n of taxa and is assumed fixed), and note that \(\lim _{k\rightarrow \infty } A(k) = 0\). The first step of QR-STAR computes an estimate of the rooted shape of a quintet q, denoted by \(\hat{S}(\hat{u})\) in Eq. 2, as follows:

  • estimate the rooted shape \(\hat{S}(\hat{u})\) as pseudo-caterpillar if \(\hat{u}_{\sigma _7} - \hat{u}_{\sigma _6} < A(k)\);

  • estimate the rooted shape \(\hat{S}(\hat{u})\) as balanced if \(\hat{u}_{\sigma _7} - \hat{u}_{\sigma _6} \ge A(k)\) and \(\hat{u}_{\sigma _9} - \hat{u}_{\sigma _8} < A(k)\);

  • estimate the rooted shape \(\hat{S}(\hat{u})\) as caterpillar if \(\hat{u}_{\sigma _7} - \hat{u}_{\sigma _6} \ge A(k)\) and \(\hat{u}_{\sigma _9} - \hat{u}_{\sigma _8} \ge A(k)\).

The runtime of QR-STAR is the same as QR, as determining the topological shape for each quintet is done in constant time, so that the overall runtime remains O(nk) when a linear sampling of quintets is used.

4 Theoretical Results

In this section, we provide the main theoretical results, starting with a series of lemmas and theorems that will be used in the proof of statistical consistency of QR-STAR in Theorem 2. Throughout this paper, we assume that discordance between species trees and gene trees is solely due to ILS. In establishing statistical consistency, we assume that input gene trees are true gene trees and, thus, have no gene tree estimation error. If not otherwise specified, all trees are assumed to be fully resolved (i.e., binary). Due to space constraints, the proofs are provided in Supplementary Materials Section B. We begin with some definitions and key observations.

Definition 1 (Path length parameter)

Let R be an MSC model species tree. Let f(R) be the length of the shortest internal branch of R and g(R) be the length of the longest internal path (i.e., a path formed from only the internal branches) of R. We define the path length parameter of R as

$$\begin{aligned} h(R) = \frac{1}{18}e^{-3g(R)}(1 - e^{-f(R)})^2 \end{aligned}$$
(3)

Note that \(h(R) \in (0, \frac{1}{18})\) since \(\exp (-x) \in (0, 1)\) for all \(x > 0\) and the branch lengths have positive values. The formula for Eq. 3 is derived from the proof of Lemma 2 in Supplementary Materials Sec. B.

Lemma 1

Let R be an MSC model species tree with \(n \ge 5\) leaves and q be an arbitrary set of 5 leaves from \(\mathcal {L}(R)\). Then \(h(R|_q) \ge h(R)\) where \(R|_q\) is the rooted tree R restricted to taxa in set q.

Lemma 2

Let R be an MSC model species tree with 5 leaves and internal branch lengths xy, and z. Let \(\vec {u}\) be the probability distribution that R defines on the unrooted 5-taxon gene tree topologies. If \(\vec {\hat{u}}\) is an estimate of \(\vec {u}\) such that given \(\epsilon > 0\), we have \(|\hat{u}_i - u_i| < \epsilon \) for all \(1 \le i \le 15\), then the following inequality holds:

$$\begin{aligned} \forall _{c> c^\prime \in C_R}\forall _{u_a \in c, u_b \in c^\prime }: \hat{u}_a - \hat{u}_b > h(R) - 2\epsilon . \end{aligned}$$
(4)

Definition 2

For a 5-taxon rooted tree R, we define \(I_R\) as the set of ordered pairs (ij), \(1 \le i \ne j \le 15\), corresponding to inequalities in the form \(u_i > u_j\) defined according to the partial order of R. The inequalities that are a result of transitivity (i.e. \(u_i > u_j\) and \(u_j > u_k\) implies \(u_i > u_k\)) are not included in \(I_R\).

Definition 3

Let \(V(R, R^\prime )\) be the set of violated inequalities of two rooted 5-taxon trees R and \(R^\prime \), i.e., all pairs \(\{i, j\}\) such that \((i, j) \in I_{R}\) and \((j, i) \in I_{R^\prime }\).

Figure 3a shows an example of \(V(R, R^\prime )\) computed for caterpillar trees and Fig. 3b is a heatmap showing the function \(|V(R, R^\prime )|\) computed for the seven possible rootings of an unrooted quintet tree. The set \(V(R, R^\prime )\) can be easily computed from \(I_R\) and \(I_{R^\prime }\) for all pairs of rooted 5-taxon trees, and \(I_R\) is derived from the ADR theory for all 105 5-taxon rooted trees in the Supplementary Materials, Sec. S2 in [30].

Fig. 3.
figure 3

Conflicting inequality penalty terms between rooted 5-taxon species trees. a) Set of violated inequality penalty terms in the partial orders of \(R_1\) and \(R_7\) with respect to \(R_4\), which are all caterpillar trees. The red edges show violations of inequalities in tree \(R_4\), highlighted in blue. b) Heatmap showing the number of pairwise violated penalty terms (function \(|V(R,R^\prime )|\)) of seven possible rooted trees having unrooted topology with bipartitions ab|cde and abc|de. The dark colors indicate more violations, and the lightest color corresponds to no violations (\(|V(R,R^\prime )| = 0\)). (Color figure online)

Lemma 3

(a) For 5-taxon binary rooted trees R and \(R^\prime \) with the same rooted shape, the set \(V(R, R^\prime )\) is always non-empty. (b) For each balanced tree B, there exist two caterpillar trees \(C_1\) and \(C_2\) such that \(V(B, C_i) = \emptyset \).

4.1 Statistical Consistency

In this section, we establish statistical consistency for QR-STAR under the MSC and provide the sufficient condition for a set of sampled quintets to lead to consistency. That is, we prove that as the number of input true gene trees increases, the probability that QR-STAR and its variants correctly root the given unrooted species tree converges to 1. We first prove statistical consistency for QR-STAR when the model tree has only five taxa in Theorem 1 and then extend the proofs to trees with arbitrary numbers of taxa in Theorem 2. The main idea of the proof of consistency for 5-taxon trees is that we show as the number of input gene trees increases, the cost of the true rooted tree becomes arbitrarily close to zero, but the cost of any other rooted tree is bounded away from zero, where the bound depends on the path length parameter of the model tree, h(R).

Lemma 4

Let R be an MSC model species tree with \(n \ge 5\) leaves and Q be a set of quintets of taxa from \(\mathcal {L}(R)\). Given \(\delta > 0\) and \(k > 0\) unrooted gene tree topologies, the following inequality holds, where \(A_{\delta }(k) = \sqrt{\frac{2}{k}\ln (\frac{30|Q|}{\delta })}\)

$$\begin{aligned} \mathbb {P}\left( \forall _{q \in Q} \forall _{1 \le i \le 15} |(\hat{u}_q)_i - (u_q)_i| < \frac{A_{\delta }(k)}{2}\right) \ge 1 - \delta . \end{aligned}$$
(5)

Setting \(\delta = \frac{1}{k}\) in Eq. 5, we get \(A(k) = \sqrt{\frac{2}{k}\ln (30|Q|k)}\), which is the bound that is used for determining the rooted shape of each quintet in the first step of QR-STAR as well as the proofs of statistical consistency. When R has only five taxa, A(k) becomes \(\sqrt{\frac{2}{k}\ln (30k)}\), as Q can only contain one quintet.

Lemma 5 (Correct determination of rooted shape)

Let R be a 5-taxon model species tree and \(\vec {u}\) be the probability distribution that it defines on the unrooted 5-taxon gene tree topologies. There is an integer \(k > 0\) such that if we are given at least k unrooted gene trees drawn i.i.d. from the distribution \(\vec {u}\), the first step of QR-STAR will correctly determine the rooted shape of R with probability at least \(1 - \frac{1}{k}\).

Lemma 6 (Upper bound on the cost of the model tree)

Let R be a 5-taxon model species tree and \(\vec {u}\) be the probability distribution that it defines on the unrooted 5-taxon gene tree topologies. There is an integer \(k > 0\) such that if we are given at least k unrooted gene trees drawn i.i.d. from distribution \(\vec {u}\), then \(\textrm{Cost}^*(R, \vec {\hat{u}})\) is less than \(31\alpha _{\max }A(k)\) with probability at least \(1 - \frac{1}{k}\).

Theorem 1 (Statistical Consistency of QR-STAR for 5-taxon trees)

Let R be a 5-taxon model species tree and \(\vec {u}\) be the distribution that it defines on the unrooted 5-taxon gene tree topologies. Given a set \(\mathcal {G}\) of unrooted true quintet gene trees drawn i.i.d. from \(\vec {u}\), QR-STAR is a statistically consistent estimator of R under the MSC.

Remark 1

Note that when \(\alpha _{\max } = 0\), meaning that the invariant penalty terms are removed from the cost function, the cost of the true tree would become exactly zero according to the proof of Lemma 6, and the cost of any other tree would be positive when k is large enough so that the conditions of Theorem 1 hold. Hence in this case, the condition in Eq. 7 (see full version of the paper) will reduce to \(A(k) < \frac{1}{2} h(R)\).

Remark 2

Note that Lemma 3(a) holds for all pairs of 5-taxon rooted trees with the same rooted shape and with different permutations of the leaf-labeling, regardless of whether they have the same unrooted topology or not. Due to this property, it is possible to differentiate all pairs of 5-taxon rooted trees in a statistically consistent manner with the cost function of QR-STAR, without prior knowledge about the unrooted tree topology, and hence Theorem 1 does not assume that the unrooted topology is given as input.

The next lemma and theorem extend the proof of statistical consistency to trees with \(n > 5\) taxa. The linear encoding of a tree T by quintets is defined in Supplementary Materials Section A.

Lemma 7 (Identifiability of the root from the linear encoding)

Let R and \(R^\prime \) be rooted trees with unrooted topology T and distinct roots. Let \(Q_{LE}(T)\) be the set of quintets of leaves in a linear encoding of T. There is at least one quintet of taxa \(q \in Q_{LE}(T)\) so that \(R_{|q}\) and \(R^\prime _{|q}\) have different rooted topologies.

Lemma 7 states that no two distinct rooted trees with topology T induce the same set of rooted quintet trees on quintets of taxa in set \(Q_{LE}(T)\). Clearly, the same is true for any superset Q such that \(Q_{LE}(T) \subseteq Q\), including the set \(Q_5\) of all quintets of taxa on the leafset of T. There might also be other quintet sets that are not a superset of \(Q_{LE}(T)\), but have the property that no two rooted versions of T define the same set of rooted quintets on their elements. We generalize the proof of consistency to all set of sampled quintets with this property.

Definition 4

Let T be an unrooted tree and Q be a set of quintets of taxa from \(\mathcal {L}(T)\). We say Q is “root-identifying” if every rooted tree R with topology T is identifiable from T and the set of rooted quintet trees in \(\{R|_{q}: q \in Q\}\), i.e., no two rooted trees with topology T induce the same set of rooted quintet trees on Q.

Theorem 2 (Statistical Consistency of QR-STAR)

Let R be an MSC model species tree with \(n \ge 5\) leaves and let T denote its unrooted topology. Given T and a set \(\mathcal {G}\) of unrooted true gene trees on the leafset \(\mathcal {L}(T)\), QR-STAR is a statistically consistent estimator of the rooted version of T under the MSC, if the set of sampled quintets Q is root-identifying.

Fig. 4.
figure 4

Rooting the model species tree with estimated gene trees on S200 datasets. Comparison between QR and QR-STAR in terms of rooting error (nCD) for rooting the true unrooted species tree topology using estimated gene trees (GTEE levels vary from 0.22 (for low ILS) to 0.49 (for high ILS)) on the 201-taxon datasets from [20] with 50 replicates in each model condition. The columns show tree height (500K for high ILS, 2M for moderate ILS, and 10M for low ILS), and the rows show speciation rate (1e−06 or 1e−07).

5 Experimental Study

We performed an experimental study on simulated datasets to explore the parameter space of QR-STAR on a training dataset, and then compared its accuracy to QR on a test dataset. We used the 101-taxon simulated datasets from [34] as our training data, which had model conditions characterized by four levels of gene tree estimation error (GTEE) ranging from 0.23 to 0.55 (measured in terms of normalized Robinson-Foulds (RF) [26] distance between true and estimated gene trees) for 1000 genes. The normalized RF distance between the model species tree and true gene trees (denoted average distance, or AD) in this dataset was 0.46, which indicates moderate ILS. For the test data (see Table D1 in the Supplementary Materials for empirical statistics), we used a set of 201-taxon simulated datasets from [20]; these are characterized by two different speciation rates and three tree heights (500K for high ILS, 2M for moderate ILS, and 10M for low ILS) and three numbers of genes for each combination of speciation rate and tree height. GTEE levels on the test data varied from 0.22 (for low ILS) to 0.49 (for high ILS). The AD levels ranged from 0.09 (for the 10M, 1e−07 condition) to 0.69 (for the 500K, 1e−06 condition). The number of replicates for each model condition for both the training and test datasets was 50.

We measured the error in the rooted species tree in terms of average normalized clade distance (nCD) [30], which is an extension of RF error for rooted trees. For our training experiment, we only rooted the true species tree topology to directly observe the rooting error. In our test experiments, we rooted both the model species tree and estimated species tree, as produced by ASTRAL, using both true and estimated gene trees (which were estimated using FastTree [23]). Additional information about the simulation study, datasets, and software commands are provided in Supplementary Materials Section D.

In our training experiments, we explored the impact of the shape coefficient C and the ratio \(\frac{\alpha _{max}}{\beta _{min}}\) (that describes the relative impact of invariants and inequalities) on the accuracy of QR-STAR. Results for the training experiments (provided in Supplementary Materials Sec. E1) show that there are wide ranges of settings for the algorithmic parameters that provide the best accuracy. We used these training results and theoretical considerations related to sample complexity of QR-STAR to set the algorithmic parameters to C = 1e−02 and \(\frac{\alpha _{max}}{\beta _{min}} = 0\).

Figure 4 shows a comparison between QR and QR-STAR in terms of rooting error for rooting the model species tree topology using estimated gene trees on the test datasets. Increasing the ILS level (by reducing tree height) decreases the rooting error, and increasing the number of genes also generally reduces rooting error (although much less under the lowest ILS level where tree height is 10M). To understand the impact of ILS in Fig. 4, note that the true species tree is being rooted and so ILS will not impact species tree estimation accuracy. However, the level of ILS impacts information about rooting location, which comes from the distribution of gene tree topologies. Thus, with lower ILS, it is likely that many gene trees that have low probability of appearing will not appear in the input. In this case, some estimates of quintet probabilities would become zero, and it may not be possible to differentiate some of the rooted quintets using the inequality and invariants derived from the ADR theory. In the extreme case, when there is no discordance, there will be only one quintet gene tree with non-zero probability, and the identifiability theorem in [2] would not hold and it becomes impossible to find the root. This trend can be compared to the impact of ILS level on the problem of estimating the unrooted topology of the species tree, where increases in ILS generally lead to increases in error [19,20,21].

A comparison between QR and QR-STAR shows that QR-STAR generally matched or improved on QR; the only exception was for the high ILS conditions, where the two methods were very close but with perhaps a small advantage to QR. On these high ILS conditions, however, GTEE is also large, and QR-STAR is more accurate than QR when used with true gene trees, even under high ILS (Supplementary Materials Sec. E). Hence, the issue is likely to be high GTEE rather than high ILS, suggesting that QR-STAR is slightly more affected by GTEE compared to QR.

6 Conclusion

In this work we presented QR-STAR, a polynomial time statistically consistent method for rooting species trees under the multispecies coalescent model. QR-STAR is an extension to QR, a method for rooting species trees introduced in [30]. QR-STAR differs from QR in that it has an additional step for determining the topological shape of each unrooted quintet selected in the QR algorithm, and incorporates the knowledge of this shape in its cost function, alongside the invariants and inequalities previously used in QR. We also showed that the statistical consistency for QR-STAR holds for a larger family of optimization problems based on cost functions and sampling methods.

To the best of our knowledge, this is the first work that established the statistical consistency of any method for rooting species trees under a model that incorporates gene tree heterogeneity. It remains to be investigated whether other rooting methods can also be proven statistically consistent under models of gene evolution inside species trees, such as the MSC or models of GDL. For example, STRIDE [4] and DISCO+QR [33] are methods that have been developed for rooting species trees from gene family trees, where genes evolve under gene duplication and loss (GDL); however, it is not known whether these methods are statistically consistent under any GDL model.

Our simulation study showed as well that QR-STAR generally improved on QR in a wide range of model conditions. Given that QR itself improved on other methods for rooting species trees (as shown in [30]), this experimental study suggests that QR-STAR may be a useful tool for rooting species trees when gene tree discordance due to ILS is present.

This study suggests several directions for future research. For example, we proved statistical consistency for one class of cost functions, which was a linear combination of the invariant, inequality and shape penalty terms; however, cost functions in other forms could also be explored and proven statistically consistent. The proof of Theorem 1 suggests that the sample complexity of QR-STAR depends on the function h(R), which is based on both the length of the shortest branch and the longest path in the model tree. This suggests that having very short or very long branches can both confound rooting under ILS, which is also suggested in previous studies [1, 2]. This is unlike what is known for species tree estimation methods such as ASTRAL, where the sample complexity is only affected by the shortest branch of the model tree [3, 28], and trees with long branches are easier to estimate.

Another theoretical direction is the construction of the rooted species tree directly from the unrooted gene trees. As explained in Remark 2, the proof of consistency of QR-STAR for 5-taxon trees does not depend upon the knowledge of the unrooted tree topology; this suggests that it is possible to estimate the rooted topology of the species tree in a statistically consistency manner directly from unrooted gene tree topologies. Future work could focus on developing statistically consistent methods for this problem, which is significantly harder than the problem of rooting a given tree.

There are also directions for improving empirical results. An important consideration in designing a good cost function is its empirical performance, as many cost functions can lead to statistical consistency but may not provide accurate estimations of the rooted tree in practice (see Figures E1 and E2 in the Supplementary Materials). One potential direction is to incorporate estimated branch lengths, whether of the gene trees or the unrooted species tree, into the rooting procedure.