1 Introduction

Gene families are collections of genes represented by DNA or protein sequences that share a common ancestral gene and perform similar functions across different organisms. In modern computational biology, the relationships between genes within a given family are often visualized using a gene family tree, which is inferred from multiple alignments of molecular sequences. Although evolution is a directed process, gene family trees are usually unrooted due to the models embedded into the phylogenetic inference packages. However, many practical applications require correctly rooted phylogenetic trees. Therefore, several methods have been developed for inferring the root location in an unrooted gene family tree (Boykin et al. 2010; Górecki et al. 2013; Kinene et al. 2016).

Wade et al. (2020) proposed four categories of gene tree rooting methods. Outgroup rooting, first introduced by Maddison et al. (1984), is perhaps the most popular method. Here, the root is placed on the branch connecting a known outgroup sequence with the rest of the tree. However, determining a suitable outgroup sequence can be challenging or non-unique, making this method infeasible in some cases. Furthermore, the placement of the outgroup in the tree may be influenced by macroevolutionary events such as gene duplication, lateral gene transfer, or hybridization.

The second category of rooting methods is based on the branch lengths of the gene tree. This category includes several methods, such as mid-point rooting (Farris 1972), which places the root at the mid-point of the longest path in the tree, minimal ancestor deviation rooting (Tria et al. 2017), and minimum variance rooting (Mai et al. 2017).

Another category consists of methods for rooted phylogenetic inference, such as maximum likelihood with non-reversible or non-stationary evolutionary models and models with molecular clock assumptions. Here, the root is inferred directly with the phylogenetic tree (Lepage et al. 2007; Williams et al. 2015).

The last category is based on reconciliation methods, which reconcile the unrooted gene tree with its known rooted species tree using various cost functions, such as duplication, duplication-loss, duplication-loss-transfer (DLT), or Robinson-Foulds distance (Chen et al. 2000; Górecki and Eulenstein 2012; Górecki et al. 2013). The optimal rooting edge is indicated by choosing the rooting that minimizes the cost of reconciling the gene tree with its species trees under the chosen tree comparing function. Some of these cost models assume horizontal gene transfers (DLT) (Kundu and Bansal 2018). However, given a gene tree and its species tree, the transfers are postulated as gene tree edges in the process of calculation of the DLT cost. Therefore, the inferred horizontal gene transfer scenarios may vary depending on the input gene tree. In addition, the transfer DLT scenarios may not always be biologically consistent, which can affect the accuracy and reliability of the inferred rooting. Nonetheless, these methods are still valuable in many cases and have been shown to provide accurate results in certain scenarios (Wade et al. 2020).

For a broader comparison and evaluation of various tree rooting methods, please refer to Górecki et al. (2013); Kundu and Bansal (2018); Mykowiecka and Górecki (2019); Wade et al. (2020).

Phylogenetic trees are commonly used to represent the evolutionary relationships between biological entities, but they only consider vertical evolutionary relationships. In cases where complex reticulate evolutionary events such as hybridization or recombination have occurred, a phylogenetic network is a more appropriate model. Phylogenetic networks can accommodate multiple evolutionary scenarios, making them useful for studying the evolutionary histories of organisms that have undergone such events (Bapteste et al. 2013; Huson et al. 2010).

Most research on gene tree rooting problems has focused on the assumption of a known gene tree topology or species tree in some variants, which only admits vertical relationships between biological entities. However, a phylogenetic network is often a more suitable model for capturing the evolutionary relationships in the presence of reticulate events. To address this issue, we propose several novel approaches to root a gene tree using a known phylogenetic network, which have not been previously studied according to our knowledge. This approach can potentially improve our understanding of the evolutionary history of genes and their species.

Our proposed approach addresses a unique and novel aspect of the gene tree rooting problem: inferring the root of a gene tree using a given phylogenetic network of the species present in the gene tree. To determine the optimal rooting edge in an unrooted gene tree, we employ unrooted reconciliation multiple times, jointly reconciling the unrooted gene tree with a set of splits inferred from the given network. We derive an exact and computationally efficient mathematical formula for the rooting problem, assuming a structural condition called the rootability condition on gene trees and their networks. However, our simulations showed that nearly 22% of input gene tree-network pairs do not meet this condition. To address this, we propose a general rooting algorithm that recursively decomposes a given network into a collection of networks that satisfy the rootability condition.

We also present a variant of the general rooting algorithm, where a phylogenetic network is contracted to the set of species in the input gene tree. While both algorithms have an exponential worst-case time complexity, our simulations show that the latter algorithm significantly improves runtime performance and can operate on the generic class of binary phylogenetic networks while maintaining congruence with the results. To verify the quality of our algorithm, we simulate gene trees and networks and demonstrate that our approach can accurately infer rootings or provide a solution with a small error.

2 Basic definitions

A network on a set of species X is a directed acyclic graph \(N=(V(N),E(N))\) with a single root such that: (1) the set of leaves of N is X, i.e., nodes of indegree 1 and outdegree 0, and (2) there is a directed path from the root to any other vertex. Nodes of a binary network N are divided into three groups: leaves, tree nodes (in-degree at most 1 and out-degree 2), and reticulation nodes (in-degree 2, out-degree 1). A network N is said to be fixed-root if every child of the root node is a tree node or a leaf. N is said to be tree-child if every tree node of N has at least one of its children being a tree node or a leaf. In this article, the class of fixed-root tree-child binary networks is denoted as \(\rho \)TC. A species tree on X is any network on X without reticulations nodes. See Fig. 1 for examples of networks.

Fig. 1
figure 1

N is a fixed-root tree-child network. \(N'\) is a tree-child network that is not fixed-root. \(N''\) is a fixed-root network but not tree-child. Reticulation nodes are marked in red

A node v of indegree one and outdegree one in a rooted directed graph can be contracted as follows: (1) remove v, (2) remove both edges incident with v, and (3) insert a new directed edge connecting the unique parent of v with the only child of v. Similarly, if v is a root and has outdegree one we remove v, and the child of v becomes a new root. If a directed graph \(G'\) is obtained from a graph G by a sequence of contract operations, then G is called a subdivision of \(G'\). Given a network N on X, we say that a tree T on X is displayed by N, if N contains a subgraph \(T'\) that is a subdivision of T (Steel 2016).

For a network N and a node v of N, the set of all species reachable in N from v is denoted \(L^{N}_{v}\).

Below, we present some properties of tree-child and \(\rho \)TC-networks.

Lemma 1

Let N be a \(\rho \)TC-network on X. Let v be a child of the root of N. Then, \(X-L^{N}_{v}\ne \emptyset \).

Proof

Let u be the other child of the root of N. Since N is fixed-root, it follows that both v and u are non-reticulation nodes. Since N is tree-child, at least one of the children of u is a tree node or a leaf. Inductively we construct a path from u to a leaf, say a, that traverses only tree nodes of N. Since this path does not traverse any reticulation node, it follows that \(a\in X-L^{N}_{v}\). \(\square \)

Observe that the assumption that N is \(\rho \)TC is essential for Lemma 1. For the network \(N'\) from Fig. 1 we have \(\{a,b\}-L^{N'}_{v}=\emptyset \). Similarly, for the network \(N''\) from Fig. 1, we have \(\{a,b\}-L^{N''}_{v}=\{a,b\}-L^{N''}_{u}=\emptyset \).

Given a binary species tree S on X, we call split of S the unordered pair \(L^S_{r'}|L^S_{r''}\), where \(r'\) and \(r''\) are the children of the root. Note that, any display tree of a \(\rho \)TC-network N on X is a species tree on X. The next result immediately follows from the proof of Lemma 1.

Lemma 2

Let N be a \(\rho \)TC-network on X. Let v be a child of the root of N. For any display tree T of N, if A|B is the split of T, then \(X-L^{N}_{v}\) is contained in A or in B.

Proposition 1

Let N be a \(\rho \)TC-network network on X. Let v be a child of the root of N. There exists a display tree T obtained from N, whose split at the root is \((L^{N}_{v},X-L^{N}_{v})\).

Proof

It follows from Lemma 1 that \(X-L^{N}_{v}\ne \emptyset \). Let u be the other child of the root of N. Note that the subgraph \(N'\) of N, obtained from N by removing exactly one edge whose bottom node is a reticulation r (we call such edges reticulation edges), for each reticulation \(r \in R(N)\), is a tree on X if N is a tree-child network. Thus, performing contract operation on \(N'\) yields a display tree on X. Now, we show how to eliminate reticulation edges to obtain a display tree T with the desired property. Let \(r_1,r_2,\dots ,r_k\) be the sequence of all reticulation nodes of N in the reverse topological order. Let \(N_0=N\), and let \(N_i\) be the subgraph of N obtained from \(N_{i-1}\) by removing the edge \(\langle p,r_i\rangle \) such that either (1) there is no path from v to p in \(N_{i-1}\), or (2) p is one arbitrary parent of \(r_i\) if both parents of \(r_i\) are reachable from v. Now, for each \(0 \le i\le k\), we have the following properties:

  1. (A)

    Contracting all indegree one and outdegree one edges in \(N_{i}\) yields a \(\rho \)TC-network on X with \(k-i\) reticulations.

  2. (B)

    \(L_v^{N_i} = L_v^{N}\).

For (A) the property holds for any removal of single reticulation edges from a tree-child network, since the child and both parents of any reticulation cannot be reticulations. Thus, removing a single reticulation edge does not change the tree-child property of other reticulation nodes, and in consequence, \(N_i\) is a subdivision of a tree-child network. In (B), the proof follows by induction. For \(i=0\) the property holds trivially. For \(i>0\), We have \(L_v^{N_{i-1}} = L_v^{N}\). Let \(\langle p,r\rangle \) be the edge removed from \(N_{i-1}\). If there is no path from v to p in \(N_{i-1}\) (see the case (1)), then if there is a leaf reachable from v via \(r_i\) in \(N_{i-1}\), then it must be reachable via the second parent of \(r_i\). Thus, the leaf is still reachable in \(N_{i}\) from v. A similar property holds in the second case. This completes the proof of (B).

Finally, \(N_k\) is a subgraph of N being a subdivision of a species tree T on X by (A), thus T is a display tree of N. By (B), the set of leaves reachable from v in \(N_k\) is \(L_v^{N}\). Since \(N_k\) is a tree, the set of leaves reachable from the second child of the root is \(X-L_v^{N}\). \(\square \)

3 Support of a network

In this Section, we introduce the notion of network support.

We say that two species a and b are glued together in a tree T on X, if a and b are descendants of the same child of the root of T. Consider the following two decision problems.

\(\varDelta \):

Given a network N on a set of species X and two species \(a,b\in X\). Is it the case that all display trees of N glue together a and b?

\(\varGamma \):

Given a network N on a set of species X and two species \(a,b\in X\). Is there a display tree of N that glues together a and b?

Theorem 1

Problems \(\varDelta \) and \(\varGamma \) are decidable in polynomial time.

Proof

Let \(a, b \in X\). For nodes x and y, let d(xy) be a predicate that is true if and only if there are node-disjoint paths from x to a and from y to b. Then, d(xy) can be computed in polynomial time using dynamic programming:

$$\begin{aligned} d(x,y)={\left\{ \begin{array}{ll} False &{} \text {if } x = y,\\ True &{} x=a, y=b \text { and } a \ne b, \\ d(x',y) &{} x \text { is a reticulation with a child } x',\\ d(x,y') &{} y \text { is a reticulation with a child } y',\\ d(x',y) \vee d(x'',y) &{} x \text { is a tree node with children } x' \text { and } x'',\\ d(x,y') \vee d(x,y'') &{} y \text { is a tree node with children } y' \text { and } y'',\\ False &{} \text {otherwise.} \end{array}\right. } \end{aligned}$$

Let \(\delta (x,y)=d(x,y) \vee d(y,x)\).

Problem \(\varDelta \) can be solved by computing the value of \(\lnot \delta (u,v)\), where u and v are the children of the root of the network. Problem \(\varGamma \) can be solved similarly by checking if there is a tree node t with children x and y located strictly below u or v such that \(\delta (x,y)\) is True. In such a case, a and b can be glued together. \(\square \)

We conclude from the proof that both problems can be solved in \(O(n^2)\) time and space, where n is the number of nodes in N.

Let N be a network on X. A pair of subsets \((A_1,A_2)\) of X is called a support of N, if for every display T of N, if \(C_1|C_2\) is the split of T, then \(A_1\subseteq C_1\) and \(A_2\subseteq C_2\). Clearly, \((\emptyset ,\emptyset )\) is a support of any network.

Theorem 2

Let N be any network.

  1. (a)

    N has the largest support \((A_1^*,A_2^*)\).

  2. (b)

    If N is a \(\rho \)TC-network, then

    1. (b1)

      The sets \((A_1^*,A_2^*)\) of the largest support are non-empty and they are (different) equivalence classes of the equivalence relation \(\varDelta \).

    2. (b2)

      Finding equivalence classes that form the largest support of N can be done in polynomial time.

Proof

(a) follows immediately from the observation that if \((A_1,A_2)\) and \((B_1,B_2)\) are supports of N, then their set-union \((A_1\cup B_1,A_2\cup B_2)\) is a support too. Moreover, every network has at least one support \((\emptyset ,\emptyset )\).

For the proof of (b) observe that if N is \(\rho \)TC-network, then assuming that uv are the children of the root of N that are non-reticulation nodes, then by tree-child property of N we can find paths that lead from u to a leaf \(a_1\) and from v to a leaf \(a_2\) that do not pass through any reticulation node of N. Clearly \(a_1\in A_1^*\) and \(a_2\in A_2^*\). Thus these sets are nonempty. Moreover, for \(i=1,2\), clearly if \(x\in A_i^*\) and if \((x,y)\in \varDelta \) for some leaf y, then \(y\in A_i^*\). Thus each \(A_i^*\) is a union of equivalence classes of \(\varDelta \). Conversely, if \(x,y\in A_i^*\) but \((x,y)\not \in \varDelta \), then for some display tree T of N, T separates x and y. Thus they cannot be contained in one block of the split determined by T. Obtained contradiction proves that it must be the case that \((x,y)\in \varDelta \), and therefore \(A_i^*\) is a single equivalence class of \(\varDelta \). This proves (b1). Part (b2) follows immediately from the previous part by taking the equivalence class \(A_i^*\) of \(\varDelta \) that contains \(a_i\) mentioned above. \(\square \)

Note that finding the largest support of a \(\rho \)TC-network can be done by a single traversal of a network in linear time without computation of \(\delta \)-classes, by traversing through non-reticulation edges only. We omit easy details.

4 Unrooted tree reconciliation

Here, we summarize the main results on unrooted reconciliation, that we will use to model rooting of a gene tree based on a network. We start with the basic terms. A rooted gene tree over X is a rooted binary tree, where each non-leaf node has exactly two children, whose leaves are labeled by the elements of X (not necessarily uniquely). Similarly, an unrooted gene tree G over X is a binary unrooted tree, whose leaves are labeled by elements of X and such that any internal node of G has degree 3. An unrooted gene tree G can be rooted by choosing an edge e where the root will be placed. Such a rooted gene tree is denoted \(G_e\). For a gene tree over X, by \(L(G) \subseteq X\) we denote the set of all species present in a gene tree G. Let \(X'|X''\) be the split of a fixed species tree S on X. Without loss of generality, we assume \(|L(G)| \ge 3\), \(L(G) \cap X' \ne \emptyset \) and \(L(G) \cap X'' \ne \emptyset \), for any gene tree G. Any internal node g of an unrooted gene tree G over X determines a star being an unordered triple A|B|C, where A, B and C are the sets of all leaf-labels of three subtrees obtained from G by removing g. Note that \(A \cup B \cup C = L(G) \subseteq X\). Let \(\bar{A}=B \cup C\), \(\bar{B}=C \cup A\) and \(\bar{C}=A \cup B\). Let \(\zeta (Z)\) be a predicate that is true if and only if \(Z \subseteq X'\) or \(Z \subseteq X''\). Then, it follows from Górecki and Tiuryn (2007), that we have five disjoint types of stars (see Fig. 2). Note that a suitable renaming of A, B and C may be required:

(S1):

if \(\lnot \zeta (A) \wedge \zeta (\bar{A})\),

(S2):

if \(\zeta (A) \wedge \zeta (\bar{A})\),

(S3):

if \(\lnot \zeta (A) \wedge \lnot \zeta (\bar{A}) \wedge \zeta (B) \wedge \zeta (C)\),

(S4):

if \(\lnot \zeta (A) \wedge \lnot \zeta (B) \wedge \lnot \zeta (C)\),

(S5):

and if \(\zeta (A) \wedge \lnot \zeta (B) \wedge \lnot \zeta (C)\).

Fig. 2
figure 2

Stars S1–S5

Next, \(\zeta \) induces arrows on the edges of G as follows. If \(e=(v,w)\), then we say that

  • e is no-arrow if \(\zeta (L^{G_e}_v) \wedge \zeta (L^{G_e}_w)\),

  • e is double-arrow if \(\lnot \zeta (L^{G_e}_v) \wedge \lnot \zeta (L^{G_e}_w)\),

  • and e is one-arrow (from v to w) if \(\lnot \zeta (L^{G_e}_v) \wedge \zeta (L^{G_e}_w)\).

Double-arrow and no-arrow edges are called symmetric.

Given a rooted gene tree and a species tree, one can determine the lowest number of gene duplication and loss events required to reconcile both trees, called duplication-loss cost (\({{\,\mathrm{\text {DL}}\,}}\)) (Page 1998; Górecki et al. 2013). The above measure is often used to determine the rooting of an unrooted gene tree by choosing the edge e that minimizes the \({{\,\mathrm{\text {DL}}\,}}\) cost between \(G_e\) and the species tree. Such edges can be identified using the following theorem.

Theorem 3

(Adopted from Górecki et al. (2013); Górecki and Tiuryn (2007)) Given a species tree S, an unrooted gene tree G, and an edge e from G. The \({{\,\mathrm{\text {DL}}\,}}\) cost of \(G_e\) and S is minimal among all rootings of G if and only if e is one-arrow in a star S5 or symmetric in any star.

The set of edges satisfying the conditions from Theorem 3 induces a connected subtree called a plateau of G determined by S and denoted P(GS). All plateau edges can be identified in linear time and space (Górecki et al. 2013). Examples are depicted in Fig. 3.

Lemma 3

(Rooting Lemma) For any species trees S and \(S'\) on X with the same split, \(P(G,S)=P(G,S')\).

Proof

It follows immediately from the fact that every star in G is determined by the top-split of a species tree. \(\square \)

Thus, we will use \(P(G,A|B):=P(G,S)\), if A|B is a split of some species tree S.

Fig. 3
figure 3

Unrooted reconciliation example. A species tree S and three gene trees with red edges denoting the plateau. The rooting cost of red edges is minimal (see Theorem 3), i.e., there are the best candidates to place the root given the species tree S

5 Rooting a gene tree using phylogenetic network

Here, given a network N on a set X of species and an unrooted gene tree G over X, we propose an algorithm to find an optimal place for the root of G. One natural approach is to choose the best rooting by identifying the edge that belongs to the maximal number of plateaus P(GT), where T ranges over all display trees of N. While the complexity of the problem remains open, we know only a naïve way to solve it by enumerating all possible display trees of N, and, then by applying the unrooted reconciliation algorithm. Such an approach is exponential in the number of reticulations in N.

To overcome this complexity, based on Rooting Lemma 3, we propose in Sect. 5.2 an algorithm that determines rooting candidates by joint analysis of all possible set-theoretic splits from a given network under a constraint based on largest supports of the input network (see the rootability condition below). In Sects. 5.5 and 5.6 we propose two general rooting algorithms without any assumptions.

5.1 When a gene tree can be directly rooted?

We start with a notion used in the next sections.

Definition 1

(Rootability Condition) Given a binary network N on X and an unrooted gene tree G over X, we say that G is N-rootable if the largest support \((A^*_1,A^*_2)\) of N satisfies \(L(G) \cap A^*_1 \ne \emptyset \) and \(L(G) \cap A^*_2 \ne \emptyset \).

A crucial property of the above condition is below.

Lemma 4

If G is N-rootable, then

  • N is fixed-root,

  • and, for any display tree T of N, \(X' \cap L(G) \ne \emptyset \) and \(X'' \cap L(G) \ne \emptyset \), where \(X'|X''\) is the split of T.

Proof

Note that N has at least two leaves. Let \((A^*_1,A^*_2)\) be the largest support of N. If N is not fixed-root, then a child, say r, of the root, is a reticulation. Then, no leaf is accessible from r through a path composed of non-reticulation nodes only. Thus, at least one of \(\{A^*_1,A^*_2\}\) is empty and the rootability condition is not satisfied. The second condition follows immediately from the definition of support. \(\square \)

5.2 Basic rooting algorithm

The algorithm will be done in three steps. See also an example in Fig. 4.

  • Input An unrooted gene tree G and a \(\rho \)TC-network N such that G is N-rootable.

  • Step 1 Collapsing on the largest support of N. Let \((A_1^*, A_2^*)\) be the largest support of N as described by Theorem 2. Let \(X^*= X-(A_1^*\cup A_2^*)\). Let ab be new names of species not occurring in X and let \(Y=\{a,b\}\cup X^*\) be a new set of species names. We transform G into a gene tree \(G^*\) over Y by replacing every species name of G that belongs to \(A_1^*\) by a, and similarly replacing every species name of G that belongs to \(A_2^*\) by b. Thus we arrive at a pointed gene tree \((G^*,a,b)\) over Y, with two distinguished species: a and b.

  • Step 2 Computing weights of edges in a pointed gene tree. For every \(Z \subseteq X^*\), we will consider splits \(Z \cup \{a\} | (X^*-Z) \cup \{b\}\) of Y induced by Z. Then, the weight of an edge e of \(G^*\) is the number of sets Z under which e belongs to the plateau determined by the split induced by Z. Note, that each edge has weight at most \(2^{|X^*|}\). Each edge with the maximum weight belongs to the intersection of all plateaus, as Z ranges over all subsets of \(X^*\), and therefore it also belongs to the intersection of all plateaus P(GT) with T ranging over all display trees of N. See Sect. 5.3 with Lemmas 6 and 5 for efficient methods to compute weights.

  • Step 3 Output Report all edges of G with the largest weight. These are the potential places for the root to be placed.

Fig. 4
figure 4

Left: A gene tree G and the pointed gene tree \((G^*,a,b)\) obtained using the network N (see Fig. 1) with the largest support \((\{c\},\{f,g\})\). Middle: Plateaus for every split induced by sets \(Z \subseteq \{d,e\}\). Right: Weights of edges. The best rooting edge has the weight of 3

5.3 Computing weights of edges of unrooted gene trees

In this Section, we show how to compute the weight of an edge of an unrooted gene tree with two distinguished species under the following assumption.

Preliminary Assumption: We assume that an unrooted gene tree G over a set Y of species has two distinguished species ab and the rest of the species we will denote here by X. So \(Y=X\cup \{a,b\}\) and ab do not belong to X. Since the top-split is required to determine optimal rootings, we need to assume that

$$\begin{aligned} a,b \in L(G). \end{aligned}$$
(1)

For any set \(A \subseteq X\), by \(\overline{A}\) we denote \(X {\setminus } A\).

Definition 2

(The weight of an edge) Under the preliminary assumption, the weight of an edge e of G, denoted W(e), is the number of sets \(Z\subseteq X\) such that e belongs to \(P(G, Z_a|Z_b)\), where \(Z_a = Z \cup \{a\}\) and \(Z_b=\overline{Z} \cup \{b\}\).

We also need to distinguish two cases depending on the type of edge. If an edge is incident to a leaf, we call it external; otherwise, the edge is internal. Figure 5 establishes additional notation.

The next Lemma shows formulas for the weight of edges depending on their type.

Lemma 5

(Internal edge weight) Under the notation of Fig. 5, the number of sets \(Z\subseteq X\) such that e belongs to \(P(G, Z_a|Z_b)\) is given:

  • If e is no-arrow, then \(\varphi _1(C,D)= 2^{|\overline{C\cup D}|}\) if \([(D \not \ni a \in C \wedge C \not \ni b \in D) \vee (D \not \ni b \in C \wedge C \not \ni a \in D)] \wedge C\cap D=\emptyset \), and \(\varphi _1(C,D)=0\), otherwise.

  • If e is one-arrow: \(\varphi _2(C,D_1,D_2)= \varphi ^a_2(C,D_1,D_2)+\varphi ^b_2(C,D_1,D_2)\), whereFootnote 1

    $$\begin{aligned}{} & {} \varphi _2^a (C,D_1,D_2)= I(b\not \in C) (2^{| \overline{C}|}-f_a(C,D_1,D_2)), \\{} & {} \varphi _2^b (C,D_1,D_2) = I(a\not \in C) (2^{|\overline{C}|}-f_b(C,D_1,D_2)),\\{} & {} f_a (C,D_1,D_2) = 2^{|\overline{D_{1,2,C}}|} \Big ( I(a\not \in D_{1,2} \wedge X\cap C\cap D_{1,2}=\emptyset )\\{} & {} \qquad \qquad \qquad \qquad + \sum _{i=1}^2 I(a\not \in D_i \wedge b\not \in D_{3-i} \wedge X\cap D_{3-i,C}\cap D_i=\emptyset ) \Big ) \\{} & {} \qquad \qquad \qquad \qquad +\sum _{i=1}^2 2^{| \overline{D_{i,C}}|} \big ( I(a\not \in D_i \wedge (C \setminus \{a\})\subseteq X \setminus D_i) + I(b\not \in D_i) \big ),\\{} & {} f_b (C,D_1,D_2)=2^{| \overline{D_{1,2,C}}|} \Big ( I(b\not \in D_{1,2} \wedge D_{1,2}\cap X\cap C=\emptyset )\\{} & {} \qquad \qquad \qquad \qquad + \sum _{i=1}^2 I(a\not \in D_i \wedge b\not \in D_{3-i} \wedge D_{3-i}\cap X\cap D_{i,C}=\emptyset ) \Big )\\{} & {} \qquad \qquad \qquad \qquad +\sum _{i=1}^2 2^{|\overline{D_{i,C}}|} \big ( I(a\not \in D_i) + I(b\not \in D_i \wedge D_i\cap X\cap C=\emptyset ) \big ),\\ \end{aligned}$$

    where \(D_{1,2}=D_1 \cup D_2\), \(D_{1,2,C}=D_1 \cup D_2 \cup C\), and \(D_{i,C} = D_i \cup C\).

  • If e is double-arrow \(\varphi _3(C,D)= \varphi ^b_2(\emptyset ,C,D)\).

Fig. 5
figure 5

An external (a) and an internal (b) edge e in a gene tree G. Here, c is a species (\(c \in Y\)). \(D_i \subseteq L(G)\) and \(C_i \subseteq L(G)\) are the sets of all species present in the corresponding subtrees of G. Note that, \(L(G)=C \cup D\)

For better readability, the proof is divided into three parts presented in Sects. 5.3.1, 5.3.2, 5.3.3, 5.3.4 and 5.3.5 for no-arrow, one-arrow and double-arrow edge, respectively. We start with several definitions needed in the proofs.

A subset \(C\subseteq Y\) is called: a-set, if \(a\in C\) and \(b\not \in C\); b-set, if \(b\in C\) and \(a\not \in C\); ab-set, if \(a,b\in C\); and 0-set, if \(a,b\not \in C\). We say that a subset \(Z\subseteq X\) is called a-safe for \(D\subseteq Y\), if \(D\cap Z_a \ne \emptyset \). Similarly, Z will be called b-safe for D, if \(D\cap Z_b \ne \emptyset \). Finally, Z will be said to be ab-safe for D, if it is both a-safe and b-safe for D. Clearly, if D is an a-set, then every subset \(Z\subseteq X\) is a-safe for D. Similar observation holds when D is a b-set.

Before proceeding further let us make the following two observations for any set Z and a non-empty set D

$$\begin{aligned} Z \text { is not } a \text {-safe for } D\quad \Leftrightarrow \quad a\not \in D \wedge Z\subseteq \overline{D}, \end{aligned}$$
(2)

and

$$\begin{aligned} Z \text { is not } b\text {-safe for } D\quad \Leftrightarrow \quad b\not \in D \wedge D\cap X\subseteq Z. \end{aligned}$$
(3)

Equivalence (2) follows immediately from definition. Also (3) is immediate: we have \(D\cap Z_b=\emptyset \) iff \(b\not \in D\) and \(D\cap \overline{Z}=\emptyset \). The latter is equivalent to \(D\cap X\subseteq Z\).

Set Z is said to be a-covering a set \(C\subseteq Y\), if \(C\subseteq Z_a\). Similarly, Z will be called b-covering C, if \(C\subseteq Z_b\). Finally we will say that Z is a/b-covering C, if Z is a-covering C or b-covering C.

5.3.1 The weight of internal no-arrow edge

Below we present the proof of Lemma 5 for the case of no-arrow edge.

Proof

The edge e belongs to the plateau determined by Z when we have the following situation

$$\begin{aligned} Z \text { is } a/b\text {-covering both } C_1\cup C_2 \text { and } D_1\cup D_2. \end{aligned}$$
(4)

Let \(\varphi _1(C,D)\) count the number of sets Z that are a/b-covering both C and D. Since, we assumed that ab in L(G), then \(a,b\in C\cup D\). Below we discuss several possibilities for the sets C and D.

When C or D is an ab-set, then clearly we have \(\varphi _1(C,D)=0\). Same formula holds when C or D is a 0-set.

Now assume that neither C nor D is an ab-set. It follows from our assumption that either C is an a-set and D is a b-set, or conversely. Then it must be the case that \((C\setminus \{a\})\subseteq Z\) and \((D {\setminus } \{b\})\subseteq \overline{Z}\). Clearly when \(C\cap D\ne \emptyset \), then there is no such Z. On the other hand, when \(C\cap D=\emptyset \), then we obtain \(\varphi _1(C,D)= 2^{|\overline{ (C{\setminus }\{a\})\cup (D {\setminus } \{b\})}|}=2^{| \overline{C\cup D}|}\). So \(\varphi _1(C,D)\) is \(2^{|\overline{C\cup D}|}\) if [(C is a-set and D is b-set) or (C is b-set and D is a-set)] and \(C\cap D=\emptyset \), and \(\varphi _1(C,D)=0\), otherwise.

This completes the proof of the no-arrow case. \(\square \)

5.3.2 The weight of an internal one-arrow edge

Below we present the proof of Lemma 5 for the case of one-arrow edge.

In this case the edge e belongs to the plateau determined by Z when we have the following situation

$$\begin{aligned} Z \text { is } a/b\text {-covering } C_1\cup C_2 \text { and } Z \text { is } ab\text {-safe for both } D_1 \text { and } D_2. \end{aligned}$$
(5)

The symmetric case when \(C_i\) and \(D_i\) are interchanged is omitted. Let \(\varphi _2(C,D_1,D_2)\) denote the number of sets \(Z\subseteq X\) that satisfy (5), where \(C,D_1, D_2\) are non-empty subsets of X. The collection of sets Z is naturally split into two disjoint classes.

$$\begin{aligned} V^a=\{Z\subseteq X\ |\ (C\subseteq Z_a\}, \end{aligned}$$

and

$$\begin{aligned} V^b=\{Z\subseteq X\ |\ C\subseteq Z_b\}. \end{aligned}$$

Each of these classes will be considered separately, in Subsections 5.3.3 and  5.3.4, yielding formulas \(\varphi ^a_2(C,D_1,D_2)\) and \(\varphi ^b_2(C,D_1,D_2)\), so that finally we will have

$$\begin{aligned} \varphi _2(C,D_1,D_2)= \varphi ^a_2(C,D_1,D_2)+\varphi ^b_2(C,D_1,D_2). \end{aligned}$$

5.3.3 The weight of an internal one-arrow edge: a-covering sets Z

Proof

Observe that in that case we have, depending on the status of C, the following possibilities

$$\begin{aligned} V^a={\left\{ \begin{array}{ll} \{(C \setminus \{a\})\cup U\ |\ U\subseteq \overline{C}\}, &{} \hbox { if}\ b\not \in C\\ \emptyset , &{}\text { if } b\in C. \end{array}\right. } \end{aligned}$$

Assume now that \(b\not \in C\) and let us count the number of sets \(Z\in V^a\) that are ab-safe for \(D_1\) and \(D_2\). Let \(Z\in V^b\). Hence Z has the form \(Z=(C \setminus \{a\})\cup U\), for some \(U\subseteq \overline{C}\). It follows from from (2) that Z is not a-safe for D iff \(a\not \in D\) and \((C {\setminus } \{a\}) \cup U\subseteq \overline{D}\). So this is equivalent to \(a\not \in D\), \(C \setminus \{a\} \subseteq \overline{D}\) and \(U\subseteq \overline{D}\).

Let \(i\in \{1,2\}\). Let \(S_{a}(C,D_{i})\) be the collection of all sets \(Z\in V^a\) that are not a-safe for \(D_i\). It follows from (2) and the above remarks that \(|S_{a}(C,D_{i})|=I(a\not \in D_i)\cdot I(C{\setminus } \{a\} \subseteq \overline{D_i}) \cdot 2^{| \overline{C\cup D_i}|}\).

Similarly, we consider \(S_{b}(C,D_{i})\), the collection of all sets \(Z\in V^a\) that are not b-safe for \(D_i\). Let \(Z\in V^a\). Hence \(C {\setminus } \{a\}\subseteq Z\). Since \(b\not \in C\), it follows that \(C {\setminus } \{a\}=C\cap X\). It follows from (3) that \(Z\in V^a\) is not b-safe for D iff \(b\not \in D\) and \((C\cap X)\cup (D\cap X)\subseteq Z\). The latter is equivalent to \((C\cup D)\cap X\subseteq Z\). Hence \(S_b(C,D)=\{Z\subseteq X\ |\ b\not \in D \wedge (C\cup D)\cap X\subseteq Z\}\). Therefore, for \(i=1,2\) we have \(|S_{b}(C,D_{i})|=I(b\not \in D_i)\cdot 2^{|\overline{C\cup D_i}|}\).

In order to compute \(\varphi _2^a(C,D_1,D_2)\) we first compute \(f_a(C,D_1,D_2)\), cardinality of the set \(S_{a}(C,D_{1})\cup S_{a}(C,D_{2})\cup S_{b}(C,D_{1})\cup S_{b}(C,D_{2})\), and then subtract it from \(2^{|\overline{C}|}\). Since the sets \(S_a\) and \(S_b\), in general, are not disjoint we have to apply the inclusion–exclusion principle: \(|X_1\cup X_2\cup X_3\cup X_4| = \sum _{i=1}^4 |X_i|-\sum _{1\le i<j\le 4} |X_i\cap X_j| + \sum _{1\le i<j<p\le 4} |X_i\cap X_j\cap X_p|-|X_1\cap X_2\cap X_3\cap X_4|\).

Since \(b\not \in C\), it follows from (1) that \(b\in D_1\cup D_2\). Hence at least one of the sets \(S_{b}(C,D_{i})\) has to be empty, i.e.

$$\begin{aligned} S_{b}(C,D_{1})\cap S_{b}(C,D_{2}) =\emptyset . \end{aligned}$$
(6)

We also have for \(i=1,2\)

$$\begin{aligned} S_{a}(C,D_{i}) \cap S_{b}(C,D_{i}) =\emptyset . \end{aligned}$$
(7)

Indeed if \(Z\in S_{a}(C,D_{i}) \cap S_{b}(C,D_{i})\), then \(a,b\not \in D_i\) and \(D_i\cap X\subseteq Z \subseteq \overline{D_i}\). Since \(D_i\cap X\subseteq \overline{D_i}\) is equivalent to \(D_i\cap X=\emptyset \), it follows from \(a,b\not \in D_i\) that \(D_i\cap X\subseteq \overline{D_i}\) is equivalent to \(D_i=\emptyset \). The obtained contradiction proves (7).

It follows from (6) and (7) that all triple intersections are empty. Obviously, the quadruple intersection is empty as well.

Of the double intersections we are left with only three: \(S_{a}(C,D_{1}) \cap S_{a}(C,D_{2})\), and \(S_{a}(C,D_{i}) \cap S_{b}(C,D_{j})\), for \(i\ne j\).

It follows from (6), (7) and the above remarks that the inclusion–exclusion principle in our case reduces to \(f_a(C,D_1,D_2)=|S_{a}(C,D_{1})| +|S_{a}(C,D_{2})| +|S_{b}(C,D_{1})| +|S_{b}(C,D_{2})|- [|S_{a}(C,D_{1}) \cap S_{a}(C,D_{2})| + |S_{a}(C,D_{1}) \cap S_{b}(C,D_{2})| + |S_{a}(C,D_{2}) \cap S_{b}(C,D_{1})|]\). Formulas for the second part of \(f_a\) are given below.

$$\begin{aligned} |S_{a}(C,D_{1}) \cap S_{a}(C,D_{2})|=I(a\not \in D_{1,2}) I(X\cap C\cap D_{1,2}=\emptyset )\cdot 2^{|\overline{D_{1,2,C}}|}, \end{aligned}$$
(8)

and for \(i\ne j\) (\(i,j\in \{1,2\}\)) we have

$$\begin{aligned} |S_{a}(C,D_{i}) \cap S_{b}(C,D_{j})|= I(a\not \in D_i) I(b\not \in D_j) I(X\cap D_{j,C}\cap D_i=\emptyset ) 2^{|\overline{ D_{1,2,C}}|}.\nonumber \\ \end{aligned}$$
(9)

For the proof of (8) observe that if \(Z\in S_{a}(C,D_{1})\cap S_{a}(C,D_{2})\), then \(a\not \in D_{1,2}\) and \(Z=(C \setminus \{a\})\cup U\), for some \(U\subseteq \overline{D_1} \cap \overline{D_2}=\overline{ D_{1,2}}\). Since we work under assumption \(b\not \in C\), it follows that \(C \setminus \{a\}=X\cap C\). Since for any sets \(A\subseteq X\) and B, we have \(A\subseteq \overline{B}\) iff \(A\cap B=\emptyset \), formula (8) follows.

For the proof of (9) let \(i\ne j\) and take any \(Z\in S_{a}(C,D_{i}) \cap S_{b}(C,D_{j})\). We have \(a\not \in D_i\), \(b\not \in D_j\), \(C {\setminus } \{a\}\subseteq Z\subseteq \overline{D_i}\), and \(X\cap D_j\subseteq Z\). Thus \((X\cap C)\cup (X\cap D_j)=X\cap D_{j,C} \subseteq Z\subseteq \overline{D_i}\). This is equivalent to existence of a set \(U\subseteq \overline{D_{j,C}}\cap \overline{D_i}\) such that \(Z=(X\cap D_{j,C}) \cup U\) and \(X\cap D_{j,C}\subseteq \overline{D_i}\). This is equivalent to \(U\subseteq \overline{D_{i,j,C}}\) and \(X\cap D_{j,C}\cap D_i=\emptyset \). This completes the proof of (9).

Finally we can write \(\varphi _2^a(C,D_1,D_2)= I(b\not \in C) (2^{|\overline{C}|}-f_a(C,D_1,D_2))\). \(\square \)

5.3.4 The weight of an internal one-arrow edge: b-covering sets Z

Proof

Observe that in that case, we have, depending on the status of C, the following possibilities

$$\begin{aligned} V^b={\left\{ \begin{array}{ll} 2^{\overline{C}}, &{} \hbox { if}\ a\not \in C\\ \emptyset , &{}\text { if } a\in C. \end{array}\right. } \end{aligned}$$

Derivation of the formula \(\varphi _2^b\) is very similar to the derivation in the previous part. Assume now \(a\not \in C\) and let \(Z\in V^b\). Hence \(Z\subseteq \overline{C}\) and it follows from (2) that Z is not a-safe for D iff \(a\not \in D\) and \(Z\subseteq \overline{D}\). In a similar way we observe that Z is not b-safe for D iff \(b\not \in D\) and \(D\cap X\subseteq Z\).

For \(i\in \{1,2\}\) let \(R_{a}(C,D_{i})\) be the collection of sets \(Z\in V^b\) that are not a-safe for \(D_i\). We have

$$\begin{aligned} R_a(C,D) =\{Z\subseteq \overline{C}\ |\ a\not \in D \wedge Z\subseteq \overline{D}\} = \{Z\subseteq X\ |\ a\not \in D\wedge Z\subseteq \overline{C\cup D}\}. \end{aligned}$$
(10)

Therefore we have \(|R_{a}(C,D_{i}) |=I(a\not \in D_i) 2^{| \overline{D_{i,C}}|}\).

Let \(R_{b}(C,D_{i})\) be the collection of all sets \(Z\in V^b\) that are not b-safe for \(D_i\). We have

$$\begin{aligned} \begin{aligned} R_{b}(C,D_{i})=&\{Z\in V^b\ |\ b\not \in D_i\wedge X\cap D_i\subseteq Z\}\\ =&\{Z\subseteq X\ |\ b\not \in D_i\wedge D_i\cap X\subseteq Z\subseteq \overline{C}\}\\=&\{Z\subseteq X\ |\ b\not \in D_i \wedge D_i\cap X\subseteq \overline{C} \wedge D_i\cap X\subseteq Z\subseteq \overline{C}\} \\ =&\{Z\subseteq X\ |\ b\not \in D_i \wedge D_i\cap X\cap C=\emptyset \wedge D_i\cap X\subseteq Z\subseteq \overline{C}\}. \end{aligned} \end{aligned}$$
(11)

Hence \(Z\in R_{b}(C,D_{i})\) iff \(b\not \in D_i\) and \(D_i\cap X\cap C=\emptyset \) and Z is of the form \((D_i\cap X)\cup U\), for some \(U\subseteq \overline{C}\) such that \(U\cap D_i\cap X=\emptyset \). Thus \(|R_{b}(C,D_{i}) |= I(b\not \in D_i) I(D_i\cap X\cap C=\emptyset ) 2^{|\overline{D_{i,C}}|}\).

Similarly to the proof from Sect. 5.3.3, we apply the inclusion–exclusion principle to derive \(\varphi _2^b(C,D_1,D_2)\). Again, since \(a\not \in C\), it follows from (1) that \(a\in D_{1,2}\). Hence we have \(R_{a}(C,D_{1}) \cap R_{a}(C,D_{2})=\emptyset \). The proof that \(R_{a}(C,D_{i})\cap R_{b}(C,D_{i}) = \emptyset \) is the same as that for (7).

Similarly to the previous proof, all triple intersections and the quadruple intersection are empty. We are left with only three double intersections: \(R_{b}(C,D_{1})\cap R_{b}(C,D_{2})\), \(R_{a}(C,D_{1})\cap R_{b}(C,D_{2})\), and \(R_{a}(C,D_{2})\cap R_{b}(C,D_{1})\). Thus, \(f_b(C,D_1,D_2) = |R_{a}(C,D_{1})| + |R_{a}(C,D_{2})| +|R_{b}(C,D_{1})| + |R_{b}(C,D_{2})| - [|R_{b}(C,D_{1}) \cap R_{b}(C,D_{2})| + |R_{a}(C,D_{1}) \cap R_{b}(C,D_{2})| + |R_{a}(C,D_{2}) \cap R_{b}(C,D_{1})|].\)

The proof that \(|R_{b}(C,D_{1}) \cap R_{b}(C,D_{2})| = I(b\not \in D_{1,2}) I(D_1\cap X\cap C=\emptyset ) I(D_2\cap X\cap C=\emptyset ) 2^{|\overline{D_{1,2,C}}|}\) follows immediately from (11).

For \(i,j\in \{1,2\}\) with \(i\ne j\) we have \(|R_{a}(C,D_{i}) \cap R_{b}(C,D_{j}) | = I(a \notin D_i) I(b \notin D_j) I(D_j\cap X\cap D_{i,C}=\emptyset )2^{|\overline{D_{1,2,C}}|}.\) Again, proof of the equation follows immediately from (10) and (11). Finally, we set \(\varphi _2^b(C,D_1,D_2)= I(a\not \in C) (2^{|\overline{C}|}-f_b(C,D_1,D_2))\). \(\square \)

5.3.5 The weight of an internal double arrow edge

Below we present the proof of Lemma 5 for the case of double arrow edge.

Proof

An edge is double-arrow when Z is ab-safe for both \(C_1\cup C_2\) and \(D_1\cup D_2\). Let \(\varphi _3(C,D)\) count the number of sets Z that are ab-safe for both C and D. It follows from the above discussion that \(\varphi _3(C,D)= \varphi ^b_2(\emptyset ,C,D)\). \(\square \)

5.3.6 Weight of any edge of a gene tree

Below, we present the main formula for computing weights.

Lemma 6

(Computing weights of edges) Under the notation from Lemma 5, the weight of an internal edge e is \(W(e)=\varphi _1(C_1\cup C_2, D_1\cup D_2)+\varphi _2(C_1\cup C_2, D_1,D_2)+ \varphi _2(D_1\cup D_2, C_1,C_2)+\varphi _3(C_1\cup C_2,D_1\cup D_2)\), while the weight of an external edge e is \(W(e)=\varphi _1(\{c\},D_1\cup D_2)+\varphi _2(\{c\},D_1,D_2)\).

Proof

We use the notation of Fig. 5. The weight of an internal edge e is \(W(e)=\varphi _1(C_1\cup C_2, D_1\cup D_2)+\varphi _2(C_1\cup C_2, D_1,D_2)+ \varphi _2(D_1\cup D_2, C_1,C_2)+\varphi _3(C_1\cup C_2,D_1\cup D_2)\), where for the one-arrow edge we need two applications of \(\varphi _2\), each for the every direction of the arrow.

Using the notation of Fig. 5, the weight of an external edge e is \(W(e)=\varphi _1(\{c\},D_1\cup D_2)+\varphi _2(\{c\},D_1,D_2)\). This completes the proof. \(\square \)

5.4 Time and space complexity of the basic rooting algorithm

Let m be the number of leaves in G, and n be the number of leaves in N. As already shown, computing the largest support of N can be completed in O(n) time. Also, collapsing supports of the gene tree can be done in \(O(\max (m,n))\) time. Since a and b are fixed, all indicators of the form \(I(a \in D_i)\) or \(I(a \notin D_i)\) (and the same for b), can be computed for each edge after one traversal of the gene tree without computing sets \(D_i\)’s. The same holds for the sizes of power-sets, e.g., \(2^{|X \setminus C|}\), and so on. Therefore, computing these values requires O(m) steps in total. For the remaining components of \(\varphi \) formulas from Lemma 5, we need to compute: \(I(C{\setminus } \{a\} \subseteq X {\setminus } D_i)\), \(I(X \cap C \cap D_i = \emptyset )\) and \(I(X \cap D_1 \cap D_2)\) for \(i=1,2\). It is an open question, whether these values can be computed without direct computation of all needed subsets of Y; therefore, we assume the classic bit-vector implementation with O(m) time and space complexity of set operations. We have 4 subsets of Y for every internal edge and 2 subsets for every external edge. This gives the whole algorithm’s total time and space complexity of O(nm).

5.5 General rooting algorithm (Algorithm A1)

The algorithm from Sect. 5.2 operates under the assumption that a gene tree is rootable using the input network. However, nearly \(22\%\) of the gene tree-network pairs from our simulation study do not satisfy the assumption (see the conference version of our article Tiuryn et al. (2022)). Here, we propose a general approach to overcome this limitation. There is only one assumption on the gene tree: it must have at least two distinct labels which is the minimal requirement to infer a root based on the structural properties of networks and gene trees.

The idea of a general algorithm is to infer a collection of networks \(N'\) from N, such that G is \(N'\)-rootable and such that the collection reflects display trees from the original network. Having this, we can collect weights from multiple applications of the basic rooting algorithm.

We start with the definition of a unique reticulation \(r^*\) from a \(\rho \)TC-network N that will be used to decompose N in our algorithm. Let us assume that the species in X are linearly ordered (e.g., lexicographically, which is a typical case for biological data). For a given \(\rho \)TC-network N on X we say that a species/leaf s is tree-reachable from a reticulation r if there is a path from r to s whose each internal node is a tree node. Note that each leaf is tree-reachable from at most one reticulation in a tree-child network. If N is not a tree, then there is a unique reticulation \(r^*\) in N assigned to the minimal tree-reachable leaf among all tree-reachable leaves from N.

For a \(\rho \)TC-network N on X and an unrooted gene tree G such \(L(G) \subseteq X\) and \(|L(G)|>1\), we define \(\varPhi \) as a function that decomposes a given network into a set of networks \(N'\) such that G is \(N'\)-rootable. The definition of \(\varGamma \) is detailed below.

  1. (R1)

    \(\varPhi _G(N):= \{ N \}\), if G is N-rootable.

  2. (R2)

    Otherwise, if u and v are the children of the root of N such that \(L^{N}_{u} \cap L^{N}_{v} = \emptyset \) (i.e., the children of the root of N are the roots of disjoint subnetworks) and \(L(G) \cap L^{N}_{u} = \emptyset \), then \(\varPhi _G(N):= \varPhi _G(N_v)\), where \(N_v\) is the subnetwork of N rooted at v.

  3. (R3)

    Otherwise, \(\varPhi _G(N):= \varPhi _G(N') \cup \varPhi _G(N'')\), where, \(N'\) and \(N''\) are the networks obtained from N by removing reticulation edge \(\langle p,r^*\rangle \) from N (plus needed contractions) for each parent p of \(r^*\), respectively.

Let \(W_{G,N}(e)\) be the weight of an edge from G given N from the basic rooting algorithm (see formulae in Lemma 6). Then, the weight of e in the general case is given by:

$$\begin{aligned} W^*(e) = \sum _{ N' \in \varPhi _G(N) } 2^{|R(N')-|R(N)|} W_{G,N'}(e), \end{aligned}$$
(12)

where R(N) is the set of all reticulation nodes in a network N. Note that the contribution of \(N' \in \varPhi _G(N)\) in (12) is weighted by \(|D_{N'}|/|D_{N}|\), where \(D_N\) is the set of all display trees of N and \(|D_N|=2^{|R(N)|}\) if N is a tree-child network. See also the last property in the next lemma for further justification of the above definition.

Lemma 7

(Correctness) We have the following properties.

  • If \(|\varPhi _G(N)|=1\), then \(\varPhi _G(N)=\{N\}\).

  • For each \(N' \in \varPhi _G(N)\), G is \(N'\)-rootable.

  • There is a one-to-one correspondence between \(D_N\) and \(\bigcup _{N' \in \varPhi _G(N)} D_{N'}\) such that for each \(T \in D_N\), there is a node v in T, a network \(N'\) in \(\varPhi _G(N)\) and a tree \(T'\) in \(D_{N'}\), such that the subtree of T rooted at v is \(T'\).

  • \(\sum _{N' \in \varPhi _G(N)} |D(N')|/|D(N)| = 1 \).

Proof

The first two conditions are obvious. The termination of the procedure follows from the fact that each recursive call takes a network of a smaller size. Note also that the network reaching (R3) has at least one reticulation. Otherwise, assume that N is a tree and let v be the least common ancestor of L(G) in N. If v is the root of N then G is N-rootable. Otherwise, v belongs to a subtree rooted at one child of the root of N and the conditions from (R2) are satisfied.

For the third condition, assume that networks are replaced by their display tree sets in all calls of \(\varPhi \)’s. Then, in (R2), the set of display trees \(D_N\) is modified by removing the same subtree of the child of the root from each display tree, which gives \(D_{N_v}\), while in (R3), \(D_N\) is partitioned into two equally-sized disjoint sets of display trees: \(D_{N'}\) and \(D_{N''}\). Formalizing this observation leads to the desired one-to-one correspondence.

The last condition follows from the third one. \(\square \)

It follows from the above Lemma that if G is N-rootable the edge weights from general rooting algorithm (see (12)) equals the weights inferred by the basic rooting algorithm.

5.6 Rooting algorithm with network reduction (Algorithm A2)

Here, we present an improved variant of the general rooting algorithm. In this variant, we reduce the initial network by removing leaves of N based on the set of species present in G and by removing some redundant reticulations. Below, we detail the procedure by introducing two rules for transforming the input network and presenting their properties.

For a binary network N and a leaf l from N, we define a leaf-removal rule as follows. Let \(P_l\) be the set of all nodes v from N such that l is reachable from v by a path composed of non-tree nodes only.Footnote 2 Then,

  • Remove from N all nodes of \(P_l\) and all edges incident to nodes from \(P_l\),

  • And contract all nodes whose indegree and outdegree equals one until the resulting network has no such nodes.

By \(N|_{F}\) we denote the network obtained from N by a sequence of leaf-removal applications with leaves from \(L(N) \setminus F\). If N is a \(\rho \)TC-network, then the network obtained by the leaf-removal rules from N may be a non-\(\rho \)TC-network. For example, for the networks from Fig. 1, \(N|_{\{c,e,f,g\}}\) is not tree-child, and \(N|_{\{d,e,f,g,\}}\) is not fixed-root. Note also that multiedges are not allowed in E(N), therefore \(N'|_{\{b\}}\) is a single-noded tree labeled b. See also the reduced network from Fig. 9.

Now we define the second operation that reduces the size of a network, but preserves the set of its display trees. A reticulation r is redundant, if one parent of r is a child the second parent of r. For instance, the reticulation u in \(N'\) from Fig. 1 is redundant. Such a redundant reticulation r can be removed from N by applying reticulation-removal rule that transforms N by removing exactly one edge connecting r with its parent and contracting the parent. By \(N^*\), we denote the network obtained from N by iterative applications of reticulation-removal rules until there is no redundant reticulation. For example, \(N'\) after removal u becomes a tree (ab). Again, it should be clear the resulting network does not depend on the order of rule applications.

To compute the support in reduced networks, we need to formalize additional properties of the largest support for the class of binary networks. The largest supports of a binary phylogenetic network have all the properties described in Theorem 2 except that \(A^*_1\) or \(A^*_2\) may be empty (e.g., the network \(N'\) from Fig. 1 has the largest supports \((\{a\},\emptyset )\). Thus, instead of property (b1) in Theorem 2 the following two conditions hold:

  1. (b1.1)

    \(A_1^*\cap A_2^*=\emptyset \),

  2. (b1.2)

    and, for \(i=1,2\), if \(A_i^*\ne \emptyset \), then it is an equivalence class of \(\varDelta \).

The largest support for a binary network can also be computed using equivalence classes of \(\varDelta \) by locating, for each child c of the root, the class having a leaf reachable by tree nodes from c. If such a leaf does not exist then the corresponding support is empty.

We conclude that if G is N-rootable then the basic rooting algorithm can be applied to compute weights. Finally, if G is not N-rootable, we can apply the general rooting algorithm (A1), where the input network is decomposed into a collection of networks \(N'\) such that G is \(N'\)-rootable. We omit technical details to avoid repetitions.

Now, we are ready to formulate the improved algorithm, denoted here Algorithm A2:

  • Given a binary network N and a gene tree G.

  • Compute weights of all edges of G by applying Algorithm A1 with the network \((N|_{L(G)})^*\).

Note that we do not assume here that N is a \(\rho \)TC-network. Based on the largest support properties, the algorithm is able to infer gene tree rootings by decomposing any binary phylogenetic network.

Both general algorithms have an exponential time complexity \(O(2^rmn)\) in the worst case, where r is the number of reticulations in N, m is the size of G and n is the size of the network. However, the motivation to introduce A2 follows from the property that rooting a small gene tree using a network with many reticulations, requires potentially many recursive calls to obtain a collection of desired networks. Replacing the initial network with a reduced one helps to achieve significantly better performance while preserving the results of the rooting inference. In addition, A2 is able to infer rootings using any binary network. See the experimental evaluation and discussion in Sect. 6.5.

6 Evaluation

We show the evaluation of the rooting algorithms based on simulated data. We start by simulating networks. Then, we use the structure of each network to simulate a corresponding collection of unrooted gene trees. Each simulated unrooted gene tree includes information indicating the true rooting edge, called here original root. Finally, we use the simulated data to verify the performance of our rooting algorithms.

All simulations and tests were performed on a Ubuntu 20.04 computing server with 80 cores with Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz processors and 500GB of RAM.

6.1 Simulating networks

To simulate a random \(\rho TC\)-network, we start by simulating a random species tree and then modify the tree to introduce reticulations.

We simulate random species trees using TreeSim 2.4 (Hartmann et al. 2010) with the number of leaves \(n \in \{20, 40, 60\}\) and other parameters from Molloy and Warnow (2020). The trees generated by TreeSim are species trees on \(\{t_1,..., t_n\}\) with each edge e labeled by a float value, called a branch length and denoted e.length. The generated trees are ultrametric, meaning that the length of each root-leaf path in a given tree is the same. The length of a given path is calculated as the sum of branch lengths of its edges.

We transform a species tree to a \(\rho TC\)-network with \(r \in \{4,8\}\) reticulations by iteratively adding a reticulation until the number of reticulations is equal to r. Adding a reticulation is performed as follows.

Let N be an ultrametric network with branch lengths and k reticulations (if N is a species tree then \(k=0\)). By v.time be denote the length of paths from root to v. Note that since N is ultrametric, v.time is well-defined for each \(v \in V(N)\). We generate a network \(N'\) with \(k+1\) reticulations as follows. Firstly, we choose a random pair of non-reticulation edges \((u_1, w_1), (u_2,w_2) \in E(N)\). Then for \(i \in \{1,2\}\) we randomly choose a number \(\theta _i\) from uniform distribution over \([0, (u_i, w_i).length]\). We denote the difference \((u_i, w_i).length - \theta _i\) as \(\psi _i\). Then we subdivide the edge \((u_i, w_i)\) by deleting it from N, adding a new vertex \(v_i\) and adding edges \((u_i, v_i)\), \((v_i, w_i)\). We set \((u_i, v_i).length\) to \(\theta _i\) and \((v_i, w_i).length\) to \(\psi _i\). If \(v_1.time \le v_2.time\) then we add an edge \((v_1, v_2)\) with branch length \(v_2.time - v_1.time\), so that \(v_1\) becomes a reticulation. Similarly, if \(v_2.time \le v_1.time\) we add an edge \((v_2, v_1)\) with branch length \(v_1.time - v_2.time\). Setting the lengths in the preceding way assures that the ultrametric property is preserved in the resulting network \(N'\). Finally, if \(N'\) is a \(\rho TC\)-network, we return \(N'\) as the new network with \(k+1\) reticulations. Otherwise, we delete \(N'\) and try to add a reticulation to N once again, repeating the preceding procedure with a different random pair of edges.

Let N be a simulated ultrametric \(\rho TC\) network with r reticulations and let h be the height of N, i.e. the length of its root-leaf paths. We can obtain a corresponding network with outgroup \(N_o\) from N by firstly adding a new vertex \(\rho \) and a new leaf o, called outgroup and labelled by a distinguishable outgroup symbol. Then we add an edge \((\rho , o)\) with length 1.5h and an edge from \(\rho \) to the root of N of length 0.5h. Adding an outgroup is needed to be able to infer the true rooting edges in the gene trees. Therefore, each network N is used in the final evaluation on the simulated data, whereas each network \(N_o\) is used to simulate gene trees as described below.

6.2 Simulating gene trees

For each network with an outgroup \(N_o\), we randomly choose 16 of its displayed trees. If an edge e of a displayed tree was added through a contraction of a node v which had a single parent u and a single child w, then the length of e is set to the sum of lengths of \((u, v), (v,w) \in E(N)\). Otherwise, the length of e in a displayed tree is the same as the length of e in its corresponding network.

Following (Wawerka et al. 2022), we simulate one gene tree per displayed tree using SimPhy 1.0.2 (Mallo et al. 2015) and 6 sets of biological parameters (Molloy and Warnow 2020; Rasmussen and Kellis 2012). The sets of biological parameters contain three values of the duplication-loss rate \(DL \in \{10^{-10}, 2\cdot 10^{-10}, 5\cdot 10^{-10}\}\) and two values of the effective population size \(ILS \in \{10^{7}, 5\cdot 10^{7}\}\). As DL and ILS values increase, a greater divergence between the displayed tree and the gene tree structure can be expected.

To mimic the properties of biological datasets, we then use a standard procedure of introducing noise to the simulated gene trees. We simulate a multiple sequence alignment for each gene tree using INDELible v1.03 (Fletcher and Yang 2009) and parameters from Molloy and Warnow (2020). We infer an unrooted gene tree from each multiple sequence alignment using neighbour-joining method and Kimura 2-parameter correction implemented in NINJA 1.2.2 (Wheeler 2009). Note that it is possible for the neighbour-joining method to infer gene trees with negative branch lengths. Such trees are an artifact of the method and are impossible to interpret biologically, therefore we delete them from the simulation dataset. Finally, we root each of the inferred trees, placing the new root on the branch connecting the single leaf labeled by the outgroup symbol with the rest of the tree. Let \(G_o\) be the obtained rooted gene tree. We generate the final gene tree G without an outgroup by deleting the root of \(G_o\) and deleting the leaf labeled by the outgroup. The edge where the root of G is attached is stored with the unrooted variant of G as the original root position. Original rooting edges were used to verify the correctness of our rooting algorithms.

6.3 Simulation outcome

Finally, for each reticulation number, leaf-set size and set of biological parameters, we simulate 100 networks and 1600 gene trees. Thus, we obtained \(57600 = 3 \cdot 2 \cdot 3 \cdot 2 \cdot 16 \cdot 100\) triplets of the form \((N,G,o_G)\), where N is a \(\rho \)TC-network without an outgroup, G is a gene tree simulated using N, and \(o_G \in E(G)\) is the original rooting edge of G.

6.4 Rooting gene trees by Algorithm A1

We tested Algorithm A1 on the set of simulated triplets. The edges with the largest weight we call here optimal. Among 57,600 triplets, our algorithm indicated

  • 11,417 gene trees in which the original rooting edge is the unique optimal edge,

  • 13,180 gene trees in which the original rooting edge is optimal, but the algorithm indicated more than one optimal edge,

  • And 32,927 gene trees where the original root is not optimal (false positives).

Since the algorithm requires gene trees to have at least two leaves with distinct labels, we rejected the remaining 76 gene trees where all leaves were labeled by the same species.

Fig. 6
figure 6

Histograms of minimal and average errors for the set of all 32927 false-positives triples for three DL rate parameters. Here, DL rates are \(L=10^{-10}\) (low), \(M=2\cdot 10^{-10}\) (medium), and \(H=5\cdot 10^{-10}\) (high). The bin size was set to 1.0. Bars of a fixed color represent false-positive triplets \((N,G,o_G)\) from two parameter sets determined by the corresponding DL rate, n, r, and both ILS parameter values jointly

We used the classical path distance as a measure, where the distance between two edges is the number of nodes on the shortest path connecting these edges: the smaller distance, the better performance of the rooting algorithm. For example, the distance between two different adjacent edges is 1. Since the set of optimal edges may contain more than one element, we define a minimal error of \((N,G,o_G)\) as the minimal distance between \(o_G\) and an optimal edge. Similarly, we define an average error as the average distance.

Fig. 7
figure 7

The number of optimal edges to the size of E(G) in false-positive triplets \((N,G,o_G)\) for low, medium and high DL rates

Figure 6 illustrates the results for all false positive triplets. We combined the results for various ILS parameters in the diagrams as they were similar. The main observation is that, in most cases, the minimum error is 1, indicating that the original rooting edge is typically adjacent to some optimal edge. This pattern is also reflected in the histograms of average error, which are generally flatter than the histograms for minimal error. This can be partially attributed to the fact that the number of optimal edges in false-positive cases is generally low, as shown in Fig. 7.

6.5 Rooting Algorithm A1 versus A2

In terms of rooting inference, i.e., the sets of optimal rooting edges, A1 and A2 yielded the same results on 55,906 simulation triples which is more than 98% of the whole dataset, and despite 11,151 of non-tree-child reduced networks in A2. Thanks to the high congruence between the results, we skip the analysis of the quality of the rooting inference from A2.

Fig. 8
figure 8

Frequency histograms of sizes of \(\varPhi _G(N)\) for Algorithm A1 (on the left) and A2 (on the right) and \(r=4\) (top diagrams) and \(r=8\) (bottom diagrams). The diagrams contain counts for triplets (NGo) with \(\varPhi _G(N) \ge 2\)

The main difference between these algorithms is in the number of recursive calls, that is the sizes of \(\varPhi \)’s. Since reduced networks in A2 are often significantly smaller than input networks in A1, also sizes of \(\varPhi \) sets are expected to be smaller in A2. The pattern is clearly visible in Fig. 8, where the counts are high for large sizes of \(\varPhi \)’s in A1 (histograms on the left), e.g., close to \(16=2^4\) for networks with \(r=4\) reticulations, and close to \(256=2^8\) for networks with \(r=8\) reticulations, while in A2 (histograms on the right) we see the opposite situation, where the maximal counts are at low sizes of \(\varPhi \) sets.

The runtime of our implementation of A1 required approximately 75 min, while A2 needed approximately 30 min to evaluate the set of 57,000 simulated triplets.

In summary, both algorithms have almost the same results on our simulation dataset. Based on the runtime of experimental evaluation, we conclude that A2 is more suitable than A1 for the networks with a large number of reticulations. A disadvantage of A2 is the potential appearance of non-tree-child networks in the evaluation, even if the input network belongs to the class of tree-child networks.

6.6 Data and software package

The software package with the implementation of the rooting algorithms (python + bash script), all datasets, and results is freely available from https://bitbucket.org/pgor17/netroot. The package partially depends on other software packages: urec (Górecki and Tiuryn 2006) for unrooted reconciliation, and embretnet (Wawerka et al. 2022) for the phylogenetic networks, where we implemented the computation of \(\varPhi \).

An example of gene tree rooting inference with Algorithm A2 (option -a2) is shown below.

figure a

Here, the output is an NHX representation (-f) of the input gene tree that contains tags with the best supports only (-EB).

figure b

See also the corresponding visualizations using graphviz, dot2tex and pdflatex in Fig. 9.

Fig. 9
figure 9

Exemplary visualizations of networks and gene trees from our netroot package. Left: the input network N. Middle: the reduced network \(N|_{\{a,b,c\}}\). Right: the resulting gene tree with the location of the original root position (OR) and support values on the edges. Here, the best rooting edge has the support of 0.5 and it is the original rooting edge (OR). The pictures were generated by using additional parameters to netroot.sh: -R (to show the original root if it is provided in the input gene tree), -Gfig -p.5 for generating pdf files using graphviz, dot2tex and pdflatex

7 Conclusions and future outlook

We have introduced a novel method for inferring the root of an unrooted gene tree by leveraging a known phylogenetic network containing the species represented in the gene tree. Our approach utilizes the unrooted reconciliation model iteratively to reconcile the unrooted gene tree with a set of splits inferred from the network, determining the best rooting edge. We have developed an exact and computationally efficient mathematical formula for this task and introduced two general rooting algorithms to address scenarios where the input network does not initially meet the required criteria. Additionally, we have made our algorithms and the outcome visualizations publicly accessible as a software package. This ensures broader accessibility and ease of use for researchers and practitioners. In experimental simulations integrating gene trees and phylogenetic networks, we have demonstrated the robust performance of our algorithms. These simulations revealed that our methods consistently achieve accurate gene tree root inferences across a wide range of scenarios, with minimal margin for error.

Phylogenetic inference software commonly includes branch lengths in gene trees and networks, and future endeavors will concentrate on extending the rooting method to incorporate these branch lengths. This enhancement will enable a more precise and unique determination of the root position within gene trees.

Additionally, we aim to compare our approach with existing gene tree rooting methods, which are primarily based on gene trees alone (e.g., outgroup, median node rooting) or incorporate some species tree information. However, the conclusions drawn from such a comparative study may be constrained by the fundamental distinction between trees and species networks.

For example, phylogenetic trees provide a representation of the evolutionary history of a group of species or genes, illustrating their relationships through common ancestors and branching events over time. Conversely, phylogenetic networks offer a broader framework, depicting a complex evolutionary scenario that includes reticulate events such as hybridization, horizontal gene transfer, or incomplete lineage sorting events that cannot be fully captured within a simple tree structure. Thus, due to their comprehensive nature, phylogenetic networks offer a more general platform, allowing for a richer depiction of evolutionary relationships that surpass traditional tree-based representations. Consequently, the performance of traditional rooting methods based solely on trees may not be fully adequate when applying our rooting methods, which take into account the complexities inherent in species networks.