Keywords

1 Introduction

In computational biology, gene trees inferred from multiple alignments of molecular sequences are usually unrooted. However, many practical applications require phylogenetic trees to be correctly rooted. Therefore, several methods have been developed for the inference of the root location in an unrooted gene tree [2, 7, 11]. There are several categories of tree rooting methods [22]. Perhaps, the most popular is outgroup rooting [13]. Another widely used type is based on branch lengths of the gene tree, which includes the mid-point rooting [4], where the root is placed at the mid-point of the longest path in the tree, minimal ancestor deviation rooting [21], and minimum variance rooting [14]. The next category represents methods of rooted phylogenetic inference; for instance, the maximum likelihood with non-reversible or non-stationary evolutionary models of evolution and models with molecular clock assumption [12, 25]. The last category is based on reconciliation methods, where the unrooted gene tree is reconciled with its known rooted species tree using various cost functions, e.g., duplication-loss, duplication-loss-transfer, or Robinson-Foulds distance [3, 6, 7]. For a broader comparison of tree rooting methods, please refer to [17, 22].

A natural extension of a species tree is a phylogenetic network, which is a mathematical model of evolutionary processes with reticulate events such as hybridization of species, recombination, or horizontal gene transfers [1, 10]. According to our knowledge, the gene tree rooting problem has never been studied under the assumption of a known phylogenetic network. Therefore, the rooting problem becomes challenging when horizontal events shaped the history of genes. Here, we propose a novel approach where the gene tree is rooted using a known phylogenetic network.

Our Contribution: We address a novel variant of the rooting problem, where the gene tree root is inferred using a given phylogenetic network of the species present in the gene tree. To obtain the best rooting edge, we apply unrooted reconciliation multiple times, where the unrooted gene tree is jointly reconciled with a set of splits inferred from the given network. We derive an exact and computationally efficient mathematical formula for the rooting problem with the algorithm that runs in square time and space. Finally, we verify the algorithm’s quality is based on simulated gene trees and networks.

2 Basic Definitions

A network on a set of species X is a directed acyclic graph \(N=(V(N),E(N))\) with a single root such that: (1) the set of leaves of N is X, i.e., nodes of indegree 1 and outdegree 0, and (2) there is a directed path from the root to any other vertex. Nodes of a binary network N are divided into three groups: leaves, tree nodes (in-degree at most 1 and out-degree 2), and reticulation nodes (in-degree 2, out-degree 1). A network N is said to be fixed-root, if both children of the root node are tree nodes. N is said to be tree-child, if every tree node of N has at least one of its children being a tree node too. In this article, the class of fixed-root tree-child networks is denoted as \(\rho \)TC. A species tree on X is any network on X without reticulations nodes. See Fig. 1 for examples of networks.

Fig. 1.
figure 1

N is a fixed-root tree-child network. \(N'\) is a tree-child network that is not fixed-root. \(N''\) is a fixed-root network but not tree-child. Reticulation nodes are marked in red. (Color figure online)

A node v of indegree one and outdegree one in a rooted directed graph can be contracted as follows: (1) remove v, (2) remove both edges incident with v, and (3) insert a new directed edge connecting the unique parent of v with the only child of v. Similarly, if v is a root and has outdegree one we remove v, and the child of v becomes a new root. If a directed graph \(G'\) is obtained from a graph G by a sequence of contract operations, then G is called a subdivision of \(G'\). Given a network N on X, we say that a tree T on X is displayed by N, if N contains a subgraph \(T'\) that is a subdivision of T [20].

For a network N and a node v of N, the set of all species reachable in N from v is denoted \(L^{N}_{v}\). Below, we present some properties of \(\rho \)TC-networks.

Lemma 1

Let N be a \(\rho \)TC-network. Let v be a child of the root of N. Let X be the set of all species/leaves of N. Then, \(X-L^{N}_{v}\ne \emptyset \).

Observe that the assumption that N is \(\rho \)TC is essential for Lemma 1. For the network \(N'\) from Fig. 1 we have \(\{a,b\}-L^{N'}_{v}=\emptyset \). Similarly, for the network \(N''\) from Fig. 1, we have \(\{a,b\}-L^{N''}_{v}=\{a,b\}-L^{N''}_{u}=\emptyset \). The next result immediately follows from the proof of Lemma 1.

Given a species tree S on X, we call split of S the unordered pair \(L^S_{r'}|L^S_{r''}\), where \(r'\) and \(r''\) are the children of the root. Note that, any display tree of a \(\rho \)TC-network N on X is a species tree on X.

Lemma 2

Let N be a \(\rho \)TC-network. Let v be a child of the root of N. For any display tree T of N, if A|B is the split of T, then \(X-L^{N}_{v}\) is contained in A or in B.

3 Supports of a Network

In this Section, we introduce the notion of network support.

We say that two species a and b are glued together in a tree T on X, if a and b are descendants of the same child of the root of T. Consider the following two decision problems.

\(\Delta \):

Given a network N on a set of species X and two species \(a,b\in X\). Is it the case that all display trees of N glue together a and b?

\(\Gamma \):

Given a network N on a set of species X and two species \(a,b\in X\). Is there a display tree of N that glues together a and b?

Theorem 1

Problems \(\Delta \) and \(\Gamma \) are decidable in polynomial time.

Proof

Let \(a, b \in X\). For nodes x and y, let d(xy) be a predicate that is true if and only if there are node-disjoint paths from x to a and from y to b. Then, d(xy) can be computed in polynomial time using the dynamic programming:

$$ d(x,y)={\left\{ \begin{array}{ll} False &{} \text {if } x\,=\,y,\\ True &{} x=a \text { and } y=b,\\ d(x',y) &{} x \text { is a reticulation with a child } x',\\ d(x,y') &{} y \text { is a reticulation with a child } y',\\ d(x',y) \vee d(x'',y) &{} x \text { is a tree node with children } x' \text { and } x'',\\ d(x,y') \vee d(x,y'') &{} y \text { is a tree node with children } y' \text { and } y'',\\ False &{} \text {otherwise.} \end{array}\right. } $$

Let \(\delta (x,y)=d(x,y) \vee d(y,x)\).

Problem \(\Delta \) can be solved by computing the value of \(\lnot \delta (u,v)\), where u and v are the children of the root of the network. Problem \(\Gamma \) can be solved similarly by checking if there is a tree node t with children x and y located strictly below u or v such that \(\delta (x,y)\) is True. In such case a and b can be glued together.    \(\square \)

We conclude from the proof that both problems can be solved in \(O(n^2)\) time and space, where n is the number of leaves in N.

Let N be a network on X. A pair of subsets \((A_1,A_2)\) of X is called a support of N, if for every display T of N, if \(C_1|C_2\) is the split of T, then \(A_1\subseteq C_1\) and \(A_2\subseteq C_2\).

Theorem 2

Let N be any network.

  1. (a)

    N has the largest support \((A_1^*,A_2^*)\).

  2. (b)

    If N is a \(\rho \)TC-network, then

    1. (b1)

      The sets \((A_1^*,A_2^*)\) of the largest support are non-empty and they are (different) equivalence classes of the equivalence relation \(\Delta \).

    2. (b2)

      Finding equivalence classes that form the largest support of N can be done in polynomial time.

Note that finding the largest support of a \(\rho \)TC-network can be done by a single traversal of a network in linear time without computation of \(\delta \)-classes, by traversing through non-reticulation edges only. We omit easy details.

4 Unrooted Tree Reconciliation

Here, we summarize the main results on unrooted reconciliation, that we will use to model rooting of a gene tree based on a network. We start with the basic terms. A rooted gene tree over X is a rooted binary tree, where each non-leaf node has exactly two children, whose leaves are labeled by the elements of X (not necessarily uniquely). Similarly, an unrooted gene tree G over X is a binary unrooted tree, whose leaves are labeled by elements of X and such that any internal node of G has degree 3. An unrooted gene tree G can be rooted by choosing an edge e where the root will be placed. Such a rooted gene tree is denoted \(G_e\). By L(G) we denote the set of all species present in a gene tree G. Let \(X'|X''\) be the split of a fixed species tree S on X. Without loss of generality, we assume \(|L(G)| \ge 3\), \(L(G) \cap X' \ne \emptyset \) and \(L(G) \cap X'' \ne \emptyset \), for any gene tree G. Any internal node g of an unrooted gene tree G over X determines a star being an unordered triple A|B|C, where A, B and C are the sets of all leaf-labels of three subtrees obtained from G by removing g. Note that \(A \cup B \cup C = L(G) \subseteq X\). Let \(\bar{A}=B \cup C\), \(\bar{B}=C \cup A\) and \(\bar{C}=A \cup B\). Let \(\zeta (Z)\) be a predicate that is true iff \(Z \subseteq X'\) or \(Z \subseteq X''\). Then, it follows from [8], that we have five disjoint types of stars (see Fig. 2). Note that reordering of A, B and C may be required: (S1) if \(\lnot \zeta (A) \wedge \zeta (\bar{A})\), (S2) if \(\zeta (A) \wedge \zeta (\bar{A})\), (S3) if \(\lnot \zeta (A) \wedge \lnot \zeta (\bar{A}) \wedge \zeta (B) \wedge \zeta (C)\), (S4) if \(\lnot \zeta (A) \wedge \lnot \zeta (B) \wedge \lnot \zeta (C)\), and (S5) if \(\zeta (A) \wedge \lnot \zeta (B) \wedge \lnot \zeta (C)\). Next, \(\zeta \) induces arrows on the edges of G as follows. If \(e=(v,w)\), then we say that e is no-arrow if \(\zeta (L^{G_e}_v) \wedge \zeta (L^{G_e}_w)\), e is double-arrow if \(\lnot \zeta (L^{G_e}_v) \wedge \lnot \zeta (L^{G_e}_w)\), and e is one-arrow (from v to w) if \(\lnot \zeta (L^{G_e}_v) \wedge \zeta (L^{G_e}_w)\). Double-arrow and no-arrow edges are called symmetric.

Fig. 2.
figure 2

Top: Stars S1-S5. Bottom: Examples of unrooted gene trees with stars and the species tree \(S=(a,(b,c))\) having the split a|bc. Red edges are elements of the plateau, i.e., their rooting cost is minimal (see Theorem 3).

Given a rooted gene tree and a species tree, one can determine the lowest number of gene duplication and loss events required to reconcile both trees, called duplication-loss cost (\({{\,\mathrm{\text {DL}}\,}}\)) [7, 18]. Due to space constraints, we omit here the formal definition of the cost. The above measure is often used to determine the rooting of an unrooted gene tree by choosing the edge e that minimizes the \({{\,\mathrm{\text {DL}}\,}}\) cost between \(G_e\) and the species tree. Such edges can be identified using the following theorem.

Theorem 3

(Adopted from [7, 8]). Given a species tree S, an unrooted gene tree G, and an edge e from G. The \({{\,\mathrm{\text {DL}}\,}}\) cost of \(G_e\) and S is minimal among all rootings of G if and only if e is one-arrow in a star S5 or symmetric in any star.

The set of edges satisfying the conditions from Theorem 3 induces a connected subtree called a plateau of G determined by S and denoted P(GS). All plateau edges can be identified in linear time and space [7]. Examples are depicted in Fig. 2.

Lemma 3 (Rooting Lemma)

For any species trees S and \(S'\) on X with the same split, \(P(G,S)=P(G,S')\).

Thus, we will use \(P(G,A|B):=P(G,S)\), if A|B is a split of some species tree S.

5 Rooting a Gene Tree

Here, given a network N on a set X of species and an unrooted gene tree G over X, we propose an algorithm to find an optimal place for the root of G. One natural approach is to choose the best rooting by identifying the edge that belongs to the maximal number of plateaus P(GT), where T ranges over all display trees of N. While the complexity of the problem remains open, we know only a naïve way to solve it by enumerating all possible display trees of N, and, then by applying the unrooted reconciliation algorithm. Such an approach is exponential in the number of reticulations in N. To overcome this complexity, based on Rooting Lemma 3, we propose an algorithm that determines rooting candidates by joint analysis of all possible set-theoretic splits from a given network under some structural assumptions based on display tree properties. The algorithm will be done in three steps. See also an example in Fig. 3

Step 1: Collapsing on the largest support of \(\boldsymbol{N}\). Let \((A_1^*, A_2^*)\) be the largest support of N as described by Theorem 2. Let \(X^*= X-(A_1^*\cup A_2^*)\). Let ab be new names of species not occurring in X and let \(Y=\{a,b\}\cup X^*\) be a new set of species names. We transform G into a gene tree \(G^*\) over Y by replacing every species name of G that belongs to \(A_1^*\) by a, and similarly replacing every species name of G that belongs to \(A_2^*\) by b. Thus we arrive at a pointed gene tree \((G^*,a,b)\) over Y, with two distinguished species: a and b.

Step 2: Computing weights of edges in a pointed gene tree. For every \(Z \subseteq X^*\), we will consider splits \(Z \cup \{a\} | (X^*-Z) \cup \{b\}\) of Y induced by Z. Then, the weight of an edge e of \(G^*\) is the number of sets Z under which e belongs to the plateau determined by the split induced by Z. Note, that each edge has weight at most \(2^{|X^*|}\). Each edge with the maximum weight belongs to the intersection of all plateaus, as Z ranges over all subsets of \(X^*\), and therefore it also belongs to the intersection of all plateaus P(GT) with T ranging over all display trees of N. See Sect. 5.1 for efficient methods to compute weights.

Step 3. Final step. Report all edges of G with the largest weight. These are the potential places for the root to be placed.

Fig. 3.
figure 3

Left: A gene tree G and the pointed gene tree \((G^*,a,b)\) obtained using the network N (see Fig. 1) with the largest support \((\{c\},\{f,g\})\). Middle: Plateaus for every split induced by sets \(Z \subseteq \{d,e\}\). Right: Weights of edges. The best rooting edge has the weight of 3.

5.1 Computing Weights of Edges of Unrooted Gene Trees

In this Section we show how to compute the weight of an edge of an unrooted gene tree with two distinguished species.

Preliminary Assumptions: Here, we assume that an unrooted gene tree G over a set Y of species has two distinguished species ab and the rest of the species we will denote here by X. So \(Y=X\cup \{a,b\}\) and ab do not belong to X. Since, the top-split is required to determine optimal rootings, we need to assume that

$$\begin{aligned} a,b \in L(G). \end{aligned}$$
(1)

Definition 1 (The weight of an edge)

Under the preliminary assumptions, the weight of an edge e of G, denoted W(e), is the number of sets \(Z\subseteq X\) such that e belongs to \(P(G, Z_a|Z_b)\), where \(Z_a = Z \cup \{a\}\) and \(Z_b=(X \setminus Z)\cup \{b\}\).

We also need to distinguish two cases depending on the type of edge. If an edge is incident to a leaf, we call it external; otherwise, the edge is internal. Figure 4 establishes additional notation.

Fig. 4.
figure 4

An external (a) and an internal (b) edge e in a gene tree G. Here, c is a species (\(c \in Y\)). \(D_i \subseteq L(G)\) and \(C_i \subseteq L(G)\) are the sets of all species present in the corresponding subtrees of G. Note that, \(L(G)=C \cup D\).

The next Lemma shows formulas for the weight of edges depending on their type.

Lemma 4 (Internal edge weight)

Under the notation of Fig. 4, the number of sets \(Z\subseteq X\) such that e belongs to \(P(G, Z_a|Z_b)\) is given:

  • If e is no-arrow,

    $$ \varphi _1(C,D)= {\left\{ \begin{array}{ll} 2^{|X \setminus (C\cup D)|} &{} \text { if [(}D \not \ni a \in C \text { and }C \not \ni b \in D\text {) or (}D \not \ni b \in C \text { and } \\ {} &{} C \not \ni a \in D\text {)] and } C\cap D=\emptyset \text {, }\\ 0,&{} \text {otherwise}. \end{array}\right. } $$
  • If e is one-arrow: \(\varphi _2(C,D_1,D_2)= \varphi ^a_2(C,D_1,D_2)+\varphi ^b_2(C,D_1,D_2)\), whereFootnote 1

    $$ \begin{aligned} \varphi _2^a(C,D_1,D_2)&= I(b\not \in C)\cdot (2^{|X-C|}-f_a(C,D_1,D_2)),\\ \varphi _2^b(C,D_1,D_2)&= I(a\not \in C)\cdot (2^{|X-C|}-f_b(C,D_1,D_2)),\\ f_a(C,D_1,D_2)&= \sum _{i=1}^2 2^{|X \setminus (C\cup D_i)|} \cdot \big ( I(a\not \in D_i)\cdot I((C \setminus \{a\})\subseteq X-D_i) +I(b\not \in D_i) \big )\\&+ 2^{|X-(D_1\cup D_2\cup C)|} \cdot \Big ( I(a\not \in D_1\cup D_2)\cdot I(X\cap C\cap (D_1\cup D_2)=\emptyset ) + \\&+ \sum _{i=1}^2 I(a\not \in D_i)\cdot I(b\not \in D_{3-i})\cdot I(X\cap (C\cup D_{3-i})\cap D_i=\emptyset ) \Big ), \end{aligned} $$
    $$ \begin{aligned} f_b(C,D_1,D_2)&=\sum _{i=1}^2 2^{|X \setminus (C\cup D_i)|} \cdot \big ( I(a\not \in D_i) + I(b\not \in D_i)\cdot I(D_i\cap X\cap C=\emptyset ) \big )+\\&+ 2^{|X-(D_1\cup D_2\cup C)|} \cdot \Big ( I(b\not \in D_1\cup D_2)\cdot I(D_1\cap X\cap C=\emptyset )\cdot I(D_2\cap X\cap C=\emptyset ) + \\&+ \sum _{i=1}^2 I(a\not \in D_i)\cdot I(b\not \in D_{3-i})\cdot I(D_{3-i}\cap X\cap (C\cup D_i)=\emptyset ) \Big ). \end{aligned} $$
  • If e is double-arrow \(\varphi _3(C,D)= \varphi ^b_2(\emptyset ,C,D)\).

Below, we present the main formula for computing weights.

Lemma 5 (Computing weights of edges)

Under the notation from Lemma 4, the weight of an internal edge e is \(W(e)=\varphi _1(C_1\cup C_2, D_1\cup D_2)+\varphi _2(C_1\cup C_2, D_1,D_2)+ \varphi _2(D_1\cup D_2, C_1,C_2)+\varphi _3(C_1\cup C_2,D_1\cup D_2)\), while the weight of an external edge e is \(W(e)=\varphi _1(\{c\},D_1\cup D_2)+\varphi _2(\{c\},D_1,D_2)\).

5.2 Time and Space Complexity of the Rooting Algorithm

Let m be the number of leaves in G, and n be the number of leaves in N. As already shown, computing the largest support of N can be completed in O(m) time. Also, collapsing supports of the gene tree can be done in \(O(\max (m,n))\) time. Since a and b are fixed, all indicators of the form \(I(a \in D_i)\) or \(I(a \notin D_i)\) (and the same for b), can be computed for each edge after one traversal of the gene tree without computing sets \(D_i\)’s. The same holds for the sizes of power-sets, e.g., \(2^{|X \setminus C|}\), and so on. Therefore, computing these values requires O(m) steps in total. For the remaining components of \(\varphi \) formulas from Lemma 4, we need to compute: \(I(C\setminus \{a\} \subseteq X \setminus D_i)\), \(I(X \cap C \cap D_i = \emptyset )\) and \(I(X \cap D_1 \cap D_2)\) for \(i=1,2\). It is an open question, whether these values can be computed without direct computation of all needed subsets of Y; therefore, we assume the classic bit-vector implementation with O(m) time and space complexity of set operations. We have 4 subsets of Y for every internal edge and 2 subsets for every external edge. This gives the whole algorithm’s total time and space complexity of O(nm).

6 Evaluation

In our evaluation, we first simulate a collection of networks with unrooted gene trees with additional information on the rooting edge. Then, we use the simulated data to verify the performance of our rooting algorithm.

6.1 Simulating Networks and Unrooted Gene Trees

Simulating Species Networks We start by simulating a collection of random species trees with \(n = \{20, 40, 60\}\) leaves using TreeSim 2.4 [9] with parameters from [16]. Then, we randomly choose \(r=\{4, 8\}\) pairs of edges from each tree. For each pair, we then subdivide its two edges and add a new reticulation edge between the vertices introduced by the subdivision. Following that procedure, we obtain a new species network with r reticulations from each of the generated trees. Additionally, we constrain the network to be tree-child and fixed-root. Finally, we add an extra leaf to each network N, called an outgroup, and another extra node \(\rho \). We add an edge from \(\rho \) to the root of N and set its length to 1/2 of the height of N and another edge from \(\rho \) to the outgroup, setting its length to 3/2 of the height of N, so that \(\rho \) becomes the root of the new network.

Simulating Gene Trees. For each network, we randomly generate 16 of its displayed trees. Following [23], we then simulate one gene tree per displayed tree with SimPhy 1.0.2 [15] using 6 sets of biological parameters [16, 19]. The sets of biological parameters contain three values of the duplication-loss rate \(DL=\{10^{-10}, 2\cdot 10^{-10}, 5\cdot 10^{-10}\}\) and two values of the effective population size \(ILS=\{10^{7}, 5\cdot 10^{7}\}\). Next, we simulate a multiple sequence alignment for each gene tree using INDELible v1.03 [5] and parameters from [16]. We infer a gene tree from each multiple sequence alignment using neighbour-joining method with Kimura 2-parameter correction implemented in NINJA 1.2.2 [24]. We delete the samples for which the method infers trees with negative branch lengths. Finally, we reroot each of the inferred trees, placing the new root on the branch connecting the single leaf labelled by the outgroup with the rest of the tree. After rerooting, we delete the root node and the outgroup node, obtaining the final rooted gene tree. The edge where the root is attached is stored with the unrooted variant of the gene tree as the original root position. Original rooting edges were used to verify the correctness of our rooting algorithm.

Finally, for each reticulation number, leaf-set size and set of biological parameters, we simulate 100 networks and 1600 corresponding gene trees. Thus, we obtained \(57600 = 3 \cdot 2 \cdot 3 \cdot 2 \cdot 16 \cdot 100\) triplets of the form \((N,G,o_G)\), where N is a \(\rho \)TC-network, G is a gene tree simulated using N, and \(o_G \in E(G)\) is the original rooting edge of G.

Fig. 5.
figure 5

Histograms of minimal and average errors for the set of all 26325 false-positives triples for three DL rate parameters. Here, DL rates are \(L=10^{-10}\) (low), \(M=2\cdot 10^{-10}\) (medium), and \(H=5\cdot 10^{-10}\) (high). The bin size was set to 1.0. Bars of a fixed color represent false-positive triplets \((N,G,o_G)\) from two parameter sets determined by the corresponding DL rate, n, r and both ILS parameter values jointly.

6.2 Rooting Algorithm Results

We tested our algorithm on the set of simulated triplets. The edges with the largest weight we call here optimal. Among 57600 triplets, our algorithm indicated

  • 8624 gene trees in which the original rooting edge is the unique optimal edge,

  • 10286 gene trees in which the original rooting edge is optimal, but the algorithm indicated more than one optimal edge,

  • and 26325 gene trees where the original root is not optimal.

Since the algorithm requires gene trees with species present in both elements of the largest support, we rejected the remaining 12365 triplets with insufficient species representation in gene trees. We will address this problem in future approaches by contracting networks.

We used the classical path distance as a measure, where the distance between two edges is the number of nodes on the shortest path connecting these edges: the smaller distance, the better performance of the rooting algorithm. For example, the distance between two different adjacent edges is 1. Since the set of optimal edges may contain more than one element, we define a minimal error of \((N,G,o_G)\) as the minimal distance between \(o_G\) and an optimal edge. Similarly, we define an average error as the average distance.

The results for all false positive triplets are depicted in Fig. 5. The results for various ILS parameters were similar; therefore, we merged them in the diagrams. The main observation is that the minimum error is usually 1 in most cases, meaning that the original rooting edge is typically adjacent to some optimal edge. We observe a similar pattern for the average error, which is partially a consequence of the property that the number of optimal edges in false-positive cases is generally low, as indicated in Fig. 6. However, generally, the histograms of average error are flatter.

Fig. 6.
figure 6

The number of optimal edges to the size of E(G) in false-positive triplets \((N,G,o_G)\) for low and medium DL rates.

Software Package. The software package with data is freely available from https://bitbucket.org/pgor17/netroot.

7 Conclusions and Future Outlook

We addressed a novel variant of the gene tree rooting problem, where the gene tree root is inferred using a known phylogenetic network of the species present in the gene tree. To obtain the best rooting edge, we apply unrooted reconciliation multiple times, where the unrooted gene tree is jointly reconciled with a set of splits inferred from the given network. We derived an exact and computationally efficient mathematical formula for such a formulation of the rooting problem. Our experimental study based on simulated gene trees and networks showed that the rooting algorithm infers the rooting correctly or with a small error in most cases.

In the future, we plan to compare the novel method with known gene tree rooting approaches. However, conclusions from such a comparative study may be limited since no method assumes a known phylogenetic network. Also, we plan to address the problem of species representation in gene trees, where our method cannot be applied when the largest support sets are disjoint with the collection of species present in the gene tree.