Introduction

A fundamental task in evolutionary biology is to combine a collection of rooted trees on partial, possibly overlapping, sets of data, into a single rooted tree on the full set of data. This is the goal of supertree methods, mainly designed and used for the purpose of reconstructing a species supertree from a set of species trees (see overviews of early methods in [46], and more recent methods in [2, 10, 21, 24, 25, 28, 29]).

Ideally, the combined supertree should "display" each of the input tree, in the sense that by restricting the supertree to the leaf set of an input tree, we obtain the same input tree. However, this is not always possible, as the input trees may contain conflicting phylogenetic information. Note that considering a set of input trees that are not all compatible leads to the questions of correcting input gene trees or finding a subset of compatible input trees or subtrees [26]. Here, we leave open these questions and study the more direct formulation of the supertree problem that is to consider a set of compatible input trees and find a supertree displaying them all. The BUILD algorithm by Aho et al. [1] can be used to test, in polynomial time, whether a collection of rooted trees is compatible, and if so, construct a compatible supertree, not necessarily fully resolved. This algorithm has been generalized in [9, 20] to output all compatible supertrees, and adapted in [27] to output all minimally resolved compatible supertrees.

Although supertree methods are classically applied to the construction of species trees, they can be used as well for the purpose of constructing gene trees. Several gene tree databases are available (see for example Ensembl Compara [30], Hogenom [22], Phog [11], MetaPHOrs [23], PhylomeDB [14], Panther[19]). For a gene family of interest, many different gene trees can therefore be available, and finding one single supertree displaying them all leads to a supertree question. On the other hand, given a gene of interest, a homology-based search tool is usually used to output all homologs in a set of genomes. The resulting gene family may be very large, involving distant gene sequences that may be hard to align, leading to weakly supported trees - or even worse, highly supported gene trees that are in fact incorrect. A standard way of reducing such errors is then to use a clustering algorithm based on sequence similarity, such as OrthoMCL [18], InParanoid [3], Proteinortho [17] or many others (see Quest for Orthologs links at http://questfororthologs.org/), to group genes into smaller sets of orthologs or inparalogs (paralogs that arose after a given speciation). Trees obtained for such partial gene families can then be combined by using a supertree method.

Considering input trees as parts of gene trees rather than as parts of species trees does not make any difference regarding the compatibility test procedure. However, for reconstructing a compatible "super gene tree", if a species tree is known for the taxa of interest, then it can be used as an additional information to choose among all possible supertrees displaying the input partial gene trees. Indeed, a natural optimization criterion is to minimize the reconciliation cost, i.e. either the duplication or the duplication plus loss cost, induced by the output tree. We call the problem of finding a compatible supertree minimizing a reconciliation cost the supergenetree problem.

In this paper, we first show how the exact methods developed for the supertree problem can be adapted to the supergenetree problem. As for the original algorithms, all the extensions have also exponential worst-time complexity. We then exhibit a heuristic, which can be seen as a greedy approach classically used for the supertree problems, that consists in constructing progressively the tree from its root to its leaves. The main module of this heuristic is to infer the minimum number of duplications preceding the first speciation, which we call the Minimum pre-Speciation Duplication problem. We show that the supergenetree problem for the duplication cost, and even its restricted version the Minimum pre-Speciation Duplication problem, are NP-hard to approximate within a n1-ϵ factor, for any 0 < ϵ < 1 (n being the number of genes). Moreover, these inapproximability results even hold for instances in which there is only one gene per species in the input trees. Finally we consider the supergenetree problem with restrictions on input trees that are relevant to many biological applications. Namely, we require each gene to appear in at most one tree, and genes of any tree to be related through orthology only. This is for example the case of gene trees obtained for OrthoMCL clusters called orthogroups [18]. We show that even for this restriction, the supergenetree problem remains NP-hard for the duplication cost.

The following section introduces preliminary notations that will be required in the rest of the paper.

Preliminaries

Notations on trees

Given a set L, a tree T for L is a rooted tree whose leafset L T is in bijection with L. We denote by V(T) the set of nodes and by r(T) the root of T. Given an internal node x of T, the subtree of T rooted at x is denoted T x . The degree of an internal node x of T is the number of children of x. If T is binary, we arbitrarily set one of the two children of x as the left child x l and the other as the right child x r . We call L T x l , L T x r the bipartition of a node x of degree 2 (note that the term 'bipartition' is sometimes used, in the context of unrooted trees, to denote the nodes or leaves of the two components obtained after removing a given edge. To avoid confusion, note that this is not what we mean here by 'bipartition').

A node x is an ancestor of y if x is on the (inclusive) path between y and the root, and we then call y a descendant of x. Two nodes x and y are separated in T if none is an ancestor of the other. The lowest common ancestor (lca) of a subset L' of L T , denoted lca T (L'), is the ancestor common to all nodes in L' that is the most distant from the root. The restriction T| L' of T to L' is the tree with leafset L' obtained from the subtree of T rooted at lca T (L') by removing all leaves that are not in L', and contracting all internal nodes of degree 2, except the root. We generalize this notation to a set of trees: For a set T of trees on L, T | L = T | L : T T . Let T' be a tree such that L ( T ) = L L ( T ) . We say that T displays T' iff T| L' is the same tree as T'.

A triplet is a binary tree on a set L with |L| = 3. For L = {x, y, z}, we denote by xy|z the unique triplet t on L with root r(t) for which lca t (x, y) ≠ r(t) holds.

A polytomy (or star tree) over a set L is a tree for L with a single internal node, which is of degree |L|.

A resolution B(T) of a non-binary tree T is a binary tree respecting all the ancestral relations given by T. More precisely, B(T) is a binary tree such that L B ( T ) =L ( T ) , and for any u, vV(T), if u is an ancestor of v in T, then lc a B ( T ) L T u is an ancestor of lc a B ( T ) L T v .

Gene and species trees

Figure 1 is an illustration of the notations defined in this section.

Figure 1
figure 1

A gene tree T for the gene family Γ = { a 1 , a 2 , b 1 , c 1 , c 2 , c 3 } and a species tree S for the set of species Σ = { a , b , c } and where, for any x and any i , x i is a gene in genome x. The label of an internal node x of T corresponds to s(x). Speciation nodes are represented by circles and duplication nodes by squares. The pre-speciation duplication nodes (here only one node) are grey-colored. The dotted lines represent losses that are inferred by a most parsimonious reconciliation algorithm. The duplication cost of T is 3 and its reconciliation cost is 5.

A species tree S for a set Σ = {σ1,⋯,σ t } of species represents an ordered set of speciation events that have led to Σ: an internal node is an ancestral species at the moment of a speciation event, and its children are the new descendant species. Inside the species' genomes, genes undergo speciation when the species to which they belong do, but also duplications and losses (other events such as transfers can happen, but we ignore them here). A gene family is a set of genes Γ accompanied with a mapping function s : Γ → Σ mapping each gene to its corresponding species.

Consider a gene family Γ where each gene x ∈ Γ belongs to a species s(x) of Σ. The evolutionary history of Γ can be represented as a gene tree T for Γ, which is a rooted binary tree with its leafset in bijection with Γ, where each internal node refers to an ancestral gene at the moment of an event (either speciation or duplication). The mapping function s is generalized as follows: if x is an internal node of T, then s ( x ) =lc a S s x ' : x ' L T x .

An internal node x of T is called a speciation node if s(x l ) and s(x r ) are separated in S. Otherwise, x is a duplication node preceding the speciation event lca S (s(x l ), s(x r )) if lca S (s(x l ), s(x r )) is an internal node of S, otherwise it is a duplication inside the extant species lca S (s(x l ), s(x r )). A duplication node x such that s(x) = r(S) is called a pre-speciation duplication node. A gene tree T with all internal nodes being speciation nodes is called a speciation tree. Two genes x, y of L T are orthologs in T if their lca T (x, y) is a speciation node.

The duplication cost of T is the number of duplication nodes of T. It reflects the minimum number of duplications required to explain the evolution of the gene family inside the species tree S according to T. A well-known reconciliation approach [7, 8] allows to further recover, in linear time, the minimum number of losses underlined by such an evolutionary history. We refer to the minimum number of duplications and losses required to explain T with respect to S as the reconciliation cost of T with respect to S, or simply the reconciliation cost if there is no ambiguity on the considered trees.

Supergenetree problem statement

A set G of gene trees is said consistent if there is a tree T, called a supergenetree for G displaying each tree of G, and inconsistent otherwise. A supergenetree T for G is said compatible with G. For example, the four triplets in Figure 2 are consistent, and the gene tree T of Figure 1 is compatible with them. However, adding the dotted tree to the set of triplets makes the gene tree set incompatible. Consistency of a set of trees can be tested in polynomial time [1]. For a consistent set of trees, the problem considered here is to find a compatible gene tree of minimum reconciliation cost with respect to a given species tree. A formal statement of the general problem follows.

Figure 2
figure 2

Genes trees (left and middle) and their corresponding triplet graphs (right). Plain edges of the graph correspond to the four triplet trees, while dotted edges correspond to the triplets of the four-leaves tree.

MINIMUM SUPERGENETREE PROBLEM (MINSGT PROBLEM):

Input: A species set Σ and a binary species tree S for Σ; a gene family Γ, a set Γi, 1 ≤ ikof subsets of Γ, and a set G = {G1, G2,⋯, G k } of consistent gene trees where, for each 1 ≤ ik, G i is a tree for Γ i .

Output: Among all gene trees for Γ compatible with G, one tree T of minimum reconciliation cost.

When the considered cost is the duplication cost, the problem is called the Minimum Duplication SuperGeneTree Problem (MinDUPSGT problem).

From the SuperTree to the SuperGeneTree Problem

The classical supertree problem is to state whether or not a set of partial trees are consistent, and if so construct a tree containing them all. Here, we introduce the classical methods for solving this problem, and explore natural generalizations to the supergenetree problem.

Let Γ be a set of n taxa (usually species in case of the supertree problem, and genes in case of the supergenetree problem), Γi, 1 ≤ i ≤kbe a set of possibly overlapping subsets of Γ, and G = { G 1 , G 2 , , G k } be a set of trees where, for each 1 ≤ ik, G i is a tree for Γ i . Let tr G be the set of triplets of G defined as tr G = x y | z : 1 i k such that G i | x , y , z = x y | z . Let T Γ , E be the triplet graph with the set of vertices Γ and the set of edges E = {xy : ∃ z ∈ Γ such that xy|ztr G } (see Figure 2 for an example).

The classical BUILD algorithm [1] determines, in polynomial time, whether a set of triplets is consistent and if so constructs a tree T, possibly non-binary, compatible with them. The algorithm takes as input the graph T = T Γ , E . Let C T = C 1 , , C m be the set of connected components of T . If T has at least three vertices and |C T |=1, then G is inconsistent, and the algorithm terminates. For example, the set of five gene trees of Figure 2 is inconsistent, as the corresponding triplet graph (including dotted lines) is connected. Otherwise, if |V T |3, a polytomy is created over C T , the internal node of the polytomy being the root r(T) of the compatible tree T under construction and its children being m subtrees with leafsets V(C1),..., V(C m ), with their topology yet to be determined (where V(C i ) ⊆ Γ denotes the set of taxa appearing in C i ). The algorithm then recurses into each connected component, i.e. the subtree for V(C i ) is determined recursively from the graph T V C i , E | C i defined by E| C i = {xy : ∃ z ∈ Γ such that xy|ztr G | V C i }. If, at any step, the considered graph has a single component containing more than two vertices, then G is reported as an inconsistent set of trees and the algorithm terminates. Otherwise, recursion terminates when the graph has at most two vertices, eventually returning a supertree T. See Figure 3 for an example.

Figure 3
figure 3

Left: Execution of the BUILD algorithm on the set of given four triplets. This example requires two iterations of the algorithm (delimited by a dotted line). At the first iteration, the triplet graph contains three components, leading to a polytomy with three leaves. The algorithm then iterates on the component {a1, a2, c1}, which terminates the supertree reconstruction procedure. Notice that the gene tree of Figure 1, which is compatible with the four triplets, is not a resolution of this non-binary tree; Right: A variant of the BUILD algorithm, with the triplet graph components grouped into bipartitions - in this case leading to a fully resolved tree. This tree is actually the gene tree T of Figure 1.

The BUILD algorithm has been generalized in an algorithm called AllTrees [20] to output all supertrees compatible with a set of triplets in case consistency holds. Instead of taking each element of C T as a separate leaf of r(T), all possible groupings, in other words all partitions of C T , are considered (see Figure 3, right, for a choice of bipartitions). For each partition P C T of C T , a polytomy is created over P C T . The algorithm then iterates by considering each possible partition of each subgraph induced by each element of P C T . The algorithm is polynomial in the size of the output that may be exponential in the size of the input.

A tree T compatible with G such that no internal edge of T can be contracted so that the resulting tree is also compatible with G is called a minimally resolved supertree. Minimally resolved supertrees contain all the information about all supertrees compatible with G but in a "compressed" format. By exhibiting some properties on graph components, Semple shows in [27] how some partitions of the triplet graph components can be avoided without loss of generality. The new developed algorithm, named AllMinTrees [27], outputs a minimally resolved tree in polynomial time. However, it was shown in [15] that the cardinality of the solution space can be exponential in n = |Γ|, leading to an exponential time algorithm with Ω n 2 n 2 .

Notice that, in general, the trees output by all these methods are non-binary trees.

Extensions to the SuperGeneTree problem

Natural exact solutions for the supertree problem can be extended to the supergenetree problem as follows:

  1. (1)

    Use AllMinTrees to output all minimally resolved supertrees, and for each one which is non-binary in general, find in linear time a resolution minimizing the reconciliation [16, 32] or duplication [31] cost. Among all optimally resolved trees, select one of minimum cost. Clearly this approach has the same complexity as the AllMinTrees algorithm, multiplied by a factor of n to resolve each tree, which is Ω n n 2 n 2 .

  2. (2)

    As we are seeking a binary tree, each created node x of the supergenetree T under construction should determine a bipartition L T x l , L T x r . Therefore, the AllTrees algorithm can be simplified by considering, instead of all partitions of C T , only all bipartitions of the triplet graph components set. See an example in Figure 3, right. Notice that this simplification approach is not applicable to the AllMinTrees algorithm, as by imposing bipartitions, the minimum resolution condition cannot be guaranteed.

A branch-and-bound approach

The tree space which is explored by the two exact methods described above can be reduced by using a branch-and-bound approach. Consider for example method (1) using the AllMinTrees algorithm. At each iteration of computing one minimally resolved tree, resolve the intermediate non-binary tree obtained at this step, using for example the linear-time algorithm presented in [16]. If its reconciliation cost is greater than the cost of a full tree already obtained at a previous stage of the AllMinTrees algorithm, then stop expanding this tree as this can only increase the reconciliation cost.

A dynamic programming approach

The recursive top-down method (2) can instead be handled by a dynamic programming approach computing the minimum reconciliation cost of a tree on a subset of Γ according to the reconciliation costs of trees on smaller subsets, similarly to the wrok done in [13].

More precisely, let P be an arbitrary subset of Γ, and denote by R(P) the minimum duplication cost of a tree T| P having leafset P and compatible with the set G | P = G 1 | P , G 2 | P , , G k | P . Let T P , E | P be the BUILD graph restricted to P and G | P , and C T = C 1 , , C m the set of its connected components. If CC T , by V(C) we mean C i C V( C i ). Denote the complement of C by C ̄ =C T \C. Finally set d( V ( C ) , V ( C ̄ ) ) to 0 if s(V(C)) and s(V( C ̄ )) are separated in S, in which case (V(C), V( C ̄ ) is the bipartition of a speciation node, and 1 otherwise i.e. if (V(C), V( C ̄ ) is the bipartition of a duplication node. Then:

R P = min C C T R V C + R V C ̄ + d V C , V C ̄

the value of interest being R(Γ). First note that, assuming constant-time lca queries over S, d(V(C), V( C ̄ ) can be computed in constant time if s(V(C)) and s ( V ( C ̄ ) ) can be accessed in constant time, since if suffices to check that the lca of s(V(C)) and s ( V ( C ̄ ) ) differs from both. To achieve this, we precompute s(X) for every subset X of Γ of size 1, 2,..., n in increasing order. Noting that if |X| > 1, then for any xX, s(X) = lca S (s(X \ {x}), s(x)), s(X) can be computed in constant time assuming that s(X \ {x}) was computed previously and assuming constant-time lca queries. As there are 2n subsets of Γ, each computed in constant time, this preprocessing step takes time O(2n).

As for R(Γ), we can simply ensure that each R(P) is computed at most once by storing its value in a table for subsequent accesses (i.e. when R(P) is needed, we use its value if it has been computed, or compute it and store it otherwise). In this manner, each subset P takes time, not counting the recursive calls, proportional to |P||G|+|P|+ 2 | C T | to construct T P , E | P , find C T , and evaluate each bipartition of C T . We will simply use the fact that |P||G|+|P|+ 2 | C T | is in O(2n). As this has to be done for, at worst, each of the 2n subsets of Γ, we get a total time O(2n + 2n ·2n) = O(4n). Note that this analysis probably overestimates the actual complexity of the algorithm, as we are assuming that each subset P and each component set C T are both always of size n. It is also worth mentioning that the R(P) recurrence can easily adapted to the mutation cost (duplications + losses).

A greedy heuristic for the duplication cost

Instead of trying all partitions of the triplet graph components set at each step of the AllTrees or AllMinTrees algorithms, if the goal is to minimize the duplication cost, then a natural greedy approach would be to choose the best partition at each iteration, namely the one allowing to minimize the number of duplications preceding each speciation event. Such an approach would result in pushing duplications down the tree. It leads to the following restricted version of the supergenetree problem.

MINIMUM PRE-SPECIATION DUPLICATION PROBLEM (MINPRESPEDUP PROBLEM):

Input: A species set Σ and a binary species tree S for Σ; a gene family Γ, a set Γi,1≤ikof subsets of Γ, and a set G = { G 1 , G 2 , , G k } of consistent gene trees where, for each 1 ≤ ik, G i is a tree on Γ i .

Output: Among all gene trees for Γ compatible with G, one tree T with minimum pre-speciation duplication nodes.

We will show in the following section that even this restricted version of the supergenetree problem is hard. Here, we give the intuition of a natural way of solving this problem, that reduces to repeated applications of the Max-Cut problem. Although known to be NP-hard, efficient heuristics exist (up to a factor of 0.878 [12]), that can be used for our purpose.

For the supertree problem, the triplet graph T=T Γ , E represents all triplets of the input trees that have to be combined. In the case of the supergenetree problem, another tree is available, the species tree S. A triplet xy|z found in the input trees G can be reconciled with S, and if r(xy|z) is a duplication, then any tree compatible with G must contain this duplication. Say that r(xy|z) is a required duplication mapped to r(S) if s(r(xy|z)) = r(S) and r(xy|z) is a duplication. Let us include this information in T. More precisely, let C=C T denote the set of connected components of T, and let T C be the graph whose vertex set is C, and C 1 , C 2 C share an edge if C1 has vertices x, y and C2 has a vertex z such that xy|z is a triplet in G with r(xy|z) being a required duplication mapped to r(S). If there are, say, d distinct such triplets, one can possibly set a weight of d to the C1C2 edge. See Figure 4 for an example.

Figure 4
figure 4

Example of how Max-Cut can be applied to the MinPreSpeDup problem. S is a species tree, G= G 1 , G 2 , G 3 and T is the BUILD graph (solid edges). Its connected components are enclosed in circles, and the dotted edges represent required duplications mapped to r(S). The edge of weight 2 is explained by the a1d1|c1 and d1b1|e1 triplets, whereas the edge of weight 1 is explained by the c1e1|f1 triplet. A Max-Cut creates the bipartition ({a1, b1, d1, f1}, {c1, e1}), leading to the T1 tree which merges all required duplications at its root. The tree T2 is obtained from the suboptimal bipartition ({a1, b1, d1}, {c1, e1, f1}) and has 2 duplications.

Consider the problem of clustering the components of T C into two parts B1, B2 of a bipartition in a way minimizing the number of duplications preceding the speciation event r(S). For each C1B1 and C2B2 such that C1C2 is an edge of T C , a tree T rooted at the bipartition (B1, B2) contains the required duplications mapped to r(S) represented by the C1C2 edge. If there are k such edges between B1 and B2 totalizing a weight of w, the single duplication at the root of T contains those w required duplications. In other words, we have "merged" w required duplications into one. It then becomes natural to find the bipartition of T C that merges a maximum of duplications, i.e. that contains a set of edges crossing between the two parts of maximum weight. This is the well-known Max-Cut problem. For instance in Figure 4, the Max-Cut has a weight of 3 and leads to the optimal tree T1. Any other bipartition sends a required duplication to a lower level and is hence suboptimal. The T2 tree is obtained from first taking the suboptimal ({a1, b1, d1}, {c1, e1, f1}) bipartition, which creates a duplication at the root and defers the c1e1|f1 required duplication for later.

Note however that the components of T may contain required duplications themselves, which are not represented by the edges of T C . Thus, a Max-Cut must then be applied recursively on both parts of the chosen bipartition. Therefore, this method does not benefit directly from the efficient approximation factor known for the Max-Cut problem, as the approximation error stacks with each application. In the next section, we show that, unlike Max-Cut, the MinPreSpeDup problem cannot admit a constant factor approximation (unless P = NP).

Inapproximability of the MinDupSGT and MinPreSpeDupSGT problems

Through the rest of this section, we denote by n = |Γ| the size of the considered gene family. We show that both the MinDupSGT problem and its restriction the MinPreSpeDupSGT problem are NP-hard.

Theorem 1 The MinDupSGT and MinPreSpeDupSGT problems are both NP-hard to approximate within a factor of n1-ϵ for any constant 0 < ϵ < 1. Moreover, this result holds for both problems even when restricted to instances having at most one gene per species in Γ.

Proof We use a reduction from the minimum k-colorability problem. Recall that a graph H = (V, E) is k-colorable if there is a partition {V1, V2,..., V k } of V into independent sets (i.e. if x, yV i for some 1 ≤ ik, then xyE). It is now well-known [33] that the smallest k for which H is k-colorable cannot be approximated within a factor of |V|1-ϵ unless P = NP.

Now, given a graph H = (V, E), we construct a gene set Γ, a set of rooted triplet gene trees G and a species tree S such that H is k-colorable if and only if G is compatible with some gene tree T having at most k − 1 duplications when reconciled with S. Using the same construction, we also show that H is k-colorable if and only if G is compatible with some gene tree T having at most k − 1 pre-speciation duplications when reconciled with S. In both cases, the gene-species mapping s is bijective, proving the second part of the theorem statement.

Let Γ = {v1, v2 : vV} and for each edge vwE, add the triplets v1v2|w1, v1v2|w2, w1w2|v1 and w1w2|v2 to G. Observe that this forces any tree T that displays G to display the tree ((v1, v2), (w1, w2)). Add one species to Σ for each gene of Γ so that the gene-species mapping s is bijective. As for S, first let S1 be any binary tree with one leaf for each member of {s(v1) : vV}, and in the same manner let S2 be any binary tree with one leaf for each member of {s(v2) : vV}. Obtain S by connecting the root of S1 and the root of S2 under a common parent r(S). Thus s(v1) and s(v2) are separated by r(S) for any vV. Clearly, G and S can be constructed in polynomial time.

Claim 1 : if H is k-colorable, then we can find a tree T compatible with G having at most k − 1 duplications. Moreover each such duplication x is a pre-speciation duplication (i.e. s(x) = r(S)).

Let {V1, V2,..., V k } be a k-coloring of H. For each 1 ≤ ik, let T i be the tree with leafset V i = v 1 , v 2 : v V i that has only speciations, i.e. T i is S | s V i (because all genes in V' i belong to a different species). Notice that s(r(T i )) = r(S), since r(S) separates v1 from v2 for all vV. Obtain T by taking any binary tree on k leaves (and hence k − 1 internal nodes), then replacing each leaf by a distinct T i . In this manner, T has k − 1 duplications since only the internal nodes of T that do not belong to any T i need to be duplications. Moreover, each duplication node x has s(x) = r(S). It remains to show that T is compatible with G. It suffices to observe that all triplets of G are of the form v1v2|w h with h ∈ {1, 2}, and that such a triplet being in G implies that vwE. For such a triplet, we must then have vV i and wV j with ij, implying v 1 , v 2 V i and w h V j . By the construction of T, v1v2|w h must be a triplet of T, as desired.

Claim 2 : if there is a tree T compatible with G having k − 1 duplications, then H is k-colorable. Moreover if T has k −1 duplications such that each duplication x has s(x) = r(S), then H is k-colorable.

Let T be a tree compatible with G having k − 1 duplications. Call a node x of T S-maximal if x is not a duplication node mapped to r(S) but every proper ancestor of x is a duplication mapped to r(S). Let X = {x1, x2,..., x m } be the set of S-maximal nodes of T. Note that if yr(T) is a duplication mapped to r(S), then so is the parent of y. This implies that every leaf ℓ of T has at least one ancestor x i in X, since x i is the highest (i.e. closest to the root) ancestor of ℓ that is not a duplication mapped to r(S) (such an x i always exists, since ℓ is itself one such node). Moreover, x i is unique, as no other x j X can be the ancestor of x i . Therefore, L T x 1 , , L T x m is a partition of L T . We next show that mk. Let T' be the tree obtained by removing all descendants of x i in T, for all 1 ≤ im. Then T' is a binary tree with m leaves, and all its m − 1 internal nodes are duplications mapped to r(S). Since T has no more than k − 1 duplications (in either cases of the claim), T' has at most k − 1 internal nodes and therefore at most k leaves. We deduce that mk.

Observe that if vwE, then α = lca(v1, v2, w1, w2) must be a duplication such that s(α) = r(S). Indeed, α separates lca(v1, v2) from lca(w1, w2) since T displays ((v1, v2), (w1, w2)). But since s(lca(v1, v2)) = s(lca(w1, w2)) = r(S) by the construction of S, s(α) can only be r(S) as well, and so α must be a duplication.

Now, let V i = {v : v1 is a descendant of x i } for each 1 ≤ im. Take v, wV i for some i. We show that vwE, and thus that {V1,..., V m } forms a coloring of H with at most k colors. The argument applies whether each duplication maps to r(S) or not, proving both parts of the claim. Suppose for the sake of contradiction that vwE, but v, wV i . In T, lca(v1, w1) must be a descendant of x i , since x i is a common ancestor of v1 and w1 by the definition of V i . Moreover, lca(v1, w1) ≠ x i since lca(v1, w1) = lca(v1, v2, w1, w2) is a duplication mapped to r(S), as shown above, while x i is not such a duplication, by its definition. Therefore, lca(v1, w1) is a proper descendant of x i . But s(lca(v1, w1)) = r(S) = s(x i ) implies that x i is a duplication mapped to r(S), a contradiction. We conclude that {V1,..., V m } with mk is a proper coloring of H.

This reduction, together with the fact that the k-coloring problem is NP-hard to approximate within a n1-ϵ factor, proves the Theorem.   □

Independent Speciation trees

We now consider the MinDupSGT problem in the special case where the input gene trees are independent speciation trees, meaning: (1) each gene of Γ appears in at most one gene tree leafset, and (2) gene trees of G= G 1 , G 2 , , G k are all speciation trees with respect to the species tree S. Our objective is to find a gene tree T compatible with G minimizing duplications that also maintains the orthology relationships specified by G. In other words, we require that for every G i G,T | L G i has only speciations. We say that a gene tree T that satisfies this property preserves the speciations of G. Note that if T preserves the speciations of G, then it is necessarily compatible with G. We call T | L G i the copy of G i in T.

MINIMUM SPECIATION SUPERGENETREE (MINSPECSGT PROBLEM):

Input: A species set Σ and a binary species tree S for Σ; a gene family Γ, a set Γi,1≤ikof disjoint subsets of Γ, and a set G= G 1 , G 2 , , G k of consistent independent speciation trees such that, for each 1 ≤ ik, G i is a tree for Γ i .

Output: Among all gene trees for Γ that preserve the speciations of G, one tree T of minimum duplication cost.

Notice that, since no gene of Γ appears more than once in the set of input trees, G always admits a solution. Indeed, taking any binary tree on k leaves and replacing each leaf by a distinct G i achieves the desired result. However, while apparently easier, we show that finding such a gene tree T minimizing the number of duplications is still hard.

Theorem 2 The decision version of the MinSpecSGT problem is NP-Complete, i.e. it is NP-Complete to decide if a species tree S and a set of independent speciation trees G admit a supertree T that preserves its speciations with at most k duplications.

Proof The problem is easily seen to be in NP, as it is easy to verify in polynomial time that a given gene tree T is compatible with G, preserves its speciations and has k duplications. For NP-hardness, we turn to the decision version of the k-colorability problem. That is, for a given k, deciding if a graph H = (V, E) is k-colorable is NP-hard. We create from H a species tree S and a set of independent speciation trees G such that H is k-colorable if and only if S and G admit a supertree T with at most k − 1 duplications.

Let n = |V|, and denote V = {v1,..., v n }. To create S, start with any binary tree S' on n 2 leaves. denote this leafset W = {w i,j : 1 ≤ i <jn} so that there is a one-to-one correspondence between W and the unordered pairs of V. Then, add a special leaf a by joining it with the root of S' under a common parent p, and finally obtain S by adding another special leaf b by joining it with p under a common parent. Therefore, the species set is =L S =W a , b .

For the construction of each gene tree GG, we ease up notation by labeling each leaf g of G by s(g) directly (e.g. if we say that G is of the form (a, b), we mean that G has two leaves g a , g b such that s(g a ) = a and s(g b ) = b). In this manner, since all trees of G contain only speciations, each tree GG must be a subtree of S (or it is obtained from such a subtree by contracting edges). Also recall that we are assuming that each gene appears in at most one gene tree of G, and so the genes from two distinct trees must also be distinct (even if they share the same label).

In G, we first add k trees of the form (a, b), plus one tree G i for each vertex v i in H. The tree G i corresponding to v i V is a copy of S from which we remove every leaf except those w j,k for which one of j = i or k = i holds, and (v j v k) E (i.e. we keep the leaves of W that correspond to an edge incident to v i ). Also contract the degree 2 nodes of G i . Notice that if v i v j E and i <j, then both G i and G j contain a gene in the w i,j species. Also, if v i v j E then G i and G j have no genes from a common species.

Claim 1 : if H is k-colorable, then S and G admit a supertree T having at most k − 1 duplications.

Let {V1,..., V k } be a k-partition of V into independent sets. Take any h such that 1 ≤ hk. Recall that if v i , v j V h , then G i and G j share no gene from a common species (since v i v j E). Thus the trees in G h = { G i : v i V h } are all disjoint in terms of species. Let Σ h be the set of species that appear in some tree of G h . Then, the tree S | Σ h contains a copy of each tree in G h , and none of these copies overlap. Obtain T h by joining a gene labeled a to r( S | Σ h ) under a common parent p, then joining a gene labeled b to p under a new common parent. Now, T h contains a copy of each tree in G h and a copy of one of the (a, b) trees. By taking a tree with k leaves (where at worst, each k − 1 internal node is a duplication), and replacing each leaf by the speciation trees T1,..., T k , we obtain a gene supertree T, which preserves the speciations of G and has at most k − 1 duplications.

Claim 2 : if S and G admit a supertree T having k − 1 duplications, then H is k-colorable.

We first show that if T has k − 1 duplications, then it must have exactly k speciations mapped to r(S). It cannot have more, as there would then be more than k − 1 duplications. Suppose instead that there are k' <k such speciations, and denote them x1,...,x k' . Note that there must be at least k' − 1 duplications in the ancestors of the x i s. Now, for 1 ≤ ik', T x i must contain a certain number of copies of a and b. Let m i (a) and m i (b) denote, respectively, the number of copies of a and b contained in T x i , noting that in total, there are k copies of each since there are k subtrees of the form (a, b) in G. Since x i is a speciation mapped to r(S), it separates the a copies from the b copies, thus the T x i subtree must contain at least m i (a) − 1 + m i (b) − 1 duplications. Denote by d(T) the number of duplications in T. It follows that d T k - 1 + i = 1 k ( m i ( a ) + m i ( b ) - 2 ) = k - 1 - 2 k + i = 1 k m i ( a ) + i = 1 k ( m i ( b ) = - k - 1 + k + k = 2 k - k - 1 > k - 1 when k' <k, a contradiction.

Now, we can let x1,..., x k be the k speciation nodes of T mapped to r(S). The k − 1 duplications of T must then all be ancestors of the x i , and they are all mapped to r(S). Therefore the T x 1 , ... , T x h subtrees each contain only speciations. For any G i G corresponding to v i , one of the T x h must contain the copy of G i (for otherwise, the root of the copy of G i in T would be a duplication, while it should be a speciation). Take any h such that 1 ≤ h ≤ k. We claim that V h = {v i : T x h contains the copy of G i } forms an inpedendent set. Since T x h contains only speciations, it cannot contain genes from the same species. Thus for any G i , G j contained in T x h we must have v i v j E, as otherwise G i and G j would share a gene from the same species. Therefore V h is an independent set. Thus {V1,..., V k } form a k-coloring of H, and the proof is completed.   □

It is interesting to note that this does not show the NP-hardness of the special case in which the input trees are only triplets. Indeed, a tree G i created in this reduction has as many leaves as the number of neighbors of its corresponding vertex v i . Therefore, if H is a cubic graph (ie. 3-regular), one can generate an input with only triplets. However, deciding if a cubic graph is k-colorable can be done in linear time, and thus the triplets case cannot be shown NP-hard through this reduction. The 3-colorability problem is NP-hard on 4-regular graph though, showing the NP-hardness of the problem on input trees having at most 4 leaves.

Conclusion

We introduce the supergenetree problem which aims at constructing a supertree that displays a set of input gene trees while minimizing the reconciliation cost with respect to an input species tree. This problem is a natural formulation of the question of combining a set of gene trees obtained for subsets of a gene family into a full gene tree for the whole gene family.

The supergenetree problem is an extension of the classical supertree problem on a set of input leaf-labeled trees, where the input trees are gene trees and a species tree is used in order to evaluate the reconciliation/duplication cost of a supergenetree. We show how existing exact and greedy heuristic algorithms for the supertree problem can be used to devise approaches for solving the supergenetree problem. The resulting approaches have exponential worst-time complexity as the original supertree algorithms.

We show that the supergenetree problem for the duplication cost is NP-hard to approximate within a factor essentially better than n, and this complexity remains the same even when the problem is restricted, in a greedy approach, to finding a supertree with a minimum number of duplications before each speciation of the species tree. We also consider a restriction of the supergenetree problem relevant to many biological applications where subsets of orthologs are studied separately and then amalgamated into a single tree. Even this restriction is shown to be NP- Complete. The reconciliation cost remains to be studied, although we conjecture all of the above mentioned problems are hard in this case also.

These negative complexity results are not surprising though as they extend an already large set of problems related to supertrees that are known to be NP-hard. We think that appropriate heuristics for various classes of input trees are worth to be considered in future projects. Removing the assumption that the input gene trees are compatible would also lead to new interesting problems. A promising avenue would be to consider constructive FPT algorithms that can be integrated in greedy heuristics or dynamic programming algorithms. Also other restrictions on the input gene trees can be explored, hopefully leading to polynomial problems. Constructing gene trees by amalgamating smaller trees for subsets of orthologous genes is a natural way of constructing large trees that would benefit from a thorough theoretical and algorithmic analysis.