Background

Genes collectively form the organism’s genomes and can be viewed as “atomic” units whose evolutionary history forms a tree. The history of species, which is also a tree, and the history of their genes is intimately linked, since the gene trees evolve along the species tree. A detailed evolutionary scenario, therefore, consists of a gene tree, a species tree and a reconciliation map \(\mu\) that describes how the gene tree is embedded into the species tree.

A reconciliation map assigns vertices of the gene tree to the vertices or edges in the species in such a way that (partial) ancestor relations given by the genes are preserved by the map \(\mu\). This gives rise to three important events that may act on the genes through evolution: speciation, duplication, and horizontal gene transfer (HGT) [1, 2]. Inner vertices of the species tree represent speciation events. Hence, vertices of the gene tree that are mapped to inner vertices in the species tree underlay a speciation event and are transmitted from the parent species into the daughter species. If two copies from a single ancestral gene are formed and reside in the same species, then a duplication event happened. Contrary, if one of the copies of a gene “jumps” into a different branch of the species tree, then a HGT event happened. The latter can be annotated in the gene tree by associating a label to the edge that points from the horizontal transfer event to the transferred copy [3,4,5,6,7]. Since both HGT and duplication events occur in between different speciation events, such vertices of the gene trees are usually mapped to the edges of the species tree. The events speciation, duplication, and HGT classify pairs of genes as orthologs, paralogs and xenologs, respectively [2].

To some extent, these relations can be estimated directly from sequence data using a variety of algorithmic approaches that are based on the pairwise best match criterion [8,9,10,11] and hence do not require any a priori knowledge of the topology of either the gene tree or the species tree. Practical workflows for orthology assignment directly use pairwise best matches as an initial estimate of orthologous gene pairs. Many of the commonly used methods for orthology-identification, such as OrthoMCL [12], ProteinOrtho [13, 14], OMA [15], or eggNOG [16], belong to this class, see also [17, 18]. While best match heuristics have been very successful as approximations of the orthology relation [19, 20], no comparable approach to extract the xenology relations directly from (dis)similarity data has been devised to-date. Nevertheless, there are several methods to detect xenologs in a genome that use sequence features rather than phylogenetic reconstructions, see e.g. [21,22,23,24,25,26].

Provided that such event-relations are available, one can infer the history of event-labeled gene trees without HGT [27,28,29,30,31,32] or with HGT [5, 7, 33]. Moreover, depending on the quality and biological feasibility of the reconstructed event-labeled gene trees, a species trees can also be reconstructed [3, 34, 35]. This line of research, in particular, has been very successful for the reconstruction of event-labeled gene trees and species trees based solely on the information of orthologous and paralogous gene pairs [36]. Note that in practice, inferred event-relations are likely to contain errors. However, characterizing relations that correspond to a valid history is a crucial step in devising error-correction algorithms, as this may lead to practical heuristic approaches to modify noisy event-relations into valid ones.

In this paper, we assume that the gene tree T and the types of evolutionary events on T are known. For an event-labeled gene tree to be biologically feasible there must be a putative “true” history that can explain the inferred gene tree. However, in practice it is not possible to observe the entire evolutionary history as e.g. gene losses eradicate the entire information on parts of the history. Therefore, the problem of determining whether an event-labeled gene tree is biologically feasible is reduced to the problem of finding a valid reconciliation map, also known as DTL-scenario [37,38,39]. The aim is then to find the unknown species tree S and reconciliation map between T and S, if one exists. Not all event-labeled gene trees T, however, are biologically feasible in the sense that that there exists a species tree S such that T can be reconciled with S. In the absence of HGT, biologically feasibility can be characterized in terms of “informative” triplets (rooted binary trees on three leaves) that are displayed by the gene trees [35]. In the presence of HGT, such triplets give at least necessary conditions for a gene tree being biologically feasible [3].

A particular difficulty that occurs in the presence of HGT is that gene trees with HGT must be mapped to species trees only in such a way that genes do not travel back in time. To be more precise, the ancestor ordering of the vertices in a species tree give rise to a relative timing information of the species within the species trees. Within this context, speciation and duplication events can be considered as a vertical evolution, that is, the genetic material is transferred “forward in time”. In contrast, HGT literally yield horizontal evolution, that is, genetic material is transferred such that a gene and its transferred copy coexist. Nøjgaard et al. [4] introduced an axiomatic framework for time-consistent reconciliation maps and characterize for given event-labeled gene trees T and a given species tree S whether there exists a time-consistent reconciliation map or not. This characterization resulted in an \(O(|V|\log |W|)\)-time algorithm to construct a time-consistent reconciliation map if one exists, where V and W are the vertex sets of T and S, respectively.

However, one of the crucial open questions that were left open within this context is as follows: For a given event-labeled gene tree that contains speciation and duplication vertices and HGT edges, does there exist a polynomial-time algorithm to reconstruct the unknown species tree together with a time-consistent reconciliation map, if one exists?

In this contribution, we show that the answer to this problem is affirmative and provide an \(O(n^3)\) time algorithm, with n being the number of leaves of T, that allows us to verify whether there is a time-consistent species S for the event-labeled gene tree and, in the affirmative case, to construct S.

We note in passing that there could be exponentially many species trees, for each of them there may be a time-consistent reconciliation map or not for a given event-labeled gene tree. Moreover, many types of reconciliation problems become NP-hard as soon as HGT and time-consistency are involved, see e.g. [37, 40,41,42,43,44,45,46]. In contrast, we show that the problem of finding a time-consistent species tree for a given event-labeled gene tree can be solved in polynomial-time.

This paper is organized as follows. We first provide a short survey of preliminary results that have been establised so far and therein, provide the concepts and basic notation we need including important results on gene and species tree, reconciliation maps and time-consistency. We then proceed in Section "Gene tree consistency" (GTC) to formally introduce the problem of finding a time-consistent species for a given event-labeled gene tree. As a main result, we will see that it suffices to start with a fully unresolved species tree that can then be stepwisely extended to a binary species tree to obtain a solution to the GTC problem, provided a solution exists. In Section "An algorithm for the GTC problem", we provide a solution to the GTC problem. For the design of this algorithm, we will utilize an auxiliary directed graph \(A({T},{S})\) based on a given event-labeled gene tree T and a given species tree S. This type of graph was established in [4]. The authors showed that there is time-consistent map between T and S if and only if \(A({T},{S})\) is acyclic. Our algorithm either reconstructs a species tree S based on the informative triplets that are displayed by the gene trees and that makes this graph \(A({T},{S})\) eventually acyclic or that returns that no solution exists. The strategy of our algorithm is to construct \(A({T},{S})\) starting with S being a fully unresolved species tree and stepwisely resolve this tree in a way that it “agrees” with the informative triplets and reduces the cycles in \(A({T},{S})\).

Since the material is rather extensive (many of the proofs use elementary graph theory but are very technical) we subdivided the presentation in a main narrative text explaining the main results and a second technical part (see Appendix: “Proof of results”) collecting the proofs of the main results as well as additional technical results.

Short survey of existing results

Notation and basic definitions

Unless stated otherwise, all graphs in this work are assumed to be directed without explicit mention. For a graph G, the subgraph induced by \(X \subseteq V(G)\) is denoted G[X]. For a subset \(Q \subseteq V(G)\), we write \(G - Q = G[V(G) \setminus Q]\). We will write (ab) and ab for the edges that link \(a,b\in V(G)\) of directed, resp., undirected graphs.

All trees in this work are rooted and edges are directed away from the root. Given a tree T, a vertex \(v \in V(T)\) is a leaf if v has out-degree 0, and an internal vertex otherwise. We write \(L(T)\) to denote the set of leaves of T. A star tree is a tree with only one internal vertex that is adjacent to the leaves.

We write \(x \preceq _{T} y\) if y lies on the unique path from the root to x, in which case y is called an ancestor of x and x is called a descendant of y. We may also write \(y \succeq _{T} x\) instead of \(x \preceq _{T} y\). We use \(x \prec _T y\) for \(x \preceq _{T} y\) and \(x \ne y\). In the latter case, y is a strict ancestor of x. If \(x \preceq _{T} y\) or \(y \preceq _{T} x\) the vertices x and y are comparable and, otherwise, incomparable. If (xy) is an edge in T, and thus, \(y \prec _{T} x\), then x is the parent of y and y the child of x. We denote with \(\mathrm {ch}(x)\) the set of all children of x. It will be convenient for the discussion below to extend the ancestor relation \(\preceq _T\) on V to the union of the edge and vertex sets of T. More precisely, for a vertex \(x\in V(T)\) and an edge \(e=(u,v)\in E(T)\) we put \(x \prec _T e\) if and only if \(x\preceq _T v\) and \(e \prec _T x\) if and only if \(u\preceq _T x\). For edges \(e=(u,v)\) and \(f=(a,b)\) in T we put \(e\preceq _T f\) if and only if \(v \preceq _T b\).

For a subset \(X \subseteq V(T)\), the lowest common ancestor \({\text {lca}}_{T}(X)\) is the unique \(\preceq _T\)-minimal vertex that is an ancestor of all vertices in X in T. For simplicity, we often write \({\text {lca}}_T(x,y)\) instead of \({\text {lca}}_T(\{x,y\})\).

A vertex is binary if it has 2 children, and T is binary if all its internal vertices are binary. A cherry is an internal vertex whose children are all leaves (note that a cherry may have more than two children). A tree T is almost binary if its only non-binary vertices are cherries. For \(v \in V(T)\), we write T(v) to denote the subtree of T rooted at v (i.e. the tree induced by v and its descendants).

A rooted triplet, or triplet for short, is a binary tree with three leaves. We write ab|c to denote the unique triplet on leaf set \(\{a,b,c\}\) in which the root is \({\text {lca}}(a,c) = {\text {lca}}(b,c)\). We say that a tree T displays a triplet ab|c if \(a,b,c \in L(T)\) and \({\text {lca}}_T(a, b) \prec {\text {lca}}_T(a,c) = {\text {lca}}_T(b,c)\). We write rt(T) to denote the set of rooted triplets that T displays. Given a set of triplets R, we say that T displays R if \(R \subseteq rt(T)\). A set of triplets R is compatible, if there is a tree that displays R. We also say that T agrees with R if, for every \(ab|c \in R\), \(ac|b \notin rt(T)\) and \(bc|a \notin rt(T)\).

Remark 1

The term “agree” is more general than the term “display” and “compatible”, i.e., if T displays R (and thus, R is compatible), then T must agree with R. The converse, however, is not always true. To see this, consider the star tree T , i.e., \(rt(T) = \emptyset\), and let \(R=\{ab|c,bc|a\}\). It is easy to verify that R is incompatible since there cannot be any tree that displays both triplets in R. However, the set R agrees with T.

We will consider rooted trees \(T=(V,E)\) from which particular edges are removed. Let \(\mathcal {E}_T \subseteq E\) and consider the forest \(T_{\mathcal {\overline{E}}}{:}{=}(V,E\setminus \mathcal {E}_T)\). We can preserve the order \(\preceq _T\) for all vertices within one connected component of \(T_{\mathcal {\overline{E}}}\) and define \(\preceq _{T_{\mathcal {\overline{E}}}}\) as follows: \(x\preceq _{T_{\mathcal {\overline{E}}}}y\) iff \(x\preceq _{T}y\) and xy are in same connected component of \(T_{\mathcal {\overline{E}}}\). Since each connected component \(T'\) of \(T_{\mathcal {\overline{E}}}\) is a tree, the ordering \(\preceq _{T_{\mathcal {\overline{E}}}}\) also implies a root \(\rho _{T'}\) for each \(T'\), that is, \(x\preceq _{T_{\mathcal {\overline{E}}}} \rho _{T'}\) for all \(x\in V(T')\). If \(L(T_{\mathcal {\overline{E}}})\) is the leaf set of \(T_{\mathcal {\overline{E}}}\), we define \(L_{T_{\mathcal {\overline{E}}}}(x) = \{y\in L(T_{\mathcal {\overline{E}}}) \mid y\preceq _{T_{\mathcal {\overline{E}}}} x\}\) as the set of leaves in \(T_{\mathcal {\overline{E}}}\) that are reachable from x. Hence, all \(y\in L_{T_{\mathcal {\overline{E}}}}(x)\) must be contained in the same connected component of \(T_{\mathcal {\overline{E}}}\). We say that the forest \(T_{\mathcal {\overline{E}}}\) displays a triplet r, if r is displayed by one of its connected components. Moreover, \(rt(T_{\mathcal {\overline{E}}})\) denotes the set of all triplets that are displayed by the forest \(T_{\mathcal {\overline{E}}}\).

The restriction \(T|_X\) of T to \(X\subseteq L(T)\), is the tree with leaf set X that is obtained from T by first taking the minimal subtree of T with leaf set X and then suppressing all vertices of degree two with the exception of the root of \(T|_X\).

Gene and species trees

Let \(\Gamma\) and \(\Sigma\) be a set of genes and a set of species, respectively. Moreover, we assume we know the gene-species association, i.e., a surjective map \(\sigma : \Gamma \rightarrow \Sigma\). A species tree is a tree S such that \(L(S) \subseteq \Sigma\). A gene tree is a tree T such that \(L(T) \subseteq \Gamma\). Note that \(\sigma (l)\) is defined for every leaf \(l \in L(T)\). We extend \(\sigma\) to vertices of T and put \(\sigma _T(v) = \{\sigma (l) :l \in L(T(v))\}\). We may drop the T subscript whenever there is no risk of confusion. For \(\mathcal {E}_T \subseteq E\), the \(\sigma\) notation also extends to \(\sigma _{T_{\mathcal {\overline{E}}}}\), i.e. we write \(\sigma _{T_{\mathcal {\overline{E}}}}(u){:}{=}\{ \sigma (l) :{:} l \in L_{T_{\mathcal {\overline{E}}}}(u) \}\). We emphasize that species and gene trees need not to be binary, which are used here to model incomplete knowledge of the exact gene phylogenies.

Given a gene tree T, we assume knowledge of a labeling function \({t : V(T) \cup E(T) \rightarrow \{\odot , \mathfrak {s}, \mathfrak {d}, \mathfrak {t}\} \cup \{0, 1\}}\). We require that \(t(v) \in \{\odot , \mathfrak {s}, \mathfrak {d}, \mathfrak {t}\}\) for all \({v \in V(T)}\) and \(t(e) \in \{0,1\}\) for all \(e \in E(T)\). Each symbol represents a different vertex type: \(\odot\) are leaves, \(\mathfrak {s}\) are speciations, \(\mathfrak {d}\) are duplications and \(\mathfrak {t}\) indicates vertices from which a horizontal gene transfer started. Edges labeled by 1 represent horizontal transfers and edges labeled by 0 represent vertical descent. Here, we always assume that only edges (xy) for which \(t(x) = \mathfrak {t}\) might be labeled as transfer edge; \(t(x,y)=1\). We let \(\mathcal {E}_T = \{e \in E(T) :t(e) = 1\}\) be the set of transfer edges. We also require that \(t(u) = \odot\) if and only if \(u \in L(T)\).

We write \((T; t, \sigma )\) to denote a gene tree T labeled by t having gene-species mapping \(\sigma\).

In what follows we will only consider labeled gene trees \((T;t,\sigma )\) that satisfy the following three axioms:

(O1):

Every internal vertex v has out-degree at least 2.

(O2):

Every transfer vertex x has at least one transfer edge \(e=(x,v)\) labeled \(t(e)=1\), and at least one non-transfer edge \(f=(x,w)\) labeled \(t(f)=0\);

(O3):

(a) If \(x\in V(T)\) is a speciation vertex with children \(v_1,\dots ,v_k\), \(k\ge 2\), then \(\sigma _{T_{\mathcal {\overline{E}}}}(v_i) \cap \sigma _{T_{\mathcal {\overline{E}}}}(v_j) =\emptyset\), \(1\le i<j\le k\);

(b) If \((x,y) \in \mathcal {E}_T\), then \(\sigma _{T_{\mathcal {\overline{E}}}}(x)\cap \sigma _{T_{\mathcal {\overline{E}}}}(y) = \emptyset .\)

These conditions are also called “observability-axioms” and are exhaustively discussed in [3] and [4]. We repeat here shortly the arguments to justify Condition (O1–O3). Usually the considered labeled gene trees are obtained from genomic sequence data. Condition (O1) ensures that every inner vertex leaves a historical trace in the sense that there are at least two children that have survived. If this were not the case, we would have no evidence that vertex v ever exist. Condition (O2) ensures that for an HGT event a historical trace remains of both the transferred and the non-transferred copy. Furthermore, there is no clear evidence for a speciation vertex v if it does not “separate” lineages, which is ensured by Condition (O3.a). Finally (O3.b) is a simple consequence of the fact that if a transfer edge (xy) in the gene tree occurred, then the species X and Y that contain x and y, respectively, cannot be ancestors of each other, as otherwise, the species X and Y would not coexist (cf. [4, Prop. 1]).

We emphasize that Lemma 1 in [4] states that the leaf set \(L_1,\dots ,L_k\) of the connected components \(T_1,\dots ,T_k\) of \(T_{\mathcal {\overline{E}}}\) forms a partition of L(T), which directly implies that \(\sigma _{T_{\mathcal {\overline{E}}}}(x) \ne \emptyset\) for all \(x\in V(T)\).

Remark 2

In what follows we always assume that a species tree satisfies (O1). This condition is used to ensure that every species tree is a so-called phylogenetic tree [47].

However, as it is possible that gene duplications and losses predate the first speciation event, we may model the species tree S as a planted tree, that is, there is an additional vertex \(x\succ \rho _S = {\text {lca}}(L(S))\) with unique child \(\rho _S\). However, for our technical results below, this planted root is not of further importance.

By slight abuse of notation and to keep the upcoming proofs simple, we still call \(\rho _S = {\text {lca}}(L(S))\) the root of S and the parent of \(\rho _S\) in S, the planted root of S.

Reconciliation maps and speciation triplets

The “embedding” of the gene tree into the species tree is formalized by a reconciliation map from \((T;t,\sigma )\)to S, that is, a map \(\mu : V(T) \rightarrow V(S)\cup E(S)\) that satisfies the following constraints for all \(x\in V(T)\):

(M1):

Leaf constraint. If \(x\in \Gamma\), then \(\mu (x)=\sigma (x)\).

(M2):

Event constraint.

  1. (i)

    If \(t(x)=\mathfrak {s}\), then \(\mu (x) = {\text {lca}}_S(\sigma _{T_{\mathcal {\overline{E}}}}(x))\).

  2. (ii)

    If \(t(x) \in \{\mathfrak {d}, \mathfrak {t}\}\), then \(\mu (x)\in E(S)\).

  3. (iii)

    If \(t(x)=\mathfrak {t}\) and \((x,y)\in \mathcal {E}_T\), then \(\mu (x)\) and \(\mu (y)\) are incomparable in S.

  4. (iv)

    If \(t(x)=\mathfrak {s}\), then \(\mu (u)\) and \(\mu (v)\) are incomparable in S for all distinct \(u,v\in \mathrm {ch}(x)\).

(M3):

Ancestor constraint. Let \(x,y\in V(T)\) with \(x\prec _{T_{\mathcal {\overline{E}}}} y\). Note, the latter implies that the path connecting x and y in T does not contain transfer edges. We distinguish two cases:

  1. (i)

    If \(t(x),t(y)\in \{\mathfrak {d}, \mathfrak {t}\}\), then \(\mu (x)\preceq _S \mu (y)\),

  2. (ii)

    otherwise, i.e., at least one of t(x) and t(y) is a speciation \(\mathfrak {s}\), \(\mu (x)\prec _S\mu (y)\).

We call \(\mu\) the reconciliation map from \((T;t,\sigma )\) to S. The provided definition of a reconciliation map coincides with the one as given in [3, 4, 48] and is a natural generalization of the maps as in [35, 37, 38, 49] for the case where no HGT took place.

The question arises when for a given gene tree \((T;t,\sigma )\) a species tree S together with a reconciliation map \(\mu\) from \((T;t,\sigma )\) to S exists. An answer to this question is provided by

Definition 1

(Informative triplets) Let \((T;t,\sigma )\) be an event-labeled gene tree. The set \(\mathcal {R}(T;t,\sigma )\) is the set of triplets \(\sigma (a)\sigma (b)|\sigma (c)\) where \(\sigma (a),\sigma (b),\sigma (c)\) are pairwise distinct and either

  1. 1

    ab|c is a triplet displayed by \(T_{\mathcal {\overline{E}}}\) and \(t({\text {lca}}_{T_{\mathcal {\overline{E}}}} (a, b, c)) = \mathfrak {s}\) or

  2. 2

    \(a, b \in L(T_{\mathcal {\overline{E}}}(x))\) and \(c \in L(T_{\mathcal {\overline{E}}}(y))\) for some transfer edge (xy) or (yx) in \(\mathcal {E}_T\)

Theorem 1

([3]) Let \((T;t,\sigma )\)be a labeled gene tree. Then, there is a species tree S together with a reconciliation map \(\mu\)from \((T;t,\sigma )\)to S if and only if \(\mathcal {R}(T;t,\sigma )\)is compatible. In this case, every species tree S that displays \(\mathcal {R}(T;t,\sigma )\)can be reconciled with \((T;t,\sigma )\).

Moreover, there is a polynomial-time algorithm that returns a species tree S for \((T;t, \sigma )\)together with a reconciliation map \(\mu\)in polynomial time, if one exists and otherwise, returns that there is no species tree for \((T;t, \sigma )\).

It has been shown in [3], that if there is any reconciliation map from \((T;t,\sigma )\) to S, then there is always a reconciliation map \(\mu\) that additionally satisfies for all \(u\in V(T)\) with \(t(u)\in \{\mathfrak {d},\mathfrak {t}\}\):

$$\begin{aligned} \mu (u) = (v,{\text {lca}}_S(\sigma _{T_{\mathcal {\overline{E}}}}(u)))\in E(S) \end{aligned}$$

where v denotes the unique parent of \({\text {lca}}_S(\sigma _{T_{\mathcal {\overline{E}}}}(u))\) in S. As a consequence, we consider the following simplification.

Definition 2

The LCA-map \(\widehat{\mu }_{T,S}:V(T) \rightarrow V(S)\) associates every vertex \(v \in V(T)\) to the lowest common ancestor of \(\sigma _{T_{\mathcal {\overline{E}}}}(v)\), i.e., \(\widehat{\mu }_{T, S}(v) {:}\,{=}\,lca_{S}(\sigma _{T_{\mathcal {\overline{E}}}}(v))\)

Remark 3

Note that if v is a leaf of T, we have \(\widehat{\mu }_{T, S}(v) = \sigma (v)\). Moreover, the LCA-map \(\widehat{\mu }_{T, S}\) always exists and is uniquely defined, although there might be no reconciliation map from \((T;t,\sigma )\) to S.

We may write \(\widehat{\mu }, \widehat{\mu }_{T}\) or \(\widehat{\mu }_{S}\) if T and/or S are clear from the context.

Compatibility of \(\mathcal {R}(T; t,\sigma )\) provides a necessary condition for the existence of biologically feasible reconciliation, i.e., maps that are additionally time-consistent. To be more precise:

Definition 3

(Time map) The map \(\tau _T:V(T) \rightarrow {\mathbb {R}}\) is called a time map for the rooted tree T if \(x\prec _T y\) implies \(\tau _T(x)>\tau _T(y)\) for all \(x,y\in V(T)\).

Definition 4

(Time-consistent) A reconciliation map \(\mu\) from \((T;t,\sigma )\) to S is time-consistent if there are time maps \(\tau _T\) for T and \(\tau _S\) for S satisfying the following conditions for all \(u\in V(T)\):

(B1):

If \(t(u) \in \{\mathfrak {s}, \odot \}\), then \(\tau _T(u) = \tau _S(\mu (u))\).

(B2):

If \(t(u)\in \{\mathfrak {d},\mathfrak {t}\}\) and, thus \(\mu (u)=(x,y)\in E(S)\), then \(\tau _S(y)>\tau _T(u)>\tau _S(x)\).

If a time-consistent reconciliation map from \((T;t,\sigma )\) to S exists, we also say that S is a time-consistent species tree for \((T;t,\sigma )\).

Figure 1 gives an example for two different species trees that both display \(\mathcal {R}(T;t,\sigma )\) for which only one admits a time-consistent reconciliation map. Further examples can be found in [3, 4].

Fig. 1
figure 1

Taken from [3, Fig. 4]. From the binary gene tree \((T;t,\sigma )\) (right) we obtain the species triplets \(\mathcal {R}(T;t,\sigma ) = \{AB|D,AC|D\}\) . Note, vertices v of T with \(t(v)=\mathfrak {s}\) and \(t(v)=\mathfrak {t}\) are highlighted by “\(\bullet\)” and “\(\triangle\)”, respectively. Transfer edges are marked with an “arrow”. Shown are two (tube-like) species trees (left and middle) where planted roots are omitted that display \(\mathcal {R}(T;t,\sigma )\). Thus, Theorem 1 implies that for both trees a reconciliation map from \((T;t,\sigma )\) exists. The respective reconciliation maps for \((T;t,\sigma )\) and the species tree are given implicitly by drawing \((T;t,\sigma )\) within the species tree. The left species tree S is least resolved for \(\mathcal {R}(T;t,\sigma )\). The reconciliation map from \((T;t,\sigma )\) to S is unique, however, not time-consistent. Thus, no time-consistent reconciliation between T and S exists at all. On the other hand, for T and the middle species tree (that is a refinement of S) there is a time-consistent reconciliation map

Auxiliary graph construction

When the species tree is known, one can efficiently determine whether a time-consistent map for a given gene G and species tree S exists. We will use an auxiliary graph as defined in [4], and will investigate the structure of this graph in the remaining part of this section. Intuitively, this graph exhibits timing information between genes and species. That is, the gene tree events allow us to determine constraints of the form “x must have existed before y”, and each such constraint is represented by an arc from x to y. As it turns out, the absence of cycles in the resulting graph is necessary and sufficient to determine the existence of a time-consistent map for a given event-labeled gene tree and a given species tree.

Let \((T;t,\sigma )\) be a labeled gene tree and S be a species tree. Let \(A({T},{S})\) be the graph with vertex set \(V(A({T},{S})) = V(T) \cup V(S)\), and edge set \(E(A({T},{S}))\) constructed from four sets as follows:

(A1):

For each \((u, v) \in E(T)\), we have \((u', v') \in E(A({T},{S}))\), where

$$\begin{aligned} u' = {\left\{ \begin{array}{ll} \widehat{\mu }(u) &{}\hbox { if}\ t(u) \in \{\odot , \mathfrak {s}\} \\ u &{}\text{ otherwise } \end{array}\right. } \end{aligned}$$

and

$$\begin{aligned} v' = {\left\{ \begin{array}{ll} \widehat{\mu }(v) &{}\hbox { if}\ t(v) \in \{\odot , \mathfrak {s}\} \\ v &{}\text{ otherwise } \end{array}\right. } \end{aligned}$$
(A2):

For each \((x, y) \in E(S)\), we have \((x, y) \in E(A({T},{S}))\)

(A3):

For each u with \(t(u) \in \{\mathfrak {d}, \mathfrak {t}\}\), we have \((u, \widehat{\mu }(u)) \in E(A({T},{S}))\)

(A4):

for each \((u, v) \in \mathcal {E}_{T}\), we have \((lca_S(\widehat{\mu }(u), \widehat{\mu }(v)), u) \in E(A({T},{S}))\)

We are aware of the fact that the graph \(A({T},{S})\) heavily depends on the event-labeling t and the species assignment \(\sigma\) of the gene tree \((T;t,\sigma )\). However, to keep the notation simple we will write, by slight abuse of notation, \(A({T},{S})\) instead of the more correct notation \(A({(T;t,\sigma )},{S})\). The \(A({T},{S})\) graph has four types of edges, and we shall refer to them as the A1, A2, A3-and A4-edges, respectively. We note for later reference that if (xy) is an A1-edge such that \(x, y \in V(S)\), then we must have \(y \preceq _S x\) which follows from the definition of \(\widehat{\mu }_{T,S}\) and the fact that \(\sigma _{T_{\mathcal {\overline{E}}}}(y) \subseteq \sigma _{T_{\mathcal {\overline{E}}}}(x)\).

We emphasize, that the definition of \(A({T},{S})\) slightly differs from the one provided in [4]. While Properties (A2), (A3) and (A4) are identical, (A1) was defined in terms of a reconciliation map \(\mu\) from \((T;t,\sigma )\) to S in [4]. To be more precise, in [4] it is stated \(u' = \mu (u)\) and \(v' = \mu (v)\) for speciation vertices or leaves u and v instead of \(u' = \widehat{\mu }(u)\) and \(v' = \widehat{\mu }(v)\), respectively. However, Condition (M1) and (M2.i) imply that \(\mu (u)=\widehat{\mu }(u)\) and \(\mu (v)=\widehat{\mu }(v)\) provided \(\mu\) exists. In other words, the definition of \(A({T},{S})\) here and in [4] are identical, in case a reconciliation map \(\mu\) exists.

Since we do not want to restrict ourselves to the existence of a reconciliation map (a necessary condition is provided by Theorem 1) we generalized the definition of \(A({T},{S})\) in terms of \(\widehat{\mu }\) instead.

For later reference, we summarize the latter observations in the following remark.

Remark 4

The graph \(A({T},{S})\) does not explicitly depend on a reconciliation map. That is, even if there is no reconciliation map at all, \(A({T},{S})\) is always well-defined.

The graph \(A({T},{S})\) will be utilized to characterize gene-species tree pairs that admit a time-consistent reconciliation map. For a given gene tree \((T;t,\sigma )\) and a given species tree S, the existence of a time-consistent reconciliation map can easily be verified in polynomial time.

Theorem 2

([3, 4]) Let \((T;t,\sigma )\)be a labeled gene tree and S be a species tree. Then T admits a time-consistent reconciliation map with S if and only if S displays every triplet of \(\mathcal {R}(T; t,\sigma )\)and \(A({T},{S})\)is acyclic.

Recognition and reconstruction of a time-consistent reconciliation map can then be done in \(O(|V(T)|\log (|V(S)|))\) time.

In Appendix: "Lemma 3 and Proof of Lemma 1", we provide Lemma 3 that is rather technical, but essentially implies the following

Remark 5

If S has a leaf that forms a self-loop in \(A({T},{S})\), then we may immediately discard S as it cannot have a solution (since any refinement will have this self-loop). For the rest of the section, we will therefore assume that no leaf of S belongs to a self-loop.

Gene tree consistency

The main question of interest of this work is to determine whether a species tree S exists at all for a labeled gene tree T. Here, we solve a slightly more general problem: the one of refining a given almost binary species tree S so that T can be reconciled with it. The idea of our approach is to refine S in a step-wise manner using extension moves, as defined below.

Definition 5

(Extension) Let x be a vertex of a tree T with \(\mathrm {ch}(x) = \{x_1, \ldots , x_k\}\), \(k\ge 3\) and suppose that \(X' \subset \mathrm {ch}(x)\) is a strict subset of \(\mathrm {ch}(x)\).

Then, the \((x, X')\)extension modifies T to the tree \(T_{x,X'}\) as follows: If \(|X'| \le 1\), then put \(T_{x,X'}=T\). Otherwise, remove the edges \((x, x')\) for each \(x' \in X'\) from T and add a new vertex y together with the edges (xy) and \((y,x')\) for all \(x'\in X'\) to obtain the tree \(T_{x,X'}\).

Given two trees T and \(T'\), we say that \(T'\) is a refinement of T if there exists a sequence of extensions that transforms T into \(T'\).

The gene tree consistency (GTC) problem:

Given: A labeled gene tree \((T;t,\sigma )\) and an almost binary species tree S.

Question: Does there exist a refinement \(S^*\) of S that displays \(\mathcal {R}(T; t,\sigma )\) and such that \(A({T},{S^*})\) is acyclic?

It is easy to see that the problem of determining the existence of a species tree S that displays \(\mathcal {R}(T; t,\sigma )\) and such that \(A({T},{S})\) is acyclic is a special case of this problem. Indeed, it suffices to provide a star tree S as input to the GTC problem, since every species tree is then a refinement of S.

Definition 6

A species tree \(S^*\)is a solution to a given GTC instance \(((T;t,\sigma ), S)\) if \(S^*\) is a refinement of S, \(S^*\) displays \(\mathcal {R}(T; t,\sigma )\) and \(A({T},{S^*})\) is acyclic.

We first show that, as a particular case of the following lemma, one can restrict the search to binary species trees (even if T is non-binary).

Lemma 1

Let \(((T;t,\sigma ), S)\)be a GTC instance and assume that a species tree \(\widehat{S}\)is a solution to this instance. Then any refinement \(S^*\)of \(\widehat{S}\)is also a solution to \(((T;t,\sigma ), S)\).

Proof

See Appendix: "Lemma 3 and Proof of Lemma 1". \(\square\)

This shows that we can restrict our search to binary species trees, and we may only require that it agrees with \(\mathcal {R}(T;t,\sigma )\).

Proposition 1

An instance \(((T;t,\sigma ), S)\) of the GTC problem admits a solution if and only if there exists a binary refinement \(S^*\) of S that agrees with and, therefore, displays \(\mathcal {R}(T;t,\sigma )\) such that \(A({T},{S^*})\) is acyclic.

Proof

Assume that \(((T;t,\sigma ), S)\) admits a solution \(\widehat{S}\). By Lemma 1, any binary refinement \(S^*\) of \(\widehat{S}\) displays \(\mathcal {R}(T;t,\sigma )\) (and hence agrees with it) and \(A({T},{S^*})\) is acyclic.

Conversely, suppose that there is a binary species tree \(S^*\) that is a refinement of S and agrees with \(\mathcal {R}(T;t,\sigma )\) such that \(A({T},{S^*})\) is acyclic. Since \(A({T},{S^*})\) is acyclic, we only need to show that \(S^*\) displays \(\mathcal {R}(T;t,\sigma )\). Let \(ab|c \in \mathcal {R}(T;t,\sigma )\). Because \(S^*\) is binary, we must have one of ab|c, ac|b or bc|a in \(rt(S^*)\). Since \(S^*\) agrees with \(\mathcal {R}(T;t,\sigma )\), \(ab|c \in rt(S^*)\), and it follows that \(\mathcal {R}(T;t,\sigma )\subseteq rt(S^*)\). Hence, \(S^*\) displays \(\mathcal {R}(T;t,\sigma )\). Taking the latter arguments together, \(S^*\) is a solution to the instance \(((T;t,\sigma ), S)\) of the GTC problem, which completes the proof. \(\square\)

An algorithm for the GTC problem

We need to introduce a few more concepts before describing our algorithm. For a sequence \(Q=(v_1,\ldots ,v_k)\) we denote \(\mathcal {M}(Q) = \{v_1,\ldots ,v_k\}\).

Given a graph G, a partial topological sort of G is a sequence of distinct vertices \(Q = (v_1, v_2, \ldots , v_k)\) such that for each \(i \in \{1,\dots ,k\}\), vertex \(v_i\) has in-degree 0 in \(G - \{v_1, \ldots , v_{i-1}\}\). If, for any \(v \in V(G)\), there is no partial topological sort \(Q'\) satisfying \(\mathcal {M}(Q') = \mathcal {M}(Q) \cup \{v\}\) then Q is called a maximal topological sort. Note that the set of vertices in a maximal topological sort of G is unique, in the sense that for two distinct maximal topological sorts \(Q,Q'\) of G we always have \(\mathcal {M}(Q) = \mathcal {M}(Q')\) (cf. Lemma 4 in Appendix: "Properties of maximal topological sort and Lemma 4").

The motivation to consider partial and maximal topological sorts is as follows: To find a solution \(S^*\) for a given GTC instance, we must ensure that the graph \(A({T},{S^*})\) is acyclic which holds precisely when there is a maximal topological sort that contains every vertex (cf. Lemma 4 in Appendix: "Properties of maximal topological sort and Lemma 4"). If this is not the case, our goal is to find a refinement \(S'\) of \(S^*\) such that the updated graph \(A({T},{S'})\) admits the same maximal topological sort as \(A({T},{S^*})\) together with (at least one) additional vertex appended to it. We thus attempt to extend the maximal topological sorts in a step-wise manner until it contains every vertex of \(A({T},{S^*})\), or until no refinement on \(S^*\) allows us to augment it.

Our algorithm will make use of what we call a good split refinement. To this end, we provide first.

Definition 7

(Split refinement)

Let S be an almost binary tree and let x be a cherry of S. We say that a refinement \(S'\) of S is a split refinement (of S at x) if \(S'\) can be obtained from S by partitioning the set \(\mathrm {ch}(x)\) of children of x into two non-empty subsets \(X_1, X_2=\mathrm {ch}(x)\setminus X_1\), and applying the extensions \((x, X_1)\) and then \((x, X_2)\).

In other words, we split the children set of x into non-empty subsets \(X_1\) and \(X_2\), and add a new parent vertex above each subset of size 2 or more and connect x with the newly created parent(s) or directly with \(x'\) whenever \(X_i=\{x'\}\).

We note that the two \((x, X_1)\) and \((x, X_2)\) extensions yield a valid refinement of S since the set \(X_2\) is a strict subset of the children of x in \(S_{x, X_1}\). Also observe that a split refinement transforms an almost binary tree into another almost binary tree that has one additional binary internal vertex.

Definition 8

(Good split refinement) Let \(((T;t,\sigma ), S)\) be a GTC instance. Let Q be a maximal topological sort of \(A({T},{S})\), and let \(S'\) be a split refinement of S at some cherry vertex x. Then \(S'\) is a good split refinement if the two following conditions are satisfied:

  • \(S'\) agrees with \(\mathcal {R}(T;t,\sigma )\);

  • All the in-neighbors of x in \(A({T},{S'})\) belong to \(\mathcal {M}(Q)\).

The intuition behind a good split refinement is that it refines S by creating an additional binary vertex. Moreover, this refinement maintains agreement with \(\mathcal {R}(T;t,\sigma )\) and, more importantly, creates a new vertex of in-degree 0 in the auxiliary graph that can be used to extend the current maximal topological sort. Ultimately, our goal is to repeat this procedure until Q contains every vertex, at which point we will have attained an acyclic graph.

As an example consider Fig. 2. The species tree \(S_1\) corresponds to the star tree. Clearly \(S_1\) agrees with \(\mathcal {R}(T;t,\sigma )\) since \(rt(S_1)=\emptyset\). However, \(A({T},{S})\) contains cycles. For the maximal topological sort \(Q_1\) of \(A({T},{S_1})\) we have \(\mathcal {M}(Q_1) = L(T)\cup \{1,2,5\}\). Now, \(S_2\) is a good split refinement of \(S_1\), since \(S_2\) agrees with \(\mathcal {R}(T;t,\sigma )\) (in fact, \(S_2\) displays \(\mathcal {R}(T;t,\sigma )\)) and since \(x=1'\) has no in-neighbors in \(A({T},{S_2})\) which trivially implies that all in neighbors of \(x=1'\) in \(A({T},{S_2})\) are already contained \(\mathcal {M}(Q_1)\). For the maximal topological sort \(Q_2\) of \(A({T},{S_2})\) we have \(\mathcal {M}(Q_2) = \mathcal {M}(Q_1)\cup \{1'\}\). Still, \(A({T},{S_2})\) is not acyclic. The tree \(S_3\) is a good split refinement of \(S_2\), since \(S_3\) agrees with \(\mathcal {R}(T;t,\sigma )\) and the unique in-neighbor \(1'\) of \(x=2'\) in \(A({T},{S_3})\) is already contained \(\mathcal {M}(Q_2)\). Since \(A({T},{S_3})\) is acyclic, there is a time-consistent reconciliation map from \((T;t,\sigma )\) to \(S_3\), which is shown in Fig. 1. Furthermore, \(S_4\) is not a good split refinement of \(S_2\). Although \(S_4\) is a split refinement of \(S_2\) and agrees with \(\mathcal {R}(T;t,\sigma )\), the in-neighbor 4 of \(x=2'\) is not contained in \(\mathcal {M}(Q_2)\).

Fig. 2
figure 2

Top left: the gene tree \((T;t,\sigma )\) from Fig. 1 from which we obtain the species triplets \(\mathcal {R}(T;t,\sigma ) = \{AB|D,AC|D\}\). Moreover, the sequence of species trees \(S_1, S_2\) and \(S_3\) is obtained by stepwise application of good split refinements. The species tree \(S_4\) is an example of a split refinement of \(S_2\) that is not good. The corresponding graphs \(A({T},{S})\) are drawn right to the respective species tree S. For clarity, we have omitted to draw all vertices of \(A({T},{S})\) that have degree 0. Moreover, in the respective species trees the planted roots are omitted. See text for further discussion

We will discuss later the question of finding a good split refinement efficiently, if one exists. For now, assume that this can be done in polynomial time. The pseudocode for a high-level algorithm for solving the GTC problem is provided in Alg. 1. We note in passing that this algorithm serves mainly as a scaffold to provide the correctness proofs that are needed for the main Alg. 2.

figure a

The following result shows that if we reach a situation where there is no good split refinement for an GTC instance, then no solution exits at all.

Proposition 2

Let \(((T;t,\sigma ), S)\) be a GTC instance such that S is not binary and does not admit a good split refinement. Then, \(((T;t,\sigma ), S)\) does not admit a solution.

Proof

See Appendix: "Proof of proposition". \(\square\)

We next show that if we are able to find a good split refinement \(S'\) of S, the \(((T; t, \sigma ), S')\) instance is equivalent in the sense that \(((T; t, \sigma ), S)\) admits a solution if and only if \(((T; t, \sigma ), S')\) also admits a solution.

Theorem 3

Let \(((T; t, \sigma ), S)\)be a GTC instance, and suppose that S admits a good split refinement \(S'\). Then \(((T;t, \sigma ), S)\)admits a solution if and only if \(((T; t, \sigma ), S')\)admits a solution. Moreover, any solution for \(((T;t, \sigma ), S')\), if any, is also a solution for \(((T;t,\sigma ), S)\).

Proof

See Appendix: "Proof of Theorem". \(\square\)

Theorem 4

Algorithm 1 determines whether a given GTC instance \(((T; t, \sigma ), S)\)admits a solution or not and, in the affirmative case, constructs a solution \(S^*\) of \(((T; t, \sigma ), S)\).

Proof

Let \(((T; t, \sigma ), S)\) be GTC instance. First it is tested in Line 2 whether S is binary or not. If S is binary, then S is already its binary refinement and Prop. 1 implies that S is a solution to \(((T; t, \sigma ), S)\) if and only if S agrees with \(\mathcal {R}(T;t,\sigma )\) and \(A({T},{S})\) is acyclic. The latter is tested in Line 3. In accordance with Prop. 1, the tree S is returned whenever the latter conditions are satisfied and, otherwise, “there is no solution” is returned.

Assume that S is not binary. If S admits no good split refinement, then Alg. 1 (Line 10) returns “there is no solution”, which is in accordance with Prop. 2. Contrary, if S admits a good split refinement \(S'\), then we can apply Theorem 3 to conclude that \(((T; t, \sigma ), S)\) admits a solution if and only if \(((T; t, \sigma ), S')\) admits a solution at all.

Now, we recurse on \(((T; t, \sigma ), S')\) as new input of Alg. 1 in Line 8. The correctness of Alg. 1 is finally ensured by Theorem 3 which states that if \(((T; t, \sigma ), S')\) admits a solution and thus, by Prop. 1, a binary refinement \(S^*\) which is obtained by a series of good split refinements starting with S, is a solution for \(((T; t, \sigma ), S)\). \(\square\)

Finding a good split refinement

To find a good split refinement, if any, we can loop through each cherry x and ask “is there a good split refinement at x”? Clearly, every partition \(X_1,X_2\) of \(\mathrm {ch}(x)\) may provide a good split refinement and thus there might be \(O(2^{|\mathrm {ch}(x)|})\) cases to be tested for each cherry x. To circumvent this issue, we define a second auxiliary graph that is an extension of the well-known Aho-graph to determine whether a set of triplets is compatible or not [47, 50, 51]. For a given set R of triplets, the Aho-graph has vertex set V and (undirected) edges ab for all triplets \(ab|c\in R\) with \(a,b,c\in V\). Essentially we will use this Aho-graph and add additional edges to it. The connected components of this extended graph eventually guide us to the process of finding good split refinements.

We define now the new auxiliary graph to determine whether the cherry x of S admits a good split refinement or not.

Definition 9

(Good-split-graph) Let \((T;t,\sigma )\) be a gene tree, S be a species tree and x be a cherry of S. Moreover, let Q be a maximal topological sort of \(A({T},{S})\).

We define \(G((T; t, \sigma ), S, x) = (V, E)\) as the undirected graph with vertex set \(V = L(S(x))\). Moreover, an (undirected) edge ab is contained in E if and only if \(a,b \in L(S(x))\) and ab are distinct and satisfy at least one of the following conditions:

(C1):

There exists \(c \in L(S(x))\) such that \(ab|c \in \mathcal {R}(T; t,\sigma )\);

(C2):

There exists an edge \((u, v) \in E(T)\) such that \(t(u) \in \{\mathfrak {d}, \mathfrak {t}\}\), \(u \notin \mathcal {M}(Q)\), \(t(v) = \mathfrak {s}\), and \(\{a, b\} \subseteq \sigma _{T_{\mathcal {\overline{E}}}}(v)\);

(C3):

There exists an edge \((u, v) \in E(T)\) such that \(t(u) = t(v) = \mathfrak {s}\), \(\widehat{\mu }_{S}(u) = x\) and \(\{a, b\} \subseteq \sigma _{T_{\mathcal {\overline{E}}}}(v)\);

(C4):

There exists a vertex \(u \in V(T) \setminus \mathcal {M}(Q)\) such that \(t(u) \in \{\mathfrak {d}, \mathfrak {t}\}\) and \(\{a, b\} \subseteq \sigma _{T_{\mathcal {\overline{E}}}}(u)\).

Intuitively, edges represent pairs of species that must belong to the same part of a split refinement at x. That is, (C1) links species that would contradict a triplet of \(\mathcal {R}(T; t, \sigma )\) if they were separated (as in the classical BUILD algorithm [47, 50, 51]); (C2) links species that would yield an A1-edge from a vertex not in Q into x if they were separated; (C3) links species that would create a self-loop on x if they were separated; and (C4) links species that would create an A3-edge from a vertex not in Q into x if separated. We want the graph \(G((T; t, \sigma ), S, x)\) to be disconnected which would allow us to split the children of x while avoiding all the situations in which we create a separation of two children where we cannot ensure that this separation yields a good split refinement at x. Considering only such pairs of children turns out to be necessary and sufficient, and Theorem 5 below formalizes this idea.

Definition 10

Given an undirected graph H, we say that (AB) is a disconnected bipartition of H if \(A \cup B = V(H)\), \(A \cap B = \emptyset\) and for each \(a \in A, b \in B\), \(ab \notin E(H)\).

We are now in the position to state how good split refinements can be identified. Note, we may assume w.l.o.g. that S agrees with \(\mathcal {R}\), as otherwise there can be no good split refinement at all.

Theorem 5

Let \(((T; t, \sigma ), S)\)be a GTC instance, and assume that S agrees with \(\mathcal {R}(T; t, \sigma )\). Let Q be a maximal topological sort of \(A({T},{S})\). Then there exists a good split refinement of S if and only if there exists a cherry x of S such that every strict ancestor of x in S is in Q, and such that \(G((T; t, \sigma ), S, x)\)is disconnected.

In particular, for any disconnected bipartition (AB) of G, the split refinement that partitions the children of x into A and B is a good split refinement.

Proof

See Appendix: "Proof of Theorem". \(\square\)

The GTC algorithm

We can finally describe the detailed algorithm for the GTC problem, see also Fig. 3.

Fig. 3
figure 3

Top right: the gene tree \((T;t,\sigma )\) from Fig. 1 from which we obtain the species triplets \(\mathcal {R}(T;t,\sigma ) = \{AB|D,AC|D\}\). We start with the star tree \(S_1\) (top left) and obtain \(G((T; t, \sigma ), S_1, 1')\), which is shown right to \(S_1\). \(G((T; t, \sigma ), S_1, 1')\) has four vertices ABCD and two edges. The edge labels indicate which of the conditions in Def. 9 yield the respective edge. In \(G((T; t, \sigma ), S_1, 1')\), there is only one non-trivial connected component which implies the good split that results in the tree \(S_2\) (lower left). There is only one cherry \(2'\) in \(S_2\) and the corresponding graph \(G((T; t, \sigma ), S_2, 2')\) is drawn right to \(S_2\). Again, the connected components give a good split that results in the binary tree \(S_3\). The tree \(S_3\) is precisely the species tree (planted root omitted) as shown in the middle of Fig. 1

figure b

A pseudocode to compute a time-consistent species for a given event-labeled gene tree \((T;t,\sigma )\), if one exists, is provided in Alg. 2. The general idea of Alg. 2 is as follows. With \((T;t,\sigma )\) as input, we start with a star tree S and stepwisely refine S by searching for good split refinements. If in each step a good split refinement exists and S is binary (in which case we cannot further refine S), then we found a time-consistent species tree S for \((T;t,\sigma )\). In every other case, the algorithm returns “No time-consistent species tree exists”. Observe that at any point during the execution of the algorithm, the graph \(G((T;t,\sigma ), S, x)\) is stored in memory for each cherry x of S. Since S is initialized as a star tree, only \(G((T;t,\sigma ), S, r)\) needs to be computed initially, with r being the root of S. After applying a binary refinement at some non-binary cherry x, two cherries \(x_1\) and \(x_2\) are created, in which case we compute their auxiliary graph. Note that at this point, the graph \(G((T;t,\sigma ), S, x)\) is not needed since x is not a cherry anymore, and thus it could be discarded. The correctness proof as well as further explanations on the running time are provided in the proof of Theorem 6.

To show that this algorithm runs in \(O(n^3)\) time, we need first the following.

Lemma 2

\(\mathcal {R}(T; t, \sigma )\)can be computed in worst-case time \(\Theta (n^3)\), where \(n = |L(T)|\).

Proof

See Appendix: "Proof of Lemma". \(\square\)

We note that it is not too difficult to show that Algorithm 2 can be implemented to take time \(O(n^4)\). Indeed, each line of the algorithm can be verified to take time \(O(n^3)\), including the construction of \(A({T},{S})\) (which takes time \(O(n \log n)\), as shown in [4, Thm. 6]) and the construction of the \(G((T;t,\sigma ), S, x)\) graphs (by checking every triplet of \(\mathcal {R}(T; t, \sigma )\) for (C1) edges, and for every pair ab of vertices, checking every member of \(V(T) \cup E(T)\) for (C2), (C3) or (C4) edges). Since the main while loop is executed O(n) times, this yields complexity \(O(n^4)\). However, with a little more work, this can be improved to cubic time algorithm.

As stated in Lemma 2, we may have \(\mathcal {R}(T;t,\sigma ) \in \Theta (n^3)\). Thus, any hope of achieving a better running time would require a strategy to reconstruct a species tree S without reconstructing the full triplet set \(\mathcal {R}(T;t,\sigma )\) that S needs to display. It may be possible that such an algorithm exists, however, this would be a quite surprising result and may require a completely different approach.

Theorem 6

Algorithm 2 correctly computes a time-consistent binary species tree for \((T;t,\sigma )\), if one exists, and can be implemented to run in time \(O(n^3)\), where \(n = |L(T)|\). Its worst-case runtime is \(\Theta (n^3)\).

Proof

See Appendix: Proof of Theorem. \(\square\)

Summary and outlook

Here, we considered event-labeled gene trees \((T;t,\sigma )\) that contain speciation, duplication and HGT. We solved the Gene Tree Consistency (GTC) problem, that is, we have shown how to decide whether a time-consistent species tree S for a given gene tree \((T;t,\sigma )\) exists and, in the affirmative case, how to construct such a binary species tree in cubic-time. Since our algorithm is based on the informative species triplets \(\mathcal {R}(T;t,\sigma )\), for which \(\mathcal {R}(T;t,\sigma )\in \Theta (n^3)\) may possible, there is no non-trivial way to improve the runtime of our algorithm. Our algorithm heavily relies on the structure of an auxiliary graph \(A({T},{S})\) to ensure time-consistency and good split refinements to additionally ensure that the final tree S displays \(\mathcal {R}(T;t,\sigma )\).

This approach may have further consequence in phylogenomics. Since event-labeled gene trees \((T;t,\sigma )\) can to some extent directly be inferred from genomic sequence data, our method allows to test whether \((T;t,\sigma )\) is “biologically feasible”, that is, there exists a time-consistent species tree for \((T;t,\sigma )\). Moreover, our method also shows that all information about the putative history of the species is entirely contained within the gene trees \((T;t,\sigma )\) and thus, in the underlying sequence data to obtain \((T;t,\sigma )\).

We note that the constructed binary species tree is one of possibly exponentially many other time-consistent species trees for \((T;t,\sigma )\). In particular, there are many different ways to choose a good split refinement, each choice may lead to a different species tree. Moreover, the reconstructed species trees here are binary. This condition might be relaxed and one may obtain further species tree by “contracting” edges so that the resulting non-binary tree is still a time-consistent species tree for \((T;t,\sigma )\). This eventually may yield so-called “least-resolved” time-consistent species trees and thus, trees that make no further assumption on the evolutionary history than actually supported by the data.

As part of further work, it may be of interest to understand in more detail, if our approach can be used to efficiently list all valid (possibly exponentially many) solutions, that is, all time-consistent species trees for \((T;t,\sigma )\).