1 Introduction

Horizontal gene transfer (HGT) laterally introduces foreign genetic material into a genome. The phenomenon is particularly frequent in prokaryotes (Soucy et al. 2015; Nelson-Sathi et al. 2015) but also contributed to shaping eukaryotic genomes (Keeling and Palmer 2008; Husnik and McCutcheon 2018; Acuña et al. 2012; Li et al. 2014; Moran and Jarvik 2010; Schönknecht et al. 2013). HGT may be additive, in which case its effect is similar to gene duplications, or lead to the replacement of a vertically inherited homolog. From a phylogenetic perspective, HGT leads to an incongruence of gene trees and species trees, thus complicating the analysis of gene family histories.

A broad spectrum of computational methods have been developed to identify horizontally transferred genes and/or HGT events, recently reviewed by Ravenhall et al. (2015). Parametric methods use genomic signatures, i.e., sequence features specific to a (group of) species identify horizontally inserted material. Genomic signatures include e.g. GC content, k-mer distributions, sequence autocorrelation, or DNA deformability (Dufraigne et al. 2005; Becq et al. 2010). Direct (or “explicit”) phylogenetic methods start from a given gene tree T and species tree S and compute a reconciliation, i.e., a mapping of the gene tree into the species tree. This problem first arose in the context of host/parasite assemblages (Page 1994; Charleston 1998) considering the equivalent problem of mapping a parasite tree T to a host phylogeny S such that the number of events such as host-switches, i.e., horizontal transfers, is minimized. For a review of the early literature we refer to Charleston and Perkins (2006). A major difficulty is to enforce time consistency in the presence of multiple horizontal transfer events, which renders the problem of finding optimal reconciliations NP-hard (Hallett and Lagergren 2001; Ovadia et al. 2011; Tofigh et al. 2011; Hasić and Tannier 2019). Nevertheless several practical approaches have become available, see e.g. Tofigh et al. (2011), Chen et al. (2012) and Ma et al. (2018).

Indirect (or “implicit”) phylogenetic methods forego the reconstruction of trees and start from sequence similarity or evolutionary distances and use unexpectedly small or large distances between genes as indicators of HGT. While indirect methods have been used successfully in the past, reviewed by Ravenhall et al. (2015), they have received very little attention from a more formal point of view. In this contribution, we focus on a particular type of implicit phylogenetic information, following the ideas of Novichkov et al. (2004). The basic idea is that the evolutionary distance between orthologous genes is approximately proportional to the distances between their species. Xenologous gene pairs as well as duplicate genes thus appear as outliers (Lawrence and Hartl 1992; Clarke et al. 2002; Novichkov et al. 2004; Dessimoz et al. 2008). More precisely, consider a family of homologous genes in a set of species and plot the phylogenetic distance of pairs of most similar homologs as a function of the phylogenetic distances between the species in which they reside. Since distances between orthologous genes can be expected to be approximately proportional to the distances between the species, orthologous pairs fall onto a regression line that defines equal divergence time for the last common ancestor of corresponding gene and species pairs. The gene pairs with “later divergence times”, i.e., those that are more closely related than expected from their species, fall below the regression line (Novichkov et al. 2004). Kanhere and Vingron (2009) complemented this idea with a statistical test based on the Cook distance to identify xenologous pairs in a statistically sound manner. For the mathematical analysis we assume that we can perfectly identify all pairs of genes a and b that are more closely related than expected from the phylogenetic distance of their respective genomes. Naturally, this defines a graph \((G,\sigma )\), whose vertices x (the genes) are colored by the species \(\sigma (x)\) in which they appear. Here, we are interested in two questions:

  1. (1)

    What are the mathematical properties that characterize these “later-divergence-time” (LDT) graphs?

  2. (2)

    What kind of information about HGT events, the gene and species tree, and the reconciliation map between them is contained implicitly in an LDT graph?

In Sect. 6 we will briefly consider the situation that later-divergence-time information is fraught with experimental errors.

These questions are motivated by a series of recent publications that characterized the mathematical structure of orthology (Hellmuth et al. 2013; Lafond and El-Mabrouk 2014), the xenology relation sensu Fitch (Geiß et al. 2018; Hellmuth et al. 2018; Hellmuth and Seemann 2019), and the (reciprocal) best match relation (Geiß et al. 2019, 2020b; Schaller et al. 2021a, b). Each of these relations satisfies stringent mathematical conditions that—at least in principle—can be used to correct empirical estimates and thus serve as a potential means of noise reduction (Hellmuth et al. 2015; Stadler et al. 2020). This approach has also lead to efficient algorithms to extract gene trees, species trees, and reconciliations from the relation data. Although the resulting representations of gene family histories are usually not fully resolved, they can provide important constraints for subsequent refinements. The advantage of the relation-based approach is primarily robustness. While the inference of phylogenetic trees relies on detailed probability models or the additivity of distance metrics, our approach starts from yes/no answers to simple, pairwise comparisons. These data can therefore be represented as edges in a graph, possibly augmented by a measure of confidence. Noise and inaccuracies in the initial estimates then translate into violations of the required mathematical properties of the graphs in question. Graph editing approaches can therefore be harnessed as a means of noise reduction (Hellmuth et al. 2015; Dondi et al. 2017; Lafond and El-Mabrouk 2014; Lafond et al. 2016; Hellmuth et al. 2020b, a; Schaller et al. 2021c).

Previous work following this paradigm has largely been confined to duplication-loss (DL) scenarios, excluding horizontal transfer. As shown in Hellmuth (2017), it is possible to partition a gene set into HGT-free classes separated by HGTs. Within each class, the reconstruction problems then simplify to the much easier DL scenarios. It is of utmost interest, therefore, to find robust methods to infer this partition directly from (dis)similarity data. Here, we explore the usefulness and limitations of LDT graphs for this purpose.

This contribution is organized as follows. After introducing the necessary notation, we introduce relaxed scenarios, a very general framework to describe evolutionary scenarios that emphasizes time consistency of reconciliation rather than particular types of evolutionary events. In Sect. 4, LDT graphs are defined formally and characterized as those properly colored cographs for which a set of accompanying rooted triples is consistent (Theorem 3). The proof is constructive and provides a method (Algorithm 1) to compute a relaxed scenario for a given LDT graph. Section 5 defines HGT events, shows that every edge in a LDT graph corresponds to an HGT event, and characterizes those LDT graphs that already capture all HGT events. In addition, we provide a characterization of “rs-Fitch graphs” (general vertex-colored graphs that capture all HGT events) in terms of their coloring. These properties can be verified in polynomial time. Since LDT graphs do not usually capture all HGT events, we discuss in “Appendix C” several ways to obtain a plausible set of HGT candidates from LDT graphs. In Sect. 7, we address the question “how much information about all HGT events is contained in LDT graphs” with the help of simulations of evolutionary scenarios with a wide range of duplication, loss, and HGT events. We find that LDT graphs cover roughly a third of xenologous pairs, while a simple greedy graph editing scheme can more than double the recall at moderate false positive rates. This greedy approach already yields a median accuracy of \(89 \%\), and in \(99.8 \%\) of the cases produces biologically feasible solutions in the sense that the inferred graphs are rs-Fitch graphs. We close with a discussion of several open problems and directions for future research in Sect. 8.

The material of this contribution is extensive and contains several lengthy, very technical proofs. We therefore divided the presentation into a Narrative Part that contains only those mathematical results that contribute to our main conclusions, and a Technical Part providing additional results and all proofs. To facilitate cross-referencing between the two parts, the same numbering of Definitions, Lemmas, Theorems, etc., is used. Appendices AB, and C contain the technical material corresponding to Sects. 4, 5, and 6, respectively.

2 Notation

Graphs We consider undirected graphs \(G=(V,E)\) with vertex set \(V(G):=V\) and edge set \(E(G):=E\), and denote edges connecting vertices \(x,y\in V\) by xy. The graphs \(K_1\) and \(K_2\) denote the complete graphs on one and two vertices, respectively. The graph \(K_2+K_1\) is the disjoint union of a \(K_2\) and a \(K_1\).

The join \(G\triangledown H\) of two graphs \(G=(V,E)\) and \(H=(W,F)\) is the graph with vertex set and edge set . We write \(H\subseteq G\) if \(V(H)\subseteq V(G)\) and \(E(H)\subseteq E(G)\), in which case H is called a subgraph of G. Given a graph \(G=(V,E)\), we write G[W] for the graph induced by \(W\subseteq V\). A connected component C of G is an inclusion-maximal vertex set such that G[C] is connected. A (maximal) clique C in an undirected graph G is an (inclusion-maximal) vertex set such that, for all vertices \(x,y\in C\), it holds that \(xy\in E(G)\), i.e., G[C] is complete. A subset \(W\subseteq V\) is a (maximal) independent set if G[W] is edgeless (and W is maximal w.r.t. inclusion). A graph \(G = (V,E)\) is complete multipartite if V consists of \(k\ge 1\) pairwise disjoint independent sets \(I_1,\dots , I_k\) and \(xy\in E\) if and only if \(x\in I_i\) and \(y\in I_j\) with \(i\ne j\).

A graph G together with a vertex coloring \(\sigma \), denoted by \((G,\sigma )\), is properly colored if \(uv \in E(G)\) implies \(\sigma (u)\ne \sigma (v)\). For a coloring \(\sigma :V\rightarrow M\) and a subset \(W\subseteq V\), we write \(\sigma (W) :=\{\sigma (w)\mid w\in W\}\) for the set of colors that appear on the vertices in W. Throughout, we will need restrictions of the coloring map \(\sigma \).

Definition 1

Let \(\sigma :L\rightarrow M\) be a map, \(L'\subseteq L\) and \(\sigma (L') \subseteq M' \subseteq M\). Then, the map \(\sigma _{|L',M'}:L'\rightarrow M'\) is defined by putting \(\sigma _{|L',M'}(v) = \sigma (v)\) for all \(v\in L'\). If we only restrict the domain of \(\sigma \), we just write \(\sigma _{|L'}\) instead of \(\sigma _{|L',M}\).

We do neither assume that \(\sigma \) nor that its restriction \(\sigma _{|L',M'}\) is surjective.

Rooted trees All trees appearing in this contribution are rooted in one of their vertices. We write \(x \preceq _{T} y\) if y lies on the unique path from the root to x, in which case y is called an ancestor of x, and x is called a descendant of y. We may also write \(y \succeq _{T} x\) instead of \(x \preceq _{T} y\). We use \(x \prec _T y\) for \(x \preceq _{T} y\) and \(x \ne y\). In the latter case, y is a strict ancestor of x. If \(x \preceq _{T} y\) or \(y \preceq _{T} x\), the vertices x and y are comparable and, otherwise, incomparable. We write L(T) for the set of leaves of the tree T, i.e., the \(\preceq _T\)-minimal vertices and say that T is a tree on L(T). We write T(u) for the subtree of T rooted in u. The last common ancestor of a vertex set \(W\subseteq V(T)\) is the \(\preceq _T\)-minimal vertex \(u:={{\,\mathrm{lca}\,}}_T(W)\) for which \(w\preceq _T u\) for all \(w\in W\). For brevity we write \({{\,\mathrm{lca}\,}}_T(x,y)={{\,\mathrm{lca}\,}}_T(\{x,y\})\).

We employ the convention that edges (xy) in a tree are always written such that \(y \preceq _{T} x\) is satisfied. If (xy) is an edge in T, then \({{\,\mathrm{par}\,}}(y):=x\) is the parent of y, and y the child of x. We denote with \({{\,\mathrm{child}\,}}_T(x)\) the set of all children of x in T. It will be convenient for the discussion below to extend the ancestor relation \(\preceq _T\) on V to the union of the edge and vertex sets of T. More precisely, for a vertex \(x\in V(T)\) and an edge \(e=(u,v)\in E(T)\) we put \(x \prec _T e\) if and only if \(x\preceq _T v\); and \(e \prec _T x\) if and only if \(u\preceq _T x\). In addition, for edges \(e=(u,v)\) and \(f=(a,b)\) in T we put \(e\preceq _T f\) if and only if \(v \preceq _T b\).

A rooted tree is phylogenetic if all vertices that are adjacent to at least two vertices have at least two children. A rooted tree T is planted if its root has degree 1. In this case, we denote the “planted root” by \(0_T\). In planted phylogenetic trees there is a unique “planted edge” \((0_T,\rho _T)\) where \(\rho _T:={{\,\mathrm{lca}\,}}_T(L(T))\). Note that by definition \(0_T\notin L(T)\).

Throughout, we will assume that all trees are rooted and phylogenetic unless explicitly stated otherwise. Whenever there is no danger of confusion, we will refer also to planted phylogenetic trees simply as trees.

The set of inner vertices is given by \(V^0(T):=V(T){\setminus } (L(T)\cup \{0_T\})\). An edge (uv) is an inner edge if both vertices u and v are inner vertices and, otherwise, an outer edge. The restriction of T to a subset \(L'\subseteq L(T)\) of leaves, denoted by \(T_{|L'}\) is obtained by identifying the (unique) minimal subtree of T that connects all leaves in \(L'\), and suppressing all vertices with degree two except possibly the root \(\rho _{T_{L'}}={{\,\mathrm{lca}\,}}_T(L')\). T displays a tree \(T'\), in symbols \(T'\le T\), if \(T'\) can be obtained from a restriction \(T_{|L'}\) of T by a series of inner edge contractions (Bryant and Steel 1995). If, in addition, \(L(T)=L(T')\), then T is a refinement of \(T'\). Throughout this contribution, we will consider leaf-colored trees \((T,\sigma )\) with \(\sigma \) being defined for L(T) only.

Rooted triples A rooted triple is a tree T on three leaves and two internal vertices. We write ab|c for the triple with \({{\,\mathrm{lca}\,}}_T(a,b)\prec {{\,\mathrm{lca}\,}}_T(a,c)={{\,\mathrm{lca}\,}}_T(b,c)\). For a set \({\mathcal {R}}\) of triples we write \(L({\mathcal {R}}):=\bigcup _{{\mathsf {t}}\in {\mathcal {R}}}L({\mathsf {t}})\). The set \({\mathcal {R}}\) is compatible if there is a tree T with \(L({\mathcal {R}}) \subseteq L(T)\) that displays every triple \({\mathsf {t}}\in {\mathcal {R}}\). The construction of such a tree T from a triple set \({\mathcal {R}}\) on L makes use of an auxiliary graph that will play a prominent role in this contribution.

Definition 2

(Aho et al. 1981) Let \({\mathcal {R}}\) be a set of rooted triples on the vertex set L. The Aho graph \([{\mathcal {R}},L]\) has vertex set L and edge set \(\{ xy \mid \exists z\in L:\, xy|z \in {\mathcal {R}}\}\).

The algorithm BUILD (Aho et al. 1981) uses Aho graphs in a top-down recursion starting from a given set of triples \({\mathcal {R}}\) and returns for compatible triple sets \({\mathcal {R}}\) on L an unambiguously defined tree \({{\,\mathrm{Aho}\,}}({\mathcal {R}}, L)\) on L, which is known as the Aho tree. BUILD runs in polynomial time. The key property of the Aho graph that ensures the correctness of BUILD can be stated as follows:

Proposition 1

(Aho et al. 1981; Bryant and Steel 1995) A set of triples \({\mathcal {R}}\) is compatible if and only if for each subset \(L\subseteq L({\mathcal {R}})\) with \(|L|>1\) the graph \([{\mathcal {R}},L]\) is disconnected.

Cographs are recursively defined as undirected graphs that can be generated as joins or disjoint unions of cographs, starting from single-vertex graphs \(K_1\). The recursive construction defines a rooted tree (Tt), called cotree, whose leaves are the vertices of the cograph G, i.e., the \(K_1\)s, while each of its inner vertices u of T represent the join or disjoint union operations, labeled as \(t(u)=1\) and \(t(u)=0\), respectively. Hence, for a given cograph G and its cotree (Tt), we have \(xy\in E(G)\) if and only if \(t({{\,\mathrm{lca}\,}}_T(x,y))=1\). Contraction of all tree edges \((u,v)\in E(T)\) with \(t(u)=t(v)\) results in the discriminating cotree \((T_G,{{\hat{t}}})\) of G with cotree-labeling \({{\hat{t}}}\) such that \({{\hat{t}}}(u)\ne {{\hat{t}}}(v)\) for any two adjacent interior vertices of \(T_G\). The discriminating cotree \((T_G,{{\hat{t}}})\) is uniquely determined by G (Corneil et al. 1981a). Cographs have a large number of equivalent characterizations. In this contribution, we will need the following classical results:

Proposition 2

(Corneil et al. 1981a) Given an undirected graph G, the following statements are equivalent:

  1. 1.

    G is a cograph.

  2. 2.

    G does not contain a \(P_4\), i.e., a path on four vertices, as an induced subgraph.

  3. 3.

    \(\mathrm {diam}(H) \le 2\) for all connected induced subgraphs H of G.

  4. 4.

    Every induced subgraph H of G is a cograph.

3 Relaxed reconciliation maps and relaxed scenarios

Tofigh et al. (2011) and Bansal et al. (2012) define “Duplication-Transfer-Loss” (DTL) scenarios in terms of a vertex-only map \(\gamma :V(T)\rightarrow V(S)\). The H-trees introduced by Górecki (2010) and Górecki and Tiuryn (2012) formalize the same concept in a very different manner. A definition of a DTL-like class of scenarios in terms of a reconciliation map \(\mu : V(T)\rightarrow V(S)\cup E(S)\) was analyzed by Nøjgaard et al. (2018). For binary trees, the two definitions are equivalent; for non-binary trees, however, the DTL-scenarios are a proper subset, see Nøjgaard et al. (2018, Fig. 1) for an example. Several other mathematical frameworks have been used in the literature to specify evolutionary scenarios. Examples include the DLS-trees of Górecki and Tiuryn (2006), which can be seen as event-labeled gene trees with leaves denoting both surviving genes and loss-events, maps \(g:V(S')\rightarrow 2^{V(T)}\) from a suitable subdivision \(S'\) of the species tree S to the gene tree as used by Hallett and Lagergren (2001), and associations of edges, i.e., subsets of \(E(T)\times E(S)\) (Wieseke et al. 2013).

In the presence of HGT, the relationships of gene trees and species are not only constrained by local conditions corresponding to the admissible local evolutionary events (duplication, speciation, gene loss, and HGT) but also by the global condition that the HGT events within each lineage admit a temporal order (Merkle and Middendorf 2005; Gorbunov and Lyubetsky 2009; Tofigh et al. 2011). In order to capture time consistency from the outset and to establish the mathematical framework, we consider here trees with explicit timing information (Merkle and Middendorf 2005).

Definition 3

(Time Map) The map \(\tau _{T}: V(T) \rightarrow {\mathbb {R}}\) is a time map for a tree T if \(x\prec _T y\) implies \(\tau _{T}(x)<\tau _{T}(y)\) for all \(x,y\in V(T)\).

It is important to note that only qualitative, relative timing information will be used in practice, i.e., we will never need the actual value of time maps but only information on whether an event pre-dates, post-dates, or is concurrent with another. Definition 3 ensures that the ancestor relation \(\preceq _T\) and the timing of the vertices are not in conflict. For later reference, we provide the following simple result.

Lemma 1

Given a tree T, a time map \(\tau _{T}\) for T satisfying \(\tau _{T}(x)=\tau _0(x)\) with arbitrary choices of \(\tau _0(x)\) for all \(x\in L(T)\) can be constructed in linear time.

Proof

We traverse T in postorder. If x is a leaf, we set \(\tau _{T}(x)=\tau _0(x)\), and otherwise compute \(t:=\max _{u\in {{\,\mathrm{child}\,}}(x)} \tau _{T}(u)\) and set \(\tau _{T}(x)=t'\) with an arbitrary value \(t'> t\). Clearly the total effort is \(O(|V(T)|+|E(T)|)\), and thus also linear in the number of leaves L(T). \(\square \)

Lemma 1 will be useful for the construction of time maps as it, in particular, allows us to put \(\tau _{T}(x)=\tau _{T}(y)\) for all \(x,y\in L(T)\).

Definition 4

(Time consistency) Let T and S be two trees. A map \(\mu :V(T) \rightarrow V(S) \cup E(S)\) is called time-consistent if there are time maps \(\tau _{T}\) for T and \(\tau _{S}\) for S satisfying the following conditions for all \(u \in V(T)\):

  1. (C1)

    If \(\mu (u) \in V(S)\), then \(\tau _{T}(u) = \tau _{S}(\mu (u))\).

  2. (C2)

    Else, if \(\mu (u) = (x,y) \in E(S)\), then \(\tau _{S}(y)<\tau _{T}(u)<\tau _{S}(x)\).

Conditions (C1) and (C2) ensure that the reconciliation map \(\mu \) preserves time in the following sense: If vertex u of the gene tree is mapped to a vertex \(\mu (u)=v\) in the species tree, then u and v receive the same time stamp by Condition (C1). If u is mapped to an edge \(\mu (u) = (x,y)\), then the time stamp of u falls within the time range \([\tau _{S}(x),\tau _{S}(y)]\) of the edge xy in the species tree. The following definition of reconciliation is designed (1) to be general enough to encompass the notions of reconciliation that have been studied in the literature, and (2) to separate the mapping between gene tree and species tree from specific types of events. Event types such as duplication or horizontal transfer therefore are considered here as a matter of interpreting scenarios, not as part of their definition.

Definition 5

(Relaxed reconciliation map) Let T and S be two planted trees with leaf sets L(T) and L(S), respectively and let \(\sigma :L(T)\rightarrow L(S)\) be a map. A map \(\mu :V(T)\rightarrow V(S)\cup E(S)\) is a relaxed reconciliation map for \((T,S,\sigma )\) if the following conditions are satisfied:

  1. (G0)

    Root Constraint. \(\mu (x) = 0_{S}\) if and only if \(x = 0_{T}\)

  2. (G1)

    Leaf Constraint. \(\mu (x)=\sigma (x)\) if and only if \(x\in L(T)\).

  3. (G2)

    Time Consistency Constraint. The map \(\mu \) is time-consistent for some time maps \(\tau _{T}\) for T and \(\tau _{S}\) for S.

Condition (G0) is used to map the respective planted roots. (G1) ensures that genes are mapped to the species in which they reside. (G2) enforces time consistency. The reconciliation maps most commonly used in the literature, see e.g. (Tofigh et al. 2011; Bansal et al. 2012), usually not only satisfy (G0)–(G2) but also impose additional conditions. We therefore call the map \(\mu \) defined here “relaxed”.

Definition 6

(relaxed Scenario) The 6-tuple \({\mathcal {S}}= (T,S,\sigma ,\mu ,\tau _{T},\tau _{S})\) is a relaxed scenario if \(\mu \) is a relaxed reconciliation map for \((T,S,\sigma )\) that satisfies (G2) w.r.t. the time maps \(\tau _{T}\) and \(\tau _{S}\).

By definition, relaxed reconciliation maps are time-consistent. Moreover, \(\tau _{T}(x)=\tau _{S}(\sigma (x))\) for all \(x \in L(T)\) by Definitions 4(C1) and 5(G1,G2). In the following we will refer to the map \(\sigma :L(T)\rightarrow L(S)\) as the coloring of \({\mathcal {S}}\).

4 Later-divergence-time graphs

4.1 LDT graphs and \(\mu \)-free scenarios

In the absence of horizontal gene transfer, the last common ancestor of two species A and B should mark the latest possible time point at which two genes a and b residing in \(\sigma (a)=A\) and \(\sigma (b)=B\), respectively, may have diverged. Situations in which this constraint is violated are therefore indicative of HGT. To address this issue in some more detail, we next define “\(\mu \)-free scenarios” that eventually will lead us to the class of “LDT graphs” that contain all information about genes that diverged after the species in which they reside.

Definition 7

(\({\mu }\)-free scenario) Let T and S be planted trees, \(\sigma :L(T)\rightarrow L(S)\) be a map, and \(\tau _{T}\) and \(\tau _{S}\) be time maps of T and S, respectively, such that \(\tau _{T}(x) = \tau _{S}(\sigma (x))\) for all \(x\in L(T)\). Then, \({\mathcal {T}}=(T,S,\sigma ,\tau _{T},\tau _{S})\) is called a \(\mu \)-free scenario.

This definition of a scenario without a reconciliation map \(\mu \) is mainly a technical convenience that simplifies the arguments in various proofs by avoiding the construction of a reconciliation map. It is motivated by the observation that the “later-divergence-time” of two genes in comparison with their species is independent from any such \(\mu \). Every relaxed scenario \({\mathcal {S}}=(T,S,\sigma ,\mu ,\tau _{T},\tau _{S})\) implies an underlying \(\mu \)-free scenario \({\mathcal {T}}=(T,S,\sigma ,\tau _{T},\tau _{S})\). Statements proved for \(\mu \)-free scenarios therefore also hold for relaxed scenarios. Note that, by Lemma 1, given the time map \(\tau _{S}\), one can easily construct a time map \(\tau _{T}\) such that \(\tau _{T}(x) = \tau _{S}(\sigma (x))\) for all \(x\in L(T)\). In particular, when constructing relaxed scenarios explicitly, we may simply choose \(\tau _{T}(u)=0\) and \(\tau _{S}(x)=0\) as common time for all leaves \(u\in L(T)\) and \(x\in L(S)\). Although not all \(\mu \)-free scenarios admit a reconciliation map and thus can be turned into relaxed scenarios, Lemma 2 below implies that for every \(\mu \)-free scenario \({\mathcal {T}}\) there is a relaxed scenario with possibly slightly distorted time maps that encodes the same LDT graph as \({\mathcal {T}}\).

Definition 8

(LDT graph) For a \(\mu \)-free scenario \({\mathcal {T}}=(T,S,\sigma ,\tau _{T},\tau _{S})\), we define \(G_{_{<}}({\mathcal {T}}) = G_{_{<}}(T,S,\sigma ,\tau _{T},\tau _{S}) = (V,E)\) as the graph with vertex set \(V:=L(T)\) and edge set

$$\begin{aligned} E :=\{ab\mid a,b\in L(T), \tau _{T}({{\,\mathrm{lca}\,}}_T(a,b))<\tau _{S}({{\,\mathrm{lca}\,}}_S(\sigma (a),\sigma (b))). \} \end{aligned}$$

A vertex-colored graph \((G,\sigma )\) is a later-divergence-time graph (LDT graph), if there is a \(\mu \)-free scenario \({\mathcal {T}}=(T,S,\sigma ,\tau _{T},\tau _{S})\) such that \(G=G_{_{<}}({\mathcal {T}})\). In this case, we say that \({\mathcal {T}}\) explains \((G,\sigma )\).

It is easy to see that the edge set of \(G_{_{<}}({\mathcal {T}})\) defines an undirected graph and that two genes a and b form an edge if the divergence time of a and b is strictly less than the divergence time of the underlying species \(\sigma (a)\) and \(\sigma (b)\). Moreover, there are no edges of the form aa, since \(\tau _{T}({{\,\mathrm{lca}\,}}_T(a,a)) = \tau _{T}(a) = \tau _{S}(\sigma (a)) =\tau _{S}({{\,\mathrm{lca}\,}}_S(\sigma (a),\sigma (a)))\). Hence \(G_{_{<}}({\mathcal {T}})\) is a simple graph.

By definition, every relaxed scenario \({\mathcal {S}}=(T,S,\sigma ,\mu ,\tau _{T},\tau _{S})\) satisfies \(\tau _{T}(x)=\tau _{S}(\sigma (x))\) all \(x \in L(T)\). Therefore, removing \(\mu \) from \({\mathcal {S}}\) yields a \(\mu \)-free scenario \({\mathcal {T}}=(T,S,\sigma ,\tau _{T},\tau _{S})\). Thus, we will use the following simplified notation.

Definition 9

We put \(G_{_{<}}({\mathcal {S}}) :=G_{_{<}}(T,S,\sigma ,\tau _{T},\tau _{S})\) for a given relaxed scenario \({\mathcal {S}}=(T,S,\sigma ,\mu ,\tau _{T},\tau _{S})\) and the underlying \(\mu \)-free scenario \((T,S,\sigma ,\tau _{T},\tau _{S})\) and say, by slight abuse of notation, that \({\mathcal {S}}\) explains \((G_{_{<}}({\mathcal {S}}),\sigma )\).

The next two results show that the existence of a reconciliation map \(\mu \) does not impose additional constraints on LDT graphs.

Lemma 2

For every \(\mu \)-free scenario \({\mathcal {T}}=(T,S,\sigma ,\tau _{T},\tau _{S})\), there is a relaxed scenario \({\mathcal {S}}=(T,S,\sigma ,\mu ,{\widetilde{\tau _{T}}},{\widetilde{\tau _{S}}})\) for TS and \(\sigma \) such that \((G_{_{<}}({\mathcal {T}}),\sigma ) = (G_{_{<}}({\mathcal {S}}), \sigma )\).

Theorem 1

\((G,\sigma )\) is an LDT graph if and only if there is a relaxed scenario \({\mathcal {S}}= (T,S,\sigma ,\mu ,\tau _{T},\tau _{S})\) such that \((G,\sigma ) = (G_{_{<}}({\mathcal {S}}),\sigma )\).

Remark 1

From here on, we omit the explicit reference to Lemma 2 and Theorem 1 and assume that the reader is aware of the fact that every LDT graph is explained by some relaxed scenario \({\mathcal {S}}\) and that for every \(\mu \)-free scenario \({\mathcal {T}}=(T,S,\sigma ,\tau _{T},\tau _{S})\), there is a relaxed scenario \({\mathcal {S}}\) for TS and \(\sigma \) such that \((G_{_{<}}({\mathcal {T}}),\sigma ) = (G_{_{<}}({\mathcal {S}}), \sigma )\).

Fig. 1
figure 1

Top row: A relaxed scenario \({\mathcal {S}}=(T,S,\sigma ,\mu ,\tau _{T},\tau _{S})\) (left) with its LDT graph \((G_{_{<}}({\mathcal {S}}),\sigma )\) (right). The reconciliation map \(\mu \) is shown implicitly by the embedding of the gene tree T into the species tree S. The times \(\tau _{T}\) and \(\tau _{S}\) are indicated by the position on the vertical axis, i.e., if a vertex x is drawn higher than a vertex y, this implies \(\tau _{T}(y)<\tau _{T}(x)\). In subsequent figures we will not show the time maps explicitly. Bottom row: Another relaxed scenario \({\mathcal {S}}' =(T',S',\sigma ',\mu ',\tau _{T}',\tau _{S}')\) with a connected LDT graph \((G_{_{<}}({\mathcal {S}}'),\sigma ')\). As we shall see, connectedness of an LDT graph depends on the relative timing of the roots of the gene and species tree (cf. Lemma 11)

4.2 Properties of LDT graphs

We continue by deriving several interesting characteristics LDT graphs.

Proposition 3

Every LDT graph \((G,\sigma )\) is properly colored.

As we shall see below, LDT graphs \((G,\sigma )\) contain detailed information about both the underlying gene trees T and species trees S for all \(\mu \)-scenarios that explain \((G,\sigma )\), and thus by Lemma 2 and Theorem 1 also about every relaxed scenario \({\mathcal {S}}\) satisfying \(G=G_{_{<}}({\mathcal {S}})\). This information is encoded in the form of certain rooted triples that can be retrieved directly from local features in the colored graphs \((G,\sigma )\).

Definition 10

For a graph \(G=(L,E)\), we define the set of triples on L as

$$\begin{aligned} {\mathfrak {T}}(G) :=\{xy|z \; :x,y,z\in L \text { are pairwise distinct, } xy\in E,\; xz,yz\notin E\} \,. \end{aligned}$$

If G is endowed with a coloring \(\sigma :L\rightarrow M\) we also define a set of color triples

$$\begin{aligned} {\mathfrak {S}}(G,\sigma ) :=\{\sigma (x)\sigma (y)|\sigma (z)\; :&x,y,z\in L,\, \sigma (x),\sigma (y),\sigma (z) \text { are pairwise distinct},\\&xz, yz\in E,\; xy\notin E\}. \end{aligned}$$

Lemma 6

If a graph \((G,\sigma )\) is an LDT graph, then \({\mathfrak {S}}(G,\sigma )\) is compatible and S displays \({\mathfrak {S}}(G,\sigma )\) for every \(\mu \)-free scenario \({\mathcal {T}}=(T,S,\sigma ,\tau _{T},\tau _{S})\) that explains \((G,\sigma )\).

The next lemma shows that induced \(K_2+K_1\) subgraphs in LDT graphs imply triples that must be displayed by the gene tree T.

Lemma 7

If \((G,\sigma )\) is an LDT graph, then \({\mathfrak {T}}(G)\) is compatible and T displays \({\mathfrak {T}}(G)\) for every \(\mu \)-free scenario \({\mathcal {T}}=(T,S,\sigma ,\tau _{T},\tau _{S})\) that explains \((G,\sigma )\).

The next results shows that LDT graphs cannot contain induced \(P_4\)s.

Lemma 8

Every LDT graph \((G,\sigma )\) is a properly colored cograph.

The converse of Lemma 8 is not true is in general. To see this, consider the properly-colored cograph \((G,\sigma )\) with vertex \(V(G)=\{a,a',b,b',c,c'\}\), edges \(ab,bc, a'b',a'c' \) and coloring \(\sigma (a)=\sigma (a')=A\), \(\sigma (b)=\sigma (b')=B\), and \(\sigma (c)=\sigma (c')=C\) with ABC being pairwise distinct. In this case, \({\mathfrak {S}}(G,\sigma )\) contains the triples AC|B and BC|A. By Lemma 6, the tree S in every \(\mu \)-free scenario \({\mathcal {T}}=(T,S,\sigma ,\tau _{T},\tau _{S})\) or relaxed scenario \({\mathcal {S}}=(T,S,\sigma ,\mu ,\tau _{T},\tau _{S})\) explaining \((G,\sigma )\) displays AC|B and BC|A. Since no such scenario can exist, \((G,\sigma )\) is not an LDT graph.

4.3 Recognition and characterization of LDT graphs

In order to design an algorithm for the recognition of LDT graphs, we will consider partitions of the vertex set of a given input graph \((G=(L,E),\sigma )\). To construct suitable partitions, we start with the connected components of G. The coloring \(\sigma :L\rightarrow M\) imposes additional constraints. We capture these with the help of binary relations that are defined in terms of partitions \({\mathscr {C}}\) of the color set M and employ them to further refine the partition of G.

Definition 12

Let \((G=(L,E),\sigma )\) be a graph with coloring \(\sigma :L\rightarrow M\). Let \({\mathscr {C}}\) be a partition of M, and \({\mathscr {C}}'\) be the set of connected components of G. We define the following binary relation \({\mathfrak {R}}(G, \sigma , {\mathscr {C}})\) by setting

$$\begin{aligned} (x,y)\in {\mathfrak {R}}(G, \sigma , {\mathscr {C}}) \iff x,y\in L,\; \sigma (x), \sigma (y)&\in C \text { for some } C\in {\mathscr {C}}, \text { and } \\ x,y&\in C' \text { for some } C'\in {\mathscr {C}}'. \end{aligned}$$

By construction, two vertices \(x,y\in L\) are in relation \({\mathfrak {R}}(G, \sigma , {\mathscr {C}})\) whenever they are in the same connected component of G and their colors \(\sigma (x), \sigma (y)\) are contained in the same set of the partition of M. As shown in Lemma 9 in the Technical Part, the relation \({\mathfrak {R}}:={\mathfrak {R}}(G, \sigma , {\mathscr {C}})\) is an equivalence relation and every equivalence class of \({\mathfrak {R}}\) is contained in some connected component of G. In particular, each connected component of G is the disjoint union of \({\mathfrak {R}}\)-classes.

The following partition of the leaf sets of subtrees of a tree S rooted at some vertex \(u\in V(S)\) will be useful:

$$\begin{aligned}&\text {If } u \text { is not a leaf, then }&{\mathscr {C}}_{S}(u)&:=\{L(S(v)) \mid v\in {{\,\mathrm{child}\,}}_S(u)\} \\&\text {and, otherwise, }&{\mathscr {C}}_{S}(u)&:=\{\{u\}\}. \end{aligned}$$

One easily verifies that, in both cases, \({\mathscr {C}}_{S}(u)\) yields a valid partition of the leaf set L(S(u)). Recall that \(\sigma _{|L',M'}:L'\rightarrow M'\) was defined as the “submap” of \(\sigma \) with \(L'\subseteq L\) and \(\sigma (L') \subseteq M' \subseteq M\).

Lemma 10

Let \((G=(L,E),\sigma )\) be a properly colored cograph. Suppose that the triple set \({\mathfrak {S}}(G,\sigma )\) is compatible and let S be a tree on M that displays \({\mathfrak {S}}(G,\sigma )\). Moreover, let \(L'\subseteq L\) and \(u\in V(S)\) such that \(\sigma (L') \subseteq L(S(u))\). Finally, set \({\mathfrak {R}}:={\mathfrak {R}}(G[L'],\sigma _{|L',L(S(u))},{\mathscr {C}}_{S}(u))\).

Then, for all distinct \({\mathfrak {R}}\)-classes K and \(K'\), either \(xy\in E\) for all \(x\in K\) and \(y\in K'\), or \(xy\notin E\) for all \(x\in K\) and \(y\in K'\). In particular, for \(x\in K\) and \(y\in K'\), it holds that

$$\begin{aligned} xy\in E \iff K, K' \text { are contained in the same connected component of } G[L']. \end{aligned}$$
Fig. 2
figure 2

Visualization of Algorithm 1. A The case \(u_S\) is a leaf (cf. Line 8). BE The case \(u_S\) is an inner vertex (cf. Line 12). B The subgraph of \((G,\sigma )\) induced by \(L'\). C The local topology of the species tree S yields \({\mathscr {C}}_{S}(u_S)=\{\{A,B,\dots \},\{C,D,\dots \}\}\). Note that \(L(S(u_S))\) may contain colors that are not present in \(\sigma (L')\) (not shown). D The equivalence classes of \({\mathfrak {R}}:={\mathfrak {R}}(G[L'], \sigma _{|L',L(S(u))}, {\mathscr {C}}_{S}(u_S))\). E The vertex \(u_T\) and the vertices \(v_T\) are created in this recursion step. The vertices \(w_K\) corresponding to the \({\mathfrak {R}}\)-classes K are created in the next-deeper steps. Note that some vertices have only a single child, and thus get suppressed in Line 25

Lemma 10 suggests a recursive strategy to construct a relaxed scenario \({\mathcal {S}}=(T,S,\sigma ,\mu ,\tau _{T},\tau _{S})\) for a given properly-colored cograph \((G,\sigma )\), which is illustrated in Fig. 2. The starting point is a species tree S displaying all the triples in \({\mathfrak {S}}(G,\sigma )\) that are required by Lemma 6. We show below that there are no further constraints on S and thus we may choose \(S={{\,\mathrm{Aho}\,}}({\mathfrak {S}}(G,\sigma ),L)\) and endow it with an arbitrary time map \(\tau _{S}\). Given \((S,\tau _{S})\), we construct \((T,\tau _{T})\) in top-down order. In order to reduce the complexity of the presentation and to make the algorithm more compact and readable, we will not distinguish the cases in which \((G,\sigma )\) is connected or disconnected, nor whether a connected component is a superset of one or more \({\mathfrak {R}}\)-classes. The tree T therefore will not be phylogenetic in general. We shall see, however, that this issue can be alleviated by simply suppressing all inner vertices with a single child.

figure a

The root \(u_T\) is placed above \(\rho _S\) to ensure that no two vertices from distinct connected components of G will be connected by an edge in \(G_{_{<}}({\mathcal {S}})\). The vertices \(v_T\) representing the connected components C of G are each placed within an edge of S below \(\rho _S\). W.l.o.g., the edges \((\rho _S,v_S)\) are chosen such that the colors of the corresponding connected component C and the colors in \(L(S(v_S))\) overlap. Next we compute the relation \({\mathfrak {R}}:={\mathfrak {R}}(G,\sigma ,{\mathscr {C}}_{S}(\rho _S))\) and determine, for each connected component C, the \({\mathfrak {R}}\)-classes K that are a subset of C. For each of them, a child \(w_K\) is appended to the tree vertex \(v_T\). The subtree \(T(w_K)\) will have leaf set \(L(T(w_K))=K\). Since \({\mathfrak {R}}\) is defined on \({\mathscr {C}}_{S}(\rho _S)\) in this first step, \(G({\mathcal {S}})\) will have all edges between vertices that are in the same connected component C but in distinct \({\mathfrak {R}}\)-classes (cf. Lemma 10). The definition of \({\mathfrak {R}}\) also implies that we always find a vertex \(v_S\in {{\,\mathrm{child}\,}}_S(\rho _S)\) such that \(\sigma (K)\subseteq L(S(v_S))\) (more detailed arguments for this are given in the proof of Claim 4 in the proof of Theorem 2 below). Thus we can place \(w_K\) into this edge \((\rho _S,v_S)\), and proceed recursively on the \({\mathfrak {R}}\)-classes \(L':=K\), the induced subgraphs \(G[L']\) and their corresponding vertices \(v_S\in V(S)\), which then serve as the root of the species trees. More precisely, we identify \(w_K\) with the root \(u'_T\) created in the “next-deeper” recursion step. Since we alternate between vertices \(u_T\) for which no edges between vertices of distinct subtrees exist, and vertices \(v_T\) for which all such edges exist, we can label the vertices \(u_T\) with “0” and the vertices \(v_T\) with “1” and obtain a cotree for the cograph G.

This recursive procedure is described more formally in Algorithm 1 which also describes the constructions of an appropriate time map \(\tau _{T}\) for T and a reconciliation map \(\mu \). We note that we find it convenient to use as trivial case in the recursion the situation in which the current root \(u_S\) of the species tree is a leaf rather than the condition \(|L'|=1\). In this manner we avoid the distinction between the cases \(u_S\in L(S)\) and \(u_S\notin L(S)\) in the else-condition starting in Line 12. This results in a shorter presentation at the expense of more inner vertices that need to be suppressed at the end in order to obtain the final tree T. We proceed by proving the correctness of Algorithm 1.

Theorem 2

Let \((G,\sigma )\) be a properly colored cograph, and assume that the triple set \({\mathfrak {S}}(M,G)\) is compatible. Then Algorithm 1 returns a relaxed scenario \({\mathcal {S}}=(T,S,\sigma ,\mu ,\tau _{T},\tau _{S})\) such that \(G_{_{<}}({\mathcal {S}})=G\) in polynomial time.

As a consequence of Lemma 6 and 8, and the fact that Algorithm 1 returns a relaxed scenario \({\mathcal {S}}\) for a given properly colored cograph with compatible triple set \({\mathfrak {S}}(G,\sigma )\), we obtain

Theorem 3

A graph \((G,\sigma )\) is an LDT graph if and only if it is a properly colored cograph and \({\mathfrak {S}}(G,\sigma )\) is compatible.

Theorem 3 has two consequences that are of immediate interest:

Corollary 2

LDT graphs can be recognized in polynomial time.

Corollary 3

The property of being an LDT graph is hereditary, that is, if \((G,\sigma )\) is an LDT graph then each of its vertex induced subgraphs is an LDT graph.

The relaxed scenarios \({\mathcal {S}}\) explaining an LDT graph \((G,\sigma )\) are far from being unique. In fact, we can choose from a large set of trees \((S,\tau _{S})\) that is determined only by the triple set \({\mathfrak {S}}(G,\sigma )\):

Corollary 4

If \((G=(L,E),\sigma )\) is an LDT graph with coloring \(\sigma :L\rightarrow M\), then for all planted trees S on M that display \({\mathfrak {S}}(G,\sigma )\) there is a relaxed scenario \({\mathcal {S}}=(T,S,\sigma ,\mu ,\tau _{T},\tau _{S})\) that contains \(\sigma \) and S and that explains \((G,\sigma )\).

As shown in the Technical Part, for every LDT graph \((G,\sigma )\) there is a relaxed scenario \({\mathcal {S}}=(T,S,\sigma ,\mu ,\tau _{T},\tau _{S})\) explaining \((G,\sigma )\) such that T displays the discriminating cotree \(T_{G}\) of G (cf. Corollary 5 in the Technical Part). However, this property is not satisfied by all relaxed scenarios that explain an \((G,\sigma )\). Nevertheless, the latter results enable us to relate connectedness of LDT graphs to properties of the relaxed scenarios by which it can be explained (cf. Lemma 11 in Technical Part).

4.4 Least resolved trees for LDT graphs

As we have seen e.g. in Corollary 4, there are in general many trees S and T forming relaxed scenarios \({\mathcal {S}}\) that explain a given LDT graph \((G,\sigma )\). This begs the question to what extent these trees are determined by “representatives”. For S, we have seen that S always displays \({\mathfrak {S}}(G,\sigma )\), suggesting to consider the role of \(S={{\,\mathrm{Aho}\,}}({\mathfrak {S}}(G,\sigma ),M)\), where M is the codomain of \(\sigma \). This tree is least resolved in the sense that there is no relaxed scenario explaining the LDT graph \((G,\sigma )\) with a tree \(S'\) that is obtained from S by edge-contractions. The latter is due to the fact that any edge contraction in \({{\,\mathrm{Aho}\,}}({\mathfrak {S}}(G,\sigma ),M)\) yields a tree \(S'\) that does not display \({\mathfrak {S}}(G,\sigma )\) any more (Jansson et al. 2012). By Proposition 6, none of the relaxed scenarios containing \(S'\) explain the LDT graph \((G,\sigma )\).

Definition 13

Let \({\mathcal {S}}=(T,S,\sigma ,\mu ,\tau _{T},\tau _{S})\) be a relaxed scenario explaining the LDT graph \((G,\sigma )\). The planted tree T is least resolved for \((G,\sigma )\) if no relaxed scenario \((T',S',\sigma ',\mu ',\tau _{T}',\tau _{S}')\) with \(T'<T\) explains \((G,\sigma )\).

In other words, T is least resolved for \((G,\sigma )\) if no relaxed scenario with a gene tree \(T'\) obtained from T by a series of edge contractions explains \((G,\sigma )\).

Fig. 3
figure 3

Examples of LDT graphs \((G,\sigma )\) with multiple least resolved trees. Top row: No unique least resolved gene tree. For both trees, contraction of the single inner edge leads to a loss of the gene triple \(ab|c\in {\mathfrak {T}}(G)\) (cf. Lemma 7). The species tree is also least resolved since contraction of its single inner edge leads to loss of the species triples \(\sigma (a)\sigma (c)|\sigma (d), \sigma (b)\sigma (c)|\sigma (d)\in {\mathfrak {S}}(G,\sigma )\) (cf. Lemma 6). Bottom row: No unique least resolved species tree. Both trees display the two necessary triples \(AB|E,CD|E\in {\mathfrak {S}}(G,\sigma )\), and are again least resolved w.r.t. these triples. The gene trees are also least resolved since contraction of either of its two inner edges leads e.g. to loss of one of the triples \(ae|c, ce'|a\in {\mathfrak {T}}(G)\)

Fig. 4
figure 4

Example of an LDT graph \((G,\sigma )\) in B that is explained by the relaxed scenario shown in A. Here, \((G,\sigma )\) cannot be explained by a relaxed scenario \({\mathcal {S}}=(T,S,\sigma ,\mu , \tau _{T},\tau _{S})\) such that T is the unique discriminating cotree (shown in C) for the cograph G, see D and the text for further explanations

The examples in Fig. 3 show that LDT graphs are in general not accompanied by unique least resolved trees. In the top row, relaxed scenarios with different least resolved gene trees T and the same least resolved species tree S explain the LDT graph \((G,\sigma )\). In the example below, two distinct least resolved species trees exist for a given least-resolved gene tree.

The example in Fig. 4 shows, furthermore, that the unique discriminating cotree \(T_G\) of an LDT graph \((G,\sigma )\) is not always “sufficiently resolved”. To see this, assume that the graph \((G,\sigma )\) in the example can be explained by a relaxed scenario \({\mathcal {S}}=(T,S,\sigma ,\mu , \tau _{T},\tau _{S})\) such that \(T=T_G\). First consider the connected component consisting of abcd. Since \({{\,\mathrm{lca}\,}}_T(a,b)\succ _T {{\,\mathrm{lca}\,}}_T(c,d)\), \(ab\in E(G)\) and \(cd\notin E(G)\), we have \(\tau _{S}({{\,\mathrm{lca}\,}}_S(\sigma (a),\sigma (b)))> \tau _{T}({{\,\mathrm{lca}\,}}_T(a,b))> \tau _{T}({{\,\mathrm{lca}\,}}_T(c,d))\ge \tau _{S}({{\,\mathrm{lca}\,}}_S(\sigma (c),\sigma (d)))\). By similar arguments, the second connected component implies \(\tau _{S}({{\,\mathrm{lca}\,}}_S(\sigma (c),\sigma (d))) > \tau _{S}({{\,\mathrm{lca}\,}}_S(\sigma (a),\sigma (b)))\); a contradiction. These examples emphasize that LDT graphs constrain the relaxed scenarios, but are far from determining them.

5 Horizontal gene transfer and fitch graphs

5.1 HGT-labeled trees and rs-Fitch graphs

As alluded to in the introduction, the LDT graphs are intimately related with horizontal gene transfer. To formalize this connection we first define transfer edges. These will then be used to encode Walter Fitch’s concept of xenologous gene pairs (Fitch 2000; Darby et al. 2017) as a binary relation, and thus, the edge set of a graph.

Definition 14

Let \({\mathcal {S}}= (T,S,\sigma ,\mu ,\tau _{T},\tau _{S})\) be a relaxed scenario. An edge (uv) in T is a transfer edge if \(\mu (u)\) and \(\mu (v)\) are incomparable in S. The HGT-labeling of T in \({\mathcal {S}}\) is the edge labeling \(\lambda _{{\mathcal {S}}}: E(T)\rightarrow \{0,1\}\) with \(\lambda (e)=1\) if and only if e is a transfer edge.

The vertex u in T thus corresponds to an HGT event, with v denoting the subsequent event, which now takes place in the “recipient” branch of the species tree. Note that \(\lambda _{{\mathcal {S}}}\) is completely determined by \({\mathcal {S}}\). In general, for a given a gene tree T, HGT events correspond to a labeling or coloring of the edges of T.

Definition 15

(Fitch graph) Let \((T,\lambda )\) be a tree T together with a map \(\lambda :E(T)\rightarrow \{0,1\}\). The Fitch graph \(\digamma (T,\lambda ) = (V,E)\) has vertex set \(V:=L(T)\) and edge set

$$\begin{aligned} E :=\{xy \mid x,y\in L,&\text { the unique path connecting } x \text { and } y \text { in } T \\&\text { contains an edge } e \text { with } \lambda (e)=1. \} \end{aligned}$$

By definition, Fitch graphs of 0/1-edge-labeled trees are loopless and undirected. We call edges e of \((T,\lambda )\) with label \(\lambda (e)=1\) also 1-edges and, otherwise, 0-edges.

Remark 2

Fitch graphs as defined here have been termed undirected Fitch graphs (Hellmuth et al. 2018), in contrast to the notion of the directed Fitch graphs of 0/1-edge-labeled trees studied e.g. in Geiß et al. (2018) and Hellmuth and Seemann (2019).

Proposition 5

(Hellmuth et al. 2018; Zverovich 1999) The following statements are equivalent.

  1. 1.

    G is the Fitch graph of a 0/1-edge-labeled tree.

  2. 2.

    G is a complete multipartite graph.

  3. 3.

    G does not contain \(K_2+K_1\) as an induced subgraph.

Definition 16

(rs-Fitch graph) Let \({\mathcal {S}}= (T,S,\sigma ,\mu ,\tau _{T},\tau _{S})\) be a relaxed scenario with HGT-labeling \(\lambda _{{\mathcal {S}}}\). We call the vertex colored graph \((\digamma ({\mathcal {S}}),\sigma ) :=(\digamma (T,\lambda _{{\mathcal {S}}}),\sigma )\) the Fitch graph of the scenario \({\mathcal {S}}\).

A vertex colored graph \((G,\sigma )\) is a relaxed scenario Fitch graph (rs-Fitch graph) if there is a relaxed scenario \({\mathcal {S}}= (T,S,\sigma ,\mu ,\tau _{T},\tau _{S})\) such that \(G = \digamma ({\mathcal {S}})\).

Fig. 5
figure 5

A The relaxed scenario \({\mathcal {S}}=(T,S,\sigma ,\mu ,\tau _{T},\tau _{S})\) as already shown in Fig. 1. B A 0/1-edge-labeled tree \((T,\lambda )\) satisfying \(\lambda =\lambda _{{\mathcal {S}}}\). C The corresponding Fitch graph \(\digamma (T,\lambda )\) drawn in a layout that emphasizes the property that \(\digamma (T,\lambda )\) is a complete multipartite graph. Independent sets are circled. D An alternative layout as in Fig. 1 (top row) that emphasizes the relationship \(G_{_{<}}({\mathcal {S}})\subseteq \digamma ({\mathcal {S}})=\digamma (T,\lambda )\) (cf. Theorem 4 below). Edges that are not present in \(G_{_{<}}({\mathcal {S}})\) are drawn as dashed lines

Figure 5 shows that rs-Fitch graphs are not necessarily properly colored. A subtle difficulty arises from the fact that Fitch graphs of 0/1-edge-labeled trees are defined without a reference to the vertex coloring \(\sigma \), while the rs-Fitch graph is vertex colored. This together with Proposition 5 implies

Observation 1

If \((G,\sigma )\) is an rs-Fitch graph then G is a complete multipartite graph.

The “converse” of Observation 1 is not true in general, as we shall see in Theorem 6 below. If, however, the coloring \(\sigma \) can be chosen arbitrarily, then every complete multipartite graph G can be turned into an rs-Fitch graph \((G,\sigma )\) as shown in Proposition 6.

Proposition 6

If G is a complete multipartite graph, then there exists a relaxed scenario \({\mathcal {S}}=(T,S,\sigma ,\mu ,\tau _{T},\tau _{S})\) such that \((G,\sigma )\) is an rs-Fitch graph.

Although every complete multipartite graph can be colored in such a way that it becomes an rs-Fitch graph (cf. Proposition 6), there are colored, complete multipartite graphs \((G,\sigma )\) that are not rs-Fitch graphs, i.e., that do not derive from a relaxed scenario (cf. Theorem 6). We summarize this discussion in the following

Observation 2

There are (planted) 0/1-edge labeled trees \((T,\lambda )\) and colorings \(\sigma :L(T)\rightarrow M\) such that there is no relaxed scenario \({\mathcal {S}}= (T,S,\sigma ,\mu ,\tau _{T},\tau _{S})\) with \(\lambda =\lambda _{{\mathcal {S}}}\).

A subtle—but important—observation is that trees \((T,\lambda )\) with coloring \(\sigma \) for which Observation 2 applies may still encode an rs-Fitch graph \((\digamma (T,\lambda ),\sigma )\), see Example 1 and Fig. 6. The latter is due to the fact that \(\digamma (T,\lambda ) = \digamma (T',\lambda ')\) may be possible for a different tree \((T',\lambda ')\) for which there is a relaxed scenario \({\mathcal {S}}' = (T',S,\sigma ,\mu ,\tau _{T},\tau _{S})\) with \(\lambda ' = \lambda _{{\mathcal {S}}}\). In this case, \((\digamma (T,\lambda ),\sigma ) = (\digamma ({\mathcal {S}}'),\sigma )\) is an rs-Fitch graph. We shall briefly return to these issues in the discussion Sect. 8.

Example 1

Consider the planted edge-labeled tree \((T,\lambda )\) shown in Fig. 6 with leaf set \(L=\{a,b,b',c,d\}\), together with a coloring \(\sigma \) where \(\sigma (b)=\sigma (b')\) and \(\sigma (a), \sigma (b), \sigma (c), \sigma (d)\) are pairwise distinct.

Assume, for contradiction, that there is a relaxed scenario \({\mathcal {S}}= (T,S,\sigma ,\mu ,\tau _{T},\tau _{S})\) with \((T,\lambda ) = (T,\lambda _{{\mathcal {S}}})\). Hence, \(\mu (v)\) and \(\mu (b)=\sigma (b)\) as well as \(\mu (u)\) and \(\mu (b')=\sigma (b)\) must be comparable in S. Therefore, \(\mu (u)\) and \(\mu (v)\) must both be comparable to \(\sigma (b)\) and thus, they are located on the path from \(\rho _S\) to \(\sigma (b)\). But this implies that \(\mu (u)\) and \(\mu (v)\) are comparable in S; a contradiction, since then \(\lambda _{{\mathcal {S}}}(u,v) = 0\ne \lambda (u,v) = 1\).

Fig. 6
figure 6

0/1-edge-labeled tree \((T,\lambda )\) for which no relaxed scenario exists such that \((T,\lambda ) = (T,\lambda _{{\mathcal {S}}})\) (see Example 1). Red edges indicates 1-labeled edges. Nevertheless for \(\digamma :=\digamma (T,\lambda )\) there is an alternative tree \((T',\lambda ')\) for which a relaxed scenario \({\mathcal {S}}= (T',S,\sigma ,\mu ,\tau _{T},\tau _{S})\) exists (right) such that \(\digamma = \digamma (T',\lambda ') = \digamma ({\mathcal {S}})\)

5.2 LDT graphs and rs-Fitch graphs

We proceed to investigate to what extent an LDT graph provides information about an rs-Fitch graph. As we shall see in Theorem 5 there is indeed a close connection between rs-Fitch graphs and LDT graphs. We start with a useful relation between the edges of rs-Fitch graphs and the reconciliation maps \(\mu \) of their scenarios.

Lemma 13

Let \(\digamma ({\mathcal {S}})\) be an rs-Fitch graph for some relaxed scenario \({\mathcal {S}}\). Then, \(ab\notin E(\digamma ({\mathcal {S}}))\) implies that \({{\,\mathrm{lca}\,}}_S(\sigma (a),\sigma (b)) \preceq _S \mu ({{\,\mathrm{lca}\,}}_T(a,b))\).

The next result shows that a subset of transfer edges can be inferred immediately from LDT graphs:

Theorem 4

If \((G,\sigma )\) is an LDT graph, then \(G\subseteq \digamma ({\mathcal {S}})\) for all relaxed scenarios \({\mathcal {S}}\) that explain \((G,\sigma )\).

Since we only have that xy is an edge in \(\digamma ({\mathcal {S}})\) if the path connecting x and y in the tree T of \({\mathcal {S}}\) contains a transfer edge, Theorem 4 immediately implies

Corollary 6

For every relaxed scenario \({\mathcal {S}}=(T,S,\sigma ,\mu ,\tau _{T},\tau _{S})\) without transfer edges, it holds that \(E(G_{_{<}}({\mathcal {S}})) = \emptyset \).

Fig. 7
figure 7

Two relaxed scenarios \({\mathcal {S}}_1\) and \({\mathcal {S}}_2\) with the same rs-Fitch graph \(\digamma = \digamma ({\mathcal {S}}_1)=\digamma ({\mathcal {S}}_2)\) (right) and different LDT graphs \(G_{_{<}}({\mathcal {S}}_1)\ne \digamma \) and \(G_{_{<}}({\mathcal {S}}_2)=\digamma \)

Theorem 4 provides the formal justification for indirect phylogenetic approaches to HGT inference that are based on the work of Lawrence and Hartl (1992), Clarke et al. (2002), and Novichkov et al. (2004) by showing that \((x,y)\in E(G_{_{<}}({\mathcal {S}}))\) can be explained only by HGT, irrespective of how complex the true biological scenario might have been. However, it does not cover all HGT events. Figure 7 shows that there are relaxed scenarios \({\mathcal {S}}\) for which \(G_{_{<}}({\mathcal {S}}) \ne \digamma ({\mathcal {S}})\) even though \(\digamma ({\mathcal {S}})\) is properly colored. Moreover, it is possible that an rs-Fitch graph \((G,\sigma )\) contains edges \(xy\in E(G)\) with \(\sigma (x)=\sigma (y)\). In particular, therefore, an rs-Fitch graph is not always an LDT graph.

It is natural, therefore, to ask whether for every properly colored Fitch graph there is a relaxed scenario \({\mathcal {S}}\) such that \(G_{_{<}}({\mathcal {S}}) = \digamma ({\mathcal {S}})\). An affirmative answer is provided by

Theorem 5

The following statements are equivalent.

  1. 1.

    \((G,\sigma )\) is a properly colored complete multipartite graph.

  2. 2.

    There is a relaxed scenario \({\mathcal {S}}=(T,S,\sigma ,\mu ,\tau _{T},\tau _{S})\) with coloring \(\sigma \) such that \(G=G_{_{<}}({\mathcal {S}}) = \digamma ({\mathcal {S}})\).

  3. 3.

    \((G,\sigma )\) is complete multipartite and an LDT graph.

  4. 4.

    \((G,\sigma )\) is properly colored and an rs-Fitch graph.

In particular, for every properly colored complete multipartite graph \((G,\sigma )\) the triple set \({\mathfrak {S}}(G,\sigma )\) is compatible.

relaxed scenarios for which \((\digamma ({\mathcal {S}}),\sigma )\) is properly colored do not admit two members of the same gene family that are separated by a HGT event. While restrictive, such models are not altogether unrealistic. Proper coloring of \((\digamma ({\mathcal {S}}),\sigma )\) is, in particular, the case if every horizontal transfer is replacing, i.e., if the original copy is effectively overwritten by homologous recombination (Thomas and Nielsen 2005), see also (Choi et al. 2012) for a detailed case study in Streptococcus. As a consequence of Theorem 5, LDT graphs are sufficient to describe replacing HGT. However, the incidence rate of replacing HGT decreases exponentially with phylogenetic distance between source and target (Williams et al. 2012), and additive HGT becomes the dominant mechanism between phylogenetically distant organisms. Still, replacing HGTs may also be the result of additive HGT followed by a loss of the (functionally redundant) vertically inherited gene.

5.3 rs-Fitch graphs with general colorings

In scenarios with additive HGT, the rs-Fitch graph is no longer properly colored and no-longer coincides with the LDT graph. Since not every vertex-colored complete multipartite graph \((G,\sigma )\) is an rs-Fitch graph (cf. Theorem 6), we ask whether an LDT \((G,\sigma )\) that is not itself already an rs-Fitch graph imposes constraints on the rs-Fitch graphs \((\digamma ({\mathcal {S}}),\sigma )\) that derive from relaxed scenarios \({\mathcal {S}}\) that explain \((G,\sigma )\). As a first step towards this goal, we aim to characterize rs-Fitch graphs, i.e., to understand the conditions imposed by the existence of an underlying scenario \({\mathcal {S}}\) on the compatibility of the collection of independent sets \({\mathscr {I}}\) of G and the coloring \(\sigma \). As we shall see, these conditions can be explained in terms of an auxiliary graph that we introduce in a very general setting:

Definition 17

Let L be a set, \(\sigma :L\rightarrow M\) a map and \({\mathscr {I}}=\{I_1,\dots , I_k\}\) a set of subsets of L. Then the graph \({\mathcal {A}}_{\digamma }(\sigma ,{\mathscr {I}})\) has vertex set M and edges xy if and only if \(x\ne y\) and \(x,y\in \sigma (I')\) for some \(I'\in {\mathscr {I}}\).

By construction \({\mathcal {A}}_{\digamma }(\sigma ,\mathscr {I'})\) is a subgraph of \({\mathcal {A}}_{\digamma }(\sigma ,{\mathscr {I}})\) whenever \(\mathscr {I'}\subseteq {\mathscr {I}}\). An extended version of Definition 17 that contains also an edge-labeling of \({\mathcal {A}}_{\digamma }(\sigma ,{\mathscr {I}})\) can be found in the Technical Part—this technical detail is not needed here. As it turns out, rs-Fitch graphs are characterized by the structure of their auxiliary graphs \({\mathcal {A}}_{\digamma }\) as shown in the next

Theorem 6

A graph \((G,\sigma )\) is an rs-Fitch graph if and only if (i) it is complete multipartite with independent sets \({\mathscr {I}}=\{I_1,\dots , I_k\}\), and (ii) if \(k>1\), there is an independent set \(I'\in {\mathscr {I}}\) such that \({\mathcal {A}}_{\digamma }(\sigma ,{\mathscr {I}}{\setminus }\{I'\})\) is disconnected.

As a consequence of Theorem 6, we obtain

Corollary 9

rs-Fitch graphs can be recognized in polynomial time.

As for LDT graphs, the property of being an rs-Fitch graph is hereditary.

Corollary 14

If \((G=(L,E),\sigma )\) is an rs-Fitch graph, then the colored vertex induced subgaph \((G[W],\sigma _{|W})\) is an rs-Fitch graph for all non-empty subsets \(W\subseteq L\).

Fig. 8
figure 8

Shown are three distinct relaxed scenarios \({\mathcal {S}}\), \({\mathcal {S}}'\) and \({\mathcal {S}}''\) with corresponding rs-Fitch graphs. Here \(\sigma ' = \sigma _{|\{a,a'\}}\) and \(\sigma '' = \sigma _{|\{a,a'\},\{A\}}\) (cf. Definition 1). Putting \((G,\sigma ) = (\digamma ({\mathcal {S}}),\sigma )\), one can observe that \((G[\{a,a'\}], \sigma ') = (\digamma ({\mathcal {S}}'),\sigma ')\) is an rs-Fitch graph. In contrast, \(\sigma ''\) is restricted to the “observable” part of species (consisting of A alone), and \((G[\{a,a'\}], \sigma '')\) is not an rs-Fitch graph, see text for further details

Note, however, that Corollary 14 is not satisfied if we restrict the codomain of \(\sigma \) to the observable part of colors, i.e., if we consider \(\sigma _{|W,\sigma (W)}:W \rightarrow \sigma (W)\) instead of \(\sigma _{|W}:W\rightarrow M\), even if \(\sigma \) is surjective. To see this consider the vertex colored graph \((G,\sigma )\) with \(V(G)=\{a,a',b\}\), \(E(G) = \{aa',ab,a'b\}\) and \(\sigma :V(G)\rightarrow M = \{A,B\}\) where \(\sigma (a) = \sigma (a')=A \ne \sigma (b)=B\). A possible relaxed scenario \({\mathcal {S}}\) for \((G,\sigma )\) is shown in Fig. 8A. The deletion of b yields \(W=V(G){\setminus } \{b\} = \{a,a'\}\) and the graph \((G[W],\sigma _{|W})\) for which \({\mathcal {S}}'\) with HGT-labeling \(\lambda _{{\mathcal {S}}'}\) as in Fig. 8B is a relaxed scenario that satisfies \(G[W] = \digamma (T,\lambda _{{\mathcal {S}}'})\). However, if we restrict the codomain of \(\sigma \) to obtain \(\sigma _{|W,\{A\}}:\{a,a'\} \rightarrow \sigma (W) =\{A\}\), then there is no relaxed scenario \({\mathcal {S}}\) for which \(G[W] = \digamma (T,\lambda _{{\mathcal {S}}})\), since there is only a single species tree S on \(L(S)=\{A\}\) (Fig. 8C) that consists of the single edge \((0_T,A)\) and thus, \(\mu (v)\) and \(\mu (a)\) as well as \(\mu (v)\) and \(\mu (a')\) must be comparable in this scenario.

5.4 Least resolved trees for Fitch graphs

It is important to note that the characterization of rs-Fitch graphs in Theorem 6 does not provide us with a characterization of rs-Fitch graphs that share a common relaxed scenario with a given LDT graph. As a potential avenue to address this problem we investigate the structure of least-resolved trees for Fitch graphs as possible source of additional constraints.

Definition 18

The edge-labeled tree \((T,\lambda )\) is Fitch-least-resolved w.r.t. \(\digamma (T,\lambda )\), if for all trees \(T'\ne T\) that are displayed by T and every labeling \(\lambda '\) of \(T'\) it holds that \(\digamma (T,\lambda )\ne \digamma (T',\lambda ')\).

As shown in the Technical Part (Theorem 7), Fitch-least-resolved trees can be characterized in terms of their edge-labeling, a result that is very similar to the results for “directed” Fitch graphs of 0/1-edge-labeled trees in Geiß et al. (2018). As a consequence of this characterization, Fitch-least-resolved trees can be constructed in polynomial time. However, Fitch-least-resolved trees are far from being unique. In particular, Fitch-least-resolved trees are only of very limited use for the construction of relaxed scenarios \({\mathcal {S}}=(T,S,\sigma ,\mu ,\tau _{T},\tau _{S})\) from an underlying Fitch graph. In fact, even though \((G,\sigma )\) is an rs-Fitch graph, Example 3 in the Technical Part shows that it is possible that there is no relaxed scenario \({\mathcal {S}}=(T,S,\sigma ,\mu ,\tau _{T},\tau _{S})\) with HGT-labeling \(\lambda _{{\mathcal {S}}}\) such that \((T,\lambda ) = (T,\lambda _{{\mathcal {S}}})\) for any of its Fitch-least-resolved trees \((T,\lambda )\).

6 Editing problems

6.1 Editing colored graphs to LDT graphs and Fitch graphs

Empirical estimates of LDT graphs from sequence data are expected to suffer from noise and hence to violate the conditions of Theorem 3. It is of interest, therefore, to consider the problem of correcting an empirical estimate \((G,\sigma )\) to the closest LDT graph. We therefore briefly investigate the usual three edge modification problems for graphs: completion only considers the insertion of edges, for deletion edges may only be removed, while solutions to the editing problem allow both insertions and deletions, see e.g. Burzyn et al. (2006).

Problem 1

(LDT-Graph-Modification (LDT-M))

Input::

A colored graph \((G =(V,E),\sigma )\) and an integer k.

Question::

Is there a subset \(F\subseteq E\) such that \(|F|\le k\) and \((G'=(V,E\star F),\sigma )\) is an LDT graph where \(\star \in \{{\setminus }, \cup , \varDelta \}\)?

We write LDT-E, LDT-C, LDT-D for the editing, completion, and deletion version of LDT-M. By virtue of Theorem 3, the LDT-M is closely related to the problem of finding a compatible subset \({\mathcal {R}}\subseteq {\mathfrak {S}}(G_{\mathcal {R}},\sigma )\) with maximum cardinality. The corresponding decision problem, MaxRTC, is known to be NP-complete (Jansson 2001, Thm. 1). In the technical part we prove

Theorem 9

LDT-M is NP-complete.

Even through at present it remains unclear whether rs-Fitch graphs can be estimated directly, the corresponding graph modification problems are at least of theoretical interest.

Problem 2

(rs-Fitch Graph-Modification (rsF-M))

Input::

A colored graph \((G =(V,E),\sigma )\) and an integer k.

Question::

Is there a subset \(F\subseteq E\) such that \(|F|\le k\) and \((G'=(V,E\star F),\sigma )\) is an rs-Fitch graph where \(\star \in \{{\setminus }, \cup , \varDelta \}\)?

As above, we write rsF-E, rsF-C, rsF-D for the editing, completion, and deletion version of rsF-M. Since rs-Fitch graphs are complete multipartite, their complements are disjoint unions of complete graphs. The problems rsF-M are thus closely related the cluster graph modification problems. Both Cluster Deletion and Cluster Editing are NP-complete, while Cluster Completion is polynomial (by completing each connected component to a clique, i.e., computing the transitive closure) (Shamir et al. 2004). We obtain

Theorem 10

rsF-C and rsF-E are NP-complete.

rsF-D remains open since the complement of the transitive closure of the complement of a colored graph \((G,\sigma )\) is not necessarily an rs-Fitch graph. This is in particular the case if \((G,\sigma )\) is complete multipartite but not an rs-Fitch graph.

6.2 Editing LDT graphs to Fitch graphs

Putative LDT graphs \((G,\sigma )\) can be estimated directly from sequence (dis)similarity data. The most direct approach was introduced by Novichkov et al. (2004), where, for (reciprocally) most similar genes x and y from two distinct species \(\sigma (x)=A\) and \(\sigma (x)=B\), dissimilarities \(\delta (x,y)\) between genes and dissimilarities \(\varDelta (A,B)\) of the underlying species are compared under the assumption of a (gene family specific) clock-rate r, i.e., the expectation that orthologous gene pairs satisfy \(\delta (x,y)\approx r \varDelta (A,B)\). In this setting, \(xy\in E(G)\) if \(\delta (x,y)< r \varDelta (A,B)\) at some level of statistical significance. The rate assumption can be relaxed to consider rank-order statistics. For fixed x, differences in the orders of \(\delta (x,y)\) and \(\varDelta (\sigma (x),\sigma (y))\) assessed by rank-order correlation measures have been used to identify x as HGT candidate e.g. Lawrence and Hartl (1992); Clarke et al. (2002). An interesting variation on the theme is described by Sevillya et al. (2020), who use relative synteny rather than sequence similarity for the same purpose. A more detailed account on estimating \((G,\sigma )\) will be given elsewhere.

In contrast, it seems much more difficult to infer a Fitch graph \((\digamma ,\sigma )\) directly from data. To our knowledge, no method for this purpose has been proposed in the literature. However, \((\digamma ,\sigma )\) is of much more direct practical interest because the independent sets of \(\digamma \) determine the maximal HGT-free subsets of genes, which could be analyzed separately by better-understood techniques. In this section, we therefore focus on the aspects of \((\digamma ,\sigma )\) that are not captured by LDT graphs \((G,\sigma )\). In the light of the previous section, these are in particular non-replacing HGTs, i.e., HGTs that result in genes x and y in the same species \(\sigma (x)=\sigma (y)\). In this case, \((\digamma ,\sigma )\) is no longer properly colored and thus \(G\ne \digamma \). To get a better intuition on this case consider three genes a, \(a'\), and b with \(\sigma (a)=\sigma (a')\ne \sigma (b)\) with \(ab\notin E(G)\) and \(a'b\in E(G)\). By Lemma 7, the gene tree T of any explaining relaxed scenario displays the triple \(a'b|a\). Fig. 9 shows two relaxed scenarios with a single HGT that explain this situation: In the first, we have \(aa'\in E(\digamma )\), while the other implies \(aa'\notin E(\digamma )\). Neither scenario is a priori less plausible than the other. Although the frequency of true homologous replacement via crossover decreases exponentially with the phylogenetic distance of donor and acceptor species (Williams et al. 2012), additive HGT with subsequent loss of one copy is an entirely plausible scenario.

Fig. 9
figure 9

Two relaxed scenarios with T displaying the triple \(a'b|a\) and explaining the same graph \((G,\sigma )\)

A pragmatic approach to approximate \((\digamma ,\sigma )\) is therefore to consider the step from an LDT graph \((G,\sigma )\) to \((\digamma ,\sigma )\) as a graph modification problem. First we note that Algorithm 1 explicitly produces a relaxed scenario \({\mathcal {S}}\) and thus implies a corresponding gene tree \(T_{{\mathcal {S}}}\) with HGT-labeling \(\lambda _{{\mathcal {S}}}\), and thus an rs-Fitch graph \((\digamma ({\mathcal {S}}),\sigma )\). However, Algorithm 1 was designed primarily as proof device. It produces neither a unique relaxed scenario nor necessarily the most plausible or a most parsimonious one. Furthermore, both the LDT graph \((G,\sigma )\) and the desired rs-Fitch graph \((\digamma ,\sigma )\) are consistent with a potentially very large number of scenarios. It thus appears preferable to altogether avoid the explicit construction of scenarios at this stage.

Since every LDT graph \((G,\sigma )\) is explained by some \({\mathcal {S}}\), it is also a spanning subgraph of the corresponding rs-Fitch graph \((\digamma ({\mathcal {S}}),\sigma )\). The step from an LDT graph \((G,\sigma )\) to an rs-Fitch graph \((\digamma ,\sigma )\) can therefore be viewed as an edge-completion problem. The simplest variation of the problem is

Problem 3

(Fitch graph completion) Given an LDT graph \((G,\sigma )\), find a minimum cardinality set Q of possible edges such that \(((V(G),E(G)\cup Q),\sigma )\) is a complete multipartite graph.

A close inspection of Problem 3 shows that the coloring is irrelevant in this version, and the actual problem to be solved is the problem Complete Multipartite Graph Completion with a cograph as input. We next show that this task can be performed in linear time. The key idea is to consider the complementary problem, i.e., the problem of deleting a minimum set of edges from the complementary cograph \({\overline{G}}\) such that the end result is a disjoint union of complete graphs. This is known as Cluster Deletion problem (Shamir et al. 2004), and is known to have a greedy solution for cographs (Gao et al. 2013).

Lemma 18

There is a linear-time algorithm to solve Problem 3 for every cograph G.

All maximum clique partitions of a cograph G have the same sequence of cluster sizes (Gao et al. 2013, Thm. 1). However, they are not unique as partitions of the vertex set V(G). Thus the minimal editing set Q that needs to be inserted into a cograph to reach a complete multipartite graphs will not be unique in general. In the Technical Part, we briefly sketch a recursive algorithm operating on the cotree of \({\overline{G}}\).

However, an optimal solution to Problem 3 with input \((G,\sigma )\) does not necessarily yield an rs-Fitch graph or an rs-Fitch graph \((\digamma ({\mathcal {S}}),\sigma )\) such that \(G=G_{_{<}}({\mathcal {S}})\), see Fig. 10. In particular, there are LDT graphs \((G,\sigma )\) for which more edges need to be added to obtain an rs-Fitch graph than the minimum required to obtain a complete multipartite graph, see Fig. 11.

Fig. 10
figure 10

Upper panel: A relaxed scenario \({\mathcal {S}}\) with LDT graph \((G_{_{<}}({\mathcal {S}}),\sigma )\) and rs-Fitch graph \((\digamma ({\mathcal {S}}),\sigma )\). There are two minimum edge completion sets that yield the complete multipartite graphs \((\digamma _1,\sigma )\) and \((\digamma _2,\sigma )\) (lower part). By Theorem 6, \((\digamma _2,\sigma )\) is not an rs-Fitch graph. The graph \((\digamma _1,\sigma )\) is an rs-Fitch graph for the relaxed scenario \({\mathcal {S}}'\). However, \(G_{_{<}}({\mathcal {S}})\ne G_{_{<}}({\mathcal {S}}')\) for all scenarios \({\mathcal {S}}'\) with \((\digamma ({\mathcal {S}}'),\sigma ) = (\digamma _1,\sigma )\). To see this, note that the gene tree \(T=((a,b),(a',b'))\) in \({\mathcal {S}}\) is uniquely determined by application of Lemma 5 and 7. Assume that there is any edge-labeling \(\lambda \) such that \(\digamma (T,\lambda ) = \digamma _1\). The none-edges in \(\digamma _1\) imply that along the two paths from a to \(a'\) and b to \(b'\) there is no transfer edge, that is, there cannot be any transfer edge in T; a contradiction

Fig. 11
figure 11

The LDT graph \((G_{_{<}}({\mathcal {S}}),\sigma )\) for the relaxed scenario \({\mathcal {S}}\) has a unique minimum edge completion set (as determined by full enumeration), resulting in the complete multipartite graph \((\digamma _1,\sigma )\). However, Theorem 6 implies that \((\digamma _1,\sigma )\) is not rs-Fitch graph. An edge completion set with more edges must be used to obtain an rs-Fitch graph, for instance \((\digamma _2,\sigma )\), which is explained by the scenario \({\mathcal {S}}'\)

A more relevant problems for our purposes, therefore is

Problem 4

(rs-Fitch graph completion) Given an LDT graph \((G,\sigma )\) find a minimum cardinality set Q of possible edges such that \(((V(G),E(G)\cup Q),\sigma )\) is an rs-Fitch graph.

The following, stronger version is what we ideally would like to solve:

Problem 5

(strong rs-Fitch graph completion) Given an LDT graph \((G,\sigma )\) find a minimum cardinality set Q of possible edges such that \(\digamma = ((V(G),E(G)\cup Q),\sigma )\) is an rs-Fitch graph and there is a common relaxed scenario \({\mathcal {S}}\), that is, \({\mathcal {S}}\) satisfies \(G = G_{_{<}}({\mathcal {S}})\) and \(\digamma = \digamma ({\mathcal {S}})\).

The computational complexity of Problems 4 and 5 is unknown. We conjecture, however, that both are NP-hard. In contrast to the application of graph modification problems to correct possible errors in the originally estimated data, the minimization of inserted edges into an LDT graph lacks a direct biological interpretation. Instead, most-parsimonious solutions in terms of evolutionary events are usually of interest in biology. In our framework, this translates to

Problem 6

(Min transfer completion) Let \((G,\sigma )\) be an LDT graph and \({\mathbb {S}}\) be the set of all relaxed scenarios \({\mathcal {S}}\) with \(G=G_{_{<}}({\mathcal {S}})\). Find a relaxed scenario \({\mathcal {S}}'\in {\mathbb {S}}\) that has a minimal number of transfer edges among all elements in \({\mathbb {S}}\) and the corresponding rs-Fitch graph \(\digamma ({\mathcal {S}}')\).

One way to address this problem might be as follows: Find edge-completion sets for the given LDT graph \((G,\sigma )\) that minimize the number of independent sets in the resulting rs-Fitch graph \(\digamma = ((V(G),E(G)\cup Q),\sigma )\). The intuition behind this idea is that, in this case, the number of pairs within the individual independent sets is maximized and thus, we get a maximized set of gene pairs without transfer along their connecting path in the gene tree. It remains an open question whether this idea always yields a solution for Problem 6.

7 Simulation results

Evolutionary scenarios covering a wide range of HGT frequencies were generated with the simulation library AsymmeTree (Stadler et al. 2020). The tool generates a planted species tree S with time map \(\tau _{S}\). A constant-rate birth-death process then generates a gene tree \(({{\widetilde{T}}},{\widetilde{\tau _{T}}})\) with additional branching events producing copies at inner vertex u of S propagating to each descendant lineage of u. To model HGT events, a recipient branch of S is selected at random. The simulation is event-based in the sense that each node of the “true” gene tree other than the planted root is one of speciation, gene duplication, horizontal gene transfer, gene loss, or a surviving gene. Here, the lost as well as the surviving genes form the leaf set of \({{\widetilde{T}}}\).

We used the following parameter settings for AsymmeTree: Planted species trees with a number of leaves between 10 and 50 (randomly drawn in each scenario) were generated using the Innovation Model (Keller-Schmidt and Klemm 2012) and equipped with a time map as described in Stadler et al. (2020). Multifurcations were introduced into the species tree by contraction of inner edges with a common probability \(p=0.2\) per edge to simulate. Gene trees therefore are also not binary in general. We used multifurcations to model the effects of limited phylogenetic resolution. Duplication and HGT events, however, always result in bifurcations in the gene tree \({{\widetilde{T}}}\). We considered different combinations of duplication, loss, and HGT event rates (indicated on the horizontal axis in Figs. 12, 13 and 14). For each combination of event rates, we simulated 1000 scenarios per event rate combination. Figure 12 summarizes basic statistics of the simulated data sets.

Fig. 12
figure 12

Top panel: Distribution of the numbers of species (i.e. species tree leaves), species thereof that contain at least one surviving genes, surviving genes in total (non-loss leaves in the gene trees), loss events (loss leaves), and horizontal transfer events (inner vertices that are HGT events). Bottom panel: Mean and standard deviation of these quantities. The numbers in the legend indicate the mean and standard deviation taken over all event rate combinations. The tuples on the horizontal axis give the rates for duplication, loss, and horizontal transfer

The simulation also determines the set of surviving genes \(L\subseteq L({\widetilde{T}})\), the reconciliation map \({\widetilde{\mu }}:V({\widetilde{T}})\rightarrow V(S)\cup E(S)\) and the coloring \(\sigma :L\rightarrow L(S)\) representing the species in which each surviving gene resides. From the true tree \({{\widetilde{T}}}\), the observable gene tree \(T={\widetilde{T}}_{|L}\) is obtained by recursively removing leaves that correspond to loss events, i.e. \(L({\widetilde{T}}){\setminus } L\), and suppressing inner vertices with a single child and setting \(\tau _{T}(x)={\widetilde{\tau _{T}}}(x)\) and \(\mu (x)={\widetilde{\mu }}(x)\) for all \(x\in V(T)\). This defines a relaxed scenario \({\mathcal {S}}=(T,S,\sigma ,\mu ,\tau _{T},\tau _{S})\). From the scenario \({\mathcal {S}}\), we can immediately determine the associated HGT map \(\lambda _{{\mathcal {S}}}\), the Fitch graph \(\digamma ({\mathcal {S}})\), and the LDT graph \(G_{_{<}}({\mathcal {S}})\). We also consider \({\widetilde{{\mathcal {S}}}}=({{\widetilde{T}}}, S,\sigma ,{\widetilde{\mu }},{\widetilde{\tau _{T}}},\tau _{S})\) which, from a formal point of view, is not a relaxed scenario, see Fig. 13. In this example, the gene-species association \(\sigma :L \rightarrow L(S)\) is not a map for the entire leaf set \(L({{\widetilde{T}}})\). Still, we can define the true LDT graph \(G_{_{<}}({\widetilde{{\mathcal {S}}}})\) and the true Fitch graph \(\digamma ({\widetilde{{\mathcal {S}}}})\) of \({\widetilde{{\mathcal {S}}}}\) in the same way as LDT graphs using Definitions 89, and 16, respectively. Note that this does not guarantee that every true Fitch graph is also an rs-Fitch graph. The example in Fig. 13 shows, furthermore, that \(\digamma ({\widetilde{{\mathcal {S}}}})[L] \ne \digamma ({\mathcal {S}})\) is possible. For the LDT graphs, on the other hand, we have \(G_{_{<}}({\mathcal {S}}) = G_{_{<}}({\widetilde{{\mathcal {S}}}})\) because \({\widetilde{{\mathcal {S}}}}\) and \({\mathcal {S}}\) are based on the same time maps.

Fig. 13
figure 13

Left: Fraction of “visible” transfer edges among the “true” transfer edges in T in the simulated scenarios, i.e., the edges that correspond to a path in \({{\widetilde{T}}}\) containing at least one transfer edge w.r.t. \(\widetilde{{\mathcal {S}}}\) (see also the explanation in the text). The tuples on the horizontal axis give the rates for duplication, loss, and horizontal transfer. Since \(E:=E(\digamma ({\mathcal {S}})) \subseteq {\widetilde{E}} :=E(\digamma ({\widetilde{{\mathcal {S}}}})[L(T)])\), we also show the ratio \(|E|/|{{\widetilde{E}}}|\). Right: A relaxed scenario \({\mathcal {S}}=(T,S,\sigma ,\mu ,\tau _{T},\tau _{S})\) with an “invisible” transfer edge \((u,a')\) (as determined by the knowledge of \({\widetilde{{\mathcal {S}}}}=(\widetilde{T},S,\sigma ,{\widetilde{\mu }},{\widetilde{\tau _{T}}},\tau _{S})\)). In this example we have \(\digamma ({\widetilde{{\mathcal {S}}}})[L(T)=\{a,a'\}] \ne \digamma ({\mathcal {S}})\)

The distinction between the true graph \(\digamma ({\widetilde{{\mathcal {S}}}})[L]\) and the rs-Fitch graph \(\digamma ({\mathcal {S}})\) is closely related to the definition of transfer edges. So far, we only took into account transfer edges (uv) in the (observable) gene trees T, for which u and v are mapped to incomparable vertices or edges of the species trees S (cf. Definition 14). Thus, given the knowledge of the relaxed scenario \({\mathcal {S}}=(T,S,\sigma ,\mu ,\tau _{T},\tau _{S})\), these transfer edges are in that sense “visible”. However, given \({\widetilde{{\mathcal {S}}}}=({{\widetilde{T}}}, S,\sigma ,{\widetilde{\mu }},{\widetilde{\tau _{T}}},\tau _{S})\), which still contains all loss branches, it is possible that a non-transfer edge in T corresponds to a path in \({{\widetilde{T}}}\) which contains a transfer edge w.r.t. \({\widetilde{{\mathcal {S}}}}\), i.e., some edge \((u,v)\in E({\widetilde{T}})\) such that \(\widetilde{\mu }(u)\) and \(\widetilde{\mu }(v)\) are incomparable in S. In particular, this is the case whenever a gene is transferred into some recipient branch followed by a back-transfer into the original branch and a loss in the recipient branch (see Fig. 13, right). Figure 13 shows that, in the majority of the simulated scenarios, the HGT information is preserved in the observable data. In fact, \(\digamma ({\mathcal {S}})=\digamma ({\widetilde{{\mathcal {S}}}})\) in \(86.7\%\) of simulated scenarios. Occasionally, however, we also encounter scenarios in which large fractions of the xenologous pairs are hidden from inference by the LDT-based approach.

In the following, we will only be concerned with estimating a Fitch graph \(\digamma ({\mathcal {S}})\), i.e., the graph resulting from the “visible” transfer edges. These were edgeless in about \(17.7\%\) of the observable scenarios \({\mathcal {S}}\) (all parameter combinations taken into account). In these cases the LDT and thus also the inferred Fitch graphs are edgeless. These scenarios were excluded from further analysis.

Fig. 14
figure 14

Xenologs inferred from LDT graphs. Only observable scenarios \({\mathcal {S}}\) whose LDT graph \((G_{_{<}}({\mathcal {S}}),\sigma )\) contains at least one edge are included (82.3% of all scenarios). The tuples on the horizontal axis give the rates for duplication, loss, and horizontal transfer. Top panel: Recall. Fraction of edges in \(\digamma ({\mathcal {S}})\) represented in \(G_{_{<}}({\mathcal {S}})\) (light blue). As an alternative, the fraction of edges in a “minimum edge completion” (m.e.c.) to the “closest” complete multipartite graph is shown in dark blue. We observe a substantial increase in the fraction of inferred edges. The Fitch graph \(\digamma ({\mathcal {S}}')\) obtained from the scenario \({\mathcal {S}}'\) produced by Algorithm 1 with input \((G_{_{<}}({\mathcal {S}}),\sigma )\) yields an even better recall (light green). Second panel: Increase in the number of correctly inferred edges relative to the LDT graph \(G_{_{<}}({\mathcal {S}})\). Third panel: Precision. In contrast to LDT graphs, which by Theorem 4 cannot contain false positive edges, this is not the case for the estimated Fitch graphs obtained as m.e.c. and by Algorithm 1. While false positive edges are typically rare, occasionally very poor estimates are observed. Bottom panel: Accuracy

We first ask how well the LDT graph \(G_{_{<}}({\mathcal {S}})\) approximates the Fitch graph \(\digamma ({\mathcal {S}})\). As shown in Fig. 14, the recall is limited. Over a broad range of parameters, the LDT graph contains about a third of the xenologous pairs. This begs the question whether the solution of the editing Problem 3, obtained using the exact recursive algorithm detailed in Sect. C in the Technical Part, leads to a substantial improvement. We find that recall indeed increases substantially, at very moderate levels of false positives. The editing approach achieves a median precision of well above 90% in most cases and a median recall of at least 60%, it provides results that are at the very least encouraging. We find that minimal edge completion (Problem 3) already yields an rs-Fitch graph in the vast majority of cases (99.8%, scenarios of all parameter combinations taken into account), even if we restrict the color set to \(M':=\sigma (L)\) (instead of L(S)) and thus force surjectivity of the coloring \(\sigma \). We note that the original LDT graph and the minimal edge completion may not always be explained by a common scenario. This suggests that it will be worthwhile to consider the more difficult editing problems for rs-Fitch graphs with a relaxed scenario \({\mathcal {S}}\) that at the same time explains the LDT graph.

Algorithm 1 provides a means to obtain an rs-Fitch graph satisfying the latter constraint but without giving any guarantees for optimality in terms of a minimal edge completion. An implementation is available in the current release of the AsymmeTree package. For the rs-Fitch graphs \(\digamma ({\mathcal {S}}')\) of the scenarios \({\mathcal {S}}'\) constructed by Algorithm 1 with \((G_{_{<}}({\mathcal {S}}),\sigma )\) as input, we observe another moderate increase of recall when compared with the minimal edge completion results. This comes, however, at the expense of a loss in precision. This is not surprising, since \(\digamma ({\mathcal {S}}')\) by construction contains at least as many edges as any minimal edge completion of \(G_{_{<}}({\mathcal {S}})\). Therefore, the number of both true positive and false positive edges in \(\digamma ({\mathcal {S}}')\) can be expected to be higher, resulting in a higher recall and lower precision, respectively.

The recall is given by \(TP / (TP + FN)\), and \(|E(\digamma ({\mathcal {S}}))|= TP + FN\) in terms of true positives TP and false negatives FN. Moreover, \(G_{_{<}}({\mathcal {S}})\) is a subgraph of the Fitch graphs \(\digamma _{\text {m.e.c.}}\) and \(\digamma ({\mathcal {S}}')\) inferred with editing or with Algorithm 1, respectively. The ratio \(|E(\digamma ({\mathcal {S}})) \cap E(\digamma ^*)| / |E(\digamma ({\mathcal {S}}) \cap E(G_{_{<}}({\mathcal {S}})))|\) with \(\digamma ^*\in \{\digamma _{\text {m.e.c.}}, \digamma ({\mathcal {S}}') \}\) therefore directly measures the increase in the number of correctly predicted xenologous pairs relative to the LDT. It is equivalent to the ratio of the respective recalls. By construction, the ratio is always \(\ge 1\). This is summarized as the second panel in Fig. 14.

8 Discussion and future directions

In this contribution, we have introduced later-divergence-time (LDT) graphs as a model capturing the subset of horizontal transfer detectable through the pairs of genes that have diverged later than their respective species. Within the setting of relaxed scenarios, LDT graphs \((G,\sigma )\) are exactly the properly colored cographs with a consistent triple set \({\mathfrak {S}}(G,\sigma )\). We further showed that LDT graphs describe a sufficient set of HGT events if and only if they are complete multipartite graphs. This corresponds to scenarios in which all HGT events are replacing. Otherwise, additional HGT events exist that separate genes from the same species. To better understand these, we investigated scenario-derived rs-Fitch graphs and characterized them as those complete multipartite graphs that satisfy an additional constraint on the coloring (expressed in terms of an auxiliary graph). Although the information contained in LDT graphs is not sufficient to unambiguously determine the missing HGT edges, we arrive at an efficiently solvable graph editing problem from which a “best guess” can be obtained. To our knowledge, this is the first detailed mathematical investigation into the power and limitation of an implicit phylogenetic method for HGT inference.

From a data analysis point of view, LDT graphs appear to be an attractive avenue to infer HGT in practice. While existing methods to estimate them from (dis)similarity data certainly can be improved, it is possible to use their cograph structure to correct the initial estimate in the same way as orthology data (Hellmuth et al. 2015). Although the LDT modification problems are NP-complete (Theorem 9), it does not appear too difficult to modify efficient cograph editing heuristics (Crespelle 2019; Hellmuth et al. 2020a) to accommodate the additional coloring constraints.

LDT graphs by themselves clearly do no contain sufficient information to completely determine a relaxed scenario. Additional information, e.g. a best match graph (Geiß et al. 2019, 2020a) will certainly be required. The most direct practical use of LDT information is to infer the Fitch graph, whose independent sets correspond to maximal HGT-free subsets of genes. These subsets can be analyzed separately (Hellmuth 2017) using recent results to infer gene family histories, including orthology relations from best match data (Geiß et al. 2020a; Schaller et al. 2021b). The main remaining unresolved question is whether the resulting HGT-free subtrees can be combined into a complete scenario using only relational information such as best match data. One way to attack this is to employ the techniques used by Lafond and Hellmuth (2020) to characterize the conditions under which a fully event-labled gene tree can be reconciled with unknown species trees. These not only resulted in an polynomial-time algorithm but also establishes additional constraints on the HGT-free subtrees. An alternative, albeit mathematically less appealing approach is to adapt classical phylogenetic methods to accommodate the HGT-free subtrees as constraints. We suspect that best match data can supply further, stringent constraints for this task. We will pursue this avenue elsewhere.

Several alternative routes can be followed to obtain Fitch graphs from LDT graphs. The most straightforward approach is to elaborate on the editing problems briefly discussed in Sect. 6. A natural question arising in this context is whether there are non-LDT edges that are shared by all minimal completion sets Q, and whether these “obligatory Fitch-edges” can be determined efficiently. A natural alternative is to modify Algorithm 1 to incorporate some form of cost function to favor the construction of biologically plausible scenarios. In a very different approach, one might also consider to use LDT graphs as constraints in probabilistic models to reconstruct scenarios, see e.g. Sjöstrand et al. (2014) and Khan et al. (2016).

Although we have obtained characterizations of both LDT graphs and rs-Fitch graphs, many open questions and avenues for future research remain.

Reconciliation maps The notion of relaxed reconciliation maps used here appears to be at least as general as alternatives that have been explored in the literature. It avoids the concurrent definition of event types and thus allows situations that may be excluded in a more restrictive setting. For example, relaxed scenarios may have two or more vertically inherited genes x and y in the same species with \(u:={{\,\mathrm{lca}\,}}_T(x,y)\) mapping to a vertex of the species trees. In the usual interpretation, u correspond to a speciation event (by virtue of \(\mu (u)\in V^0(S)\)); on the other hand, the descendants x and y constitute paralogs in most interpretations. Such scenarios are explicitly excluded e.g. in Stadler et al. (2020). Lemma 3 suggests that relaxed scenarios are sufficiently flexible to make it possible to replace a scenario \({\mathcal {S}}\) that is “forbidden” in response to such inconsistent interpretations of events by an “allowed” scenario \({\mathcal {S}}'\) with the same \(\sigma \) such that \(G_{_{<}}({\mathcal {S}})=G_{_{<}}({\mathcal {S}}')\). Whether this is indeed true, or whether a more restrictive definition of reconciliation imposes additional constraints of LDT graphs will of course need to be checked in each case.

The restriction of a \(\mu \)-free scenario to a subset \(L'\) of leaves of T and to a subset \(M'\) of leaves of S is well defined as long as \(\sigma (L')\subseteq M'\). One can also define a corresponding restriction of the reconciliation map \(\mu \). Most importantly, the deletion of some leaves of T may leave inner vertices in T with only a single child, which are then suppressed to recover a phylogenetic tree. This replaces paths in T by single edges and thus affects the definition of the HGT map \(\lambda _{{\mathcal {S}}}\) since a path in T that contains two adjacent vertices \(u_1\), \(u_2\) with incomparable images \(\mu (u_1)\) and \(\mu (u_2)\) may be replaced by an edge with comparable end points in the restricted scenario \({\mathcal {S}}'\). This means that HGT events may become invisible, and thus \(\digamma ({\mathcal {S}}')\) is not necessarily an induced subgraph of \(\digamma ({\mathcal {S}})\), but a subgraph that may lack additional edges. Note that this is in contrast to the assumptions made in the analysis of (directed) Fitch graphs of 0/1-edge-labeled graphs (Geiß et al. 2018; Hellmuth and Seemann 2019), where the information on horizontal transfers is inherited upon restriction of \((T,\lambda )\).

Observability The latter issue is a special case of the more general problem with observability of events. Conceptually, we assume that evolution followed a true scenario comprising discrete events (speciations, duplications, horizontal transfer, gene losses, and possibly other events such as hybridization which are not considered here). In computer simulations, of course we know this true scenario, as well as all event types. Gene loss not only renders some leaves invisible but also erases the evidence of all subtrees without surviving leaves. Removal of these vertices in general results in a non-phylogenetic gene tree that contains inner vertices with a single child. In the absence of horizontal transfer, this causes little problems and the unobservable vertices can be be removed as described in the previous paragraph, see e.g. Hernández-Rosales et al. (2012). The situation is more complicated with HGT. In Nøjgaard et al. (2018), an HGT-vertex is deemed observable if it has both a horizontally and a vertically inherited descendant. In our present setting, the scenario retains an HGT-edge by virtue of consecutive vertices in T with incomparable \(\mu \)-images, irrespective of whether an HGT-vertex is retained. This type of “vertex-centered” notion of xenology is explored further in Hellmuth et al. (2017). We suspect that these different points of view can be unified only when gene losses are represented explicitly or when gene and species tree trees are not required to be phylogenetic (with single-child vertices implicating losses). Either extension of the theory, however, requires a more systematic understanding of which losses need to be represented and what evidence can be acquired to “observe” them.

Impact of orthology Pragmatically, one would define two genes x and y to be orthologs if \(\mu ({{\,\mathrm{lca}\,}}_T(x,y))\in V^0(S)\), i.e., if x and y are the product of a speciation event. Lemma 3 implies that there is always a scenario without any orthologs that explains a given LDT graph \((G,\sigma )\). In particular, therefore, \((G,\sigma )\) makes no implications on orthology. Conversely, however, orthology information is available and additional information on HGT might become available. In a situation akin to Fig. 9 (with the ancestral duplication moved down to the speciation), knowing that a and b are orthologs in the more restrictive sense that \(\mu ({{\,\mathrm{lca}\,}}_T(a,b))={{\,\mathrm{lca}\,}}_S(\sigma (a),\sigma (b))\) excludes the r.h.s. scenario and implies that \(a'\) is the horizontally inherited child, and therefore also that a and \(a'\) are xenologs. This connection of orthology and xenology will be explored elsewhere.

Other types of implicit phylogenetic information LDT graphs are not the only conceivable type of accessible xenology information. A large class of methods is designed to assess whether a single gene is a xenolog, i.e., whether there is evidence that it has been horizontally inserted into the genome of the recipient species. The main subclasses evaluate nucleotide composition patterns, the phyletic distribution of best-matching genes, or combination thereof. A recent overview can be found e.g. in Sánchez-Soto et al. (2020). It remains an open question how this information can be utilized in conjunction with other types of HGT information, such as LDT graphs. It seems reasonable to expect that it can provide not only additional constraints to infer rs-Fitch graphs but also provides directional information that may help to infer the directed Fitch graphs studied by Geiß et al. (2018) and Hellmuth and Seemann (2019)). Complementarily, we may ask whether it is possible to gain direct information on HGT edges between pairs of genes in the same genome, and if so, what needs to be measured to extract this information efficiently.

We also have to leave open several mathematical questions. Regarding 0/1-edge labeled trees \((T,\lambda )\), it would be of interest to know whether there is always a relaxed scenario \({\mathcal {S}}= (T,S,\sigma ,\mu ,\tau _{T},\tau _{S})\) such that \((T,\lambda ) = (T,\lambda _{{\mathcal {S}}})\) for a suitable choice of \(\sigma \). Elaborating on Theorem 5, it would be interesting to characterize the leaf colorings \(\sigma \) for \((T,\lambda )\) such that there is a relaxed scenario \({\mathcal {S}}\) with \(\digamma (T,\lambda ) = \digamma ({\mathcal {S}})\).