Best match graphs

Best match graphs arise naturally as the first processing intermediate in algorithms for orthology detection. Let T be a phylogenetic (gene) tree T and \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\sigma $$\end{document}σ an assignment of leaves of T to species. The best match graph \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$(G,\sigma )$$\end{document}(G,σ) is a digraph that contains an arc from x to y if the genes x and y reside in different species and y is one of possibly many (evolutionary) closest relatives of x compared to all other genes contained in the species \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\sigma (y)$$\end{document}σ(y). Here, we characterize best match graphs and show that it can be decided in cubic time and quadratic space whether \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$(G,\sigma )$$\end{document}(G,σ) derived from a tree in this manner. If the answer is affirmative, there is a unique least resolved tree that explains \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$(G,\sigma )$$\end{document}(G,σ), which can also be constructed in cubic time.


Introduction
Symmetric best matches [43], also known as bidirectional best hits (BBH) [35], reciprocal best hits (RBH) [5], or reciprocal smallest distance (RSD) [45] are the most commonly employed method for inferring orthologs [3,4]. Practical applications typically produce, for each gene from species A, a list of genes found in species B, ranked in the order of decreasing sequence similarity. From these lists, reciprocal best hits are readily obtained. Some software tools, such as ProteinOrtho [30,31], explicitly construct a digraph whose arcs are the (approximately) co-optimal best matches. Empirically, the pairs of genes that are identified as reciprocal best hits depend on the details of the computational method for quantifying sequence similarity. Most commonly, blast or blat scores are used. Sometimes exact pairwise alignment algorithms are used to obtain a more accurate estimate of the evolutionary distance, see [33] for a detailed investigation. Independent of the computational details, however, reciprocal best match are of interest because they approximate the concept of pairs of reciprocal evolutionarily most closely related genes. It is this notion that links best matches directly to orthology: Given a gene x in species a (and disregarding horizontal gene transfer), all its co-orthologous genes y in species b are by definition closest relatives of x.
Evolutionary relatedness is a phylogenetic property and thus is defined relative to the phylogenetic tree T of the genes under consideration. More precisely, we consider a set of genes L (the leaves of the phylogenetic tree T ), a set of species S, and a map σ assigning to each gene x ∈ L the species σ(x) ∈ S within which it resides. A gene x is more closely related to gene y than to gene x x time Figure 1: An evolutionary scenario (left) consists of a gene tree whose inner vertices are marked by the event type (• for speciations, for gene duplications, and × for gene loss) together with its embedding into a species tree (drawn as tube-like outline). All events are placed on a time axis. The middle panel shows the observable part of the gene tree (T, σ); it is obtained from the gene tree in the full evolutionary scenario by removing all leaves marked as loss events and suppression of all resulting degree two vertices [25,19]. The r.h.s. panel shows the colored best match graph (G, σ) that is explained by (T, σ). Directed arcs indicate the best match relation →. Bi-directional best matches (x → y and y → x) are drawn as solid lines without arrow heads instead of pairs of arrows. Dotted circles collect sets of leaves that have the same in-and out-neighborhood. The corresponding arcs are shown only once.
z if lca(x, y) ≺ lca(x, z). As usual, lca denotes the last common ancestor, and p ≺ q denotes the fact that q is located above p along the path connecting p with the root of T . The partial order (which also allows equality) is called the ancestor order on T . We can now make the notion of a best match precise: Definition 1. Consider a tree T with leaf set L and a surjective map σ : L → S. Then y ∈ L is a best match of x ∈ L, in symbols x → y, if and only if lca(x, y) lca(x, y ) holds for all leaves y from species σ(y ) = σ(y).
In order to understand how best matches (in the sense of Def. 1) are approximated by best hits computed by mean sequence similarity we first observe that best matches can be expressed in terms of the evolutionary time. Denote by t(x, y) the temporal distance along the evolutionary tree, as in Fig. 1. By definition t(x, y) is twice the time elapsed between lca(x, y) and x (or y), assuming that all leaves of T live in the present. Instead of Def. 1 we can then use "x → y holds if and only if t(x, y) ≤ t(x, y ) for all y with σ(y ) = σ(y) = σ(x)." Mathematically, this is equivalent to Def. 1 whenever t is an ultrametric distance on T . For the temporal distance t this is the case. Best match heuristics therefore assume (often tacitly) that the molecular clock hypothesis [48,27] is at least a reasonable approximation.
While this strong condition is violated more often than not, best match heuristics still perform surprisingly well on real-life data, in particular in the context of orthology prediction [46]. Despite practical problems, in particular in applications to Eukaryotic genes [9], reciprocal best heuristics perform at least as good for this task as methods that first estimate the gene phylogeny [4,41]. One reason for their resilience is that the identification of best matches only requires inequalities between sequence similarities. In particular, therefore they are invariant under monotonic transformations and, in contrast e.g. to distance based phylogenetic methods, does not require additivity. Even more generally, it suffices that the evolutionary rates of the different members of a gene family are roughly the same within each lineage.
Best match methods are far from perfect, however. Large differences in evolutionary rates between paralogs, as predicted by the DDC model [13], for example, may lead to false negatives among co-orthologs and false positive best matches between members of slower subfamilies. Recent orthology detection methods recognize the sources of error and complement sequence similarity by additional sources of information. Most notably, synteny is often used to support or reject reciprocal best matches [31,26]. Another class of approaches combine the information of small sets of pairwise matches to improve orthology prediction [47,44]. In the Concluding Remarks we briefly sketch a simple quartet-based approach to identify incorrect best match assignments.  Figure 2: Not every graph with non-empty out-neighborhoods is is a colored best match graph. The 4-vertex graph (G, σ) shown here is the smallest connected counterexample: there is no leaf-colored tree (T, σ) that explains (G, σ).
Extending the information used for the correction of initial reciprocal best hits to a global scale, it is possible to improve orthology prediction by enforcing the global cograph of the orthology relation [24,28]. This work originated from an analogous question: Can empirical reciprocal best match data be improved just by using the fact that ideally a best match relation should derive from a tree T according to Def. 1? To answer this question we need to understand the structure of best match relations.
The best match relation is conveniently represented as a colored digraph.
Definition 2. Given a tree T and a map σ : L → S, the colored best match graph (cBMG) G(T, σ) has vertex set L and arcs xy ∈ E(G) if x = y and x → y. Each vertex x ∈ L obtains the color σ(x).
The rooted tree T explains the vertex-colored graph (G, σ) if (G, σ) is isomorphic to the cBMG G(T, σ).
To emphasize the number of colors used in G(T, σ), that is, the number of species in S, we will write |S|-cBMG.
The purpose of this contribution is to establish a characterization of cBMGs as an indispensable prerequisite for any method that attempts to directly correct empirical best match data. After settling the notation we establish a few simple properties of cBMGs and show that key problems can be broken down to the connected components of 2-colored BMGs. These are considered in detail in section 3. The characterization of 2-BMGs is not a trivial task. Although the existence of at least one out-neighbor for each vertex is an obvious necessary condition, the example in Fig.  2 shows that it is not sufficient. In Section 3 we prove our main results on 2-cBMGs: the existence of a unique least resolved tree that explains any given 2-cBMG (Thm. 4), a characterization in terms of informative triples that can be extracted directly from the input graph (Thm. 8), and a characterization in terms of three simple conditions on the out-neighborhoods (Thm. 6). In section 4 we provide a complete characterization of a general cBMG: It is necessary and sufficient that the subgraph induced by each pair of colors is a 2-cBMG and that the union of the triple sets of their least resolved tree representations is consistent. After a brief discussion of algorithmic considerations we close with a brief introduction into questions for future research.

Notation
Given a rooted tree T = (V, E) with root ρ, we say that a vertex v ∈ V is an ancestor of u ∈ V , in symbols u v, v lies one the path from ρ to u. For an edge e = uv in the rooted tree T we assume that u is closer to the root of T than v. In this case, we call v a child of u, and u the parent of v and denote with child(u) the set of children of u. Moreover, e = uv is an outer edge if v ∈ L(T ) and an inner edge otherwise. We write T (v) for the subtree of T rooted at v, L(T ) for the leaf set of some subtree T and σ(L ) = {σ(x) | x ∈ L }. To avoid dealing with trivial cases we will assume that σ(L) = S contains at least two distinct colors. Furthermore, for |S| = 1, the edge-less graphs are explained by any tree. Hence, we will assume |S| ≥ 2 in the following. Without loosing generality we may assume throughout this contribution that all trees are phylogenetic, i.e., all inner vertices of T (except possibly the root) have at least two children. A tree is binary if each inner vertex has exactly two children.
We follow the notation used e.g. in [40] and say that T is displayed by T , in symbols T ≤ T , if the tree T can be obtained from a subtree of T by contraction of edges. In addition, we will consider trees T with a coloring map σ : L(T ) → S of its leaves, in short (T, σ). We say that (T, σ) displays or is a refinement of (T , σ ), whenever T ≤ T and σ(v) = σ (v) for all v ∈ L(T ).
We write T L for the restriction of T to a subset L ⊆ L. We denote by lca(A) the last common ancestor of all elements of any set A of vertices in T . For later reference we note that lca(A ∪ B) = lca(lca(A), lca(B)). We sometimes write lca T instead of lca to avoid ambiguities. We will often write A x, in case that lca(A) x and therefore, that x is an ancestor of all a ∈ A.
A binary tree on three leaves is called a triple. In particular, we write xy|z for the triple on the leaves x, y and z if the path from x to y does not intersect the path from z to the root. We write r(T ) for the set of all triples that are displayed by the tree T . In particular, we call a triple set R consistent if there exists a tree T that displays R, i.e., R ⊆ r(T ). A rooted triple xy|z ∈ r(T ) distinguishes an edge (u, v) in T if and only if x, y and z are descendants of u, v is an ancestor of x and y but not of z, and there is no descendant v of v for which x and y are both descendants. In other words, the edge (u, v) is distinguished by xy|z ∈ r(T ) if lca(x, y) = v and lca(x, y, z) = u.
By a slight abuse of notation we will retain the symbol σ also for the restriction of σ to a subset L ⊆ L. We write L[s] = {x ∈ L | σ(x) = s} for the color classes on the leaves of (T, σ) and denote by σ(x) = S \ {σ(x)} the set of colors different from the color of the leaf x.
All (di-)graphs considered here do not contain loops, i.e., there are no arcs of the form xx. For a given (di-)graph G = (V, E) and a subset W ⊆ V , we write G[W ] for the induced subgraph of G that has vertex set W and contains all edges xy of G for which

Basic Properties of Best Match Relations
The best match relation → is reflexive because lca(x, x) = x ≺ lca(x, y) for all genes y with σ(x) = σ(y). For any pair of distinct genes x and y with σ(x) = σ(y) we have lca(x, y) / ∈ {x, y}, hence the relation → has off-diagonal pairs only between genes from different species. There is still a 1-1 correspondence between cBMGs (Def. 2) and best match relations (Def. 1): In the cBMG the reflexive loops are omitted, in the relation → they are added.
The tree (G, σ) and the corresponding cBGM G(T, σ) employ the same coloring map σ : L → S, i.e., our notion of isomorphy requires the preservation of colors. The usual definition of isomorphisms of colored graphs also allows an arbitrary bijection between the color sets. This is not relevant for our discussion: if (G , σ ) and G(T, σ) are isomorphic in the usual sense then there is -by definition -a bijective relabeling of the colors in (G , σ ) that makes them coincide with the vertex coloring of G(T, σ). In other words, if ϕ is an isomorphism from (G , σ ) to G(T, σ) we assume w.l.o.g. that σ (x) = σ(ϕ(x)), i.e., each vertex x ∈ V (G ) has the same color as the vertex ϕ(x) ∈ V (G).

Thinness
In undirected graphs, equivalence classes of vertices that share the same neighborhood are considered in the context of thinness of the graph [32,42,7]. The concept naturally extends to digraphs [21]. For our purposes the following variation on the theme is most useful: For each ∼ • class α we have N (x) = N (α) and N − (x) = N − (α) for all x ∈ α. It is obvious, therefore, that ∼ • is an equivalence relation on the vertex set of G. Moreover, since we consider loop-free graphs, one can easily see that G[α] is always edge-less. We write N for the corresponding partition, i.e., the set of ∼ • classes of G. Individual ∼ • classes will be denoted by lowercase Greek letters. Moreover, we write N s (x) = {z | z ∈ N (x) and σ(z) = s} and N − s (x) = {z | z ∈ N − (x) and σ(z) = s} for the in-and out-neighborhoods of x restricted to a color s ∈ S. For the graphs considered here, we always have N σ(x) (x) = N − σ(x) (x) = ∅. When considering sets N s (x) and N − s (x) we always assume that s = σ(x). Furthermore, N s denotes the set of ∼ • classes with color s.
By construction, the function N : V (G) → P(V (G)), where P(V (G)) is the power set of V (G), is isotonic, i.e., A ⊆ B implies N (A) ⊆ N (B). In particular, therefore, we have for α, β ∈ N : These observations will be useful in the proofs below.
By construction every vertex in a cBMG has at least one out-neighbor of every color except its own, i.e., |N (x)| ≥ |S| − 1 holds for all x. In contrast, N − (x) = ∅ is possible.

Some Simple Observations
The color classes L[s] on the leaves of T are independent sets in G(T, σ) since arcs in G(T, σ) connect only vertices with different colors. For any pair of colors s, t ∈ S, therefore, the induced Since the definition of x → y does not depend on the presence or absence of vertices u with σ(u) / ∈ {σ(x), σ(y)}, we have of T to the colors s and t. Furthermore, G is the edge-disjoint union of bipartite subgraphs corresponding to color pairs, i.e., In order to understand when arbitrary graphs (G, σ) are cBMGs, it is sufficient, therefore, to characterize 2-cBMGs. A formal proof will be given later on in section 4.
Note the condition that "T explains (G, σ)" does not imply that (T L , σ) explains (G[L ], σ) for arbitrary subsets of L ⊆ L. Fig. 3 shows that, indeed, not every induced subgraph of a cBMG is necessarily a cBMG. However, we have the following, weaker property: Lemma 1. Let (G, σ) be the cBMG explained by (T, σ), let T = T L and let (G , σ) be the cBMG explained by (T , σ). Then u, v ∈ L and uv ∈ E(G) implies uv ∈ E(G ). In other words, (G[L ], σ) is always a subgraph of (G [L ], σ).
, and thus the inequality

Connectedness
We briefly present some results concerning the connectedness of cBMGs. In particular, it turns out that connected cBMGs have a simple characterization in terms of their representing trees.
Theorem 2. Let (T, σ) be a leaf-labeled tree and G(T, σ) its cBMG. Then G(T, σ) is connected if and only if there is a child v of the root ρ such that σ(L(T (v))) = S. Furthermore, if G(T, σ) is not connected, then for every connected component C of G(T, σ) there is a child v of the root ρ such that V (C) ⊆ L(T (v)).
Proof. For convenience we write L v := L(T (v)). Suppose σ(L v ) = S holds for all children v of the root. Then for any pair of colors s, t ∈ S we find for a leaf x ∈ L v with σ(x) = s a leaf y ∈ L v with σ(y) = t within T (v); thus lca(x, y) is in T (v) and thus lca(x, y) ≺ ρ. Hence, all best matching pairs are confined to the subtrees below the children of the root. The corresponding leaf sets are thus mutually disconnected in G(T, σ).
Conversely, suppose that one of the children v of the root ρ satisfies σ(L v ) = S. Therefore, there is a color t ∈ S with t / ∈ σ(L v ). Then for every x ∈ L v there is an arc x → z for all z ∈ L[t] since for all such z we have lca(x, z) = ρ. If L[t] = L \ L v , we can conclude that G(T, σ) is a connected digraph. Otherwise, every leaf y ∈ L \ L v with a color σ(y) = t has an out-arc y → z to some z ∈ L[t] and thus there is a path y → z ← x connecting y ∈ L \ L v to every x ∈ L v . Finally, for any two vertices y, y ∈ L \ (L v ∪ L[t]) there are vertices z, z ∈ L[t] such that arcs exist that form a path y → z ← x → z ← y connecting z with z and both to any x ∈ L v . In summary, therefore, G(T, σ) is a connected digraph.
For the last statement, we argue as above and conclude that if σ(L v ) = S for all children v of the root (or, equivalently, if G(T, σ) is not connected), then all best matching pairs are confined to the subtrees below the children of the root ρ. Thus, the vertices of every connected component of G(T, σ) must be leaves of a subtree T (v) for some child v of the root ρ.
The following result shows that cBMGs can be characterized by their connected components: the disjoint union of vertex disjoint cBMGs is again a cBMG if and only if they all share the same color set. It suffices therefore, to consider each connected component separately.
Proof. The statement is trivially fulfilled for k = 1. For k ≥ 2, the disjoint union (G, σ) is not We construct a tree (T, σ) as follows: Let ρ be the root of (T, σ) with children r 1 , . . . r k . Then we identify r i with the root of T i and retain all leaf colors. In order to show that (T, σ) explains (G, σ) we recall from Thm. 2 that all best matching pairs are confined to the subtrees below the children of the root and hence, each connected component of (G, σ) forms a subset of one of the leaf sets L i . Since each (T i , σ i ) explains (G i , σ i ), we conclude that the cBMG explained by (T, σ) is indeed the disjoint union of the (G i , σ i ), i.e., (G, σ). Thus (G, σ) is a cBMG.
Conversely, assume that (G, σ) is a cBMG but σ i (L i ) = σ k (L k ) for some k = i. By construction, σ(L i ) = σ i (L i ) and σ(L k ) = σ k (L k ). In particular, for every color t / ∈ σ(L i ) and every vertex x ∈ L i , there is a j = i with t ∈ σ(L j ) such that there exists an outgoing arc form x to some vertex y ∈ L j with color σ(y) = t. Thus (x, y) is an arc connecting L i with some L j , j = i, contradicting the assumption that each L i forms a connected component of (G, σ). Hence, the color sets cannot differ between connected components.
The example (G(T {u,v,w} ), σ) in Fig. 3 already shows however that G(T, σ) is not necessarily strongly connected.

Two-Colored Best Match Graphs (2-cBMGs)
Through this section we assume that σ(L) = {s, t} contains exactly two colors.

Thinness Classes
A connected 2-cBMG contains at least two ∼ • classes, since all in-and out-neighbors y of x by construction have a color σ(y) different from σ(x). Consequently, a 2-cBMG is bipartite. Furthermore, if σ(x) = σ(y) then N (x) ∩ N (y) = ∅. Since N (x) = ∅ and all members of N (x) have the same color, we observe that N (x) = N (y) implies σ(x) = σ(y). By a slight abuse of notation we will often write σ(x) = σ(α) for an element x of some ∼ • class α. Two leaves x and y of the same color that have the same last common ancestor with all other leaves in T , i.e., that satisfy lca(x, u) = lca(y, u) for all u ∈ L \ {x, y} by construction have the same in-neighbor and the same out-neighbors in G(T, σ); hence x ∼ • y.
Observation 3. Let (G, σ) be a connected 2-cBMG and α ∈ N be a ∼ • class. Then, σ(x) = σ(y) for any x, y ∈ α. The following result shows that the out-neighborhood of any ∼ • class is a disjoint union of ∼ • classes.
Lemma 2. Let (G, σ) be a connected 2-cBMG. Then any two ∼ • classes α, β ∈ N satisfy Proof. For any y ∈ β, the definition of ∼ • classes implies that y ∈ N (α) if and only if β ⊆ N (α). Hence, either all or none of the elements of β are contained in N (α).
The connection between the ∼ • classes of G(T, σ) and the tree (T, σ) is captured by identifying an internal node in T that is, as we shall see, in a certain sense characteristic for a given equivalence class. lca(x, y). Corollary 1. Let ρ α be the root of a ∼ • class α. Then, for any y ∈ N (α) holds ρ α = max x∈α lca(x, y).
Proof. For any y ∈ N (α) it holds by definition of N (α) that lca(x, y) lca(x, z) for x ∈ α and any z with σ(z) = σ(y). This together with Observation 3 implies that lca(x, y) = lca(x, z) for any two y, z ∈ N (α) and x ∈ α.
The following lemma collects some simple properties of the roots of ∼ • classes that will be useful for the proofs of the main results. Lemma 3. Let (G, σ) be a connected 2-cBMG explained by (T, σ) and let α, β be ∼ • classes with roots ρ α and ρ β , respectively. Then the following statements hold (i) ρ α lca(α, β) and ρ β lca(α, β); equality holds for at least one of them if and only if ρ α , ρ β are comparable, i.e., ρ α ρ β or ρ β ρ α .
(ii) The subtree T (ρ α ) contains leaves of both colors.
(ii) As argued above, N (x) = ∅ for all vertices x. Let x ∈ α and y ∈ N (x) such that ρ α = lca(x, y). By definition, σ(x) = σ(y). Since ρ α is an ancestor of both x and y, the statement follows.
(iv) is a direct consequence of (i) and (iii).
(N0) implies that there are four distinct ways in which two ∼ • classes α and β with distinct colors can be related to each other. These cases distinguish the relative location of their roots ρ α and ρ β : Lemma 4. If (G, σ) is a connected 2-cBMG, and α, β are ∼ • classes with σ(α) = σ(β). Then exactly one of the following four cases is true In this case ρ α and ρ β are not -comparable.

Least resolved Trees
In general, there are many trees that explain the same 2-cBMG. We next show that there is a unique "smallest" tree among them, which we will call the least resolved tree for (G, σ). Later-on, we will derive a hierarchy of leaf sets from (G, σ) whose tree representation coincides with this least resolved tree. We start by introducing a systematic way of simplifying trees. Let e be an interior edge of (T, σ). Then the tree T e obtained by contracting the edge e = uv is derived by identifying u and v. Analogously, we write T A for the tree obtained by contracting all edges in A.
Definition 5. Let (G, σ) be a cBMG and let (T, σ) be a tree explaining (G, σ). An interior edge e in (T, σ) is redundant if (T e , σ) also explains (G, σ). Edges that are not redundant are called relevant.
The next two results characterize redundant edges and show that such edges can be contracted in an arbitrary order.
Lemma 5. Let (T, σ) be a tree that explains a connected 2-cBMG (G, σ). Then, the edge e = uv is redundant if and only if e is an inner edge and there exists no ∼ • class α such that v = ρ α .
Proof. First we note that e = uv must be an inner edge. Otherwise, i.e., if e is an outer edge, then v / ∈ L(T e ) and thus, (T e , σ) does not explain (G, σ). Now suppose that e is an inner edge, which in particular implies L(T e ) = L(T ), and that e is redundant. Assume for contradiction that there is a . But then, contraction of e implies y ∈ T (ρ α ) and therefore y ∈ N (α), thus (T e , σ) does not explain (G, σ).
Conversely, assume that e is an inner edge and there is no In the first and second case, contraction of e implies either v ρ α or v ρ α . Thus, since L(T (w)) = L(T e (w)) is clearly satisfied if w and v are incomparable, we have L(T (w)) = L(T e (w)) for every w = v. Moreover, N (α) = {y | y ∈ L(T (ρ α )), σ(y) = σ(α)} by Lemma 3(vi). Together these facts imply for every ∼ • class α with ρ α = v that N (α) remains unchanged in (T e , σ) after contraction of e. Since the out-neighborhoods of all ∼ • classes are unaffected by contraction of e, all in-neighborhoods also remain the same in (T e , σ). Therefore, (T, σ) and (T e , σ) explain the same graph (G, σ). Proof. Let e = uv be a redundant edge in (T, σ). Then, for any vertex w = u, v in (T, σ) it is true that w is the root of a ∼ • class α in (T e , σ) if and only if w is the root of α in (T, σ). In particular, the vertex uv in (T e , σ) is the root of a ∼ • class α if and only if u = ρ α in (T, σ). Consequently, f is redundant in (T, σ) if and only if f is redundant in (T e , σ).
As an immediate consequence, contraction of edges is commutative, i.e., the order of the contractions is irrelevant. We can therefore write T A for the tree obtained by contracting all edges in A in arbitrary order: Corollary 2. Let (T, σ) be a tree that explains a 2-cBMG (G, σ) and let A be a set of redundant edges of (T, σ). Then, (T A , σ) explains (G, σ). In particular, ((T A ) B , σ) explains (G, σ) if and only if B is a set of redundant edges of (T, σ). Definition 6. Let (G, σ) be a cBMG explained by (T, σ). We say that (T, σ) is least resolved if (T A , σ) does not explain (G, σ) for any non-empty set A of interior edges of (T, σ).
We are now in the position to formulate the main result of this section: Theorem 4. For any connected 2-cBMG (G, σ), there exists a unique least resolved tree (T , σ) that explains (G, σ). (T , σ) is obtained by contraction of all redundant edges in an arbitrary tree (T, σ) that explains (G, σ). The set of all redundant edges in (T, σ) is given by Moreover, (T , σ) is displayed by (T, σ).
Proof. Any edge in a least resolved tree (T , σ) is non-redundant and therefore, as a consequence of Cor. 2, (T , σ) is obtained from (T, σ) by contraction of all redundant edges of (T, σ). According to Lemma 5, the set of redundant edges is exactly E T . Since the order of contracting the edges in E T is arbitrary, there is a least resolved tree for every given tree (T, σ). Now assume for contradiction that there exist colored digraphs that are explained by two distinct least resolved trees. Let (G, σ) be a minimal graph (w.r.t. the number of vertices) that is explained by two distinct least resolved trees (T 1 , σ) and (T 2 , σ) and let v ∈ L with σ(v) = s. By construction, the two trees (T 1 , σ ) and (T 2 , σ ) with T 1 := T 1|L\{v} , T 2 := T 2|L\{v} and leaf labeling σ := σ |L\{v} , each explain a unique graph, which we denote by (G 1 , σ ) and (G 2 , σ ), respectively. Lemma 1 implies that (G , σ ) : We next show that (G 1 , σ ) and (G 2 , σ ) are equal by characterizing the additional edges that are inserted in both graphs compared to (G , σ ). Assume that there is an additional edge uy in one of the graphs, say (G 1 , σ). Since uy is not an edge in (G, σ), we have lca T (u, y) T lca T (u, y ) for some y ∈ L(T ) with σ(y) = σ(y ). However, , which implies that y = v and, in particular, uv ∈ E(G) and N (u) = {v}.
In particular, we have σ(u) = t = s. In this case, u has no out-neighbors in (G , σ ) but it has outgoing arcs in (G 1 , σ ) and (G 2 , σ ). In order to determine these outgoing arcs explicitly, we will reconstruct the local structure of (T 1 , σ) and (T 2 , σ) in the vicinity of the leaf v. The following argumentation is illustrated in Fig. 5. Figure 5: Illustration of the proof of Theorem 4, showing the local subtrees of (T1, σ) and (T2, σ), immediately above α = {v}. The relevant portion extends to the root ργ of the ∼ • class γ that is located immediately above of α and has the same color as α, here red. Clearly, the deletion of α can affect only pairs of vertices x, y with lca(x, y) below ργ. Triangles denote the subtree that consists of all leaves of the corresponding class which are attached to the root of the class by an outer edge. Dashed triangles and nodes denote subtrees which may or may not be present in (T1, σ) and (T2, σ).
Finally, we consider a few simple properties of least resolved trees that will be useful in the following sections.
Corollary 3. Let (G, σ) be a connected 2-cBMG that is explained by a least resolved tree (T, σ). Then all elements of α ∈ N are attached to ρ α , i.e., ρ α a ∈ E(T ) for all a ∈ α.
Proof. Assume that ρ α a / ∈ E(T ). Since by definition α ≺ ρ α , there exists an inner node v with ρ α v ∈ E(T ) such that v lies in the unique path from ρ α to a. In particular v = a. Theorem 4 implies that each inner vertex (except possibly the root) of the least resolved tree (T, σ) must be the root of some ∼ • class of (G, σ). Hence, there is a ∼ • class β ∈ N with ρ β = v. According to Lemma 3(ii), the subtree T (v) contains leaves of both colors, i.e., there exists some leaf c ∈ L(T (v)) with σ(c) = σ(a). It follows that lca(a, c) ≺ ρ α , which contradicts the definition of ρ α .
This result remains true also for 2-cBMGs that are not connected.

Characterization of 2-cBMGs
We will first establish necessary conditions for a colored digraph to be a 2-cBMG. The key construction for this purpose is the reachable set of a ∼ • class, that is, the set of all leaves that can be reached from this class via a path of directed edges in (G, σ). Not unexpectedly, the reachable sets should forms a hierarchical structure. However, this hierarchy does not quite determine a tree that explains (G, σ). We shall see, however, that the definition of reachable sets can be modified in such a way that the resulting hierarchy defines the unique least resolved tree w.r.t. (G, σ).

Necessary Conditions
We start by deriving some graph properties of 2-cBMGs. We shall see later that these are in fact sufficient to characterize 2-cBMGs.
Repeating the argument yields N (N (N (α))) ρ α and thus there cannot be a pair of leaves x ∈ α and q ∈ N (N (N (α))) with lca(x, q) ρ α .
As we shall see below, technical difficulties arise for distinct ∼ • classes that share the same set of in-neighbors. Hence, we briefly consider the classes in W. An example is shown Fig. 6. Lemma 7. Let G(T, σ) be a connected 2-cBMG explained by a tree (T, σ). Then all ∼ • classes in W have the same color and the cardinality of W distinguishes three types of roots as follows: Proof. By Thm. 2 there is at least one child v of the root ρ T of T that itself is the root of a subtree with a single leaf color, i.e., σ(L(T (v))) = {s}. Assume for contradiction that there are two ∼ • classes α, β ∈ W with s = σ(α) = σ(β) = t. Then by definition lca(v, x) = ρ T for all x ∈ β, and furthermore, ux ∈ E(G) for all u ∈ L(T (v)). Since x ∈ β has an in-arc, β ∈ W, a contradiction. All leaves in W therefore have the same color.
For the remainder of the proof we fix such a child v of the root ρ T . By construction all leaves below it belong to the same ∼ • class, which we denote by ω = L(T (v)). W.l.o.g. we assume σ(v) = s.
(iii) From the proof of (ii), we know that if |W| = 1, then the unique member of W is ω. We already know that ρ ω = ρ T .

Sufficient Conditions
We now turn to showing that the properties obtained in Theorem 5 are already sufficient for the characterization of 2-cBMGs. For this we show that the extended reachable sets form a hierarchy whenever (G, σ) satisfies the properties (N1), (N2), and (N3).
Note that while R(α) is unique for a given ∼ • class α, there may exist more than one ∼ • class that have the same reachable set (see for instance α 2 and β 2 in Fig. 7(C)). In particular, there may even be ∼ • classes with different color giving rise to the same element of H. More generally, we have R(α) = R(β) for α = β if and only if α ∈ R(β) and β ∈ R(α).
A hierarchy H corresponds to a unique tree T (H) defined as the Hasse diagram of H, i.e., the vertices of T (H) are sets of H, and R 2 is a child of R 1 iff R 2 ⊂ R 1 and there is no R 3 such that R 2 ⊂ R 3 ⊂ R 1 . In particular, thus, two ∼ • classes belong to the same interior vertex if R(α) = R(β). It is tempting to use this tree to construct a tree T explaining (G, σ) by attaching the elements of α as leaves to the node R(α) in T (H). The example in Fig. 7(A) and (B) shows, however, that this simply does not work. The key issue arises from groups of distinct ∼ • classes that share the same in-neighborhood because they will in general be attached to the same node in T (H), i.e., they are indistinguishable. We therefore need a modification of the definition of reachable sets that properly distinguishes such ∼ • classes in order to construct a hierarchy with the appropriate resolution for the least resolved tree specified in Theorem 4. To this end we define for every ∼ • class the auxiliary leaf set Note that α ⊆ Q(α). For later reference we list several simple properties of Q.
Proof. (i) follows directly from the definition.
It remains to show that L ∈ H . Similar arguments as in the proof of Lemma 9 can be applied in order to show that there is a unique element R (α * ) that is maximal w.r.t. inclusion in H . Since for any α ∈ N it is true that α ∈ R (α), every ∼ • class of G is contained in at least one element of H . Moreover, any vertex of G is contained in exactly one ∼ • class. Hence, L = R (α * ) ∈ H .
Since H is a hierarchy, its Hasse diagram is a tree T (H ). Its vertices are by construction exactly the extended reachable sets R (α) of (G, σ). Starting from T (H ), we construct the tree T * (H ) by attaching the vertices x ∈ α to the vertex R (α) of T (H ). The tree T * (H ) has leaf set L. Since |R (α)| > 1 as noted below Equ.(4), T * (H ) is a phylogenetic tree. Proof. The "only if"-direction is an immediate consequence of Lemma 2 and Theorem 5. For the "if"-direction we employ Lemma 11 and show that the tree T * (H ) constructed from the hierarchy H explains (G, σ).
Let x ∈ L and α be the ∼ • class of (G, σ) to which x belongs. Denote byÑ (x) the outneighbors of x in the graph explained by T * (H ). Therefore y ∈Ñ (x) if and only if σ(y) = σ(x) and lca T * (H ) (x, y) is the interior node to which x is attached in T (H ), i.e., R (α). Therefore, y ∈Ñ (x) if and only if σ(y) = σ(x) and y ∈ R (α). By (N2) this is the case if and only if y ∈ N (x). ThusÑ (x) = N (x). Since two digraphs are identical whenever all their out-neighborhoods are the same, the tree T * (H ) indeed explains (G, σ).
By construction and Theorem 4, (T * (H ), σ) is a least resolved tree.

Informative Triples
An inspection of induced three-vertex subgraphs of a 2-cBMG (G, σ) shows that several local configurations derive only from specific types of trees. More precisely, certain induced subgraphs on three vertices are associated with uniquely defined triples displayed by the least resolved tree (T, σ) introduced in the previous section. Other induced subgraphs on three vertices, however, may derive from two or three distinct triples. The importance of triples derives from the fact that a phylogenetic tree can be reconstructed from the triples that it displays by a polynomial time algorithm traditionally referred to as BUILD [40]. BUILD makes use of a simple graph representation of certain subsets of triples: Given a triple set R and a subset of leaves L ⊆ L, the Aho-graph [R, L ] has vertex set L and there is an edge between two vertices x, y ∈ L if and only if there exists a triple xy|z ∈ R with z ∈ L [2]. It is well known that R is consistent if and only if [R, L ] is disconnected for every subset L ⊆ L with |L | > 1 [6]. BUILD uses Aho-graphs in a top-down recursion: First, [R, L] is computed and a tree T consisting only of the root ρ T is initialized. If [R, L] is connected and |L| > 1, then BUILD terminates and returns "R is not consistent". Otherwise, BUILD adds the connected components C 1 , . . . , C k of [R, L] as vertices to T and inserts the edges (ρ T , C i ), 1 ≤ i ≤ k. BUILD recurses on the Aho-graphs [R, C i ] (where vertex C i in T plays the role of ρ T ) until it arrives at single-vertex components. BUILD either returns the tree T or identifies the triple set R as "not consistent". Since the Aho-graphs [R, L ] and their connected components are uniquely defined in each step of BUILD, the tree T is uniquely defined by R whenever it exists. T is known as the Aho tree and will be denoted by Aho(R).
It is natural to ask whether the triples that can be inferred directly from (G, σ) are sufficient to (a) characterize 2-cBMGs and (b) to completely determine the least resolved tree (T, σ) explaining (G, σ). Definition 8. Let (G, σ) be a two-colored digraph. We say that a triple ab|c is informative (for (G, σ)) if the three distinct vertices a, b, c ∈ L induce a colored subgraph G[a, b, c] isomorphic (in the usual sense, i.e., with recoloring) to the graphs X 1 , X 2 , X 3 , or X 4 shown in Fig. 8. The set of informative triples is denoted by R(G, σ). Proof. Let (T, σ) be a tree that explains (G, σ). Assume that there is an induced subgraph X 1 in (G, σ). W.l.o.g. let σ(c) = σ(b). Since there is no arc (a, c) but an arc (a, b), we have lca b' a' Figure 9: The four-vertex graph (G, σ) on the l.h.s. cannot be a 2-cBMG because there is no out-arc from a . The four induced subgraphs are of type X1, X2, X3 (with red and blue exchanged) and arc-less, respectively resulting in the set R(G, σ) = {ab|b , ab|a , ab |a } of informative triples. This set is consistent and displayed by the Aho tree T shown in the middle. It is not difficult to check that every edge of T is distinguished by one informative triple. Therefore R(G, σ) identifies the leaf-colored tree (T, σ) [16]. However, the graph G(T, σ) explained by the tree (T, σ) is not isomorphic to the graph (G, σ) from which the triples were inferred.
lca(a, c), which implies that T must display the triple ab|c. By the same arguments, if X 2 , X 3 or X 4 is an induced subgraph in (G, σ), then T must display the triple ab|c.
In particular, therefore, if (G, σ) is 2-cBMG, then R(G, σ) is consistent. It is tempting to conjecture that consistency of the set R(G, σ) of informative triples is already sufficient to characterize a 2-cBMG. The example in Fig. 9 shows, however, that this is not the case.
Lemma 13. Let (T, σ) be a least resolved tree explaining a connected 2-cBMG (G, σ). Then every inner edge of T is distinguished by at least one triple in R(G, σ).
Proof. Let (T, σ) be a least resolved tree w.r.t. to (G, σ) and e = uv be an inner edge of T . Since (T, σ) is least resolved for (G, σ), Thm. 4 implies that the edge e is relevant, and hence, there exists a α ∈ N such that v = ρ α . By Cor. 3, we have a ∈ child(v) for any a ∈ α. Lemma 3(ii) implies that T (v) contains a ∼ • class β with σ(α) = σ(β) and b ∈ β.
Case A: Suppose that ρ β = ρ α and therefore, ab, ba ∈ E(G). If u is the root of some ∼ • class with c ∈ γ, then Lemma 3(vi) implies ca ∈ E(G), cb / ∈ E(G) for σ(c) = σ(b) and cb ∈ E(T ), ca / ∈ E(T ) for σ(c) = σ(a). In all cases, we have neither bc ∈ E(G) nor ac ∈ E(G), since ab, ba ∈ E(G). Therefore, we always obtain a 3-vertex induced subgraph that is isomorphic to X 2 (see Fig. 8) and ab|c ∈ R(G, σ). On the other hand, if there is no ∼ • class γ such that u = ρ γ , then u is the root of (T, σ) by Cor. 3. Since (T, σ) is phylogenetic and u is no root of any ∼ • class, there must be an inner vertex w ∈ child(u) \ {v} such that w = ρ γ for some γ ∈ N . Since T (ρ γ ) contains leaves of both colors by Lemma 3(ii), for any leaf c ∈ L(T (ρ γ )) there is no edge between c and b as well as between c and a. Taken together, we obtain the induced subgraph X 1 and the triple ab|c.
Case B: Now assume ρ β ≺ ρ α and there is no other β ∈ N with σ(β ) = σ(β) and ρ α = ρ β . By definition of ρ β , we have lca(b, a ) ≺ lca(b, a) for some a with σ(a) = σ(a ), i.e., ba / ∈ E(G). Moreover, Lemma 3(vi) implies b ∈ N (a), thus ab ∈ E(G). Similar to Case A, first suppose that u is the root of some ∼ • class of (G, σ). Since e is relevant, there is a γ ∈ N with u = ρ γ and σ(γ) = σ(α). Otherwise, if σ(γ) = σ(α) and there is no other γ ∈ N with u = ρ γ , Lemma 3(vi) implies N (α) = N (γ) and N − (α) = N − (γ), i.e., α and γ belong to the same ∼ • class with root u. Hence, v is not the root of any ∼ • class; a contradiction. Consequently, we have σ(γ) = σ(α), thus ca ∈ E(G) by Lemma 3(vi) but ac / ∈ E(G). This yields the triple ab|c that is derived from the subgraph X 4 . If u is no root of any ∼ • class, analogous arguments as in Case A show that there is an inner vertex w ∈ child(u) \ v such that the tree T (w) contains leaves of both colors. In particular, there exists a leaf c ∈ L(T (w)) and since u is not the root of α, β or the ∼ • class that c belongs to, there is no arc between c and a or b in (G, σ). Hence, we again obtain the triple ab|c which in this case is derived from X 3 .
In every case we have v = lca(a, b) ≺ lca(a, c) = u, i.e., the triple ab|c distinguishes uv.
Lemma 13 suggests that the leaf-colored Aho tree (Aho(R(G, σ)), σ) of the set of informative triples R(G, σ) explains a given 2-cBMG (G, σ). The following result shows that this is indeed the case and sets the stage for the main result of this section, a characterization of 2-cBMGs in terms of informative triples. Theorem 7. Let (G, σ) be a connected 2-cBMG. Then (G, σ) is explained by the Aho tree of the set of informative triples, i.e., (G, σ) = G(Aho (R(G, σ)), σ).
First consider the case L = {x, y}. Since (G, σ) is a connected 2-cBMG, we have σ(x) = σ(y) and xy, yx ∈ E(G). It is easy to see that both the least resolved tree w.r.t. (G, σ) and Aho(R(G, σ)) correspond to the path x − ρ T − y with end points x and y. Thus (G, σ) = G(Aho (R(G, σ)), σ). Now let |L| > 2 and assume that the statement of the proposition is false. Then there is a minimal graph (G, σ) such that (G, σ) = G(T, σ), i.e., (G , σ ) = G(T , σ ) holds for every choice of v ∈ V (G). Since (G, σ) is connected, Theorem 2 implies that there is a ∼ • class α of (G, σ) such that ρ α = ρT . We fix a vertex v in this class α and proceed to show that (G, σ) = G(T, σ), a contradiction. Let σ(α) = s and let (T − v, σ ) be the tree that is obtained by removing the leaf v and its incident edge from (T , σ). Clearly, the out-neighborhood of every leaf of color s is still the same in (T − v, σ ) compared to (T , σ). Moreover, Lemma 3(vi) implies that N (x) remains unchanged in (T − v, σ ) for any x ∈ L[t] \ {v} that belongs to a ∼ • class β with ρ β = ρT . If ρ β = ρT , then N (x) = L[s] in (T , σ) by Lemma 3(vi) and thus N (x) = L[s] \ {v} in (T − v, σ ). We can therefore conclude that (T − v, σ ) explains the induced subgraph (G , σ ) of (G, σ). Now, we distinguish two cases: Hence, the root of (T − v, σ ) has at least two children and, in particular, G(T − v, σ ) is connected by Theorem 2. Since (T , σ) is least resolved, Theorem 4 implies that any inner edge of (T − v, σ ) is non-redundant, and hence (T , σ ) = (T − v, σ ). Consequently, we can recover (T , σ) from (T , σ ) by inserting the edge ρT v.
Hence, any informative triple that contains v is induced by X 2 or X 4 , and is thus of the form xy|v with σ(x) = σ(y). This implies v ∈ child(ρ T ). On the other hand, if there is a β ∈ N with σ(β) = t and ρ β = ρT , we have vu ∈ E(G) and uv ∈ E(G) with u ∈ L[t] if and only if u ∈ β by Lemma 4(i). Then, there is no 3-vertex induced subgraph of (G, σ) of the form X 1 , X 2 , X 3 , or X 4 that contains both u and v, and any informative triple that contains either u or v is again of the form xy|v and xy|v respectively. As before, this implies v ∈ child(ρ T ). Hence, (T, σ) is obtained from (T , σ ) by insertion of the edge ρ T v. Since (G , σ ) = G(T , σ ), we conclude that (T, σ) explains (G, σ), and arrive to the desired contradiction.
Case B: If |child(ρT ) ∩ L| = 1, then (T − v, σ ) is not least resolved since either (a) the root is of degree 1 or (b) there exists no u ∈ child(ρT ) \ {v} such that σ(u) = {s, t} (see Theorem 2). In the latter case, the graph (G , σ ) is not connected. To convert (T − v, σ ) into the least resolved tree (T , σ ), we need to contract all edges ρT u with u ∈ child(ρ T ) \ {v}. Clearly, we can recover (G, σ) from (G , σ ) by reverting the prescribed steps. Analogous arguments as in Case A show that again any informative triple in R(G, σ) that contains v is of the form xy|v with σ(x) = σ(y). If (G σ ) is connected, then any triple in R(G, σ) \ R(G , σ ) is of this form and hence as above, we conclude that v ∈ child(ρ T ) and (G, σ) = G(T, σ). If (G σ ) is not connected, then R(G, σ) \ R(G , σ ) contains also all triples xy|z induced by X 1 and X 3 that emerged from connecting all components of (G , σ ) by insertion of v. However, since lca(x, y, z) = ρT , we conclude that v ∈ child(ρ T ) and thus (G, σ) = G(T, σ) again yields the desired contradiction.
We finally arrive at the main result of this section.  R(G, σ)), σ).
If (G, σ) is not connected, then the informative triples of Definition 8 are not sufficient by themselves to infer a tree that explains (G, σ). However, it follows from Theorems 2 and 8, that the desired tree (T, λ) can be obtained by attaching the Aho trees of the connected components as children of the root of (T, λ). It can be understood as the Aho tree of the triple set where the R(G i , σ i ) are the sets of informative triples of the connected components and R C (G, σ) consists of all triples of the form xy|z with x, y ∈ L(G i ) and z ∈ L(G j ) for all pairs i = j. The triple set R C (G, σ) simply specifies the connected components of (G, σ). Note that with this augmented definition of R, Thm. 8 remains true also for 2-cBMGs that are not connected.

n-colored Best Match Graphs
In this section we generalize the results about 2-cBMGs to an arbitrary number of colors. As in the two-color case, we write x ∼ We can therefore think of the relation ∼ • as the common refinement of the relations ∼ • st based on the induced 2-cBMGs for all colors s, t. In particular, therefore, all elements of a ∼ • class of an n-cBMG appear as sibling leaves in the different least resolved trees, each explaining one of the induced 2-cBMGs. Next we generalize the notion of roots. lca(x, y).
Observation 10. Consider an n-cBMG (G, σ) that is explained by a tree (T, σ). By observation 1, the subgraph (G st , σ st ) induced by any two distinct colors s, t ∈ S is a 2-BMG and thus explained by a corresponding least resolved tree (T st , σ st ). Uniqueness of this least resolved tree implies that the tree (T, σ) must display (T st , σ st ). In other words, (T, σ) is a refinement of (T st , σ st ).
Observation 11. Let (G, σ) be an n-cBMG that is explained by a tree (T, σ), and a, b, c ∈ L leaves of three distinct colors. Then the 3-cBMG (G(T {a,b,c} ), σ) is the complete graph on {a, b, c} with bidirectional edges.
Therefore, no further refinement can be obtained from triples of three different colors. Thus, the two-colored triples inferred from the induced 2-cBMGs for all color pairs may already be sufficient to construct (T, σ). This suggests, furthermore, that every n-cBMG is explained by a unique least resolved tree. An important tool for addressing this conjecture is the following generalization of condition (vi) of Lemma 3. Proof. The definition of ρ α,s implies N s (α) ⊆ L(T (ρ α,s )) ∩ L[s]. In particular, there is a leaf y ∈ N s (α) such that lca(y, α) = ρ α,s . Now consider an arbitrary leaf x ∈ L(T (ρ α,s )) ∩ L[s] \ N s (α). By construction we have lca(x, α) ρ α,s = lca(y, α) and therefore x ∈ N s (α).
We are now in the position to characterize the redundant edges. Proof. Let (T e , σ) be the tree that is obtained from (T, σ) by contraction of the edge e = uv and assume that (T e , σ) explains (G, σ). First we note that e is an inner edge and thus, in particular, L(T e ) = L(T ). Otherwise, i.e., if e is an outer edge, then v / ∈ L(T e ); (T e , σ) does not explain (G, σ). Now consider an inner edge e. Since (T, σ) is phylogenetic, there exists a leaf y ∈ L(T (u) \ T (v)) of some color s ∈ σ(L(T (u) \ T (v))). Assume that there is a ∼ • class α of G such that v = ρ α,s . Note that s = σ(α) by definition of ρ α,s . Lemma 14 implies that y / ∈ N (α) in (G, σ). After contraction of e, we have lca(α, y) = ρ α,s , thus y ∈ N (α) by Lemma 14. Hence, (T e , σ) does not explain G; a contradiction.
Conversely, assume that e is an inner edge and for every s ∈ σ(L(T (u) \ T (v))), there is no α ∈ N such that v = ρ α,s , i.e., for every α ∈ N and every color s = σ(α) we either have (i) v ρ α,s , (ii) v ≺ ρ α,s , or (iii) v and ρ α,s are incomparable. In the first two cases, contraction of e implies v ρ α,s or v ρ α,s in (T e , σ), respectively. Therefore, since L(T (w)) = L(T e (w)) for any w incomparable to v, we have L(T (w)) = L(T e (w)) for any node w = v. Moreover, it follows from Lemma 14 that N s (α) = {y | y ∈ L(T (ρ α,s )), σ(y) = s}. This implies that the set N s (α) remains unchanged after contraction of e for all ∼ • classes α and all color s ∈ S. In other words, the inand out-neighborhood of any leaf remain the same in (T e , σ). Hence, we conclude that (T, σ) and (T e , σ) explain the same graph (G, σ).
Before we consider the general case, we show that 3-cBMGs like 2-cBMGs are explained by unique least resolved trees.
Proof. This proof uses arguments very similar to those in the proof of uniqueness result for 2-cBMGs. In particular, as in the proof of Theorem 4, we assume for contradiction that there exist 3-colored digraphs that are explained by two distinct least resolved trees. Let (G, σ) be a minimal graph (w.r.t. the number of vertices) that is explained by the two distinct least resolved trees (T 1 , σ) and (T 2 , σ). W.l.o.g. we can choose a vertex v and assume that its color is r ∈ S, i.e., v ∈ L[r]. Using the same notation as in the proof of Theorem 4, we write (T 1 , σ ) and (T 2 , σ ) for the trees that are obtained by deleting v from (T, σ). These trees explain the uniquely defined graphs (G 1 , σ ) and (G 2 , σ ), respectively. Again, Lemma 1 implies that (G , σ ) := (G[L \ {v}], σ ) is a subgraph of both (G 1 , σ ) and (G 2 , σ ). Similar to the case of 2-cBMGs, we characterize the additional edges that are inserted into (G 1 , σ ) and (G 2 , σ ) compared to (G , σ ) in order to show that (G 1 , σ ) = (G 2 , σ ). Assume that uy is an edge in (G 1 , σ ) but not in (G , σ ). By analogous arguments as in the proof of Theorem 4, we find that uv ∈ E(G) and in particular N r (u) = {v}, i.e., u has no out-neighbors of color r in (G , σ ).
Moreover, we have u ∈ L[s], where s ∈ S \ {r}. Similar to the 2-color case, we now determine the outgoing arcs of u in (G 1 , σ ) and (G 2 , σ ) by reconstructing the local structure of (T 1 , σ) and (T 2 , σ) in the vicinity of v.
If (G, σ) is not connected, we can build a least resolved tree (T, σ) analogously to the case of 2-cBMGs: we first construct the unique least resolved tree (T i , σ i ) for each component (G i , σ i ). Using Theorem 2 we then insert an additional root for (T, σ) to which the roots of the (G i , σ i ) are attached as children. We proceed by showing that this construction corresponds to the unique least resolved tree.
Proof. Denote by (G i , σ i ) the connected components of (G, σ). By Theorem 4 and Lemma 16 there is a unique least resolved tree (T i , σ i ) that explains (G i , σ i ). Hence, if (G, σ) is connected, we are done.
Now assume that there are at least two connected components. Let (T, σ) be a least resolved tree that explains (G, σ). Theorem 2 implies that there is a vertex u ∈ child(ρ T ) such that L(G i ) ⊆ L(T (u)) for each connected component (G i , σ i ). Hence, the subtree (T (u), σ L(T (u)) ) displays the least resolved tree (T i , σ i ) explaining (G i , σ i ). Moreover, since (T, σ) is least resolved, ρ T u is a relevant edge, i.e., there must be a color s ∈ σ(L(T \ T (u))) and a ∼ • class α such that u = ρ α,s by Lemma 15. This implies in particular that there exists a leaf x ∈ L(T (u)) ∩ L[s]. Lemma 14 now implies that the elements of α are connected to any element of color s in the subtree (T (u), σ L(T (u)) ). Furthermore, any leaf y ∈ L(T (u)) has at least one out-neighbor of color s in L(T (u)). Hence, we can conclude that the graph G(T (u), σ L(T (u)) ) induced by the subtree (T (u), σ L(T (u)) ) is connected.
As a consequence, any least resolved tree (T, σ) that explains (G, σ) must be composed of the disjoint trees (T i , σ i ) that are linked to the root via the relevant edge ρ T ρ Ti . Since every (T i , σ i ) and the construction of the edges ρ T ρ Ti is unique, (T, σ) is unique.
The characterization of redundant edges in trees explaining 2-cBMGs together with the uniqueness of the least resolved trees for 3-cBMGs can be used to characterize redundant edges in the general case, thereby establishing the existence of a unique least resolved tree for n-cBMGs.
Proof. Using arguments analogous to the 2-color case one shows that there is a least resolved tree (T , σ) that can be obtained from (T, σ) by contraction of all redundant edges. The set of redundant edges is given by E T by Lemma 15. By construction, (T , σ) is displayed by (T, σ). It remains to show that (T , σ) is unique. Observation 1 implies that for any pair of distinct colors s and t the corresponding unique least resolved tree (T st , σ st ) is displayed by (T , σ). The same is true for the least resolved tree (T rst , σ rst ) for any three distinct colors r, s, t ∈ S. Since for any 2-cBMG as well as for any 3-cBMG, the corresponding least resolved tree is unique (see Theorem 4 and Lemma 16), it follows for any three distinct leaves x, y, z ∈ L[r] ∪ L[s] ∪ L[t] that there is either a unique triple that is displayed by (T rst , σ rst ) or the least resolved tree (T rst , σ rst ) contains no triple on x, y, z. Note that we do not require that the colors r, s, t are pairwise distinct. Instead, we use the notation (T rst , σ rst ) to also include the trees explaining the induced 2-cBMGs. Observation 1 then implies that R * := r,s,t∈S r(T rst ) ⊆ r(T ). Now assume that there are two distinct least resolved trees (T 1 , σ) and (T 2 , σ) that explain (G, σ). In the following we show that any triple displayed by T 1 must be displayed by T 2 and thus, r(T 1 ) = r(T 2 ). Fig. 10 shows that there may be triples xy|z ∈ r(T 1 ) \ R * . Assume, for contradiction, that xy|z / ∈ r(T 2 ) \ R * . Fix the notation such that z ∈ α, σ(x) = r, σ(y) = s, and σ(z) = t. We do not assume here that r, s, t are necessarily pairwise distinct.
In the remainder of the proof, we will make frequent use of the following Observation: If the tree T is a refinement of T , then we have u T v if and only if u T v for all u, v ∈ V (T ). In particular, u ≺ T v (i.e., u T v and u = v) implies u ≺ T v. The converse of the latter statement is still true if u is a leaf in T but not necessarily for arbitrary inner vertices u and v. Let u = lca T1 (x, y, z). The assumption xy|z ∈ r(T 1 ) implies that there is a vertex v ∈ child(u) such that v lca T1 (x, y). Since (T 1 , σ) is least resolved the characterization of relevant edges ensures that there is a color p ∈ σ(L(T 1 (u) \ T 1 (v))) and a ∼ • class β with σ(β) = q such that v = ρ β,p . In particular, there must be leaves a ∈ L(T 1 (v)) and a * ∈ L(T 1 (u) \ T 1 (v)) with σ(a) = σ(a * ) = p. As a consequence we know that a * / ∈ N p (b) for any b ∈ β. We continue to show that the edge uv must also be contained in the least resolved tree (T pq , σ pq ) that explains the (not necessarily connected) graph (G pq , σ pq ). By Thm. 12, (T pq , σ pq ) is unique. Assume, for contradiction, that uv is not an edge in T pq . Recalling the arguments in Observation 10, the tree (T 1 , σ) must display (T pq , σ pq ). Thus, if uv is not an edge in T pq , then v * := u = v in T pq . By construction, we therefore have v * = ρ β,p in (T pq , σ pq ). Since (T pq , σ pq ) is least resolved, it follows from Cor. 3 that b ∈ child(v * ) for all b ∈ β in (T pq , σ pq ). The latter, together with a, a * Tpq v * , implies that lca Tpq (a, β) = lca Tpq (a * , β) = v * . However, this implies a * ∈ N p (β), a contradiction.
To summarize, the edge uv must be contained in the least resolved tree (T pq , σ pq ). Moreover, by Observation 10, (T pqo , σ pqo ) is a refinement of (T pq , σ pq ) for every color o ∈ S. Hence, we have v ≺ Tpqo u, which is in particular true for the color o ∈ {r, s, t}. Moreover, we know that x ≺ Tpqr v and y ≺ Tpqs v because (T 1 , σ) is a refinement of both (T pqr , σ pqr ) and (T pqs , σ pqs ).
Since (T 2 , σ) is also a refinement of both (T pqr , σ pqr ) and (T pqs , σ pqs ), we have x, y ≺ T2 v ≺ T2 u. Furthermore, v ≺ T1 lca T1 (v, z) = u and z T1 implies that z ≺ Tpqt u and z Tpqt v. Therefore, z ≺ T2 u and z T2 v. Combining these facts about partial order of the vertices v, u, x, y and z in T 2 , we obtain xy|z ∈ r(T 2 ); a contradiction.
Corollary 4. Every n-cBMG (G, σ) is explained by the unique least resolved tree (T, σ) consisting of the least resolved trees (T i , G i ) explaining the connected components (G i , σ i ) and an additional root ρ T to which the roots of the (T i , G i ) are attached as children.
Proof. It is clear from the construction that (T, σ) explains (G, σ). The proof that his is the only least resolved tree parallels the arguments in the proof of Theorem 12 for 2-cBMGs and 3-cBMGs.
Since a tree is determined by all its triples, it is clear now that the construction of a tree that explains a connected n-cBMG is essentially a supertree problem: it suffices to find a tree, if it exists, that displays the least resolved trees explaining the induced subgraphs on 3 colors. In the following, we write R := s,t∈S for the union of all triples in the least resolved trees (T * st , σ st ) explaining the 2-colored subgraphs (G st , σ st ) of (G, σ). In contrast, the set of all informative triples of (G, σ), as specified in Def. 8, is denoted by R(G, σ). As an immediate consequence of Lemma 12 we have Proof. Let (G, σ) be an n-cBMG that is explained by a tree (T, σ). Moreover, let s and t be two distinct colors of G and let L := L[s] ∪ L[t] be the subset of vertices with color s and t, respectively. Observation 1 states that the induced subgraph (G[L ], σ) is a 2-cBMG that is explained by (T L , σ ).
In particular, the least resolved tree (T * L , σ ) of (T L , σ ) also explains (G[L ], σ) and T * L ≤ T L ≤ T by Theorem 13, i.e., r(T * L ) ⊆ r(T ). Since this holds for all pairs of two distinct colors, the union of the triples obtained from the set of all least resolved 2-cBMG trees R is displayed by T . In particular, therefore, R is consistent.
Conversely, suppose that (G[L ], σ) is a 2-cBMG for any two distinct colors s, t and R is consistent. Let Aho(R) be the tree that is constructed by BUILD for the input set R. This tree displays R and is a least resolved tree [2] in the sense that we cannot contract any edge in Aho(R) without loosing a triple from R. By construction, any triple that is displayed by (T st , σ st ) is also displayed by Aho(R), i.e. (T st , σ st ) ≤ Aho(R). Hence, for any α ∈ N and any color s = σ(α) the out-neighborhood N s (α) is the same w.r.t. (T st , σ st ) and w.r.t. Aho(R). Since this is true for any ∼ • class of G, also all in-neighborhoods are the same in Aho(R) and the corresponding (T st , σ st ). Therefore, we conclude that Aho(R) explains (G, σ), i.e., (G, σ) is an n-cBMG.
In order to see that Aho(R) is a least resolved tree explaining (G, σ), we recall that the contraction of an edge leaves at least on triple unexplained, see [39,Prop. 4.1]. Since R consists of all the triples r(T st ) that in turn uniquely identify the structure of (T st , σ st ) (cf. [40,Thm. 6.4.1]), none of these triples is dispensable. The contraction of an edge in Aho(R) therefore yields a tree that no longer displays (T st , σ st ) for some pair of colors s, t and thus no longer explains (G, σ). Thus, Aho(R) contains no redundant edges and we can apply Theorem 13 to conclude that Aho(R) is the unique least resolved tree that explains (G, σ). Fig. 11 summarizes the construction of the least resolved tree from the 3-colored digraph (G, σ) shown in Fig. 11(B). For simplicity we assume that we already know that (G, σ) is indeed a 3-cBMG. For each of the three colors the example has four genes. In addition to singleton there are three non-trivial ∼ • classes α = {a 2 , a 3 , a 4 }, β = {b 3 , b 4 } and γ = {c 3 , c 4 }. Following Theorem 14, we extract for each of the three pairs of colors the induced subgraphs (G st , σ st ) and construct the least resolved trees that explain them ( Fig. 11(C)). Extracting all triples from these least resolved trees on two colors yields the triple set R, which in this case is consistent. Theorem 14 implies that the tree Aho(R) (shown in the lower right corner) explains (G, σ) and is in particular the unique least resolved tree w.r.t. (G, σ).
We close this section by showing that in fact the informative triples of all (G st , σ st ) are already sufficient to decide whether (G, σ) is an n-cBMG or not. More precisely, we show A B C D Figure 11: Construction of the least resolved tree explaining the colored best match graph. Panel (A) recalls the event-labeled gene tree of the evolutionary scenario shown in Fig. 1. There are three ∼ • classes with more than one element: α = {a2, a3, a4}, β = {b3, b4} and γ = {c3, c4} in the 3-cBMG graph (G, σ) shown in panel (B). For simplicity of presentation, the ∼ • classes are already collapsed into single vertices. Panel (C) lists the three induced subgraphs of (G, σ) on two colors together with their least resolved trees. By construction, (G, σ) is the union of the three subgraphs on two colors. (D) The Aho-Tree for the set of all triples obtained from the least resolved trees shown in (C). This tree explains the graph (G, σ) and is the unique least resolved tree w.r.t. (G, σ).
R |Lv ⊆ R |Lu , the set L vi must also be connected in [R |Lu , L u ] for every v i ∈ child(v) (cf. Prop. 8 in [6]). It remains to show that all L vi are connected in [R |Lu , L u ].
Since (T, σ) is least resolved w.r.t. (G, σ), it follows from Theorem 13 that v = ρ α,s for some color s ∈ σ(L(T (u)\T (v))) and an ∼ • class α with σ(α) = s. In particular, therefore, s / ∈ σ(L vi ) if α ∈ L vi (say i = 1). By definition of s, there must be a v j ∈ child(v) \ {v 1 } (say j = 2) such that s ∈ σ(L v2 ). Let y ∈ L v2 ∩ L[s]. Lemma 14 implies y ∈ N s (α), i.e., αy ∈ E(G). Moreover, by definition of s, there must be a leaf y ∈ L(T (u) \ T (v)) ∩ L[s]. Since lca(α, y) ≺ T lca(α, y ), we have αy / ∈ E(G), whereas y α may or may not be contained in (G, σ). Therefore, the induced subgraph on {αyy } is of the form X 1 , X 2 , X 3 , or X 4 and thus provides the informative triple αy|y . It follows that L v1 and L v2 are connected in [R |Lu , L u ]. In particular, this implies that any L vj with σ(L vj ) ⊆ σ(L v ) containing s is connected to any L vi that does not contain s. Since (G, σ) is connected, such a set L vi always exists by Theorem 2. Now let L 1 := {L vj | v j ∈ child(v), s ∈ σ(L vj )} and L 2 := {L vi | v i ∈ child(v), s / ∈ σ(L vi )}. It then follows from the arguments above that L 1 and L 2 form a complete bipartite graph, hence [R |Lu , L u ] is connected.
As an immediate consequence, Theorem 14 can be rephrased as: σ) is an n-cBMG if and only if (i) all induced subgraphs (G st , σ st ) on two colors are 2-cBMGs and (ii) the union R of informative triples R(G st , σ st ) obtained from the induced subgraphs (G st , σ st ) forms a consistent set. In particular, Aho(R) is the unique least resolved tree that explains (G, σ).

Algorithmic Considerations
The material in the previous two sections can be translated into practical algorithms that decide for a given colored graph (G, σ) whether it is an n-cBMG and, if this is the case, compute the unique least resolved tree that explains (G, σ). The correctness of Algorithm 1 follows directly from Theorem 14 (for a single connected component) and Theorem 2 regarding the composition of connected components. It depends on the construction of the unique least resolved tree for the connected components of the induced 2-cBMGs, called LRTfrom2cBMG() in the pseudocode of Algorithm 1. There are two distinct ways of computing these trees: either by constructing the hierarchy T (H) from the extended reachable sets R (Algorithm 2) or via constructing the Aho tree from the set of informative triples (Algorithm 3). While the latter approach seems simpler, we shall see below that it is in general slightly less efficient. Furthermore, we use a function BuildST() to construct the supertree from a collection of input trees. Together with the computation of Aho() from a set of triples, it will be briefly discussed later in this section.
if there is xy ∈ E with σ(x) = σ(y) then exit("not a BMG") determine connected components Let us now turn to analyzing the computational complexity of Algorithm 1, 2, and 3. We start with the building blocks necessary to process the 2-cBMG and consider performance bounds on individual tasks.
From (T, σ) to (G, σ). Given a leaf-labeled tree (T, σ) we first consider the construction of the corresponding cBMG. The necessary lowest common ancestor queries can be answered in constant time after linear time preprocessing, see e.g. [17,38]. The lca() function can also be used to express the partial orders among vertices since we have x y if and only if lca(x, y) = y. In particular, therefore, lca(x, y) lca(x, y ) is true if and only if lca(lca(x, y), lca(x, y )) = lca(lca(x, y), y ) = lca(x, y ). Thus (G, σ) can be constructed from (T, σ) by computing lca(x, y) in constant time for each leaf x and each y ∈ L[s]. Since the last common ancestors for fixed x are comparable, their Algorithm 2 Unique least resolved tree of connected 2-cBMG Require: Two-colored connected bipartite digraph (G(L, E), σ).
In order to compute the thinness classes, we observe that the symmetric part of P F corresponds to equal sets. The classes of equal sets can be obtained as connected components by breadth first search on the symmetric part of P F with an effort of O(n 2 ). This procedure is separately applied to the in-and out-neighborhoods of the cBMG. Using an auxiliary graph in which x, y ∈ L are connected if they are in the same component for both the in-and out-neighbors, the thinness classes can now be obtained by another breath first search in O(n 2 ). Since we have n = |L| and m = |E| and thus the sets of vertices with equal in-and out-neighborhoods can be identified in O(|L| |E|) total time.
Recognizing 2-cBMGs. Since (N0) holds for all graphs, it will be useful to construct the table X with entries X α,β = 1 if α ⊆ N (β) and X α,β = 0 otherwise. This table can be constructed in O(|E|) time by iterating over all edges and retrieving (in constant time) the ∼ • classes to which its endpoints belong. The N (N (α)) can now be obtained in O(|E| |L|) by iterating over all edges αβ and adding the classes in N (β) to N (N (α)). We store this information in a table with entries Q α,β = 1 if α ∈ N (N (β)) and Q α,β = 0 otherwise, in order to be able to decide membership in constant time later on.
A table Y αβ with Y αβ = 0 if N (α) ∩ N (N (β)) = ∅ and Y αβ = 1 if there is an overlap between N (α) and N (N (β)) can be computed in O(|L| 3 ) time from the membership tables X and Q for neighborhoods N ( . ) and next-nearest neighborhoods N (N ( . )), respectively. From the membership table for N (N (α)) and N (γ) we obtain N (N (N (α))) in O(|E| |L|) time, making use of the fact that α |N (α)| = |E|. For fixed α, β ∈ N it only takes constant time to check the conditions in (N1) and (N3) since all set inclusions and intersections can be tested in constant time using the auxiliary data derived above. The inclusion (N2) can be tested directly in O(|L|) time for each α. We can summarize considerations above as Summarizing the discussion so far, and using the fact that the vertices x ∈ α can be attached to the corresponding vertices R (α) in total time O(|L|) we obtain

Reciprocal Best Match Graphs
Several software tools implementing methods for tree-free orthology assignment are typically on reciprocal best matches, i.e., the symmetric part of a cBMG, which we will refer to as colored Reciprocal Best Match Graph (cRBMG). Orthology is well known to have a cograph structure [20,18,23]. The example in Fig. 12 shows, however, that cRBMG in general are not cographs. It is of interest, therefore to better understand this class of colored graphs and their relationships with cographs.
Definition 10. A vertex-colored undirected graph G(V, E, σ) with σ : V → S is a colored reciprocal best match graph (cRBMG) if there is a tree T with leaf set V such that xy ∈ E if and only if lca(x, y) lca(x, y ) for all y ∈ V with σ(y ) = σ(y) and lca(x, y) lca(x , y) for all x ∈ V with σ(x ) = σ(x). Corollary 6. Every 2-cRBMG is the disjoint union of complete bipartite graphs.
Proof. By Lemma 4 there are arcs (x, y) and (y, x) if and only if x ∈ α ⊆ N (β) and y ∈ β ⊆ N (α). In this case ρ α = ρ β . By Lemma 3(v) then σ(α) = σ(β). The same results also implies in a 2-cRBMG there are at most two ∼ • classes with the same root. Thus the connected components of a 2-cRBMG are the complete bipartite graphs formed by pairs of ∼ • classes with a common root, as well as isolated vertices corresponding to all other leaves of T .
The converse, however, is not true, as shown by the counterexample in Figure 13. The complete characterization of cRBMGs does not seem to follow in a straightforward manner from the properties of the underlying cBMGs. It will therefore be addressed elsewhere.

Concluding Remarks
The main result of this contribution is a complete characterization of colored best match graphs (cBMGs), a class of digraphs that arises naturally at the first stage of many of the widely used computational methods for orthology assignment. A cBMG (G, σ) is explained by a unique least resolved tree (T, σ), which is displayed by the true underlying tree. We have shown here that cBMGs can be recognized in cubic time (in the number of genes) and with the same complexity it is possible to reconstruct the unique least resolved tree (T, σ). Related graph classes, for instance directed cographs [8], which appear in generalizations of orthology relations [22], or the Fitch graphs associated with horizontal gene transfer [14], have characterizations in terms of forbidden induced subgraphs. We suspect that this not the case for best match graphs because they are not hereditary.
Reciprocal best match graphs, i.e., the symmetric subgraph of (G, σ), form the link between cBMGs and orthology relations. The characterization of cRBMGs, somewhat surprisingly, does not seem to be a simple consequence of the results on cBMGs presented here. We will address this issue in future work.
Several other questions seem to be appealing for future work. Most importantly, what if the vertex coloring is not known a priori ? What are the properties of BMGs in general? For connected 2-cBMGs the question is simple, since the bipartition is easily found by a breadth first search. In general, however, we suspect that -similar to many other coloring problems -it is difficult to decide whether a digraph G admits a coloring σ with n = |S| colors such that (G, σ) is an n-cBMG.
In the same vein, we may ask for the smallest number n of colors, if it exists, such that G can be colored as an n-cBMG.
As discussed in the introduction, usually sequence similarities are computed. In the presence of large differences in evolutionary rates between paralogous groups, maximal sequence similarity does not guarantee maximal evolutionary relatedness. It is often possible, however, to identify such problematic cases. Suppose the three species a, b, and c form a triple ab|c that is trustworthy due to independent phylogenetic information. Now consider a gene x in a, two candidate best matches y and y in b, and a candidate best match z in c. To decide whether lca(x, y ) ≺ lca(x, y ) or not, we can use the support for the three possible unrooted quadruples formed by the sequences {x, y , y , z} to decide whether lca(x, y ) ≺ lca(x, y ), which can be readily computed as the likelihoods of the three quadruples or using quartet-mapping [34]. If the best supported quadruples is (xy |y z) or (xy |y z) it is very likely that lca(x, y ) ≺ lca(x, y ) or lca(x, y ) ≺ lca(x, y ), respectively, while (xz|y y ) typically indicates lca(x, y ) = lca(x, y ). This inference is correct as long a z is correctly identified as outgroup to x, y , y , which is very likely since all three of y , y , z are candidate best matches of x in the first place. Aggregating evidence over different choices of z thus could be used to increase the confidence. An empirical evaluation of this approach to improve blast-based best hit data is the subject of ongoing research.
From a data analysis point of view, finally, it is of interest to ask whether an n-colored digraph (G, σ) that is not a cBMG can be edited by adding and removing arcs to an n-cBMG. This idea has been used successfully to obtain orthologs from noisy, empirical reciprocal best hit data, see e.g. [20,29,24,28,11]. We propose that a step-wise approach could further improve the accuracy of orthology detection. In the first step, empirical (reciprocal) best hit data obtained with ProteinOrtho or a similar tool would be edited to conform to a cBMG or a cRBMG. These improved data are edited in a second step to the co-graph structure of an orthology relation. Details on cRBMGs and their connections with orthology will be discussed in forthcoming work.