## Abstract

Orthology detection from sequence similarity remains a difficult and computationally expensive problem for gene families with large numbers of gene duplications and losses. REvolutionH-tl implements a new graph-based approach to identify orthogroups, orthology, and paralogy relationships first, and it uses this information in a second step to infer event-labeled gene trees and their reconciliation with an inferred species tree. It avoids using gene trees and species trees upon input and settles for a maximal subtree reconciliation in cases where noise or horizontal gene transfer precludes a global reconciliation. The accuracy of the tool is comparable to competing tools at substantially reduced computational cost. REvolutionH-tl is freely available at https://pypi.org/project/revolutionhtl/.

## Access this chapter

Tax calculation will be finalised at checkout

Purchases are for personal use only

## References

Aho, A.V., Sagiv, Y., Szymanski, T.G., Ullman, J.D.: Inferring a tree from lowest common ancestors with an application to the optimization of relational expressions. SIAM J. Comput.

**10**(3), 405–421 (1981). https://doi.org/10.1137/0210030Altschul, S.F., et al.: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res.

**25**(17), 3389–3402 (1997). https://doi.org/10.1093/nar/25.17.3389Bininda-Emonds, O.: Phylogenetic Supertrees: Combining Information to Reveal the Tree of Life. Computational Biology, Springer, Dordrecht (2004). https://doi.org/10.1007/978-1-4020-2330-9

Buchfink, B., Reuter, K., Drost, H.G.: Sensitive protein alignments at tree-of-life scale using DIAMOND. Nat. Methods

**18**, 366–368 (2021). https://doi.org/10.1038/s41592-021-01101-xDress, A., Huber, K.T., Koolen, J., Moulton, V., Spillner, A.: Basic Phylogenetic Combinatorics. Cambridge University Press, Cambridge (2011). https://doi.org/10.1017/CBO9781139019767

Emms, D.M., Kelly, S.: OrthoFinder: phylogenetic orthology inference for comparative genomics. Genome Biol.

**20**, 1–14 (2019)Fitch, W.: Homology: a personal view on some of the problems. Trends Genet.

**16**, 227–231 (2000). https://doi.org/10.1016/S0168-9525(00)02005-9Fuentes, D., Molina, M., Chorostecki, U., Capella-Gutiérrez, S., Marcet-Houben, M., Gabaldón, T.: PhylomeDB V5: an expanding repository for genome-wide catalogues of annotated gene phylogenies. Nucleic Acids Res.

**50**(D1), D1062–D1068 (2021). https://doi.org/10.1093/nar/gkab966Gabaldón, T., Koonin, E.V.: Functional and evolutionary implications of gene orthology. Nat. Rev. Genet.

**14**(5), 360–366 (2013)Geiß, M., et al.: Best match graphs. J. Math. Biol.

**78**(7), 2015–2057 (2019). https://doi.org/10.1007/s00285-019-01332-9Geiß, M.: Best match graphs and reconciliation of gene trees with species trees. J. Math. Biol.

**80**(5), 1459–1495 (2020)Hellmuth, M.: Biologically feasible gene trees, reconciliation maps and informative triples. Algorithms Mol. Biol.

**12**(1), 23 (2017). https://doi.org/10.1186/s13015-017-0114-zHellmuth, M., Stadler, P.F.: The theory of gene family histories. arXiv preprint arXiv:2304.11826 (2023)

Hellmuth, M., Wieseke, N., Lechner, M., Lenhof, H.P., Middendorf, M., Stadler, P.F.: Phylogenomics with paralogs. Proc. Natl. Acad. Sci. U.S.A.

**112**, 2058–2063 (2015). https://doi.org/10.2307/2412448Hernandez-Rosales, M., Hellmuth, M., Wieseke, N., Huber, K.T., Moulton, V., Stadler, P.F.: From event-labeled gene trees to species trees. BMC Bioinform.

**13**(19), S6 (2012). https://doi.org/10.1186/1471-2105-13-S19-S6Huerta-Cepas, J., Dopazo, H., Dopazo, J., Gabaldón, T.: The human phylome. Genome Biol.

**8**, R109 (2007)Kerfeld, C.A., Scott, K.M.: Using BLAST to teach “E-value-tionary’’ concepts. PLoS Biol.

**9**(2), e1001014 (2011). https://doi.org/10.1371/journal.pbio.1001014Klemm, P., Stadler, P.F., Lechner, M.: Proteinortho6: pseudo-reciprocal best alignment heuristic for graph-based detection of (co-) orthologs. Front. Bioinform.

**3**, 1322477 (2023)Kristensen, D., Wolf, Y., Mushegian, A., Koonin, E.: Computational methods for gene orthology inference. Brief. Bioinform.

**5**(12), 399–420 (2019)Kundu, S., Bansal, M.S.: SaGePhy: an improved phylogenetic simulation framework for gene and subgene evolution. Bioinformatics

**35**(18), 3496–3498 (2019). https://doi.org/10.1093/bioinformatics/btz081Le, S.Q., Gascuel, O.: An improved general amino acid replacement matrix. Mol. Biol. Evol.

**25**, 1307–1320 (2008). https://doi.org/10.1093/molbev/msn067Lechner, M., Findeiß, S., Steiner, L., Marz, M., Stadler, P.F., Prohaska, S.J.: Proteinortho: detection of (co-)orthologs in large-scale analysis. BMC Bioinform.

**12**(1), 124 (2011). https://doi.org/10.1186/1471-2105-12-124Python Software Foundation: Python language reference (2023). http://www.python.org

Schaller, D., et al.: Corrigendum to “Best match graphs". J. Math. Biol.

**82**(6), 47 (2021). https://doi.org/10.1007/s00285-021-01601-6Schaller, D., Geiß, M., Hellmuth, M., Stadler, P.F.: Best match graphs with binary trees. In: Martín-Vide, C., Vega-Rodríguez, M.A., Wheeler, T. (eds.) AlCoB 2021. LNCS, vol. 12715, pp. 82–93. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-74432-8_6

Schaller, D., Geiß, M., Hellmuth, M., Stadler, P.F.: Heuristic algorithms for best match graph editing. Algorithms Mol. Biol.

**16**(1), 19 (2021). https://doi.org/10.1186/s13015-021-00196-3Schaller, D., Geiß, M., Stadler, P.F., Hellmuth, M.: Complete characterization of incorrect orthology assignments in best match graphs. J. Math. Biol.

**82**(3), 20 (2021). https://doi.org/10.1007/s00285-021-01564-8Semple, C., Steel, M., Steel, B.: Phylogenetics. Oxford Lecture Series in Mathematics and Its Applications, Oxford University Press, Oxford (2003)

Stadler, P.F., et al.: From pairs of most similar sequences to phylogenetic best matches. Algorithms Mol. Biol.

**15**(1), 1–20 (2020). https://doi.org/10.1186/s13015-020-00165-2Wu, B.Y.: Constructing the maximum consensus tree from rooted triples. J. Comb. Optim.

**8**(1), 29–39 (2004). https://doi.org/10.1023/B:JOCO.0000021936.04215.68Zhang, C., Mirarab, S.: ASTRAL-Pro 2: ultrafast species tree reconstruction from multi-copy gene family trees. Bioinformatics

**38**(21), 4949–4950 (2022)Zmasek, C.M., Eddy, S.R.: A simple algorithm to infer gene duplication and speciation events on a gene tree. Bioinformatics

**17**(9), 821–828 (2001)

## Acknowledgments

This work was supported in part by CINVESTAV-University of California (UC Alianza MX) joint project and by the German Research Foundation (DFG, STA 850/49-1). KAP (CVU:227919) and JARR (CVU:1147711) received financial support from CONAHCyT. We express our gratitude to Marisol Navarro Miranda, Erika Viridiana Cruz Bonilla, and Luis Fernando Flores Lopez for their valuable contributions to the design of the methodology figure for REvolutionH-tl.

## Author information

### Authors and Affiliations

### Corresponding author

## Editor information

### Editors and Affiliations

## Ethics declarations

### Disclosure of Interests

Authors have no competing interests to declare that are relevant to the content of this article.

## A Appendix

### A Appendix

### 1.1 A.1 Notation

A *graph* \(G=(V(G),E(G))\) consists of two sets: a non-empty set of objects *V*(*G*), called *nodes*, and a set *E*(*G*) of *edges*. Each edge, noted as \(e=uv\), connects a pair of nodes \(u,v\in V(G)\). The edge is called an * arrow* when this connection has a direction. In such cases, *v* is an * out-neighbor* of *u*. When we count the number of connections to a node *v*, we refer to this count as the *degree* of the node, denoted as \(\deg _G(v)\). Furthermore, the *out-degree* \(\text {deg}^+_G(v)\) of a node *v* is the number of its out-neighbors. Based on this concept, graphs are divided into two main families depending on the nature of their connections: those with edges, known as *undirected graphs*, and those with arrows, known as * directed graphs*. A *subgraph H* of *G* is also a graph where \(V(H)\subseteq V(G)\) and \(E(H)\subseteq E(G)\). Moreover, the *subgraph of* *G* *induced by* \(V' \subseteq V(G)\) denoted as \(G[V']\) is a subgraph where \(V(H)= V'\) and its set of edges consists of all edges in *E*(*G*) that connect the nodes in \(V'\).

In a graph *G*, a *path* from node *u* to node *v* is a sequence of nodes starting at *u* and ending at *v*, with consecutive nodes connected by edges. A graph is termed *connected* if there is a path linking every pair of its nodes.

A *tree* *T* is a connected undirected graph that becomes disconnected by removing any edge. In this context, every tree is *rooted*, meaning it has a designated root node \(\rho _T\), with the structure visualized such that all other nodes fall hierarchically beneath the root (refer to Fig. 3(A)). The *leaves* of the tree, *L*(*T*), are nodes with zero out-degree. The *inner nodes*, \(V^0(T)\), are those nodes that are neither leaves nor the root of *T*.

Although rooted trees are considered undirected, the convention \(uv \in E(T)\) indicates *u* as the unique *parent* of *v* and *v* as a *child* of *u*, with \(\text {ch}_T(u)\) representing children of *u*. Also, *u* is an *ancestor* of *v*, and *v* a *descendant* of *u*, if *u* lies on the unique path from *v* to \(\rho _T\). We express this as \(v \preceq _T u\), or more strictly as \(v \prec _T u\) if \(v \ne u\). Nodes \(u, v \in V(T)\) are *non-comparable*, noted as \(x \parallel _T y\), if neither is an ancestor or descendant of the other; they are *comparable* otherwise. The *last common ancestor* of a set \(X \subseteq V(T)\), \(\text {lca}_T(X)\), is the most distant node *u* from \(\rho _T\) that is an ancestor of all nodes in *X*. For individual nodes \(x,y \in V(T)\), we denote \(\text {lca}_T(x,y)\) as their last common ancestor. Furthermore, in [15] the \(\preceq _T\) relationship has been extended to consider edges within *T*; for edges \(e_0=uv, e_1=xy\) and a node *z*, \(e_0 \preceq _T e_1\) if \(v \preceq _T y\), \(z \prec _T e_0\) if \(z \preceq _T v\), and \(e_0 \prec _T z\) if \(u \preceq _T z\).

For any node *v* in *V*(*T*), the expression *T*(*v*) denotes the *subtree* rooted at *v*, encompassing all descendants of *v*. The restriction \(T_{|L'}\) of *T* to a leaf subset \(L' \subseteq L(T)\) is its minimal subtree connecting all leaves in \(L'\), excluding degree-two inner nodes. A tree *T* *displays* another tree \(T'\) with leaves \(L'\), denoted \(T' \le T\), if \(T'\) arises from contracting inner edges of \(T_{|L'}\). If \(L(T) = L(T')\), *T* is a *refinement* of \(T'\). The *cluster* \(C_T(v)\) includes all leaves in the subtree *T*(*v*).

All trees in this paper are *phylogenetic*, meaning each inner node \(v \in V^0(T)\) has an out-degree \(\text {deg}^+_T(v) > 1\), except the root. In some cases, like in Fig. 3(C), we examine *planted trees* formed by adding a new node \(0_T\) and edge \(0_T \rho _{T'}\) to a phylogenetic, rooted tree \(T'\).

A *triple* *xy*|*z* is a tree on three leaves *x*, *y* and *z* where *x* and *y* share a closer common ancestor than either does with *z*, triples are pivotal for supertree construction [3, 5, 28]. Each tree *T* corresponds to rooted triples *R*(*T*). A triple set *R* is *consistent* if it’s part of *R*(*T*) for some tree *T* that displays *R*. The BUILD algorithm [1, 28] checks this, returning a supertree for consistent *R* or noting inconsistency. It uses the *Aho-graph* \([R, L']\) (with \(L' = \bigcup (L(R))\) and edge *xy* for each triple \(xy|z\in R\)) to assess consistency; a disconnected graph confirms consistent triples.

### 1.2 A.2 Evolutionary Scenarios

In a species tree *S*, leaves symbolize extant species, and inner nodes indicate speciations. Conversely, a gene tree \((T,t,\sigma )\) depicts genes as leaves *L*(*T*). The function \(\sigma : L(T) \rightarrow L(S)\) maps each gene to its residing species. The function \(t: V(T) \rightarrow \{ \bullet , \square , \odot , \times \}\) classifies nodes in the gene tree based on evolutionary processes: \(t(x)= \bullet \) for speciation, \(t(x)= \square \) for duplication, \(t(x)= \odot \) for extant genes, and \(t(x)= \times \) for gene loss, as detailed in [15] and illustrated in Fig. 3B.

An *evolutionary scenario* \((S,T,t,\sigma )\) merges a gene tree \((T,t,\sigma )\) with a species tree *S* via the reconciliation map \(\mu \), as introduced in [11, 15] and exemplified in Fig. 3. Detailed mathematical constraints of such scenarios are elaborated in Appendix A.7.

Constructing an evolutionary scenario requires consistency between gene and species trees, assessed using color triples. Let \(\mathfrak {R}(T)=\{ r\in R(T) \mid t(\text {lca}_T(L(r)))=\bullet \text { and } |\sigma (L(r))|=3 \}\) be the set of speciation triples of the gene tree. Given a triple \(ab|c\in R(T)\), the corresponding *color triple* is \(\sigma (ab|c)= \sigma (a)\sigma (b)|\sigma (c)\). Finally, let \(\mathfrak {R}_\sigma (T)=\{ \sigma (r) \text { for all } r\in \mathfrak {R}(T) \}\) be the set of *color triples* of the gene tree. Here, the gene tree \((T,t,\sigma )\) and a species tree *S* are consistent whenever \(\mathfrak {R}_\sigma (T)\subseteq R(S)\) [12, 15]. Consistency is required to ensure that a reconciliation between \((T,t,\sigma )\) and *S* exists.

### 1.3 A.3 Best Match Graphs

The concept of *best match graphs* (BMGs) [10, 11, 24, 25, 27] outlines that a gene *y* is a *best match* for *x* if, *x* and *y* reside in distinct species and \(\text {lca}_T(x,y)\preceq \text {lca}_T(x,y')\) for all genes \(y'\) in the species \(\sigma (y)\), i.e., *y* is one of the genes in \(\sigma (y)\) that is evolutionary most closely related to *x*.

The best match graph \(G(T,\sigma )\), a directed graph, represents these relationships, with an arrow *xy* indicating *y* is the best match of *x*. The tree \((T,t,\sigma )\) *explains* \(G(T,\sigma )\).

For any directed graph *G* and a node-coloring map \(\sigma :V(G)\rightarrow M\), the *informative triples* \(\mathcal {R}(G,\sigma )\) ascertain if *G* is a BMG. A triple \(r\in \mathcal {R}(G,\sigma )\) exists with \(L(r)= x,y,y'\in V(G)\) and \(\sigma (x)\not =\sigma (y)=\sigma (y')\) if \(xy\in E(G)\), \(xy'\not \in E(G)\), and, if *T* is binary, \(yy'|x\) for both \(xy, xy'\in E(G)\). \((G,\sigma )\) is a BMG if and only if \((G,\sigma )=G(\text {aho}(\mathcal {R}(G,\sigma )), \sigma )\) [10, 25]. Figure 3AD-G depicts the interplay between gene trees, best match graphs, and informative triples.

### 1.4 A.4 Selection of Best Hits

Each alignment hit \(\overrightarrow{xy}\) is associated with a *bit score* \(\omega (\overrightarrow{xy})\), we estimate the evolutionary relatedness between two genes *x* and *y* as the *normalized bit-score* \(\omega _{xy}= ( \omega (\overrightarrow{xy})/\text {length}(y) + \omega (\overrightarrow{yx})/\text {length}(x) )/2\).

For each gene *x*, we identify the most closely related genes in a different species \(Y \ne \sigma (x)\). A gene *y* from species \(\sigma (y) = Y\) is considered a *best hit* of *x* if its alignment hit score \(\omega _{xy}\) meets or exceeds an *adaptive threshold* defined as \(f \cdot \omega _{x|Y}\), where \(\omega _{x|Y} = \max ({ \omega _{xy} \text { where } \sigma (y) = Y })\). Here, *f* is a factor between zero and one. This threshold, aimed at identifying paralogous best hits, was introduced in [22], and we set \(f=0.95\).

### 1.5 A.5 From Best Hits to Gene Trees

We start by constructing a *best hit graph* \((G,\sigma )\), which is a directed graph where nodes are the genes of the orthogroup, and there is an arrow *xy* if *y* is best hit of *x*. Then we proceed to find a least resolved gene tree \((T^*,\sigma )\) that maximizes the similarity of the best hit graph \((G,\sigma )\) and the best match graph \(G(T^*,\sigma )\). To do so, we use the heuristic introduced in [26], which consists of finding the maximum set of consistent, informative triples \(\mathcal {R}(G,\sigma )\).

The three \((T^*,\sigma )\) are further refined into an *augmented tree* \((T,\sigma )\), which allows us to assign evolutionary events in such a way that duplication events are minimized while maintaining the same best match graph, this is \(G(T^*,\sigma )=G(T,\sigma )\) [27].

Now, we create the evolutionary events map \(t:V(T)\rightarrow \{\bullet ,\square ,\odot \}\) in such a way that for a node \(v\in V(T)\), if such a node is a leaf then we set \(t(v)=\odot \), on the contrary, we set \(t(v)=\bullet \) if \(\sigma (C_T(v')) \cap \sigma (C_T(v'')) = \emptyset \), otherwise \(t(v)=\square \).

Finally, having the event-labeled gene tree \((T,t,\sigma )\), we compute the orthology relation underling this tree as the relation that comprises all pairs (*x*, *y*) and (*y*, *x*) of genes *x* and *y* for which \(t(\text {lca}_T(x,y))=\bullet \).

### 1.6 A.6 Consistency of Triple Sets

To reconcile a gene tree \((T,t,\sigma )\) inconsistent with the species tree *S*, we modify \((T,t,\sigma )\) to a consistent tree \((T',t,\sigma )\). We differentiate between consistent triples \(R_C = \{r \in \mathfrak {R}(T) : \sigma (r) \in R(S)\}\) and inconsistent triples \(R_I = \mathfrak {R}(T) {\setminus } R_C\). The aim is to eliminate triples in \(R_I\) while retaining those in \(R_C\). Removing a leaf \(a \in L(T)\) also removes all triples \(r \in R(T)\) with \(a \in L(r)\). Utilizing this, we can select a subset of inconsistent leaves \(L_I \subseteq L(R_I)\), set \(L' = L(R_T) {\setminus } L_I\), and construct a consistent tree \(T' = T_{|L'}\). The steps for this tree editing are outlined in Algorithm 1.

### 1.7 A.7 Tree Reconciliation

Once we ensure consistency between the gene tree \((T,t,\sigma )\) and species trees *S*, we perform a reconciliation map as follows.

Lets assume that \(x,y\in V(T)\), then the reconciliation map \(\mu : V(T)\rightarrow V(S)\cup E(S)\) from the gene tree \((T,t,\sigma )\) to the species tree *S* satisfies:

The reconciliation map \(\mu : V(T)\rightarrow V(S)\cup E(S)\) is computed in linear time [15]. Given a node \(v\in V(T)\) such that \(t(v)\not =\square \), it is straightforward to determine which element of \(V(S)\cup E(S)\) corresponds to \(\mu (v)\) by just looking at constraints \(U0-2\), in the case when \(t(v)=\square \) we set \(\mu (v)=xy\in E(S)\) such that \(y=\text {lca}_S(\sigma (C_T(x)))\), this assignation minimizes the gene-loss events.

### 1.8 A.8 Resolving Speciation Nodes

To refine a node \(x \in V(T)\) with more than two children via a map \(f: \text {ch}_T(x) \rightarrow {y_0, y_1}\), perform: (i) add nodes \(y_0, y_1\) to the tree, (ii) remove edges *xy* for each \(y \in \text {ch}_T(x)\), and (iii) add edges \(xy_0\), \(xy_1\), and *f*(*y*)*y* for each \(y \in \text {ch}_T(x)\). When *u* is a speciation node, use the reconciliation map \(\mu \) and map \(f: \text {ch}_T(u) \rightarrow \text {ch}_S(\mu (u))\) to resolve *u*. For \(v \in \text {ch}_T(u)\) and \(v' \in \text {ch}_S(\mu (u))\), set \(f(v) = v'\) iff \(\mu (v) \preceq v'\).

### 1.9 A.9 Inferring Gene Loss

For a speciation node \(x \in V(T)\) in the gene tree, the reconciliation map \(\mu \) helps detect gene losses by mapping *x* to a node \(y = \mu (x)\) in the species tree. If we find a node \(y' \in \text {ch}_S(y)\) for which all nodes \(x' \in V(T)\) satisfying \(x' \prec x\) also fulfill \(\mu (x') \parallel _S y'\), a gene loss at \(y'\) is inferred.

## Rights and permissions

## Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG

## About this paper

### Cite this paper

Ramírez-Rafael, J.A. *et al.* (2024). REvolutionH-tl: __R__econstruction of __Evolution__ary __H__istories __t__oo__l__.
In: Scornavacca, C., Hernández-Rosales, M. (eds) Comparative Genomics. RECOMB-CG 2024. Lecture Notes in Computer Science(), vol 14616. Springer, Cham. https://doi.org/10.1007/978-3-031-58072-7_5

### Download citation

DOI: https://doi.org/10.1007/978-3-031-58072-7_5

Published:

Publisher Name: Springer, Cham

Print ISBN: 978-3-031-58071-0

Online ISBN: 978-3-031-58072-7

eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)