Skip to main content

REvolutionH-tl: Reconstruction of Evolutionary Histories tool

  • Conference paper
  • First Online:
Comparative Genomics (RECOMB-CG 2024)

Abstract

Orthology detection from sequence similarity remains a difficult and computationally expensive problem for gene families with large numbers of gene duplications and losses. REvolutionH-tl implements a new graph-based approach to identify orthogroups, orthology, and paralogy relationships first, and it uses this information in a second step to infer event-labeled gene trees and their reconciliation with an inferred species tree. It avoids using gene trees and species trees upon input and settles for a maximal subtree reconciliation in cases where noise or horizontal gene transfer precludes a global reconciliation. The accuracy of the tool is comparable to competing tools at substantially reduced computational cost. REvolutionH-tl is freely available at https://pypi.org/project/revolutionhtl/.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 49.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 64.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Aho, A.V., Sagiv, Y., Szymanski, T.G., Ullman, J.D.: Inferring a tree from lowest common ancestors with an application to the optimization of relational expressions. SIAM J. Comput. 10(3), 405–421 (1981). https://doi.org/10.1137/0210030

    Article  MathSciNet  Google Scholar 

  2. Altschul, S.F., et al.: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25(17), 3389–3402 (1997). https://doi.org/10.1093/nar/25.17.3389

    Article  Google Scholar 

  3. Bininda-Emonds, O.: Phylogenetic Supertrees: Combining Information to Reveal the Tree of Life. Computational Biology, Springer, Dordrecht (2004). https://doi.org/10.1007/978-1-4020-2330-9

    Book  Google Scholar 

  4. Buchfink, B., Reuter, K., Drost, H.G.: Sensitive protein alignments at tree-of-life scale using DIAMOND. Nat. Methods 18, 366–368 (2021). https://doi.org/10.1038/s41592-021-01101-x

    Article  Google Scholar 

  5. Dress, A., Huber, K.T., Koolen, J., Moulton, V., Spillner, A.: Basic Phylogenetic Combinatorics. Cambridge University Press, Cambridge (2011). https://doi.org/10.1017/CBO9781139019767

    Book  Google Scholar 

  6. Emms, D.M., Kelly, S.: OrthoFinder: phylogenetic orthology inference for comparative genomics. Genome Biol. 20, 1–14 (2019)

    Article  Google Scholar 

  7. Fitch, W.: Homology: a personal view on some of the problems. Trends Genet. 16, 227–231 (2000). https://doi.org/10.1016/S0168-9525(00)02005-9

    Article  Google Scholar 

  8. Fuentes, D., Molina, M., Chorostecki, U., Capella-Gutiérrez, S., Marcet-Houben, M., Gabaldón, T.: PhylomeDB V5: an expanding repository for genome-wide catalogues of annotated gene phylogenies. Nucleic Acids Res. 50(D1), D1062–D1068 (2021). https://doi.org/10.1093/nar/gkab966

    Article  Google Scholar 

  9. Gabaldón, T., Koonin, E.V.: Functional and evolutionary implications of gene orthology. Nat. Rev. Genet. 14(5), 360–366 (2013)

    Article  Google Scholar 

  10. Geiß, M., et al.: Best match graphs. J. Math. Biol. 78(7), 2015–2057 (2019). https://doi.org/10.1007/s00285-019-01332-9

    Article  MathSciNet  Google Scholar 

  11. Geiß, M.: Best match graphs and reconciliation of gene trees with species trees. J. Math. Biol. 80(5), 1459–1495 (2020)

    Article  MathSciNet  Google Scholar 

  12. Hellmuth, M.: Biologically feasible gene trees, reconciliation maps and informative triples. Algorithms Mol. Biol. 12(1), 23 (2017). https://doi.org/10.1186/s13015-017-0114-z

    Article  Google Scholar 

  13. Hellmuth, M., Stadler, P.F.: The theory of gene family histories. arXiv preprint arXiv:2304.11826 (2023)

  14. Hellmuth, M., Wieseke, N., Lechner, M., Lenhof, H.P., Middendorf, M., Stadler, P.F.: Phylogenomics with paralogs. Proc. Natl. Acad. Sci. U.S.A. 112, 2058–2063 (2015). https://doi.org/10.2307/2412448

    Article  Google Scholar 

  15. Hernandez-Rosales, M., Hellmuth, M., Wieseke, N., Huber, K.T., Moulton, V., Stadler, P.F.: From event-labeled gene trees to species trees. BMC Bioinform. 13(19), S6 (2012). https://doi.org/10.1186/1471-2105-13-S19-S6

    Article  Google Scholar 

  16. Huerta-Cepas, J., Dopazo, H., Dopazo, J., Gabaldón, T.: The human phylome. Genome Biol. 8, R109 (2007)

    Article  Google Scholar 

  17. Kerfeld, C.A., Scott, K.M.: Using BLAST to teach “E-value-tionary’’ concepts. PLoS Biol. 9(2), e1001014 (2011). https://doi.org/10.1371/journal.pbio.1001014

    Article  Google Scholar 

  18. Klemm, P., Stadler, P.F., Lechner, M.: Proteinortho6: pseudo-reciprocal best alignment heuristic for graph-based detection of (co-) orthologs. Front. Bioinform. 3, 1322477 (2023)

    Article  Google Scholar 

  19. Kristensen, D., Wolf, Y., Mushegian, A., Koonin, E.: Computational methods for gene orthology inference. Brief. Bioinform. 5(12), 399–420 (2019)

    Google Scholar 

  20. Kundu, S., Bansal, M.S.: SaGePhy: an improved phylogenetic simulation framework for gene and subgene evolution. Bioinformatics 35(18), 3496–3498 (2019). https://doi.org/10.1093/bioinformatics/btz081

    Article  Google Scholar 

  21. Le, S.Q., Gascuel, O.: An improved general amino acid replacement matrix. Mol. Biol. Evol. 25, 1307–1320 (2008). https://doi.org/10.1093/molbev/msn067

    Article  Google Scholar 

  22. Lechner, M., Findeiß, S., Steiner, L., Marz, M., Stadler, P.F., Prohaska, S.J.: Proteinortho: detection of (co-)orthologs in large-scale analysis. BMC Bioinform. 12(1), 124 (2011). https://doi.org/10.1186/1471-2105-12-124

    Article  Google Scholar 

  23. Python Software Foundation: Python language reference (2023). http://www.python.org

  24. Schaller, D., et al.: Corrigendum to “Best match graphs". J. Math. Biol. 82(6), 47 (2021). https://doi.org/10.1007/s00285-021-01601-6

    Article  MathSciNet  Google Scholar 

  25. Schaller, D., Geiß, M., Hellmuth, M., Stadler, P.F.: Best match graphs with binary trees. In: Martín-Vide, C., Vega-Rodríguez, M.A., Wheeler, T. (eds.) AlCoB 2021. LNCS, vol. 12715, pp. 82–93. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-74432-8_6

    Chapter  Google Scholar 

  26. Schaller, D., Geiß, M., Hellmuth, M., Stadler, P.F.: Heuristic algorithms for best match graph editing. Algorithms Mol. Biol. 16(1), 19 (2021). https://doi.org/10.1186/s13015-021-00196-3

    Article  Google Scholar 

  27. Schaller, D., Geiß, M., Stadler, P.F., Hellmuth, M.: Complete characterization of incorrect orthology assignments in best match graphs. J. Math. Biol. 82(3), 20 (2021). https://doi.org/10.1007/s00285-021-01564-8

    Article  MathSciNet  Google Scholar 

  28. Semple, C., Steel, M., Steel, B.: Phylogenetics. Oxford Lecture Series in Mathematics and Its Applications, Oxford University Press, Oxford (2003)

    Book  Google Scholar 

  29. Stadler, P.F., et al.: From pairs of most similar sequences to phylogenetic best matches. Algorithms Mol. Biol. 15(1), 1–20 (2020). https://doi.org/10.1186/s13015-020-00165-2

    Article  Google Scholar 

  30. Wu, B.Y.: Constructing the maximum consensus tree from rooted triples. J. Comb. Optim. 8(1), 29–39 (2004). https://doi.org/10.1023/B:JOCO.0000021936.04215.68

    Article  MathSciNet  Google Scholar 

  31. Zhang, C., Mirarab, S.: ASTRAL-Pro 2: ultrafast species tree reconstruction from multi-copy gene family trees. Bioinformatics 38(21), 4949–4950 (2022)

    Article  Google Scholar 

  32. Zmasek, C.M., Eddy, S.R.: A simple algorithm to infer gene duplication and speciation events on a gene tree. Bioinformatics 17(9), 821–828 (2001)

    Article  Google Scholar 

Download references

Acknowledgments

This work was supported in part by CINVESTAV-University of California (UC Alianza MX) joint project and by the German Research Foundation (DFG, STA 850/49-1). KAP (CVU:227919) and JARR (CVU:1147711) received financial support from CONAHCyT. We express our gratitude to Marisol Navarro Miranda, Erika Viridiana Cruz Bonilla, and Luis Fernando Flores Lopez for their valuable contributions to the design of the methodology figure for REvolutionH-tl.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Maribel Hernández-Rosales .

Editor information

Editors and Affiliations

Ethics declarations

Disclosure of Interests

Authors have no competing interests to declare that are relevant to the content of this article.

A Appendix

A Appendix

1.1 A.1 Notation

A graph \(G=(V(G),E(G))\) consists of two sets: a non-empty set of objects V(G), called nodes, and a set E(G) of edges. Each edge, noted as \(e=uv\), connects a pair of nodes \(u,v\in V(G)\). The edge is called an arrow when this connection has a direction. In such cases, v is an out-neighbor of u. When we count the number of connections to a node v, we refer to this count as the degree of the node, denoted as \(\deg _G(v)\). Furthermore, the out-degree \(\text {deg}^+_G(v)\) of a node v is the number of its out-neighbors. Based on this concept, graphs are divided into two main families depending on the nature of their connections: those with edges, known as undirected graphs, and those with arrows, known as directed graphs. A subgraph H of G is also a graph where \(V(H)\subseteq V(G)\) and \(E(H)\subseteq E(G)\). Moreover, the subgraph of G induced by \(V' \subseteq V(G)\) denoted as \(G[V']\) is a subgraph where \(V(H)= V'\) and its set of edges consists of all edges in E(G) that connect the nodes in \(V'\).

In a graph G, a path from node u to node v is a sequence of nodes starting at u and ending at v, with consecutive nodes connected by edges. A graph is termed connected if there is a path linking every pair of its nodes.

A tree T is a connected undirected graph that becomes disconnected by removing any edge. In this context, every tree is rooted, meaning it has a designated root node \(\rho _T\), with the structure visualized such that all other nodes fall hierarchically beneath the root (refer to Fig. 3(A)). The leaves of the tree, L(T), are nodes with zero out-degree. The inner nodes, \(V^0(T)\), are those nodes that are neither leaves nor the root of T.

Although rooted trees are considered undirected, the convention \(uv \in E(T)\) indicates u as the unique parent of v and v as a child of u, with \(\text {ch}_T(u)\) representing children of u. Also, u is an ancestor of v, and v a descendant of u, if u lies on the unique path from v to \(\rho _T\). We express this as \(v \preceq _T u\), or more strictly as \(v \prec _T u\) if \(v \ne u\). Nodes \(u, v \in V(T)\) are non-comparable, noted as \(x \parallel _T y\), if neither is an ancestor or descendant of the other; they are comparable otherwise. The last common ancestor of a set \(X \subseteq V(T)\), \(\text {lca}_T(X)\), is the most distant node u from \(\rho _T\) that is an ancestor of all nodes in X. For individual nodes \(x,y \in V(T)\), we denote \(\text {lca}_T(x,y)\) as their last common ancestor. Furthermore, in [15] the \(\preceq _T\) relationship has been extended to consider edges within T; for edges \(e_0=uv, e_1=xy\) and a node z, \(e_0 \preceq _T e_1\) if \(v \preceq _T y\), \(z \prec _T e_0\) if \(z \preceq _T v\), and \(e_0 \prec _T z\) if \(u \preceq _T z\).

For any node v in V(T), the expression T(v) denotes the subtree rooted at v, encompassing all descendants of v. The restriction \(T_{|L'}\) of T to a leaf subset \(L' \subseteq L(T)\) is its minimal subtree connecting all leaves in \(L'\), excluding degree-two inner nodes. A tree T displays another tree \(T'\) with leaves \(L'\), denoted \(T' \le T\), if \(T'\) arises from contracting inner edges of \(T_{|L'}\). If \(L(T) = L(T')\), T is a refinement of \(T'\). The cluster \(C_T(v)\) includes all leaves in the subtree T(v).

All trees in this paper are phylogenetic, meaning each inner node \(v \in V^0(T)\) has an out-degree \(\text {deg}^+_T(v) > 1\), except the root. In some cases, like in Fig. 3(C), we examine planted trees formed by adding a new node \(0_T\) and edge \(0_T \rho _{T'}\) to a phylogenetic, rooted tree \(T'\).

A triple xy|z is a tree on three leaves xy and z where x and y share a closer common ancestor than either does with z, triples are pivotal for supertree construction [3, 5, 28]. Each tree T corresponds to rooted triples R(T). A triple set R is consistent if it’s part of R(T) for some tree T that displays R. The BUILD algorithm [1, 28] checks this, returning a supertree for consistent R or noting inconsistency. It uses the Aho-graph \([R, L']\) (with \(L' = \bigcup (L(R))\) and edge xy for each triple \(xy|z\in R\)) to assess consistency; a disconnected graph confirms consistent triples.

1.2 A.2 Evolutionary Scenarios

In a species tree S, leaves symbolize extant species, and inner nodes indicate speciations. Conversely, a gene tree \((T,t,\sigma )\) depicts genes as leaves L(T). The function \(\sigma : L(T) \rightarrow L(S)\) maps each gene to its residing species. The function \(t: V(T) \rightarrow \{ \bullet , \square , \odot , \times \}\) classifies nodes in the gene tree based on evolutionary processes: \(t(x)= \bullet \) for speciation, \(t(x)= \square \) for duplication, \(t(x)= \odot \) for extant genes, and \(t(x)= \times \) for gene loss, as detailed in [15] and illustrated in Fig. 3B.

An evolutionary scenario \((S,T,t,\sigma )\) merges a gene tree \((T,t,\sigma )\) with a species tree S via the reconciliation map \(\mu \), as introduced in [11, 15] and exemplified in Fig. 3. Detailed mathematical constraints of such scenarios are elaborated in Appendix A.7.

Constructing an evolutionary scenario requires consistency between gene and species trees, assessed using color triples. Let \(\mathfrak {R}(T)=\{ r\in R(T) \mid t(\text {lca}_T(L(r)))=\bullet \text { and } |\sigma (L(r))|=3 \}\) be the set of speciation triples of the gene tree. Given a triple \(ab|c\in R(T)\), the corresponding color triple is \(\sigma (ab|c)= \sigma (a)\sigma (b)|\sigma (c)\). Finally, let \(\mathfrak {R}_\sigma (T)=\{ \sigma (r) \text { for all } r\in \mathfrak {R}(T) \}\) be the set of color triples of the gene tree. Here, the gene tree \((T,t,\sigma )\) and a species tree S are consistent whenever \(\mathfrak {R}_\sigma (T)\subseteq R(S)\) [12, 15]. Consistency is required to ensure that a reconciliation between \((T,t,\sigma )\) and S exists.

1.3 A.3 Best Match Graphs

The concept of best match graphs (BMGs) [10, 11, 24, 25, 27] outlines that a gene y is a best match for x if, x and y reside in distinct species and \(\text {lca}_T(x,y)\preceq \text {lca}_T(x,y')\) for all genes \(y'\) in the species \(\sigma (y)\), i.e., y is one of the genes in \(\sigma (y)\) that is evolutionary most closely related to x.

The best match graph \(G(T,\sigma )\), a directed graph, represents these relationships, with an arrow xy indicating y is the best match of x. The tree \((T,t,\sigma )\) explains \(G(T,\sigma )\).

For any directed graph G and a node-coloring map \(\sigma :V(G)\rightarrow M\), the informative triples \(\mathcal {R}(G,\sigma )\) ascertain if G is a BMG. A triple \(r\in \mathcal {R}(G,\sigma )\) exists with \(L(r)= x,y,y'\in V(G)\) and \(\sigma (x)\not =\sigma (y)=\sigma (y')\) if \(xy\in E(G)\), \(xy'\not \in E(G)\), and, if T is binary, \(yy'|x\) for both \(xy, xy'\in E(G)\). \((G,\sigma )\) is a BMG if and only if \((G,\sigma )=G(\text {aho}(\mathcal {R}(G,\sigma )), \sigma )\) [10, 25]. Figure 3AD-G depicts the interplay between gene trees, best match graphs, and informative triples.

1.4 A.4 Selection of Best Hits

Each alignment hit \(\overrightarrow{xy}\) is associated with a bit score \(\omega (\overrightarrow{xy})\), we estimate the evolutionary relatedness between two genes x and y as the normalized bit-score \(\omega _{xy}= ( \omega (\overrightarrow{xy})/\text {length}(y) + \omega (\overrightarrow{yx})/\text {length}(x) )/2\).

For each gene x, we identify the most closely related genes in a different species \(Y \ne \sigma (x)\). A gene y from species \(\sigma (y) = Y\) is considered a best hit of x if its alignment hit score \(\omega _{xy}\) meets or exceeds an adaptive threshold defined as \(f \cdot \omega _{x|Y}\), where \(\omega _{x|Y} = \max ({ \omega _{xy} \text { where } \sigma (y) = Y })\). Here, f is a factor between zero and one. This threshold, aimed at identifying paralogous best hits, was introduced in [22], and we set \(f=0.95\).

1.5 A.5 From Best Hits to Gene Trees

We start by constructing a best hit graph \((G,\sigma )\), which is a directed graph where nodes are the genes of the orthogroup, and there is an arrow xy if y is best hit of x. Then we proceed to find a least resolved gene tree \((T^*,\sigma )\) that maximizes the similarity of the best hit graph \((G,\sigma )\) and the best match graph \(G(T^*,\sigma )\). To do so, we use the heuristic introduced in [26], which consists of finding the maximum set of consistent, informative triples \(\mathcal {R}(G,\sigma )\).

The three \((T^*,\sigma )\) are further refined into an augmented tree \((T,\sigma )\), which allows us to assign evolutionary events in such a way that duplication events are minimized while maintaining the same best match graph, this is \(G(T^*,\sigma )=G(T,\sigma )\) [27].

Now, we create the evolutionary events map \(t:V(T)\rightarrow \{\bullet ,\square ,\odot \}\) in such a way that for a node \(v\in V(T)\), if such a node is a leaf then we set \(t(v)=\odot \), on the contrary, we set \(t(v)=\bullet \) if \(\sigma (C_T(v')) \cap \sigma (C_T(v'')) = \emptyset \), otherwise \(t(v)=\square \).

Finally, having the event-labeled gene tree \((T,t,\sigma )\), we compute the orthology relation underling this tree as the relation that comprises all pairs (xy) and (yx) of genes x and y for which \(t(\text {lca}_T(x,y))=\bullet \).

1.6 A.6 Consistency of Triple Sets

To reconcile a gene tree \((T,t,\sigma )\) inconsistent with the species tree S, we modify \((T,t,\sigma )\) to a consistent tree \((T',t,\sigma )\). We differentiate between consistent triples \(R_C = \{r \in \mathfrak {R}(T) : \sigma (r) \in R(S)\}\) and inconsistent triples \(R_I = \mathfrak {R}(T) {\setminus } R_C\). The aim is to eliminate triples in \(R_I\) while retaining those in \(R_C\). Removing a leaf \(a \in L(T)\) also removes all triples \(r \in R(T)\) with \(a \in L(r)\). Utilizing this, we can select a subset of inconsistent leaves \(L_I \subseteq L(R_I)\), set \(L' = L(R_T) {\setminus } L_I\), and construct a consistent tree \(T' = T_{|L'}\). The steps for this tree editing are outlined in Algorithm 1.

figure e

1.7 A.7 Tree Reconciliation

Once we ensure consistency between the gene tree \((T,t,\sigma )\) and species trees S, we perform a reconciliation map as follows.

Lets assume that \(x,y\in V(T)\), then the reconciliation map \(\mu : V(T)\rightarrow V(S)\cup E(S)\) from the gene tree \((T,t,\sigma )\) to the species tree S satisfies:

$$\begin{aligned} \text {(U0)} & \textit{Root Constrain}. \mu (0_T)=0_S \\ \text {(U1)} & \textit{Gene Constrain}.\text { If } t(x)=\odot \text {, then } \mu (x)=\sigma (x) \in L(S) \\ \text {(U2)} & \textit{Speciation Constrain}.\text { If } t(x)=\bullet \text {, then } \mu (x) = \text {lca}_S(\sigma (C_T(x))) \in V^0(S) \text {, and} \\ & \mu (y_0) \parallel _S \mu (y_1) \text { for any two distinct children } y_0, y_1 \in \text {ch}_T(x) \\ \text {(U3)} & \textit{Duplication Constrain}.\text { If } t(x)=\square \text {, then } \mu (x)=e\in E(S) \text { and } \text {lca}_S(\sigma (C_T(x))) \prec e \\ \text {(U4)} & \textit{Ancestor Constrain}.\text { If } x\prec _T y \text {, then } \mu (x) \preceq _S \mu (y) \end{aligned}$$

The reconciliation map \(\mu : V(T)\rightarrow V(S)\cup E(S)\) is computed in linear time [15]. Given a node \(v\in V(T)\) such that \(t(v)\not =\square \), it is straightforward to determine which element of \(V(S)\cup E(S)\) corresponds to \(\mu (v)\) by just looking at constraints \(U0-2\), in the case when \(t(v)=\square \) we set \(\mu (v)=xy\in E(S)\) such that \(y=\text {lca}_S(\sigma (C_T(x)))\), this assignation minimizes the gene-loss events.

1.8 A.8 Resolving Speciation Nodes

To refine a node \(x \in V(T)\) with more than two children via a map \(f: \text {ch}_T(x) \rightarrow {y_0, y_1}\), perform: (i) add nodes \(y_0, y_1\) to the tree, (ii) remove edges xy for each \(y \in \text {ch}_T(x)\), and (iii) add edges \(xy_0\), \(xy_1\), and f(y)y for each \(y \in \text {ch}_T(x)\). When u is a speciation node, use the reconciliation map \(\mu \) and map \(f: \text {ch}_T(u) \rightarrow \text {ch}_S(\mu (u))\) to resolve u. For \(v \in \text {ch}_T(u)\) and \(v' \in \text {ch}_S(\mu (u))\), set \(f(v) = v'\) iff \(\mu (v) \preceq v'\).

1.9 A.9 Inferring Gene Loss

For a speciation node \(x \in V(T)\) in the gene tree, the reconciliation map \(\mu \) helps detect gene losses by mapping x to a node \(y = \mu (x)\) in the species tree. If we find a node \(y' \in \text {ch}_S(y)\) for which all nodes \(x' \in V(T)\) satisfying \(x' \prec x\) also fulfill \(\mu (x') \parallel _S y'\), a gene loss at \(y'\) is inferred.

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Ramírez-Rafael, J.A. et al. (2024). REvolutionH-tl: Reconstruction of Evolutionary Histories tool. In: Scornavacca, C., Hernández-Rosales, M. (eds) Comparative Genomics. RECOMB-CG 2024. Lecture Notes in Computer Science(), vol 14616. Springer, Cham. https://doi.org/10.1007/978-3-031-58072-7_5

Download citation

Publish with us

Policies and ethics