Inferring Clusters of Orthologous and Paralogous Transcripts

Ouedraogo, Wend Yam Donald Davy; Ouangraoua, Aida

doi:10.1007/978-3-031-36911-7_2

Wend Yam Donald Davy Ouedraogo⁹ &
Aida Ouangraoua⁹

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 13883))

Included in the following conference series:

RECOMB International Workshop on Comparative Genomics

522 Accesses

Abstract

The alternative processing of eukaryote genes allows producing multiple distinct transcripts from a single gene, thereby contributing to the transcriptome diversity. Recent studies suggest that more than 90% of human genes are concerned, and the transcripts resulting from alternative processing are highly conserved between orthologous genes.

In this paper, we first present a model to define orthology and paralogy relationships at the transcriptome level, then we present an algorithm to infer clusters of orthologous and paralogous transcripts. Gene-level homology relationships are used to define different types of homology relationships between transcripts and a Reciprocal Best Hits approach is used to infer clusters of isoorthologous and recent paralogous transcripts.

We applied the method to transcripts of gene families from the Ensembl-Compara database. The results are agreeing with those from previous studies comparing orthologous gene transcripts. The results also provide evidence that searching for conserved transcripts beyond orthologous genes will likely yield valuable information. The results obtained on the Ensembl-Compara gene families are available at https://github.com/UdeS-CoBIUS/TranscriptOrthology. Supplementary material can be found at https://doi.org/10.5281/zenodo.7750949.

You have full access to this open access chapter, Download conference paper PDF

Keywords

1 Introduction

Alternative splicing is one of the most important mechanisms whose extent was revealed in the post-genomic era [8]. It allows distinct transcripts to be produced from the same gene. Over the past decade, the number of alternatively spliced genes and alternative transcripts annotated in eukaryote organisms has increased dramatically [21]. It has now been established that alternative splicing was likely a feature of the eukaryotes’ common ancestor.

Over the past decade, several methods have been developed to study the conservation of sets of alternative transcripts annotated in orthologous genes. They identify splicing orthologous transcripts between genes, defined as alternative transcripts of orthologous genes composed of orthologous exons [3, 7, 9, 16, 20]. Other studies have proposed various models of transcript evolution with associated algorithms to reconstruct transcript phylogenies using parsimony-based tree search methods [1, 5, 6] or supertree methods [11, 12]. However, several questions about the evolution of sets of alternative transcripts in a gene family remain open [10]. For example, are alternative transcripts more conserved between orthologous genes than paralogous genes? How do new alternative transcripts arise during evolution? Is an alternative transcript preserved between multiple homologous genes and species? Where was it gained or lost in the evolution? Moreover, beyond identifying orthologous transcripts between orthologous genes, no method exists to compare all transcripts of a gene family to provide measures of similarity between the transcripts of all the genes. Furthermore, no frameworks, such as those developed for classifying gene homology types, currently exist for classifying transcript homology types. Beyond allowing a better understanding of alternative transcripts evolution, the prediction of orthologous and paralogous transcripts has other important potential applications. It can be useful for gene orthology inference and gene tree correction, as well as gene function prediction.

In this paper, we present a model to define orthology and paralogy relations between transcripts of a gene family. This model is mainly inspired by the reconciliation model between gene trees and species trees which allows defining orthologs, paralogs and isoorthologs at the gene-level. Isoorthologous genes are the least divergent orthologs that have retained the function of their lowest common ancestor [15, 19]. We present an algorithm associated to our model to infer groups of isoorthologous and paralogous transcripts. The algorithm uses transcript pairwise similarity scores as conservation measures to identify pairs of recent paralogs and isoorthologs through a Reciprocal Best Hit (RBH) approach. These relations are then used to infer ortholog groups in which the pairs of transcripts are isoorthologs, recent paralogs or related through a path of isoorthology and recent paralogy relations. The paper is organized as follows. Section 2 provides the definitions and notations required for the remaining of the paper. Section 3 describes our graph-based algorithm to infer ortholog groups. Section 4 contains the results of the application on gene families and sets of transcripts from the Ensembl-Compara database [21].

2 Preliminaries: Phylogenetic Trees, Reconciliation, Orthology, Paralogy

$\mathbb {S}$ denotes a set of species. $\mathbb {G}$ denotes a set of homologous genes from a gene family. $\mathbb {T}$ denotes a set of transcripts descending from the same ancestral transcript. The three sets are related by two functions $s : \mathbb {G} \rightarrow \mathbb {S}$ that maps each gene to its corresponding species, and $g : \mathbb {T} \rightarrow \mathbb {G}$ that maps each transcript to its corresponding gene such that $\{g(\texttt {t}) : \texttt {t} \in \mathbb {T}\} = \mathbb {G} $ and $\{s(\texttt {g}) : \texttt {g} \in \mathbb {G}\} = \mathbb {S}$. The induced set function $g^{-1}$ associates each gene to its set of corresponding transcripts.

All trees are considered rooted and binary. Given a tree P, v(P) the set of nodes of P, and l(P) its leafset. Given a node x of P, P[x] denotes the subtree of P rooted in x. A node x is an ancestor of a node y if y is a node of P[x]. If x is an internal node, $x_l$ and $x_r$ denote its two children. Given a subset $L'$ of l(P), $lca_P(L')$ denotes the lowest common ancestor (LCA) in P of $L'$, defined as the ancestor common to all the nodes in $L'$ that is the most distant from the root.

A tree on a set $\varSigma $ is a tree whose leafset is $\varSigma $. S denotes a species tree on $\mathbb {S}$ whose internal nodes represent a partially ordered set of speciation events that have led to $\mathbb {S}$. G denotes a gene tree on $\mathbb {G}$ whose internal nodes represent speciation and gene duplication events that have led to $\mathbb {G}$. T denotes a transcript tree on $\mathbb {T}$ whose internal nodes represent speciation, gene duplication, and transcript creation events that have led to $\mathbb {T}$. We extend the mapping functions s from v(G) to v(S), and g from v(T) to v(G) as follows: for any node $\texttt {g}$ in v(G), $s(\texttt {g})=lca_S(\{s(\texttt {g}') : \texttt {g}' \in l(G[\texttt {g}])\}$, and for any node $\texttt {t}$ in v(T), $g(\texttt {t})=lca_G(\{g(\texttt {t}') : \texttt {t}' \in l(T[\texttt {t}])\}$.

In the Duplication-Loss (DL) model of gene evolution, genes undergo speciation events when their corresponding species do, but also duplication events when a new gene copy is created and loss events when a gene is lost in a species. Likewise, in the Creation-Loss (CL) model of transcript evolution, transcripts undergo speciation and duplication events when their corresponding genes do, but also creation events when a new transcript is created and loss events when a transcript is lost in a gene. The labeling of the internal nodes of a gene tree as speciation or gene duplication events is obtained through the reconciliation with a species tree. The labeling of the internal nodes of a transcript tree is obtained through the reconciliation with a gene tree. Several definitions of reconciliation between a gene tree and a species tree exist. Here, we consider the LCA-reconciliation based on the LCA mapping.

Definition 1 (LCA-reconciliation at the gene and transcript levels)

The LCA-reconciliation of G and S is a function $rec_G : v(G)-\mathbb {G} \rightarrow \{Spe,Dup\}$ that labels any internal node $\texttt {g}$ of G as a duplication (Dup) if $s(\texttt {g}) = s(\texttt {g}_l)$ or $s(\texttt {g}) = s(\texttt {g}_r)$, and as a speciation (Spe) otherwise.

Similarly, the LCA-reconciliation of T and G is a function $rec_T : v(T)-\mathbb {T} \rightarrow \{Spec,Dup,Cre\}$ that labels any internal node $\texttt {t}$ of T as a creation (Cre) if $g(\texttt {t}) = g(\texttt {t}_l)$ or $g(\texttt {t}) = g(\texttt {t}_r)$, otherwise as a duplication (Dup) if $rec_G(g(\texttt {t})) = Dup$, and as a speciation (Spe) if $rec_G(g(\texttt {t})) = Spe$.

Figure 1 shows an illustration for Definition 1. $rec_G$ provides a reconciliation between G and S that minimizes the number of gene duplications and losses [4]. $rec_T$ also provides a reconciliation between T and G that minimizes the number of transcript creations and losses [12].

Orthology and paralogy are relations defined over pairs of homologous genes based on the gene tree - species tree reconciliation. Orthology is the relation between two genes for which any two distinct ancestors taken at a point of time always appear in distinct ancestral species (i.e. apart, orthogonal). Paralogy is the relation between two genes having two distinct ancestors at a point of time that are in the same ancestral species (i.e. beside, parallel). Based on the co-occurrence of transcript ancestors in the same ancestral genes, we can extend orthology and paralogy definitions over pairs of homologous transcripts.

Definition 2 (Orthology, paralogy at the gene and transcript levels)

Two distinct genes $\texttt {g}_1$ and $\texttt {g}_2$ of $\mathbb {G}$ are:

orthologs if their LCA in G is a speciation, i.e. $rec_G(lca_G(\{\texttt {g}_1,\texttt {g}_2\})) = Spe$;
recent paralogs if $rec_G(lca_G(\{\texttt {g}_1,\texttt {g}_2\})) = Dup$ and $s(lca_G(\{\texttt {g}_1,\texttt {g}_2\})=s(\texttt {g}_1)$
ancient paralogs otherwise.

Likewise, two distinct transcripts $\texttt {t}_1$ and $\texttt {t}_2$ of $\mathbb {T}$ are [12]:

ortho-orthologs if their LCA in T is a speciation, i.e $rec_T(lca_T(\{\texttt {t}_1,\texttt {t}_2\})) = Spe$;
para-orthologs if $rec_T(lca_T(\{\texttt {t}_1,\texttt {t}_2\})) = Dup$;
recent paralogs if $rec_T(lca_T(\{\texttt {t}_1,\texttt {t}_2\})) = Cre$ and $g(lca_T(\{\texttt {t}_1,\texttt {t}_2\})=g(\texttt {t}_1)$;
ancient paralogs otherwise.

For example in Fig. 1, $a_{11}$ and $b_{12}$ are ortho-orthologs, $a_{11}$ and $b_{21}$ are para-orthologs, $b_{21}$ and $b_{22}$ are recent paralogs, and $a_{11}$ and $a_{22}$ are ancient paralogs. Notice that if all gene pairs are orthologs then all pairs of orthologous transcripts are ortho-orthologs.

Lemma 1 (Link between homology relationships at the gene and transcript levels)

If two transcripts $\texttt {t}_1$ and $\texttt {t}_2$ are ortho-orthologs then the genes $g(\texttt {t}_1)$ and $g(\texttt {t}_2)$ are orthologs, and if $\texttt {t}_1$ and $\texttt {t}_2$ are para-orthologs then $g(\texttt {t}_1)$ and $g(\texttt {t}_2)$ are paralogs. If $\texttt {t}_1$ and $\texttt {t}_2$ are recent paralogs then $g(\texttt {t}_1) = g(\texttt {t}_2)$. None of the converse statements are true.

Proof

Let $\texttt {t}=lca_T(\{\texttt {t}_1,\texttt {t}_2\})$. If $\texttt {t}_1$ and $\texttt {t}_2$ are ortho-orthologs or para-orthologs then $g(\texttt {t}) \ne g(\texttt {t}_l)$ and $g(\texttt {t}) \ne g(\texttt {t}_r$. Therefore, $lca_G(g(\texttt {t}_1),g(\texttt {t}_2)) = g(\texttt {t})$, and then $rec_G(lca_G(g(\texttt {t}_1),g(\texttt {t}_2))) = rec_G(g(\texttt {t}))$ which is a speciation if $\texttt {t}_1$ and $\texttt {t}_2$ are ortho-orthologs, and a duplication otherwise. If $\texttt {t}_1$ and $\texttt {t}_2$ are recent paralogs then any leaf $\texttt {t}'$ of the subtree $T[\texttt {t}]$ must satisfy $g(\texttt {t}') = g(lca_T(\{\texttt {t}_1,\texttt {t}_2\})$. Therefore, $g(\texttt {t}_1) = g(\texttt {t}_2) = g(lca_T(\{\texttt {t}_1,\texttt {t}_2\})$.

Figure 1 shows an example where $a_3$ and $b_2$ (resp. $a_1$ and $a_2$) are orthologous (resp. paralogous) genes but their transcripts $a_{31}$ and $b_{21}$ (resp. $a_{11}$ and $a_{22}$) are not ortho-orthologs (resp. para-orthologs). Likewise, $g(a_{21})=g(a_{22}) = a_2$ but $a_{21}$ and $a_{22}$ are not recent paralogs.

Lemma 2 (Link between recent paralogy and orthology)

(1) If three transcripts $\texttt {t}_1$, $\texttt {t}_2$ and $\texttt {t}_3$ are such that $\texttt {t}_1$ and $\texttt {t}_2$ are recent paralogs, and $\texttt {t}_1$ and $\texttt {t}_3$ are ortho-orthologs (resp. para-orthologs), then $\texttt {t}_2$ and $\texttt {t}_3$ are ortho-orthologs (resp. para-orthologs). (2) If $\texttt {t}_1$ and $\texttt {t}_3$ are recent paralogs, and $\texttt {t}_2$ and $\texttt {t}_3$ are recent paralogs, then $\texttt {t}_1$ and $\texttt {t}_2$ are also recent paralogs.

Proof

If $\texttt {t}_1$, $\texttt {t}_2$ and $\texttt {t}_3$ are in the configuration of (1) then they must satisfy $lca_T(\{\texttt {t}_2,\texttt {t}_3\}) = lca_T(\{\texttt {t}_1,\texttt {t}_3\})$. Therefore, $\texttt {t}_2$ and $\texttt {t}_3$ have the same relation as $\texttt {t}_1$ and $\texttt {t}_3$. (2) is trivial.

The key assumption of our method is that after a transcript creation event, the newly created transcript tends to diverge from the original transcript from which it was modified, whereas the original transcript tends to remain conserved, like for the inference of gene ortholog groups using graph-based methods where isoorthologous gene pairs are considered [2, 13, 15, 19]. It should be noted that the conservation/divergence between two transcripts is based on the comparison of their content in exons, and not of their nucleotide sequences. Therefore, given a creation node t in the LCA-reconciliation of T with G, one of its edges descending to its children, say $(t,t_l)$ without loss of generality, corresponds to the original transcript conserved, whereas the other edge $(t,t_r)$ corresponds to the newly created divergent transcript. In this case, we call $(t,t_l)$ a conservation edge, whereas $(t,t_r)$ is called a divergence edge. For example, in Fig. 1, the divergence edges after creation nodes appear in dashed lines. Distinguishing conservation edges and divergence edges after a creation node allows to define a particular type of orthology relation between transcripts.

Definition 3 (Isoorthology at the transcript level)

Two ortho- (resp. para-) orthologous transcripts $\texttt {t}_1$ and $\texttt {t}_2$ of $\mathbb {T}$ are ortho- (resp. para-) isoorthologs if there are no divergence edges on the path between $\texttt {t}_1$ and $\texttt {t}_2$ in T.

Notice that the isorthology relation is transitive, which allows $\mathbb {T}$ to be partitioned into ortholog groups.

Definition 4 (Ortholog groups at the transcript level)

An ortholog group $\mathbb {O}$ of $\mathbb {T}$ is a subset of $\mathbb {T}$ such that any two distinct transcripts $\texttt {t}_1$ and $\texttt {t}_2$ belonging to $\mathbb {O}$ are isoorthologs (i.e. ortho- or para-isoorthologs), recent paralogs, or there exist two transcripts $\texttt {t}_1'$ and $\texttt {t}_2'$ in $\mathbb {O}$ such that $\texttt {t}_1=\texttt {t}_1'$ or $\texttt {t}_1$ and $\texttt {t}_1'$ are recent paralogs, $\texttt {t}_2=\texttt {t}_2'$ or $\texttt {t}_2$ and $\texttt {t}_2'$ are recent paralogs, and $\texttt {t}_1'$ and $\texttt {t}_2'$ are isoorthologs.

For example, in Fig. 1, $\{a_{11},a_{21},b_{11},a_{31}\}$, $\{a_{22}\}$, $\{b_{12}\}$, $\{a_{32},b_{21}\}$, $\{b_{22}\}$ are the maximum inclusive-wise ortholog groups of $\mathbb {T}$.

3 A Graph-Based Algorithm to Infer Isoorthology and Recent Paralogy Relations Between Transcripts

In this section, we present a graph-based method to infer isoorthology and recent paralogy relations in a set of homologous transcripts $\mathbb {T}$. The method relies on a pairwise similarity measure between transcripts to infer ortholog groups.

3.1 Pairwise Similarity Score Between Transcripts

A gene $\texttt {g} \in \mathbb {G}$ is a DNA sequence on the alphabet of nucleotides $\varSigma = \{A,C,G,T\}$. A transcript $\texttt {t}$ of $\texttt {g}$ (i.e $\texttt {t} \in g^{-1}(\texttt {g})$) is a subsequence of $\texttt {g}$ obtained by concatenating an ordered set of substrings of $\texttt {g}$ such that each substring is an exon of $\texttt {g}$ that is present in $\texttt {t}$. The transcribed subsequence of a gene $\texttt {g} \in \mathbb {G}$, denoted by $\hat{\texttt{g}}$, is the subsequence of $\texttt {g}$ obtained by deleting from $\texttt {g}$ any nucleotide that is absent from all transcripts of $\texttt {g}$. $\hat{\texttt {G}}$ denotes the set of transcribed subsequences of all genes in $\texttt {G}$. Note that $|~\hat{\mathbb {G}}~| = |~\mathbb {G}~|$. Figure 2 shows an illustration.

A multiple sequence alignment $\mathbb {A}$ of all the transcribed subsequences in $\hat{\mathbb {G}}$ and all the transcripts in $\mathbb {T}$ is obtained by first computing a multiple sequence alignment M of the transcribed subsequences in $\hat{\mathbb {G}}$, and then mapping each transcript $\texttt {t} \in \mathbb {T}$ on its corresponding transcribed subsequence within M to obtain the resulting alignment $\mathbb {A}$.

In the sequel, $\mathbb {A}$ denotes a multiple sequence alignment of the all transcribed subsequences in $\hat{\mathbb {G}}$ and all the transcript sequences in $\mathbb {T}$, represented as a $n\times m$ matrix such that $n = |~\mathbb {T}~| + |~\hat{\mathbb {G}}~|$ and m is the number of columns of the alignment.

Following the block-based model used in [16] to represent transcripts, a multiple sequence alignment $\mathbb {A}$ of $\mathbb {T}$ and $\hat{\mathbb {G}}$ is partitioned into a set of non-overlapping blocks of columns as follows:

Definition 5 (Decomposition of multiple sequence alignment)

Let $\mathbb {A}$ be a multiple sequence alignment of $\mathbb {T}$ and $\hat{\mathbb {G}}$. $\mathbb {A}_{b}$ denotes the binary matrix of same dimension as $\mathbb {A}$ such that each nucleotide A, C, G, or T in $\mathbb {A}$ is replaced by 1 in $\mathbb {A}_{b}$, and each gap character ’-’ is replaced by 0. A block of $\mathbb {A}$ is a set of consecutive columns of $\mathbb {A}$ which correspond to a maximum inclusive-wise set of consecutive columns of $\mathbb {A}_{b}$ which are equal.

For any block B of $\mathbb {A}$, $\alpha (B)$ denotes a positive number representing the weight of the block B.

For example, in Fig. 2, the alignment is decomposed into 9 blocks.

Lemma 3 (Aligned sequences in blocks)

For any aligned sequence $\texttt {t}'$ in $\mathbb {A}$ and any block B of $\mathbb {A}$, $\texttt {t}'$ contains either only nucleotides, or only gaps in B.

Proof

Trivial, by definition of blocks.

Definition 6 (Block-based representation of transcripts and genes)

Given the ordered set of blocks defined by the partition of $\mathbb {A}$, for each transcript $\texttt {t} \in \mathbb {T}$, $\mathbb {B}(\texttt {t})$ denotes the ordered subset of blocks in which the aligned sequence $\texttt {t}'$ corresponding to $\texttt {t}$ contains nucleotides.

Likewise, for each gene $\texttt {g} \in \mathbb {G}$, $\mathbb {B}(\texttt {g})$ denotes the ordered subset of blocks in which the aligned sequence $\texttt {g}'$ corresponding to $\hat{\texttt {g}}$ contains nucleotides.

Lemma 4 (Link between representations of transcripts and genes)

For any gene $\texttt {g} \in \mathbb {G}$, $\mathbb {B}(\texttt {g})$ contains all blocks in which at least one aligned sequence $\texttt {t}'$ corresponding to a transcript $\texttt {t}$ of $\texttt {g}$ contains nucleotides.

Proof

Let B be a block contained in $\mathbb {B}(\texttt {g})$. Then, the transcribed subsequence $\hat{\texttt {g}}$ contains a segment of nucleotides in B. Therefore, there exists at least one transcript $\texttt {t}$ of $\texttt {g}$ which contains this segment, and then the corresponding aligned sequence $\texttt {t}'$ contains nucleotides in B. Conversely, any block containing nucleotides from a transcript of $\texttt {g}$ belongs necessarily to $\mathbb {B}(\texttt {g})$.

We are now ready to give the definition of the transcript similarity measure.

Definition 7 (Pairwise transcript similarity)

Let $t_1$ and $t_2$ be two distinct transcripts in $\mathbb {T}$. Consider the sets of blocks shared by $t_1$ and $t_2$, $\mathbb{B}\mathbb{I}(t_1,t_2) = \mathbb {B}(t_1) ~\cap ~\mathbb {B}(t_2)$, and the set of blocks $\mathbb {B}\mathbb {U}(t_1,t_2) = \mathbb {B}(t_1) ~ \cup ~ \mathbb {B}(t_2) $ and $\mathbb {B}\mathbb {U}_{+}(t_1,t_2) = \mathbb {B}\mathbb {U}(t_1,t_2) \cap ~ (\mathbb {B}(g(t_1)) ~ \cap ~ \mathbb {B}(g(t_2))$ which contains the blocks of $t_1$ shared with $g(t_2)$ and the blocks of $t_2$ shared with $g(t_1)$. The similarity score between $t_1$ and $t_2$ equals:

$$\begin{aligned} \textrm{tsm}(t_1,t_2) =\frac{\sum _{B \in \mathbb{B}\mathbb{I}(t_1,t_2)}^{} \alpha (B)}{\sum _{B \in \mathbb {B}\mathbb {U}(t_1,t_2)}^{} \alpha (B)} \end{aligned}$$

The corrected similarity score between $t_1$ and $t_2$ equals:

$$\begin{aligned} \textrm{tsm}_{+}(t_1,t_2) = \frac{\sum _{B \in \mathbb{B}\mathbb{I}(t_1,t_2)}^{} \alpha (B)}{\sum _{B \in \mathbb {B}\mathbb {U}_+}(t_1,t_2)^{} \alpha (B)} \end{aligned}$$

If the weights associated to the blocks are unitary, the similarity score between two transcripts $t_1$ and $t_2$ is the ratio between the number of blocks shared by the two transcripts and the number of blocks in at least one of the two transcripts. However, for the corrected similarity score, the blocks which are contained in the symmetric difference of $\mathbb {B}(t_1)$ and $\mathbb {B}(t_2)$ are only counted in the denominator if they belong to both $\mathbb {B}(g_1)$ and $\mathbb {B}(g_2)$. This correction allows to account for differences at the transcript-level only, and to avoid those at the gene-level. For instance, in the example provided in Fig. 2, $tsm_+(t11,t21) =$ $\mid \{2,3,4,6,8\}\mid / \mid \{2,3,4,6,8\}\mid = 1$ even if they do not share the block 7. If the weight associated to a block is its length, the similarity score corresponds to the case where each column of the multiple sequence alignment $\mathbb {A}$ is considered as a block.

3.2 Orthology Graph Construction and Ortholog Groups Inference

Using the pairwise similarity scores between transcripts, the method identifies pairs of recent paralogs and putative pairs of isoorthologs through an RBH. The key idea is that between two homologous genes $\texttt {g}_1$ and $\texttt {g}_2$, the pairs of isoorthologous transcripts (i.e., para- or ortho-isoorthologs) should be the most conserved. Moreover, two recent paralogous transcripts within a gene $\texttt {g}_1$ should share more similarities than to any transcript in another gene $\texttt {g}_2$, because their LCA which is a creation node is lower than the node from which any pair of transcripts from $\texttt {g}_1$ and $\texttt {g}_2$ diverged.

Definition 8 (Inferred recent paralogs)

Two distinct transcripts $\texttt {t}_1$ and $\texttt {t}_2$ of $\mathbb {T}$ such that $g(\texttt {t}_1) = g(\texttt {t}_2)$ are inferred as recent paralogs if:

$tsm(\texttt {t}_1,\texttt {t}_2) > max\{tsm(\texttt {t}_1,\texttt {t}) : \texttt {t} \in \mathbb {T} - g^{-1}(g(\texttt {t}_2))\}$ and $tsm(\texttt {t}_1,\texttt {t}_2) > max\{tsm(\texttt {t}_2,\texttt {t}) : \texttt {t} \in \mathbb {T} - g^{-1}(g(\texttt {t}_1))\}$;
or there exists a third transcript $\texttt {t}_3$ of $\mathbb {T}$ such that $\texttt {t}_1$ and $\texttt {t}_3$ are recent paralogs and, $\texttt {t}_2$ and $\texttt {t}_3$ are also recent paralogs.

Definition 9 (Putative isoorthologs)

Two transcripts $\texttt {t}_1$ and $\texttt {t}_2$ of $\mathbb {T}$ of two distinct genes are inferred as putative isoorthologs if:

$tsm(\texttt {t}_1,\texttt {t}_2) = max\{tsm(\texttt {t}_1,\texttt {t}) : \texttt {t} \in g^{-1}(g(\texttt {t}_2))\} = max\{tsm(\texttt {t}_2,\texttt {t}) : \texttt {t} \in g^{-1}(g(\texttt {t}_1))\}$.

Using RBH to define putative isoorthologs makes the method more robust to transcript loss or incomplete transcript annotation in some genes. The next step is to define an orthology graph whose set of vertices is $\mathbb {T}$, edges represent inferred recent paralogs or putative isoorthologs, and connected components define ortholog groups.

Definition 10 (Orthology graph)

An orthology graph for $\mathbb {T}$ is a graph $G=(V,E)$ whose set of vertices $V=\mathbb {T}$ and for any two distinct transcripts $\texttt {t}_1$ and $\texttt {t}_2$ of $\mathbb {T}$:

(1) if $(\texttt {t}_1,\texttt {t}_2)\in E$ then $\texttt {t}_1$ and $\texttt {t}_2$ are either inferred recent paralogs or putative isoorthologs, and;
(2) if $g(\texttt {t}_1)=g(\texttt {t}_2)$, then $\texttt {t}_1$ and $\texttt {t}_2$ belong to the same connected component of G if and only if $\texttt {t}_1$ and $\texttt {t}_2$ are recent paralogs.

The objective is to construct an orthology graph for $\mathbb {T}$ that contains a minimum number of connected components. Given a graph (V, E), CC(V, E) denotes the set of all connected components of the graph. We define the function $cc_{(V,E)} : V \rightarrow CC(V,E)$ which associates each vertex $x\in V$ to the connected component to which it belongs. Algorithm 1 uses a progressive heuristic approach to construct an orthology graph for $\mathbb {T}$. It takes as input the set $\mathbb{R}\mathbb{P}$ of all the pairs of inferred recent paralogs, and the ordered set $\mathbb{P}\mathbb{O}$ of all the pairs of putative isoorthologs ordered by decreasing similarity. It starts with an empty set of edges, then edges corresponding to recent paralogs are added. Next, edges that correspond to putative isoorthologs are considered progressively and added if their addition preserves the property that the graph is an orthology graph.

Lemma 5

(Correctness of Algorithm 1). Given $\mathbb{R}\mathbb{P}$ the set of all the pairs of inferred recent paralogs, and $\mathbb{P}\mathbb{O}$ the ordered set of all the pairs of putative isoorthologs, Algorithm 1 computes an orthology graph for $\mathbb {T}$.

Proof

In Algorithm 1, after the first $\texttt {for}$ loop, the set of edges contains only recent paralogy edges. Therefore, at this point the graph is an orthology graph. In the remaining of the algorithm, an isoorthology edge is added if and only it preserves the property that the graph is an orthology graph.

Given a multiple sequence alignment $\mathbb {A}$ of dimension $n\times m$ such that $n = |~\mathbb {T}~| + |~\hat{\mathbb {G}}~|$ and m is the number of columns, the decomposition of the multiple sequence alignment $\mathbb {A}$ into blocks is computed in $O(n\times m)$ time complexity. The pairwise transcript similarity scores are computed in $O(n^2 \times b)$ time complexity where b is the number of blocks in the decomposition of $\mathbb {A}$ and $b \ll m$. The pairs of inferred recent paralogs and putative isoorthologs are computed in $O(n^2)$ time complexity. Finally, Algorithm 1 runs in $O(n^2)$ time complexity. Therefore, given $\mathbb {A}$, the whole method runs in $O(n\times m + n^2)$ time complexity.

4 Results and Discussion

4.1 Comparison with Ortholog Groups Predicted in Human, Mouse and Dog One-to-one Orthologous Genes

Since this paper introduces the notion of orthology and paralogy at the transcript level, no previous work exists on the computation of transcript ortholog groups. However, many previous works have studied the problem of identifying conserved transcripts between one-to-one orthologous genes. Most of this work is limited to comparing two species. An exception is the work of Guillaudeux et al. [7] who proposed a formal definition of the splicing structure orthology and an algorithm that was used to predict transcript orthologs in human, mouse and dog. A comparison between the approach of Guillaudeux et al. and that of our method is provided in supplementary material. Their dataset is publicly available and includes 253 triplets of one-to-one orthologous genes from 236 gene families in the Ensembl-Compara database. Their method predicted 879 transcript ortholog groups for a total of 1896 transcripts.

We compared the 879 orthologous groups of Guillaudeux et al. to orthologous groups obtained using 12 different settings of our method using: (1) MACSE [17] or Kalign [14] to compute multiple sequence alignments; (2) unitary weights for alignment blocks, or weights corresponding to their lengths, or the mean of the two similarity scores; (3) a corrected or an uncorrected transcript similarity measure. For each version of our method, we used our results as the prediction and the 784 orthologous groups of Guillaudeux et al. as ground truth. We considered 3 performance measures: 1) the homogeneity score which computes the ratio of the pairs of transcripts which are predicted in the same group and are truly in the same group to the pairs of transcripts which are predicted in the same group; 2) the completeness score which computes the ratio of the pairs of transcripts which are predicted in the same group and are truly in the same group to the pairs of transcripts which are truly in the same group; 3) the V-measure which is the harmonic mean between the scores of homogeneity and completeness. Figure 3 provides the details of the score distributions for all 253 gene triplets. The high scores obtained show that our results are agreeing with those of Guillaudeux et al. In particular, the best scores are achieved with the setting that uses MACSE, the unitary-weights for block alignments, and the uncorrected transcript similarity measure. However, for all versions, our method tends to cluster transcripts more (i.e., the homogeneity score is always less than the completeness score).

The detailed comparison of the groups obtained by the Guillaudeux et al. method and the best performing setting of our method shows that 495 clusters of 166 families are exactly the same for the two methods. Our method predicts 1896 ortho-isoorthologs and 466 recent paralogs compared to Guillaudeux et al. who predicts 1,408 ortho-isoorthologs and 395 recent paralogs. Our method also finds all recent paralogs inferred by Guillaudeux et al. Figure 4 shows the number of recent paralogs per species and the number of ortho-isoorthologs per species pair predicted by the two methods.

4.2 Comparison of the Proportions of Ortho-orthologs, Para-orthologs and Recent Paralogs Predicted

Table 1. Description of the dataset of 20 gene families. The total number of genes, the total number of transcripts and the numbers per species are given.

Full size table

Table 2. The number of ortholog groups predicted, the ratio of isoortholog pairs to all transcript pairs, the ratio of recent paralogs pairs to all transcript pairs, the ratio of ortho-isoorthologs pairs to all isoortholog pairs divided by the ratio of gene ortholog pairs to all gene pairs, and the ratio of para-isoortholog pairs to all isoortholog pairs divided by the ratio of gene paralog pairs to all gene pairs.

Full size table

We randomly selected 20 gene families composed of genes from 6 species: human, mouse, dog, dingo, cow, chicken. Table 1 describes the dataset. We used the best performing setting of our method using MACSE to compute the multiple sequence alignment, the unitary-weights similarity scores, and the uncorrected transcript similarity measure. From a total of 1,402 transcripts, we identified 236 ortholog groups. Table 2 shows the ratio of isoorthologues and recent paralogs found between transcript pairs. In this experiment, we could identify para-isoorthologs because the dataset contains paralogous genes. Table 2 also shows the ratio of ortho-isoorthology relations normalized by the ratio of gene orthology relations and the ratio of para-isoorthology relations normalized by the ratio of gene paralogy relations. When the normalized ratio of ortho-isoorthology (resp. para-isoorthology) is greater than 1, it means more relations were predicted than expected given the ratio of gene orthology (resp. paralogy) relations to all gene pairs. In 14 out of 20 families, the normalized para-isoorthology ratio is greater than 1, and greater than the normalized ortho-isoorthology ratio. Therefore, it seems that isoorthologous transcripts tend to be more present between paralogous genes than between orthologous genes. This is consistent with previous studies that have found evidence against the ortholog conjecture in the context of gene function prediction by transferring annotations between homologous genes [18]. The ortholog conjecture proposes that orthologous genes should be preferred when making such predictions because they evolve functions more slowly than paralogous genes. Our results support that orthologs and paralogs should be considered to provide higher prediction accuracy. However, this interpretation should also be taken with caution as there may be errors in the orthology/paralogy relationships between genes.

5 Conclusion

The ability to classify homology relationships between homologous transcripts in gene families is a fundamental step to study the evolution of alternative transcripts. Identifying groups of transcripts that are isoorthologs helps to identify and study the function of transcripts conserved across multiple genes and species. It also provides a framework for gene tree correction, and to study the impact that the evolution of transcripts through creation and loss events has on the evolution of gene functions.

In this work, we revisit the notions of orthology, paralogy and isoorthology at the transcript-level with an associated algorithm to infer ortholog groups composed of transcripts that are isoorthologs, recent paralogs, or related through a path of isoorthology and recent paralogy relations. The method provides results that are consistent with results from methods that identify conserved transcripts between orthologous genes. The results also show that the proportion of conserved transcripts between paralogous genes is not negligible compared to conserved transcripts between orthologous genes, thus justifying the relevance of further studying the relationship between the evolution of transcripts and genes in entire gene families.

The method offers many possibilities for improvement and extension. First, the quality of inference strongly depends on the quality of the multiple sequence alignment and the definition of the transcript similarity measure. Future works will explore different ways to compare transcripts. We will also explore alternative algorithms for computing ortholog groups given putative pairs of isoorthologs. Finally, the method can be extended to infer all pairwise relations between homologous transcripts of a gene family by using the ortholog groups to compute complete transcript trees which can then be reconciled with the gene tree. The predictions can then be evaluated using the annotated functions of the proteins corresponding to transcripts.

References

Ait-Hamlat, A., Zea, D.J., Labeeuw, A., Polit, L., Richard, H., Laine, E.: Transcripts’ evolutionary history and structural dynamics give mechanistic insights into the functional diversity of the jnk family. J. Molecular Biol. 432(7), 2121–2140 (2020)
Article Google Scholar
Altenhoff, A.M., Gil, M., Gonnet, G.H., Dessimoz, C.: Inferring hierarchical orthologous groups from orthologous gene pairs. PLoS ONE 8(1), e53786 (2013)
Article Google Scholar
Blanquart, S., Varré, J.-S., Guertin, P., Perrin, A., Bergeron, A., Swenson, K.M.: Assisted transcriptome reconstruction and splicing orthology. BMC Genom. 17(10), 157 (2016)
Google Scholar
Chauve, C., El-Mabrouk, N.: New perspectives on gene family evolution: losses in reconciliation and a link with supertrees. In: Batzoglou, S. (ed.) RECOMB 2009. LNCS, vol. 5541, pp. 46–58. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-02008-7_4
Chapter Google Scholar
Christinat, Y., Moret, B.M.E.: Inferring transcript phylogenies. BMC Bioinform. 13(9), S1 (2012)
Article Google Scholar
Christinat, Y., Moret, B.M.E.: A transcript perspective on evolution. IEEE/ACM Trans. Comput. Biol. Bioinf. 10(6), 1403–1411 (2013)
Article Google Scholar
Guillaudeux, N., Belleannée, C., Blanquart, S.: Identifying genes with conserved splicing structure and orthologous isoforms in human, mouse and dog. BMC Genom. 23(1), 1–14 (2022)
Article Google Scholar
Harrow, J., et al.: Gencode: the reference human genome annotation for the encode project. Genome Res. 22(9), 1760–1774 (2012)
Article Google Scholar
Jammali, S., Aguilar, J.-D., Kuitche, E., Ouangraoua, A.: Splicedfamalign: Cds-to-gene spliced alignment and identification of transcript orthology groups. BMC Bioinform. 20(3), 133 (2019)
Article Google Scholar
Keren, H., Lev-Maor, G., Ast, G.: Alternative splicing and evolution: diversification, exon definition and function. Nat. Rev. Genet. 11(5), 345–355 (2010)
Article Google Scholar
Kuitche, E., Jammali, S., Ouangraoua, A.: Simspliceevol: alternative splicing-aware simulation of biological sequence evolution. BMC Bioinform. 20(20), 640 (2019)
Article Google Scholar
Kuitche, E., Lafond, M., Ouangraoua, A.: Reconstructing protein and gene phylogenies using reconciliation and soft-clustering. J. Bioinform. Comput. Biol. 15(06), 1740007 (2017)
Article Google Scholar
Lafond, M., Miardan, M.M., Sankoff, D.: Accurate prediction of orthologs in the presence of divergence after duplication. Bioinformatics 34(13), i366–i375 (2018)
Article Google Scholar
Lassmann, T., Sonnhammer, E.L.L.: Kalign-an accurate and fast multiple sequence alignment algorithm. BMC Bioinform. 6(1), 1–9 (2005)
Article Google Scholar
Li, L., Stoeckert, C.J., Roos, D.S.: Orthomcl: identification of ortholog groups for eukaryotic genomes. Genome Res. 13(9), 2178–2189 (2003)
Article Google Scholar
Ouangraoua, A., Swenson, K.M., Bergeron, A.: On the comparison of sets of alternative transcripts. In: Bleris, L., Măndoiu, I., Schwartz, R., Wang, J. (eds.) ISBRA 2012. LNCS, vol. 7292, pp. 201–212. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-30191-9_19
Chapter Google Scholar
Ranwez, V., Douzery, E.J.P., Cambon, C., Chantret, N., Delsuc, F.: Macse v2: toolkit for the alignment of coding sequences accounting for frameshifts and stop codons. Molecular Biol. Evolut. 35(10), 2582–2584 (2018)
Article Google Scholar
Stamboulian, M., Guerrero, R.F., Hahn, M.W., Radivojac, P.: The ortholog conjecture revisited: the value of orthologs and paralogs in function prediction. Bioinformatics 36(Supplement_1), i219–i226 (2020)
Google Scholar
Swenson, K.M., El-Mabrouk, N.: Gene trees and species trees: irreconcilable differences. BMC Bioinform. 13, 1–9. BioMed Central (2012)
Google Scholar
Zambelli, F., Pavesi, G., Gissi, C., Horner, D.S., Pesole, G.: Assessment of orthologous splicing isoforms in human and mouse orthologous genes. BMC Genom. 11(1), 1 (2010)
Article Google Scholar
Zerbino, D.R., et al.: Ensembl 2018. Nucleic Acids Res. 46(D1), D754–D761 (2018)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Université de Sherbrooke, Sherbrooke, QC, J1K2R1, Canada
Wend Yam Donald Davy Ouedraogo & Aida Ouangraoua

Authors

Wend Yam Donald Davy Ouedraogo
View author publications
You can also search for this author in PubMed Google Scholar
Aida Ouangraoua
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Aida Ouangraoua .

Editor information

Editors and Affiliations

Freie Universität Berlin, Berlin, Germany
Katharina Jahn
Comenius University, Bratislava, Slovakia
Tomáš Vinař

Rights and permissions

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ouedraogo, W.Y.D.D., Ouangraoua, A. (2023). Inferring Clusters of Orthologous and Paralogous Transcripts. In: Jahn, K., Vinař, T. (eds) Comparative Genomics. RECOMB-CG 2023. Lecture Notes in Computer Science(), vol 13883. Springer, Cham. https://doi.org/10.1007/978-3-031-36911-7_2

Download citation

DOI: https://doi.org/10.1007/978-3-031-36911-7_2
Published: 13 July 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-36910-0
Online ISBN: 978-3-031-36911-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Inferring Clusters of Orthologous and Paralogous Transcripts

Abstract

Keywords

1 Introduction

2 Preliminaries: Phylogenetic Trees, Reconciliation, Orthology, Paralogy

Definition 1 (LCA-reconciliation at the gene and transcript levels)

Definition 2 (Orthology, paralogy at the gene and transcript levels)

Lemma 1 (Link between homology relationships at the gene and transcript levels)

Proof

Lemma 2 (Link between recent paralogy and orthology)

Proof

Definition 3 (Isoorthology at the transcript level)

Definition 4 (Ortholog groups at the transcript level)

3 A Graph-Based Algorithm to Infer Isoorthology and Recent Paralogy Relations Between Transcripts

3.1 Pairwise Similarity Score Between Transcripts

Definition 5 (Decomposition of multiple sequence alignment)

Lemma 3 (Aligned sequences in blocks)

Proof

Definition 6 (Block-based representation of transcripts and genes)

Lemma 4 (Link between representations of transcripts and genes)

Proof

Definition 7 (Pairwise transcript similarity)

3.2 Orthology Graph Construction and Ortholog Groups Inference

Definition 8 (Inferred recent paralogs)

Definition 9 (Putative isoorthologs)

Definition 10 (Orthology graph)

Lemma 5

Proof

4 Results and Discussion

4.1 Comparison with Ortholog Groups Predicted in Human, Mouse and Dog One-to-one Orthologous Genes

4.2 Comparison of the Proportions of Ortho-orthologs, Para-orthologs and Recent Paralogs Predicted

5 Conclusion

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation