Keywords

1 Introduction

Alternative splicing is one of the most important mechanisms whose extent was revealed in the post-genomic era [8]. It allows distinct transcripts to be produced from the same gene. Over the past decade, the number of alternatively spliced genes and alternative transcripts annotated in eukaryote organisms has increased dramatically [21]. It has now been established that alternative splicing was likely a feature of the eukaryotes’ common ancestor.

Over the past decade, several methods have been developed to study the conservation of sets of alternative transcripts annotated in orthologous genes. They identify splicing orthologous transcripts between genes, defined as alternative transcripts of orthologous genes composed of orthologous exons [3, 7, 9, 16, 20]. Other studies have proposed various models of transcript evolution with associated algorithms to reconstruct transcript phylogenies using parsimony-based tree search methods [1, 5, 6] or supertree methods [11, 12]. However, several questions about the evolution of sets of alternative transcripts in a gene family remain open [10]. For example, are alternative transcripts more conserved between orthologous genes than paralogous genes? How do new alternative transcripts arise during evolution? Is an alternative transcript preserved between multiple homologous genes and species? Where was it gained or lost in the evolution? Moreover, beyond identifying orthologous transcripts between orthologous genes, no method exists to compare all transcripts of a gene family to provide measures of similarity between the transcripts of all the genes. Furthermore, no frameworks, such as those developed for classifying gene homology types, currently exist for classifying transcript homology types. Beyond allowing a better understanding of alternative transcripts evolution, the prediction of orthologous and paralogous transcripts has other important potential applications. It can be useful for gene orthology inference and gene tree correction, as well as gene function prediction.

In this paper, we present a model to define orthology and paralogy relations between transcripts of a gene family. This model is mainly inspired by the reconciliation model between gene trees and species trees which allows defining orthologs, paralogs and isoorthologs at the gene-level. Isoorthologous genes are the least divergent orthologs that have retained the function of their lowest common ancestor [15, 19]. We present an algorithm associated to our model to infer groups of isoorthologous and paralogous transcripts. The algorithm uses transcript pairwise similarity scores as conservation measures to identify pairs of recent paralogs and isoorthologs through a Reciprocal Best Hit (RBH) approach. These relations are then used to infer ortholog groups in which the pairs of transcripts are isoorthologs, recent paralogs or related through a path of isoorthology and recent paralogy relations. The paper is organized as follows. Section 2 provides the definitions and notations required for the remaining of the paper. Section 3 describes our graph-based algorithm to infer ortholog groups. Section 4 contains the results of the application on gene families and sets of transcripts from the Ensembl-Compara database [21].

2 Preliminaries: Phylogenetic Trees, Reconciliation, Orthology, Paralogy

\(\mathbb {S}\) denotes a set of species. \(\mathbb {G}\) denotes a set of homologous genes from a gene family. \(\mathbb {T}\) denotes a set of transcripts descending from the same ancestral transcript. The three sets are related by two functions \(s : \mathbb {G} \rightarrow \mathbb {S}\) that maps each gene to its corresponding species, and \(g : \mathbb {T} \rightarrow \mathbb {G}\) that maps each transcript to its corresponding gene such that \(\{g(\texttt {t}) : \texttt {t} \in \mathbb {T}\} = \mathbb {G} \) and \(\{s(\texttt {g}) : \texttt {g} \in \mathbb {G}\} = \mathbb {S}\). The induced set function \(g^{-1}\) associates each gene to its set of corresponding transcripts.

All trees are considered rooted and binary. Given a tree P, v(P) the set of nodes of P, and l(P) its leafset. Given a node x of P, P[x] denotes the subtree of P rooted in x. A node x is an ancestor of a node y if y is a node of P[x]. If x is an internal node, \(x_l\) and \(x_r\) denote its two children. Given a subset \(L'\) of l(P), \(lca_P(L')\) denotes the lowest common ancestor (LCA) in P of \(L'\), defined as the ancestor common to all the nodes in \(L'\) that is the most distant from the root.

A tree on a set \(\varSigma \) is a tree whose leafset is \(\varSigma \). S denotes a species tree on \(\mathbb {S}\) whose internal nodes represent a partially ordered set of speciation events that have led to \(\mathbb {S}\). G denotes a gene tree on \(\mathbb {G}\) whose internal nodes represent speciation and gene duplication events that have led to \(\mathbb {G}\). T denotes a transcript tree on \(\mathbb {T}\) whose internal nodes represent speciation, gene duplication, and transcript creation events that have led to \(\mathbb {T}\). We extend the mapping functions s from v(G) to v(S), and g from v(T) to v(G) as follows: for any node \(\texttt {g}\) in v(G), \(s(\texttt {g})=lca_S(\{s(\texttt {g}') : \texttt {g}' \in l(G[\texttt {g}])\}\), and for any node \(\texttt {t}\) in v(T), \(g(\texttt {t})=lca_G(\{g(\texttt {t}') : \texttt {t}' \in l(T[\texttt {t}])\}\).

In the Duplication-Loss (DL) model of gene evolution, genes undergo speciation events when their corresponding species do, but also duplication events when a new gene copy is created and loss events when a gene is lost in a species. Likewise, in the Creation-Loss (CL) model of transcript evolution, transcripts undergo speciation and duplication events when their corresponding genes do, but also creation events when a new transcript is created and loss events when a transcript is lost in a gene. The labeling of the internal nodes of a gene tree as speciation or gene duplication events is obtained through the reconciliation with a species tree. The labeling of the internal nodes of a transcript tree is obtained through the reconciliation with a gene tree. Several definitions of reconciliation between a gene tree and a species tree exist. Here, we consider the LCA-reconciliation based on the LCA mapping.

Definition 1 (LCA-reconciliation at the gene and transcript levels)

The LCA-reconciliation of G and S is a function \(rec_G : v(G)-\mathbb {G} \rightarrow \{Spe,Dup\}\) that labels any internal node \(\texttt {g}\) of G as a duplication (Dup) if \(s(\texttt {g}) = s(\texttt {g}_l)\) or \(s(\texttt {g}) = s(\texttt {g}_r)\), and as a speciation (Spe) otherwise.

Similarly, the LCA-reconciliation of T and G is a function \(rec_T : v(T)-\mathbb {T} \rightarrow \{Spec,Dup,Cre\}\) that labels any internal node \(\texttt {t}\) of T as a creation (Cre) if \(g(\texttt {t}) = g(\texttt {t}_l)\) or \(g(\texttt {t}) = g(\texttt {t}_r)\), otherwise as a duplication (Dup) if \(rec_G(g(\texttt {t})) = Dup\), and as a speciation (Spe) if \(rec_G(g(\texttt {t})) = Spe\).

Figure 1 shows an illustration for Definition 1. \(rec_G\) provides a reconciliation between G and S that minimizes the number of gene duplications and losses [4]. \(rec_T\) also provides a reconciliation between T and G that minimizes the number of transcript creations and losses [12].

Orthology and paralogy are relations defined over pairs of homologous genes based on the gene tree - species tree reconciliation. Orthology is the relation between two genes for which any two distinct ancestors taken at a point of time always appear in distinct ancestral species (i.e. apart, orthogonal). Paralogy is the relation between two genes having two distinct ancestors at a point of time that are in the same ancestral species (i.e. beside, parallel). Based on the co-occurrence of transcript ancestors in the same ancestral genes, we can extend orthology and paralogy definitions over pairs of homologous transcripts.

Fig. 1.
figure 1

A species tree S on \(\mathbb {S}=\{a,b\}\), a gene tree G on \(\mathbb {G}=\{a_1,a_2,a_3,b_1,b_2\}\), and a transcript tree T on \(\mathbb {T}=\{a_{11},a_{21},a_{21},a_{31},a_{32},b_{11},b_{12},b_{21},b_{22}\}\) such that for any species \(x\in \mathbb {S}\), gene \(x_i \in \mathbb {G}\), and transcript \(x_{ij} \in \mathbb {T}\), \(s(x_i) = x\) and \(g(x_{ij})= x_i\). Round nodes represent speciations, square nodes gene duplications, and triangle nodes transcript creations in the LCA-reconciliation of G and S, and the LCA-reconciliation of T and G. Divergence edges after creation nodes are represented as dashed lines. The isoortholog groups of \(\mathbb {T}\) are displayed using different colors.

Definition 2 (Orthology, paralogy at the gene and transcript levels)

Two distinct genes \(\texttt {g}_1\) and \(\texttt {g}_2\) of \(\mathbb {G}\) are:

  • orthologs if their LCA in G is a speciation, i.e. \(rec_G(lca_G(\{\texttt {g}_1,\texttt {g}_2\})) = Spe\);

  • recent paralogs if \(rec_G(lca_G(\{\texttt {g}_1,\texttt {g}_2\})) = Dup\) and \(s(lca_G(\{\texttt {g}_1,\texttt {g}_2\})=s(\texttt {g}_1)\)

  • ancient paralogs otherwise.

Likewise, two distinct transcripts \(\texttt {t}_1\) and \(\texttt {t}_2\) of \(\mathbb {T}\) are [12]:

  • ortho-orthologs if their LCA in T is a speciation, i.e \(rec_T(lca_T(\{\texttt {t}_1,\texttt {t}_2\})) = Spe\);

  • para-orthologs if \(rec_T(lca_T(\{\texttt {t}_1,\texttt {t}_2\})) = Dup\);

  • recent paralogs if \(rec_T(lca_T(\{\texttt {t}_1,\texttt {t}_2\})) = Cre\) and \(g(lca_T(\{\texttt {t}_1,\texttt {t}_2\})=g(\texttt {t}_1)\);

  • ancient paralogs otherwise.

For example in Fig. 1, \(a_{11}\) and \(b_{12}\) are ortho-orthologs, \(a_{11}\) and \(b_{21}\) are para-orthologs, \(b_{21}\) and \(b_{22}\) are recent paralogs, and \(a_{11}\) and \(a_{22}\) are ancient paralogs. Notice that if all gene pairs are orthologs then all pairs of orthologous transcripts are ortho-orthologs.

Lemma 1 (Link between homology relationships at the gene and transcript levels)

If two transcripts \(\texttt {t}_1\) and \(\texttt {t}_2\) are ortho-orthologs then the genes \(g(\texttt {t}_1)\) and \(g(\texttt {t}_2)\) are orthologs, and if \(\texttt {t}_1\) and \(\texttt {t}_2\) are para-orthologs then \(g(\texttt {t}_1)\) and \(g(\texttt {t}_2)\) are paralogs. If \(\texttt {t}_1\) and \(\texttt {t}_2\) are recent paralogs then \(g(\texttt {t}_1) = g(\texttt {t}_2)\). None of the converse statements are true.

Proof

Let \(\texttt {t}=lca_T(\{\texttt {t}_1,\texttt {t}_2\})\). If \(\texttt {t}_1\) and \(\texttt {t}_2\) are ortho-orthologs or para-orthologs then \(g(\texttt {t}) \ne g(\texttt {t}_l)\) and \(g(\texttt {t}) \ne g(\texttt {t}_r\). Therefore, \(lca_G(g(\texttt {t}_1),g(\texttt {t}_2)) = g(\texttt {t})\), and then \(rec_G(lca_G(g(\texttt {t}_1),g(\texttt {t}_2))) = rec_G(g(\texttt {t}))\) which is a speciation if \(\texttt {t}_1\) and \(\texttt {t}_2\) are ortho-orthologs, and a duplication otherwise. If \(\texttt {t}_1\) and \(\texttt {t}_2\) are recent paralogs then any leaf \(\texttt {t}'\) of the subtree \(T[\texttt {t}]\) must satisfy \(g(\texttt {t}') = g(lca_T(\{\texttt {t}_1,\texttt {t}_2\})\). Therefore, \(g(\texttt {t}_1) = g(\texttt {t}_2) = g(lca_T(\{\texttt {t}_1,\texttt {t}_2\})\).

Figure 1 shows an example where \(a_3\) and \(b_2\) (resp. \(a_1\) and \(a_2\)) are orthologous (resp. paralogous) genes but their transcripts \(a_{31}\) and \(b_{21}\) (resp. \(a_{11}\) and \(a_{22}\)) are not ortho-orthologs (resp. para-orthologs). Likewise, \(g(a_{21})=g(a_{22}) = a_2\) but \(a_{21}\) and \(a_{22}\) are not recent paralogs.

Lemma 2 (Link between recent paralogy and orthology)

(1) If three transcripts \(\texttt {t}_1\), \(\texttt {t}_2\) and \(\texttt {t}_3\) are such that \(\texttt {t}_1\) and \(\texttt {t}_2\) are recent paralogs, and \(\texttt {t}_1\) and \(\texttt {t}_3\) are ortho-orthologs (resp. para-orthologs), then \(\texttt {t}_2\) and \(\texttt {t}_3\) are ortho-orthologs (resp. para-orthologs). (2) If \(\texttt {t}_1\) and \(\texttt {t}_3\) are recent paralogs, and \(\texttt {t}_2\) and \(\texttt {t}_3\) are recent paralogs, then \(\texttt {t}_1\) and \(\texttt {t}_2\) are also recent paralogs.

Proof

If \(\texttt {t}_1\), \(\texttt {t}_2\) and \(\texttt {t}_3\) are in the configuration of (1) then they must satisfy \(lca_T(\{\texttt {t}_2,\texttt {t}_3\}) = lca_T(\{\texttt {t}_1,\texttt {t}_3\})\). Therefore, \(\texttt {t}_2\) and \(\texttt {t}_3\) have the same relation as \(\texttt {t}_1\) and \(\texttt {t}_3\). (2) is trivial.

The key assumption of our method is that after a transcript creation event, the newly created transcript tends to diverge from the original transcript from which it was modified, whereas the original transcript tends to remain conserved, like for the inference of gene ortholog groups using graph-based methods where isoorthologous gene pairs are considered [2, 13, 15, 19]. It should be noted that the conservation/divergence between two transcripts is based on the comparison of their content in exons, and not of their nucleotide sequences. Therefore, given a creation node t in the LCA-reconciliation of T with G, one of its edges descending to its children, say \((t,t_l)\) without loss of generality, corresponds to the original transcript conserved, whereas the other edge \((t,t_r)\) corresponds to the newly created divergent transcript. In this case, we call \((t,t_l)\) a conservation edge, whereas \((t,t_r)\) is called a divergence edge. For example, in Fig. 1, the divergence edges after creation nodes appear in dashed lines. Distinguishing conservation edges and divergence edges after a creation node allows to define a particular type of orthology relation between transcripts.

Definition 3 (Isoorthology at the transcript level)

Two ortho- (resp. para-) orthologous transcripts \(\texttt {t}_1\) and \(\texttt {t}_2\) of \(\mathbb {T}\) are ortho- (resp. para-) isoorthologs if there are no divergence edges on the path between \(\texttt {t}_1\) and \(\texttt {t}_2\) in T.

Notice that the isorthology relation is transitive, which allows \(\mathbb {T}\) to be partitioned into ortholog groups.

Definition 4 (Ortholog groups at the transcript level)

An ortholog group \(\mathbb {O}\) of \(\mathbb {T}\) is a subset of \(\mathbb {T}\) such that any two distinct transcripts \(\texttt {t}_1\) and \(\texttt {t}_2\) belonging to \(\mathbb {O}\) are isoorthologs (i.e. ortho- or para-isoorthologs), recent paralogs, or there exist two transcripts \(\texttt {t}_1'\) and \(\texttt {t}_2'\) in \(\mathbb {O}\) such that \(\texttt {t}_1=\texttt {t}_1'\) or \(\texttt {t}_1\) and \(\texttt {t}_1'\) are recent paralogs, \(\texttt {t}_2=\texttt {t}_2'\) or \(\texttt {t}_2\) and \(\texttt {t}_2'\) are recent paralogs, and \(\texttt {t}_1'\) and \(\texttt {t}_2'\) are isoorthologs.

For example, in Fig. 1, \(\{a_{11},a_{21},b_{11},a_{31}\}\), \(\{a_{22}\}\), \(\{b_{12}\}\), \(\{a_{32},b_{21}\}\), \(\{b_{22}\}\) are the maximum inclusive-wise ortholog groups of \(\mathbb {T}\).

3 A Graph-Based Algorithm to Infer Isoorthology and Recent Paralogy Relations Between Transcripts

In this section, we present a graph-based method to infer isoorthology and recent paralogy relations in a set of homologous transcripts \(\mathbb {T}\). The method relies on a pairwise similarity measure between transcripts to infer ortholog groups.

3.1 Pairwise Similarity Score Between Transcripts

A gene \(\texttt {g} \in \mathbb {G}\) is a DNA sequence on the alphabet of nucleotides \(\varSigma = \{A,C,G,T\}\). A transcript \(\texttt {t}\) of \(\texttt {g}\) (i.e \(\texttt {t} \in g^{-1}(\texttt {g})\)) is a subsequence of \(\texttt {g}\) obtained by concatenating an ordered set of substrings of \(\texttt {g}\) such that each substring is an exon of \(\texttt {g}\) that is present in \(\texttt {t}\). The transcribed subsequence of a gene \(\texttt {g} \in \mathbb {G}\), denoted by \(\hat{\texttt{g}}\), is the subsequence of \(\texttt {g}\) obtained by deleting from \(\texttt {g}\) any nucleotide that is absent from all transcripts of \(\texttt {g}\). \(\hat{\texttt {G}}\) denotes the set of transcribed subsequences of all genes in \(\texttt {G}\). Note that \(|~\hat{\mathbb {G}}~| = |~\mathbb {G}~|\). Figure 2 shows an illustration.

A multiple sequence alignment \(\mathbb {A}\) of all the transcribed subsequences in \(\hat{\mathbb {G}}\) and all the transcripts in \(\mathbb {T}\) is obtained by first computing a multiple sequence alignment M of the transcribed subsequences in \(\hat{\mathbb {G}}\), and then mapping each transcript \(\texttt {t} \in \mathbb {T}\) on its corresponding transcribed subsequence within M to obtain the resulting alignment \(\mathbb {A}\).

In the sequel, \(\mathbb {A}\) denotes a multiple sequence alignment of the all transcribed subsequences in \(\hat{\mathbb {G}}\) and all the transcript sequences in \(\mathbb {T}\), represented as a \(n\times m\) matrix such that \(n = |~\mathbb {T}~| + |~\hat{\mathbb {G}}~|\) and m is the number of columns of the alignment.

Fig. 2.
figure 2

A multiple sequence alignment of the transcribed subsequences of two genes \(\texttt {g1}\) and \(\texttt {g2}\) with their transcripts, decomposed into 9 blocks. Non-coding nucleotides in the gene sequence (i.e. introns, untranscribed and untranslated regions) are represented with the character ’*’.

Following the block-based model used in [16] to represent transcripts, a multiple sequence alignment \(\mathbb {A}\) of \(\mathbb {T}\) and \(\hat{\mathbb {G}}\) is partitioned into a set of non-overlapping blocks of columns as follows:

Definition 5 (Decomposition of multiple sequence alignment)

Let \(\mathbb {A}\) be a multiple sequence alignment of \(\mathbb {T}\) and \(\hat{\mathbb {G}}\). \(\mathbb {A}_{b}\) denotes the binary matrix of same dimension as \(\mathbb {A}\) such that each nucleotide A, C, G, or T in \(\mathbb {A}\) is replaced by 1 in \(\mathbb {A}_{b}\), and each gap character ’-’ is replaced by 0. A block of \(\mathbb {A}\) is a set of consecutive columns of \(\mathbb {A}\) which correspond to a maximum inclusive-wise set of consecutive columns of \(\mathbb {A}_{b}\) which are equal.

For any block B of \(\mathbb {A}\), \(\alpha (B)\) denotes a positive number representing the weight of the block B.

For example, in Fig. 2, the alignment is decomposed into 9 blocks.

Lemma 3 (Aligned sequences in blocks)

For any aligned sequence \(\texttt {t}'\) in \(\mathbb {A}\) and any block B of \(\mathbb {A}\), \(\texttt {t}'\) contains either only nucleotides, or only gaps in B.

Proof

Trivial, by definition of blocks.

Definition 6 (Block-based representation of transcripts and genes)

Given the ordered set of blocks defined by the partition of \(\mathbb {A}\), for each transcript \(\texttt {t} \in \mathbb {T}\), \(\mathbb {B}(\texttt {t})\) denotes the ordered subset of blocks in which the aligned sequence \(\texttt {t}'\) corresponding to \(\texttt {t}\) contains nucleotides.

Likewise, for each gene \(\texttt {g} \in \mathbb {G}\), \(\mathbb {B}(\texttt {g})\) denotes the ordered subset of blocks in which the aligned sequence \(\texttt {g}'\) corresponding to \(\hat{\texttt {g}}\) contains nucleotides.

Lemma 4 (Link between representations of transcripts and genes)

For any gene \(\texttt {g} \in \mathbb {G}\), \(\mathbb {B}(\texttt {g})\) contains all blocks in which at least one aligned sequence \(\texttt {t}'\) corresponding to a transcript \(\texttt {t}\) of \(\texttt {g}\) contains nucleotides.

Proof

Let B be a block contained in \(\mathbb {B}(\texttt {g})\). Then, the transcribed subsequence \(\hat{\texttt {g}}\) contains a segment of nucleotides in B. Therefore, there exists at least one transcript \(\texttt {t}\) of \(\texttt {g}\) which contains this segment, and then the corresponding aligned sequence \(\texttt {t}'\) contains nucleotides in B. Conversely, any block containing nucleotides from a transcript of \(\texttt {g}\) belongs necessarily to \(\mathbb {B}(\texttt {g})\).

We are now ready to give the definition of the transcript similarity measure.

Definition 7 (Pairwise transcript similarity)

Let \(t_1\) and \(t_2\) be two distinct transcripts in \(\mathbb {T}\). Consider the sets of blocks shared by \(t_1\) and \(t_2\), \(\mathbb{B}\mathbb{I}(t_1,t_2) = \mathbb {B}(t_1) ~\cap ~\mathbb {B}(t_2)\), and the set of blocks \(\mathbb {B}\mathbb {U}(t_1,t_2) = \mathbb {B}(t_1) ~ \cup ~ \mathbb {B}(t_2) \) and \(\mathbb {B}\mathbb {U}_{+}(t_1,t_2) = \mathbb {B}\mathbb {U}(t_1,t_2) \cap ~ (\mathbb {B}(g(t_1)) ~ \cap ~ \mathbb {B}(g(t_2))\) which contains the blocks of \(t_1\) shared with \(g(t_2)\) and the blocks of \(t_2\) shared with \(g(t_1)\). The similarity score between \(t_1\) and \(t_2\) equals:

$$\begin{aligned} \textrm{tsm}(t_1,t_2) =\frac{\sum _{B \in \mathbb{B}\mathbb{I}(t_1,t_2)}^{} \alpha (B)}{\sum _{B \in \mathbb {B}\mathbb {U}(t_1,t_2)}^{} \alpha (B)} \end{aligned}$$

The corrected similarity score between \(t_1\) and \(t_2\) equals:

$$\begin{aligned} \textrm{tsm}_{+}(t_1,t_2) = \frac{\sum _{B \in \mathbb{B}\mathbb{I}(t_1,t_2)}^{} \alpha (B)}{\sum _{B \in \mathbb {B}\mathbb {U}_+}(t_1,t_2)^{} \alpha (B)} \end{aligned}$$

If the weights associated to the blocks are unitary, the similarity score between two transcripts \(t_1\) and \(t_2\) is the ratio between the number of blocks shared by the two transcripts and the number of blocks in at least one of the two transcripts. However, for the corrected similarity score, the blocks which are contained in the symmetric difference of \(\mathbb {B}(t_1)\) and \(\mathbb {B}(t_2)\) are only counted in the denominator if they belong to both \(\mathbb {B}(g_1)\) and \(\mathbb {B}(g_2)\). This correction allows to account for differences at the transcript-level only, and to avoid those at the gene-level. For instance, in the example provided in Fig. 2, \(tsm_+(t11,t21) =\) \(\mid \{2,3,4,6,8\}\mid / \mid \{2,3,4,6,8\}\mid = 1\) even if they do not share the block 7. If the weight associated to a block is its length, the similarity score corresponds to the case where each column of the multiple sequence alignment \(\mathbb {A}\) is considered as a block.

3.2 Orthology Graph Construction and Ortholog Groups Inference

Using the pairwise similarity scores between transcripts, the method identifies pairs of recent paralogs and putative pairs of isoorthologs through an RBH. The key idea is that between two homologous genes \(\texttt {g}_1\) and \(\texttt {g}_2\), the pairs of isoorthologous transcripts (i.e., para- or ortho-isoorthologs) should be the most conserved. Moreover, two recent paralogous transcripts within a gene \(\texttt {g}_1\) should share more similarities than to any transcript in another gene \(\texttt {g}_2\), because their LCA which is a creation node is lower than the node from which any pair of transcripts from \(\texttt {g}_1\) and \(\texttt {g}_2\) diverged.

Definition 8 (Inferred recent paralogs)

Two distinct transcripts \(\texttt {t}_1\) and \(\texttt {t}_2\) of \(\mathbb {T}\) such that \(g(\texttt {t}_1) = g(\texttt {t}_2)\) are inferred as recent paralogs if:

  • \(tsm(\texttt {t}_1,\texttt {t}_2) > max\{tsm(\texttt {t}_1,\texttt {t}) : \texttt {t} \in \mathbb {T} - g^{-1}(g(\texttt {t}_2))\}\) and \(tsm(\texttt {t}_1,\texttt {t}_2) > max\{tsm(\texttt {t}_2,\texttt {t}) : \texttt {t} \in \mathbb {T} - g^{-1}(g(\texttt {t}_1))\}\);

  • or there exists a third transcript \(\texttt {t}_3\) of \(\mathbb {T}\) such that \(\texttt {t}_1\) and \(\texttt {t}_3\) are recent paralogs and, \(\texttt {t}_2\) and \(\texttt {t}_3\) are also recent paralogs.

Definition 9 (Putative isoorthologs)

Two transcripts \(\texttt {t}_1\) and \(\texttt {t}_2\) of \(\mathbb {T}\) of two distinct genes are inferred as putative isoorthologs if:

\(tsm(\texttt {t}_1,\texttt {t}_2) = max\{tsm(\texttt {t}_1,\texttt {t}) : \texttt {t} \in g^{-1}(g(\texttt {t}_2))\} = max\{tsm(\texttt {t}_2,\texttt {t}) : \texttt {t} \in g^{-1}(g(\texttt {t}_1))\}\).

Using RBH to define putative isoorthologs makes the method more robust to transcript loss or incomplete transcript annotation in some genes. The next step is to define an orthology graph whose set of vertices is \(\mathbb {T}\), edges represent inferred recent paralogs or putative isoorthologs, and connected components define ortholog groups.

figure a

Definition 10 (Orthology graph)

An orthology graph for \(\mathbb {T}\) is a graph \(G=(V,E)\) whose set of vertices \(V=\mathbb {T}\) and for any two distinct transcripts \(\texttt {t}_1\) and \(\texttt {t}_2\) of \(\mathbb {T}\):

  • (1) if \((\texttt {t}_1,\texttt {t}_2)\in E\) then \(\texttt {t}_1\) and \(\texttt {t}_2\) are either inferred recent paralogs or putative isoorthologs, and;

  • (2) if \(g(\texttt {t}_1)=g(\texttt {t}_2)\), then \(\texttt {t}_1\) and \(\texttt {t}_2\) belong to the same connected component of G if and only if \(\texttt {t}_1\) and \(\texttt {t}_2\) are recent paralogs.

The objective is to construct an orthology graph for \(\mathbb {T}\) that contains a minimum number of connected components. Given a graph (VE), CC(VE) denotes the set of all connected components of the graph. We define the function \(cc_{(V,E)} : V \rightarrow CC(V,E)\) which associates each vertex \(x\in V\) to the connected component to which it belongs. Algorithm 1 uses a progressive heuristic approach to construct an orthology graph for \(\mathbb {T}\). It takes as input the set \(\mathbb{R}\mathbb{P}\) of all the pairs of inferred recent paralogs, and the ordered set \(\mathbb{P}\mathbb{O}\) of all the pairs of putative isoorthologs ordered by decreasing similarity. It starts with an empty set of edges, then edges corresponding to recent paralogs are added. Next, edges that correspond to putative isoorthologs are considered progressively and added if their addition preserves the property that the graph is an orthology graph.

Fig. 3.
figure 3

Homogeneity, Completeness and V-measure scores of our predictions for the 253 triplets of orthologous genes from Guillaudeux et al. [7]

Lemma 5

(Correctness of Algorithm 1). Given \(\mathbb{R}\mathbb{P}\) the set of all the pairs of inferred recent paralogs, and \(\mathbb{P}\mathbb{O}\) the ordered set of all the pairs of putative isoorthologs, Algorithm 1 computes an orthology graph for \(\mathbb {T}\).

Proof

In Algorithm 1, after the first \(\texttt {for}\) loop, the set of edges contains only recent paralogy edges. Therefore, at this point the graph is an orthology graph. In the remaining of the algorithm, an isoorthology edge is added if and only it preserves the property that the graph is an orthology graph.

Given a multiple sequence alignment \(\mathbb {A}\) of dimension \(n\times m\) such that \(n = |~\mathbb {T}~| + |~\hat{\mathbb {G}}~|\) and m is the number of columns, the decomposition of the multiple sequence alignment \(\mathbb {A}\) into blocks is computed in \(O(n\times m)\) time complexity. The pairwise transcript similarity scores are computed in \(O(n^2 \times b)\) time complexity where b is the number of blocks in the decomposition of \(\mathbb {A}\) and \(b \ll m\). The pairs of inferred recent paralogs and putative isoorthologs are computed in \(O(n^2)\) time complexity. Finally, Algorithm 1 runs in \(O(n^2)\) time complexity. Therefore, given \(\mathbb {A}\), the whole method runs in \(O(n\times m + n^2)\) time complexity.

Fig. 4.
figure 4

Comparison of the number of recent paralogs and the number of ortho-isoorthologs inferred by the methods.

4 Results and Discussion

4.1 Comparison with Ortholog Groups Predicted in Human, Mouse and Dog One-to-one Orthologous Genes

Since this paper introduces the notion of orthology and paralogy at the transcript level, no previous work exists on the computation of transcript ortholog groups. However, many previous works have studied the problem of identifying conserved transcripts between one-to-one orthologous genes. Most of this work is limited to comparing two species. An exception is the work of Guillaudeux et al. [7] who proposed a formal definition of the splicing structure orthology and an algorithm that was used to predict transcript orthologs in human, mouse and dog. A comparison between the approach of Guillaudeux et al. and that of our method is provided in supplementary material. Their dataset is publicly available and includes 253 triplets of one-to-one orthologous genes from 236 gene families in the Ensembl-Compara database. Their method predicted 879 transcript ortholog groups for a total of 1896 transcripts.

We compared the 879 orthologous groups of Guillaudeux et al. to orthologous groups obtained using 12 different settings of our method using: (1) MACSE [17] or Kalign [14] to compute multiple sequence alignments; (2) unitary weights for alignment blocks, or weights corresponding to their lengths, or the mean of the two similarity scores; (3) a corrected or an uncorrected transcript similarity measure. For each version of our method, we used our results as the prediction and the 784 orthologous groups of Guillaudeux et al. as ground truth. We considered 3 performance measures: 1) the homogeneity score which computes the ratio of the pairs of transcripts which are predicted in the same group and are truly in the same group to the pairs of transcripts which are predicted in the same group; 2) the completeness score which computes the ratio of the pairs of transcripts which are predicted in the same group and are truly in the same group to the pairs of transcripts which are truly in the same group; 3) the V-measure which is the harmonic mean between the scores of homogeneity and completeness. Figure 3 provides the details of the score distributions for all 253 gene triplets. The high scores obtained show that our results are agreeing with those of Guillaudeux et al. In particular, the best scores are achieved with the setting that uses MACSE, the unitary-weights for block alignments, and the uncorrected transcript similarity measure. However, for all versions, our method tends to cluster transcripts more (i.e., the homogeneity score is always less than the completeness score).

The detailed comparison of the groups obtained by the Guillaudeux et al. method and the best performing setting of our method shows that 495 clusters of 166 families are exactly the same for the two methods. Our method predicts 1896 ortho-isoorthologs and 466 recent paralogs compared to Guillaudeux et al. who predicts 1,408 ortho-isoorthologs and 395 recent paralogs. Our method also finds all recent paralogs inferred by Guillaudeux et al. Figure 4 shows the number of recent paralogs per species and the number of ortho-isoorthologs per species pair predicted by the two methods.

4.2 Comparison of the Proportions of Ortho-orthologs, Para-orthologs and Recent Paralogs Predicted

Table 1. Description of the dataset of 20 gene families. The total number of genes, the total number of transcripts and the numbers per species are given.
Table 2. The number of ortholog groups predicted, the ratio of isoortholog pairs to all transcript pairs, the ratio of recent paralogs pairs to all transcript pairs, the ratio of ortho-isoorthologs pairs to all isoortholog pairs divided by the ratio of gene ortholog pairs to all gene pairs, and the ratio of para-isoortholog pairs to all isoortholog pairs divided by the ratio of gene paralog pairs to all gene pairs.

We randomly selected 20 gene families composed of genes from 6 species: human, mouse, dog, dingo, cow, chicken. Table 1 describes the dataset. We used the best performing setting of our method using MACSE to compute the multiple sequence alignment, the unitary-weights similarity scores, and the uncorrected transcript similarity measure. From a total of 1,402 transcripts, we identified 236 ortholog groups. Table 2 shows the ratio of isoorthologues and recent paralogs found between transcript pairs. In this experiment, we could identify para-isoorthologs because the dataset contains paralogous genes. Table 2 also shows the ratio of ortho-isoorthology relations normalized by the ratio of gene orthology relations and the ratio of para-isoorthology relations normalized by the ratio of gene paralogy relations. When the normalized ratio of ortho-isoorthology (resp. para-isoorthology) is greater than 1, it means more relations were predicted than expected given the ratio of gene orthology (resp. paralogy) relations to all gene pairs. In 14 out of 20 families, the normalized para-isoorthology ratio is greater than 1, and greater than the normalized ortho-isoorthology ratio. Therefore, it seems that isoorthologous transcripts tend to be more present between paralogous genes than between orthologous genes. This is consistent with previous studies that have found evidence against the ortholog conjecture in the context of gene function prediction by transferring annotations between homologous genes [18]. The ortholog conjecture proposes that orthologous genes should be preferred when making such predictions because they evolve functions more slowly than paralogous genes. Our results support that orthologs and paralogs should be considered to provide higher prediction accuracy. However, this interpretation should also be taken with caution as there may be errors in the orthology/paralogy relationships between genes.

5 Conclusion

The ability to classify homology relationships between homologous transcripts in gene families is a fundamental step to study the evolution of alternative transcripts. Identifying groups of transcripts that are isoorthologs helps to identify and study the function of transcripts conserved across multiple genes and species. It also provides a framework for gene tree correction, and to study the impact that the evolution of transcripts through creation and loss events has on the evolution of gene functions.

In this work, we revisit the notions of orthology, paralogy and isoorthology at the transcript-level with an associated algorithm to infer ortholog groups composed of transcripts that are isoorthologs, recent paralogs, or related through a path of isoorthology and recent paralogy relations. The method provides results that are consistent with results from methods that identify conserved transcripts between orthologous genes. The results also show that the proportion of conserved transcripts between paralogous genes is not negligible compared to conserved transcripts between orthologous genes, thus justifying the relevance of further studying the relationship between the evolution of transcripts and genes in entire gene families.

The method offers many possibilities for improvement and extension. First, the quality of inference strongly depends on the quality of the multiple sequence alignment and the definition of the transcript similarity measure. Future works will explore different ways to compare transcripts. We will also explore alternative algorithms for computing ortholog groups given putative pairs of isoorthologs. Finally, the method can be extended to infer all pairwise relations between homologous transcripts of a gene family by using the ortholog groups to compute complete transcript trees which can then be reconciled with the gene tree. The predictions can then be evaluated using the annotated functions of the proteins corresponding to transcripts.