1 Introduction

From the perspective of the intrinsic sequence properties, exons carry the genetic information of protein coding and the signals of mRNA localization, therefore, exons are definitely non-random. In contrast, introns are removed during the pre-mRNA maturating processes and they do not involve in the translation processes, which makes them treated as junk DNA sometimes. These biological evidences strongly suggest there exist distinguishing properties between exons and introns and many computational analysis and biological evidences showed the sequence disparities between them. The dichotomous splicing signals in flanking regions of splice sites had been identified for years (Zhang et al. 2005); it had been also observed the sequence complexity of exons is higher than introns (Orlov and Potapov 2004); significant fluctuation of nucleotide composition near exon–intron junction sites was also found (Louie et al. 2003).

The main information available to the splicing machinery is the nucleotide composition of the pre-mRNA sequence being processed, therefore, profiling the differences between exons and their immediate introns using only information of single sequence would bring valuable insights to the splicing research field. To clarify how the splicing machinery transforms the sequence information into a binary decision of intron removal and exon ligation, profiling the intrinsic differences between exons and introns is essential. The intrinsic differences between adjacent exonic and intronic flanks of splice sites had been revealed by investigating the distribution of tri-nucleotides in sequences (Liou and Huang 2008), which linked up the compositional heterogeneity between them with the tri-nucleotide repeats and served as single sequence-based features for exon/intron discrimination (Liou and Huang 2009). As many splicing regulatory cis-elements were identified, a natural next step is to integrate the available information into a predictive framework to simulate the processes of identifying exons and introns by the splicing machinery (Wang and Burge 2008). Knowledge-driven methodologies, which focus on profiling or modeling the related biological evidences, are better ways to identify biologically meaningful biomarkers (Chen et al. 2009) and some efforts had been made towards achieving this goal. For example, the SpliceIT encoded the sequence features relevant to splicing events in vivo to identify the splicing signals (Malousi et al. 2010); about 200 sequence features were combined to decipher the code for alternative splicing (Barash et al. 2010).

In this paper, the landscapes of exonic and intronic regions of splice sites are carefully surveyed, the pieces of biological evidences are integrated into an exon/intron discrimination framework (EIDF). A simple binary classifier is implemented according to the proposed EIDF, which discriminate exons from their immediate introns using only information of single sequence. Nine data sets, including sequences of manually collected from genes of six model organisms and sequences from three well-known databases, were prepared to validate the usefulness of the proposed EIDF. The results demonstrate the power of the proposed EIDF for discriminating exons from introns and some valuable insights into structure of genes are also provided.

2 Related works

The identified heterogeneity between exons and introns mainly comes from biological evidences and computational analysis. The biological evidences provide hints to discriminate them from each other, such as the basic differences between exons and introns in terms of the tri-nucleotides can be explained by the circular code theory (Arquèsa and Michel 1996) and the sharp transition at flank regions of splice sites (Zhan 1998). Many of exon/intron discrimination methods are based on analysis of sequence composition, such as the consensus sequences (Weir and Rice 2004), oligo-nucleotide frequencies (Claverie and Bougueleret 1986; Claverie et al. 1990; Solovyev et al. 1994; Louie et al. 2003), base/codon/triplet usage (Zhan 1998), the sequence determinants of splice sites (Mengeritsky and Smith 1989) and a multi-source recognition method recruiting the consensus features and statistical differences of bases usage (Nakata et al. 1985). However, none of the above methods is a single sequence-based methodology, and hence the day of fulfilling the vision of simulating the processes of splicing machinery is still awaited.

3 Methods

3.1 The exon/intron disparity framework (EIDF)

Each of the splice site sequences consist of the exons (coding sequences) and introns (noncoding sequences). A meta-analysis on them is performed, which investigates the related biological evidences from different perspectives including the concepts of genetic code, circular-code theory, reading frames of coding sequences and constitutive/alternative splicing. The blueprint of the meta-analysis is shown in Fig. 1. The most significant features characterizing exons are the genetic codes that are responsible of the surveillance of protein synthesis; and most significant features characterizing introns are the splicing signals that help the surveillance of pre-mRNA splicing. The properties of coding sequences and genetic codes are closely related to codon usage bias, RNY preference and circular code theory, which are further related to reading frame and intron phases.

Fig. 1
figure 1

The disparities between exons and introns

The blueprint shown in Fig. 1 is systematically surveyed under bottom-up analysis, which finally leads to the exon/intron disparity framework (EIDF) as depicted in Fig. 2. The construction of EIDF starts from the basic computational concerns: what are the dominated base composition in terms of mono-/di-/tri-nucleotides. Then, the dominated base composition is identified via related biological evidences, which are further promoted to abstract concepts. The concepts are then systematically quantified via micro-view profiling of sequences. Finally, the disparity profiles will be formulated as equations as the concrete implementation of exon/intron disparities. EIDF represents a way to discriminate an exon from its immediate intron in terms of mono-/di-/tri-nucleotide composition.

Fig. 2
figure 2

The exon/intron disparity framework (EIDF)

3.2 Meta-analysis of mono-nucleotide composition

In general, any genomic sequence can be uniquely described by two independent RNA languages: the strong/weak (S/W, where S=G/C and W=A/U) bases and the purine/pyrimidine (R/Y, where R=A/G and Y=C/U) bases. The two distributions are invariant under the transforms of the RNA group in some sense, which indicates they are inherent for the sequences (Zhan 1997), moreover, the coding sequences prefer the S/W and the noncoding regions (e.g., introns, 5′-UTR, 3′-UTR) are often full of R/Y bases (Luo and Ji 1997). The variance of S/W codes and R/Y codes, denoted by M SW and M RY (M stands for mono-nucleotide), are defined in Eqs. 1 and 2, where μ is the average number of A/C/G/U in the specific single sequence and N x is the number of base x (i.e.., A, C, G or U) in the sequence. The disparity in terms of mono-nucleotide composition, D 1, is the mean of variance M SW and M RY, which is defined in Eq. 3.

$$ M_{\rm SW}=\frac{1}{2}((N_{\rm C}+N_{\rm G})-\mu)^{2}+((N_{\rm A}+N_{\rm U})-\mu)^{2}) $$
(1)
$$ M_{\rm RY}=\frac{1}{2}((N_{\rm A}+N_{\rm G})-\mu)^{2}+(N_{\rm C}+N_{\rm U})-\mu)^{2}) $$
(2)
$$ D_{1}=\frac{\left(M_{\rm SW}+M_{\rm RY}\right)}{2} $$
(3)

3.3 Meta-analysis of di-nucleotide composition

Many significant di-nucleotides in exonic and intronic regions are derived according to the properties of coding and noncoding sequences. The RNY preference (Nikolaou and Almirantis 2004) in coding sequences suggests the YR-doublet (Y stands for C/U and R for A/G) will be abundant in exons (i.e., RNYRNYRNYRNY...), and thus the CA/UA/CG/UG are the candidates of frequent di-nucleotides. In DNA sequences, there is marked variability among genes in the frequency of the di-nucleotide CpG (Bulmer 1987). Introns are similar to the intergenic regions, and thus the frequencies of CpG islands in exons (coding sequences) and introns (noncoding sequences) are expected to be very different. Therefore, the CpG is added to the candidate set of distinguishing di-nucleotides. From the perspective of constitutive splicing, it is anticipated that natural selection decreases frequency of GpU and ApG near the 5′ and 3′ ends of exons to prevent the appearance of cryptic splicing sites, this is so called the site avoidance (in exons) (Eskesen et al. 2004). While from the perspective of alternative splicing, a clear preference of isoforms was identified (Bortfeldt et al. 2008), which showed the potential tandem repeats of GpU/ApG di-nucleotide in introns. Thus, the GpU and ApG are recruited to the di-nucleotide candidate set. The intronic regions near acceptor site is a C/U rich region, which implies the occurrence of CpC/CpU/UpC/UpU di-nucleotides will be more frequent than other regions, and thus all of the four di-nucleotides are added into the candidate set. By performing the sensitivity analysis, some of the above-mentioned di-nucleotide candidates are filtered out and only the di-nucleotides with significant discrimination power are reserved. The significant exonic dimers, denoted as D E2 shown in Eq. 4, are {CA, CG}/{CG, GA}; whereas the intronic dimers, denoted as D I2 shown in Eq. 5, are {GU, UA}/{AG, UA, UC, CU, UU}. The dimer disparity D *2 is defined as the difference between exonic dimers and intronic dimers as shown in Eq. 6.

$$ D_{2}^{E}=\left\{\begin{array}{ll} N_{\rm CA}+N_{CG}, & {\rm if}\;{\rm Site}={\rm Donor}\\ N_{\rm CG}+N_{\rm GA}, & {\rm if}\;{\rm Site}={\rm Acceptor} \end{array}\right. $$
(4)
$$ D_{2}^{I}=\left\{\begin{array}{ll} N_{\rm GU}+N_{\rm UA}, & {\rm if}\;{\rm Site}={\rm Donor}\\ N_{\rm AG}+N_{\rm UA}+N_{\rm UC}+N_{\rm CU}+N_{\rm UU}, &{\rm if}\;{\rm Site}={\rm Acceptor} \end{array}\right. $$
(5)
$$ D_{2}^{*}=(D_{2}^{E}-D_{2}^{I})/(L-1),\quad {\rm where}\;L\;{\rm is}\;{\rm sequence\;length}. $$
(6)

3.4 Meta-analysis of tri-nucleotide composition

The tri-nucleotide preference has great implications in analyzing pre-mRNA sequences and characterizing splice sites (Staden and McLachlan 1982; Nikolaou and Almirantis 2004; Willie and Majewski 2004). Tri-nucleotide repeats were also shown to be closely related with splicing regulation (Parmley and Hurst 2007), which are usually tandem repeats around splice sites (Zhuo et al. 2007); mutually symmetric and complementary triplets provide hints to distinguish coding sequences from noncoding ones (Nikolaou and Almirantis 2003). The tri-nucleotide disparity D 3 is derived via circular code-based analysis, which is described in Algorithm 1. From line 1 to 3, the function codoncount will find the frequencies of all the 64 codons in the sequences; the codons that do not appear in any of the three reading frame will be excluded by the function DeleteEntry (line 4). CircularCodes are the sets of codons that appear in the same reading frame (line 6). If the reading frame preference is observed, the specific tri-nucleotide gets award (the frequency is increased by half of the original frequency, it is the meaning of the \(\frac{3}{2}\) in line 7). The D 3 is the weighted sum of the frequencies of tri-nucleotides in each of the reading frames, the \(+\frac{1}{2}\) in line 8 stands for the appearing in another reading frame is also acceptable but it is less important, while \(-\frac{1}{4}\) for the appearance in the third reading frame is unacceptable and it is treated as a penalty.

Algorithm 1 Circular-code based tri-nucleotide disparity

1. \(F\left(:,1\right)=codoncount\left(Seq,'Reading Frame 1'\right);\)

2. \(F\left(:,2\right)=codoncount\left(Seq,'Reading Frame 2'\right);\)

3. \(F\left(:,3\right)=codoncount\left(Seq,'Reading Frame 3'\right);\)

4. \(F_{sorted}=sort\left(F\right);\)

5. \(F_{sorted}^{\prime}=DeleteEntry(find(F(:,1)==0 \&\& F(:,2)==0 \&\& F(:,3)==0));\)

6. Let \(CircularCodes=find(F_{sorted}^{\prime}(:,2)==0);\)

7. \(CircularCodes(i,1)=CircularCodes(i,1)\times\frac{3}{2};\)

8. \(D_{3}=sum\left(F_{sorted}(:,1)+\frac{1}{2}F_{sorted}(:,2)-\frac{1}{4}F_{sorted}(:,3)\right);\)

3.5 The simple EIDF classifier

The defined D 3 and D 2 are exon-dominated factors and the D 1 is intron-dominated factor. After formulating the disparities of mono-/di-/tri-nucleotide composition, a simple EIDF classifier is implemented to perform a bipartite comparison between the exon and its immediate intron within the single sequence. The discrimination function aims at maximizing the EIDF exon (i.e., minimizing the EIDF intron ), therefore, the D 3 and D 2 are treated as positive factors and D 1 is treated as a negative factor. Accordingly, the EIDF classifier is defined by an easy addition model as shown Eq. 7. The classifier performs bipartite comparison between exons and their immediate introns based on Eq. 8. Generally, the value of EIDF i exon (the ith exon) is expected larger than the corresponding one of EIDF i intron (the immediate intron of ith exon). The comparison result is either 1 or 0, thus, for a data set comprising n exons and introns, the total number of positive results is defined as the support of tests as defined in Eq. 9.

$$ EIDF=D_{3}+D_{2}-D_{1} $$
(7)
$$Support_{i}=\left\{\begin{array}{ll} 1, & {\rm if}\;EIDF_{exon}^{i}>EIDF_{intron}^{i}\\ 0,& {\rm else}. \end{array}\right.$$
(8)
$$ Support=\frac{\sum_{i=1}^{n}Support_{i}}{n} $$
(9)

4 Results

4.1 Data sets

There were nine different data sets used to test performance of the EIDF classifier. Firstly, the data sets were collected from the UCI repositories (Arthur and Newman 2007), SpliceDB (Burset et al. 2001) and HS3D (Pollastro and Rampone 2002). The number of experimental sequences in the three data sets is listed in Table 1 and they were used to validate the basic performance of the proposed EIDF classifier. Other data sets were manually extracted from the Xpro database (Gopalan et al. 2004), including the higher eukaryotes (human, mouse, rat) and lower eukaryotes (C. elegans, Drosophila, Arabidopsis), they were prepared to investigate the regional disparities between exons and introns. The number of experimental sequences in the six data sets is listed in Table 2.

Table 1 Experimental sequences collected from three public databases
Table 2 Experimental sequences manually extracted from the Xpro database

4.2 Basic performance of EIDF classifier

The EIDF classifier is firstly tested using UCI, SpliceDB and HS3D data sets, and the performances are listed in in Table 3. The average supports reached 0.89 and 0.92, therefore, it is obvious that there exists significant heterogeneous base composition between exonic and intronic flank regions of splice sites. The strongly negative Spearman rank correlations between intronic EIDF classifier values and the difference of exon/intron EIDF classifier values provide another insight into exon/intron discrimination. Taking the HS3D experiments as an example, the EIDF classifier values of true exons and true introns are shown in the left part of Fig. 3.

Table 3 EIDF classifier on three public data sets
Fig. 3
figure 3

The EIDF values of true Exon–Donor–Intron sequences (top) and the EIDF values of true Intron–Acceptor–Exon sequences (bottom)

4.3 The regional disparity

To identify the most significant regions discriminating exons from introns, the sequences were investigated under restricted lengths. The discrimination analysis on restricted exonic/intronic regions flanking splice sites of six model organisms (the data sets are listed in Table 2) was performed with restricted lengths ranging from 11 to 68 bp (considering the calculation of tri-nucleotide disparity, the sequence length is set to be 3k+2 base pairs). As shown in the left part of Fig. 4, the degree of disparity increases as the sequence length gets longer for the donor site, but the exons and introns flanking acceptor site showed significant disparity (the left part of Fig. 4) at the range of about 30 bases. The regional disparities revealed that there exist different sequence properties between the exonic regions and their immediate intronic regions.

Fig. 4
figure 4

The regional disparity between exonic and intronic regions flanking splice sites. The y axis is the support of EIDF classifier and the x axis is the restricted sequence length starting from the splice sites

4.4 Robustness of EIDF classifier

For validating the robustness of the proposed EIDF classifier, an extended data set is prepared which consists of genes with large number of exons. As listed in Table 4, the complicated exon/intron organization provides a chance to test the robustness of EIDF classifier, where the GID is the gi number of the selected gene in NCBI, nExon is the number of exons in the specified gene; Seq_len is the sequence length; and Window_Size is the size of sliding window (in terms of number of base pairs) determined by two-thirds of the length of the shortest intron in that gene. The supports are shown in Table 5, the high correction ratios demonstrated the robustness of EIDF classifier.

Table 4 Example genes with complicated exon/intron organization
Table 5 Supports of EIDF classifier on complicated genes

5 Discussion and conclusions

The precise processes of intron removal and exon ligation during pre-mRNA splicing remain far beyond well understood; most of related works were devoted to discovering sequence features or trying to identify discriminative motifs based on results of sequence analysis. In this paper, an EIDF is proposed, which recruits the diverse base disparities between exons and introns. The EIDF is implemented as a simple binary classifier; the experimental results showed that it is effective and reliable in distinguishing exons from introns using only information of single sequence. The results suggest the proposed EIDF is a valuable profile of splice site sequences and it can serve as a new basis for investigating the splice site sequences. The proposed EIDF demonstrated a new paradigm of methodology in exon/intron discrimination, which provides new insights into the pre-mRNA splicing events and the possibility of simulating the processes of splicing machinery in silico is also revealed.