Background

Introns separate eukaryotic genes into exons [1, 2]. After their likely origin as selfish elements [3], introns subsequently evolved into beneficial components in eukaryotic genomes [4,5,6]. Historical debates concerning the evolutionary history of introns led to the “introns-first-hypothesis” which proposes that introns were already present in the last common ancestor of all eukaryotes [3, 7]. Although this putative ancestral genome is inferred to be intron-rich, several plant genomes accumulated more introns during their evolution generating the highly fragmented gene structures with average intron numbers between six and seven [8]. Introner elements (IEs) [9], which behave similar to transposable elements, are one possible mechanism for the amplification of introns [10]. Early introns probably originated from self-splicing class II introns [3, 11] and evolved into passive elements, that require removal by eukaryote-specific molecular machineries [11]. No class II introns were identified in the nuclear genomes of sequenced extant eukaryotes [11] except for mitochondrial DNA (mtDNA) insertions [12, 13].

The removal of these introns during pre-mRNA processing is a complex and expensive step, which involves 5 snoRNAs and over 150 proteins building the spliceosome [14]. In fact, a major U2 [15] and a minor U12 spliceosome [16] are removing different intron types from eukaryotic pre-mRNAs [17]. The major U2 spliceosome mostly recognises canonical GT-AG introns, but is additionally reported to remove AT-AC class I introns [18]. Non-canonical AT-AC class II introns are spliced by the minor U2 spliceosome, which is also capable of removing some GT-AG introns [18, 19]. Highly conserved cis-regulatory sequences are required for the correct spliceosome recruitment to designated splice sites [20,21,22]. Although these sequences pose potential for deleterious mutations [4], some intron positions are conserved between very distant eukaryotic species like Homo sapiens and Arabidopsis thaliana [23].

Among the most important recognition sequences of spliceosomes are dinucleotides at both ends of spliceosomal introns which show almost no variation from GT at the 5′ end and AG at the 3′ end, respectively [24]. Different types of alternative splicing generate diversity at the transcript level by combining exons in different combinations [25]. This process results in a substantially increased diversity of peptide sequences [2, 26]. Special splicing cases e.g. utilizing a single nucleotide within an intron for recursive splicing [27] or generating circular RNAs [28] are called non-canonical splicing events [25] and build an additional layer of RNA and proteomic diversity. If this process is based on splice sites differing from GT-AG those splice sites are called non-canonical. Non-canonical splice sites were first identified before genome sequences became available on a massive scale (reviewed in [29]). GC-AG and AT-AC are classified as major non-canonical splice site combinations, while all deviations from these sequences are deemed to be minor non-canonical splice sites. More recently, advances in sequencing technologies and the development of novel sequence alignment tools now enable a systematic investigation of non-canonical splicing events [25, 30]. Comprehensive genome sequence assemblies and large RNA-Seq data sets are publicly available. Dedicated split-read aligners like STAR [31, 32] are able to detect non-canonical splice sites during the alignment of RNA-Seq reads to genomic sequences. Numerous differences in annotated non-canonical splice sites even between accessions of the same species [30] as well as the extremely low frequency of all non-canonical splice sites indicate that sequencing, assembly, and annotation are potential major sources of erroneously inferred splice sites [29, 30, 33]. Distinguishing functional splice sites from degraded sequences such as in pseudogenes is also still an unsolved issue. Nonetheless, the combined number of currently inferred minor non-canonical splice site combinations is even higher than the number of the major non-canonical AT-AC splice site combinations [30, 34].

Here, we analysed 121 whole genome sequences from across the entire plant kingdom to harness the power of a very large sample size and genomic variation accumulated over extensive periods of evolutionary time, to better understand splice site combinations. Although, only a small number of splice sites are considered as non-canonical, the potential number in 121 species is large. Furthermore, conservation of sequences between these species over a long evolutionary time scale may also serve as a strong indication for their functional relevance. We incorporated RNA-Seq data to differentiate between artifacts and bona fide cases of active non-canonical splice sites. Active splice sites are revealed by an RNA-Seq read alignment allowing quantification of splice site activity. We then identified homologous non-canonical splice sites across species and subjected the genes containing these splice sites to phylogenetic analyses. Conservation over a long evolutionary time, expression of the effected gene, and RNA-Seq reads spanning the predicted intron served as evidence to identify bona fide functional non-canonical splice site combinations.

Methods

Collection of data sets and quality control

Genome sequences (FASTA) and the corresponding annotation (GFF3) of 121 plant species (Additional file 1) were retrieved from the NCBI. Since all annotations were generated by GNOMON [35], these data sets should have an equal quality and thus allow comparisons between them. BUSCO v3 [36] was deployed to assess the completeness and duplication level of all sets of representative peptide sequences using the reference data set ‘embryophyta odb9’.

Classification of annotated splice sites

Genome sequences and their annotation were processed by a Python script to identify the representative transcript per gene defined as the transcript that encodes the longest polypeptide sequence [30, 37]. Like all custom Python scripts relevant for this work, it is available with additional instructions at https://github.com/bpucker/ncss2018. Genes with putative annotation errors or inconsistencies were filtered out as done before in similar analyses [38]. Focusing on the longest peptide is essential to avoid biases caused by different numbers of annotated isoforms in different species. Splice sites within the coding sequence of the longest transcripts were analyzed by extracting dinucleotides at the borders of all introns. Untranslated regions (UTRs) were avoided due to their more challenging and thus less reliable prediction [30, 39]. Splice sites and other sequences will be described based on their encoding DNA sequence (e.g. GT instead of GU for the conserved dinucleotide at the donor splice site). Based on terminal dinucleotides in introns, splice site combinations were classified as canonical (GT-AG) or non-canonical if they diverged from the canonical motif. A more detailed classification into major non-canonical splice site combinations (GC-AG, AT-AC) and all remaining minor non-canonical splice site combinations was applied. All following analyses were focused on introns and intron-like sequences equal or greater than 20 bp.

Investigation of splice site diversity

A Python script was applied to summarize all annotated combinations of splice sites that were detected in a representative transcript. The specific profile comprising frequency and diversity of splice site combinations in individual species was analyzed. Splice site combinations containing ambiguity characters were masked from this analysis as they are most likely caused by sequencing or annotation errors. Spearman correlation coefficients were computed pairwise between the splice site profiles of two species to measure their similarity. Flanking sequences of CA-GG and GC-AG splice sites in rice were investigated, because CA-GG splice sites seemed to be the result of an erroneous alignment. The conservation of flanking sequences was illustrated based on sequence web logos constructed at https://weblogo.berkeley.edu/logo.cgi.

Analysis of splice site conservation

Selected protein encoding transcript sequences with non-canonical splice sites were subjected to a search via BLASTn v2.2.28+ [40] to identify homologues in other species to investigate the conservation of splice sites across plant species. As proof of concept, one previously validated non-canonical splice site containing gene [30], At1g79350 (rna15125), was investigated in more depth. Homologous transcripts were compared based on their annotation to investigate the conservation of non-canonical splice sites across species. Exon-intron structures of selected transcripts were plotted by a Python script using matplotlib [41] to facilitate manual inspection.

Validation of annotated splice sites

Publicly available RNA-Seq data sets of different species (Additional file 2) were retrieved from the Sequence Read Archive [42]. Whenever possible, samples from different tissues and conditions were included. The selection was restricted to paired-end data sets to provide a high accuracy during the read mapping. Only species with multiple available data sets were considered for this analysis. All reads were mapped via STAR v2.5.1b [31] in 2-pass mode to the corresponding genome sequence using previously described cutoff values [43]. A Python script utilizing BEDTools v2.25.0 [44] was deployed to convert the resulting BAM files into customized coverage files. Next, the read coverage depth at all exon-intron borders was calculated based on the terminal nucleotides of an intron and the flanking exons. Splice sites were considered as supported by RNA-Seq if the read coverage depth dropped by at least 20% when moving from an exon into an intron (Additional file 3).

Phylogenetic tree construction

RbcL (large RuBisCO subunit) sequences of almost all investigated species were retrieved from the NCBI for the construction of a phylogenetic tree. MAFFT v.7 [45] was deployed to generate an alignment which was trimmed to a minimal occupancy of 60% in each alignment column and finally subjected to FastTree v.2.1.10 [46] for tree construction. Species without an available RbcL sequence were integrated manually by constructing subtrees based on scientific names via phyloT (https://phylot.biobyte.de/). Due to these manual adjustments, the branch lengths in the resulting tree are not accurate and only the topology (Additional file 4) was considered for further analyses.

Intron length analyses

Stress-related gene IDs of A. thaliana were retrieved from the literature [47] and corresponding genes in the NCBI annotations were identified through reciprocal best BLAST hits as previously described [48]. Lengths of introns in these stress genes were compared against an equal number of randomly selected intron lengths from all remaining genes using the Wilcoxon test as implemented in the Python module scipy. Average values of the stress gene intron lengths as well as the randomly selected intron lengths were compared. This random selection and the following comparison were repeated 100 times to correct for random effects.

Minor non-canonical splice site combinations without ambiguous bases in introns longer than 5 kb were counted and compared against their frequency in shorter introns. After ranking all splice site combinations by this ratio, the frequency of the four bases A, C, G, and T was analyzed in correlation to their position in this list.

Comparison of non-canonical splice sites to overall sequence variation

A previously generated variant data set [48] was used to identify the general pattern of mutation and variant fixation between the two A. thaliana accessions Columbia-0 and Niederzenz-1. All homozygous SNPs in a given VCF file were considered for the calculation of nucleotide substitution rates. Corresponding substitution rates were calculated for all minor non-canonical splice sites by assuming they originated from the closest sequence among GT-AG, GC-AG, and AT-AC. General substitution rates in a species were compared against the observed substitution in minor non-canonical splice sites via Chi2 test.

Results

Genomic properties of plants and diversity of non-canonical splice sites

Comparison of all genomic data sets revealed an average GC content of 36.3%, an average percentage of 7.8% of protein encoding sequence, and on average 95.7% of complete BUSCO genes (Additional file 5). Averaged across all 121 genomes, a genome contains an average of 27,232 genes with 4.5 introns per gene. The number of introns per gene was only slightly reduced to 4.15 when only introns enclosed by coding exons were considered for this analysis.

Our investigation of these 121 plant genome sequences revealed a huge variety of different non-canonical splice site combinations (Additional files 6 and 7). Nevertheless, most of all annotated introns display the canonical GT-AG dinucleotides at their borders. Despite the presence of a huge amount of non-canonical splice sites in almost all plant genomes, the present types and the frequencies of different types show a huge variation between species (Additional file 8). A phylogenetic signal in this data set is weak if it is present at all. The total number of splice site combinations ranged between 1505 (Bathycoccus prasinos) and 372,164 (Brasssica napus). Algae displayed a very low number of minor non-canonical splice site combinations, but other plant genome annotations within land plants also did not contain any minor non-canonical splice site combinations without ambiguity characters e.g. Medicago truncatula. Camelina sativa displayed the highest number of minor non-canonical splice site combinations (2902). There is a strong correlation between the number of non-canonical splice site combinations and the total number of splice sites (Spearman correlation coefficient = 0.53, p-value = 5.5*10− 10). However, there is almost no correlation between the number of splice sites and the genome size (Additional file 9).

Non-canonical splice sites are likely to be similar to canonical splice sites

There is a negative correlation between the frequency of non-canonical splice site combinations and their divergence from canonical sequences (r = − 0.4297 p-value = 7*10-13; Fig. 1; Additional file 7). Splice sites with one difference to a canonical splice site are more frequent than more diverged splice sites. A similar trend can be observed around the major non-canonical splice sites AT-AC (Fig. 2) and the canonical GT-AG. Comparison of the overall nucleotide substitution rate in the plant genome and the divergence of minor non-canonical splice sites from canonical or major non-canonical splice sites revealed significant differences (p-value = 0, Chi2 test). For example, the substitutions of A by C and A by G were observed with a similar frequency at splice sites, while the substitution of A by G is almost three times as likely as the A by C substitution between the A. thaliana accessions Col-0 and Nd-1.

Fig. 1
figure 1

Correlation between splice site sequence divergence and frequency. Spearman correlation coefficient between the splice site combination divergence from the canonical GT-AG and their frequency is r = − 0.4297 (p-value = 7*10− 13)

Fig. 2
figure 2

Splice site combination frequency. The frequencies of selected splice site combinations across 121 plant species are displayed. Splice site combinations with high similarity to the canonical GT-AG or the major non-canonical GC-AG/AT-AC are more frequent than other splice site combinations

The genome-wide distribution of genes with non-canonical splice sites did not reveal striking patterns. When looking at the chromosome-level genome sequences of A. thaliana, B. vulgaris, and V. vinifera (Additional files 10, 11 and 12), there were slightly less genes with non-canonical splice sites close to the centromeres. However, the total number of genes was reduced in these regions as well, so likely correlated with genic content.

One interesting species-specific property was the high frequency of non-canonical CA-GG splice site combinations in Oryza sativa which is accompanied by a low frequency of the major non-canonical GC-AG splice sites. In total, 233 CA-GG splice site combinations were identified. However, the transcript sequences can be aligned in a different way to support GC-AG sites close to and even overlapping with the annotated CA-GG splice sites. RNA-Seq reads supported 224 of these CA-GG splice sites. Flanking sequences of CA-GG and GC-AG splice sites were extracted and aligned to investigate the reason for these erroneous transcript alignments (Additional file 13). An additional G directly downstream of the 3′ AG splice site was only present when this splice site was predicted as GG. Cases where the GC-AG was predicted lack this G thus preventing the annotation of a CA-GG splice site combination.

Non-canonical splice sites in single copy genes

To assess the impact of gene copy number on the presence of non-canonical splice sites, we compared a group of presumably single copy genes against all other genes. The average percentage of genes with non-canonical splice sites among single copy BUSCO genes was 11.4%. The average percentage among all genes was only 10.4%. This uncorrected difference between both groups is statistically significant (p-value = 0.04, Mann-Whitney U test), but species-specific effects were obvious. While the percentage in some species is almost the same, other species show a much higher percentage of genes with non-canonical splice sites among BUSCO genes (Additional file 14). A couple of species displayed an inverted situation, having less genes with non-canonical splice sites among the BUSCO genes than the genome-wide average.

Intron analysis

Length distributions of introns with canonical and non-canonical splice site combinations are similar in most regions (Fig. 3). However, there are three striking differences between both distributions: i) the higher abundance of very short introns with non-canonical splice sites, ii) the lower peak at the most frequent intron length (around 200 bp), and iii) the high percentage of introns with non-canonical splice sites that are longer than 5 kb. These distributions indicate that non-canonical splice sites are more frequent in introns that deviate from the average length. Although the total number of introns with canonical splice sites longer than 5 kb is much higher, the proportion of non-canonical splice sites containing introns is on average at least twice as high as the proportion of introns with canonical splice site combinations. These differences between both distributions are significant (Wilcoxon test, p-value = 0.02). Although differences in the frequency of non-canonical splice site combinations in introns longer than 5 kb exist, no clear pattern of preferred motifs was detected. However, it seems that G might be underrepresented in frequent splice site combinations in these long introns.

Fig. 3
figure 3

Intron length distribution. Length distribution of introns with canonical (green) and non-canonical (red) splice site combinations are displayed. Values of all species are combined in this plot resulting in a consensus curve. Most striking differences are (1) at the intron length peak around 200 bp where non-canonical splice site combinations are less likely and (2) at very long intron lengths where introns with non-canonical splice sites are more likely

Stress-related genes were checked for increased intron sizes, because non-canonical splice site combinations might be associated with stress-response. Comparison of stress-related genes in A. thaliana, Beta vulgaris, Brassica oleracea, B.napus, B.rapa, and Vitis vinifera did not reveal a substantially increased intron size in these genes.

The likelihood of having a non-canonical splice site in a gene is almost perfectly correlated with the number of introns (Additional file 15). Analyzing this correlation across all plant species resulted in a sufficiently large sample size to see this effect even in genes with about 40 introns. Insufficient sample sizes kept us from investigating it for genes with even more introns.

Conservation of non-canonical splice sites

Non-canonical splice site combinations detected in A. thaliana Col-0 were compared to single nucleotide polymorphisms of 1135 accessions which were studied as part of the 1001 genomes project. Of 1296 non-canonical splice site combinations, 109 overlapped with listed variant positions. At 21 of those positions, the majority of all accessions displayed the Col-0 allele, while the remaining 88 positions were dominated by other alleles.

To differentiate between randomly occurring non-canonical splice sites (e.g. sequencing errors) and true biological variation, the conservation of non-canonical splice sites across multiple species can be analyzed. This approach was demonstrated for the selected candidate At1g79350 (rna15125). Manual inspection revealed that non-canonical splice sites were conserved in three positions in many putative homologous genes across various species (Additional file 16).

RNA-Seq-based validation of annotated splice sites

RNA-Seq reads of 35 different species (Additional file 2) were mapped to the respective genome sequence to allow the validation of splice sites based on changes in the read coverage depth (Additional files 3 and 17). Validation ratios of all splice sites ranged from 75.5% in Medicago truncatula to 96.4% in Musa acuminata. A moderate correlation (r = 0.46) between the amount of RNA-Seq reads and the ratio of validated splice sites was observed (Additional file 18). When only considering non-canonical splice sites, the validation ranged from 15.2 to 91.3% displaying a similar correlation with the amount of sequencing reads. Based on validated splice sites, the proportion of different splice site combinations was analyzed across all species (Fig. 4). The average percentages are approximately 98.7% for GT-AG, 1.2% for GC-AG, 0.06% for AT-AC, and 0.09% for all other minor splice site combinations. Medicago truncatula, Oryza sativa, Populus trichocarpa, Monoraphidium neglectum, and Morus notabilis displayed substantially lower validation values for the major non-canonical splice sites.

Fig. 4
figure 4

Splice site frequency. Occurrences of the canonical GT-AG, the major non-canonical GC-AG and AT-AC as well as the combined occurrences of all minor non-canonical splice sites (others) are displayed. The proportion of GT-AG is about 98.7%. There is some variation, but most species show GC-AG at about 1.2% and AT-AC at 0.06%. All others combined account usually for about 0.09% as well

Quantification of splice site usage

Based on mapped RNA-Seq reads, the usage of different splice sites was quantified (Fig. 5; [49]). Canonical GT-AG splice site combinations displayed the strongest RNA-Seq read coverage drop when moving from an exon into an intron (Additional file 3). There was a substantial difference in average splice site usage between 5′ and the 3′ ends of GT-AG introns. The same trend holds true for major non-canonical GC-AG splice site combinations, while the total splice site usage is lower. Major non-canonical AT-AC and minor non-canonical splice sites did not show a difference between 5′ and 3′ end. However, the total usage values of AT-AC are even lower than the values of GC-AG splice sites.

Fig. 5
figure 5

Usage of splice sites. Usage of splice sites was calculated based on the number of RNA-Seq reads supporting the exon next to a splice site and the number of reads supporting the intron containing the splice site. There is a substantial difference between the usage of 5′ and 3′ splice sites in favor of the 5′ splice sites. Canonical GT-AG splice site combinations are used more often than major or minor non-canonical splice site combinations. Sample size (n) and median (m) of the usage values are given for all splice sites

There is a significant correlation between the usage of a 5′ splice site and the corresponding 3′ splice site. However, the Spearman correlation coefficient varies between all four groups of splice sites ranging from 0.42 in minor non-canonical splice site combinations to 0.82 in major non-canonical AT-AC splice site combinations.

In order to provide an example for the usage of minor non-canonical splice sites under stress conditions, four single RNA-Seq data sets of B. vulgaris were processed separately. They are the comparison of control vs. salt and control vs. high light [50]. The number of RNA-Seq supported minor non-canonical splice site combinations increased between control and stress conditions from 17 to 19 and from 21 to 24, respectively. GT-TA and AA-TA were only supported by RNA-Seq reads derived from samples under stress conditions.

Discussion

This inspection of non-canonical splice sites annotated in plant genome sequences was performed to capture the diversity and to assess the validity of these annotations, because previous studies indicate that annotations of non-canonical splice sites are a mixture of artifacts and bona fide splice sites [29, 34, 51]. Our results update and expand previous systematic analyses of non-canonical splice sites in smaller data sets [29, 30, 33, 34]. An extended knowledge about non-canonical splice sites in plants could benefit gene predictions [30, 52], as novel genome sequences are often annotated by lifting an existing annotation.

Confirmation of bona fide splicing from minor non-canonical combinations

Our analyses supported a variety of different non-canonical splice sites matching previous reports of bona fide non-canonical splice sites [29, 30, 34, 51]. Frequencies of different minor non-canonical splice site combinations are not random and vary between different combinations. Those combinations similar to the canonical combination or the major non-canonical splice site combinations are more frequent. Furthermore, our RNA-Seq analyses demonstrate the actual use of non-canonical splice sites, revealing a huge variety of different transcripts derived from non-canonical splice sites, which may be evolutionarily significant. Although some non-canonical splice sites may be located in pseudogenes, the transcriptional activity and accurate splicing at most non-canonical splice sites indicates functional relevance e.g. by contributing to functional diversity as previously postulated [2, 25, 26]. These findings are consistent with published reports that have demonstrated functional RNAs generated from non-canonical splice sites [30, 53].

In general, the pattern of non-canonical splice sites is very similar between species with major non-canonical splice sites accounting for most cases of non-canonical splicing. While the average across plants of 98.7% GT-AG canonical splice sites is in agreement with recent reports for A. thaliana [30], it is slightly lower than 99.2% predicted for mammals [33] or 99.3% as previously reported for Arabidopsis based on cDNAs [54]. In contrast, the frequency of major non-canonical GC-AG splice sites in plants is almost twice the value reported for mammals [33]. Most importantly the proportion of 0.09% minor non-canonical splice site combinations in plants is substantially higher than the estimation of 0.02% initially reported for mammals [33]. Taking these findings together, both major and minor non-canonical splice sites could be a more significant phenomenon of splicing in plants than in animals. This hypothesis would be consistent with the notion that splicing in plants is a more complex and diverse process than that occurring in metazoan lineages [55,56,57]. An in-depth investigation of non-canonical splice sites in animals and fungi would be needed to validate this hypothesis.

Species-specific differences in minor non-canonical splice site combinations

As previous studies on non-canonical splice sites were often focused on one species [54] or a few model organisms [33, 34, 38], the observed variation among the plant genomes investigated here updates the current knowledge and revealed potential species-specific differences. However, small numbers of non-canonical splice sits in some species might prevent the detection of phylogenetic patterns in the genome-wide analysis. Nevertheless, conserved non-canonical splice site positions exist as presented on the gene level for At1g79350. Differences in the availability of hints in the gene prediction process and variation in the assembly quality might contribute to the observed differences in the number of non-canonical splice sites between closely related species.

The group of minor non-canonical splice sites displayed the largest variation between species, and a frequent non-canonical splice site combination (CA-GG) which appeared peculiar to O. sativa is probably due to an alignment error. In other words, the predicted CA-GG splice site combinations in rice can be conceived as major non-canonical GC-AG events by just splitting the transcript sequence in a different way during the alignment over the intron. An additional downstream G at the 3′ splice site seems to be responsible for leading to this annotation, because cases where GC-AG was correctly annotated do not display this G in the respective position. Dedicated alignment tools are needed to bioinformatically distinguish these events [58], otherwise manual inspection must be used to correctly resolve these situations.

Despite all artifacts described here and elsewhere [29, 33, 59], non-canonical splice sites seem to have conserved functions as indicated by conservation over long evolutionary periods displayed as presence in homologous sequences in multiple species [23, 29]. Our own analyses across multiple accessions of A. thaliana support this conjecture and suggest that some non-canonical splice sites are conserved in homologous loci at the intra-specific level. At the same time, there is intra-specific variability [30] that might be attributed to the accumulation of mutations prior to purifying selection. Assessing the variability within a species could be an additional approach to distinguish bona fide splice sites from artifacts or recent mutations.

Putative mechanisms for processing of minor non-canonical splice sites

We sought to understand possible correlations with minor non-canonical splice site combinations in order understand the mechanisms driving their occurrence. Therefore, we explored the impact of genomic position relative to centromeres, the effect of increased gene number, and the impact of intron length. The occurrence of non-canonical splice sites is reduced with proximity to the centromere, but this is likely due to reduced gene content in centromeric regions. Averaged across all species, there is a significantly higher proportion of non-canonical sites in single copy genes, but species-specific differences also violate this observation, suggesting that gene copy number is not an important determinant. However, non-canonical splice sites may be more important in splicing very long introns, because they appear in introns above 5 kb with a higher relative likelihood than canonical splice sites. Further investigations are needed to validate the observed lack of G in these splice site combinations and to identify an underlying pattern if it exists. When looking for an association of long introns with stress-related genes, no significant increase in their intron sizes was observed. However, it is still possible that these long introns belong to genes which were not previously described in relation to stress.

Previous studies postulated different non-spliceosomal removal mechanisms for such introns including the IRE1 / tRNA ligase system [60, 61] and short direct repeats leading to transcriptional slippage [62, 63]. It should be mentioned that many sequence variants of snRNAs are encoded in plant genomes [64]. The presence of multiple spliceosome types in addition to the canonical U2 and the non-canonical U12 spliceosome could be another explanation [38].

Another hypothesis suggests parasitic splice sites using neighbouring recognition sites for the splicing machinery to enable their processing [33]. The mere presence of GT close to the 5′ non-canonical splice site and AG close to the 3′ non-canonical splice site might be sufficient for this process to take place. These non-canonical splice sites are expected to be in frame with the associated GT-AG signals which could be responsible for recruiting the splicing machinery [33]. This hypothesis is supported by the observation that splice sites seem to be missed sometimes thus leading to the use of the next splice site which is usually in frame with the original one [54]. Further investigation might connect neighbouring sequences to the processing of minor non-canonical splice sites.

There is no evidence for RNA editing to modify splice sites yet, but previous studies found that modifications of mRNAs are necessary to enable proper splicing in some cases [65]. Even so such a system is probably not in place for all minor non-canonical splice sites, a modification of nucleotides in the transcript would be another way to regulate gene expression at the post-transcriptional level.

Although, these hypotheses could be an additional or alternative explanations for the situation observed in O. sativa, considering the CA-GG cases as annotation and alignment errors seems more likely due to their unique presence in this species.

Usage of non-canonical splice sites

Our results could provide a strong foundation to further analyses of the splicing process by providing detailed information about the frequency at which splicing occurred at a certain splice site. The results indicate that this usage of different splice site types could vary substantially. A possible explanation for these observed differences is the mixture of RNA-Seq data sets, which contains samples from various tissues and different environmental or physiological conditions. Sequencing reads reflect the splicing events occurring under these specific conditions. As previously indicated by several reports, non-canonical splice sites might be more frequently used under stress conditions [25, 51, 63]. As most plants are unable to escape environmental conditions by movement, a higher frequency of non-canonical splice sites in sessile plant species compared to other taxonomic groups should be assessed in the future to explore whether there may be a link between non-canonical splice frequency and life habit.

The observation of a stronger usage of the donor splice site over the acceptor splice site in GT-AG and GC-AG splice site combinations is matching previous reports where one donor splice site can be associated with multiple acceptor splice sites [54, 66]. The absence of this effect at minor non-canonical splice site combinations might hint towards a different splicing mechanism, which is restricted to precisely one combination of donor and acceptor splice site.

The observed usage of GT-TA and AA-TA splice site combinations under stress conditions in contrast to control conditions as well as the slight increase in the number of supported minor non-canonical splice site combinations requires further testing e.g. in other species or under different stress conditions. It would be interesting to validate the usage of different splice sites in response to stress and not just the expression of stress-related genes. In principle, it would be possible to assess the usage of splice sites under diverse environmental or developmental conditions as performed in this study for different plant species. While numerous RNA-Seq data sets are available per species, these analyses would require a large number of data sets generated under identical or at least similar conditions. Therefore, the identification of splicing variants dedicated to certain stress responses is beyond the scope of this work.

Limitations of the current analyses

Some constraints limit the power of the presented analyses. In accordance with the important plant database Araport11 [37] and previous analyses [30], only the transcript encoding the longest peptide sequence was considered per gene when investigating splice site conservation across species. Although the exclusion of alternative transcripts was necessary to compensate differences in the annotation quality, more non-canonical splice sites could be revealed by investigations of all transcript versions in the future. The exclusion of annotated introns shorter than 20 bp as well as the minimal intron length cutoff of 20 bp during the RNA-Seq read mapping prevented the investigation of very small introns. There are reports of experimentally validated introns with a minimal length of 56 bp [67]. Although recent reports indicate a minimal intron length around 30 bp in humans [68] or even down to 10 bp [51], it is unclear if very short sequences should be called introns. Since spliceosomal removal of these very short sequences via lariat formation seems unlikely, a new terminology might be needed. The applied length cutoff was selected to avoid previously reported issues with false positives [51]. However, de novo identification of very short introns as recently performed for Mus musculus and H. sapiens [51, 69] could become feasible as RNA-Seq data sets based on similar protocols become available for a broad range of plant species. Variations between RNA-Seq samples posed another challenge. Since there is a substantial amount of variation within species [70, 71], we can assume that small differences in the genetic background of the analyzed material could bias the results. Splice sites of interest might be canonical splice site combinations in some accessions or subspecies, respectively, while they are non-canonical in others. Despite our attempts to collect RNA-Seq samples derived from a broad range of different conditions and tissues for each species, data of many specific physiological states are missing for most species. Therefore, we cannot exclude that certain non-canonical splice sites were missed in our splice site usage analysis due to a lack of gene expression under the investigated conditions.

Future perspectives

As costs for RNA-Seq data generation drops over the years [72], improved analyses will become possible over time. Investigation of homologous non-canonical splice sites poses several difficulties, as the exonic sequence is not necessarily conserved. Due to upstream changes in the exon-intron structure [73], the number of the non-canonical introns can differ between species. However, a computationally feasible approach to investigate the phylogeny of all non-canonical splice sites would significantly enhance our knowledge e.g. about the emergence and loss of non-canonical splice sites. Experimental validation of splice sites in vivo and in vitro could be the next step. It is crucial for such analyses to avoid biases introduced by reverse transcription artifacts e.g. by comparing different enzymes and avoiding random hexameters during cDNA synthesis [74]. Splice sites could be experimentally validated e.g. by integration in the Aequoria vicotria GFP sequence [75] to see if they are functional in plants. Our analyses support the concept that differences between plant species need to be taken into account when performing such investigations [76, 77].

Conclusion

Non-canonical splice site combinations are present and appear to be functionally relevant in most plants, although at low abundance. The frequency of splice sites combinations decreases with the divergence from the canonical GT-AG combination, however, this pattern cannot be explained by simple accumulation of random mutations. RNA-Seq reads show a stronger conservation of the 5′ splice site when compared to the 3′ splice site indicating the presence of multiple alternative 3′ splice sites. Initial analyses indicate variations in the usage of minor non-canonical splice sites under certain stress conditions, but further investigations are needed to understand the impact of environmental factors or developmental stages on usage of minor non-canonical splice sites.