Genome-wide analyses supported by RNA-Seq reveal non-canonical splice sites in plant genomes
Most eukaryotic genes comprise exons and introns thus requiring the precise removal of introns from pre-mRNAs to enable protein biosynthesis. U2 and U12 spliceosomes catalyze this step by recognizing motifs on the transcript in order to remove the introns. A process which is dependent on precise definition of exon-intron borders by splice sites, which are consequently highly conserved across species. Only very few combinations of terminal dinucleotides are frequently observed at intron ends, dominated by the canonical GT-AG splice sites on the DNA level.
Here we investigate the occurrence of diverse combinations of dinucleotides at predicted splice sites. Analyzing 121 plant genome sequences based on their annotation revealed strong splice site conservation across species, annotation errors, and true biological divergence from canonical splice sites. The frequency of non-canonical splice sites clearly correlates with their divergence from canonical ones indicating either an accumulation of probably neutral mutations, or evolution towards canonical splice sites. Strong conservation across multiple species and non-random accumulation of substitutions in splice sites indicate a functional relevance of non-canonical splice sites. The average composition of splice sites across all investigated species is 98.7% for GT-AG, 1.2% for GC-AG, 0.06% for AT-AC, and 0.09% for minor non-canonical splice sites. RNA-Seq data sets of 35 species were incorporated to validate non-canonical splice site predictions through gaps in sequencing reads alignments and to demonstrate the expression of affected genes.
We conclude that bona fide non-canonical splice sites are present and appear to be functionally relevant in most plant genomes, although at low abundance.
KeywordsGene structure Splicing Annotation Comparative genomics Transcriptomics Gene expression Natural diversity Evolution
Introns separate eukaryotic genes into exons [1, 2]. After their likely origin as selfish elements , introns subsequently evolved into beneficial components in eukaryotic genomes [4, 5, 6]. Historical debates concerning the evolutionary history of introns led to the “introns-first-hypothesis” which proposes that introns were already present in the last common ancestor of all eukaryotes [3, 7]. Although this putative ancestral genome is inferred to be intron-rich, several plant genomes accumulated more introns during their evolution generating the highly fragmented gene structures with average intron numbers between six and seven . Introner elements (IEs) , which behave similar to transposable elements, are one possible mechanism for the amplification of introns . Early introns probably originated from self-splicing class II introns [3, 11] and evolved into passive elements, that require removal by eukaryote-specific molecular machineries . No class II introns were identified in the nuclear genomes of sequenced extant eukaryotes  except for mitochondrial DNA (mtDNA) insertions [12, 13].
The removal of these introns during pre-mRNA processing is a complex and expensive step, which involves 5 snoRNAs and over 150 proteins building the spliceosome . In fact, a major U2  and a minor U12 spliceosome  are removing different intron types from eukaryotic pre-mRNAs . The major U2 spliceosome mostly recognises canonical GT-AG introns, but is additionally reported to remove AT-AC class I introns . Non-canonical AT-AC class II introns are spliced by the minor U2 spliceosome, which is also capable of removing some GT-AG introns [18, 19]. Highly conserved cis-regulatory sequences are required for the correct spliceosome recruitment to designated splice sites [20, 21, 22]. Although these sequences pose potential for deleterious mutations , some intron positions are conserved between very distant eukaryotic species like Homo sapiens and Arabidopsis thaliana .
Among the most important recognition sequences of spliceosomes are dinucleotides at both ends of spliceosomal introns which show almost no variation from GT at the 5′ end and AG at the 3′ end, respectively . Different types of alternative splicing generate diversity at the transcript level by combining exons in different combinations . This process results in a substantially increased diversity of peptide sequences [2, 26]. Special splicing cases e.g. utilizing a single nucleotide within an intron for recursive splicing  or generating circular RNAs  are called non-canonical splicing events  and build an additional layer of RNA and proteomic diversity. If this process is based on splice sites differing from GT-AG those splice sites are called non-canonical. Non-canonical splice sites were first identified before genome sequences became available on a massive scale (reviewed in ). GC-AG and AT-AC are classified as major non-canonical splice site combinations, while all deviations from these sequences are deemed to be minor non-canonical splice sites. More recently, advances in sequencing technologies and the development of novel sequence alignment tools now enable a systematic investigation of non-canonical splicing events [25, 30]. Comprehensive genome sequence assemblies and large RNA-Seq data sets are publicly available. Dedicated split-read aligners like STAR [31, 32] are able to detect non-canonical splice sites during the alignment of RNA-Seq reads to genomic sequences. Numerous differences in annotated non-canonical splice sites even between accessions of the same species  as well as the extremely low frequency of all non-canonical splice sites indicate that sequencing, assembly, and annotation are potential major sources of erroneously inferred splice sites [29, 30, 33]. Distinguishing functional splice sites from degraded sequences such as in pseudogenes is also still an unsolved issue. Nonetheless, the combined number of currently inferred minor non-canonical splice site combinations is even higher than the number of the major non-canonical AT-AC splice site combinations [30, 34].
Here, we analysed 121 whole genome sequences from across the entire plant kingdom to harness the power of a very large sample size and genomic variation accumulated over extensive periods of evolutionary time, to better understand splice site combinations. Although, only a small number of splice sites are considered as non-canonical, the potential number in 121 species is large. Furthermore, conservation of sequences between these species over a long evolutionary time scale may also serve as a strong indication for their functional relevance. We incorporated RNA-Seq data to differentiate between artifacts and bona fide cases of active non-canonical splice sites. Active splice sites are revealed by an RNA-Seq read alignment allowing quantification of splice site activity. We then identified homologous non-canonical splice sites across species and subjected the genes containing these splice sites to phylogenetic analyses. Conservation over a long evolutionary time, expression of the effected gene, and RNA-Seq reads spanning the predicted intron served as evidence to identify bona fide functional non-canonical splice site combinations.
Collection of data sets and quality control
Genome sequences (FASTA) and the corresponding annotation (GFF3) of 121 plant species (Additional file 1) were retrieved from the NCBI. Since all annotations were generated by GNOMON , these data sets should have an equal quality and thus allow comparisons between them. BUSCO v3  was deployed to assess the completeness and duplication level of all sets of representative peptide sequences using the reference data set ‘embryophyta odb9’.
Classification of annotated splice sites
Genome sequences and their annotation were processed by a Python script to identify the representative transcript per gene defined as the transcript that encodes the longest polypeptide sequence [30, 37]. Like all custom Python scripts relevant for this work, it is available with additional instructions at https://github.com/bpucker/ncss2018. Genes with putative annotation errors or inconsistencies were filtered out as done before in similar analyses . Focusing on the longest peptide is essential to avoid biases caused by different numbers of annotated isoforms in different species. Splice sites within the coding sequence of the longest transcripts were analyzed by extracting dinucleotides at the borders of all introns. Untranslated regions (UTRs) were avoided due to their more challenging and thus less reliable prediction [30, 39]. Splice sites and other sequences will be described based on their encoding DNA sequence (e.g. GT instead of GU for the conserved dinucleotide at the donor splice site). Based on terminal dinucleotides in introns, splice site combinations were classified as canonical (GT-AG) or non-canonical if they diverged from the canonical motif. A more detailed classification into major non-canonical splice site combinations (GC-AG, AT-AC) and all remaining minor non-canonical splice site combinations was applied. All following analyses were focused on introns and intron-like sequences equal or greater than 20 bp.
Investigation of splice site diversity
A Python script was applied to summarize all annotated combinations of splice sites that were detected in a representative transcript. The specific profile comprising frequency and diversity of splice site combinations in individual species was analyzed. Splice site combinations containing ambiguity characters were masked from this analysis as they are most likely caused by sequencing or annotation errors. Spearman correlation coefficients were computed pairwise between the splice site profiles of two species to measure their similarity. Flanking sequences of CA-GG and GC-AG splice sites in rice were investigated, because CA-GG splice sites seemed to be the result of an erroneous alignment. The conservation of flanking sequences was illustrated based on sequence web logos constructed at https://weblogo.berkeley.edu/logo.cgi.
Analysis of splice site conservation
Selected protein encoding transcript sequences with non-canonical splice sites were subjected to a search via BLASTn v2.2.28+  to identify homologues in other species to investigate the conservation of splice sites across plant species. As proof of concept, one previously validated non-canonical splice site containing gene , At1g79350 (rna15125), was investigated in more depth. Homologous transcripts were compared based on their annotation to investigate the conservation of non-canonical splice sites across species. Exon-intron structures of selected transcripts were plotted by a Python script using matplotlib  to facilitate manual inspection.
Validation of annotated splice sites
Publicly available RNA-Seq data sets of different species (Additional file 2) were retrieved from the Sequence Read Archive . Whenever possible, samples from different tissues and conditions were included. The selection was restricted to paired-end data sets to provide a high accuracy during the read mapping. Only species with multiple available data sets were considered for this analysis. All reads were mapped via STAR v2.5.1b  in 2-pass mode to the corresponding genome sequence using previously described cutoff values . A Python script utilizing BEDTools v2.25.0  was deployed to convert the resulting BAM files into customized coverage files. Next, the read coverage depth at all exon-intron borders was calculated based on the terminal nucleotides of an intron and the flanking exons. Splice sites were considered as supported by RNA-Seq if the read coverage depth dropped by at least 20% when moving from an exon into an intron (Additional file 3).
Phylogenetic tree construction
RbcL (large RuBisCO subunit) sequences of almost all investigated species were retrieved from the NCBI for the construction of a phylogenetic tree. MAFFT v.7  was deployed to generate an alignment which was trimmed to a minimal occupancy of 60% in each alignment column and finally subjected to FastTree v.2.1.10  for tree construction. Species without an available RbcL sequence were integrated manually by constructing subtrees based on scientific names via phyloT (https://phylot.biobyte.de/). Due to these manual adjustments, the branch lengths in the resulting tree are not accurate and only the topology (Additional file 4) was considered for further analyses.
Intron length analyses
Stress-related gene IDs of A. thaliana were retrieved from the literature  and corresponding genes in the NCBI annotations were identified through reciprocal best BLAST hits as previously described . Lengths of introns in these stress genes were compared against an equal number of randomly selected intron lengths from all remaining genes using the Wilcoxon test as implemented in the Python module scipy. Average values of the stress gene intron lengths as well as the randomly selected intron lengths were compared. This random selection and the following comparison were repeated 100 times to correct for random effects.
Minor non-canonical splice site combinations without ambiguous bases in introns longer than 5 kb were counted and compared against their frequency in shorter introns. After ranking all splice site combinations by this ratio, the frequency of the four bases A, C, G, and T was analyzed in correlation to their position in this list.
Comparison of non-canonical splice sites to overall sequence variation
A previously generated variant data set  was used to identify the general pattern of mutation and variant fixation between the two A. thaliana accessions Columbia-0 and Niederzenz-1. All homozygous SNPs in a given VCF file were considered for the calculation of nucleotide substitution rates. Corresponding substitution rates were calculated for all minor non-canonical splice sites by assuming they originated from the closest sequence among GT-AG, GC-AG, and AT-AC. General substitution rates in a species were compared against the observed substitution in minor non-canonical splice sites via Chi2 test.
Genomic properties of plants and diversity of non-canonical splice sites
Comparison of all genomic data sets revealed an average GC content of 36.3%, an average percentage of 7.8% of protein encoding sequence, and on average 95.7% of complete BUSCO genes (Additional file 5). Averaged across all 121 genomes, a genome contains an average of 27,232 genes with 4.5 introns per gene. The number of introns per gene was only slightly reduced to 4.15 when only introns enclosed by coding exons were considered for this analysis.
Our investigation of these 121 plant genome sequences revealed a huge variety of different non-canonical splice site combinations (Additional files 6 and 7). Nevertheless, most of all annotated introns display the canonical GT-AG dinucleotides at their borders. Despite the presence of a huge amount of non-canonical splice sites in almost all plant genomes, the present types and the frequencies of different types show a huge variation between species (Additional file 8). A phylogenetic signal in this data set is weak if it is present at all. The total number of splice site combinations ranged between 1505 (Bathycoccus prasinos) and 372,164 (Brasssica napus). Algae displayed a very low number of minor non-canonical splice site combinations, but other plant genome annotations within land plants also did not contain any minor non-canonical splice site combinations without ambiguity characters e.g. Medicago truncatula. Camelina sativa displayed the highest number of minor non-canonical splice site combinations (2902). There is a strong correlation between the number of non-canonical splice site combinations and the total number of splice sites (Spearman correlation coefficient = 0.53, p-value = 5.5*10− 10). However, there is almost no correlation between the number of splice sites and the genome size (Additional file 9).
Non-canonical splice sites are likely to be similar to canonical splice sites
The genome-wide distribution of genes with non-canonical splice sites did not reveal striking patterns. When looking at the chromosome-level genome sequences of A. thaliana, B. vulgaris, and V. vinifera (Additional files 10, 11 and 12), there were slightly less genes with non-canonical splice sites close to the centromeres. However, the total number of genes was reduced in these regions as well, so likely correlated with genic content.
One interesting species-specific property was the high frequency of non-canonical CA-GG splice site combinations in Oryza sativa which is accompanied by a low frequency of the major non-canonical GC-AG splice sites. In total, 233 CA-GG splice site combinations were identified. However, the transcript sequences can be aligned in a different way to support GC-AG sites close to and even overlapping with the annotated CA-GG splice sites. RNA-Seq reads supported 224 of these CA-GG splice sites. Flanking sequences of CA-GG and GC-AG splice sites were extracted and aligned to investigate the reason for these erroneous transcript alignments (Additional file 13). An additional G directly downstream of the 3′ AG splice site was only present when this splice site was predicted as GG. Cases where the GC-AG was predicted lack this G thus preventing the annotation of a CA-GG splice site combination.
Non-canonical splice sites in single copy genes
To assess the impact of gene copy number on the presence of non-canonical splice sites, we compared a group of presumably single copy genes against all other genes. The average percentage of genes with non-canonical splice sites among single copy BUSCO genes was 11.4%. The average percentage among all genes was only 10.4%. This uncorrected difference between both groups is statistically significant (p-value = 0.04, Mann-Whitney U test), but species-specific effects were obvious. While the percentage in some species is almost the same, other species show a much higher percentage of genes with non-canonical splice sites among BUSCO genes (Additional file 14). A couple of species displayed an inverted situation, having less genes with non-canonical splice sites among the BUSCO genes than the genome-wide average.
Stress-related genes were checked for increased intron sizes, because non-canonical splice site combinations might be associated with stress-response. Comparison of stress-related genes in A. thaliana, Beta vulgaris, Brassica oleracea, B.napus, B.rapa, and Vitis vinifera did not reveal a substantially increased intron size in these genes.
The likelihood of having a non-canonical splice site in a gene is almost perfectly correlated with the number of introns (Additional file 15). Analyzing this correlation across all plant species resulted in a sufficiently large sample size to see this effect even in genes with about 40 introns. Insufficient sample sizes kept us from investigating it for genes with even more introns.
Conservation of non-canonical splice sites
Non-canonical splice site combinations detected in A. thaliana Col-0 were compared to single nucleotide polymorphisms of 1135 accessions which were studied as part of the 1001 genomes project. Of 1296 non-canonical splice site combinations, 109 overlapped with listed variant positions. At 21 of those positions, the majority of all accessions displayed the Col-0 allele, while the remaining 88 positions were dominated by other alleles.
To differentiate between randomly occurring non-canonical splice sites (e.g. sequencing errors) and true biological variation, the conservation of non-canonical splice sites across multiple species can be analyzed. This approach was demonstrated for the selected candidate At1g79350 (rna15125). Manual inspection revealed that non-canonical splice sites were conserved in three positions in many putative homologous genes across various species (Additional file 16).
RNA-Seq-based validation of annotated splice sites
Quantification of splice site usage
There is a significant correlation between the usage of a 5′ splice site and the corresponding 3′ splice site. However, the Spearman correlation coefficient varies between all four groups of splice sites ranging from 0.42 in minor non-canonical splice site combinations to 0.82 in major non-canonical AT-AC splice site combinations.
In order to provide an example for the usage of minor non-canonical splice sites under stress conditions, four single RNA-Seq data sets of B. vulgaris were processed separately. They are the comparison of control vs. salt and control vs. high light . The number of RNA-Seq supported minor non-canonical splice site combinations increased between control and stress conditions from 17 to 19 and from 21 to 24, respectively. GT-TA and AA-TA were only supported by RNA-Seq reads derived from samples under stress conditions.
This inspection of non-canonical splice sites annotated in plant genome sequences was performed to capture the diversity and to assess the validity of these annotations, because previous studies indicate that annotations of non-canonical splice sites are a mixture of artifacts and bona fide splice sites [29, 34, 51]. Our results update and expand previous systematic analyses of non-canonical splice sites in smaller data sets [29, 30, 33, 34]. An extended knowledge about non-canonical splice sites in plants could benefit gene predictions [30, 52], as novel genome sequences are often annotated by lifting an existing annotation.
Confirmation of bona fide splicing from minor non-canonical combinations
Our analyses supported a variety of different non-canonical splice sites matching previous reports of bona fide non-canonical splice sites [29, 30, 34, 51]. Frequencies of different minor non-canonical splice site combinations are not random and vary between different combinations. Those combinations similar to the canonical combination or the major non-canonical splice site combinations are more frequent. Furthermore, our RNA-Seq analyses demonstrate the actual use of non-canonical splice sites, revealing a huge variety of different transcripts derived from non-canonical splice sites, which may be evolutionarily significant. Although some non-canonical splice sites may be located in pseudogenes, the transcriptional activity and accurate splicing at most non-canonical splice sites indicates functional relevance e.g. by contributing to functional diversity as previously postulated [2, 25, 26]. These findings are consistent with published reports that have demonstrated functional RNAs generated from non-canonical splice sites [30, 53].
In general, the pattern of non-canonical splice sites is very similar between species with major non-canonical splice sites accounting for most cases of non-canonical splicing. While the average across plants of 98.7% GT-AG canonical splice sites is in agreement with recent reports for A. thaliana , it is slightly lower than 99.2% predicted for mammals  or 99.3% as previously reported for Arabidopsis based on cDNAs . In contrast, the frequency of major non-canonical GC-AG splice sites in plants is almost twice the value reported for mammals . Most importantly the proportion of 0.09% minor non-canonical splice site combinations in plants is substantially higher than the estimation of 0.02% initially reported for mammals . Taking these findings together, both major and minor non-canonical splice sites could be a more significant phenomenon of splicing in plants than in animals. This hypothesis would be consistent with the notion that splicing in plants is a more complex and diverse process than that occurring in metazoan lineages [55, 56, 57]. An in-depth investigation of non-canonical splice sites in animals and fungi would be needed to validate this hypothesis.
Species-specific differences in minor non-canonical splice site combinations
As previous studies on non-canonical splice sites were often focused on one species  or a few model organisms [33, 34, 38], the observed variation among the plant genomes investigated here updates the current knowledge and revealed potential species-specific differences. However, small numbers of non-canonical splice sits in some species might prevent the detection of phylogenetic patterns in the genome-wide analysis. Nevertheless, conserved non-canonical splice site positions exist as presented on the gene level for At1g79350. Differences in the availability of hints in the gene prediction process and variation in the assembly quality might contribute to the observed differences in the number of non-canonical splice sites between closely related species.
The group of minor non-canonical splice sites displayed the largest variation between species, and a frequent non-canonical splice site combination (CA-GG) which appeared peculiar to O. sativa is probably due to an alignment error. In other words, the predicted CA-GG splice site combinations in rice can be conceived as major non-canonical GC-AG events by just splitting the transcript sequence in a different way during the alignment over the intron. An additional downstream G at the 3′ splice site seems to be responsible for leading to this annotation, because cases where GC-AG was correctly annotated do not display this G in the respective position. Dedicated alignment tools are needed to bioinformatically distinguish these events , otherwise manual inspection must be used to correctly resolve these situations.
Despite all artifacts described here and elsewhere [29, 33, 59], non-canonical splice sites seem to have conserved functions as indicated by conservation over long evolutionary periods displayed as presence in homologous sequences in multiple species [23, 29]. Our own analyses across multiple accessions of A. thaliana support this conjecture and suggest that some non-canonical splice sites are conserved in homologous loci at the intra-specific level. At the same time, there is intra-specific variability  that might be attributed to the accumulation of mutations prior to purifying selection. Assessing the variability within a species could be an additional approach to distinguish bona fide splice sites from artifacts or recent mutations.
Putative mechanisms for processing of minor non-canonical splice sites
We sought to understand possible correlations with minor non-canonical splice site combinations in order understand the mechanisms driving their occurrence. Therefore, we explored the impact of genomic position relative to centromeres, the effect of increased gene number, and the impact of intron length. The occurrence of non-canonical splice sites is reduced with proximity to the centromere, but this is likely due to reduced gene content in centromeric regions. Averaged across all species, there is a significantly higher proportion of non-canonical sites in single copy genes, but species-specific differences also violate this observation, suggesting that gene copy number is not an important determinant. However, non-canonical splice sites may be more important in splicing very long introns, because they appear in introns above 5 kb with a higher relative likelihood than canonical splice sites. Further investigations are needed to validate the observed lack of G in these splice site combinations and to identify an underlying pattern if it exists. When looking for an association of long introns with stress-related genes, no significant increase in their intron sizes was observed. However, it is still possible that these long introns belong to genes which were not previously described in relation to stress.
Previous studies postulated different non-spliceosomal removal mechanisms for such introns including the IRE1 / tRNA ligase system [60, 61] and short direct repeats leading to transcriptional slippage [62, 63]. It should be mentioned that many sequence variants of snRNAs are encoded in plant genomes . The presence of multiple spliceosome types in addition to the canonical U2 and the non-canonical U12 spliceosome could be another explanation .
Another hypothesis suggests parasitic splice sites using neighbouring recognition sites for the splicing machinery to enable their processing . The mere presence of GT close to the 5′ non-canonical splice site and AG close to the 3′ non-canonical splice site might be sufficient for this process to take place. These non-canonical splice sites are expected to be in frame with the associated GT-AG signals which could be responsible for recruiting the splicing machinery . This hypothesis is supported by the observation that splice sites seem to be missed sometimes thus leading to the use of the next splice site which is usually in frame with the original one . Further investigation might connect neighbouring sequences to the processing of minor non-canonical splice sites.
There is no evidence for RNA editing to modify splice sites yet, but previous studies found that modifications of mRNAs are necessary to enable proper splicing in some cases . Even so such a system is probably not in place for all minor non-canonical splice sites, a modification of nucleotides in the transcript would be another way to regulate gene expression at the post-transcriptional level.
Although, these hypotheses could be an additional or alternative explanations for the situation observed in O. sativa, considering the CA-GG cases as annotation and alignment errors seems more likely due to their unique presence in this species.
Usage of non-canonical splice sites
Our results could provide a strong foundation to further analyses of the splicing process by providing detailed information about the frequency at which splicing occurred at a certain splice site. The results indicate that this usage of different splice site types could vary substantially. A possible explanation for these observed differences is the mixture of RNA-Seq data sets, which contains samples from various tissues and different environmental or physiological conditions. Sequencing reads reflect the splicing events occurring under these specific conditions. As previously indicated by several reports, non-canonical splice sites might be more frequently used under stress conditions [25, 51, 63]. As most plants are unable to escape environmental conditions by movement, a higher frequency of non-canonical splice sites in sessile plant species compared to other taxonomic groups should be assessed in the future to explore whether there may be a link between non-canonical splice frequency and life habit.
The observation of a stronger usage of the donor splice site over the acceptor splice site in GT-AG and GC-AG splice site combinations is matching previous reports where one donor splice site can be associated with multiple acceptor splice sites [54, 66]. The absence of this effect at minor non-canonical splice site combinations might hint towards a different splicing mechanism, which is restricted to precisely one combination of donor and acceptor splice site.
The observed usage of GT-TA and AA-TA splice site combinations under stress conditions in contrast to control conditions as well as the slight increase in the number of supported minor non-canonical splice site combinations requires further testing e.g. in other species or under different stress conditions. It would be interesting to validate the usage of different splice sites in response to stress and not just the expression of stress-related genes. In principle, it would be possible to assess the usage of splice sites under diverse environmental or developmental conditions as performed in this study for different plant species. While numerous RNA-Seq data sets are available per species, these analyses would require a large number of data sets generated under identical or at least similar conditions. Therefore, the identification of splicing variants dedicated to certain stress responses is beyond the scope of this work.
Limitations of the current analyses
Some constraints limit the power of the presented analyses. In accordance with the important plant database Araport11  and previous analyses , only the transcript encoding the longest peptide sequence was considered per gene when investigating splice site conservation across species. Although the exclusion of alternative transcripts was necessary to compensate differences in the annotation quality, more non-canonical splice sites could be revealed by investigations of all transcript versions in the future. The exclusion of annotated introns shorter than 20 bp as well as the minimal intron length cutoff of 20 bp during the RNA-Seq read mapping prevented the investigation of very small introns. There are reports of experimentally validated introns with a minimal length of 56 bp . Although recent reports indicate a minimal intron length around 30 bp in humans  or even down to 10 bp , it is unclear if very short sequences should be called introns. Since spliceosomal removal of these very short sequences via lariat formation seems unlikely, a new terminology might be needed. The applied length cutoff was selected to avoid previously reported issues with false positives . However, de novo identification of very short introns as recently performed for Mus musculus and H. sapiens [51, 69] could become feasible as RNA-Seq data sets based on similar protocols become available for a broad range of plant species. Variations between RNA-Seq samples posed another challenge. Since there is a substantial amount of variation within species [70, 71], we can assume that small differences in the genetic background of the analyzed material could bias the results. Splice sites of interest might be canonical splice site combinations in some accessions or subspecies, respectively, while they are non-canonical in others. Despite our attempts to collect RNA-Seq samples derived from a broad range of different conditions and tissues for each species, data of many specific physiological states are missing for most species. Therefore, we cannot exclude that certain non-canonical splice sites were missed in our splice site usage analysis due to a lack of gene expression under the investigated conditions.
As costs for RNA-Seq data generation drops over the years , improved analyses will become possible over time. Investigation of homologous non-canonical splice sites poses several difficulties, as the exonic sequence is not necessarily conserved. Due to upstream changes in the exon-intron structure , the number of the non-canonical introns can differ between species. However, a computationally feasible approach to investigate the phylogeny of all non-canonical splice sites would significantly enhance our knowledge e.g. about the emergence and loss of non-canonical splice sites. Experimental validation of splice sites in vivo and in vitro could be the next step. It is crucial for such analyses to avoid biases introduced by reverse transcription artifacts e.g. by comparing different enzymes and avoiding random hexameters during cDNA synthesis . Splice sites could be experimentally validated e.g. by integration in the Aequoria vicotria GFP sequence  to see if they are functional in plants. Our analyses support the concept that differences between plant species need to be taken into account when performing such investigations [76, 77].
Non-canonical splice site combinations are present and appear to be functionally relevant in most plants, although at low abundance. The frequency of splice sites combinations decreases with the divergence from the canonical GT-AG combination, however, this pattern cannot be explained by simple accumulation of random mutations. RNA-Seq reads show a stronger conservation of the 5′ splice site when compared to the 3′ splice site indicating the presence of multiple alternative 3′ splice sites. Initial analyses indicate variations in the usage of minor non-canonical splice sites under certain stress conditions, but further investigations are needed to understand the impact of environmental factors or developmental stages on usage of minor non-canonical splice sites.
We are thankful to everyone involved in generating the data sets underlying this study.
We acknowledge support for the Article Processing Charge by the Deutsche Forschungsgemeinschaft and the Open Access Publication Fund of Bielefeld University.
BP and SFB designed the research. BP performed bioinformatic analyses. BP and SFB interpreted the results and wrote the manuscript. Both authors read and approved the final version of the manuscript.
Ethics approval and consent to participate
Consent for publication
The authors declare that they have no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
- 4.Carmel L, Chorev M. The function of introns. Front Genet. 2012;3:55. https://doi.org/10.3389/fgene.2012.00055.
- 8.Csuros M, Rogozin IB, Koonin EV. A detailed history of intron-rich eukaryotic ancestors inferred from a global survey of 100 complete genomes. PLoS Comput Biol. 2011;7:9. https://doi.org/10.1371/journal.pcbi.1002150.
- 11.Zimmerly S, Semper C. Evolution of group II introns. Mob DNA. 2015;6:7. https://doi.org/10.1186/s13100-015-0037-5.
- 13.Pucker B, Holtgraewe D, Stadermann KB, Frey K, Huettel B, Reinhardt R, et al. A chromosome-level sequence assembly reveals the structure of the Arabidopsis thaliana Nd-1 genome and its gene set. bioRxiv. 407627. https://doi.org/10.1101/407627.
- 18.Wu Q, Krainer AR. Splicing of a divergent subclass of AT-AC introns requires the major spliceosomal snRNAs. RNA N Y N. 1997;3:586–601.Google Scholar
- 30.Pucker B, Holtgräwe D, Weisshaar B. Consideration of non-canonical splice sites improves gene prediction on the Arabidopsis thaliana Niederzenz-1 genome sequence. BMC Res Notes. 2017;10:667. https://doi.org/10.1186/s13104-017-2985-y.
- 35.Souvorov A, Kapustin Y, Kiryutin B, Chetvernin V, Tatusova T, Lipman D. Gnomon – NCBI eukaryotic gene prediction tool. 2010. http://www.ncbi.nlm.nih.gov/core/assets/genome/files/Gnomon-description.pdf. Accessed 25 Sep 2018.
- 38.Qu W, Cingolani P, Zeeberg BR, Ruden DM. A bioinformatics-based alternative mRNA splicing code that may explain some disease mutations is conserved in animals. Front Genet. 2017;8:38. https://doi.org/10.3389/fgene.2017.00038.
- 43.Haak M, Vinke S, Keller W, Droste J, Rückert C, Kalinowski J, et al. High quality de novo transcriptome assembly of Croton tiglium. Front Mol Biosci. 2018;5:62. https://doi.org/10.3389/fmolb.2018.00062.
- 46.Price MN, Dehal PS, Arkin AP. FastTree 2 – approximately maximum-likelihood trees for large alignments. PLoS One. 2010;5:1–10. https://doi.org/10.1371/journal.pone.0009490.
- 49.Pucker B. RNA-Seq read coverage depth of splice sites in plants. 2018. https://doi.org/10.4119/unibi/2931315. Accessed 11 Oct 2018.
- 63.Dubrovina AS, Kiselev KV, Zhuravlev YN. The role of canonical and noncanonical pre-mRNA splicing in plant stress responses. Biomed Res Int. 2013;2013:1–14. https://doi.org/10.1155/2013/264314.
- 68.Piovesan A, Caracausi M, Ricci M, Strippoli P, Vitale L, Pelleri MC. Identification of minimal eukaryotic introns through GeneBase, a user-friendly tool for parsing the NCBI gene databank. DNA Res Int J Rapid Publ Rep Genes Genomes. 2015;22:495–503.Google Scholar
- 73.Garcia-España A, Mares R, Sun T-T, DeSalle R. Intron evolution: testing hypotheses of intron evolution using the Phylogenomics of Tetraspanins. PLoS One. 2009;4:1–12. https://doi.org/10.1371/journal.pone.0004680.
- 74.Houseley J, Tollervey D. Apparent non-canonical trans-splicing is generated by reverse transcriptase in vitro. PLoS One. 2010;5:1–7. https://doi.org/10.1371/journal.pone.0012271.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.