Background

The majority of mammalian genes code for multiple transcript isoforms that contribute substantially to the vast complexity of both the mammalian transcriptome and proteome (E. T. [25, 38]). Each mature isoform is generated through a dynamic series of tightly coordinated actions that begin to occur as the nascent transcript is being synthesized [3]. The growing precursor RNA is sequentially bound by a myriad of RNA binding proteins (RNABPs) and small nucleolar RNAs (snoRNAs; reviewed in Wahl et al [37]) as the exon-intron boundaries become defined through these specific ribonucleoprotein complex interactions. The assembled ribonucleoprotein complex, termed the spliceosome, facilitates intron excision and covalent ligation of flanking exons across the gene locus, ultimately generating a mature transcript isoform.

While each exon-intron boundary inherently contains a splice site, contiguous exons are not always spliced together. Retained introns (Y. [18]), skipped exons [20], and cryptic splice sites [14] commonly diversify the profile of fully processed transcript isoforms. Splice site proximity, defined by RNA secondary structure, is a major factor in splice site selection [28]. Intron length and the presence or absence of inverted repeats can impact the physical distance between splice donor and acceptor [33]. Branch point sequence motifs [43] and nucleotides adjacent to splice sites [5] fine tune the strength of snoRNA interactions. Further, variations in polypyrimidine tracts can preferentially attract one RNABP over another [32]. An additional layer of regulation is provided by the cellular abundance and availability of individual RNABPs and snoRNAs, allowing for tissue and context specificity of RNA processing and alternative isoform expression [29].

The same splicing reaction that generates mature mRNAs can also fuse a downstream splice donor to an upstream splice acceptor (much like tying the end of a string to the beginning), in effect circularizing the transcript [34]. These circular RNAs (circRNAs) are covalently closed transcripts that inherently lack 5′ or 3′ ends, thereby enabling them to escape exonuclease destruction. This class of RNA has recently been shown to be evolutionarily conserved ([30]; P. L. [39]), highly abundant in humans, and for some genes, is the most prevalent transcript isoform [31]. The 3′ to 5′ back-splicing reaction, required for circRNA biogenesis, correlates with the speed of precursor transcript elongation (Y. [42]), occurring more frequently at splice sites flanked by long introns and introns containing reverse complementary sequences [12]. To date, little is known regarding the function of the vast majority of circRNAs. Of the relatively few that have been characterized, some have been shown to serve as microRNA sponges [11, 22], as direct regulators of parental gene expression (Z. [19]), in signaling between cells [15] and even as templates for translation [16, 24]. Further evidence of their importance in the cell is accumulating and their functions and mechanisms of action are being found to be generally quite distinct from their cognate linear counterparts [6].

Exploring the relationship between linear and circular RNA isoforms of a common parental gene can be facilitated by utilizing Next Generation Sequencing (NGS) technology. NGS based approaches have provided the framework to study the abundance of individual transcript isoforms at a large scale, allowing investigators to compare circular and linear isoform abundance. However, the majority of bioinformatic pipelines require prior knowledge of transcript structure. While useful for broad scale interpretations, these approaches fail to resolve the abundances of both linear and circular isoforms of each gene, the function of which can dramatically differ from one another. Between linear transcripts alone, alternatively spliced isoforms can code for proteins that are truncated [4], lack specific functional domains [26], have completely unique amino acid sequences [1], and in some cases, alter cell fate entirely [4, 10]. Here we present a visualization tool, SpliceV, that facilitates detailed exploration and visualization of transcript isoform expression in publication quality format. SpliceV facilitates within- and across-sample analyses and includes the display of predicted cis and trans regulatory factors to further assist in the biogenesis and function studies. Together, SpliceV should be a useful tool for a wide spectrum of the RNA biology research community.

Implementation

Our software package is written in Python 3 but is backwards compatible with Python 2.7, relying only upon the third-party libraries, matplotlib [36], and pysam. Source code can be found at https://github.com/flemingtonlab/SpliceV and can be installed from PyPI using the Python package manager, pip. SpliceV is written with a GNU 3.0 public license, provided with anonymous download and installation. Full usage information can be found in Additional file 1.

SpliceV generates plots of coverage, splice junctions, and back-splice junctions with customizable parameters, depicting expression of both the linear and circular isoforms of a given gene. Standard formats (BAM, GTF, and BED) are accepted as input files. BAM files are sequentially accessed by our software (rather than in parallel). In practice, this means that SpliceV first determines the chromosomal coordinates that mark the beginning and end of the input gene. Next, it extracts reads that fall within that range from each BAM file (one BAM file at a time). As BAM files are indexed (either prior to running SpliceV, or automatically by SpliceV), this process never requires loading of the entire file into memory, and we have no reason to believe that a personal laptop computer would have difficulty running SpliceV on many BAM files at once. Because junction calling sensitivity can be improved using specialized software, canonical and back-splice junction information can be extracted directly from BAM files or input separately as BED-formatted files containing the coordinates and quantities of each junction. The user is provided the flexibility of normalizing expression of each exon across all samples or for exon normalization to be confined within each sample (this helps visualize alternative splicing, intron retention, and exon exclusion). As introns are generally much larger than exons, an option to reduce intron size by a user-defined amount is also provided. In an effort to guide interpretation of gene specific splicing patterns, predicted or empirically determined RNA binding protein binding sites can be added to the plots (Fig. 1b-c; a stepwise tutorial to reproduce these figures is outlined in Additional file 2) by supplying a list of coordinates or utilizing the consensus binding sequences determined by Ray et al [27]. Because inverted ALU repeat elements impact RNA secondary structure, we have also incorporated the option to add a track of ALU elements to the plot.

Fig. 1
figure 1

a SpliceV plot of SPPL2A expression in Akata cells. Coverage (exon level coverage; color intensity of each exon, single nucleotide level coverage; height of the horizontal line bisecting each exon) and forward splice junctions (arches above exons) was derived from sequencing a poly(A)-selected library preparation. Back-splice junctions (curves below exons) were obtained from sequencing a ribodepleted, RNase-R treated library preparation. b SpliceV plot of FARSA in Akata cells. All junctions and coverage were derived from a ribodepleted, RNase-R treated sample. c SpliceV plot of GSE1 expression in ribodepleted, RNase-R treated (back-splice junctions) and poly(A)-selected (coverage and canonical junctions) SNU719 cells. Predicted binding sites for RNA binding proteins, RBM3, HNRNPL, HNRNPA1, PTBP1 are plotted along the FARSA transcript (b) and RBFOX1, and MATR3 sites are plotted along the GSE1 transcript (c). ALU elements are marked in (c)

Results

Multiple computational pipelines have been developed to detect and quantify circRNAs from high throughput RNA sequencing data ([13, 22, 40]; X.-O. [8, 41]). As circRNAs lack a poly(A) tail, ribodepleted library preparations are essential for circRNA detection. RNA preparations can then be treated with the exonuclease, RNase R, which exclusively digests linear RNAs, to increase the depth of circRNA coverage. To demonstrate the utility of SpliceV, we used libraries prepared from poly(A) selected (enriched for polyadenylated linear RNAs) or ribodepleted-RNase R-treated RNA from the Burkitt’s Lymphoma cell line, Akata, and the gastric carcinoma cell line, SNU719. Reads from each library were aligned using the STAR aligner v2.6.0a [7] to generate BAM and splice junction BED files. We further processed our alignments using find_circ [22] to interrogate the unmapped reads for back-splice junctions. Our first plot displays a prominent circular RNA formed via back-splicing from exon 5 to 3 of SPPL2A (Fig. 1a). For this plot, back-splicing (under arches) derived from RNase R-seq data is plotted with forward splicing (over arches) and exon level (exon color intensity) and single nucleotide level (horizontal line graph) coverage from poly(A)-RNA-seq data from Akata cells to illustrate circRNA data in the context of linear poly(A) transcript expression. Exon level coverage display provides easy visualization of selective exon utilization: for example, using forward- and back-splicing and coverage data derived from RNase R-seq data (Akata cells) show enriched coverage of the circular RNA exons 6–8 of the FARSA gene (Fig. 1b). Nevertheless, the simultaneous display of single nucleotide level coverage includes additional information that can help provide more detailed clarity in interpretation. For example, while the last exon of SPPL2A (Fig. 1a) shows low exon level coverage, there is an evident drop in single nucleotide level coverage soon after the splice acceptor site, likely illustrating the utilization of an upstream poly(A) site (3′ UTR shortening [21]). Therefore, while exon level coverage provides illustrative qualities for some more macroscopic analyses (e.g. enriched exon coverage of circularized exons (Fig. 1b in RNase R-seq data)), single nucleotide coverage provides granularity when needed.

The need that initially inspired us to develop SpliceV was the lack of available software to plot back-splicing in the context of coverage and forward splicing (for example, see Fig. 1a). This is not only useful for simple presentation of circRNA splicing information, but can also aid interpretation. For example, the display of forward splicing and coverage from poly(A)-seq data in the context of back-splicing data from RNase R-seq data for the GSE1 gene provides evidence of circle formation of exon 2 which precludes its inclusion in the cognate linear GSE1 isoform (Fig. 1c). In this case, exon 2 exclusion introduces a frameshift, ablating the canonical function of this gene.

To add utility to SpliceV in transcript biogenesis and isoform function analyses, we also incorporated the display of RNA binding protein predictions (Fig. 1b) based on empirically determined binding motifs (Ray et al) and user supplied ALU element sites (Fig. 1c). These features can assist the user in assessing the mechanisms of forward splicing, back splicing, alternative splicing, intron retention, etc. Further, since loaded RNA binding proteins control transcript localization as well as activity, these features can help assist the user in investigating transcript function.

To further illustrate the utility of SpliceV in investigational efforts, we next used SpliceV to visualize isoform level expression in two Gastric Carcinomas and one normal gastric tissue sample from The Cancer Genome Atlas (TCGA) [2]. Whole Exome Sequencing variant calls revealed that each of the two tumor samples had unique splice site mutations in the critical tumor suppressor gene, TP53 [17]. The Genomic Data Commons pipeline [9] for gene expression quantification revealed a slight increase in TP53 RNA levels in the tumor samples. Because the mutations in both tumors occurred in intronic regions, the impact on protein output is not easily determined. Using SpliceV to visualize RNA-seq data (Fig. 2), however, revealed likely haplotypic ablation of the mutated splice acceptor (Fig. 2b) or donor (Fig. 2c) site in these two samples. This led to the utilization of cryptic splice sites that produced frameshifts in each of the resulting transcripts. Also evident in sample BR-8483, based on the single nucleotide coverage line graph, is extensive intron retention, likely causing the resulting intron retained transcript to be subjected to non-sense mediated RNA decay. In both of these cases, SpliceV was able to assist in determining the negative impact of these two mutations on TP53 function, findings that are otherwise opaque to the user.

Fig. 2
figure 2

a SpliceV plot of RNA coverage and splicing from normal stomach tissue (wild type TP53). b SpliceV plot of a gastric tumor with a 1 base (T) deletion at a splice acceptor (chr17:7673610, HG38 genome build), disrupting the splicing from exon 8–9 and causing the utilization of a novel cryptic downstream splice acceptor at position chr17:7673590; resulting in a frameshift deletion. c SpliceV plot of a gastric tumor with a G > A splice donor variant at chr17:7675993. Part of the intron is retained (chr17:7675884–7,675,993) and a novel intronic splice donor site is utilized, with the same upstream acceptor. This introduces a frameshifting insertion into the protein coding sequence. Asterisks indicate the SNV location and insets are enlarged representations of the transcript structure. Nucleotide sequences at the cryptic splice sites are labeled, with the junctions occurring between the red and black bases in each figure. These samples were initially provided by The Cancer Genome Atlas [2], with alignments obtained from the Genomic Data Commons [9]

Conclusions

Here we present a new tool, SpliceV, that facilitates investigations into transcript biogenesis, isoform function and the generation of publication quality figures for the RNA biologist. SpliceV is fast (taking full advantage of the random access nature of BAM files), customizable (allowing users to control plotting aesthetics), and can filter data and make cross-sample comparisons. It is modular in structure, allowing for the inclusion of new features in future package releases. SpliceV should provide value to the toolkit of investigators studying RNA biology and function and should speed the time frame from data acquisition, data analysis to publication of results.

Availability and requirements

Project name: SpliceV

Project home page: https://github.com/flemingtonlab/SpliceV

Operating system: Platform independent

Programming language: Python

Other requirements: Python 2.7 or Python 3.0+

License: GNU Public License

Any restrictions to use by non-academics: License needed