Background

Alternative splicing (AS) is one of the important processes during RNA maturation in higher eukaryotes. By including or excluding alternative exons, AS increases the diversity of downstream RNA products. More than 90% of genes with multiple exons undergo AS [1]. The inclusion and exclusion of exons by AS shape the downstream protein diversity [2]. Furthermore, AS participates in many key biological processes, such as developmental stages [3], tissue types [4, 5], genders [6, 7], insect caste determination [8], and so on. Thus, understanding the regulation of AS is vital.

The regulation of AS relies on numerous cis-regulatory elements, including cis-acting splicing regulatory elements (SREs), splicing motifs, and enhancers. SREs include exonic/intronic splicing enhancers or silencers. Wang et al. had conducted a systematical method for the identification of these SREs [9]. Some splicing motifs have been reported to be correlated with regulation of AS. For example, Holste et al. had provided a computational framework to identify splicing motifs and to predict AS events [10]. Enhancer had also been reported to correlate to AS changes [11,12,13].

Enhancer is a cis-regulatory element known as its characteristics: high abundance in genome, regulating genes in highly variable location, and lack of discriminative DNA sequence [14]. Enhancers have been demonstrated to physically interact with promoter and polymerase during transcription elongation [15, 16]. This physical interaction shortens the distance between enhancer and gene body, and further grants enhancers an opportunity to influence AS. Previous studies had demonstrated that enhancer can affect alternative splicing. For example, the insertion of the SV40 transcriptional enhancer is capable of inhibition of inclusive form of fibronectin extra domain I [11]. Another one example is that the downstream enhancer of protocadherin alpha can loop back to bind with promoter by coupling of CTCF and further affect AS [17]. These studies had shown that enhancer is capable of affecting AS events.

A previous study suggested that most of enhancers are inactive (poised) until the proper factor binds on it [18]. Thus, it is challenging for biologists to design a high-throughput experiment to identify the enhancer-AS associations. Because there is no genome-wide study to identify the associations, in this context, we developed a bioinformatics pipeline (Fig. 1A) to find out the significant enhancer-AS associations on a genome-wide scale by analyzing large amount of human and mouse transcriptomes. We further constructed a platform entitled VEnAS (Visualization of Enhancer-associated Alternative Splicing) to present the enhancer-AS associations.

Fig. 1
figure 1

(A)Workflow or the analysis pipeline for identification of enhancer-AS associations. In the top left part of the analysis pipeline, we focused on enhancer datasets polish, including position refining and presence/absence calling. In the top right part, we focused on the processes for quantification of AS changes. We then conducted association analysis to identify enhancer-AS associations, and finally constructed a website called VEnAS for data visualization. (B) An example of refining enhancers between different tissues and cell lines.The blue boxes are representing to enhancers in different tissues or cell lines. The middle positions of enhancers are used for hierarchical clustering with centroid method. The cuttree threshold is set as 3 kilo bases. The green and orange boxes are representing to the two refined groups under the threshold

Methods

Data selection and preparation

We downloaded enhancer datasets which include 28 human and 24 mouse tissues and cell lines from enhancerAtlas [19]. These tissues and cell lines were chosen because they have at least three paired RNA-seq datasets for quantification of AS. To prevent the data imbalance, we down-sampled the number of RNA-seq datasets to three. We then downloaded the chosen 84 human (28*3) and 72 mouse (24*3) RNA-seq fastq files from Sequence Read Archive (SRA) [20]. These fastq files were mapped onto the latest genome (GRCh38 for human and GRCm38 for mouse) by HISAT2 [21] with default parameters.

Enhancer calling

The boundaries of enhancer could be incongruent due to tissue characteristics, enhancer calling methodologies, or batch effects from input data sets. Thus, refining the location of enhancers between different tissue types is required to eradicate the incongruence. To refine the enhancers between different tissues and cell lines, we took advantage of agglomerative hierarchical clustering with centroid method (Fig. 1B). We used the central position of each enhancer as input for hierarchical clustering. Previous studies had reported that the length of enhancer is ranged between 2–4 kilo bases [18, 22,23,24]. Thus, we set 3 kilo bases as a threshold to limit the growth of the clusters. After refining the location of enhancer, we were able to call the present or absent of enhancer between different tissues based on whether there is any enhancer located in the refined range.

Quantification and categorization of AS

The v94 human and mouse genome annotations were downloaded from Ensembl. CATANA [25] was used upon the human and mouse genome annotation to obtain the latest version of AS annotation. The latest AS annotation and the mapped bam files (from data preparation) were used for MISO [26] to compute percent splice in (PSI), which is an inclusion index based on the number of junction reads [27]. The equation of PSI is defined as

$${\text{PSI}} \, \text{=} \, \frac{\text{Junction reads supporting to inclusive form}}{\left(\text{Junction reads supporting to inclusive form}\text{ + Junction reads supporting to exclusive form}\right)}$$

To guarantee the AS changes of a given AS event from human 84 or mouse 72 samples are large enough, we removed the AS events with the PSI range across all samples less than 0.1. After that, we conducted Z-transformation upon all PSI values across tissues to capture the changes of a given AS between tissues. To categorize whether a tissue does have an AS change, the tissue having Z-transformed PSI value (Z-PSI) larger than 1 is defined as “inclusive shift”, while the tissue having Z-PSI smaller than -1 is defined as “exclusive shift”.

Association analysis

With the labels present/absent of enhancers and inclusive/exclusive shift of AS changes, for each enhancer-AS pair we can generate a two-by-two contingency table containing the number of samples in the four cells. We removed enhancer-AS pairs having low strength of association in the contingency table to improve the precision of the association analysis and reduce the false results. Thus, we only included the enhancer-AS pairs for analysis in which the odds ratio must be larger than 2 or less than 0.5 accompanied by the effective size constrain (the number difference between concordant and discordant cells must be larger than 10). Then the Fisher exact test was conducted exhaustively throughout all the enhancer-AS pairs to calculate the p-value. All the p-values were then adjusted by Benjamini–Hochberg procedure false discovery rate (FDR) to obtain q-values. An enhancer-AS association was considered significant if the q-value is smaller than 0.05.

Implementation of VEnAS

The VEnAS database was written by a combination of Perl, Python, and R for data processing and statistical analysis. The web server of VEnAS was implemented with a combination of PHP, Google Polymer framework, and MySQL on Ubuntu server. For efficiently storing and querying, the analysis result and other integrated data were subjected to database normalization. The schema of the normalized MySQL database table is shown in the Figure S1. The tables holding PSI and genomic location of enhancer were separated for parallel querying by MySQL. In addition, the table holding index for autocompletion during user query is shown on the top-left side of Figure S1. The keywords used for constructing index include Ensembl gene accession, gene symbol, and gene description.

Results

To identify the enhancer-AS associations on a genome-wide scale, we developed an analysis pipeline (Fig. 1A, detailed in Methods). We first curated the enhancer profiles and RNA-seq datasets of 28 human and 24 mouse tissues and cell lines for analysis. Since the profile of active enhancer is naturally varied between different tissues and cell lines [28], we refined the boundaries of enhancers and generated enhancer calling using the hierarchical clustering method. We then used CATANA and MISO to quantify and categorize AS events from RNA-seq datasets. To further check the similarity or overlapping event between different samples, we computed the Jaccard coefficient index (Figure S2). The result shows that the enhancer-AS events are quite similar within the triplicated samples under the same tissue type but different between tissues. The Fisher exact test was performed to identify the significant enhancer-AS associations with present/absent of enhancer and inclusive/exclusive shift of AS types.

Enhancer-AS associations in human and mouse

By conducting association analysis with absent/present of enhancer and inclusive/exclusive shift of AS event, we found that 3,242 human genes and 7,716 mouse genes have at least one significant enhancer-AS association, and 11,262 human enhancers and 26,083 mouse enhancers are participating in AS changes (Table 1 and Table 2). Previous study had mentioned that transcripts having alternative start and termination sites shape the major transcriptome diversity across human tissues [13]. As expected, in our results, the numbers of genes having associations between enhancers and the AS types regarding alternative transcription initiation and termination sites (AFE, ALE, ATSS and ATTS) are notably higher than the six canonical AS types (A5SS, A3SS, SE, RI, MSE, MXE) in human and mouse (Fig. 2).

Table 1 The counting table of human genes and enhancer having enhancers having enhancer-AS association for different AS types. The row “All” represents the number of genes or enhancers having associations in any types of AS. The column “input” means the number of genes or enhancers which are qualified for the analysis. The column “significant” represents the number of genes or enhancers pass the q-value smaller than 0.05
Table 2 The counting table of mouse genes and enhancer having enhancer-AS association for different AS types. The row “All” represents the number of genes or enhancers having associations in any types of AS. The column “input” means the number of genes or enhancers which are qualified for the analysis. The column “significant” represents the number of genes or enhancers pass the q-value smaller than 0.05
Fig. 2
figure 2

The counting number of genes and enhancer associated to AS changes in (A) human and (B) mouse. The number of genes having significant (FDR q-value < 0.05) enhancer-associated AS and the number of enhancers significantly (FDR q-value < 0.05) associated to AS changes are shown in x-axis. All ten types of AS are counted independently. Every gene having multiple significant events is counted once. The ten types of AS are including alternative 5’/3’ splice site (A5SS, A3SS), skipped exon (SE), retained intron (RI), multiple skipped exons (MSE), mutually exclusive exons (MXE), alternative first/last exons (AFE, ALE), and alternative transcription start/termination sites (ATSS, ATTS)

Gene and enhancer are many-to-many relationship [29]. One given gene could be associated to multiple enhancers, and vice versa. Here we would like to know that under consideration of association with AS changes, how many enhancers are associated to one given gene and how many genes are associated to one given enhancer. We further interrogated the association relationship between enhancer and genes by examining the number of enhancers per gene (also the genes per enhancer). According to the annotation from enhancerAtlas, on average, each gene is paired with 60.32 enhancers in human and 68.47 enhancers in mouse. Our association analysis suggests that given one gene, on average, 3.88 of the 60.32 enhancers (6.43%) in human and 3.54 of the 68.47 enhancers (5.17%) in mouse are associated to AS change (Figure S3A and S3B). For enhancers, on average each one enhancer is paired to 7.66 genes in human and 9.29 genes in mouse according to enhancerAtlas, but in our result one enhancer is significantly associated to AS change with only 1.28 genes human and 1.22 genes in mouse (Figure S3C and S3D).

Investigations of the genetic properties of identified enhancer-AS associations

To further understand the genetic properties of identified enhancer-AS associations, we observed the proportion of enhancer-AS associations which have both homologous genes and homologous enhancers between human and mouse. For each gene in human, we defined its homolog in mouse according to the homologs list provided in Mouse Genome Informatics (MGI) [30]. For each enhancer in human, we obtained its homologous enhancers in mouse by conducting the CrossMap [31] with the human and mouse chain file and, which is the pairwise alignment between two reference assemblies from Ensembl [32]. We found that about 52% of the significant and 35% of the insignificant enhancer-AS associations in human have homologous genes accompanied with homologous enhancers in mouse (Table 3). The Welch two sample t-test shows significant difference (p-value = 5.56 × 10–11) upon percentages of significant enhancer-AS pairs with homologous genes and enhancers in all ten types of AS against insignificant groups. This suggests that the significant enhancer-AS pairs are more likely to be the co-existence of homologous genes and homologous enhancers than insignificant enhancer-AS pairs. Similar trends with lower percentages were found when we check the significant enhancer-AS pairs (Welch two sample t-test p-value = 1.906 × 10–13) in mouse (Table 4). These results show that the significant enhancer-AS associations we identified are more likely to be the co-existence of homologous genes accompanied with homologous enhancers in both human and mouse rather than conservation of enhancer sequence only.

Table 3 The counting table of human significant and insignificant enhancer-AS pairs accompanied with both homologous enhancers. The numbers of total significant enhancer-AS pairs, significant enhancer-AS pairs with homologous genes and enhancers, total insignificant enhancer-AS pairs, insignificant enhancer-AS pairs with homologous genes or enhancers in all ten types of AS are provided. The percentages of enhancer-AS pairs with homologous genes and enhancers in all ten types of AS are calculated
Table 4 The counting table of mouse significant and insignificant enhancer-AS pairs accompanied with both homologous genes and homologous enhancers

Visualization of enhancer-AS associations

To visualize the enhancer-AS associations, we constructed a platform named VEnAS. VEnAS provides intuitive genomic architecture, association plot, and contingency table of all the significant enhancer-AS associations (Fig. 3). To query VEnAS, users can input Ensembl gene ID or gene symbol (Query 1 in Fig. 3). The auto-completion function would help users find out the gene of interests. The web server provides portable gene information for convenient linking to Ensembl, NCBI, and RefSeq (Result 1 in Fig. 3). After users select an AS type and a corresponding enhancer (Query 2 in Fig. 3), VEnAS shows the architecture of the gene with enhancer, association plot, and a two-by-two contingency table (Query 2 in Fig. 3). For splicing display, the bending curve drawn above exons represents the inclusive form of AS products, while the curve drawn below exons represents the exclusive form. The width of curves represents the number of biological replicates which support the association events. Moreover, the colors denote whether enhancer is active or inactive. In the top of a two-by-two contingency table, the FDR adjusted q-value of the Fisher exact test and the odds ratio are also provided. Inside the table, the color boxes are representing biological replicates having AS shifted to inclusion or exclusion. The color intensity of the boxes is proportional to the Z-PSI. The tissue name and Z-PSI would be displayed when the mouse cursor is hovering atop the box. Additionally, VEnAS provides batch retrieval function. The user could send a list of Ensembl gene ID(s) obtained from any other analysis tool or software in the batch retrieval web page through pasting in dialog box or uploading file. VEnAS can convert the visualized results into PDF file format for users for further analyses.

Fig. 3
figure 3

The webpage of VEnAS and query steps. Following by the queries (e.g. Query 1 and 2), users can obtain gene architecture, association plot, and detailed statistical information (e.g. Result 1 and 2) of VEnAS database conveniently

Case study

We have identified lots of enhancer-AS associations in this study. However, it is difficult to find out large scale biological evaluation or literature evidence. Hence, we performed comparative genomics analysis between human and mouse as well as observed the splicing events to further evaluate the identified associations. Below is a case demonstrating the robustness of our finding. In Manduchi et al.’s study [33], they identified 35 significant SNP marks and enhancers which are associated to Type 2 diabetes with combination of epigenomic markers and genome wide association studies (GWAS). In their result, gene ST3GAL4 is associated to two SNP markers located within an enhancer which is named chr11_1460 in our system. As shown in Fig. 4A, the enhancer is marked by ENCODE as a cis-regulatory element in human. As shown in comparative genomics data track, the genomic region of this enhancer is located within synteny between human and mouse. In mouse, the associated enhancer is named chr9_3600, which is also marked as a cis-regulatory element by ENCODE and within the syntenic region shared with human enhancer chr11_1460 (Fig. 4B). Furthermore, we utilized MISO to draw sashimi plots and PSI histograms [26] to illustrate that the presence/absence of the associated enhancer is associated to skipped exon event (SE) of ST3GAL4 in human (Fig. 4C). The PSI histograms show that all the PSI values are closed to “1”, i.e. the inclusive form dominated, in the samples where the enhancer is present. On the contrary, the PSI values are decreased to about 0.5 in the samples when the enhancer is absent. The strength of association between enhancer and SE is significant (\(q-\mathrm{value}=2.909\mathrm{ x }{10}^{-2}\), as shown in Fig. 4D). Taking together the literature evidence, comparative genomics data, and PSI distribution; we did successfully demonstrate the existence of the enhancer-AS association.

Fig. 4
figure 4

 A real case for the enhancer-AS association. The associated enhancers located in upstream of human gene ST3GAL4 (A) and mouse gene St3gal4 (B) marked by the red arrows. The tracks showing ENCODE cis-regulatory elements and genomic synteny between human and mouse are provided in the below. C The sashimi plots and PSI histograms of human ST3GAL4 2nd skipped exon (SE) event in 17 samples. The plots in red indicate that the samples having present associated enhancer chr11_1460, while the plots in green indicating the absence of chr11_1460. D The VEnAS result of 2nd SE event in ENSG00000110080 and associated enhancer chr11_1460. The location of enhancer indicated by red triangle in genomic architecture, association plot, and two-by-two contingency table are provided

Discussions

Previous studies showed that some enhancers are conserved between human and mouse [34] while some enhancers might be reprogrammed after human-mouse speciation [35]. To investigate whether the enhancers associated to AS existed in human and mouse are conserved or not, we further examined the conservation score difference between significant and insignificant enhancers. The conservation score of enhancer sequence between human and mouse were downloaded from Ensembl v94 compara 32 amniotes datasets [36]. After comparison, we didn’t find any difference of conservation score of enhancer sequence between significant and insignificant enhancer-AS associations (data not shown).

As we already know that enhancers could also serve as a hub for binding of transcription factors [18], we tried to annotate known motifs on enhancer regions by DREME [37] and TomTom [38] with position frequency matrix from JASPAR [39]. However, we didn’t find any differentially enriched known motifs shared between human and mouse. Though we didn’t find any advanced evidence, more data sets are required to conclude that the enhancers associated to AS are newly emerged or reprogrammed after human-mouse speciation.

In 2013, a new concept of super-enhancer had been proposed [40, 41]. Super-enhancers are considered to be a cluster of several different enhancers with exceptional higher binding of transcriptional coactivators [40, 41]. Super-enhancers are usually longer than typical enhancers, with a median length of 8.7 kb [42]. Recently, more and more super-enhancer databases about super-enhancer characteristics and associated genes are available, such as dbSUPER [43], SEdb [44], and SEA [45]. It has been reported that super-enhancer is capable of regulating alternative splicing in smooth muscle [46]. However, our current statistical analysis method is designed for one enhancer on one AS event rather than multiple/combinatorial enhancers on one AS event. To pin-point the correlation between the combination of transcription factors and AS events requires a more sophisticated method. In the future, we will pursue a genome-wide method to reveal the correlation between super-enhancer and alternative splicing event.

Conclusion

In this study, an analysis pipeline to identify enhancer-AS associations was proposed. We included 84 RNA-seq data sets across 28 tissues and cell lines in human and 72 RNA-seq data sets across 24 tissues and cell lines in mouse for analysis. In total, 3,242 human genes and 7,716 mouse genes having at least one significant enhancer-AS were identified. On average, about 5–6% of the enhancers of one given gene are associated to AS change, and one given enhancer is associated to 1.28 human or 1.22 mouse genes. The significant enhancer-AS associations are more likely to be the co-existence of homologous genes and homologous enhancers in both human and mouse. Finally, we constructed VEnAS to provide comprehensive enhancer-associated AS results for scientists, including genomic architecture, intuitive association plot, and contingency table. We believe that our study is helpful in further understanding the roles of enhancers on regulating alternative splicing.