Characterizing the interplay between gene nucleotide composition bias and splicing
Nucleotide composition bias plays an important role in the 1D and 3D organization of the human genome. Here, we investigate the potential interplay between nucleotide composition bias and the regulation of exon recognition during splicing.
By analyzing dozens of RNA-seq datasets, we identify two groups of splicing factors that activate either about 3200 GC-rich exons or about 4000 AT-rich exons. We show that splicing factor–dependent GC-rich exons have predicted RNA secondary structures at 5′ ss and are dependent on U1 snRNP–associated proteins. In contrast, splicing factor–dependent AT-rich exons have a large number of decoy branch points, SF1- or U2AF2-binding sites and are dependent on U2 snRNP–associated proteins. Nucleotide composition bias also influences local chromatin organization, with consequences for exon recognition during splicing. Interestingly, the GC content of exons correlates with that of their hosting genes, isochores, and topologically associated domains.
We propose that regional nucleotide composition bias over several dozens of kilobase pairs leaves a local footprint at the exon level and induces constraints during splicing that can be alleviated by local chromatin organization at the DNA level and recruitment of specific splicing factors at the RNA level. Therefore, nucleotide composition bias establishes a direct link between genome organization and local regulatory processes, like alternative splicing.
KeywordsSplicing Genomic Chromatin organization Nucleotide composition bias
Most eukaryotic genes comprise both exons and introns. Introns are defined at their 5′-end by the 5′ splicing site (ss), which interacts with the U1 snRNA, and at their 3′-end, by the branch point (BP; recognized by SF1), the polypyrimidine (Py) tract (recognized by U2AF2 or U2AF65), and the 3′ ss (recognized by U2AF1 or U2AF35) . SF1 and U2AF2 allow the recruitment of the U2 snRNP, which contains the U2 snRNA that interacts with the BP . Recent large-scale experiments have demonstrated that RNA secondary structures frequently occur in the vicinity of splicing sites and gene-by-gene analyses have demonstrated that RNA structures play a direct role in splicing regulation . For example, secondary structures at the 5′ ss can hinder the interactions between the 5′ ss and U1 snRNA and this structure-dependent constraint can be relaxed by RNA helicases, such as DDX5 and DDX17 [3, 4, 5]. Meanwhile, secondary structures at the 3′-end of short introns can replace the need for U2AF2 . Splicing signals are short degenerate sequences, and exons are much smaller than introns. How then are exons precisely defined? How are bona fide splicing signals distinguished from pseudo-signals or decoy signals? These questions have been intensively researched but still remain open.
Many (if not all) exons require a variety of splicing factors to be defined. Splicing factors that belong to different families of RNA binding proteins, such as the SR and hnRNP families, bind to short degenerate motifs either in exons or introns of pre-mRNAs . Splicing factor binding sites are low-complexity sequences comprising either the same nucleotide or dinucleotide [8, 9, 10]. Splicing factors modulate the recruitment of different spliceosome-associated components [7, 11].
Spliceosome assembly and the splicing process occur mostly during transcription [11, 12]. In this setting, the velocity of RNA polymerase II (RNAPII) influences exon recognition in a complex manner, as speeding up transcription elongation can either enhance or repress exon inclusion . RNAPII velocity is in turn influenced by the local chromatin organization, such as the presence of nucleosomes [11, 12]. Nucleosomes are preferentially positioned on exons because exons have a higher GC content than introns, which increases DNA bendability [14, 15, 16, 17]. In addition, Py tracts (mostly made of Ts) upstream of exons may form a nucleosome energetic barrier [14, 15, 16, 17]. Nucleosomes influence splicing by slowing down RNAPII in the vicinity of exons and by modulating the local recruitment of splicing regulators [11, 12]. Indeed, depending on their specific chemical modifications (e.g., methylation), histone tails can interact directly or indirectly with splicing factors . Therefore, exon recognition during the splicing process depends on a complex interplay between signals at the DNA level (e.g., nucleosome positioning) and signals at the RNA level (e.g., splicing factor binding sites).
Genes are not randomly organized across a genome, and nucleotide composition bias over genomic regions of varying lengths plays an important role in genome organization at multiple genomic scales. For example, isochores are large genomic regions (≥ 30 kbps) with a uniform GC content that differs from adjacent regions [19, 20, 21]. Isochores can be classified into five families, ranging from less than 37% of GC content to more than 53% [19, 20, 21]. GC-rich isochores have a higher density of genes than AT-rich isochores, and genes in GC-rich isochores contain smaller introns than genes in AT-rich isochores [22, 23, 24]. In addition, to be associated with specific gene features (e.g., intron size), the density of GC nucleotides in genes has consequences on splicing site sequences and on the splicing process . For example, it has been proposed that splicing of short introns in a GC-rich context may occur through the intron definition model, while the splicing of large introns in an AT-rich context may occur through the exon definition model [11, 26]. Collectively, these observations support a model in which the gene architecture and gene nucleotide composition bias (e.g., GC or AT content) influence local processes at the exon level, such as nucleosome positioning and intron removal. As exon recognition also depends on the binding to the pre-mRNAs of splicing factors that interact with compositionally biased sequences, one interesting possibility is that the nature of these splicing factors depends at least in part on the gene nucleotide composition bias. In this setting, we have recently reported that exons regulated by different splicing factors have different nucleotide composition bias .
Here, we have investigated the relationship between the splicing process, gene nucleotide composition bias, and chromatin organization at both the local and global levels. We initially identified sets of exons activated by different splicing factors and then demonstrated that analyzing the nucleotide composition bias provided a better understanding of the interplay between chromatin organization and splicing-related features, which collectively affect exon recognition. We propose that nucleotide composition bias not only contributes to the 1D and 3D genome organization, but has also local consequences at the exon level during the splicing process.
Splicing factor–dependent GC-rich and AT-rich exons
Publicly available RNA-seq datasets generated after knocking down or overexpressing individual splicing factors across different cell lines were analyzed [28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43] (Additional file 1: Table S1). Using our recently published FARLINE pipeline, which allows the inclusion rate of exons from RNA-seq datasets to be quantified , we defined the sets of exons whose inclusion is activated by each of the 33 splicing factors that were analyzed (Additional file 2: Table S2). We focused on splicing factor–activated exons to uncover the splicing-related features characterizing exons whose recognition depends on at least one splicing factor. We identified 10,707 exons that were activated by at least one splicing factor from the 93,680 exons whose inclusion rate was quantified by FARLINE across all datasets (see the “Materials and methods” section).
Size analysis of introns flanking splicing factor–activated exons revealed that different sets of splicing factor–activated exons were flanked by introns that were either smaller or larger than the median size of human introns (Additional file 3: Figure S4a, b). Interestingly, there was a negative correlation between the GC content of exons and the size of their flanking introns (Fig. 1c, upper panel, P value < 10−16), as previously reported , but not between exon GC content and exon size (Fig. 1c, lower panel). Based on these observations, we defined two groups of exons. The GC-exon group depends on splicing factors activating GC-rich exons that are flanked by small introns (Fig. 1d, in blue), while the AT-exon group depends on splicing factors activating AT-rich exons that are flanked by large introns (Fig. 1d, in green). We excluded for further analyses exons regulated by SRSF2, SRSF3, or hnRNPC, as these splicing factors regulate GC-rich exons flanked by relatively large introns, as well as exons belonging to both groups (see the “Materials and methods” and “Discussion” sections). We next analyzed different splicing-related features by comparing 3182 GC exons to 4045 AT exons, representing two populations of exons that (i) differ in terms of both GC content and flanking intron size and (ii) are activated by distinct splicing factors (Fig. 1d, e and Additional file 2: Table S2).
Nucleotide composition bias and splicing-related features
GC exons were impoverished in Ts but enriched in Cs just upstream of their 3′ ss as compared to control exons (Fig. 2a, c). Therefore, the pattern of pyrimidines upstream of GC exons was similar to that of control exons (Fig. 2a), although the high density of Cs upstream of GC exons may have consequences on Py-tract recognition by U2AF2 (see below). Meanwhile, AT exons had a higher frequency of As upstream of their 3′ ss as compared to GC exons or control exons (Fig. 2a, c). We tested whether this was associated with a larger number of potential BP sites, which often contain As by using the pipeline developed by Corvelo et al. [47, 48, 49]. Indeed, a higher proportion of AT exons had more than two predicted BPs in their upstream intronic sequence, as compared to GC exons or control exons (Fig. 2e, left panel). Further, predicted BPs upstream of GC exons were embedded in sequences that contained a slightly higher proportion of Cs as compared to AT exons (Fig. 2f). The interaction between the BP and U2 snRNA was reported to be more stable when the BP is embedded in GC-rich sequences [47, 48, 49]. Accordingly, the number of hydrogen bonds between BP sites and U2 snRNA was higher for GC exons than for AT exons (Fig. 2e, right panel).
As there was a higher frequency of As and Ts upstream of AT exons as compared to control exons (Fig. 2a, c), we investigated whether this may interfere with the number of potential binding motifs for SF1 (which binds to UNA motifs) and U2AF2 (which binds to U-rich motifs) . As shown in Fig. 2g (left panel), AT exons contained a larger number of TNA motifs upstream of their 3′ ss as compared to GC exons. In addition, AT exons contained a larger number of low-complexity sequences made of three Ts within a four-nucleotide window upstream of the Py-tract as compared to GC exons (Fig. 2g, right panel). Supporting the biological relevance of this observation, the analysis of U2AF2 CLIP-seq datasets revealed a higher U2AF2-related signal upstream of AT exons, as compared to GC exons (Fig. 2h), which extended upstream of the Py tract of AT exons (green arrows). This is consistent with the differential pattern of T and C frequency between the two sets of exons (Fig. 2a) and with the fact that Us provide higher binding affinity to U2AF2 than Cs .
Nucleotide composition bias and dependency for specific spliceosome components
Exons skipped after depletion of SF1, U2AF2, SF3A3, or SF3B4 (but not U2AF1) that recognize splicing signals at 3′-ends of introns were in an AT-rich environment as compared to control exons (Fig. 3a). In addition, a larger proportion of U2-exons—that is, those activated by the U2 snRNP–associated factors including SF1, U2AF2, SF3A3 or SF3B4—contained more than two predicted BPs in their upstream intron as compared to U1 exons (e.g., those activated by SNRPC, SNRNP70, and/or DDX5/17) (Fig. 3c). In addition, U2 exons contained a larger number of SF1- and U2AF2-binding sites in their upstream intron as compared to U1 exons (Fig. 3d, e). In agreement with the T-frequency pattern (Fig. 3a), a broader U2AF2-derived signal was observed when comparing U2 exons to U1 exons (Fig. 3f).
To summarize, GC exons were predicted to have stable secondary structures at their 5′ ss (Fig. 2a, d), and the nucleotide composition bias and splicing-related features of U1 exons were similar to those of GC exons (Fig. 3a, b). Additionally, the increased frequency of As and Ts upstream of AT exons (Fig. 2a) was associated with increased numbers of potential decoy signals, and the nucleotide composition bias and splicing-related features of U2 exons were similar to those of AT exons (Fig. 3a, c–f). We therefore hypothesized that GC exons were more sensitive to U1 snRNP- than to U2 snRNP–associated factors, in contrast to AT exons. Accordingly, 28% of GC exons and 14% of AT exons were regulated by U1-related factors, while 27% of AT exons and 17% of GC exons were regulated by U2-related factors. In addition, GC exons were more likely to be affected by SNRPC, SNRNP70, or DDX5/17 depletion than AT exons, which were more likely to be affected by SF1, U2AF2, SF3A3, or SF3B4 depletion (Fig. 3g).
Nucleotide composition bias, gene features, and chromatin organization
It has been reported that GC-rich genes are more expressed as compared to AT-rich genes [24, 61, 62, 63]. Remarkably, for genes comprising either AT exons or GC exons, the RNAPII density was similar at promoters and the first exon but was higher at exons and introns of genes hosting GC exons than genes hosting AT exons (Fig. 5c, d, Additional file 3: Figure S5a). A similar result was observed for the pattern of RNAPII phosphorylated at Ser2 (Fig. 5e, f), suggesting that the RNAPII content on genes hosting GC exons was likely to be productive.
The pattern of nucleosomes on GC exons and AT exons prompted us to analyze histone tail modifications that play a role in splicing regulation (see the “Introduction” section). We analyzed publicly available ChIP-seq datasets generated across different cell lines (Additional file 1: Table S1). As shown in Fig. 6d, a higher density signal corresponding to H3K4me1, H3K4me2, H3K4me3, H3K9ac, and H3K27ac was detected on GC exons when compared to AT exons. No significant differences were observed for H3K9me3, H3K27me3, H3K36me3, H3K79me3, and H4K20me1 (Fig. 6d). The pattern of histone modifications did not seem to be specific to splicing factor–regulated exons. While there was a similar signal density of histone marks on the promoter and first exon of genes hosting either AT exons or GC exons, the H3K4me3 and H3K9ac density signals were higher across all the exons of the GC exons hosting genes (Fig. 6e, f, Additional file 3: Figure S5d).
To summarize, GC and AT exons were hosted by GC- and AT-rich genes (Fig. 5a, b), respectively, that were associated with different levels of RNAPII (Fig. 5c–f) and that were differentially organized at the chromatin level (Fig. 6). Owing to the relationship between nucleotide composition bias, gene expression level, chromatin organization, and the 1D and 3D genome organization, we investigated the interplay between splicing regulation and isochores or topologically associated domains (TADs) that define the 1D or 3D human genome organization, respectively.
GC and AT exons are hosted by genes belonging to different isochores and chromatin domains
Some genomic domains, named lamina-associated domains (LADs), are in close proximity to the nuclear envelop , and some DNA domains separated by several dozens of Kbps, named topologically associated domains (TADs), are in close proximity in the nuclear space . LADs and TADs have been annotated across different cell lines [64, 65]. We observed that LADs contain more frequently AT exons (~ 70%, Fig. 7d), while TADs contain a similar proportion of GC exons and AT exons (~ 55%). This is in agreement with the fact that LADs correspond to AT-rich regions . Interestingly, we observed a positive correlation between the GC content of exons and the GC content of the TAD they belong to (Fig. 7e). Furthermore, GC exons and AT exons cluster in different TADs. Indeed, some TADs contain a larger number of GC than AT exons, while other TADs contain a larger number of AT exons (Fig. 7f). This result was confirmed using different annotations of TADs across different cell lines (Additional file 3: Figure S6c, d).
As already mentioned, the GC content of exons correlated with the GC content of their hosted TADs. Furthermore, the GC content of each exon correlated with the average GC content of the exons in the same TAD (Additional file 3: Figure S7a). This suggested that splicing-related features, such as the MFE at the 5′ ss, the number of splicing-related decoys, or the number of exons regulated by U1 or U2 were associated with the hosting TAD. Indeed, the values of splicing-related features were not equitably distributed among TADs and exons regulated by U1 (or U2) were more concentrated in some TADs than in others (LRT test P value< 10−16; Additional file 3: Figure S7b).
Collectively, these observations support a model where nucleotide composition bias establishes a link between genomic organization (e.g., isochores or TADs) and the splicing process.
The rules that govern exon recognition during splicing and that explain the dependency of some exons on different classes of splicing factors remain to be clarified. We propose that nucleotide composition bias (e.g., GC content) over large genomic regions, which plays an important role in genome organization, leaves a footprint locally at the exon level and induces constraints during the exon recognition process that can be alleviated by local chromatin organization and different classes of RNA binding proteins.
The human genome is divided in isochores corresponding to regions of varying lengths (up to several dozens of kbps) having a uniform GC content that differs from adjacent regions [19, 20, 21]. GC-rich isochores have a higher density of genes than AT-rich isochores, and GC- and gene-rich genomic regions are highly expressed [19, 20, 21, 23, 24, 61, 62, 63] (Fig. 7g). Increasing evidence indicates that the one-dimensional genome organization, defined by regional nucleotide composition bias, is related to the three-dimensional genome architecture. For example, an overlap between isochores and TADs has been reported [19, 66]. Both the relationship between the 1D and 3D genome organization and the higher density of actively transcribed genes in GC-rich regions could be explained by the physicochemical properties of GC and AT nucleotides. For example, the nature of the base stacking interactions between GC nucleotides increases DNA structural polymorphism that in turn increases DNA bendability and flexibility with consequences on DNA folding [14, 15, 16, 67, 68, 69, 70]. Furthermore, the higher stability of G:C base pairing and frequency of GC-associated polymorphic structures have been proposed to increase the resistance of the DNA polymer to transcription-associated physical constraints. For example, the GC content is associated with the transition from the B- to Z-DNA form, the former “absorbing” the topological and torsional stresses that are generated during transcription [67, 68, 69, 70, 71]. In this setting, we observed that GC exons and their hosting GC-rich genes have a higher RNAPII density than AT exons and their hosting AT-rich genes (Fig. 5). Based on these observations, transcription and the three-dimensional genome organization may constrain the nucleotide composition of genomic regions, which in turn has “local” consequences on exon recognition during co-transcriptional splicing.
Supporting this possibility, first we observed a positive correlation between the GC content of exons, their flanking introns, and their hosting genes, isochores, and TADs (Figs. 1b, 5a, and 7a, e), in agreement with the notion that the GC content is uniform and homogenous regardless of the genomic scale [19, 20, 21]. Second, differential GC content is associated with specific constraints on the exon recognition process. For example, our analyses and reports from the literature support a model where a high local GC content favors the formation of RNA secondary structures that can hinder the recognition of the 5′ ss by occluding them, therefore limiting the access of U1 snRNA to the 5′ ss [3, 4, 5]. Accordingly, exons sensitive to the depletion of SNRPC and SNRNP70, two components of the U1 snRNP, as well as exons sensitive to the DDX5 and DDX17 helicases that enhance the recognition of structured 5′ ss owing to their RNA helicase activity [3, 4, 5, 55], are embedded in a GC-rich environment (Fig. 3a, b). In this setting, splicing factors that activate GC exons (for instance, hnRNPF, hnRNPH, PCBP1, RBFOX2, RBM22, RBM25, RBMX, SRSF1, SRSF5, SRSF6, and SRSF9) bind to G-, C-, or GC-rich motifs [8, 9, 10] (Additional file 3: Figure S8). Furthermore, hnRNPF, hnRNPH, RBFOX2, RBM22, RBM25, and several SRSF splicing factors are known to enhance U1 snRNP recruitment [72, 73, 74, 75, 76]. Therefore, a high GC content could increase the probability of generating secondary structure at the 5′ ss, which decreases exon recognition. Simultaneously, high GC content could increase the recruitment of splicing factor binding to GC-rich motifs, thereby enhancing U1 snRNP recruitment (Fig. 7g). While RNA secondary structures at the 5′ ss negatively impact exon recognition, structures at the 3′ ss favor exon recognition and replace the requirement for U2AF2 in splicing . In addition, a high GC content, as well as G- and C-rich motifs upstream of the BP, enhances U2 snRNA binding and BP recognition [47, 48, 77]. Accordingly, exons embedded in a GC-rich environment are more sensitive to factors associated with U1 snRNP than those with U2 snRNP (Fig. 3g).
Our results also support a model in which a high content of AT nucleotides in large introns can negatively influence exon recognition. Indeed, high AT content upstream of exons associated with a larger number of potential BPs (Fig. 2e), in agreement with a previous report . A high AT content upstream of exons also associated with a larger number of SF1- and U2AF2-binding sites (Fig. 2g, h). In this setting, increasing evidence indicates that binding of spliceosome-associated factors (e.g., SF1 or U2AF2) to pseudo-signals or decoy signals can inhibit splicing by decreasing the efficiency of spliceosome assembly [48, 53, 56, 57, 58, 59, 60]. These observations suggest that splicing factors that activate AT exons enhance exon recognition either by enhancing U2 snRNP recruitment or by binding to decoy splicing signals. Accordingly, splicing factors activating AT exons, including hnRNPA1, hnRNPM, RBM15, RBM39, SFPQ, and TRA2, interact with and enhance the recruitment of SF1, U2AF2, U2AF1, and/or U2 snRNP [78, 79, 80, 81, 82, 83]. In addition, some of these splicing factors can compete with U2AF2 or SF1 for binding to intronic 3′-end splicing signals, and binding of splicing factors such as hnRNPA1 and PTBP1 to decoy splicing signals has been proposed to “fill up” a surplus of splicing signals and consequently enhancing the recognition of bona fide splicing sites [48, 53, 56, 57, 58, 59, 60]. Therefore, a high AT content could increase the probability of generating decoy splicing signals at intronic 3′-ends, which would decrease exon recognition. However, a high AT content could simultaneously increase the probability of recruiting splicing factors at decoy signals, thereby strengthening the recruitment of spliceosome-related components (e.g., SF1 and U2AF2) to bona fide splicing signals and ultimately enhancing exon recognition (Fig. 7g).
While this work focused on exons that are activated by one of the 33 analyzed splicing factors, the U1 and U2 dependency relying on exon GC content is likely a general feature. Indeed, exons (labeled oGC or oAT exons) that shared the same GC composition and flanking intron size with GC or AT exons (but that were not activated by one of the analyzed auxiliary splicing factors) had the same splicing-related features than GC or AT exons, respectively, in terms of secondary structures at 5′ss, number of decoy sites, and U1 or U2 dependency (Additional file 3: Figure S9a-e). Of note, several splicing factors activating GC or AT exons were more frequently detected by ClIP-dataset analyses in the vicinity of oGC or oAT exons, respectively (Additional file 3: Figure S9f). Obviously, oGC and oAT exons might also be activated by different sets of splicing factors. Conversely, some exons annotated as being regulated by one of the analyzed splicing factor might be indirect targets, even though ClIP-dataset analyses revealed an enrichment of the analyzed splicing factors on or in the vicinity of their activated exons (Additional file 3: Figure S9 g) and even though exons that are co-regulated by a given splicing factor share many features, as shown throughout this work.
Finally, we grouped exons regulated by different splicing factors into two main categories. However, each set of exons that are co-regulated by a given splicing factor had a different combination of properties (e.g., splicing site strength, size of exons or flanking introns, Additional file 3: Figure S1, S4, and S10). A detailed annotation of these combinations and their comparison should improve the prediction of splicing factor–regulated exons. Actually, some splicing-related features of sets of co-regulated exons could be linked to the fact that each of these sets had a specific combination of nucleotide composition bias. For example, some sets of co-regulated exons were more enriched in Gs than in Cs (e.g., SRFS2 vs SRSF3, Additional file 3: Figure S2 and S10a). We also analyzed exonic purine- or pyrimidine-biases and observed that purine- and pyrimidine-rich exons had similar properties than AT- and GC-rich exons, respectively (Additional file 3: Figure S11 and S12). Obviously, more work is needed to better understand the interdependency between splicing-related features and nucleotide composition bias both at the exon and gene levels, which can have consequences on the coupling between transcription and splicing as detailed below.
The synchronization between transcription and splicing plays a major role in the exon recognition process [11, 12, 84, 85]. We propose that the coupling between transcription and splicing operates through different mechanisms depending on the gene nucleotide composition bias, which impacts chromatin organization and RNAPII dynamics. Indeed, at the chromatin level, nucleosomes are better positioned on exons in an AT-rich context than in a GC-rich context, and there is a higher density of nucleosomes in both GC-rich exons and introns (Fig. 6a–c), as already reported [14, 15, 16, 17]. This feature could result from the fact that exons embedded in an AT-rich context have a much higher GC content than their flanking intronic sequences, in contrast to exons embedded in a GC-rich context [11, 26] (Fig. 2a). In this setting, GC-rich stretches favor DNA wrapping around nucleosomes, because the stacking interactions between GC nucleotides allow DNA structural polymorphism that in turn increases DNA bendability; in contrast, T- and A-rich stretches form more rigid structures that create nucleosome energetic barriers [14, 15, 16, 17]. Therefore, increasing the intronic GC content increases the probability of nucleosomes sliding from exons to introns, while increasing the density of Ts and As in exonic flanking regions creates nucleosome energetic barriers. Consequently, transcription and splicing synchronization in an AT-rich environment could depend on nucleosomes being well-positioned on exons, as these would locally slow down RNAPII and thereby favor recruitment of splicing-related factors (Fig. 7g), as previously proposed . Of note, components associated with the U2 snRNP interact with chromatin-associated factors [86, 87].
Synchronization between transcription and splicing could rely on other mechanisms when exons are within a GC-rich environment. Indeed, both the higher density of nucleosomes across introns of GC-rich genes (Fig. 6a–c), and the higher stability of G:C base pairing, create constraints that reduce the velocity of RNAPII across both exons and introns. Accordingly, the rate of elongation by RNAPII is negatively correlated with gene GC content . Therefore, high GC content may facilitate the synchronization between transcription and splicing by “smoothing” RNAPII dynamics all along GC-rich genes. Of note, extensive interactions between the U1 snRNP and RNAPII-associated complexes have been reported ; therefore, a slower speed of RNAPII across GC-rich genes may facilitate U1 snRNP recruitment (Fig. 7g). Further supporting that gene GC content plays an important role in the interplay between gene expression levels and splicing, intron removal occurs more efficiently in highly expressed genes , and GC-rich genomic regions associate with nuclear speckles [91, 92]. It must also be underlined that introns flanking GC-rich exons have been shown to be efficiently and probably co-transcriptionally spliced .
Altogether, these observations suggest a link between nucleotide composition bias, genome organization, and RNA processing (Fig. 7g). We propose a model in which transcription and genome organization constrain the nucleotide composition of DNA over dozens of kbps. In turn, nucleotide composition bias induces local (at the exon level) constraints on the splicing process by affecting specific splicing-related features. However, constraints induced by nucleotide composition bias can be alleviated by specific mechanisms. For example, although AT exons are weakened in terms of intron 3′-end definition, this can be alleviated by an interplay between exon-positioned nucleosomes, U2 snRNP–associated factors, and splicing factors that bind decoy signals and/or enhance U2 snRNP recruitment. Likewise, while GC exons are weakened at their 5′ ss because of the formation of RNA secondary structures, this can be alleviated by an interplay between a slow RNAPII, U1 snRNP–associated factors and splicing factors that bind to GC-rich sequences and enhance the recruitment of the U1 snRNP. In this model, splicing factors would enhance the recognition of exons by counteracting splicing-associated constraints resulting from nucleotide composition bias and providing room for regulatory processes such as alternative splicing.
Materials and methods
RNA-seq dataset analyses and establishment of GC and AT exon sets
Publicly available RNA-seq datasets generated from different cell lines transfected with siRNAs or shRNAs targeting specific splicing factors, or transfected with splicing factor expression vectors, were recovered from GEO and ENCODE (Additional file 1: Table S1). RNA-seq datasets were analyzed using FARLINE, a computational program dedicated to analyze and quantify alternative splicing variations, as previously reported . This study focused on exons whose inclusion depends on at least one splicing factor. For this, the sets of exons that are activated by each analyzed splicing factor in at least one sample were defined (Additional file 2: Table S2). Exons that are regulated in an opposite way by the same splicing factor in different samples were eliminated. GC exons (or AT exons) were defined as exons whose inclusion depends on splicing factors activating exons with a median GC content higher (or lower, respectively) than control exons (threshold 49.3%) and having a median size of their smallest flanking intron smaller (or larger, respectively) than that of control exons (threshold 691 bp). Based on the results shown in Fig. 1d, GC exons were defined as exons being activated by SRSF9, PCBP1, RBMX, hnRNPF, RBFOX2, SRSF5, hnRNPH1, RBM22, RBM25, MBNL2, and SRSF6, while AT exons were defined as exons activated by TRA2A/B, RBM15, RBM39, hnRNPA2B1, KHSRP, hnRNPM, SRSF7, SFPQ, MBNL1, DAZAP1, PTBP1, hnRNPL, hnRNPK, FUS, QKI, hnRNPA1, and PCBP2. The set of control (CTRL) exons corresponds to human constitutive coding exons annotated in FASTERDB, excluding the exons activated by at least one of the analyzed splicing factors. For further analyses, exons found in the two-exon sets (i.e., exons regulated by two splicing factors of different classes) were eliminated (about 25%), leading to one list of 3182 GC exons and another of 4045 AT exons (Additional file 2: Table S2). U1 exons were defined as exons activated by the SNRPC, SNRNP70, and/or DDX5/17 factors, and U2 exons were defined as exons activated by the U2AF2, SF3B4, SF1, and/or SF3A3 factors. GA exons and CT exons were defined based on their GA and CT composition compared to the GA and CT composition of CTRL exons. All genomic annotations were from FasterDB .
Heatmaps and frequency maps
where DS and DC are the vectors of D values for the sets S and C. Heatmaps of Fig. 2 b and c were generated using a Mean function in formula (1). A linear model (with R lm() and summary() functions) was used to compare the GC content of exons activated by each splicing factor to the GC content of the control exons. With this model, a Student’s test was computed between control exons and each set of exons activated by a splicing factor. A generalized linear model for the Poisson distribution (with glm() and summary() functions) was used to compare the counts of As, Gs, Cs, or Ts, in the 25 first nucleotides upstream exons between exons activated by each splicing factor to control exons. With this model, a Wald’s test was computed between each splicing factors activated set of exons and control exons. The sequences corresponding to the intron-exon junctions (the last 100 nucleotides of the upstream intron and the first 50 nucleotides the exon) were recovered for each exon. The mean frequency of a given nucleotide was computed at each window position using sliding window with a size and a step of 20 and 1 nucleotide, respectively. The same procedure was applied for the sequences corresponding to exon-intron junctions defined as the last 50 nucleotides of the exon and the first 100 nucleotides of the downstream intron.
Splice site scores, minimum free energy, BP predictions, U2 binding energy, and motif count
The splice site scores were calculated for each FasterDB exons with MaxEntScan (http://genes.mit.edu/burgelab/maxent/Xmaxentscan_scoreseq.html). The minimum free energy (MFE) was computed from exon-intron junction sequences (25 nucleotides within the intron and 25 nucleotides within the exon) using RNAFold from the ViennaRNA package  (v2.4.1; http://rna.tbi.univie.ac.at/cgi-bin/RNAWebSuite/RNAfold.cgi). These sequences were split in two groups: the sequences centered on the 5′ ss and the sequences centered on 3′ ss. The Anscombe transformation was applied on MFE values to obtain Gaussian distributed values. An ANOVA model (with R, aov() function) was built and statistical differences between every couple of group of exons was tested with a Tukey’s test. Differences between MFE of exons activated by each spliceosome-associated factor and control exons were tested using a linear model (with R lm() and summary() functions). The number of branch points in a given sequence corresponding to the 100 nucleotides preceding 3′ ss was computed with SVM-BP finder (http://regulatorygenomics.upf.edu/Software/SVM_BP/). Only branch point sites with a svm score > 0 were considered. The U2 snRNA binding energy corresponds to the number of hydrogen bounds between the nucleotides surrounding the branch point of an RNA sequence (without the branch point adenine) and the branch point binding sequence of U2 snRNA. The RNAduplex script in the ViennaRNA package (v2.4.1) was used to determine the optimal hybridization structure between the branch point binding sequence of U2 snRNA (GUGUAGUA) and the RNA sequence. The RNA sequence is composed of 5 nucleotides before and 3 after the branch point. Then, the sum of hydrogen bounds forming between the RNA and the U2 sequence was computed. The number of TNA motifs was computed in the last 50 nucleotides of each intron. To test the differences for the three features mentioned above between groups of exons activated by spliceosome-associated factors and the control group of exons, the same procedure as the one applied for Fig. 2 b and c was used (see previous section) (Fig. 3c, d). To test the differences between every couple of group of exons, a Tukey’s test was used (with R, glh function (library multcomp)) (Fig. 2e, g, left panel). T-rich low-complexity sequences were computed between the 75th to the 35th nucleotides upstream the 3′ ss, using a sliding window (size 4, step 1 and at least three Ts). Statistical differences were tested by using a linear model for the negative binomial distribution (with glm.nb() function, library MASS). Statistical differences between every couple of group of exons were tested using a Tukey’s test (R, glh function).
V value: exons regulation by U1 or U2 snRNP–associated factors
where s = 1 if k > l; s = − 1 otherwise.
Statistical analysis at gene level
To test whether the GC content of genes hosting AT exons (AT genes) or GC exons (GC genes) was different, the GC content of the genes according to their size and their group was modeled with an ANOVA model (in R, aov() function). A Tukey’s test on this model was computed to compare between all the possible couples of gene groups (AT, GC, and control genes). To test if the median intron size was different for each couple of AT, GC, and control groups of genes, a Wilcoxon’s test was performed. To test if the gene size of GC, AT, and control genes was different, we built an ANOVA model (in R, aov() function). Then, a Tukey’s test was performed on this model. For those analyses, genes hosting both AT and GC exons were not considered.
CLIP-seq dataset analyses
Bed files from publicly available CLIP-seq datasets generated using U2AF2 antibodies (GSE83923, GSM2221657; GSE61603 or GSM1509288) were used to generate density maps. The bed files were first sorted and transformed into bedGraph files using the bedtools suite (v2.25.0). The bedGraph files were then converted into bigWig files using bedGraphToBigWig (v4). The 5′ ss and 3′ ss regions (comprising the ss, 200 nucleotides into the intron and 50 nucleotides into the exon) were considered. The proportions of GC or AT exons, or exons activated by at least one U1-associated factor (e.g., SNRPC, DDX5/17, and SNRNP70) or at least one U2-associated factor (e.g., U2AF2, SF3B4, SF1, and SF3A3) with CLIP peak signals at each nucleotide position of the 5′ ss and 3′ ss regions, were computed.
Analysis of MNase datasets and RNAPII, H3, and histone mark–related ChIP-seq datasets
ChIP-seq or MNase-seq datasets were recovered from Cistrome, ENCODE, and GEO databases (Additional file 1: Table S1). Coverage files (BigWig) were directly downloaded from Cistrome and RNAPII ChIP-seq from ENCODE. Annotations were lifted over from hg38 to hg19 if the coverage file came from Cistrome database. Otherwise, raw data were downloaded for analysis with homemade pipelines. Reads were trimmed and filtered for a minimum length of 25b using Cutadapt 1.16 (options: -m 25), trimmed at their 3′ end for a minimum quality of 20 (-q 20) and then filtered for minimum length of 25b (–m 25). The processed reads were mapped to hg19 with Bowtie2 2.3.3 (options: --very-sensitive --fr -I 100 -X 300 --no-mixed) and filtered for mapping quality over 10 with samtools view 1.6 (options: -b -q 10). For ChIP-seq experiments generated using sonication only, duplicates were removed with homemade tools, which check for coordinates and CIGAR of the Read and the Read 2 adaptor sequence if paired-end sequencing was used. Fragments were reconstituted from the reads, and fragment-coverage files were built using MACS2 126.96.36.19960309 (options: -g hs -B). The metaplots of ChIP/MNase-seq on genes were generated by recovering the fragment coverage (promoter – 1500b/+ 500 from the TSS; first exon, internal exons and introns: according to the coordinates of the annotation; splicing factor–regulated exons − 100/+ 100 from the center, or 500b into the intron and 50b into the exon from the splicing site [MNase and H3]; whole gene: according to the coordinates of the annotations and − 200/+ 200 from the annotation). In the case of RNAPII coverage, only exons regulated in the corresponding cell line, or annotation from their hosting gene, were considered. For internal exons and introns, the coverages of the annotations from the same gene were concatenated respecting their genomic order. The coverages recovered according to the coordinates of the annotations were the split into 1000 bins. The first 199 bins of “internal exons” or of introns were removed to avoid displaying signals influenced by the promoter. Metaplots were built by computing, at each position or bin, the average coverage across the annotations. Statistics were done by comparing average coverage in the annotations from two groups, which were cell-line specific, with a Wilcoxon’s test. In Fig. 5d, the average of the mean coverage on each regulated exons (− 100/+ 100 from center) was computed. For each ChIP-seq experiment, the ratio of the averages (“GC” - “AT”/max(“GC”, “AT”)) was computed and used to build the boxplot in Fig. 6c, and the statistics were computed with a Wilcoxon’s test of the average per experiment of GC compared to AT exons.
Isochores, LADs, TADs, and CLIP-seq dataset analyses
Chromosome coordinates of isochores, LADs, and TADs were recovered from previous publications or GEO (see Additional file 4: Table S3). The bedtool intersect command was used to determine the isochore and TAD and LAD regions to which each exon belongs. The percentage of GC, and the number of exons (GC- or AT exons), present in each annotated isochore, LAD, or TAD was calculated (Additional file 4: Table S3).
Statistical analysis at TAD level
After labelling exons accordingly to their hosting TADs (see above and Additional file 4: Table S3), we tested whether different splicing-related features were equally distributed between TADs. We modeled different features by TADs as a mixed effect on the number of exons with more than two branch points, or the number of U1 and U2 exons, while accounting for the size of TADs (i.e., the number of hosted exons). The models were implemented using the glmer function (package lme4 ) with family = binomial in R software (see Additional file 3: Figure S7). A likelihood ratio test (R software, function anova with test = “Chisq”) was used to test the effect of TADs. The same procedure was used with the function glmer.nb (package lme4) in R to model the number in TADs of TNA motifs and T-rich low-complexity sequences. The same procedure was used with the function lmer (package lme4) in R to model MFE values in TADs.
Cell culture, transfections, RNA preparation, and RT-(q)PCR
The human MCF-7 and HeLa cell lines were cultured in DMEM medium (GIBCO) complemented with 10% FBS, 1% glutamine, and 1% penicillin/streptomycin. Cells were reverse-transfected using Lipofectamine RNAiMax (Thermo Fisher) following the manufacturer’s instructions. Thirty nanomolars of siRNA was mixed with 100 nM of 2′-O-methylated antisense RNA oligonucleotides (AON) targeting intronic sequences. Cells were harvested 48 h after transfection, and total RNA was isolated using TriReagent. 1.5 μg of total RNA was DNAse-treated and retro-transcribed using the Maxima First Strand cDNA Kit (ThermoFisher) following the manufacturer’s instructions. RT-PCR reactions were performed using 12.5 ng cDNA and 0.5 U GoTaq® DNA polymerase (Promega). qRT-PCR reactions were performed using 7.5 ng cDNA and the SYBR® Premix Ex Taq (Takara). Melting curves were controlled to rule out the existence of non-specific products. Relative DNA levels were calculated using the ΔΔCt method (using the average Ct obtained from technical duplicates or triplicates). siRNA, AON, and primer sequences are provided in Additional file 1: Table S1. The cell lines were authenticated by ATCC.
We gratefully acknowledge the support from the PSMN (Pôle Scientifique de Modélisation Numérique) of the ENS de Lyon for the computing resources. We thank the members of the LBMC Biocomputing Center for their involvement in this project. We thank VA Raker for manuscript editing.
The review history is available as Additional file 5.
Peer review information
Anahita Bishop was the primary editor of this article and managed its editorial process and peer review in collaboration with the rest of the editorial team.
SL performed bioinformatics analyses of ChIP-seq and Mnase-seq.NF performed bioinformatics analyses related to nucleotide composition bias and ClIP-seq. FA performed transfection experiments. JBC performed bioinformatics analyses related to isochore, TADs, and LADs. HP performed bioinformatics analyses of RNA-seq. LM supervised statistical analyses. CB, FM, and DA conceived the project and supervised the experimental work and the bioinformatics analyses. SL, NF, FA, and DA wrote the manuscript with the help and comments from all co-authors. All authors read and approved the final manuscript.
This work was funded by Fondation ARC (PGA120140200853), INCa (2014-154), ANR (CHROTOPAS), AFM-Téléthon, and LNCC. J.B.C was supported by Fondation de France.
Ethics approval and consent to participate
The authors declare that they have no competing interests.
- 5.Lambert MP, Terrone S, Giraud G, Benoit-Pilven C, Cluet D, Combaret V, Mortreux F, Auboeuf D, Bourgeois CF. The RNA helicase DDX17 controls the transcriptional activity of REST and the expression of proneural microRNAs in neuronal differentiation. Nucleic Acids Res. 2018;46:7686–700.PubMedPubMedCentralCrossRefGoogle Scholar
- 9.Giudice G, Sanchez-Cabo F, Torroja C, Lara-Pezzi E. ATtRACT-a database of RNA-binding proteins and associated motifs. Database (Oxford). 2016;1–9.Google Scholar
- 24.Versteeg R, van Schaik BD, van Batenburg MF, Roos M, Monajemi R, Caron H, Bussemaker HJ, van Kampen AH. The human transcriptome map reveals extremes in gene density, intron length, GC content, and repeat pattern for domains of highly and weakly expressed genes. Genome Res. 2003;13:1998–2004.PubMedPubMedCentralCrossRefGoogle Scholar
- 34.Ge Z, Quek BL, Beemon KL, Hogg JR. Polypyrimidine tract binding protein 1 protects mRNAs from recognition by the nonsense-mediated mRNA decay pathway. Elife. 2016;5:e11155.Google Scholar
- 39.Juan-Mateu J, Alvelos MI, Turatsinze JV, Villate O, Lizarraga-Mollinedo E, Grieco FA, Marroqui L, Bugliani M, Marchetti P, Eizirik DL. SRp55 regulates a splicing network that controls human pancreatic beta-cell function and survival. Diabetes. 2018;67:423–36.PubMedCrossRefPubMedCentralGoogle Scholar
- 42.Perron G, Jandaghi P, Solanki S, Safisamghabadi M, Storoz C, Karimzadeh M, Papadakis AI, Arseneault M, Scelo G, Banks RE, et al. A general framework for interrogation of mRNA stability programs identifies RNA-binding proteins that govern Cancer Transcriptomes. Cell Rep. 2018;23:1639–50.PubMedCrossRefGoogle Scholar
- 44.Benoit-Pilven C, Marchet C, Chautard E, Lima L, Lambert MP, Sacomoto G, Rey A, Cologne A, Terrone S, Dulaurier L, et al. Complementarity of assembly-first and mapping-first approaches for alternative splicing annotation and differential analysis from RNAseq data. Sci Rep. 2018;8:4307.PubMedPubMedCentralCrossRefGoogle Scholar
- 80.Zhang L, Tran NT, Su H, Wang R, Lu Y, Tang H, Aoyagi S, Guo A, Khodadadi-Jamayran A, Zhou D, et al. Cross-talk between PRMT1-mediated methylation and ubiquitylation on RBM15 controls RNA splicing. Elife. 2015;4:e07938.Google Scholar
- 81.Mai S, Qu X, Li P, Ma Q, Cao C, Liu X. Global regulation of alternative RNA splicing by the SR-rich protein RBM39. Biochim Biophys Acta. 1859;2016:1014–24.Google Scholar
- 83.Cho S, Moon H, Loh TJ, Oh HK, Cho S, Choy HE, Song WK, Chun JS, Zheng X, Shen H. hnRNP M facilitates exon 7 inclusion of SMN2 pre-mRNA in spinal muscular atrophy by targeting an enhancer on exon 7. Biochim Biophys Acta. 1839;2014:306–15.Google Scholar
- 90.Pai AA, Henriques T, McCue K, Burkholder A, Adelman K, Burge CB. The kinetics of pre-mRNA splicing in the Drosophila genome and the influence of gene architecture. Elife. 2017;6:e32537.Google Scholar
- 93.Tilgner H, Knowles DG, Johnson R, Davis CA, Chakrabortty S, Djebali S, Curado J, Snyder M, Gingeras TR, Guigo R. Deep sequencing of subcellular RNA fractions shows splicing to be predominantly co-transcriptional in the human genome but inefficient for lncRNAs. Genome Res. 2012;22:1616–25.PubMedPubMedCentralCrossRefGoogle Scholar
- 94.Mallinjoud P, Villemin JP, Mortada H, Polay Espinoza M, Desmet FO, Samaan S, Chautard E, Tranchevent LC, Auboeuf D. Endothelial, epithelial, and fibroblast cells exhibit specific splicing programs independently of their tissue of origin. Genome Res. 2014;24:511–21.PubMedPubMedCentralCrossRefGoogle Scholar
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.