Background

With the advent of high-throughput proteomic and transcriptomic methods, genome-wide data are giving previously unprecedented views of entire collections of gene products and their regulation. Recently, approaches based on nucleotide-enhanced UV cross-linking and oligo(dT) purification have shown that a number of proteins are able to bind to RNA [1, 2].

RNA-binding proteins (RBPs) are key regulators of post-transcriptional events [3] and influence gene expression by acting at various steps in RNA metabolism, including stabilization, processing, storing, transport and translation. RBP-mediated events have been described using recognition and regulatory elements in RNA sequences [4, 5] as well as expression profiles [6] that are tissue specific and conserved across species [79]. Although heterogeneity in gene regulation is responsible for phenotypic variation and evolution [10], very little is known about constitutive expression patterns controlled by RBPs [11, 12], which are the subject of this work.

Data from recent transcriptomic and proteomic studies [13, 14] are becoming attractive for studying mechanisms of gene regulation [15, 16]. Despite the increasing amount of genomic data, the development of computational methods for integrating, interpreting and understanding molecular networks remains challenging [17, 18]. Here we combine our predictions of protein–RNA interactions, based on catRAPID calculations [19, 20], with the information obtained from expression data to investigate constitutive regulatory mechanisms. The catRAPID approach has been previously employed to predict protein associations with non-coding RNAs [21, 22] as well as ribonucleoprotein interactions linked to neurodegenerative diseases [23, 24]. Our theoretical framework has been used to unravel self-regulatory pathways controlling gene expression [25]. The catRAPID omics algorithm, validated using photoactivatable-ribonucleoside-enhanced cross-linking and immunoprecipitation (PAR-CLIP) data, has been recently developed to predict protein–RNA associations at the transcriptomic and proteomic levels [26].

Using comprehensive and manually annotated databases of expression profiles in human tissues, at both protein and RNA levels, we investigated the correlation between RBP activity and regulation. The link between interaction propensity and expression levels was exploited to reveal the fine-tuned functional sub-networks responsible for regulatory control. To explore the results further, we developed the catRAPID express web server [27].

Results

In this study, we focused on the mRNA interactomes of RBPs detected through nucleotide-enhanced UV cross-linking and oligo(dT) purification approaches [1, 2]. Exploiting gene ontology (GO) annotations [28] for protein-coding genes, we systematically analyzed protein–RNA interactions and expression data for human tissues.

At present, few studies have investigated how altering protein expression affects the abundance of RNA targets. Interrogating the Gene Expression Omnibus (GEO) [29] and ArrayExpress databases [30], we found two human proteins, ELAV-like protein 1 (or human antigen R, HuR) [31] and Protein lin-28 homolog B (LIN28B) [32, 33], whose knock-down has been shown to alter the expression of target genes identified by PAR-CLIP (see Materials and methods).

Our predictions, made using the catRAPID algorithm [26], identified experimentally validated interactions with high significance (HuR: P = 10-8; LIN28B: P = 10-3; Fisher’s exact test; see Materials and methods). The interactions were effectively discriminated from non-interacting pairs using score distributions (LIN28B: P = 10-4; HuR: P = 10-16; Student’s t-test; see Materials and methods). Hence, catRAPID is very good at predicting physical interactions between a protein and RNA partners (other statistical tests are given in Materials and methods and Additional file 1).

To understand the regulation of HuR and LIN28B targets better, we studied the relation between interaction propensities and expression levels. We found that the expression of predicted HuR targets is altered (log-fold change, LFC) when HuR is knocked down (P < 10-5; Kolmogorov–Smirnov test; Figure 1A), which is in agreement with experimental data [31]. Similarly, predicted LIN28B targets are downregulated upon protein depletion (P < 10-2; Kolmogorov–Smirnov test; Figure 1B), as shown in a previous study [33]. Moreover, we compared the top 1% of predicted associations with the top 1% of experimental interactions and found the same enrichments for transcripts changing in expression levels upon protein depletion. Specifically, 62% of HuR experimental interactions and 63% of HuR predicted associations had LFC > 0. Similarly for LIN28B, 57% of experimental interactions and 56% of predicted associations had LFC > 0.

Figure 1
figure 1

Relation between protein and RNA regulation. (A) HuR interactome: our predictions, made using catRAPID [26], indicate that expression levels of RNA targets change upon HuR knock-down (log-fold changes, LFC), in agreement with experimental evidence [31] (P < 10-5; Kolmogorov–Smirnov test). (B) LIN28B interactome: RNA targets are downregulated upon LIN28B knock-down (LFC), as reported in a previous study [33] (P < 10-2; Kolmogorov–Smirnov test). In this analysis, the prediction of the interactions was highly significant (HuR: P < 10-8; LIN28B: P < 10-3; Fisher’s exact test). Our results indicate that changes in protein expression influence the abundance of RNA targets to a significant extent. HuR, human antigen R; LFC, log-fold change; LIN28B, lin-28 homolog B.

These HuR and LIN28B examples indicate that changes in protein expression influence the abundance of RNA targets, suggesting that a large-scale analysis of co-expression and interaction propensities could improve understanding of RBP-mediated regulatory mechanisms.

RNA-binding protein–mRNA interactions and relative expression profiles

Our predictions indicate that interacting molecules have both more correlated and anti-correlated expression patterns (see Materials and methods and Figure 2). By contrast, non-correlated expression is not associated with any enrichment in interaction propensity (Additional file 2: Figure S1A). We observed the same results using immunohistochemistry [34] and RNA sequencing data [6] to estimate protein abundances (Additional file 2: Figures S1B and S2; see Materials and methods). This finding is truly remarkable. Direct proportionality between protein and mRNA expression levels has been observed in bacteria and fungi [13, 14] but post-transcriptional modification is known to influence the overall abundance of the protein product in higher eukaryotes [35]. Since immunohistochemistry only provides a qualitative estimate of the amount of protein (see Materials and methods) and the analysis is restricted to 612 proteins, we used RNA sequencing for our predictions (1,156 RBPs).

Figure 2
figure 2

Protein–RNA interaction and expression. (A) In this analysis, we compared interacting and non-interacting protein–RNA pairs at different interaction propensity scores. Areas under the curve (AUCs), expressed as percentages, were used to select the same number of interacting and non-interacting protein–RNA pairs. (B) The same procedure was used to investigate positively and negatively correlated protein–RNA expression at different thresholds. (C) With respect to non-interacting protein–RNA pairs, the predicted associations had enriched positively correlated expression (that is, co-expression; see Materials and methods). (D) Compared to non-interacting protein–RNA pairs, the predicted associations had enriched negatively correlated expression (that is, anti-expression; see Materials and methods). Non-correlated protein–RNA expression did not show any similar trend (Additional file 1). AUC, area under the curve.

The enrichment shown in Figure 2 suggests that a good relation exists between interaction and expression of protein–RNA molecules, which should have co-evolved to be either co-expressed or anti-expressed to exert a regulatory function (Figure 2C,D).

Conservation of expression pattern for functionally related genes

We classified protein–RNA associations into four categories: interacting and co-expressed (IC), interacting and anti-expressed (IA), non-interacting and co-expressed (NIC) and non-interacting and anti-expressed (NIA). We applied conditional tests on each subset to detect significantly over-represented gene ontology (GO) terms (see Materials and methods and Additional file 3: Table S1).

For high interaction propensities, transcripts in the IC subset have more processes associated with cell cycle control, in particular the negative regulation of proliferation (Discussion; Additional file 3: Table S1).

Transcripts interacting with anti-expressed proteins (IA subset) are involved in survival, growth and differentiation processes and have more regulative functions at the DNA level (Discussion; Additional file 3: Table S1).

No clear functional assignments and/or insufficiently populated GO terms were found for transcripts in non-interacting protein–RNA pairs (NIC and NIA subsets).

Intrinsic disorder and RNA-binding protein interaction propensity

Recent findings suggest that RBPs have more structurally disordered regions [1]. To investigate the relation between disorder and RNA-binding ability, we used the IUPred algorithm [36]. For each protein, we extracted structurally disordered regions (IUPred score > 0.4 [1]) and calculated the interaction propensities with human transcripts. We considered both canonical RBPs (that is, containing RNA-binding domains) and putative RBPs (that is, lacking RNA-binding domains) [1]. With respect to the RNA-binding ability of full-length sequences, the contribution of disorder is higher at low interaction propensity scores and becomes negligible at high interaction propensities (see Materials and methods and Figure 3A). Nevertheless, the role of structural disorder is more pronounced in proteins lacking canonical RNA-binding domains, indicating that unfolded regions might be able to promote interactions with RNA (Figure 3B).

Figure 3
figure 3

RNA-binding ability and structural disorder. (A) For each protein, we calculated RNA interactions with full-length sequences as well as structurally disordered regions [1, 36]. When the interaction propensity score of a disordered region exceeds that of the full-length protein (points above the red line), disorder is considered to promote interaction with RNA molecules. (B) For 66% of the proteins (137 entries), disorder contributes at low interaction propensities, while full-length protein sequences dominate at high interaction propensities (Mann–Whitney U test). Overall, from low to high interaction propensities, the contribution of disorder decreases progressively with respect to that of the full-length protein (red and grey lines), in agreement with a previous analysis [25]. The role of disorder is more relevant in proteins lacking canonical RNA-binding domains (grey line), indicating that unstructured regions might have direct involvement in contacting RNA. Interaction propensities are averaged per protein. RBD, RNA-binding domain.

In a previous study we observed that catRAPID scores correlate with chemical affinities [21], which suggests that the interaction propensity can be used to estimate the strength of association [21, 26]. Hence, our results indicate that structural disorder might contribute to low-affinity interactions with RNA (Figure 3A,B), which is in agreement with what has been observed for protein–protein associations [37, 38]. As a matter of fact, it has been reported that disorder regions are able to promote promiscuous and non-specific interactions [39].

Discussion

Because they are associated with transcriptional control of gene expression, RBPs play fundamental roles in health and disease. Indeed, by binding to their target mRNAs, RBPs can influence protein production at different levels (transcription, translation and protein/mRNA degradation). Protein–RNA complexes are very dynamic and can undergo extensive remodeling. Thus, they can control the spatiotemporal regulation of target gene expression and the overall switching on and off of the distinct sets of genes involved in biological processes such as cell cycle progression, cell differentiation, cell response to metabolic stimuli and stress conditions, organ morphogenesis and embryonic development.

Co-expression and interaction propensity are features of cell cycle control

At high interaction propensities (AUC > 95%; see Materials and methods), the IC subset has more GO terms linked to cell cycle control and housekeeping functions such as nucleobase metabolism and purine biosynthesis (Figure 4 and Additional file 3: Table S1). In particular, mRNAs interacting with co-expressed RBPs code for negative regulators of cell proliferation and migration (translation, signaling and metabolite utilization). We found a number of tumor suppressors in the IC subset (AHRR, BAX, BRMS1, CDKN1A, CDKN2A, CTBP1, DAB2IP, DKK3, FLCN, FOXP1, GADD45G, GALR1, GTPBP4, HIC1, IGFBP3, IRF8, KLF4, MEN1, MLH1, NF2, NR0B2, PARK2, PAWR, PAX4, PAX5, PCGF2, PHB, PML, PPP1R1B, PPP2R4, PTPRJ, PYCARD, RHOA, SIRT2, TFAP2A, TNFAIP3, TRIM24, TSC2, TSG101, UCHL1). Interestingly, 90% of IC genes annotated with more functional categories (381 out of 422) are listed in the gene index of the National Institutes of Health’s Cancer Genome Anatomy Project [40]. Terms associated with inhibition of cellular pathways (especially the negative regulation of phosphorylation and regulation of protein serine/threonine kinase activity) are also more prevalent in the IC subset when immunochemistry data are used.

Figure 4
figure 4

GO enrichment for interacting mRNA–RBP pairs correlated in expression (IC subset). Using the catRAPID score distribution, we counted mRNA GO enrichment associated with different areas under the curve (see Materials and methods). The color gradient (yellow to red) indicates the AUC values (number of interactions: 20,702,804 for AUC > 50%, 10,351,402 for AUC > 75%, 2,070,280 for AUC > 95%). We found that cell cycle processes have more highly interacting mRNA–RBP pairs (AUC > 95%) that are correlated in expression. AUC, area under the curve; GO, gene ontology; IC, interacting and co-expressed; RBP, RNA-binding protein.

As mutations altering tumor suppression lead to aberrant proliferative events, we speculate that downregulation of specific genes is a mechanism for preventing indiscriminate cellular growth. In agreement with this hypothesis, it has been reported that somatic loss of function of the tumor suppressor tuberous sclerosis 2 (TSC-2) leads to the development of benign and malignant lesions in the myometrium, kidney and other tissues sharing common features such as a low rate of renewal and defects in the mitochondrial respiratory chain associated with oncogenesis [41, 42]. This gene is annotated in all the functional categories prevalent in the IC subset. Intriguingly, it is predicted that TSC-2 mRNA interacts strongly with Nuclear Protein 5A (NOP56). The interaction propensity is 175 corresponding to an AUC of 99.5%. This protein is an essential component of the splicing machinery [43] that is differentially expressed in leiomyoma and downregulated in response to hypoxia [44]. It is possible that hypoxia-dependent repression of NOP56 expression [4547] is a protective mechanism against fast growth and potential tumor progression. Indeed, it has been reported that NOP56 and TSC-2 are not differentially expressed in renal carcinomas and oncocytomas [48, 49] (ArrayExpress: E-GEOD-12090; ArrayExpress: E-GEOD-19982), indicating loss of regulation during malignant progression.

Based on these observations, we propose that downregulation of RBPs promoting the translation of dysfunctional tumor suppressors can prevent indiscriminate cellular growth and that loss of control can destine a cell to malignancy (additional examples are reported in Additional file 1).

Anti-expression and interaction propensity are features of repressing processes

For AUC > 95%, the IA subset has more terms associated with cell differentiation processes (for example, proximal/distal pattern formation) as well as inflammation (for example, positive regulation of isotype switching), which are known to be tightly linked [5052]. In fact, a number of differentiation cytokines (IL18, IL23 and EBI3/IL27) and stimulators of cytokine production (CD28 and CD80CCR2/CD192) are in the subset. Moreover, a large fraction of entries is also linked to protein–DNA complex assembly and regulation of transcription initiation from RNA polymerase II promoter (Figure 5 and Additional file 3: Table S1). It has been shown that 94% of genes in IA enriched functional categories (124 out of 132) are listed in the annotated gene index of the National Institutes of Health’s Cancer Genome Anatomy Project [40]. Remarkably, terms clearly associated with cell differentiation and inflammation (especially regulation of embryonic development and B cell activation involved in immune response) are more prevalent in the IA subset when immunochemistry data are used.

Figure 5
figure 5

GO enrichment for interacting mRNA–RBP pairs anti-correlated in expression (IA subset). Using the catRAPID score distribution, we evaluated mRNA GO enrichment associated with different areas under the curve (see Materials and methods). A color gradient (cyan to blue) shows the AUC values (number of interactions: 20,702,804 for AUC > 50%, 10,351,402 for AUC > 75%, 2,070,280 for AUC > 95%). We found that cell differentiation processes are more prevalent in interacting mRNA–RBP pairs (AUC > 95%) that are anti-correlated in expression. AUC, area under the curve; GO, gene ontology; IA, interacting and anti-expressed; RBP, RNA-binding protein.

IA genes share the common functional property of regulating survival, growth and differentiation processes. As RBPs play a crucial role in repressing gene expression [53, 54], IA associations could be involved in the regulation of proliferative events. Indeed, adult tissues are constantly maintained at the steady state [13] but a dramatic reawakening of growth, survival and differentiation genes occur in either physiological conditions (for example, wound healing [50]) or pathological progression to cancer [55].

In the IA set, we found YTHDC1 (YT521-B), which is a ubiquitously expressed member of the novel RNA-binding YTH-domain family [56]. YTHDC1 represses gene expression by either sequestering splicing factors or directly binding to transcripts [5759] (Additional file 2: Figure S5A). Among the transcripts that we predict to be potentially targeted by YTHDC1, we found several proto-oncogenes or tumor-associated genes such as RET, PRMT2, RARG and HOXA9 (RET: interaction propensity = 166; PRMT2: interaction propensity = 209; RARG: interaction propensity = 194; HOXA9: interaction propensity = 165; all corresponding to an AUC of 99.5%). In particular, alternatively spliced variants of PRMT2 were related to survival and the invasiveness of breast cancer cells [60, 61], while high expression of RARG and HOXA9 has been observed in human hepatocellular carcinomas and acute leukemia [62, 63]. We hypothesize that perturbation of the regulation by YTHDC1 of potentially oncogenic genes such as RET, PRMT2, RARG and HOXA9 could be involved in the pathogenesis of related tumors. In fact, experimental studies support the implications for YTHDC1 in cancer progression with regard to angiogenesis, growth factor signaling, immortalization, genetic instability, tissue invasion and apoptosis [59, 64, 65].

Similarly, the translational silencer TIA-1, also reported to induce mRNA decay [6668], is predicted to interact with the ubiquitously expressed NAP1L1 transcript (interaction propensity = 113 corresponding to an AUC of 95%), consistent with iCLIP data for HeLa cells (ArrayExpress: E-MTAB-432) [69] (Additional file 4: Table S2). Deregulation of NAP1L1 expression has been documented for several tumors such as small intestine carcinoid neoplasia [70], neuroendocrine tumors [71], ovarian cancer [72] and hepatoblastomas [73]. We hypothesize that TIA-1 plays a fundamental role in the post-transcriptional regulation of NAP1L1 and that alteration of this regulatory process contributes to NAP1L1-associated tumor development.

We note that repression of aberrant interactions can be achieved by gene silencing, which prevents the potential stabilizing action of RBPs on specific transcripts (Additional file 2: Figure S5B). For instance, the Nodal gene is normally silenced in adult tissues and its expression is associated with tumor progression [74]. Since Nodal is a member of the Transforming Growth Factor β (TGFB) superfamily and controls mesoderm formation and axial patterning during embryonic development [74], it is possible that Nodal interactions with specific RBPs lead to pathogenesis in adult tissues. Our predictions indicate that the transcript Nodal interacts with a number of anti-expressed RBPs (ADD1, API5, ARCN1, CANX, CAPRIN1, CCT6A, DKFZP434I0812, GSPT1, HSP90AB1, PKM, PUF60, XRCC5, YTHDC1 and YWHAZ). Since the exact mechanism regulating Nodal is at present unknown, we generated a list of protein partners that could be exploited for future experimental studies (Additional file 5: Table S3).

Conclusions

Comparative expression studies provide important insights into biological processes and can lead to the discovery of unknown regulation patterns. While evolutionary constraints on tissue-specific gene expression patterns have been extensively investigated [79, 75, 76], the constitutive regulation of RBP-mediated interactions is still poorly understood [11, 12]. It has been previously observed that cellular localization and gene expression levels impose stringent conditions on the physicochemical properties of both protein and RNA sequences [77, 78], but large-scale computational analyses of constitutive RBP-mediated regulatory networks have never been attempted before. Our study shows for the first time that the integration of in silico predictions [19] with ex vivo expression profile data [6, 34] can be used to discover distinct features of RBP biological functions.

We observed an enrichment of unique and functionally related GO terms for RBP–mRNA pairs associated with high interaction propensities and specific expression patterns. In our analysis, co-expression of interacting mRNA–RBP pairs (IC set) is linked to regulation of proliferation and cell cycle control, while anti-expression (IA set) is a characteristic feature of survival, growth and differentiation-specific processes. We do not exclude that RBP–mRNA associations displaying poor interaction propensities (NIC and NIA sets) might have important evolutionary implications as spatiotemporal separation and limited chemical reactivity could be ways to avoid aberrant associations [55].

We found that RNA-binding proteins are enriched in structurally disordered regions and that unfolded polypeptide fragments promote association with RNA molecules at low interaction propensities. As disordered proteins are highly reactive [37], it is reasonable to assume that interaction with RNA needs to be tightly regulated to avoid cellular damage [39]. In this regard, our results expand at the nucleic acid level what has been previously observed for the general promiscuity of natively unfolded proteins [38, 79].

In conclusion, we hope that our study of protein–RNA interaction and expression will be useful in the design of new experiments and for further characterizing ribonucleoprotein associations. A list of proposed interactions and a server for new inquiries are available at the catRAPID express webpage [27].

Materials and methods

Prediction for LIN28B and HuR interactions

We performed a number of tests to assess the quality of our calculations (see section on RNA-binding protein–mRNA interaction propensity) using PAR-CLIP data [31, 33]. In this analysis, we used all the RNA interactions present in our dataset (positive set: 285 sequences for LIN28B and 579 for HuR) and, due to the unavailability of non-bound RNAs, the full list of human transcripts (negative set: 105,000 sequences).

For the significance of interaction predictions, we performed Fisher’s exact test comparing the top 1% of predicted interactions with the remaining protein–RNA associations (HuR: P = 10-8; LIN28B: P = 10-3). Fisher’s exact test was computed using equal amounts (that is, 1% of the total interactions) of randomly extracted negative subsets (HuR: P = 10-7; LIN28B: P = 0.0002; Additional file 2: Figure S3).

For the significance of score distributions, we used Student’s t-test to compare the score distribution of positives and negatives (HuR: P = 10-16; LIN28B: P = 10-4). We also performed Student’s t-test using random extractions of negative subsets, each containing the same number of RNAs as positives (LIN28B: P = 0.03; HuR: P < 10-8; Student’s t-test).

Other statistical tests (receiver operating characteristics and precision/recall curves) are discussed in Additional file 1. The expression data for HuR and LIN28B were taken from the original manuscripts [31, 33] and processed as indicated by the authors. The datasets were downloaded from GEO [29] (GSE29943) and ArrayExpress [80] (E-GEOD-44615 and E-GEOD-44613).

mRNA dataset: Human BodyMap

The Human BodyMap (HBM) 2.0 contains expression data generated using the Hiseq 2000 system and it has expression profiles for a number of human tissues [22]. The HBM RNA sequencing (RNA-seq) data was downloaded from ArrayExpress [81] under accession number E-MTAB-513. The final mRNA dataset contained 35,818 transcripts (11,584 genes) with expression levels for 14 human tissues (see section on RNA-binding protein–mRNA expression). We considered all human cDNAs from EnsEMBL release 68. Transcripts incompatible with the catRAPID size restrictions (that is, 50 to 1,200 nucleotides) or not expressed in at least one tissue were filtered out. In the analysis, we evaluated different CD-HIT [82] sequence similarity cutoff thresholds (see section on Gene ontology analysis).

RNA-binding protein dataset: Human Protein Atlas

We considered all the RBPs reported in two studies on RBPs binding to mRNAs [1, 2]. The initial dataset consisted of 3,500 RBPs (832 genes). Proteins incompatible with catRAPID’s size restrictions (that is, 50 to 750 amino acids) and above a CD-HIT [82] sequence similarity cutoff of 75% were filtered out. Similarly, proteins not present in the Human Protein Atlas (HPA) database (version 11.0) [34] and not expressed in at least one tissue were discarded. The final RBP (HPA) dataset contained 612 proteins (491 genes) with expression levels for 14 human tissues (see section on RNA-binding protein–mRNA expression). All protein sequences were retrieved from EnsEMBL release 68.

RNA-binding protein dataset: Human BodyMap

As for RBPs in the HPA, filters on sequence size and redundancy were applied. Proteins not present in the Human BodyMap database (version 2.0) [6] were discarded. The final RBP (HBM) dataset contained 1,156 proteins (543 genes) with expression levels for 14 human tissues (see section on RNA-binding protein–mRNA expression). All protein sequences were retrieved from EnsEMBL release 68.

RNA-binding protein–mRNA expression

We analyzed 14 human tissues for which both immunohistochemistry [34] and transcript abundances [6] were available. At present, the Human Protein Atlas is the largest collection of protein abundance data available [34]. Transcripts in the mRNA dataset and proteins in the RBP dataset were represented by vectors containing the normalized relative abundance of the following tissues: adrenal gland, brain, breast, colon, heart, kidney, liver, lung, lymph, muscle, lymph node, ovary, prostate and thyroid. For the immunohistochemistry data, the read-outs ‘no’, ‘low’, ‘intermediate’ or ‘high’ expression were transformed into numbers (0, 1, 2, 3) and subject to Z-normalization per tissue. As for the transcript data, the vectors were Z-normalized using the average and standard deviation per tissue. For each RBP–mRNA combination we computed the pairwise Pearson’s correlation coefficient of the vectors. As shown in Additional file 2: Figures S1 and S2, we observed the same trends using immunohistochemistry [34] and RNA-seq data [6] to estimate protein abundances in human tissues.

RNA-binding protein–mRNA interaction propensity

We used catRAPID [19, 20] to compute the interaction propensity of each protein in the RBP dataset with each transcript in the mRNA dataset. catRAPID predicts protein–RNA associations by estimating the interaction propensity between amino acids and nucleotides using secondary structure information, hydrogen bonding and Van der Waals forces [19, 20]. The approach was previously applied to predict associations between different types of proteins and RNA molecules [21, 23]. Although each protein binds to distinct types of RNA structures [83], we observe that the contribution of hairpin loops accounts for 57% of the overall interaction propensity [19]. The catRAPID web server is publicly accessible from our webpage [84].

Protein–RNA interaction and expression

For a given protein, interacting (n int ) and non-interacting (n no-int ) protein–RNA pairs were compared at different AUCs (areas under the curve) of the interaction propensity distribution. The enrichment in positively correlated expression (Figure 2C) is calculated as:

enrichment co expressed interactions = n int r > r th n no int r > r th n no int r > r th
(1)

In Equation (1), the correlation coefficient r follows the distribution of protein–RNA expression and the parameter r th  > 0 corresponds to an AUC spanning the range 50% to 99.5% (Figure 2B).

Similarly, for negatively correlated expressions (Figure 2D):

enrichment anti expressed interactions = n int r < l th n no int r < l th n no int r < l th
(2)

In Equation (2), the parameter l th  < 0 corresponds to an AUC spanning the range 50% to 99.5% (Figure 2B).

Gene ontology analysis

For each area under the curve (AUC) of the catRAPID score distribution (50% < AUC < 99.5%), we created four subsets according to the correlation in tissue expression: (1) IC subset: positively correlating and interacting genes (expression correlation ≥ +0.7 and positive interaction propensities); (2) IA subset: negatively correlating and interacting genes (expression correlation ≤ −0.7 and positive interaction propensities); (3) NIC subset: positively correlating and non-interacting genes (expression correlation ≥ + 0.7 and negative interaction propensities); (4) NIA subset: negatively correlating and non-interacting genes (expression correlation ≤ −0.7 and negative interaction propensities). The expression correlation of |0.7| corresponds to AUC = 95% of the statistical distribution, for which we found the highest enrichments (Figure 2C,D). We systematically applied conditional tests for GO term over-representation in each subset using the GOStats package (version 2.28.0) available from Bioconductor [85]. To assess the over-representation of a GO term in one particular subset at a certain AUC, we considered five criteria (Additional file 3: Table S1; Additional file 6: Table S4; Additional file 2: Figure S6):

  1. 1.

    The GO term must be reported for more than two genes.

  2. 2.

    The P value of the GO term must be significant (P < 0.05) in the subset of interest and non-significant (P > 0.1) in the others.

  3. 3.

    The enrichment must be conserved with respect to: (a) the entire human transcriptome (that is, including RNAs longer than 1,200 nucleotides and independently of expression data), (b) the complete set of analyzed genes (that is, including RNAs shorter than 1,200 nucleotides and with available expression) and (c) all genes under the same AUC (that is, considering both interacting and non-interacting pairs at the two tails of the distribution).

  4. 4.

    The P value of the GO term must be non-significant (P > 0.1) in: (a) the complete set of analyzed genes compared to the human transcriptome (significance would indicate enrichment irrespective of the subset assignment) and (b) the list of transcripts compatible with catRAPID length requirements compared to the human transcriptome (significance would indicate length bias in the statistics; see section on length bias statistics).

  5. 5.

    The enrichment must be conserved after sequence redundancy reduction to the 80% identity threshold.

Length bias statistics

Due to the conformational space of nucleotide chains, prediction of RNA secondary structures is difficult when RNA sequences are >1,200 nucleotides and simulations cannot be completed on standard processors (2.5 GHz; 4 to 8 GB memory). To see whether GO enrichment is biased by the catRAPID length restriction, we used a hypergeometric test (see section on the RNA-binding protein–mRNA interaction propensity). If a GO term is enriched in the length-restricted set, it is excluded a priori from the analysis because genes annotated in that GO term would be only selected for the length range. Thus, we imposed that GO terms must be non-significant (P > 0.1) in the length-restricted set of genes (see section on gene ontology analysis). This condition ensures that there is no bias due to length restrictions for any GO term enriched in a particular subset (Additional file 3: Table S1).

Analysis of RNA-binding protein sequence disorder

The content of disordered regions in the RBP sequences was computed using IUPred [36]. For each protein, we extracted structurally disordered regions (IUPred score higher than 0.4) and calculated their interactions against the reference transcriptome. We compared the interaction propensities of each disordered region with that of the full-length protein and assessed if there was an increase or decrease of the interaction propensity score (Figure 3A). The contribution of the disordered region was evaluated using a Mann–Whitney U test, where a significant increase (P < 0.05; H0 < H1) in the interaction propensity score is associated with a positive contribution. From low to high interaction propensities, the contribution of disorder decreases progressively with respect to that of the full-length proteins (Figure 3A). The role of disorder is more pronounced in proteins lacking canonical RNA-binding domains, indicating that unstructured regions have a direct involvement in contacting RNA (Figure 3B).

Web server

catRAPID express [27] is a publicly available implementation of catRAPID [19, 20], which is used to study the relation between protein–RNA interaction propensity and expression in Homo sapiens. The tool has two components: (1) catRAPID predictions of protein–RNA interaction and (2) the computation of correlation using protein and RNA expression profiles [6, 34]. A description of how catRAPID makes predictions can be found in the Documentation, Tutorial and Frequently Asked Questions (FAQs) on the webpage. Expression profiles of the RBP dataset and mRNA dataset are assigned respectively to input proteins and RNA using a homology-based criterion (ten top-ranked proteins with a BLAST [86]e ≤ 0.01 and ≥75% whole sequence similarity; ten top-ranked transcripts with a BLAST e ≤ 0.01 and ≥95% whole sequence similarity). Sequence similarity is evaluated using the Needleman–Wunsch algorithm [87].