Introduction

RNAs, in conventional view, comprise mostly messenger RNAs (mRNAs) that are used by cells merely to convey genetic information and several typical noncoding RNAs (ncRNAs), including transfer RNA (tRNA) and ribosomal RNA (rRNA), which are directly involved in protein translation. However, as the rapid development of high-throughput RNA-sequencing biotechnologies, a large number of previously un-annotated ncRNAs have been uncovered1,2, which mainly contain microRNAs (miRNAs) and long noncoding RNAs (lncRNAs)3. Recently, accumulating studies have shown that ncRNAs play critical roles in a variety of biological processes and thus are associated with a wide spectrum of diseases4,5,6,7. Although they cannot encode proteins, ncRNAs are important regulators inside the complex regulatory networks of homeostasis8,9. Therefore, ncRNAs are essential to maintain health and are considered to have potential for disease diagnosis and therapy.

Given the large number of ncRNAs, it becomes increasingly urgent to annotate them. For this purpose, computational methods have a significant contribution10. In the post-genomic era, one of the most important topics is identifying the functional sites of ncRNAs for better understanding of their functions. However, methods for the above purpose remain limited. The diverse functions of RNAs, especially ncRNAs, are not only conferred by their nucleotide sequences, but also endowed by the secondary, tertiary and even quaternary structures11,12,13. To infer functional sites, several biochemical methods have been developed14, such as SHAPE and mutational profiling (SHAPE-MaP)15, RNase footprinting16, hydroxyl radical footprinting17 and short-range RNA-RNA crosslinking18. These methods contribute much to the identification of the functional sites of RNAs. However, they are rather laborious, time consuming and not economical. In addition, it is hard to keep up with the fast increasing number of ncRNAs. Moreover, the above approaches only work well on identifying big RNA domains or RNA motifs but perform poorly on identifying functional sites whose lengths range from one nucleotide to several nucleotides. Thus, developing alternative computational methods is increasingly needed.

In contrast, for proteins, one big class of more well-studied biological molecules, there are a number of in silico methods for the identification of their functional sites. Given that many public databases have been created for proteins to store related experimental data, such as protein 3D structures19, protein-protein interactions20,21,22 and known protein functional sites23,24,25,26, a myriad of computational methods for predicting protein functional sites have been released based on the above datasets27. These methods use diverse strategies, e.g. machine learning28, pattern recognition29, phylogenetic motif identification30,31,32, structure network analysis33, text mining34 and geometric analysis35,36,37,38. However, considering currently limited information about ncRNA functional sites and low sequence conservation of ncRNAs39, computational methods like machine learning and phylogenetic motif identification are not suitable so far to predict ncRNA functional sites. As for geometric analysis, it finds that the residues close to the protein centroid38 and the residues that resides on the protein surface35 could be the functional elements. These residues are normally located within the local extreme points of residue distance curve. We suppose that it could be applicable for ncRNA as well. In a previous study40, we have developed a tertiary geometry based computational method, Rsite, to identify ncRNA functional sites and have displayed its reliable accuracy. The major limitation of Rsite is that it works based on ncRNA tertiary structures. Up to now for most of ncRNAs, their experimentally derived tertiary structures are not available. In addition, it is still very hard to identify the tertiary structures of RNAs especially large RNA molecules (>=100nt) using computational prediction41,42,43. The application of Rsite is therefore largely restricted.

In this study, we first revealed a strong correlation between the nucleotide distances derived from the RNA tertiary structures and those derived from the RNA secondary structures. This makes it reasonable to develop secondary structure geometry based computational methods for ncRNA functional site prediction. Compared with ncRNA tertiary structures, secondary structures can be determined and predicted more simply and easily. Based on the above observation, here we developed a secondary structure based computational method, Rsite2, to predict the functional sites of ncRNAs. Finally, we confirmed that Rsite2 has a nearly equivalent accuracy as the tertiary structure based method (Rsite) but is much more feasible and efficient.

Results

ncRNA secondary structures and tertiary structures show significantly positive correlation

With a non-redundant list of RNA-containing 3D structures from the Nucleic Acid Database (NDB, http://ndbserver.rutgers.edu/)44,45 that records in total 1419 PDB IDs of the equivalence class representatives, we obtained 1034 biological assemblies of them from the Protein Data Bank (PDB, http://www.rcsb.org/pdb/home/home.do)19. After a screening procedure (see Methods), there left 203 ncRNA molecules for further 2D-3D structure correlation analysis. Almost all of the 203 ncRNAs are shorter than 200nt except a few rRNA. In the case of the Euclidean distances from nucleotides to molecular centroid, the overall Spearman correlation coefficient is 0.79 (p = 0) (Fig. 1a). In addition, for each specific ncRNA molecule, we calculated the q values using the R (v3.2.0) package ‘qvalue’46 (v1.99.1). It can be seen that 122 (60%) of the 203 ncRNAs exhibit significantly strong correlations (Spearman rho >0.6, q < 1e-5) (Fig. 1b) and 173 (85%) ncRNA molecules demonstrate significantly positive correlations (q < 0.01) between 2D and 3D structures.

Figure 1
figure 1

ncRNA secondary structure positively correlates with tertiary structure significantly.

(a) Nucleotide distances to corresponding centroids derived from the tertiary structures plotted versus those derived from the secondary structures of the 203 ncRNAs (Spearman rho = 0.79, p = 0). (b) −log10(q) plotted versus Spearman correlation coefficients of the 203 ncRNAs. Data points of ncRNAs that display significantly strong correlations (Spearman rho >0.6, q < 1e-5) are in red.

As shown above, it is reasonable to predict ncRNA functional sites on the basis of secondary structure-derived distance metrics. And we further proved the feasibility in the following sections.

Predicting the functional sites of the tRNA (Lys)

Firstly, we downloaded the secondary structure of the tRNA (Lys) in PostScript format from RNA STRAND (http://www.rnasoft.ca/strand/)47. We next calculated and plotted the distance curve (Fig. 2). As a result, Rsite2 identified 10 possible functional sites in total. The result shows that only 1 of the 7 known functional sites is missed (Table 1; Fig. 3). As a result, Rsite2 has a sensitivity of 86% (6/7) and a positive predictive value (PPV) of 60% (6/10) for the prediction of the functional sites of tRNA (Lys).

Table 1 The known functional sites (FSs) and the Rsite2 hits of the tRNA (Lys).
Figure 2
figure 2

The raw nucleotide distance curve (a) and the smoothed nucleotide distance curve (b) of the tRNA (Lys).

Figure 3
figure 3

The predicted functional sites of the tRNA (Lys) in a graphical representation.

The figure shows the secondary structure of the tRNA (Lys). A predicted functional site is colored red on condition that it hits a known functional site, otherwise blue. And the missed known functional sites are in green color.

Predicting the functional sites of the Diels-Alder ribozyme

In the same way as the above section, we calculated and plotted the distance curve of the Diels-Alder ribozyme (Fig. 4). As a result, Rsite2 predicted 7 putative functional sites for the Diels-Alder ribozyme. We successfully predicted all of the 3 known functional sites (Table 2; Fig. 5). The result shows that Rsite2 has a sensitivity of 100% (3/3) and a PPV of 43% (3/7).

Table 2 The known functional sites (FSs) and the Rsite2 hits of the Diels-Alder ribozyme.
Figure 4
figure 4

The raw nucleotide distance curve (a) and the smoothed nucleotide distance curve (b) of the Diels-Alder ribozyme.

Figure 5
figure 5

The predicted functional sites of the Diels-Alder ribozyme in a graphical representation.

The figure shows the secondary structure of the Diels-Alder ribozyme. A predicted functional site is colored red on condition that it hits a known functional site, otherwise blue. And the missed known functional sites are in green color.

Predicting the functional sites of the S. cerevisiae RNase P

Moreover, we applied Rsite2 to RNAase P whose secondary structure and functional sites were experimentally identified. For doing so, we downloaded the secondary structure of RNase P from RNA STRAND47. And then we calculated the distance curve (Fig. 6). As a result, Rsite2 predicted 25 putative functional sites for the RNase P. Rsite2 successfully hit 13 of the 17 functional protein binding sites derived from RNase footprinting assays48 (Table 3; Fig. 7). It shows that Rsite2 achieves a sensitivity of 76% (13/17) and a PPV of 52% (13/25).

Table 3 The known functional sites (FSs) and the Rsite2 hits of the S. cerevisiae RNase P.
Figure 6
figure 6

The raw nucleotide distance curve (a) and the smoothed nucleotide distance curve (b) of the S. cerevisiae RNase P.

Figure 7
figure 7

The predicted functional sites of the S. cerevisiae RNase P in a graphical representation. The figure shows the secondary structure of the S. cerevisiae RNase P. A predicted functional site is colored red on condition that it hits a known functional site, otherwise blue. And the missed known functional sites are in green color.

Predicting the functional sites of human mitochondrial transcripts

Recently, by digital RNase footprinting, Liu et al.49 mapped the functional segments of human mitochondrial transcripts that participate in interactions with proteins. We wondered whether these regions can be predicted by Rsite2. For this purpose, we first retrieved the transcript sequences (Feb. 2009 GRCh37/hg19) on the mitochondrial chromosome (chrM) using the UCSC Table Browser data retrieval tool50 (http://genome.ucsc.edu/). RNAfold51 was then used to predict the secondary structures of these mitochondrial RNAs. Finally, Rsite2 predicted the putative functional sites for each of them. After careful comparison, we found that Rsite2 successfully hits 32 (74%) of the 43 functional segments in the mitochondrial RNAs from the UCSC database50. Due to the low ratio of footprint’s length to RNA’s length and the large number of predicted sites and because in default Rsite2 merges sites no more than 2 nucleotides apart regardless of RNA’s length, PPVs range from 1% (1/72) to 44% (4/9) and the overall PPV is 7% (54/763) (Supplementary File 1).

Discussion

It is increasingly needed to develop computational methods for the identification of ncRNA functional sites. We previously developed an ncRNA tertiary structure based method, Rsite. However it is largely limited because the tertiary structures of most ncRNAs are not available. In this study, we revealed that there is a significantly positive correlation between ncRNA secondary structures and tertiary structures. This observation gives clues that secondary structure-derived distance metrics could be used to predict their functional sites. Based on this observation, we then presented a secondary structure based computational method to predict candidate functional sites of ncRNAs. Since we are not able to assess Rsite2 systematically for the reason that there are not standard datasets that we can employ as benchmarks at the present stage, we showed the efficiency and accuracy of Rsite2 by applying it to three ncRNAs and a set of mitochondrial RNAs. Compared with the first version of Rsite, Rsite2 was improved in the following two aspects. Firstly, Rsite2 runs on the RNA secondary structures. It is known that RNA secondary structures are much easier and more feasible to be obtained by wet-lab or computational experiments than tertiary structures. This makes Rsite2 much more feasible and efficient. In addition, it would provide great convenience to the functional site prediction of large ncRNA, of which we just know about the primary sequence and the secondary structure and facilitate the investigation of ncRNA functions. Secondly, Rsite2 provides the users with not only the source codes but also a web-server, which makes it very easy for users to predict the functional sites of their ncRNAs.

With simple single-stranded chains, ncRNAs have the ability to form certain common structural units which can serve as building blocks for subsequent elaborate construction of tertiary structures. So it makes sense that overall secondary structures of ncRNAs are positively correlated with their tertiary structures. However, due to the disparity between some predicted secondary structures and known secondary structures, their 2D-3D structure correlations are weak or even negative. For instance, an RNA pseudoknot with 3D domain swapping (PDB ID: 387D), whose length is 26, shows significantly negative correlation (Spearman rho = −0.82, p = 2e-07) if the secondary structure generated by RNAfold51 is used. Conversely, the correlation turns into a significantly positive one (Spearman rho = 0.69, p = 9e-05) once replaced with known 2D structure taken from RNA STRAND (http://www.rnasoft.ca/strand/)47.

Nonetheless, ncRNA secondary structure cannot perfectly substitute for tertiary structure in light of the inevitable information loss. It is quite difficult to precisely depict all of the more sophisticated tertiary structural motifs via secondary structural motifs. The more structural features there are unique to 3D-level, the more information loss comes from the replacement with 2D structure. Furthermore, we expect that taking other features such as sequence conservation and structure conservation into account would improve the current method.

From another perspective, the divergences between ncRNA secondary structure and tertiary structure might be hints for functional sites too. It is totally reasonable that nucleotides which seem apart in ncRNA secondary structure contact with each other in 3D space, forming a pocket that maybe catalyze chemical reactions, or a domain resembling other shapes, like a clamp. Besides, we hypothesize that, during the building process of tertiary structure, nucleotides folded inside out or outside in are candidates for functional sites. Further research about this is required.

The understanding of RNA folding principles is at an early stage52,53. When more details about RNA folding become available, we expect Rsite2 will be improved at that time. Finally, although limitations exist in the current method, we believe that the secondary structure based method (Rsite2) is an alternative solution for the prediction of the functional sites of ncRNAs.

Methods

The tertiary structure data of ncRNAs

To generally analyze the correlation between ncRNA secondary structure and tertiary structure, we first acquired the latest non-redundant PDB ID list of RNA-containing tertiary structures (Release 1.89, 2014-12-05) from NDB website (http://ndbserver.rutgers.edu/)44,45 and then downloaded from PDB server (http://www.rcsb.org/pdb/home/home.do)19 the corresponding biological assemblies of the equivalence class representatives. As a result, we obtained the tertiary structures of 1034 ncRNAs.

Predicting the secondary structures of ncRNAs

From the above structure files of ncRNAs, we extracted the nucleotide sequences using an in-house python program. We then predicted the secondary structures of these ncRNAs using RNAfold51 (v1.8.5), an easy-to-use tool for RNA secondary structure prediction. It produced for each ncRNA sequence one PostScript file, from which we further fetched out the 2D coordinates of each ncRNA molecule.

Calculating nucleotide location in ncRNA structures

We first obtained the 3D and the 2D coordinates from the PDB files and the PostScript files, respectively. We then calculated their molecular centroids based on the 3D and 2D coordinates. We presented two metrics to calculate the location of each nucleotide in an ncRNA. The first metric, nucleotide distance to centroid (NDC), is the Euclidean distances between any one nucleotide and the centroid. And we computed the sum of distances for each nucleotide between it and all the rest nucleotides in one RNA molecule as the second metric, called nucleotide distance sum (NDS). Both types of distances can quantitatively evaluate the location of one nucleotide. The outermost nucleotide has the biggest distance value and the innermost the smallest.

Correlation analysis between the 2D and the 3D structures of ncRNAs

We performed Spearman correlation analysis for the 2D and the 3D nucleotide locations calculated using the above two metrics. Before the above process, we filtered out the ncRNAs with multiple RNA chains, those no longer than 20nt, those whose coordinates are incomplete and those whose nucleotide composition does not cover all four kinds of bases.

Identifying ncRNA functional sites

We used three ncRNAs (tRNA (Lys), Diels-Alder ribozyme and RNase P) and a set of human mitochondria transcripts with well-annotated functional sites to validate the accuracy of Rsite2. For a given ncRNA, Rsite2 first calculates the two metrics of nucleotide location defined above based on the 2D coordinates derived from the predicted secondary structure. Then, Rsite2 utilizes the algorithms of Rsite40 to predict functional sites. In Results, the distance metric is NDC in default.

Algorithms of Rsite in brief

For an ncRNA with n nucleotides, we can get a distance curve (D) of length n, of which D(i) denotes the location of the ith nucleotide and smooth it with a Gaussian filter. Then we identify the nucleotides that are the extreme points in the distance curve as the putative functional sites. And multiple points close (<= 2nt in default for their sequence positions) to each other are integrated into one functional site.

Additional Information

How to cite this article: Zeng, P. and Cui, Q. Rsite2: an efficient computational method to predict the functional sites of noncoding RNAs. Sci. Rep. 6, 19016; doi: 10.1038/srep19016 (2016).