RMalign: an RNA structural alignment tool based on a novel scoring function RMscore
RNA-protein 3D complex structure prediction is still challenging. Recently, a template-based approach PRIME is proposed in our team to build RNA-protein 3D complex structure models with a higher success rate than computational docking software. However, scoring function of RNA alignment algorithm SARA in PRIME is size-dependent, which limits its ability to detect templates in some cases.
Herein, we developed a novel RNA 3D structural alignment approach RMalign, which is based on a size-independent scoring function RMscore. The parameter in RMscore is then optimized in randomly selected RNA pairs and phase transition points (from dissimilar to similar) are determined in another randomly selected RNA pairs. In tRNA benchmarking, the precision of RMscore is higher than that of SARAscore (0.88 and 0.78, respectively) with phase transition points. In balance-FSCOR benchmarking, RMalign performed as good as ESA-RNA with a non-normalized score measuring RNA structural similarity. In balance-x-FSCOR benchmarking, RMalign achieves much better than a state-of-the-art RNA 3D structural alignment approach SARA due to a size-independent scoring function. Take the advantage of RMalign, we update our RNA-protein modeling approach PRIME to version 2.0. The PRIME2.0 significantly improves about 10% success rate than PRIME.
Based on a size-independent scoring function RMscore, a novel RNA 3D structural alignment approach RMalign is developed and integrated into PRIME2.0, which could be useful for the biological community in modeling protein-RNA interaction.
KeywordsRNA structural alignment RMalign RMscore Protein-RNA interaction
Area Under Curve
Gapless structural alignment
RNA secondary structural alignment
RNA plays important roles in many biological processes such as gene regulation, subcellular location and splicing. High-throughput global mapping of RNA duplexes with near base-pair resolution reveals that RNA interacts with RNA and RNA-binding proteins using higher order architectures in living cell . Most of them, though their binding sites and binding regions  are determined, atomic interaction details are still missing, which is key to understanding molecular mechanisms underlying the RNA-RNA or RNA-protein recognition. With the increasing RNA and RNA-protein 3D structures deposited in PDB , it is important to develop better bioinformatics tools to compare RNA structures, which could provide a possible way to build atomic RNA-RNA or RNA-protein interaction models by inferring RNA structural homologs with lower sequence similarity. Some RNA structure comparing approaches have been developed under different scoring strategies with a traditional sequence alignment algorithm [4, 5, 6, 7]. In these approaches, the RNA 3D structures are represented with structural alphabet (SA) [5, 6, 7] or dihedral angles . Then DP algorithm is used to align RNA sequence with a substitution scoring matrix. Besides, STAR3D employs a substitution scoring function which includes RMSD, aligned stack regions and the distance . SETTER is a secondary structure-based tertiary structure comparing algorithm which employs the non-overlapping generalized secondary structure unites (GSSUs) [9, 10, 11]. In the other state-of-the-art alignment approaches, SARA applies a statistical scoring function to measure the similarity of RNA 3D structures [12, 13]; and ESA-RNA uses the geodesic distance integrating RNA sequence with 3D structure information to measure the RNA similarity [14, 15]. Like using a geometric concept in ESA-RNA, R3D Align and FR3D employ geometric discrepancy to measure the RNA similarity [16, 17]. CLICK is a topology-independent tool comparing of 3D structures without a scoring function measuring the structural similarity [18, 19]. Similar to SARA-Coffee  coupling with sequence alignments, SupeRNAlign iteratively superimposes the RNA fragment structures with R3D and maximizes the local fit . They found that R3D is scoring the best among the tools without ESA-RNA in benchmark. Based on SARA, a template-based approach PRIME is proposed in our team to build RNA-protein complex 3D structure models, which shows a higher success rate than computational docking software. However, the scoring function of RNA alignment algorithm SARA is size-dependent, which limits its ability to detect potential templates in some cases.
In this manuscript, we introduce an RNA structural alignment approach based on RMscore, which is a size independent scoring function to measure RNA structural similarity. Firstly, we reveal the liner relationship between the logarithmic length of RNA and the logarithmic radius of gyration (Rg) of RNA. At the same time, the aligned correlation coefficient (ACC) describing the relationship between RMSD and Rg also has a complex function relation with the RNA length. Combining these function relations, a length slightly independent scoring function RMscore is determined (the RMscore only slightly decreases as the length increases). Then RMscore is applied to two randomly selected independent datasets to optimize parameters and determine the transition point from similar to dissimilar. With the transition point, RMscore performs better than SARAscore [5, 13, 22] in selecting similar tRNA pairs. Then based on the RMscore, we develop an RNA structural alignment method RMalign. In RNA function classification, RMalign performs as good as ESA-RNA in balance-FSCOR. However, RNAs share the structural similarity may have different functions. So, we benchmark RMalign in structural classification. In RNA structural classification, RMalign performs much better than SARA in balance-x-FSCOR. Finally, PRIME is updated to PRIME 2.0 by replacing SARA with RMalign. PRIME 2.0 improves the success rate about 10% than previous when it is tested in protein-RNA docking benchmark.
We download RNA structure coordinates from PDB  website with RNA structures containing at least one RNA chain. This step obtains 2557 RNA structures. Based on these RNA structures, vary datasets are constructed for variable goals. PDB-3775 is constructed to explore the relationship between the Rg of RNA and the length. Fragment-pairs dataset is built to study the relationship between the ACC and RMSD in RNA. Results in PDB-3775 and fragment-pairs are combined to estimate the expression of RMscore. To calculate all-to-all alignments of 3775 RNAs is a time-consuming process, so we randomly select two RNA-RNA pair datasets (random pairs-0.3 M and random pairs-0.1 M) without overlap. Random pairs-0.3 M and random pairs-0.1 M are built to optimize the compensation and determine the transition point, respectively. Like benchmarking modeRNA in tRNA , tRNA-pairs are also constructed for benchmarking RMscore. Balance-FSCOR and Balance-x-FSCOR are established to benchmark RMalign in function and structural classification. The unbound protein-RNA docking set is employed to compare the performance of PRIME 2.0 and PRIME.
Total 2557 RNA structures and their complexes from PDB are separated by chains. 3775 RNA chains are kept expect the RNA structures in mmcif format. PDB-3775 represents all RNA structures in PDB. The relationships between the Rg of RNA and the RNA length are explored in this dataset.
ACC in proteins describing the relationship between RMSD and Rg is reported in . In order to study the relationship between the ACC and RMSD in RNA, we generate a fragment pair dataset based on PDB-3775. Only one fragment is randomly chosen for each RNA chain in PDB-3775. Then all the fragments with the identical length are made in pairs. This strategy generating structure fragments is previous used in the protein field .
Random pairs-0.3 M
We randomly selected 0.3 million RNA pairs from all-to-all alignment of RNA chains in PDB-3775 to optimize parameters in RMscore. The alignment of the paired RNA is generated by needle . This dataset is named as random pairs-0.3 M.
Random pairs-0.1 M
We randomly chose 0.1 million RNA pairs from all-to-all pair of RNA chains in PDB-3775 to determine the phase transition. The alignment of the paired RNA is aligned by SARA [5, 13], which is an RNA structural alignment protocol based on unit-vector root-mean-square. This dataset is named as random pairs-0.1 M.
We downloaded all tRNA structures from NDB  (http://ndbserver.rutgers.edu/). We extract one RNA chain from one structure or its complex. This process outputs 175 RNA chains. tRNA pairs are then constructed through all-to-all pairwise alignment by SARA.
FSCOR [5, 13] is downloaded from this website (http://structure.biofold.org/sara/datasets.html), which is constructed from SCOR  to benchmark RNA structural alignment methods. Positive pairs are generated from the RNAs with the same function in FSCOR. Negative pairs are generated from randomly selected the RNAs with different functions. The number of negative pairs is equal to the number of positive pairs. This dataset including both negative pairs and positive pairs is named as balance-FSCOR.
Structural similarity is used as evaluation in protein structural alignment protocol. However, RNA function is used as the metric in benchmarking in balance-FSCOR. So, we construct balance-x-FSCOR to benchmark RNA structural alignment approach employing the RNA structural similarity RMScore as a metric. Firstly, FSCOR is clustered by RMalign with different RMscore cut-offs (x = 0.4, 0.45, 0.5 … 1.0) to construct the x-FSCOR. Then 1000 positive and negative pairs are randomly selected from all-to-all pairs of x-FSCOR. These datasets are named as the balance-x-FSCOR. If the number of positive or negative pairs is less than 1000, the dataset contains less pairs. The structural classes with various cut-offs of 419 RNA chains in FSCOR can be downloaded from www.rnabinding.com/RMalign/RMalign.html. The vary cut-offs are tried, because it is still unknown which value is appropriate to cluster the RNA structures.
Unbound protein-RNA docking set
The unbound set is used to compare the performance of PRIME  and PRIME (2.0) in predicting protein-RNA complex structures. This set includes 49 protein-RNA structures from protein-RNA docking benchmark .
Relationship between the Rg of RNA and the RNA length
The Rg of protein is an important metric to describe the compactness of protein. Previous studies [30, 31] about protein reveal a scaling law Rg ∝ N0.4 where N is the number of residues in a protein. Adopting a similar strategy with protein, we investigate the relationship between Rg of the RNA and its length. Simply, all RNA structures are represented with C3’ atoms. The average log Rg located in the same length bins is calculated. After calculating Rg of all RNAs in PDB-3775, we observe that a scaling law Rg ∝ N0.39 for RNA.
Relationship between the ACC and the RNA length
Where LN is the length of fragment in the fragment-pairs.
Searching engine of RMscore
Benchmark of RMscore on tRNA pairs
The tRNAs are selected to benchmark RMscore as the RNA homology modelling method modeRNA was also benchmarked in tRNA dataset . For comparison of the ability to select the similar RNA structures for RMscore and SARAscore (normalized SARAscore), we determine the phase transition point of RMscore and SARAscore in random pairs-0.1 M. All alignments of RNA pairs are generated by SARA. The target SARAscore is normalized by dividing SARAscore of aligning itself. After the phase transition point from dissimilar to similar pairs is determined, RMscore and SARAscore are tested in tRNA pairs to distinguish similar (RMSD <= 5 Å) or dissimilar (RMSD > 5 Å) tRNA pairs. The alignments of tRNA pairs are also generated by SARA. A possible application of RMscore is to measure similarity between the native RNA structures and RNA models .
Step 1:Initial structural alignment
In this step, a total of three types of alignments are used to obtain an initial alignment. They are RNA secondary structural alignment (SSA), gapless structural alignment (GSA) and alignment combining SSA and GSA, respectively.
1.1 RNA SSA. We totally consider five secondary structural states of RNA calculated by X3DNA . They are stem, bulge, internal loop, hairpin loop and other, respectively. So the RNA sequence can be represented by a string consisting of 5 characters. And then DP algorithm  is implemented to align RNAs. The aligned nt with identical/different secondary structural state is assign to 1/0. Penalty of gap-open is set to − 1.
1.2 RNA GSA. The secondary initial structural alignment is GSA. In TMalign, the TM-score is used as the comparison metric. In RMalign, we employ the RMscore to select the best alignment.
1.3 Alignment combining SSA and GSA. In the third initial alignment, we combine the SSA and GSA with the scoring matrix that is a half/half combination of secondary score matrix (the first initial alignment) and distance score matrix (the secondary initial alignment).
Step 2: Scoring for an alignment
We obtain the alignments based on the step 1. In this step, the alignment is scored by the RMscore. First, the alignment is divided into fragment of length (4,8 …. LN) where LN is the length of alignment. The complete alignment fragment is then rotated by the convergent rotation matrix which is obtained by continuously superposing the nts of fragment by Kabsch algorithm with distance less than 5 Å. Secondary, the RMscore is calculated with eq.3. Then we obtained a new fragment by shifting one nt from 5′ to 3′ end and the rotation process is repeated until the fragment reaches the end of 3′. All the possible fragments are tried. Finally, the rotation matrix with the best RMscore is kept.
Step 3: Update the alignment
We obtain the RMscore rotation matrix from step 2. In this step, the aligned RNA structure is rotated by the RMscore rotation matrix. And then a scoring similarity matrix S(i, j) is calculated according to the Eq.6. The new alignment is obtained by DP algorithm  with the scoring similarity matrix and a gap-opening penalty of − 0.6. If the new alignment is equal to the previous alignment, then the process returns to step 2. Otherwise the process goes to step 4.
Step 4: Final scoring and output
In this step, the alignment is scored with all the aligned nts. And the result is output.
Benchmark of RMalign on balance-FSCOR and balance-x-FSCOR
To test the performance of RMalign in RNA function classification and compare with ESA-RNA, a balance-FSCOR based on FSCOR is constructed. The same structures may have different functions. And the purpose of RNA structural alignment approach is to detect the structural similarity. So, we also benchmark RMalign in balance-x-FSCOR. The AUC value is used as the metric to measure the performance.
Predicting protein-RNA 3D structure
We previous developed an approach PRIME  to predict the protein-RNA 3D structure. PRIME is tested on an unbound protein-RNA docking benchmark. The result shows that PRIME performs better than 3dRPC . We update previous PRIME to v2.0, because RMalign performs better than SARA in balance-x-FSCOR. A similar approach in PRIME is adopted to build the protein-RNA complex structure model. The transformation matrices of TM-score and RMscore are applied to superimpose the target protein and RNA onto the templates. The ligand RMSD of RNA C3’ atom between the model and the native structure is calculated. The quality of the model is measured by ligand RMSD. A prediction defined as “acceptable” for the ligand RMSD <= 10 Å .
Principle and benchmark of RMscore
Benchmark of RMalign and comparison with other approaches
Predicting protein-RNA 3D structure with PRIME 2.0 and comparison with PRIME
In discussion, we introduce an RNA structural alignment approach RMalign, which includes RMscore as the similarity score. The definition of RMscore is derived from TM-score which has been applied in protein structural alignment successfully. However, the RMscore shows a slightly dependent on RNA length. This phenomenon may be caused by the flexible structure of RNA. It is hard to benchmark RMscore like TM-score because that study in RNA falls behind in protein. For an example, the best way to benchmark RMscore is to compare the similarity between RNA model and native structure in RNA structure modelling. However, no related studies about size-independent scoring function have been investigated. Even more, the RNA homology modelling modeRNA employs RMSD or LG-score which is introduced as an auxiliary metric to measure the RNA structural similarity without any modifications . Considering the current situation, we study the relationship between RMscore and RMSD in RNA. The result shows that RMscore = 0.5 can discriminate the similar and dissimilar structure. Benchmarking in tRNA pairs, RMscore increases with the RMSD decreasing like the relationship between protein identity and its structural similarity.
In this study, we develop a novel RNA 3D structure alignment approach RMalign, which is based on a size-independent scoring function RMscore. we systematically analyzed RNA sequence and structure relationship to the binding mode, and exhaustively benchmarked the predictive modeling. The results show that in pairwise structure comparison for 172 tRNA structures RMalign significantly outperforms SARA. Replacing SARA with RMalign, the success rate of PRIME (v2.0) is 10% improved than before. Benchmarking on RNA function prediction, RMalign also shows a very high precision with AUC 0.95, which is as good as ESA. The study provides a foundation for novel RNA structural alignment approach in a size-independent way, applicable to the protein-RNA complex structure modeling and RNA function and fold classification. On the basis of the results we designed and implemented an RNA alignment tool, which should be useful for the biological community interested in RNA structural studies.
We thank the National Supercomputer Center in Guangzhou for support of computing resources.
This work has been supported by National Natural Science Foundation of China ; National High Technology Research and Development Program of China [2012AA020402]; the Special Program for Applied Research on Super Computation of the NSFC-Guangdong Joint Fund (the second phase) under Grant No. U1501501 and the Fundamental Research Funds for the Central Universities [2016YXMS017]. The funding body had no role in the design of the study, collection, analysis, or interpretation of data, or in writing the manuscript.
JZ wrote the software and prepared the data, SL, JX and XH performed the analysis, JZ draft the manuscript, all authors read, revised and approved the finial manuscript.
Ethics approval and consent to participate
Consent for publication
The author declares that they have no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
- 21.Piatkowski P, Jablonska J, Zyla A, Niedzialek D, Matelska D, Jankowska E, Walen T, Dawson WK, Bujnicki JM. SupeRNAlign: a new tool for flexible superposition of homologous RNA structures and inference of accurate structure-based sequence alignments. Nucleic Acids Res. 2017;45(16):e150.CrossRefGoogle Scholar
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.