RNAm5Cfinder: A Web-server for Predicting RNA 5-methylcytosine (m5C) Sites Based on Random Forest

Li, Jianwei; Huang, Yan; Yang, Xiaoyue; Zhou, Yiran; Zhou, Yuan

doi:10.1038/s41598-018-35502-4

RNAm5Cfinder: A Web-server for Predicting RNA 5-methylcytosine (m5C) Sites Based on Random Forest

Article
Open access
Published: 23 November 2018

Volume 8, article number 17299, (2018)
Cite this article

Download PDF

You have full access to this open access article

Scientific Reports

RNAm5Cfinder: A Web-server for Predicting RNA 5-methylcytosine (m5C) Sites Based on Random Forest

Download PDF

Jianwei Li^1,2^na1,
Yan Huang ORCID: orcid.org/0000-0001-8063-6110¹^na1,
Xiaoyue Yang¹,
Yiran Zhou² &
…
Yuan Zhou²

3543 Accesses
42 Citations
Explore all metrics

Abstract

5-methylcytosine (m5C) is a common nucleobase modification, and recent investigations have indicated its prevalence in cellular RNAs including mRNA, tRNA and rRNA. With the rapid accumulation of m5C sites data, it becomes not only feasible but also important to build an accurate model to predict m5C sites in silico. For this purpose, here, we developed a web-server named RNAm5Cfinder based on RNA sequence features and machine learning method to predict RNA m5C sites in eight tissue/cell types from mouse and human. We confirmed the accuracy and usefulness of RNAm5Cfinder by independent tests, and the results show that the comprehensive and cell-specific predictors could pinpoint the generic or tissue-specific m5C sites with the Area Under Curve (AUC) no less than 0.77 and 0.87, respectively. RNAm5Cfinder web-server is freely available at http://www.rnanut.net/rnam5cfinder.

NmSEER: A Prediction Tool for 2’-O-Methylation (Nm) Sites Based on Random Forest

NmSEER V2.0: a prediction tool for 2′-O-methylation sites based on random forest and multi-encoding combination

Article Open access 24 December 2019

m5CPred-SVM: a novel method for predicting m5C sites of RNA

Article Open access 30 October 2020

Introduction

RNA modification plays an important role in all three domains of life^1,2,3. To date, more than 150 kinds of RNA modifications have been discovered, while 5-methylcytosine (m5C) is one of the most prevalent modification types⁴. Thanks to the novel applications of high-throughput sequencing technique for detecting RNA m5C modification (e.g., bisulfite sequencing and aza-IP), a pilot whole-transcriptome map of m5C sites have become available, where the modification sites mainly appear in the anticodon loop and the variable region of tRNAs and rRNAs, and the coding sequences in mRNAs^5,6,7,8,9. Similar to other nucleobase modifications in RNA, m5C also influences RNA structural stability and translation efficiency, and further researches revealed that it could promote mRNA export and regulate tissue differentiation^10,11. But the functions of m5C in RNA are still not fully understood, partly because the experimental identification of m5C sites is still expensive and labor-intensive. For this purpose, here, we developed a web-based tool named RNAm5Cfinder to predict m5C sites, which would help researchers to screen potential m5C sites easily and quickly and provide a new tool to dig functional implication of m5C.

RNAm5Cfinder is a platform with an easy-to-use web interface to predict m5C modification sites in RNA sequences. It adopts one-hot encoding for coding RNA sequences and random forest algorithm which is a supervised machine learning method for solving classification problems. In view of the fact that m5C is a tissue-specific modification, we built independent predictor for every tissue/cell type respectively. Finally, we optimized each predictor independently by cross-validation and benchmarked the predictors by independent tests. To our best knowledge, RNAm5Cfinder is the first m5C predictor that allows for predicting tissue-specific m5C sites with competitive precision.

Results and Discussion

Establishment of the predictor and performance benchmarking

The m5C modification data covering 7 tissues of mouse and human Hela cells were collected from previous studies^10,12. We first integrated all m5C sites to build a comprehensive (generic) predictor. In the training process, we continuously optimized the ratio of the positives and the negatives of the training data set and changed the number of the decision trees in the random forest predictor by five-fold cross-validation. The results suggest that the optimal parameters are 1:30 ratio and 300 decision trees, respectively. In order to verify the performance of the predictor, we benchmarked and compared its performance with other state-of-the-art published web servers for predicting RNA m5C sites on the same independent test set. We found two available online servers for predicting RNA m5C sites which are iRNA-PseColl developed by Feng et al. and M5C-HPCR developed by Zhang et al.^13,14. Both of them can predict m5C sites in RNA sequences, but they don’t permit tissue-specific prediction. Therefore, we compared the performance of our comprehensive predictor with iRNA-PseColl and M5C-HPCR. Note that the thresholds of the servers above are fixed, resulting in a single point in the ROC (receiver operating characteristic curve) curve that corresponds to their performance (Fig. 1). As for the strategy for coding RNA sequence, RNAm5Cfinder adopted one-hot encoding and by trying to re-train our predictor with Feng’s coding strategy and found that the performance was slightly reduced (Fig. 2), indicating that one-hot encoding is at least comparable to the current state-of-art method for RNA m5C site prediction. Another reason for picking one-hot encoding is that it is timesaving and could give the users a good experience comparing to other strategies.

The performance of the tissue-specific predictors

Taking into account the modification spectrum in different cell types or tissues are not the same, one comprehensive predictor can not accurately predict the m5C sites from each specific tissue or cell type. We further applied tissue-specific training and independent test sets where RNA m5C modification data was came from experiments on single tissue or cell to test and benchmark the tissue-specific m5C predictors (Table 1). In order to verify the robustness of the constructed tissue-specific predictors, we performed both intra- and inter-tissue independent tests for each tissue-specific predictor. For each independent test set, we removed the samples which were used to train the predictors for the rest of tissues. In other words, we only considered tissue-specific sites in the independent test for the intra- and inter-tissue independent tests. The results are summarized in Fig. 3. Clearly, the intra-tissue prediction performances, which are all above 0.87 in terms of AUC, are substantially better than inter-tissue prediction performance. This is consistent with previous studies, where m5C is implied as a tissue-specific modification¹⁰. This result also supports that it is necessary to build tissue-specific m5C predictors.

Table 1 AUC of independent test of different predictors.

Full size table

The construction of RNAm5Cfinder web-server

To facilitate the community, we built a web-server named RNAm5Cfinder with the optimized comprehensive and tissue-specific predictors mentioned above. RNAm5Cfinder has a user-friendly interface and step-by-step guide. It takes the FASTA sequences as the input and provide the option to switch between the comprehensive predictor and the tissue-specific predictors. We also provide 3 levels of stringent thresholds, corresponding to the false positive rate values of 1%, 5%, 10%. Considering users may analyze large dataset which will spend plenty of time, RNAm5Cfinder also supports the function to send results to the submitted E-mail address.

Methods

Datasets

We gathered three available m5C datasets in GEO database including GSE90963 (human Hela cell), GSE93749 (human Hela cell; heart, muscle, brain, kidney and liver of mouse) and GSE83432 (mouse ESC and brain). Then m5C sites from the three datasets were first mapped to the Ensembl transcripts (queried at Feb, 2018, the genome version is GRCh37 for human and GRCm38 for mouse)¹⁵. For multiple transcripts of the same gene, we picked the mRNA transcript which have relatively more modification sites to insure the quality and reliability of data. One quarter of the m5C site data was randomly selected as the independent test set while the rest was used to train the predictors. The negative samples were randomly selected from the non-modified C sites in the transcripts. Since the ratio of positive and negative training samples could affect the precision of the prediction model, we preliminarily tested 3 ratios (1:10, 1:30, 1:50) and finally considered the best one (1:30) based on cross-validation. In order to fit the real-world data, as for the independent test sets, all of the non-modified C sites were used as the negative samples (Table 2).

Table 2 The sample size of different tissues’ training and test datasets.

Full size table

Sequence encoding

To train the machine learning model, the RNA sequence flanking the modified/non-modified sites should be translated to the numeric feature encoding. In this study, two kinds of encoding strategies were tested and compared, which were the one-hot encoding¹⁶ and Feng’s encoding¹⁴. The one-hot encoding uses n bits of 0 or 1 to represent n kinds of nucleotide state. For each position, the A, G, C, T are translated into vectors of (1, 0, 0, 0), (0, 1, 0, 0), (0, 0, 1, 0) and (0, 0, 0, 1), respectively. Feng’s encoding also uses four bits to represent specific nucleotide. But unlike one-hot encoding, the first three bits in Feng’s encoding represent three kinds of physicochemical characters (which are the ring number, the chemical functionality and the number of hydrogen bonds). And the fourth bit of Feng’s encoding represents the accumulated occurrence frequency of the nucleotide in the sequence. Therefore, A, G, C, T are translated into vectors of (1, 1, 1, FreqA), (1, 0, 0, FreqG), (0, 1, 0, FreqC) and (0, 0, 1, FreqT), respectively. The size of flanking sequence window to be encoded by the one-hot and Feng’s encodings are both 10, which were optimized by five-fold cross-validation. According to their performance and complexity we finally chose one-hot encoding strategy.

Machine learning algorithm

We have tested four methods of machine learning which are logistic regression, naïve Bayes, Decision Tree (with parameters minsplit = 35, cp = 0.00001 and maxdepth = 30) and Random forest (RF) with integrated RNA m5C sites. The performance of each algorithm is shown in Table 3. Considering both efficiency and accuracy, we finally chose RF as our preferred algorithm. RF algorithm is a robust machine learning framework that has been widely used in medicine and biology information fields¹⁷. RF consists of a large ensemble of classification and regression trees (CARTs). The number of CARTs is defined as n_tree, which was also optimized by cross-validation. The random forest algorithm was implemented by using the ‘randomForest’ package in R¹⁸.

Table 3 Performance of different machine learning algorithm.

Full size table

Performance evaluation

In this study we used ROC (receiver operating characteristic) curve, which is less affected by the unbalanced test data set, to evaluate the performance of predictors. ROC curve reflects the overall relationship between sensitivity and specificity when different thresholds are applied. The sensitivity and specificity are defined as

$$Sensitivity=\frac{TP}{TP+FN}$$

(1)

$$Specificity=\frac{TN}{TN+FP}$$

(2)

where TP, TN, FP and FN represent the number of true positive, true negative, false positive and false negative samples, respectively. The larger the area under the curve (AUC), the higher the prediction performance. We benchmarked our predictors on the independent test sets. We also compared the comprehensive predictor of RNAm5Cfinder with iRNA-PseColl and M5C-HPCR on the same independent test set. The binary (yes or no) prediction results of iRNA-PseColl and M5C-HPCR were obtained by submitting the RNA sequences to their servers.

Construction of web-server platform

The user interface and message response mechanisms were based on JavaScript and Ajax. The data processing module was written by PHP5 and could process the input sequences into the numeric sequence encoding for subsequent random forest prediction.

Conclusions

From above analyses, we can draw a conclusion that RNAm5Cfinder is an efficient tool to predict m5C sites. Comparing with other predictors, RNAm5Cfinder has two advantages: (1) Larger and more updated dataset, which together with the random forest machine learning framework, results in a better performance. (2) Ability to predict tissue-specific m5C sites. We believe that RNAm5Cfinder has great potentials and with more m5C site data become available, the performance of RNAm5Cfinder could be further improved.

References

Liu, N. & Pan, T. RNA epigenetics. Transl Res 165, 28–35 (2015).
Article CAS Google Scholar
Marbaniang, C. N. & Vogel, J. Emerging roles of RNA modifications in bacteria. Curr Opin Microbiol 30, 50–57 (2016).
Article CAS Google Scholar
Omer, A. D., Ziesche, S., Decatur, W. A., Fournier, M. J. & Dennis, P. P. RNA-modifying machines in archaea. Mol Microbiol 48, 617–29 (2003).
Article CAS Google Scholar
Boccaletto, P. et al. MODOMICS: a database of RNA modification pathways. 2017 update. Nucleic Acids Res 46, D303–D307 (2018).
Article Google Scholar
Squires, J. E. et al. Widespread occurrence of 5-methylcytosine in human coding and non-coding RNA. Nucleic Acids Res 40, 5023–33 (2012).
Article ADS CAS Google Scholar
Hussain, S., Aleksic, J., Blanco, S., Dietmann, S. & Frye, M. Characterizing 5-methylcytosine in the mammalian epitranscriptome. Genome Biol 14, 215 (2013).
Article Google Scholar
Schaefer, M., Pollex, T., Hanna, K. & Lyko, F. RNA cytosine methylation analysis by bisulfite sequencing. Nucleic Acids Res 37, e12 (2009).
Article Google Scholar
Khoddami, V. & Cairns, B. R. Identification of direct targets and modified bases of RNA cytosine methyltransferases. Nat Biotechnol 31, 458–64 (2013).
Article CAS Google Scholar
Hussain, S. et al. NSun2-mediated cytosine-5 methylation of vault noncoding RNA determines its processing into regulatory small RNAs. Cell Rep 4, 255–61 (2013).
Article CAS Google Scholar
Yang, X. et al. 5-methylcytosine promotes mRNA export - NSUN2 as the methyltransferase and ALYREF as an m(5)C reader. Cell Res 27, 606–625 (2017).
Article CAS Google Scholar
Blanco, S. et al. The RNA-methyltransferase Misu (NSun2) poises epidermal stem cells to differentiate. Plos Genet 7, e1002403 (2011).
Article CAS Google Scholar
Amort, T. et al. Distinct 5-methylcytosine profiles in poly(A) RNA from mouse embryonic stem cells and brain. Genome Biol 18, 1 (2017).
Article Google Scholar
Zhang, M. et al. Accurate RNA 5-methylcytosine site prediction based on heuristic physical-chemical properties reduction and classifier ensemble. Anal Biochem 550, 41–48 (2018).
Article CAS Google Scholar
Feng, P. et al. iRNA-PseColl: Identifying the Occurrence Sites of Different RNA Modifications by Incorporating Collective Effects of Nucleotides into PseKNC. Mol Ther Nucleic Acids 7, 155–163 (2017).
Article CAS Google Scholar
Kersey, P. J. et al. Ensembl Genomes 2018: an integrated omics infrastructure for non-vertebrate species. Nucleic Acids Res 46, D802–D808 (2018).
Article Google Scholar
Zhou, Y., Zeng, P., Li, Y. H., Zhang, Z. & Cui, Q. SRAMP: prediction of mammalian N6-methyladenosine (m6A) sites based on sequence-derived features. Nucleic Acids Res 44, e91 (2016).
Article Google Scholar
Chen, X. & Ishwaran, H. Random forests for genomic data analysis. Genomics 99, 323–9 (2012).
Article CAS Google Scholar
Andy, L. & Matthew, W. Classification and Regression by randomForest. R News 2, 18–22 (2002).
Google Scholar

Download references

Acknowledgements

We appreciate the researchers who shared their m5C site data. This study was supported by National Natural Science Foundation of China [81672113 to J.L.], Natural Science Foundation of Hebei Province [C2018202083 to J.L.] and the Fundamental Research Funds for the Central Universities [BMU2017YJ004 to Y.Z.].

Author information

Jianwei Li and Yan Huang contributed equally.

Authors and Affiliations

Institute of Computational Medicine, School of Artificial Intelligence, Hebei University of Technology, Tianjin, China
Jianwei Li, Yan Huang & Xiaoyue Yang
Department of Biomedical Informatics, School of Basic Medical Sciences, Center for Noncoding RNA Medicine, Peking University, Beijing, China
Jianwei Li, Yiran Zhou & Yuan Zhou

Authors

Jianwei Li
View author publications
You can also search for this author in PubMed Google Scholar
Yan Huang
View author publications
You can also search for this author in PubMed Google Scholar
Xiaoyue Yang
View author publications
You can also search for this author in PubMed Google Scholar
Yiran Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Yuan Zhou
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Yuan Zhou designed the study, Jianwei Li and Yan Huang performed the analysis, Xiaoyue Yang and Yiran Zhou assisted data collection and webserver building in the analysis, Yan Huang drafted the manuscript, Jianwei Li and Yuan Zhou revised the manuscript.

Corresponding author

Correspondence to Yuan Zhou.

Ethics declarations

Competing Interests

The authors declare no competing interests.

Additional information

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Li, J., Huang, Y., Yang, X. et al. RNAm5Cfinder: A Web-server for Predicting RNA 5-methylcytosine (m5C) Sites Based on Random Forest. Sci Rep 8, 17299 (2018). https://doi.org/10.1038/s41598-018-35502-4

Download citation

Received: 06 July 2018
Accepted: 07 November 2018
Published: 23 November 2018
DOI: https://doi.org/10.1038/s41598-018-35502-4
Springer Nature Limited

Keywords

This article is cited by

Changes of RNA m6A/m5C Modification Regulatory Molecules in Ferroptosis of T2DM Rat Pancreas
- Xiaoyu Liu
- Nan Wang
- Zuoshun He
Cell Biochemistry and Biophysics (2024)
A CNN based m5c RNA methylation predictor
- Irum Aslam
- Sajid Shah
- Gauhar Ali
Scientific Reports (2023)
m5CPred-SVM: a novel method for predicting m5C sites of RNA
- Xiao Chen
- Yi Xiong
- Xiaolei Zhu
BMC Bioinformatics (2020)
PACES: prediction of N4-acetylcytidine (ac4C) modification sites in mRNA
- Wanqing Zhao
- Yiran Zhou
- Yuan Zhou
Scientific Reports (2019)

RNAm5Cfinder: A Web-server for Predicting RNA 5-methylcytosine (m5C) Sites Based on Random Forest

Abstract

Similar content being viewed by others

NmSEER: A Prediction Tool for 2’-O-Methylation (Nm) Sites Based on Random Forest

NmSEER V2.0: a prediction tool for 2′-O-methylation sites based on random forest and multi-encoding combination

m5CPred-SVM: a novel method for predicting m5C sites of RNA

Introduction

Results and Discussion

Establishment of the predictor and performance benchmarking

The performance of the tissue-specific predictors

The construction of RNAm5Cfinder web-server

Methods

Datasets

Sequence encoding

Machine learning algorithm

Performance evaluation

Construction of web-server platform

Conclusions

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing Interests

Additional information

Rights and permissions

About this article

Cite this article

Keywords

This article is cited by

Changes of RNA m6A/m5C Modification Regulatory Molecules in Ferroptosis of T2DM Rat Pancreas

A CNN based m5c RNA methylation predictor

m5CPred-SVM: a novel method for predicting m5C sites of RNA

PACES: prediction of N4-acetylcytidine (ac4C) modification sites in mRNA

Navigation

RNAm5Cfinder: A Web-server for Predicting RNA 5-methylcytosine (m5C) Sites Based on Random Forest

Abstract

Similar content being viewed by others

NmSEER: A Prediction Tool for 2’-O-Methylation (Nm) Sites Based on Random Forest

NmSEER V2.0: a prediction tool for 2′-O-methylation sites based on random forest and multi-encoding combination

m5CPred-SVM: a novel method for predicting m5C sites of RNA

Introduction

Results and Discussion

Establishment of the predictor and performance benchmarking

The performance of the tissue-specific predictors

The construction of RNAm5Cfinder web-server

Methods

Datasets

Sequence encoding

Machine learning algorithm

Performance evaluation

Construction of web-server platform

Conclusions

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing Interests

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

This article is cited by

Changes of RNA m6A/m5C Modification Regulatory Molecules in Ferroptosis of T2DM Rat Pancreas

A CNN based m5c RNA methylation predictor

m5CPred-SVM: a novel method for predicting m5C sites of RNA

PACES: prediction of N4-acetylcytidine (ac4C) modification sites in mRNA

Search

Navigation