Abstract
Pseudouridine represents one of the most prevalent post-transcriptional RNA modifications. The identification of pseudouridine sites is an essential step toward understanding RNA functions, RNA structure stabilization, translation process, and RNA stability; however, high-throughput experimental techniques remain expensive and time-consuming in lab explorations and biochemical processes. Thus, how to develop an efficient pseudouridine site identification method based on machine learning is very important both in academic research and drug development. Motived by this, we present an effective layered ensemble model designated as iPseU-Layer for identification of RNA pseudouridine sites. The proposed iPseU-Layer approach is essentially based on three different machine learning layers including: feature selection layer, feature extraction and fusion layer, and prediction layer. The feature selection layer reduces the dimensionality, which can be regarded as a data pre-processing stage. The feature extraction and fusion layer utilizes an ensemble method which is implemented through various machine learning algorithms to generate some outputs. The prediction layer applies classic random forest to identify the final results. Furthermore, we systematically conduct the validation experiments using cross-validation tests and independent test with the current state-of-the-art models. The proposed iPseU-Layer provides a promising predictive performance in terms of sensitivity, specificity, accuracy and Matthews correlation coefficient. Collectively, these findings indicate that the framework of iPseU-Layer is a feasible and effective strategy for the prediction of RNA pseudouridine sites.
Similar content being viewed by others
References
Ge J, Yu YT (2013) RNA pseudouridylation: new insights into an old modification. Trends Biochem Sci 38(4):210–218. https://doi.org/10.1016/j.tibs.2013.01.002
Hudson GA, Bloomingdale RJ, Znosko BM (2013) Thermodynamic contribution and nearest-neighbor parameters of pseudouridine-adenosine base pairs in oligoribonucleotides. Rna 19(11):1474–1482. https://doi.org/10.1261/rna.039610.113
Tahir M, Tayara H, Chong KT (2019) iPseU-CNN: identifying RNA pseudouridine sites using convolutional neural networks. Mol Ther Nucl Acids 16:463–470. https://doi.org/10.1016/j.omtn.2019.03.010
Reddy R, Busch H (1998) Small nuclear RNAs: RNA sequences, structure, and modifications. Structure and function of major and minor small nuclear ribonucleoprotein particles. Springer, Berlin, pp 1–37
Andrew TY, Ge J, Yu YT (2011) Pseudouridines in spliceosomal snRNAs. Protein Cell 2(9):712–725. https://doi.org/10.1007/s13238-011-1087-1
Wu G, Yu AT, Kantartzis A et al (2011) Functions and mechanisms of spliceosomal small nuclear RNA pseudouridylation. Wires Rna 2(4):571–581. https://doi.org/10.1002/wrna.77
Maden BEH (1990) The numerous modified nucleotides in eukaryotic ribosomal RNA. Prog Nucl Acid Res 39:241–303. https://doi.org/10.1016/S0079-6603(08)60629-7
Schattner P, Barberan-soler S, Lowe TM (2006) A computational screen for mammalian pseudouridylation guide H/ACA RNAs. Rna 12(1):15–25. https://doi.org/10.1261/rna.2210406
Grosjean H, Sprinzl M, Steinberg S (1995) Posttranscriptionally modified nucleosides in transfer RNA: their locations and frequencies. Biochimie 77(1–2):139–141. https://doi.org/10.1016/0300-9084(96)88117-X
Sprinzl M, Horn C, Brown M et al (1998) Compilation of tRNA sequences and sequences of tRNA genes. Nucleic Acids Res 26(1):148–153. https://doi.org/10.1093/nar/26.1.148
Hopper AK, Phizicky EM (2003) tRNA transfers to the limelight. Genes Dev 17(2):162–180. https://doi.org/10.1101/gad.1049103
Karijolich J, Yu YT (2015) The new era of RNA modification. Rna 21(4):659–660. https://doi.org/10.1261/rna.049650.115
Karijolich J, Yu YT (2011) Converting nonsense codons into sense codons by targeted pseudouridylation. Nature 474(7351):395–398. https://doi.org/10.1038/nature10165
Carlile TM, Rojas-Duran MF, Zinshteyn B et al (2014) Pseudouridine profiling reveals regulated mRNA pseudouridylation in yeast and human cells. Nature 515(7525):143–146. https://doi.org/10.1038/nature13802
Lovejoy AF, Riordan DP, Brown PO (2014) Transcriptome-wide mapping of pseudouridines: pseudouridine synthases modify specific mRNAs in S. cerevisiae. PLoS One 9(10):e110799. https://doi.org/10.1371/journal.pone.0110799
Schwartz S, Bernstein DA, Mumbach MR et al (2014) Transcriptome-wide mapping reveals widespread dynamic-regulated pseudouridylation of ncRNA and mRNA. Cell 159(1):148–162. https://doi.org/10.1016/j.cell.2014.08.028
Chen W, Feng P, Tang H et al (2016) Identifying 2’-O-methylationation sites by integrating nucleotide chemical properties and nucleotide compositions. Genomics 107(6):255–258. https://doi.org/10.1016/j.ygeno.2016.05.003
Sun WJ, Li JH, Liu S et al (2016) RMBase: a resource for decoding the landscape of RNA modifications from high-throughput sequencing data. Nucleic Acids Res 44(D1):D259–D265. https://doi.org/10.1093/nar/gkv1036
Li YH, Zhang G, Cui Q (2015) PPUS: a web server to predict PUS-specific pseudouridine sites. Bioinformatics 31(20):3362–3364. https://doi.org/10.1093/bioinformatics/btv366
Chen W, Tang H, Ye J et al (2016) iRNA-PseU: identifying RNA pseudouridine sites. Mol Ther Nucl Acids 5:e332. https://doi.org/10.1038/mtna.2016.37
He J, Fang T, Zhang Z et al (2018) PseUI: pseudouridine sites identification based on RNA sequence information. BMC Bioinform 19(1):306. https://doi.org/10.1186/s12859-018-2321-0
Liu K, Chen W, Lin H (2020) XG-PseU: an eXtreme gradient boosting based method for identifying pseudouridine sites. Mol Genet Genomics 295(1):13–21. https://doi.org/10.1007/s00438-019-01600-9
Dou L, Li X, Ding H et al (2020) Is there any sequence feature in the RNA pseudouridine modification prediction problem? Mol Ther Nucl Acids 19:293–303. https://doi.org/10.1016/j.omtn.2019.11.014
Jia J, Liu Z, Xiao X et al (2015) iPPI-Esml: an ensemble classifier for identifying the interactions of proteins by incorporating their physicochemical properties and wavelet transforms into PseAAC. J Theor Biol 377:47–56. https://doi.org/10.1016/j.jtbi.2015.04.011
Jia J, Liu Z, Xiao X et al (2016) pSuc-Lys: predict lysine succinylation sites in proteins with PseAAC and ensemble random forest approach. J Theor Biol 394:223–230. https://doi.org/10.1016/j.jtbi.2016.01.020
Jia C, Zuo Y (2017) S-SulfPred: a sensitive predictor to capture S-sulfenylation sites based on a resampling one-sided selection undersampling-synthetic minority oversampling technique. J Theor Biol 422:84–89. https://doi.org/10.1016/j.jtbi.2017.03.031
Chen W, Feng P, Yang H et al (2018) iRNA-3typeA: identifying three types of modification at RNA’s adenosine sites. Mol Ther Nucl Acids 11:468–474. https://doi.org/10.1016/j.omtn.2018.03.012
Cheng X, Xiao X, Chou KC (2018) pLoc-mEuk: predict subcellular localization of multi-label eukaryotic proteins by extracting the key GO information into general PseAAC. Genomics 110(1):50–58. https://doi.org/10.1016/j.ygeno.2017.08.005
Cheng X, Lin WZ, Xiao X et al (2019) pLoc\_bal-mAnimal: predict subcellular localization of animal proteins by balancing training dataset and PseAAC. Bioinformatics 35(3):398–406. https://doi.org/10.1093/bioinformatics/bty628
Feng P, Yang H, Ding H et al (2019) iDNA6mA-PseKNC: identifying DNA N6-methyladenosine sites by incorporating nucleotide physicochemical properties into PseKNC. Genomics 111(1):96–102. https://doi.org/10.1016/j.ygeno.2018.01.005
Cheng X, Xiao X, Chou KC (2018) pLoc-mGneg: predict subcellular localization of gram-negative bacterial proteins by deep gene ontology learning via general PseAAC. Genomics 110(4):231–239. https://doi.org/10.1016/j.ygeno.2017.10.002
Liu B, Li K, Huang DS et al (2018) iEnhancer-EL: identifying enhancers and their strength with ensemble learning approach. Bioinformatics 34(22):3835–3842. https://doi.org/10.1093/bioinformatics/bty458
Liu B, Weng F, Huang DS et al (2018) iRO-3wPseKNC: identify DNA replication origins by three-window-based PseKNC. Bioinformatics 34(18):3086–3093. https://doi.org/10.1093/bioinformatics/bty312
Su ZD, Huang Y, Zhang ZY et al (2018) iLoc-lncRNA: predict the subcellular location of lncRNAs by incorporating octamer composition into general PseKNC. Bioinformatics 34(24):4196–4204. https://doi.org/10.1093/bioinformatics/bty508
Chen Z, Zhao P, Li F et al (2019) iLearn: an integrated platform and meta-learner for feature engineering, machine-learning analysis and modeling of DNA, RNA and protein sequence data. Brief Bioinform. https://doi.org/10.1093/bib/bbz041
Hall MA (1998) Correlation-based feature subset selection for machine learning. University of Waikato, Hamilton
Shi H (2007) Best-first decision tree learning. The University of Waikato, Hamilton
Jia J, Liu Z, Xiao X et al (2016) iCar-PseCp: identify carbonylation sites in proteins by Monte Carlo sampling and incorporating sequence coupled effects into general PseAAC. Oncotarget 7(23):34558. https://doi.org/10.18632/oncotarget.9148
Jia J, Liu Z, Xiao X et al (2016) Identification of protein-protein binding sites by incorporating the physicochemical properties and stationary wavelet transforms into pseudo amino acid composition. J Biomol Struct Dyn 34(9):1946–1961. https://doi.org/10.1080/07391102.2015.1095116
Jia J, Liu Z, Xiao X et al (2016) iPPBS-Opt: a sequence-based ensemble classifier for identifying protein-protein binding sites by optimizing imbalanced training datasets. Molecules 21(1):95. https://doi.org/10.3390/molecules21010095
Jia J, Liu Z, Xiao X et al (2016) iSuc-PseOpt: identifying lysine succinylation sites in proteins by incorporating sequence-coupling effects into pseudo components and optimizing imbalanced training dataset. Anal Biochem 497:48–56. https://doi.org/10.1016/j.ab.2015.12.009
Jia J, Zhang L, Liu Z et al (2016) pSumo-CD: predicting sumoylation sites in proteins with covariance discriminant algorithm by incorporating sequence-coupled effects into general PseAAC. Bioinformatics 32(20):3133–3141. https://doi.org/10.1093/bioinformatics/btw387
Chen W, Feng PM, Lin H et al (2013) iRSpot-PseDNC: identify recombination spots with pseudo dinucleotide composition. Nucleic Acids Res 41(6):e68–e68. https://doi.org/10.1093/nar/gks1450
Lin H, Deng EZ, Ding H et al (2014) iPro54-PseKNC: a sequence-based predictor for identifying sigma-54 promoters in prokaryote with pseudo k-tuple nucleotide composition. Nucleic Acids Res 42(21):12961–12972. https://doi.org/10.1093/nar/gku1019
Liu B, Liu F, Wang X et al (2015) Pse-in-One: a web server for generating various modes of pseudo components of DNA, RNA, and protein sequences. Nucleic Acids Res 43(W1):W65–W71. https://doi.org/10.1038/mtna.2016.37
Liu B, Wang S, Long R et al (2017) iRSpot-EL: identify recombination spots with an ensemble learning approach. Bioinformatics 33(1):35–41. https://doi.org/10.1093/bioinformatics/btw539
Liu B, Wu H, Chou KC (2017) Pse-in-one 2.0: an improved package of web servers for generating various modes of pseudo components of DNA, RNA, and protein sequences. Nat Sci 9(04):67. https://doi.org/10.4236/ns.2017.94007
Liu B, Yang F, Chou KC (2017) 2L-piRNA: a two-layer ensemble classifier for identifying piwi-interacting RNAs and their function. Mol Ther Nucl Acids 7:267–277. https://doi.org/10.1016/j.omtn.2017.04.008
Qiu WR, Xiao X, Chou KC (2014) iRSpot-TNCPseAAC: identify recombination spots with trinucleotide composition and pseudo amino acid components. Int J Mol Sci 15(2):1746–1766. https://doi.org/10.3390/ijms15021746
Chou KC (2015) Impacts of bioinformatics to medicinal chemistry. Med Chem 11(3):218–234. https://doi.org/10.2174/1573406411666141229162834
Xiao X, Ye HX, Liu Z et al (2016) iROS-gPseKNC: predicting replication origin sites in DNA by incorporating dinucleotide position-specific propensity into general pseudo nucleotide composition. Oncotarget 7(23):34180. https://doi.org/10.18632/oncotarget.9057
Feng P, Ding H, Yang H et al (2017) iRNA-PseColl: identifying the occurrence sites of different RNA modifications by incorporating collective effects of nucleotides into PseKNC. Mol Ther Nucl Acids 7:155–163. https://doi.org/10.1016/j.omtn.2017.03.006
Yang H, Qiu WR, Liu G et al (2018) iRSpot-Pse6NC: identifying recombination spots in Saccharomyces cerevisiae by incorporating hexamer composition into general PseKNC. Int J Biol Sci 14(8):883. https://doi.org/10.7150/ijbs.24616
Song J, Wang Y, Li F et al (2019) iProt-Sub: a comprehensive package for accurately mapping and predicting protease-specific substrates and cleavage sites. Brief Bioinform 20(2):638–658. https://doi.org/10.1093/bib/bby028
Chou KC (2001) Prediction of signal peptides using scaled window. Peptides 22(12):1973–1979. https://doi.org/10.1016/S0196-9781(01)00540-X
Chou KC (2001) Using subsite coupling to predict signal peptides. Protein Eng 14(2):75–79. https://doi.org/10.1093/protein/14.2.75
Acknowledgements
This work was supported by the Research Foundation for Advanced Talents (Nos. 2019BS007, 31401204) of Henan University of Technology and the National Natural Science Foundation of China under Grants (Nos. 61673082, 61773352).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of Interest
The authors declare that they have no conflict of interest.
Rights and permissions
About this article
Cite this article
Mu, Y., Zhang, R., Wang, L. et al. iPseU-Layer: Identifying RNA Pseudouridine Sites Using Layered Ensemble Model. Interdiscip Sci Comput Life Sci 12, 193–203 (2020). https://doi.org/10.1007/s12539-020-00362-y
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s12539-020-00362-y