XG-PseU: an eXtreme Gradient Boosting based method for identifying pseudouridine sites


As one of the most popular post-transcriptional modifications, pseudouridine (Ψ) participates in a series of biological processes. Therefore, the efficient detection of pseudouridine sites is very important in revealing its functions in biological processes. Although experimental techniques have been proposed for identifying Ψ sites at single-base resolution, they are still labor intensive and expensive. Recently, to fill the experimental method’s gap, computational methods have been proposed for identifying Ψ sites. However, their performances are still unsatisfactory. In this paper, we proposed an eXtreme Gradient Boosting (xgboost)-based method, called XG-PseU, to identify Ψ sites based on the optimal features obtained using the forward feature selection together with increment feature selection method. Our results demonstrated that XG-PseU is superior or at least complementary to existing methods for identifying pseudouridine sites. Finally, a freely available online web server for XG-PseU was established at We wish that XG-PseU will become a useful tool for computationally identifying Ψ sites.

This is a preview of subscription content, log in to check access.

We’re sorry, something doesn't seem to be working properly.

Please try refreshing the page. If that doesn't work, please contact support so we can address the problem.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5


  1. Basak A, Query CC (2014) A pseudouridine residue in the spliceosome core is part of the filamentous growth program in yeast. Cell Rep 8:966–973

  2. Boccaletto P, Machnicka MA, Purta E, Piatkowski P, Baginski B, Wirecki TK, de Crecy-Lagard V, Ross R, Limbach PA, Kotter A, Helm M, Bujnicki JM (2018) MODOMICS: a database of RNA modification pathways. 2017 update. Nucleic Acids Res 46:D303–D307

  3. Brayet J, Zehraoui F, Jeanson-Leh L, Israeli D, Tahi F (2014) Towards a piRNA prediction using multiple kernel fusion and support vector machine. Bioinformatics 30:I364–I370

  4. Carlile TM, Rojas-Duran MF, Zinshteyn B, Shin H, Bartoli KM, Gilbert WV (2014) Pseudouridine profiling reveals regulated mRNA pseudouridylation in yeast and human cells. Nature 515:143–146

  5. Chen T, Guestrin C (2016) XGBoost: a scalable tree boosting system. In: Acm sigkdd international conference on knowledge discovery & data mining

  6. Chen W, Ding H, Feng PM, Lin H, Chou KC (2016a) IACP: a sequence-based tool for identifying anticancer peptides. Oncotarget 7:16895–16909

  7. Chen W, Tang H, Ye J, Lin H, Chou KC (2016b) iRNA-PseU: identifying RNA pseudouridine sites. Mol Ther Nucleic Acids 5:e332

  8. Chen XX, Tang H, Li WC, Wu H, Chen W, Ding H, Lin H (2016c) Identification of bacterial cell wall lyases via pseudo amino acid composition. Biomed Res Int 2016:1654623

  9. Chen W, Lv H, Nie F, Lin H (2019) i6mA-Pred: Identifying DNA N6-methyladenine sites in the rice genome. Bioinformatics.

  10. Chou KC (2001) Using subsite coupling to predict signal peptides. Protein Eng 14:75–79

  11. Dezman ZDW, Gao C, Yang SM, Hu P, Yao L, Li HC, Chang CI, Mackenzie C (2017) Anomaly detection outperforms logistic regression in predicting outcomes in trauma patients. Prehospital Emerg Care 21:174–179

  12. Feng PM, Chen W, Lin H, Chou K-C (2013) iHSP-PseRAAAC: identifying the heat shock protein families using pseudo reduced amino acid alphabet composition. Anal Biochem 442:118–125

  13. Ferre-D’Amare AR (2003) RNA-modifying enzymes. Curr Opin Struct Biol 13:49–55

  14. Fujiwara T, Harigae H (2013) Pathophysiology and genetic mutations in congenital sideroblastic anemia. Pediatr Int 55:675–679

  15. Fujiwara T, Harigae H (2019) Molecular pathophysiology and genetic mutations in congenital sideroblastic anemia. Free Radic Biol Med 133:179–185

  16. Ge J, Yu YT (2013) RNA pseudouridylation: new insights into an old modification. Trends Biochem Sci 38:210–218

  17. Guzzi N, Ciesla M, Ngoc PCT, Lang S, Arora S, Dimitriou M, Pimkova K, Sommarin MNE, Munita R, Lubas M, Lim Y, Okuyama K, Soneji S, Karlsson G, Hansson J, Jonsson G, Lund AH, Sigvardsson M, Hellstrom-Lindberg E, Hsieh AC, Bellodi C (2018) Pseudouridylation of tRNA-derived fragments steers translational control in stem cells. Cell 173(1204–1216):e1226

  18. Hamma T, Ferre-D’Amare AR (2006) Pseudouridine synthases. Chem Biol 13:1125–1135

  19. He J, Fang T, Zhang Z, Huang B, Zhu X, Xiong Y (2018) PseUI: pseudouridine sites identification based on RNA sequence information. BMC Bioinform 19:306

  20. Hudson GA, Bloomingdale RJ, Znosko BM (2013) Thermodynamic contribution and nearest-neighbor parameters of pseudouridine-adenosine base pairs in oligoribonucleotides. RNA 19:1474–1482

  21. Jiang W, Middleton K, Yoon HJ, Fouquet C, Carbon J (1993) An essential yeast protein, CBF5p, binds in vitro to centromeres and microtubules. Mol Cell Biol 13:4884–4893

  22. Kiss T, Fayet E, Jady BE, Richard P, Weber M (2006) Biogenesis and intranuclear trafficking of human box C/D and H/ACA RNPs. Cold Spring Harb Symp Quant Biol 71:407–417

  23. Le NQK (2019) iN6-methylat (5-step): identifying DNA N6-methyladenine sites in rice genome using continuous bag of nucleobases via Chou’s 5-step rule. Mol Genet Genomics.

  24. Le NQ, Yapp EK, Ho QT, Nagasundaram N, Ou YY, Yeh HY (2019a) iEnhancer-5Step: identifying enhancers using hidden information of DNA sequences via Chou’s 5-step rule and word embedding. Anal Biochem 571:53–61

  25. Le NQ, Yapp EK, Ou YY, Yeh HY (2019b) iMotor-CNN: identifying molecular functions of cytoskeleton motor proteins using 2D convolutional neural network via Chou’s 5-step rule. Anal Biochem 575:17–26

  26. Li X, Zhu P, Ma S, Song J, Bai J, Sun F, Yi C (2015a) Chemical pulldown reveals dynamic pseudouridylation of the mammalian transcriptome. Nat Chem Biol 11:592–597

  27. Li YH, Zhang G, Cui Q (2015b) PPUS: a web server to predict PUS-specific pseudouridine sites. Bioinformatics 31:3362–3364

  28. Li GQ, Liu Z, Shen HB, Yu DJ (2016) Target M6A: identifying N-6-methyladenosine sites from RNA sequences via position-specific nucleotide propensities and a support vector machine. IEEE Trans Nanobiosci 15:674–682

  29. Liu Y, Gu W, Zhang W, Wang J (2015) Predict and analyze protein glycation sites with the mRMR and IFS methods. Biomed Res Int 2015:561547

  30. Schwartz S, Bernstein DA, Mumbach MR, Jovanovic M, Herbst RH, Leon-Ricardo BX, Engreitz JM, Guttman M, Satija R, Lander ES, Fink G, Regev A (2014) Transcriptome-wide mapping reveals widespread dynamic-regulated pseudouridylation of ncRNA and mRNA. Cell 159:148–162

  31. Tahir M, Tayara H, Chong KT (2019) iPseU-CNN: identifying RNA pseudouridine sites using convolutional neural networks. Mol Ther Nucleic Acids 16:463–470

  32. Tang H, Zhao YW, Zou P, Zhang CM, Chen R, Huang P, Lin H (2018) HBPred: a tool to identify growth hormone-binding proteins. Int J Biol Sci 14:957–964

  33. Toh SM, Mankin AS (2008) An indigenous posttranscriptional modification in the ribosomal peptidyl transferase center confers resistance to an array of protein synthesis inhibitors. J Mol Biol 380:593–597

  34. Vacic V, Iakoucheva LM, Radivojac P (2006) Two sample logo: a graphical representation of the differences between two sets of sequence alignments. Bioinformatics 22:1536–1537

  35. Vuckovic F, Theodoratou E, Thaci K, Timofeeva M, Vojta A, Stambuk J, Pucic-Bakovic M, Rudd PM, Derek L, Servis D, Wennerstrom A, Farrington SM, Perola M, Aulchenko Y, Dunlop MG, Campbell H, Lauc G (2016) IgG glycome in colorectal cancer. Clin Cancer Res 22:3078–3086

  36. Wang L, Shen C, Hartley R (2011) On the optimality of sequential forward feature selection using class separability measure. In: International conference on digital image computing: techniques & applications

  37. Wang Q, Zhao D, Wang Y, Hou X (2019) Ensemble learning algorithm based on multi-parameters for sleep staging. Med Biol Eng Comput 57(8):1693–1707.

  38. Xuan JJ, Sun WJ, Lin PH, Zhou KR, Liu S, Zheng LL, Qu LH, Yang JH (2018) RMBase v2.0: deciphering the map of RNA modifications from epitranscriptome sequencing data. Nucleic Acids Res 46:D327–D334

  39. Yang H, Tang H, Chen XX, Zhang CJ, Zhu PP, Ding H, Chen W, Lin H (2016) Identification of secretory proteins in mycobacterium tuberculosis using pseudo amino acid composition. Biomed Res Int 2016:5413903

  40. Yao L, Cai M, Chen Y, Shen C, Shi L, Guo Y (2019) Prediction of antiepileptic drug treatment outcomes of patients with newly diagnosed epilepsy by machine learning. Epilepsy Behav 96:92–97

  41. Ye K (2007) H/ACA guide RNAs, proteins and complexes. Curr Opin Struct Biol 17:287–292

  42. Zebarjadian Y, King T, Fournier MJ, Clarke L, Carbon J (1999) Point mutations in yeast CBF5 can abolish in vivo pseudouridylation of rRNA. Mol Cell Biol 19:7461–7472

  43. Zhang Y, Wang XH, Kang L (2011) A k-mer scheme to predict piRNAs and characterize locust piRNAs. Bioinformatics 27:771–776

Download references


This work was supported by the National Nature Scientific Foundation of China (31771471, 61772119) and the Natural Science Foundation for Distinguished Young Scholar of Hebei Province (No. C2017209244).

Author information

Correspondence to Wei Chen or Hao Lin.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Ethical approval

This article does not contain any studies with human participants performed by any of the authors.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Communicated by Stefan Hohmann.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (DOCX 127 kb)

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Liu, K., Chen, W. & Lin, H. XG-PseU: an eXtreme Gradient Boosting based method for identifying pseudouridine sites. Mol Genet Genomics 295, 13–21 (2020).

Download citation


  • Pseudouridine
  • eXtreme Gradient Boosting
  • Feature selection
  • Web server