Abstract
Protein–DNA and protein–RNA interactions are involved in many biological processes and regulate many cellular functions. Moreover, they are related to many human diseases. To understand the molecular mechanism of protein–DNA binding and protein–RNA binding, it is important to identify which residues in the protein sequence bind to DNA and RNA. At present, there are few methods for specifically identifying the binding sites of disease-related protein–DNA and protein–RNA. In this study, so we combined four machine learning algorithms into an ensemble classifier (EPDRNA) to predict DNA and RNA binding sites in disease-related proteins. The dataset used in model was collated from UniProt and PDB database, and PSSM, physicochemical properties and amino acid type were used as features. The EPDRNA adopted soft voting and achieved the best AUC value of 0.73 at the DNA binding sites, and the best AUC value of 0.71 at the RNA binding sites in 10-fold cross validation in the training sets. In order to further verify the performance of the model, we assessed EPDRNA for the prediction of DNA-binding sites and the prediction of RNA-binding sites on the independent test dataset. The EPDRNA achieved 85% recall rate and 25% precision on the protein–DNA interaction independent test set, and achieved 82% recall rate and 27% precision on the protein–RNA interaction independent test set. The online EPDRNA webserver is freely available at http://www.s-bioinformatics.cn/epdrna.
Similar content being viewed by others
References
Luscombe NM, Austin SE, Berman HM et al (2000) Genome Biol 1(1):1–37
Charoensawan V, Wilson D, Teichmann SA (2010) Nucleic Acids Res 38(21):7364–7377
Glisovic T, Bachorik JL, Yong J et al (2008) FEBS Lett 582(14):1977–1986
Noller HF (2005) Science 309(5740):1508–1514
Hertel KJ, Graveley BR (2005) Trends Biochem Sci 30(3):115–118
Lukong KE, Chang K-W, Khandjian EW et al (2008) Trends Genet 24(8):416–425
Chen-Plotkin AS, Lee VM-Y, Trojanowski JQ (2010) Nat Rev Neurol 6(4):211–220
Hu W, Xin Y, Hu J et al (2019) Cell Commun Signal 17(1):1–11
Bullock AN, Fersht AR (2001) Nat Rev Cancer 1(1):68–76
Neef DW, Jaeger AM, Thiele DJ (2011) Nat Rev Drug Discov 10(12):930–944
Camandola S, Mattson MP (2007) Expert Opin Ther Targets 11(2):123–132
Lee DH, Kim TM, Kim JK et al (2019) Theranostics 9(19):5694
Pereira B, Billaud M, Almeida R (2017) Trends Cancer 3(7):506–528
Barnby G, Abbott A, Sykes N et al (2005) Am J Hum Genet 76(6):950–966
Voineagu I, Wang X, Johnston P et al (2011) Nature 474(7351):380–384
Zhou H, Mangelsdorf M, Liu J et al (2014) Sci China Life Sci 57(4):432–444
Bansal P, Arora M (2020) Adv Exp Med Biol 1229:105–118
de Bruin RG, Rabelink TJ, van Zonneveld AJ et al (2017) Eur Heart J 38(18):1380–1388
Teichmann SA, Murzin AG, Chothia C (2001) Curr Opin Struct Biol 11(3):354–363
Burley SK, Bhikadiya C, Bi C et al (2021) Nucleic Acids Res 49(D1):D437–D451
Zhang QC, Petrey D, Deng L et al (2012) Nature 490(7421):556–560
Ahmad S, Sarai A (2005) BMC Bioinform 6(1):1–6
Hwang S, Gou Z, Kuznetsov IB (2007) Bioinformatics 23(5):634–636
Wang L, Huang C, Yang MQ et al (2010) BMC Syst Biol 4(1):1–9
Yan J, Kurgan L (2017) Nucleic Acids Res 45(10):e84
Si J, Zhang Z, Lin B et al (2011) BMC Syst Biol 5(1):1–7
Qiu JJ, Bernhofer M, Heinzinger M et al (2020) J Mol Biol 432(7):2428–2443
Wang N, Yan K, Zhang J et al (2022) Brief Bioinform 24(37):18
Zhang J, Chen QC, Liu B (2021) Brief Bioinform 22(5)
Zhang J, Chen QC, Liu B (2020) J Mol Biol 432(22):5860–5875
Feng JW, Wang N, Zhang J et al (2022) Comput Biol Med 149:105940
Cui FF, Li S, Zhang ZL et al (2022) Comput Struct Biotechnol J 20:2020–2028
Wang N, Zhang J, Liu B (2022) IEEE/ACM Trans Comput Biol Bioinform 19(4):2284–2293
Hu J, Li Y, Zhang M et al (2017) IEEE/ACM Trans Comput Biol Bioinform 14(64):1389–1398
Bahadur RP, Zacharias M, Janin J (2008) Nucleic Acids Res 36(8):2705–2716
Barik A, Mishra A, Bahadur RP (2012) Nucleic Acids Res 40:440–444
Chen YC, Sargsyan K, Wright JD et al (2014) Nucleic Acids Res 42(3):e15
Terribilini M, Sander JD, Lee JH et al (2007) Nucleic Acids Res 35:578–584
Zhang T, Zhang H, Chen K et al (2010) Curr Protein Pept Sci 11(7):609–628
Fernandez M, Kumagai Y, Standley DM et al (2011) BMC Bioinform 12:S5
Liu ZP, Wu LY, Wang Y et al (2010) Bioinformatics 26(13):1616–1622
Gupta A, Gribskov M (2011) J Mol Biol 409(4):574–587
Wang CC, Fang Y, Xiao J et al (2011) Amino Acids 40(1):239–248
Ren H, Shen Y (2015) BMC Bioinform 16(1):249
Li S, Yamashita K, Amada KM et al (2014) Nucleic Acids Res 42(15):10086–10098
Sun M, Wang X, Zou C et al (2016) BMC Bioinform 17(1):231
Sathyapriya R, Vijayabaskar MS, Vishveshwara S et al (2016) PLoS Comput Biol 4(9):e1000170
Dey S, Pal A, Guharoy M et al (2012) Nucleic Acids Res 40(15):7150–7161
Liu R, Hu J (2013) Proteins 81(11):1885–1899
Ma X, Guo J, Liu HD et al (2012) IEEE/ACM Trans Comput Biol Bioinform 9(6):1766–1775
Chakravarty A, Carlson JM, Khetani RS, Gross RH (2007) BMC Bioinform 8:249–263
Zhang C, Ma Y (2012) Ensemble machine learning: methods and applications. Springer, Berlin
Osareh A, Shadgar B (2013) Biomed Res Int 2013:478410
Kim C, You SC, Reps JM et al (2021) J Am Med Inform Assoc 28(6):1098–1107
Iakoucheva LM, Brown CJ, Lawson JD et al (2002) J Mol Biol 323(3):573–584
Cheng Y, LeGall T, Oldfield CJ et al (2006) Biochemistry 45(35):10448–10460
Uversky VN (2014) Front Biosci (Landmark Ed) 19:181–258
Bateman A, Martin M-J, Orchard S et al (2020) Nucleic Acids Res
Huang Y, Niu B, Gao Y et al (2010) Bioinformatics 26(5):680–682
Zhang J, Chen Q, Liu B (2021) IEEE/ACM Trans Comput Biol Bioinform 18(4):1451–1463
Ahmad S, Gromiha MM, Sarai A (2004) Bioinformatics 20(4):477–486
Si J, Zhang Z, Lin B et al (2011) BMC Syst Biol 17:88–105
Huang YF, Chiu LY, Huang CC et al (2010) BMC Genomics 11:S2
Walia RR, Caragea C, Lewis BA et al (2012) BMC Bioinform 13(1):1–20
Terribilini M, Sander JD, Lee J-H et al (2007) Nucleic Acids Res 35(Suppl_2):W578–W584
DeLano WL (2002) CCP4 Newsl Protein Crystallogr 40(1):82–92
Ahmad S, Sarai A (2005) BMC Bioinform 19(6):33
Altschul SF, Madden TL, Schäffer AA et al (1997) Nucleic Acids Res 25(17):3389–3402
Kawashima S, Pokarowski P, Pokarowska M et al (2007) Nucleic Acids Res 36(Suppl_1):D202–D205
Wei ZS, Han K, Yang JY et al (2016) Neurocomputing 193:201–212
Raymer ML, Sanschagrin PC, Punch WF et al (1997) J Mol Biol 265(4):445–464
Mousavi SZ, Kavian A, Soleimani K et al (2011) Geomatics Nat Hazards Risk 2(1):33–50
Chen C, Wang H (2020) J Comput Biol 27(6):934–940
Song X, Zhu J, Tan X et al (2022) Front Public Health 10:926069
Zhao Z, Xu Y, Zhao Y (2019) Genes (Basel) 10(12):965
Batista GE, Prati RC, Monard MC (2004) ACM SIGKDD Explor Newsl 6(1):20–29
Chawla NV, Bowyer KW, Hall LO et al (2002) J Artif Intell Res 16:321–357
Wilson DL (1972) IEEE Trans Syst Man Cybern 3:408–421
Luengo J, Fernández A, García S et al (2011) Soft Comput 15(10):1909–1936
Acknowledgements
The authors are grateful to the PYMOL’s author Warren Lyford Delano and acknowledge the author of DB-Bind and DRNApred for making their methods available. And the authors are grateful to the anonymous reviewers for their valuable suggestions and comments, which have led to the improvement of this paper. The work was supported by the National Natural Science Foundation of China (No. 62262050) and the Special Fund of National Natural Science Foundation of China (No. 62141204).
Author information
Authors and Affiliations
Contributions
YengE Feng designed the project and performed the analysis and drafted the manuscript. CanZhuang Sun collected the data and carried out the computation of binding sites and set up web server. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Conflict of interest
The authors declare no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Sun, C., Feng, Y. EPDRNA: A Model for Identifying DNA–RNA Binding Sites in Disease-Related Proteins. Protein J (2024). https://doi.org/10.1007/s10930-024-10183-3
Accepted:
Published:
DOI: https://doi.org/10.1007/s10930-024-10183-3