Abstract
The incidence of thyroid cancer and breast cancer is increasing every year, and the specific pathogenesis is unclear. Post-translational modifications are an important regulatory mechanism that affects the function of almost all proteins. They are essential for a diverse and well-functioning proteome and can integrate metabolism with physiological and pathological processes. In recent years, post-translational modifications have become a research hotspot, with methylation, phosphorylation, acetylation and succinylation being the main focus. SUMOylated proteins are predominantly localized in the nucleus, and SUMO regulates nuclear processes, including cell cycle control and DNA repair. SUMOylated proteins are predominantly localized in the nucleus, and SUMO regulates nuclear processes, including cell cycle control and DNA repair. SUMOylation has been increasingly implicated in cancer, Alzheimer’s, and Parkinson’s diseases. Therefore, identification and characterization SUMOylation sites are essential for determining modification-specific proteomics. This study aims to propose a novel schema for predicting protein SUMOylation sites based on the incorporation of natural language features (Word2Vec) and sequence-based features. In addition, the novel model, called RSX_SUMO, is proposed for the prediction of protein SUMOylation sites. Our experiments reveal that the performance of RSX_SUMO model achieves the highest performance in both five-fold cross-validation and independent testing, obtain the performance on independent testing with acccuracy at 88.6% and MCC value of 0.743. In addition, the comparison with several existing prediction models show that our proposed model outperforms and obtains the highest performance. We hope that our findings would provide effective suggestions and be a great helpful for researchers related to their related studies.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Geiss-Friedlander, R., Melchior, F.: Concepts in sumoylation: a decade on. Nat. Rev. Mol. Cell Biol. 8(12), 947–956 (2007)
Hay, R.T.: SUMO: a history of modification. Mol. Cell 18(1), 1–12 (2005)
Müller, S., et al.: SUMO, ubiquitin’s mysterious cousin. Nat. Rev. Mol. Cell Biol. 2(3), 202–210 (2001)
Marmor-Kollet, H.S., et al.: Spatiotemporal proteomic analysis of stress granule disassembly using APEX reveals regulation by SUMOylation and links to ALS pathogenesis. Mol. Cell. 80, 15 (2020)
Princz, A.T.: N. SUMOylation in neurodegenerative diseases. Gerontology 66, 8 (2020)
Seeler, J.S.B., Nacerddine, K.O., Dejean, A.: SUMO, the three Rs and cancer. Curr. Top. Microbiol. Immunol. 313, 22 (2007)
Ren, J., et al.: Systematic study of protein sumoylation: development of a site‐specific predictor of SUMOsp 2.0. Proteomics 9(12), 3409–3412 (2009)
Jia, J., et al.: pSumo-CD: predicting sumoylation sites in proteins with covariance discriminant algorithm by incorporating sequence-coupled effects into general PseAAC. Bioinformatics 32(20), 3133–3141 (2016)
Chang, C.C., et al.: SUMOgo: prediction of sumoylation sites on lysines by motif screening models and the effects of various post-translational modifications. Sci. Rep. 8(1), 15512 (2018)
Dehzangi, A., et al.: SumSec: accurate prediction of sumoylation sites using predicted secondary structure. Molecules 23(12) (2018)
Sharma, A., et al.: HseSUMO: sumoylation site prediction using half-sphere exposures of amino acids residues. BMC Gen. 19(Suppl. 9), 982 (2019)
Qian, Y., et al.: SUMO-Forest: a cascade forest based method for the prediction of SUMOylation sites on imbalanced data. Gene 741, 144536 (2020)
Lopez, Y., Dehzangi, A., Reddy, H.M., Sharma, A.: C-iSUMO: a sumoylation site predictor that incorporates intrinsic characteristics of amino acid sequences. Comput. Biol. Chem. 87 (2020)
Khan, Y.D., et al.: iSUMOK-PseAAC: prediction of lysine sumoylation sites using statistical moments and Chou’s PseAAC. PeerJ 9, e11581 (2021)
Zhao, Q., et al.: GPS-SUMO: a tool for the prediction of sumoylation sites and SUMO-interaction motifs. Nucleic Acids Res.. 42(Web Server issue), W325–W330 (2014)
Chen, Y.Z., Chen, Z., Gong, Y.A., Ying, G.: SUMOhydro: a novel method for the prediction of sumoylation sites based on hydrophobic properties. PloS One 7(6), e39195 (2012)
Nguyen, V.-N., Nguyen, H.-M., Tran, T.-X.: An approach by exploiting support vector machine to characterize and identify protein SUMOylation sites. JASSA. 505, 877 (2012)
Nguyen, V.-N., et al.: Characterization and identification of ubiquitin conjugation sites with E3 ligase recognition specificities. BMC Bioinform. BioMed Central (2015)
Nguyen, V.-N., et al.: A new scheme to characterize and identify protein ubiquitination sites. IEEE/ACM Trans. Comput. Biol. Bioinform. 14(2), 393–403 (2016)
Bui, V.-M., Nguyen, V.-N.: The prediction of Succinylation site in protein by analyzing amino acid composition. In: Akagi, M., Nguyen, T.T., Vu, D.T., Phung, T.N., Huynh, V.N. (eds.) Advances in Information and Communication Technology. ICTA 2016. Advances in Intelligent Systems and Computing, vol. 538, pp. 633–642. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-49073-1_67
Le, N.Q.K., Ho, Q.T., Ou, Y.Y.: Incorporating deep learning with convolutional neural networks and position specific scoring matrices for identifying electron transport proteins. J. Comput. Chem. 38(23), 2000–2006 (2017)
Nguyen, V.-N., et al. A new schema to identify S-farnesyl cysteine prenylation sites with substrate motifs. in Advances in Information and Communication Technology: Proceedings of the International Conference, ICTA 2016. 2017. Springer
Le, N.Q.K., et al.: Identification of Clathrin proteins by incorporating hyperparameter optimization in deep learning and PSSM profiles. Comput. Methods Programs Biomed. 177, 81–88 (2019)
Nguyen, V.-N., et al.: Exploiting two-layer support vector machine to predict protein sumoylation sites. In: Fujita, H., Nguyen, D., Vu, N., Banh, T., Puta, H. (eds.) Advances in Engineering Research and Application. ICERA 2018. Lecture Notes in Networks and Systems, vol. 63, pp. 324–332. Springer, Cham. https://doi.org/10.1007/978-3-030-04792-4_43
Lu, C.T., et al.: DbPTM 3.0: an informative resource for investigating substrate site specificity and functional association of protein post-translational modifications. Nucleic Acids Res. 41(Database issue), D295–305 (2013)
Beauclair, G., et al.: JASSA: a comprehensive tool for prediction of SUMOylation sites and SIMs. Bioinformatics 31(21), 3483–3491 (2015)
Teng, S., Luo, H., Wang, L.: Predicting protein sumoylation sites from sequence features. Amino Acids 43, 447–455 (2012)
Ho Thanh Lam, L., et al.: Machine learning model for identifying antioxidant proteins using features calculated from primary sequences. Biology 9(10), 325 (2020)
Huang, Y., et al.: CD-HIT Suite: a web server for clustering and comparing biological sequences. Bioinformatics 26(5), 680–682 (2010)
Chen, Z., et al.: iFeature: a Python package and web server for features extraction and selection from protein and peptide sequences. Bioinformatics 34(14), 2499–2502 (2018)
Sahlgren, M.: The distributional hypothesis. Ital. J. Disabil. Stud. 20, 33–53 (2008)
Chiu, B., et al.: How to train good word embeddings for biomedical NLP. In: Proceedings of the 15th Workshop on Biomedical Natural Language Processing (2016)
Lai, S., et al.: How to generate a good word embedding. IEEE Intell. Syst. 31(6), 5–14 (2016)
Crooks, G.E., et al.: WebLogo: a sequence logo generator. Genome Res. 14(6), 1188–1190 (2004)
Vacic, V., Iakoucheva, L.M., Radivojac, P.: Two Sample Logo: a graphical representation of the differences between two sets of sequence alignments. Bioinformatics 22(12), 1536–1537 (2006)
Acknowledgment
The authors sincerely thank to TUEBA for partly financial supported this research under the TNU-level project ID: ÐH2023-TN08-05.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Tran, TX., Nguyen, VN., Le, N. (2023). Incorporating Natural Language-Based and Sequence-Based Features to Predict Protein Sumoylation Sites. In: Nguyen, N.T., Le-Minh, H., Huynh, CP., Nguyen, QV. (eds) The 12th Conference on Information Technology and Its Applications. CITA 2023. Lecture Notes in Networks and Systems, vol 734. Springer, Cham. https://doi.org/10.1007/978-3-031-36886-8_7
Download citation
DOI: https://doi.org/10.1007/978-3-031-36886-8_7
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-36885-1
Online ISBN: 978-3-031-36886-8
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)