Abstract
Enzymes have been proven to play considerable roles in disease diagnosis and biological functions. The feature extraction that truly reflects the intrinsic properties of protein is the most critical step for the automatic identification of enzymes. Although lots of feature extraction methods have been proposed, some challenges remain. In this study, we developed a predictor called IHEC_RAAC, which has the capability to identify whether a protein is a human enzyme and distinguish the function of the human enzyme. To improve the feature representation ability, protein sequences were encoded by a new feature-vector called ‘reduced amino acid cluster’. We calculated 673 amino acid reduction alphabets to determine the optimal feature representative scheme. The tenfold cross-validation test showed that the accuracy of IHEC_RAAC to identify human enzymes was 74.66% and further discriminate the human enzyme classes with an accuracy of 54.78%, which was 2.06% and 8.68% higher than the state-of-the-art predictors, respectively. Additionally, the results from the independent dataset indicated that IHEC_RAAC can effectively predict human enzymes and human enzyme classes to further provide guidance for protein research. A user-friendly web server, IHEC_RAAC, is freely accessible at http://bioinfor.imu.edu.cn/ihecraac.
Similar content being viewed by others
References
Al-Barakati HJ, McConnell EW, Hicks LM, Poole LB, Newman RH, Kc DB (2018) SVM-SulfoSite: a support vector machine based predictor for sulfenylation sites. Sci Rep 8(1):11288. https://doi.org/10.1038/s41598-018-29126-x
Ashari ZE, Brayton KA, Broschat SL (2019) Using an optimal set of features with a machine learning-based approach to predict effector proteins for Legionella pneumophila. PLoS ONE. https://doi.org/10.1371/journal.pone.0202312
Bhadra P, Yan J, Li J, Fong S, Siu SWI (2018) AmPEP: Sequence-based prediction of antimicrobial peptides using distribution patterns of amino acid properties and random forest. Sci Rep 8(1):1697. https://doi.org/10.1038/s41598-018-19752-w
Cai YD, Zhou GP, Chou KC (2005) Predicting enzyme family classes by hybridizing gene product composition and pseudo-amino acid composition. J Theor Biol 234(1):145–149. https://doi.org/10.1016/j.jtbi.2004.11.017
Chang C-C, Lin C-J (2011) Libsvm. ACM Trans Intell Syst Technol 2(3):1–27. https://doi.org/10.1145/1961189.1961199
Chou KJB (2005) Using amphiphilic pseudo amino acid composition to predict enzyme subfamily classes. Bioinformatics 21(1):10–19
Chou KC, Cai YD (2004) Predicting enzyme family class in a hybridization space. Protein Sci 13(11):2857–2863. https://doi.org/10.1110/ps.04981104
Chou K-C, Zhang C-T (1995) Prediction of protein structural classes. Crit Rev Biochem Mol Biol 30(4):275–349. https://doi.org/10.3109/10409239509083488
Dao FY, Lv H, Wang F, Feng CQ, Ding H, Chen W, Lin H (2019) Identify origin of replication in Saccharomyces cerevisiae using two-step feature selection technique. Bioinformatics 35(12):2075–2083. https://doi.org/10.1093/bioinformatics/bty943
Feng CQ, Zhang ZY, Zhu XJ, Lin Y, Chen W, Tang H, Lin H (2019) iTerm-PseKNC: a sequence-based tool for predicting bacterial transcriptional terminators. Bioinformatics 35(9):1469–1477. https://doi.org/10.1093/bioinformatics/bty827
Fu X, Cai L, Zeng X, Zou Q (2020) StackCPPred: a stacking and pairwise energy content-based prediction of cell-penetrating peptides and their uptake efficiency. Bioinformatics 36(10):3028–3034. https://doi.org/10.1093/bioinformatics/btaa131
He W, Jia C, Zou Q (2019) 4mCPred: machine learning methods for DNA N4-methylcytosine sites prediction. Bioinformatics 35(4):593–601. https://doi.org/10.1093/bioinformatics/bty668
Heine D, Müller R, Brüsselbach SJGT (2001) Cell surface display of a lysosomal enzyme for extracellular gene-directed enzyme prodrug therapy. Gene Ther 8(13):1005
Izidoro SC, de Melo-Minardi RC, Pappa GL (2015) GASS: identifying enzyme active sites with genetic algorithms. Bioinformatics 31(6):864–870. https://doi.org/10.1093/bioinformatics/btu746
Jiao Y, Du PJQB (2016) Performance measures in evaluating machine learning based bioinformatics predictors for classifications. 4 (4)
Kato T, Nagano N (2010) Metric learning for enzyme active-site search. Bioinformatics 26(21):2698–2704. https://doi.org/10.1093/bioinformatics/btq519
Liang ZY, Lai HY, Yang H, Zhang CJ, Yang H, Wei HH, Chen XX, Zhao YW, Su ZD, Li WC, Deng EZ, Tang H, Chen W, Lin H (2017) Pro54DB: a database for experimentally verified sigma-54 promoters. Bioinformatics 33(3):467–469. https://doi.org/10.1093/bioinformatics/btw630
Liu X, Liu D, Qi J, Zheng WM (2002) Simplified amino acid alphabets based on deviation of conditional probability from random background. Phys Rev E Stat Nonlin Soft Matter Phys 66(2 Pt 1):021906. https://doi.org/10.1103/PhysRevE.66.021906
Liu D, Li G, Zuo Y (2019) Function determinants of TET proteins: the arrangements of sequence motifs with specific codes. Brief Bioinform 20(5):1826–1835. https://doi.org/10.1093/bib/bby053
Liu ML, Su W, Wang JS, Yang YH, Yang H, Lin H (2020) predicting preference of transcription factors for methylated DNA using sequence information. Mol Ther Nucl Acids 22:1043–1050. https://doi.org/10.1016/j.omtn.2020.07.035
Lv Z, Jin S, Ding H, Zou Q (2019) A random forest sub-Golgi protein classifier optimized via dipeptide and amino acid composition features. Front Bioeng Biotechnol 7:215
Matsuta Y, Ito M, Tohsato Y (2013) ECOH: an enzyme commission number predictor using mutual information and a support vector machine. Bioinformatics 29(3):365–372. https://doi.org/10.1093/bioinformatics/bts700
Meng C, Guo F, Zou Q (2020) CWLy-SVM: A support vector machine-based tool for identifying cell wall lytic enzymes. Comput Biol Chem 87:107304. https://doi.org/10.1016/j.compbiolchem.2020.107304
Moraes JPA, Pappa GL, Pires DEV, Izidoro SC (2017) GASS-WEB: a web server for identifying enzyme active sites based on genetic algorithms. Nucleic Acids Res 45(W1):W315–W319. https://doi.org/10.1093/nar/gkx337
Oosterhoff D, Overmeer RM, Graaf MD, Meulen IHVD, Giaccone G, Beusechem VWV, Haisma HJ, Pinedo HM, Gerritsen WRJBJoC, (2005) Adenoviral vector-mediated expression of a gene encoding secreted, EpCAM-targeted carboxylesterase-2 sensitises colon cancer spheroids to CPT-11. Br J Cancer. https://doi.org/10.1038/sj.bjc.6602362
Patil K, Chouhan U (2019) Relevance of machine learning techniques and various protein features in protein fold classification: a Review. Curr Bioinform 14(8):688–697. https://doi.org/10.2174/1574893614666190204154038
Qiu JD, Huang JH, Shi SP, Liang RP (2010) Using the concept of Chou’s pseudo amino acid composition to predict enzyme family classes: an approach with support vector machine based on discrete wavelet transform. Protein Peptide Lett 17(6):715–722. https://doi.org/10.2174/092986610791190372
Solis AD (2015) Amino acid alphabet reduction preserves fold information contained in contact interactions in proteins. Proteins 83(12):2198–2216. https://doi.org/10.1002/prot.24936
Tan JX, Li SH, Zhang ZM, Chen CX, Chen W, Tang H, Lin H (2019a) Identification of hormone binding proteins based on machine learning methods. Math Biosci Eng 16(4):2466–2480. https://doi.org/10.3934/mbe.2019123
Tan JX, Lv H, Wang F, Dao FY, Chen W, Ding H (2019b) A survey for predicting enzyme family classes using machine learning methods. Curr Drug Targets 20(5):540–550. https://doi.org/10.2174/1389450119666181002143355
Tang H, Chen W, Lin H (2016) Identification of immunoglobulins using Chou’s pseudo amino acid composition with feature selection technique. Mol BioSyst 12(4):1269–1275. https://doi.org/10.1039/c5mb00883b
ValizadehAslani T, Zhao Z, Sokhansanj BA, Rosen GL (2020) Amino acid k-mer feature extraction for quantitative antimicrobial resistance (AMR) prediction by machine learning and model interpretation for biological insights. Biology (Basel). https://doi.org/10.3390/biology9110365
Volpato V, Adelfio A, Pollastri G (2013) Accurate prediction of protein enzymatic class by N-to-1 Neural Networks. Bioinformatics. https://doi.org/10.1186/1471-2105-14-S1-S11
Wang Z, Liu D, Xu B, Tian R, Zuo Y (2020) Modular arrangements of sequence motifs determine the functional diversity of KDM proteins. Brief Bioinform. https://doi.org/10.1093/bib/bbaa215
Wei LY, Luan S, Nagai LAE, Su R, Zou Q (2019a) Exploring sequence-based features for the improved prediction of DNA N4-methylcytosine sites in multiple species. Bioinformatics 35(8):1326–1333. https://doi.org/10.1093/bioinformatics/bty824
Wei LY, Zhou C, Su R, Zou Q (2019b) PEPred-suite: improved and robust prediction of therapeutic peptides using adaptive feature representation learning. Bioinformatics 35(21):4272–4280. https://doi.org/10.1093/bioinformatics/btz246
Weng SF, Kai J, Guha IN, Qureshi NJOH (2015) The value of aspartate aminotransferase and alanine aminotransferase in cardiovascular disease risk assessment. Open Heart 2(1):e000272
Wrabl JO, Grishin NV (2005) Grouping of amino acid types and extraction of amino acid properties from multiple sequence alignments using variance maximization. Proteins 61(3):523–534. https://doi.org/10.1002/prot.20648
Wu Y, Tang H, Chen W, Lin H (2016a) Predicting human enzyme family classes by using pseudo amino acid composition. Curr Proteomics 13:99–104. https://doi.org/10.2174/157016461302160514003437
Wu Y, Tang H, Chen W, Lin H (2016b) Predicting human enzyme family classes by using pseudo amino acid composition. Curr Proteomics 13(2):99–104. https://doi.org/10.2174/157016461302160514003437
Xu HD, Shi SP, Wen PP, Qiu JD (2015) SuccFind: a novel succinylation sites online prediction tool via enhanced characteristic strategy. Bioinformatics 31(23):3748–3750. https://doi.org/10.1093/bioinformatics/btv439
Xu B, Liu D, Wang Z, Tian R, Zuo Y (2020) Multi-substrate selectivity based on key loops and non-homologous domains: new insight into ALKBH family. Cell Mol Life Sci. https://doi.org/10.1007/s00018-020-03594-9
Yang L, Lv Y, Li T, Zuo Y, Jiang W (2014) Human proteins characterization with subcellular localizations. J Theor Biol 358:61–73. https://doi.org/10.1016/j.jtbi.2014.05.008
Yang W, Zhu XJ, Huang J, Ding H, Lin H (2019) A brief survey of machine learning methods in protein sub-Golgi localization. Curr Bioinform 14:234–240
Yang YH, Ma C, Wang JS, Yang H, Ding H, Han SG, Li YW (2020a) Prediction of N7-methylguanosine sites in human RNA based on optimal sequence features. Genomics. https://doi.org/10.1016/j.ygeno.2020.07.035
Yang YH, Ma C, Wang JS, Yang H, Ding H, Han SG, Li YW (2020b) Prediction of N7-methylguanosine sites in human RNA based on optimal sequence features. Genomics 112(6):4342–4347
Zhang J, Liu B (2019) A review on the recent developments of sequence-based protein feature extraction methods. Curr Bioinform 14(3):190–199. https://doi.org/10.2174/1574893614666181212102749
Zhang YP, Zou Q (2020) PPTPP: a novel therapeutic peptide prediction method using physicochemical property encoding and adaptive feature representation learning. Bioinformatics 36(13):3982–3987. https://doi.org/10.1093/bioinformatics/btaa275
Zhang Q, Wang S, Pan Y, Su D, Lu Q, Zuo Y, Yang L (2019) Characterization of proteins in different subcellular localizations for Escherichia coli K12. Genomics 111(5):1134–1141. https://doi.org/10.1016/j.ygeno.2018.07.008
Zhang D, Xu ZC, Su W, Yang YH, Lv H, Yang H, Lin H (2020a) iCarPS: a computational tool for identifying protein carbonylation sites by novel encoded features. Bioinformatics. https://doi.org/10.1093/bioinformatics/btaa702
Zhang ZY, Yang YH, Ding H, Wang D, Chen W, Lin H (2020b) Design powerful predictor for mRNA subcellular location prediction in Homo sapiens. Brief Bioinform. https://doi.org/10.1093/bib/bbz177
Zheng L, Huang S, Mu N, Zhang H, Zhang J, Chang Y, Yang L, Zuo Y (2019) RAACBook: a web server of reduced amino acid alphabet for sequence-dependent inference by using Chou’s five-step rule. Database (Oxford). https://doi.org/10.1093/database/baz131
Zheng L, Liu D, Yang W, Yang L, Zuo Y (2020) RaacLogo: a new sequence logo generator by using reduced amino acid clusters. Brief Bioinform. https://doi.org/10.1093/bib/bbaa096
Zhou XB, Chen C, Li ZC, Zou XY (2007) Using Chou’s amphiphilic pseudo-amino acid composition and support vector machine for prediction of enzyme subfamily classes. J Theor Biol 248(3):546–551. https://doi.org/10.1016/j.jtbi.2007.06.001
Zhu XJ, Feng CQ, Lai HY, Chen W, Lin H (2019) Predicting protein structural classes for low-similarity sequences by evaluating different features. Knowl Based Syst 163:787–793. https://doi.org/10.1016/j.knosys.2018.10.007
Zou Q, Wan S, Ju Y, Tang J, Zeng X (2016) Pretata: predicting TATA binding proteins with novel features and dimensionality reduction strategy. BMC Syst Biol 10(4):114
Zuo YC, Li QZ (2009) Using reduced amino acid composition to predict defensin family and subfamily: integrating similarity measure and structural alphabet. Peptides 30(10):1788–1793
Zuo YC, Chen W, Fan GL, Li QZ (2013) A similarity distance of diversity measure for discriminating mesophilic and thermophilic proteins. Amino Acids 44(2):573–580. https://doi.org/10.1007/s00726-012-1374-z
Zuo Y, Lv Y, Wei Z, Yang L, Li G, Fan G (2015) iDPF-PseRAAAC: a web-server for identifying the defensin peptide family and subfamily using pseudo reduced amino acid alphabet composition. PLoS ONE 10(12):e0145541. https://doi.org/10.1371/journal.pone.0145541
Zuo Y, Li Y, Chen Y, Li G, Yan Z, Yang L (2017) PseKRAAC: a flexible web server for generating pseudo K-tuple reduced amino acids composition. Bioinformatics 33(1):122–124. https://doi.org/10.1093/bioinformatics/btw564
Acknowledgements
This work was supported by the National Nature Scientific Foundation of China (No: 62061034, 61702290, 61861036), Program for Young Talents of Science and Technology in Universities of Inner Mongolia Autonomous Region (NJYT-18-B01) and the Fund for Excellent Young Scholars of Inner Mongolia (2017JQ04).
Author information
Authors and Affiliations
Contributions
YZ designed this work. HW and QX performed the data analyses and wrote the manuscript. PL and LZ contributed significantly to analysis and manuscript preparation. YH helped perform the analysis with constructive discussions.
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflicts of interest.
Human/animal rights statement
This article does not contain any studies with human participants or animals performed by any of the authors.
Informed consent
No individual participant has included in this study therefore no informed consent was necessary.
Additional information
Handling editor: Y. Su.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Below is the link to the electronic supplementary material.
Rights and permissions
About this article
Cite this article
Wang, H., Xi, Q., Liang, P. et al. IHEC_RAAC: a online platform for identifying human enzyme classes via reduced amino acid cluster strategy. Amino Acids 53, 239–251 (2021). https://doi.org/10.1007/s00726-021-02941-9
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00726-021-02941-9