Abstract
Several machine learning models have been formulated for protein classification based on an important prerequisite for industrial usage, thermostability, and described herein a classification model for a specific enzyme; serine protease. For building the classifier, 283 thermophilic and 200 mesophilic bacterial genomes were mined for their respective serine protease sequences. Features were extracted from 760 sequences, followed by feature selection. We deployed a random forest-based classifier that identified thermophilic and non-thermophilic serine proteases with an accuracy of 97.11%, higher than other benchmark machine learning methods. Knowledge of thermostability and amino acid positional shifts can be vital for downstream protein engineering techniques. Thus, a web platform has been proposed to emphasize the real-time application of this enzyme-specific classification model. We designed a framework that can aid protein engineers in combining their sequence data and the classification model and employ it to align query sequences against the custom databases and identify similar novel enzymes along with their thermophilic nature.
Similar content being viewed by others
Abbreviations
- rRNA:
-
ribosomal ribonucleic acid
- ML:
-
Machine learning
- SVM:
-
Support vector machines
- DNN:
-
Deep neural networks
- RF:
-
Random forest
- VAL:
-
Valine
- ILE:
-
Isoleucine
- GLU:
-
Glutamic acid
- ARG:
-
Arginine
- GLY:
-
Glycine
- MET:
-
Methionine
- GLN:
-
Glutamine
- SPs:
-
Serine Protease
- API:
-
Application Programming Interface
- PseAA:
-
Pseudo-amino acid composition
- Se:
-
Sensitivity
- Sp:
-
Specificity
- Acc:
-
Accuracy
- TP:
-
True Positives
- TN:
-
True Negatives
- FP:
-
False Positives
- ROC:
-
Receiver operating characteristic
- TPR:
-
True positive rate
- FPR:
-
False positive rate
- BLAST:
-
Basic local alignment search tool
- PHP:
-
Hypertext Preprocessor
- HTML:
-
Hypertext Markup Language
- CSS:
-
Cascading Style Sheet
References
Ashraf NM, Krishnagopal A, Hussain A et al (2019) Engineering of serine protease for improved thermostability and catalytic activity using rational design. Int J Biol Macromol 126:229–237. https://doi.org/10.1016/j.ijbiomac.2018.12.218
Aziz RK, Bartels D, Best AA et al (2008) The RAST Server: rapid annotations using subsystems technology. BMC Genomics 9:1–15. https://doi.org/10.1186/1471-2164-9-75
Berman HM, Westbrook J, Feng Z et al (2000) The protein data bank. Nucleic Acids Res 28:235–242. https://doi.org/10.1093/nar/28.1.235
Bilal M, Iqbal HM, Guo S et al (2018) State-of-the-art protein engineering approaches using biological macromolecules: A review from immobilization to implementation view point. Int J Biol Macromol 108:893–901. https://doi.org/10.1016/j.ijbiomac.2017.10.182
Bouchot J-L, Trimble WL, Ditzler G et al (2013) Advances in machine learning for processing and comparison of metagenomic data. Comput Syst Biol Mol Mech Dis 295–329. https://doi.org/10.1016/B978-0-12-405926-9.00014-9
Bruins ME, Janssen AE, Boom RM (2001) Thermozymes and their applications. Appl Biochem Biotechnol 90:155–186. https://doi.org/10.1385/ABAB:90:2:155
Cai Y-D, Chou K-C (2005) Predicting enzyme subclass by functional domain composition and pseudo amino acid composition. J Proteome Res 4:967–971. https://doi.org/10.1021/pr0500399
Chaparro-Riggers JF, Polizzi KM, Bommarius AS (2007) Better library design: data-driven protein engineering. Biotechnol J 2:180–191. https://doi.org/10.1002/biot.200600170
Charoenkwan P, Chotpatiwetchkul W, Lee VS et al (2021) A novel sequence-based predictor for identifying and characterizing thermophilic proteins using estimated propensity scores of dipeptides. Sci Rep 11:1–15. https://doi.org/10.1038/s41598-021-03293-w
Charoenkwan P, Schaduangrat N, Hasan MM et al (2022) Empirical comparison and analysis of machine learning-based predictors for predicting and analyzing of thermophilic proteins. EXCLI J 21:554. https://doi.org/10.17179/excli2022-4723
Charoenkwan P, Schaduangrat N, Moni MA et al (2022b) SAPPHIRE: A stacking-based ensemble learning framework for accurate prediction of thermophilic proteins. Comput Biol Med 105704. https://doi.org/10.1016/j.compbiomed.2022.105704
Chou K-C (2005) Using amphiphilic pseudo amino acid composition to predict enzyme subfamily classes. Bioinformatics 21:10–19. https://doi.org/10.1093/bioinformatics/bth466
Di Cera E (2009) Serine proteases. IUBMB Life 61:510–515. https://doi.org/10.1002/iub.186
Fan G-L, Liu Y-L, Wang H (2016) Identification of thermophilic proteins by incorporating evolutionary and acid dissociation information into Chou’s general pseudo amino acid composition. J Theor Biol 407:138–142. https://doi.org/10.1016/j.jtbi.2016.07.010
Feng C, Ma Z, Yang D et al (2020) A method for prediction of thermophilic protein based on reduced amino acids and mixed features. Front Bioeng Biotechnol 8:285. https://doi.org/10.3389/fbioe.2020.00285
Gromiha MM, Pathak MC, Saraboji K et al (2013) Hydrophobic environment is a key factor for the stability of thermophilic proteins. Proteins Struct Funct Bioinforma 81:715–721. https://doi.org/10.1002/prot.24232
Guo F, Zou Q, Yang G et al (2019) Identifying protein-protein interface via a novel multi-scale local sequence and structural representation. BMC Bioinformatics 20:1–11. https://doi.org/10.1186/s12859-019-3048-2
Ibrahim N, Harun HC, Ibrahim NA (2022) Cloning and expression of thermostable alkaline protease 50a in E. coli BL21 (DE3) and TOP10. AIP Publishing LLC, p 030005 https://doi.org/10.1063/5.0078673
Ibrahim EN, Ma K (2017) Industrial applications of thermostable enzymes from extremophilic microorganisms. Curr Biochem Eng 4:75–98. https://doi.org/10.2174/2212711904666170405123414
Jablaoui A, Kriaa A, Akermi N et al (2018) Biotechnological applications of serine proteases: a patent review. Recent Pat Biotechnol 12:280–287. https://doi.org/10.2174/1872208312666180924112007
Kumar S, Nei M, Dudley J, Tamura K (2008) MEGA: a biologist-centric software for evolutionary analysis of DNA and protein sequences. Brief Bioinform 9:299–306. https://doi.org/10.1093/bib/bbn017
Letunic I, Bork P (2021) Interactive Tree Of Life (iTOL) v5: an online tool for phylogenetic tree display and annotation. Nucleic Acids Res 49:W293–W296. https://doi.org/10.1093/nar/gkab301
Lin H, Chen W (2011) Prediction of thermophilic proteins using feature selection technique. J Microbiol Methods 84:67–70. https://doi.org/10.1016/j.mimet.2010.10.013
Littlechild JA (2015) Enzymes from extreme environments and their industrial applications. Front Bioeng Biotechnol 3:161. https://doi.org/10.3389/fbioe.2015.00161
Liu B, Xu J, Lan X et al (2014) iDNA-Prot| dis: identifying DNA-binding proteins by incorporating amino acid distance-pairs and reduced alphabet profile into the general pseudo amino acid composition. PLoS ONE 9:e106691. https://doi.org/10.1371/journal.pone.0106691
Malhis N, Jones SJ, Gsponer J (2019) Improved measures for evolutionary conservation that exploit taxonomy distances. Nat Commun 10:1–8. https://doi.org/10.1038/s41467-019-09583-2
Matkawala F, Nighojkar S, Kumar A, Nighojkar A (2021) Microbial alkaline serine proteases: Production, properties and applications. World J Microbiol Biotechnol 37:1–12. https://doi.org/10.1007/s11274-021-03036-z
Meng C, Ju Y, Shi H (2022) TMPpred: A support vector machine-based thermophilic protein identifier. Anal Biochem 645:114625. https://doi.org/10.1016/j.ab.2022.114625
Musil M, Stourac J, Bendl J et al (2017) FireProt: web server for automated design of thermostable proteins. Nucleic Acids Res 45:W393–W399. https://doi.org/10.1093/nar/gkx285
Panja AS, Bandopadhyay B, Maiti S (2015) Protein thermostability is owing to their preferences to non-polar smaller volume amino acids, variations in residual physico-chemical properties and more salt-bridges. PLoS ONE 10:e0131495. https://doi.org/10.1371/journal.pone.0131495
Pedregosa F, Varoquaux G, Gramfort A et al (2011) Scikit-learn: Machine Learning in Python. J Mach Learn Res 12:2825–2830
Peterson EL, Kondev J, Theriot JA, Phillips R (2009) Reduced amino acid alphabets exhibit an improved sensitivity and selectivity in fold assignment. Bioinformatics 25:1356–1362. https://doi.org/10.1093/bioinformatics/btp164
Price MN, Arkin AP (2019) Curated BLAST for genomes. Msystems 4:e00072-e119. https://doi.org/10.1128/mSystems.00072-19
Qi Y (2012) Random forest for bioinformatics. In: Ensemble machine learning. Springer, pp 307–323. https://doi.org/10.1007/978-1-4419-9326-7_11
Quester S, Schomburg D (2011) EnzymeDetector: an integrated enzyme function prediction tool and database. BMC Bioinformatics 12:1–13. https://doi.org/10.1186/1471-2105-12-376
Sharma M, Gat Y, Arya S et al (2019) A review on microbial alkaline protease: an essential tool for various industrial approaches. Ind Biotechnol 15:69–78. https://doi.org/10.1089/ind.2018.0032
Shen H-B, Chou K-C (2008) PseAAC: a flexible web server for generating various kinds of protein pseudo amino acid composition. Anal Biochem 373:386–388. https://doi.org/10.1016/j.ab.2007.10.012
Siedhoff NE, Schwaneberg U, Davari MD (2020) Machine learning-assisted enzyme engineering. Methods Enzymol 643:281–315. https://doi.org/10.1016/bs.mie.2020.05.005
Szilágyi A, Závodszky P (2000) Structural differences between mesophilic, moderately thermophilic and extremely thermophilic protein subunits: results of a comprehensive survey. Structure 8:493–504. https://doi.org/10.1016/s0969-2126(00)00133-7
Takano K, Aoi A, Koga Y, Kanaya S (2013) Evolvability of thermophilic proteins from archaea and bacteria. Biochemistry 52:4774–4780. https://doi.org/10.1021/bi400652c
Taylor TJ, Vaisman II (2010) Discrimination of thermophilic and mesophilic proteins. BMC Struct Biol 10:1–10. https://doi.org/10.1186/1472-6807-10-S1-S5
Wang D, Yang L, Fu Z, Xia J (2011) Prediction of thermophilic protein with pseudo amino acid composition: an approach from combined feature selection and reduction. Protein Pept Lett 18:684–689. https://doi.org/10.2174/092986611795446085
Wang Y, Hu X, Sun L et al (2014) Predicting enzyme subclasses by using random forest with multicharacteristic parameters. Protein Pept Lett 21:275–284. https://doi.org/10.2174/09298665113206660114
Wu L-C, Lee J-X, Huang H-D et al (2009) An expert system to predict protein thermostability using decision tree. Expert Syst Appl 36:9007–9014. https://doi.org/10.1016/j.eswa.2008.12.020
Yachdav G, Kloppmann E, Kajan L et al (2014) PredictProtein—an open resource for online prediction of protein structural and functional features. Nucleic Acids Res 42:W337–W343. https://doi.org/10.1093/nar/gku366
Zare M, Mohabatkar H, Faramarzi FK et al (2015) Using Chou’s pseudo amino acid composition and machine learning method to predict the antiviral peptides. Open Bioinforma J 9. https://doi.org/10.2174/1875036201509010013
Zhang G, Fang B (2006) Discrimination of thermophilic and mesophilic proteins via pattern recognition methods. Process Biochem 41:552–556. https://doi.org/10.1016/j.procbio.2005.09.003
Zhang G, Fang B (2007) LogitBoost classifier for discriminating thermophilic and mesophilic proteins. J Biotechnol 127:417–424. https://doi.org/10.1016/j.jbiotec.2006.07.020
Zhao H-Y, Feng H (2018) Engineering Bacillus pumilus alkaline serine protease to increase its low-temperature proteolytic activity by directed evolution. BMC Biotechnol 18:1–12. https://doi.org/10.1186/s12896-018-0451-0
Zuo Y-C, Chen W, Fan G-L, Li Q-Z (2013) A similarity distance of diversity measure for discriminating mesophilic and thermophilic proteins. Amino Acids 44:573–580. https://doi.org/10.1007/s00726-012-1374-z
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
Authors have no conflicts and/or funding information to declare.
Additional information
Publisher's note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Below is the link to the electronic supplementary material.
Supplementary Video S1 (MP4 8530 KB)
Rights and permissions
Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Sunny, J.S., Kumar, A., Nisha, K. et al. Converting the genomic knowledge base to build protein specific machine learning prediction models; a classification study on thermophilic serine protease. Biologia 77, 3615–3622 (2022). https://doi.org/10.1007/s11756-022-01214-4
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11756-022-01214-4