Skip to main content
Log in

Investigating the Performance of Machine Learning Methods in Predicting Functional Properties of the Hydrogenase Variants

  • Research Paper
  • Published:
Biotechnology and Bioprocess Engineering Aims and scope Submit manuscript

Abstract

Improving a functional property of an enzyme via mutagenesis is still a challenging problem due to vast search space and difficulty of predicting the effects of mutation(s). Machine learning has proven to be proficient in solving similar problems with unprecedented speed owing to the latest advances in computing power and analytical algorithms. In this study, we investigate the performance of machine learning methods in predicting the H2 production activity and O2 tolerance of the hydrogenase variants. Experimentally measured activities and tolerance of 377 variants having single or double amino acid replacements are used to train and test seven types of machine learning models. Binary representation of amino acid sequence as well as the series of vectors quantifying physicochemical properties of amino acids, namely VHSE, are employed as features representing each variant. The results show that the VHSE enable higher performance, especially with respect to correlation coefficient and coefficient of determination in addition to the root mean square error. Next, the analysis of model performance with respect to changes in the data size and heterogeneity is conducted to provide insights on designing effective mutagenesis library for applying machine learning. The best performance was obtained when support vector machine or ridge regression was trained using a large, homogeneous data. In this manner, our study reveals the factors affecting the performance of machine learning in identifying the enzyme variants with enhanced function.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Korkegian, A., M. E. Black, D. Baker, and B. L. Stoddard (2005) Computational thermostabilization of an enzyme. Science 308: 857–860.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  2. Amin, N., A. D. Liu, S. Ramer, W. Aehle, D. Meijer, M. Metin, S. Wong, P. Gualfetti, and V. Schellenberger (2004) Construction of stabilized proteins by combinatorial consensus mutagenesis. Protein Eng. Des. Sel. 17: 787–793.

    Article  CAS  PubMed  Google Scholar 

  3. Worth, C. L., R. Preissner, and T. L. Blundell (2011) SDM—a server for predicting effects of mutations on protein stability and malfunction. Nucleic Acids Res. 39(Web Server issue): W215–W222.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  4. Thiltgen, G. and R. A. Goldstein (2012) Assessing predictors of changes in protein stability upon mutation using self-consistency. PLoS One 7: e46084.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  5. Vedadi, M., F. H. Niesen, A. Allali-Hassani, O. Y. Fedorov, P. J. Finerty Jr., G. A. Wasney, R. Yeung, C. Arrowsmith, L. J. Ball, H. Berglund, R. Hui, B. D. Marsden, P. Nordlund, M. Sundstrom, J. Weigelt, and A. M. Edwards (2006) Chemical screening methods to identify ligands that promote protein stability, protein crystallization, and structure determination. Proc. Natl. Acad. Sci. U. S. A. 103: 15835–15840.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  6. Koo, J., T. Schnabel, S. Liong, N. H. Evitt, and J. R. Swartz (2017) High-throughput screening of catalytic H2 production. Angew. Chem. Int. Ed. Engl. 56: 1012–1016.

    Article  CAS  PubMed  Google Scholar 

  7. Esvelt, K. M., J. C. Carlson, and D. R. Liu (2011) A system for the continuous directed evolution of biomolecules. Nature 472: 499–503.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  8. Saito, Y., M. Oikawa, H. Nakazawa, T. Niide, T. Kameda, K. Tsuda, and M. Umetsu (2018) Machine-learning-guided mutagenesis for directed evolution of fluorescent proteins. ACS Synth. Biol. 7: 2014–2022.

    Article  CAS  PubMed  Google Scholar 

  9. Wu, Z., S. B. J. Kan, R. D. Lewis, B. J. Wittmann, and F. H. Arnold (2019) Machine learning-assisted directed protein evolution with combinatorial libraries. Proc. Natl. Acad. Sci. U. S. A. 116: 8852–8858. (Erratum published 2020, Proc. Natl. Acad. Sci. U. S. A. 117: 788–789)

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  10. Koo, J. and J. R. Swartz (2018) System analysis and improved [FeFe] hydrogenase O2 tolerance suggest feasibility for photosynthetic H2 production. Metab. Eng. 49: 21–27.

    Article  CAS  PubMed  Google Scholar 

  11. Kuchenreuther, J. M., C. S. Grady-Smith, A. S. Bingham, S. J. George, S. P. Cramer, and J. R. Swartz (2010) High-yield expression of heterologous [FeFe] hydrogenases in Escherichia coli. PLoS One 5: e15491.

    Article  PubMed  PubMed Central  Google Scholar 

  12. Koo, J. (2020) Enhanced aerobic H2 production by engineering an [FeFe] hydrogenase from Clostridium pasteurianum. Int. J. Hydrogen Energy 45: 10673–10679.

    Article  CAS  Google Scholar 

  13. Koo, J. and Y. Cha (2021) Investigation of the ferredoxin’s influence on the anaerobic and aerobic, enzymatic H2 production. Front. Bioeng. Biotechnol. 9: 641305.

    Article  PubMed  PubMed Central  Google Scholar 

  14. Lu, F., P. R. Smith, K. Mehta, and J. R. Swartz (2015) Development of a synthetic pathway to convert glucose to hydrogen using cell free extracts. Int. J. Hydrogen Energy 40: 9113–9124.

    Article  CAS  Google Scholar 

  15. Pedregosa, F., G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay (2011) Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12: 2825–2830.

    Google Scholar 

  16. Mei, H., Z. H. Liao, Y. Zhou, and S. Z. Li (2005) A new set of amino acid descriptors and its application in peptide QSARs. Biopolymers 80: 775–786.

    Article  CAS  PubMed  Google Scholar 

  17. Svetnik, V., A. Liaw, C. Tong, J. C. Culberson, R. P. Sheridan, and B. P. Feuston (2003) Random forest: a classification and regression tool for compound classification and QSAR modeling. J. Chem. Inf. Comput. Sci. 43: 1947–1958.

    Article  CAS  PubMed  Google Scholar 

  18. Suykens, J. A. K. and J. Vandewalle (1999) Least squares support vector machine classifiers. Neural Process. Lett. 9: 293–300.

    Article  Google Scholar 

  19. Chen, T. and C. Guestrin (2016) XGBoost: a scalable tree boosting system. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. August 13–17. San Francisco, CA, USA.

  20. Meier, L., S. Van De Geer, and P. Bühlmann (2008) The group lasso for logistic regression. J. R. Stat. Soc. Series B Stat. Methodol. 70: 53–71.

    Article  Google Scholar 

  21. Akiba, T., S. Sano, T. Yanase, T. Ohta, and M. Koyama (2019) Optuna: a next-generation hyperparameter optimization framework. Proceedings of the 25th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. August 4–8. Anchorage, AK, USA.

  22. Ostafe, R., N. Fontaine, D. Frank, M. Ng Fuk Chong, R. Prodanovic, R. Pandjaitan, B. Offmann, F. Cadet, and R. Fischer (2020) One-shot optimization of multiple enzyme parameters: tailoring glucose oxidase for pH and electron mediators. Biotechnol. Bioeng. 117: 17–29.

    Article  CAS  PubMed  Google Scholar 

  23. Yang, K. K., Z. Wu, and F. H. Arnold (2019) Machine-learning-guided directed evolution for protein engineering. Nat. Methods 16: 687–694.

    Article  CAS  PubMed  Google Scholar 

  24. Xie, X., T. Wu, M. Zhu, G. Jiang, Y. Xu, X. Wang, and L. Pu (2021) Comparison of random forest and multiple linear regression models for estimation of soil extracellular enzyme activities in agricultural reclaimed coastal saline land. Ecol. Indic. 120: 106925.

    Article  CAS  Google Scholar 

  25. Zhao, M., S. Zhou, L. Wu, and Y. Deng (2020) Model-driven promoter strength prediction based on a fine-tuned synthetic promoter library in Escherichia coli. BioRxivhttps://doi.org/10.1101/2020.06.25.170365

  26. Zhao, Z. Y., W. Z. Huang, X. K. Zhan, J. Pan, Y. A. Huang, S. W. Zhang, and C.-Q. Yu (2021) An ensemble learning-based method for inferring drug-target interactions combining protein sequences and drug fingerprints. Biomed Res. Int. 2021: 9933873.

    Article  PubMed  PubMed Central  Google Scholar 

  27. Pertusi, D. A., M. E. Moura, J. G. Jeffryes, S. Prabhu, B. Walters Biggs, and K. E. J. Tyo (2017) Predicting novel substrates for enzymes with minimal experimental effort with active learning. Metab. Eng. 44: 171–181.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  28. Tian, Y., C. Deutsch, and B. Krishnamoorthy (2010) Scoring function to predict solubility mutagenesis. Algorithms Mol. Biol. 5: 33.

    Article  PubMed  PubMed Central  Google Scholar 

  29. Giguère, S., M. Marchand, F. Laviolette, A. Drouin, and J. Corbeil (2013) Learning a peptide-protein binding affinity predictor with kernel ridge regression. BMC Bioinformatics 14: 82.

    Article  PubMed  PubMed Central  Google Scholar 

  30. Mellor, J., I. Grigoras, P. Carbonell, and J. L. Faulon (2016) Semisupervised Gaussian process for automated enzyme search. ACS Synth. Biol. 5: 518–528.

    Article  CAS  PubMed  Google Scholar 

  31. Peng, L., M. Peng, B. Liao, G. Huang, W. Li, and D. Xie (2018) The advances and challenges of deep learning application in biological big data processing. Curr. Bioinform. 13: 352–359.

    Article  CAS  Google Scholar 

  32. Yap, B. W., K. A. Rani, H. A. Abd Rahman, S. Fong, Z. Khairudin, and N. N. Abdullah (2014) An application of oversampling, undersampling, bagging and boosting in handling imbalanced datasets. Proceedings of the First International Conference on Advanced Data and Information Engineering (DaEng-2013). 2013 December 16–18. Kuala Lumpur, Malaysia.

  33. Hawkins, D. M. (2004) The problem of overfitting. J. Chem. Inf. Comput. Sci. 44: 1–12.

    Article  CAS  PubMed  Google Scholar 

  34. Kambeitz, J., L. Kambeitz-Ilankovic, S. Leucht, S. Wood, C. Davatzikos, B. Malchow, P. Falkai, and N. Koutsouleris (2015) Detecting neuroimaging biomarkers for schizophrenia: a meta-analysis of multivariate pattern recognition studies. Neuropsychopharmacology 40: 1742–1751.

    Article  PubMed  PubMed Central  Google Scholar 

  35. Zarogianni, E., T. W. J. Moorhead, and S. M. Lawrie (2013) Towards the identification of imaging biomarkers in schizophrenia, using multivariate pattern classification at a single-subject level. Neuroimage Clin. 3: 279–289.

    Article  PubMed  PubMed Central  Google Scholar 

  36. Elnaggar, A., M. Heinzinger, C. Dallago, G. Rihawi, Y. Wang, L. Jones, T. Gibbs, T. Feher, C. Angerer, M. Steinegger, D. Bhowmik, and B. Rost (2020) ProtTrans: towards cracking the language of life’s code through self-supervised deep learning and high performance computing. ArXivhttps://doi.org/10.48550/arxiv.2007.06225

  37. Rives, A., J. Meier, T. Sercu, S. Goyal, Z. Lin, J. Liu, D. Guo, M. Ott, C. L. Zitnick, J. Ma, and R. Fergus (2021) Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proc. Natl. Acad. Sci. U. S. A. 118: e2016239118.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

Download references

Acknowledgements

This work was supported by the 2023 Hongik University Research Fund and KISTI R&D Innovation Support Program (KSC-2020-INO-0051).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jamin Koo.

Ethics declarations

The authors declare no conflict of interest.

Neither ethical approval nor informed consent was required for this study.

Additional information

Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Electronic supplementary material

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Choi, G., Kim, W. & Koo, J. Investigating the Performance of Machine Learning Methods in Predicting Functional Properties of the Hydrogenase Variants. Biotechnol Bioproc E 28, 143–151 (2023). https://doi.org/10.1007/s12257-022-0330-3

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s12257-022-0330-3

Keywords

Navigation