Investigating the Performance of Machine Learning Methods in Predicting Functional Properties of the Hydrogenase Variants

Choi, Gyucheol; Kim, Wonjun; Koo, Jamin

doi:10.1007/s12257-022-0330-3

Investigating the Performance of Machine Learning Methods in Predicting Functional Properties of the Hydrogenase Variants

Research Paper
Published: 11 February 2023

Volume 28, pages 143–151, (2023)
Cite this article

Biotechnology and Bioprocess Engineering Aims and scope Submit manuscript

Gyucheol Choi¹^na1,
Wonjun Kim¹^na1 &
Jamin Koo¹

107 Accesses
Explore all metrics

Abstract

Improving a functional property of an enzyme via mutagenesis is still a challenging problem due to vast search space and difficulty of predicting the effects of mutation(s). Machine learning has proven to be proficient in solving similar problems with unprecedented speed owing to the latest advances in computing power and analytical algorithms. In this study, we investigate the performance of machine learning methods in predicting the H₂ production activity and O₂ tolerance of the hydrogenase variants. Experimentally measured activities and tolerance of 377 variants having single or double amino acid replacements are used to train and test seven types of machine learning models. Binary representation of amino acid sequence as well as the series of vectors quantifying physicochemical properties of amino acids, namely VHSE, are employed as features representing each variant. The results show that the VHSE enable higher performance, especially with respect to correlation coefficient and coefficient of determination in addition to the root mean square error. Next, the analysis of model performance with respect to changes in the data size and heterogeneity is conducted to provide insights on designing effective mutagenesis library for applying machine learning. The best performance was obtained when support vector machine or ridge regression was trained using a large, homogeneous data. In this manner, our study reveals the factors affecting the performance of machine learning in identifying the enzyme variants with enhanced function.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Application of fourier transform and proteochemometrics principles to protein engineering

Article Open access 16 October 2018

Learning epistatic interactions from sequence-activity data to predict enantioselectivity

Article 12 December 2017

A machine learning approach for reliable prediction of amino acid interactions and its application in the directed evolution of enantioselective enzymes

Article Open access 13 November 2018

References

Korkegian, A., M. E. Black, D. Baker, and B. L. Stoddard (2005) Computational thermostabilization of an enzyme. Science 308: 857–860.
Article CAS PubMed PubMed Central Google Scholar
Amin, N., A. D. Liu, S. Ramer, W. Aehle, D. Meijer, M. Metin, S. Wong, P. Gualfetti, and V. Schellenberger (2004) Construction of stabilized proteins by combinatorial consensus mutagenesis. Protein Eng. Des. Sel. 17: 787–793.
Article CAS PubMed Google Scholar
Worth, C. L., R. Preissner, and T. L. Blundell (2011) SDM—a server for predicting effects of mutations on protein stability and malfunction. Nucleic Acids Res. 39(Web Server issue): W215–W222.
Article CAS PubMed PubMed Central Google Scholar
Thiltgen, G. and R. A. Goldstein (2012) Assessing predictors of changes in protein stability upon mutation using self-consistency. PLoS One 7: e46084.
Article CAS PubMed PubMed Central Google Scholar
Vedadi, M., F. H. Niesen, A. Allali-Hassani, O. Y. Fedorov, P. J. Finerty Jr., G. A. Wasney, R. Yeung, C. Arrowsmith, L. J. Ball, H. Berglund, R. Hui, B. D. Marsden, P. Nordlund, M. Sundstrom, J. Weigelt, and A. M. Edwards (2006) Chemical screening methods to identify ligands that promote protein stability, protein crystallization, and structure determination. Proc. Natl. Acad. Sci. U. S. A. 103: 15835–15840.
Article CAS PubMed PubMed Central Google Scholar
Koo, J., T. Schnabel, S. Liong, N. H. Evitt, and J. R. Swartz (2017) High-throughput screening of catalytic H₂ production. Angew. Chem. Int. Ed. Engl. 56: 1012–1016.
Article CAS PubMed Google Scholar
Esvelt, K. M., J. C. Carlson, and D. R. Liu (2011) A system for the continuous directed evolution of biomolecules. Nature 472: 499–503.
Article CAS PubMed PubMed Central Google Scholar
Saito, Y., M. Oikawa, H. Nakazawa, T. Niide, T. Kameda, K. Tsuda, and M. Umetsu (2018) Machine-learning-guided mutagenesis for directed evolution of fluorescent proteins. ACS Synth. Biol. 7: 2014–2022.
Article CAS PubMed Google Scholar
Wu, Z., S. B. J. Kan, R. D. Lewis, B. J. Wittmann, and F. H. Arnold (2019) Machine learning-assisted directed protein evolution with combinatorial libraries. Proc. Natl. Acad. Sci. U. S. A. 116: 8852–8858. (Erratum published 2020, Proc. Natl. Acad. Sci. U. S. A. 117: 788–789)
Article CAS PubMed PubMed Central Google Scholar
Koo, J. and J. R. Swartz (2018) System analysis and improved [FeFe] hydrogenase O₂ tolerance suggest feasibility for photosynthetic H₂ production. Metab. Eng. 49: 21–27.
Article CAS PubMed Google Scholar
Kuchenreuther, J. M., C. S. Grady-Smith, A. S. Bingham, S. J. George, S. P. Cramer, and J. R. Swartz (2010) High-yield expression of heterologous [FeFe] hydrogenases in Escherichia coli. PLoS One 5: e15491.
Article PubMed PubMed Central Google Scholar
Koo, J. (2020) Enhanced aerobic H₂ production by engineering an [FeFe] hydrogenase from Clostridium pasteurianum. Int. J. Hydrogen Energy 45: 10673–10679.
Article CAS Google Scholar
Koo, J. and Y. Cha (2021) Investigation of the ferredoxin’s influence on the anaerobic and aerobic, enzymatic H₂ production. Front. Bioeng. Biotechnol. 9: 641305.
Article PubMed PubMed Central Google Scholar
Lu, F., P. R. Smith, K. Mehta, and J. R. Swartz (2015) Development of a synthetic pathway to convert glucose to hydrogen using cell free extracts. Int. J. Hydrogen Energy 40: 9113–9124.
Article CAS Google Scholar
Pedregosa, F., G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay (2011) Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12: 2825–2830.
Google Scholar
Mei, H., Z. H. Liao, Y. Zhou, and S. Z. Li (2005) A new set of amino acid descriptors and its application in peptide QSARs. Biopolymers 80: 775–786.
Article CAS PubMed Google Scholar
Svetnik, V., A. Liaw, C. Tong, J. C. Culberson, R. P. Sheridan, and B. P. Feuston (2003) Random forest: a classification and regression tool for compound classification and QSAR modeling. J. Chem. Inf. Comput. Sci. 43: 1947–1958.
Article CAS PubMed Google Scholar
Suykens, J. A. K. and J. Vandewalle (1999) Least squares support vector machine classifiers. Neural Process. Lett. 9: 293–300.
Article Google Scholar
Chen, T. and C. Guestrin (2016) XGBoost: a scalable tree boosting system. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. August 13–17. San Francisco, CA, USA.
Meier, L., S. Van De Geer, and P. Bühlmann (2008) The group lasso for logistic regression. J. R. Stat. Soc. Series B Stat. Methodol. 70: 53–71.
Article Google Scholar
Akiba, T., S. Sano, T. Yanase, T. Ohta, and M. Koyama (2019) Optuna: a next-generation hyperparameter optimization framework. Proceedings of the 25th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. August 4–8. Anchorage, AK, USA.
Ostafe, R., N. Fontaine, D. Frank, M. Ng Fuk Chong, R. Prodanovic, R. Pandjaitan, B. Offmann, F. Cadet, and R. Fischer (2020) One-shot optimization of multiple enzyme parameters: tailoring glucose oxidase for pH and electron mediators. Biotechnol. Bioeng. 117: 17–29.
Article CAS PubMed Google Scholar
Yang, K. K., Z. Wu, and F. H. Arnold (2019) Machine-learning-guided directed evolution for protein engineering. Nat. Methods 16: 687–694.
Article CAS PubMed Google Scholar
Xie, X., T. Wu, M. Zhu, G. Jiang, Y. Xu, X. Wang, and L. Pu (2021) Comparison of random forest and multiple linear regression models for estimation of soil extracellular enzyme activities in agricultural reclaimed coastal saline land. Ecol. Indic. 120: 106925.
Article CAS Google Scholar
Zhao, M., S. Zhou, L. Wu, and Y. Deng (2020) Model-driven promoter strength prediction based on a fine-tuned synthetic promoter library in Escherichia coli. BioRxivhttps://doi.org/10.1101/2020.06.25.170365
Zhao, Z. Y., W. Z. Huang, X. K. Zhan, J. Pan, Y. A. Huang, S. W. Zhang, and C.-Q. Yu (2021) An ensemble learning-based method for inferring drug-target interactions combining protein sequences and drug fingerprints. Biomed Res. Int. 2021: 9933873.
Article PubMed PubMed Central Google Scholar
Pertusi, D. A., M. E. Moura, J. G. Jeffryes, S. Prabhu, B. Walters Biggs, and K. E. J. Tyo (2017) Predicting novel substrates for enzymes with minimal experimental effort with active learning. Metab. Eng. 44: 171–181.
Article CAS PubMed PubMed Central Google Scholar
Tian, Y., C. Deutsch, and B. Krishnamoorthy (2010) Scoring function to predict solubility mutagenesis. Algorithms Mol. Biol. 5: 33.
Article PubMed PubMed Central Google Scholar
Giguère, S., M. Marchand, F. Laviolette, A. Drouin, and J. Corbeil (2013) Learning a peptide-protein binding affinity predictor with kernel ridge regression. BMC Bioinformatics 14: 82.
Article PubMed PubMed Central Google Scholar
Mellor, J., I. Grigoras, P. Carbonell, and J. L. Faulon (2016) Semisupervised Gaussian process for automated enzyme search. ACS Synth. Biol. 5: 518–528.
Article CAS PubMed Google Scholar
Peng, L., M. Peng, B. Liao, G. Huang, W. Li, and D. Xie (2018) The advances and challenges of deep learning application in biological big data processing. Curr. Bioinform. 13: 352–359.
Article CAS Google Scholar
Yap, B. W., K. A. Rani, H. A. Abd Rahman, S. Fong, Z. Khairudin, and N. N. Abdullah (2014) An application of oversampling, undersampling, bagging and boosting in handling imbalanced datasets. Proceedings of the First International Conference on Advanced Data and Information Engineering (DaEng-2013). 2013 December 16–18. Kuala Lumpur, Malaysia.
Hawkins, D. M. (2004) The problem of overfitting. J. Chem. Inf. Comput. Sci. 44: 1–12.
Article CAS PubMed Google Scholar
Kambeitz, J., L. Kambeitz-Ilankovic, S. Leucht, S. Wood, C. Davatzikos, B. Malchow, P. Falkai, and N. Koutsouleris (2015) Detecting neuroimaging biomarkers for schizophrenia: a meta-analysis of multivariate pattern recognition studies. Neuropsychopharmacology 40: 1742–1751.
Article PubMed PubMed Central Google Scholar
Zarogianni, E., T. W. J. Moorhead, and S. M. Lawrie (2013) Towards the identification of imaging biomarkers in schizophrenia, using multivariate pattern classification at a single-subject level. Neuroimage Clin. 3: 279–289.
Article PubMed PubMed Central Google Scholar
Elnaggar, A., M. Heinzinger, C. Dallago, G. Rihawi, Y. Wang, L. Jones, T. Gibbs, T. Feher, C. Angerer, M. Steinegger, D. Bhowmik, and B. Rost (2020) ProtTrans: towards cracking the language of life’s code through self-supervised deep learning and high performance computing. ArXivhttps://doi.org/10.48550/arxiv.2007.06225
Rives, A., J. Meier, T. Sercu, S. Goyal, Z. Lin, J. Liu, D. Guo, M. Ott, C. L. Zitnick, J. Ma, and R. Fergus (2021) Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proc. Natl. Acad. Sci. U. S. A. 118: e2016239118.
Article CAS PubMed PubMed Central Google Scholar

Download references

Acknowledgements

This work was supported by the 2023 Hongik University Research Fund and KISTI R&D Innovation Support Program (KSC-2020-INO-0051).

Author information

Gyucheol Choi and Wonjun Kim contributed equally to the manuscript.

Authors and Affiliations

Department of Chemical Engineering, Hongik University, Seoul, 04066, Korea
Gyucheol Choi, Wonjun Kim & Jamin Koo

Authors

Gyucheol Choi
View author publications
You can also search for this author in PubMed Google Scholar
Wonjun Kim
View author publications
You can also search for this author in PubMed Google Scholar
Jamin Koo
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jamin Koo.

Ethics declarations

The authors declare no conflict of interest.

Neither ethical approval nor informed consent was required for this study.

Additional information

Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Electronic supplementary material

Supplementary material, approximately 53.2 KB.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Choi, G., Kim, W. & Koo, J. Investigating the Performance of Machine Learning Methods in Predicting Functional Properties of the Hydrogenase Variants. Biotechnol Bioproc E 28, 143–151 (2023). https://doi.org/10.1007/s12257-022-0330-3

Download citation

Received: 25 October 2022
Revised: 12 December 2022
Accepted: 14 December 2022
Published: 11 February 2023
Issue Date: February 2023
DOI: https://doi.org/10.1007/s12257-022-0330-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Investigating the Performance of Machine Learning Methods in Predicting Functional Properties of the Hydrogenase Variants

Abstract

Access this article

Similar content being viewed by others

Application of fourier transform and proteochemometrics principles to protein engineering

Learning epistatic interactions from sequence-activity data to predict enantioselectivity

A machine learning approach for reliable prediction of amino acid interactions and its application in the directed evolution of enantioselective enzymes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Additional information

Electronic supplementary material

Supplementary material, approximately 53.2 KB.

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Investigating the Performance of Machine Learning Methods in Predicting Functional Properties of the Hydrogenase Variants

Abstract

Access this article

Similar content being viewed by others

Application of fourier transform and proteochemometrics principles to protein engineering

Learning epistatic interactions from sequence-activity data to predict enantioselectivity

A machine learning approach for reliable prediction of amino acid interactions and its application in the directed evolution of enantioselective enzymes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Additional information

Electronic supplementary material

Supplementary material, approximately 53.2 KB.

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation