Skip to main content

Exploring Machine Learning Algorithms and Protein Language Models Strategies to Develop Enzyme Classification Systems

  • Conference paper
  • First Online:
Bioinformatics and Biomedical Engineering (IWBBIO 2023)

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 13919))

Abstract

Discovering functionalities for unknown enzymes has been one of the most common bioinformatics tasks. Functional annotation methods based on phylogenetic properties have been the gold standard in every genome annotation process. However, these methods only succeed if the minimum requirements for expressing similarity or homology are met. Alternatively, machine learning and deep learning methods have proven helpful in this problem, developing functional classification systems in various bioinformatics tasks. Nevertheless, there needs to be a clear strategy for elaborating predictive models and how amino acid sequences should be represented. In this work, we address the problem of functional classification of enzyme sequences (EC number) via machine learning methods, exploring various alternatives for training predictive models and numerical representation methods. The results show that the best performances are achieved by applying representations based on pre-trained models. However, there needs to be a clear strategy to train models. Therefore, when exploring several alternatives, it is observed that the methods based on CNN architectures proposed in this work present a more outstanding facility for learning and pattern extraction in complex systems, achieving performances above 97% and with error rates lower than 0.05 of binary cross entropy. Finally, we discuss the strategies explored and analyze future work to develop integrated methods for functional classification and the discovery of new enzymes to support current bioinformatics tools.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 79.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 99.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Arakaki, A.K., Huang, Y., Skolnick, J.: EFICAz2: enzyme function inference by a combined approach enhanced by machine learning. BMC Bioinform. 10(1), 1–15 (2009)

    Article  Google Scholar 

  2. Basso, A., Serban, S.: Industrial applications of immobilized enzymes-a review. Mol. Catal. 479, 110607 (2019)

    Article  CAS  Google Scholar 

  3. Bonetta, R., Valentino, G.: Machine learning techniques for protein function prediction. Proteins: Struct. Function Bioinform. 88(3), 397–413 (2020)

    Google Scholar 

  4. Burley, S.K., Berman, H.M., Kleywegt, G.J., Markley, J.L., Nakamura, H., Velankar, S.: Protein data bank (PDB): the single global macromolecular structure archive. In: Protein Crystallography: Methods and Protocols, pp. 627–641 (2017)

    Google Scholar 

  5. Cadet, F., et al.: A machine learning approach for reliable prediction of amino acid interactions and its application in the directed evolution of enantioselective enzymes. Sci. Rep. 8(1), 16757 (2018)

    Article  PubMed  PubMed Central  Google Scholar 

  6. Cock, P.J., et al.: Biopython: freely available python tools for computational molecular biology and bioinformatics. Bioinformatics 25(11), 1422–1423 (2009)

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  7. UniProt Consortium: Uniprot: a worldwide hub of protein knowledge. Nucleic Acids Res. 47(D1), D506–D515 (2019)

    Google Scholar 

  8. Copeland, R.A.: Enzymes: A Practical Introduction to Structure, Mechanism, and Data Analysis. Wiley, Hoboken (2023)

    Book  Google Scholar 

  9. Dallago, C., et al.: Learned embeddings from deep learning to visualize and predict protein sets. Curr. Protoc. 1(5), e113 (2021)

    Article  PubMed  Google Scholar 

  10. Gao, W., Mahajan, S.P., Sulam, J., Gray, J.J.: Deep learning in protein structural modeling and design. Patterns 1(9), 100142 (2020)

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  11. Greener, J.G., Kandathil, S.M., Moffat, L., Jones, D.T.: A guide to machine learning for biologists. Nat. Rev. Mol. Cell Biol. 23(1), 40–55 (2022)

    Article  CAS  PubMed  Google Scholar 

  12. Kanehisa, M., Furumichi, M., Tanabe, M., Sato, Y., Morishima, K.: KEGG: new perspectives on genomes, pathways, diseases and drugs. Nucleic Acids Res. 45(D1), D353–D361 (2017)

    Article  CAS  PubMed  Google Scholar 

  13. Kanehisa, M., Sato, Y., Kawashima, M.: KEGG mapping tools for uncovering hidden features in biological data. Protein Sci. 31(1), 47–53 (2022)

    Article  CAS  PubMed  Google Scholar 

  14. Kawashima, S., Pokarowski, P., Pokarowska, M., Kolinski, A., Katayama, T., Kanehisa, M.: Aaindex: amino acid index database, progress report 2008. Nucleic Acids Res. 36(Suppl. 1), D202–D205 (2007)

    Google Scholar 

  15. Kuo, C.H., Huang, C.Y., Shieh, C.J., Dong, C.D.: Enzymes and biocatalysis (2022)

    Google Scholar 

  16. Li, Y., et al.: DEEPre: sequence-based enzyme EC number prediction by deep learning. Bioinformatics 34(5), 760–769 (2018)

    Article  CAS  PubMed  Google Scholar 

  17. Luo, Y., et al.: ECNet is an evolutionary context-integrated deep learning framework for protein engineering. Nat. Commun. 12(1), 1–14 (2021)

    Article  Google Scholar 

  18. Maeda, K., Strassel, S.M.: Annotation tools for large-scale corpus development: using AGTK at the linguistic data consortium. In: LREC (2004)

    Google Scholar 

  19. Mazurenko, S., Prokop, Z., Damborsky, J.: Machine learning in enzyme engineering. ACS Catal. 10(2), 1210–1223 (2019)

    Article  Google Scholar 

  20. Medina-Ortiz, D., et al.: Generalized property-based encoders and digital signal processing facilitate predictive tasks in protein engineering. Front. Mol. Biosci. 9 (2022)

    Google Scholar 

  21. Neves, M., Ševa, J.: An extensive review of tools for manual annotation of documents. Brief. Bioinform. 22(1), 146–163 (2021)

    Article  PubMed  Google Scholar 

  22. Przepiórkowski, A.: XML text interchange format in the national corpus of polish. In: The Proceedings of Practical Applications in Language and Computers PALC 2009 (2009)

    Google Scholar 

  23. Qu, K., Wei, L., Zou, Q.: A review of DNA-binding proteins prediction methods. Curr. Bioinform. 14(3), 246–254 (2019)

    Article  CAS  Google Scholar 

  24. Quiroz, C., et al.: Peptipedia: a user-friendly web application and a comprehensive database for peptide research supported by machine learning approach. Database 2021 (2021)

    Google Scholar 

  25. Rao, R., et al.: Evaluating protein transfer learning with tape. In: Advances in Neural Information Processing Systems, vol. 32 (2019)

    Google Scholar 

  26. Ryu, J.Y., Kim, H.U., Lee, S.Y.: Deep learning enables high-quality and high-throughput prediction of enzyme commission numbers. Proc. Natl. Acad. Sci. 116(28), 13996–14001 (2019)

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  27. Salgado, D., et al.: MyMiner: a web application for computer-assisted biocuration and text annotation. Bioinformatics 28(17), 2285–2287 (2012)

    Article  CAS  PubMed  Google Scholar 

  28. Sapoval, N., et al.: Current progress and open challenges for applying deep learning across the biosciences. Nat. Commun. 13(1), 1728 (2022)

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  29. Siedhoff, N.E., Illig, A.M., Schwaneberg, U., Davari, M.D.: PyPEF-an integrated framework for data-driven protein engineering. J. Chem. Inf. Model. 61(7), 3463–3476 (2021)

    Article  CAS  PubMed  Google Scholar 

  30. Tao, Z., Dong, B., Teng, Z., Zhao, Y.: The classification of enzymes by deep learning. IEEE Access 8, 89802–89811 (2020)

    Article  Google Scholar 

Download references

Acknowledgments

The authors acknowledge funding by the MAG-2095 project, Ministry of Education, Chile. DMO acknowledges ANID for the project “SUBVENCIÓN A INSTALACIÓN EN LA ACADEMIA CONVOCATORIA AÑO 2022”, Folio 85220004. The authors gratefully acknowledge support from the Centre for Biotechnology and Bioengineering - CeBiB (PIA project FB0001, Conicyt, Chile).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to David Medina-Ortiz .

Editor information

Editors and Affiliations

Ethics declarations

Conflict of Interest Statement

The authors declare that the research was conducted without any commercial or financial relationships that could be construed as a potential conflict of interest.

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Fernández, D., Olivera-Nappa, Á., Uribe-Paredes, R., Medina-Ortiz, D. (2023). Exploring Machine Learning Algorithms and Protein Language Models Strategies to Develop Enzyme Classification Systems. In: Rojas, I., Valenzuela, O., Rojas Ruiz, F., Herrera, L.J., Ortuño, F. (eds) Bioinformatics and Biomedical Engineering. IWBBIO 2023. Lecture Notes in Computer Science(), vol 13919. Springer, Cham. https://doi.org/10.1007/978-3-031-34953-9_24

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-34953-9_24

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-34952-2

  • Online ISBN: 978-3-031-34953-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics