Skip to main content

ProTranslator: Zero-Shot Protein Function Prediction Using Textual Description

  • Conference paper
  • First Online:
Research in Computational Molecular Biology (RECOMB 2022)

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 13278))

Abstract

Accurately finding proteins and genes that have a certain function is the prerequisite for a broad range of biomedical applications. Despite the encouraging progress of existing computational approaches in protein function prediction, it remains challenging to annotate proteins to a novel function that is not collected in the Gene Ontology and does not have any annotated proteins. This limitation, a “side effect” from the widely-used multi-label classification problem setting of protein function prediction, hampers the progress of studying new pathways and biological processes, and further slows down research in various biomedical areas. Here, we tackle this problem by annotating proteins to a function only based on its textual description so that we don’t need to know any associated proteins for this function. The key idea of our method ProTranslator is to redefine protein function prediction as a machine translation problem, which translates the description word sequence of a function to the amino acid sequence of a protein. We can then transfer annotations from functions that have similar textual description to annotate a novel function. We observed substantial improvement in annotating novel functions and sparsely annotated functions on CAFA3, SwissProt and GOA datasets. We further demonstrated how our method accurately predicted gene members for a given pathway in Reactome, KEGG and MSigDB only based on the pathway description. Finally, we showed how ProTranslator enabled us to generate the textual description instead of the function label for a set of proteins, providing a new scheme for protein function prediction. We envision ProTranslator will give rise to a protein function “search engine” that returns a list of proteins based on the free text queried by the user.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 64.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 84.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Radivojac, P., Clark, W.T., Oron, T.R., Schnoes, A.M., Wittkop, T., Sokolov, A., et al.: A large-scale evaluation of computational protein function prediction. Nat. Methods 10, 221–227 (2013)

    Article  Google Scholar 

  2. Zhou, N., Jiang, Y., Bergquist, T.R., Lee, A.J., Kacsoh, B.Z., Crocker, A.W., et al.: The CAFA challenge reports improved protein function prediction and new functional annotations for hundreds of genes through experimental screens. Genome Biol. 20, 244 (2019)

    Article  Google Scholar 

  3. Jiang, Y., Oron, T.R., Clark, W.T., Bankapur, A.R., D’Andrea, D., Lepore, R., et al.: An expanded evaluation of protein function prediction methods shows an improvement in accuracy. Genome Biol. 17, 184 (2016)

    Article  Google Scholar 

  4. Friedberg, I., Radivojac, P.: Community-wide evaluation of computational function prediction. Methods Mol. Biol. 1446, 133–146 (2017)

    Article  Google Scholar 

  5. Dick, F.A., Rubin, S.M.: Molecular mechanisms underlying RB protein function. Nat. Rev. Mol. Cell Biol. 14, 297–306 (2013)

    Article  Google Scholar 

  6. Freixo, F., Martinez Delgado, P., Manso, Y., Sánchez-Huertas, C., Lacasa, C., Soriano, E., et al.: NEK7 regulates dendrite morphogenesis in neurons via Eg5-dependent microtubule stabilization. Nat. Commun. 9, 2330 (2018)

    Article  Google Scholar 

  7. Pierri, C.L.: SARS-CoV-2 spike protein: flexibility as a new target for fighting infection. Sig. Transduct. Target Ther. 5, 254 (2020)

    Google Scholar 

  8. Menche, J., Sharma, A., Kitsak, M., Ghiassian, S.D., Vidal, M., Loscalzo, J., et al.: Disease networks. Uncovering disease-disease relationships through the incomplete interactome. Science 347, 1257601 (2015)

    Google Scholar 

  9. Cheng, F., Kovács, I.A., Barabási, A.-L.: Network-based prediction of drug combinations. Nat. Commun. 10, 1197 (2019)

    Article  Google Scholar 

  10. Altschul, S.F., Madden, T.L., Schäffer, A.A., Zhang, J., Zhang, Z., Miller, W., et al.: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25, 3389–3402 (1997)

    Article  Google Scholar 

  11. Jones, P., Binns, D., Chang, H.-Y., Fraser, M., Li, W., McAnulla, C., et al.: InterProScan 5: genome-scale protein function classification. Bioinformatics 30, 1236–1240 (2014)

    Article  Google Scholar 

  12. Zohra Smaili, F., Tian, S., Roy, A., Alazmi, M., Arold, S.T., Mukherjee, S., et al.: QAUST: protein function prediction using structure similarity, protein ınteraction, and functional motifs. Genomics Proteomics Bioinform. (2021). https://doi.org/10.1016/j.gpb.2021.02.001

  13. Kulmanov, M., Khan, M.A., Hoehndorf, R., Wren, J.: DeepGO: predicting protein functions from sequence and interactions using a deep ontology-aware classifier. Bioinformatics 34, 660–668 (2018)

    Article  Google Scholar 

  14. Kulmanov M, Hoehndorf R. DeepGOPlus: improved protein function prediction from sequence. Bioinformatics (2021). https://doi.org/10.1093/bioinformatics/btaa763

  15. Fa, R., Cozzetto, D., Wan, C., Jones, D.T.: Predicting human protein function with multi-task deep neural networks. PLoS ONE 13, e0198216 (2018)

    Article  Google Scholar 

  16. You, R., Zhang, Z., Xiong, Y., Sun, F., Mamitsuka, H., Zhu, S.: GOLabeler: improving sequence-based large-scale protein function prediction by learning to rank. Bioinformatics 34, 2465–2473 (2018)

    Article  Google Scholar 

  17. Strodthoff, N., Wagner, P., Wenzel, M., Samek, W.: UDSMProt: universal deep sequence models for protein classification. Bioinformatics 36, 2401–2409 (2020)

    Article  Google Scholar 

  18. Wang, S., Cho, H., Zhai, C., Berger, B., Peng, J.: Exploiting ontology graph for predicting sparsely annotated gene function. Bioinformatics 31, i357–i364 (2015)

    Article  Google Scholar 

  19. Vazquez, A., Flammini, A., Maritan, A., Vespignani, A.: Global protein function prediction from protein-protein interaction networks. Nat. Biotechnol. 21, 697–700 (2003)

    Article  Google Scholar 

  20. Cho, H., Berger, B., Peng, J.: Compact ıntegration of multi-network topology for functional analysis of genes. Cell Syst. 3, 540–548.e5 (2016)

    Google Scholar 

  21. You, R., Huang, X., Zhu, S.: DeepText2GO: improving large-scale protein function prediction with deep semantic text representation. Methods 145, 82–90 (2018)

    Article  Google Scholar 

  22. Yang, J., Yan, R., Roy, A., Xu, D., Poisson, J., Zhang, Y.: The I-TASSER suite: protein structure and function prediction. Nat. Methods 12, 7–8 (2014)

    Article  Google Scholar 

  23. Whisstock, J.C., Lesk, A.M.: Prediction of protein function from protein sequence and structure. Q. Rev. Biophys. 36, 307–340 (2003)

    Article  Google Scholar 

  24. Borgwardt, K.M., Ong, C.S., Schönauer, S., Vishwanathan, S.V.N., Smola, A.J., Kriegel, H.-P.: Protein function prediction via graph kernels. Bioinformatics 21(Suppl 1), i47-56 (2005)

    Article  Google Scholar 

  25. Zhang, C., Freddolino, P.L., Zhang, Y.: COFACTOR: improved protein function prediction by combining structure, sequence and protein–protein interaction information. Nucleic Acids Res. 45, W291–W299 (2017)

    Google Scholar 

  26. You, R., Yao, S., Xiong, Y., Huang, X., Sun, F., Mamitsuka, H., et al.: NetGO: improving large-scale protein function prediction with massive network information. Nucleic Acids Res. 47, W379–W387 (2019)

    Article  Google Scholar 

  27. Yao, S., You, R., Wang, S., Xiong, Y., Huang, X., Zhu, S.: NetGO 2.0: improving large-scale protein function prediction with massive sequence, text, domain, family and network information. Nucleic Acids Res. 49, W469–475 (2021)

    Google Scholar 

  28. Ashburner, M., Ball, C.A., Blake, J.A., Botstein, D., Butler, H., Cherry, J.M., et al.: Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat. Genet. 25, 25–29 (2000)

    Google Scholar 

  29. Gu, Y., Tinn, R., Cheng, H., Lucas, M., Usuyama, N., Liu, X., et al.: Domain-specific language model pretraining for biomedical natural language processing. ACM Trans. Comput. Healthcare 3, 1–23 (2021)

    Article  Google Scholar 

  30. Boeckmann, B., Bairoch, A., Apweiler, R., Blatter, M.-C., Estreicher, A., Gasteiger, E., et al.: The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003. Nucleic Acids Res. 31, 365–370 (2003)

    Article  Google Scholar 

  31. Fabregat, A., Jupe, S., Matthews, L., Sidiropoulos, K., Gillespie, M., Garapati, P., et al.: The Reactome pathway knowledgebase. Nucleic Acids Res. 46, D649–D655 (2018)

    Article  Google Scholar 

  32. Kanehisa, M., Goto, S.: KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 28, 27–30 (2000)

    Article  Google Scholar 

  33. Liberzon, A., Subramanian, A., Pinchback, R., Thorvaldsdóttir, H., Tamayo, P., Mesirov, J.P.: Molecular signatures database (MSigDB) 3.0. Bioinformatics 27, 1739–1740 (2011)

    Google Scholar 

  34. Rebhan, M., Chalifa-Caspi, V., Prilusky, J., Lancet, D.: GeneCards: integrating information about genes, proteins and diseases. Trends Genet. 13, 163 (1997)

    Article  Google Scholar 

  35. Rebhan, M., Chalifa-Caspi, V., Prilusky, J., Lancet, D.: GeneCards: a novel functional genomics compendium with automated data mining and query reformulation support. Bioinformatics 14, 656–664 (1998)

    Article  Google Scholar 

  36. Buchfink, B., Xie, C., Huson, D.H.: Fast and sensitive protein alignment using DIAMOND. Nat. Methods 12, 59–60 (2015)

    Article  Google Scholar 

  37. Vaswani, A., Shazeer, N., Parmar, N.: Attention is all you need. In: 31st Conference on Neural Information Processing Systems (NIPS 2017) (2017)

    Google Scholar 

  38. Xiong, R., Yang, Y., He, D., Zheng, K., Zheng, S., Xing, C., et al.: On layer normalization in the transformer architecture. In: Proceedings of the37thInternational Conference on Machine Learning (2020)

    Google Scholar 

  39. Franceschini, A., Szklarczyk, D., Frankild, S., Kuhn, M., Simonovic, M., Roth, A., et al.: STRING v9.1: protein-protein interaction networks, with increased coverage and integration. Nucleic Acids Res. 41, D808–815 (2013)

    Google Scholar 

  40. Zou, K.H., O’Malley, A.J., Mauri, L.: Receiver-operating characteristic analysis for evaluating diagnostic tests and predictive models. Circulation 115, 654–657 (2007)

    Article  Google Scholar 

  41. Papineni, K., Roukos, S., Ward, T., Zhu, W.-J., Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (2002)

    Google Scholar 

  42. Yu, G., Fu, G., Wang, J., Zhao, Y.: NewGOA: predicting new GO annotations of proteins by Bi-random walks on a hybrid graph. IEEE/ACM Trans. Comput. Biol. Bioinform. 15, 1390–1402 (2018)

    Article  Google Scholar 

  43. Zhao, Y., Fu, G., Wang, J., Guo, M., Yu, G.: Gene function prediction based on gene ontology hierarchy preserving hashing. Genomics 111, 334–342 (2019)

    Article  Google Scholar 

  44. Dutkowski, J., Kramer, M., Surma, M.A., Balakrishnan, R., Cherry, J.M., Krogan, N.J., et al.: A gene ontology inferred from molecular networks. Nat. Biotechnol. 31, 38–45 (2013)

    Article  Google Scholar 

  45. Kramer, M., Dutkowski, J., Yu, M., Bafna, V., Ideker, T.: Inferring gene ontologies from pairwise similarity data. Bioinformatics 30, i34-42 (2014)

    Article  Google Scholar 

  46. Wang, S., Ma, J., Fong, S., Rensi, S., Han, J., Peng, J., et al.: Deep functional synthesis: a machine learning approach to gene functional enrichment. bioRxiv 2019:824086. https://doi.org/10.1101/824086

  47. Wang, S., Ma, J., Yu, M.K., Zheng, F., Huang, E.W., Han, J., et al.: Annotating gene sets by mining large literature collections with protein networks. Pac. Symp. Biocomput. 23, 602–613 (2018)

    Google Scholar 

  48. Zhang, Y., Chen, Q., Zhang, Y., Wei, Z., Gao, Y., Peng, J., et al.: Automatic term name generation for gene ontology: task and dataset. In: Findings of the Association for Computational Linguistics: EMNLP 2020 (2020)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sheng Wang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Xu, H., Wang, S. (2022). ProTranslator: Zero-Shot Protein Function Prediction Using Textual Description. In: Pe'er, I. (eds) Research in Computational Molecular Biology. RECOMB 2022. Lecture Notes in Computer Science(), vol 13278. Springer, Cham. https://doi.org/10.1007/978-3-031-04749-7_17

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-04749-7_17

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-04748-0

  • Online ISBN: 978-3-031-04749-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics