Abstract
Accurately finding proteins and genes that have a certain function is the prerequisite for a broad range of biomedical applications. Despite the encouraging progress of existing computational approaches in protein function prediction, it remains challenging to annotate proteins to a novel function that is not collected in the Gene Ontology and does not have any annotated proteins. This limitation, a “side effect” from the widely-used multi-label classification problem setting of protein function prediction, hampers the progress of studying new pathways and biological processes, and further slows down research in various biomedical areas. Here, we tackle this problem by annotating proteins to a function only based on its textual description so that we don’t need to know any associated proteins for this function. The key idea of our method ProTranslator is to redefine protein function prediction as a machine translation problem, which translates the description word sequence of a function to the amino acid sequence of a protein. We can then transfer annotations from functions that have similar textual description to annotate a novel function. We observed substantial improvement in annotating novel functions and sparsely annotated functions on CAFA3, SwissProt and GOA datasets. We further demonstrated how our method accurately predicted gene members for a given pathway in Reactome, KEGG and MSigDB only based on the pathway description. Finally, we showed how ProTranslator enabled us to generate the textual description instead of the function label for a set of proteins, providing a new scheme for protein function prediction. We envision ProTranslator will give rise to a protein function “search engine” that returns a list of proteins based on the free text queried by the user.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Radivojac, P., Clark, W.T., Oron, T.R., Schnoes, A.M., Wittkop, T., Sokolov, A., et al.: A large-scale evaluation of computational protein function prediction. Nat. Methods 10, 221–227 (2013)
Zhou, N., Jiang, Y., Bergquist, T.R., Lee, A.J., Kacsoh, B.Z., Crocker, A.W., et al.: The CAFA challenge reports improved protein function prediction and new functional annotations for hundreds of genes through experimental screens. Genome Biol. 20, 244 (2019)
Jiang, Y., Oron, T.R., Clark, W.T., Bankapur, A.R., D’Andrea, D., Lepore, R., et al.: An expanded evaluation of protein function prediction methods shows an improvement in accuracy. Genome Biol. 17, 184 (2016)
Friedberg, I., Radivojac, P.: Community-wide evaluation of computational function prediction. Methods Mol. Biol. 1446, 133–146 (2017)
Dick, F.A., Rubin, S.M.: Molecular mechanisms underlying RB protein function. Nat. Rev. Mol. Cell Biol. 14, 297–306 (2013)
Freixo, F., Martinez Delgado, P., Manso, Y., Sánchez-Huertas, C., Lacasa, C., Soriano, E., et al.: NEK7 regulates dendrite morphogenesis in neurons via Eg5-dependent microtubule stabilization. Nat. Commun. 9, 2330 (2018)
Pierri, C.L.: SARS-CoV-2 spike protein: flexibility as a new target for fighting infection. Sig. Transduct. Target Ther. 5, 254 (2020)
Menche, J., Sharma, A., Kitsak, M., Ghiassian, S.D., Vidal, M., Loscalzo, J., et al.: Disease networks. Uncovering disease-disease relationships through the incomplete interactome. Science 347, 1257601 (2015)
Cheng, F., Kovács, I.A., Barabási, A.-L.: Network-based prediction of drug combinations. Nat. Commun. 10, 1197 (2019)
Altschul, S.F., Madden, T.L., Schäffer, A.A., Zhang, J., Zhang, Z., Miller, W., et al.: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25, 3389–3402 (1997)
Jones, P., Binns, D., Chang, H.-Y., Fraser, M., Li, W., McAnulla, C., et al.: InterProScan 5: genome-scale protein function classification. Bioinformatics 30, 1236–1240 (2014)
Zohra Smaili, F., Tian, S., Roy, A., Alazmi, M., Arold, S.T., Mukherjee, S., et al.: QAUST: protein function prediction using structure similarity, protein ınteraction, and functional motifs. Genomics Proteomics Bioinform. (2021). https://doi.org/10.1016/j.gpb.2021.02.001
Kulmanov, M., Khan, M.A., Hoehndorf, R., Wren, J.: DeepGO: predicting protein functions from sequence and interactions using a deep ontology-aware classifier. Bioinformatics 34, 660–668 (2018)
Kulmanov M, Hoehndorf R. DeepGOPlus: improved protein function prediction from sequence. Bioinformatics (2021). https://doi.org/10.1093/bioinformatics/btaa763
Fa, R., Cozzetto, D., Wan, C., Jones, D.T.: Predicting human protein function with multi-task deep neural networks. PLoS ONE 13, e0198216 (2018)
You, R., Zhang, Z., Xiong, Y., Sun, F., Mamitsuka, H., Zhu, S.: GOLabeler: improving sequence-based large-scale protein function prediction by learning to rank. Bioinformatics 34, 2465–2473 (2018)
Strodthoff, N., Wagner, P., Wenzel, M., Samek, W.: UDSMProt: universal deep sequence models for protein classification. Bioinformatics 36, 2401–2409 (2020)
Wang, S., Cho, H., Zhai, C., Berger, B., Peng, J.: Exploiting ontology graph for predicting sparsely annotated gene function. Bioinformatics 31, i357–i364 (2015)
Vazquez, A., Flammini, A., Maritan, A., Vespignani, A.: Global protein function prediction from protein-protein interaction networks. Nat. Biotechnol. 21, 697–700 (2003)
Cho, H., Berger, B., Peng, J.: Compact ıntegration of multi-network topology for functional analysis of genes. Cell Syst. 3, 540–548.e5 (2016)
You, R., Huang, X., Zhu, S.: DeepText2GO: improving large-scale protein function prediction with deep semantic text representation. Methods 145, 82–90 (2018)
Yang, J., Yan, R., Roy, A., Xu, D., Poisson, J., Zhang, Y.: The I-TASSER suite: protein structure and function prediction. Nat. Methods 12, 7–8 (2014)
Whisstock, J.C., Lesk, A.M.: Prediction of protein function from protein sequence and structure. Q. Rev. Biophys. 36, 307–340 (2003)
Borgwardt, K.M., Ong, C.S., Schönauer, S., Vishwanathan, S.V.N., Smola, A.J., Kriegel, H.-P.: Protein function prediction via graph kernels. Bioinformatics 21(Suppl 1), i47-56 (2005)
Zhang, C., Freddolino, P.L., Zhang, Y.: COFACTOR: improved protein function prediction by combining structure, sequence and protein–protein interaction information. Nucleic Acids Res. 45, W291–W299 (2017)
You, R., Yao, S., Xiong, Y., Huang, X., Sun, F., Mamitsuka, H., et al.: NetGO: improving large-scale protein function prediction with massive network information. Nucleic Acids Res. 47, W379–W387 (2019)
Yao, S., You, R., Wang, S., Xiong, Y., Huang, X., Zhu, S.: NetGO 2.0: improving large-scale protein function prediction with massive sequence, text, domain, family and network information. Nucleic Acids Res. 49, W469–475 (2021)
Ashburner, M., Ball, C.A., Blake, J.A., Botstein, D., Butler, H., Cherry, J.M., et al.: Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat. Genet. 25, 25–29 (2000)
Gu, Y., Tinn, R., Cheng, H., Lucas, M., Usuyama, N., Liu, X., et al.: Domain-specific language model pretraining for biomedical natural language processing. ACM Trans. Comput. Healthcare 3, 1–23 (2021)
Boeckmann, B., Bairoch, A., Apweiler, R., Blatter, M.-C., Estreicher, A., Gasteiger, E., et al.: The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003. Nucleic Acids Res. 31, 365–370 (2003)
Fabregat, A., Jupe, S., Matthews, L., Sidiropoulos, K., Gillespie, M., Garapati, P., et al.: The Reactome pathway knowledgebase. Nucleic Acids Res. 46, D649–D655 (2018)
Kanehisa, M., Goto, S.: KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 28, 27–30 (2000)
Liberzon, A., Subramanian, A., Pinchback, R., Thorvaldsdóttir, H., Tamayo, P., Mesirov, J.P.: Molecular signatures database (MSigDB) 3.0. Bioinformatics 27, 1739–1740 (2011)
Rebhan, M., Chalifa-Caspi, V., Prilusky, J., Lancet, D.: GeneCards: integrating information about genes, proteins and diseases. Trends Genet. 13, 163 (1997)
Rebhan, M., Chalifa-Caspi, V., Prilusky, J., Lancet, D.: GeneCards: a novel functional genomics compendium with automated data mining and query reformulation support. Bioinformatics 14, 656–664 (1998)
Buchfink, B., Xie, C., Huson, D.H.: Fast and sensitive protein alignment using DIAMOND. Nat. Methods 12, 59–60 (2015)
Vaswani, A., Shazeer, N., Parmar, N.: Attention is all you need. In: 31st Conference on Neural Information Processing Systems (NIPS 2017) (2017)
Xiong, R., Yang, Y., He, D., Zheng, K., Zheng, S., Xing, C., et al.: On layer normalization in the transformer architecture. In: Proceedings of the37thInternational Conference on Machine Learning (2020)
Franceschini, A., Szklarczyk, D., Frankild, S., Kuhn, M., Simonovic, M., Roth, A., et al.: STRING v9.1: protein-protein interaction networks, with increased coverage and integration. Nucleic Acids Res. 41, D808–815 (2013)
Zou, K.H., O’Malley, A.J., Mauri, L.: Receiver-operating characteristic analysis for evaluating diagnostic tests and predictive models. Circulation 115, 654–657 (2007)
Papineni, K., Roukos, S., Ward, T., Zhu, W.-J., Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (2002)
Yu, G., Fu, G., Wang, J., Zhao, Y.: NewGOA: predicting new GO annotations of proteins by Bi-random walks on a hybrid graph. IEEE/ACM Trans. Comput. Biol. Bioinform. 15, 1390–1402 (2018)
Zhao, Y., Fu, G., Wang, J., Guo, M., Yu, G.: Gene function prediction based on gene ontology hierarchy preserving hashing. Genomics 111, 334–342 (2019)
Dutkowski, J., Kramer, M., Surma, M.A., Balakrishnan, R., Cherry, J.M., Krogan, N.J., et al.: A gene ontology inferred from molecular networks. Nat. Biotechnol. 31, 38–45 (2013)
Kramer, M., Dutkowski, J., Yu, M., Bafna, V., Ideker, T.: Inferring gene ontologies from pairwise similarity data. Bioinformatics 30, i34-42 (2014)
Wang, S., Ma, J., Fong, S., Rensi, S., Han, J., Peng, J., et al.: Deep functional synthesis: a machine learning approach to gene functional enrichment. bioRxiv 2019:824086. https://doi.org/10.1101/824086
Wang, S., Ma, J., Yu, M.K., Zheng, F., Huang, E.W., Han, J., et al.: Annotating gene sets by mining large literature collections with protein networks. Pac. Symp. Biocomput. 23, 602–613 (2018)
Zhang, Y., Chen, Q., Zhang, Y., Wei, Z., Gao, Y., Peng, J., et al.: Automatic term name generation for gene ontology: task and dataset. In: Findings of the Association for Computational Linguistics: EMNLP 2020 (2020)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Xu, H., Wang, S. (2022). ProTranslator: Zero-Shot Protein Function Prediction Using Textual Description. In: Pe'er, I. (eds) Research in Computational Molecular Biology. RECOMB 2022. Lecture Notes in Computer Science(), vol 13278. Springer, Cham. https://doi.org/10.1007/978-3-031-04749-7_17
Download citation
DOI: https://doi.org/10.1007/978-3-031-04749-7_17
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-04748-0
Online ISBN: 978-3-031-04749-7
eBook Packages: Computer ScienceComputer Science (R0)