Skip to main content
Log in

Beyond Homology Transfer: Deep Learning for Automated Annotation of Proteins

  • Published:
Journal of Grid Computing Aims and scope Submit manuscript

Abstract

Accurate annotation of protein functions is important for a profound understanding of molecular biology. A large number of proteins remain uncharacterized because of the sparsity of available supporting information. For a large set of uncharacterized proteins, the only type of information available is their amino acid sequence. This motivates the need to make sequence based computational techniques that can precisely annotate uncharacterized proteins. In this paper, we propose DeepSeq – a deep learning architecture – that utilizes only the protein sequence information to predict its associated functions. The prediction process does not require handcrafted features; rather, the architecture automatically extracts representations from the input sequence data. Results of our experiments with DeepSeq indicate significant improvements in terms of prediction accuracy when compared with other sequence-based methods. Our deep learning model achieves an overall validation accuracy of 86.72%, with an F1 score of 71.13%. We achieved improved results for protein function prediction problem through DeepSeq, by utilizing sequence only information. Moreover, using the automatically learned features and without any changes to DeepSeq, we successfully solved a different problem i.e. protein function localization, with no human intervention. Finally, we discuss how the same architecture can be used to solve even more complicated problems such as prediction of 2D and 3D structure as well as protein-protein interactions.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  1. Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J.: Basic local alignment search tool (blast). Mol. Biol. 215(3), 403–410 (1990)

    Article  Google Scholar 

  2. Benso, A., Carlo, S.D., Ur Rehman, H., Politano, G., Savino, A., Suravajhala, P.: A combined approach for genome wide protein function annotation/prediction. Proteome Sci. 11(No. S1), 1–12 (2013)

    Article  Google Scholar 

  3. Bork, P., Dandekar, T., Diaz-Lazcoz, Y., Eisenhaber, F., Huynen, M., Yuan, Y.: Predicting function: from genes to genomes and back. J. Mol. Biol. 283(4), 707–725 (1998)

    Article  Google Scholar 

  4. Chorowski, J.K., Bahdanau, D., Serdyuk, D., Cho, K., Bengio, Y.: Attention-Based Models for Speech Recognition. In: Advances in Neural Information Processing Systems, pp. 577–585 (2015)

  5. Chung, J., Gulcehre, C., Cho, K., Bengio, Y.: Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv:1412.3555 (2014)

  6. Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., Kuksa, P.: Natural language processing (almost) from scratch. J. Mach. Learn. Res. 12(Aug), 2493–2537 (2011)

    MATH  Google Scholar 

  7. Dahl, G.E., Sainath, T.N., Hinton, G.E.: Improving Deep Neural Networks for Lvcsr Using Rectified Linear Units and Dropout. In: 2013 IEEE International Conference On Acoustics, Speech and Signal Processing (ICASSP). IEEE, pp. 8609–8613 (2013)

  8. Deng, M., Zhang, K., Mehta, S., Chen, T., Sun, F.: Prediction of protein function using protein-protein interaction data. J. Comput. Biol. 10(6), 947–960 (2003)

    Article  Google Scholar 

  9. Duvenaud, D., Maclaurin, D., Adams, R.: Early Stopping as Nonparametric Variational Inference. In: Artificial Intelligence and Statistics, pp. 1070–1077 (2016)

  10. Friedberg, I.: Automated protein function prediction–the genomic challenge. Brief. Bioinform. 7(3), 225–242 (2006)

    Article  Google Scholar 

  11. Gaudet, P., Livstone, M.S., Lewis, S.E., Thomas, P.D.: Phylogenetic-based propagation of functional annotations within the gene ontology consortium. Brief. Bioinform. 12(5), 449–462 (2011)

    Article  Google Scholar 

  12. GO: The gene ontology consortium. gene ontology consortium: going forward. Nucleic Acids Res. 43 (Database Issue), D1049–D1056 (2015)

    Google Scholar 

  13. Graves, A., Jaitly, N.: Towards end-to-end speech recognition with recurrent neural networks. In: Proceedings of the 31st International Conference on Machine Learning (ICML-14), pp. 1764–1772 (2014)

  14. Hearst, M.A., Dumais, S.T., Osuna, E., Platt, J., Scholkopf, B.: Support vector machines. IEEE Intell. Syst. Appl. 13(4), 18–28 (1998)

    Article  Google Scholar 

  15. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)

    Article  Google Scholar 

  16. Huttenhower, C., Hibbs, M., Myers, C., Troyanskaya, O.G.: A scalable method for integration and functional analysis of multiple microarray datasets. Bioinformatics 22(23), 2890–2897 (2006)

    Article  Google Scholar 

  17. Ioffe, S., Szegedy, C.: Batch normalization: Accelerating deep network training by reducing internal covariate shift. In: Proceedings of the 32nd International Conference on Machine Learning (ICML-15), pp. 448–456 (2015)

  18. Jiang, Y., Oron, T.R., Clark, W.T., Bankapur, A.R., D’Andrea, D., Lepore, R., Funk, C.S., Kahanda, I., Verspoor, K.M., Ben-Hur, A., et al.: An expanded evaluation of protein function prediction methods shows an improvement in accuracy. Genome Biology 17(1), 184–203 (2016). https://doi.org/10.1186/s13059-016-1037-6

    Article  Google Scholar 

  19. Kanehisa, M., et al.: Kanehisa Laboratories at Institute for Chemical Research (ICR). Kyoto University, Japan (2015). http://www.kanehisa.jp/en/db_growth.html

    Google Scholar 

  20. Kingma, D., Ba, J.: Adam: A method for stochastic optimization. arXiv:1412.6980 (2014)

  21. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet Classification with Deep Convolutional Neural Networks. In: Advances in Neural Information Processing Systems, pp. 1097–1105 (2012)

  22. Laurent, C., Pereyra, G., Brakel, P., Zhang, Y., Bengio, Y.: Batch Normalized Recurrent Neural Networks. In: 2016 IEEE International Conference On Acoustics, Speech and Signal Processing (ICASSP). IEEE, Pp. 2657–2661 (2016)

  23. LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436–444 (2015)

    Article  Google Scholar 

  24. Leifert, G., Strauß, T., Grüning, T., Wustlich, W., Labahn, R.: Cells in multidimensional recurrent neural networks. J. Mach. Learn. Res. 17(1), 3313–3349 (2016)

    MathSciNet  MATH  Google Scholar 

  25. Letovsky, S., Kasif, S.: Predicting protein function from protein-protein interaction data: a probabilistic approach. Bioinformatics 19(suppl. 1), i197–i204 (2003)

    Article  Google Scholar 

  26. Lowe, D.G.: Object recognition from local scale-invariant features. In: 1999. The proceedings of the seventh IEEE international conference on Computer vision. IEEE, vol. 2, pp. 1150–1157 (1999)

  27. Marcotte, E.M., Pellegrini, M., Ng, H.L., Rice, D.W., Yeates, T.O., Eisenberg, D.: Detecting protein function and protein-protein interactions from genome sequences. Science 285(5428), 751–753 (1999)

    Article  Google Scholar 

  28. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv:1301.3781 (2013)

  29. Mitchell, A., Chang, H.Y., Daugherty, L., Fraser, M., Hunter, S., Lopez, R., McAnulla, C., McMenamin, C., Nuka, G., Pesseat, S., et al.: Interpro protein families database: the classification resource after 15 years. Nucleic Acids Res. 43(Database Issue), D213–21 (2015)

    Article  Google Scholar 

  30. Nabieva, E., Jim, K., Agarwal, A., Chazelle, B., Singh, M.: Whole-proteome prediction of protein function via graph-theoretic analysis of interaction maps. Bioinformatics 21(suppl. 1), i302–i310 (2005)

    Article  Google Scholar 

  31. Nair, V., Hinton, G.E.: Rectified linear units improve restricted boltzmann machines. In: Proceedings of the 27th international conference on machine learning (ICML-10), pp. 807–814 (2010)

  32. Oord, A.v.d., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., Kalchbrenner, N., Senior, A., Kavukcuoglu, K.: Wavenet: A generative model for raw audio. arXiv:1609.03499 (2016)

  33. Pal, D., Eisenberg, D.: Inference of protein function from protein structure. Structure 13(1), 121–130 (2005)

    Article  Google Scholar 

  34. Pazos, F., Sternberg, M.J.: Automated prediction of protein function and detection of functional sites from structure. Proc. Natl. Acad. Sci. USA 101(41), 14754–14759 (2004)

    Article  Google Scholar 

  35. Pellegrini, M., Marcotte, E.M., Thompson, M.J., Eisenberg, D., Yeates, T.O.: Assigning protein functions by comparative genome analysis: Protein phylogenetic profiles. Proc. Natl. Acad. Sci. 96 (8), 4285–4288 (1999)

    Article  Google Scholar 

  36. Pennington, J., Socher, R., Manning, C.D.: Glove: Global Vectors Forword Representation. In: Empiricial Methods in Natural Language Processing, Vol. 14, pp. 1532–1543 (2014)

  37. Piovesan, D., Giollo, M., Leonardi, E., Ferrari, C., Tosatto, C.E.S.: Inga: protein function prediction combining interaction networks, domain assignments and sequence similarity. Nucleic Acids Res. 43(W1), W134–40 (2015). https://doi.org/10.1093/nar/gkv523

    Article  Google Scholar 

  38. Radivojac, P., Clark, W.T., Oron, T.R.: Schnoes, others: a large-scale evaluation of computational protein function prediction. Nat. Methods 10(3), 221–227 (2013)

    Article  Google Scholar 

  39. Shen, L.X., Basilion, J.P., Stanton, V.P.: Single-nucleotide polymorphisms can cause different structural folds of mRNA. Proc. Natl. Acad. Sci. 96(14), 7871–7876 (1999)

    Article  Google Scholar 

  40. Shimodaira, H.: Improving predictive inference under covariate shift by weighting the log-likelihood function. J. Stat. Plann. Inference 90(2), 227–244 (2000)

    Article  MathSciNet  MATH  Google Scholar 

  41. Shin, H.C., Roth, H.R., Gao, M., Lu, L., Xu, Z., Nogues, I., Yao, J., Mollura, D., Summers, R.M.: Deep convolutional neural networks for computer-aided detection: Cnn architectures, dataset characteristics and transfer learning. IEEE Trans. Med. Imaging 35(5), 1285–1298 (2016)

    Article  Google Scholar 

  42. Sun, Y., Wang, X., Tang, X.: Deep learning face representation from predicting 10,000 classes. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1891–1898 (2014)

  43. Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to Sequence Learning with Neural Networks. In: Advances in Neural Information Processing Systems, pp. 3104–3112 (2014)

  44. The UniProt Consortium: Uniprot: a hub for protein information. Nucleic Acids Res. 43, D204–D212 (2015)

    Article  Google Scholar 

  45. Vazquez, A., Flammini, A., Maritan, A., Vespignani, A.: Global protein function prediction from protein-protein interaction networks. Nat. Biotechnol. 21(6), 697–700 (2003)

    Article  Google Scholar 

  46. Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., Manzagol, P.A.: Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. J. Mach. Learn. Res. 11(Dec), 3371–3408 (2010)

    MathSciNet  MATH  Google Scholar 

  47. Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: a neural image caption generator. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3156–3164 (2015)

  48. Watson, J.D., Laskowski, R.A., Thornton, J.M.: Predicting protein function from sequence and structural data. Curr. Opin. Struct. Biol. 15(3), 275–284 (2005)

    Article  Google Scholar 

  49. Xin, F., Radivojac, P.: Computational methods for identification of functional residues in protein structures. Curr. Protein Pept. Sci. 12(6), 456–469 (2011)

    Article  Google Scholar 

Download references

Acknowledgements

The execution of our deep learning experiments was made possible by the gracious contribution of a Tesla K40 GPU by NVIDIA Corporation. The contents of this paper are not necessarily endorsed by the funding agencies.

Funding

Hafeez Ur Rehman’s contribution in this work was partially supported by Grant Number: 21-915/SRGP/R&D/HEC/2016 by the HEC.

Author information

Authors and Affiliations

Authors

Contributions

M.N. conceived the idea of using deep learning for Bioinformatics, H.R. provided domain knowledge and structured the problem. Both M.N. and H.R. wrote the code for the experiments. G.P. contributed with running experiments and analyzed results. A.B. helped analyze results and formalize the details of discussion. All authors contributed in manuscript preparation and review.

Corresponding author

Correspondence to Hafeez Ur Rehman.

Ethics declarations

Competing interests

There are no competing financial interests associated with this research work.

Consent to publish

All authors consent to publication of the details presented in this manuscript.

Additional information

Availability of data and materials

We provide the dataset (also freely available from UniProt [44]) and our code for the whole pipeline as open source at http://github.com/recluze/deepseq. Instructions for executing the code are provided in the attached README.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Nauman, M., Ur Rehman, H., Politano, G. et al. Beyond Homology Transfer: Deep Learning for Automated Annotation of Proteins. J Grid Computing 17, 225–237 (2019). https://doi.org/10.1007/s10723-018-9450-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10723-018-9450-6

Keywords

Navigation