Abstract
Accurate annotation of protein functions is important for a profound understanding of molecular biology. A large number of proteins remain uncharacterized because of the sparsity of available supporting information. For a large set of uncharacterized proteins, the only type of information available is their amino acid sequence. This motivates the need to make sequence based computational techniques that can precisely annotate uncharacterized proteins. In this paper, we propose DeepSeq – a deep learning architecture – that utilizes only the protein sequence information to predict its associated functions. The prediction process does not require handcrafted features; rather, the architecture automatically extracts representations from the input sequence data. Results of our experiments with DeepSeq indicate significant improvements in terms of prediction accuracy when compared with other sequence-based methods. Our deep learning model achieves an overall validation accuracy of 86.72%, with an F1 score of 71.13%. We achieved improved results for protein function prediction problem through DeepSeq, by utilizing sequence only information. Moreover, using the automatically learned features and without any changes to DeepSeq, we successfully solved a different problem i.e. protein function localization, with no human intervention. Finally, we discuss how the same architecture can be used to solve even more complicated problems such as prediction of 2D and 3D structure as well as protein-protein interactions.
Similar content being viewed by others
References
Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J.: Basic local alignment search tool (blast). Mol. Biol. 215(3), 403–410 (1990)
Benso, A., Carlo, S.D., Ur Rehman, H., Politano, G., Savino, A., Suravajhala, P.: A combined approach for genome wide protein function annotation/prediction. Proteome Sci. 11(No. S1), 1–12 (2013)
Bork, P., Dandekar, T., Diaz-Lazcoz, Y., Eisenhaber, F., Huynen, M., Yuan, Y.: Predicting function: from genes to genomes and back. J. Mol. Biol. 283(4), 707–725 (1998)
Chorowski, J.K., Bahdanau, D., Serdyuk, D., Cho, K., Bengio, Y.: Attention-Based Models for Speech Recognition. In: Advances in Neural Information Processing Systems, pp. 577–585 (2015)
Chung, J., Gulcehre, C., Cho, K., Bengio, Y.: Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv:1412.3555 (2014)
Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., Kuksa, P.: Natural language processing (almost) from scratch. J. Mach. Learn. Res. 12(Aug), 2493–2537 (2011)
Dahl, G.E., Sainath, T.N., Hinton, G.E.: Improving Deep Neural Networks for Lvcsr Using Rectified Linear Units and Dropout. In: 2013 IEEE International Conference On Acoustics, Speech and Signal Processing (ICASSP). IEEE, pp. 8609–8613 (2013)
Deng, M., Zhang, K., Mehta, S., Chen, T., Sun, F.: Prediction of protein function using protein-protein interaction data. J. Comput. Biol. 10(6), 947–960 (2003)
Duvenaud, D., Maclaurin, D., Adams, R.: Early Stopping as Nonparametric Variational Inference. In: Artificial Intelligence and Statistics, pp. 1070–1077 (2016)
Friedberg, I.: Automated protein function prediction–the genomic challenge. Brief. Bioinform. 7(3), 225–242 (2006)
Gaudet, P., Livstone, M.S., Lewis, S.E., Thomas, P.D.: Phylogenetic-based propagation of functional annotations within the gene ontology consortium. Brief. Bioinform. 12(5), 449–462 (2011)
GO: The gene ontology consortium. gene ontology consortium: going forward. Nucleic Acids Res. 43 (Database Issue), D1049–D1056 (2015)
Graves, A., Jaitly, N.: Towards end-to-end speech recognition with recurrent neural networks. In: Proceedings of the 31st International Conference on Machine Learning (ICML-14), pp. 1764–1772 (2014)
Hearst, M.A., Dumais, S.T., Osuna, E., Platt, J., Scholkopf, B.: Support vector machines. IEEE Intell. Syst. Appl. 13(4), 18–28 (1998)
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Huttenhower, C., Hibbs, M., Myers, C., Troyanskaya, O.G.: A scalable method for integration and functional analysis of multiple microarray datasets. Bioinformatics 22(23), 2890–2897 (2006)
Ioffe, S., Szegedy, C.: Batch normalization: Accelerating deep network training by reducing internal covariate shift. In: Proceedings of the 32nd International Conference on Machine Learning (ICML-15), pp. 448–456 (2015)
Jiang, Y., Oron, T.R., Clark, W.T., Bankapur, A.R., D’Andrea, D., Lepore, R., Funk, C.S., Kahanda, I., Verspoor, K.M., Ben-Hur, A., et al.: An expanded evaluation of protein function prediction methods shows an improvement in accuracy. Genome Biology 17(1), 184–203 (2016). https://doi.org/10.1186/s13059-016-1037-6
Kanehisa, M., et al.: Kanehisa Laboratories at Institute for Chemical Research (ICR). Kyoto University, Japan (2015). http://www.kanehisa.jp/en/db_growth.html
Kingma, D., Ba, J.: Adam: A method for stochastic optimization. arXiv:1412.6980 (2014)
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet Classification with Deep Convolutional Neural Networks. In: Advances in Neural Information Processing Systems, pp. 1097–1105 (2012)
Laurent, C., Pereyra, G., Brakel, P., Zhang, Y., Bengio, Y.: Batch Normalized Recurrent Neural Networks. In: 2016 IEEE International Conference On Acoustics, Speech and Signal Processing (ICASSP). IEEE, Pp. 2657–2661 (2016)
LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436–444 (2015)
Leifert, G., Strauß, T., Grüning, T., Wustlich, W., Labahn, R.: Cells in multidimensional recurrent neural networks. J. Mach. Learn. Res. 17(1), 3313–3349 (2016)
Letovsky, S., Kasif, S.: Predicting protein function from protein-protein interaction data: a probabilistic approach. Bioinformatics 19(suppl. 1), i197–i204 (2003)
Lowe, D.G.: Object recognition from local scale-invariant features. In: 1999. The proceedings of the seventh IEEE international conference on Computer vision. IEEE, vol. 2, pp. 1150–1157 (1999)
Marcotte, E.M., Pellegrini, M., Ng, H.L., Rice, D.W., Yeates, T.O., Eisenberg, D.: Detecting protein function and protein-protein interactions from genome sequences. Science 285(5428), 751–753 (1999)
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv:1301.3781 (2013)
Mitchell, A., Chang, H.Y., Daugherty, L., Fraser, M., Hunter, S., Lopez, R., McAnulla, C., McMenamin, C., Nuka, G., Pesseat, S., et al.: Interpro protein families database: the classification resource after 15 years. Nucleic Acids Res. 43(Database Issue), D213–21 (2015)
Nabieva, E., Jim, K., Agarwal, A., Chazelle, B., Singh, M.: Whole-proteome prediction of protein function via graph-theoretic analysis of interaction maps. Bioinformatics 21(suppl. 1), i302–i310 (2005)
Nair, V., Hinton, G.E.: Rectified linear units improve restricted boltzmann machines. In: Proceedings of the 27th international conference on machine learning (ICML-10), pp. 807–814 (2010)
Oord, A.v.d., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., Kalchbrenner, N., Senior, A., Kavukcuoglu, K.: Wavenet: A generative model for raw audio. arXiv:1609.03499 (2016)
Pal, D., Eisenberg, D.: Inference of protein function from protein structure. Structure 13(1), 121–130 (2005)
Pazos, F., Sternberg, M.J.: Automated prediction of protein function and detection of functional sites from structure. Proc. Natl. Acad. Sci. USA 101(41), 14754–14759 (2004)
Pellegrini, M., Marcotte, E.M., Thompson, M.J., Eisenberg, D., Yeates, T.O.: Assigning protein functions by comparative genome analysis: Protein phylogenetic profiles. Proc. Natl. Acad. Sci. 96 (8), 4285–4288 (1999)
Pennington, J., Socher, R., Manning, C.D.: Glove: Global Vectors Forword Representation. In: Empiricial Methods in Natural Language Processing, Vol. 14, pp. 1532–1543 (2014)
Piovesan, D., Giollo, M., Leonardi, E., Ferrari, C., Tosatto, C.E.S.: Inga: protein function prediction combining interaction networks, domain assignments and sequence similarity. Nucleic Acids Res. 43(W1), W134–40 (2015). https://doi.org/10.1093/nar/gkv523
Radivojac, P., Clark, W.T., Oron, T.R.: Schnoes, others: a large-scale evaluation of computational protein function prediction. Nat. Methods 10(3), 221–227 (2013)
Shen, L.X., Basilion, J.P., Stanton, V.P.: Single-nucleotide polymorphisms can cause different structural folds of mRNA. Proc. Natl. Acad. Sci. 96(14), 7871–7876 (1999)
Shimodaira, H.: Improving predictive inference under covariate shift by weighting the log-likelihood function. J. Stat. Plann. Inference 90(2), 227–244 (2000)
Shin, H.C., Roth, H.R., Gao, M., Lu, L., Xu, Z., Nogues, I., Yao, J., Mollura, D., Summers, R.M.: Deep convolutional neural networks for computer-aided detection: Cnn architectures, dataset characteristics and transfer learning. IEEE Trans. Med. Imaging 35(5), 1285–1298 (2016)
Sun, Y., Wang, X., Tang, X.: Deep learning face representation from predicting 10,000 classes. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1891–1898 (2014)
Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to Sequence Learning with Neural Networks. In: Advances in Neural Information Processing Systems, pp. 3104–3112 (2014)
The UniProt Consortium: Uniprot: a hub for protein information. Nucleic Acids Res. 43, D204–D212 (2015)
Vazquez, A., Flammini, A., Maritan, A., Vespignani, A.: Global protein function prediction from protein-protein interaction networks. Nat. Biotechnol. 21(6), 697–700 (2003)
Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., Manzagol, P.A.: Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. J. Mach. Learn. Res. 11(Dec), 3371–3408 (2010)
Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: a neural image caption generator. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3156–3164 (2015)
Watson, J.D., Laskowski, R.A., Thornton, J.M.: Predicting protein function from sequence and structural data. Curr. Opin. Struct. Biol. 15(3), 275–284 (2005)
Xin, F., Radivojac, P.: Computational methods for identification of functional residues in protein structures. Curr. Protein Pept. Sci. 12(6), 456–469 (2011)
Acknowledgements
The execution of our deep learning experiments was made possible by the gracious contribution of a Tesla K40 GPU by NVIDIA Corporation. The contents of this paper are not necessarily endorsed by the funding agencies.
Funding
Hafeez Ur Rehman’s contribution in this work was partially supported by Grant Number: 21-915/SRGP/R&D/HEC/2016 by the HEC.
Author information
Authors and Affiliations
Contributions
M.N. conceived the idea of using deep learning for Bioinformatics, H.R. provided domain knowledge and structured the problem. Both M.N. and H.R. wrote the code for the experiments. G.P. contributed with running experiments and analyzed results. A.B. helped analyze results and formalize the details of discussion. All authors contributed in manuscript preparation and review.
Corresponding author
Ethics declarations
Competing interests
There are no competing financial interests associated with this research work.
Consent to publish
All authors consent to publication of the details presented in this manuscript.
Additional information
Availability of data and materials
We provide the dataset (also freely available from UniProt [44]) and our code for the whole pipeline as open source at http://github.com/recluze/deepseq. Instructions for executing the code are provided in the attached README.
Rights and permissions
About this article
Cite this article
Nauman, M., Ur Rehman, H., Politano, G. et al. Beyond Homology Transfer: Deep Learning for Automated Annotation of Proteins. J Grid Computing 17, 225–237 (2019). https://doi.org/10.1007/s10723-018-9450-6
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10723-018-9450-6