Skip to main content
Log in

Multi-label classifier for protein sequence using heuristic-based deep convolution neural network

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

Deep learning techniques are found very useful to classify sequential data in recent times. The protein sequences belong to the functional classes based on the structure of their sequences. The annotation task of protein sequences into corresponding functional classes is multi-label in nature. The primary structure of protein contains a notable amount of vast data compared to the other secondary, tertiary, and quaternary structures. The clustering-based techniques require expert domain knowledge from the extensive data samples. Traditional methods use the n-gram features of amino acids while ignoring the relationship of motifs and amino acid sequence. This paper proposes an efficient method to classify the proteins into their functional classes using a convolution neural network based on heuristic rules. The proposed approach works on the primary structure of protein sequences which considers the relationship among motifs and amino acids. The proposed approach also takes into account the amino acid locations in the protein sequence. The proposed approach considers the affinity information between amino acids and motifs. Along with achieving high performance in the classification of protein sequences, we propose a heuristic approach to improve the precision and recall of the individual functional classes. The proposed heuristic approach improves the performance and handles the data imbalance problem. The proposed approach is compared with other competitive approaches, and our approach provides better performance metrics in terms of precision, recall, AUC, and subset accuracy. The greatest challenge with multi-label classification is to handle the data imbalance, which appears due to variance in frequencies of the labels in the data. This data imbalance is dealt with weight modulation in the loss function to influence the learning process.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

References

  1. Nguyen Np, Nute M, Mirarab S, Warnow T (2016) Hippi: highly accurate protein family classification with ensembles of hmms. BMC Genomics 17(10):89–100

    Google Scholar 

  2. Dawson N, Sillitoe I, Marsden RL, Orengo CA (2017) The classification of protein domains. In: Bioinformatics. Springer, pp 137–164

  3. Creighton TE (1993) Proteins: Structures and molecular properties. W. H. Freeman. https://books.google.co.in/books?id=hu8T_kI1LrkC

  4. Szalkai B, Grolmusz V (2017) Near perfect protein multi-label classification with deep neural networks. Methods 132

  5. Nadzirin N, Firdaus Raih M (2012) Proteins of unknown function in the protein data bank (pdb): An inventory of true uncharacterized proteins and computational tools for their analysis. Int J Mol Sci 13:12761–12772

    Article  Google Scholar 

  6. Zhang M, Zhou Z (2014) A review on multi-label learning algorithms. IEEE Trans Knowl Data Eng 26(8):1819–1837

    Article  Google Scholar 

  7. Smith TF, Waterman MS (1981) Identification of common molecular subsequences, vol 147

  8. Ivan G, Banky D, Grolmusz V (2013) Fast and exact sequence alignment with the smith-waterman algorithm: The swissalign webserver. Gene Rep 4

  9. Altschul S, Gish W, Miller W, Myers E, Lipman D (1990) Basic local aligment search tool. J Mol Biol 215:403–10

    Article  Google Scholar 

  10. Eddy SR (2011) Accelerated profile hmm searches. PLOS Comput Biol 7(10):1–16

    Article  MathSciNet  Google Scholar 

  11. Illergård K, Ardell D, Elofsson A (2009) Structure is three to ten times more conserved than sequence-a study of structural response in protein cores. Proteins 77:499–508

    Article  Google Scholar 

  12. Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013) Distributed representations of words and phrases and their compositionality. In: Advances in neural information processing systems, pp 3111–3119

  13. Seo S, Oh M, Park Y, Kim S (2018) Deepfam: deep learning based alignment-free method for protein family modeling and prediction. Bioinformatics 34(13):i254–i262

    Article  Google Scholar 

  14. He K, Zhang X, Ren S, Sun J (2014) Spatial pyramid pooling in deep convolutional networks for visual recognition. In: Fleet D, Pajdla T, Schiele B, Tuytelaars T (eds) Computer Vision – ECCV 2014. Springer International Publishing, Cham, pp 346–361

  15. Alipanahi B, Delong A, Weirauch MT, Frey BJ (2015) Predicting the sequence specificities of dna-and rna-binding proteins by deep learning. Nat Biotechnol 33(8):831–838

    Article  Google Scholar 

  16. Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R (2014) Dropout: A simple way to prevent neural networks from overfitting. J Mach Learn Res 15:1929–1958

    MathSciNet  MATH  Google Scholar 

  17. He K, Zhang X, Ren S, Sun J (2015) Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. IEEE International Conference on Computer Vision (ICCV 2015) 1502

  18. Ioffe S, Szegedy C (2015) Batch normalization: Accelerating deep network training by reducing internal covariate shift

  19. ichi Amari S (1993) Backpropagation and stochastic gradient descent method. Neurocomputing 5(4):185–196

    Article  Google Scholar 

  20. Dauphin Y, de Vries H, Bengio Y (2015) Equilibrated adaptive learning rates for non-convex optimization. In: Cortes C, Lawrence N D, Lee D D, Sugiyama M, Garnett R (eds) Advances in Neural Information Processing Systems 28. Curran Associates, Inc., pp 1504–1512

  21. Dauphin Y, de Vries H, Chung J, Bengio Y (2015) Equilibrated adaptive learning rates for non-convex optimization. In: NIPS

  22. Elisseeff A, Weston J, et al. (2001) A kernel method for multi-labelled classification. In: NIPS, vol 14, pp 681–687

  23. Zhang M-L, Zhou Z-H (2006) Multilabel neural networks with applications to functional genomics and text categorization. IEEE Trans Knowl Data Eng 18(10):1338–1351

    Article  Google Scholar 

  24. Hashemifar S, Neyshabur B, Khan AA, Xu J (2018) Predicting protein–protein interactions through sequence-based deep learning. Bioinformatics 34(17):i802–i810

    Article  Google Scholar 

  25. Zhang D, Kabuka M (2020) Protein family classification from scratch: A cnn based deep learning approach. IEEE/ACM Trans Comput Biol Bioinform:1–1

  26. Cheng Y, Song F, Qian K (2021) Missing multi-label learning with non-equilibrium based on two-level autoencoder. Appl Intell:1–19

  27. Chauhan V, Tiwari A, Arya S (2020) Multi-label classifier based on kernel random vector functional link network. In: 2020 International Joint Conference on Neural Networks (IJCNN). IEEE, pp 1–7

  28. Nepomuceno-Chamorro IA, Nepomuceno JA, Galván-Rojas JL, Vega-Márquez B, Rubio-Escudero C (2020) Using prior knowledge in the inference of gene association networks. Appl Intell 50(11):3882–3893

    Article  Google Scholar 

  29. Ioffe S, Szegedy C (2015) Batch normalization: Accelerating deep network training by reducing internal covariate shift. In: International conference on machine learning. PMLR, pp 448–456

  30. Geem ZW (2009) Music-inspired harmony search algorithm: theory and applications, vol 191. Springer, Berlin

    Book  Google Scholar 

  31. Das S, Mukhopadhyay A, Roy A, Abraham A, Panigrahi B K (2010) Exploratory power of the harmony search algorithm: analysis and improvements for global numerical optimization. IEEE Trans Syst Man Cybern Part B (Cybern) 41(1):89–106

    Article  Google Scholar 

  32. Diao R, Shen Q (2012) Feature selection with harmony search. IEEE Trans Syst Man Cybern Part B (Cybern) 42(6):1509–1523

    Article  Google Scholar 

  33. Hoang DC, Yadav P, Kumar R, Panda SK (2013) Real-time implementation of a harmony search algorithm-based clustering protocol for energy-efficient wireless sensor networks. IEEE Transa Ind Inf 10(1):774–783

    Article  Google Scholar 

  34. Bairoch A, Bougueleret L, Altairac S, Amendolia V, Auchincloss A, Argoud-Puy G, Axelsen K, Baratin D, Blatter M-C, Boeckmann B, Bolleman J, Bollondi L, Boutet E, SB Q, Breuza L, Bridge A, deCastro E, Ciapina L, Coral D, Zhang J (2009) The universal protein resource (uniprot) 2009. Nucleic Acids Res 37

  35. Ferrán EA, Ferrara P, Pflugfelder B (1993) Protein classification using neural networks. Proc Int Conf Intell Syst Mol Biol 1:127–35

    Google Scholar 

  36. Wu C, Berry M, Fung Y, Mclarty J (1993) Neural networks for molecular sequence classification. Proc Int Conf Intell Syst Mol Bio ISMB 1:429–37

    Google Scholar 

  37. Lei X, Yang X, Fujita H (2019) Random walk based method to identify essential proteins by integrating network topology and biological characteristics. Knowl-Based Syst 167:53–67

    Article  Google Scholar 

  38. Lei X, Ding Y, Fujita H, Zhang A (2016) Identification of dynamic protein complexes based on fruit fly optimization algorithm. Knowl-Based Syst 105:270–277

    Article  Google Scholar 

  39. Asgari E, Mofrad MRK (2015) Continuous distributed representation of biological sequences for deep proteomics and genomics. PloS one 10(11):e0141287

    Article  Google Scholar 

  40. Liu X (2017) Deep recurrent neural network for protein function prediction from sequence. arXiv:1701.08318

  41. Maas AL, Hannun AY, Ng AY (2013) Rectifier nonlinearities improve neural network acoustic models. In: Proc. icml, vol 30, p 3

  42. He K, Zhang X, Ren S, Sun J (2015) Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans Pattern Anal Mach Intell 37(9):1904–1916

    Article  Google Scholar 

  43. Lee TK, Nguyen T (2016) Protein family classification with neural networks

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Vikas Chauhan.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Chauhan, V., Tiwari, A., Joshi, N. et al. Multi-label classifier for protein sequence using heuristic-based deep convolution neural network. Appl Intell 52, 2820–2837 (2022). https://doi.org/10.1007/s10489-021-02529-6

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-021-02529-6

Keywords

Navigation