Abstract
Deep learning techniques are found very useful to classify sequential data in recent times. The protein sequences belong to the functional classes based on the structure of their sequences. The annotation task of protein sequences into corresponding functional classes is multi-label in nature. The primary structure of protein contains a notable amount of vast data compared to the other secondary, tertiary, and quaternary structures. The clustering-based techniques require expert domain knowledge from the extensive data samples. Traditional methods use the n-gram features of amino acids while ignoring the relationship of motifs and amino acid sequence. This paper proposes an efficient method to classify the proteins into their functional classes using a convolution neural network based on heuristic rules. The proposed approach works on the primary structure of protein sequences which considers the relationship among motifs and amino acids. The proposed approach also takes into account the amino acid locations in the protein sequence. The proposed approach considers the affinity information between amino acids and motifs. Along with achieving high performance in the classification of protein sequences, we propose a heuristic approach to improve the precision and recall of the individual functional classes. The proposed heuristic approach improves the performance and handles the data imbalance problem. The proposed approach is compared with other competitive approaches, and our approach provides better performance metrics in terms of precision, recall, AUC, and subset accuracy. The greatest challenge with multi-label classification is to handle the data imbalance, which appears due to variance in frequencies of the labels in the data. This data imbalance is dealt with weight modulation in the loss function to influence the learning process.
Similar content being viewed by others
References
Nguyen Np, Nute M, Mirarab S, Warnow T (2016) Hippi: highly accurate protein family classification with ensembles of hmms. BMC Genomics 17(10):89–100
Dawson N, Sillitoe I, Marsden RL, Orengo CA (2017) The classification of protein domains. In: Bioinformatics. Springer, pp 137–164
Creighton TE (1993) Proteins: Structures and molecular properties. W. H. Freeman. https://books.google.co.in/books?id=hu8T_kI1LrkC
Szalkai B, Grolmusz V (2017) Near perfect protein multi-label classification with deep neural networks. Methods 132
Nadzirin N, Firdaus Raih M (2012) Proteins of unknown function in the protein data bank (pdb): An inventory of true uncharacterized proteins and computational tools for their analysis. Int J Mol Sci 13:12761–12772
Zhang M, Zhou Z (2014) A review on multi-label learning algorithms. IEEE Trans Knowl Data Eng 26(8):1819–1837
Smith TF, Waterman MS (1981) Identification of common molecular subsequences, vol 147
Ivan G, Banky D, Grolmusz V (2013) Fast and exact sequence alignment with the smith-waterman algorithm: The swissalign webserver. Gene Rep 4
Altschul S, Gish W, Miller W, Myers E, Lipman D (1990) Basic local aligment search tool. J Mol Biol 215:403–10
Eddy SR (2011) Accelerated profile hmm searches. PLOS Comput Biol 7(10):1–16
Illergård K, Ardell D, Elofsson A (2009) Structure is three to ten times more conserved than sequence-a study of structural response in protein cores. Proteins 77:499–508
Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013) Distributed representations of words and phrases and their compositionality. In: Advances in neural information processing systems, pp 3111–3119
Seo S, Oh M, Park Y, Kim S (2018) Deepfam: deep learning based alignment-free method for protein family modeling and prediction. Bioinformatics 34(13):i254–i262
He K, Zhang X, Ren S, Sun J (2014) Spatial pyramid pooling in deep convolutional networks for visual recognition. In: Fleet D, Pajdla T, Schiele B, Tuytelaars T (eds) Computer Vision – ECCV 2014. Springer International Publishing, Cham, pp 346–361
Alipanahi B, Delong A, Weirauch MT, Frey BJ (2015) Predicting the sequence specificities of dna-and rna-binding proteins by deep learning. Nat Biotechnol 33(8):831–838
Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R (2014) Dropout: A simple way to prevent neural networks from overfitting. J Mach Learn Res 15:1929–1958
He K, Zhang X, Ren S, Sun J (2015) Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. IEEE International Conference on Computer Vision (ICCV 2015) 1502
Ioffe S, Szegedy C (2015) Batch normalization: Accelerating deep network training by reducing internal covariate shift
ichi Amari S (1993) Backpropagation and stochastic gradient descent method. Neurocomputing 5(4):185–196
Dauphin Y, de Vries H, Bengio Y (2015) Equilibrated adaptive learning rates for non-convex optimization. In: Cortes C, Lawrence N D, Lee D D, Sugiyama M, Garnett R (eds) Advances in Neural Information Processing Systems 28. Curran Associates, Inc., pp 1504–1512
Dauphin Y, de Vries H, Chung J, Bengio Y (2015) Equilibrated adaptive learning rates for non-convex optimization. In: NIPS
Elisseeff A, Weston J, et al. (2001) A kernel method for multi-labelled classification. In: NIPS, vol 14, pp 681–687
Zhang M-L, Zhou Z-H (2006) Multilabel neural networks with applications to functional genomics and text categorization. IEEE Trans Knowl Data Eng 18(10):1338–1351
Hashemifar S, Neyshabur B, Khan AA, Xu J (2018) Predicting protein–protein interactions through sequence-based deep learning. Bioinformatics 34(17):i802–i810
Zhang D, Kabuka M (2020) Protein family classification from scratch: A cnn based deep learning approach. IEEE/ACM Trans Comput Biol Bioinform:1–1
Cheng Y, Song F, Qian K (2021) Missing multi-label learning with non-equilibrium based on two-level autoencoder. Appl Intell:1–19
Chauhan V, Tiwari A, Arya S (2020) Multi-label classifier based on kernel random vector functional link network. In: 2020 International Joint Conference on Neural Networks (IJCNN). IEEE, pp 1–7
Nepomuceno-Chamorro IA, Nepomuceno JA, Galván-Rojas JL, Vega-Márquez B, Rubio-Escudero C (2020) Using prior knowledge in the inference of gene association networks. Appl Intell 50(11):3882–3893
Ioffe S, Szegedy C (2015) Batch normalization: Accelerating deep network training by reducing internal covariate shift. In: International conference on machine learning. PMLR, pp 448–456
Geem ZW (2009) Music-inspired harmony search algorithm: theory and applications, vol 191. Springer, Berlin
Das S, Mukhopadhyay A, Roy A, Abraham A, Panigrahi B K (2010) Exploratory power of the harmony search algorithm: analysis and improvements for global numerical optimization. IEEE Trans Syst Man Cybern Part B (Cybern) 41(1):89–106
Diao R, Shen Q (2012) Feature selection with harmony search. IEEE Trans Syst Man Cybern Part B (Cybern) 42(6):1509–1523
Hoang DC, Yadav P, Kumar R, Panda SK (2013) Real-time implementation of a harmony search algorithm-based clustering protocol for energy-efficient wireless sensor networks. IEEE Transa Ind Inf 10(1):774–783
Bairoch A, Bougueleret L, Altairac S, Amendolia V, Auchincloss A, Argoud-Puy G, Axelsen K, Baratin D, Blatter M-C, Boeckmann B, Bolleman J, Bollondi L, Boutet E, SB Q, Breuza L, Bridge A, deCastro E, Ciapina L, Coral D, Zhang J (2009) The universal protein resource (uniprot) 2009. Nucleic Acids Res 37
Ferrán EA, Ferrara P, Pflugfelder B (1993) Protein classification using neural networks. Proc Int Conf Intell Syst Mol Biol 1:127–35
Wu C, Berry M, Fung Y, Mclarty J (1993) Neural networks for molecular sequence classification. Proc Int Conf Intell Syst Mol Bio ISMB 1:429–37
Lei X, Yang X, Fujita H (2019) Random walk based method to identify essential proteins by integrating network topology and biological characteristics. Knowl-Based Syst 167:53–67
Lei X, Ding Y, Fujita H, Zhang A (2016) Identification of dynamic protein complexes based on fruit fly optimization algorithm. Knowl-Based Syst 105:270–277
Asgari E, Mofrad MRK (2015) Continuous distributed representation of biological sequences for deep proteomics and genomics. PloS one 10(11):e0141287
Liu X (2017) Deep recurrent neural network for protein function prediction from sequence. arXiv:1701.08318
Maas AL, Hannun AY, Ng AY (2013) Rectifier nonlinearities improve neural network acoustic models. In: Proc. icml, vol 30, p 3
He K, Zhang X, Ren S, Sun J (2015) Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans Pattern Anal Mach Intell 37(9):1904–1916
Lee TK, Nguyen T (2016) Protein family classification with neural networks
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Chauhan, V., Tiwari, A., Joshi, N. et al. Multi-label classifier for protein sequence using heuristic-based deep convolution neural network. Appl Intell 52, 2820–2837 (2022). https://doi.org/10.1007/s10489-021-02529-6
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-021-02529-6