Abstract
In the sphere of bioinformatics, the identification of an effective protein feature, is of the essence. The fruitfulness of any classification technique, relies heavily on the identification of informative and distinct features. Various pre-existing classifiers recognised the use of a single type of disulphide bond (viz, parallel, or alternate) as a useful feature. However, the computational efficiency may be increased by the identification of appropriate combination of disulphide bonds, as a single feature. Hence, in this paper, the various combinations of disulphide bonds have been studied, to formulate a potent protein feature. It can be utilised in various studies, for achieving better protein classification results, without incorporating redundant data. After that, a data mining approach has been applied on the seven different combinations of disulphide bonds (viz. parallel, alternate and quad) to identify the best feature. A statistical analysis conducted in terms of confusion matrix and various point metrics (such as, sensitivity, specificity, recall and precision), resulted in a high level of accuracy and F score, for the feature, formed by the combination of two disulphide bonds i.e. alternative and quad bond. The average F Score achieved in this combination is approximately, 0.9 and the average accuracy level turned out to be more than 93%. These turn out to be an unprecedented level of precision for any individual feature, considered so far, in any research methodology. Also, the combination of two disulphide bonds instead of three ensures less computational time. The overall analytical results, in this study, revealed that the combination of alternative and quad disulphide bonds can be used as an effective feature in any form of protein classification.
Similar content being viewed by others
References
Ali AF, Shawky DM (2010) A novel approach for protein classification using fourier transform. Int J Eng Appl Sci 6:4
AlQuraishi M (2019) ProteinNet: a standardized data set for machine learning of protein structure. BMC Bioinform 20: 311
Bolser DM, Dafas P, Harrington R, Park J, Schroeder M (2003) Visualization and graph-theoretic analysis of a large-scale protein structural interactome. BMC Bioinform 4, 45,1471–2105, 1–11
Boujenfa K, Essoussi N, Limam M (2011) Tree-kNN: a tree-based algorithm for protein sequence classification. In: International journal on computer science and engineering (IJCSE), vol 3, ISSN: 0975-3397, pp 961–968
Caragea C, Silvescu A, Mitra P (2012) Protein sequence classification using feature hashing. Proteome Sci 10(Suppl 1):S14. https://doi.org/10.1186/1477-5956-10-S1-S14
Desai P (2005) Sequence classification using hidden markov models. https://etd.ohiolink.edu/
Ghosh SK, Ghosh A, Chakrabarti A (2018) VEA: vessel extraction algorithm by active contour model and a novel wavelet analyzer for diabetic retinopathy detection. Int J Image Gr 18(02):1850008
Jain P et al (2009) Supervised machine learning algorithms for protein structure classification. Comput Biol Chem 33:216–223
Jain P, Hirst JD (2010) Automatic structure classification of small proteins using random forest. BMC Bioinform 11:364
John M et al (2018) Critical assessment of methods of protein structure prediction (CASP) round XII. Proteins Struct Funct Bioinforma 86(S1):7–15
Kumar AV, Ali RFM, Cao Y, Krishnan VV (2015) Application of data mining tools for classification of protein structural class from residue based averaged NMR chemical shifts. Biochim Biophys Acta 1854(10):1545–1552
Lu CH et al (2007) Predicting disulfide connectivity patterns. Proteins 67:262–270
Mansoori EG, Zolghadri MJ, Katebi SD, Mohabatkar H, Boostani R, Sadreddini MH (2008) Generating fuzzy rules for protein classification. Iran J Fuzzy Syst 5(2):21–33
Mohamed S, Rubin D, Marwala T (2006) Multi-class protein sequence classification using fuzzy ARTMAP. In: IEEE conference, pp 1676–1680
Murzin AG et al (1995) SCOP: a structural classification of proteins database for the investigation of sequences and structures. J Mol Biol 247:536–540
Nageswara Rao PV, Uma Devi T, Kaladhar D, Sridhar GR, Appa RA (2009) A probabilistic neural network approach for protein superfamily classification. J Theor Appl Inf Technol
Pawlak Z (2002) Rough set theory and its applications. J Telecommun Inf Technol 3:7–10
Rahman MM, Alam AU, Abdullah-Al-Mamun, Mursalin TE (2010) A more appropriate protein classification using data mining. J Theor Appl Inf Technol (JATIT):33–43
Saha S, Bhattacharya T (2018) A new protein sequence classification approach using positional-average values of features AISC, SoCTA2018. Springer, Jalandhar
Saha S, Bhattacharya T (2018) A novel approach to find the saturation point of n-gram encoding method for protein sequence classification involving data mining. In: LNNS, Springer, vol 56, ICICC-2018, Delhi, pp 101–108
Saha S, Bhattacharya T (2019) An approach to find proper execution parameters of n-gram encoding method for protein sequence classification. In: CCIS, Springer, vol 1046, ICACDS-2019, Ghaziabad, India, pp 294–303
Saha S, Chaki R (2012) A brief review of data mining application involving protein sequence classification, AISC, Springer, ACITY 2012. Chennai, India 177, pp 469–477
Saha S, Chaki R (2012) Application of data mining in protein sequence classification. In: International journal of database management systems (IJDMS), vol 4, no. 5
Seavey BR, Farr EA, Westler WM, Markley JL (1991) A relational database for sequence-specific protein NMR data. J Biomol NMR 1:217–236
Song J, Yuan Z, Tan H, Huber T, Burrage K (2007) Predicting disulfide connectivity from protein sequence using multiple sequence feature vectors and secondary structure. Bioinformatics 23(23):3147–3154
Spalding JD, Hoyle DC (2005) Accuracy of string kernels for protein sequence classification. In: ICAPR 2005. LNCS, Springer, vol 3686
Wang JTL, Ma QH, Shasha D, Wu CH (2000) Application of neural networks to biological data mining: a case study in protein sequence classification. KDD, Boston, MA, USA, pp 305–309
Watts DJ, Strogatz SH (1998) Collective dynamics of “small-world’’ networks. Nature 393(6684):440–2
Yellasiri R, Rao CR (2009) Rough set protein classifier. J Theor Appl Inf Technol
Zainuddin Z et al (2008) Radial basic function neural networks in protein sequence classification. Malays J Math Sci 2:195–204
Zaki NM, Deri S, Illias RM (2005) Protein sequences classification based on string weighting scheme. Int J Comput Internet Manag 13:50–60
Zhang HY, Neal S, Wishart DS (2003) RefDB: a database of uniformly referenced protein chemical shifts. J Biomol NMR 25:173–195
Zhao X-M, Huang D-S, Cheung Y, Wang H, Huang X (2004) A novel hybrid GA/SVM system for protein sequences classification. IDEAL 2004. LNCS Springer 3177:11–16
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Saha, S., Paul, T. & Bhattacharya, T. A study to find a potent feature by combining the various disulphide bonds of protein using data mining technique. Netw Model Anal Health Inform Bioinforma 10, 36 (2021). https://doi.org/10.1007/s13721-021-00311-9
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s13721-021-00311-9