Skip to main content
Log in

A study to find a potent feature by combining the various disulphide bonds of protein using data mining technique

  • Original Article
  • Published:
Network Modeling Analysis in Health Informatics and Bioinformatics Aims and scope Submit manuscript

Abstract

In the sphere of bioinformatics, the identification of an effective protein feature, is of the essence. The fruitfulness of any classification technique, relies heavily on the identification of informative and distinct features. Various pre-existing classifiers recognised the use of a single type of disulphide bond (viz, parallel, or alternate) as a useful feature. However, the computational efficiency may be increased by the identification of appropriate combination of disulphide bonds, as a single feature. Hence, in this paper, the various combinations of disulphide bonds have been studied, to formulate a potent protein feature. It can be utilised in various studies, for achieving better protein classification results, without incorporating redundant data. After that, a data mining approach has been applied on the seven different combinations of disulphide bonds (viz. parallel, alternate and quad) to identify the best feature. A statistical analysis conducted in terms of confusion matrix and various point metrics (such as, sensitivity, specificity, recall and precision), resulted in a high level of accuracy and F score, for the feature, formed by the combination of two disulphide bonds i.e. alternative and quad bond. The average F Score achieved in this combination is approximately, 0.9 and the average accuracy level turned out to be more than 93%. These turn out to be an unprecedented level of precision for any individual feature, considered so far, in any research methodology. Also, the combination of two disulphide bonds instead of three ensures less computational time. The overall analytical results, in this study, revealed that the combination of alternative and quad disulphide bonds can be used as an effective feature in any form of protein classification.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

References

  • Ali AF, Shawky DM (2010) A novel approach for protein classification using fourier transform. Int J Eng Appl Sci 6:4

    Google Scholar 

  • AlQuraishi M (2019) ProteinNet: a standardized data set for machine learning of protein structure. BMC Bioinform 20: 311

  • Bolser DM, Dafas P, Harrington R, Park J, Schroeder M (2003) Visualization and graph-theoretic analysis of a large-scale protein structural interactome. BMC Bioinform 4, 45,1471–2105, 1–11

  • Boujenfa K, Essoussi N, Limam M (2011) Tree-kNN: a tree-based algorithm for protein sequence classification. In: International journal on computer science and engineering (IJCSE), vol 3, ISSN: 0975-3397, pp 961–968

  • Caragea C, Silvescu A, Mitra P (2012) Protein sequence classification using feature hashing. Proteome Sci 10(Suppl 1):S14. https://doi.org/10.1186/1477-5956-10-S1-S14

    Article  Google Scholar 

  • Desai P (2005) Sequence classification using hidden markov models. https://etd.ohiolink.edu/

  • Ghosh SK, Ghosh A, Chakrabarti A (2018) VEA: vessel extraction algorithm by active contour model and a novel wavelet analyzer for diabetic retinopathy detection. Int J Image Gr 18(02):1850008

    Article  MathSciNet  Google Scholar 

  • Jain P et al (2009) Supervised machine learning algorithms for protein structure classification. Comput Biol Chem 33:216–223

    Article  Google Scholar 

  • Jain P, Hirst JD (2010) Automatic structure classification of small proteins using random forest. BMC Bioinform 11:364

    Article  Google Scholar 

  • John M et al (2018) Critical assessment of methods of protein structure prediction (CASP) round XII. Proteins Struct Funct Bioinforma 86(S1):7–15

    Google Scholar 

  • Kumar AV, Ali RFM, Cao Y, Krishnan VV (2015) Application of data mining tools for classification of protein structural class from residue based averaged NMR chemical shifts. Biochim Biophys Acta 1854(10):1545–1552

    Article  Google Scholar 

  • Lu CH et al (2007) Predicting disulfide connectivity patterns. Proteins 67:262–270

    Article  Google Scholar 

  • Mansoori EG, Zolghadri MJ, Katebi SD, Mohabatkar H, Boostani R, Sadreddini MH (2008) Generating fuzzy rules for protein classification. Iran J Fuzzy Syst 5(2):21–33

    MathSciNet  MATH  Google Scholar 

  • Mohamed S, Rubin D, Marwala T (2006) Multi-class protein sequence classification using fuzzy ARTMAP. In: IEEE conference, pp 1676–1680

  • Murzin AG et al (1995) SCOP: a structural classification of proteins database for the investigation of sequences and structures. J Mol Biol 247:536–540

    Google Scholar 

  • Nageswara Rao PV, Uma Devi T, Kaladhar D, Sridhar GR, Appa RA (2009) A probabilistic neural network approach for protein superfamily classification. J Theor Appl Inf Technol

  • Pawlak Z (2002) Rough set theory and its applications. J Telecommun Inf Technol 3:7–10

    Google Scholar 

  • Rahman MM, Alam AU, Abdullah-Al-Mamun, Mursalin TE (2010) A more appropriate protein classification using data mining. J Theor Appl Inf Technol (JATIT):33–43

  • Saha S, Bhattacharya T (2018) A new protein sequence classification approach using positional-average values of features AISC, SoCTA2018. Springer, Jalandhar

    Google Scholar 

  • Saha S, Bhattacharya T (2018) A novel approach to find the saturation point of n-gram encoding method for protein sequence classification involving data mining. In: LNNS, Springer, vol 56, ICICC-2018, Delhi, pp 101–108

  • Saha S, Bhattacharya T (2019) An approach to find proper execution parameters of n-gram encoding method for protein sequence classification. In: CCIS, Springer, vol 1046, ICACDS-2019, Ghaziabad, India, pp 294–303

  • Saha S, Chaki R (2012) A brief review of data mining application involving protein sequence classification, AISC, Springer, ACITY 2012. Chennai, India 177, pp 469–477

  • Saha S, Chaki R (2012) Application of data mining in protein sequence classification. In: International journal of database management systems (IJDMS), vol 4, no. 5

  • Seavey BR, Farr EA, Westler WM, Markley JL (1991) A relational database for sequence-specific protein NMR data. J Biomol NMR 1:217–236

    Article  Google Scholar 

  • Song J, Yuan Z, Tan H, Huber T, Burrage K (2007) Predicting disulfide connectivity from protein sequence using multiple sequence feature vectors and secondary structure. Bioinformatics 23(23):3147–3154

    Article  Google Scholar 

  • Spalding JD, Hoyle DC (2005) Accuracy of string kernels for protein sequence classification. In: ICAPR 2005. LNCS, Springer, vol 3686

  • Wang JTL, Ma QH, Shasha D, Wu CH (2000) Application of neural networks to biological data mining: a case study in protein sequence classification. KDD, Boston, MA, USA, pp 305–309

  • Watts DJ, Strogatz SH (1998) Collective dynamics of “small-world’’ networks. Nature 393(6684):440–2

    Article  Google Scholar 

  • Yellasiri R, Rao CR (2009) Rough set protein classifier. J Theor Appl Inf Technol

  • Zainuddin Z et al (2008) Radial basic function neural networks in protein sequence classification. Malays J Math Sci 2:195–204

    Google Scholar 

  • Zaki NM, Deri S, Illias RM (2005) Protein sequences classification based on string weighting scheme. Int J Comput Internet Manag 13:50–60

    Google Scholar 

  • Zhang HY, Neal S, Wishart DS (2003) RefDB: a database of uniformly referenced protein chemical shifts. J Biomol NMR 25:173–195

    Article  Google Scholar 

  • Zhao X-M, Huang D-S, Cheung Y, Wang H, Huang X (2004) A novel hybrid GA/SVM system for protein sequences classification. IDEAL 2004. LNCS Springer 3177:11–16

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Suprativ Saha.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Saha, S., Paul, T. & Bhattacharya, T. A study to find a potent feature by combining the various disulphide bonds of protein using data mining technique. Netw Model Anal Health Inform Bioinforma 10, 36 (2021). https://doi.org/10.1007/s13721-021-00311-9

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s13721-021-00311-9

Keywords

Navigation