An Approach to Find Proper Execution Parameters of n-Gram Encoding Method Based on Protein Sequence Classification

  • Suprativ SahaEmail author
  • Tanmay Bhattacharya
Conference paper
Part of the Communications in Computer and Information Science book series (CCIS, volume 1046)


Various protein sequence classification approaches are developed to classify unknown sequences in to its classes or familes with an certain accuracy. Features extraction from protein sequence is a key technique to implement all approaches. N-gram encoding method is a popular feature extraction procedure. But to maintain the low computational time and high accuracy level of classification, it requires to fix up the upper limit of ‘N’ of N-gram encoding method. On the other hand, the standard deviation value of protein sequence is one of the important feature value which is extracted by N-gram encoding method. This feature can be extracted by two different ways like standard deviation calculation using standard mean value and using floating mean value. It is also important to find proper method to calculate the value of standard deviation. In this paper, an investigational proof has done to find upper limit of N-gram encoding method as well as find the proper technique to calculate the standard deviation value as a feature which are extracted from unknown protein sequence.


Neural network based classifier N-gram encoding method Standard mean value Standard deviation Floating mean value 


  1. 1.
    Bentley, D.R.: The human genome project-an overview. Med. Res. Rev. 20(3), 189–196 (2000)CrossRefGoogle Scholar
  2. 2.
    Apweiler, R., et al.: UniProt: the universal protein knowledgebase. Nucleic Acids Res. 32(DATABASE ISS.), D115–D119 (2004)CrossRefGoogle Scholar
  3. 3.
    Vipsita, S., Shee, B.K., Rath, S.K.: An efficient technique for protein classification using feature extraction by artificial neural networks. In: Proceedings of the Annual IEEE India Conference (INDICON), Kolkata, India, pp. 1–5 (2010)Google Scholar
  4. 4.
    Wang, J.T.L., Ma, Q.H., Shasha, D., Wu, C.H.: Application of neural networks to biological data mining: a case study in protein sequence classification. In: KDD, Boston, pp. 305–309 (2000)Google Scholar
  5. 5.
    Zainuddin, Z., et al.: Radial basic function neural networks in protein sequence classification. Malays. J. Math. Sci. 2, 195–204 (2008)Google Scholar
  6. 6.
    Nageswara Rao, P.V., Uma Devi, T., Kaladhar, D., Sridhar, G., Rao, A.A.: A probabilistic neural network approach for protein superfamily classification. J. Theor. Appl. Inf. Technol. (2009)Google Scholar
  7. 7.
    Mohamed, S., Rubin, D., Marwala, T.: Multi-class protein sequence classification using Fuzzy ARTMAP. In: IEEE Conference, pp. 1676–1680 (2006)Google Scholar
  8. 8.
    Mansoori, E.G., Zolghadri, M.J., Katebi, S.D., Mohabatkar, H., Boostani, R., Sadreddini, M.H.: Generating fuzzy rules for protein classification. Iran. J. Fuzzy Syst. 5(2), 21–33 (2008)MathSciNetzbMATHGoogle Scholar
  9. 9.
    Cai, C.Z., Han, L.Y., Ji, Z.L., Chen, X., Chen, Y.Z.: SVM-Prot: web-based support vector machine software for functional classification of a protein from its primary sequence. Nucleic Acids Res. 31, 3692–3697 (2003)CrossRefGoogle Scholar
  10. 10.
    Saha, S., Chaki, R.: Application of data mining in protein sequence classification. IJDMS 4(5), 103–118 (2012)CrossRefGoogle Scholar
  11. 11.
    Saha, S., et al.: A brief review of data mining application involving protein sequence classification. In: Meghanathan, N., Nagamalai, D., Chaki, N. (eds.) Advances in Computing and Information Technology. AISC, vol. 177. Springer, Berlin (2012). Scholar
  12. 12.
    Spalding, J.D., Hoyle, D.C.: Accuracy of string kernels for protein sequence classification. In: Singh, S., Singh, M., Apte, C., Perner, P. (eds.) ICAPR 2005. LNCS, vol. 3686, pp. 454–460. Springer, Heidelberg (2005). Scholar
  13. 13.
    Zaki, N.M., Deri, S., Illias, R.M.: Protein sequences classification based on string weighting scheme. Int. J. Comput. Internet Manage. 13(1), 50–60 (2005)Google Scholar
  14. 14.
    Ali, A.F., Shawky, D.M.: A novel approach for protein classification using fourier transform. Int. J. Eng. Appl. Sci. 6, 4 (2010)Google Scholar
  15. 15.
    Boujenfa, K., Essoussi, N., Limam, M.: Tree-kNN: a tree-based algorithm for protein sequence classification. IJCSE 3, 961–968 (2011). ISSN: 0975-3397Google Scholar
  16. 16.
    Desai, P.: Sequence classification using hidden markov models. Electronic thesis or Dissertation (2005).
  17. 17.
    Rahman, M.M., Arif Ul Alam, A.-A.-M., Mursalin, T.E.: A more appropriate protein classification using data mining. JATIT, 33–43 (2010)Google Scholar
  18. 18.
    Caragea, C., et al.: Protein sequence classification using feature hashing. Proteome Sci. 10(Suppl 1), S14 (2012). Scholar
  19. 19.
    Zhao, X.-M., Huang, D.-S., Cheung, Y., Wang, H., Huang, X.: A novel hybrid GA/SVM system for protein sequences classification. In: Yang, Z.R., Yin, H., Everson, R.M. (eds.) IDEAL 2004. LNCS, vol. 3177, pp. 11–16. Springer, Heidelberg (2004). Scholar
  20. 20.
    Saha, S., Bhattacharya, T.: A novel approach to find the saturation point of n-Gram encoding method for protein sequence classification involving data mining. In: Bhattacharyya, S., Hassanien, A.E., Gupta, D., Khanna, A., Pan, I. (eds.) International Conference on Innovative Computing and Communications. LNNS, vol. 56, pp. 101–108. Springer, Singapore (2019). Scholar

Copyright information

© Springer Nature Singapore Pte Ltd. 2019

Authors and Affiliations

  1. 1.Department of Computer Science and EngineeringBrainware UniversityBarasat, KolkataIndia
  2. 2.Department of Information TechnologyTechno IndiaSaltlake, KolkataIndia

Personalised recommendations