Feature Filtering of Amino Acid Sequences Using Rough Set Theory

  • Amit Paul
  • Jaya Sil
  • Chitrangada Das Mukhopadhyay
Conference paper
Part of the Smart Innovation, Systems and Technologies book series (SIST, volume 31)


Numerous algorithms have been developed for extracting meaningful information from large dimensional biological data set. However, due to handling of large number of features and objects, the algorithms are often complex and procedures are lengthy. Feature selection procedure reduces complexity in analyzing high dimensional biological data and becoming essential step in bio-informatics research. The paper addresses feature selection problems by exploiting inter object feature distribution in protein sequence data where importance of amino acids are determined based on their appearance in protein. The proposed algorithm is compared with other well known feature selection methods revealing significant improvement in classification accuracy.


Protein Importance factor Oscillation factor Classification 


  1. 1.
    Donev, E.N., Tobias, Y.D., Donev, A.N., Tobias, R.D.: For drug discovery experiments (2010)Google Scholar
  2. 2.
    Kantardjieff, K., Rupp, B.: Structural bioinformatic approaches to the discovery of new antimyco bacterial drugs (2004)Google Scholar
  3. 3.
    Weston, J., Pérez-Cruz, F., Bousquet, O., Chapelle, O., Elisseeff, A., Schölkopf, B.: Feature selection and transduction for prediction of molecular bioactivity for drug design. Bioinformatics 19(6), 764–771 (2003)CrossRefGoogle Scholar
  4. 4.
    Semmes, O., Feng, Z., Adam, B., Banez, L., Bigbee, W., Campos, D., Cazares, L., Chan, D., Grizzle, W., Izbicka, E., Kagan, J., Malik, G., McLerran, D., Moul, J., Partin, A., Prasanna, P., Rosenzweig, J., Sokoll, L., Srivastava, S., Srivastava, S., Thompson, I., Welsh, M., White, N., Winget, M., Yasui, Y., Zhang, Z., Zhu, L.: Evaluation of serum protein profiling by surface-enhanced laser desorption/ionization time-of-flight mass spectrometry for the detection of prostate cancer: I. assessment of platform reproducibility. Clin. Chem. 51(1), 102–112 (2005)CrossRefGoogle Scholar
  5. 5.
    Arya, S., Mount, D.M., Netanyahu, N.S., Silverman, R., Wu, A.Y.: An optimal algorithm for approximate nearest neighbor searching in fixed dimensions. In: ACM-Siam Symposium on Discrete Algorithms, pp. 573–582. (1994)Google Scholar
  6. 6.
    Chang, Y.W.Z., Ying, Z., Zhu, L., Yang, Y.: A parsimonious threshold independent protein feature selection method through the area under receiver operating characteristic curve. Bioinformatics 23(20), 2788–2794 (2007)Google Scholar
  7. 7.
    Roweis, S.T., Saul, L.K.: Nonlinear dimensionality reduction by locally linear embedding. Science 290, 2323–2326 (2000)CrossRefGoogle Scholar
  8. 8.
    John, G.H., Kohavi, R., Pfleger, K.: Irrelevant features and the subset selection problem. In: Machine learning: proceedings of the eleventh international. Morgan Kaufmann, Burlington, (1994) 121–129Google Scholar
  9. 9.
    Kohavi, R., John, G.H.: Wrappers for feature subset selection (1997)Google Scholar
  10. 10.
    Søndberg-madsen, N., Thomsen, C., Pea, J.M.: Unsupervised feature subset selection. In: In Proceedings of the Workshop on Probabilistic Graphical Models for Classification, pp. 71–82 (2003)Google Scholar
  11. 11.
    Lin, T.Y.: Rough set theory in very large databases. In: Proceedings of the IMACS Symposium on Modeling, Analysis and Simulation (CESA’96), pp. 936–941 (1996)Google Scholar
  12. 12.
    Pawlak, Z.: Rough sets: theoretical aspects of reasoning about data. Kluwer Academic Publishing, Dordrecht (1991)Google Scholar
  13. 13.
    Yao, Y.Y.: On generalizing rough set theory. In: Proceedings of 9th International Conference on Rough Sets, Fuzzy Sets, Data Mining, and Granular Computing, RSFDGrC03, pp. 44–51 (2003)Google Scholar
  14. 14.
    Lang, G., Li, Q., Guo, L.: Discernibility matrix simplification with new attribute dependency functions for incomplete information systems. Knowl. Inf. Syst. 37(3), 611–638 (2012)Google Scholar
  15. 15.
    Yao, Y., Zhao, Y.: Discernibility matrix simplification for constructing attribute reducts. J. Am. Stat. Assoc. 179(5), 867–882 (2009)MathSciNetMATHGoogle Scholar
  16. 16.
    Zhao, Y., Yao, Y., Luo, F.: Data analysis based on discernibility and indiscernibility. Inf. Sci. 177(4959–4976), 867–882 (2007)Google Scholar
  17. 17.
    Chouchoulas, A., Shen, Q.: Rough set-aided keyword reduction for text categorization. Appl. Artif. Intell. 15(9), 843–873 (2001)Google Scholar
  18. 18.
    Jensen, R., Shen, Q.: Semantics-preserving dimensionality reduction: rough and fuzzy-rough based approaches. IEEE Trans. Knowl. Data Eng. 16(12), 1457–1471 (2004)Google Scholar
  19. 19.
    Chiu, S.: Fuzzy model identification based on cluster estimation. J. Intell. Fuzzy Syst. 2(3), 267–278 (1994)Google Scholar
  20. 20.
    Hore, P., Hall, L.O., Goldgof, D.B., Cheng, W.: Online fuzzy c means (2008)Google Scholar
  21. 21.
    Hall, M.A.: Correlation-based feature selection for machine learning. Technical report. University of Waikato, Hamilton (1998)Google Scholar
  22. 22.
    Hall, M.A.: Correlation-based feature selection for discrete and numeric class machine learning. In: ICML, pp. 359–366. Morgan Kaufmann, Burlington (2000)Google Scholar
  23. 23.
    Michalak, K., Kwaśnicka, H.: H.: Correlation-based feature selection strategy in classification problems. Int. J. Appl. Math. Comput. Sci. 16, 503–511 (2006)Google Scholar
  24. 24.
    Zhang, H., Ling, C.X., Zhao, Z.: The learnability of naive bayes. In: Proceedings of Canadian Artificial Intelligence Conference, pp. 432–441. AAAI Press, California (2005)Google Scholar
  25. 25.
    Berman, H.M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T.N., Weissig, H., Shindyalov, I.N., Bourne, P.E.: The protein data bank. Nucleic Acids Res. 28, 235–242 (2000)CrossRefGoogle Scholar
  26. 26.
    Bhat, T.N., Bourne, P., Feng, Z., Gilliland, G., Jain, S., Ravichandran, V., Schneider, B., Schneider, K., Thanki, N., Weissig, H., Westbrook, J., Berman, H.: The pdb data uniformity project (2001)Google Scholar
  27. 27.
    Jonassen, I., Eidhammer, I.: Structure motif discovery and mining the pdb (2000)Google Scholar
  28. 28.
    Hubbard, T.J.P., Ailey, B., Brenner, S.E., Murzin, A.G., Chothia, C.: Scop, structural classification of proteins database: applications to evaluation of the effectiveness of sequence alignment methods and statistics of protein structural data (1998)Google Scholar
  29. 29.
    Watters, A.: The scop database (2000)Google Scholar
  30. 30.
    Bairoch, A., Apweiler, R.: The swiss-prot protein sequence database and its supplement tremble in 2000. Nucleic Acids Res. 27, 49–54 (2000)CrossRefGoogle Scholar
  31. 31.
    Jolliffe, I.: Principal component analysis. Springer Series in Statistics, New York (2002)Google Scholar
  32. 32.
    Sewell, M.: Principal component analysis (2007)Google Scholar
  33. 33.
    Frank, E., Hall, M.A., Holmes, G., Kirkby, R., Pfahringer, B.: Weka—a machine learning workbench for data mining. In: Maimon, O., Rokach, L., (eds.): The Data Mining and Knowledge Discovery Handbook, pp. 1305–1314. Springer, Berlin (2005)Google Scholar
  34. 34.
    Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The weka data mining software: an update. SIGKDD Explorations 11(1), 10–18 (2009)Google Scholar

Copyright information

© Springer India 2015

Authors and Affiliations

  • Amit Paul
    • 1
  • Jaya Sil
    • 2
  • Chitrangada Das Mukhopadhyay
    • 3
  1. 1.Computer Science and EngineeringSt. Thomas College of Engineering and TechnologyKhidirporeIndia
  2. 2.Computer Science and TechnologyBengal Engineering and Science UniversityShibpurIndia
  3. 3.Health Care Science and TechnologyBengal Engineering and Science UniversityShibpurIndia

Personalised recommendations