A Survey on Filter Techniques for Feature Selection in Text Mining

Conference paper
Part of the Advances in Intelligent Systems and Computing book series (AISC, volume 236)


A large portion of a document is usually covered by irrelevant features. Instead of identifying actual context of the document, such features increase dimensions in the representation model and computational complexity of underlying algorithm, and hence adversely affect the performance. It necessitates a requirement of relevant feature selection in the given feature space. In this context, feature selection plays a key role in removing irrelevant features from the original feature space. Feature selection methods are broadly categorized into three groups: filter, wrapper, and embedded. Filter methods are widely used in text mining because of their simplicity, computational complexity, and efficiency. In this article, we provide a brief survey of filter feature selection methods along with some of the recent developments in this area.


Text mining Text categorization Text clustering  Feature extraction Feature selection Filter methods 


  1. 1.
    Chen, J., Huang, H., Tian, S., Qu, Y.: Feature selection for text classification with Naïve Bayes. Expert Syst. Appl. 36(3), 5432–5435 (2009)Google Scholar
  2. 2.
    Chen, X.: An improved branch and bound algorithm for feature selection. Pattern Recogn. Lett. 24(12), 1925–1933 (2003)Google Scholar
  3. 3.
    Chuang, L.Y., Tsai, S.W., Yang, C.H.: Improved binary particle swarm optimization using catfish effect for feature selection. Expert Syst. Appl. 38(10), 12699–12707 (2011)Google Scholar
  4. 4.
    Chuang, L.Y., Yang, C.H., Wu, K.C., Yang, C.H.: A hybrid feature selection method for DNA microarray data. Comput. Biol. Med. 41(4), 228–237 (2011)Google Scholar
  5. 5.
    Church, K.W., Hanks, P.: Word association norm, mutual information and lexicography. J. Comput. Linguist. 27(1), 22–29 (1990)Google Scholar
  6. 6.
    Deerwester, S.: Improving information retrieval with latent semantic indexing. In: Proceedings of the 51st Annual Meeting of the American Society for Information Science, Vol. 25, pp. 36–40 (1988)Google Scholar
  7. 7.
    Ding, C., Peng, H.: Minimum redundancy feature selection from microarray gene expression data. J. Bioinf. Comput. Biol. 185–205 (2005)Google Scholar
  8. 8.
    Ferreira, A.J., Figueired, M.A.T.: Efficient feature selection filters for high-dimensional data. Pattern Recogn. Lett. 33(13), 1794–1804 (2012)Google Scholar
  9. 9.
    Hall, M.A.: Correlation-based feature selection for machine learning. Ph.D. Thesis. Department of Computer Science, University of Waikato (1999)Google Scholar
  10. 10.
    Hsu, H.H., Hsieh, C. W., Lu, M.D.: Hybrid feature selection by combining filters and wrappers. Expert Syst. Appl. 38(7), 8144–8150 (2011)Google Scholar
  11. 11.
    Li, B., Zhang, P., Ren, G., Xing, Z.: A two stage feature selection method for gear fault diagnosis using reliefF and GA-wrapper. In: Proceedings International Conference on Measuring Technology and Mechatronics Automation, pp. 578–581 (2009)Google Scholar
  12. 12.
    Liu, L., Kang, J., Yu, J., Wang, Z.: A comparative study on unsupervised feature selection methods for text clustering. In: Proceedings of Natural Language Processing and Knowledge, Engineering, pp. 59–601 (2005)Google Scholar
  13. 13.
    Liu, Y., Qin, Z., Xu, Z., He, X.: Feature selection with particle swarms. In: Computational and Information Science, pp. 425–430. Springer, Heidelberg (2004)Google Scholar
  14. 14.
    Liu, Y., Wang, G., Chen, H., Dong, H., Zhu, X., Wang, S.: An improved particle swarm optimization for feature selection. J. Bionic Eng. 8(2), 191–200 (2011)Google Scholar
  15. 15.
    Meng, J., Lin, H., Yu, Y.: A two-stage feature selection method for text categorization. Knowl.-Based Syst. 62(7), 2793–2800 (2011)Google Scholar
  16. 16.
    Mitra, P., Murthy, C., Pal, S.: Unsupervised feature selection using feature similarity. IEEE Trans. Pattern Anal. Machine Intell. 24(3), 301–312 (2002)Google Scholar
  17. 17.
    Ng, H. T., Goh, W. B., Low, K. L.: Feature selection, perception learning, and a usability case study for text categorization. In: Proceedings of the 20th ACM International Conference on Research and Development in, Information Retrieval, pp. 67–73 (1997)Google Scholar
  18. 18.
    Pearson, K.: On lines and planes of closest filt to systems of points in space. Phil. Mag. 1(6), 559–572 (1901)Google Scholar
  19. 19.
    Peng, H., Long, F., Ding, C.: Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans. Pattern Anal. Mach. Intell. 27(8), 1226–1238 (2005)Google Scholar
  20. 20.
    Pudil, P., Novoviciva, J., Kittler, J.: Floating search methods in feature selection. Pattern Recogn. Lett. 15(11), 1119–1125 (1994)Google Scholar
  21. 21.
    Quinlan, J.R.: Induction of decision tree. Mach. learn. 1(1), 81–106 (1986)Google Scholar
  22. 22.
    Salton, G., Wong, A., Yang, C. S.: A vector space model for automatic indexing. Commun. ACM18(11), 613–620 (1975)Google Scholar
  23. 23.
    Shang, W., Huang, H., Zhu, H., Lin, Y., Qu, Y., Wang, Z.: A novel feature selection algorithm for text clustering. Expert Syst. Appl. 33(1), 1–5 (2007)Google Scholar
  24. 24.
    Shevade, S., Keerthi, S.: A simple and efficient algorithm for gene selection using sparse logistic regression. Bioinformatics 19(17), 2246–2253 (2003)Google Scholar
  25. 25.
    Song, W., Park, S.C.: Genetic algorithm for text clustering based on latent semantic indexing. Comput. Math. Appl. 57(11–12), 1901–1907 (2009)Google Scholar
  26. 26.
    Tu, C.J., Chuang, L.Y., Chang, J.Y., Yang, C.H.: Feature selection using PSO-SVM. In: Proceedings of Multiconferenc of Engineers, pp. 138–143 (2006)Google Scholar
  27. 27.
    Uguz, H.: A hybrid system based on information gain and principal component analysis for the classification of transcranial Doppler signals. Comput. Methods Programs Biomed. 107(3), 598–609 (2012)Google Scholar
  28. 28.
    Uguz, H.: A two-stage feature selection method for text categorization by using information gain, principal component analysis and genetic algorithm. Knowl. Based. Syst. 24(7), 1024–1032 (2011)Google Scholar
  29. 29.
    Unler, A., Murat, A., Chinnam, R.B.: \(\text{ mr }^{2}\text{ PSO }\): A maximum relevance minimum redundancy feature selection method based on swarm intelligence for support vector machine classification. Inf. Sci. 181(20), 4625–4641 (2011)Google Scholar
  30. 30.
    Yang, C.H., Chuang, L.Y., Yang, C.H.: IG-GA: a hybrid filter/wrapper method for feature selection of microarray data. J. Med. Biol. Eng. 30(1), 23–28 (2009)Google Scholar

Copyright information

© Springer India 2014

Authors and Affiliations

  1. 1.Computational Intelligence and Data Mining Research LabABV-Indian Institute of Information Technology and Management GwaliorGwaliorIndia

Personalised recommendations