A Clustering Based Feature Selection Method Using Feature Information Distance for Text Data

  • Shilong Chao
  • Jie Cai
  • Sheng YangEmail author
  • Shulin Wang
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9771)


Feature selection is a key point in text classification. In this paper a new feature selection method based on feature clustering using information distance is put forward. This method using information distance measure builds a feature clusters space. Firstly, K-medoids clustering algorithm is employed to gather the features into k clusters. Secondly the feature which has the largest mutual information with class is selected from each cluster to make up a feature subset. Finally, choose target number features according to the mRMR algorithm from the selected subset. This algorithm fully considers the diversity between features. Unlike the incremental search algorithm mRMR, it avoids prematurely falling into local optimum. Experimental results show that the features selected by the proposed algorithm can gain better classification accuracy.


Text classification Feature selection Cluster Diversity 



This research was supported by the National Natural Science Foundation of China (Grant No. 61472467).


  1. 1.
    Lan, M., Tan, C.L., Su, J., Lu, Y.: Supervised and traditional term weighting methods for automatic text categorization. IEEE Trans. Pattern Anal. Mach. Intell. 31, 721–735 (2009)CrossRefGoogle Scholar
  2. 2.
    Xu, J., Croft, W.B.: Improving the effectiveness of information retrieval with local context analysis. ACM Trans. Inf. Syst. 18, 79–112 (2000)CrossRefGoogle Scholar
  3. 3.
    Chen, Z., Lü, K.: A preprocess algorithm of filtering irrelevant information based on the minimum class difference. Knowl.-Based Syst. 19, 422–429 (2006)CrossRefGoogle Scholar
  4. 4.
    Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. 34, 1–47 (2002)MathSciNetCrossRefGoogle Scholar
  5. 5.
    Song, F., Liu, S., Yang, J.: A comparative study on text representation schemes in text categorization. Pattern Anal. Appl. 8, 199–209 (2005)MathSciNetCrossRefGoogle Scholar
  6. 6.
    Fragoudis, D., Meretakis, D., Likothanassis, S.: Best terms: an efficient feature-selection algorithm for text categorization. Knowl. Inf. Syst. 8, 16–33 (2005)CrossRefGoogle Scholar
  7. 7.
    Battiti, R.: Using mutual information for selecting features in supervised neural net learning. IEEE Trans. Neural Netw. 5, 537–550 (1994)CrossRefGoogle Scholar
  8. 8.
    Peng, H., Long, F., Ding, C.: Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans. Pattern Anal. Mach. Intell. 27, 1226–1238 (2005)CrossRefGoogle Scholar
  9. 9.
    Yu, L., Liu, H.: Efficient feature selection via analysis of relevance and redundancy. J. Mach. Learn. Res. 5, 1205–1224 (2004)MathSciNetzbMATHGoogle Scholar
  10. 10.
    Fleuret, F.: Fast binary feature selection with conditional mutual information. J. Mach. Learn. Res. 5, 1531–1555 (2004)MathSciNetzbMATHGoogle Scholar
  11. 11.
    Vinh, N.X., Epps, J., Bailey, J.: Effective global approaches for mutual information based feature selection. In: International Conference on Knowledge Discovery and Data Mining, pp. 512–521. ACM (2014)Google Scholar
  12. 12.
    Forman, G.: An extensive empirical study of feature selection metrics for text classification. J. Mach. Learn. Res. 3, 1289–1305 (2003)zbMATHGoogle Scholar
  13. 13.
    Liu, H., Liu, L., Zhang, H.: Feature selection using mutual information: an experimental study. In: Ho, T.-B., Zhou, Z.-H. (eds.) PRICAI 2008. LNCS (LNAI), vol. 5351, pp. 235–246. Springer, Heidelberg (2008)CrossRefGoogle Scholar
  14. 14.
    Au, W.H., Chan, K.C.C., Wong, A.K.C., Wang, Y.: Attribute clustering for grouping, selection, and classification of gene expression data. IEEE/ACM Trans. Comput. Biol. Bioinform. (TCBB) 2, 83–101 (2005)CrossRefGoogle Scholar
  15. 15.
    Song, Q., Ni, J., Wang, G.: A fast clustering-based feature subset selection algorithm for high-dimensional data. IEEE Trans. Knowl. Data Eng. 25, 1–14 (2013)CrossRefGoogle Scholar
  16. 16.
    Liu, Q., Zhang, J., Xiao, J., Zhu, H., Zhao, Q.: A supervised feature selection algorithm through minimum spanning tree clustering. In: IEEE 26th International Conference on Tools with Artificial Intelligence (ICTAI), pp. 264–271 (2014)Google Scholar
  17. 17.
    Vinh, N.X., Epps, J., Bailey, J.: Information theoretic measures for clusterings comparison: variants, properties, normalization and correction for chance. J. Mach. Learn. Res. 11, 2837–2854 (2010)MathSciNetzbMATHGoogle Scholar
  18. 18.
    Vinh, N.X., Epps, J., Bailey, J.: Information theoretic measures for clusterings comparison: is a correction for chance necessary? In: 26th AI Conference, pp. 1073–1080 (2009)Google Scholar
  19. 19.
    Vinh, N.X, Epps, J.: A novel approach for automatic number of clusters detection in microarray data based on consensus clustering. In: 9th IEEE International Conference on Bioinformatics and BioEngineering, pp. 84–91 (2009)Google Scholar
  20. 20.
    Jain, A.K., Duin, R.P., Mao, J.: Statistical pattern recognition: a review. IEEE Trans. Pattern Anal. Mach. Intell. 22, 4–37 (2000)CrossRefGoogle Scholar
  21. 21.
    Herman, G., Zhang, B., Wang, Y., Ye, G., Chen, F.: Mutual information-based method for selecting informative feature sets. Pattern Recogn. 46, 3315–3327 (2013)CrossRefGoogle Scholar
  22. 22.
    Fayyad, U., Irani, K.B.: Multi-interval discretization of continuous valued attributes for classification learning. In: 13th IJCAI, pp. 1022–1027 (1993)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2016

Authors and Affiliations

  • Shilong Chao
    • 1
  • Jie Cai
    • 1
  • Sheng Yang
    • 1
    Email author
  • Shulin Wang
    • 1
  1. 1.College of Computer Science and Electronic EngineeringHunan UniversityChangshaChina

Personalised recommendations