Skip to main content

A General Framework of Feature Selection for Text Categorization

  • Conference paper
Machine Learning and Data Mining in Pattern Recognition (MLDM 2009)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 5632))

Abstract

Many feature selection methods have been proposed for text categorization. However, their performances are usually verified by experiments, so the results rely on the corpora used and may not be accurate. This paper proposes a novel feature selection framework called Distribution-Based Feature Selection (DBFS) based on distribution difference of features. This framework generalizes most of the state-of-the-art feature selection methods including OCFS, MI, ECE, IG, CHI and OR. The performances of many feature selection methods can be estimated by theoretical analysis using components of this framework. Besides, DBFS sheds light on the merits and drawbacks of many existing feature selection methods. In addition, this framework helps to select suitable feature selection methods for specific domains. Moreover, a weighted model based on DBFS is given so that suitable feature selection methods for unbalanced datasets can be derived. The experimental results show that they are more effective than CHI, IG and OCFS on both balanced and unbalanced datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Similar content being viewed by others

References

  1. Dasgupta, A., Drineas, P., Harb, B., Josifovski, V., Mahoney, M.W.: Feature Selection Methods for Text Classification. In: ACM SIGKDD, pp. 230–239 (2007)

    Google Scholar 

  2. Doumpos, M., Salappa, A.: Feature selection algorithms in classification problems: an experimental evaluation. In: AIKED, pp. 1–6 (2005)

    Google Scholar 

  3. Forman, G.: An extensive empirical study of feature selection metrics for text classification. Journal of Machine Learning Research, 1289–1305 (2003)

    Google Scholar 

  4. Hong, J., Cho, S.: Efficient huge-scale feature selection with speciated genetic. Pattern Recognition Letters, 143–150 (2006)

    Google Scholar 

  5. Hong, S.J.: Use of Contextual Information for Feature Ranking and Discretization. IEEE Transactions on Knowledge and Data Engineering 9(5), 718–730 (1997)

    Article  Google Scholar 

  6. How, B.C., Kulathuramaiyer, N., Kiong, W.T.: Categorical term descriptor: A proposed term weighting scheme for feature selection. In: IEEE/WIC/ACM WI, pp. 313–316 (2005)

    Google Scholar 

  7. Joactfims, T.: Text categorization with support vector machines learning with many relevant features. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 137–142. Springer, Heidelberg (1998)

    Chapter  Google Scholar 

  8. John, G.H., Kohavi, R., Pfleger, K.: Irrelevant features and the subset selection problem. In: ICML, pp. 121–129 (1994)

    Google Scholar 

  9. Lang, K., NewsWeeder: Learning to filter netnews. In: ICML, pp. 331–339 (1995)

    Google Scholar 

  10. Langley, P.: Selectuion of relevant features in machine learning. In: AAAI Fall Symposium on Relevance, pp. 140–144 (1994)

    Google Scholar 

  11. Legrand, G., Nicoloyannis, N.: Feature Selection Method Using Preferences Aggregation. In: Perner, P., Imiya, A. (eds.) MLDM 2005. LNCS, vol. 3587, pp. 203–217. Springer, Heidelberg (2005)

    Chapter  Google Scholar 

  12. Li, S., Zong, C.: A new approach to feature selection for text categorization. In: IEEE NLP-KE, pp. 626–630 (2005)

    Google Scholar 

  13. Li, F., Guan, T., Zhang, X., Zhu, X.: An Aggressive Feature Selection Method based on Rough Set Theory. Innovative Computing, Information and Control, 176–179 (2007)

    Google Scholar 

  14. Liu, Y., Zheng, Y.F.: FS_SFS: A novel feature selection method for support vector machines. Pattern Recognition 39, 1333–1345 (2006)

    Article  MATH  Google Scholar 

  15. Luo, S., Corne, D.: Feature selection strategies for poorly correlated data: correlation coefficient considered harmful. In: AIKED, pp. 226–231 (2008)

    Google Scholar 

  16. Mak, M., Kung, S.: Fusion of feature selection methods for pairwise scoring SVM. Neurocomputing 71, 3104–3113 (2008)

    Article  Google Scholar 

  17. Mladenic, D., Grobelnik, M.: Feature selection for unbalanced class distribution and naive bayes. In: ICML, pp. 258–267 (1999)

    Google Scholar 

  18. Peng, H., Long, F., Ding, C.: Feature selection based on mutual information: Criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans. Pattern Analysis and Machine Intelligence 53, 1226–1238 (2005)

    Article  Google Scholar 

  19. Perner, P.: Improving the Accuracy of Decision Tree Induction by Feature Pre-Selection. Applied Artificial Intelligence 15(8), 747–760 (2001)

    Article  Google Scholar 

  20. Polkowski, L., Tsumoto, S., Lin, T.Y.: Rough Set Methods and Applications: New Developments in Knowledge Discovery in Information Systems. Springer, Heidelberg (2000)

    Book  MATH  Google Scholar 

  21. Robnik-Siikonja, M., Kononenko, I.: Theoretical and Empirical Analysis of Relief and Relief. Machine Learning Journal 53, 23–69 (2003)

    Article  Google Scholar 

  22. Yan, J., Liu, N., Zhang, B.: OCFS: Optimal orthogonal centroid feature selection for text categorization. In: ACM SIGIR, pp. 122–129 (2005)

    Google Scholar 

  23. Yang, Y., Liu, X.: A re-examination of text categorization methods. In: ACM SIGIR, pp. 42–49 (1999)

    Google Scholar 

  24. Yang, Y., Pedersen, J.O.: A comparative study on feature selection in text categorization. In: ICML, pp. 412–420 (1997)

    Google Scholar 

  25. Yu, L., Ding, C., Loscalzo, S.: Stable feature selection via dense feature groups. In: ACM SIGKDD, pp. 803–811 (2008)

    Google Scholar 

  26. Zhao, P., Liu, P.: Spectral feature selection for supervised and unsupervised learning. In: ICML, pp. 1151–1157 (2007)

    Google Scholar 

  27. Zhao, T., Lu, J., Zhang, Y., Xiao, Q.: Feature Selection Based on Genetic Algorithm for CBIR. In: CISP, pp. 495–499 (2008)

    Google Scholar 

  28. Zheng, Z., Wu, X., Srihari, R.: Feature selection for text categorization on imbalanced data. ACM SIGKDD Explorations Newsletter, 80–89 (2004)

    Google Scholar 

  29. Zhou, Q., Zhao, M., Hu, M.: Study on feature selection in chinese text categorization. Journal of Chinese Information Processing 18, 17–23 (2004)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2009 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Jing, H., Wang, B., Yang, Y., Xu, Y. (2009). A General Framework of Feature Selection for Text Categorization. In: Perner, P. (eds) Machine Learning and Data Mining in Pattern Recognition. MLDM 2009. Lecture Notes in Computer Science(), vol 5632. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-03070-3_49

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-03070-3_49

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-03069-7

  • Online ISBN: 978-3-642-03070-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics