A General Framework of Feature Selection for Text Categorization

Jing, Hongfang; Wang, Bin; Yang, Yahui; Xu, Yan

doi:10.1007/978-3-642-03070-3_49

Hongfang Jing^20,21,
Bin Wang²⁰,
Yahui Yang²² &
…
Yan Xu²³

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 5632))

Included in the following conference series:

International Workshop on Machine Learning and Data Mining in Pattern Recognition

3 Citations

Abstract

Many feature selection methods have been proposed for text categorization. However, their performances are usually verified by experiments, so the results rely on the corpora used and may not be accurate. This paper proposes a novel feature selection framework called Distribution-Based Feature Selection (DBFS) based on distribution difference of features. This framework generalizes most of the state-of-the-art feature selection methods including OCFS, MI, ECE, IG, CHI and OR. The performances of many feature selection methods can be estimated by theoretical analysis using components of this framework. Besides, DBFS sheds light on the merits and drawbacks of many existing feature selection methods. In addition, this framework helps to select suitable feature selection methods for specific domains. Moreover, a weighted model based on DBFS is given so that suitable feature selection methods for unbalanced datasets can be derived. The experimental results show that they are more effective than CHI, IG and OCFS on both balanced and unbalanced datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

A New Feature Selection Algorithm Based on Category Difference for Text Categorization

A non-redundant feature selection method for text categorization based on term co-occurrence frequency and mutual information

Article 31 July 2023

A new feature selection method for handling redundant information in text classification

Article 01 February 2018

References

Dasgupta, A., Drineas, P., Harb, B., Josifovski, V., Mahoney, M.W.: Feature Selection Methods for Text Classification. In: ACM SIGKDD, pp. 230–239 (2007)
Google Scholar
Doumpos, M., Salappa, A.: Feature selection algorithms in classification problems: an experimental evaluation. In: AIKED, pp. 1–6 (2005)
Google Scholar
Forman, G.: An extensive empirical study of feature selection metrics for text classification. Journal of Machine Learning Research, 1289–1305 (2003)
Google Scholar
Hong, J., Cho, S.: Efficient huge-scale feature selection with speciated genetic. Pattern Recognition Letters, 143–150 (2006)
Google Scholar
Hong, S.J.: Use of Contextual Information for Feature Ranking and Discretization. IEEE Transactions on Knowledge and Data Engineering 9(5), 718–730 (1997)
Article Google Scholar
How, B.C., Kulathuramaiyer, N., Kiong, W.T.: Categorical term descriptor: A proposed term weighting scheme for feature selection. In: IEEE/WIC/ACM WI, pp. 313–316 (2005)
Google Scholar
Joactfims, T.: Text categorization with support vector machines learning with many relevant features. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 137–142. Springer, Heidelberg (1998)
Chapter Google Scholar
John, G.H., Kohavi, R., Pfleger, K.: Irrelevant features and the subset selection problem. In: ICML, pp. 121–129 (1994)
Google Scholar
Lang, K., NewsWeeder: Learning to filter netnews. In: ICML, pp. 331–339 (1995)
Google Scholar
Langley, P.: Selectuion of relevant features in machine learning. In: AAAI Fall Symposium on Relevance, pp. 140–144 (1994)
Google Scholar
Legrand, G., Nicoloyannis, N.: Feature Selection Method Using Preferences Aggregation. In: Perner, P., Imiya, A. (eds.) MLDM 2005. LNCS, vol. 3587, pp. 203–217. Springer, Heidelberg (2005)
Chapter Google Scholar
Li, S., Zong, C.: A new approach to feature selection for text categorization. In: IEEE NLP-KE, pp. 626–630 (2005)
Google Scholar
Li, F., Guan, T., Zhang, X., Zhu, X.: An Aggressive Feature Selection Method based on Rough Set Theory. Innovative Computing, Information and Control, 176–179 (2007)
Google Scholar
Liu, Y., Zheng, Y.F.: FS_SFS: A novel feature selection method for support vector machines. Pattern Recognition 39, 1333–1345 (2006)
Article MATH Google Scholar
Luo, S., Corne, D.: Feature selection strategies for poorly correlated data: correlation coefficient considered harmful. In: AIKED, pp. 226–231 (2008)
Google Scholar
Mak, M., Kung, S.: Fusion of feature selection methods for pairwise scoring SVM. Neurocomputing 71, 3104–3113 (2008)
Article Google Scholar
Mladenic, D., Grobelnik, M.: Feature selection for unbalanced class distribution and naive bayes. In: ICML, pp. 258–267 (1999)
Google Scholar
Peng, H., Long, F., Ding, C.: Feature selection based on mutual information: Criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans. Pattern Analysis and Machine Intelligence 53, 1226–1238 (2005)
Article Google Scholar
Perner, P.: Improving the Accuracy of Decision Tree Induction by Feature Pre-Selection. Applied Artificial Intelligence 15(8), 747–760 (2001)
Article Google Scholar
Polkowski, L., Tsumoto, S., Lin, T.Y.: Rough Set Methods and Applications: New Developments in Knowledge Discovery in Information Systems. Springer, Heidelberg (2000)
Book MATH Google Scholar
Robnik-Siikonja, M., Kononenko, I.: Theoretical and Empirical Analysis of Relief and Relief. Machine Learning Journal 53, 23–69 (2003)
Article Google Scholar
Yan, J., Liu, N., Zhang, B.: OCFS: Optimal orthogonal centroid feature selection for text categorization. In: ACM SIGIR, pp. 122–129 (2005)
Google Scholar
Yang, Y., Liu, X.: A re-examination of text categorization methods. In: ACM SIGIR, pp. 42–49 (1999)
Google Scholar
Yang, Y., Pedersen, J.O.: A comparative study on feature selection in text categorization. In: ICML, pp. 412–420 (1997)
Google Scholar
Yu, L., Ding, C., Loscalzo, S.: Stable feature selection via dense feature groups. In: ACM SIGKDD, pp. 803–811 (2008)
Google Scholar
Zhao, P., Liu, P.: Spectral feature selection for supervised and unsupervised learning. In: ICML, pp. 1151–1157 (2007)
Google Scholar
Zhao, T., Lu, J., Zhang, Y., Xiao, Q.: Feature Selection Based on Genetic Algorithm for CBIR. In: CISP, pp. 495–499 (2008)
Google Scholar
Zheng, Z., Wu, X., Srihari, R.: Feature selection for text categorization on imbalanced data. ACM SIGKDD Explorations Newsletter, 80–89 (2004)
Google Scholar
Zhou, Q., Zhao, M., Hu, M.: Study on feature selection in chinese text categorization. Journal of Chinese Information Processing 18, 17–23 (2004)
Google Scholar

Download references

Author information

Authors and Affiliations

Institute of Computing Technology, Chinese Academy of Sciences, Beijing, 100190, China
Hongfang Jing & Bin Wang
Graduate University, Chinese Academy of Sciences, Beijing, 100080, China
Hongfang Jing
School of Software & Microelectronics, Peking University, Beijing, 102600, China
Yahui Yang
Center of Network Information and Education Technology, Beijing Language, and Culture University, Beijing, 100083, China
Yan Xu

Authors

Hongfang Jing
View author publications
You can also search for this author in PubMed Google Scholar
Bin Wang
View author publications
You can also search for this author in PubMed Google Scholar
Yahui Yang
View author publications
You can also search for this author in PubMed Google Scholar
Yan Xu
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Institut für Bildverarbeitung und angewandte Informatik, Körnerstr. 10, 04107, Leipzig, Deutschland, Germany
Petra Perner

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Jing, H., Wang, B., Yang, Y., Xu, Y. (2009). A General Framework of Feature Selection for Text Categorization. In: Perner, P. (eds) Machine Learning and Data Mining in Pattern Recognition. MLDM 2009. Lecture Notes in Computer Science(), vol 5632. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-03070-3_49

Download citation

DOI: https://doi.org/10.1007/978-3-642-03070-3_49
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-03069-7
Online ISBN: 978-3-642-03070-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

A General Framework of Feature Selection for Text Categorization

Abstract

Access this chapter

Preview

Similar content being viewed by others

A New Feature Selection Algorithm Based on Category Difference for Text Categorization

A non-redundant feature selection method for text categorization based on term co-occurrence frequency and mutual information

A new feature selection method for handling redundant information in text classification

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

A General Framework of Feature Selection for Text Categorization

Abstract

Access this chapter

Preview

Similar content being viewed by others

A New Feature Selection Algorithm Based on Category Difference for Text Categorization

A non-redundant feature selection method for text categorization based on term co-occurrence frequency and mutual information

A new feature selection method for handling redundant information in text classification

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation