Abstract
Feature selection is an important research issue in text categorization. The reason for this is that thousands of features are often involved, even when the simplest document representation model, the so-called bag-of-words, is used. Among the many approaches to feature selection, the use of some scoring function to rank features to filter them out is an important one. Many of these functions are widely used in text categorization. In past feature selection studies, most researchers have focused on comparing these measures in terms of accuracy achieved. For any measure, however, there are many selection strategies that can be applied to produce the resulting feature set. In this paper, we compare some such strategies and propose a new one. Tests have been conducted to compare five selection strategies on four datasets, using three distinct classifiers and four common feature scoring functions. As a result, it is possible to better understand which strategies are suited to particular classification settings.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
D.D. Lewis (1997), Reuters-21578 text categorization test collection, Distrib. 1.0, Sept 26.
D. Lewis, R. Schapire, J. Callan, and R. Papka (1996), Training Algorithms for Linear Text Classifiers, In Proc. of ACM SIGIR, 298–306.
I. Androutsopoulos, G. Paliouras, V. Karkaletsis, G. Sakkis, C. D. Spyropoulos & P. Stamatopoulos (2000). Learning to Filter Spam E-Mail: A Comparison of a Naive Bayesian and a Memory-Based Approach. In Proc. of the workshop Machine Learning and Textual Information Access, PKDD-2000, Lyon, 1–13.
S. Scott and S. Matwin. (1999) Feature engineering for text classification. In Proc. of ICML 99, San Francisco, 379–388.
T. Joachims (1997), Text Categorization with Support Vector Machines: Learning with Many Relevant Features. LS8-Report 23, Universität Dortmund.
Thorsten Joachims (2002), Learning to Classify Text Using Support Vector Machines. Dissertation, Kluwer.
Yang, Y., Pedersen, J.O. (1997), A Comparative Study on Feature Selection in Text Categorization, In Proc. of the ICML97, 412–420.
Mladenic, D (1998). Machine Learning on non-homogeneous, distributed text data. PhD thesis, University of Ljubljana, Slovenia, October.
Janez Brank, Marko Grobelnik, Natasa Milic-Frayling, Dunja Mladenic (2002). Interaction of Feature Selection Methods and Linear Classification Models, In Proc. of Nineteenth Conf. on Machine Learning (ICML-02), Workshop on Text Learning.
Y. Yang and X. Liu (1999). A re-examination of text categorization methods. In SIGIR-99.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2003 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Soucy, P., Mineau, G.W. (2003). Feature Selection Strategies for Text Categorization. In: Xiang, Y., Chaib-draa, B. (eds) Advances in Artificial Intelligence. Canadian AI 2003. Lecture Notes in Computer Science, vol 2671. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-44886-1_41
Download citation
DOI: https://doi.org/10.1007/3-540-44886-1_41
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-40300-5
Online ISBN: 978-3-540-44886-0
eBook Packages: Springer Book Archive