Feature Selection Strategies for Text Categorization

Soucy, Pascal; Mineau, Guy W.

doi:10.1007/3-540-44886-1_41

Feature Selection Strategies for Text Categorization

Pascal Soucy^5,6 &
Guy W. Mineau⁶

Conference paper
First Online: 01 January 2003

1125 Accesses
15 Citations

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 2671))

Abstract

Feature selection is an important research issue in text categorization. The reason for this is that thousands of features are often involved, even when the simplest document representation model, the so-called bag-of-words, is used. Among the many approaches to feature selection, the use of some scoring function to rank features to filter them out is an important one. Many of these functions are widely used in text categorization. In past feature selection studies, most researchers have focused on comparing these measures in terms of accuracy achieved. For any measure, however, there are many selection strategies that can be applied to produce the resulting feature set. In this paper, we compare some such strategies and propose a new one. Tests have been conducted to compare five selection strategies on four datasets, using three distinct classifiers and four common feature scoring functions. As a result, it is possible to better understand which strategies are suited to particular classification settings.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

D.D. Lewis (1997), Reuters-21578 text categorization test collection, Distrib. 1.0, Sept 26.
Google Scholar
D. Lewis, R. Schapire, J. Callan, and R. Papka (1996), Training Algorithms for Linear Text Classifiers, In Proc. of ACM SIGIR, 298–306.
Google Scholar
I. Androutsopoulos, G. Paliouras, V. Karkaletsis, G. Sakkis, C. D. Spyropoulos & P. Stamatopoulos (2000). Learning to Filter Spam E-Mail: A Comparison of a Naive Bayesian and a Memory-Based Approach. In Proc. of the workshop Machine Learning and Textual Information Access, PKDD-2000, Lyon, 1–13.
Google Scholar
S. Scott and S. Matwin. (1999) Feature engineering for text classification. In Proc. of ICML 99, San Francisco, 379–388.
Google Scholar
T. Joachims (1997), Text Categorization with Support Vector Machines: Learning with Many Relevant Features. LS8-Report 23, Universität Dortmund.
Google Scholar
Thorsten Joachims (2002), Learning to Classify Text Using Support Vector Machines. Dissertation, Kluwer.
Google Scholar
Yang, Y., Pedersen, J.O. (1997), A Comparative Study on Feature Selection in Text Categorization, In Proc. of the ICML97, 412–420.
Google Scholar
Mladenic, D (1998). Machine Learning on non-homogeneous, distributed text data. PhD thesis, University of Ljubljana, Slovenia, October.
Google Scholar
Janez Brank, Marko Grobelnik, Natasa Milic-Frayling, Dunja Mladenic (2002). Interaction of Feature Selection Methods and Linear Classification Models, In Proc. of Nineteenth Conf. on Machine Learning (ICML-02), Workshop on Text Learning.
Google Scholar
Y. Yang and X. Liu (1999). A re-examination of text categorization methods. In SIGIR-99.
Google Scholar

Download references

Author information

Authors and Affiliations

Copernic Research, Copernic Inc., Québec, Canada
Pascal Soucy
Department of Computer Science, Université Laval, Québec, Canada
Pascal Soucy & Guy W. Mineau

Authors

Pascal Soucy
View author publications
You can also search for this author in PubMed Google Scholar
Guy W. Mineau
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computing and Information Science, College of Physical and Engineering Science, University of Guelph, Guelph, Ontario, Canada, N1G 2W1
Yang Xiang
Dépt. Informatique-Génie Logiciel, Université Laval, Pavillon Pouliot, Ste-Foy, PQ, Canada, G1K 7P4
Brahim Chaib-draa

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Soucy, P., Mineau, G.W. (2003). Feature Selection Strategies for Text Categorization. In: Xiang, Y., Chaib-draa, B. (eds) Advances in Artificial Intelligence. Canadian AI 2003. Lecture Notes in Computer Science, vol 2671. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-44886-1_41

Download citation

DOI: https://doi.org/10.1007/3-540-44886-1_41
Published: 27 May 2003
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-40300-5
Online ISBN: 978-3-540-44886-0
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics