Skip to main content

Feature Selection Strategies for Text Categorization

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 2671))

Abstract

Feature selection is an important research issue in text categorization. The reason for this is that thousands of features are often involved, even when the simplest document representation model, the so-called bag-of-words, is used. Among the many approaches to feature selection, the use of some scoring function to rank features to filter them out is an important one. Many of these functions are widely used in text categorization. In past feature selection studies, most researchers have focused on comparing these measures in terms of accuracy achieved. For any measure, however, there are many selection strategies that can be applied to produce the resulting feature set. In this paper, we compare some such strategies and propose a new one. Tests have been conducted to compare five selection strategies on four datasets, using three distinct classifiers and four common feature scoring functions. As a result, it is possible to better understand which strategies are suited to particular classification settings.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. D.D. Lewis (1997), Reuters-21578 text categorization test collection, Distrib. 1.0, Sept 26.

    Google Scholar 

  2. D. Lewis, R. Schapire, J. Callan, and R. Papka (1996), Training Algorithms for Linear Text Classifiers, In Proc. of ACM SIGIR, 298–306.

    Google Scholar 

  3. I. Androutsopoulos, G. Paliouras, V. Karkaletsis, G. Sakkis, C. D. Spyropoulos & P. Stamatopoulos (2000). Learning to Filter Spam E-Mail: A Comparison of a Naive Bayesian and a Memory-Based Approach. In Proc. of the workshop Machine Learning and Textual Information Access, PKDD-2000, Lyon, 1–13.

    Google Scholar 

  4. S. Scott and S. Matwin. (1999) Feature engineering for text classification. In Proc. of ICML 99, San Francisco, 379–388.

    Google Scholar 

  5. T. Joachims (1997), Text Categorization with Support Vector Machines: Learning with Many Relevant Features. LS8-Report 23, Universität Dortmund.

    Google Scholar 

  6. Thorsten Joachims (2002), Learning to Classify Text Using Support Vector Machines. Dissertation, Kluwer.

    Google Scholar 

  7. Yang, Y., Pedersen, J.O. (1997), A Comparative Study on Feature Selection in Text Categorization, In Proc. of the ICML97, 412–420.

    Google Scholar 

  8. Mladenic, D (1998). Machine Learning on non-homogeneous, distributed text data. PhD thesis, University of Ljubljana, Slovenia, October.

    Google Scholar 

  9. Janez Brank, Marko Grobelnik, Natasa Milic-Frayling, Dunja Mladenic (2002). Interaction of Feature Selection Methods and Linear Classification Models, In Proc. of Nineteenth Conf. on Machine Learning (ICML-02), Workshop on Text Learning.

    Google Scholar 

  10. Y. Yang and X. Liu (1999). A re-examination of text categorization methods. In SIGIR-99.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2003 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Soucy, P., Mineau, G.W. (2003). Feature Selection Strategies for Text Categorization. In: Xiang, Y., Chaib-draa, B. (eds) Advances in Artificial Intelligence. Canadian AI 2003. Lecture Notes in Computer Science, vol 2671. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-44886-1_41

Download citation

  • DOI: https://doi.org/10.1007/3-540-44886-1_41

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-40300-5

  • Online ISBN: 978-3-540-44886-0

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics