Feature Selection Strategy in Text Classification

  • Pui Cheong Gabriel Fung
  • Fred Morstatter
  • Huan Liu
Conference paper

DOI: 10.1007/978-3-642-20841-6_3

Part of the Lecture Notes in Computer Science book series (LNCS, volume 6634)
Cite this paper as:
Fung P.C.G., Morstatter F., Liu H. (2011) Feature Selection Strategy in Text Classification. In: Huang J.Z., Cao L., Srivastava J. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2011. Lecture Notes in Computer Science, vol 6634. Springer, Berlin, Heidelberg

Abstract

Traditionally, the best number of features is determined by the so-called “rule of thumb”, or by using a separate validation dataset. We can neither find any explanation why these lead to the best number nor do we have any formal feature selection model to obtain this number. In this paper, we conduct an in-depth empirical analysis and argue that simply selecting the features with the highest scores may not be the best strategy. A highest scores approach will turn many documents into zero length, so that they cannot contribute to the training process. Accordingly, we formulate the feature selection process as a dual objective optimization problem, and identify the best number of features for each document automatically. Extensive experiments are conducted to verify our claims. The encouraging results indicate our proposed framework is effective.

Keywords

Feature Selection Feature Ranking Text Classification Selection Strategy 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Copyright information

© Springer-Verlag Berlin Heidelberg 2011

Authors and Affiliations

  • Pui Cheong Gabriel Fung
    • 1
  • Fred Morstatter
    • 1
  • Huan Liu
    • 1
  1. 1.Arizona State UniversityTempeUnited States of America

Personalised recommendations