Advertisement

The Use of Multi-Criteria in Feature Selection to Enhance Text Categorization

  • Son Doan
  • Susumu Horiguchi

Abstract

Feature selection has been an interesting issue in text categorization up to now. Previous works in feature selection often used filter model in which features, after ranked by a measure, are selected based on a given threshold. In this paper, we present a novel approach to feature selection based on multi-criteria of each feature. Instead of only one criterion, multi-criteria of a feature are used; and a procedure based on each threshold of feature selection is proposed. This framework seems to be suitable for text data and applied to text categorization. Experimental results on Reuters-21578 benchmark data show that our approach has a promising scheme and enhances the performance of a text categorization system.

Keywords

Feature Selection Mutual Information Text Categorization Baseline Method Optimal Subset 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. [1]
    Y. Yang and J.O. Pedersen. A comparative study on feature selection in text categorization. In Proceeding of the 14th International Conference on Machine Learning (ICML97), pages 412–420, 1997.Google Scholar
  2. [2]
    D. Mladenic. Feature subset selection in text learning. In Proc of European Conference on Machine Learning(ECML), pages 95–100, 1998.Google Scholar
  3. [3]
    F. Sebastiani. Machine learning in automated text categorization. A CM computing survey, 34(1): 1–47, 2002.CrossRefGoogle Scholar
  4. [4]
    A. Blum and P. Langley. Selection of relevant features and examples in machine learning. Artificial Intelligence, 97(1–2):245–271, 1997.MathSciNetCrossRefGoogle Scholar
  5. [5]
    R. Kohavi and G. John. Wrappers for feature subset selection. Artificial Intelligence, 97(1–2):273–324, 1997.CrossRefGoogle Scholar
  6. [6]
    S. Huang. Dimensionality reduction in automatic knowledge acquisition: A simple greedy search approach. IEEE Transactions on Knowledge and Data Engineering, 15(6): 1364–1373, 2003.CrossRefGoogle Scholar
  7. [7]
    M. Rogati and Y. Yang. High-performing feature selection for text classification. In International Conference on Information and Knowledge Management-CIKM2002, pages 659–661, 2002.Google Scholar
  8. [8]
    P. Soucy and G. Mineau. A simple feature selection method for text classification. In International Joint Conference of Artificial Intelligence (IJCAI), 2001.Google Scholar
  9. [9]
    G. Salton, A. Wong, and C.S. Yang. A vector space model for automatic indexing. Communications of the ACM, 18(11):613–620, 1975.CrossRefGoogle Scholar
  10. [10]
    Y. Yang. An evaluation of statistical approaches to text categorization. Information Retrieval Journal, 1:69–90, 1999.CrossRefGoogle Scholar
  11. [11]
    F. Debole and F. Sebastiani. An analysis of the relative hardness of reuters-21578 subsets. Journal of the American Society for Information Science and Technology (JASIST), 2004. Forthcoming.Google Scholar
  12. [12]
    Andrew Kachites McCallum. Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering, 1996. http://www.cs.cmu.edu/~mccallum/bow.Google Scholar

Copyright information

© Springer-Verlag/Wien 2005

Authors and Affiliations

  • Son Doan
    • 1
  • Susumu Horiguchi
    • 1
  1. 1.Japan Advance Institute of Science and TechnologyTatsunokuchi, IshikawaJapan

Personalised recommendations