Rough Set Feature Selection Methods for Case-Based Categorization of Text Documents

  • Kalyan Moy Gupta
  • Philip G. Moore
  • David W. Aha
  • Sankar K. Pal
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3776)

Abstract

Textual case bases can contain thousands of features in the form of tokens or words, which can inhibit classification performance. Recent developments in rough set theory and its applications to feature selection offer promising approaches for selecting and reducing the number of features. We adapt two rough set feature selection methods for use on n-ary class text categorization problems. We also introduce a new method for selecting features that computes the union of features selected from randomly-partitioned training subsets. Our comparative evaluation of our method with a conventional method on the Reuters-21578 data set shows that it can dramatically decrease training time without compromising classification accuracy. Also, we found that randomized training set partitions dramatically reduce training time.

Keywords

Feature Selection Training Time Information Gain Feature Selection Method Conditional Attribute 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

  1. Chouchoulas, A., Shen, Q.: Rough-set aided keyword reduction for text categorization. Applied Artificial Intelligence 15, 843–873 (2001)CrossRefGoogle Scholar
  2. Johnson, D.S.: Approximation algorithms for combinatorial problems. Journal of Com-puter and System Sciences 9, 256–278 (1974)MATHCrossRefGoogle Scholar
  3. Li, Y., Shiu, S.C.K., Pal, S.K.: Combining feature reduction and case selection in building CBR classifiers. In: Pal, S.K., Aha, D.W., Gupta, K.M. (eds.) Case-based reasoning in knowledge discovery and data mining. Wiley, New York (2005) (to appear)Google Scholar
  4. Pal, S.K., Shiu, S.C.K.: Foundations of soft case-based reasoning. Wiley, Hoboken (2004)CrossRefGoogle Scholar
  5. Pawlak, Z.: Rough sets: Theoretical aspects of reasoning about data. Kluwer, Dordrecht (1991)MATHGoogle Scholar
  6. Popova, V.N.: Knowledge discovery and monotonicity. Doctoral dissertation, Rotterdam School of Economics, Erasmus University, The Netherlands (2004)Google Scholar
  7. Wiratunga, N., Koychev, I., Massie, S.: Feature selection and generalization for re-trieval of textual cases. In: Proceedings of the Seventh European Conference on Case-Based Reasoning, Madrid, Spain, pp. 806–820. Springer, Heidelberg (2004)Google Scholar
  8. Yang, Y., Pederson, J.: A comparative study of feature selection in text categorization. In: Proceedings of the Fourteenth International Conference on Machine Learning, pp. 412–420. Morgan Kaufmann, Nashville (1997)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2005

Authors and Affiliations

  • Kalyan Moy Gupta
    • 1
    • 2
  • Philip G. Moore
    • 1
    • 2
  • David W. Aha
    • 1
  • Sankar K. Pal
    • 3
  1. 1.ITT IndustriesAlexandriaUSA
  2. 2.Naval Research LaboratoryWashington, DCUSA
  3. 3.Indian Statistical InstituteKolkataIndia

Personalised recommendations