Using Typical Testors for Feature Selection in Text Categorization

  • Aurora Pons-Porrata
  • Reynaldo Gil-García
  • Rafael Berlanga-Llavori
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4756)

Abstract

A major difficulty of text categorization problems is the high dimensionality of the feature space. Thus, feature selection is often performed in order to increase both the efficiency and effectiveness of the classification. In this paper, we propose a feature selection method based on Testor Theory. This criterion takes into account inter-feature relationships. We experimentally compared our method with the widely used information gain using two well-known classification algorithms: k-nearest neighbour and Support Vector Machine. Two benchmark text collections were chosen as the testbeds: Reuters-21578 and Reuters Corpus Version 1 (RCV1-v2). We found that our method consistently outperformed information gain for both classifiers and both data collections, especially when aggressive feature selection is carried out.

Keywords

feature selection typical testors text categorization 

References

  1. 1.
    Jain, A.K., Duin, R.P.W., Mao, J.: Statistical Pattern Recognition: A Review. IEEE Transactions on Pattern Analysis and Machine Intelligence 22(1), 4–37 (2000)CrossRefGoogle Scholar
  2. 2.
    John, G.H., Kohavi, R., Pfleger, K.: Irrelevant Features and the Subset Selection Problem. In: Proceedings of the 11th International Conference on Machine Learning, pp. 121–129 (1994)Google Scholar
  3. 3.
    Lewis, D.D.: An evaluation of phrasal and clustered representations on a text categorization task. In: Proceedings of 15th ACM International Conference on Research and Development in Information Retrieval, Denmark, pp. 37–50. ACM Press, New York (1992)CrossRefGoogle Scholar
  4. 4.
    Yang, Y., Pedersen, J.O.: A Comparative Study on Feature Selection in Text Categorization. In: Proc. of the 14th International Conference on Machine Learning, pp. 412–420 (1997)Google Scholar
  5. 5.
    Mladenic, D.: Feature subset selection using in text learning. In: Proceedings of the 10th European Conference on Machine Learning, pp. 95–100 (1998)Google Scholar
  6. 6.
    Sebastiani, F.: Machine Learning in automated text categorization. ACM Computing Surveys 34(1), 1–47 (2002)CrossRefGoogle Scholar
  7. 7.
    Mladenic, D., Grobelnik, M.: Feature selection for unbalanced class distribution and naive bayes. In: Proc. of the 16th International Conference on Machine Learning, pp. 258–267 (1999)Google Scholar
  8. 8.
    Rogati, M., Yang, Y.: High-performing feature selection for text classification. In: Proceedings of the 11th International Conference on Information and Knowledge Management, pp. 659–661. ACM Press, New York (2002)Google Scholar
  9. 9.
    Forman, G.: An extensive empirical study of feature selection metrics for text classification. Journal of Machine Learning Research 3, 1289–1305 (2003)MATHCrossRefGoogle Scholar
  10. 10.
    Somol, P., Pudil, P.: Oscillating Search Algorithms for Feature Selection. In: Proc. of the 15th IAPR International Conference on Pattern Recognition, Barcelona, pp. 406–409 (2000)Google Scholar
  11. 11.
    Lazo-Cortés, M., Ruiz-Shulcloper, J., Alba-Cabrera, E.: An overview of the evolution of the concept of testor. Pattern Recognition 34(4), 753–762 (2001)MATHCrossRefGoogle Scholar
  12. 12.
    Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Information Processing and Management 24, 513–523 (1988)CrossRefGoogle Scholar
  13. 13.
    Santiesteban, Y., Pons-Porrata, A.: LEX: a new algorithm for the calculus of typical testors. Mathematics Sciences Journal 21(1), 85–95 (2003)Google Scholar
  14. 14.
    Gil-García, R., Badía Contelles, J.M., Pons-Porrata, A.: Parallel nearest neighbour algorithms for Text Categorization. In: Kermarrec, A.-M., Bougè, L., Priol, T. (eds.) Euro-Par 2007. LNCS, vol. 4641, pp. 328–337. Springer, Heidelberg (2007)CrossRefGoogle Scholar
  15. 15.
    Chang, C.C., Lin, C.J.: LIBSVM: a library for support vector machines (2001), Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm
  16. 16.
    Lewis, D., Yang, Y., Rose, T., Li, F.: Rcv1: A new benchmark collection for text categorization research. Machine Learning Research 5, 361–397 (2004)Google Scholar
  17. 17.
    Novovicová, J., Somol, P., Pudil, P.: Oscillating Feature Subset Search Algorithm for Text Categorization. In: Martínez-Trinidad, J.F., Carrasco Ochoa, J.A., Kittler, J. (eds.) CIARP 2006. LNCS, vol. 4225, pp. 578–587. Springer, Heidelberg (2006)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2007

Authors and Affiliations

  • Aurora Pons-Porrata
    • 1
  • Reynaldo Gil-García
    • 1
  • Rafael Berlanga-Llavori
    • 2
  1. 1.Center of Pattern Recognition and Data Mining, Universidad de Oriente, Santiago de CubaCuba
  2. 2.Computer Science, Universitat Jaume I, CastellónSpain

Personalised recommendations