Taking Advantage of the Web for Text Classification with Imbalanced Classes

  • Rafael Guzmán-Cabrera
  • Manuel Montes-y-Gómez
  • Paolo Rosso
  • Luis Villaseñor-Pineda
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4827)


A problem of supervised approaches for text classification is that they commonly require high-quality training data to construct an accurate classifier. Unfortunately, in many real-world applications the training sets are extremely small and present imbalanced class distributions. In order to confront these problems, this paper proposes a novel approach for text classification that combines under-sampling with a semi-supervised learning method. In particular, the proposed semi-supervised method is specially suited to work with very few training examples and considers the automatic extraction of untagged data from the Web. Experimental results on a subset of Reuters-21578 text collection indicate that the proposed approach can be a practical solution for dealing with the class-imbalance problem, since it allows achieving very good results using very small training sets.


Text Classification Unlabeled Data Training Instance Minority Class Imbalanced Classis 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Aas, K., Eikvil, L.: Text Categorization: A survey, Technical Report, number 941, Norwegian Computing Center (1999)Google Scholar
  2. 2.
    Chawla, N.V., Japkowicz, N., Kolcz, A.: Editorial: Special Issue on Learning from Imbalanced data Sets. ACM SIGKDD Exploration Newsletters 6(1) (June 2004)Google Scholar
  3. 3.
    Gelbukh, A., Sidorov, G., Guzman-Arénas, A.: Use of a Weighted Topic Hierarchy for Document Classification. In: Matoušek, V., Mautner, P., Ocelíková, J., Sojka, P. (eds.) TSD 1999. LNCS (LNAI), vol. 1692, pp. 130–135. Springer, Heidelberg (1999)CrossRefGoogle Scholar
  4. 4.
    Guzmán-Cabrera, R., Montes-y-Gómez, M., Rosso, P., Villaseñor-Pineda, L.: Improving Text Classification by Web Corpora. In: Advances in Soft Computing, vol. 43, pp. 154–159. Springer, Heidelberg (2007)Google Scholar
  5. 5.
    Hoste, V.: Optimization Issues in Machine Learning of Coreference Resolution. Doctoral Thesis, Faculteit Letteren en Wijsbegeerte, Universiteit Antwerpen (2005)Google Scholar
  6. 6.
    Japkowicz, N.: Learning from Imbalanced Data Sets: A comparison of Various Strategies. In: AAAI Workshop on Learning from Imbalanced Data Sets. Tech Rep. WS-00-05, AAAI Press, Menlo Park (2000)Google Scholar
  7. 7.
    Joachims, T.: Transductive inference for text classification using support vector machines. In: Proceedings of the Sixteenth International Conference on Machine Learning (1999)Google Scholar
  8. 8.
    Kilgarriff, A., Greffenstette, G.: Introduction to the Special Issue on Web as Corpus. Computational Linguistics 29(3) (2003)Google Scholar
  9. 9.
    Nigam, K., Mccallum, A.K., Thrun, S., Mitchell, T.: Text classification from labeled and unlabeled documents using EM. Machine Learning 39(2/3), 103–134 (2000)zbMATHCrossRefGoogle Scholar
  10. 10.
    Sebastiani, F.: Machine learning in automated text categorization. ACM Computing Surveys 34(1), 1–47 (2002)CrossRefGoogle Scholar
  11. 11.
    Seeger, M.: Learning with labeled and unlabeled data. Technical report, Institute for Adaptive and Neural Computation, University of Edinburgh, Edinburgh, United Kingdom (2001)Google Scholar
  12. 12.
    Solorio, T.: Using unlabeled data to improve classifier accuracy, Master Degree Thesis, Computer Science Department, INAOE, Mexico (2002)Google Scholar
  13. 13.
    Zelikovitz, S., Hirsh, H.: Integrating background knowledge into nearest-Neighbor text classification. In: Advances in Case-Based Reasoning, ECCBR Proceedings (2002)Google Scholar
  14. 14.
    Zelikovitz, S., Kogan, M.: Using Web Searches on Important Words to Create Background Sets for LSI Classification. In: 19th International FLAIRS conference, Melbourne Beach, Florida (May 2006)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2007

Authors and Affiliations

  • Rafael Guzmán-Cabrera
    • 1
    • 2
  • Manuel Montes-y-Gómez
    • 3
  • Paolo Rosso
    • 2
  • Luis Villaseñor-Pineda
    • 3
  1. 1.FIMEE, Universidad de GuanajuatoMexico
  2. 2.DSIC, Universidad Politécnica de ValenciaSpain
  3. 3.LTL, Instituto Nacional de Astrofísica, Óptica y ElectrónicaMexico

Personalised recommendations