Taking Advantage of the Web for Text Classification with Imbalanced Classes

Guzmán-Cabrera, Rafael; Montes-y-Gómez, Manuel; Rosso, Paolo; Villaseñor-Pineda, Luis

doi:10.1007/978-3-540-76631-5_79

Taking Advantage of the Web for Text Classification with Imbalanced Classes

Rafael Guzmán-Cabrera^1,2,
Manuel Montes-y-Gómez³,
Paolo Rosso² &
…
Luis Villaseñor-Pineda³

Conference paper

994 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 4827))

Abstract

A problem of supervised approaches for text classification is that they commonly require high-quality training data to construct an accurate classifier. Unfortunately, in many real-world applications the training sets are extremely small and present imbalanced class distributions. In order to confront these problems, this paper proposes a novel approach for text classification that combines under-sampling with a semi-supervised learning method. In particular, the proposed semi-supervised method is specially suited to work with very few training examples and considers the automatic extraction of untagged data from the Web. Experimental results on a subset of Reuters-21578 text collection indicate that the proposed approach can be a practical solution for dealing with the class-imbalance problem, since it allows achieving very good results using very small training sets.

This work was done under partial support of CONACYT-Mexico (43990) MCyT-Spain (TIN2006-15265-C06-04) and PROMEP (UGTO-121).

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Aas, K., Eikvil, L.: Text Categorization: A survey, Technical Report, number 941, Norwegian Computing Center (1999)
Google Scholar
Chawla, N.V., Japkowicz, N., Kolcz, A.: Editorial: Special Issue on Learning from Imbalanced data Sets. ACM SIGKDD Exploration Newsletters 6(1) (June 2004)
Google Scholar
Gelbukh, A., Sidorov, G., Guzman-Arénas, A.: Use of a Weighted Topic Hierarchy for Document Classification. In: Matoušek, V., Mautner, P., Ocelíková, J., Sojka, P. (eds.) TSD 1999. LNCS (LNAI), vol. 1692, pp. 130–135. Springer, Heidelberg (1999)
Chapter Google Scholar
Guzmán-Cabrera, R., Montes-y-Gómez, M., Rosso, P., Villaseñor-Pineda, L.: Improving Text Classification by Web Corpora. In: Advances in Soft Computing, vol. 43, pp. 154–159. Springer, Heidelberg (2007)
Google Scholar
Hoste, V.: Optimization Issues in Machine Learning of Coreference Resolution. Doctoral Thesis, Faculteit Letteren en Wijsbegeerte, Universiteit Antwerpen (2005)
Google Scholar
Japkowicz, N.: Learning from Imbalanced Data Sets: A comparison of Various Strategies. In: AAAI Workshop on Learning from Imbalanced Data Sets. Tech Rep. WS-00-05, AAAI Press, Menlo Park (2000)
Google Scholar
Joachims, T.: Transductive inference for text classification using support vector machines. In: Proceedings of the Sixteenth International Conference on Machine Learning (1999)
Google Scholar
Kilgarriff, A., Greffenstette, G.: Introduction to the Special Issue on Web as Corpus. Computational Linguistics 29(3) (2003)
Google Scholar
Nigam, K., Mccallum, A.K., Thrun, S., Mitchell, T.: Text classification from labeled and unlabeled documents using EM. Machine Learning 39(2/3), 103–134 (2000)
Article MATH Google Scholar
Sebastiani, F.: Machine learning in automated text categorization. ACM Computing Surveys 34(1), 1–47 (2002)
Article Google Scholar
Seeger, M.: Learning with labeled and unlabeled data. Technical report, Institute for Adaptive and Neural Computation, University of Edinburgh, Edinburgh, United Kingdom (2001)
Google Scholar
Solorio, T.: Using unlabeled data to improve classifier accuracy, Master Degree Thesis, Computer Science Department, INAOE, Mexico (2002)
Google Scholar
Zelikovitz, S., Hirsh, H.: Integrating background knowledge into nearest-Neighbor text classification. In: Advances in Case-Based Reasoning, ECCBR Proceedings (2002)
Google Scholar
Zelikovitz, S., Kogan, M.: Using Web Searches on Important Words to Create Background Sets for LSI Classification. In: 19th International FLAIRS conference, Melbourne Beach, Florida (May 2006)
Google Scholar

Download references

Author information

Authors and Affiliations

FIMEE, Universidad de Guanajuato, Mexico
Rafael Guzmán-Cabrera
DSIC, Universidad Politécnica de Valencia, Spain
Rafael Guzmán-Cabrera & Paolo Rosso
LTL, Instituto Nacional de Astrofísica, Óptica y Electrónica, Mexico
Manuel Montes-y-Gómez & Luis Villaseñor-Pineda

Authors

Rafael Guzmán-Cabrera
View author publications
You can also search for this author in PubMed Google Scholar
Manuel Montes-y-Gómez
View author publications
You can also search for this author in PubMed Google Scholar
Paolo Rosso
View author publications
You can also search for this author in PubMed Google Scholar
Luis Villaseñor-Pineda
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Alexander Gelbukh Ángel Fernando Kuri Morales

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Guzmán-Cabrera, R., Montes-y-Gómez, M., Rosso, P., Villaseñor-Pineda, L. (2007). Taking Advantage of the Web for Text Classification with Imbalanced Classes. In: Gelbukh, A., Kuri Morales, Á.F. (eds) MICAI 2007: Advances in Artificial Intelligence. MICAI 2007. Lecture Notes in Computer Science(), vol 4827. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-76631-5_79

Download citation

DOI: https://doi.org/10.1007/978-3-540-76631-5_79
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-76630-8
Online ISBN: 978-3-540-76631-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics