Taking Advantage of the Web for Text Classification with Imbalanced Classes
Abstract
A problem of supervised approaches for text classification is that they commonly require high-quality training data to construct an accurate classifier. Unfortunately, in many real-world applications the training sets are extremely small and present imbalanced class distributions. In order to confront these problems, this paper proposes a novel approach for text classification that combines under-sampling with a semi-supervised learning method. In particular, the proposed semi-supervised method is specially suited to work with very few training examples and considers the automatic extraction of untagged data from the Web. Experimental results on a subset of Reuters-21578 text collection indicate that the proposed approach can be a practical solution for dealing with the class-imbalance problem, since it allows achieving very good results using very small training sets.
Keywords
Text Classification Unlabeled Data Training Instance Minority Class Imbalanced ClassisPreview
Unable to display preview. Download preview PDF.
References
- 1.Aas, K., Eikvil, L.: Text Categorization: A survey, Technical Report, number 941, Norwegian Computing Center (1999)Google Scholar
- 2.Chawla, N.V., Japkowicz, N., Kolcz, A.: Editorial: Special Issue on Learning from Imbalanced data Sets. ACM SIGKDD Exploration Newsletters 6(1) (June 2004)Google Scholar
- 3.Gelbukh, A., Sidorov, G., Guzman-Arénas, A.: Use of a Weighted Topic Hierarchy for Document Classification. In: Matoušek, V., Mautner, P., Ocelíková, J., Sojka, P. (eds.) TSD 1999. LNCS (LNAI), vol. 1692, pp. 130–135. Springer, Heidelberg (1999)CrossRefGoogle Scholar
- 4.Guzmán-Cabrera, R., Montes-y-Gómez, M., Rosso, P., Villaseñor-Pineda, L.: Improving Text Classification by Web Corpora. In: Advances in Soft Computing, vol. 43, pp. 154–159. Springer, Heidelberg (2007)Google Scholar
- 5.Hoste, V.: Optimization Issues in Machine Learning of Coreference Resolution. Doctoral Thesis, Faculteit Letteren en Wijsbegeerte, Universiteit Antwerpen (2005)Google Scholar
- 6.Japkowicz, N.: Learning from Imbalanced Data Sets: A comparison of Various Strategies. In: AAAI Workshop on Learning from Imbalanced Data Sets. Tech Rep. WS-00-05, AAAI Press, Menlo Park (2000)Google Scholar
- 7.Joachims, T.: Transductive inference for text classification using support vector machines. In: Proceedings of the Sixteenth International Conference on Machine Learning (1999)Google Scholar
- 8.Kilgarriff, A., Greffenstette, G.: Introduction to the Special Issue on Web as Corpus. Computational Linguistics 29(3) (2003)Google Scholar
- 9.Nigam, K., Mccallum, A.K., Thrun, S., Mitchell, T.: Text classification from labeled and unlabeled documents using EM. Machine Learning 39(2/3), 103–134 (2000)zbMATHCrossRefGoogle Scholar
- 10.Sebastiani, F.: Machine learning in automated text categorization. ACM Computing Surveys 34(1), 1–47 (2002)CrossRefGoogle Scholar
- 11.Seeger, M.: Learning with labeled and unlabeled data. Technical report, Institute for Adaptive and Neural Computation, University of Edinburgh, Edinburgh, United Kingdom (2001)Google Scholar
- 12.Solorio, T.: Using unlabeled data to improve classifier accuracy, Master Degree Thesis, Computer Science Department, INAOE, Mexico (2002)Google Scholar
- 13.Zelikovitz, S., Hirsh, H.: Integrating background knowledge into nearest-Neighbor text classification. In: Advances in Case-Based Reasoning, ECCBR Proceedings (2002)Google Scholar
- 14.Zelikovitz, S., Kogan, M.: Using Web Searches on Important Words to Create Background Sets for LSI Classification. In: 19th International FLAIRS conference, Melbourne Beach, Florida (May 2006)Google Scholar