Enhancing Text Classification by Information Embedded in the Test Set
Current text classification methods are mostly based on a supervised approach, which require a large number of examples to build models accurate. Unfortunately, in several tasks training sets are extremely small and their generation is very expensive. In order to tackle this problem in this paper we propose a new text classification method that takes advantage of the information embedded in the own test set. This method is supported on the idea that similar documents must belong to the same category. Particularly, it classifies the documents by considering not only their own content but also information about the assigned category to other similar documents from the same test set. Experimental results in four data sets of different sizes are encouraging. They indicate that the proposed method is appropriate to be used with small training sets, where it could significantly outperform the results from traditional approaches such as Naive Bayes and Support Vector Machines.
Unable to display preview. Download preview PDF.
- 2.Derivaux, S., Forestier, G., Wemmert, C.: Improving supervised learning with multiple clusterings. In: Workshop on Supervised and Unsupervised Ensemble Methods and Their Applications in conjunction with ECAI, Patras, Greece, pp. 57–60 (2008)Google Scholar
- 3.Cong, G., Lee, W.S., Wu, H., Liu, B.: Semi-supervised text classification using partitioned em. In: Lee, Y., Li, J., Whang, K.-Y., Lee, D. (eds.) DASFAA 2004. LNCS, vol. 2973, pp. 229–239. Springer, Heidelberg (2004)Google Scholar
- 4.Nigam, K., Mccallum, A., Thrun, S., Mitchell, T.: Text classification from labeled and unlabeled documents using em. In: Machine Learning, 103–134 (1999)Google Scholar
- 6.Fang, Y.C., Parthasarathy, S., Schwartz, F.: Using clustering to boost text classification. In: Workshop on Text Mining, TextDM 2001 (2001)Google Scholar
- 7.Lertnattee, V., Theeramunkong, T.: Term-length normalization for centroid-based text categorization. In: Palade, V., Howlett, R.J., Jain, L. (eds.) KES 2003. LNCS, vol. 2774, pp. 850–856. Springer, Heidelberg (2003)Google Scholar
- 11.Lewis, D.D.: Naive (bayes) at forty: The independence assumption in information retrieval, pp. 4–15. Springer, Heidelberg (1998)Google Scholar
- 12.Joachims, T.: Text categorization with support vector machines: Learning with many relevant features, pp. 137–142. Springer, Heidelberg (1998)Google Scholar