Enhancing Text Classification by Information Embedded in the Test Set

Ramírez-de-la-Rosa, Gabriela; Montes-y-Gómez, Manuel; Villaseńor-Pineda, Luis

doi:10.1007/978-3-642-12116-6_53

Enhancing Text Classification by Information Embedded in the Test Set

Gabriela Ramírez-de-la-Rosa¹⁷,
Manuel Montes-y-Gómez¹⁷ &
Luis Villaseńor-Pineda¹⁷

Conference paper

1789 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 6008))

Abstract

Current text classification methods are mostly based on a supervised approach, which require a large number of examples to build models accurate. Unfortunately, in several tasks training sets are extremely small and their generation is very expensive. In order to tackle this problem in this paper we propose a new text classification method that takes advantage of the information embedded in the own test set. This method is supported on the idea that similar documents must belong to the same category. Particularly, it classifies the documents by considering not only their own content but also information about the assigned category to other similar documents from the same test set. Experimental results in four data sets of different sizes are encouraging. They indicate that the proposed method is appropriate to be used with small training sets, where it could significantly outperform the results from traditional approaches such as Naive Bayes and Support Vector Machines.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Han, E.H., Karypis, G.: Centroid-based document classification: Analysis and experimental results. In: Zighed, D.A., Komorowski, J., Żytkow, J.M. (eds.) PKDD 2000. LNCS (LNAI), vol. 1910, pp. 424–431. Springer, Heidelberg (2000)
Chapter Google Scholar
Derivaux, S., Forestier, G., Wemmert, C.: Improving supervised learning with multiple clusterings. In: Workshop on Supervised and Unsupervised Ensemble Methods and Their Applications in conjunction with ECAI, Patras, Greece, pp. 57–60 (2008)
Google Scholar
Cong, G., Lee, W.S., Wu, H., Liu, B.: Semi-supervised text classification using partitioned em. In: Lee, Y., Li, J., Whang, K.-Y., Lee, D. (eds.) DASFAA 2004. LNCS, vol. 2973, pp. 229–239. Springer, Heidelberg (2004)
Google Scholar
Nigam, K., Mccallum, A., Thrun, S., Mitchell, T.: Text classification from labeled and unlabeled documents using em. In: Machine Learning, 103–134 (1999)
Google Scholar
Kyriakopoulou, A., Kalamboukis, T.: Using clustering to enhance text classification. In: SIGIR 2007: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 805–806. ACM, New York (2007)
Chapter Google Scholar
Fang, Y.C., Parthasarathy, S., Schwartz, F.: Using clustering to boost text classification. In: Workshop on Text Mining, TextDM 2001 (2001)
Google Scholar
Lertnattee, V., Theeramunkong, T.: Term-length normalization for centroid-based text categorization. In: Palade, V., Howlett, R.J., Jain, L. (eds.) KES 2003. LNCS, vol. 2774, pp. 850–856. Springer, Heidelberg (2003)
Google Scholar
Guan, H., Zhou, J., Guo, M.: A class-feature-centroid classifier for text categorization. In: WWW 2009: Proceedings of the 18th international conference on World wide web, pp. 201–210. ACM, New York (2009)
Chapter Google Scholar
Cardoso-Cachopo, A., Oliveira, A.L.: Semi-supervised single-label text categorization using centroid-based classifiers. In: SAC 2007: Proceedings of the, ACM symposium on Applied computing, pp. 844–851. ACM, New York (2007)
Chapter Google Scholar
Tan, S.: An improved centroid classifier for text categorization. Expert Systems with Applications 35, 279–285 (2008)
Article Google Scholar
Lewis, D.D.: Naive (bayes) at forty: The independence assumption in information retrieval, pp. 4–15. Springer, Heidelberg (1998)
Google Scholar
Joachims, T.: Text categorization with support vector machines: Learning with many relevant features, pp. 137–142. Springer, Heidelberg (1998)
Google Scholar

Download references

Author information

Authors and Affiliations

Laboratory of Language Technologies, National Institute of Astrophysics, Optics and Electronics, Luis Enrrique Erro No. 1, Sta. María Tonantzintla, Pue., 72840, Mexico
Gabriela Ramírez-de-la-Rosa, Manuel Montes-y-Gómez & Luis Villaseńor-Pineda

Authors

Gabriela Ramírez-de-la-Rosa
View author publications
You can also search for this author in PubMed Google Scholar
Manuel Montes-y-Gómez
View author publications
You can also search for this author in PubMed Google Scholar
Luis Villaseńor-Pineda
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Center for Computing Research, National Polytechnic Institute, 07738, Mexico City, Mexico
Alexander Gelbukh

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ramírez-de-la-Rosa, G., Montes-y-Gómez, M., Villaseńor-Pineda, L. (2010). Enhancing Text Classification by Information Embedded in the Test Set. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2010. Lecture Notes in Computer Science, vol 6008. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-12116-6_53

Download citation

DOI: https://doi.org/10.1007/978-3-642-12116-6_53
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-12115-9
Online ISBN: 978-3-642-12116-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics