Skip to main content

Enhancing Text Classification by Information Embedded in the Test Set

  • Conference paper
  • 1789 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 6008))

Abstract

Current text classification methods are mostly based on a supervised approach, which require a large number of examples to build models accurate. Unfortunately, in several tasks training sets are extremely small and their generation is very expensive. In order to tackle this problem in this paper we propose a new text classification method that takes advantage of the information embedded in the own test set. This method is supported on the idea that similar documents must belong to the same category. Particularly, it classifies the documents by considering not only their own content but also information about the assigned category to other similar documents from the same test set. Experimental results in four data sets of different sizes are encouraging. They indicate that the proposed method is appropriate to be used with small training sets, where it could significantly outperform the results from traditional approaches such as Naive Bayes and Support Vector Machines.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Han, E.H., Karypis, G.: Centroid-based document classification: Analysis and experimental results. In: Zighed, D.A., Komorowski, J., Żytkow, J.M. (eds.) PKDD 2000. LNCS (LNAI), vol. 1910, pp. 424–431. Springer, Heidelberg (2000)

    Chapter  Google Scholar 

  2. Derivaux, S., Forestier, G., Wemmert, C.: Improving supervised learning with multiple clusterings. In: Workshop on Supervised and Unsupervised Ensemble Methods and Their Applications in conjunction with ECAI, Patras, Greece, pp. 57–60 (2008)

    Google Scholar 

  3. Cong, G., Lee, W.S., Wu, H., Liu, B.: Semi-supervised text classification using partitioned em. In: Lee, Y., Li, J., Whang, K.-Y., Lee, D. (eds.) DASFAA 2004. LNCS, vol. 2973, pp. 229–239. Springer, Heidelberg (2004)

    Google Scholar 

  4. Nigam, K., Mccallum, A., Thrun, S., Mitchell, T.: Text classification from labeled and unlabeled documents using em. In: Machine Learning, 103–134 (1999)

    Google Scholar 

  5. Kyriakopoulou, A., Kalamboukis, T.: Using clustering to enhance text classification. In: SIGIR 2007: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 805–806. ACM, New York (2007)

    Chapter  Google Scholar 

  6. Fang, Y.C., Parthasarathy, S., Schwartz, F.: Using clustering to boost text classification. In: Workshop on Text Mining, TextDM 2001 (2001)

    Google Scholar 

  7. Lertnattee, V., Theeramunkong, T.: Term-length normalization for centroid-based text categorization. In: Palade, V., Howlett, R.J., Jain, L. (eds.) KES 2003. LNCS, vol. 2774, pp. 850–856. Springer, Heidelberg (2003)

    Google Scholar 

  8. Guan, H., Zhou, J., Guo, M.: A class-feature-centroid classifier for text categorization. In: WWW 2009: Proceedings of the 18th international conference on World wide web, pp. 201–210. ACM, New York (2009)

    Chapter  Google Scholar 

  9. Cardoso-Cachopo, A., Oliveira, A.L.: Semi-supervised single-label text categorization using centroid-based classifiers. In: SAC 2007: Proceedings of the, ACM symposium on Applied computing, pp. 844–851. ACM, New York (2007)

    Chapter  Google Scholar 

  10. Tan, S.: An improved centroid classifier for text categorization. Expert Systems with Applications 35, 279–285 (2008)

    Article  Google Scholar 

  11. Lewis, D.D.: Naive (bayes) at forty: The independence assumption in information retrieval, pp. 4–15. Springer, Heidelberg (1998)

    Google Scholar 

  12. Joachims, T.: Text categorization with support vector machines: Learning with many relevant features, pp. 137–142. Springer, Heidelberg (1998)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2010 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Ramírez-de-la-Rosa, G., Montes-y-Gómez, M., Villaseńor-Pineda, L. (2010). Enhancing Text Classification by Information Embedded in the Test Set. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2010. Lecture Notes in Computer Science, vol 6008. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-12116-6_53

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-12116-6_53

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-12115-9

  • Online ISBN: 978-3-642-12116-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics