Skip to main content
Log in

Supervised term weighting centroid-based classifiers for text categorization

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

In this paper, we study the theoretical properties of the class feature centroid (CFC) classifier by considering the rate of change of each prototype vector with respect to individual dimensions (terms). We show that CFC is inherently biased toward the larger (dominant majority) classes, which invariably leads to poor performance on class-imbalanced data. CFC also aggressively prune terms that appear across all classes, discarding some non-exclusive but useful terms. To overcome these CFC limitations while retaining its intrinsic and worthy design goals, we propose an improved centroid-based classifier that uses precise term-class distribution properties instead of presence or absence of terms in classes. Specifically, terms are weighted based on the Kullback–Leibler (KL) divergence measure between pairs of class-conditional term probabilities; we call this the CFC–KL centroid classifier. We then generalize CFC–KL to handle multi-class data by replacing the KL measure with the multi-class Jensen–Shannon (JS) divergence, called CFC–JS. Our proposed supervised term weighting schemes have been evaluated on 5 datasets; KL and JS weighted classifiers consistently outperformed baseline CFC and unweighted support vector machines (SVM). We also devise a word cloud visualization approach to highlight the important class-specific words picked out by our KL and JS term weighting schemes, which were otherwise obscured by unsupervised term weighting. The experimental and visualization results show that KL and JS term weighting not only notably improve centroid-based classifiers, but also benefit SVM classifiers as well.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17

Similar content being viewed by others

Notes

  1. In \(tf \times rf\), uniformly distributed terms, that is, those that appear equally in both classes, are assigned a constant weight of \(1.58 tf\).

  2. Generated by Wordle (http://www.wordle.net).

  3. We use the LibSVM (http://www.csie.ntu.edu.tw/cjlin/libsvm/) library with linear kernels and default parameters.

References

  1. Ali SM, Silvey SD (1966) A general class of coefficients of divergence of one distribution from another. J R Stat Soc 28:131–142

    Google Scholar 

  2. Bekkerman R, Gavish M (2011) High-precision phrase-based document classification on a modern scale. In: Proceedings of the 17th ACM SIGKDD international conference on knowledge discovery and data mining. KDD ’11, ACM, New York, pp 231–239

  3. Blitzer J, Dredze M, Pereira F (2007) Biographies, bollywood, boom-boxes and blenders: Domain adaptation for sentiment classification. In: The association for computer linguistics (ACL)

  4. Bruce RF, Wiebe JM (1999) Recognizing subjectivity: a case study in manual tagging. Nat Lang Eng 5:187–205

    Article  Google Scholar 

  5. Cover T, Hart P (2002) Nearest neighbor pattern classification. Knowl Based Syst 13:373–389

    Google Scholar 

  6. Debole F, Sebastiani F (2003) Supervised term weighting for automated text categorization. In: Proceedings of the 2003 ACM symposium on applied computing. SAC ’03, ACM, New York, pp 784–788

  7. Géry M, Largeron C (2011) BM25t: a BM25 extension for focused information retrieval. Knowl Inf Syst 32:1–25

    Google Scholar 

  8. Guan H, Zhou J, Guo M (2009) A class-feature-centroid classifier for text categorization. In: 18th international world wide web conference, pp 201–201

  9. Han E-H, Karypis G (2000) Centroid-based document classification: analysis and experimental results. In: PKDD ’00: Proceedings of the 4th European conference on principles of data mining and knowledge discovery. Springer, London, pp 424–431

  10. Hatzivassiloglou V, McKeown KR (1997) Predicting the semantic orientation of adjectives. In: Proceedings of the eighth conference on European chapter of the association for computational linguistics. Association for Computational Linguistics, Morristown, pp 174–181

  11. Hu M, Liu B (2004) Mining and summarizing customer reviews. In: KDD ’04: Proceedings of the tenth ACM SIGKDD international conference on knowledge discovery and data mining, ACM, New York, pp 168–177

  12. Joachims T (2001) A statistical learning model of text classification with support vector machines. In: Proceedings of ACM SIGIR, pp 128–136

  13. Junejo KN, Karim A (2012) Robust personalizable spam filtering via local and global discrimination modeling. Knowl Inf Syst 1–36. doi:10.1007/s10115-012-0477-x

  14. Lan M, Tan CL, Su J, Lu Y (2009) Supervised and traditional term weighting methods for automatic text categorization. Pattern Anal Mach Intell 31:721–735

    Article  Google Scholar 

  15. Langley P, Iba W, Thompson K (1992) An analysis of bayesian classifiers. In: AAAI ’92: Proceedings of the tenth national conference on artificial intelligence. AAAI Press, pp 223–228

  16. Lavesson N, Boldt M, Davidsson P, Jacobsson A (2011) Learning to detect spyware using end user license agreements. Knowl Inf Syst 26(2):285–307

    Article  Google Scholar 

  17. Lewis DD (1998) Naive (bayes) at forty: The independence assumption in information retrieval. In: ECML ’98: Proceedings of the 10th European conference on machine learning. Springer, London, pp 4–15

  18. Lin J (1991) Divergence measures based on the shannon entropy. IEEE Trans Inf Theory 37:145–151

    Article  MATH  Google Scholar 

  19. Liu W, Wang T (2011) Online active multi-field learning for efficient email spam filtering. Knowl Inf Syst 1–20. doi:10.1007/s10115-011-0461-x

  20. Manning CD, Raghavan P, Schütze H (2008) Introduction to information retrieval. Cambridge University Press, Cambridge

    Book  MATH  Google Scholar 

  21. Martineau J, Finin T, Joshi A, Patel S (2009) Improving binary classification on text problems using differential word features. In: Proceeding of the 18th ACM conference on information and knowledge management. CIKM ’09, ACM, New York, pp 2019–2024

  22. McCullagh P, Nelder JA (2000) Generalized linear models. Champman and Hall/CRC, New York

    Book  Google Scholar 

  23. Nguyen TT, Chang K, Hui SC (2011) Supervised term weighting for sentiment analysis. In: Intelligence and security informatics

  24. Nguyen TT, Chang K, Hui SC (2011) Word cloud model for text categorization. In: Proceedings of the 11th IEEE international conference on data mining, pp 487–496

  25. Pang B, Lee L (2004) A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts. In: Proceedings of the ACL

  26. Pang B, Lee L (2005) Seeing stars: exploiting class relationships for sentiment categorization with respect to rating scales. In: Proceedings of the ACL

  27. Quinlan JR, Rivest RL (1989) Inferring decision trees using the minimum description length principle. Inf Comput 80(3):227–248

    Article  MathSciNet  MATH  Google Scholar 

  28. Sebastiani F (2002) Machine learning in automated text categorization. ACM Comput Surv 34(1):1–47

    Article  Google Scholar 

  29. Wang B, Japkowicz N (2010) Boosting support vector machines for imbalanced data sets. Knowl Inf Syst 25:1–20. doi:10.1007/s10115-009-0198-y

    Article  Google Scholar 

  30. Wei F, Liu S, Song Y, Pan S, Zhou MX, Qian W, Shi L, Tan L, Zhang Q (2010) Tiara: a visual exploratory text analytic system. In: Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining KDD ’10, ACM, New York, NY, USA, pp 153–162

Download references

Acknowledgments

This research was supported in part by Singapore Ministry of Education’s Academic Research Fund Tier 2 grant ARC 9/12 (MOE2011-T2-2-056).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Tam T. Nguyen.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Nguyen, T.T., Chang, K. & Hui, S.C. Supervised term weighting centroid-based classifiers for text categorization. Knowl Inf Syst 35, 61–85 (2013). https://doi.org/10.1007/s10115-012-0559-9

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-012-0559-9

Keywords

Navigation