Supervised term weighting centroid-based classifiers for text categorization

Nguyen, Tam T.; Chang, Kuiyu; Hui, Siu Cheung

doi:10.1007/s10115-012-0559-9

Supervised term weighting centroid-based classifiers for text categorization

Regular Paper
Published: 09 September 2012

Volume 35, pages 61–85, (2013)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

Tam T. Nguyen¹,
Kuiyu Chang¹ &
Siu Cheung Hui¹

706 Accesses
22 Citations
Explore all metrics

Abstract

In this paper, we study the theoretical properties of the class feature centroid (CFC) classifier by considering the rate of change of each prototype vector with respect to individual dimensions (terms). We show that CFC is inherently biased toward the larger (dominant majority) classes, which invariably leads to poor performance on class-imbalanced data. CFC also aggressively prune terms that appear across all classes, discarding some non-exclusive but useful terms. To overcome these CFC limitations while retaining its intrinsic and worthy design goals, we propose an improved centroid-based classifier that uses precise term-class distribution properties instead of presence or absence of terms in classes. Specifically, terms are weighted based on the Kullback–Leibler (KL) divergence measure between pairs of class-conditional term probabilities; we call this the CFC–KL centroid classifier. We then generalize CFC–KL to handle multi-class data by replacing the KL measure with the multi-class Jensen–Shannon (JS) divergence, called CFC–JS. Our proposed supervised term weighting schemes have been evaluated on 5 datasets; KL and JS weighted classifiers consistently outperformed baseline CFC and unweighted support vector machines (SVM). We also devise a word cloud visualization approach to highlight the important class-specific words picked out by our KL and JS term weighting schemes, which were otherwise obscured by unsupervised term weighting. The experimental and visualization results show that KL and JS term weighting not only notably improve centroid-based classifiers, but also benefit SVM classifiers as well.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

CenKNN: a scalable and effective text classifier

Article 03 July 2014

Guansong Pang, Huidong Jin & Shengyi Jiang

Information-theoretic term weighting schemes for document clustering and classification

Article 30 July 2014

Weimao Ke

The Nearest Centroid Based on Vector Norms: A New Classification Algorithm for a New Document Representation Model

Notes

In \(tf \times rf\), uniformly distributed terms, that is, those that appear equally in both classes, are assigned a constant weight of \(1.58 tf\).
Generated by Wordle (http://www.wordle.net).
We use the LibSVM (http://www.csie.ntu.edu.tw/cjlin/libsvm/) library with linear kernels and default parameters.

References

Ali SM, Silvey SD (1966) A general class of coefficients of divergence of one distribution from another. J R Stat Soc 28:131–142
Google Scholar
Bekkerman R, Gavish M (2011) High-precision phrase-based document classification on a modern scale. In: Proceedings of the 17th ACM SIGKDD international conference on knowledge discovery and data mining. KDD ’11, ACM, New York, pp 231–239
Blitzer J, Dredze M, Pereira F (2007) Biographies, bollywood, boom-boxes and blenders: Domain adaptation for sentiment classification. In: The association for computer linguistics (ACL)
Bruce RF, Wiebe JM (1999) Recognizing subjectivity: a case study in manual tagging. Nat Lang Eng 5:187–205
Article Google Scholar
Cover T, Hart P (2002) Nearest neighbor pattern classification. Knowl Based Syst 13:373–389
Google Scholar
Debole F, Sebastiani F (2003) Supervised term weighting for automated text categorization. In: Proceedings of the 2003 ACM symposium on applied computing. SAC ’03, ACM, New York, pp 784–788
Géry M, Largeron C (2011) BM25t: a BM25 extension for focused information retrieval. Knowl Inf Syst 32:1–25
Google Scholar
Guan H, Zhou J, Guo M (2009) A class-feature-centroid classifier for text categorization. In: 18th international world wide web conference, pp 201–201
Han E-H, Karypis G (2000) Centroid-based document classification: analysis and experimental results. In: PKDD ’00: Proceedings of the 4th European conference on principles of data mining and knowledge discovery. Springer, London, pp 424–431
Hatzivassiloglou V, McKeown KR (1997) Predicting the semantic orientation of adjectives. In: Proceedings of the eighth conference on European chapter of the association for computational linguistics. Association for Computational Linguistics, Morristown, pp 174–181
Hu M, Liu B (2004) Mining and summarizing customer reviews. In: KDD ’04: Proceedings of the tenth ACM SIGKDD international conference on knowledge discovery and data mining, ACM, New York, pp 168–177
Joachims T (2001) A statistical learning model of text classification with support vector machines. In: Proceedings of ACM SIGIR, pp 128–136
Junejo KN, Karim A (2012) Robust personalizable spam filtering via local and global discrimination modeling. Knowl Inf Syst 1–36. doi:10.1007/s10115-012-0477-x
Lan M, Tan CL, Su J, Lu Y (2009) Supervised and traditional term weighting methods for automatic text categorization. Pattern Anal Mach Intell 31:721–735
Article Google Scholar
Langley P, Iba W, Thompson K (1992) An analysis of bayesian classifiers. In: AAAI ’92: Proceedings of the tenth national conference on artificial intelligence. AAAI Press, pp 223–228
Lavesson N, Boldt M, Davidsson P, Jacobsson A (2011) Learning to detect spyware using end user license agreements. Knowl Inf Syst 26(2):285–307
Article Google Scholar
Lewis DD (1998) Naive (bayes) at forty: The independence assumption in information retrieval. In: ECML ’98: Proceedings of the 10th European conference on machine learning. Springer, London, pp 4–15
Lin J (1991) Divergence measures based on the shannon entropy. IEEE Trans Inf Theory 37:145–151
Article MATH Google Scholar
Liu W, Wang T (2011) Online active multi-field learning for efficient email spam filtering. Knowl Inf Syst 1–20. doi:10.1007/s10115-011-0461-x
Manning CD, Raghavan P, Schütze H (2008) Introduction to information retrieval. Cambridge University Press, Cambridge
Book MATH Google Scholar
Martineau J, Finin T, Joshi A, Patel S (2009) Improving binary classification on text problems using differential word features. In: Proceeding of the 18th ACM conference on information and knowledge management. CIKM ’09, ACM, New York, pp 2019–2024
McCullagh P, Nelder JA (2000) Generalized linear models. Champman and Hall/CRC, New York
Book Google Scholar
Nguyen TT, Chang K, Hui SC (2011) Supervised term weighting for sentiment analysis. In: Intelligence and security informatics
Nguyen TT, Chang K, Hui SC (2011) Word cloud model for text categorization. In: Proceedings of the 11th IEEE international conference on data mining, pp 487–496
Pang B, Lee L (2004) A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts. In: Proceedings of the ACL
Pang B, Lee L (2005) Seeing stars: exploiting class relationships for sentiment categorization with respect to rating scales. In: Proceedings of the ACL
Quinlan JR, Rivest RL (1989) Inferring decision trees using the minimum description length principle. Inf Comput 80(3):227–248
Article MathSciNet MATH Google Scholar
Sebastiani F (2002) Machine learning in automated text categorization. ACM Comput Surv 34(1):1–47
Article Google Scholar
Wang B, Japkowicz N (2010) Boosting support vector machines for imbalanced data sets. Knowl Inf Syst 25:1–20. doi:10.1007/s10115-009-0198-y
Article Google Scholar
Wei F, Liu S, Song Y, Pan S, Zhou MX, Qian W, Shi L, Tan L, Zhang Q (2010) Tiara: a visual exploratory text analytic system. In: Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining KDD ’10, ACM, New York, NY, USA, pp 153–162

Download references

Acknowledgments

This research was supported in part by Singapore Ministry of Education’s Academic Research Fund Tier 2 grant ARC 9/12 (MOE2011-T2-2-056).

Author information

Authors and Affiliations

School of Computer Engineering, Nanyang Technological University, 50 Nanyang Avenue, Singapore, 639798, Singapore
Tam T. Nguyen, Kuiyu Chang & Siu Cheung Hui

Authors

Tam T. Nguyen
View author publications
You can also search for this author in PubMed Google Scholar
Kuiyu Chang
View author publications
You can also search for this author in PubMed Google Scholar
Siu Cheung Hui
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Tam T. Nguyen.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Nguyen, T.T., Chang, K. & Hui, S.C. Supervised term weighting centroid-based classifiers for text categorization. Knowl Inf Syst 35, 61–85 (2013). https://doi.org/10.1007/s10115-012-0559-9

Download citation

Received: 09 March 2012
Revised: 28 April 2012
Accepted: 22 August 2012
Published: 09 September 2012
Issue Date: April 2013
DOI: https://doi.org/10.1007/s10115-012-0559-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Supervised term weighting centroid-based classifiers for text categorization

Abstract

Access this article

Similar content being viewed by others

CenKNN: a scalable and effective text classifier

Information-theoretic term weighting schemes for document clustering and classification

The Nearest Centroid Based on Vector Norms: A New Classification Algorithm for a New Document Representation Model

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Supervised term weighting centroid-based classifiers for text categorization

Abstract

Access this article

Similar content being viewed by others

CenKNN: a scalable and effective text classifier

Information-theoretic term weighting schemes for document clustering and classification

The Nearest Centroid Based on Vector Norms: A New Classification Algorithm for a New Document Representation Model

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation