Model-induced term-weighting schemes for text classification

Kim, Hyun Kyung; Kim, Minyoung

doi:10.1007/s10489-015-0745-z

Model-induced term-weighting schemes for text classification

Published: 15 January 2016

Volume 45, pages 30–43, (2016)
Cite this article

Applied Intelligence Aims and scope Submit manuscript

Hyun Kyung Kim¹ &
Minyoung Kim¹

622 Accesses
13 Citations
Explore all metrics

Abstract

The bag-of-words representation of text data is very popular for document classification. In the recent literature, it has been shown that properly weighting the term feature vector can improve the classification performance significantly beyond the original term-frequency based features. In this paper we demystify the success of the recent term-weighting strategies as well as provide possibly more reasonable modifications. We then propose novel term-weighting schemes that can be induced from the well-known document probabilistic models such as the Naive Bayes and the multinomial term model. Interestingly, some of the intuition-based term-weighting schemes coincide exactly with the proposed derivations. Our term-weighting schemes are tested on large-scale text classification problems/datasets where we demonstrate improved prediction performance over existing approaches.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Comparative Study on Term Weighting Schemes for Text Classification

A Supervised Term Weighting Scheme for Multi-class Text Categorization

Combining supervised term-weighting metrics for SVM text classification with extended term representation

Article 19 February 2016

Notes

While there are slack variables to be optimized as well, they often tend to be highly sparse with a few non-zero entries for the so-called support vectors.
There is no reason to constrain the feature vector to be bounded from below by 1 in the SVM framework. Recall that features with large magnitudes are considered important, and not their signs, since one can simply adapts the signs of SVM weight parameters accordingly. Constraining the weight bound in [20, 21] is solely from their physical interpretation of the weights as adjusted term frequencies.
³ In (10), it is typical to add some positive constants to numerators and denominators for p _k and \(\overline {p}_{k}\) (e.g., 1 to numerators and 2 to denominators), often known as the Laplace smoothing. Throughout the paper we skip the Laplace smoothers for simplicity, and incorporating them does not alter the core derivations.
http://www.csie.ntu.edu.tw/~cjlin/liblinear/.
http://web.ist.utl.pt/acardoso/datasets/. (Also, [2])
http://ai.stanford.edu/~amaas/data/sentiment/.

References

Akhmedova S, Semenkin E, Sergienko R (2014) Automatically generated classifiers for opinion mining with different term weighting schemes. In: International Conference on Informatics in Control, Automation and Robotics
Cardoso-Cachopo A (2007) Improving Methods for Single-label Text Categorization. PhD Thesis, Instituto Superior Tecnico, Universidade Tecnica de Lisboa
Carmel D, Mejer A, Pinter Y, Szpektor I (2014) Improving term weighting for community question answering search using syntactic analysis. In: Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management
Chy A, Seddiqui M (2014) Das, S.. In: International Conference on Computer and Information Technology, Bangla news classification using naive Bayes classifier
Crammer K, Singer Y, Cristianini N, Shawe-taylor J, Williamson, B (2001) On the algorithmic implementation of multiclass kernel-based vector machines. J Mach Learn Res 2
Craven M, DiPasquo D, Freitag D, McCallum A, Mitchell T, Nigam K, Slattery S (1998) Learning to extract symbolic knowledge from the world wide web. In: Proceedings of the 21st national conference on Artificial intelligence (AAAI)
Debole F, Sebastiani F. (2003) Supervised term weighting for automated text categorization. In: Proceedings of the ACM symposium on Applied computing
Deng ZH, Luo KH, Yu HL (2014) A study of supervised term weighting scheme for sentiment analysis. Expert Syst Appl 41(7):3506–3513
Article Google Scholar
Deng ZH, Tang SW, Yang DQ, Li M. Z. L. Y., Xie KQ (2004) A comparative study on feature weight in text categorization. Advanced Web Technologies and Applications. Lect Notes Comput Sci 3007:588–597
Article Google Scholar
Dietterich TG (1998) Approximate statistical tests for comparing supervised classification learning algorithms. Neural Comput 10(7):1895–1923
Article Google Scholar
Domingos P, Pazzani M (1996) Beyond independence: conditions for the optimality of the simple Bayesian classifier. In: International Conference on Machine Learning
Duda RO, Hart PE, Stork DG (2000) Pattern classification . Wiley
Escalante HJ, Garcia-Limon MA, Morales-Reyes A, Graff M, Gomez MM, Morales EF, Martinez-Carranza J (2015) Term-weighting learning via genetic programming for text classification. Knowl-Based Syst 83:176–189
Article Google Scholar
Fan RE, Chang KW, Hsieh CJ, Wang XR, Lin CJ (2008) LIBLINEAR: A library for large linear classification. J Mach Learn Res 9:1871–1874
MATH Google Scholar
Fattah MA (2015) New term weighting schemes with combination of multiple classifiers for sentiment analysis. Neurocomputing 167:434–442
Article Google Scholar
Jiang H, Li P, Hu X, Wang S. (2009) An improved method of term weighting for text classification. In: IEEE International Conference on Intelligent Computing and Intelligent Systems
Joachims T (1998) Text categorization with suport vector machines: learning with many relevant features. In: European Conference on Machine Learning
Ko Y (2012) A study of term weighting schemes using class information for text classification. In: ACM SIGIR conference on Research and development in information retrieval
Kocabas I, Dincer BT, Karaoglan B (2014) A nonparametric term weighting method for information retrieval based on measuring the divergence from independence. Inf Retr 17 (2):153–176
Article Google Scholar
Lan M, Tan C, Su J, Lu Y (2009) Supervised and traditional term weighting methods for automatic text categorization. IEEE Trans Pattern Anal Mach Intell 31(4):721–735
Article Google Scholar
Lan M, Tan CL, Low HB (2006) Proposing a new term weighting scheme for text categorization. In: Proceedings of the 21st National Conference on Artificial Intelligence (AAAI)
Lang K (1995) Newsweeder: learning to filter netnews. In: International Conference on Machine Learning
Lewis D, Knowles K (1997) Threading electronic mail: a preliminary study. Int J Inf Process Manag 33 (2):209–217
Article Google Scholar
Li JR, Yang K. (2010) News clustering system based on text mining. In: International Conference on Advanced Management Science
Liu WY, Wang L, Wang T. (2010) Online supervised learning from multi-field documents for email spam filtering. In: International Conference on Machine Learning and Cybernetics
Maas AL, Daly RE, Pham PT, Huang D, Ng AY, Potts C (2011) Learning word vectors for sentiment analysis. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies
Manning CD, Raghavan P, Schütze H (2008) Introduction to information retrieval. Cambridge University Press
Porter MF (1980) An algorithm for suffix stripping. Program 14(3):130–137
Article Google Scholar
Sahami M, Dumais S, Heckerman D, Horvitz E (1998) A bayesian approach to filtering junk e-mail. In: Proceedings of the 21st national conference on Artificial intelligence (AAAI)
Schölkopf B, Smola A (2002) Learning with Kernels. MIT Press, Cambridge
MATH Google Scholar
Shavlik J, Eliassi-Rad T (1998) Intelligent agents for web-based tasks: An advice-taking approach. In: Proceedings of the 21st National Conference on Artificial Intelligence (AAAI)
Soucy P, Mineau GW (2005) Beyond tfidf weighting for text categorization in the vector space model. In: International Joint Conference on Artificial Intelligence
Spärck Jones K (1972) A statistical interpretation of term specificity and its application in retrieval. J Doc 28:11–21
Article Google Scholar
Upasana CS (2010) A survey on text classification techniques for e-mail filtering. In: International Conference on Machine Learning and Computing
Xuan NP, Quang HL (2014) A new improved term weighting scheme for text categorization. Advances in Intelligent Systems and Computing 244:261–270
Article Google Scholar
Youquan H, Jianfang X, Cheng X (2011) An improved naive Bayesian algorithm for web page text classification. In: International Conference on Fuzzy Systems and Knowledge Discovery

Download references

Author information

Authors and Affiliations

Department of Electronics & IT Media Engineering, Seoul National University of Science & Technology, Seoul, 139-743, Korea
Hyun Kyung Kim & Minyoung Kim

Authors

Hyun Kyung Kim
View author publications
You can also search for this author in PubMed Google Scholar
Minyoung Kim
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Minyoung Kim.

Ethics declarations

This work is supported by National Research Foundation of Korea (NRF-2013R1A1A1076101). The authors have no conflict of interest. This research does not involve human participants nor animals. Consent to submit this manuscript has been received tacitly from the authors’ institution, Seoul National University of Science & Technology.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kim, H.K., Kim, M. Model-induced term-weighting schemes for text classification. Appl Intell 45, 30–43 (2016). https://doi.org/10.1007/s10489-015-0745-z

Download citation

Published: 15 January 2016
Issue Date: July 2016
DOI: https://doi.org/10.1007/s10489-015-0745-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Model-induced term-weighting schemes for text classification

Abstract

Access this article

Similar content being viewed by others

A Comparative Study on Term Weighting Schemes for Text Classification

A Supervised Term Weighting Scheme for Multi-class Text Categorization

Combining supervised term-weighting metrics for SVM text classification with extended term representation

Notes

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Model-induced term-weighting schemes for text classification

Abstract

Access this article

Similar content being viewed by others

A Comparative Study on Term Weighting Schemes for Text Classification

A Supervised Term Weighting Scheme for Multi-class Text Categorization

Combining supervised term-weighting metrics for SVM text classification with extended term representation

Notes

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation