Abstract
The bag-of-words representation of text data is very popular for document classification. In the recent literature, it has been shown that properly weighting the term feature vector can improve the classification performance significantly beyond the original term-frequency based features. In this paper we demystify the success of the recent term-weighting strategies as well as provide possibly more reasonable modifications. We then propose novel term-weighting schemes that can be induced from the well-known document probabilistic models such as the Naive Bayes and the multinomial term model. Interestingly, some of the intuition-based term-weighting schemes coincide exactly with the proposed derivations. Our term-weighting schemes are tested on large-scale text classification problems/datasets where we demonstrate improved prediction performance over existing approaches.
Similar content being viewed by others
Notes
While there are slack variables to be optimized as well, they often tend to be highly sparse with a few non-zero entries for the so-called support vectors.
There is no reason to constrain the feature vector to be bounded from below by 1 in the SVM framework. Recall that features with large magnitudes are considered important, and not their signs, since one can simply adapts the signs of SVM weight parameters accordingly. Constraining the weight bound in [20, 21] is solely from their physical interpretation of the weights as adjusted term frequencies.
3 In (10), it is typical to add some positive constants to numerators and denominators for p k and \(\overline {p}_{k}\) (e.g., 1 to numerators and 2 to denominators), often known as the Laplace smoothing. Throughout the paper we skip the Laplace smoothers for simplicity, and incorporating them does not alter the core derivations.
http://web.ist.utl.pt/acardoso/datasets/. (Also, [2])
References
Akhmedova S, Semenkin E, Sergienko R (2014) Automatically generated classifiers for opinion mining with different term weighting schemes. In: International Conference on Informatics in Control, Automation and Robotics
Cardoso-Cachopo A (2007) Improving Methods for Single-label Text Categorization. PhD Thesis, Instituto Superior Tecnico, Universidade Tecnica de Lisboa
Carmel D, Mejer A, Pinter Y, Szpektor I (2014) Improving term weighting for community question answering search using syntactic analysis. In: Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management
Chy A, Seddiqui M (2014) Das, S.. In: International Conference on Computer and Information Technology, Bangla news classification using naive Bayes classifier
Crammer K, Singer Y, Cristianini N, Shawe-taylor J, Williamson, B (2001) On the algorithmic implementation of multiclass kernel-based vector machines. J Mach Learn Res 2
Craven M, DiPasquo D, Freitag D, McCallum A, Mitchell T, Nigam K, Slattery S (1998) Learning to extract symbolic knowledge from the world wide web. In: Proceedings of the 21st national conference on Artificial intelligence (AAAI)
Debole F, Sebastiani F. (2003) Supervised term weighting for automated text categorization. In: Proceedings of the ACM symposium on Applied computing
Deng ZH, Luo KH, Yu HL (2014) A study of supervised term weighting scheme for sentiment analysis. Expert Syst Appl 41(7):3506–3513
Deng ZH, Tang SW, Yang DQ, Li M. Z. L. Y., Xie KQ (2004) A comparative study on feature weight in text categorization. Advanced Web Technologies and Applications. Lect Notes Comput Sci 3007:588–597
Dietterich TG (1998) Approximate statistical tests for comparing supervised classification learning algorithms. Neural Comput 10(7):1895–1923
Domingos P, Pazzani M (1996) Beyond independence: conditions for the optimality of the simple Bayesian classifier. In: International Conference on Machine Learning
Duda RO, Hart PE, Stork DG (2000) Pattern classification . Wiley
Escalante HJ, Garcia-Limon MA, Morales-Reyes A, Graff M, Gomez MM, Morales EF, Martinez-Carranza J (2015) Term-weighting learning via genetic programming for text classification. Knowl-Based Syst 83:176–189
Fan RE, Chang KW, Hsieh CJ, Wang XR, Lin CJ (2008) LIBLINEAR: A library for large linear classification. J Mach Learn Res 9:1871–1874
Fattah MA (2015) New term weighting schemes with combination of multiple classifiers for sentiment analysis. Neurocomputing 167:434–442
Jiang H, Li P, Hu X, Wang S. (2009) An improved method of term weighting for text classification. In: IEEE International Conference on Intelligent Computing and Intelligent Systems
Joachims T (1998) Text categorization with suport vector machines: learning with many relevant features. In: European Conference on Machine Learning
Ko Y (2012) A study of term weighting schemes using class information for text classification. In: ACM SIGIR conference on Research and development in information retrieval
Kocabas I, Dincer BT, Karaoglan B (2014) A nonparametric term weighting method for information retrieval based on measuring the divergence from independence. Inf Retr 17 (2):153–176
Lan M, Tan C, Su J, Lu Y (2009) Supervised and traditional term weighting methods for automatic text categorization. IEEE Trans Pattern Anal Mach Intell 31(4):721–735
Lan M, Tan CL, Low HB (2006) Proposing a new term weighting scheme for text categorization. In: Proceedings of the 21st National Conference on Artificial Intelligence (AAAI)
Lang K (1995) Newsweeder: learning to filter netnews. In: International Conference on Machine Learning
Lewis D, Knowles K (1997) Threading electronic mail: a preliminary study. Int J Inf Process Manag 33 (2):209–217
Li JR, Yang K. (2010) News clustering system based on text mining. In: International Conference on Advanced Management Science
Liu WY, Wang L, Wang T. (2010) Online supervised learning from multi-field documents for email spam filtering. In: International Conference on Machine Learning and Cybernetics
Maas AL, Daly RE, Pham PT, Huang D, Ng AY, Potts C (2011) Learning word vectors for sentiment analysis. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies
Manning CD, Raghavan P, Schütze H (2008) Introduction to information retrieval. Cambridge University Press
Porter MF (1980) An algorithm for suffix stripping. Program 14(3):130–137
Sahami M, Dumais S, Heckerman D, Horvitz E (1998) A bayesian approach to filtering junk e-mail. In: Proceedings of the 21st national conference on Artificial intelligence (AAAI)
Schölkopf B, Smola A (2002) Learning with Kernels. MIT Press, Cambridge
Shavlik J, Eliassi-Rad T (1998) Intelligent agents for web-based tasks: An advice-taking approach. In: Proceedings of the 21st National Conference on Artificial Intelligence (AAAI)
Soucy P, Mineau GW (2005) Beyond tfidf weighting for text categorization in the vector space model. In: International Joint Conference on Artificial Intelligence
Spärck Jones K (1972) A statistical interpretation of term specificity and its application in retrieval. J Doc 28:11–21
Upasana CS (2010) A survey on text classification techniques for e-mail filtering. In: International Conference on Machine Learning and Computing
Xuan NP, Quang HL (2014) A new improved term weighting scheme for text categorization. Advances in Intelligent Systems and Computing 244:261–270
Youquan H, Jianfang X, Cheng X (2011) An improved naive Bayesian algorithm for web page text classification. In: International Conference on Fuzzy Systems and Knowledge Discovery
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
This work is supported by National Research Foundation of Korea (NRF-2013R1A1A1076101). The authors have no conflict of interest. This research does not involve human participants nor animals. Consent to submit this manuscript has been received tacitly from the authors’ institution, Seoul National University of Science & Technology.
Rights and permissions
About this article
Cite this article
Kim, H.K., Kim, M. Model-induced term-weighting schemes for text classification. Appl Intell 45, 30–43 (2016). https://doi.org/10.1007/s10489-015-0745-z
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-015-0745-z