Skip to main content
Log in

Model-induced term-weighting schemes for text classification

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

The bag-of-words representation of text data is very popular for document classification. In the recent literature, it has been shown that properly weighting the term feature vector can improve the classification performance significantly beyond the original term-frequency based features. In this paper we demystify the success of the recent term-weighting strategies as well as provide possibly more reasonable modifications. We then propose novel term-weighting schemes that can be induced from the well-known document probabilistic models such as the Naive Bayes and the multinomial term model. Interestingly, some of the intuition-based term-weighting schemes coincide exactly with the proposed derivations. Our term-weighting schemes are tested on large-scale text classification problems/datasets where we demonstrate improved prediction performance over existing approaches.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

Notes

  1. While there are slack variables to be optimized as well, they often tend to be highly sparse with a few non-zero entries for the so-called support vectors.

  2. There is no reason to constrain the feature vector to be bounded from below by 1 in the SVM framework. Recall that features with large magnitudes are considered important, and not their signs, since one can simply adapts the signs of SVM weight parameters accordingly. Constraining the weight bound in [20, 21] is solely from their physical interpretation of the weights as adjusted term frequencies.

  3. 3 In (10), it is typical to add some positive constants to numerators and denominators for p k and \(\overline {p}_{k}\) (e.g., 1 to numerators and 2 to denominators), often known as the Laplace smoothing. Throughout the paper we skip the Laplace smoothers for simplicity, and incorporating them does not alter the core derivations.

  4. http://www.csie.ntu.edu.tw/~cjlin/liblinear/.

  5. http://web.ist.utl.pt/acardoso/datasets/. (Also, [2])

  6. http://ai.stanford.edu/~amaas/data/sentiment/.

References

  1. Akhmedova S, Semenkin E, Sergienko R (2014) Automatically generated classifiers for opinion mining with different term weighting schemes. In: International Conference on Informatics in Control, Automation and Robotics

  2. Cardoso-Cachopo A (2007) Improving Methods for Single-label Text Categorization. PhD Thesis, Instituto Superior Tecnico, Universidade Tecnica de Lisboa

  3. Carmel D, Mejer A, Pinter Y, Szpektor I (2014) Improving term weighting for community question answering search using syntactic analysis. In: Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management

  4. Chy A, Seddiqui M (2014) Das, S.. In: International Conference on Computer and Information Technology, Bangla news classification using naive Bayes classifier

  5. Crammer K, Singer Y, Cristianini N, Shawe-taylor J, Williamson, B (2001) On the algorithmic implementation of multiclass kernel-based vector machines. J Mach Learn Res 2

  6. Craven M, DiPasquo D, Freitag D, McCallum A, Mitchell T, Nigam K, Slattery S (1998) Learning to extract symbolic knowledge from the world wide web. In: Proceedings of the 21st national conference on Artificial intelligence (AAAI)

  7. Debole F, Sebastiani F. (2003) Supervised term weighting for automated text categorization. In: Proceedings of the ACM symposium on Applied computing

  8. Deng ZH, Luo KH, Yu HL (2014) A study of supervised term weighting scheme for sentiment analysis. Expert Syst Appl 41(7):3506–3513

    Article  Google Scholar 

  9. Deng ZH, Tang SW, Yang DQ, Li M. Z. L. Y., Xie KQ (2004) A comparative study on feature weight in text categorization. Advanced Web Technologies and Applications. Lect Notes Comput Sci 3007:588–597

    Article  Google Scholar 

  10. Dietterich TG (1998) Approximate statistical tests for comparing supervised classification learning algorithms. Neural Comput 10(7):1895–1923

    Article  Google Scholar 

  11. Domingos P, Pazzani M (1996) Beyond independence: conditions for the optimality of the simple Bayesian classifier. In: International Conference on Machine Learning

  12. Duda RO, Hart PE, Stork DG (2000) Pattern classification . Wiley

  13. Escalante HJ, Garcia-Limon MA, Morales-Reyes A, Graff M, Gomez MM, Morales EF, Martinez-Carranza J (2015) Term-weighting learning via genetic programming for text classification. Knowl-Based Syst 83:176–189

    Article  Google Scholar 

  14. Fan RE, Chang KW, Hsieh CJ, Wang XR, Lin CJ (2008) LIBLINEAR: A library for large linear classification. J Mach Learn Res 9:1871–1874

    MATH  Google Scholar 

  15. Fattah MA (2015) New term weighting schemes with combination of multiple classifiers for sentiment analysis. Neurocomputing 167:434–442

    Article  Google Scholar 

  16. Jiang H, Li P, Hu X, Wang S. (2009) An improved method of term weighting for text classification. In: IEEE International Conference on Intelligent Computing and Intelligent Systems

  17. Joachims T (1998) Text categorization with suport vector machines: learning with many relevant features. In: European Conference on Machine Learning

  18. Ko Y (2012) A study of term weighting schemes using class information for text classification. In: ACM SIGIR conference on Research and development in information retrieval

  19. Kocabas I, Dincer BT, Karaoglan B (2014) A nonparametric term weighting method for information retrieval based on measuring the divergence from independence. Inf Retr 17 (2):153–176

    Article  Google Scholar 

  20. Lan M, Tan C, Su J, Lu Y (2009) Supervised and traditional term weighting methods for automatic text categorization. IEEE Trans Pattern Anal Mach Intell 31(4):721–735

    Article  Google Scholar 

  21. Lan M, Tan CL, Low HB (2006) Proposing a new term weighting scheme for text categorization. In: Proceedings of the 21st National Conference on Artificial Intelligence (AAAI)

  22. Lang K (1995) Newsweeder: learning to filter netnews. In: International Conference on Machine Learning

  23. Lewis D, Knowles K (1997) Threading electronic mail: a preliminary study. Int J Inf Process Manag 33 (2):209–217

    Article  Google Scholar 

  24. Li JR, Yang K. (2010) News clustering system based on text mining. In: International Conference on Advanced Management Science

  25. Liu WY, Wang L, Wang T. (2010) Online supervised learning from multi-field documents for email spam filtering. In: International Conference on Machine Learning and Cybernetics

  26. Maas AL, Daly RE, Pham PT, Huang D, Ng AY, Potts C (2011) Learning word vectors for sentiment analysis. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies

  27. Manning CD, Raghavan P, Schütze H (2008) Introduction to information retrieval. Cambridge University Press

  28. Porter MF (1980) An algorithm for suffix stripping. Program 14(3):130–137

    Article  Google Scholar 

  29. Sahami M, Dumais S, Heckerman D, Horvitz E (1998) A bayesian approach to filtering junk e-mail. In: Proceedings of the 21st national conference on Artificial intelligence (AAAI)

  30. Schölkopf B, Smola A (2002) Learning with Kernels. MIT Press, Cambridge

    MATH  Google Scholar 

  31. Shavlik J, Eliassi-Rad T (1998) Intelligent agents for web-based tasks: An advice-taking approach. In: Proceedings of the 21st National Conference on Artificial Intelligence (AAAI)

  32. Soucy P, Mineau GW (2005) Beyond tfidf weighting for text categorization in the vector space model. In: International Joint Conference on Artificial Intelligence

  33. Spärck Jones K (1972) A statistical interpretation of term specificity and its application in retrieval. J Doc 28:11–21

    Article  Google Scholar 

  34. Upasana CS (2010) A survey on text classification techniques for e-mail filtering. In: International Conference on Machine Learning and Computing

  35. Xuan NP, Quang HL (2014) A new improved term weighting scheme for text categorization. Advances in Intelligent Systems and Computing 244:261–270

    Article  Google Scholar 

  36. Youquan H, Jianfang X, Cheng X (2011) An improved naive Bayesian algorithm for web page text classification. In: International Conference on Fuzzy Systems and Knowledge Discovery

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Minyoung Kim.

Ethics declarations

This work is supported by National Research Foundation of Korea (NRF-2013R1A1A1076101). The authors have no conflict of interest. This research does not involve human participants nor animals. Consent to submit this manuscript has been received tacitly from the authors’ institution, Seoul National University of Science & Technology.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Kim, H.K., Kim, M. Model-induced term-weighting schemes for text classification. Appl Intell 45, 30–43 (2016). https://doi.org/10.1007/s10489-015-0745-z

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-015-0745-z

Keywords

Navigation