Skip to main content
Log in

On Term Frequency Factor in Supervised Term Weighting Schemes for Text Classification

  • Research Article -Computer Engineering and Computer Science
  • Published:
Arabian Journal for Science and Engineering Aims and scope Submit manuscript

Abstract

The performance of text classification can be affected by the choice of appropriate term weighting scheme as well as other parameters. The terminology supervised term weighting scheme has become popular in recent years, as it may provide discriminative representation in vector space for text documents belonging to different classes. A term weighting scheme generally consists of three factors, namely term frequency factor, collection frequency factor, and length normalization factor. The researchers mostly have been focused on developing new collection frequency factors in term weighting studies. However, the term frequency factor has an important role, especially in supervised term weighting. In this study, we extensively analyzed the effects of using different term frequency factors on seven supervised term weighting schemes. While six of these supervised term weighting schemes were applied in the previous studies in the literature, we derived one of them from an existing feature selection method and it was not used as a weighting method before. This analysis is performed using SVM and Roccio classifiers on two widely known benchmark datasets with different characteristics. Experimental results showed that modification of term frequency factor in supervised term weighting schemes increased the performance of almost all weighting schemes. Also, term weighting schemes using square root function-based term frequency factor (SQRT_TF) are more successful than the ones using term frequency (TF) and logarithmic function-based term frequency (LOG_TF) factors. TF term frequency factor seems as the least effective one among three different term frequency factors according to the experimental results and statistical analysis.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

References

  1. Uysal, A.K.; Gunal, S.: The impact of preprocessing on text classification. Inf. Process. Manag. 50(1), 104–112 (2014)

    Article  Google Scholar 

  2. Schneider, K.-M.: Weighted average pointwise mutual information for feature selection in text categorization. In: European Conference on Principles of Data Mining and Knowledge Discovery, pp. 252–263. Springer (2005)

  3. Lee, C.; Lee, G.G.: Information gain and divergence-based feature selection for machine learning-based text categorization. Inf. Process. Manag. 42(1), 155–165 (2006). https://doi.org/10.1016/j.ipm.2004.08.006

    Article  Google Scholar 

  4. Ogura, H.; Amano, H.; Kondo, M.: Feature selection with a measure of deviations from Poisson in text categorization. Expert Syst. Appl. 36(3), 6826–6832 (2009). https://doi.org/10.1016/j.eswa.2008.08.006

    Article  Google Scholar 

  5. Chen, Y.-T.; Chen, M.C.: Using Chi square statistics to measure similarities for text categorization. Expert Syst. Appl. 38(4), 3085–3090 (2011). https://doi.org/10.1016/j.eswa.2010.08.100

    Article  Google Scholar 

  6. Uysal, A.K.; Gunal, S.: A novel probabilistic feature selection method for text classification. Knowl. Based Syst. 36, 226–235 (2012). https://doi.org/10.1016/j.knosys.2012.06.005

    Article  Google Scholar 

  7. Uysal, A.K.: An improved global feature selection scheme for text classification. Expert Syst. Appl. 43, 82–92 (2016). https://doi.org/10.1016/j.eswa.2015.08.050

    Article  Google Scholar 

  8. Deng, Z.-H.; Tang, S.-W.; Yang, D.-Q.; Zhang, M.; Li, L.-Y.; Xie, K.Q.: A comparative study on feature weight in text categorization. In: APWeb, pp. 588–597. Springer (2004)

  9. Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. (CSUR) 34(1), 1–47 (2002)

    Article  Google Scholar 

  10. Sparck Jones, K.: A statistical interpretation of term specificity and its application in retrieval. J. Doc. 28(1), 11–21 (2004). https://doi.org/10.1108/eb026526

    Article  Google Scholar 

  11. Debole, F; Sebastiani, F.: Supervised term weighting for automated text categorization. In: Text Mining and its Applications, pp. 81–97. Springer (2004)

  12. Lertnattee, V.; Theeramunkong, T.: Analysis of inverse class frequency in centroid-based text classification. In: IEEE International Symposium on Communications and Information Technology, 2004. ISCIT 2004, pp. 1171–1176. IEEE (2004)

  13. Lan, M.; Tan, C.L.; Su, J.; Lu, Y.: Supervised and traditional term weighting methods for automatic text categorization. IEEE Trans. Pattern Anal. Mach. Intell. 31(4), 721–735 (2009)

    Article  Google Scholar 

  14. Liu, Y.; Loh, H.T.; Sun, A.: Imbalanced text classification: a term weighting approach. Expert Syst. Appl. 36(1), 690–701 (2009). https://doi.org/10.1016/j.eswa.2007.10.042

    Article  Google Scholar 

  15. Altınçay, H.; Erenel, Z.: Analytical evaluation of term weighting schemes for text categorization. Pattern Recognit. Lett. 31(11), 1310–1323 (2010). https://doi.org/10.1016/j.patrec.2010.03.012

    Article  Google Scholar 

  16. Deisy, C.; Gowri, M.; Baskar, S.; Kalaiarasi, S.; Ramraj, N.: A novel term weighting scheme MIDF for text categorization. J. Eng. Sci. Technol. 5(1), 94–107 (2010)

    Google Scholar 

  17. Wei, B.; Feng, B.; He, F.; Fu, X.: An extended supervised term weighting method for text categorization. In: Proceedings of the International Conference on Human-centric Computing 2011 and Embedded and Multimedia Computing 2011. Lecture Notes in Electrical Engineering, pp. 87–99. (2011). https://doi.org/10.1007/978-94-007-2105-0_11

  18. Luo, Q.; Chen, E.; Xiong, H.: A semantic term weighting scheme for text categorization. Expert Syst. Appl. 38(10), 12708–12716 (2011). https://doi.org/10.1016/j.eswa.2011.04.058

    Article  Google Scholar 

  19. Ren, F.; Sohrab, M.G.: Class-indexing-based term weighting for automatic text classification. Inf. Sci. 236, 109–125 (2013). https://doi.org/10.1016/j.ins.2013.02.029

    Article  Google Scholar 

  20. Emmanuel, M.; Khatri, S.M.; Babu, D.R.R.: A novel scheme for term weighting in text categorization: positive impact factor. Paper Presented at the 2013 IEEE International Conference on Systems, Man, and Cybernetics (2013)

  21. Badawi, D.; Altınçay, H.: A novel framework for termset selection and weighting in binary text classification. Eng. Appl. Artif. Intell. 35, 38–53 (2014). https://doi.org/10.1016/j.engappai.2014.06.012

    Article  Google Scholar 

  22. Ke, W.: Information-theoretic term weighting schemes for document clustering and classification. Int. J. Digit. Libr. 16(2), 145–159 (2015). https://doi.org/10.1007/s00799-014-0121-3

    Article  Google Scholar 

  23. Deng, Z.-H.; Luo, K.-H.; Yu, H.-L.: A study of supervised term weighting scheme for sentiment analysis. Expert Syst. Appl. 41(7), 3506–3513 (2014). https://doi.org/10.1016/j.eswa.2013.10.056

    Article  Google Scholar 

  24. Abdel Fattah, M.: New term weighting schemes with combination of multiple classifiers for sentiment analysis. Neurocomputing 167, 434–442 (2015). https://doi.org/10.1016/j.neucom.2015.04.051

    Article  Google Scholar 

  25. Escalante, H.J.; García-Limón, M.A.; Morales-Reyes, A.; Graff, M.; Montes-y-Gómez, M.; Morales, E.F.; Martínez-Carranza, J.: Term-weighting learning via genetic programming for text classification. Knowl. Based Syst. 83, 176–189 (2015). https://doi.org/10.1016/j.knosys.2015.03.025

    Article  Google Scholar 

  26. Ko, Y.: A new term-weighting scheme for text classification using the odds of positive and negative class probabilities. J. Assoc. Inf. Sci. Technol. 66(12), 2553–2565 (2015). https://doi.org/10.1002/asi.23338

    Article  Google Scholar 

  27. Chen, K.; Zhang, Z.; Long, J.; Zhang, H.: Turning from TF-IDF to TF-IGM for term weighting in text classification. Expert Syst. Appl. 66, 245–260 (2016). https://doi.org/10.1016/j.eswa.2016.09.009

    Article  Google Scholar 

  28. Haddoud, M.; Mokhtari, A.; Lecroq, T.; Abdeddaïm, S.: Combining supervised term-weighting metrics for SVM text classification with extended term representation. Knowl. Inf. Syst. 49(3), 909–931 (2016). https://doi.org/10.1007/s10115-016-0924-1

    Article  Google Scholar 

  29. Kim, H.K.; Kim, M.: Model-induced term-weighting schemes for text classification. Appl. Intell. 45(1), 30–43 (2016)

    Article  Google Scholar 

  30. Sabbah, T.; Selamat, A.; Selamat, M.H.; Al-Anzi, F.S.; Viedma, E.H.; Krejcar, O.; Fujita, H.: Modified frequency-based term weighting schemes for text classification. Appl. Soft Comput. 58, 193–206 (2017)

    Article  Google Scholar 

  31. Badawi, D.; Altınçay, H.: Termset weighting by adapting term weighting schemes to utilize cardinality statistics for binary text categorization. Appl. Intell. (2017). https://doi.org/10.1007/s10489-017-0911-6

    Google Scholar 

  32. Wu, H.; Gu, X.; Gu, Y.: Balancing between over-weighting and under-weighting in supervised term weighting. Inf. Process. Manag. 53(2), 547–557 (2017). https://doi.org/10.1016/j.ipm.2016.10.003

    Article  Google Scholar 

  33. Alsmadi, I.; Hoon, G.K.: Term weighting scheme for short-text classification: twitter corpuses. Neural Comput. Appl. (2018). https://doi.org/10.1007/s00521-017-3298-8

    Google Scholar 

  34. Rao, Y.; Li, Q.; Wu, Q.; Xie, H.; Wang, F.L.; Wang, T.: A multi-relational term scheme for first story detection. Neurocomputing 254, 42–52 (2017)

    Article  Google Scholar 

  35. Feng, G.; Li, S.; Sun, T.; Zhang, B.: A probabilistic model derived term weighting scheme for text classification. Pattern Recognit. Lett. 110, 23–29 (2018)

    Article  Google Scholar 

  36. Matsuo, R.; Ho, T.B.: Semantic term weighting for clinical texts. Expert Syst. Appl. 114, 543–551 (2018)

    Article  Google Scholar 

  37. Li, X.; Zhang, A.; Li, C.; Ouyang, J.; Cai, Y.: Exploring coherent topics by topic modeling with term weighting. Inf. Process. Manag. 54(6), 1345–1358 (2018)

    Article  Google Scholar 

  38. Santhanakumar, M.; Columbus, C.C.; Jayapriya, K.: Multi term based co-term frequency method for term weighting in information retrieval. Int. J. Bus. Inf. Syst. 28(1), 79–94 (2018)

    Google Scholar 

  39. Pak, A.; Paroubek, P.; Fraisse, A.; Francopoulo, G.: Normalization of term weighting scheme for sentiment analysis. In: Language and Technology Conference, pp. 116–128. Springer (2011)

  40. Erenel, Z.; Altınçay, H.: Nonlinear transformation of term frequencies for term weighting in text categorization. Eng. Appl. Artif. Intell. 25(7), 1505–1514 (2012). https://doi.org/10.1016/j.engappai.2012.06.013

    Article  Google Scholar 

  41. Xuan, N.P.; Le Quang, H.: A new improved term weighting scheme for text categorization. In: Knowledge and Systems Engineering. Advances in Intelligent Systems and Computing, pp. 261–270. (2014). https://doi.org/10.1007/978-3-319-02741-8_23

  42. Nguyen, T.T.; Chang, K.; Hui, S.C.: Supervised term weighting centroid-based classifiers for text categorization. Knowl. Inf. Syst. 35(1), 61–85 (2013)

    Article  Google Scholar 

  43. Lan, M.; Tan, C.L.; Su, J.; Lu, Y.: Supervised and traditional term weighting methods for automatic text categorization. IEEE Trans. Pattern Anal. Mach. Intell. 31(4), 721–735 (2009). https://doi.org/10.1109/TPAMI.2008.110

    Article  Google Scholar 

  44. Rocchio JJ (1971) Relevance feedback in information retrieval. In: The smart retrieval system-experiments in automatic document processing, pp 313–323

  45. Chang, C.-C.; Lin, C.-J.: LIBSVM: a library for support vector machines. ACM Trans. Intell. Syst. Technol. (TIST) 2(3), 27 (2011)

    Google Scholar 

  46. Porter, M.F.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980)

    Article  Google Scholar 

  47. Asuncion, A.; Newman, D.J.: UCI Machine Learning Repository. http://www.ics.uci.edu/~mlearn/MLRepository.html. Accessed Jan 2013 (2007)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Turgut Dogan.

Ethics declarations

Conflict of interest

This manuscript is the original work of the author and has not been published nor has it been submitted simultaneously elsewhere. It is to specifically state that no competing interests are at stake, and there is no conflict of interest with other people or organizations that could inappropriately influence or bias the content of the paper.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Dogan, T., Uysal, A.K. On Term Frequency Factor in Supervised Term Weighting Schemes for Text Classification. Arab J Sci Eng 44, 9545–9560 (2019). https://doi.org/10.1007/s13369-019-03920-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s13369-019-03920-9

Keywords

Navigation