Abstract
The performance of text classification can be affected by the choice of appropriate term weighting scheme as well as other parameters. The terminology supervised term weighting scheme has become popular in recent years, as it may provide discriminative representation in vector space for text documents belonging to different classes. A term weighting scheme generally consists of three factors, namely term frequency factor, collection frequency factor, and length normalization factor. The researchers mostly have been focused on developing new collection frequency factors in term weighting studies. However, the term frequency factor has an important role, especially in supervised term weighting. In this study, we extensively analyzed the effects of using different term frequency factors on seven supervised term weighting schemes. While six of these supervised term weighting schemes were applied in the previous studies in the literature, we derived one of them from an existing feature selection method and it was not used as a weighting method before. This analysis is performed using SVM and Roccio classifiers on two widely known benchmark datasets with different characteristics. Experimental results showed that modification of term frequency factor in supervised term weighting schemes increased the performance of almost all weighting schemes. Also, term weighting schemes using square root function-based term frequency factor (SQRT_TF) are more successful than the ones using term frequency (TF) and logarithmic function-based term frequency (LOG_TF) factors. TF term frequency factor seems as the least effective one among three different term frequency factors according to the experimental results and statistical analysis.
Similar content being viewed by others
References
Uysal, A.K.; Gunal, S.: The impact of preprocessing on text classification. Inf. Process. Manag. 50(1), 104–112 (2014)
Schneider, K.-M.: Weighted average pointwise mutual information for feature selection in text categorization. In: European Conference on Principles of Data Mining and Knowledge Discovery, pp. 252–263. Springer (2005)
Lee, C.; Lee, G.G.: Information gain and divergence-based feature selection for machine learning-based text categorization. Inf. Process. Manag. 42(1), 155–165 (2006). https://doi.org/10.1016/j.ipm.2004.08.006
Ogura, H.; Amano, H.; Kondo, M.: Feature selection with a measure of deviations from Poisson in text categorization. Expert Syst. Appl. 36(3), 6826–6832 (2009). https://doi.org/10.1016/j.eswa.2008.08.006
Chen, Y.-T.; Chen, M.C.: Using Chi square statistics to measure similarities for text categorization. Expert Syst. Appl. 38(4), 3085–3090 (2011). https://doi.org/10.1016/j.eswa.2010.08.100
Uysal, A.K.; Gunal, S.: A novel probabilistic feature selection method for text classification. Knowl. Based Syst. 36, 226–235 (2012). https://doi.org/10.1016/j.knosys.2012.06.005
Uysal, A.K.: An improved global feature selection scheme for text classification. Expert Syst. Appl. 43, 82–92 (2016). https://doi.org/10.1016/j.eswa.2015.08.050
Deng, Z.-H.; Tang, S.-W.; Yang, D.-Q.; Zhang, M.; Li, L.-Y.; Xie, K.Q.: A comparative study on feature weight in text categorization. In: APWeb, pp. 588–597. Springer (2004)
Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. (CSUR) 34(1), 1–47 (2002)
Sparck Jones, K.: A statistical interpretation of term specificity and its application in retrieval. J. Doc. 28(1), 11–21 (2004). https://doi.org/10.1108/eb026526
Debole, F; Sebastiani, F.: Supervised term weighting for automated text categorization. In: Text Mining and its Applications, pp. 81–97. Springer (2004)
Lertnattee, V.; Theeramunkong, T.: Analysis of inverse class frequency in centroid-based text classification. In: IEEE International Symposium on Communications and Information Technology, 2004. ISCIT 2004, pp. 1171–1176. IEEE (2004)
Lan, M.; Tan, C.L.; Su, J.; Lu, Y.: Supervised and traditional term weighting methods for automatic text categorization. IEEE Trans. Pattern Anal. Mach. Intell. 31(4), 721–735 (2009)
Liu, Y.; Loh, H.T.; Sun, A.: Imbalanced text classification: a term weighting approach. Expert Syst. Appl. 36(1), 690–701 (2009). https://doi.org/10.1016/j.eswa.2007.10.042
Altınçay, H.; Erenel, Z.: Analytical evaluation of term weighting schemes for text categorization. Pattern Recognit. Lett. 31(11), 1310–1323 (2010). https://doi.org/10.1016/j.patrec.2010.03.012
Deisy, C.; Gowri, M.; Baskar, S.; Kalaiarasi, S.; Ramraj, N.: A novel term weighting scheme MIDF for text categorization. J. Eng. Sci. Technol. 5(1), 94–107 (2010)
Wei, B.; Feng, B.; He, F.; Fu, X.: An extended supervised term weighting method for text categorization. In: Proceedings of the International Conference on Human-centric Computing 2011 and Embedded and Multimedia Computing 2011. Lecture Notes in Electrical Engineering, pp. 87–99. (2011). https://doi.org/10.1007/978-94-007-2105-0_11
Luo, Q.; Chen, E.; Xiong, H.: A semantic term weighting scheme for text categorization. Expert Syst. Appl. 38(10), 12708–12716 (2011). https://doi.org/10.1016/j.eswa.2011.04.058
Ren, F.; Sohrab, M.G.: Class-indexing-based term weighting for automatic text classification. Inf. Sci. 236, 109–125 (2013). https://doi.org/10.1016/j.ins.2013.02.029
Emmanuel, M.; Khatri, S.M.; Babu, D.R.R.: A novel scheme for term weighting in text categorization: positive impact factor. Paper Presented at the 2013 IEEE International Conference on Systems, Man, and Cybernetics (2013)
Badawi, D.; Altınçay, H.: A novel framework for termset selection and weighting in binary text classification. Eng. Appl. Artif. Intell. 35, 38–53 (2014). https://doi.org/10.1016/j.engappai.2014.06.012
Ke, W.: Information-theoretic term weighting schemes for document clustering and classification. Int. J. Digit. Libr. 16(2), 145–159 (2015). https://doi.org/10.1007/s00799-014-0121-3
Deng, Z.-H.; Luo, K.-H.; Yu, H.-L.: A study of supervised term weighting scheme for sentiment analysis. Expert Syst. Appl. 41(7), 3506–3513 (2014). https://doi.org/10.1016/j.eswa.2013.10.056
Abdel Fattah, M.: New term weighting schemes with combination of multiple classifiers for sentiment analysis. Neurocomputing 167, 434–442 (2015). https://doi.org/10.1016/j.neucom.2015.04.051
Escalante, H.J.; García-Limón, M.A.; Morales-Reyes, A.; Graff, M.; Montes-y-Gómez, M.; Morales, E.F.; Martínez-Carranza, J.: Term-weighting learning via genetic programming for text classification. Knowl. Based Syst. 83, 176–189 (2015). https://doi.org/10.1016/j.knosys.2015.03.025
Ko, Y.: A new term-weighting scheme for text classification using the odds of positive and negative class probabilities. J. Assoc. Inf. Sci. Technol. 66(12), 2553–2565 (2015). https://doi.org/10.1002/asi.23338
Chen, K.; Zhang, Z.; Long, J.; Zhang, H.: Turning from TF-IDF to TF-IGM for term weighting in text classification. Expert Syst. Appl. 66, 245–260 (2016). https://doi.org/10.1016/j.eswa.2016.09.009
Haddoud, M.; Mokhtari, A.; Lecroq, T.; Abdeddaïm, S.: Combining supervised term-weighting metrics for SVM text classification with extended term representation. Knowl. Inf. Syst. 49(3), 909–931 (2016). https://doi.org/10.1007/s10115-016-0924-1
Kim, H.K.; Kim, M.: Model-induced term-weighting schemes for text classification. Appl. Intell. 45(1), 30–43 (2016)
Sabbah, T.; Selamat, A.; Selamat, M.H.; Al-Anzi, F.S.; Viedma, E.H.; Krejcar, O.; Fujita, H.: Modified frequency-based term weighting schemes for text classification. Appl. Soft Comput. 58, 193–206 (2017)
Badawi, D.; Altınçay, H.: Termset weighting by adapting term weighting schemes to utilize cardinality statistics for binary text categorization. Appl. Intell. (2017). https://doi.org/10.1007/s10489-017-0911-6
Wu, H.; Gu, X.; Gu, Y.: Balancing between over-weighting and under-weighting in supervised term weighting. Inf. Process. Manag. 53(2), 547–557 (2017). https://doi.org/10.1016/j.ipm.2016.10.003
Alsmadi, I.; Hoon, G.K.: Term weighting scheme for short-text classification: twitter corpuses. Neural Comput. Appl. (2018). https://doi.org/10.1007/s00521-017-3298-8
Rao, Y.; Li, Q.; Wu, Q.; Xie, H.; Wang, F.L.; Wang, T.: A multi-relational term scheme for first story detection. Neurocomputing 254, 42–52 (2017)
Feng, G.; Li, S.; Sun, T.; Zhang, B.: A probabilistic model derived term weighting scheme for text classification. Pattern Recognit. Lett. 110, 23–29 (2018)
Matsuo, R.; Ho, T.B.: Semantic term weighting for clinical texts. Expert Syst. Appl. 114, 543–551 (2018)
Li, X.; Zhang, A.; Li, C.; Ouyang, J.; Cai, Y.: Exploring coherent topics by topic modeling with term weighting. Inf. Process. Manag. 54(6), 1345–1358 (2018)
Santhanakumar, M.; Columbus, C.C.; Jayapriya, K.: Multi term based co-term frequency method for term weighting in information retrieval. Int. J. Bus. Inf. Syst. 28(1), 79–94 (2018)
Pak, A.; Paroubek, P.; Fraisse, A.; Francopoulo, G.: Normalization of term weighting scheme for sentiment analysis. In: Language and Technology Conference, pp. 116–128. Springer (2011)
Erenel, Z.; Altınçay, H.: Nonlinear transformation of term frequencies for term weighting in text categorization. Eng. Appl. Artif. Intell. 25(7), 1505–1514 (2012). https://doi.org/10.1016/j.engappai.2012.06.013
Xuan, N.P.; Le Quang, H.: A new improved term weighting scheme for text categorization. In: Knowledge and Systems Engineering. Advances in Intelligent Systems and Computing, pp. 261–270. (2014). https://doi.org/10.1007/978-3-319-02741-8_23
Nguyen, T.T.; Chang, K.; Hui, S.C.: Supervised term weighting centroid-based classifiers for text categorization. Knowl. Inf. Syst. 35(1), 61–85 (2013)
Lan, M.; Tan, C.L.; Su, J.; Lu, Y.: Supervised and traditional term weighting methods for automatic text categorization. IEEE Trans. Pattern Anal. Mach. Intell. 31(4), 721–735 (2009). https://doi.org/10.1109/TPAMI.2008.110
Rocchio JJ (1971) Relevance feedback in information retrieval. In: The smart retrieval system-experiments in automatic document processing, pp 313–323
Chang, C.-C.; Lin, C.-J.: LIBSVM: a library for support vector machines. ACM Trans. Intell. Syst. Technol. (TIST) 2(3), 27 (2011)
Porter, M.F.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980)
Asuncion, A.; Newman, D.J.: UCI Machine Learning Repository. http://www.ics.uci.edu/~mlearn/MLRepository.html. Accessed Jan 2013 (2007)
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
This manuscript is the original work of the author and has not been published nor has it been submitted simultaneously elsewhere. It is to specifically state that no competing interests are at stake, and there is no conflict of interest with other people or organizations that could inappropriately influence or bias the content of the paper.
Rights and permissions
About this article
Cite this article
Dogan, T., Uysal, A.K. On Term Frequency Factor in Supervised Term Weighting Schemes for Text Classification. Arab J Sci Eng 44, 9545–9560 (2019). https://doi.org/10.1007/s13369-019-03920-9
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13369-019-03920-9