Abstract
A major problem text classification faces is the high dimensional feature space of the text data. Feature selection (FS) algorithms are used for eliminating the irrelevant and redundant terms, thus increasing accuracy and speed of a text classifier. For text classification, FS algorithms have to be designed keeping the highly imbalanced classes of the text data in view. To this end, more recently ensemble algorithms (e.g., improved global feature selection scheme (IGFSS) and variable global feature selection scheme (VGFSS)) were proposed. These algorithms, which combine local and global FS metrics, have shown promising results with VGFSS having better capability of addressing the class imbalance issue. However, both these schemes are highly dependent on the underlying local and global FS metrics. Existing FS metrics get confused while selecting relevant terms of a data with highly imbalanced classes. In this paper, we propose a new FS metric named inherent distinguished feature selector (IDFS), which selects terms having greater relevance to classes and is highly effective for imbalanced data sets. We compare performance of IDFS against five well-known FS metrics as a stand-alone FS algorithm and as a part of the IGFSS and VGFSS frameworks on five benchmark data sets using two classifiers, namely support vector machines and random forests. Our results show that IDFS in both scenarios selects smaller subsets, and achieves higher micro and macro \(F_1\) values, thus outperforming the existing FS metrics.
Similar content being viewed by others
Notes
In text classification, category is the same as the class.
In the paper, “features,” “words” and “terms” are used interchangeably.
We use filters, FS metrics or feature ranking metrics interchangeably.
References
Uysal, A.K.; Gunal, S.: A novel probabilistic feature selection method for text classification. Knowl. Based Syst. 36, 226–235 (2012)
Grimes, S.: Unstructured data and the 80 percent rule. http://breakthroughanalysis.com/2008/08/01/unstructured-data-and-the-80-percent-rule/. Accessed 13 Oct 2019 (2019)
Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. (CSUR) 34(1), 1–47 (2002)
Marin, A.; Holenstein, R.; Sarikaya, R.; Ostendorf, M.: Learning phrase patterns for text classification using a knowledge graph and unlabeled data. In: 15th Annual Conference of the International Speech Communication Association (2014)
Li, X.; Xie, H.; Chen, L.; Wang, J.; Deng, X.: News impact on stock price return via sentiment analysis. Knowl. Based Syst. 69(1), 14–23 (2014)
Rao, Y.; Xie, H.; Li, J.; Jin, F.; Wang, F.L.; Li, Q.: Social emotion classification of short text via topic-level maximum entropy model. Inf. Manag. 53(8), 978–986 (2016)
Uysal, A.K.: An improved global feature selection scheme for text classification. Expert Syst. Appl. 43, 82–92 (2016)
Mironczuk, M.; Protasiewicz, J.: A recent overview of the state-of-the-art elements of text classification. Expert Syst. Appl. 106, 36–54 (2018)
Joachims, T.: Learning To Classify Text using Support Vector Machines. Kluwer Academic Publishers, Berlin (2002)
Aggarwal, C.C.; Zhai, C.: A survey of text classification algorithms. In: Mining Text Data, pp. 163–222. Springer (2012)
Grimmer, J.; Stewart, B.M.: Text as data: the promise and pitfalls of automatic content analysis methods for political texts. Pol. Anal. 21, 267–297 (2013)
Ko, Y.: A study of term weighting schemes using class information for text classification. In: Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 1029–1030. Citeseer (2012)
Forman, G.: An extensive empirical study of feature selection metrics for text classification. J. Mach. Learn. Res. 3(Mar), 1289–1305 (2003)
Lan, M.; Tan, C.L.; Low, H.B.; Sung, S.Y.: A comprehensive comparative study on term weighting schemes for text categorization with support vector machines. In: Special Interest Tracks and Posters of the 14th International conference on World Wide Web, pp. 1032–1033 (2005)
Zhang, W.; Yoshida, T.; Tang, X.: A comparative study of TF*IDF, LSI and multi-words for text classification. Expert Syst. Appl. 38, 2758–2765 (2011)
Manning, C.D.; Raghavan, P.; Schütze, H.: Introduction to Information Retrieval. Cambridge University Press, Cambridge (2008)
Lan, M.; Tan, C.L.; Su, J.; Lu, Y.: Supervised and traditional term weighting methods for automatic text categorization. IEEE Trans. Pattern Anal. Mach. Intell. 31(4), 721–735 (2008)
Chen, K.; Zhang, Z.; Long, J.; Zhang, H.: Turning from tf-idf to tf-igm for term weighting in text classification. Expert Syst. Appl. 66, 245–260 (2016)
Sabbaha, T.; Selamat, A.; Selamat, M.H.; Al-Anzi, F.S.; Viedmae, E.H.; Krejcar, O.; Fujita, H.: Modified frequency-based term weighting schemes for text classification. Appl. Soft Comput. 58, 193–206 (2017)
Mengle, S.S.; Goharian, N.: Ambiguity measure feature-selection algorithm. J. Am. Soc. Inf. Sci. Technol. 60(5), 1037–1050 (2009)
Maruf, S.; Javed, K.; Babri, H.A.: Improving text classification performance with random forests-based feature selection. Arab. J. Sci. Eng. 41, 951–964 (2016)
Saeed, M.; Javed, K.; Babri, H.A.: Machine learning using bernoulli mixture models: clustering, rule extraction and dimensionality reduction. Neurocomputing 119(7), 366–374 (2013)
Aceto, G.; Ciuonzo, D.; Montieri, A.; Pescapé, A.: Multi-classification approaches for classifying mobile app traffic. J. Netw. Comput. Appl. 103, 131–145 (2018)
Harish, B.; Revanasiddappa, M.: A comprehensive survey on various feature selection methods to categorize text documents. Int. J. Comput. Appl. 164(8), 1–7 (2017)
Javed, K.; Maruf, S.; Babri, H.A.: A two-stage markov blanket based feature selection algorithm for text classification. Neurocomputing 157, 91–104 (2015)
Javed, K.; Babri, H.; Saeed, M.: Feature selection based on class-dependent densities for high-dimensional binary data. IEEE Trans. Knowl. Data Eng. 24(3), 465–477 (2012)
Yang, Y.; Pedersen, J.O.: A comparative study on feature selection in text categorization. ICML 97, 412–420 (1997)
Shang, W.; Huang, H.; Zhu, H.; Lin, Y.: A novel feature selection algorithm for text categorization. Expert Syst. Appl. 33, 1–5 (2007)
Ogura, H.; Amano, H.; Kondo, M.: Comparison of metrics for feature selection in imbalanced text classification. Expert Syst. Appl. 38(5), 4978–4989 (2011)
TaşCı, Ş.; Güngör, T.: Comparison of text feature selection policies and using an adaptive framework. Expert Syst. Appl. 40(12), 4871–4886 (2013)
Agnihotri, D.; Verma, K.; Tripathi, P.: Variable global feature selection scheme for automatic classification of text documents. Expert Syst. Appl. 81, 268–281 (2017)
Cortes, C.; Vapnik, V.: Support vector networks. Mach. Learn. 20(3), 273–297 (1995)
Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)
Montieri, A.; Ciuonzo, D.; Aceto, G.; Pescapé, A.: Anonymity services tor, i2p, jondonym: classifying in the dark (web). IEEE Trans. Depend. Sec. Comput. 17(3), 662–675 (2020)
Javed, K.; Babri, H.A.; Saeed, M.: Impact of a metric of association between two variables on performance of filters for binary data. Neurocomputing 143, 248–260 (2014a)
Bolon-Canedo, V.; Sanchez-Marono, N.; Alonso-Betanzos, A.: Feature Selection for High-Dimensional Data. Springer, Basel (2015)
Labani, M.; Moradi, P.; Ahmadizar, F.; Jalili, M.: A novel multivariate filter method for feature selection in text classification problems. Eng. Appl. Artif. Intell. 70, 25–37 (2018)
Guyon, I.; Gunn, S.; Nikravesh, M.; Zadeh, L.: Feature Extraction: Foundations and Applications. Springer, Berlin (2006)
Guyon, I.; Elisseeff, A.: An introduction to variable and feature selection. J. Mach. Learn. Res. 3(Mar), 1157–1182 (2003)
Rehman, A.; Javed, K.; Babri, H.A.; Asim, N.: Selection of the most relevant terms based on a max-min ratio metric for text classification. Expert Syst. Appl. 114, 78–96 (2018)
Javed, K.; Saeed, M.; Babri, H.A.: The correctness problem: evaluating the ordering of binary features in rankings. Knowl. Inf. Syst. 39(3), 543–563 (2014b)
Uğuz, H.: A two-stage feature selection method for text categorization by using information gain, principal component analysis and genetic algorithm. Knowl. Based Syst. 24(7), 1024–1032 (2011)
Srividhya, V.; Anitha, R.: Evaluating preprocessing techniques in text categorization. Int. J. Comput. Sci. Appl. 47(11), 49–51 (2010)
Flach, P.: Machine Learning: The Art and Science of Algorithms that Make Sense of Data. Cambridge University Press, Cambridge (2012)
Forman, G.: Bns feature scaling: an improved representation over TF-IDF for SVM text classification. In: Proceedings of the 17th ACM Conference on Information and Knowledge Management, pp. 263–270. ACM (2008)
Liu, H.; Sun, J.; Liu, L.; Zhang, H.: Feature selection with dynamic mutual information. Pattern Recognit. 42(7), 1330–1339 (2009)
Wang, D.; Zhang, H.; Liu, R.; Lv, W.; Wang, D.: t-test feature selection approach based on term frequency for text categorization. Pattern Recognit. Lett. 45, 1–10 (2014)
Lee, C.; Lee, G.G.: Information gain and divergence-based feature selection for machine learning-based text categorization. Inf. Process. Manag. 42(1), 155–165 (2006)
Pinheiro, R.H.; Cavalcanti, G.D.; Correa, R.F.; Ren, T.I.: A global-ranking local feature selection method for text categorization. Expert Syst. Appl. 39(17), 12851–12857 (2012)
Rehman, A.; Javed, K.; Babri, H.A.: Feature selection based on a normalized difference measure for text classification. Inf. Process. Manag. 53(2), 473–489 (2017)
Chen, J.; Huang, H.; Tian, S.; Qu, Y.: Feature selection for text classification with naïve bayes. Expert Syst. Appl. 36(3), 5432–5435 (2009)
Wang, F.; Li, Ch; Wang Js, XuJ; Li, L.: A two-stage feature selection method for text categorization by using category correlation degree and latent semantic indexing. J. Shanghai Jiaotong Univ. (Sci.) 20(1), 44–50 (2015)
Mladenic, D.; Grobelnik, M.: Feature selection for unbalanced class distribution and naive bayes. In: Proceedings of the 16nth International Conference on Machine Learning, pp. 258–267 (1999)
Cachopo, AMdJC; et al.: Improving Methods for Single-Label Text Categorization. Instituto Superior Técnico, Portugal (2007)
Hall, M.; Frank, E.; Holmes, G.; Pfahringer, B.; Reutemann, P.; Witten, I.H.: The WEKA data mining software: an update. ACM SIGKDD Explor. Newsl. 11(1), 10–18 (2009)
Montieri, A.; Ciuonzo, D.; Bovenzi, G.; Persico, V.; Pescapé, A.: A dive into the dark web: Hierarchical traffic classification of anonymity tools. In: IEEE Transactions on Network Science and Engineering, pp. 1–1 (2019)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Ali, M.S., Javed, K. A Novel Inherent Distinguishing Feature Selector for Highly Skewed Text Document Classification. Arab J Sci Eng 45, 10471–10491 (2020). https://doi.org/10.1007/s13369-020-04763-5
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13369-020-04763-5