Skip to main content
Log in

A Novel Inherent Distinguishing Feature Selector for Highly Skewed Text Document Classification

  • Research Article-Computer Engineering and Computer Science
  • Published:
Arabian Journal for Science and Engineering Aims and scope Submit manuscript

Abstract

A major problem text classification faces is the high dimensional feature space of the text data. Feature selection (FS) algorithms are used for eliminating the irrelevant and redundant terms, thus increasing accuracy and speed of a text classifier. For text classification, FS algorithms have to be designed keeping the highly imbalanced classes of the text data in view. To this end, more recently ensemble algorithms (e.g., improved global feature selection scheme (IGFSS) and variable global feature selection scheme (VGFSS)) were proposed. These algorithms, which combine local and global FS metrics, have shown promising results with VGFSS having better capability of addressing the class imbalance issue. However, both these schemes are highly dependent on the underlying local and global FS metrics. Existing FS metrics get confused while selecting relevant terms of a data with highly imbalanced classes. In this paper, we propose a new FS metric named inherent distinguished feature selector (IDFS), which selects terms having greater relevance to classes and is highly effective for imbalanced data sets. We compare performance of IDFS against five well-known FS metrics as a stand-alone FS algorithm and as a part of the IGFSS and VGFSS frameworks on five benchmark data sets using two classifiers, namely support vector machines and random forests. Our results show that IDFS in both scenarios selects smaller subsets, and achieves higher micro and macro \(F_1\) values, thus outperforming the existing FS metrics.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1

Similar content being viewed by others

Notes

  1. In text classification, category is the same as the class.

  2. In the paper, “features,” “words” and “terms” are used interchangeably.

  3. We use filters, FS metrics or feature ranking metrics interchangeably.

  4. http://ana.cachopo.org/datasets-for-single-label-text-categorization.

  5. http://www.dataminingresearch.com/index.php/2010/09/classic3-classic4-datasets/.

References

  1. Uysal, A.K.; Gunal, S.: A novel probabilistic feature selection method for text classification. Knowl. Based Syst. 36, 226–235 (2012)

    Google Scholar 

  2. Grimes, S.: Unstructured data and the 80 percent rule. http://breakthroughanalysis.com/2008/08/01/unstructured-data-and-the-80-percent-rule/. Accessed 13 Oct 2019 (2019)

  3. Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. (CSUR) 34(1), 1–47 (2002)

    Google Scholar 

  4. Marin, A.; Holenstein, R.; Sarikaya, R.; Ostendorf, M.: Learning phrase patterns for text classification using a knowledge graph and unlabeled data. In: 15th Annual Conference of the International Speech Communication Association (2014)

  5. Li, X.; Xie, H.; Chen, L.; Wang, J.; Deng, X.: News impact on stock price return via sentiment analysis. Knowl. Based Syst. 69(1), 14–23 (2014)

    Google Scholar 

  6. Rao, Y.; Xie, H.; Li, J.; Jin, F.; Wang, F.L.; Li, Q.: Social emotion classification of short text via topic-level maximum entropy model. Inf. Manag. 53(8), 978–986 (2016)

    Google Scholar 

  7. Uysal, A.K.: An improved global feature selection scheme for text classification. Expert Syst. Appl. 43, 82–92 (2016)

    Google Scholar 

  8. Mironczuk, M.; Protasiewicz, J.: A recent overview of the state-of-the-art elements of text classification. Expert Syst. Appl. 106, 36–54 (2018)

    Google Scholar 

  9. Joachims, T.: Learning To Classify Text using Support Vector Machines. Kluwer Academic Publishers, Berlin (2002)

    Google Scholar 

  10. Aggarwal, C.C.; Zhai, C.: A survey of text classification algorithms. In: Mining Text Data, pp. 163–222. Springer (2012)

  11. Grimmer, J.; Stewart, B.M.: Text as data: the promise and pitfalls of automatic content analysis methods for political texts. Pol. Anal. 21, 267–297 (2013)

    Google Scholar 

  12. Ko, Y.: A study of term weighting schemes using class information for text classification. In: Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 1029–1030. Citeseer (2012)

  13. Forman, G.: An extensive empirical study of feature selection metrics for text classification. J. Mach. Learn. Res. 3(Mar), 1289–1305 (2003)

    MATH  Google Scholar 

  14. Lan, M.; Tan, C.L.; Low, H.B.; Sung, S.Y.: A comprehensive comparative study on term weighting schemes for text categorization with support vector machines. In: Special Interest Tracks and Posters of the 14th International conference on World Wide Web, pp. 1032–1033 (2005)

  15. Zhang, W.; Yoshida, T.; Tang, X.: A comparative study of TF*IDF, LSI and multi-words for text classification. Expert Syst. Appl. 38, 2758–2765 (2011)

    Google Scholar 

  16. Manning, C.D.; Raghavan, P.; Schütze, H.: Introduction to Information Retrieval. Cambridge University Press, Cambridge (2008)

    MATH  Google Scholar 

  17. Lan, M.; Tan, C.L.; Su, J.; Lu, Y.: Supervised and traditional term weighting methods for automatic text categorization. IEEE Trans. Pattern Anal. Mach. Intell. 31(4), 721–735 (2008)

    Google Scholar 

  18. Chen, K.; Zhang, Z.; Long, J.; Zhang, H.: Turning from tf-idf to tf-igm for term weighting in text classification. Expert Syst. Appl. 66, 245–260 (2016)

    Google Scholar 

  19. Sabbaha, T.; Selamat, A.; Selamat, M.H.; Al-Anzi, F.S.; Viedmae, E.H.; Krejcar, O.; Fujita, H.: Modified frequency-based term weighting schemes for text classification. Appl. Soft Comput. 58, 193–206 (2017)

    Google Scholar 

  20. Mengle, S.S.; Goharian, N.: Ambiguity measure feature-selection algorithm. J. Am. Soc. Inf. Sci. Technol. 60(5), 1037–1050 (2009)

    Google Scholar 

  21. Maruf, S.; Javed, K.; Babri, H.A.: Improving text classification performance with random forests-based feature selection. Arab. J. Sci. Eng. 41, 951–964 (2016)

    Google Scholar 

  22. Saeed, M.; Javed, K.; Babri, H.A.: Machine learning using bernoulli mixture models: clustering, rule extraction and dimensionality reduction. Neurocomputing 119(7), 366–374 (2013)

    Google Scholar 

  23. Aceto, G.; Ciuonzo, D.; Montieri, A.; Pescapé, A.: Multi-classification approaches for classifying mobile app traffic. J. Netw. Comput. Appl. 103, 131–145 (2018)

    Google Scholar 

  24. Harish, B.; Revanasiddappa, M.: A comprehensive survey on various feature selection methods to categorize text documents. Int. J. Comput. Appl. 164(8), 1–7 (2017)

    Google Scholar 

  25. Javed, K.; Maruf, S.; Babri, H.A.: A two-stage markov blanket based feature selection algorithm for text classification. Neurocomputing 157, 91–104 (2015)

    Google Scholar 

  26. Javed, K.; Babri, H.; Saeed, M.: Feature selection based on class-dependent densities for high-dimensional binary data. IEEE Trans. Knowl. Data Eng. 24(3), 465–477 (2012)

    Google Scholar 

  27. Yang, Y.; Pedersen, J.O.: A comparative study on feature selection in text categorization. ICML 97, 412–420 (1997)

    Google Scholar 

  28. Shang, W.; Huang, H.; Zhu, H.; Lin, Y.: A novel feature selection algorithm for text categorization. Expert Syst. Appl. 33, 1–5 (2007)

    Google Scholar 

  29. Ogura, H.; Amano, H.; Kondo, M.: Comparison of metrics for feature selection in imbalanced text classification. Expert Syst. Appl. 38(5), 4978–4989 (2011)

    Google Scholar 

  30. TaşCı, Ş.; Güngör, T.: Comparison of text feature selection policies and using an adaptive framework. Expert Syst. Appl. 40(12), 4871–4886 (2013)

    Google Scholar 

  31. Agnihotri, D.; Verma, K.; Tripathi, P.: Variable global feature selection scheme for automatic classification of text documents. Expert Syst. Appl. 81, 268–281 (2017)

    Google Scholar 

  32. Cortes, C.; Vapnik, V.: Support vector networks. Mach. Learn. 20(3), 273–297 (1995)

    MATH  Google Scholar 

  33. Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)

    MATH  Google Scholar 

  34. Montieri, A.; Ciuonzo, D.; Aceto, G.; Pescapé, A.: Anonymity services tor, i2p, jondonym: classifying in the dark (web). IEEE Trans. Depend. Sec. Comput. 17(3), 662–675 (2020)

    Google Scholar 

  35. Javed, K.; Babri, H.A.; Saeed, M.: Impact of a metric of association between two variables on performance of filters for binary data. Neurocomputing 143, 248–260 (2014a)

    Google Scholar 

  36. Bolon-Canedo, V.; Sanchez-Marono, N.; Alonso-Betanzos, A.: Feature Selection for High-Dimensional Data. Springer, Basel (2015)

    Google Scholar 

  37. Labani, M.; Moradi, P.; Ahmadizar, F.; Jalili, M.: A novel multivariate filter method for feature selection in text classification problems. Eng. Appl. Artif. Intell. 70, 25–37 (2018)

    Google Scholar 

  38. Guyon, I.; Gunn, S.; Nikravesh, M.; Zadeh, L.: Feature Extraction: Foundations and Applications. Springer, Berlin (2006)

    MATH  Google Scholar 

  39. Guyon, I.; Elisseeff, A.: An introduction to variable and feature selection. J. Mach. Learn. Res. 3(Mar), 1157–1182 (2003)

    MATH  Google Scholar 

  40. Rehman, A.; Javed, K.; Babri, H.A.; Asim, N.: Selection of the most relevant terms based on a max-min ratio metric for text classification. Expert Syst. Appl. 114, 78–96 (2018)

    Google Scholar 

  41. Javed, K.; Saeed, M.; Babri, H.A.: The correctness problem: evaluating the ordering of binary features in rankings. Knowl. Inf. Syst. 39(3), 543–563 (2014b)

    Google Scholar 

  42. Uğuz, H.: A two-stage feature selection method for text categorization by using information gain, principal component analysis and genetic algorithm. Knowl. Based Syst. 24(7), 1024–1032 (2011)

    Google Scholar 

  43. Srividhya, V.; Anitha, R.: Evaluating preprocessing techniques in text categorization. Int. J. Comput. Sci. Appl. 47(11), 49–51 (2010)

    Google Scholar 

  44. Flach, P.: Machine Learning: The Art and Science of Algorithms that Make Sense of Data. Cambridge University Press, Cambridge (2012)

    MATH  Google Scholar 

  45. Forman, G.: Bns feature scaling: an improved representation over TF-IDF for SVM text classification. In: Proceedings of the 17th ACM Conference on Information and Knowledge Management, pp. 263–270. ACM (2008)

  46. Liu, H.; Sun, J.; Liu, L.; Zhang, H.: Feature selection with dynamic mutual information. Pattern Recognit. 42(7), 1330–1339 (2009)

    MATH  Google Scholar 

  47. Wang, D.; Zhang, H.; Liu, R.; Lv, W.; Wang, D.: t-test feature selection approach based on term frequency for text categorization. Pattern Recognit. Lett. 45, 1–10 (2014)

    Google Scholar 

  48. Lee, C.; Lee, G.G.: Information gain and divergence-based feature selection for machine learning-based text categorization. Inf. Process. Manag. 42(1), 155–165 (2006)

    Google Scholar 

  49. Pinheiro, R.H.; Cavalcanti, G.D.; Correa, R.F.; Ren, T.I.: A global-ranking local feature selection method for text categorization. Expert Syst. Appl. 39(17), 12851–12857 (2012)

    Google Scholar 

  50. Rehman, A.; Javed, K.; Babri, H.A.: Feature selection based on a normalized difference measure for text classification. Inf. Process. Manag. 53(2), 473–489 (2017)

    Google Scholar 

  51. Chen, J.; Huang, H.; Tian, S.; Qu, Y.: Feature selection for text classification with naïve bayes. Expert Syst. Appl. 36(3), 5432–5435 (2009)

    Google Scholar 

  52. Wang, F.; Li, Ch; Wang Js, XuJ; Li, L.: A two-stage feature selection method for text categorization by using category correlation degree and latent semantic indexing. J. Shanghai Jiaotong Univ. (Sci.) 20(1), 44–50 (2015)

    Google Scholar 

  53. Mladenic, D.; Grobelnik, M.: Feature selection for unbalanced class distribution and naive bayes. In: Proceedings of the 16nth International Conference on Machine Learning, pp. 258–267 (1999)

  54. Cachopo, AMdJC; et al.: Improving Methods for Single-Label Text Categorization. Instituto Superior Técnico, Portugal (2007)

    Google Scholar 

  55. Hall, M.; Frank, E.; Holmes, G.; Pfahringer, B.; Reutemann, P.; Witten, I.H.: The WEKA data mining software: an update. ACM SIGKDD Explor. Newsl. 11(1), 10–18 (2009)

    Google Scholar 

  56. Montieri, A.; Ciuonzo, D.; Bovenzi, G.; Persico, V.; Pescapé, A.: A dive into the dark web: Hierarchical traffic classification of anonymity tools. In: IEEE Transactions on Network Science and Engineering, pp. 1–1 (2019)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Muhammad Sajid Ali.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ali, M.S., Javed, K. A Novel Inherent Distinguishing Feature Selector for Highly Skewed Text Document Classification. Arab J Sci Eng 45, 10471–10491 (2020). https://doi.org/10.1007/s13369-020-04763-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s13369-020-04763-5

Keywords

Navigation