Advertisement

Utility-based feature selection for text classification

  • Heyong Wang
  • Ming Hong
  • Raymond Yiu Keung Lau
Regular Paper

Abstract

Feature selection is a significant step before a classification task used to reduce excessive computational costs and enhance classification performance. This paper illustrates a novel feature selection method based on the concept of utility that is grounded in economics theory. In particular, we focus on a utility-based feature selection method for enhancing text classification. Different from existing feature selection methods, the proposed method selects discriminative semantic terms according to how authors utilize terms to express the main ideas in textual documents, i.e., the “utility of terms,” a criteria that can be used to measure the usefulness of terms on expressing authors’ main ideas. To our best knowledge, our work represents the successful research on the leveraging economics theory for developing a semantically rich feature selection method to improve text classification. Our empirical tests based on six UCI benchmark datasets confirm that the proposed method often outperforms other state-of-the-art feature selection methods in text classification. Moreover, our method provides an economics explanation of term weighting for information retrieval and semantic information acquisition in textual documents.

Keywords

Feature selection Text classification Text mining Utility theory Economics theory 

Notes

Acknowledgements

This research was supported by Project of National Nature Science Foundation of China, Grant No. 71731006, and Natural Science Foundation of Guangdong Province, Grant No. 2018A030313795. Lau’s work was supported by Grants from the Research Grant Council of the Hong Kong SAR (Projects: CityU 11502115 and CityU 11525716), the NSFC Basic Research Program (Project: 71671155), and the Shenzhen Municipal Science and Technology Innovation Fund (Project: JCYJ20160229165300897).

References

  1. 1.
    Abualigah LM, Khader AT (2017) Unsupervised text feature selection technique based on hybrid particle swarm optimization algorithm with genetic operators for the text clustering. J Supercomput 73(11):4773–4795Google Scholar
  2. 2.
    Abualigah LM, Khader AT, Al-Betar MA (2017) Text feature selection with a robust weight scheme and dynamic dimension reduction to text document clustering. Expert Syst Appl 84(C):24–36Google Scholar
  3. 3.
    Aghdam MH, Heidari S (2015) Feature selection using particle swarm optimization in text categorization. J Artif Intell Soft Comput Res 5(4):38–43Google Scholar
  4. 4.
    Agnihotri D (2017) Variable global feature selection scheme for automatic classification of text documents. Expert Syst Appl 81:268–281Google Scholar
  5. 5.
    Azzopardi L (2011) The economics in interactive information retrieval. In: Proceedings of the 34th international ACM SIGIR conference on research and development in information retrieval. ACM, Beijing, China, pp 15–24Google Scholar
  6. 6.
    Basu T, Murthy CA (2012) Effective text classification by a supervised feature selection approach. In: Proceedings of the 2012 IEEE 12th international conference on data mining workshops. IEEE Computer Society, Brussels, Belgium, pp 918–925Google Scholar
  7. 7.
    Bharti KK, Singh PK (2014) A survey on filter techniques for feature selection in text mining. In: Proceedings of the 2nd international conference on soft computing for problem solving. Springer, Jaipur, pp 1545–1559Google Scholar
  8. 8.
    Bharti KK, Singh PK (2015) Hybrid dimension reduction by integrating feature selection with feature extraction method for text clustering. Expert Syst Appl 42(6):3105–3114Google Scholar
  9. 9.
    Bharti KK, Singh PK (2016) Opposition chaotic fitness mutation based adaptive inertia weight BPSO for feature selection in text clustering. Appl Soft Comput 43:20–34Google Scholar
  10. 10.
    Chao S, Cai J, Yang S et al. (2016) A clustering based feature selection method using feature information distance for text data. In: Proceedings of international conference on intelligent computing. Springer, Lanzhou, China, pp 122–132Google Scholar
  11. 11.
    Chen K, Gao S, Zhu Y et al (2006) Music genres classification using text categorization method. In: Proceedings of IEEE workshop on multimedia signal processing. IEEE, Victoria, BC, Canada, pp 221–224Google Scholar
  12. 12.
    Chen J, Huang H, Tian S et al (2009) Feature selection for text classification with naïve bayes. Expert Syst Appl 36(3):5432–5435Google Scholar
  13. 13.
    Duric A, Song F (2012) Feature selection for sentiment analysis based on content and syntax models. Decis Support Syst 53(4):704–711Google Scholar
  14. 14.
    Fei G, Liu B (2015) Social media text classification under negative covariate shift. In: Proceedings: conference on empirical methods in natural language processing. Association for Computational Linguistics, Lisbon, Portugal, pp 2347–2356Google Scholar
  15. 15.
    Feldman R, Dagan I (1995) Knowledge discovery in textual databases (KDT). In: Proceedings of the 1st international conference on knowledge discovery and data mining. AAAI Press, Montréal, Québec, Canada, pp 112–117Google Scholar
  16. 16.
    Feng G, Guo J, Jing BY et al (2012) A bayesian feature selection paradigm for text classification. Inf Process Manage 48(2):283–302Google Scholar
  17. 17.
    Feng G, Guo J, Jing BY et al (2015) Feature subset selection using naive Bayes for text classification. Pattern Recogn Lett 65:109–115Google Scholar
  18. 18.
    Feng G, An B, Yang F et al (2017) Relevance popularity: a term event model based feature selection scheme for text classification. PLoS ONE 12(4):1–15Google Scholar
  19. 19.
    Ganesan K, Zhai CX (2012) Opinion-based entity ranking. Inf Retr 15(2):116–150Google Scholar
  20. 20.
    Ghareb AS, Bakar AA, Hamdan AR (2016) Hybrid feature selection based on enhanced genetic algorithm for text categorization. Expert Syst Appl 49:31–47Google Scholar
  21. 21.
    György S (2008) Hedge classification in biomedical texts with a weakly supervised selection of keywords. In: Proceedings of the 46th meeting of the association for computational linguistics. Association for Computational Linguistics, Columbus, Ohio, USA, pp 281–289Google Scholar
  22. 22.
    Hai NT, Le TD, Nghia NH et al (2015) A hybrid feature selection method for vietnamese text classification. In: Proceedings of the 7th international conference on knowledge and systems engineering. IEEE, Ho Chi Minh City, Vietnam, pp 91–96Google Scholar
  23. 23.
    Han J, Kamber M, Pei J (2011) Data mining: concepts and techniques, 3rd edn. Morgan Kaufmann Publishers, Waltham, pp 341–342zbMATHGoogle Scholar
  24. 24.
    Havrlant L, Kreinovich V (2014) A simple probabilistic explanation of term frequency-inverse document frequency (tf-idf) heuristic (and variations motivated by this explanation). Int J Gen Syst 46(1):27–36MathSciNetGoogle Scholar
  25. 25.
    Hearst MA (1999) Untangling text data mining. In: Proceedings of the 37th annual meeting of the association for computational linguistics on computational linguistics. Association for Computational Linguistics, Maryland, USA, pp 3–10Google Scholar
  26. 26.
    Javed K, Maruf S, Babri HA (2015) A two-stage markov blanket based feature selection algorithm for text classification. Neurocomputing 157:91–104Google Scholar
  27. 27.
    Jin J, Yan X, Yu Y et al (2013) Service failure complaints identification in social media: a text classification approach. In: Proceedings of the 34th international conference of information systems, Milan, ItalyGoogle Scholar
  28. 28.
    Joachims T (1999) Transductive inference for text classification using support vector machines. In: Proceedings of the 16th international conference on machine learning. Morgan Kaufmann Publishers, Bled, Slovenia, pp 200–209Google Scholar
  29. 29.
    Kilinç D, Özçift A, Bozyiğit F et al (2015) Ttc-3600: a new benchmark dataset for turkish text categorization. J Inf Sci 43(2):174–185Google Scholar
  30. 30.
    Kotzias D, Denil M, De Freitas N et al (2015) From group to individual labels using deep features. In: Proceedings of the 21st ACM SIGKDD international conference on knowledge discovery and data mining. ACM, Sydney, NSW, Australia, pp 597–606Google Scholar
  31. 31.
    Kumaran G, Allan J (2004) Text classification and named entities for new event detection. In: Proceedings of the 27th annual international ACM SIGIR conference on research and development in information retrieval. ACM, Sheffield, UK, pp 297–304Google Scholar
  32. 32.
    Lamirel JC, Cuxac P, Chivukula AS et al (2015) Optimizing text classification through efficient feature selection based on quality metric. J Intell Inf Syst 45(3):1–18Google Scholar
  33. 33.
    Langley P, Sage S (2013) Induction of selective Bayesian classifiers. In: Proceedings of the 10th international conference on uncertainty in artificial intelligence. Morgan Kaufmann Publishers, Seattle, WA, USA, pp 399–406Google Scholar
  34. 34.
    Lau RYK, Li C, Liao S (2014) Social analytics: learning fuzzy product ontologies for aspect-oriented sentiment analysis. Decis Support Syst 65:80–94Google Scholar
  35. 35.
    Lehnert W, Soderland S, Aronow D et al (1995) Inductive text classification for medical applications. J Exp Theor Artif Intell 7(1):49–80Google Scholar
  36. 36.
    Li Z, Lu W, Sun Z et al (2016) A parallel feature selection method study for text classification. Neural Comput Appl 28(Supp l):S513–S524Google Scholar
  37. 37.
    Liu M, Lu X, Song J (2016) A new feature selection method for text categorization of customer reviews. Commun Stat Simul Comput 45(4):1397–1409MathSciNetzbMATHGoogle Scholar
  38. 38.
    Lu Y, Chen Y (2017) A text feature selection method based on the small world algorithm. Procedia Comput Sci 107:276–284Google Scholar
  39. 39.
    Lu Y, Liang M, Ye Z et al (2015) Improved particle swarm optimization algorithm and its application in text feature selection. Appl Soft Comput 35:629–636Google Scholar
  40. 40.
    Maldonado S, Bravo C, López J et al (2017) Integrated framework for profit-based feature selection and SVM classification in credit scoring. Decis Support Syst 104:113–121Google Scholar
  41. 41.
    Mankiw NG (2011) Principles of economics, 6th edn. South-Western Cengage Learning, Mason, pp 424–425Google Scholar
  42. 42.
    Mladenić D, Grobelnik M (2003) Feature selection on hierarchy of web documents. Decis Support Syst 35(1):45–87Google Scholar
  43. 43.
    Mojaveriyan M, Ebrahimpour-Komleh H, Mousavirad SJ (2016) IGICA: a hybrid feature selection approach in text categorization. Int J Intell Syst Technol Appl 8(3):42–47Google Scholar
  44. 44.
    Nam LNH, Quoc HB (2017) The hybrid filter feature selection methods for improving high-dimensional text categorization. Int J Uncertain Fuzziness Knowl Based Syst 25(2):235–265Google Scholar
  45. 45.
    Novovičová J, Malik A (2005) Information-theoretic feature selection algorithms for text classification. In: IEEE international joint conference on neural networks. IEEE, Montreal, Canada, pp 3272–3277Google Scholar
  46. 46.
    Onan A, Korukoğlu S (2015) A feature selection model based on genetic rank aggregation for text sentiment classification. J Inf Sci 39(5):1103–1107Google Scholar
  47. 47.
    Pandey U, Chakravarty S (2010) A survey on text classification techniques for e-mail filtering. In: Proceedings of the 2nd international conference on machine learning and computing, Bangalore, India, pp 32–36Google Scholar
  48. 48.
    Pennington J, Socher R, Manning C (2014) Glove: global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing. Association for Computational Linguistics, Doha, Qatar, pp 1532–1543Google Scholar
  49. 49.
    Pinheiro RHW, Cavalcanti GDC, Ren TI (2015) Data-driven global-ranking local feature selection methods for text categorization. Expert Syst Appl 42(4):1941–1949Google Scholar
  50. 50.
    Rashid TA, Mustafa AM, Saeed AM (2017) A robust categorization system for Kurdish Sorani text documents. Inf Technol J 16(1):27–34Google Scholar
  51. 51.
    Rehman A, Javed K, Babri HA (2017) Feature selection based on a normalized difference measure for text classification. Inf Process Manage 53(2):473–489Google Scholar
  52. 52.
    Roul RK, Sahay SK (2016) K-means and Wordnet based feature selection combined with extreme learning machines for text classification. In: Proceedings of international conference on distributed computing and internet technology. Springer, Bhubaneswar, India, pp 103–112Google Scholar
  53. 53.
    Sanchez-Pi N, Martí L, Garcia ACB (2014) Text classification techniques in oil industry applications. In: Proceedings of international joint conference SOCO’13-CISIS’13-ICEUTE’13. Springer, pp 211–220Google Scholar
  54. 54.
    Shravankumar B, Ravi V (2014) Text classification using ensemble features selection and data mining techniques. In: Proceedings of international conference on swarm, evolutionary, and memetic computing. Springer, pp 176–186Google Scholar
  55. 55.
    Tang B, Kay S, He H (2016) Toward optimal feature selection in naive bayes for text categorization. IEEE Trans Knowl Data Eng 28(9):2508–2521Google Scholar
  56. 56.
    Torii M, Yin L, Nguyen T et al (2011) An exploratory study of a text classification framework for internet-based surveillance of emerging epidemics. Int J Med Inf 80(1):56–66Google Scholar
  57. 57.
    Tutkan M, Ganiz MC, Akyokuş S (2016) Helmholtz principle based supervised and unsupervised feature selection methods for text mining. Inf Process Manage 52(5):885–910Google Scholar
  58. 58.
    Uysal AK (2016) An improved global feature selection scheme for text classification. Expert Syst Appl 43:82–92Google Scholar
  59. 59.
    Uysal AK, Gunal S (2012) A novel probabilistic feature selection method for text classification. Knowl Based Syst 36(6):226–235Google Scholar
  60. 60.
    Verma I, Dey L, Srinivasan RS, Singh L (2015) Event detection from business news. In: Proceedings of the 6th international conference on pattern recognition and machine intelligence. Springer, Warsaw, Poland, pp 575–585Google Scholar
  61. 61.
    Wang H, Hong M (2015) Distance variance score: an efficient feature selection method in text classification. Math Probl Eng 2015:1–10 Google Scholar
  62. 62.
    Wang H, Hong M (2017) Probability and variance score: an efficient supervised feature selection method for text classification. J Residuals Sci Technol 14(3):218–232Google Scholar
  63. 63.
    Wang D, Zhang H, Liu R et al (2014) T-test feature selection approach based on term frequency for text categorization. Pattern Recogn Lett 45(1):1–10Google Scholar
  64. 64.
    Wang Y, Zhou Z, Jin S et al (2017) Comparisons and selections of features and classifiers for short text classification. In: IOP conference series: materials science and engineering. IOP PublishingGoogle Scholar
  65. 65.
    Wei G, Agnihotri L, Dimitrova N (2000) TV program classification based on face and text processing, In: Proceedings of the 1st IEEE international conference on multimedia and expo. IEEE, New York, USA, pp 1345–1348Google Scholar
  66. 66.
    Witten IH, Frank E, Hall MA et al (2017) Data mining: practical machine learning tools and techniques, 4th edn. Morgan Kaufmann, Cambridge, pp 179–181Google Scholar
  67. 67.
    Wu L, Wang Y, Zhang S et al (2017) Fusing Gini index and term frequency for text feature selection. In: Proceedings of IEEE 3rd international conference on multimedia big data. IEEE, Laguna, Hills, CA, USA, pp 280–283Google Scholar
  68. 68.
    Xu Y, Chen L (2010) Term-frequency based feature selection methods for text categorization. In: Proceedings of the 2010 4th international conference on genetic and evolutionary computing. IEEE, Shenzhen, China, pp 280–283Google Scholar
  69. 69.
    Yao H, Liu C, Zhang P et al (2017) A feature selection method based on synonym merging in text classification system. Eurasip J Wirel Commun Netw 2017:1–8Google Scholar
  70. 70.
    Yao L, Qin S, Zhu H (2017) Feature selection algorithm for hierarchical text classification using Kullback–Leibler divergence. In: Proceedings of 2nd IEEE international conference on cloud computing and big data analysis. IEEE, Chengdu, China, pp 421–424Google Scholar
  71. 71.
    Yi J, Yang G, Wan J (2016) Category discrimination based feature selection algorithm in Chinese text classification. J Inf Sci Eng 32(5):1145–1159MathSciNetGoogle Scholar
  72. 72.
    Zeng L, Li Z (2015) Text classification based on paragraph distributed representation and extreme learning machine. In: Proceedings of the 6th international conference on advances in swarm and computational intelligence. Springer, Beijing, China, pp 81–88Google Scholar
  73. 73.
    Zhang L, Jiang L, Li C (2016) A new feature selection approach to Naive Bayes text classifiers. Int J Pattern Recogn Artif Intell 30(2):1650003-1–1650003-17MathSciNetGoogle Scholar
  74. 74.
    Zhang L, Mistry K, Lim C-P et al (2018) Feature selection using firefly optimization for classification and regression models. Decis Support Syst 106:64–85Google Scholar
  75. 75.
    Zheng Z (2004) Feature selection for text categorization on imbalanced data. ACM SIGKDD Explor Newsl 6(1):80–89MathSciNetGoogle Scholar
  76. 76.
    Žižka J, Dařena F (2017) The comparison of effects of relevant-feature selection algorithms on certain social-network text-mining viewpoints. In: Proceedings of the 6th computer science on-line conference. Zlin, Czech Republic, pp 354–363Google Scholar

Copyright information

© Springer-Verlag London Ltd., part of Springer Nature 2018

Authors and Affiliations

  1. 1.Department of E-BusinessSouth China University of TechnologyGuangzhouChina
  2. 2.Department of Information Systems, College of BusinessCity University of Hong KongHong KongChina

Personalised recommendations