Abstract
Feature selection is a significant step before a classification task used to reduce excessive computational costs and enhance classification performance. This paper illustrates a novel feature selection method based on the concept of utility that is grounded in economics theory. In particular, we focus on a utility-based feature selection method for enhancing text classification. Different from existing feature selection methods, the proposed method selects discriminative semantic terms according to how authors utilize terms to express the main ideas in textual documents, i.e., the “utility of terms,” a criteria that can be used to measure the usefulness of terms on expressing authors’ main ideas. To our best knowledge, our work represents the successful research on the leveraging economics theory for developing a semantically rich feature selection method to improve text classification. Our empirical tests based on six UCI benchmark datasets confirm that the proposed method often outperforms other state-of-the-art feature selection methods in text classification. Moreover, our method provides an economics explanation of term weighting for information retrieval and semantic information acquisition in textual documents.
Similar content being viewed by others
References
Abualigah LM, Khader AT (2017) Unsupervised text feature selection technique based on hybrid particle swarm optimization algorithm with genetic operators for the text clustering. J Supercomput 73(11):4773–4795
Abualigah LM, Khader AT, Al-Betar MA (2017) Text feature selection with a robust weight scheme and dynamic dimension reduction to text document clustering. Expert Syst Appl 84(C):24–36
Aghdam MH, Heidari S (2015) Feature selection using particle swarm optimization in text categorization. J Artif Intell Soft Comput Res 5(4):38–43
Agnihotri D (2017) Variable global feature selection scheme for automatic classification of text documents. Expert Syst Appl 81:268–281
Azzopardi L (2011) The economics in interactive information retrieval. In: Proceedings of the 34th international ACM SIGIR conference on research and development in information retrieval. ACM, Beijing, China, pp 15–24
Basu T, Murthy CA (2012) Effective text classification by a supervised feature selection approach. In: Proceedings of the 2012 IEEE 12th international conference on data mining workshops. IEEE Computer Society, Brussels, Belgium, pp 918–925
Bharti KK, Singh PK (2014) A survey on filter techniques for feature selection in text mining. In: Proceedings of the 2nd international conference on soft computing for problem solving. Springer, Jaipur, pp 1545–1559
Bharti KK, Singh PK (2015) Hybrid dimension reduction by integrating feature selection with feature extraction method for text clustering. Expert Syst Appl 42(6):3105–3114
Bharti KK, Singh PK (2016) Opposition chaotic fitness mutation based adaptive inertia weight BPSO for feature selection in text clustering. Appl Soft Comput 43:20–34
Chao S, Cai J, Yang S et al. (2016) A clustering based feature selection method using feature information distance for text data. In: Proceedings of international conference on intelligent computing. Springer, Lanzhou, China, pp 122–132
Chen K, Gao S, Zhu Y et al (2006) Music genres classification using text categorization method. In: Proceedings of IEEE workshop on multimedia signal processing. IEEE, Victoria, BC, Canada, pp 221–224
Chen J, Huang H, Tian S et al (2009) Feature selection for text classification with naïve bayes. Expert Syst Appl 36(3):5432–5435
Duric A, Song F (2012) Feature selection for sentiment analysis based on content and syntax models. Decis Support Syst 53(4):704–711
Fei G, Liu B (2015) Social media text classification under negative covariate shift. In: Proceedings: conference on empirical methods in natural language processing. Association for Computational Linguistics, Lisbon, Portugal, pp 2347–2356
Feldman R, Dagan I (1995) Knowledge discovery in textual databases (KDT). In: Proceedings of the 1st international conference on knowledge discovery and data mining. AAAI Press, Montréal, Québec, Canada, pp 112–117
Feng G, Guo J, Jing BY et al (2012) A bayesian feature selection paradigm for text classification. Inf Process Manage 48(2):283–302
Feng G, Guo J, Jing BY et al (2015) Feature subset selection using naive Bayes for text classification. Pattern Recogn Lett 65:109–115
Feng G, An B, Yang F et al (2017) Relevance popularity: a term event model based feature selection scheme for text classification. PLoS ONE 12(4):1–15
Ganesan K, Zhai CX (2012) Opinion-based entity ranking. Inf Retr 15(2):116–150
Ghareb AS, Bakar AA, Hamdan AR (2016) Hybrid feature selection based on enhanced genetic algorithm for text categorization. Expert Syst Appl 49:31–47
György S (2008) Hedge classification in biomedical texts with a weakly supervised selection of keywords. In: Proceedings of the 46th meeting of the association for computational linguistics. Association for Computational Linguistics, Columbus, Ohio, USA, pp 281–289
Hai NT, Le TD, Nghia NH et al (2015) A hybrid feature selection method for vietnamese text classification. In: Proceedings of the 7th international conference on knowledge and systems engineering. IEEE, Ho Chi Minh City, Vietnam, pp 91–96
Han J, Kamber M, Pei J (2011) Data mining: concepts and techniques, 3rd edn. Morgan Kaufmann Publishers, Waltham, pp 341–342
Havrlant L, Kreinovich V (2014) A simple probabilistic explanation of term frequency-inverse document frequency (tf-idf) heuristic (and variations motivated by this explanation). Int J Gen Syst 46(1):27–36
Hearst MA (1999) Untangling text data mining. In: Proceedings of the 37th annual meeting of the association for computational linguistics on computational linguistics. Association for Computational Linguistics, Maryland, USA, pp 3–10
Javed K, Maruf S, Babri HA (2015) A two-stage markov blanket based feature selection algorithm for text classification. Neurocomputing 157:91–104
Jin J, Yan X, Yu Y et al (2013) Service failure complaints identification in social media: a text classification approach. In: Proceedings of the 34th international conference of information systems, Milan, Italy
Joachims T (1999) Transductive inference for text classification using support vector machines. In: Proceedings of the 16th international conference on machine learning. Morgan Kaufmann Publishers, Bled, Slovenia, pp 200–209
Kilinç D, Özçift A, Bozyiğit F et al (2015) Ttc-3600: a new benchmark dataset for turkish text categorization. J Inf Sci 43(2):174–185
Kotzias D, Denil M, De Freitas N et al (2015) From group to individual labels using deep features. In: Proceedings of the 21st ACM SIGKDD international conference on knowledge discovery and data mining. ACM, Sydney, NSW, Australia, pp 597–606
Kumaran G, Allan J (2004) Text classification and named entities for new event detection. In: Proceedings of the 27th annual international ACM SIGIR conference on research and development in information retrieval. ACM, Sheffield, UK, pp 297–304
Lamirel JC, Cuxac P, Chivukula AS et al (2015) Optimizing text classification through efficient feature selection based on quality metric. J Intell Inf Syst 45(3):1–18
Langley P, Sage S (2013) Induction of selective Bayesian classifiers. In: Proceedings of the 10th international conference on uncertainty in artificial intelligence. Morgan Kaufmann Publishers, Seattle, WA, USA, pp 399–406
Lau RYK, Li C, Liao S (2014) Social analytics: learning fuzzy product ontologies for aspect-oriented sentiment analysis. Decis Support Syst 65:80–94
Lehnert W, Soderland S, Aronow D et al (1995) Inductive text classification for medical applications. J Exp Theor Artif Intell 7(1):49–80
Li Z, Lu W, Sun Z et al (2016) A parallel feature selection method study for text classification. Neural Comput Appl 28(Supp l):S513–S524
Liu M, Lu X, Song J (2016) A new feature selection method for text categorization of customer reviews. Commun Stat Simul Comput 45(4):1397–1409
Lu Y, Chen Y (2017) A text feature selection method based on the small world algorithm. Procedia Comput Sci 107:276–284
Lu Y, Liang M, Ye Z et al (2015) Improved particle swarm optimization algorithm and its application in text feature selection. Appl Soft Comput 35:629–636
Maldonado S, Bravo C, López J et al (2017) Integrated framework for profit-based feature selection and SVM classification in credit scoring. Decis Support Syst 104:113–121
Mankiw NG (2011) Principles of economics, 6th edn. South-Western Cengage Learning, Mason, pp 424–425
Mladenić D, Grobelnik M (2003) Feature selection on hierarchy of web documents. Decis Support Syst 35(1):45–87
Mojaveriyan M, Ebrahimpour-Komleh H, Mousavirad SJ (2016) IGICA: a hybrid feature selection approach in text categorization. Int J Intell Syst Technol Appl 8(3):42–47
Nam LNH, Quoc HB (2017) The hybrid filter feature selection methods for improving high-dimensional text categorization. Int J Uncertain Fuzziness Knowl Based Syst 25(2):235–265
Novovičová J, Malik A (2005) Information-theoretic feature selection algorithms for text classification. In: IEEE international joint conference on neural networks. IEEE, Montreal, Canada, pp 3272–3277
Onan A, Korukoğlu S (2015) A feature selection model based on genetic rank aggregation for text sentiment classification. J Inf Sci 39(5):1103–1107
Pandey U, Chakravarty S (2010) A survey on text classification techniques for e-mail filtering. In: Proceedings of the 2nd international conference on machine learning and computing, Bangalore, India, pp 32–36
Pennington J, Socher R, Manning C (2014) Glove: global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing. Association for Computational Linguistics, Doha, Qatar, pp 1532–1543
Pinheiro RHW, Cavalcanti GDC, Ren TI (2015) Data-driven global-ranking local feature selection methods for text categorization. Expert Syst Appl 42(4):1941–1949
Rashid TA, Mustafa AM, Saeed AM (2017) A robust categorization system for Kurdish Sorani text documents. Inf Technol J 16(1):27–34
Rehman A, Javed K, Babri HA (2017) Feature selection based on a normalized difference measure for text classification. Inf Process Manage 53(2):473–489
Roul RK, Sahay SK (2016) K-means and Wordnet based feature selection combined with extreme learning machines for text classification. In: Proceedings of international conference on distributed computing and internet technology. Springer, Bhubaneswar, India, pp 103–112
Sanchez-Pi N, Martí L, Garcia ACB (2014) Text classification techniques in oil industry applications. In: Proceedings of international joint conference SOCO’13-CISIS’13-ICEUTE’13. Springer, pp 211–220
Shravankumar B, Ravi V (2014) Text classification using ensemble features selection and data mining techniques. In: Proceedings of international conference on swarm, evolutionary, and memetic computing. Springer, pp 176–186
Tang B, Kay S, He H (2016) Toward optimal feature selection in naive bayes for text categorization. IEEE Trans Knowl Data Eng 28(9):2508–2521
Torii M, Yin L, Nguyen T et al (2011) An exploratory study of a text classification framework for internet-based surveillance of emerging epidemics. Int J Med Inf 80(1):56–66
Tutkan M, Ganiz MC, Akyokuş S (2016) Helmholtz principle based supervised and unsupervised feature selection methods for text mining. Inf Process Manage 52(5):885–910
Uysal AK (2016) An improved global feature selection scheme for text classification. Expert Syst Appl 43:82–92
Uysal AK, Gunal S (2012) A novel probabilistic feature selection method for text classification. Knowl Based Syst 36(6):226–235
Verma I, Dey L, Srinivasan RS, Singh L (2015) Event detection from business news. In: Proceedings of the 6th international conference on pattern recognition and machine intelligence. Springer, Warsaw, Poland, pp 575–585
Wang H, Hong M (2015) Distance variance score: an efficient feature selection method in text classification. Math Probl Eng 2015:1–10
Wang H, Hong M (2017) Probability and variance score: an efficient supervised feature selection method for text classification. J Residuals Sci Technol 14(3):218–232
Wang D, Zhang H, Liu R et al (2014) T-test feature selection approach based on term frequency for text categorization. Pattern Recogn Lett 45(1):1–10
Wang Y, Zhou Z, Jin S et al (2017) Comparisons and selections of features and classifiers for short text classification. In: IOP conference series: materials science and engineering. IOP Publishing
Wei G, Agnihotri L, Dimitrova N (2000) TV program classification based on face and text processing, In: Proceedings of the 1st IEEE international conference on multimedia and expo. IEEE, New York, USA, pp 1345–1348
Witten IH, Frank E, Hall MA et al (2017) Data mining: practical machine learning tools and techniques, 4th edn. Morgan Kaufmann, Cambridge, pp 179–181
Wu L, Wang Y, Zhang S et al (2017) Fusing Gini index and term frequency for text feature selection. In: Proceedings of IEEE 3rd international conference on multimedia big data. IEEE, Laguna, Hills, CA, USA, pp 280–283
Xu Y, Chen L (2010) Term-frequency based feature selection methods for text categorization. In: Proceedings of the 2010 4th international conference on genetic and evolutionary computing. IEEE, Shenzhen, China, pp 280–283
Yao H, Liu C, Zhang P et al (2017) A feature selection method based on synonym merging in text classification system. Eurasip J Wirel Commun Netw 2017:1–8
Yao L, Qin S, Zhu H (2017) Feature selection algorithm for hierarchical text classification using Kullback–Leibler divergence. In: Proceedings of 2nd IEEE international conference on cloud computing and big data analysis. IEEE, Chengdu, China, pp 421–424
Yi J, Yang G, Wan J (2016) Category discrimination based feature selection algorithm in Chinese text classification. J Inf Sci Eng 32(5):1145–1159
Zeng L, Li Z (2015) Text classification based on paragraph distributed representation and extreme learning machine. In: Proceedings of the 6th international conference on advances in swarm and computational intelligence. Springer, Beijing, China, pp 81–88
Zhang L, Jiang L, Li C (2016) A new feature selection approach to Naive Bayes text classifiers. Int J Pattern Recogn Artif Intell 30(2):1650003-1–1650003-17
Zhang L, Mistry K, Lim C-P et al (2018) Feature selection using firefly optimization for classification and regression models. Decis Support Syst 106:64–85
Zheng Z (2004) Feature selection for text categorization on imbalanced data. ACM SIGKDD Explor Newsl 6(1):80–89
Žižka J, Dařena F (2017) The comparison of effects of relevant-feature selection algorithms on certain social-network text-mining viewpoints. In: Proceedings of the 6th computer science on-line conference. Zlin, Czech Republic, pp 354–363
Acknowledgements
This research was supported by Project of National Nature Science Foundation of China, Grant No. 71731006, and Natural Science Foundation of Guangdong Province, Grant No. 2018A030313795. Lau’s work was supported by Grants from the Research Grant Council of the Hong Kong SAR (Projects: CityU 11502115 and CityU 11525716), the NSFC Basic Research Program (Project: 71671155), and the Shenzhen Municipal Science and Technology Innovation Fund (Project: JCYJ20160229165300897).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Wang, H., Hong, M. & Lau, R.Y.K. Utility-based feature selection for text classification. Knowl Inf Syst 61, 197–226 (2019). https://doi.org/10.1007/s10115-018-1281-z
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-018-1281-z