A New Feature Selection Algorithm Based on Category Difference for Text Categorization

Zhang, Wang; Chen, Chanjuan; Jiang, Lei; Bai, Xu

doi:10.1007/978-3-030-26075-0_25

Wang Zhang^14,15,
Chanjuan Chen¹⁶,
Lei Jiang¹⁴ &
…
Xu Bai¹⁴

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11642))

Included in the following conference series:

Asia-Pacific Web (APWeb) and Web-Age Information Management (WAIM) Joint International Conference on Web and Big Data

1039 Accesses
1 Citations

Abstract

The feature selection is an important step which can reduce the dimensionality and improve the performance of the classifiers in text categorization. Many popular feature selection methods do not consider the difference in the distribution of different categories on a feature. In this paper, we propose a new filter based feature selection algorithm, namely fused distance feature selection (FDFS), which evaluates the significance of a feature by taking account of the difference in the distribution of different categories and selects more discriminative features with the minimal number. The proposed algorithm is investigated both inside and outside perspectives on four benchmark document datasets, 20-Newsgroups, WebKB, CSDMC2010 and Ohsumed, using Linear Support Vector Machine (LSVM) and Multinomial Naïve Bayes (MNB) classifiers. The experimental results indicate that our proposed method provides a competitive result, where its average ranking is 1.25 on LSVM and 1 on MNB.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 59.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

The 4 universities data set (1998). http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/. Accessed 4 June 2018
Text categorization corpora (2004). http://disi.unitn.it/moschitti/corpora.htm. Accessed 4 June 2018
Home page for 20 newsgroups data set (2008). http://www.qwone.com/~jason/20Newsgroups/. Accessed 4 June 2018
Aggarwal, C.C., Zhai, C.: A survey of text classification algorithms. In: Aggarwal, C., Zhai, C. (eds.) Mining Text Data, pp. 163–222. Springer, Boston (2012). https://doi.org/10.1007/978-1-4614-3223-4_6
Chapter Google Scholar
Agnihotri, D., Verma, K., Tripathi, P.: Variable global feature selection scheme for automatic classification of text documents. Expert Syst. Appl. 81, 268–281 (2017)
Article Google Scholar
James, J.: Data never sleeps 6.0 (2018). https://www.domo.com/blog/data-never-sleeps-6/. Accessed 4 July 2018
Labani, M., Moradi, P., Ahmadizar, F., Jalili, M.: A novel multivariate filter method for feature selection in text classification problems. Eng. Appl. Artif. Intell. 70, 25–37 (2018)
Article Google Scholar
Mirończuk, M.M., Protasiewicz, J.: A recent overview of the state-of-the-art elements of text classification. Expert Syst. Appl. 106, 36–54 (2018)
Article Google Scholar
Pinheiro, R.H., Cavalcanti, G.D., Correa, R.F., Ren, T.I.: A global-ranking local feature selection method for text categorization. Expert Syst. Appl. 39(17), 12851–12857 (2012)
Article Google Scholar
Quinlan, J.R.: Induction of decision trees. Mach. Learn. 1(1), 81–106 (1986)
Google Scholar
Rehman, A., Javed, K., Babri, H.A.: Feature selection based on a normalized difference measure for text classification. Inf. Process. Manag. 53(2), 473–489 (2017)
Article Google Scholar
Shang, W., Huang, H., Zhu, H., Lin, Y., Qu, Y., Wang, Z.: A novel feature selection algorithm for text categorization. Expert Syst. Appl. 33(1), 1–5 (2007)
Article Google Scholar
Tutkan, M., Ganiz, M.C., Akyokuş, S.: Helmholtz principle based supervised and unsupervised feature selection methods for text mining. Inf. Process. Manag. 52(5), 885–910 (2016)
Article Google Scholar
Uysal, A.K., Gunal, S.: A novel probabilistic feature selection method for text classification. Knowl.-Based Syst. 36, 226–235 (2012)
Article Google Scholar
Yan, J., et al.: OCFS: optimal orthogonal centroid feature selection for text categorization. In: Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 122–129. ACM (2005)
Google Scholar
Yang, J., Liu, Y., Liu, Z., Zhu, X., Zhang, X.: A new feature selection algorithm based on binomial hypothesis testing for spam filtering. Knowl.-Based Syst. 24(6), 904–914 (2011)
Article Google Scholar
Yang, J., Liu, Y., Zhu, X., Liu, Z., Zhang, X.: A new feature selection based on comprehensive measurement both in inter-category and intra-category for text categorization. Inf. Process. Manag. 48(4), 741–754 (2012)
Article Google Scholar
Yang, J., Qu, Z., Liu, Z.: Improved feature-selection method considering the imbalance problem in text categorization. Sci. World J. 2014(3), 17 (2014)
Google Scholar
Yang, Y., Pedersen, J.O.: A comparative study on feature selection in text categorization. In: ICML 1997, pp. 412–420 (1997)
Google Scholar
Zhang, W., Bai, X., Chen, C., Chen, Z.: Booter blacklist generation based on content characteristics. In: Gao, H., Wang, X., Yin, Y., Iqbal, M. (eds.) CollaborateCom 2018. LNICST, vol. 268, pp. 529–542. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-12981-1_37
Chapter Google Scholar

Download references

Acknowledgement

This paper is Supported by National Science Foundation for Young Scientists of China (Grant No. 61702507).

Author information

Authors and Affiliations

Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China
Wang Zhang, Lei Jiang & Xu Bai
School of Cyber Security, University of Chinese Academy of Sciences, Beijing, China
Wang Zhang
China National Machinery Industry Corporation, Beijing, China
Chanjuan Chen

Authors

Wang Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Chanjuan Chen
View author publications
You can also search for this author in PubMed Google Scholar
Lei Jiang
View author publications
You can also search for this author in PubMed Google Scholar
Xu Bai
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xu Bai .

Editor information

Editors and Affiliations

University of Electronic Science and Technology of China, Chengdu, China
Jie Shao
Hong Kong Polytechnic University, Hong Kong, China
Man Lung Yiu
The University of Tokyo, Tokyo, Japan
Masashi Toyoda
Zhejiang University, Hangzhou, China
Dongxiang Zhang
National University of Singapore, Singapore, Singapore
Wei Wang
Peking University, Beijing, China
Bin Cui

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhang, W., Chen, C., Jiang, L., Bai, X. (2019). A New Feature Selection Algorithm Based on Category Difference for Text Categorization. In: Shao, J., Yiu, M., Toyoda, M., Zhang, D., Wang, W., Cui, B. (eds) Web and Big Data. APWeb-WAIM 2019. Lecture Notes in Computer Science(), vol 11642. Springer, Cham. https://doi.org/10.1007/978-3-030-26075-0_25

Download citation

DOI: https://doi.org/10.1007/978-3-030-26075-0_25
Published: 17 July 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-26074-3
Online ISBN: 978-3-030-26075-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics