Abstract
Feature selection is an important task in machine learning. It can improve classification accuracy and effectively reduce the dataset dimensionality by removing no discriminative features. Though a large body of researches were focused on feature selection for text classification, few works addressed the problem for multi-label data in big data context. Therefore, this paper proposes a distributed feature selection approach for multi-label textual big data based on the weighted chi-square method. First, a standard multi-label approach to transform the multi-label data into single-label data is applied. Then, the algorithm assigns different weights to the features based on the category term frequency and then calculates the chi-square based on the weight of each feature. The proposed method is implemented on Hadoop framework using MapReduce programming model. At last, a set of experiments were conducted on three benchmarking text datasets to evaluate the effectiveness of the proposed approach. A comparative analysis of the results with the state-of-the-art techniques proves that our method is efficient, robust and scalable.
Keywords
- Chi-square
- Feature selection
- Hadoop
- MapReduce
- Multi-label
- Text classification
This is a preview of subscription content, access via your institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Alelyani, S., Tang, J., Liu, H.: Feature selection for clustering: a review. In: Data Clustering, pp. 29–60. Chapman and Hall/CRC (2018)
Alshammari, S., Zolkepli, M.B., Abdullah, R.B.: Genetic algorithm based parallel k-means data clustering algorithm using mapreduce programming paradigm on hadoop environment (GAPKCA). In: Ghazali, R., Nawi, N., Deris, M., Abawajy, J. (eds.) SCDM 2020. AISC, vol. 978, pp. 98–108. Springer, Heidelberg (2020). https://doi.org/10.1007/978-3-030-36056-6_10
Chen, W., Yan, J., Zhang, B., Chen, Z., Yang, Q.: Document transformation for multi-label feature selection in text categorization. In: Seventh IEEE International Conference on Data Mining (ICDM 2007), pp. 451–456. IEEE (2007)
Chen, W., Liu, X., Guo, D., Lu, M.: Multi-label text classification based on sequence model. In: Tan, Y., Shi, Y. (eds.) DMBD 2019. CCIS, vol. 1071, pp. 201–210. Springer, Singapore (2019). https://doi.org/10.1007/978-981-32-9563-6_21
Doquire, G., Verleysen, M.: Feature selection for multi-label classification problems. In: Cabestany, J., Rojas, I., Joya, G. (eds.) IWANN 2011. LNCS, vol. 6691, pp. 9–16. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-21501-8_2
Gonzalez-Lopez, J., Ventura, S., Cano, A.: Distributed multi-label feature selection using individual mutual information measures. Knowl.-Based Syst. 188, 105052 (2019)
Herrera, F., Charte, F., Rivera, A.J., del Jesus, M.J.: Multilabel classification. In: Herrera, F., Charte, F., Rivera, A.J., del Jesus, M.J. (eds.) Multilabel Classification, pp. 17–31. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-41111-8_2
Huang, R., Jiang, W., Sun, G.: Manifold-based constraint laplacian score for multi-label feature selection. Pattern Recogn. Lett. 112, 346–352 (2018)
Jia, L., Zhang, B.: Optimal document representation strategy for supervised term weighting schemes in automatic text categorization (2019)
Jiang, M., et al.: Text classification based on deep belief network and softmax regression. Neural Comput. Appl. 29(1), 61–70 (2018)
Kashef, S., Nezamabadi-pour, H.: A label-specific multi-label feature selection algorithm based on the pareto dominance concept. Pattern Recogn. 88, 654–667 (2019)
Labani, M., Moradi, P., Ahmadizar, F., Jalili, M.: A novel multivariate filter method for feature selection in text classification problems. Eng. Appl. Artif. Intell. 70, 25–37 (2018)
Li, Y., Lin, Y., Liu, J., Weng, W., Shi, Z., Wu, S.: Feature selection for multi-label learning based on kernelized fuzzy rough sets. Neurocomputing 318, 271–286 (2018)
Lin, Y., Hu, Q., Liu, J., Chen, J., Duan, J.: Multi-label feature selection based on neighborhood mutual information. Appl. Soft Comput. 38, 244–256 (2016)
Pant, P., Sai Sabitha, A., Choudhury, T., Dhingra, P.: Multi-label classification trending challenges and approaches. In: Rathore, V.S., Worring, M., Mishra, D.K., Joshi, A., Maheshwari, S. (eds.) Emerging Trends in Expert Applications and Security. AISC, vol. 841, pp. 433–444. Springer, Singapore (2019). https://doi.org/10.1007/978-981-13-2285-3_51
Pereira, R.B., Plastino, A., Zadrozny, B., Merschmann, L.H.: Categorizing feature selection methods for multi-label classification. Artif. Intell. Rev. 49(1), 57–78 (2018)
Ramesh, B., Sathiaseelan, J.: An advanced multi class instance selection based support vector machine for text classification. Procedia Comput. Sci. 57, 1124–1130 (2015)
Rossi, R.G., Marcacini, R.M., Rezende, S.O., et al.: Benchmarking text collections for classification and clustering tasks (2013)
Schütze, H., Manning, C.D., Raghavan, P.: Introduction to information retrieval. In: Proceedings of the International Communication of Association for Computing Machinery Conference, vol. 4 (2008)
Singh, L., Singh, S., Aggarwal, N.: Two-stage text feature selection method for human emotion recognition. In: Krishna, C.R., Dutta, M., Kumar, R. (eds.) Proceedings of 2nd International Conference on Communication, Computing and Networking. LNNS, vol. 46, pp. 531–538. Springer, Singapore (2019). https://doi.org/10.1007/978-981-13-1217-5_51
SpolaôR, N., Cherman, E.A., Monard, M.C., Lee, H.D.: A comparison of multi-label feature selection methods using the problem transformation approach. Electron. Notes Theoret. Comput. Sci. 292, 135–151 (2013)
Sun, Z., Zhang, J., Dai, L., Li, C., Zhou, C., Xin, J., Li, S.: Mutual information based multi-label feature selection via constrained convex optimization. Neurocomputing 329, 447–456 (2019)
Thaseen, I.S., Kumar, C.A.: Intrusion detection model using fusion of chi-square feature selection and multi class SVM. J. King Saud Univ.-Comput. Inf. Sci. 29(4), 462–472 (2017)
Trstenjak, B., Mikac, S., Donko, D.: KNN with TF-IDF based framework for text categorization. Procedia Eng. 69, 1356–1364 (2014)
Tsoumakas, G., Katakis, I., Vlahavas, I.: Mining multi-label data. In: Maimon, O., Rokach, L. (eds.) Data Mining and Knowledge Discovery Handbook, pp. 667–685. Springer, Heidelberg (2009). https://doi.org/10.1007/978-0-387-09823-4_34
Xu, H., Xu, L.: Multi-label feature selection algorithm based on label pairwise ranking comparison transformation. In: 2017 International Joint Conference on Neural Networks (IJCNN), pp. 1210–1217. IEEE (2017)
Zhang, B.: Analysis and Research on Feature Selection Algorithm for Text Classification. University of Science and Technology of China, Anhui (2010)
Zhang, J., Luo, Z., Li, C., Zhou, C., Li, S.: Manifold regularized discriminative feature selection for multi-label learning. Pattern Recogn. 95, 136–150 (2019)
Zhang, L., Duan, Q.: A feature selection method for multi-label text based on feature importance. Appl. Sci. 9(4), 665 (2019)
Zhang, P., Liu, G., Gao, W.: Distinguishing two types of labels for multi-label feature selection. Pattern Recogn. 95, 72–82 (2019)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Amazal, H., Ramdani, M., Kissi, M. (2020). Towards a Feature Selection for Multi-label Text Classification in Big Data. In: Hamlich, M., Bellatreche, L., Mondal, A., Ordonez, C. (eds) Smart Applications and Data Analysis. SADASC 2020. Communications in Computer and Information Science, vol 1207. Springer, Cham. https://doi.org/10.1007/978-3-030-45183-7_14
Download citation
DOI: https://doi.org/10.1007/978-3-030-45183-7_14
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-45182-0
Online ISBN: 978-3-030-45183-7
eBook Packages: Computer ScienceComputer Science (R0)