Abstract
Text classification is one of the challenging computational tasks in machine learning community due to the increased amounts of natural language text documents available in the electronic forms. In this process, feature selection (FS) is an essential phase because thousands of possible feature sets may be considered in text classification. This paper proposes an enhanced binary grey wolf optimizer (GWO) within a wrapper FS approach to tackle Arabic text classification problems. The proposed binary GWO is utilized to play the role of a wrapper-based feature selection technique. The performance of the proposed method using different learning models, including decision trees, K-nearest neighbour, Naive Bayes, and SVM classifiers, are investigated. Three Arabic public datasets, namely Alwatan, Akhbar-Alkhaleej, and Al-jazeera-News, are utilized to evaluate the efficacy of different BGWO-based wrapper methods. Results and analysis show that SVM-based feature selection technique with the proposed binary GWO optimizer with elite-based crossover scheme has enhanced efficacy in dealing with Arabic text classification problems compared to other peers.
Similar content being viewed by others
References
Abuaiadah D (2016) Using bisect k-means clustering technique in the analysis of Arabic documents. ACM Trans Asian Low Resour Lang Inf Process 15(3):17
AbuZeina D, Al-Anzi FS (2017) Employing Fisher discriminant analysis for Arabic text classification. Comput Electr Eng 66:474–486
Ahmadizar F, Hemmati M, Rabanimotlagh A (2012) Two-stage text feature selection method using fuzzy entropy measure and ant colony optimization. In: 2012 20th Iranian conference on electrical engineering (ICEE). IEEE, pp 695–700
Ahmed M, Elhassan R (2015) Arabic text classification review. Int J Comput Sci Softw Eng 4(1):1–5
Al-Badarneh A, Al-Shawakfa E, Bani-Ismail B, Al-Rababah K, Shatnawi S (2017) The impact of indexing approaches on Arabic text classification. J Inf Sci 43(2):159–173
Al-Harbi S, Almuhareb A, Al-Thubaity A, Khorsheed M, Al-Rajeh A (2008) Automatic Arabic text classification
Al-Saleem S (2010) Associative classification to categorize Arabic data sets. Int J ACM Jordan 1(3):118–127
Al-Salemi B, Aziz MJA (2011) Statistical Bayesian learning for automatic Arabic text categorization. J Comput Sci 7(1):39
Al-Tashi Q, Kadir SJA, Rais HM, Mirjalili S, Alhussian H (2019) Binary optimization using hybrid grey wolf optimization for feature selection. IEEE Access 7:39496–39508
Al-Thubaity A, Abanumay N, Al-Jerayyed S, Alrukban A, Mannaa Z (2013) The effect of combining different feature selection methods on arabic text classification. In: 2013 14th ACIS international conference on software engineering, artificial intelligence, networking and parallel/distributed computing. IEEE, pp 211–216
Alghamdi HM, Selamat A (2017) Arabic web page clustering: a review. J King Saud Univ Comput Inf Sci 31:1–14
Alghamdi HS, Tang HL, Alshomrani S (2012) Hybrid ACO and TOFA feature selection approach for text classification. In: 2012 IEEE congress on evolutionary computation (CEC). IEEE, pp 1–6
Aljarah I, Mafarja M, Heidari AA, Faris H, Mirjalili S (2019) Clustering analysis using a novel locality-informed grey wolf-inspired clustering approach. Knowl Inf Syst. https://doi.org/10.1007/s10115-019-01358-x
Aljarah I, Mafarja M, Heidari AA, Faris H, Zhang Y, Mirjalili S (2018) Asynchronous accelerating multi-leader salp chains for feature selection. Appl Soft Comput 71:964–979
Anghelescu AV, Muchnik IB (2003) Combinatorial PCA and SVM methods for feature selection in learning classifications (applications to text categorization). In: International Conference on integration of knowledge intensive multi-agent systems, 2003. IEEE, pp 491–496
Bawaneh MJ, Alkoffash MS, Al Rabea A (2008) Arabic text classification using K-NN and Naive Bayes. J Comput Sci 4(7):600–605
Burges CJ (1998) A tutorial on support vector machines for pattern recognition. Data Min Knowl Discov 2(2):121–167
Chantar HK, Corne DW (2011) Feature subset selection for Arabic document categorization using BPSO-KNN. In: 2011 third world congress on nature and biologically inspired computing (NaBIC). IEEE, pp 546–551
Chantar HKH et al (2013) New techniques for Arabic document classification. PhD thesis, Heriot-Watt University
Chen H, Jiao S, Heidari AA, Wang M, Chen X, Zhao X (2019) An opposition-based sine cosine approach with local search for parameter estimation of photovoltaic models. Energy Convers Manag 195:927–942
Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20(3):273–297
Deng W, Yao R, Zhao H, Yang X, Li G (2017) A novel intelligent diagnosis method using optimal LS-SVM with improved PSO algorithm. Soft Comput. https://doi.org/10.1007/s00500-017-2940-9
Dharmadhikari SC, Ingle M, Kulkarni P (2011) Empirical studies on machine learning based text classification algorithms. Adv Comput 2(6):161
Domingos P, Pazzani M (1997) On the optimality of the simple Bayesian classifier under zero-one loss. Mach Learn 29(2):103–130
Duwairi R, Al-Refai MN, Khasawneh N (2009) Feature reduction techniques for Arabic text categorization. J Assoc Inf Sci Technol 60(11):2347–2352
Duwairi RM (2007) Arabic text categorization. Int Arab J Inf Technol 4(2):125–132
Duwairi RM (2013) Statistical feature selection techniques for Arabic text categorization
El Kourdi M, Bensaid A, Rachidi T (2004) Automatic Arabic document categorization based on the Naïve Bayes algorithm. In: Proceedings of the workshop on computational approaches to Arabic script-based languages. Association for Computational Linguistics, pp 51–58
Emary E, Zawbaa HM, Hassanien AE (2016) Binary ant lion approaches for feature selection. Neurocomputing 213:54–65
Emary E, Zawbaa HM, Hassanien AE (2016) Binary grey wolf optimization approaches for feature selection. Neurocomputing 172:371–381
Faris H, Al-Zoubi AM, Heidari AA, Aljarah I, Mafarja M, Hassonah MA, Fujita H (2019) An intelligent system for spam detection and identification of the most relevant features based on evolutionary random weight networks. Inf Fusion 48:67–83. https://doi.org/10.1016/j.inffus.2018.08.002
Faris H, Aljarah I, Al-Betar MA, Mirjalili S (2017) Grey wolf optimizer: a review of recent variants and applications. Neural Comput Appl 30:413–435
Faris H, Mafarja MM, Heidari AA, Aljarah I, Ala’M AZ, Mirjalili S, Fujita H (2018) An efficient binary salp swarm algorithm with crossover scheme for feature selection problems. Knowl Based Syst 154:43–67
Feldman R, Sanger J (2007) The text mining handbook: advanced approaches in analyzing unstructured data. Cambridge University Press, Cambridge
Fodor IK (2002) A survey of dimension reduction techniques. Technical report, Lawrence Livermore National Laboratory, CA, USA
Gandomi AH, Yang XS, Alavi AH (2011) Mixed variable structural optimization using firefly algorithm. Comput Struct 89(23–24):2325–2336
Gao W, Dimitrov D, Abdo H (2019) Tight independent set neighborhood union condition for fractional critical deleted graphs and ID deleted graphs. Discrete Contin Dyn Syst Ser S 12(4&5):711–721. https://doi.org/10.3934/dcdss.2019045
Gao W, Guirao JL, Basavanagoud B, Wu J (2018) Partial multi-dividing ontology learning algorithm. Inf Sci 467:35–58
Gao W, Guirao JLG, Abdel-Aty M, Xi W (2019) An independent set degree condition for fractional critical deleted graphs. Discrete Contin Dyn Syst Ser S 12(4&5):877–886
Gao W, Wang W, Dimitrov D, Wang Y (2018) Nano properties analysis via fourth multiplicative ABC indicator calculating. Arab J Chem 11(6):793–801
Gao W, Wu H, Siddiqui MK, Baig AQ (2018) Study of biological networks using graph theory. Saudi J Biol Sci 25(6):1212–1219
Ghareb AS, Bakar AA, Hamdan AR (2016) Hybrid feature selection based on enhanced genetic algorithm for text categorization. Expert Syst Appl 49:31–47
Hadni M, Gouiouez M (2017) Graph based representation for Arabic text categorization. In: Proceedings of the 2nd international conference on big data, cloud and applications. ACM, p 75
Haralambous Y, Elidrissi Y, Lenca P (2014) Arabic language text classification using dependency syntax-based feature selection. arXiv preprint arXiv:1410.4863
Harrag F, El-Qawasmeh E, Al-Salman AMS (2010) A comparative study of statistical feature reduction methods for Arabic text categorization. In: International conference on networked digital technologies. Springer, pp 676–682
Heidari AA, Mirjalili S, Faris H, Aljarah I, Mafarja M, Chen H (2019) Harris hawks optimization: algorithm and applications. Future Gener Comput Syst 97:849–872
Hsu CW, Chang CC, Lin CJ et al (2003) A practical guide to support vector classification
Indriyani Gunawan W, Rakhmadi A (2015) Filter-wrapper approach to feature selection using PSO-GA for Arabic document classification with Naive Bayes multinomial. IOSR J Comput Eng 17(6):45–51
Jackson P, Moulinier I (2007) Natural language processing for online applications: text retrieval, extraction and categorization, vol 5. John Benjamins Publishing, Amsterdam
Jiang L, Cai Z, Wang D, Jiang S (2007) Survey of improving k-nearest-neighbor for classification. In: Proceedings of fourth international conference on fuzzy systems and knowledge discovery, vol 1. IEEE, pp 679–683
Joachims T (1998) Text categorization with support vector machines: learning with many relevant features. In: Machine learning: ECML-98, pp 137–142
John GH, Kohavi R, Pfleger K et al (1994) Irrelevant features and the subset selection problem. In: Machine learning: proceedings of the eleventh international conference, pp 121–129
Khorsheed MS, Al-Thubaity AO (2013) Comparative evaluation of text classification techniques using a large diverse Arabic dataset. Lang Resour Eval 47(2):513–538
Khreisat L (2009) A machine learning approach for Arabic text classification using N-gram frequency statistics. J Inform 3(1):72–77
Kotsiantis S (2007) Supervised machine learning: a review of classification techniques. Informatica 31:249–268
Lee C, Lee GG (2006) Information gain and divergence-based feature selection for machine learning-based text categorization. Inf Process Manag 42(1):155–165
Li Q, Chen H, Huang H, Zhao X, Cai Z, Tong C, Liu W, Tian X (2017) An enhanced grey wolf optimization based feature selection wrapped kernel extreme learning machine for medical diagnosis. Comput Math Methods Med. https://doi.org/10.1155/2017/9512741
Luo J, Chen H, Heidari AA, Xu Y, Zhang Q, Li C (2019) Multi-strategy boosted mutative whale-inspired optimization approaches. Appl Math Model. https://doi.org/10.1016/j.apm.2019.03.046
Mafarja M, Aljarah I, Heidari AA, Faris H, Fournier-Viger P, Li X, Mirjalili S (2018) Binary dragonfly optimization for feature selection using time-varying transfer functions. Knowl Based Syst 161:185–204. https://doi.org/10.1016/j.knosys.2018.08.003
Mafarja M, Jarrar R, Ahmad S, Abusnaina A (2018) Feature selection using binary particle swarm optimization with time varying inertia weight strategies. In: The 2nd international conference on future networks & distributed systems, Amman, Jordan, vol 2. ACM
Mafarja M, Sabar NR (2018) Rank based binary particle swarm optimisation for feature selection in classification. In: Proceedings of the 2nd international conference on future networks and distributed systems, ICFNDS ’18. ACM, New York, pp 19:1–19:6. https://doi.org/10.1145/3231053.3231072
Markov Z, Larose DT (2007) Data mining the web: uncovering patterns in web content, structure, and usage. Wiley-Interscience, New York
Math N, Ivanovi M (2008) Text mining: bag-of-words document representation machine learning with textual data. October 38(3):227–234
Meng J, Lin H, Yu Y (2011) A two-stage feature selection method for text categorization. Comput Math Appl 62(7):2793–2800
Mirjalili S (2015) The ant lion optimizer. Adv Eng Softw 83:80–98
Mirjalili S (2015) Moth-flame optimization algorithm: a novel nature-inspired heuristic paradigm. Knowl Based Syst 89:228–249
Mirjalili S (2016) Dragonfly algorithm: a new meta-heuristic optimization technique for solving single-objective, discrete, and multi-objective problems. Neural Comput Appl 27(4):1053–1073
Mirjalili S (2016) SCA: a sine cosine algorithm for solving optimization problems. Knowl Based Syst 96:120–133
Mirjalili S, Aljarah I, Mafarja M, Heidari A A, Faris H (2020) Grey wolf optimizer: theory, literature review, and application in computational fluid dynamics problems. Springer, Cham, pp 87–105
Mirjalili S, Gandomi AH, Mirjalili SZ, Saremi S, Faris H, Mirjalili SM (2017) Salp swarm algorithm: a bio-inspired optimizer for engineering design problems. Adv Eng Softw 114:163–191
Mirjalili S, Lewis A (2016) The whale optimization algorithm. Adv Eng Softw 95:51–67
Mirjalili S, Mirjalili SM, Hatamlou A (2016) Multi-verse optimizer: a nature-inspired algorithm for global optimization. Neural Comput Appl 27(2):495–513
Mirjalili S, Mirjalili SM, Lewis A (2014) Grey wolf optimizer. Adv Eng Softw 69:46–61
Mitchell TM (1997) Machine learning. McGraw Hill, Boston
Moh’d A, Mesleh A (2007) Chi square feature extraction based SVMS Arabic language text categorization system. J Comput Sci 3(6):430–435
Moh’d Mesleh A (2008) Support vector machines based Arabic language text classification system: feature selection comparative study. In: Sobh T (ed) Advances in computer and information sciences and engineering. Springer, Dordrecht, pp 11–16
Moh’d Mesleh A (2011) Feature sub-set selection metrics for Arabic text classification. Pattern Recognit Lett 32(14):1922–1929
Panwar LK, Reddy S, Verma A, Panigrahi B, Kumar R (2017) Binary grey wolf optimizer for large scale unit commitment problem. Swarm Evolut Comput 38:251–266
Ramos J et al (2003) Using TF-IDF to determine word relevance in document queries. In: Proceedings of the first instructional conference on machine learning, vol 242, pp 133–142
Rao RV, Savsani VJ, Vakharia D (2011) Teaching-learning-based optimization: a novel method for constrained mechanical design optimization problems. Comput Aided Des 43(3):303–315
Rashedi E, Nezamabadi-Pour H, Saryazdi S (2009) GSA: a gravitational search algorithm. Inf Sci 179(13):2232–2248
Rish I (2001) An empirical study of the Naive Bayes classifier. In: IJCAI 2001 workshop on empirical methods in artificial intelligence, vol 3, pp 41–46
Saad MK, Ashour W (2010) Arabic text classification using decision trees. In: Proceedings of the 12th international workshop on computer science and information technologies CSIT, vol 2, pp 75–79
Said D, Wanas NM, Darwish NM, Hegazy N (2009) A study of text preprocessing tools for Arabic text categorization. In: The second international conference on Arabic language, pp 230–236
Salton G, Buckley C (1988) Term-weighting approaches in automatic text retrieval. Inf Process Manag 24(5):513–523
Salton G, Wong A, Yang CS (1975) A vector space model for automatic indexing. Commun ACM 18(11):613–620
Saremi S, Mirjalili S, Lewis A (2017) Grasshopper optimisation algorithm: theory and application. Adv Eng Softw 105:30–47
Sebastiani F (2002) Machine learning in automated text categorization. ACM Comput Surv (CSUR) 34(1):1–47
Sebastiani F (2005) Text categorization. In: Doorn JH, Rivero LC, Ferraggine VE (eds) Encyclopedia of database technologies and applications. IGI Global, Hershey, pp 683–687
Silva C, Ribeiro B (2003) The importance of stop word removal on recall values in text categorization. In: Proceedings of the international joint conference on neural networks, 2003, vol 3. IEEE, pp 1661–1666
Simon D (2008) Biogeography-based optimization. IEEE Trans Evolut Comput 12(6):702–713
Singh SR, Murthy HA, Gonsalves TA (2010) Feature selection for text classification based on Gini coefficient of inequality. FSDM 10:76–85
Smith LI et al (2002) A tutorial on principal components analysis, vol 51. Cornell University, Ithaca, p 65
Syiam MM, Fayed ZT, Habib MB (2006) An intelligent system for Arabic text categorization. Int J Intell Comput Inf Sci 6(1):1–19
Taradeh M, Mafarja M, Heidari AA, Faris H, Aljarah I, Mirjalili S, Fujita H (2019) An evolutionary gravitational search-based feature selection. Inf Sci 497:219–239. https://doi.org/10.1016/j.ins.2019.05.038
Too J, Abdullah A, Mohd Saad N, Mohd Ali N, Tee W (2018) A new competitive binary grey wolf optimizer to solve the feature selection problem in EMG signals classification. Computers 7(4):58
Witten IH, Frank E, Hall MA, Pal CJ (2016) Data mining: practical machine learning tools and techniques. Morgan Kaufmann, Burlington
Xu Y, Chen H, Heidari AA, Luo J, Zhang Q, Zhao X, Li C (2019) An efficient chaotic mutative moth-flame-inspired optimizer for global optimization tasks. Expert Syst Appl 129:135–155. https://doi.org/10.1016/j.eswa.2019.03.043
Yan J, Zhang B, Liu N, Yan S, Cheng Q, Fan W, Yang Q, Xi W, Chen Z (2006) Effective and efficient dimensionality reduction for large-scale and streaming data preprocessing. IEEE Trans Knowl Data Eng 18(3):320–333
Yang J, Liu Y, Zhu X, Liu Z, Zhang X (2012) A new feature selection based on comprehensive measurement both in inter-category and intra-category for text categorization. Inf Process Manag 48(4):741–754
Yang XS (2010) A new metaheuristic bat-inspired algorithm. In: Pelta DA, Krasnogor N, Dumitrescu D, Chira C, Lung R (eds) Nature inspired cooperative strategies for optimization (NICSO 2010). Springer, Berlin, pp 65–74
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that there is no conflict of interest regarding the publication of this paper.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendices
Sample of Arabic datasets
A text from sport category in Akhbar-Alkhaleej dataset:
A text from economic category in Al-jazeera-News dataset:
A text from culture category in Alwatan dataset:
Rights and permissions
About this article
Cite this article
Chantar, H., Mafarja, M., Alsawalqah, H. et al. Feature selection using binary grey wolf optimizer with elite-based crossover for Arabic text classification. Neural Comput & Applic 32, 12201–12220 (2020). https://doi.org/10.1007/s00521-019-04368-6
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00521-019-04368-6