Abstract
Since the last decade, high-dimensional data has been increasing in various document mining fields, such as text summarization, text clustering, and text classification. The curse of dimensionality has an impact on the classification model’s performance. The feature selection strategy is extremely effective in dealing with the curse of dimensionality issue. In this work, we present the Tom and Jerry Optimization technique(TJO) for feature subset selection. The proposed work uses the classifier error rate and the feature chosen rate to measure the candidate’s fitness. The performance of the proposed scheme is examined using two popular benchmark text corpus and compared with five metaheuristic approaches. The best success rate obtained by the proposed scheme is 95.77%, whereas the best precision is 0.9509, recall is 0.9577 and F1-score is 0.9541. According to the comparison results, the proposed feature subset selection scheme outperforms the standard strategy.
Similar content being viewed by others
Data Availibility Statement
The dataset used in this study are available in a public repository
References
Adam SP, Alexandropoulos SAN, Pardalos PM, Vrahatis MN (2019) No free lunch theorem: a review. Springer, Cham, pp 57–82. https://doi.org/10.1007/978-3-030-12767-1_5
Bahassine S, Madani A, Al-Sarem M, Kissi M (2020) Feature selection using an improved chi-square for arabic text classification. J King Saud Univ Comput Inf Sci 32(2):225–231 . http://www.sciencedirect.com/science/article/pii/S131915781730544X
Bai, X, Gao, X, Xue, B (2018) Particle swarm optimization based two-stage feature selection in text mining. In: 2018 IEEE Congress on evolutionary computation (CEC), pp 1–8. https://doi.org/10.1109/CEC.2018.8477773
Balochian S, Baloochian H (2019) Social mimic optimization algorithm and engineering applications. Exp Syst Appl 134:178–191
Behjat A, Mustapha A, Nezamabadi-pour H (2013) Sulaiman, MN. A PSO-based feature subset selection for application of spam/non-spam detection 378:183–193. https://doi.org/10.1007/978-3-642-40567-9_16
Chakravarthy S, Rajaguru H (2019) Comparison analysis of linear discriminant analysis and cuckoo-search algorithm in the classification of breast cancer from digital mammograms. Asian Pacific J Cancer Prev 20:2333–2337. https://doi.org/10.31557/APJCP.2019.20.8.2333
Chantar H, Mafarja M, Alsawalqah H, Heidari AA, Aljarah I, Faris H (2020) Feature selection using binary grey wolf optimizer with elite-based crossover for arabic text classification. Neural Comput Appl 32(16):12,201-12,220. https://doi.org/10.1007/s00521-019-04368-6
Dada EG, Bassi JS, Chiroma H, Abdulhamid SM, Adetunmbi AO, Ajibuwa OE (2019) Machine learning for email spam filtering: review, approaches and open research problems. Heliyon 5(6):e01,802. https://doi.org/10.1016/j.heliyon.2019.e01802, https://www.sciencedirect.com/science/article/pii/S2405844018353404
Dey Sarkar S, Goswami S, Agarwal A (2014) Aktar, J (2014) A novel feature selection technique for text classification using naïve bayes. Int Scholarly Res Notices 717:092. https://doi.org/10.1155/2014/717092
Dhar, A, Dash, N, Roy, K (2019) Efficient feature selection based on modified cuckoo search optimization problem for classifying web text documents, pp 640–651. https://doi.org/10.1007/978-981-13-9187-3_57
Elakiya E, Rajkumar N (2021) In text mining: detection of topic and sub-topic using multiple spider hunting model. J Amb Intell Human Comput 12(3):3571–3580. https://doi.org/10.1007/s12652-019-01588-5
Feng G, Guo J, Jing BY, Sun T (2015) Feature subset selection using naive bayes for text classification. Pattern Recogn Lett 65:109–115. https://doi.org/10.1016/j.patrec.2015.07.028
Ghareb AS, Bakar AA, Hamdan AR (2016) Hybrid feature selection based on enhanced genetic algorithm for text categorization. Exp Syst Appl 49:31–47. https://doi.org/10.1016/j.eswa.2015.12.004https://www.sciencedirect.com/science/article/pii/S0957417415007952
Jalal, N, Mehmood, A, Choi, GS, Ashraf, I (2022) A novel improved random forest for text classification using feature ranking and optimal number of trees. J King Saud Univ Comput Inf Sci 34(6, Part A):2733–2742. https://doi.org/10.1016/j.jksuci.2022.03.012. https://www.sciencedirect.com/science/article/pii/S1319157822000969
Karpagalingam T, Karuppaiah M (2021) Feature selection using hybrid poor and rich optimization algorithm for text classification. Pattern Recogn Lett 147:63–70 https://doi.org/10.1016/j.patrec.2021.03.034https://www.sciencedirect.com/science/article/pii/S016786552100129X
Kawade D (2017) Sentiment analysis: Machine learning approach. Int J Eng Technol 19:2183–2186. https://doi.org/10.21817/ijet/2017/v9i3/170903151
Kim K, Zzang SY (2019) Trigonometric comparison measure: a feature selection method for text categorization. Data Knowl Eng 119:1–21. https://doi.org/10.1016/j.datak.2018.10.003https://www.sciencedirect.com/science/article/pii/S0169023X18300922
Kumar A, Jaiswal A, Garg S, Verma S, Kumar S (2019) Sentiment analysis using cuckoo search for optimized feature selection on kaggle tweets. Int J Inf Retr Res 9:1–15. https://doi.org/10.4018/IJIRR.2019010101
Kumar, A, Khorwal, R (2017) Firefly algorithm for feature selection in sentiment analysis, pp 693–703. https://doi.org/10.1007/978-981-10-3874-7_66
Larabi Marie-Sainte S, Alalyani N (2020) Firefly algorithm based feature selection for arabic text classification. J King Saud Univ Comput Inf Sci 32(3):320–328. https://doi.org/10.1016/j.jksuci.2018.06.004 . https://www.sciencedirect.com/science/article/pii/S131915781830106X
Mirjalili S (2015) Dragonfly algorithm: A new meta-heuristic optimization technique for solving single-objective, discrete, and multi-objective problems. Neural Comput Appl. https://doi.org/10.1007/s00521-015-1920-1
Mirjalili S (2016) Sca: a sine cosine algorithm for solving optimization problems. Knowl-Based Syst 96:120–133. https://doi.org/10.1016/j.knosys.2015.12.022 . http://www.sciencedirect.com/science/article/pii/S0950705115005043
Mirjalili S, Lewis A (2016) The whale optimization algorithm. Adv Eng Softw 95:51–67. https://doi.org/10.1016/j.advengsoft.2016.01.008 . http://www.sciencedirect.com/science/article/pii/S0965997816300163
Mirjalili S, Mirjalili SM, Lewis A (2014) Grey wolf optimizer. Adv Eng Softw 69:46–61. https://doi.org/10.1016/j.advengsoft.2013.12.007 . http://www.sciencedirect.com/science/article/pii/S0965997813001853
Moghdani R, Salimifard K (2018) Volleyball premier league algorithm. Applied Soft Comput 64:161–185. https://doi.org/10.1016/j.asoc.2017.11.043 . http://www.sciencedirect.com/science/article/pii/S1568494617307068
Dehghani MZ, Montazeri OPMHG, Guerrero JM (2020) Shell game optimization: a novel game-based algorithm. Int J Intell Eng Syst 13:246–255. https://doi.org/10.22266/ijies2020.0630.23
Moosavi SHS, Bardsiri VK (2019) Poor and rich optimization algorithm: a new human-based and multi populations algorithm. Eng Appl Artif Intell 86:165–181. https://doi.org/10.1016/j.engappai.2019.08.025 . http://www.sciencedirect.com/science/article/pii/S0952197619302167
Neogi PPG, Das AK, Goswami S, Mustafi J (2020) Topic modeling for text classification. In: Mandal JK, Bhattacharya D (eds) Emerging technology in modelling and graphics. Springer, Singapore, pp 395–407
Parlak, B, Uysal, AK (2021) A novel filter feature selection method for text classification: extensive feature selector. J Inf Sci :1–20. https://doi.org/10.1177/0165551521991037
Rehman A, Javed K, Babri HA (2017) Feature selection based on a normalized difference measure for text classification. Inf Process Manag 53(2):473–489. https://doi.org/10.1016/j.ipm.2016.12.004
Rehman A, Javed K, Babri HA, Asim MN (2018) Selection of the most relevant terms based on a max-min ratio metric for text classification. Exp Syst Appl 114:78–96. https://doi.org/10.1016/j.eswa.2018.07.028 . https://www.sciencedirect.com/science/article/pii/S0957417418304457
Rustam Z, Amalia Y, Hartini S, Saragih G (2021) Linear discriminant analysis and support vector machines for classifying breast cancer. IAES Int J Artif Intell (IJ-AI) 10:253. https://doi.org/10.11591/ijai.v10.i1.pp253-256
Saigal P, Khanna V (2020) Multi-category news classification using support vector machine based classifiers. SN Appl Sci 2(3):458. https://doi.org/10.1007/s42452-020-2266-6
Saremi S, Mirjalili S, Lewis A (2017) Grasshopper optimisation algorithm: theory and application. Adv Eng Softw 105:30–47. https://doi.org/10.1016/j.advengsoft.2017.01.004 . http://www.sciencedirect.com/science/article/pii/S0965997816305646
Sel, I, Karci, A, Hanbay, D.: Feature selection for text classification using mutual information. In: 2019 International artificial intelligence and data processing symposium (IDAP), pp 1–4. https://doi.org/10.1109/IDAP.2019.8875927
Shadravan S, Naji H, Bardsiri V (2019) The sailfish optimizer: a novel nature-inspired metaheuristic algorithm for solving constrained engineering optimization problems. Eng Appl Artif Intell 80:20–34. https://doi.org/10.1016/j.engappai.2019.01.001 . http://www.sciencedirect.com/science/article/pii/S0952197619300016
Shang C, Li M, Feng S, Jiang Q, Fan J (2013) Feature selection via maximizing global information gain for text classification. Knowl-Based Syst 54:298–309. https://doi.org/10.1016/j.knosys.2013.09.019 . https://www.sciencedirect.com/science/article/pii/S0950705113003067
Thirumoorthy K, Muneeswaran K (2020) Optimal feature subset selection using hybrid binary jaya optimization algorithm for text classification. Sādhanā 45(1):201. https://doi.org/10.1007/s12046-020-01443-w
Thirumoorthy K, Muneeswaran K (2021) Feature selection for text classification using machine learning approaches. Nat’l Acad Sci Lett. https://doi.org/10.1007/s40009-021-01043-0
Uysal AK, Gunal S (2012) A novel probabilistic feature selection method for text classification. Knowl-Based Syst 36:226–235. https://doi.org/10.1016/j.knosys.2012.06.005. www.sciencedirect.com/science/article/pii/S0950705112001761
Venkata Rao R (2016) Jaya: a simple and new optimization algorithm for solving constrained and unconstrained optimization problems. Int J Ind Eng Comput 7:19–34. https://doi.org/10.5267/j.ijiec.2015.8.004
Venkata Rao, R (2020) Rao algorithms: three metaphor-less simple algorithms for solving optimization problems. Int J Ind Eng Comput :107–130. https://doi.org/10.5267/j.ijiec.2019.6.002
Wang L, Gao Y, Li J, Wang X (2021) A feature selection method by using chaotic cuckoo search optimization algorithm with elitist preservation and uniform mutation for data classification. Discr Dyn Nat Soc 2021:1–19. https://doi.org/10.1155/2021/7796696
Wei L, Wei B, Wang B (2012) Text classification using support vector machine with mixture of kernel. J Softw Eng Appl 05:55–58. https://doi.org/10.4236/jsea.2012.512B012
Wolpert DH, Macready WG (1997) No free lunch theorems for optimization. IEEE Trans Evol Comput 1(1):67–82
Yazdani M, Jolai F (2016) Lion optimization algorithm (loa): A nature-inspired metaheuristic algorithm. J Comput Des Eng 3(1):24–36. https://doi.org/10.1016/j.jcde.2015.06.003. www.sciencedirect.com/science/article/pii/S2288430015000524
Yigit, F, Baykan, OK (2014) A new feature selection method for text categorization based on information gain and particle swarm optimization. In: 2014 IEEE 3rd International conference on cloud computing and intelligence systems, pp 523–529. https://doi.org/10.1109/CCIS.2014.7175792
Zhou H, Zhang Y, Liu H, Zhang Y (2018) Feature selection based on term frequency reordering of document level. IEEE Access 6:51,655-51,668
Zhu, L, Wang, G, Zou, X (2017) Improved information gain feature selection method for chinese text classification based on word embedding. In: Proceedings of the 6th international conference on software and computer applications, ICSCA ’17, Association for Computing Machinery, New York, pp 72–76. https://doi.org/10.1145/3056662.3056671
Zhu, W, Feng, J, Lin, Y (2014) Using gini-index for feature selection in text categorization. In: Proceedings of the 2014 International conference on information, business and education technology, Atlantis Press, pp 76–80. https://doi.org/10.2991/icibet-14.2014.22
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Thirumoorthy, K., Britto, J.J.J. A feature selection model for document classification using Tom and Jerry Optimization algorithm. Multimed Tools Appl 83, 10273–10295 (2024). https://doi.org/10.1007/s11042-023-15828-6
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-023-15828-6