Skip to main content
Log in

A feature selection model for document classification using Tom and Jerry Optimization algorithm

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Since the last decade, high-dimensional data has been increasing in various document mining fields, such as text summarization, text clustering, and text classification. The curse of dimensionality has an impact on the classification model’s performance. The feature selection strategy is extremely effective in dealing with the curse of dimensionality issue. In this work, we present the Tom and Jerry Optimization technique(TJO) for feature subset selection. The proposed work uses the classifier error rate and the feature chosen rate to measure the candidate’s fitness. The performance of the proposed scheme is examined using two popular benchmark text corpus and compared with five metaheuristic approaches. The best success rate obtained by the proposed scheme is 95.77%, whereas the best precision is 0.9509, recall is 0.9577 and F1-score is 0.9541. According to the comparison results, the proposed feature subset selection scheme outperforms the standard strategy.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

Data Availibility Statement

The dataset used in this study are available in a public repository

Notes

  1. https://data.mendeley.com/datasets/9rw3vkcfy4/6

  2. https://www.kaggle.com/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews

References

  1. Adam SP, Alexandropoulos SAN, Pardalos PM, Vrahatis MN (2019) No free lunch theorem: a review. Springer, Cham, pp 57–82. https://doi.org/10.1007/978-3-030-12767-1_5

    Book  Google Scholar 

  2. Bahassine S, Madani A, Al-Sarem M, Kissi M (2020) Feature selection using an improved chi-square for arabic text classification. J King Saud Univ Comput Inf Sci 32(2):225–231 . http://www.sciencedirect.com/science/article/pii/S131915781730544X

  3. Bai, X, Gao, X, Xue, B (2018) Particle swarm optimization based two-stage feature selection in text mining. In: 2018 IEEE Congress on evolutionary computation (CEC), pp 1–8. https://doi.org/10.1109/CEC.2018.8477773

  4. Balochian S, Baloochian H (2019) Social mimic optimization algorithm and engineering applications. Exp Syst Appl 134:178–191

    Article  Google Scholar 

  5. Behjat A, Mustapha A, Nezamabadi-pour H (2013) Sulaiman, MN. A PSO-based feature subset selection for application of spam/non-spam detection 378:183–193. https://doi.org/10.1007/978-3-642-40567-9_16

    Article  Google Scholar 

  6. Chakravarthy S, Rajaguru H (2019) Comparison analysis of linear discriminant analysis and cuckoo-search algorithm in the classification of breast cancer from digital mammograms. Asian Pacific J Cancer Prev 20:2333–2337. https://doi.org/10.31557/APJCP.2019.20.8.2333

    Article  Google Scholar 

  7. Chantar H, Mafarja M, Alsawalqah H, Heidari AA, Aljarah I, Faris H (2020) Feature selection using binary grey wolf optimizer with elite-based crossover for arabic text classification. Neural Comput Appl 32(16):12,201-12,220. https://doi.org/10.1007/s00521-019-04368-6

    Article  Google Scholar 

  8. Dada EG, Bassi JS, Chiroma H, Abdulhamid SM, Adetunmbi AO, Ajibuwa OE (2019) Machine learning for email spam filtering: review, approaches and open research problems. Heliyon 5(6):e01,802. https://doi.org/10.1016/j.heliyon.2019.e01802, https://www.sciencedirect.com/science/article/pii/S2405844018353404

  9. Dey Sarkar S, Goswami S, Agarwal A (2014) Aktar, J (2014) A novel feature selection technique for text classification using naïve bayes. Int Scholarly Res Notices 717:092. https://doi.org/10.1155/2014/717092

    Article  Google Scholar 

  10. Dhar, A, Dash, N, Roy, K (2019) Efficient feature selection based on modified cuckoo search optimization problem for classifying web text documents, pp 640–651. https://doi.org/10.1007/978-981-13-9187-3_57

  11. Elakiya E, Rajkumar N (2021) In text mining: detection of topic and sub-topic using multiple spider hunting model. J Amb Intell Human Comput 12(3):3571–3580. https://doi.org/10.1007/s12652-019-01588-5

    Article  Google Scholar 

  12. Feng G, Guo J, Jing BY, Sun T (2015) Feature subset selection using naive bayes for text classification. Pattern Recogn Lett 65:109–115. https://doi.org/10.1016/j.patrec.2015.07.028

    Article  Google Scholar 

  13. Ghareb AS, Bakar AA, Hamdan AR (2016) Hybrid feature selection based on enhanced genetic algorithm for text categorization. Exp Syst Appl 49:31–47. https://doi.org/10.1016/j.eswa.2015.12.004https://www.sciencedirect.com/science/article/pii/S0957417415007952

  14. Jalal, N, Mehmood, A, Choi, GS, Ashraf, I (2022) A novel improved random forest for text classification using feature ranking and optimal number of trees. J King Saud Univ Comput Inf Sci 34(6, Part A):2733–2742. https://doi.org/10.1016/j.jksuci.2022.03.012. https://www.sciencedirect.com/science/article/pii/S1319157822000969

  15. Karpagalingam T, Karuppaiah M (2021) Feature selection using hybrid poor and rich optimization algorithm for text classification. Pattern Recogn Lett 147:63–70 https://doi.org/10.1016/j.patrec.2021.03.034https://www.sciencedirect.com/science/article/pii/S016786552100129X

  16. Kawade D (2017) Sentiment analysis: Machine learning approach. Int J Eng Technol 19:2183–2186. https://doi.org/10.21817/ijet/2017/v9i3/170903151

    Article  Google Scholar 

  17. Kim K, Zzang SY (2019) Trigonometric comparison measure: a feature selection method for text categorization. Data Knowl Eng 119:1–21. https://doi.org/10.1016/j.datak.2018.10.003https://www.sciencedirect.com/science/article/pii/S0169023X18300922

  18. Kumar A, Jaiswal A, Garg S, Verma S, Kumar S (2019) Sentiment analysis using cuckoo search for optimized feature selection on kaggle tweets. Int J Inf Retr Res 9:1–15. https://doi.org/10.4018/IJIRR.2019010101

    Article  Google Scholar 

  19. Kumar, A, Khorwal, R (2017) Firefly algorithm for feature selection in sentiment analysis, pp 693–703. https://doi.org/10.1007/978-981-10-3874-7_66

  20. Larabi Marie-Sainte S, Alalyani N (2020) Firefly algorithm based feature selection for arabic text classification. J King Saud Univ Comput Inf Sci 32(3):320–328. https://doi.org/10.1016/j.jksuci.2018.06.004 . https://www.sciencedirect.com/science/article/pii/S131915781830106X

  21. Mirjalili S (2015) Dragonfly algorithm: A new meta-heuristic optimization technique for solving single-objective, discrete, and multi-objective problems. Neural Comput Appl. https://doi.org/10.1007/s00521-015-1920-1

  22. Mirjalili S (2016) Sca: a sine cosine algorithm for solving optimization problems. Knowl-Based Syst 96:120–133. https://doi.org/10.1016/j.knosys.2015.12.022 . http://www.sciencedirect.com/science/article/pii/S0950705115005043

  23. Mirjalili S, Lewis A (2016) The whale optimization algorithm. Adv Eng Softw 95:51–67. https://doi.org/10.1016/j.advengsoft.2016.01.008 . http://www.sciencedirect.com/science/article/pii/S0965997816300163

  24. Mirjalili S, Mirjalili SM, Lewis A (2014) Grey wolf optimizer. Adv Eng Softw 69:46–61. https://doi.org/10.1016/j.advengsoft.2013.12.007 . http://www.sciencedirect.com/science/article/pii/S0965997813001853

  25. Moghdani R, Salimifard K (2018) Volleyball premier league algorithm. Applied Soft Comput 64:161–185. https://doi.org/10.1016/j.asoc.2017.11.043 . http://www.sciencedirect.com/science/article/pii/S1568494617307068

  26. Dehghani MZ, Montazeri OPMHG, Guerrero JM (2020) Shell game optimization: a novel game-based algorithm. Int J Intell Eng Syst 13:246–255. https://doi.org/10.22266/ijies2020.0630.23

    Article  Google Scholar 

  27. Moosavi SHS, Bardsiri VK (2019) Poor and rich optimization algorithm: a new human-based and multi populations algorithm. Eng Appl Artif Intell 86:165–181. https://doi.org/10.1016/j.engappai.2019.08.025 . http://www.sciencedirect.com/science/article/pii/S0952197619302167

  28. Neogi PPG, Das AK, Goswami S, Mustafi J (2020) Topic modeling for text classification. In: Mandal JK, Bhattacharya D (eds) Emerging technology in modelling and graphics. Springer, Singapore, pp 395–407

    Chapter  Google Scholar 

  29. Parlak, B, Uysal, AK (2021) A novel filter feature selection method for text classification: extensive feature selector. J Inf Sci :1–20. https://doi.org/10.1177/0165551521991037

  30. Rehman A, Javed K, Babri HA (2017) Feature selection based on a normalized difference measure for text classification. Inf Process Manag 53(2):473–489. https://doi.org/10.1016/j.ipm.2016.12.004

    Article  Google Scholar 

  31. Rehman A, Javed K, Babri HA, Asim MN (2018) Selection of the most relevant terms based on a max-min ratio metric for text classification. Exp Syst Appl 114:78–96. https://doi.org/10.1016/j.eswa.2018.07.028 . https://www.sciencedirect.com/science/article/pii/S0957417418304457

  32. Rustam Z, Amalia Y, Hartini S, Saragih G (2021) Linear discriminant analysis and support vector machines for classifying breast cancer. IAES Int J Artif Intell (IJ-AI) 10:253. https://doi.org/10.11591/ijai.v10.i1.pp253-256

    Article  Google Scholar 

  33. Saigal P, Khanna V (2020) Multi-category news classification using support vector machine based classifiers. SN Appl Sci 2(3):458. https://doi.org/10.1007/s42452-020-2266-6

    Article  Google Scholar 

  34. Saremi S, Mirjalili S, Lewis A (2017) Grasshopper optimisation algorithm: theory and application. Adv Eng Softw 105:30–47. https://doi.org/10.1016/j.advengsoft.2017.01.004 . http://www.sciencedirect.com/science/article/pii/S0965997816305646

  35. Sel, I, Karci, A, Hanbay, D.: Feature selection for text classification using mutual information. In: 2019 International artificial intelligence and data processing symposium (IDAP), pp 1–4. https://doi.org/10.1109/IDAP.2019.8875927

  36. Shadravan S, Naji H, Bardsiri V (2019) The sailfish optimizer: a novel nature-inspired metaheuristic algorithm for solving constrained engineering optimization problems. Eng Appl Artif Intell 80:20–34. https://doi.org/10.1016/j.engappai.2019.01.001 . http://www.sciencedirect.com/science/article/pii/S0952197619300016

  37. Shang C, Li M, Feng S, Jiang Q, Fan J (2013) Feature selection via maximizing global information gain for text classification. Knowl-Based Syst 54:298–309. https://doi.org/10.1016/j.knosys.2013.09.019 . https://www.sciencedirect.com/science/article/pii/S0950705113003067

  38. Thirumoorthy K, Muneeswaran K (2020) Optimal feature subset selection using hybrid binary jaya optimization algorithm for text classification. Sādhanā 45(1):201. https://doi.org/10.1007/s12046-020-01443-w

    Article  Google Scholar 

  39. Thirumoorthy K, Muneeswaran K (2021) Feature selection for text classification using machine learning approaches. Nat’l Acad Sci Lett. https://doi.org/10.1007/s40009-021-01043-0

  40. Uysal AK, Gunal S (2012) A novel probabilistic feature selection method for text classification. Knowl-Based Syst 36:226–235. https://doi.org/10.1016/j.knosys.2012.06.005. www.sciencedirect.com/science/article/pii/S0950705112001761

  41. Venkata Rao R (2016) Jaya: a simple and new optimization algorithm for solving constrained and unconstrained optimization problems. Int J Ind Eng Comput 7:19–34. https://doi.org/10.5267/j.ijiec.2015.8.004

    Article  Google Scholar 

  42. Venkata Rao, R (2020) Rao algorithms: three metaphor-less simple algorithms for solving optimization problems. Int J Ind Eng Comput :107–130. https://doi.org/10.5267/j.ijiec.2019.6.002

  43. Wang L, Gao Y, Li J, Wang X (2021) A feature selection method by using chaotic cuckoo search optimization algorithm with elitist preservation and uniform mutation for data classification. Discr Dyn Nat Soc 2021:1–19. https://doi.org/10.1155/2021/7796696

  44. Wei L, Wei B, Wang B (2012) Text classification using support vector machine with mixture of kernel. J Softw Eng Appl 05:55–58. https://doi.org/10.4236/jsea.2012.512B012

    Article  Google Scholar 

  45. Wolpert DH, Macready WG (1997) No free lunch theorems for optimization. IEEE Trans Evol Comput 1(1):67–82

    Article  Google Scholar 

  46. Yazdani M, Jolai F (2016) Lion optimization algorithm (loa): A nature-inspired metaheuristic algorithm. J Comput Des Eng 3(1):24–36. https://doi.org/10.1016/j.jcde.2015.06.003. www.sciencedirect.com/science/article/pii/S2288430015000524

  47. Yigit, F, Baykan, OK (2014) A new feature selection method for text categorization based on information gain and particle swarm optimization. In: 2014 IEEE 3rd International conference on cloud computing and intelligence systems, pp 523–529. https://doi.org/10.1109/CCIS.2014.7175792

  48. Zhou H, Zhang Y, Liu H, Zhang Y (2018) Feature selection based on term frequency reordering of document level. IEEE Access 6:51,655-51,668

    Article  Google Scholar 

  49. Zhu, L, Wang, G, Zou, X (2017) Improved information gain feature selection method for chinese text classification based on word embedding. In: Proceedings of the 6th international conference on software and computer applications, ICSCA ’17, Association for Computing Machinery, New York, pp 72–76. https://doi.org/10.1145/3056662.3056671

  50. Zhu, W, Feng, J, Lin, Y (2014) Using gini-index for feature selection in text categorization. In: Proceedings of the 2014 International conference on information, business and education technology, Atlantis Press, pp 76–80. https://doi.org/10.2991/icibet-14.2014.22

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to K Thirumoorthy.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Thirumoorthy, K., Britto, J.J.J. A feature selection model for document classification using Tom and Jerry Optimization algorithm. Multimed Tools Appl 83, 10273–10295 (2024). https://doi.org/10.1007/s11042-023-15828-6

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-023-15828-6

Keywords

Navigation