Abstract
Several comparatives studies report new findings relevant to the Text Categorization (TC) task, and all provide valuable observations. However, many of them addressed western languages, especially English. By writing this paper, we take a step toward filling this gap and focus on less commonly investigated languages (i.e., Arabic) to provide a more balanced perspective. In that respect, this paper presents a deeper investigation regarding the performance of some well-known probabilistic methods successfully implemented for automatic TC, such as Naïve Bayesian, Support Vector Machines, and Decision Tree. Besides, the investigation covers pre-processing techniques and feature selection methods that deal with data’s high dimensionality. Expressly, stop words elimination, stemming, and lemmatization are the pre-processing techniques included along with the TF-IDF and Chi-square as the feature selection methods. Moreover, all possible combinations are considered. To make this study accurate and comprehensive, we trained and evaluated the selected classifiers and the pre-processing techniques on common ground. To this end, we used an in-house balanced and large corpus with 300,000 news articles which are equally distributed into six categories. The findings obtained prove the effectiveness of combining the pre-processing techniques, feature selection methods, and classifiers.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
https://ar.wikipedia.org/wiki/قائمة_أسماء_الأسد_في_اللغة_العربية.
- 2.
- 3.
- 4.
https://www.nltk.org/_modules/nltk/stem/arlstem.html [last accessed: January 24, 2021].
- 5.
https://pypi.org/project/Tashaphyne/ [last accessed: January 24, 2021].
- 6.
https://arabicstemmer.com/ [last accessed: January 24, 2021].
References
Zeroual, I., Lakhouaja, A.: Arabic corpus linguistics: major progress, but still a long way to go. In: Intelligent Natural Language Processing: Trends and Applications, pp. 613–636. Springer, Cham (2018)
Guellil, I., Saâdane, H., Azouaou, F., Gueni, B., Nouvel, D.: Arabic natural language processing: an overview. J. King Saud Univ. – Comput. Inf. Sci. (2019). In Press
Zeroual, I., Goldhahn, D., Eckart, T., Lakhouaja, A.: OSIAN: Open source international arabic news corpus - preparation and integration into the CLARIN-infrastructure. In: Proceedings of the Fourth Arabic Natural Language Processing Workshop, pp. 175–182. Association for Computational Linguistics, Florence (2019)
El-Khair, I.A.: Effects of stop words elimination for Arabic information retrieval: a comparative study. arXiv preprint arXiv:1702.01925 (2017)
Al-Abdallah, R.Z., Al-Taani, A.T.: Arabic single-document text summarization using particle swarm optimization algorithm. Proc. Comput. Sci. 117, 30–37 (2017)
Arora, K.K., Agrawal, S.S.: Pre-processing of English-Hindi corpus for statistical machine translation. Comput. Sist. 21, 725–737 (2017)
El Kah, A., Zeroual, I.: The effects of pre-processing techniques on Arabic text classification. IJATCSE 10, 41–48 (2021)
Jianqiang, Z., Xiaolin, G.: Comparison research on text pre-processing methods on Twitter sentiment analysis. IEEE Access 5, 2870–2879 (2017)
Larkey, L.S., Ballesteros, L., Connell, M.E.: Light stemming for Arabic information retrieval. In: Arabic Computational Morphology, pp. 221–243. Springer (2007)
Khoja, S., Garside, R.: Stemming Arabic text. Computing Department, Lancaster University, Lancaster (1999)
Al-Anzi, F.S., AbuZeina, D.: Stemming impact on Arabic text categorization performance: A survey. In: 2015 5th International Conference on Information Communication Technology and Accessibility (ICTA), pp. 1–7 (2015)
Wahbeh, A., Al-Kabi, M., Al-Radaideh, Q., Al-Shawakfa, E., Alsmadi, I.: The effect of stemming on Arabic text classification: an empirical study. Int. J. Inf. Retrieval Res. (IJIRR). 1, 54–70 (2011)
Namly, D., Bouzoubaa, K., El Jihad, A., Aouragh, S.L.: Improving Arabic lemmatization through a lemmas database and a machine-learning technique. In: Recent Advances in NLP: The Case of Arabic Language, pp. 81–100. Springer (2020)
Zeroual, I., Boudchiche, M., Mazroui, A., Lakhouaja, A.: Developing and performance evaluation of a new Arabic heavy/light stemmer. In: Proceedings of the 2Nd International Conference on Big Data, Cloud and Applications, pp. 17:1–17:6. ACM, Tetouan (2017)
Naili, M., Chaibi, A.H., Ghezala, H.H.B.: Comparative study of Arabic stemming algorithms for topic identification. Proc. Comput. Sci. 159, 794–802 (2019)
Alhaj, Y.A., Xiang, J., Zhao, D., Al-Qaness, M.A., Abd Elaziz, M., Dahou, A.: A study of the effects of stemming strategies on Arabic document classification. IEEE Access 7, 32664–32671 (2019)
Abainia, K., Rebbani, H.: Comparing the effectiveness of the improved ARLSTem algorithm with existing Arabic light stemmers. In: 2019 International Conference on Theoretical and Applicative Aspects of Computer Science (ICTAACS), pp. 1–8. IEEE (2019)
Abainia, K., Ouamour, S., Sayoud, H.: A novel robust Arabic light stemmer. J. Exp. Theor. Artif. Intell. 29(3), 557–573 (2016)
Abdelali, A., Darwish, K., Durrani, N., Mubarak, H.: Farasa: a fast and furious segmenter for Arabic. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations, pp. 11–16 (2016)
Boudchiche, M., Mazroui, A., Bebah, M.O.A.O., Lakhouaja, A., Boudlal, A.: AlKhalil Morpho Sys 2: a robust Arabic morpho-syntactic analyzer. J. King Saud Univ.-Comput. Inf. Sci. 29, 141–146 (2017)
Soori, H., Platoš, J., Snášel, V.: Simple stemming rules for Arabic language. In: Proceedings of the Third International Conference on Intelligent Human Computer Interaction (IHCI 2011), Prague, Czech Republic, August 2011, pp. 99–108. Springer (2013)
Taghva, K., Elkhoury, R., Coombs, J.: Arabic stemming without a root dictionary. In: Null, pp. 152–157. IEEE (2005)
Pasha, A., Al-Badrashiny, M., Diab, M.T., El Kholy, A., Eskander, R., Habash, N., Pooleery, M., Rambow, O., Roth, R.: MADAMIRA: a fast, comprehensive tool for morphological analysis and disambiguation of Arabic. In: LREC, pp. 1094–1101 (2014)
Garner, S.R.: Weka: the waikato environment for knowledge analysis. In: Proceedings of the New Zealand Computer Science Research Students Conference, pp. 57–64 (1995)
Tang, H., Zhou, L., Chengjie, X., Zhu, Q.: A method of text dimension reduction based on CHI and TF-IDF. In: 2015 4th International Conference on Mechatronics, Materials, Chemistry and Computer Engineering. Atlantis Press (2015)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
El Kah, A., Zeroual, I. (2021). Improved Document Categorization Through Feature-Rich Combinations. In: Hassanien, A.E., et al. Proceedings of the International Conference on Artificial Intelligence and Computer Vision (AICV2021). AICV 2021. Advances in Intelligent Systems and Computing, vol 1377. Springer, Cham. https://doi.org/10.1007/978-3-030-76346-6_32
Download citation
DOI: https://doi.org/10.1007/978-3-030-76346-6_32
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-76345-9
Online ISBN: 978-3-030-76346-6
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)