Improved Document Categorization Through Feature-Rich Combinations

El Kah, Anoual; Zeroual, Imad

doi:10.1007/978-3-030-76346-6_32

Anoual El Kah²² &
Imad Zeroual²³

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 1377))

Included in the following conference series:

The International Conference on Artificial Intelligence and Computer Vision

2170 Accesses
5 Citations

Abstract

Several comparatives studies report new findings relevant to the Text Categorization (TC) task, and all provide valuable observations. However, many of them addressed western languages, especially English. By writing this paper, we take a step toward filling this gap and focus on less commonly investigated languages (i.e., Arabic) to provide a more balanced perspective. In that respect, this paper presents a deeper investigation regarding the performance of some well-known probabilistic methods successfully implemented for automatic TC, such as Naïve Bayesian, Support Vector Machines, and Decision Tree. Besides, the investigation covers pre-processing techniques and feature selection methods that deal with data’s high dimensionality. Expressly, stop words elimination, stemming, and lemmatization are the pre-processing techniques included along with the TF-IDF and Chi-square as the feature selection methods. Moreover, all possible combinations are considered. To make this study accurate and comprehensive, we trained and evaluated the selected classifiers and the pre-processing techniques on common ground. To this end, we used an in-house balanced and large corpus with 300,000 news articles which are equally distributed into six categories. The findings obtained prove the effectiveness of combining the pre-processing techniques, feature selection methods, and classifiers.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
https://ar.wikipedia.org/wiki/قائمة_أسماء_الأسد_في_اللغة_العربية.
2.
https://www.internetworldstats.com/stats7.htm.
3.
http://www.httrack.com/.
4.
https://www.nltk.org/_modules/nltk/stem/arlstem.html [last accessed: January 24, 2021].
5.
https://pypi.org/project/Tashaphyne/ [last accessed: January 24, 2021].
6.
https://arabicstemmer.com/ [last accessed: January 24, 2021].

References

Zeroual, I., Lakhouaja, A.: Arabic corpus linguistics: major progress, but still a long way to go. In: Intelligent Natural Language Processing: Trends and Applications, pp. 613–636. Springer, Cham (2018)
Google Scholar
Guellil, I., Saâdane, H., Azouaou, F., Gueni, B., Nouvel, D.: Arabic natural language processing: an overview. J. King Saud Univ. – Comput. Inf. Sci. (2019). In Press
Google Scholar
Zeroual, I., Goldhahn, D., Eckart, T., Lakhouaja, A.: OSIAN: Open source international arabic news corpus - preparation and integration into the CLARIN-infrastructure. In: Proceedings of the Fourth Arabic Natural Language Processing Workshop, pp. 175–182. Association for Computational Linguistics, Florence (2019)
Google Scholar
El-Khair, I.A.: Effects of stop words elimination for Arabic information retrieval: a comparative study. arXiv preprint arXiv:1702.01925 (2017)
Al-Abdallah, R.Z., Al-Taani, A.T.: Arabic single-document text summarization using particle swarm optimization algorithm. Proc. Comput. Sci. 117, 30–37 (2017)
Article Google Scholar
Arora, K.K., Agrawal, S.S.: Pre-processing of English-Hindi corpus for statistical machine translation. Comput. Sist. 21, 725–737 (2017)
Google Scholar
El Kah, A., Zeroual, I.: The effects of pre-processing techniques on Arabic text classification. IJATCSE 10, 41–48 (2021)
Article Google Scholar
Jianqiang, Z., Xiaolin, G.: Comparison research on text pre-processing methods on Twitter sentiment analysis. IEEE Access 5, 2870–2879 (2017)
Article Google Scholar
Larkey, L.S., Ballesteros, L., Connell, M.E.: Light stemming for Arabic information retrieval. In: Arabic Computational Morphology, pp. 221–243. Springer (2007)
Google Scholar
Khoja, S., Garside, R.: Stemming Arabic text. Computing Department, Lancaster University, Lancaster (1999)
Google Scholar
Al-Anzi, F.S., AbuZeina, D.: Stemming impact on Arabic text categorization performance: A survey. In: 2015 5th International Conference on Information Communication Technology and Accessibility (ICTA), pp. 1–7 (2015)
Google Scholar
Wahbeh, A., Al-Kabi, M., Al-Radaideh, Q., Al-Shawakfa, E., Alsmadi, I.: The effect of stemming on Arabic text classification: an empirical study. Int. J. Inf. Retrieval Res. (IJIRR). 1, 54–70 (2011)
Google Scholar
Namly, D., Bouzoubaa, K., El Jihad, A., Aouragh, S.L.: Improving Arabic lemmatization through a lemmas database and a machine-learning technique. In: Recent Advances in NLP: The Case of Arabic Language, pp. 81–100. Springer (2020)
Google Scholar
Zeroual, I., Boudchiche, M., Mazroui, A., Lakhouaja, A.: Developing and performance evaluation of a new Arabic heavy/light stemmer. In: Proceedings of the 2Nd International Conference on Big Data, Cloud and Applications, pp. 17:1–17:6. ACM, Tetouan (2017)
Google Scholar
Naili, M., Chaibi, A.H., Ghezala, H.H.B.: Comparative study of Arabic stemming algorithms for topic identification. Proc. Comput. Sci. 159, 794–802 (2019)
Article Google Scholar
Alhaj, Y.A., Xiang, J., Zhao, D., Al-Qaness, M.A., Abd Elaziz, M., Dahou, A.: A study of the effects of stemming strategies on Arabic document classification. IEEE Access 7, 32664–32671 (2019)
Article Google Scholar
Abainia, K., Rebbani, H.: Comparing the effectiveness of the improved ARLSTem algorithm with existing Arabic light stemmers. In: 2019 International Conference on Theoretical and Applicative Aspects of Computer Science (ICTAACS), pp. 1–8. IEEE (2019)
Google Scholar
Abainia, K., Ouamour, S., Sayoud, H.: A novel robust Arabic light stemmer. J. Exp. Theor. Artif. Intell. 29(3), 557–573 (2016)
Google Scholar
Abdelali, A., Darwish, K., Durrani, N., Mubarak, H.: Farasa: a fast and furious segmenter for Arabic. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations, pp. 11–16 (2016)
Google Scholar
Boudchiche, M., Mazroui, A., Bebah, M.O.A.O., Lakhouaja, A., Boudlal, A.: AlKhalil Morpho Sys 2: a robust Arabic morpho-syntactic analyzer. J. King Saud Univ.-Comput. Inf. Sci. 29, 141–146 (2017)
Google Scholar
Soori, H., Platoš, J., Snášel, V.: Simple stemming rules for Arabic language. In: Proceedings of the Third International Conference on Intelligent Human Computer Interaction (IHCI 2011), Prague, Czech Republic, August 2011, pp. 99–108. Springer (2013)
Google Scholar
Taghva, K., Elkhoury, R., Coombs, J.: Arabic stemming without a root dictionary. In: Null, pp. 152–157. IEEE (2005)
Google Scholar
Pasha, A., Al-Badrashiny, M., Diab, M.T., El Kholy, A., Eskander, R., Habash, N., Pooleery, M., Rambow, O., Roth, R.: MADAMIRA: a fast, comprehensive tool for morphological analysis and disambiguation of Arabic. In: LREC, pp. 1094–1101 (2014)
Google Scholar
Garner, S.R.: Weka: the waikato environment for knowledge analysis. In: Proceedings of the New Zealand Computer Science Research Students Conference, pp. 57–64 (1995)
Google Scholar
Tang, H., Zhou, L., Chengjie, X., Zhu, Q.: A method of text dimension reduction based on CHI and TF-IDF. In: 2015 4th International Conference on Mechatronics, Materials, Chemistry and Computer Engineering. Atlantis Press (2015)
Google Scholar

Download references

Author information

Authors and Affiliations

Faculty of Sciences, Mohammed First University, Oujda, Morocco
Anoual El Kah
L-STI, T-IDMS, Faculty of Sciences and Techniques, Moulay Ismail University, Meknes, Morocco
Imad Zeroual

Authors

Anoual El Kah
View author publications
You can also search for this author in PubMed Google Scholar
Imad Zeroual
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Information Technology Department, Faculty of Computers and Artificial Intelligence, Cairo University, Giza, Egypt
Aboul Ella Hassanien
Faculty of Sciences and Techniques, Hassan 1st University, Settat, Morocco
Abdelkrim Haqiq
School of Medicine, University of Missouri, Columbia, MO, USA
Peter J. Tonellato
ISAE-ENSMA, Futuroscope Chasseneuil Cedex, France
Ladjel Bellatreche
British University Vietnam, Hung Yen, Vietnam
Sam Goundar
Faculty of Computers and Artificial Intelligence, Benha University, Benha, Egypt
Ahmad Taher Azar
NEST Research Group, ENSEM, Hassan II University of Casablanca, Casablanca, Morocco
Essaid Sabir
High National School for Computer Science and Systems Analysis (ENSIAS), Mohammed V University, Rabat, Morocco
Driss Bouzidi

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

El Kah, A., Zeroual, I. (2021). Improved Document Categorization Through Feature-Rich Combinations. In: Hassanien, A.E., et al. Proceedings of the International Conference on Artificial Intelligence and Computer Vision (AICV2021). AICV 2021. Advances in Intelligent Systems and Computing, vol 1377. Springer, Cham. https://doi.org/10.1007/978-3-030-76346-6_32

Download citation

DOI: https://doi.org/10.1007/978-3-030-76346-6_32
Published: 29 May 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-76345-9
Online ISBN: 978-3-030-76346-6
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics