Effects of Light Stemming on Feature Extraction and Selection for Arabic Documents Classification

  • Yousif A. Alhaj
  • Mohammed A. A. Al-qaness
  • Abdelghani Dahou
  • Mohamed Abd ElazizEmail author
  • Dongdong Zhao
  • Jianwen XiangEmail author
Part of the Studies in Computational Intelligence book series (SCI, volume 874)


This chapter aims to study the effects of the light stemming technique on feature extraction where Bag of Words (BoW) and Term frequency- Inverse Documents (TF-IDF) are employed for Arabic document classification. Moreover, feature selection methods such as Chi-square (Chi2), Information gain (IG), and singular value decomposition (SVD) are used to select the most relevant features. K-nearest Neighbor (kNN), Logistic Regression (LR), and Support Vector Machine (SVM) classifiers are used to build the classification model. Experiment are conducted using a public data collected from Arab websites, namely, BBC Arabic dataset. Experiment results show that SVM outperforms LR and KNN. Furthermore, BoW outperforms TF-IDF without using a stemming technique. Using a Robust Arabic Light Stemmer (ARLStem) as our main light stemmer shows a positive effect when combined with TF-IDF over the baseline. In the experiment where Chi2 is used as the feature selection technique, SVM resulted in 0.9568% F1-micro using BoW to extract the features from the dataset where 5000 relevant features were selected. In the experiment where IG is used as the feature selection method, SVM achieved 0.9588% F1-micro with BoW and 4000 selected features. Finally in the experiment where SVD is used as the feature selection technique, SVM reached 0.9569% F1-micro when using BoW and 5000 relevant feature were selected. The aforementioned experiments report the best results achieved where stemming is not employed.


Arabic text classification Feature extraction Feature selection Stemming techniqueue 



This work was partially supported by the National Natural Science Foundation of China (Grant No. 61672398, 61806151), the Defense Industrial Technology Development Program (Grant No. JCKY2018110C165), and the Hubei Provincial Natural Science Foundation of China (Grant No. 2017CFA012).


  1. 1.
    A. Dahou, M.A. Elaziz, J. Zhou, S. Xiong, Arabic sentiment classification using convolutional neural network and differential evolution algorithm. Comput. Intell. Neurosci. 2019 (2019)Google Scholar
  2. 2.
    J.R. Méndez, T.R. Cotos-Yañez, D. Ruano-Ordás, A new semantic-based feature selection method for spam filtering. Appl. Soft Comput. 76, 89–104 (2019)CrossRefGoogle Scholar
  3. 3.
    S. Sakurai, A. Suyama, An e-mail analysis method based on text mining techniques. Appl. Soft Comput. 6(1), 62–71 (2005)CrossRefGoogle Scholar
  4. 4.
    A. Ayedh, G. Tan, K. Alwesabi, H. Rajeh, The effect of preprocessing on arabic document categorization. Algorithms 9(2), 27 (2016)MathSciNetCrossRefGoogle Scholar
  5. 5.
    J.-S. Kuo, Active Learning for Constructing Transliteration. J. Am. Soc. Inf. Sci., 59(1), 126–135 (2008). [Online]. Available:
  6. 6.
    A. Ayedh, G. Tan, Building and benchmarking novel Arabic stemmer for document classification. J. Comput. Theor. Nanosci. 13(3), 1527–1535 (2016)CrossRefGoogle Scholar
  7. 7.
    Slamet, C., Atmadja, A.R., Maylawati, D.S., Lestari, R.S., Darmalaksana, W., Ramdhani, M.A.: Automated text summarization for indonesian article using vector space model. IOP Conf. Ser. Mater. Sci. Eng. 288(1) (2018)CrossRefGoogle Scholar
  8. 8.
    A. Sinaga, Adiwijaya, H. Nugroho, Development of word-based text compression algorithm for Indonesian language document, in 2015 3rd International Conference on Advanced Information and Communication Technology ICoICT 2015, pp. 450–454 (2015)Google Scholar
  9. 9.
    M. Hussein, H.M. Mousa, R.M. Sallam, Arabic text categorization using mixed words. I.J. Inf. Technol. Comput. Sci. Inf. Technol. Comput. Sci., 11(11), 74–81, 2016. [Online]. Available:
  10. 10.
    R. Mamoun, M. Ahmed, Arabic text stemming: Comparative analysis. in Conference of Basic Sciences and Engineering Studies (SGCAC). IEEE 2016, 88–93 (2016)Google Scholar
  11. 11.
    F. Harrag, E. El-qawasmeh, I. Al, Improving arabic text categorization using decision trees. First Int. Conf. Networked Digit. Technol. 2009. NDT ’09. no. September, pp. 110–115 (2009)Google Scholar
  12. 12.
    B. Sharef, N. Omar, Z. Sharef, An Automated arabic text categorization based on the frequency Ratio Accumulation 11(2), 213–221 (2014)Google Scholar
  13. 13.
    B. Al-Shargabi, F. Olayah, W.A. Romimah, An experimental study for the effect of stop words elimination for arabic text classification algorithms. Int. J. Inf. Technol. Web Eng. (IJITWE) 6(2), 68–75 (2011)CrossRefGoogle Scholar
  14. 14.
    D. AbuZeina, F. Al-Anzi, Employing fisher discriminant analysis for Arabic text classification. Comput. Electr. Eng. 1–13 (2017)Google Scholar
  15. 15.
    S.A. Yousif, V.W. Samawi, I. Elkabani, R. Zantout, The effect of combining different semantic relations on arabic text classification. World Comput. Sci. Inform. Technol. J 5(1), 12–118 (2015)Google Scholar
  16. 16.
    A. Nehar, D. Ziadi, and H. Cherroun, “Rational kernels for Arabic Root Extraction and Text Classification,” J. King Saud Univ. - Comput. Inf. Sci., vol. 28, no. 2, pp. 157–169, 2016. [Online]. Available:
  17. 17.
    Y. A. Alhaj, J. Xiang, D. Zhao, M. A. Al-Qaness, M. A. Elaziz, and A. Dahou, “A study of the effects of stemming strategies on arabic document classification,” IEEE Access (2019)Google Scholar
  18. 18.
    L.S. Larkey, L. Ballesteros, M.E. Connell, Improving stemming for Arabic information retrieval, in Proceedings of the 26th annual international ACM SIGIR conference on research and development in information retrieval, SIGIR ’03, 2002, p. 275. [Online]. Available:
  19. 19.
    Y.A. Alhaj, W.U. Wickramaarachchi, A. Hussain, M.A. Al-Qaness, and H.M. Abdelaal, Efficient feature representation based on the effect of words frequency for arabic documents classification, in Proceedings of the 2nd International Conference on Telecommunications and Communication Engineering (ACM, 2018), pp. 397–401Google Scholar
  20. 20.
    L. Larkey, L. Ballesteros, and M. Connell, Light stemming for Arabic information retrieval,” Arab. Comput. Morphol., pp. 221–243 (2007)Google Scholar
  21. 21.
    K. Abainia, S. Ouamour, H. Sayoud, A novel robust arabic light stemmer. J. Exp. & Theor. Artif. Intell. 29(3), 557–573 (2017)CrossRefGoogle Scholar
  22. 22.
    K. Kowsari, K. Jafari Meimandi, M. Heidarysafa, S. Mendu, L. Barnes, and D. Brown, Text classification algorithms: a survey. Information, 10(4), 150 (2019)CrossRefGoogle Scholar
  23. 23.
    A.K. Uysal, S. Günal, S. Ergin, E.Ş. Günal, Detection of sms spam messages on mobile phones, in 20th Signal Processing and Communications Applications Conference (SIU). IEEE 2012, 1–4 (2012)Google Scholar
  24. 24.
    F. Thabtah, M. Eljinini, M. Zamzeer, W. Hadi, "Naïve bayesian based on chi square to categorize arabic data," in proceedings of The 11th International Business Information Management Association Conference (IBIMA) Conference on Innovation and Knowledge Management in Twin Track Economies (Egypt, Cairo, 2009), pp. 4–6Google Scholar
  25. 25.
    G.W. Furnas, S. Deerwester, S.T. Dumais, T.K. Landauer, R.A. Harshman, L.A. Streeter, K.E. Lochbaum,: Information retrieval using a singular value decomposition model of latent semantic structure,” in 11th Annu. Int. ACM SIGIR Conf. Res. Dev. Inf. Retr. (SIGIR 1988) (1988)Google Scholar
  26. 26.
    P. Tsangaratos, I. Ilia, Comparison of a logistic regression and naïve bayes classifier in landslide susceptibility assessments: The influence of models complexity and training dataset size. Catena 145, 164–179 (2016)CrossRefGoogle Scholar
  27. 27.
    M. Syiam, Z.T. Fayed, M.B. Habib, An intelligent system for arabic text categorization. Int. J. Intell. Comput. Inf. Sci. 6(1), 1–19 (2006)CrossRefGoogle Scholar
  28. 28.
    A. Moh, A. Mesleh, Chi Square Feature Extraction Based Svms Arabic Language Text Categorization System. J. Comput. Sci. 3(6), 430–435 (2007)Google Scholar
  29. 29.
    M. Saad, W. Ashour, OSAC: Open Source Arabic Corpora, in 6th international conference on computer systems (EECS’10), Nov 25-26, 2010, Lefke, Cyprus., pp. 118–123, 2010. [Online]. Available:

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  • Yousif A. Alhaj
    • 1
  • Mohammed A. A. Al-qaness
    • 2
  • Abdelghani Dahou
    • 1
  • Mohamed Abd Elaziz
    • 3
    Email author
  • Dongdong Zhao
    • 1
  • Jianwen Xiang
    • 1
    Email author
  1. 1.Hubei Key Laboratory of Transportation of Internet of ThingsSchool of Computer Science and Technology, Wuhan University of TechnologyWuhanChina
  2. 2.School of Computer ScienceWuhan UniversityWuhanChina
  3. 3.Department of Mathematics Faculty of ScienceZagazig UniversityZagazigEgypt

Personalised recommendations