Improving Arabic Lemmatization Through a Lemmas Database and a Machine-Learning Technique

  • Driss NamlyEmail author
  • Karim Bouzoubaa
  • Abdelhamid El Jihad
  • Si Lhoussain Aouragh
Part of the Studies in Computational Intelligence book series (SCI, volume 874)


Lemmatization is a key preprocessing step and an important component for many natural language applications. For Arabic language, lemmatization is a complex task due to Arabic morphology richness. In this paper, we present a new lemmatizer that combines a lexicon-based approach with a machine-learning-based approach to get the lemma solution. The lexicon-based step provides a context-free lemmatization and the most appropriate lemma according to the sentence context is detected using the Hidden Markov Model. The developed lemmatizer evaluations yield to over than 91% of accuracy. This achievement outperforms the state of the art Arabic lemmatizers.


Arabic NLP Arabic lemmatization Lexicon-based lemmatization Machine-learning-based lemmatization Hidden markov model Viterbi algorithm 


  1. 1.
    The World Bank, World Development Indicators (The World Bank, Washington, DC, 2018)Google Scholar
  2. 2.
    J. Owens, The Oxford Handbook of Arabic Linguistics (Oxford University Press, 2013), p. 2Google Scholar
  3. 3.
    R. Zhang, E. Sumita, Boosting statistical machine translation by lemmatization and linear interpolation, in Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions. Association for Computational Linguistics (2007)Google Scholar
  4. 4.
    K. Tuomo, et al., Stemming and lemmatization in the clustering of finnish text documents, in Proceedings of the Thirteenth ACM International Conference on Information and Knowledge Management (ACM, 2004)Google Scholar
  5. 5.
    E.-S. Tarek, F. El-Ghannam, A Lemma Based Evaluator for Semitic Language Text Summarization Systems. arXiv preprint arXiv:1403.5596 (2014)
  6. 6.
    E.-S. Tarek, A. Al-Sammak, Arabic Keyphrase Extraction Using Linguistic Knowledge and Machine Learning Techniques. arXiv preprint arXiv:1203.4605 (2012)
  7. 7.
    G. De Pauw, G.-M. De Schryver, Improving the computational morphological analysis of a Swahili corpus for lexicographic purposes. Lexikos 18(1) (2008)Google Scholar
  8. 8.
    M.N. Al-Kabi et al., A novel root based Arabic stemmer. J. King Saud Univ.-Comput. Inf. Sci. 27(2), 94–103 (2015)Google Scholar
  9. 9.
    M. Al-Kabi, R. Al-Mustafa, Arabic root based stemmer, in Proceedings of the International Arab Conference on Information Technology, Jordan (2006)Google Scholar
  10. 10.
    L.S. Larkey, L. Ballesteros, M.E. Connell, Light Stemming for Arabic Information Retrieval. Arabic Computational Morphology (Springer, Dordrecht, 2007), pp. 221–243Google Scholar
  11. 11.
    P. Arfath et al., Madamira: a fast, comprehensive tool for morphological analysis and disambiguation of arabic. LREC 14 (2014)Google Scholar
  12. 12.
    D. Taji, S. Khalifa, O. Obeid, F. Eryani, N. Habash, An Arabic morphological analyzer and generator with copious features, in Workshop on Computational Research in Phonetics, Phonology, and Morphology. The Conference on Empirical Methods in Natural Language Processing (EMNLP 2018), Brussels, BelgiumGoogle Scholar
  13. 13.
    B. Mohamed et al., AlKhalil Morpho Sys 2: a robust Arabic morpho-syntactic analyzer. J. King Saud Univ.-Comput. Inf. Sci. 29(2), 141–146 (2017)MathSciNetGoogle Scholar
  14. 14.
    A. Ahmed, et al., Farasa: a fast and furious segmenter for arabic, in Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations (2016)Google Scholar
  15. 15.
    E. Al-Shammari, J. Lin, A novel Arabic lemmatization algorithm, in Proceedings of the Second Workshop on Analytics for Noisy Unstructured Text Data (ACM, 2008)Google Scholar
  16. 16.
    B. Karien, R. Pretorius, G.B. van Huyssteen, Automatic lemmatization in Setswana: towards a prototype. S. Afr. J. Afr. Lang. 25(1), 37–47 (2005)Google Scholar
  17. 17.
    K.R. Beesley, Finite-state morphological analysis and generation of Arabic at xerox research: status and plans in 2001, in ACL Workshop on Arabic Language Processing: Status and Perspective, vol 1 (2001)Google Scholar
  18. 18.
    B. Tim, Issues in Arabic orthography and morphology analysis, in Proceedings of the Workshop on Computational Approaches to Arabic Script-Based Languages. Association for Computational Linguistics (2004)Google Scholar
  19. 19.
    S. Otakar, Elixirfm: implementation of functional arabic morphology, in Proceedings of the 2007 Workshop on Computational Approaches to Semitic Languages: Common Issues and Resources. Association for Computational Linguistics (2007)Google Scholar
  20. 20.
    M. Altabba, A. Al-Zaraee, M.A. Shukairy, An Arabic morphological analyzer and part-of-speech tagger. Actes de JADT (2010)Google Scholar
  21. 21.
    S. Majdi, E. Atwell, M.A.M. Abushariah, SALMA: standard Arabic language morphological analysis, in 2013 1st International Conference on Communications, Signal Processing, and their Applications (ICCSPA). IEEE (2013)Google Scholar
  22. 22.
    B. Abderrahim, et al., Alkhalil morpho sys1: A morphosyntactic analysis system for arabic texts, in International Arab Conference on Information Technology. Benghazi Libya (2010)Google Scholar
  23. 23.
    H. Nizar, O. Rambow, R. Roth, MADA + TOKAN: a toolkit for Arabic tokenization, diacritization, morphological disambiguation, POS tagging, stemming and lemmatization, in Proceedings of the 2nd International Conference on Arabic Language Resources and Tools (MEDAR), Cairo, Egypt, vol 41 (2009)Google Scholar
  24. 24.
    D. Mona, Second generation AMIRA tools for Arabic processing: fast and robust tokenization, POS tagging, and base phrase chunking, in 2nd International Conference on Arabic Language Resources and Tools, vol 110 (2009)Google Scholar
  25. 25.
    D. Graff, M. Maamouri, B. Bouziri, S. Krouna, S. Kulick, T. Buckwalter, Standard Arabic Morphological Analyzer (SAMA) Version 3.1. Linguistic Data Consortium LDC2009E73 (2009)Google Scholar
  26. 26.
    A. Mohammed, A. Zirikly, M. Diab, The power of language music: Arabic lemmatization through patterns, in Proceedings of the 5th Workshop on Cognitive Aspects of the Lexicon (CogALex-V) (2016)Google Scholar
  27. 27.
    B. Mohamed, M. Azzeddine, Approche hybride pour le développement d’un lemmatiseur pour la langue arabe. 13éme Colloque Africain sur la Recherche en Informatique et Mathématiques Appliquées (2016)Google Scholar
  28. 28.
    D. Namly, Y. Regragui, K. Bouzoubaa, Interoperable Arabic language resources building and exploitation in SAFAR platform, in 2016 IEEE/ACS 13th International Conference of Computer Systems and Applications (AICCSA). IEEE (2016)Google Scholar
  29. 29.
    J. Dichy, M. Hassoun, The DIINAR. 1 Arabic Lexical Resource, an outline of contents and methodology. The ELRA Newsl. 10(2) (2005)Google Scholar
  30. 30.
    A. El Jihad, D. Namly, K. Bouzoubaa, The development of a standard Morpho-Syntactic Lexicon for Arabic NLP, in Proceedings of the International Conference on Learning and Optimization Algorithms: Theory and Applications (ACM, 2018)Google Scholar
  31. 31.
    A. Khemakhem, Arabic LDB: A Standardized Lexical Basis for the Arabic Language (2006)Google Scholar
  32. 32.
    A. Neme, A fully inflected Arabic verb resource constructed from a lexicon of lemmas by using finite-state transducers. Revue RIST 20(2), 7–19 (2013)Google Scholar
  33. 33.
    F. Song, B. Croft, A general language model for information retrieval, in Proceedings of the Eighth International Conference on Information and Knowledge Management (ACM, 1999)Google Scholar
  34. 34.
    O. Ibe, Markov Processes for Stochastic Modeling. Elsevier insights, Elsevier Science, 2nd edn (2013)Google Scholar
  35. 35.
    H. Ney, U. Essen, On smoothing techniques for bigram-based natural language modelling, in [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing, Toronto, Ontario, Canada, vol 2 (1991), pp. 825–828Google Scholar
  36. 36.
    D. Forney, The Viterbi algorithm. Proc. IEEE 61(3), 268–278 (1973)MathSciNetCrossRefGoogle Scholar
  37. 37.
    I. Zeroual, A. Lakhouaja, A new Quranic Corpus rich in morphosyntactical information. Int. J. Speech Technol. (IJST) (2016)Google Scholar
  38. 38.
    M. Boudchiche, A. Mazroui, Enrichment of the Nemlar corpus by the lemma tag, in Workshop Language Resources of Arabic NLP: Construction, Standardization, Management and Exploitation. Rabat, Morocco. November 26 (2015)Google Scholar
  39. 39.
    K. Shereen, R. Garside, Stemming arabic text. Lancaster, UK, Computing Department, Lancaster University (1999)Google Scholar
  40. 40.
    M. Saad, W. Ashour, Arabic Morphological Tools for Text Mining 18 (2010)Google Scholar
  41. 41.
    Y. Jaafar, D. Namly, K. Bouzoubaa, A. Yousfi, Enhancing Arabic Stemming Process Using Resources and Benchmarking Tool, King Saud University - Computer and Information Sciences (JKSU-CIS) 12/ 2016Google Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  • Driss Namly
    • 1
    Email author
  • Karim Bouzoubaa
    • 1
  • Abdelhamid El Jihad
    • 2
  • Si Lhoussain Aouragh
    • 3
  1. 1.Mohammadia School of EngineersMohammed V UniversityRabatMorocco
  2. 2.Institute of Arabization Studies and ResearchMohammed V UniversityRabatMorocco
  3. 3.Faculty of Legal, Economic and Social Sciences - SaleMohammed V UniversityRabatMorocco

Personalised recommendations