Advertisement

A hybrid approach for Arabic lemmatization

  • Mohamed BoudchicheEmail author
  • Azzeddine Mazroui
Article
  • 13 Downloads

Abstract

We present in this article an Arabic lemmatizer that assigns to each word of an Arabic sentence, a single lemma taking into account the word context. The proposed system comprises two modules. The first one consists in an analysis out of context, based on the morphosyntactic analyser Alkhalil Morpho Sys 2. In the second module, we use the context to identify the correct lemma from the potential lemmas of the word obtained by the first module. For this purpose, we use a statistical technique based on the hidden Markov models, where the observations are the words of the sentence, and the lemmas represent the hidden states. We validate this approach using a labelled corpus consisting of about 500,000 words. The lemmatizer gives the correct lemma in more than 99.24% in the training set and about 94.45% of the words in the test set.

Keywords

Arabic natural language processing Lemmatization Morphological analyser Hidden markov model Viterbi algorithm 

References

  1. Abuhaiba, I. S. I., & Dawoud, H. M. (2017). Combining different approaches to improve arabic text documents classification. International Journal of Intelligent Systems and Applications, 9(4), 39–52.  https://doi.org/10.5815/ijisa.2017.04.05.CrossRefGoogle Scholar
  2. Alajmi, A. F., Saad, E. M., & Awadalla, M. H. (2011). Hidden markov model based Arabic morphological analyzer. International Journal of Computer Engineering Research, 2(2), 28–33. http://www.academicjournals.org/article/article1379930440_Amal et al pdf.pdf.
  3. Al-Shammari, E., & Lin, J. (2008). A novel Arabic lemmatization algorithm. In Proceedings of the second workshop on analytics for noisy unstructured text dataAND08 (pp. 113–118). New York: ACM Press.  https://doi.org/10.1145/1390749.1390767.
  4. Attiya, M., Yaseen, M., & Choukri, K. (2005). Specifications of the Arabic Written Corpus produced within the NEMLAR project. http://www.nemlar.org.
  5. Balakrishnan, V., & Ethel, L.-Y. (2014). Stemming and lemmatization: A comparison of retrieval performances. Lecture Notes on Software Engineering, 2(3), 262–267.  https://doi.org/10.7763/LNSE.2014.V2.134.CrossRefGoogle Scholar
  6. Boudchiche, M., & Mazroui, A. (2015a). Evaluation of the ambiguity caused by the absence of diacritical marks in Arabic texts: Statistical study. In 2015 5th international conference on information & communication technology and accessibility (ICTA) (pp. 1–6). IEEE.  https://doi.org/10.1109/ICTA.2015.7426904.
  7. Boudchiche, M., & Mazroui, A. (2015b). Enrichissement du corpus Nemlar par l’étiquette lexicale lemme. In Journée d’étude “Ressources langagières de l’arabe pour le TAL: construction, standardisation, gestion et exploitation.”. Morocco: Rabat.Google Scholar
  8. Boudchiche, M., Mazroui, A., Ould Abdallahi Ould Bebah, M., Lakhouaja, A., & Boudlal, A. (2017). AlKhalil Morpho Sys 2: A robust Arabic morpho-syntactic analyzer. Journal of King Saud University - Computer and Information Sciences, 29(2), 141–146.  https://doi.org/10.1016/j.jksuci.2016.05.002.CrossRefGoogle Scholar
  9. Boudlal, A., Belahbib, R., Lakhouaja, A., Mazroui, A., Meziane, A., & Bebah, M. (2011). A markovian approach for Arabic root extraction. International Arab Journal of Information Technology, 8(1), 91–98.Google Scholar
  10. Buckwalter, T. (2002). Arabic Morphological Analyzer Version 1.0. Linguistic Data Consortium No. LDC2002L49.Google Scholar
  11. Chennoufi, A., & Mazroui, A. (2016). Impact of morphological analysis and a large training corpus on the performances of Arabic diacritization. International Journal of Speech Technology, 19(2), 269–280.  https://doi.org/10.1007/s10772-015-9313-5.CrossRefGoogle Scholar
  12. Cristianini, N., & Shawe-Taylor, J. (2000). An Introduction to support vector machines and other kernel-based learning methods. Cambridge: Cambridge University Press.  https://doi.org/10.1017/CBO9780511801389.CrossRefzbMATHGoogle Scholar
  13. Diab, M., Kadri, H., & Daniel, J. (2007). Automated methods for processing Arabic text: from tokenization to base phrase chunking. In A. van den Bosch & A. Soudi (Eds.), Arabic computational morphology: Knowledge-based and empirical methods. Dordrecht: Springer.Google Scholar
  14. Dichy, J. (2001). On lemmatization in Arabic, a formal definition of the Arabic entries of multilingual lexical databases. In ACL 39th annual meeting. Workshop on Arabic Language Processing (pp. 23–30). Toulouse.Google Scholar
  15. Dichy, J., & Farghaly, A. (2003). Roots & Patterns vs. Stems plus Grammar-Lexis Specifications: On what basis should a multilingual lexical database centred on Arabic be built? In Proceedings of the {MT-summit IX} workshop on machine translation for semitic languages workshop on machine translation for semitic languages, 2016, (pp.1–8).Google Scholar
  16. El-shishtawy, T., & El-Ghannam, F. (2012). An accurate arabic root-based lemmatizer for information retrieval purposes. IJCSI International Journal of Computer Science Issues, 9, 58–66.Google Scholar
  17. Farghaly, A., & Shaalan, K. (2009). Arabic natural language processing. ACM Transactions on Asian Language Information Processing, 8(4), 1–22.  https://doi.org/10.1145/1644879.1644881.CrossRefGoogle Scholar
  18. Giménez, J., & Màrquez, L. (2004). SVMTool: A general POS tagger generator based on support vector machines. In the 4th international conference on language resources and evaluation, (pp. 43–46).Google Scholar
  19. Graff, D., Maamouri, M., Bouziri, B., Krouna, S., Kulick, S., & Buckwalter, T. (2010). Standard arabic morphological analyzer (SAMA). Linguistic Data Consortium LDC2009E73.Google Scholar
  20. Habash, N., Rambow, O., & Roth, R. (2009). MADA + TOKAN: A Toolkit for Arabic tokenization, diacritization, morphological disambiguation, POS tagging, stemming and lemmatization. Proceedings of the Second International Conference on Arabic Language Resources and Tools, 102–109. http://www.elda.org/medar-conference/pdf/24.pdf%5CnAll Papers/H/Habash, et al. 2009 - Mada + tokan - A toolkit for arabic tokenization, di … morphological disambiguation, pos tagging, stemming and lemmatization.pdf.
  21. Hammouda, F. K., & Almarimi, A. A. (2010). Heuristic lemmatization for Arabic texts indexation and classification. Journal of Computer Science, 6(6), 660–665.CrossRefGoogle Scholar
  22. Korenius, T., Laurikkala, J., Järvelin, K., & Juhola, M. (2004). Stemming and lemmatization in the clustering of finnish text documents. In Proceedings of the thirteenth ACM conference on Information and knowledge management - CIKM04 (p. 625). New York, USA: ACM Press.  https://doi.org/10.1145/1031171.1031285.
  23. Koulali, R., & Meziane, A. (2013). Experiments with arabic topic detection. Journal of Theoretical and Applied Information Technology, 50(1), 28–32.  https://doi.org/10.1007/978-3-642-25631-8_56.Google Scholar
  24. Manning, C. D., & Schütze, H. (1999). Foundations of statistical natural language processing. Cambridge: MIT Press. http://ics.upjs.sk/~pero/web/documents/pillar/Manning_Schuetze_StatisticalNLP.pdf.
  25. Neuhoff, D. (1975). The Viterbi algorithm as an aid in text recognition. IEEE Transactions on Information Theory, 21(2), 222–226.  https://doi.org/10.1109/TIT.1975.1055355.MathSciNetCrossRefGoogle Scholar
  26. Ney, H., & Essen, U. (1991). On smoothing techniques for bigram-based natural language modelling. In [Proceedings] ICASSP 91: 1991 international conference on acoustics, speech, and signal processing (pp. 825–828 vol. 2). IEEE.  https://doi.org/10.1109/ICASSP.1991.150464.
  27. Pasha, A., Al-badrashiny, M., Diab, M., El Kholy, A., Eskander, R., Habash, N., et al. (2014). MADAMIRA: A fast, comprehensive tool for morphological analysis and disambiguation of Arabic. In Proceedings of the 9th language resources and evaluation conference (LREC14), pp. 1094–1101.Google Scholar
  28. Rabiner, L. R. (1990). A tutorial on hidden Markov models and selected applications in speech recognition. In Readings in speech recognition (pp. 267–296).Google Scholar
  29. Reqqass, M., Lakhouaja, A., Mazroui, A., & Atih, I. (2015). Amelioration of the interactive dictionary of arabic language. International Journal of Computer Science and Applications, 12(1), 94–107. http://www.tmrfindia.org/ijcsa/v12i18.pdf.
  30. Yaseen, M., Attia, M., Maegaard, B., Choukri, K., Paulsson, N., Haamid, S., et al. (2006). Building Annotated Written and Spoken Arabic LR’s in NEMLAR Project. In LREC (pp. 533–538). http://www.nemlar.org.
  31. Zine, O., Meziane, A., & Boudchiche, M. (2018). Towards a high-quality lemma-based text to speech system for the arabic language. In Communications in computer and information science (Vol. 782, pp. 53–66) Cham: Springer.  https://doi.org/10.1007/978-3-319-73500-9_4.Google Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018

Authors and Affiliations

  1. 1.Department of Mathematics and Computer Science, Faculty of SciencesMohammed First UniversityOujdaMorocco

Personalised recommendations