Advertisement

Restoring Arabic vowels through omission-tolerant dictionary lookup

تشْكيل الكَلِمات عَبْرَ مَوارد حاسوبيّة
  • Alexis Amid NemeEmail author
  • Sébastien Paumier
Original Paper

Abstract

Vowels in Arabic are optional orthographic symbols written as diacritics above or below letters. In Arabic texts, typically more than 97 percent of written words do not explicitly show any of the vowels they contain; that is to say, depending on the author, genre and field, less than 3 percent of words include any explicit vowel. Although numerous studies have been published on the issue of restoring the omitted vowels in speech technologies, little attention has been given to this problem in papers dedicated to written Arabic technologies. In this research, we present Arabic-Unitex, an Arabic Language Resource, with emphasis on vowel representation and encoding. Specifically, we present two dozens of rules formalizing a detailed description of vowel omission in written text. They are typographical rules integrated into large-coverage resources for morphological annotation. For restoring vowels, our resources are capable of identifying words in which the vowels are not shown, as well as words in which the vowels are partially or fully included. By taking into account these rules, our resources are able to compute and restore for each word form a list of compatible fully vowelized candidates through omission-tolerant dictionary lookup. In our previous studies, we have proposed a straightforward encoding of taxonomy for verbs (Neme in Proceedings of the international workshop on lexical resources (WoLeR) at ESSLLI, 2011) and broken plurals (Neme and Laporte in Lang Sci, 2013, http://dx.doi.org/10.1016/j.langsci.2013.06.002). While traditional morphology is based on derivational rules, our description is based on inflectional ones. The breakthrough lies in the reversal of the traditional root-and-pattern Semitic model into pattern-and-root, giving precedence to patterns over roots. The lexicon is built and updated manually and contains 76,000 fully vowelized lemmas. It is then inflected by means of finite-state transducers (FSTs), generating 6 million forms. The coverage of these inflected forms is extended by formalized grammars, which accurately describe agglutinations around a core verb, noun, adjective or preposition. A laptop needs one minute to generate the 6 million inflected forms in a 340-MB flat file, which is compressed in 2 min into 11 MB for fast retrieval. Our program performs the analysis of 5000 words/second for running text (20 pages/second). Based on these comprehensive linguistic resources, we created a spell checker that detects any invalid/misplaced vowel in a fully or partially vowelized form. Finally, our resources provide a lexical coverage of more than 99 percent of the words used in popular newspapers, and restore vowels in words (out of context) simply and efficiently.

Keywords

Arabic language Arabic language resources Arabic NLP Semtic morphology Root-and-pattern model Pattern-and-root model Finite state transducers Local grammars Compression algorithm Vocalisation Vowelization 

Notes

Supplementary material

References

  1. Abandah, G. A., Graves, A., Al-Shagoor, B., Arabiyat, A., Jamour, F., & Al-Taee, M. (2015). Automatic diacritization of Arabic text using recurrent neural networks. International Journal on Document Analysis and Recognition, 18, 183–197.  https://doi.org/10.1007/s10032-015-0242-2.CrossRefGoogle Scholar
  2. Abdel-Nour, J. (2006). Dictionnaire Abdel-Nour al-Mufassal Arabe-Français (10th ed., p. 2034). Beirut: Dar El-Ilm Lil-Malayin.Google Scholar
  3. Al-Bawab, M., Mrayati, M., Alam, Y. M., & Al-Tayyan, M. H. (1994). A computerized morpho-syntactic system of Arabic. The Arabian Journal of Science and Engineering, 19, 461–480.Google Scholar
  4. Al-Ghalāyini, M. (2007). “Jāmi3 al-durūs al-’arabiyah” (A university grammar textbook). 1st edition 1912 (Vol. 3, p. 570). Beirut: Dar El-Ilm Lil-Malayin.Google Scholar
  5. Al-Jar, K. (1973). Al-mu’jam al-’arabiy al-Hadith. Larousse. 1973, 53,500 lexical entries. 1300 pages, 2 columns. (in Arabic).Google Scholar
  6. Altantawy, M., Habash, N., & Rambow, O. (2011). Fast yet rich morphological analysis. In Proceedings of the 9th international workshop on finite state methods and natural language processing (FSMNLP) (pp. 116–124).Google Scholar
  7. Altantawy, M., Habash, N., Rambow, O., & Saleh, I. (2010). Morphological analysis and generation of Arabic nouns: A morphemic functional approach. In Proceedings of the language resource and evaluation conference (LREC), Malta (pp. 851–858).Google Scholar
  8. Attia, M. (2006). An ambiguity-controlled morphological analyzer for modern standard Arabic modelling finite state networks. In Challenges of Arabic for NLP/MT conference. London: The British Computer Society.Google Scholar
  9. Attia, M., Pecina, P., Tounsi, L., Toral, A., & van Genabith, J. (2011). An open-source finite state morphological transducer for modern standard Arabic. In International workshop on finite state methods and natural language processing (FSMNLP), Blois.Google Scholar
  10. Attia, M., Pecina, P., Samih, Y., Shaalan, K., & van Genabith, J. (2015). Arabic spelling error detection and correction. Natural Language Engineering.  https://doi.org/10.1017/S1351324915000030.Google Scholar
  11. Azmi, Aqil, & Almajed, Reham S. (2015). A survey of automatic Arabic diacritization techniques. Natural Language Engineering, 21, 477–495.  https://doi.org/10.1017/S1351324913000284.CrossRefGoogle Scholar
  12. Bebah, M., Chennoufi, A., Mazroui, A., & Lakhouaja, A. (2014). Hybrid approaches for automatic vowelization of Arabic texts. International Journal on Natural Language Computing, 3, 53–71.  https://doi.org/10.5121/ijnlc.2014.3404.CrossRefGoogle Scholar
  13. Beesley, K. (2005). Xerox Arabic morphological analysis and generation romanization, transcription and transliteration.Google Scholar
  14. Ben Mesmia, F., Friburger, N., Haddar, K., & Maurel, D. (2015). Arabic named entity recognition process using transducer cascade and Arabic Wikipedia. In Proceedings of recent advances in natural language processing, Hissar (pp. 48–54).Google Scholar
  15. Boudchiche, M., Mazroui, A., Bebah, M. O. A. O., Lakhouaja, A., & Boudlal, A. (2014). L’Analyseur Morphosyntaxique AlKhalil, Morpho Sys 2.  https://doi.org/10.13140/rg.2.1.4280.0085.
  16. Boudchiche, M., Mazroui, A., Bebah, M. O. A. O., Lakhouaja, A., & Boudlal, A. (2016). AlKhalil Morpho Sys 2: A robust Arabic morpho-syntactic analyzer. Journal of King Saud University-Computer and Information Sciences, 12, 3.  https://doi.org/10.1016/j.jksuci.2016.05.002.Google Scholar
  17. Boudlal, A., Lakhouaja, A., Mazroui, A., Meziane, A., Bebah, M. O. A. O., & Shoul, M. (2010). Alkhalil Morpho SYS1: A morphosyntactic analysis system for arabic texts. In International Arab conference on information technology, Benghazi (pp. 1–6).Google Scholar
  18. Buckwalter, T. (2004). Buckwalter Arabic morphological analyser version 2.0. Linguistic Data Consortium (LDC) Catalog Number LDC2004L02. ISBN 1-58563-324-0.Google Scholar
  19. Buckwalter, T. (2007). Issues in Arabic morphological analysis. In A. van den Bosch & A. Soudi (Eds.), Arabic computational morphology. Knowledge-based and empirical methods. Text, speech and language technology (Vol. 38, pp. 23–41). Berlin: Springer.CrossRefGoogle Scholar
  20. Buckwalter Arabic Morphological Analyzer Version 1.0. (2002). LDC Catalog No.: LDC2002349.Google Scholar
  21. Buckwalter Arabic Morphological Analyzer Version 2.0. (2004). LDC Catalog No.: LDC2004L02.Google Scholar
  22. Chennoufi, A., & Mazroui, A. (2016). Morphological, syntactic and diacritics rules for automatic diacritization of Arabic sentences. Journal of King Saud University - Computer and Information Sciences, 29, 156–163.CrossRefGoogle Scholar
  23. Debili, F., Achour, H., & Souissi, E. (2002). La langue arabe et l’ordinateur: de l’étiquetage grammatical à la voyellation automatique. In Correspondances de l’IRMC, No. 71, Tunis.Google Scholar
  24. Elshafei, M., Al-Muhtaseb, H., & Alghamdi, M. (2006). Machine generation of Arabic diacritical marks. In The 2006 world congress in computer science computer engineering, and applied computing, Las Vegas, USA (pp. 128–133).Google Scholar
  25. Fairon, C., Paumier, S., & Watrin, P. (2005). Can we parse without tagging? In 2nd Language & technology conference (LTC’05), 2005, Poznan (pp. 473–477).Google Scholar
  26. Habash, N. (2010). Introduction to Arabic natural language processing. Synthesis lectures on human language technologies. San Rafael: Morgan & Claypool.  https://doi.org/10.2200/S00277ED1V01Y201008HLT010.Google Scholar
  27. Habash, N., & Rambow, O. (2005). Arabic tokenization, part-of-speech tagging and morphological disambiguation in one fell swoop. In Proceedings of the conference of the American Association for Computational Linguistics, New York.Google Scholar
  28. Habash, N., & Rambow, O. (2006). MAGEAD: A morphological analyzer and generator for the Arabic dialects. In Proceedings of the international conference on computational linguistics and annual meeting of the Association for Computational Linguistics (COLING-ACL), Sydney (pp. 681–688).Google Scholar
  29. Habash, N., & Rambow, O. (2007). Arabic diacritization through full morphological tagging. In Proceedings of the North American chapter of the Association for Computational Linguistics (NAACL), Rochester, New York.Google Scholar
  30. Hamdi, A. (2012). Apport de la diacritisation dans l’analyse morphosyntaxique de l’arabe (pp. 247–254).Google Scholar
  31. Hamed, O., Torsten, Z. (2017). A survey and comparative study of Arabic diacritization tools. In JLCL (Vol. 32, No. 1). http://jlcl.org/content/5-allissues/1-Heft1-2017/Heft1-2017.pdf.
  32. Krstev, C., Stanković, R., & Duško, V. (2018). Knowledge and rule-based diacritic restoration in Serbian. In Proceedings of the 3rd international conference computational linguistics in Bulgaria, CLIB-2018 (pp. 41–51).Google Scholar
  33. Maamouri, M., et al. (2010). LDC standard Arabic morphological analyzer (SAMA) Version 3.1 LDC2010L01. Web Download. Philadelphia: Linguistic Data Consortium.Google Scholar
  34. Maamouri, M., Bies, A., & Kulick, S. (2006). Diacritization: A challenge to Arabic treebank annotation and parsing. In Proceedings of the British computer society Arabic NLP/MT conference, London.Google Scholar
  35. Maamouri, M., Kulick, S., & Bies, A. (2008). Diacritic annotation in the Arabic treebank and its impact on parser evaluation. In Proceedings of the 6th international conference on language resources and evaluation (LREC 2008), Marrakech, May 28–30, 2008.Google Scholar
  36. Mubarak, H., Darwish, K. (2014). Automatic correction of Arabic text: A cascaded approach. In Proceedings of the EMNLP 2014 workshop on Arabic natural language processing (NLP).Google Scholar
  37. Neme, A. (2011). A lexicon of Arabic verbs constructed on the basis of Semitic taxonomy and using finite-state transducers. In Proceedings of the international workshop on lexical resources (WoLeR) at ESSLLI.Google Scholar
  38. Neme, A. A. (2014). Why Microsoft Arabic spell checker is ineffective. Linguistica Communication. http://www.al-erfan.com/. Arabic Language in Information Technology, 16, 55. http://www.al-erfan.com/.
  39. Neme, A., & Laporte, É. (2013). Pattern-and-root inflectional morphology: The Arabic broken plural. Language Sciences.  https://doi.org/10.1016/j.langsci.2013.06.002.Google Scholar
  40. Neuhoff, D. (1975). The Viterbi algorithm as an aid in text recognition (Corresp.). IEEE Transactions on Information Theory, 21, 222–226.  https://doi.org/10.1109/TIT.1975.1055355.CrossRefGoogle Scholar
  41. Pasha, A., Al-Badrashiny, M., Diab, M., El Kholy, A., Eskander, R., Habash, N., Pooleery, M., Rambow, O., & Roth, R. (2014). MADAMIRA: A fast, comprehensive tool for morphological analysis and disambiguation of Arabic. In Proceedings of the 9th international conference on language resources and evaluation LREC, May 2014.Google Scholar
  42. Revuz, D. (1992). Minimization of acyclic deterministic automata in linear time. Theoretical Computer Science, 92(1), 181–189.CrossRefGoogle Scholar
  43. Sak, H., Güngör, T., & Saraçlar, M. (2011). Resources for Turkish morphological processing. Language Resources & Evaluation, 45, 249.  https://doi.org/10.1007/s10579-010-9128-6.CrossRefGoogle Scholar
  44. Smrz, O. (2007). ElixirFM—Implementation of functional Arabic morphology. In Computational approaches to Semitic languages, ACL 2007, Prague.Google Scholar
  45. Traboulsi, H. (2009). Arabic named entity extraction: A local grammar-based approach. In Proceedings of the international multi-conference on computer science and information technology (IMCSIT 2009), Mragowo (pp. 139–143).Google Scholar
  46. Wintner, S. (2008). Strengths and weaknesses of finite-state technology: a case study morphological grammar development. Natural Language Engineering, 14(4), 457–469.  https://doi.org/10.1017/S1351324907004676.CrossRefGoogle Scholar
  47. Zalmout, N., & Habash, N. (2017). Don’t throw those morphological analyzers away just yet: Neural morphological disambiguation for Arabic. In Proceedings of conference on empirical methods in natural language processing, Copenhagen, September, 2017. (pp. 715–724).Google Scholar
  48. Zitouni, I., Sorensen, J. S., & Sarikaya, R., (2006). Maximum entropy based restoration of Arabic diacritics. In Proceedings of ACL’06. Google Scholar

Copyright information

© Springer Nature B.V. 2019

Authors and Affiliations

  1. 1.LIGM, UPEM, CNRS, ENPC, ESIEEUniversité Paris-EstMarne-la-ValléeFrance

Personalised recommendations