Tharawat: A Vision for a Comprehensive Resource for Arabic Computational Processing

  • Mona DiabEmail author
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9041)


In this paper, we present a vision for a comprehensive unified lexical resource for computational processing of Arabic with as many of its variants as possible. We will review the current state of the art for three existing resources and then propose a method to link them in addition to augment them in a manner that would render them even more useful for natural language processing whether targeting enabling technologies such as part of speech tagging or parsing, or applications such as Machine Translation, or Information Extraction. The unified lexical resource, Tharawat, meaning treasures, is an extension of our core unique resource Tharwa, which is a three way computational lexicon for Dialectal Arabic, Modern Standard Arabic, and English lemma correspondents. Tharawat will incorporate two other current resources namely SANA, our Arabic Sentiment Lexicon, and MuSTalAHAt, our Multiword Expression (MWE) version of Tharwa but instead of listing lemmas and their correspondents, it lists MWE and their correspondents. Moreover, we present a roadmap for incorporating links for Tharawat to existing English resources and corpora leveraging advanced machine learning techniques and crowd sourcing methods. Such resources are at the core of NLP technologies. Specifically, we believe that such a resource could lead to significant leaps and strides for Arabic NLP. Possessing them for a language such as Arabic could be quite impactful for the development of advanced scientific material and hence lead to an Arabic scientific and economic revolution.


Machine Translation Sentiment Analysis Parallel Corpus Lexical Resource Modern Standard Arabic 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Abdul-Mageed, M., Diab, M.: Sana: A large scale multi-genre, multi-dialect lexicon for arabic subjectivity and sentiment analysis. In: Calzolari, N., Choukri, K., Declerck, T., Loftsson, H., Maegaard, B., Mariani, J., Moreno, A., Odijk, J., Piperidis, S. (eds.) Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC 2014). European Language Resources Association (ELRA), Reykjavik (2014),
  2. 2.
    Abo Bakr, H., Shaalan, K., Ziedan, I.: A Hybrid Approach for Converting Written Egyptian Colloquial Dialect into Diacritized Arabic. In: The 6th International Conference on Informatics and Systems, INFOS 2008, Cairo University (2008),
  3. 3.
    Al-Badrashiny, M., Eskander, R., Habash, N., Rambow, O.: Automatic transliteration of romanized dialectal arabic. In: Proceedings of the Eighteenth Conference on Computational Natural Language Learning, pp. 30–38. Association for Computational Linguistics, Ann Arbor (2014), Google Scholar
  4. 4.
    Alkuhlani, S., Habash, N.: A Corpus for Modeling Morpho-Syntactic Agreement in Arabic: Gender, Number and Rationality. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics (ACL 2011), Portland, Oregon, USA (2011)Google Scholar
  5. 5.
    Badawi, E.S., Hinds, M.: A Dictionary of Egyptian Arabic. Librairie du Liban (1986)Google Scholar
  6. 6.
    Brustad, K.: The Syntax of Spoken Arabic: A Comparative Study of Moroccan, Egyptian, Syrian, and Kuwaiti Dialects. Georgetown University Press (2000)Google Scholar
  7. 7.
    Diab, M., AlBadrashiny, M., Aminian, M., Attia, M., Elfardy, H., Habash, N., Hawwari, A., Salloum, W., Dasigi, P., Eskander, R.: Tharwa: A large scale dialectal arabic - standard arabic - english lexicon. In: Calzolari, N., Choukri, K., Declerck, T., Loftsson, H., Maegaard, B., Mariani, J., Moreno, A., Odijk, J., Piperidis, S. (eds.) Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC 2014), pp. 3782–3789. European Language Resources Association (ELRA), Reykjavik (2014),, aCL Anthology Identifier: L14-1115
  8. 8.
    Ferguson, C.F.: Diglossia. Word 15(2), 325–340 (1959)Google Scholar
  9. 9.
    Graff, D., Maamouri, M., Bouziri, B., Krouna, S., Kulick, S., Buckwalter, T.: Standard Arabic Morphological Analyzer (SAMA) Version 3.1 (2009), linguistic Data Consortium LDC2009E73Google Scholar
  10. 10.
    Habash, N., Eskander, R., Hawwari, A.: A Morphological Analyzer for Egyptian Arabic. In: NAACL-HLT 2012 Workshop on Computational Morphology and Phonology (SIGMORPHON 2012), pp. 1–9 (2012)Google Scholar
  11. 11.
    Habash, N.: Introduction to Arabic Natural Language Processing. Morgan & Claypool Publishers (2010)Google Scholar
  12. 12.
    Habash, N., Diab, M., Rabmow, O.: Conventional Orthography for Dialectal Arabic. In: Proceedings of the Language Resources and Evaluation Conference (LREC), Istanbul (2012)Google Scholar
  13. 13.
    Habash, N., Soudi, A., Buckwalter, T.: On Arabic transliteration. In: Soudi, A., Neumann, G., van den Bosch, A. (eds.) Arabic Computational Morphology, Text, Speech and Language Technology, vol. 38, ch. 2, pp. 15–22. Springer (2007),
  14. 14.
    Hawwari, A., Attia, M., Diab, M.: A framework for the classification and annotation of multiword expressions in dialectal arabic. In: Proceedings of the EMNLP 2014 Workshop on Arabic Natural Language Processing (ANLP), pp. 48–56. Association for Computational Linguistics, Doha (2014), Google Scholar
  15. 15.
    Kilany, H., Gadalla, H., Arram, H., Yacoub, A., El-Habashi, A., McLemore, C.: Egyptian Colloquial Arabic Lexicon. LDC catalog number LDC99L22 (2002)Google Scholar
  16. 16.
    Maamouri, M., Bies, A., Buckwalter, T., Diab, M., Habash, N., Rambow, O., Tabessi, D.: Developing and using a pilot dialectal Arabic treebank. In: LREC, Genoa, Italy (2006)Google Scholar
  17. 17.
    Saleh, I., Habash, N.: Automatic extraction of lemma-based bilingual dictionaries for morphologically rich languages. In: Third Workshop on Computational Approaches to Arabic Script-based Languages at the MT Summit XII, Ottawa, Canada (2009)Google Scholar
  18. 18.
    Salloum, W., Habash, N.: Dialectal to Standard Arabic Paraphrasing to Improve Arabic-English Statistical Machine Translation. In: Proceedings of the First Workshop on Algorithms and Resources for Modelling of Dialects and Language Varieties, Edinburgh, Scotland, pp. 10–21 (2011)Google Scholar
  19. 19.
    Spiro, S.: An Arabic-English Vocabulary of the Colloquial Arabic of, Egypt. Al-Mokattam printing office (1895)Google Scholar
  20. 20.
    Spiro, S.: Arabic-English Dictionary of the Colloquial Arabic of Egypt. Librairie Du Liban (1987)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  1. 1.Department of Computer ScienceThe George Washington UniversityWashingtonUSA

Personalised recommendations