Advertisement

Language Resources and Evaluation

, Volume 51, Issue 3, pp 745–775 | Cite as

Curras: an annotated corpus for the Palestinian Arabic dialect

  • Mustafa Jarrar
  • Nizar Habash
  • Faeq Alrimawi
  • Diyam Akra
  • Nasser Zalmout
Original Paper

Abstract

In this article we present Curras, the first morphologically annotated corpus of the Palestinian Arabic dialect. Palestinian Arabic is one of the many primarily spoken dialects of the Arabic language. Arabic dialects are generally under-resourced compared to Modern Standard Arabic, the primarily written and official form of Arabic. We start in the article with a background description that situates Palestinian Arabic linguistically and historically and compares it to Modern Standard Arabic and Egyptian Arabic in terms of phonological, morphological, orthographic, and lexical variations. We then describe the methodology we developed to collect Palestinian Arabic text to guarantee a variety of representative domains and genres. We also discuss the annotation process we used, which extended previous efforts for annotation guideline development, and utilized existing automatic annotation solutions for Standard Arabic and Egyptian Arabic. The annotation guidelines and annotation meta-data are described in detail. The Curras Palestinian Arabic corpus consists of more than 56 K tokens, which are annotated with rich morphological and lexical features. The inter-annotator agreement results indicate a high degree of consistency.

Keywords

Palestinian Arabic Palestinian corpus Arabic morphology Conventional Orthography for Dialectal Arabic Dialectal Arabic Word annotation 

Notes

Acknowledgments

This work is part of our ongoing Curras project, funded by the Palestinian Ministry of Higher Education, Scientific Research Council. We wish to thank Owen Rambow, Ramy Eskander and Faisal Al-Shargi for their support with DIWAN and MADAMIRA. We would like to also thank Rami Asia for developing the Curras portal, Bahya Mustafa and Mohammad Dwaikat for their support during the annotation process, and Mahdi Arar for helpful conversations and fruitful discussions in the early stages of this work. Last but not least, we would like to thank the “Watan Aa Watar” actors for their support and for providing us with the scripts of their TV show.

References

  1. Abdul-Mageed, M., & Diab, M. (2014). SANA: A large scale multi-genre, multi-dialect lexicon for Arabic subjectivity and sentiment analysis. In Proceedings of the ninth international conference on language resources and evaluation (LREC’14), European language resources association (ELRA), Reykjavik, Iceland (pp. 1162–1169).Google Scholar
  2. Abdul-Mageed, M., Kübler, S., & Diab, M. (2012). Samar: A system for subjectivity and sentiment analysis of Arabic social media. In Proceedings of the 3rd workshop in computational approaches to subjectivity and sentiment analysis, association for computational linguistics, Jeju, Korea (pp. 19–28).Google Scholar
  3. Alansary, S., Nagi, M., & Adly, N. (2007). Building an international corpus of Arabic (ICA): Progress of compilation stage. In The 7th international conference on language engineering, Cairo, Egypt. Google Scholar
  4. Alkuhlani, S., & Habash, N. (2011). A corpus for modeling morpho-syntactic agreement in Arabic: Gender, number and rationality. In Proceedings of the association for computational linguistics: Human language technologies (pp. 357–362).Google Scholar
  5. Al-Sabbagh, R., & Girju, R. (2010). Mining the web for the induction of a dialectical Arabic lexicon. In Proceedings of the seventh international conference on language resources and evaluation (LREC’10), European language resources association (ELRA), Malta (pp. 288–293).Google Scholar
  6. Al-Sabbagh, R., & Girju, R. (2012). YADAC: Yet another dialectal arabic corpus. In Proceedings of the eighth international conference on language resources and evaluation (LREC’12), European language resources association (ELRA), Reykjavik, Iceland (pp. 2882–2889).Google Scholar
  7. Al-Shargi, F., & Rambow, O. (2015). DIWAN: A dialectal word annotation tool for Arabic. In Proceedings of the second workshop on arabic natural language processing, association for computational linguistics, Beijing, China (p. 49).Google Scholar
  8. Al-Sughaiyer, I., & Al-Kharashi, I. (2004). Arabic morphological analysis techniques: A comprehensive survey. Journal of the American Society for Information Science and Technology, 55(3), 189–213.CrossRefGoogle Scholar
  9. Al-Sulaiti, L., & Atwell, E. S. (2006). The design of a corpus of contemporary Arabic. International Journal of Corpus Linguistics, 11(2), 135–171.CrossRefGoogle Scholar
  10. Artstein, R., & Poesio, M. (2008). Inter-coder agreement for computational linguistics. Computational Linguistics, 34(4), 555–596.CrossRefGoogle Scholar
  11. Attia, M. (2006). An ambiguity-controlled morphological analyzer for modern standard Arabic modelling finite state networks. In Proceedings of the challenges of Arabic for NLP/MT conference, The British Computer Society, London, UK (pp. 1–16).Google Scholar
  12. Bakr, H. A., Shaalan, K., & Ziedan, I. (2008). A hybrid approach for converting written Egyptian colloquial dialect into diacritized Arabic. In The 6th international conference on informatics and systems, (INFOS2008), Cairo University, Cairo, Egypt (p. 72).Google Scholar
  13. Beesley, K. R. (1996). Arabic finite-state morphological analysis and generation. In Proceedings of the 16th conference on computational linguistics (Vol. 1, pp. 89–94).Google Scholar
  14. Bouamor, H., Habash, N., & Oflazer, K. (2014). A multidialectal parallel corpus of Arabic. In Proceedings of the ninth international conference on language resources and evaluation (LREC’14), European language resources association (ELRA), Reykjavik, Iceland (pp. 1240–1245).Google Scholar
  15. Bruce, R. F., & Wiebe, J. (1998). Word-sense distinguishability and inter-coder agreement. In Proceedings of the empirical methods on natural language processing conference (EMNLP’98), association for computational linguistics, Granada, Spain 1998 (pp. 53–60).Google Scholar
  16. Buckwalter, T. (2004). Buckwalter Arabic morphological analyzer: Version 2.0. LDC catalog number LDC2004L02. ISBN 1-58563-324-0.Google Scholar
  17. Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20(1), 37–46.CrossRefGoogle Scholar
  18. Darwish, K. (2013). Arabizi detection and conversion to Arabic. arXiv preprint arXiv:1306.6755.
  19. Di Eugenio, B., & Glass, M. (2004). The kappa statistic: A second look. Computational Linguistics, 30(1), 95–101.CrossRefGoogle Scholar
  20. Diab, M., Al-Badrashiny, M., Aminian, M., Attia, M., Dasigi, P., Elfardy, H., et al. (2014). Tharwa: A large scale dialectal Arabic-Standard Arabic-English lexicon. In Proceedings of the ninth international conference on language resources and evaluation (LREC’14), European language resources association (ELRA), Reykjavik, Iceland (pp. 3782–3789).Google Scholar
  21. Diab, M., Habash, N., Rambow, O., Altantawy, M., & Benajiba, Y. (2010). COLABA: Arabic dialect annotation and processing. In LREC workshop on semitic language processing, Malta (pp. 66–74).Google Scholar
  22. Diab, M., Hacioglu, K., & Jurafsky, D. (2007). Automated methods for processing Arabic text: From tokenization to base phrase chunking. In: Arabic computational morphology: Knowledge-based and empirical methods. Kluwer/Springer.Google Scholar
  23. Eskander, R., Al-Badrashiny, M., Habash, N., & Rambow, O. (2014). Foreign words and the automatic processing of Arabic social media text written in Roman script. In Proceedings of the empirical methods on natural language processing conference (EMNLP’14), Doha, Qatar (p. 1).Google Scholar
  24. Eskander, R., Habash, N., Rambow, O., & Tomeh, N. (2013). Processing spontaneous orthography. In Proceedings of the North American chapter of the association for computational linguistics: Human language technologies (NAACL HLT’13), Atlanta, Georgia (pp. 585–595).Google Scholar
  25. Gadalla, H., Kilany, H., Arram, H., Yacoub, A., El-Habashi, A., Shalaby, A., et al. (1997). CALLHOME Egyptian Arabic transcripts. LDC97T19. Web Download. Philadelphia: Linguistic Data Consortium.Google Scholar
  26. Graff, D., Maamouri, M., Bouziri, B., Krouna, S., Kulick, S., & Buckwalter, T. (2009). Standard Arabic morphological analyzer (SAMA) version 3.1. In Linguistic Data Consortium LDC2009E73.Google Scholar
  27. Gupta, M., Yadav, V., Husain, S., & Sharma, D. M. (2010). Partial parsing as a method to expedite dependency annotation of a Hindi treebank. In Proceedings of the seventh international conference on language resources and evaluation (LREC’10), Malta (pp. 1930–1935).Google Scholar
  28. Habash, N. (2010). Introduction to Arabic natural language processing. Synthesis Lectures on Human Language Technologies, 3(1), 1–187.CrossRefGoogle Scholar
  29. Habash, N., Diab, M., & Rambow, O. (2012a). Conventional orthography for dialectal Arabic. In Proceedings of the eighth international conference on language resources and evaluation (LREC’12), European language resources association (ELRA), Istanbul, Turkey (pp. 711–718).Google Scholar
  30. Habash, N., Jarrar, M., Alrimawi, F., Akra, D., Zalmout, N., Bartolotti, E., et al. (2016). Palestinian Arabic conventional orthography guidelines. Tech Report: Under preparationGoogle Scholar
  31. Habash, N., Eskander, R., & Hawwari, A. A morphological analyzer for Egyptian Arabic. (2012b). In Proceedings of the twelfth meeting of the special interest group on computational morphology and phonology, association for computational linguistics, Montreal, Canada (pp. 1–9).Google Scholar
  32. Habash, N., & Rambow, O. (2005). Arabic tokenization, part-of-speech tagging and morphological disambiguation in one fell swoop. In Proceedings of the 43rd annual meeting on association for computational linguistics, Ann Arbor, Michigan, USA (pp. 573–580).Google Scholar
  33. Habash, N., & Rambow, O. (2006). MAGEAD: A morphological analyzer and generator for the Arabic dialects. In Proceedings of the 21st international conference on computational linguistics and the 44th annual meeting of the association for computational linguistics, association for computational linguistics, Sydney, Australia (pp. 681–688).Google Scholar
  34. Habash, N., Rambow, O., & Roth, R. (2009). MADA + TOKAN: A toolkit for Arabic tokenization, diacritization, morphological disambiguation, POS tagging, stemming and lemmatization. In Proceedings of the 2nd international conference on Arabic language resources and tools (MEDAR), Cairo, Egypt (pp. 102–109).Google Scholar
  35. Habash, N., & Roth, R. M. (2009). CATiB: The columbia Arabic treebank. In Proceedings of the 53rd annual meeting of the association for computational linguistics and the 7th international joint conference on natural language processing, association for computational linguistics, Beijing, China (pp. 221–224).Google Scholar
  36. Habash, N., Roth, R., Rambow, O., Eskander, R., & Tomeh, N. (2013). Morphological analysis and disambiguation for dialectal Arabic. In proceedings of the North American chapter of the association for computational linguistics (NAACL’13), Atlanta, Georgia (pp. 426–432).Google Scholar
  37. Habash, N., Soudi, A., & Buckwalter, T. (2007). On Arabic transliteration. In Arabic Computational (Ed.), Morphology (pp. 15–22). New York: Springer.Google Scholar
  38. Herzallah, R. (1990). Aspects of palestinian Arabic phonology: A nonlinear approach. Ph.D., Cornell University, New York.Google Scholar
  39. Holes, C. (2004). Modern Arabic: Structures, functions, and varieties. Washington, D.C.: Georgetown University Press.Google Scholar
  40. Jarrar. (2006). Towards the notion of gloss, and the adoption of linguistic resources in formal ontology engineering. In Proceedings of the 15th International World Wide Web Conference (WWW2006). Edinburgh, Scotland (pp. 497–503). ACM Press.Google Scholar
  41. Jarrar. (2011). Building a formal Arabic ontology (Invited Paper). In Proceedings of the experts meeting on Arabic ontologies and semantic networks. Alecso, Arab League. Tunis, July 26–28, 2011.Google Scholar
  42. Jarrar, M., & Alrimawi, F. (2015a). Downloads. http://sina.birzeit.edu/projects/curras/downloads. Accessed 18 Aug 2015.
  43. Jarrar, M., & Alrimawi, F. (2015b). Statistics and inter-annotator agreement calculations of the Palestinian dialect corpus—Curras. www.jarrar.info/publications/JR15.pdf.
  44. Jarrar, M., Habash, N., Akra, D., & Zalmout, N. (2014). Building a corpus for palestinian Arabic: A preliminary study. In Arabic natural language processing (ANLP) workshop, at the conference on empirical methods in natural language processing (EMNLP 2014), Doha, Qatar (p. 18).Google Scholar
  45. Khalifa, S., Habash, N., Abdulrahim, D., & Hassan, S. (2016). A large scale corpus of Gulf Arabic. In Proceedings of the ninth international conference on language resources and evaluation (LREC’16). Portorož, Slovenia. Google Scholar
  46. Kilany, H., Gadalla, H., Arram, H., Yacoub, A., El-Habashi, A., & McLemore, C. (2002). Egyptian colloquial Arabic lexicon. In LDC catalog number LDC99L22.Google Scholar
  47. Landis, J. R., & Koch, G. G. (1977). The measurement of observer agreement for categorical data. Biometrics, 33(1), 159–174.CrossRefGoogle Scholar
  48. Lynn, T., Cetinoglu, O., Foster, J., Ui Dhonnchadha, E., Dras, M., & van Genabith, J. (2012). Irish treebanking and parsing: A preliminary evaluation. In Proceedings of the eighth international conference on language resources and evaluation (LREC’12), European language resources association (ELRA), Istanbul, Turkey (pp. 1939–1946).Google Scholar
  49. Maamouri, M., Bies, A., Buckwalter, T., Diab, M., Habash, N., Rambow, O., et al. (2006). Developing and using a pilot dialectal Arabic treebank. In Proceedings of the fifth international conference on language resources and evaluation (LREC’06), european language resources association (ELRA), Genoa, Italy.Google Scholar
  50. Maamouri, M., Bies, A., Buckwalter, T., & Mekki, W. (2004). The Penn Arabic Treebank: Building a large-scale annotated arabic corpus. In NEMLAR conference on Arabic language resources and tools, Cairo, Egypt (pp. 102–109).Google Scholar
  51. Maamouri, M., Bies, A., Kulick, S., Ciul, M., Habash, N., & Eskander, R. (2014). Developing an Egyptian Arabic Treebank: Impact of dialectal morphology on annotation and tool development. In Proceedings of the 9th international conference on language resources and evaluation (LREC’14), Reykjavik, Iceland (pp. 2348–2354).Google Scholar
  52. Meftouh, K., Harrat, S., Jamoussi, S., Abbas, M., & Smaili, K. (2015). Machine translation experiments on PADIC: A parallel Arabic dialect corpus. In The 29th Pacific Asia conference on language, information and computation. Google Scholar
  53. Mieskes, M., & Strube, M. (2006). Part-of-speech tagging of transcribed speech. In Proceedings of the conference on language resources and evaluation (LREC’06), Genoa, Italy (pp. 935–938).Google Scholar
  54. Miller, G., Beckwith, R., Fellbaum, C., Gross, D., & Miller, K. (1990). Introduction to WordNet: An on-line lexical database. International Journal of Lexicography, 3(4), 235–244.CrossRefGoogle Scholar
  55. Och, F. J., & Ney, H. (2003). A systematic comparison of various statistical alignment models. Computational Linguistics, 29(1), 19–51.CrossRefGoogle Scholar
  56. Olive, J., Christianson, C., & McCary, J. (Eds.). (2011). Handbook of natural language processing and machine translation: DARPA global autonomous language exploitation. Berlin: Springer Science & Business Media.Google Scholar
  57. Parker, R., Graff, D., Chen, K., Kong, J., & Maeda, K. (2011). Arabic Gigaword fifth edition. In LDC2011T11. Philadelphia: Linguistic Data Consortium. Google Scholar
  58. Pasha, A., Al-Badrashiny, M., Kholy, A. E., Eskander, R., Diab, M., Habash, N., et al. (2014). Madamira: A fast, comprehensive tool for morphological analysis and disambiguation of Arabic. In Proceedings of the conference on language resources and evaluation (LREC’14), Reykjavik, Iceland (pp. 1094–1101).Google Scholar
  59. Poesio, M. (2004). Discourse annotation and semantic annotation in the GNOME corpus. In Proceedings of the workshop on discourse annotation, association for computational linguistics, Barcelona, Spain.Google Scholar
  60. Rafalovitch, A., & Dale, R. (2009). United nations general assembly resolutions: A six-language parallel corpus. In Proceedings of the MT Summit (Vol. 12, pp. 292–299).Google Scholar
  61. Riesa, J., & Yarowsky, D. (2006). Minimally supervised morphological segmentation with applications to machine translation. In Proceedings of the 7th conference of the association for machine translation in the Americas (AMTA06) (pp. 185–192).Google Scholar
  62. Saadane, H., & Habash, N. (2015). A conventional orthography for Algerian Arabic. In Proceedings of the Arabic natural language processing (ANLP) workshop, Beijing, China (p. 69).Google Scholar
  63. Sajjad, H., Darwish, K., & Belinkov, Y. (2013). Translating dialectal Arabic to English. In Proceedings of the association for computational linguistics, Sofia, Bulgaria.Google Scholar
  64. Salloum, W., & Habash, N. (2011). Dialectal to standard Arabic paraphrasing to improve Arabic-English statistical machine translation. In Proceedings of the first workshop on algorithms and resources for modelling of dialects and language varieties.Google Scholar
  65. Salloum, W., & Habash, N. (2013). Dialectal Arabic to English machine translation: Pivoting through modern standard Arabic. In proceedings of the North American chapter of the association for computational linguistics: Human language technologies (NAACL HLT’13), Atlanta, Georgia (pp. 348–358).Google Scholar
  66. Salloum, W., & Habash, N. (2014). ADAM: Analyzer for dialectal Arabic morphology. Journal of King Saud University-Computer and Information Sciences, 26(4), 372–378.CrossRefGoogle Scholar
  67. Sawaf, H. (2010). Arabic dialect handling in hybrid machine translation. In Proceedings of the 9th conference of the association for machine translation in the Americas (AMTA), Denver, Colorado. Google Scholar
  68. Shoufan, A., & Al-Ameri, S. (2015). Natural language processing for dialectical Arabic: A Survey. In The Arabic natural language processing workshop 2015, Beijing, China.Google Scholar
  69. Smrž, O. (2007). Functional Arabic morphology. Formal system and implementation. PhD Thesis, Charles University, Prague, Czech Republic. Google Scholar
  70. Smrž, O., & Hajic, J. (2006). The other Arabic treebank: Prague dependencies and functions. In Arabic computational linguistics: Current implementations. CSLI Publications, 104 Google Scholar
  71. Uria, L., Estarrona, A., Aldezabal, I., Aranzabe, M. J., De Ilarraza, A. D., & Iruskieta, M. (2009). Evaluation of the syntactic annotation in EPEC, the reference corpus for the processing of Basque. In A. Gelbukh (Ed.), Computational linguistics and intelligent text processing (pp. 72–85). New York: Springer.Google Scholar
  72. Véronis, J. (1998). A study of polysemy judgements and inter-annotator agreement. In Programme and advanced papers of the Senseval workshop. Herstmonceux Castle, UK (pp. 2–4).Google Scholar
  73. Zaidan, O. F., & Callison-Burch, C. (2011). Crowdsourcing translation: Professional quality from non-professionals. In Proceedings of the association for computational linguistics: Human language technologies (Vol. 1, pp. 1220–1229).Google Scholar
  74. Zbib, R., Malchiodi, E., Devlin, J., Stallard, D., Matsoukas, S., Schwartz, R., et al. (2012). Machine translation of Arabic dialects. In Proceedings of the conference of the North American chapter of the association for computational linguistics: Human language technologies (NAACL-HLT’12).Google Scholar
  75. Zribi, I., Boujelbane, R., Masmoudi, A., Ellouze, M., Belguith, L., & Habash, N. (2014). A conventional orthography for Tunisian Arabic. In Proceedings of the ninth international conference on language resources abd evaluation (LREC’14), European language resources association (ELRA), Reykjavik, Iceland (pp. 2355–2361).Google Scholar

Copyright information

© Springer Science+Business Media Dordrecht 2016

Authors and Affiliations

  • Mustafa Jarrar
    • 1
  • Nizar Habash
    • 2
  • Faeq Alrimawi
    • 1
  • Diyam Akra
    • 1
  • Nasser Zalmout
    • 2
  1. 1.Birzeit UniversityBirzeitPalestine
  2. 2.New York University Abu DhabiAbu DhabiUnited Arab Emirates

Personalised recommendations