Skip to main content
Log in

Supervised collaboration for syntactic annotation of Quranic Arabic

  • Original Paper
  • Published:
Language Resources and Evaluation Aims and scope Submit manuscript

Abstract

The Quranic Arabic Corpus (http://corpus.quran.com) is a collaboratively constructed linguistic resource initiated at the University of Leeds, with multiple layers of annotation including part-of-speech tagging, morphological segmentation (Dukes and Habash 2010) and syntactic analysis using dependency grammar (Dukes and Buckwalter 2010). The motivation behind this work is to produce a resource that enables further analysis of the Quran, the 1,400 year-old central religious text of Islam. This project contrasts with other Arabic treebanks by providing a deep linguistic model based on the historical traditional grammar known as i′rāb (إعراب). By adapting this well-known canon of Quranic grammar into a familiar tagset, it is possible to encourage online annotation by Arabic linguists and Quranic experts. This article presents a new approach to linguistic annotation of an Arabic corpus: online supervised collaboration using a multi-stage approach. The different stages include automatic rule-based tagging, initial manual verification, and online supervised collaborative proofreading. A popular website attracting thousands of visitors per day, the Quranic Arabic Corpus has approximately 100 unpaid volunteer annotators each suggesting corrections to existing linguistic tagging. To ensure a high-quality resource, a small number of expert annotators are promoted to a supervisory role, allowing them to review or veto suggestions made by other collaborators. The Quran also benefits from a large body of existing historical grammatical analysis, which may be leveraged during this review. In this paper we evaluate and report on the effectiveness of the chosen annotation methodology. We also discuss the unique challenges of annotating Quranic Arabic online and describe the custom linguistic software used to aid collaborative annotation.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

Notes

  1. comp-quran mail archive: http://www.mail-archive.com/comp-quran@comp.leeds.ac.uk.

  2. Available online: http://www.archive.org/download/imkam12.

  3. http://corpus.quran.com/messageboard.jsp.

  4. This functional role from traditional i'rāb, along with related syntactic dependencies, is described further in the online annotation guidelines: http://corpus.quran.com/documentation/circumstantialaccusative.jsp.

  5. Annotation guidelines for gender tagging: http://corpus.quran.com/documentation/gender.jsp.

References

  • Abbas, N. (2009). Qurany: A tool to search for concepts in the quran. MSc Research Thesis, School of Computing, University of Leeds.

  • Al-Mubarakpuri, S. (2003). Tafsir Ibn Kathir. Riyadh: Darussalam Publishers.

    Google Scholar 

  • Al-Saif, A., & Markert, K. (2010). The leeds arabic discourse treebank: Annotating discourse connectives for arabic. Language Resources and Evaluation Conference (LREC). Valletta, Malta.

  • Amara, N., & Bouslama, F. (2005). Classification of arabic script using multiple sources of information: State of the art and perspectives. International Journal on Document Analysis and Recognition, 5(4), 195–212.

    Google Scholar 

  • Ansari, H. (2000). Learning the language of the Quran. New Delhi: MMI Publishers.

  • Atwell, E. (2008). Development of tagsets for part-of-speech tagging. An international handbook. Corpus Linguistics: Mouton de Gruyter.

    Google Scholar 

  • Atwell, E., Dukes, K., Sharaf, A., Habash, N., Louw, B., Abu Shawar, B., et al. (2010). Understanding the Quran: A new grand challenge for computer science and artificial intelligence. Edinburgh, Scotland: ACM/BCS Visions of Computer Science.

    Google Scholar 

  • Bamman, D., Francesco, M., & Crane, G. (2009). An ownership model of annotation: The ancient Greek dependency treebank. In Proceedings of the eighth international workshop on treebanks and linguistic theories, Milan.

  • Bies, A., Ferguson, M., Katz, K., & MacIntyre, R. (1995). Bracketing guidelines for treebank II style, penn treebank project. Philadelphia: University of Pennsylvania.

    Google Scholar 

  • Bies, A., & Maamouri, M. (2003). Penn Arabic treebank guidelines. http://www.ircs.upenn.edu/arabic.

  • Böhm, K., & Daub, E. (2008). Geographical analysis of hierarchical business structures by interactive drill down. In Proceedings of the 16th ACM SIGSPATIAL international conference on advances in geographic information. Irvine, California.

  • Bow, C., Hughes, B., & Bird, S. (2003). Towards a general model of interlinear text. In Proceedings of EMELD workshop.

  • Carletta, J. (1996). Assessing agreement on classification tasks: The kappa statistic. Computational Linguistics, 22(2), 249–254.

    Google Scholar 

  • Chamberlain, J., Kruschwitz, U., & Poesio, M. (2009). Constructing an anaphorically annotated corpus with non-experts: Assessing the quality of collaborative annotations. In Proceedings of the 2009 workshop on the people’s web meets NLP: Collaboratively constructed semantic resources.

  • Dror, J., Shaharabani, D., Talmon, R., & Wintner, S. (2004). Morphological analysis of the Qur’an. Literary and Linguistic Computing, 19(4), 431–452.

    Article  Google Scholar 

  • Dukes, K., & Buckwalter T. (2010). A dependency treebank of the quran using traditional arabic grammar. In Proceedings of the 7th international conference on informatics and systems (INFOS). Cairo, Egypt.

  • Dukes, K., Atwell, E., & Sharaf, A. M. (2010). Syntactic annotation guidelines for the quranic Arabic treebank. In Language resources and evaluation conference (LREC). Valletta, Malta.

  • Dukes, K., & Habash, N. (2010). Morphological annotation of quranic Arabic. In Language resources and evaluation conference (LREC). Valletta, Malta.

  • Gasser, M. (2010). A dependency grammar for Amharic. In Workshop on language resources and human language technologies for semitic languages, language resources and evaluation conference (LREC). Valletta, Malta.

  • Habash, N. (2007). Arabic morphological representations for machine translation. Arabic Computational Morphology: Knowledge-Based and Empirical Methods (pp. 263–285). Springer.

  • Habash, N. (2010). Introduction to arabic natural language processing. In G. Hirst (Ed.), Synthesis lectures on human language technologies. California: Morgan & Claypool Publishers.

  • Habash, N., Faraj, R., & Roth, R. (2009a). Syntactic annotation in the Columbia arabic treebank. In Proceedings of the 2nd international conference on arabic language resources and tools (MEDAR). Cairo, Egypt.

  • Habash, N., Gabbard, R., Rambow, O., Kulick, S. & Marcus, M. (2007). Determining case in arabic: Learning complex linguistic behavior requires complex linguistic features. In Proceedings of the conference on empirical methods in natural language processing (EMNLP), Prague, Czech Republic.

  • Habash, N., & Rambow, O. (2007). Arabic diacritization through full morphological tagging. In Proceedings of the North American chapter of the association for computational linguistics (NAACL). Rochester, New York.

  • Habash, N., Rambow, O., & Roth, R. (2009b). MADA+TOKAN: A toolkit for arabic tokenization, diacritization, morphological disambiguation, POS tagging, stemming and lemmatization. In Proceedings of the 2nd international conference on Arabic language resources and tools (MEDAR), Cairo, Egypt.

  • Habash, N., & Roth, R. (2009). CATiB: The Columbia Arabic treebank. In Proceedings of (ACL’09). Suntec, Singapore.

  • Hajič, J., Smrž, O., Zemanek, P., Snaidauf, J., & Beska, E. (2004). Prague Arabic dependency treebank: development in data and tools. In Proceedings of the NEMLAR international conference on Arabic language resources and tools.

  • Jones, A. (2005). Arabic through the Qur’an. Islamic Texts Society.

  • Kittur, A., & Kraut, R. (2010). Beyond Wikipedia: Coordination and conflict in online production groups. In Proceedings of the 2010 ACM conference on computer supported cooperative work. Savannah, Georgia, USA.

  • Kruijff, G. (2006). Dependency grammar. The encyclopedia of language and linguistics (2nd Ed). Amsterdam: Elsevier.

  • Lane, E. (1992). Arabic-English lexicon. Islamic Texts Society.

  • Maamouri, M., Bies, A., & Buckwalter, T. (2004). The penn arabic treebank: Building a large-scale annotated arabic corpus. In NEMLAR conference on arabic language resources and tools. Cairo, Egypt.

  • Mace, J. (2007). Arabic verbs. Bennett & Bloom.

  • Muhammad, E. (2007). From the treasures of arabic morphology. Karachi: Zam Zam Publishers.

  • Nowak, S., & Rüger, S. (2010). How reliable are annotations via crowdsourcing: A study about inter-annotator agreement for multi-label image annotation. In Proceedings of the international conference on multimedia information retrieval. Philadelphia, Pennsylvania.

  • Owens, J. (1988). The foundations of grammar: An introduction to medieval Arabic grammatical theory. Amsterdam and Philadelphia: John Benjamins Publishers.

  • Pietersma, A. (2002). A new paradigm for addressing old questions: The relevance of the interlinear model for the study of the septuagint. In Bible and computer: The stellenbosch AIBI-6 conference.

  • Salih, B. (2007). al-i′rāb al-mufassal li-kitāb allāh al-murattal (“A detailed grammatical analysis of the recited Quran using i′rāb”). Beirut: Dar Al-Fikr.

    Google Scholar 

  • Sawalha, M., & Atwell, E. (2010). Fine-grain morphological analyzer and part-of-speech tagger for Arabic text. Language resources and evaluation conference (LREC). Valletta, Malta.

  • Smrž, O. (2007). Functional Arabic morphology. Formal system and implementation PhD Thesis, Charles University, Prague, Czech Republic.

  • Smrž, O., & Hajič, J. (2006). The other Arabic treebank: Prague dependencies and functions. Arabic computational linguistics: Current implementations. CSLI Publications.

  • Snow, R., O’Connor, B., Jurafsky, D., & Ng, A. Y. (2008). Cheap and fast: But is it good? Evaluating non-expert annotations for natural language tasks. In Proceedings of EMNLP.

  • Soudi, A., Bosch, A., & Neumann, G. (Eds.) (2007). Introductory chapter. Arabic Computational Morphology: Knowledge-based and Empirical Methods (Springer).

  • Su, Q., Pavlov, D., Chow, J., & Baker, W. (2007). Internet-scale collection of human-reviewed data. In Proceedings of WWW.

  • Wilson, A. (2000). Conceptual glossary and index to the vulgate translation of the Gospel according to John. Olms-Weidmann: Hildesheim.

    Google Scholar 

  • Wright, W. (2007). A grammar of the Arabic language. London: Simon Wallenberg Press.

  • Yusof, R., Zainuddin, R., Baba, M., & Yusof, Z. (2010). Qur’anic words stemming. Arabian Journal for Science and Engineering (AJSE), 35(2C), 37–49.

    Google Scholar 

  • Zaidi, S., Laskri, M., & Abdelali, A. (2010). Arabic collocations extraction using gate. IEEE ICMWi′10. Algiers, Algeria.

  • Zitouni, I., Sorensen, J. S., & Sarikaya, R. (2006). Maximum entropy based restoration of arabic diacritics. In Proceedings of the 21st international conference on computational linguistics and 44th annual meeting of the association for computational linguistics. Sydney, Australia.

Download references

Acknowledgments

We would like to thank Lydia Lau and Katja Markert at the School of Computing, University of Leeds for providing invaluable feedback and numerous suggestions to improve the quality of this article. We thank Wajdi Zaghouani at the Linguistic Data Consortium, University of Pennsylvania for assistance in devising the Amazon Mechanical Turk experiment for tagging the Quran via crowdsourcing. We also acknowledge the hard work of the supervisors and other volunteer collaborators involved in online annotation of the Quranic Arabic Corpus.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Kais Dukes.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Dukes, K., Atwell, E. & Habash, N. Supervised collaboration for syntactic annotation of Quranic Arabic. Lang Resources & Evaluation 47, 33–62 (2013). https://doi.org/10.1007/s10579-011-9167-7

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10579-011-9167-7

Keywords

Navigation