Skip to main content

Arabic Corpus Linguistics: Major Progress, but Still a Long Way to Go

  • Chapter
  • First Online:
Intelligent Natural Language Processing: Trends and Applications

Part of the book series: Studies in Computational Intelligence ((SCI,volume 740))

Abstract

Arabic is an old Semitic language, the standardization of its lexicon and grammar are deeply rooted and well established a long time ago in history. Arabic is a morphologically rich language characterized by the phenomenon of derivation and inflection. It is an international language with over 500 million native speakers around 29 countries. In the last 15 years, Arabic has achieved the highest growth of the ten top online languages. Consequently, the volume of stored electronic information increases rapidly. Despite this proud heritage, lexical richness, and online user growth, Arabic is relatively an under-resourced language compared to other languages with less or similar population size (e.g., French and German). The boundaries of this chapter cover the major progress that has been made in Arabic linguistic resources, primarily corpora compilation and the challenges that researchers face in the development of such process. It is hoped that this overall view of the Arabic corpus linguistics would guide current and future research directions.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 229.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 299.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 299.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    http://tanzil.net/.

  2. 2.

    http://quranytopics.appspot.com/.

  3. 3.

    http://shamela.ws/.

  4. 4.

    http://ksucorpus.ksu.edu.sa.

  5. 5.

    http://www.arrawdah.com.

  6. 6.

    http://learning.aljazeera.net.

  7. 7.

    https://sourceforge.net/projects/tashkeela/.

  8. 8.

    http://www.medar.info/MEDAR_Survey_III.pdf.

  9. 9.

    http://www.alwatan.com.sa.

  10. 10.

    http://alrai.com/.

  11. 11.

    http://www.comp.leeds.ac.uk/eric/latifa/research.htm.

  12. 12.

    https://www.sketchengine.co.uk/.

  13. 13.

    http://amara.org.

  14. 14.

    https://wit3.fbk.eu/.

  15. 15.

    http://www.ted.com/.

References

  • Ababou, N., Mazroui, A.: A hybrid Arabic POS tagging for simple and compound morphosyntactic tags. Int. J. Speech Technol. 19, 289–302 (2016)

    Article  Google Scholar 

  • Abbas, M., Smaïli, K., Berkani, D.: Evaluation of topic identification methods on Arabic corpora. JDIM 9, 185–192 (2011)

    Google Scholar 

  • Abdelali, Ahmed, Guzman, Francisco, Sajjad, Hassan, Vogel, Stephan: The AMARA corpus: building parallel language resources for the educational domain. In LREC 14, 1044–1054 (2014)

    Google Scholar 

  • Abdul-Mageed, M., Diab, M.T., Kübler, S.: ASMA: a system for automatic segmentation and morpho-syntactic disambiguation of modern standard Arabic. In: RANLP, pp. 1–8 (2013)

    Google Scholar 

  • Abumalloh, R.A., Al-Sarhan, H.M., Ibrahim, O., Abu-Ulbeh, W.: Arabic part-of-speech tagging. J: Soft Comput. Decis. Support Syst. 3, 45–52 (2016)

    Google Scholar 

  • Ahmed, F., Nürnberger, A.: Arabic/english word translation disambiguation using parallel corpora and matching schemes. In: Proceedings of EAMT, vol. 8, p. 28 (2008)

    Google Scholar 

  • Al-Dahdah, A.: The Grammar of the Arabic Language in Tables And Lists. Maktabat Lebnan, Beirut (1989). [in Arabic]

    Google Scholar 

  • Al-Emran, M., Zaza, S., Shaalan, K.: Parsing modern standard Arabic using Treebank resources. In: 2015 International Conference on Information and Communication Technology Research (ICTRC), pp. 80–83. IEEE (2015)

    Google Scholar 

  • Alfaifi, A.Y.G., Atwell, Eric, Hedaya, I.: Arabic learner corpus (ALC) v2: a new written and spoken corpus of Arabic learners. Proc. Learn. Corpus Stud. Asia World 2014(2), 77–89 (2014)

    Google Scholar 

  • Alrabiah, M., Al-Salman, A., Atwell, E.S.: The design and construction of the 50 million words KSUCCA. In: Proceedings of WACL’2 Second Workshop on Arabic Corpus Linguistics, pp. 5–8. The University of Leeds (2013)

    Google Scholar 

  • Alsaedi, N., Peter B., Rana, O.F.: Sensing real-world events using Arabic Twitter posts (2016)

    Google Scholar 

  • Al-Sulaiti, L., Atwell, E.S.: The design of a corpus of contemporary Arabic. Int. J. Corpus Linguist. 11, 135–171 (2006)

    Article  Google Scholar 

  • Altabba, M., Al-Zaraee, A., Shukairy, M.A.: An Arabic morphological analyzer and part-of-speech tagger. A Thesis Presented to the Faculty of Informatics Engineering, Arab International University, Damascus, Syria (2010)

    Google Scholar 

  • Al-Thubaity, A.O.: A 700M+ Arabic corpus: KACST Arabic corpus design and construction. Lang. Resour. Eval. 49, 721–751 (2015)

    Article  Google Scholar 

  • Armstrong, S., Church, K., Isabelle, P., Manzi, S., Tzoukermann, E., Yarowsky, D.: Natural Language Processing Using Very Large Corpora, vol. 11. Springer Science & Business Media (2013)

    Google Scholar 

  • Arts, T., Belinkov, Y., Habash, N., Kilgarriff, A., Suchomel, V.: arTenTen: Arabic corpus and word sketches. J. King Saud Univ.—Comput. Inf. Sci. 26, Special Issue on Arabic NLP, 357–371 (2014). https://doi.org/10.1016/j.jksuci.2014.06.009

  • Atkins, S., Clear, J., Ostler, N.: Corpus design criteria. Lit. Linguist. Comput. 7, 1–16 (1992)

    Article  Google Scholar 

  • Attia, M., Van Genabith, J.: A jellyfish dictionary for Arabic. In: Electronic Lexicography in the 21st Century: Thinking Outside the Paper: Proceedings of the eLex 2013 Conference, 17–19 October 2013, Tallinn, Estonia, pp. 195–212 (2013)

    Google Scholar 

  • Ball, C.N.: Automated text analysis: Cautionary tales. Lit. Linguist. Comput. 9, 295–302 (1994)

    Article  Google Scholar 

  • Baneyx, A., Charlet, J., Jaulent, M.-C.: Building an ontology of pulmonary diseases with natural language processing tools using textual corpora. Int. J. Med. Inform. 76, 208–215 (2007)

    Article  Google Scholar 

  • Belinkov, Y., Magidow, A., Romanov, M., Shmidman, A., Koppel, M.: Shamela: A Large-Scale Historical Arabic Corpus (2016). arXiv:1612.08989:45

  • Bertels, A.: Corpus Linguistics for Language Teaching and LSP (2017)

    Google Scholar 

  • Bhattacharya, P., Goyal, P., Sarkar, Sudeshna: Query translation for cross-language information retrieval using multilingual word clusters. WSSANLP 2016, 152 (2016)

    Google Scholar 

  • Biber, D., Conrad, S., Reppen, R.: Corpus Linguistics: Investigating Language Structure and Use. Cambridge University Press (1998)

    Google Scholar 

  • Bongers, H.: The History and Principles of Vocabulary Control: As It Affects in General and of English in Particular. 3. The KLM-List. Wocopi (1947)

    Google Scholar 

  • Bouamor, H., Habash, N., Oflazer, K.: A multidialectal parallel corpus of Arabic. In LREC, pp. 1240–1245 (2014)

    Google Scholar 

  • Boudchiche, M., Mazroui, A., Bebah, M.O.A.O., Lakhouaja, A., Boudlal, A.: AlKhalil morpho sys 2: a robust Arabic morpho-syntactic analyzer. J. King Saud Univ.—Comput. Inf. Sci. (2016). https://doi.org/10.1016/j.jksuci.2016.05.002

    Google Scholar 

  • Boulton, A., Landure, C.: Using corpora in language teaching, learning and use. Recherche et pratiques pédagogiques en langues de spécialité. Cahiers de l’Apliut 35 (2016)

    Google Scholar 

  • Cettolo, M., Girardi, C., Federico, M.: Wit3: web inventory of transcribed and translated talks. In: Proceedings of the 16th Conference of the European Association for Machine Translation (EAMT), pp. 261–268 (2012)

    Google Scholar 

  • Chen, Y., Eisele, A.: MultiUN v2: un documents with multilingual alignments. In: LREC, pp. 2500–2504 (2012)

    Google Scholar 

  • Chennoufi, A., Mazroui, A.: Impact of morphological analysis and a large training corpus on the performances of Arabic diacritization. Int. J. Speech Technol. 19, 269–280 (2016)

    Article  Google Scholar 

  • Cotterell, R., Callison-Burch, C.: A multi-dialect, multi-genre corpus of informal written Arabic. In: LREC, pp. 241–245 (2014)

    Google Scholar 

  • Darwish, K., Abdelali, A., Mubarak, H.: Using stem-templates to improve Arabic POS and gender/number tagging. In: LREC, pp. 2926–2931. Citeseer (2014)

    Google Scholar 

  • Diab, M.: Second generation AMIRA tools for Arabic processing: fast and robust tokenization, POS tagging, and base phrase chunking. In: 2nd International Conference on Arabic Language Resources and Tools (2009)

    Google Scholar 

  • Dror, Judith, Shaharabani, Dudu, Talmon, Rafi, Wintner, Shuly: Morphological analysis of the Qur’an. Lit. Linguist. Comput. 19, 431–452 (2004)

    Article  Google Scholar 

  • Dukes, K.: Statistical parsing by machine learning from a classical Arabic treebank (2015). arXiv:1510.07193

  • Dukes, K., Habash, N.: Morphological annotation of Quranic Arabic. In: LREC (2010)

    Google Scholar 

  • El-Haj, M., Koulali, R.: KALIMAT a multipurpose Arabic Corpus. In: Second Workshop on Arabic Corpus Linguistics (WACL-2), pp. 22–25 (2013)

    Google Scholar 

  • El-Haj, M., Kruschwitz, U., Fox, C.: Creating language resources for under-resourced languages: methodologies, and experiments with Arabic. Lang. Resour. Eval. 49, 549–580 (2015)

    Article  Google Scholar 

  • Farghaly, Ali, Shaalan, Khaled: Arabic natural language processing: challenges and solutions. ACM Trans. Asian Lang. Inf. Process. (TALIP) 8, 14 (2009)

    Google Scholar 

  • Francis, W., Kucera, H.: Frequency analysis of English usage (1982)

    Google Scholar 

  • Ghalayini, M.I.M.S.: Jami’al-durus al-’arabiyah. Turath For Solutions (2013)

    Google Scholar 

  • Gharaibeh, I.K., Gharaibeh, N.K.: Towards Arabic noun phrase extractor (ANPE) using information retrieval techniques. Softw. Eng. 2, 36–42 (2012)

    Google Scholar 

  • Habash, N., Rambow, O., Roth, R.: MADA + TOKAN: a toolkit for Arabic tokenization, diacritization, morphological disambiguation, POS tagging, stemming and lemmatization. In: Proceedings of the 2nd International Conference on Arabic Language Resources and Tools (MEDAR), Cairo, Egypt, pp. 102–109 (2009)

    Google Scholar 

  • Halliday, M., Matthiessen, C.M.I.M., Matthiessen, C.: An Introduction to Functional Grammar. Routledge (2014)

    Google Scholar 

  • Hu, K.: Corpus-based translation studies: problems and prospects. In: Introducing Corpus-based Translation Studies, pp. 223–233. Springer

    Google Scholar 

  • Hu, K., et al.: Introducing Corpus-Based Translation Studies. Springer (2016)

    Google Scholar 

  • Hunston, S.: Corpus linguistics: historical development. In: The Encyclopedia of Applied Linguistics (2013)

    Google Scholar 

  • Hyland, K.: Teaching and Researching Writing. Routledge (2015)

    Google Scholar 

  • Imad, Z., Abdelhak, L.: Adapting a decision tree based tagger for Arabic, pp. 1–6. IEEE (2016). https://doi.org/10.1109/IT4OD.2016.7479306

  • Jakubíček, M., Kilgarriff, A., Kovář, V., Rychlỳ, P., Suchomel, V.: The tenten corpus family. In: 7th International Corpus Linguistics Conference CL, pp. 125–127 (2013)

    Google Scholar 

  • Jurida, H.S., Džanić, M., Pavlović, T., Jahić, A., Hanić, J.: Netspeak: linguistic properties and aspects of online communication in postponed time. J. Foreign Lang. Teach. Appl. Linguist. 3, 1–19 (2016)

    Google Scholar 

  • Kammoun, N.C., Belguith, L.H., Hamadou, A.B.: The MORPH2 new version: a robust morphological analyzer for Arabic texts. In: JADT 2010: 10th International Conference on Statistical Analysis of Textual Data (2010)

    Google Scholar 

  • Kennedy, G.: An Introduction to Corpus Linguistics. Routledge (2014)

    Google Scholar 

  • Khalifa, S., Habash, N., Abdulrahim, D., Hassan, S.: A large scale corpus of Gulf Arabic (2016). arXiv:1609.02960

  • Khalifa, S., Hassan, S., Habash, N.: A morphological analyzer for Gulf Arabic verbs. WANLP 2017 (co-located with EACL 2017), 35 (2017)

    Google Scholar 

  • Khorsheed, M.S., Alhazmi, K.M., Asiri, A.M.: Developing typewritten Arabic corpus with multi-fonts (TRACOM). In: Proceedings of the International Workshop on Multilingual OCR, p. 16. ACM (2009)

    Google Scholar 

  • Kilgarriff, A.: Using corpora as data source for dictionaries. In: The Bloomsbury Companion to Lexicography, pp. 77–96. Bloomsbury, London (2013)

    Google Scholar 

  • Leech, G.N.: The state of the art in corpus linguistics. In: Aijmer, K., Altenberg, B. (eds.) English Corpus Linguistics: Studies in Honor of Jan Svartuk. Longman, London (1991)

    Google Scholar 

  • Leech, G.: Corpora and theories of linguistic performance. In: Directions in Corpus Linguistics, pp. 105–122 (1992a)

    Google Scholar 

  • Leech, G.: 100 million words of English: the British National Corpus (BNC). Lang. Res. 28, 1–13 (1992b)

    MathSciNet  Google Scholar 

  • Leech, G., Rayson, P., et al.: Word Frequencies in Written and Spoken English: Based on the British National Corpus. Routledge (2014)

    Google Scholar 

  • Lefever, E., Hoste, V.: Semeval-2013 task 10: Cross-lingual word sense disambiguation. In: Proceedings of SemEval, pp. 158–166 (2013)

    Google Scholar 

  • Li, L., Forascu, C., El-Haj, M., Giannakopoulos, G.: Multi-document multilingual summarization corpus preparation, part 1: Arabic, English, Greek, Chinese, Romanian. In: Association for Computational Linguistics (2013)

    Google Scholar 

  • Liua, Q., Jiangb, H., Linga, Z.-H., Zhuc, X., Weid, S., Hua, Y.: Commonsense Knowledge Enhanced Embeddings for Solving Pronoun Disambiguation Problems in Winograd Schema Challenge (2016). arXiv:1611.04146

  • Maamouri, M., Bies, A., Kulick, S., Ciul, M., Habash, N., Eskander, R.: Developing an Egyptian Arabic treebank: impact of dialectal morphology on annotation and tool development. In: LREC, pp. 2348–2354 (2014)

    Google Scholar 

  • Maamouri, M., Bies, A., Kulick, S., Gaddeche, F., Mekki, W., Krouna, S., Bouziri, B., Zaghouani, W.: Arabic Treebank: Part 1 v 4.1 (2013)

    Google Scholar 

  • Maegaard, B., Attia, M., Choukri, K., Krauwer, S., Mokbel, C., Yaseen, M.: MEDAR: Arabic language technology, state-of-the-art and a cooperation roadmap. In: Proceedings of the Second International Conference on Arabic Language Resources and Tools. Citeseer (2009)

    Google Scholar 

  • Magdy, W., Jones, G.J.F.: Studying machine translation technologies for large-data CLIR tasks: a patent prior-art search case study. Inf. Retr. 17, 492–519 (2014)

    Article  Google Scholar 

  • Mansour, M.: The absence of Arabic corpus linguistics: a call for creating an Arabic national corpus. Int. J. Human. Soc. Sci. 3, 81–90 (2013)

    Google Scholar 

  • Marcus, M.P., Marcinkiewicz, M.A., Santorini, B.: Building a large annotated corpus of English: the Penn Treebank. Comput. Linguist. 19, 313–330 (1993)

    Google Scholar 

  • McEnery, T., Wilson, A.: Corpus linguistics. Edinburgh University Press, Edinburgh (1996)

    MATH  Google Scholar 

  • Mdhaffar, S., Bougares, F., Esteve, Y., Hadrich-Belguith, L.: Sentiment analysis of tunisian dialect: linguistic resources and experiments. In: WANLP 2017 (co-located with EACL 2017), pp. 55 (2017)

    Google Scholar 

  • Milfull, Inge: Mutual Illumination: the dictionary of old English and the ongoing revision of the oxford english dictionary (OED3). Florilegium 26, 235–264 (2009)

    Google Scholar 

  • Mostefa, D., Laïb, M., Chaudiron, S., Choukri, K., Chalendar, G.: A multilingual named entity corpus for Arabic, English and French. In: MEDAR 2009, 2nd (2009)

    Google Scholar 

  • Nakov, P.: Web as a corpus: going beyond the n-gram. In: Russian Summer School in Information Retrieval, pp. 185–228. Springer (2014)

    Google Scholar 

  • Nation, I.S.P.: Teaching & learning vocabulary. Heinle Cengage Learning, Boston (2013)

    Google Scholar 

  • Nothman, J., Ringland, N., Radford, W., Murphy, T., Curran, J.R.: Learning multilingual named entity recognition from Wikipedia. Artif. Intell. 194, 151–175 (2013)

    Article  MathSciNet  MATH  Google Scholar 

  • O’Keeffe, A., McCarthy, M.: The Routledge Handbook of Corpus Linguistics. Routledge (2010)

    Google Scholar 

  • Pasha, A., Al-Badrashiny, M., Diab, M., El Kholy, A., Eskander, R., Habash, N., Pooleery, M., Rambow, O., Roth, R.M.: Madamira: a fast, comprehensive tool for morphological analysis and disambiguation of arabic. In: Proceedings of the Language Resources and Evaluation Conference (LREC), Reykjavik, Iceland (2014)

    Google Scholar 

  • Paul, M., Federico, M., Stüker, S.: Overview of the IWSLT 2010 evaluation campaign. In: IWSLT, vol. 10, pp. 3–27 (2010)

    Google Scholar 

  • Rabiee, H.S.: Adapting standard open-source resources to tagging a morphologically rich language: a case study with Arabic. In: RANLP Student Research Workshop, pp. 127–132 (2011)

    Google Scholar 

  • Roberts, A., Al-Sulaiti, L., Atwell, E.: aConCorde: towards an open-source, extendable concordancer for Arabic. Corpora 1, 39–60 (2006)

    Article  Google Scholar 

  • Rogati, M., McCarley, S., Yang, Y.: Unsupervised learning of arabic stemming using a parallel corpus. In: Proceedings of the 41st Annual Meeting on Association for Computational Linguistics—Volume 1, pp. 391–398. Association for Computational Linguistics, ACL ’03. Stroudsburg, PA, USA (2003). https://doi.org/10.3115/1075096.1075146

  • Rozovskaya, A., Bouamor, H., Habash, N., Zaghouani, W., Obeid, O., Mohit, B.: The second QALB shared task on automatic text correction for Arabic. In: ANLP Workshop 2015, pp. 26 (2015)

    Google Scholar 

  • Saad, M.K., Ashour, W.: Osac: open source arabic corpora. In: 6th ArchEng Int. Symposiums, EEECS, vol. 10 (2010)

    Google Scholar 

  • Sahragard, R., Kushki, A., Ansaripour, E.: The application of corpora in teaching grammar: the case of English relative clause. J. Pan-Pac. Assoc. Appl. Linguist. 17, 79–93 (2013)

    Google Scholar 

  • Sakho, M.L.: Teaching Arabic as a Second Language in International School in Dubai A Case Study Exploring New Perspectives in Learning Materials Design and Development. British University in Dubai (2012)

    Google Scholar 

  • Salloum, W., Habash, N.: Adam: analyzer for dialectal arabic morphology. J. King Saud Univ.-Comput. Inf. Sci. 26, 372–378 (2014)

    Google Scholar 

  • Samih, Y., Attia, M., Eldesouki, M., Mubarak, H., Abdelali, A., Kallmeyer, L., Darwish, K.: A neural architecture for dialectal Arabic segmentation. In: WANLP 2017 (co-located with EACL 2017), pp. 46 (2017)

    Google Scholar 

  • Sawalha, M., Atwell, E., Abushariah, M.A.M.: SALMA: standard Arabic language morphological analysis. In: 2013 1st International Conference on Communications, Signal Processing, and their Applications (ICCSPA), pp. 1–6. IEEE (2013)

    Google Scholar 

  • Sawalha, M., Brierley, C., Atwell, E.: Automatically generated, phonemic Arabic-IPA pronunciation tiers for the boundary annotated Qur’an dataset for machine learning (version 2.0). In: Proceedings of LRE-Rel 2: 2nd Workshop on Language Resource and Evaluation for Religious Texts, LREC 2014 Post-Conference Workshop 31st May 2014, Reykjavik, Iceland, pp. 42–47. The University of Leeds (2014)

    Google Scholar 

  • Sharaf, A.-B.M., Atwell, E.: QurAna: corpus of the Quran annotated with pronominal anaphora. In: LREC, pp. 130–137. Citeseer (2012a)

    Google Scholar 

  • Sharaf, A.-B.M., Atwell, E.: QurSim: a corpus for evaluation of relatedness in short texts. In: LREC, 2295–2302 (2012b)

    Google Scholar 

  • Silberztein, M.: Formalizing Natural Languages: The NooJ Approach. Wiley (2016)

    Google Scholar 

  • Sinclair, J.: Preliminary recommendations on corpus typology. In: EAGLES Document TCWG-CTYP/P. http://www.ilc.pi.cnr.it/EAGLES/corpustyp/corpustyp.html

  • Sinclair, J.: Intuition and annotation—the discussion continues. Lang. Comput. 49, 39–59 (2004)

    Google Scholar 

  • Sinclair, J.: Corpus and text-basic principles. In: Developing Linguistic Corpora: A Guide to Good Practice, pp. 1–16 (2005)

    Google Scholar 

  • Sinclair, J.: Borrowed ideas. Lang. Comput. Stud. Pract. Linguist. 64, 21 (2008)

    Google Scholar 

  • Steinberger, R., Ebrahim, M., Poulis, A., Carrasco-Benitez, M., Schlüter, P., Przybyszewski, M., Gilbro, S.: An overview of the European Union’s highly multilingual parallel corpora. Lang. Res. Eval. 48, 679–707 (2014)

    Article  Google Scholar 

  • Stolcke, A., Zheng, J., Wang, W., Abrash, V.: SRILM at sixteen: update and outlook. In: Proceedings of IEEE Automatic Speech Recognition and Understanding Workshop, vol. 5 (2011)

    Google Scholar 

  • Suchomel, V., Pomikálek, J., et al.: Efficient web crawling for large text corpora. In: Proceedings of the seventh Web as Corpus Workshop (WAC7), pp. 39–43 (2012)

    Google Scholar 

  • Teubert, W.: Corpus Linguistics and Lexicography: The Beginning of a Beautiful Friendship, Issues 31 (2015)

    Google Scholar 

  • Tiedemann, J.: Building a multilingual parallel subtitle corpus. Proc. CLIN, 14 (2007)

    Google Scholar 

  • Tiedemann, J.: Parallel data, tools and interfaces in OPUS. In: LREC, pp. 2214–2218 (2012)

    Google Scholar 

  • Toutanova, K., Klein, D., Manning, C.D., Singer, Y.: Feature-rich part-of-speech tagging with a cyclic dependency network. In: Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology—Volume 1, pp. 173–180. Association for Computational Linguistics (2003)

    Google Scholar 

  • Tsarfaty, R., Seddah, D., Kübler, S., Nivre, J.: Parsing morphologically rich languages: Introduction to the special issue. Comput. Linguist. 39, 15–22 (2013)

    Article  Google Scholar 

  • Watson, J.C.E.: The Phonology and Morphology of Arabic. Oxford University Press on Demand (2002)

    Google Scholar 

  • Xing, J., Wong, D.F., Chao, L.S., Leal, A.L.V., Schmaltz, M., Lu, C.: Syntaxtree aligner: a web-based parallel tree alignment toolkit. In: 2016 International Conference on Wavelet Analysis and Pattern Recognition (ICWAPR), pp. 37–42. IEEE (2016)

    Google Scholar 

  • Yassein, M.B., Wahsheh, Y.A.: HQTP v. 2: holy Quran transfer protocol version 2. In: 2016 7th International Conference on Computer Science and Information Technology (CSIT), pp. 1–5. IEEE (2016)

    Google Scholar 

  • Zaghouani, W.: Critical Survey of the Freely Available Arabic Corpora (2017). arXiv:1702.07835

  • Zaghouani, W., Habash, N., Mohit, B.: The qatar arabic language bank guidelines. Technical Report CMU-CS-QTR-124, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, September, 2014

    Google Scholar 

  • Zaki, Y., Hajjar, H., Hajjar, M., Bernard, G.: A survey of syntactic parsers of arabic language. In: Proceedings of the International Conference on Big Data and Advanced Wireless Technologies, p. 31. ACM (2016)

    Google Scholar 

  • Zamin, N., Oxley, A., Bakar, Z.A., Farhan, S.A.: A statistical dictionary-based word alignment algorithm: an unsupervised approach. In: 2012 International Conference on Computer & Information Science (ICCIS), vol. 1, pp. 396–402. IEEE (2012)

    Google Scholar 

  • Zeroual, I., Lakhouaja, A.: Towards a multilingual aligned parallel corpus. In: Proceedings of the International Conference of High Innovation in Computer Science, Kenitra, Morocco (2016a)

    Google Scholar 

  • Zeroual, I., Lakhouaja, A.: A new Quranic corpus rich in morphosyntactical information. Int. J. Speech Technol., 1–8 (2016b). https://doi.org/10.1007/s10772-016-9335-7

  • Zeroual, I., Lakhouaja, A., Belahbib, R.: Towards a standard part of Speech tagset for the Arabic language. J. King Saud Univ.—Comput. Inf. Sci. 29, 174–181 (2017). https://doi.org/10.1016/j.jksuci.2017.01.006

    Article  Google Scholar 

  • Zerrouki, T., Balla, A.: Tashkeela: novel corpus of Arabic vocalized texts, data for auto-diacritization systems. Data Brief 11, 147–151 (2017)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Imad Zeroual .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer International Publishing AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Zeroual, I., Lakhouaja, A. (2018). Arabic Corpus Linguistics: Major Progress, but Still a Long Way to Go. In: Shaalan, K., Hassanien, A., Tolba, F. (eds) Intelligent Natural Language Processing: Trends and Applications. Studies in Computational Intelligence, vol 740. Springer, Cham. https://doi.org/10.1007/978-3-319-67056-0_29

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-67056-0_29

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-67055-3

  • Online ISBN: 978-3-319-67056-0

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics