Abstract
Arabic is an old Semitic language, the standardization of its lexicon and grammar are deeply rooted and well established a long time ago in history. Arabic is a morphologically rich language characterized by the phenomenon of derivation and inflection. It is an international language with over 500 million native speakers around 29 countries. In the last 15 years, Arabic has achieved the highest growth of the ten top online languages. Consequently, the volume of stored electronic information increases rapidly. Despite this proud heritage, lexical richness, and online user growth, Arabic is relatively an under-resourced language compared to other languages with less or similar population size (e.g., French and German). The boundaries of this chapter cover the major progress that has been made in Arabic linguistic resources, primarily corpora compilation and the challenges that researchers face in the development of such process. It is hoped that this overall view of the Arabic corpus linguistics would guide current and future research directions.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
- 2.
- 3.
- 4.
- 5.
- 6.
- 7.
- 8.
- 9.
- 10.
- 11.
- 12.
- 13.
- 14.
- 15.
References
Ababou, N., Mazroui, A.: A hybrid Arabic POS tagging for simple and compound morphosyntactic tags. Int. J. Speech Technol. 19, 289–302 (2016)
Abbas, M., Smaïli, K., Berkani, D.: Evaluation of topic identification methods on Arabic corpora. JDIM 9, 185–192 (2011)
Abdelali, Ahmed, Guzman, Francisco, Sajjad, Hassan, Vogel, Stephan: The AMARA corpus: building parallel language resources for the educational domain. In LREC 14, 1044–1054 (2014)
Abdul-Mageed, M., Diab, M.T., Kübler, S.: ASMA: a system for automatic segmentation and morpho-syntactic disambiguation of modern standard Arabic. In: RANLP, pp. 1–8 (2013)
Abumalloh, R.A., Al-Sarhan, H.M., Ibrahim, O., Abu-Ulbeh, W.: Arabic part-of-speech tagging. J: Soft Comput. Decis. Support Syst. 3, 45–52 (2016)
Ahmed, F., Nürnberger, A.: Arabic/english word translation disambiguation using parallel corpora and matching schemes. In: Proceedings of EAMT, vol. 8, p. 28 (2008)
Al-Dahdah, A.: The Grammar of the Arabic Language in Tables And Lists. Maktabat Lebnan, Beirut (1989). [in Arabic]
Al-Emran, M., Zaza, S., Shaalan, K.: Parsing modern standard Arabic using Treebank resources. In: 2015 International Conference on Information and Communication Technology Research (ICTRC), pp. 80–83. IEEE (2015)
Alfaifi, A.Y.G., Atwell, Eric, Hedaya, I.: Arabic learner corpus (ALC) v2: a new written and spoken corpus of Arabic learners. Proc. Learn. Corpus Stud. Asia World 2014(2), 77–89 (2014)
Alrabiah, M., Al-Salman, A., Atwell, E.S.: The design and construction of the 50 million words KSUCCA. In: Proceedings of WACL’2 Second Workshop on Arabic Corpus Linguistics, pp. 5–8. The University of Leeds (2013)
Alsaedi, N., Peter B., Rana, O.F.: Sensing real-world events using Arabic Twitter posts (2016)
Al-Sulaiti, L., Atwell, E.S.: The design of a corpus of contemporary Arabic. Int. J. Corpus Linguist. 11, 135–171 (2006)
Altabba, M., Al-Zaraee, A., Shukairy, M.A.: An Arabic morphological analyzer and part-of-speech tagger. A Thesis Presented to the Faculty of Informatics Engineering, Arab International University, Damascus, Syria (2010)
Al-Thubaity, A.O.: A 700M+ Arabic corpus: KACST Arabic corpus design and construction. Lang. Resour. Eval. 49, 721–751 (2015)
Armstrong, S., Church, K., Isabelle, P., Manzi, S., Tzoukermann, E., Yarowsky, D.: Natural Language Processing Using Very Large Corpora, vol. 11. Springer Science & Business Media (2013)
Arts, T., Belinkov, Y., Habash, N., Kilgarriff, A., Suchomel, V.: arTenTen: Arabic corpus and word sketches. J. King Saud Univ.—Comput. Inf. Sci. 26, Special Issue on Arabic NLP, 357–371 (2014). https://doi.org/10.1016/j.jksuci.2014.06.009
Atkins, S., Clear, J., Ostler, N.: Corpus design criteria. Lit. Linguist. Comput. 7, 1–16 (1992)
Attia, M., Van Genabith, J.: A jellyfish dictionary for Arabic. In: Electronic Lexicography in the 21st Century: Thinking Outside the Paper: Proceedings of the eLex 2013 Conference, 17–19 October 2013, Tallinn, Estonia, pp. 195–212 (2013)
Ball, C.N.: Automated text analysis: Cautionary tales. Lit. Linguist. Comput. 9, 295–302 (1994)
Baneyx, A., Charlet, J., Jaulent, M.-C.: Building an ontology of pulmonary diseases with natural language processing tools using textual corpora. Int. J. Med. Inform. 76, 208–215 (2007)
Belinkov, Y., Magidow, A., Romanov, M., Shmidman, A., Koppel, M.: Shamela: A Large-Scale Historical Arabic Corpus (2016). arXiv:1612.08989:45
Bertels, A.: Corpus Linguistics for Language Teaching and LSP (2017)
Bhattacharya, P., Goyal, P., Sarkar, Sudeshna: Query translation for cross-language information retrieval using multilingual word clusters. WSSANLP 2016, 152 (2016)
Biber, D., Conrad, S., Reppen, R.: Corpus Linguistics: Investigating Language Structure and Use. Cambridge University Press (1998)
Bongers, H.: The History and Principles of Vocabulary Control: As It Affects in General and of English in Particular. 3. The KLM-List. Wocopi (1947)
Bouamor, H., Habash, N., Oflazer, K.: A multidialectal parallel corpus of Arabic. In LREC, pp. 1240–1245 (2014)
Boudchiche, M., Mazroui, A., Bebah, M.O.A.O., Lakhouaja, A., Boudlal, A.: AlKhalil morpho sys 2: a robust Arabic morpho-syntactic analyzer. J. King Saud Univ.—Comput. Inf. Sci. (2016). https://doi.org/10.1016/j.jksuci.2016.05.002
Boulton, A., Landure, C.: Using corpora in language teaching, learning and use. Recherche et pratiques pédagogiques en langues de spécialité. Cahiers de l’Apliut 35 (2016)
Cettolo, M., Girardi, C., Federico, M.: Wit3: web inventory of transcribed and translated talks. In: Proceedings of the 16th Conference of the European Association for Machine Translation (EAMT), pp. 261–268 (2012)
Chen, Y., Eisele, A.: MultiUN v2: un documents with multilingual alignments. In: LREC, pp. 2500–2504 (2012)
Chennoufi, A., Mazroui, A.: Impact of morphological analysis and a large training corpus on the performances of Arabic diacritization. Int. J. Speech Technol. 19, 269–280 (2016)
Cotterell, R., Callison-Burch, C.: A multi-dialect, multi-genre corpus of informal written Arabic. In: LREC, pp. 241–245 (2014)
Darwish, K., Abdelali, A., Mubarak, H.: Using stem-templates to improve Arabic POS and gender/number tagging. In: LREC, pp. 2926–2931. Citeseer (2014)
Diab, M.: Second generation AMIRA tools for Arabic processing: fast and robust tokenization, POS tagging, and base phrase chunking. In: 2nd International Conference on Arabic Language Resources and Tools (2009)
Dror, Judith, Shaharabani, Dudu, Talmon, Rafi, Wintner, Shuly: Morphological analysis of the Qur’an. Lit. Linguist. Comput. 19, 431–452 (2004)
Dukes, K.: Statistical parsing by machine learning from a classical Arabic treebank (2015). arXiv:1510.07193
Dukes, K., Habash, N.: Morphological annotation of Quranic Arabic. In: LREC (2010)
El-Haj, M., Koulali, R.: KALIMAT a multipurpose Arabic Corpus. In: Second Workshop on Arabic Corpus Linguistics (WACL-2), pp. 22–25 (2013)
El-Haj, M., Kruschwitz, U., Fox, C.: Creating language resources for under-resourced languages: methodologies, and experiments with Arabic. Lang. Resour. Eval. 49, 549–580 (2015)
Farghaly, Ali, Shaalan, Khaled: Arabic natural language processing: challenges and solutions. ACM Trans. Asian Lang. Inf. Process. (TALIP) 8, 14 (2009)
Francis, W., Kucera, H.: Frequency analysis of English usage (1982)
Ghalayini, M.I.M.S.: Jami’al-durus al-’arabiyah. Turath For Solutions (2013)
Gharaibeh, I.K., Gharaibeh, N.K.: Towards Arabic noun phrase extractor (ANPE) using information retrieval techniques. Softw. Eng. 2, 36–42 (2012)
Habash, N., Rambow, O., Roth, R.: MADA + TOKAN: a toolkit for Arabic tokenization, diacritization, morphological disambiguation, POS tagging, stemming and lemmatization. In: Proceedings of the 2nd International Conference on Arabic Language Resources and Tools (MEDAR), Cairo, Egypt, pp. 102–109 (2009)
Halliday, M., Matthiessen, C.M.I.M., Matthiessen, C.: An Introduction to Functional Grammar. Routledge (2014)
Hu, K.: Corpus-based translation studies: problems and prospects. In: Introducing Corpus-based Translation Studies, pp. 223–233. Springer
Hu, K., et al.: Introducing Corpus-Based Translation Studies. Springer (2016)
Hunston, S.: Corpus linguistics: historical development. In: The Encyclopedia of Applied Linguistics (2013)
Hyland, K.: Teaching and Researching Writing. Routledge (2015)
Imad, Z., Abdelhak, L.: Adapting a decision tree based tagger for Arabic, pp. 1–6. IEEE (2016). https://doi.org/10.1109/IT4OD.2016.7479306
Jakubíček, M., Kilgarriff, A., Kovář, V., Rychlỳ, P., Suchomel, V.: The tenten corpus family. In: 7th International Corpus Linguistics Conference CL, pp. 125–127 (2013)
Jurida, H.S., Džanić, M., Pavlović, T., Jahić, A., Hanić, J.: Netspeak: linguistic properties and aspects of online communication in postponed time. J. Foreign Lang. Teach. Appl. Linguist. 3, 1–19 (2016)
Kammoun, N.C., Belguith, L.H., Hamadou, A.B.: The MORPH2 new version: a robust morphological analyzer for Arabic texts. In: JADT 2010: 10th International Conference on Statistical Analysis of Textual Data (2010)
Kennedy, G.: An Introduction to Corpus Linguistics. Routledge (2014)
Khalifa, S., Habash, N., Abdulrahim, D., Hassan, S.: A large scale corpus of Gulf Arabic (2016). arXiv:1609.02960
Khalifa, S., Hassan, S., Habash, N.: A morphological analyzer for Gulf Arabic verbs. WANLP 2017 (co-located with EACL 2017), 35 (2017)
Khorsheed, M.S., Alhazmi, K.M., Asiri, A.M.: Developing typewritten Arabic corpus with multi-fonts (TRACOM). In: Proceedings of the International Workshop on Multilingual OCR, p. 16. ACM (2009)
Kilgarriff, A.: Using corpora as data source for dictionaries. In: The Bloomsbury Companion to Lexicography, pp. 77–96. Bloomsbury, London (2013)
Leech, G.N.: The state of the art in corpus linguistics. In: Aijmer, K., Altenberg, B. (eds.) English Corpus Linguistics: Studies in Honor of Jan Svartuk. Longman, London (1991)
Leech, G.: Corpora and theories of linguistic performance. In: Directions in Corpus Linguistics, pp. 105–122 (1992a)
Leech, G.: 100 million words of English: the British National Corpus (BNC). Lang. Res. 28, 1–13 (1992b)
Leech, G., Rayson, P., et al.: Word Frequencies in Written and Spoken English: Based on the British National Corpus. Routledge (2014)
Lefever, E., Hoste, V.: Semeval-2013 task 10: Cross-lingual word sense disambiguation. In: Proceedings of SemEval, pp. 158–166 (2013)
Li, L., Forascu, C., El-Haj, M., Giannakopoulos, G.: Multi-document multilingual summarization corpus preparation, part 1: Arabic, English, Greek, Chinese, Romanian. In: Association for Computational Linguistics (2013)
Liua, Q., Jiangb, H., Linga, Z.-H., Zhuc, X., Weid, S., Hua, Y.: Commonsense Knowledge Enhanced Embeddings for Solving Pronoun Disambiguation Problems in Winograd Schema Challenge (2016). arXiv:1611.04146
Maamouri, M., Bies, A., Kulick, S., Ciul, M., Habash, N., Eskander, R.: Developing an Egyptian Arabic treebank: impact of dialectal morphology on annotation and tool development. In: LREC, pp. 2348–2354 (2014)
Maamouri, M., Bies, A., Kulick, S., Gaddeche, F., Mekki, W., Krouna, S., Bouziri, B., Zaghouani, W.: Arabic Treebank: Part 1 v 4.1 (2013)
Maegaard, B., Attia, M., Choukri, K., Krauwer, S., Mokbel, C., Yaseen, M.: MEDAR: Arabic language technology, state-of-the-art and a cooperation roadmap. In: Proceedings of the Second International Conference on Arabic Language Resources and Tools. Citeseer (2009)
Magdy, W., Jones, G.J.F.: Studying machine translation technologies for large-data CLIR tasks: a patent prior-art search case study. Inf. Retr. 17, 492–519 (2014)
Mansour, M.: The absence of Arabic corpus linguistics: a call for creating an Arabic national corpus. Int. J. Human. Soc. Sci. 3, 81–90 (2013)
Marcus, M.P., Marcinkiewicz, M.A., Santorini, B.: Building a large annotated corpus of English: the Penn Treebank. Comput. Linguist. 19, 313–330 (1993)
McEnery, T., Wilson, A.: Corpus linguistics. Edinburgh University Press, Edinburgh (1996)
Mdhaffar, S., Bougares, F., Esteve, Y., Hadrich-Belguith, L.: Sentiment analysis of tunisian dialect: linguistic resources and experiments. In: WANLP 2017 (co-located with EACL 2017), pp. 55 (2017)
Milfull, Inge: Mutual Illumination: the dictionary of old English and the ongoing revision of the oxford english dictionary (OED3). Florilegium 26, 235–264 (2009)
Mostefa, D., Laïb, M., Chaudiron, S., Choukri, K., Chalendar, G.: A multilingual named entity corpus for Arabic, English and French. In: MEDAR 2009, 2nd (2009)
Nakov, P.: Web as a corpus: going beyond the n-gram. In: Russian Summer School in Information Retrieval, pp. 185–228. Springer (2014)
Nation, I.S.P.: Teaching & learning vocabulary. Heinle Cengage Learning, Boston (2013)
Nothman, J., Ringland, N., Radford, W., Murphy, T., Curran, J.R.: Learning multilingual named entity recognition from Wikipedia. Artif. Intell. 194, 151–175 (2013)
O’Keeffe, A., McCarthy, M.: The Routledge Handbook of Corpus Linguistics. Routledge (2010)
Pasha, A., Al-Badrashiny, M., Diab, M., El Kholy, A., Eskander, R., Habash, N., Pooleery, M., Rambow, O., Roth, R.M.: Madamira: a fast, comprehensive tool for morphological analysis and disambiguation of arabic. In: Proceedings of the Language Resources and Evaluation Conference (LREC), Reykjavik, Iceland (2014)
Paul, M., Federico, M., Stüker, S.: Overview of the IWSLT 2010 evaluation campaign. In: IWSLT, vol. 10, pp. 3–27 (2010)
Rabiee, H.S.: Adapting standard open-source resources to tagging a morphologically rich language: a case study with Arabic. In: RANLP Student Research Workshop, pp. 127–132 (2011)
Roberts, A., Al-Sulaiti, L., Atwell, E.: aConCorde: towards an open-source, extendable concordancer for Arabic. Corpora 1, 39–60 (2006)
Rogati, M., McCarley, S., Yang, Y.: Unsupervised learning of arabic stemming using a parallel corpus. In: Proceedings of the 41st Annual Meeting on Association for Computational Linguistics—Volume 1, pp. 391–398. Association for Computational Linguistics, ACL ’03. Stroudsburg, PA, USA (2003). https://doi.org/10.3115/1075096.1075146
Rozovskaya, A., Bouamor, H., Habash, N., Zaghouani, W., Obeid, O., Mohit, B.: The second QALB shared task on automatic text correction for Arabic. In: ANLP Workshop 2015, pp. 26 (2015)
Saad, M.K., Ashour, W.: Osac: open source arabic corpora. In: 6th ArchEng Int. Symposiums, EEECS, vol. 10 (2010)
Sahragard, R., Kushki, A., Ansaripour, E.: The application of corpora in teaching grammar: the case of English relative clause. J. Pan-Pac. Assoc. Appl. Linguist. 17, 79–93 (2013)
Sakho, M.L.: Teaching Arabic as a Second Language in International School in Dubai A Case Study Exploring New Perspectives in Learning Materials Design and Development. British University in Dubai (2012)
Salloum, W., Habash, N.: Adam: analyzer for dialectal arabic morphology. J. King Saud Univ.-Comput. Inf. Sci. 26, 372–378 (2014)
Samih, Y., Attia, M., Eldesouki, M., Mubarak, H., Abdelali, A., Kallmeyer, L., Darwish, K.: A neural architecture for dialectal Arabic segmentation. In: WANLP 2017 (co-located with EACL 2017), pp. 46 (2017)
Sawalha, M., Atwell, E., Abushariah, M.A.M.: SALMA: standard Arabic language morphological analysis. In: 2013 1st International Conference on Communications, Signal Processing, and their Applications (ICCSPA), pp. 1–6. IEEE (2013)
Sawalha, M., Brierley, C., Atwell, E.: Automatically generated, phonemic Arabic-IPA pronunciation tiers for the boundary annotated Qur’an dataset for machine learning (version 2.0). In: Proceedings of LRE-Rel 2: 2nd Workshop on Language Resource and Evaluation for Religious Texts, LREC 2014 Post-Conference Workshop 31st May 2014, Reykjavik, Iceland, pp. 42–47. The University of Leeds (2014)
Sharaf, A.-B.M., Atwell, E.: QurAna: corpus of the Quran annotated with pronominal anaphora. In: LREC, pp. 130–137. Citeseer (2012a)
Sharaf, A.-B.M., Atwell, E.: QurSim: a corpus for evaluation of relatedness in short texts. In: LREC, 2295–2302 (2012b)
Silberztein, M.: Formalizing Natural Languages: The NooJ Approach. Wiley (2016)
Sinclair, J.: Preliminary recommendations on corpus typology. In: EAGLES Document TCWG-CTYP/P. http://www.ilc.pi.cnr.it/EAGLES/corpustyp/corpustyp.html
Sinclair, J.: Intuition and annotation—the discussion continues. Lang. Comput. 49, 39–59 (2004)
Sinclair, J.: Corpus and text-basic principles. In: Developing Linguistic Corpora: A Guide to Good Practice, pp. 1–16 (2005)
Sinclair, J.: Borrowed ideas. Lang. Comput. Stud. Pract. Linguist. 64, 21 (2008)
Steinberger, R., Ebrahim, M., Poulis, A., Carrasco-Benitez, M., Schlüter, P., Przybyszewski, M., Gilbro, S.: An overview of the European Union’s highly multilingual parallel corpora. Lang. Res. Eval. 48, 679–707 (2014)
Stolcke, A., Zheng, J., Wang, W., Abrash, V.: SRILM at sixteen: update and outlook. In: Proceedings of IEEE Automatic Speech Recognition and Understanding Workshop, vol. 5 (2011)
Suchomel, V., Pomikálek, J., et al.: Efficient web crawling for large text corpora. In: Proceedings of the seventh Web as Corpus Workshop (WAC7), pp. 39–43 (2012)
Teubert, W.: Corpus Linguistics and Lexicography: The Beginning of a Beautiful Friendship, Issues 31 (2015)
Tiedemann, J.: Building a multilingual parallel subtitle corpus. Proc. CLIN, 14 (2007)
Tiedemann, J.: Parallel data, tools and interfaces in OPUS. In: LREC, pp. 2214–2218 (2012)
Toutanova, K., Klein, D., Manning, C.D., Singer, Y.: Feature-rich part-of-speech tagging with a cyclic dependency network. In: Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology—Volume 1, pp. 173–180. Association for Computational Linguistics (2003)
Tsarfaty, R., Seddah, D., Kübler, S., Nivre, J.: Parsing morphologically rich languages: Introduction to the special issue. Comput. Linguist. 39, 15–22 (2013)
Watson, J.C.E.: The Phonology and Morphology of Arabic. Oxford University Press on Demand (2002)
Xing, J., Wong, D.F., Chao, L.S., Leal, A.L.V., Schmaltz, M., Lu, C.: Syntaxtree aligner: a web-based parallel tree alignment toolkit. In: 2016 International Conference on Wavelet Analysis and Pattern Recognition (ICWAPR), pp. 37–42. IEEE (2016)
Yassein, M.B., Wahsheh, Y.A.: HQTP v. 2: holy Quran transfer protocol version 2. In: 2016 7th International Conference on Computer Science and Information Technology (CSIT), pp. 1–5. IEEE (2016)
Zaghouani, W.: Critical Survey of the Freely Available Arabic Corpora (2017). arXiv:1702.07835
Zaghouani, W., Habash, N., Mohit, B.: The qatar arabic language bank guidelines. Technical Report CMU-CS-QTR-124, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, September, 2014
Zaki, Y., Hajjar, H., Hajjar, M., Bernard, G.: A survey of syntactic parsers of arabic language. In: Proceedings of the International Conference on Big Data and Advanced Wireless Technologies, p. 31. ACM (2016)
Zamin, N., Oxley, A., Bakar, Z.A., Farhan, S.A.: A statistical dictionary-based word alignment algorithm: an unsupervised approach. In: 2012 International Conference on Computer & Information Science (ICCIS), vol. 1, pp. 396–402. IEEE (2012)
Zeroual, I., Lakhouaja, A.: Towards a multilingual aligned parallel corpus. In: Proceedings of the International Conference of High Innovation in Computer Science, Kenitra, Morocco (2016a)
Zeroual, I., Lakhouaja, A.: A new Quranic corpus rich in morphosyntactical information. Int. J. Speech Technol., 1–8 (2016b). https://doi.org/10.1007/s10772-016-9335-7
Zeroual, I., Lakhouaja, A., Belahbib, R.: Towards a standard part of Speech tagset for the Arabic language. J. King Saud Univ.—Comput. Inf. Sci. 29, 174–181 (2017). https://doi.org/10.1016/j.jksuci.2017.01.006
Zerrouki, T., Balla, A.: Tashkeela: novel corpus of Arabic vocalized texts, data for auto-diacritization systems. Data Brief 11, 147–151 (2017)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer International Publishing AG
About this chapter
Cite this chapter
Zeroual, I., Lakhouaja, A. (2018). Arabic Corpus Linguistics: Major Progress, but Still a Long Way to Go. In: Shaalan, K., Hassanien, A., Tolba, F. (eds) Intelligent Natural Language Processing: Trends and Applications. Studies in Computational Intelligence, vol 740. Springer, Cham. https://doi.org/10.1007/978-3-319-67056-0_29
Download citation
DOI: https://doi.org/10.1007/978-3-319-67056-0_29
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-67055-3
Online ISBN: 978-3-319-67056-0
eBook Packages: EngineeringEngineering (R0)