Arabic Corpus Linguistics: Major Progress, but Still a Long Way to Go

Zeroual, Imad; Lakhouaja, Abdelhak

doi:10.1007/978-3-319-67056-0_29

Imad Zeroual⁵ &
Abdelhak Lakhouaja⁵

Part of the book series: Studies in Computational Intelligence ((SCI,volume 740))

3591 Accesses
10 Citations

Abstract

Arabic is an old Semitic language, the standardization of its lexicon and grammar are deeply rooted and well established a long time ago in history. Arabic is a morphologically rich language characterized by the phenomenon of derivation and inflection. It is an international language with over 500 million native speakers around 29 countries. In the last 15 years, Arabic has achieved the highest growth of the ten top online languages. Consequently, the volume of stored electronic information increases rapidly. Despite this proud heritage, lexical richness, and online user growth, Arabic is relatively an under-resourced language compared to other languages with less or similar population size (e.g., French and German). The boundaries of this chapter cover the major progress that has been made in Arabic linguistic resources, primarily corpora compilation and the challenges that researchers face in the development of such process. It is hoped that this overall view of the Arabic corpus linguistics would guide current and future research directions.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 229.00; Price excludes VAT (USA)

Softcover Book: USD 299.99; Price excludes VAT (USA)

Hardcover Book: USD 299.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
http://tanzil.net/.
2.
http://quranytopics.appspot.com/.
3.
http://shamela.ws/.
4.
http://ksucorpus.ksu.edu.sa.
5.
http://www.arrawdah.com.
6.
http://learning.aljazeera.net.
7.
https://sourceforge.net/projects/tashkeela/.
8.
http://www.medar.info/MEDAR_Survey_III.pdf.
9.
http://www.alwatan.com.sa.
10.
http://alrai.com/.
11.
http://www.comp.leeds.ac.uk/eric/latifa/research.htm.
12.
https://www.sketchengine.co.uk/.
13.
http://amara.org.
14.
https://wit3.fbk.eu/.
15.
http://www.ted.com/.

References

Ababou, N., Mazroui, A.: A hybrid Arabic POS tagging for simple and compound morphosyntactic tags. Int. J. Speech Technol. 19, 289–302 (2016)
Article Google Scholar
Abbas, M., Smaïli, K., Berkani, D.: Evaluation of topic identification methods on Arabic corpora. JDIM 9, 185–192 (2011)
Google Scholar
Abdelali, Ahmed, Guzman, Francisco, Sajjad, Hassan, Vogel, Stephan: The AMARA corpus: building parallel language resources for the educational domain. In LREC 14, 1044–1054 (2014)
Google Scholar
Abdul-Mageed, M., Diab, M.T., Kübler, S.: ASMA: a system for automatic segmentation and morpho-syntactic disambiguation of modern standard Arabic. In: RANLP, pp. 1–8 (2013)
Google Scholar
Abumalloh, R.A., Al-Sarhan, H.M., Ibrahim, O., Abu-Ulbeh, W.: Arabic part-of-speech tagging. J: Soft Comput. Decis. Support Syst. 3, 45–52 (2016)
Google Scholar
Ahmed, F., Nürnberger, A.: Arabic/english word translation disambiguation using parallel corpora and matching schemes. In: Proceedings of EAMT, vol. 8, p. 28 (2008)
Google Scholar
Al-Dahdah, A.: The Grammar of the Arabic Language in Tables And Lists. Maktabat Lebnan, Beirut (1989). [in Arabic]
Google Scholar
Al-Emran, M., Zaza, S., Shaalan, K.: Parsing modern standard Arabic using Treebank resources. In: 2015 International Conference on Information and Communication Technology Research (ICTRC), pp. 80–83. IEEE (2015)
Google Scholar
Alfaifi, A.Y.G., Atwell, Eric, Hedaya, I.: Arabic learner corpus (ALC) v2: a new written and spoken corpus of Arabic learners. Proc. Learn. Corpus Stud. Asia World 2014(2), 77–89 (2014)
Google Scholar
Alrabiah, M., Al-Salman, A., Atwell, E.S.: The design and construction of the 50 million words KSUCCA. In: Proceedings of WACL’2 Second Workshop on Arabic Corpus Linguistics, pp. 5–8. The University of Leeds (2013)
Google Scholar
Alsaedi, N., Peter B., Rana, O.F.: Sensing real-world events using Arabic Twitter posts (2016)
Google Scholar
Al-Sulaiti, L., Atwell, E.S.: The design of a corpus of contemporary Arabic. Int. J. Corpus Linguist. 11, 135–171 (2006)
Article Google Scholar
Altabba, M., Al-Zaraee, A., Shukairy, M.A.: An Arabic morphological analyzer and part-of-speech tagger. A Thesis Presented to the Faculty of Informatics Engineering, Arab International University, Damascus, Syria (2010)
Google Scholar
Al-Thubaity, A.O.: A 700M+ Arabic corpus: KACST Arabic corpus design and construction. Lang. Resour. Eval. 49, 721–751 (2015)
Article Google Scholar
Armstrong, S., Church, K., Isabelle, P., Manzi, S., Tzoukermann, E., Yarowsky, D.: Natural Language Processing Using Very Large Corpora, vol. 11. Springer Science & Business Media (2013)
Google Scholar
Arts, T., Belinkov, Y., Habash, N., Kilgarriff, A., Suchomel, V.: arTenTen: Arabic corpus and word sketches. J. King Saud Univ.—Comput. Inf. Sci. 26, Special Issue on Arabic NLP, 357–371 (2014). https://doi.org/10.1016/j.jksuci.2014.06.009
Atkins, S., Clear, J., Ostler, N.: Corpus design criteria. Lit. Linguist. Comput. 7, 1–16 (1992)
Article Google Scholar
Attia, M., Van Genabith, J.: A jellyfish dictionary for Arabic. In: Electronic Lexicography in the 21st Century: Thinking Outside the Paper: Proceedings of the eLex 2013 Conference, 17–19 October 2013, Tallinn, Estonia, pp. 195–212 (2013)
Google Scholar
Ball, C.N.: Automated text analysis: Cautionary tales. Lit. Linguist. Comput. 9, 295–302 (1994)
Article Google Scholar
Baneyx, A., Charlet, J., Jaulent, M.-C.: Building an ontology of pulmonary diseases with natural language processing tools using textual corpora. Int. J. Med. Inform. 76, 208–215 (2007)
Article Google Scholar
Belinkov, Y., Magidow, A., Romanov, M., Shmidman, A., Koppel, M.: Shamela: A Large-Scale Historical Arabic Corpus (2016). arXiv:1612.08989:45
Bertels, A.: Corpus Linguistics for Language Teaching and LSP (2017)
Google Scholar
Bhattacharya, P., Goyal, P., Sarkar, Sudeshna: Query translation for cross-language information retrieval using multilingual word clusters. WSSANLP 2016, 152 (2016)
Google Scholar
Biber, D., Conrad, S., Reppen, R.: Corpus Linguistics: Investigating Language Structure and Use. Cambridge University Press (1998)
Google Scholar
Bongers, H.: The History and Principles of Vocabulary Control: As It Affects in General and of English in Particular. 3. The KLM-List. Wocopi (1947)
Google Scholar
Bouamor, H., Habash, N., Oflazer, K.: A multidialectal parallel corpus of Arabic. In LREC, pp. 1240–1245 (2014)
Google Scholar
Boudchiche, M., Mazroui, A., Bebah, M.O.A.O., Lakhouaja, A., Boudlal, A.: AlKhalil morpho sys 2: a robust Arabic morpho-syntactic analyzer. J. King Saud Univ.—Comput. Inf. Sci. (2016). https://doi.org/10.1016/j.jksuci.2016.05.002
Google Scholar
Boulton, A., Landure, C.: Using corpora in language teaching, learning and use. Recherche et pratiques pédagogiques en langues de spécialité. Cahiers de l’Apliut 35 (2016)
Google Scholar
Cettolo, M., Girardi, C., Federico, M.: Wit3: web inventory of transcribed and translated talks. In: Proceedings of the 16th Conference of the European Association for Machine Translation (EAMT), pp. 261–268 (2012)
Google Scholar
Chen, Y., Eisele, A.: MultiUN v2: un documents with multilingual alignments. In: LREC, pp. 2500–2504 (2012)
Google Scholar
Chennoufi, A., Mazroui, A.: Impact of morphological analysis and a large training corpus on the performances of Arabic diacritization. Int. J. Speech Technol. 19, 269–280 (2016)
Article Google Scholar
Cotterell, R., Callison-Burch, C.: A multi-dialect, multi-genre corpus of informal written Arabic. In: LREC, pp. 241–245 (2014)
Google Scholar
Darwish, K., Abdelali, A., Mubarak, H.: Using stem-templates to improve Arabic POS and gender/number tagging. In: LREC, pp. 2926–2931. Citeseer (2014)
Google Scholar
Diab, M.: Second generation AMIRA tools for Arabic processing: fast and robust tokenization, POS tagging, and base phrase chunking. In: 2nd International Conference on Arabic Language Resources and Tools (2009)
Google Scholar
Dror, Judith, Shaharabani, Dudu, Talmon, Rafi, Wintner, Shuly: Morphological analysis of the Qur’an. Lit. Linguist. Comput. 19, 431–452 (2004)
Article Google Scholar
Dukes, K.: Statistical parsing by machine learning from a classical Arabic treebank (2015). arXiv:1510.07193
Dukes, K., Habash, N.: Morphological annotation of Quranic Arabic. In: LREC (2010)
Google Scholar
El-Haj, M., Koulali, R.: KALIMAT a multipurpose Arabic Corpus. In: Second Workshop on Arabic Corpus Linguistics (WACL-2), pp. 22–25 (2013)
Google Scholar
El-Haj, M., Kruschwitz, U., Fox, C.: Creating language resources for under-resourced languages: methodologies, and experiments with Arabic. Lang. Resour. Eval. 49, 549–580 (2015)
Article Google Scholar
Farghaly, Ali, Shaalan, Khaled: Arabic natural language processing: challenges and solutions. ACM Trans. Asian Lang. Inf. Process. (TALIP) 8, 14 (2009)
Google Scholar
Francis, W., Kucera, H.: Frequency analysis of English usage (1982)
Google Scholar
Ghalayini, M.I.M.S.: Jami’al-durus al-’arabiyah. Turath For Solutions (2013)
Google Scholar
Gharaibeh, I.K., Gharaibeh, N.K.: Towards Arabic noun phrase extractor (ANPE) using information retrieval techniques. Softw. Eng. 2, 36–42 (2012)
Google Scholar
Habash, N., Rambow, O., Roth, R.: MADA + TOKAN: a toolkit for Arabic tokenization, diacritization, morphological disambiguation, POS tagging, stemming and lemmatization. In: Proceedings of the 2nd International Conference on Arabic Language Resources and Tools (MEDAR), Cairo, Egypt, pp. 102–109 (2009)
Google Scholar
Halliday, M., Matthiessen, C.M.I.M., Matthiessen, C.: An Introduction to Functional Grammar. Routledge (2014)
Google Scholar
Hu, K.: Corpus-based translation studies: problems and prospects. In: Introducing Corpus-based Translation Studies, pp. 223–233. Springer
Google Scholar
Hu, K., et al.: Introducing Corpus-Based Translation Studies. Springer (2016)
Google Scholar
Hunston, S.: Corpus linguistics: historical development. In: The Encyclopedia of Applied Linguistics (2013)
Google Scholar
Hyland, K.: Teaching and Researching Writing. Routledge (2015)
Google Scholar
Imad, Z., Abdelhak, L.: Adapting a decision tree based tagger for Arabic, pp. 1–6. IEEE (2016). https://doi.org/10.1109/IT4OD.2016.7479306
Jakubíček, M., Kilgarriff, A., Kovář, V., Rychlỳ, P., Suchomel, V.: The tenten corpus family. In: 7th International Corpus Linguistics Conference CL, pp. 125–127 (2013)
Google Scholar
Jurida, H.S., Džanić, M., Pavlović, T., Jahić, A., Hanić, J.: Netspeak: linguistic properties and aspects of online communication in postponed time. J. Foreign Lang. Teach. Appl. Linguist. 3, 1–19 (2016)
Google Scholar
Kammoun, N.C., Belguith, L.H., Hamadou, A.B.: The MORPH2 new version: a robust morphological analyzer for Arabic texts. In: JADT 2010: 10th International Conference on Statistical Analysis of Textual Data (2010)
Google Scholar
Kennedy, G.: An Introduction to Corpus Linguistics. Routledge (2014)
Google Scholar
Khalifa, S., Habash, N., Abdulrahim, D., Hassan, S.: A large scale corpus of Gulf Arabic (2016). arXiv:1609.02960
Khalifa, S., Hassan, S., Habash, N.: A morphological analyzer for Gulf Arabic verbs. WANLP 2017 (co-located with EACL 2017), 35 (2017)
Google Scholar
Khorsheed, M.S., Alhazmi, K.M., Asiri, A.M.: Developing typewritten Arabic corpus with multi-fonts (TRACOM). In: Proceedings of the International Workshop on Multilingual OCR, p. 16. ACM (2009)
Google Scholar
Kilgarriff, A.: Using corpora as data source for dictionaries. In: The Bloomsbury Companion to Lexicography, pp. 77–96. Bloomsbury, London (2013)
Google Scholar
Leech, G.N.: The state of the art in corpus linguistics. In: Aijmer, K., Altenberg, B. (eds.) English Corpus Linguistics: Studies in Honor of Jan Svartuk. Longman, London (1991)
Google Scholar
Leech, G.: Corpora and theories of linguistic performance. In: Directions in Corpus Linguistics, pp. 105–122 (1992a)
Google Scholar
Leech, G.: 100 million words of English: the British National Corpus (BNC). Lang. Res. 28, 1–13 (1992b)
MathSciNet Google Scholar
Leech, G., Rayson, P., et al.: Word Frequencies in Written and Spoken English: Based on the British National Corpus. Routledge (2014)
Google Scholar
Lefever, E., Hoste, V.: Semeval-2013 task 10: Cross-lingual word sense disambiguation. In: Proceedings of SemEval, pp. 158–166 (2013)
Google Scholar
Li, L., Forascu, C., El-Haj, M., Giannakopoulos, G.: Multi-document multilingual summarization corpus preparation, part 1: Arabic, English, Greek, Chinese, Romanian. In: Association for Computational Linguistics (2013)
Google Scholar
Liua, Q., Jiangb, H., Linga, Z.-H., Zhuc, X., Weid, S., Hua, Y.: Commonsense Knowledge Enhanced Embeddings for Solving Pronoun Disambiguation Problems in Winograd Schema Challenge (2016). arXiv:1611.04146
Maamouri, M., Bies, A., Kulick, S., Ciul, M., Habash, N., Eskander, R.: Developing an Egyptian Arabic treebank: impact of dialectal morphology on annotation and tool development. In: LREC, pp. 2348–2354 (2014)
Google Scholar
Maamouri, M., Bies, A., Kulick, S., Gaddeche, F., Mekki, W., Krouna, S., Bouziri, B., Zaghouani, W.: Arabic Treebank: Part 1 v 4.1 (2013)
Google Scholar
Maegaard, B., Attia, M., Choukri, K., Krauwer, S., Mokbel, C., Yaseen, M.: MEDAR: Arabic language technology, state-of-the-art and a cooperation roadmap. In: Proceedings of the Second International Conference on Arabic Language Resources and Tools. Citeseer (2009)
Google Scholar
Magdy, W., Jones, G.J.F.: Studying machine translation technologies for large-data CLIR tasks: a patent prior-art search case study. Inf. Retr. 17, 492–519 (2014)
Article Google Scholar
Mansour, M.: The absence of Arabic corpus linguistics: a call for creating an Arabic national corpus. Int. J. Human. Soc. Sci. 3, 81–90 (2013)
Google Scholar
Marcus, M.P., Marcinkiewicz, M.A., Santorini, B.: Building a large annotated corpus of English: the Penn Treebank. Comput. Linguist. 19, 313–330 (1993)
Google Scholar
McEnery, T., Wilson, A.: Corpus linguistics. Edinburgh University Press, Edinburgh (1996)
MATH Google Scholar
Mdhaffar, S., Bougares, F., Esteve, Y., Hadrich-Belguith, L.: Sentiment analysis of tunisian dialect: linguistic resources and experiments. In: WANLP 2017 (co-located with EACL 2017), pp. 55 (2017)
Google Scholar
Milfull, Inge: Mutual Illumination: the dictionary of old English and the ongoing revision of the oxford english dictionary (OED3). Florilegium 26, 235–264 (2009)
Google Scholar
Mostefa, D., Laïb, M., Chaudiron, S., Choukri, K., Chalendar, G.: A multilingual named entity corpus for Arabic, English and French. In: MEDAR 2009, 2nd (2009)
Google Scholar
Nakov, P.: Web as a corpus: going beyond the n-gram. In: Russian Summer School in Information Retrieval, pp. 185–228. Springer (2014)
Google Scholar
Nation, I.S.P.: Teaching & learning vocabulary. Heinle Cengage Learning, Boston (2013)
Google Scholar
Nothman, J., Ringland, N., Radford, W., Murphy, T., Curran, J.R.: Learning multilingual named entity recognition from Wikipedia. Artif. Intell. 194, 151–175 (2013)
Article MathSciNet MATH Google Scholar
O’Keeffe, A., McCarthy, M.: The Routledge Handbook of Corpus Linguistics. Routledge (2010)
Google Scholar
Pasha, A., Al-Badrashiny, M., Diab, M., El Kholy, A., Eskander, R., Habash, N., Pooleery, M., Rambow, O., Roth, R.M.: Madamira: a fast, comprehensive tool for morphological analysis and disambiguation of arabic. In: Proceedings of the Language Resources and Evaluation Conference (LREC), Reykjavik, Iceland (2014)
Google Scholar
Paul, M., Federico, M., Stüker, S.: Overview of the IWSLT 2010 evaluation campaign. In: IWSLT, vol. 10, pp. 3–27 (2010)
Google Scholar
Rabiee, H.S.: Adapting standard open-source resources to tagging a morphologically rich language: a case study with Arabic. In: RANLP Student Research Workshop, pp. 127–132 (2011)
Google Scholar
Roberts, A., Al-Sulaiti, L., Atwell, E.: aConCorde: towards an open-source, extendable concordancer for Arabic. Corpora 1, 39–60 (2006)
Article Google Scholar
Rogati, M., McCarley, S., Yang, Y.: Unsupervised learning of arabic stemming using a parallel corpus. In: Proceedings of the 41st Annual Meeting on Association for Computational Linguistics—Volume 1, pp. 391–398. Association for Computational Linguistics, ACL ’03. Stroudsburg, PA, USA (2003). https://doi.org/10.3115/1075096.1075146
Rozovskaya, A., Bouamor, H., Habash, N., Zaghouani, W., Obeid, O., Mohit, B.: The second QALB shared task on automatic text correction for Arabic. In: ANLP Workshop 2015, pp. 26 (2015)
Google Scholar
Saad, M.K., Ashour, W.: Osac: open source arabic corpora. In: 6th ArchEng Int. Symposiums, EEECS, vol. 10 (2010)
Google Scholar
Sahragard, R., Kushki, A., Ansaripour, E.: The application of corpora in teaching grammar: the case of English relative clause. J. Pan-Pac. Assoc. Appl. Linguist. 17, 79–93 (2013)
Google Scholar
Sakho, M.L.: Teaching Arabic as a Second Language in International School in Dubai A Case Study Exploring New Perspectives in Learning Materials Design and Development. British University in Dubai (2012)
Google Scholar
Salloum, W., Habash, N.: Adam: analyzer for dialectal arabic morphology. J. King Saud Univ.-Comput. Inf. Sci. 26, 372–378 (2014)
Google Scholar
Samih, Y., Attia, M., Eldesouki, M., Mubarak, H., Abdelali, A., Kallmeyer, L., Darwish, K.: A neural architecture for dialectal Arabic segmentation. In: WANLP 2017 (co-located with EACL 2017), pp. 46 (2017)
Google Scholar
Sawalha, M., Atwell, E., Abushariah, M.A.M.: SALMA: standard Arabic language morphological analysis. In: 2013 1st International Conference on Communications, Signal Processing, and their Applications (ICCSPA), pp. 1–6. IEEE (2013)
Google Scholar
Sawalha, M., Brierley, C., Atwell, E.: Automatically generated, phonemic Arabic-IPA pronunciation tiers for the boundary annotated Qur’an dataset for machine learning (version 2.0). In: Proceedings of LRE-Rel 2: 2nd Workshop on Language Resource and Evaluation for Religious Texts, LREC 2014 Post-Conference Workshop 31st May 2014, Reykjavik, Iceland, pp. 42–47. The University of Leeds (2014)
Google Scholar
Sharaf, A.-B.M., Atwell, E.: QurAna: corpus of the Quran annotated with pronominal anaphora. In: LREC, pp. 130–137. Citeseer (2012a)
Google Scholar
Sharaf, A.-B.M., Atwell, E.: QurSim: a corpus for evaluation of relatedness in short texts. In: LREC, 2295–2302 (2012b)
Google Scholar
Silberztein, M.: Formalizing Natural Languages: The NooJ Approach. Wiley (2016)
Google Scholar
Sinclair, J.: Preliminary recommendations on corpus typology. In: EAGLES Document TCWG-CTYP/P. http://www.ilc.pi.cnr.it/EAGLES/corpustyp/corpustyp.html
Sinclair, J.: Intuition and annotation—the discussion continues. Lang. Comput. 49, 39–59 (2004)
Google Scholar
Sinclair, J.: Corpus and text-basic principles. In: Developing Linguistic Corpora: A Guide to Good Practice, pp. 1–16 (2005)
Google Scholar
Sinclair, J.: Borrowed ideas. Lang. Comput. Stud. Pract. Linguist. 64, 21 (2008)
Google Scholar
Steinberger, R., Ebrahim, M., Poulis, A., Carrasco-Benitez, M., Schlüter, P., Przybyszewski, M., Gilbro, S.: An overview of the European Union’s highly multilingual parallel corpora. Lang. Res. Eval. 48, 679–707 (2014)
Article Google Scholar
Stolcke, A., Zheng, J., Wang, W., Abrash, V.: SRILM at sixteen: update and outlook. In: Proceedings of IEEE Automatic Speech Recognition and Understanding Workshop, vol. 5 (2011)
Google Scholar
Suchomel, V., Pomikálek, J., et al.: Efficient web crawling for large text corpora. In: Proceedings of the seventh Web as Corpus Workshop (WAC7), pp. 39–43 (2012)
Google Scholar
Teubert, W.: Corpus Linguistics and Lexicography: The Beginning of a Beautiful Friendship, Issues 31 (2015)
Google Scholar
Tiedemann, J.: Building a multilingual parallel subtitle corpus. Proc. CLIN, 14 (2007)
Google Scholar
Tiedemann, J.: Parallel data, tools and interfaces in OPUS. In: LREC, pp. 2214–2218 (2012)
Google Scholar
Toutanova, K., Klein, D., Manning, C.D., Singer, Y.: Feature-rich part-of-speech tagging with a cyclic dependency network. In: Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology—Volume 1, pp. 173–180. Association for Computational Linguistics (2003)
Google Scholar
Tsarfaty, R., Seddah, D., Kübler, S., Nivre, J.: Parsing morphologically rich languages: Introduction to the special issue. Comput. Linguist. 39, 15–22 (2013)
Article Google Scholar
Watson, J.C.E.: The Phonology and Morphology of Arabic. Oxford University Press on Demand (2002)
Google Scholar
Xing, J., Wong, D.F., Chao, L.S., Leal, A.L.V., Schmaltz, M., Lu, C.: Syntaxtree aligner: a web-based parallel tree alignment toolkit. In: 2016 International Conference on Wavelet Analysis and Pattern Recognition (ICWAPR), pp. 37–42. IEEE (2016)
Google Scholar
Yassein, M.B., Wahsheh, Y.A.: HQTP v. 2: holy Quran transfer protocol version 2. In: 2016 7th International Conference on Computer Science and Information Technology (CSIT), pp. 1–5. IEEE (2016)
Google Scholar
Zaghouani, W.: Critical Survey of the Freely Available Arabic Corpora (2017). arXiv:1702.07835
Zaghouani, W., Habash, N., Mohit, B.: The qatar arabic language bank guidelines. Technical Report CMU-CS-QTR-124, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, September, 2014
Google Scholar
Zaki, Y., Hajjar, H., Hajjar, M., Bernard, G.: A survey of syntactic parsers of arabic language. In: Proceedings of the International Conference on Big Data and Advanced Wireless Technologies, p. 31. ACM (2016)
Google Scholar
Zamin, N., Oxley, A., Bakar, Z.A., Farhan, S.A.: A statistical dictionary-based word alignment algorithm: an unsupervised approach. In: 2012 International Conference on Computer & Information Science (ICCIS), vol. 1, pp. 396–402. IEEE (2012)
Google Scholar
Zeroual, I., Lakhouaja, A.: Towards a multilingual aligned parallel corpus. In: Proceedings of the International Conference of High Innovation in Computer Science, Kenitra, Morocco (2016a)
Google Scholar
Zeroual, I., Lakhouaja, A.: A new Quranic corpus rich in morphosyntactical information. Int. J. Speech Technol., 1–8 (2016b). https://doi.org/10.1007/s10772-016-9335-7
Zeroual, I., Lakhouaja, A., Belahbib, R.: Towards a standard part of Speech tagset for the Arabic language. J. King Saud Univ.—Comput. Inf. Sci. 29, 174–181 (2017). https://doi.org/10.1016/j.jksuci.2017.01.006
Article Google Scholar
Zerrouki, T., Balla, A.: Tashkeela: novel corpus of Arabic vocalized texts, data for auto-diacritization systems. Data Brief 11, 147–151 (2017)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Computer Sciences Laboratory, Faculty of Sciences, Mohammed First University, Oujda, Morocco
Imad Zeroual & Abdelhak Lakhouaja

Authors

Imad Zeroual
View author publications
You can also search for this author in PubMed Google Scholar
Abdelhak Lakhouaja
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Imad Zeroual .

Editor information

Editors and Affiliations

The British University in Dubai, Dubai, United Arab Emirates
Khaled Shaalan
Faculty of Computers and Information Technology, Cairo University, Giza, Egypt
Aboul Ella Hassanien
Faculty of Computers and Information, Ain Shams University, Cairo, Egypt
Fahmy Tolba

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Zeroual, I., Lakhouaja, A. (2018). Arabic Corpus Linguistics: Major Progress, but Still a Long Way to Go. In: Shaalan, K., Hassanien, A., Tolba, F. (eds) Intelligent Natural Language Processing: Trends and Applications. Studies in Computational Intelligence, vol 740. Springer, Cham. https://doi.org/10.1007/978-3-319-67056-0_29

Download citation

DOI: https://doi.org/10.1007/978-3-319-67056-0_29
Published: 18 November 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-67055-3
Online ISBN: 978-3-319-67056-0
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics