Survey of the Arabic Machine Translation Corpora

Babaali, Baligh; Salem, Mohammed

doi:10.1007/978-3-031-18516-8_15

Baligh Babaali¹⁵ &
Mohammed Salem¹⁵

Part of the book series: Lecture Notes in Networks and Systems ((LNNS,volume 593))

Included in the following conference series:

International Symposium on Modelling and Implementation of Complex Systems

324 Accesses
1 Citations

Abstract

Machine translation (henceforward referred to as MT) is one of the important areas of Natural language processing (NLP) that is necessary for cracking the language obstacle and easing inter-lingual communication. This paper sheds light on the approaches used in MT, available in the literature, to encourage researchers to study these techniques. In the last years, the neural approach is dominating the field of MT. Such a technique is based on datasets with a large number of parallel sentences, that, contrarily, Arabic MT is lacking such prestige. Thus, this paper summarizes the major Arabic MT corpora, with both sides the Standard Arabic and Dialectal Arabic, and discusses their characteristics, which we feel is a key concept in MT and may provide a better solution to these open challenges in Arabic MT.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

LSTM vs. GRU for Arabic Machine Translation

English-Arabic Statistical Machine Translation: State of the Art

The Key Challenges for Arabic Machine Translation

Notes

1.
https://github.com/ModernMT/MMT.
2.
Language Models.
3.
http://www.ldc.upenn.edu.
4.
http://uncorpora.org.
5.
https://catalog.ldc.upenn.edu/LDC2011T11.
6.
https://conferences.unite.un.org/UNCorpus/.
7.
http://www.opensubtitles.org.
8.
https://ec.europa.eu/jrc/en/language-technologies/jrc-acquis.
9.
http://www.statmt.org/cc-aligned/.
10.
https://github.com/facebookresearch/LASER/tree/master/tasks/CCMatrix.
11.
https://l10n.gnome.org.
12.
https://dumps.wikimedia.org/other/contenttranslation.
13.
https://tico-19.github.io/index.html.
14.
https://translations.launchpad.net.
15.
http://data.statmt.org/news-commentary/v16/documents.tgz.
16.
Map from Wikipedia distributed under a CCBY 3.0 license.
17.
https://camel.abudhabi.nyu.edu/gumar/.
18.
http://alt.qcri.org/~hmubarak/EGY-MGR-LEV-GLF-2-MSA.zip.
19.
https://sites.google.com/nyu.edu/madar/.
20.
https://github.com/xprogramer/DZDC12.
21.
https://github.com/darija-open-dataset/dataset.

References

Abainia, K.: DZDC12: a new multipurpose parallel Algerian Arabizi-French code-switched corpus. Lang. Resour. Eval. 54(2), 419–455 (2020). https://doi.org/10.1007/s10579-019-09454-8
Article Google Scholar
Abu El-khair, I.: 1.5 billion words Arabic corpus. arXiv e-prints (2016). Provided by the SAO/NASA Astrophysics Data System
Google Scholar
Ashraf, N., Ahmad, M.: Machine translation techniques and their comparative study. Int. J. Comput. Appl. 125(7), 25–31 (2015). Published by Foundation of Computer Science (FCS), NY, USA
Google Scholar
Azouaou, F., Guellil, I.: ALG/FR: a step by step construction of a lexicon between Algerian dialect and French. In: The 31st Pacific Asia Conference on Language, Information and Computation PACLIC, vol. 31 (2017)
Google Scholar
Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. In: Proceedings of the 3rd International Conference on Learning Representations (ICLR 2015) (2015)
Google Scholar
Banerjee, S., Lavie, A.: METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. In: Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, Ann Arbor, Michigan, pp. 65–72 (2005)
Google Scholar
Bies, A., et al.: Transliteration of Arabizi into Arabic orthography: developing a parallel annotated Arabizi-Arabic script SMS/chat corpus. In: Proceedings of the EMNLP 2014 Workshop on Arabic Natural Language Processing (ANLP), Doha, Qatar, pp. 93–103. Association for Computational Linguistics (2014). https://doi.org/10.3115/v1/W14-3612
Bojar, O., et al.: Findings of the 2017 conference on machine translation (WMT17). In: Proceedings of the Second Conference on Machine Translation, Copenhagen, Denmark. Association for Computational Linguistics (2017)
Google Scholar
Bojar, O., et al.: Findings of the 2018 conference on machine translation (WMT18). In: Proceedings of the Third Conference on Machine Translation: Shared Task Papers, Belgium, Brussels. Association for Computational Linguistics (2018)
Google Scholar
Bouamor, H., et al.: The MADAR Arabic dialect corpus and lexicon. In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan. European Language Resources Association (ELRA) (2018)
Google Scholar
Brown, P.F., et al.: A statistical approach to language translation. In: Coling Budapest 1988 Volume 1: International Conference on Computational Linguistics (1988)
Google Scholar
Brown, P.F., Della Pietra, S.A., Della Pietra, V.J., Mercer, R.L.: The mathematics of statistical machine translation: parameter estimation. Comput. Linguist. 19(2), 263–311 (1993)
Google Scholar
Charoenpornsawat, P., Sornlertlamvanich, V., Charoenporn, T.: Improving translation quality of rule-based machine translation. In: COLING-02: Machine Translation in Asia (2002)
Google Scholar
Cho, K., Van Merriënboer, B., Bahdanau, D., Bengio, Y.: On the properties of neural machine translation: encoder-decoder approaches. In: Proceedings of the 8th Workshop on Syntax, Semantics and Structure in Statistical Translation (SSST 2014), pp. 103–111. Association for Computational Linguistics (2014)
Google Scholar
Clinchant, S., Jung, K.W., Nikoulina, V.: On the use of BERT for neural machine translation. arXiv Preprint arXiv:1909.12744 (2019)
Devlin, H., Van Durme, B., Murray, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
Do to TND: Extraction de corpus parallèle pour la traduction automatique depuis et vers une langue peu dotée. Ph.D. thesis, Université de Grenoble (2011)
Google Scholar
Doddington, G.: Automatic evaluation of machine translation quality using n-gram co-occurrence statistics. In: Proceedings of the Second International Conference on Human Language Technology Research, HLT 2002, pp. 138–145. Morgan Kaufmann Publishers Inc., San Francisco (2002)
Google Scholar
Eisele, A., Chen, Y.: MultiUN: a multilingual corpus from united nation documents. In: Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC 2010), Valletta, Malta, pp. 2868–2872. European Language Resources Association (ELRA) (2010)
Google Scholar
El-Kishky, A., Chaudhary, V., Guzmán, F., Koehn, P.: CCAligned: a massive collection of cross-lingual web-document pairs. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP 2020), pp. 5960–5969. Association for Computational Linguistics, Online (2020). https://doi.org/10.18653/v1/2020.emnlp-main.480
Eriguchi, A., Rarrick, S., Matsushita, H.: Combining translation memory with neural machine translation. In: Proceedings of the 6th Workshop on Asian Translation, pp. 123–130 (2019)
Google Scholar
Gehring, J., Auli, M., Grangier, D., Yarats, D., Dauphin, Y.N.: Convolutional sequence to sequence learning. In: Precup, D., Teh, Y.W. (eds.) Proceedings of Machine Learning Research, Sydney, Australia, vol. 70, pp. 1243–1252. PMLR, International Convention Centre (2017)
Google Scholar
Grimes, S., Li, X., Bies, A., Kulick, S., Ma, X., Strassel, S.: Creating Arabic-English parallel word-aligned treebank corpora at LDC. In: Proceedings of Language Resources and Evaluation Conference (LREC 2010), Malta (2010)
Google Scholar
Habash, N., Zalmout, N., Taji, D., Hoang, H., Alzate, M.: Parallel corpus for evaluating machine translation between Arabic and European languages. In: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, Valencia, Spain, pp. 235–241. Association for Computational Linguistics (2017)
Google Scholar
Habash, N.Y.: Introduction to Arabic natural language processing. Synthesis Lect. Hum. Lang. Technol. 3(1), 1–187 (2010). https://doi.org/10.2200/S00277ED1V01Y201008HLT010
Article Google Scholar
Han, A.L.F., Wong, D.F., Chao, L.S.: LEPOR: a robust evaluation metric for machine translation with augmented factors. In: Proceedings of COLING 2012: Posters, Mumbai, India, pp. 441–450. The COLING 2012 Organizing Committee (2012)
Google Scholar
Jarrar, M., Habash, N., Akra, D., Zalmout, N.: Building a corpus for Palestinian Arabic: a preliminary study. In: Proceedings of the EMNLP 2014 Workshop on Arabic Natural Language Processing (ANLP), Doha, Qatar, pp. 18–27 (2014)
Google Scholar
Johnson, M., et al.: Google’s multilingual neural machine translation system: enabling zero-shot translation. Trans. Assoc. Comput. Linguist. 5, 339–351 (2017)
Article Google Scholar
Khalifa, S., Habash, N., Abdulrahim, D., Hassan, S.: A large scale corpus of Gulf Arabic. In: Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016), Portorož, Slovenia, pp. 4282–4289. European Language Resources Association (ELRA) (2016)
Google Scholar
Lewis, M., et al.: BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461 (2019)
Lison, P., Tiedemann, J.: OpenSubtitles2016: extracting large parallel corpora from movie and TV subtitles. In: Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016), Portorož, Slovenia, pp. 923–929. European Language Resources Association (ELRA) (2016)
Google Scholar
Maamouri, M., Bies, A., Buckwalter, T., Mekki, W.: The Penn Arabic treebank: building a large-scale annotated Arabic corpus. In: NEMLAR Conference on Arabic Language Resources and Tools (2004)
Google Scholar
Meftouh, K., Harrat, S., Jamoussi, S., Abbas, M., Smaili, K.: Machine translation experiments on PADIC: a parallel Arabic DIalect corpus. In: Proceedings of the 29th Pacific Asia Conference on Language, Information and Computation, Shanghai, China, pp. 26–34 (2015)
Google Scholar
Mostefa, D., Laïb, M., Chaudiron, S., Choukri, K., Chalendar, G.: A multilingual named entity corpus for Arabic, English and French. MEDAR 2009, 2nd (2009)
Google Scholar
Mubarak, H.: Dial2MSA: a tweets corpus for converting dialectal Arabic to modern standard Arabic. In: OSACT 3: The 3rd Workshop on Open-Source Arabic Corpora and Processing Tools, p. 49 (2018)
Google Scholar
Nagao, M.: Framework of a mechanical translation between Japanese and English by analogy principle. In: Elithorn, A., Banerji, R. (eds.) Artificial and Human Intelligence, pp. 173–180. North-Holland (1984)
Google Scholar
Okpor, M.D.: Machine translation approaches: issues and challenges. IJCSI Int. J. Comput. Sci. Issues 11(2), 159–165 (2014)
Google Scholar
Outchakoucht, A., Es-Samaali, H.: Moroccan dialect-darija-open dataset. arXiv preprint arXiv:2103.09687 (2021)
Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, Pennsylvania, USA. Association for Computational Linguistics (2002)
Google Scholar
Parker, R., Graff, D., Chen, K., Kong, J., Maeda, K.: Arabic Gigaword Fifth Edition LDC2011T11. https://catalog.ldc.upenn.edu/LDC2011T11
Rafalovitch, A., Dale, R.: United Nations general assembly resolutions: a six-language parallel corpus. In: Proceedings of Machine Translation Summit XII: Posters, Ottawa, Canada (2009)
Google Scholar
Reimers, N., Gurevych, I.: Making monolingual sentence embeddings multilingual using knowledge distillation. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics (2020)
Google Scholar
Samy, D., Moreno-Sandoval, A., Guirao, J.M., Alfonseca, E.: Building a parallel multilingual corpus (Arabic-Spanish-English). In: LREC, pp. 2176–2181 (2006)
Google Scholar
Schwenk, H., Wenzek, G., Edunov, S., Grave, E., Joulin, A., Fan, A.: CCMatrix: mining billions of high-quality parallel sentences on the web. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 6490–6500. Association for Computational Linguistics (2021). https://doi.org/10.18653/v1/2021.acl-long.507
Shterionov, D., Casanellas, P.N.L., Superbo, R., O’Dowd, T.: Empirical evaluation of NMT and PBSMT quality for large-scale translation production. In: Conference Booklet, p. 74 (2017)
Google Scholar
Snover, M., Dorr, B., Schwartz, R., Micciulla, L., Makhoul, J.: A study of translation edit rate with targeted human annotation. In: Proceedings of the 7th Conference of the Association for Machine Translation in the Americas: Technical Papers, Cambridge, Massachusetts, USA, pp. 223–231. Association for Machine Translation in the Americas (2006)
Google Scholar
Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: Proceedings of the 27th International Conference on Neural Information Processing Systems (NIPS 2014), pp. 3104–3112 (2014)
Google Scholar
Tiedmann, J.: Parallel data, toolsand interfaces in OPUS. In: Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC 2012), Istanbul, Turkey, vol. 2012, pp. 2214–2218. European Language Resources Association (ELRA) (2012)
Google Scholar
Tripathi, S., Sarkhel, J.K.: Approaches to machine translation. Ann. Libr. Inform. Stud. 57, 388–393 (2010)
Google Scholar
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 5998–6008 (2017)
Google Scholar
Wikipedia: Machine translation (2020). https://en.wikipedia.org/wiki/Machine_translation. Accessed 3 Feb 2020
Wu, Y., et al.: Google’s neural machine translation system: bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144 (2016)
Zhu, J., et al.: Incorporating BERT into neural machine translation. In: Proceedings of the Eighth International Conference on Learning Representations, Addis Abbaba, Ethiopia (Online) (2020)
Google Scholar
Ziemski, M., Junczys-Dowmunt, M., Pouliquen, B.: The United Nations parallel corpus v1.0. In: Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016), Portorož, Slovenia, pp. 3530–3534. European Language Resources Association (ELRA) (2016)
Google Scholar

Download references

Author information

Authors and Affiliations

University of Mascara, Mascara, Algeria
Baligh Babaali & Mohammed Salem

Authors

Baligh Babaali
View author publications
You can also search for this author in PubMed Google Scholar
Mohammed Salem
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Baligh Babaali .

Editor information

Editors and Affiliations

Faculty of New Information and Communication Technologies, University of Constantine2, Constantine, Algeria
Salim Chikhi
Sistemas Informáticos, Universidad de Castilla - La Mancha, Albacete, Spain
Gregorio Diaz-Descalzo
Faculty of Technology, University of Saida, Saida, Algeria
Abdelmalek Amine
Faculty of Information and Communication Technology, University of Constantine2, Constantine, Algeria
Allaoua Chaoui
Faculty of New Information and Communication Technologies, Université Constantine 2, Constantine, Algeria
Djamel Eddine Saidouni
Department of Mathematics and Computer Science, University of El Oued, El-Oued, Algeria
Mohamed Khireddine Kholladi

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Babaali, B., Salem, M. (2023). Survey of the Arabic Machine Translation Corpora. In: Chikhi, S., Diaz-Descalzo, G., Amine, A., Chaoui, A., Saidouni, D.E., Kholladi, M.K. (eds) Modelling and Implementation of Complex Systems. MISC 2022. Lecture Notes in Networks and Systems, vol 593. Springer, Cham. https://doi.org/10.1007/978-3-031-18516-8_15

Download citation

DOI: https://doi.org/10.1007/978-3-031-18516-8_15
Published: 13 October 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-18515-1
Online ISBN: 978-3-031-18516-8
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics

Survey of the Arabic Machine Translation Corpora

Abstract

Access this chapter

Similar content being viewed by others

LSTM vs. GRU for Arabic Machine Translation

English-Arabic Statistical Machine Translation: State of the Art

The Key Challenges for Arabic Machine Translation

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Survey of the Arabic Machine Translation Corpora

Abstract

Access this chapter

Similar content being viewed by others

LSTM vs. GRU for Arabic Machine Translation

English-Arabic Statistical Machine Translation: State of the Art

The Key Challenges for Arabic Machine Translation

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation