Orthographic and morphological processing for English–Arabic statistical machine translation

El Kholy, Ahmed; Habash, Nizar

doi:10.1007/s10590-011-9110-0

Orthographic and morphological processing for English–Arabic statistical machine translation

Published: 21 September 2011

Volume 26, pages 25–45, (2012)
Cite this article

Machine Translation

Ahmed El Kholy¹ &
Nizar Habash¹

568 Accesses
20 Citations
Explore all metrics

Abstract

Much of the work on statistical machine translation (SMT) from morphologically rich languages has shown that morphological tokenization and orthographic normalization help improve SMT quality because of the sparsity reduction they contribute. In this article, we study the effect of these processes on SMT when translating into a morphologically rich language, namely Arabic. We explore a space of tokenization schemes and normalization options. We also examine a set of six detokenization techniques and evaluate on detokenized and orthographically correct (enriched) output. Our results show that the best performing tokenization scheme is that of the Penn Arabic Treebank. Additionally, training on orthographically normalized (reduced) text then jointly enriching and detokenizing the output outperforms training on enriched text.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

English-Arabic Statistical Machine Translation: State of the Art

Addressing Limited Vocabulary and Long Sentences Constraints in English–Arabic Neural Machine Translation

Article 02 March 2021

A Novel Approach by Injecting CCG Supertags into an Arabic–English Factored Translation Machine

Article 12 March 2016

References

Al-Haj H, Lavie A (2010) The impact of Arabic morphological segmentation on broad-coverage English-to-Arabic statistical machine translation. In: The ninth conference of the Association for Machine Translation in the Americas (AMTA-2010), Denver
Badr I, Zbib R, Glass J (2008) Segmentation for English-to-Arabic statistical machine translation. In: Proceedings of ACL-08: HLT, short papers, Columbus, pp 153–156. http://www.aclweb.org/anthology/P/P08/P08-2039
Badr I, Zbib R, Glass J (2009) Syntactic phrase reordering for English-to-Arabic statistical machine translation. In: Proceedings of the 12th conference of the European chapter of the ACL (EACL 2009), Athens, pp 86–93. http://www.aclweb.org/anthology/E09-1011
Bilmes JA, Kirchhoff K (2003) Factored language models and generalized parallel backoff. In: Proceedings of the human language technology conference/North American chapter of Association for Computational Linguistics (HLT/NAACL-03), Edmonton, pp 4–6
Buckwalter T (2004) Buckwalter Arabic morphological analyzer version 2.0. Linguistic Data Consortium, University of Pennsylvania. LDC Catalog No.: LDC2004L02, ISBN 1-58563-324-0
Callison-Burch C, Osborne M, Koehn P (2006) Re-evaluating the role of BLEU in machine translation research. In: Proceedings of the 11th conference of the European chapter of the Association for Computational Linguistics (EACL’06), Trento, Italy, pp 249–256
Diehl F, Gales M, Tomalin M, Woodland P (2009) Morphological analysis and decomposition for Arabic speech-to-text systems. In: Proceedings of interspeech, Brighton
El Kholy A, Habash N (2010a) Techniques for Arabic morphological detokenization and orthographic denormalization. In: Proceedings of the seventh international conference on language resources and evaluation (LREC), Valletta
El Kholy A, Habash N (2010b) Orthographic and morphological processing for English–Arabic statistical machine translation. In: In Actes de Traitement Automatique des Langues Naturelles (TALN), Montreal
Elming J, Habash N (2009) Syntactic reordering for English–Arabic phrase-based machine translation. In: Proceedings of the EACL 2009 workshop on computational approaches to semitic languages, Athens, pp 69–77. http://www.aclweb.org/anthology/W09-0809
Goldwater S, McClosky D (2005) Improving statistical MT through morphological analysis. In: Proceedings of the conference on empirical methods in natural language processing (EMNLP), Vancouver, pp 676–683
Habash N (2007) Arabic morphological representations for machine translation. In: van den Bosch A, Soudi A (eds) Arabic computational morphology: knowledge-based and empirical methods. Springer, Berlin
Google Scholar
Habash N (2010) Introduction to Arabic natural language processing. Morgan & Claypool Publishers, San Rafael
Google Scholar
Habash N, Rambow O (2005) Arabic tokenization, part-of-speech tagging and morphological disambiguation in one fell swoop. In: Proceedings of the 43rd annual meeting of the Association for Computational Linguistics (ACL’05), Ann Arbor, pp 573–580. http://www.aclweb.org/anthology/P/P05/P05-1071
Habash N, Sadat F (2006) Arabic preprocessing schemes for statistical machine translation. In: Proceedings of the 7th meeting of the North American chapter of the Association for Computational Linguistics/human language tTechnologies conference (HLT-NAACL06), New York, pp 49–52
Habash N, Soudi A, Buckwalter T (2007) On Arabic transliteration. In: van den Bosch A, Soudi A (eds) Arabic computational morphology: knowledge-based and empirical methods. Springer, Berlin
Google Scholar
Heintz I (2008) Arabic language modeling with finite state transducers. In: Proceedings of the ACL-08: HLT student research workshop, Columbus, pp 37–42. http://www.aclweb.org/anthology/P/P08/P08-3007
Koehn P (2004) Statistical significance tests for machine translation evaluation. In: Proceedings of the empirical methods in natural language processing conference (EMNLP’04), Barcelona
Koehn P, Hoang H, Birch A, Callison-Burch C, Federico M, Bertoldi N, Cowan B, Shen W, Moran C, Zens R, Dyer C, Bojar O, Constantin A, Herbst E (2007) Moses: open source toolkit for statistical machine translation. In: Proceedings of the 45th annual meeting of the Association for Computational Linguistics, companion volume, Proceedings of the demo and poster sessions, Prague, pp 177–180
Lee YS (2004) Morphological analysis for statistical machine translation. In: Proceedings of the 5th meeting of the North American chapter of the Association for Computational Linguistics/human language technologies conference (HLT-NAACL04), Boston, pp 57–60
Maamouri M, Bies A, Buckwalter T, Mekki W (2004) The Penn Arabic treebank : building a large-scale annotated Arabic corpus. In: NEMLAR conference on Arabic language resources and tools, Cairo, pp 102–109
Nießen S, Ney H (2004) Statistical machine translation with scarce resources usingmorpho-syntactic information. Comput Linguist 30(2):181–204
Article Google Scholar
Och FJ (2005) Google system description for the 2005 NIST MT evaluation. In: MT eval workshop (unpublished talk)
Och FJ, Ney H (2003) A systematic comparison of various statistical alignment models. Comput Linguist 29(1): 19–52
Article Google Scholar
Oflazer K, Durgar El-Kahlout I (2007) Exploring different representational units in English-to-Turkish statistical machine translation. In: Proceedings of the second workshop on statistical machine translation, Association for Computational Linguistics, Prague, pp 25–32. http://www.aclweb.org/anthology/W/W07/W07-0704
Papineni K, Roukos S, Ward T, Zhu WJ (2002) BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, Philadelphia, pp 311–318
Roth R, Rambow O, Habash N, Diab M, Rudin C (2008) Arabic morphological tagging, diacritization, and lemmatization using lexeme models and feature ranking. In: Proceedings of ACL-08: HLT, short papers, Columbus, pp 117–120. http://www.aclweb.org/anthology/P/P08/P08-2030
Sarikaya R, Deng Y (2007) Joint morphological-lexical language modeling for machine translation. In: Human language technologies 2007: the conference of the North American chapter of the Association for Computational Linguistics, companion volume, short papers, Rochester, pp 145–148. http://www.aclweb.org/anthology/N/N07/N07-2037
Stolcke A (2002) SRILM—an extensible language modeling toolkit. In: Proceedings of the international conference on spoken language processing (ICSLP), vol 2, Denver, pp 901–904
Zollmann A, Venugopal A, Vogel S (2006) Bridging the inflection morphology gap for Arabic statistical machine translation. In: Proceedings of the human language technology conference of the NAACL, companion volume: short papers, New York City, pp 201–204

Download references

Author information

Authors and Affiliations

Center for Computational Learning Systems, Columbia University, 475 Riverside Drive, Suite 850, New York, NY, 10115, USA
Ahmed El Kholy & Nizar Habash

Authors

Ahmed El Kholy
View author publications
You can also search for this author in PubMed Google Scholar
Nizar Habash
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ahmed El Kholy.

Rights and permissions

Reprints and permissions

About this article

Cite this article

El Kholy, A., Habash, N. Orthographic and morphological processing for English–Arabic statistical machine translation. Machine Translation 26, 25–45 (2012). https://doi.org/10.1007/s10590-011-9110-0

Download citation

Received: 05 July 2010
Accepted: 20 August 2011
Published: 21 September 2011
Issue Date: March 2012
DOI: https://doi.org/10.1007/s10590-011-9110-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Orthographic and morphological processing for English–Arabic statistical machine translation

Abstract

Access this article

Similar content being viewed by others

English-Arabic Statistical Machine Translation: State of the Art

Addressing Limited Vocabulary and Long Sentences Constraints in English–Arabic Neural Machine Translation

A Novel Approach by Injecting CCG Supertags into an Arabic–English Factored Translation Machine

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Orthographic and morphological processing for English–Arabic statistical machine translation

Abstract

Access this article

Similar content being viewed by others

English-Arabic Statistical Machine Translation: State of the Art

Addressing Limited Vocabulary and Long Sentences Constraints in English–Arabic Neural Machine Translation

A Novel Approach by Injecting CCG Supertags into an Arabic–English Factored Translation Machine

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation