Skip to main content
Log in

Orthographic and morphological processing for English–Arabic statistical machine translation

  • Published:
Machine Translation

Abstract

Much of the work on statistical machine translation (SMT) from morphologically rich languages has shown that morphological tokenization and orthographic normalization help improve SMT quality because of the sparsity reduction they contribute. In this article, we study the effect of these processes on SMT when translating into a morphologically rich language, namely Arabic. We explore a space of tokenization schemes and normalization options. We also examine a set of six detokenization techniques and evaluate on detokenized and orthographically correct (enriched) output. Our results show that the best performing tokenization scheme is that of the Penn Arabic Treebank. Additionally, training on orthographically normalized (reduced) text then jointly enriching and detokenizing the output outperforms training on enriched text.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Al-Haj H, Lavie A (2010) The impact of Arabic morphological segmentation on broad-coverage English-to-Arabic statistical machine translation. In: The ninth conference of the Association for Machine Translation in the Americas (AMTA-2010), Denver

  2. Badr I, Zbib R, Glass J (2008) Segmentation for English-to-Arabic statistical machine translation. In: Proceedings of ACL-08: HLT, short papers, Columbus, pp 153–156. http://www.aclweb.org/anthology/P/P08/P08-2039

  3. Badr I, Zbib R, Glass J (2009) Syntactic phrase reordering for English-to-Arabic statistical machine translation. In: Proceedings of the 12th conference of the European chapter of the ACL (EACL 2009), Athens, pp 86–93. http://www.aclweb.org/anthology/E09-1011

  4. Bilmes JA, Kirchhoff K (2003) Factored language models and generalized parallel backoff. In: Proceedings of the human language technology conference/North American chapter of Association for Computational Linguistics (HLT/NAACL-03), Edmonton, pp 4–6

  5. Buckwalter T (2004) Buckwalter Arabic morphological analyzer version 2.0. Linguistic Data Consortium, University of Pennsylvania. LDC Catalog No.: LDC2004L02, ISBN 1-58563-324-0

  6. Callison-Burch C, Osborne M, Koehn P (2006) Re-evaluating the role of BLEU in machine translation research. In: Proceedings of the 11th conference of the European chapter of the Association for Computational Linguistics (EACL’06), Trento, Italy, pp 249–256

  7. Diehl F, Gales M, Tomalin M, Woodland P (2009) Morphological analysis and decomposition for Arabic speech-to-text systems. In: Proceedings of interspeech, Brighton

  8. El Kholy A, Habash N (2010a) Techniques for Arabic morphological detokenization and orthographic denormalization. In: Proceedings of the seventh international conference on language resources and evaluation (LREC), Valletta

  9. El Kholy A, Habash N (2010b) Orthographic and morphological processing for English–Arabic statistical machine translation. In: In Actes de Traitement Automatique des Langues Naturelles (TALN), Montreal

  10. Elming J, Habash N (2009) Syntactic reordering for English–Arabic phrase-based machine translation. In: Proceedings of the EACL 2009 workshop on computational approaches to semitic languages, Athens, pp 69–77. http://www.aclweb.org/anthology/W09-0809

  11. Goldwater S, McClosky D (2005) Improving statistical MT through morphological analysis. In: Proceedings of the conference on empirical methods in natural language processing (EMNLP), Vancouver, pp 676–683

  12. Habash N (2007) Arabic morphological representations for machine translation. In: van den Bosch A, Soudi A (eds) Arabic computational morphology: knowledge-based and empirical methods. Springer, Berlin

    Google Scholar 

  13. Habash N (2010) Introduction to Arabic natural language processing. Morgan & Claypool Publishers, San Rafael

    Google Scholar 

  14. Habash N, Rambow O (2005) Arabic tokenization, part-of-speech tagging and morphological disambiguation in one fell swoop. In: Proceedings of the 43rd annual meeting of the Association for Computational Linguistics (ACL’05), Ann Arbor, pp 573–580. http://www.aclweb.org/anthology/P/P05/P05-1071

  15. Habash N, Sadat F (2006) Arabic preprocessing schemes for statistical machine translation. In: Proceedings of the 7th meeting of the North American chapter of the Association for Computational Linguistics/human language tTechnologies conference (HLT-NAACL06), New York, pp 49–52

  16. Habash N, Soudi A, Buckwalter T (2007) On Arabic transliteration. In: van den Bosch A, Soudi A (eds) Arabic computational morphology: knowledge-based and empirical methods. Springer, Berlin

    Google Scholar 

  17. Heintz I (2008) Arabic language modeling with finite state transducers. In: Proceedings of the ACL-08: HLT student research workshop, Columbus, pp 37–42. http://www.aclweb.org/anthology/P/P08/P08-3007

  18. Koehn P (2004) Statistical significance tests for machine translation evaluation. In: Proceedings of the empirical methods in natural language processing conference (EMNLP’04), Barcelona

  19. Koehn P, Hoang H, Birch A, Callison-Burch C, Federico M, Bertoldi N, Cowan B, Shen W, Moran C, Zens R, Dyer C, Bojar O, Constantin A, Herbst E (2007) Moses: open source toolkit for statistical machine translation. In: Proceedings of the 45th annual meeting of the Association for Computational Linguistics, companion volume, Proceedings of the demo and poster sessions, Prague, pp 177–180

  20. Lee YS (2004) Morphological analysis for statistical machine translation. In: Proceedings of the 5th meeting of the North American chapter of the Association for Computational Linguistics/human language technologies conference (HLT-NAACL04), Boston, pp 57–60

  21. Maamouri M, Bies A, Buckwalter T, Mekki W (2004) The Penn Arabic treebank : building a large-scale annotated Arabic corpus. In: NEMLAR conference on Arabic language resources and tools, Cairo, pp 102–109

  22. Nießen S, Ney H (2004) Statistical machine translation with scarce resources usingmorpho-syntactic information. Comput Linguist 30(2):181–204

    Article  Google Scholar 

  23. Och FJ (2005) Google system description for the 2005 NIST MT evaluation. In: MT eval workshop (unpublished talk)

  24. Och FJ, Ney H (2003) A systematic comparison of various statistical alignment models. Comput Linguist 29(1): 19–52

    Article  Google Scholar 

  25. Oflazer K, Durgar El-Kahlout I (2007) Exploring different representational units in English-to-Turkish statistical machine translation. In: Proceedings of the second workshop on statistical machine translation, Association for Computational Linguistics, Prague, pp 25–32. http://www.aclweb.org/anthology/W/W07/W07-0704

  26. Papineni K, Roukos S, Ward T, Zhu WJ (2002) BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, Philadelphia, pp 311–318

  27. Roth R, Rambow O, Habash N, Diab M, Rudin C (2008) Arabic morphological tagging, diacritization, and lemmatization using lexeme models and feature ranking. In: Proceedings of ACL-08: HLT, short papers, Columbus, pp 117–120. http://www.aclweb.org/anthology/P/P08/P08-2030

  28. Sarikaya R, Deng Y (2007) Joint morphological-lexical language modeling for machine translation. In: Human language technologies 2007: the conference of the North American chapter of the Association for Computational Linguistics, companion volume, short papers, Rochester, pp 145–148. http://www.aclweb.org/anthology/N/N07/N07-2037

  29. Stolcke A (2002) SRILM—an extensible language modeling toolkit. In: Proceedings of the international conference on spoken language processing (ICSLP), vol 2, Denver, pp 901–904

  30. Zollmann A, Venugopal A, Vogel S (2006) Bridging the inflection morphology gap for Arabic statistical machine translation. In: Proceedings of the human language technology conference of the NAACL, companion volume: short papers, New York City, pp 201–204

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ahmed El Kholy.

Rights and permissions

Reprints and permissions

About this article

Cite this article

El Kholy, A., Habash, N. Orthographic and morphological processing for English–Arabic statistical machine translation. Machine Translation 26, 25–45 (2012). https://doi.org/10.1007/s10590-011-9110-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10590-011-9110-0

Keywords

Navigation