Machine Translation

, Volume 26, Issue 1–2, pp 25–45 | Cite as

Orthographic and morphological processing for English–Arabic statistical machine translation

Article

Abstract

Much of the work on statistical machine translation (SMT) from morphologically rich languages has shown that morphological tokenization and orthographic normalization help improve SMT quality because of the sparsity reduction they contribute. In this article, we study the effect of these processes on SMT when translating into a morphologically rich language, namely Arabic. We explore a space of tokenization schemes and normalization options. We also examine a set of six detokenization techniques and evaluate on detokenized and orthographically correct (enriched) output. Our results show that the best performing tokenization scheme is that of the Penn Arabic Treebank. Additionally, training on orthographically normalized (reduced) text then jointly enriching and detokenizing the output outperforms training on enriched text.

Keywords

Arabic language Morphology Tokenization Detokenization Statistical machine translation 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Al-Haj H, Lavie A (2010) The impact of Arabic morphological segmentation on broad-coverage English-to-Arabic statistical machine translation. In: The ninth conference of the Association for Machine Translation in the Americas (AMTA-2010), DenverGoogle Scholar
  2. 2.
    Badr I, Zbib R, Glass J (2008) Segmentation for English-to-Arabic statistical machine translation. In: Proceedings of ACL-08: HLT, short papers, Columbus, pp 153–156. http://www.aclweb.org/anthology/P/P08/P08-2039
  3. 3.
    Badr I, Zbib R, Glass J (2009) Syntactic phrase reordering for English-to-Arabic statistical machine translation. In: Proceedings of the 12th conference of the European chapter of the ACL (EACL 2009), Athens, pp 86–93. http://www.aclweb.org/anthology/E09-1011
  4. 4.
    Bilmes JA, Kirchhoff K (2003) Factored language models and generalized parallel backoff. In: Proceedings of the human language technology conference/North American chapter of Association for Computational Linguistics (HLT/NAACL-03), Edmonton, pp 4–6Google Scholar
  5. 5.
    Buckwalter T (2004) Buckwalter Arabic morphological analyzer version 2.0. Linguistic Data Consortium, University of Pennsylvania. LDC Catalog No.: LDC2004L02, ISBN 1-58563-324-0Google Scholar
  6. 6.
    Callison-Burch C, Osborne M, Koehn P (2006) Re-evaluating the role of BLEU in machine translation research. In: Proceedings of the 11th conference of the European chapter of the Association for Computational Linguistics (EACL’06), Trento, Italy, pp 249–256Google Scholar
  7. 7.
    Diehl F, Gales M, Tomalin M, Woodland P (2009) Morphological analysis and decomposition for Arabic speech-to-text systems. In: Proceedings of interspeech, BrightonGoogle Scholar
  8. 8.
    El Kholy A, Habash N (2010a) Techniques for Arabic morphological detokenization and orthographic denormalization. In: Proceedings of the seventh international conference on language resources and evaluation (LREC), VallettaGoogle Scholar
  9. 9.
    El Kholy A, Habash N (2010b) Orthographic and morphological processing for English–Arabic statistical machine translation. In: In Actes de Traitement Automatique des Langues Naturelles (TALN), MontrealGoogle Scholar
  10. 10.
    Elming J, Habash N (2009) Syntactic reordering for English–Arabic phrase-based machine translation. In: Proceedings of the EACL 2009 workshop on computational approaches to semitic languages, Athens, pp 69–77. http://www.aclweb.org/anthology/W09-0809
  11. 11.
    Goldwater S, McClosky D (2005) Improving statistical MT through morphological analysis. In: Proceedings of the conference on empirical methods in natural language processing (EMNLP), Vancouver, pp 676–683Google Scholar
  12. 12.
    Habash N (2007) Arabic morphological representations for machine translation. In: van den Bosch A, Soudi A (eds) Arabic computational morphology: knowledge-based and empirical methods. Springer, BerlinGoogle Scholar
  13. 13.
    Habash N (2010) Introduction to Arabic natural language processing. Morgan & Claypool Publishers, San RafaelGoogle Scholar
  14. 14.
    Habash N, Rambow O (2005) Arabic tokenization, part-of-speech tagging and morphological disambiguation in one fell swoop. In: Proceedings of the 43rd annual meeting of the Association for Computational Linguistics (ACL’05), Ann Arbor, pp 573–580. http://www.aclweb.org/anthology/P/P05/P05-1071
  15. 15.
    Habash N, Sadat F (2006) Arabic preprocessing schemes for statistical machine translation. In: Proceedings of the 7th meeting of the North American chapter of the Association for Computational Linguistics/human language tTechnologies conference (HLT-NAACL06), New York, pp 49–52Google Scholar
  16. 16.
    Habash N, Soudi A, Buckwalter T (2007) On Arabic transliteration. In: van den Bosch A, Soudi A (eds) Arabic computational morphology: knowledge-based and empirical methods. Springer, BerlinGoogle Scholar
  17. 17.
    Heintz I (2008) Arabic language modeling with finite state transducers. In: Proceedings of the ACL-08: HLT student research workshop, Columbus, pp 37–42. http://www.aclweb.org/anthology/P/P08/P08-3007
  18. 18.
    Koehn P (2004) Statistical significance tests for machine translation evaluation. In: Proceedings of the empirical methods in natural language processing conference (EMNLP’04), BarcelonaGoogle Scholar
  19. 19.
    Koehn P, Hoang H, Birch A, Callison-Burch C, Federico M, Bertoldi N, Cowan B, Shen W, Moran C, Zens R, Dyer C, Bojar O, Constantin A, Herbst E (2007) Moses: open source toolkit for statistical machine translation. In: Proceedings of the 45th annual meeting of the Association for Computational Linguistics, companion volume, Proceedings of the demo and poster sessions, Prague, pp 177–180Google Scholar
  20. 20.
    Lee YS (2004) Morphological analysis for statistical machine translation. In: Proceedings of the 5th meeting of the North American chapter of the Association for Computational Linguistics/human language technologies conference (HLT-NAACL04), Boston, pp 57–60Google Scholar
  21. 21.
    Maamouri M, Bies A, Buckwalter T, Mekki W (2004) The Penn Arabic treebank : building a large-scale annotated Arabic corpus. In: NEMLAR conference on Arabic language resources and tools, Cairo, pp 102–109Google Scholar
  22. 22.
    Nießen S, Ney H (2004) Statistical machine translation with scarce resources usingmorpho-syntactic information. Comput Linguist 30(2):181–204CrossRefGoogle Scholar
  23. 23.
    Och FJ (2005) Google system description for the 2005 NIST MT evaluation. In: MT eval workshop (unpublished talk)Google Scholar
  24. 24.
    Och FJ, Ney H (2003) A systematic comparison of various statistical alignment models. Comput Linguist 29(1): 19–52CrossRefGoogle Scholar
  25. 25.
    Oflazer K, Durgar El-Kahlout I (2007) Exploring different representational units in English-to-Turkish statistical machine translation. In: Proceedings of the second workshop on statistical machine translation, Association for Computational Linguistics, Prague, pp 25–32. http://www.aclweb.org/anthology/W/W07/W07-0704
  26. 26.
    Papineni K, Roukos S, Ward T, Zhu WJ (2002) BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, Philadelphia, pp 311–318Google Scholar
  27. 27.
    Roth R, Rambow O, Habash N, Diab M, Rudin C (2008) Arabic morphological tagging, diacritization, and lemmatization using lexeme models and feature ranking. In: Proceedings of ACL-08: HLT, short papers, Columbus, pp 117–120. http://www.aclweb.org/anthology/P/P08/P08-2030
  28. 28.
    Sarikaya R, Deng Y (2007) Joint morphological-lexical language modeling for machine translation. In: Human language technologies 2007: the conference of the North American chapter of the Association for Computational Linguistics, companion volume, short papers, Rochester, pp 145–148. http://www.aclweb.org/anthology/N/N07/N07-2037
  29. 29.
    Stolcke A (2002) SRILM—an extensible language modeling toolkit. In: Proceedings of the international conference on spoken language processing (ICSLP), vol 2, Denver, pp 901–904Google Scholar
  30. 30.
    Zollmann A, Venugopal A, Vogel S (2006) Bridging the inflection morphology gap for Arabic statistical machine translation. In: Proceedings of the human language technology conference of the NAACL, companion volume: short papers, New York City, pp 201–204Google Scholar

Copyright information

© Springer Science+Business Media B.V. 2011

Authors and Affiliations

  1. 1.Center for Computational Learning SystemsColumbia UniversityNew YorkUSA

Personalised recommendations