Machine Translation

, Volume 26, Issue 1–2, pp 47–65 | Cite as

A comparison of segmentation methods and extended lexicon models for Arabic statistical machine translation

  • Saša HasanEmail author
  • Saab Mansour
  • Hermann Ney


In this article, we investigate different methodologies of Arabic segmentation for statistical machine translation by comparing a rule-based segmenter to different statistically-based segmenters. We also present a method for segmentation that serves the needs of a real-time translation system without impairing the translation accuracy. Second, we report on extended lexicon models based on triplets that incorporate sentence-level context during the decoding process. Results are presented on different translation tasks that show improvements in both BLEU and TER scores.


Statistical machine translation Segmentation Extended lexicon models 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. Bar-Haim R, Sima’an K, Winter Y (2005) Choosing an optimal architecture for segmentation and POS-tagging of modern Hebrew. In: Semitic ’05: proceedings of the ACL workshop on computational approaches to semitic languages, Morristown, NJ, USA, pp 39–46Google Scholar
  2. Brown PF, Della Pietra SA, Della Pietra VJ, Mercer RL (1993) The mathematics of statistical machine translation: parameter estimation. Comput Linguist 19(2): 263–311Google Scholar
  3. Buckwalter T (2002) Buckwalter Arabic Morphological Analyzer Version 1.0. Linguistic Data Consortium, University of Pennsylvania. LDC Catalog No.: LDC2002L49Google Scholar
  4. Carpuat M, Wu D (2007) Improving statistical machine translation using word sense disambiguation. In: EMNLP-CoNLL 2007, Prague, Czech RepublicGoogle Scholar
  5. Chan YS, Ng HT, Chiang D (2007) Word sense disambiguation improves statistical machine translation. In: Proceedings of the 45th annual meeting of the ACL, Prague, Czech Republic, pp 33–40Google Scholar
  6. Chang PC, Galley M, Manning CD (2008) Optimizing Chinese word segmentation for machine translation performance. In: StatMT’08: proceedings of the third workshop on SMT, Morristown, NJ, USA, pp 224–232Google Scholar
  7. Chelba C, Jelinek F (2000) Structured language modeling. Comput Speech Lang 14(4): 283–332CrossRefGoogle Scholar
  8. Della Pietra SA, Della Pietra VJ, Gillett JR, Lafferty JD, Printz H, Ureš L (1994) Inference and estimation of a long-range trigram model. In: Oncina J, Carrasco RC (eds) Grammatical inference and applications, second international colloquium, ICGI-94, vol 862. Springer, Alicante, pp 78–92Google Scholar
  9. Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc B 39(1): 1–22MathSciNetzbMATHGoogle Scholar
  10. Diab M, Hacioglu K, Jurafsky D (2004) Automatic tagging of Arabic text: from raw text to base phrase chunks. In: HLT-NAACL 2004: short papers, Boston, MA, USA, pp 149–152Google Scholar
  11. El Isbihani A, Khadivi S, Bender O, Ney H (2006) Morpho-syntactic Arabic preprocessing for Arabic to English statistical machine translation. In: Proceedings on the workshop on SMT, New York, pp 15–22Google Scholar
  12. Habash N (2007) Arabic morphological representations for machine translation. In: Soudi A, Bosch Avd, Neumann G (eds) Arabic computational morphology, text, speech and language technology, vol 38. Springer, Netherlands, pp 263–285CrossRefGoogle Scholar
  13. Habash N, Rambow O (2005) Arabic tokenization, part-of-speech tagging and morphological disambiguation in one fell swoop. In: Proceedings of the 43rd annual meeting of the ACL, Morristown, NJ, USA, pp 573–580Google Scholar
  14. Habash N, Sadat F (2006) Arabic preprocessing schemes for statistical machine translation. In: HLT- NAACL 2006: short papers, New York, USA, pp 49–52Google Scholar
  15. Habash N, Soudi A, Buckwalter T (2007) On Arabic transliteration. In: Ide N, Véronis J, Soudi A, Bosch Avd, Neumann G (eds) Arabic computational morphology, text, speech and language technology, vol 38, Springer, Netherlands, pp 15–22Google Scholar
  16. Hamon O, Hartley A, Popescu-Belis A, Choukri K (2007) Assessing human and automated quality judgments in the French MT evaluation campaign CESTA. In: MT summit XI, Copenhagen, Denmark, pp 231–238Google Scholar
  17. Hasan S, Ney H (2009) Comparison of extended lexicon models in search and rescoring for SMT. In: HLT-NAACL 2009: short papers, Boulder, CO, pp 17–20Google Scholar
  18. Hasan S, El Isbihani A, Ney H (2006) Creating a large-scale Arabic to French statistical machine translation system. In: International conference on language resources and evaluation, Genoa, Italy, pp 855–858Google Scholar
  19. Hasan S, Ganitkevitch J, Ney H, Andrés-Ferrer J (2008) Triplet lexicon models for statistical machine translation. In: EMNLP 2008, Honolulu, Hawaii, pp 372–381Google Scholar
  20. Kim W, Khudanpur S (2003) Cross-lingual lexical triggers in statistical language modeling. In: EMNLP 2003, Morristown, NJ, USA, pp 17–24Google Scholar
  21. Lee YS (2004) Morphological analysis for statistical machine translation. In: HLT-NAACL ’04: proceedings of HLT-NAACL 2004: short papers, Morristown, NJ, USA, pp 57–60Google Scholar
  22. Maamouri M, Bies A, Buckwalter T, Mekki W (2004) The Penn Arabic Treebank: building a large-scale annotated Arabic corpus. In: NEMLAR conference on Arabic language resources and toolsGoogle Scholar
  23. Mansour S, Sima’an K, Winter Y (2007) Smoothing a lexicon-based POS tagger for Arabic and Hebrew. In: Semitic ’07: proceedings of the 2007 workshop on computational approaches to semitic languages, Morristown, NJ, USA, pp 97–103Google Scholar
  24. Mauser A, Hasan S, Ney H (2009) Extending statistical machine translation with discriminative and trigger-based lexicon models. In: EMNLP 2009, Singapore, pp 210–217Google Scholar
  25. Nguyen T, Vogel S (2008) Context-based arabic morphological analysis for machine translation. In: CoNLL ’08, Morristown, NJ, USA, pp 135–142Google Scholar
  26. NIST (2009) NIST open MT evaluation.
  27. QUAERO (2008) Automatic multimedia content processing.
  28. Roark B (2001) Probabilistic top-down parsing and language modeling. Comput Linguist 27(2): 249–276MathSciNetCrossRefGoogle Scholar
  29. Rosenfeld R (1996) A maximum entropy approach to adaptive statistical language modeling. Comput Speech Lang 10(3): 187–228MathSciNetCrossRefGoogle Scholar
  30. Sadat F, Habash N (2006) Combination of preprocessing schemes for statistical MT. In: Proceedings of the 44th annual meeting of the Association for Computational Linguistics (ACL), Sydney, Australia, pp 1–8Google Scholar
  31. Smith NA, Smith DA, Tromble RW (2005) Context-based morphological disambiguation with random fields. In: HLT /EMNLP’05, Morristown, NJ, USA, pp 475–482Google Scholar
  32. Stolcke A (2002) SRILM—an extensible language modeling toolkit. In: Proceedings of the seventh international conference on spoken language processing, ISCA, Denver, CO, USA, pp 901–904Google Scholar
  33. Tillmann C, Ney H (1997) Word triggers and the EM algorithm. In: Proceedings of the special interest group workshop on computational natural language learning (ACL), Madrid, Spain, pp 117–124Google Scholar
  34. Zens R, Ney H (2008) Improvements in dynamic programming beam search for phrase-based statistical machine translation. In: International workshop on spoken language translation, Honolulu, Hawaii, pp 195–205Google Scholar

Copyright information

© Springer Science+Business Media B.V. 2011

Authors and Affiliations

  1. 1.Human Language Technology and Pattern Recognition Group, Lehrstuhl für Informatik 6RWTH Aachen UniversityAachenGermany

Personalised recommendations