Skip to main content

Evaluation of 2-way Iraqi Arabic–English speech translation systems using automated metrics


The Defense Advanced Research Projects Agency (DARPA) Spoken Language Communication and Translation System for Tactical Use (TRANSTAC) program ( faced many challenges in applying automated measures of translation quality to Iraqi Arabic–English speech translation dialogues. Features of speech data in general and of Iraqi Arabic data in particular undermine basic assumptions of automated measures that depend on matching system outputs to reference translations. These features are described along with the challenges they present for evaluating machine translation quality using automated metrics. We show that scores for translation into Iraqi Arabic exhibit higher correlations with human judgments when they are computed from normalized system outputs and reference translations. Orthographic normalization, lexical normalization, and operations involving light stemming resulted in higher correlations with human judgments.

This is a preview of subscription content, access via your institution.


  • Bannerjee S, Lavie A (2005) METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. In: Proceedings of the ACL 2005 workshop on intrinsic and extrinsic evaluation measures for MT and/or summarization, pp 65–73

  • Buckwalter T (2001) Arabic transliteration.

  • Callison-Burch C, Osborne M, Koehn P (2006) Re-evaluating the role of BLEU in machine translation research. Proc EACL 2006:249–256

  • Chatterjee N, Johnson A, Krishna M (2007) Some improvements over the BLEU metric for measuring translation quality for Hindi. In: Proceedings of the international conference on computing: theory and applications 2007, pp 485–490

  • Condon S, Sanders G, Parvaz D, Rubenstein A, Doran C, Aberdeen J, Oshika B (2009) Normalization for automated metrics: English and Arabic speech translation. In: Proceedings of MT summit XII, Ottawa, Ontario, Canada, pp 33–40

  • Culy C, Riehemann S (2003) The limits of n-gram translation evaluation metrics. In: Proceedings of the MT summit IX, New Orleans, LA, USA, pp 71–78

  • Diab M (2009) Second generation tools (amira 2.0): Fast and robust tokenization, pos tagging, and base phrase chunking. In: MEDAR 2nd international conference on arabic language resources and tools, Cairo, Egypt

  • Habash N, Rambow O (2005) Arabic tokenization, morphological analysis, and part-of-speech tagging in one fell swoop. In: Proceedings of ACL, Ann Arbor

  • Habash N, Sadat F (2006) Arabic preprocessing schemes for statistical machine translation. In: Proceedings of the North American chapter of NAACL, New York

  • Larkey LS, Ballesteros L, Connell ME (2007) Light stemming for Arabic information retrieval. In: Soudi A, van den Bosch A, Neumann G (eds) Arabic computational morphology: knowledge-based and empirical methods. Springer, New York, pp 221–243

    Chapter  Google Scholar 

  • Lavie A, Sagae S, Jayaraman S (2004) The significance of recall in automatic metrics for MT evaluation. In: Proceedings of the 6th conference of the association for machine translation in the Americas (AMTA-2004), pp 134–143

  • Likert R (1932) A technique for the measurement of attitudes. Arch Psychol 140: 1–55

    Google Scholar 

  • Owczarzak K, van Genabith J, Way A (2007) Dependency-based automatic evaluation for machine translation. In: Proceedings of HLT-NAACL 2007 AMTA workshop on syntax and structure in statistical translation, pp 80–87

  • Papineni K, Roukos S, Ward T, Zhou WJ (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of ACL 2002, pp 311–318

  • Przybocki M, Peterson K, Bronsart S, Sanders G (2009) The nist 2008 metrics for machine translation challenge–overview, methodology, metrics, and results. Mach Trans 23: 71–103

    Article  Google Scholar 

  • Riesa J, Mohit B, Knight K, Marcu D (2006) Building an English-Iraqi Arabic machine translation system for spoken utterances with limited resources. In: Proceedings of interspeech 2006: ICSLP ninth international conference on spoken language processing, p 2012

  • Sanders G, Bronsart S, Condon S, Schlenoff C (2008) Odds of successful transfer of low-level concepts: a key metric for bidirectional speech-to-speech machine translation in DARPA’s TRANSTAC program. In: Proceedings of LREC 2008, Marrakesh, Morocco

  • SCLite (2009) SCLite–NIST multi-modal information group.

  • Shen W, Delaney B, Anderson T, Slyh R (2007) The MIT-LL/AFRL IWSLT-2007 MT system. In: IWSLT 2007: international workshop on spoken language translation, Trento, Italy

  • Snover M, Dorr B, Schwartz R, Makhoul J, Micciula L (2006) A study of translation error rate with targeted human annotation. In: Proceedings of AMTA 2006, pp 223–231

  • Turian JP, Shen L, Melamed ID (2003) Evaluation of machine translation and its evaluation. In: Proceedings of MT summit 2003, pp 386–393

  • Weiss B, Schlenoff C, Sanders G, Steves M, Condon S, Phillips J, Parvaz D (2008) Performance evaluation of speech translation systems. In: Proceedings of LREC 2008, Marrakesh, Morocco

Download references

Author information

Authors and Affiliations


Corresponding author

Correspondence to Sherri Condon.

Additional information

Approved for Public Release: 11-0118. Distribution Unlimited. The views expressed are those of the author and do not reflect the official policy or position of the Department of Defense or the U.S. Government. Some of the material in this article was originally presented at the Language Resources and Evaluation Conference (LREC) 2008 in Marrakesh, Morocco and at the 2009 MT Summit XII in Ottawa, Canada.

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Condon, S., Arehart, M., Parvaz, D. et al. Evaluation of 2-way Iraqi Arabic–English speech translation systems using automated metrics. Machine Translation 26, 159–176 (2012).

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: