The Defense Advanced Research Projects Agency (DARPA) Spoken Language Communication and Translation System for Tactical Use (TRANSTAC) program (http://1.usa.gov/transtac) faced many challenges in applying automated measures of translation quality to Iraqi Arabic–English speech translation dialogues. Features of speech data in general and of Iraqi Arabic data in particular undermine basic assumptions of automated measures that depend on matching system outputs to reference translations. These features are described along with the challenges they present for evaluating machine translation quality using automated metrics. We show that scores for translation into Iraqi Arabic exhibit higher correlations with human judgments when they are computed from normalized system outputs and reference translations. Orthographic normalization, lexical normalization, and operations involving light stemming resulted in higher correlations with human judgments.
This is a preview of subscription content, access via your institution.
Buy single article
Instant access to the full article PDF.
Price excludes VAT (USA)
Tax calculation will be finalised during checkout.
Bannerjee S, Lavie A (2005) METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. In: Proceedings of the ACL 2005 workshop on intrinsic and extrinsic evaluation measures for MT and/or summarization, pp 65–73
Buckwalter T (2001) Arabic transliteration. http://www.qamus.org/transliteration.htm
Callison-Burch C, Osborne M, Koehn P (2006) Re-evaluating the role of BLEU in machine translation research. Proc EACL 2006:249–256
Chatterjee N, Johnson A, Krishna M (2007) Some improvements over the BLEU metric for measuring translation quality for Hindi. In: Proceedings of the international conference on computing: theory and applications 2007, pp 485–490
Condon S, Sanders G, Parvaz D, Rubenstein A, Doran C, Aberdeen J, Oshika B (2009) Normalization for automated metrics: English and Arabic speech translation. In: Proceedings of MT summit XII, Ottawa, Ontario, Canada, pp 33–40
Culy C, Riehemann S (2003) The limits of n-gram translation evaluation metrics. In: Proceedings of the MT summit IX, New Orleans, LA, USA, pp 71–78
Diab M (2009) Second generation tools (amira 2.0): Fast and robust tokenization, pos tagging, and base phrase chunking. In: MEDAR 2nd international conference on arabic language resources and tools, Cairo, Egypt
Habash N, Rambow O (2005) Arabic tokenization, morphological analysis, and part-of-speech tagging in one fell swoop. In: Proceedings of ACL, Ann Arbor
Habash N, Sadat F (2006) Arabic preprocessing schemes for statistical machine translation. In: Proceedings of the North American chapter of NAACL, New York
Larkey LS, Ballesteros L, Connell ME (2007) Light stemming for Arabic information retrieval. In: Soudi A, van den Bosch A, Neumann G (eds) Arabic computational morphology: knowledge-based and empirical methods. Springer, New York, pp 221–243
Lavie A, Sagae S, Jayaraman S (2004) The significance of recall in automatic metrics for MT evaluation. In: Proceedings of the 6th conference of the association for machine translation in the Americas (AMTA-2004), pp 134–143
Likert R (1932) A technique for the measurement of attitudes. Arch Psychol 140: 1–55
Owczarzak K, van Genabith J, Way A (2007) Dependency-based automatic evaluation for machine translation. In: Proceedings of HLT-NAACL 2007 AMTA workshop on syntax and structure in statistical translation, pp 80–87
Papineni K, Roukos S, Ward T, Zhou WJ (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of ACL 2002, pp 311–318
Przybocki M, Peterson K, Bronsart S, Sanders G (2009) The nist 2008 metrics for machine translation challenge–overview, methodology, metrics, and results. Mach Trans 23: 71–103
Riesa J, Mohit B, Knight K, Marcu D (2006) Building an English-Iraqi Arabic machine translation system for spoken utterances with limited resources. In: Proceedings of interspeech 2006: ICSLP ninth international conference on spoken language processing, p 2012
Sanders G, Bronsart S, Condon S, Schlenoff C (2008) Odds of successful transfer of low-level concepts: a key metric for bidirectional speech-to-speech machine translation in DARPA’s TRANSTAC program. In: Proceedings of LREC 2008, Marrakesh, Morocco
SCLite (2009) SCLite–NIST multi-modal information group. http://www.itl.nist.gov/iad/mig/tools/
Shen W, Delaney B, Anderson T, Slyh R (2007) The MIT-LL/AFRL IWSLT-2007 MT system. In: IWSLT 2007: international workshop on spoken language translation, Trento, Italy
Snover M, Dorr B, Schwartz R, Makhoul J, Micciula L (2006) A study of translation error rate with targeted human annotation. In: Proceedings of AMTA 2006, pp 223–231
Turian JP, Shen L, Melamed ID (2003) Evaluation of machine translation and its evaluation. In: Proceedings of MT summit 2003, pp 386–393
Weiss B, Schlenoff C, Sanders G, Steves M, Condon S, Phillips J, Parvaz D (2008) Performance evaluation of speech translation systems. In: Proceedings of LREC 2008, Marrakesh, Morocco
Approved for Public Release: 11-0118. Distribution Unlimited. The views expressed are those of the author and do not reflect the official policy or position of the Department of Defense or the U.S. Government. Some of the material in this article was originally presented at the Language Resources and Evaluation Conference (LREC) 2008 in Marrakesh, Morocco and at the 2009 MT Summit XII in Ottawa, Canada.
About this article
Cite this article
Condon, S., Arehart, M., Parvaz, D. et al. Evaluation of 2-way Iraqi Arabic–English speech translation systems using automated metrics. Machine Translation 26, 159–176 (2012). https://doi.org/10.1007/s10590-011-9105-x