Machine Translation

, Volume 23, Issue 2–3, pp 181–193 | Cite as

Measuring machine translation quality as semantic equivalence: A metric based on entailment features

  • Sebastian PadóEmail author
  • Daniel Cer
  • Michel Galley
  • Dan Jurafsky
  • Christopher D. Manning


Current evaluation metrics for machine translation have increasing difficulty in distinguishing good from merely fair translations. We believe the main problem to be their inability to properly capture meaning: A good translation candidate means the same thing as the reference translation, regardless of formulation. We propose a metric that assesses the quality of MT output through its semantic equivalence to the reference translation, based on a rich set of match and mismatch features motivated by textual entailment. We first evaluate this metric in an evaluation setting against a combination metric of four state-of-the-art scores. Our metric predicts human judgments better than the combination metric. Combining the entailment and traditional features yields further improvements. Then, we demonstrate that the entailment metric can also be used as learning criterion in minimum error rate training (MERT) to improve parameter estimation in MT system training. A manual evaluation of the resulting translations indicates that the new model obtains a significant improvement in translation quality.


MT evaluation Automated metric MERT Semantics Entailment Linguistic analysis Paraphrase 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. Amigó E, Giménez J, Gonzalo J, Màrquez L (2006) MT evaluation: human-like vs. human acceptable. In: Proceedings of COLING/ACL 2006, pp 17–24Google Scholar
  2. Banerjee S, Lavie A (2005) METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. In: Proceedings of the ACL workshop on evaluation measures, pp 65–72Google Scholar
  3. Callison-Burch C, Fordyce C, Koehn P, Monz C, Schroeder J (2008) Further meta-evaluation of machine translation. In: Proceedings of the ACL workshop on statistical machine translation, pp 70–106Google Scholar
  4. Callison-Burch C, Osborne M, Koehn P (2006) Re-evaluating the role of BLEU in machine translation research. In: Proceedings of EACL. pp 249–256Google Scholar
  5. Cer D, Jurafsky D, Manning CD (2008) Regularization and search for minimum error rate training. In: Proceedings of the third workshop on statistical machine translation, Columbus, Ohio, pp  26–34Google Scholar
  6. Chan YS, Ng HT (2008) MAXSIM: a maximum similarity metric for machine translation evaluation. In: Proceedings of ACL-08/HLT, pp 55–62Google Scholar
  7. Dagan I, Glickman O, Magnini B (2005) The PASCAL recognising textual entailment challenge. In: Proceedings of the PASCAL RTE workshop, pp 177–190Google Scholar
  8. de Marneffe M-C, Grenager T, MacCartney B, Cer D, Ramage D, Kiddon C, Manning CD (2007) Aligning semantic graphs for textual inference and machine reading. In: Proceedings of the AAAI spring symposium on machine reading, pp 36–42Google Scholar
  9. de Marneffe M-C, MacCartney B, Manning CD (2006) Generating typed dependency parses from phrase structure parses. In: Fifth international conference on language resources and evaluation (LREC 2006), pp 449–454Google Scholar
  10. Doddington G (2002) Automatic evaluation of machine translation quality using n-gram cooccurrence statistics. In: Proceedings of HLT, pp 128–132Google Scholar
  11. Fabrigar LR, Krosnick JA, MacDougall BL (2005) Attitude measurement: techniques for measuring the unobservable. In: Brock T, Green M (eds) Persuasion: psychological insights and perspectives, Chap 2. 2nd edn. Sage, Thousand OaksGoogle Scholar
  12. Giménez J, Márquez L (2008) Heterogeneous automatic MT evaluation through non-parametric metric combinations. In: Proceedings of IJCNLP, pp 319–326Google Scholar
  13. Hoang H, Birch A, Callison-Burch C, Zens R, Aachen R, Constantin A, Federico M, Bertoldi N, Dyer C, Cowan B, Shen W, Moran C, Bojar O (2007) Moses: open source toolkit for statistical machine translation. In: Proceedings of ACL, pp  177–180Google Scholar
  14. Kauchak D, Barzilay R (2006) Paraphrasing for automatic evaluation. In: Proceedings of HLT-NAACL, pp 455–462Google Scholar
  15. Koehn P, Och F, Marcu D (2003) Statistical Phrase-Based Translation. In: Proceedings of HLT-NAACL. pp 127–133Google Scholar
  16. Likert R (1932) A technique for the measurement of attitudes. Arch Psychol 22(140): 1–55Google Scholar
  17. Lin C-Y, Och FJ (2004) ORANGE: a method for evaluating automatic evaluation metrics for machine translation. In: Proceedings of COLING. pp. 501–507Google Scholar
  18. Lin D (1998) Extracting collocations from text corpora. In: First workshop on computational terminology, pp 57–63Google Scholar
  19. Liu D, Gildea D (2005) Syntactic features for evaluation of machine translation. In: Proceedings of the ACL workshop on evaluation measures, pp 25–32Google Scholar
  20. MacCartney B, Grenager T, de Marneffe M-C, Cer D, Manning CD (2006) Learning to recognize features of valid textual entailments. In: Proceedings of NAACL, pp 41–48Google Scholar
  21. Miller GA, Beckwith R, Fellbaum C, Gross D, Miller K (1990) WordNet: an on-line lexical database. Int J Lexicogr 3: 235–244CrossRefGoogle Scholar
  22. Och FJ (2003) Minimum error rate training in statistical machine translation. In: Proceedings of ACL, pp 160–167Google Scholar
  23. Och FJ, Ney H (2003) A systematic comparison of various statistical alignment models. Comput Linguist 29(1): 19–51CrossRefGoogle Scholar
  24. Owczarzak K, van Genabith J, Way A (2008) Evaluating machine translation with LFG dependencies. Mach Transl 21(2): 95–119CrossRefGoogle Scholar
  25. Padó S, Galley M, Jurafsky D, Manning C (2009) Textual entailment features for machine translation evaluation. In: Proceedings of the EACL workshop on machine translation, pp 37–41Google Scholar
  26. Papineni K, Roukos S, Ward T, Zhu W-J (2002) BLEU: a method for automatic evaluation of machine translation. In: Proceedings of ACL, pp 311–318Google Scholar
  27. Snover M, Dorr B, Schwartz R, Micciulla L, Makhoul J (2006) A study of translation edit rate with targeted human annotation. In: Proceedings of AMTA, pp 223–231Google Scholar
  28. Snow R, O’Connor B, Jurafsky D, Ng A (2008) Cheap and fast—but is it good? evaluating non-expert annotations for natural language tasks. In: Proceedings of EMNLP, pp 254–263Google Scholar
  29. Stolcke A (2002) SRILM—an extensible language modeling toolkit. In: Proceedings of the international conference on spoken language processing, pp 901–904Google Scholar
  30. Takayama Y, Flournoy R, Kaufmann S, Peters S (1999) Information retrieval based on domain-specific word associations. In: Proceedings of PACLING, pp 155–161Google Scholar
  31. Tseng H, Chang P-C, Andrew G, Jurafsky D, Manning C (2005) A conditional random field word segmenter for the SIGHAN bakeoff 2005. In: Proceedings of the SIGHAN workshop on chinese language processing, pp 32–39Google Scholar
  32. Zhou L, Lin C-Y, Hovy E (2006) Re-evaluating machine translation results with paraphrase support. In: Proceedings of EMNLP, pp 77–84Google Scholar

Copyright information

© Springer Science+Business Media B.V. 2009

Authors and Affiliations

  • Sebastian Padó
    • 1
    Email author
  • Daniel Cer
    • 2
  • Michel Galley
    • 2
  • Dan Jurafsky
    • 2
  • Christopher D. Manning
    • 2
  1. 1.Stuttgart UniversityStuttgartGermany
  2. 2.Stanford UniversityStanfordUSA

Personalised recommendations