Machine Translation

, Volume 24, Issue 1, pp 15–26 | Cite as

Metrics for MT evaluation: evaluating reordering

Article

Abstract

Translating between dissimilar languages requires an account of the use of divergent word orders when expressing the same semantic content. Reordering poses a serious problem for statistical machine translation systems and has generated a considerable body of research aimed at meeting its challenges. Direct evaluation of reordering requires automatic metrics that explicitly measure the quality of word order choices in translations. Current metrics, such as BLEU, only evaluate reordering indirectly. We analyse the ability of current metrics to capture reordering performance. We then introduce permutation distance metrics as a direct method for measuring word order similarity between translations and reference sentences. By correlating all metrics with a novel method for eliciting human judgements of reordering quality, we show that current metrics are largely influenced by lexical choice, and that they are not able to distinguish between different reordering scenarios. Also, we show that permutation distance metrics correlate very well with human judgements, and are impervious to lexical differences.

Keywords

Machine translation Metrics Reordering BLEU METEOR TER Permutation distances Human evaluation 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Birch A, Osborne M, Koehn P (2008) Predicting success in machine translation. In: Proceedings of the empirical methods in natural language processingGoogle Scholar
  2. Callison-Burch C, Osborne M, Koehn P (2006) Re-evaluation the role of BLEU in machine translation research. In: Proceedings of EMNLPGoogle Scholar
  3. Callison-Burch C, Fordyce C, Koehn P, Monz C, Schroeder J (2007) (Meta-) evaluation of machine translation. In: Proceedings of the second workshop on statistical machine translation. Prague, Czech Republic, pp 136–158Google Scholar
  4. Callison-Burch C, Fordyce C, Koehn P, Monz C, Schroeder J (2008) Further meta-evaluation of machine translation. In: Proceedings of the third workshop on statistical machine translation. Columbus, OH, pp 70–106Google Scholar
  5. Callison-Burch C, Koehn P, Monz C, Schroeder J (2009) Findings of the 2009 workshop on statistical machine translation. In: Proceedings of the fourth workshop on statistical machine translation. Athens, Greece, pp 1–28Google Scholar
  6. Giménez J, Màrquez L (2007) Linguistic features for automatic evaluation of heterogenous MT systems. In: ACL workshop on statistical machine translationGoogle Scholar
  7. Hirschberg D (1975) A linear space algorithm for computing maximal common subsequences. In: Communications of the ACM, pp 341–343Google Scholar
  8. Kendall M, Dickinson Gibbons J (1990) Rank correlation methods. Oxford University Press, New YorkMATHGoogle Scholar
  9. Koehn P, Hoang H, Birch A, Callison-Burch C, Federico M, Bertoldi N, Cowan B, Shen W, Moran C, Zens R, Dyer C, Bojar O, Constantin A, Herbst E (2007) Moses: open source toolkit for statistical machine translation. In: Proceedings of the association for computational linguistics companion demo and poster sessions, Prague, Czech Republic, pp 177–180Google Scholar
  10. Lapata M (2003) Probabilistic text structuring: experiments with sentence ordering. Comput Linguist 29(2): 263–317Google Scholar
  11. Lapata M (2006) Automatic evaluation of information ordering: Kendall’s Tau. Comput Linguist 32(4): 471–484CrossRefGoogle Scholar
  12. Lavie A, Agarwal A (2007) METEOR: an automatic metric for MT evaluation with high levels of correlation with human judgments. In: Proceedings of the workshop on statistical machine translation at the meeting of the association for computational linguistics (ACL-2007), pp 228–231Google Scholar
  13. Liang P, Taskar B, Klein D (2006) Alignment by agreement. In: Proceedings of the human language technology conference of NAAC, pp 104–111Google Scholar
  14. Lin C-Y, Och F (2004) Orange: a method for evaluating automatic evaluation metrics for machine translation. In: Proceedings of the conference on computational linguistics, 501 ppGoogle Scholar
  15. Padó S, Galley M, Manning CD, Jurafsky D (2009) Textual entailment features for machine translation evaluation. In: the EACL workshop on machine translation (WMT)Google Scholar
  16. Papineni K, Roukos S, Ward T, Zhu W-J (2002) BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the association for computational linguistics, Philadelphia, USA, pp 311–318Google Scholar
  17. Ronald S (1998) More distance functions for order-based encodings. In: the IEEE conference on evolutionary computation, pp 558–563Google Scholar
  18. Snover M, Dorr B, Schwartz R, Micciulla L, Makhoul J (2006) A study of translation edit rate with targeted human annotation. In: AMTAGoogle Scholar
  19. Ulam S (1972) Some ideas and prospects in biomathematics. In: Annual review of biophysics and bioengineering, pp 277–292Google Scholar

Copyright information

© Springer Science+Business Media B.V. 2010

Authors and Affiliations

  1. 1.University of EdinburghEdinburghUK
  2. 2.University of OxfordOxfordUK

Personalised recommendations