Machine Translation

, Volume 24, Issue 1, pp 51–65 | Cite as

Significance tests of automatic machine translation evaluation metrics

Article

Abstract

Automatic evaluation metrics for Machine Translation (MT) systems, such as BLEU, METEOR and the related NIST metric, are becoming increasingly important in MT research and development. This paper presents a significance test-driven comparison of n-gram-based automatic MT evaluation metrics. Statistical significance tests use bootstrapping methods to estimate the reliability of automatic machine translation evaluations. Based on this reliability estimation, we study the characteristics of different MT evaluation metrics and how to construct reliable and efficient evaluation suites.

Keywords

Machine translation evaluation Significance test Bootstrap Confidence interval Evaluation suite construction 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Amigó E, Gonzalo J, Peñas A, Verdejo F (2005) QARLA: a framework for the evaluation of text summarization systems. In: ACL ’05: Proceedings of the 43rd annual meeting on association for computational linguistics. Association for Computational Linguistics, Morristown, NJ, USA, pp 280–289Google Scholar
  2. Amigó E, Giménez J, Gonzalo J, Màrquez L (2006) MT evaluation: human-like vs. human acceptable. In: Proceedings of the COLING/ACL on main conference poster sessions. Association for Computational Linguistics, Morristown, NJ, USA, pp 17–24Google Scholar
  3. Banerjee S, Lavie A (2005) METEOR: an automatic metric for MT evaluation with improved correlation with human judgments’. In: Proceedings of the ACL workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization. Association for Computational Linguistics, Ann Arbor, Michigan, pp 65–72Google Scholar
  4. Bisani M, Ney H (2004) Bootstrap estimates for confidence intervals in ASR performance evaluation. In: Proceedings of the 2004 IEEE international conference on acoustics, speech, and signal processing (ICASSP 2004). Montreal, Quebec, CanadaGoogle Scholar
  5. Callison-Burch C, Osborne M, Koehn P (2006) Re-evaluation the role of bleu in machine translation research. In: Proceedings of the 11th conference of the European chapter of the association for computational linguistics: EACL 2006. Trento, Italy, pp 249–256Google Scholar
  6. Callison-Burch C, Fordyce C, Koehn P, Monz C, Schroeder J (2007) (Meta-) evaluation of machine translation. In: StatMT ’07: Proceedings of the second workshop on statistical machine translation. Association for Computational Linguistics, Morristown, NJ, USA, pp 136–158Google Scholar
  7. Efron B, Tibshirani R (1993) An introduction to the bootstrap. Chapman & Hall, Boca RatonMATHGoogle Scholar
  8. Koehn P (2004) Statistical significance tests for machine translation evaluation. In: Proceedings of EMNLP 2004. Barcelona, SpainGoogle Scholar
  9. Leusch G, Ueffing N, Ney H (2003) A novel string-to-string distance measurewith applications to machine translation evaluation. In: Proceedings of MT Summit IX. New Orleans, LAGoogle Scholar
  10. Lin C-Y, Och FJ (2004) ORANGE: a method for evaluating automatic evaluation metrics for machine translation. In: COLING ’04: Proceedings of the 20th international conference on computational linguistics. Association for Computational Linguistics, Morristown, NJ, USA, p 501Google Scholar
  11. Liu D, Gildea D (2005) Syntactic features for evaluation of machine translation. In: ACL 2005 workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarizationGoogle Scholar
  12. Nießen S, Vogel S, Ney H, Tillmann C (1998) A DP based search algorithm for statistical machine translation. In: Proceedings of the 17th international conference on computational linguistics. Association for Computational Linguistics, Morristown, NJ, USA, pp 960–967Google Scholar
  13. NIST (2003) Automatic evaluation of machine translation quality using N-gram co-occurrence statistics. Technical report, NIST, http://www.nist.gov/speech/tests/mt/doc/ngram-study.pdf
  14. Owczarzak K, Genabith J, Way A (2007) Evaluating machine translation with LFG dependencies. Mach Transl 21(2): 95–119CrossRefGoogle Scholar
  15. Pado S, Galley M, Jurafsky D, Manning CD (2009) Robust machine translation evaluation with entailment features. In: Proceedings of the joint conference of the 47th annual meeting of the ACL and the 4th international joint conference on natural language processing of the AFNLP. Association for Computational Linguistics, Suntec, Singapore, pp 297–305Google Scholar
  16. Papineni K, Roukos S, Ward T, Zhu W (2001) Bleu: a method for automatic evaluation of machine translation. Technical Report RC22176(W0109-022), IBM Research Division, Thomas J. Watson Research CenterGoogle Scholar
  17. Snover M, Dorr B, Schwartz R, Micciulla L, Makhoul J (2006) A study of translation edit rate with targeted human annotation. In: Proceedings AMTA, pp 223–231Google Scholar
  18. Zhang Y (2008) Structured language model for statistical machine translation. Ph.D. thesis, Carnegie Mellon University, Pittsburgh, PAGoogle Scholar
  19. Zhang Y, Vogel S, Waibel A (2004) Interpreting Bleu/NIST scores: how much improvement do we need to have a better system? In: Proceedings of the 4th international conference on language resources and evaluation. Lisbon, PortugalGoogle Scholar

Copyright information

© Springer Science+Business Media B.V. 2010

Authors and Affiliations

  1. 1.Carnegie Mellon UniversityPittsburghUSA

Personalised recommendations