Machine Translation

, Volume 24, Issue 1, pp 51–65

Significance tests of automatic machine translation evaluation metrics

Authors

    • Carnegie Mellon University
  • Stephan Vogel
    • Carnegie Mellon University
Article

DOI: 10.1007/s10590-010-9073-6

Cite this article as:
Zhang, Y. & Vogel, S. Machine Translation (2010) 24: 51. doi:10.1007/s10590-010-9073-6

Abstract

Automatic evaluation metrics for Machine Translation (MT) systems, such as BLEU, METEOR and the related NIST metric, are becoming increasingly important in MT research and development. This paper presents a significance test-driven comparison of n-gram-based automatic MT evaluation metrics. Statistical significance tests use bootstrapping methods to estimate the reliability of automatic machine translation evaluations. Based on this reliability estimation, we study the characteristics of different MT evaluation metrics and how to construct reliable and efficient evaluation suites.

Keywords

Machine translation evaluationSignificance testBootstrapConfidence intervalEvaluation suite construction

Copyright information

© Springer Science+Business Media B.V. 2010