Machine Translation

, Volume 24, Issue 1, pp 27–38

Metric and reference factors in minimum error rate training

Article

Abstract

In Minimum Error Rate Training (MERT), Bleu is often used as the error function, despite the fact that it has been shown to have a lower correlation with human judgment than other metrics such as Meteor and Ter. In this paper, we present empirical results in which parameters tuned on Bleu may lead to sub-optimal Bleu scores under certain data conditions. Such scores can be improved significantly by tuning on an entirely different metric altogether, e.g. Meteor, by 0.0082 Bleu or 3.38% relative improvement on the WMT08 English–French data. We analyze the influence of the number of references and choice of metrics on the result of MERT and experiment on different data sets. We show the problems of tuning on a metric that is not designed for the single reference scenario and point out some possible solutions.

Keywords

Minimum Error Rate Training Machine translation evaluation Log-linear phrase-based statistical machine translation BLEU METEOR TER Chunk penalty 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Banerjee S, Lavie A (2005) METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. In: Proceedings of the ACL workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization. Ann Arbor, MI, pp 65–72Google Scholar
  2. Callison-Burch C, Osborne M, Koehn P (2006) Re-evaluation the role of bleu in machine translation research. In: EACL-2006, Proceedings of the 11th conference of the european chapter of the association for computational linguistics. Trento, Italy, pp 249–256Google Scholar
  3. Callison-Burch C, Fordyce C, Koehn P, Monz C, Schroeder J (2008) Further meta-evaluation of machine translation. In: Proceedings of the third workshop on statistical machine translation. Columbus, OH, pp 70–106Google Scholar
  4. Cer D, Jurafsky D, Manning C (2008) Regularization and search for minimum error rate training. In: Proceedings of the third workshop on statistical machine translation. Columbus, OH, pp 26–34Google Scholar
  5. Chiang D, DeNeefe S, Chan YS, Ng HT (2008) Decomposability of translation metrics for improved evaluation and efficient algorithms. In: Proceedings of the 2008 conference on empirical methods in natural language processing. Honolulu, HI, pp 610–619Google Scholar
  6. Dyer C, Setiawan H, Marton Y, Resnik P (2009) The University of Maryland statistical machine translation system for the fourth workshop on machine translation. In: Proceedings of the fourth workshop on statistical machine translation. Athens, Greece, pp 145–149Google Scholar
  7. He Y, Way A (2009) Improving the objective function in minimum error rate training. In: Proceedings of the twelfth machine translation summit. Ottawa, ON, Canada, pp 238–245Google Scholar
  8. Koehn P (2004) Statistical significance tests for machine translation evaluation. In: Proceedings the 2004 conference of empirical methods in natural language processing (EMNLP-2004). Barcelona, Spain, pp 388–395Google Scholar
  9. Lambert P, Giménez J, Costa-jussà MR, Amigó E, Banchs RE, Màrquez L, Fonollosa JAR (2006) Machine Translation system development based on human likeness. In: Proceedings of the IEEE/ACL workshop on spoken language technology. Palm Beach, Aruba, pp 246–249Google Scholar
  10. Levenshtein VI (1966) Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady 10(8): 707–710MathSciNetGoogle Scholar
  11. Macherey W, Och F, Thayer I, Uszkoreit J (2008) Lattice-based minimum error rate training for statistical machine translation. In: Proceedings of the 2008 conference on empirical methods in natural language processing. Honolulu, HI, pp 725–734Google Scholar
  12. Moore RC, Quirk C (2008) Random restarts in minimum error rate training for statistical machine translation. In: Proceedings of the 22nd international conference on computational linguistics (Coling 2008). Manchester, UK, pp 585–592Google Scholar
  13. Och FJ (2003) Minimum error rate training in statistical machine translation. In: 41st annual meeting of the association for computational linguistics. Sapporo, Japan, pp 160–167Google Scholar
  14. Och FJ, Ney H (2002) Discriminative training and maximum entropy models for statistical machine translation. In: 40th annual meeting of the association for computational linguistics. Philadelphia, PA, pp 295–302Google Scholar
  15. Owczarzak K, van Genabith J, Way A (2007) Labelled dependencies in machine translation evaluation. In: Proceedings of the second workshop on statistical machine translation. Prague, Czech Republic, pp 104–111Google Scholar
  16. Papineni K, Roukos S, Ward T, Zhu W-J (2002) Bleu: a method for automatic evaluation of machine translation. In: 40th annual meeting of the association for computational linguistics. Philadelphia, PA, pp 311–318Google Scholar
  17. Snover M, Dorr B, Schwartz R, Micciulla L, Makhoul J (2006) A study of translation edit rate with targeted human annotation. In: AMTA 2006, Proceedings of the 7th conference of the association for machine translation in the Americas, visions for the future of machine translation. Cambridge, MA, pp 223–231Google Scholar

Copyright information

© Springer Science+Business Media B.V. 2010

Authors and Affiliations

  1. 1.CNGL, School of ComputingDublin City UniversityDublinIreland

Personalised recommendations