Machine Translation

, Volume 24, Issue 1, pp 39–50

Machine translation evaluation versus quality estimation

Article

Abstract

Most evaluation metrics for machine translation (MT) require reference translations for each sentence in order to produce a score reflecting certain aspects of its quality. The de facto metrics, BLEU and NIST, are known to have good correlation with human evaluation at the corpus level, but this is not the case at the segment level. As an attempt to overcome these two limitations, we address the problem of evaluating the quality of MT as a prediction task, where reference-independent features are extracted from the input sentences and their translation, and a quality score is obtained based on models produced from training data. We show that this approach yields better correlation with human evaluation as compared to commonly used metrics, even with models trained on different MT systems, language-pairs and text domains.

Keywords

Machine translation evaluation Quality estimation Confidence estimation 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Albrecht J, Hwa R (2007a) A re-examination of machine learning approaches for sentence-level MT evaluation. In: 45th meeting of the association for computational linguistics, Prague, pp 880–887Google Scholar
  2. Albrecht J, Hwa R (2007b) Regression for sentence-level MT evaluation with pseudo references. In: 45th meeting of the association for computational linguistics, Prague, pp 296–303Google Scholar
  3. Blatz J, Fitzgerald E, Foster G, Gandrabur S, Goutte C, Kulesza A, Sanchis A, Ueffing N (2003) Confidence estimation for machine translation. Technical report. Johns Hopkins University, BaltimoreGoogle Scholar
  4. Blatz J, Fitzgerald E, Foster G, Gandrabur S, Goutte C, Kulesza A, Sanchis A, Ueffing N (2004) Confidence estimation for machine translation. In: 20th coling, Geneva, pp 315–321Google Scholar
  5. Callison-Burch C, Fordyce C, Koehn P, Monz C, Schroeder J (2008) Further meta-evaluation of machine translation. In: 3rd workshop on statistical machine translation, Columbus, pp 70–106Google Scholar
  6. Callison-Burch C, Koehn P, Monz C, Schroeder J (2009) Findings of the 2009 workshop on statistical machine translation. In: 4th workshop on statistical machine translation, Athens, pp 1–28Google Scholar
  7. Chang C, Lin C (2001) LIBSVM: a library for support vector machines. Software available at http://www.csie.ntu.edu.tw/cjlin/libsvm
  8. Cohen J (1960) A coefficient of agreement for nominal scales. Educ Psychol Meas 20(1): 37–46CrossRefGoogle Scholar
  9. Doddington G (2002) Automatic evaluation of machine translation quality using n-gram co-occurrence statistics. In: Conference on human language technology, San Diego, pp 138–145Google Scholar
  10. Gamon M, Aue A, Smets M (2005) Sentence-level MT evaluation without reference translations: beyond language modeling. In: 10th meeting of the European association for machine translation, BudapestGoogle Scholar
  11. Gandrabur S, Foster G (2003) Confidence estimation for translation prediction. In: 7th conference on natural language learning, Edmonton, pp 95–102Google Scholar
  12. Gimenez J, Marquez L (2008) A smorgasbord of features for automatic MT evaluation. In: 3rd workshop on statistical machine translation, Columbus, OH, pp 195–198Google Scholar
  13. Joachims T (1999) Making large-scale SVM learning practical. In: Schoelkopf B, Burges CJC, Smola AJ (eds) Advances in Kernel methods—support vector learning. MIT Press, CambridgeGoogle Scholar
  14. Johnson H, Sadat F, Foster G, Kuhn R, Simard M, Joanis E, Larkin S (2006) Portage: with smoothed phrase tables and segment choice models. In: Workshop on statistical machine translation, New York, pp 134–137Google Scholar
  15. Kääriäinen M (2009) Sinuhe—statistical machine translation using a globally trained conditional exponential family translation model. In: Conference on empirical methods in natural language processing, Singapore, pp 1027–1036Google Scholar
  16. Kadri Y, Nie JY (2006) Improving query translation with confidence estimation for cross language information retrieval. In: 15th ACM international conference on information and knowledge management, Arlington, pp 818–819Google Scholar
  17. Koehn P (2004) Statistical significance tests for machine translation evaluation. In: Conference on empirical methods in natural language processing, BarcelonaGoogle Scholar
  18. Lavie A, Agarwal A (2007) METEOR: an automatic metric for MT evaluation with high levels of correlation with human judgments. In: 2nd workshop on statistical machine translation, Prague, Czech Republic, pp 228–231Google Scholar
  19. Lin CY, Och FJ (2004) ORANGE: a method for evaluating automatic evaluation metrics for machine translation. In: Coling-2004, Geneva, pp 501–507Google Scholar
  20. Pado S, Galley M, Jurafsky D, Manning CD (2009) Textual entailment features for machine translation evaluation. In: 4th workshop on statistical machine translation, Athens, pp 37–41Google Scholar
  21. Papineni K, Roukos S, Ward T, Zhu W (2002) Bleu: a method for automatic evaluation of machine translation. In: 40th meeting of the association for computational linguistics, Morristown, pp 311–318Google Scholar
  22. Quirk CB (2004) Training a sentence-level machine translation confidence measure. In: 4th language resources and evaluation conference, Lisbon, pp 825–828Google Scholar
  23. Saunders C (2008) Application of Markov approaches to SMT. Technical report. SMART Project Deliverable 2.2Google Scholar
  24. Simard M, Cancedda N, Cavestro B, Dymetman M, Gaussier E, Goutte C, Yamada K (2005) Translating with non-contiguous phrases. In: Conference on empirical methods in natural language processing, Vancouver, pp 755–762Google Scholar
  25. Snover M, Dorr B, Schwartz R, Micciulla L, Makhoul J (2006) A study of translation edit rate with targeted human annotation. In: Conference of the 7th association for machine translation in the Americas, Cambridge, MA, pp 223–231Google Scholar
  26. Specia L, Turchi M, Cancedda N, Dymetman M, Cristianini N (2009) Estimating the sentence-level quality of machine translation systems. In: 13th meeting of the European association for machine translation, BarcelonaGoogle Scholar
  27. Ueffing N, Ney H (2005) Application of word-level confidence measures in interactive statistical machine translation. In: 10th meeting of the European association for machine translation, Budapest, pp 262–270Google Scholar

Copyright information

© Springer Science+Business Media B.V. 2010

Authors and Affiliations

  1. 1.Research Group in Computational LinguisticsUniversity of WolverhamptonWolverhamptonUK
  2. 2.Indian Institute of Information TechnologyAllahabadIndia
  3. 3.European Commission – JRC (IPSC)IspraItaly

Personalised recommendations