Machine Translation

, 22:1 | Cite as

Regression for machine translation evaluation at the sentence level

Article

Abstract

Machine learning offers a systematic framework for developing metrics that use multiple criteria to assess the quality of machine translation (MT). However, learning introduces additional complexities that may impact on the resulting metric’s effectiveness. First, a learned metric is more reliable for translations that are similar to its training examples; this calls into question whether it is as effective in evaluating translations from systems that are not its contemporaries. Second, metrics trained from different sets of training examples may exhibit variations in their evaluations. Third, expensive developmental resources (such as translations that have been evaluated by humans) may be needed as training examples. This paper investigates these concerns in the context of using regression to develop metrics for evaluating machine-translated sentences. We track a learned metric’s reliability across a 5 year period to measure the extent to which the learned metric can evaluate sentences produced by other systems. We compare metrics trained under different conditions to measure their variations. Finally, we present an alternative formulation of metric training in which the features are based on comparisons against pseudo-references in order to reduce the demand on human produced resources. Our results confirm that regression is a useful approach for developing new metrics for MT evaluation at the sentence level.

Keywords

Machine translation Evaluation metrics Machine learning 

References

  1. Albrecht JS, Hwa R (2007a) A re-examination of machine learning approaches for sentence-level MT evaluation. In: ACL 2007 Proceedings of the 45th annual meeting of the Association for Computational Linguistics, Prague, Czech Republic, pp 880–887Google Scholar
  2. Albrecht JS, Hwa R (2007b) Regression for sentence-level MT evaluation with pseudo references. In: ACL 2007 Proceedings of the 45th annual meeting of the Association for Computational Linguistics, Prague, Czech Republic, pp 296–303Google Scholar
  3. Al-Onaizan Y, Curin J, Jahr M, Knight K, Lafferty J, Melamed ID, Och F-J, Purdy D, Smith NA, Yarowsky D (1999) Statistical machine translation. Technical report natural language engineering workshop final report. Johns Hopkins University, BaltimoreGoogle Scholar
  4. Banerjee S, Lavie A (2005) METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. In: Proceedings of the workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, Ann Arbor, Michigan, pp 65–72Google Scholar
  5. Bishop CM (2006) Pattern recognition and machine learning. Springer-Verlag, New YorkGoogle Scholar
  6. Blatz J, Fitzgerald E, Foster G, Gandrabur S, Goutte C, Kulesza A, Sanchis A, Ueffing N (2003) Confidence estimation for machine translation. Technical report natural language engineering workshop final report. Johns Hopkins University, BaltimoreGoogle Scholar
  7. Burbank A, Carpuat M, Clark S, Dreyer M, Groves D, Fox P, Hall K, Hearne M, Melamed ID, Shen Y, Way A, Wellington B, Wu D (2005) Final report of the 2005 language engineering workshop on statistical machine translation by parsing. Technical report natural language engineering workshop final report. Johns Hopkins University, BaltimoreGoogle Scholar
  8. Carbonell JG, Cullingford RE, Gershman AV (1981) Steps toward knowledge-based machine translation. IEEE Trans Pattern Anal Mach Intell 3(4): 376–392CrossRefGoogle Scholar
  9. Corston-Oliver S, Gamon M, Brockett C (2001) A machine learning approach to the automatic evaluation of machine translation. In: Association for Computational Linguistics, 39th annual meeting and 10th conference of the European chapter, proceedings of the conference, Toulouse, France, pp 148–155Google Scholar
  10. Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20(3): 273–297Google Scholar
  11. Doddington G (2002) Automatic evaluation of machine translation quality using n-gram co-occurrence statistics. In: Proceedings of the second conference on human language technology (HLT-2002), San Diego, California, pp 128–132Google Scholar
  12. Dorr BJ, Jordan PW, Benoit JW (1999) A survey of current paradigms in machine translation. Adv Comput 49: 2–68Google Scholar
  13. Duh K (2008) Ranking vs. regression in machine translation evaluation. In: Proceedings of the third workshop on statistical machine translation, Columbus, Ohio, pp 191–194Google Scholar
  14. Gamon M, Aue A, Smets M (2005) Sentence-level MT evaluation without reference translations: beyond language modeling. In: 10th EAMT conference, practical applications of machine translation, proceedings, Budapest, Hungary, pp 103–111Google Scholar
  15. Giménez J, Màrquez L (2008) A smorgasbord of features for automatic MT evaluation. In: ACL-08: HLT third workshop on statistical machine translation, Columbus, Ohio, pp 195–198Google Scholar
  16. Gimpel K, Smith NA (2008) Rich source-side context for statistical machine translation. In: ACL-08: HLT third workshop on statistical machine translation, Columbus, Ohio, pp 9–17Google Scholar
  17. Goldberg Y, Elhadad M (2007) SVM model tampering and anchored learning: a case study in Hebrew NP chunking. In: ACL 2007 Proceedings of the 45th annual meeting of the Association for Computational Linguistics, Prague, Czech Republic, pp 224–231Google Scholar
  18. Hastie T, Tibshirani R, Friedman J (2001) The elements of statistical learning. Springer-Verlag, New YorkGoogle Scholar
  19. Hovy E, King M, Popescu-Belis A (2002) Principles of context-based machine translation evaluation. Mach Transl 17(1): 43–75CrossRefGoogle Scholar
  20. Joachims T (1999) Making large-scale SVM learning practical. In: Schöelkopf B, Burges C, Smola A(eds) Advances in kernel methods—support vector learning. MIT Press, Cambridge, pp 169–184Google Scholar
  21. Kauchak D, Barzilay R (2006) Paraphrasing for automatic evaluation. In: HLT-NAACL 2006 Human language technology conference of the North American chapter of the Association for Computational Linguistics, New York, NY, pp 455–462Google Scholar
  22. Koehn P (2004) Statistical significance tests for machine translation evaluation. In: Proceedings of the 2004 conference on empirical methods in natural language processing, Barcelona, Spain, pp 388–395Google Scholar
  23. Kulesza A, Shieber SM (2004) A learning approach to improving sentence-level MT evaluation. In: TMI-2004: Proceedings of the tenth conference on theoretical and methodological issues in machine translation, Baltimore, MD, pp 75–84Google Scholar
  24. Lavie A, Agarwal A (2007) METEOR: an automatic metric for MT evaluation with high levels of correlation with human judgments. In: ACL 2007: Proceedings of the second workshop on statistical machine translation, Prague, Czech Republic, pp 228–231Google Scholar
  25. Leusch G, Ueffing N, Ney H (2006) CDER: efficient MT evaluation using block movements. In: EACL-2006, 11th conference of the European chapter of the Association for Computational Linguistics, proceedings of the conference, Trento, Italy, pp 241–248Google Scholar
  26. Lin C-Y, Och FJ (2004a) Automatic evaluation of machine translation quality using longest common subsequence and Skip-Bigram statistics. In: ACL-04: 42nd annual meeting of the Association for Computational Linguistics, Barcelona, Spain, pp 605–612Google Scholar
  27. Lin C-Y, Och FJ (2004b) ORANGE: a method for evaluating automatic evaluation metrics for machine translation. In: Coling, 20th international conference on computational linguistics, proceedings, Geneva, Switzerland, pp 501–507Google Scholar
  28. Lita LV, Rogati M, Lavie A (2005) Blanc: learning evaluation metrics for MT. In: HLT/EMNLP 2005 Human language technology conference and conference on empirical methods in natural language processing, Vancouver, British Columbia, Canada, pp 740–747Google Scholar
  29. Liu D, Gildea D (2005) Syntactic features for evaluation of machine translation. In: Intrinsic and extrinsic evaluation measures for machine translation and/or summarization, Proceedings of the ACL-05 workshop, Ann Arbor, MI, pp 25–32Google Scholar
  30. Liu D, Gildea D (2006) Stochastic iterative alignment for machine translation evaluation. In: COLING·ACL 2006 21st international conference on computational linguistics and 44th annual meeting of the Association for Computational Linguistics, Proceedings of the main conference poster sessions, Sydney, Australia, pp 539–546Google Scholar
  31. Liu D, Gildea D (2007) Source-language features and maximum correlation training for machine translation evaluation. In: NAACL HLT 2007 Human language technologies 2007: the conference of the North American chapter of the Association for Computational Linguistics, Rochester, NY, pp 41–48Google Scholar
  32. Melamed ID, Green R, Turian J (2003) Precision and recall of machine translation. In: HLT-NAACL 2003: conference combining human language technology conference series and the North American chapter of the Association for Computational Linguistics series, companion volume, Edmonton, Canada, pp 61–63Google Scholar
  33. Owczarzak K, Groves D, Van Genabith J, Way A (2006) Contextual bitext-derived paraphrases in automatic MT evaluation. In: HLT-NAACL 06 Statistical machine translation, Proceedings of the workshop, New York City, pp 86–93Google Scholar
  34. Owczarzak K, Van Genabith J, Way A (2007) Dependency-based automatic evaluation for machine translation. In: Proceedings of SSST, NAACL-HLT 2007/AMTA workshop on syntax and structure in statistical translation, Rochester, NY, pp 80–87Google Scholar
  35. Papineni K, Roukos S, Ward T, Zhu W-J (2002) Bleu: a method for automatic evaluation of machine translation. In: 40th annual meeting of the Association for Computational Linguistics (ACL-2002), Philadelphia, PA, pp 311–318Google Scholar
  36. Quirk C (2004) Training a sentence-level machine translation confidence measure. In: Proceedings of the international conference on language resources and evaluation (LREC-2004), Lisbon, Portugal, pp 825–828Google Scholar
  37. Riezler S, Maxwell JT III (2005) On some pitfalls in automatic evaluation and significance testing for MT. In: Intrinsic and extrinsic evaluation measures for machine translation and/or summarization, Proceedings of the ACL-05 workshop, Ann Arbor, MI, pp 57–64Google Scholar
  38. Snover M, Dorr B, Schwartz R, Micciulla L, Makhoul J (2006) A study of translation edit rate with targeted human annotation. In: AMTA 2006: Proceedings of the 7th conference of the Association for Machine Translation in the Americas, visions of the future of machine translation, Cambridge, MA, pp 223–231Google Scholar
  39. Tillmann C, Vogel S, Ney H, Sawaf H, Zubiaga A (1997) Accelerated DP-based search for statistical translation. In: Proceedings of the 5th European conference on speech communication and technology (EuroSpeech’97), Rhodes, Greece, pp 2667–2670Google Scholar
  40. Uchimoto K, Kotani K, Zhang Y, Isahara H (2007) Automatic evaluation of machine translation based on rate of accomplishment of sub-goals. In: NAACL HLT 2007 Human language technologies 2007: the conference of the North American chapter of the Association for Computational Linguistics, Rochester, New York, pp 33–40Google Scholar
  41. Witten IH, Frank E (2005) Data mining: practical machine learning tools and techniques, 2nd edn. Morgan Kaufmann, San FranciscoGoogle Scholar
  42. Ye Y, Zhou M, Lin C-Y (2007) Sentence level machine translation evaluation as a ranking. In: ACL 2007: Proceedings of the second workshop on statistical machine translation, Prague, Czech Republic, pp 240–247Google Scholar

Copyright information

© Springer Science+Business Media B.V. 2008

Authors and Affiliations

  1. 1.Department of Computer ScienceUniversity of PittsburghPittsburghUSA

Personalised recommendations