Advertisement

Machine Translation

, Volume 24, Issue 3–4, pp 209–240 | Cite as

Linguistic measures for automatic machine translation evaluation

  • Jesús GiménezEmail author
  • Lluís Màrquez
Original Paper

Abstract

Assessing the quality of candidate translations involves diverse linguistic facets. However, most automatic evaluation methods in use today rely on limited quality assumptions, such as lexical similarity. This introduces a bias in the development cycle which in some cases has been reported to carry very negative consequences. In order to tackle this methodological problem, we explore a novel path towards heterogeneous automatic Machine Translation evaluation. We have compiled a rich set of specialized similarity measures operating at different linguistic dimensions and analyzed their individual and collective behaviour over a wide range of evaluation scenarios. Results show that measures based on syntactic and semantic information are able to provide more reliable system rankings than lexical measures, especially when the systems under evaluation are based on different paradigms. At the sentence level, while some linguistic measures perform better than most lexical measures, some others perform substantially worse, mainly due to parsing problems. Their scores are, however, suitable for combination, yielding a substantially improved evaluation quality.

Keywords

Machine translation Automatic evaluation methods Linguistic analysis Syntactic similarity Semantic similarity Combined measures 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Akiba Y, Imamura K, Sumita E (2001) Using multiple edit distances to automatically rank machine translation output. In: MT Summit VIII: Machine translation in the information age, proceedings. Santiago de Compostela, Spain, pp 15–20Google Scholar
  2. Albrecht J, Hwa R (2007a) A re-examination of machine learning approaches for sentence-level MT evaluation. In: Proceedings of the 45th annual meeting of the association for computational linguistics. Prague, Czech Republic, pp 880–887Google Scholar
  3. Albrecht J, Hwa R (2007b) Regression for sentence-level MT evaluation with pseudo references. In: Proceedings of the 45th annual meeting of the association for computational linguistics. Prague, Czech Republic, pp 296–303Google Scholar
  4. Albrecht J, Hwa R (2008) The role of pseudo references in MT evaluation. In: Proceedings of the third workshop on statistical machine translation. Columbus, Ohio, pp 187–190Google Scholar
  5. Banerjee S, Lavie A (2005) METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In: Intrinsic and extrinsic evaluation measures for machine translation and/or summarization, proceedings of the ACL-05 workshop. Ann Arbor, Michigan, USA, pp 65–72Google Scholar
  6. Callison-Burch C (2005) Linear B system description for the 2005 NIST MT evaluation exercise. In: Proceedings of the NIST 2005 machine translation evaluation workshop. Bethesda, MAGoogle Scholar
  7. Callison-Burch C, Fordyce C, Koehn P, Monz C, Schroeder J (2007) (Meta-) Evaluation of machine translation. In: Proceedings of the second workshop on statistical machine translation. Prague, Czech Republic, pp 136–158Google Scholar
  8. Callison-Burch C, Fordyce C, Koehn P, Monz C, Schroeder J (2008) Further meta-evaluation of machine translation. In: Proceedings of the third workshop on statistical machine translation. Columbus, Ohio, pp 70–106Google Scholar
  9. Callison-Burch C, Koehn P, Monz C, Schroeder J (2009) Findings of the 2009 workshop on statistical machine translation. In: Proceedings of the fourth workshop on statistical machine translation. Athens, Greece, pp 1–28Google Scholar
  10. Callison-Burch C, Osborne M, Koehn P (2006) Re-evaluating the role of BLEU in machine translation research. In: EACL-2006: 11th conference of the European chapter of the association for computational linguistics, proceedings of the conference. Trento, Italy, pp 249–256Google Scholar
  11. Chan YS, Ng HT (2008) MAXSIM: A maximum similarity metric for machine translation evaluation. In: ACL-08: HLT, 46th annual meeting of the association for computational linguistics, human language, technologies, proceeding of the conference. Columbus, Ohio, pp 55–62Google Scholar
  12. Charniak E, Johnson M (2005) Coarse-to-fine n-best parsing and MaxEnt discriminative reranking. In: ACL-05: 43rd annual meeting of the association for computational linguistics. Ann Arbor, Michigan, USA, pp 173–180Google Scholar
  13. Charniak E, Knight K, Yamada K (2003) Syntax-based language models for machine translation. In: MT SUMMIT IX: Proceedings of the ninth machine translation Summit. New Orleans, USA, pp 40–46Google Scholar
  14. Corston-Oliver S, Gamon M, Brockett C (2001) A machine learning approach to the automatic evaluation of machine translation. In: 39th annual meeting and 10th conference of the European chapter, proceedings of the conference. Toulouse, France, pp 140–147Google Scholar
  15. Coughlin D (2003) Correlating automated and human assessments of machine translation quality. In: MT SUMMIT IX: Proceedings of the ninth machine translation Summit. New Orleans, USA, pp 23–27Google Scholar
  16. Culy C, Riehemann SZ (2003) The limits of N-gram translation evaluation metrics. In: MT SUMMIT IX: proceedings of the ninth machine translation Summit, New Orleans, USA, pp 1–8Google Scholar
  17. Curran J, Clark S, Bos J (2007) Linguistically motivated large-scale NLP with C&C and boxer. In: Proceedings of the 45th annual meeting of the association for computational linguistics companion volume proceedings of the demo and poster sessions. Prague, Czech Republic, pp 33–36Google Scholar
  18. Denkowski M, Lavie A (2010) METEOR-NEXT and the METEOR Paraphrase Tables: improved evaluation support for five target languages. In: ACL 2010: joint fifth workshop on statistical machine translation and metricsMATR. Uppsala, Sweden, pp 339–342Google Scholar
  19. Doddington G (2002) Automatic evaluation of machine translation quality using N-gram co-occurrence statistics. In: HLT 2002: human language technology conference, proceedings of the second international conference on human language technology research. San Diego, CA, pp 138–145Google Scholar
  20. Efron B, Tibshirani R (1986) Bootstrap methods for standard errors, confidence intervals, and other measures of statistical accuracy. Stat Sci 1(1): 54–77CrossRefMathSciNetGoogle Scholar
  21. Fisher RA (1921) On the ‘probable error’ of a coefficient of correlation deduced from a small sample. Metron 11: 3–32Google Scholar
  22. Fisher RA (1924) On a distribution yielding the error functions of several well known statistics. In: Proceedings of the international congress of mathematicians, vol 2. Toronto, Canada, pp 805–813Google Scholar
  23. Gamon M, Aue A, Smets M (2005) Sentence-level MT evaluation without reference translations: beyond language modeling. In: Proceedings of 10th annual conference of the European association for machine translation. Budapest, Hungary, pp 103–111Google Scholar
  24. Giménez J, Màrquez L (2004) SVMTool: a general POS tagger generator based on support vector machines. In: Proceedings of 4th international conference on language resources and evaluation. Lisbon, Portugal, pp 43–46Google Scholar
  25. Giménez J, Màrquez L (2008) Towards heterogeneous automatic MT error analysis. In: Proceedings of the 6th international conference on language resources and evaluation. Marrakech, Morocco, pp 1894–1901Google Scholar
  26. Giménez J, Màrquez L (2010) Asiya: an open toolkit for automatic machine translation (meta-)evaluation. Prague Bull Math Linguist 94: 77–86CrossRefGoogle Scholar
  27. Jaccard P (1901) Distribution de la flore alpine dans le bassin des Dranses et dans quelques régions voisines. Bulletin de la Société Vaudoise des Sciences Naturelles 37: 241–272Google Scholar
  28. Kahn JG, Snover M, Ostendorf M (2010) Expected dependency pair match: predicting translation quality with expected syntactic structure. Mach Trans 23(2–3): 169–179Google Scholar
  29. Kauchak D, Barzilay R (2006) Paraphrasing for automatic evaluation. In: human language technology conference of the North American chapter of the association of computational linguistics, proceedings of the main conference. New York, New York, USA, pp 455–462Google Scholar
  30. Koehn P (2004) Statistical significance tests for machine translation evaluation. In: Proceedings of the (2004) Conference on empirical methods in natural language processing. Barcelona, Spain, pp 388–395Google Scholar
  31. Koehn P, Monz C (2006) Manual and automatic evaluation of machine translation between European languages. In: HLT-NAACL 06: statistical machine translation, proceedings of the workshop. New York City, USA, pp 102–121Google Scholar
  32. Kulesza A, Shieber SM (2004) A learning approach to improving sentence-level MT evaluation. In: TMI-2004 conference: proceedings of the 10th international conference on theoretical and methodological issues in machine translation. Baltimore, Maryland, USA, pp 75–84Google Scholar
  33. LDC (2005) Linguistic data annotation specification: assessment of adequacy and fluency in translations. Revision 1.5. Technical report, Linguistic Data Consortium. http://www.ldc.upenn.edu/Projects/TIDES/Translation/TransAssess04.pdf.
  34. Le A, Przybocki M (2005) NIST 2005 machine translation evaluation official results. In: Official release of automatic evaluation scores for all submissions, AugustGoogle Scholar
  35. Leusch G, Ueffing N, Ney H (2006) CDER: efficient MT Evaluation Using Block Movements. In: EACL-2006: 11th conference of the European chapter of the association for computational linguistics, proceedings of the conference. Trento, Italy, pp 241–248Google Scholar
  36. Lin CY, Och FJ (2004a) Automatic evaluation of machine translation quality using longest common subsequence and skip-bigram statics. In: ACL-04, 42nd annual meeting of the association for computational linguistics, proceedings of the conference. Barcelona, Spain, pp 605–612Google Scholar
  37. Lin CY, Och FJ (2004b) ORANGE: a method for evaluating automatic evaluation metrics for machine translation. In: 20th International conference on computational linguistics, proceedings, vol. I. Geneva, Switzerland, pp 501–507Google Scholar
  38. Lin D (1998) Dependency-based evaluation of MINIPAR. In: Proceedings of the workshop on the evaluation of parsing systems. Granada, SpainGoogle Scholar
  39. Lita LV, Rogati M, Lavie A (2005) BLANC: learning evaluation metrics for MT. In: HLT/EMNLP 2005: human language technology conference and conference on empirical methods in natural language processing, proceedings of the conference. Vancouver, British Columbia, Canada, pp 740–747Google Scholar
  40. Liu D, Gildea D (2005) Syntactic features for evaluation of machine translation. In: Proceedings of ACL workshop on intrinsic and extrinsic evaluation measures for MT and/or summarization. Association for Computational Linguistics, Ann Arbor, Michigan, pp 25–32Google Scholar
  41. Liu D, Gildea D (2006) Stochastic iterative alignment for machine translation evaluation. In: COLING ACL 2006, 21st international conference on computational linguistics and 44th annual meeting of the association for computational linguistics, proceedings of the conference. Sydney, Australia, pp 539–546Google Scholar
  42. Liu D, Gildea D (2007) Source-language features and maximum correlation training for machine translation evaluation. In: Human language technologies 2007: the conference of the North American chapter of the association for computational linguistics, proceedings of the main conference. Rochester, New York, USA, pp 41–48Google Scholar
  43. Marcus MP, Santorini B, Marcinkiewicz MA (1993) Building a large annotated corpus of English: the Penn Treebank. Comput Linguist 19(2): 313–330Google Scholar
  44. Mehay D, Brew C (2007) BLEUATRE: flattening syntactic dependencies for MT evaluation. In: Proceedings of the 11th conference on theoretical and methodological issues in machine translation. Skövde, Sweden, pp 122–131Google Scholar
  45. Melamed ID, Green R, Turian JP (2003) Precision and recall of machine translation. In: HLT-NAACL 2003: conference combining human language technology conference series and the North American chapter of the association for computational linguistics conference series. Edmonton, Canada, pp 61–63Google Scholar
  46. Nießen S, Och FJ, Leusch G, Ney H (2000) An evaluation tool for machine translation: fast evaluation for MT research. In: Proceedings of the 2nd international conference on language resources and evaluation. Athens, Greece, pp 39–45Google Scholar
  47. Och FJ, Gildea D, Khudanpur S, Sarkar A, Yamada K, Fraser A, Kumar S, Shen L, Smith D, Eng K, Jain V, Jin Z, Radev D (2003) Final report of Johns Hopkins 2003 summer workshop on syntax for statistical machine translation. Technical report, Johns Hopkins University, Baltimore, MDGoogle Scholar
  48. Owczarzak K, Groves D, Genabith JV, Way A (2006) Contextual bitext-derived paraphrases in automatic MT evaluation. In: Proceedings of the joint conference on human language technology and the North American chapter of the association for computational linguistics. Cambridge, Massachusetts, USA, pp 148–155Google Scholar
  49. Owczarzak K, van Genabith J, Way A (2007) Dependency-based automatic evaluation for machine translation. In: Human language technologies 2007: the conference of the North American chapter of the association for computational linguistics, proceedings of the main conference. Rochester, New York, USA, pp 80–87Google Scholar
  50. Padó S, Cer D, Galley M, Jurafsky D, Manning CD (2009) Measuring machine translation quality as semantic equivalence: a metric based on entailment features. Mach Trans 23(2–3): 181–193CrossRefGoogle Scholar
  51. Padó S, Galley M, Jurafsky D, Manning CD (2009b) Textual entailment features for machine translation evaluation. In: EACL 2009: fourth workshop on statistical machine translation, proceedings of the workshop. Athens, Greece, pp 37–41Google Scholar
  52. Palmer M, Gildea D, Kingsbury P (2005) The proposition bank: an annotated corpus of semantic roles. Comput Linguist 31(1): 71–106CrossRefGoogle Scholar
  53. Papineni K, Roukos S, Ward T, Zhu WJ (2002) BLEU: a method for automatic evaluation of machine translation. In: 40th Annual meeting of the association for computational linguistics, proceedings of the conference. Philadelphia, PA, USA, pp 311–318Google Scholar
  54. Paul M (2006) Overview of the IWSLT 2006 evaluation campaign. In: IWSLT 2006: Proceedings of the 3rd international workshop on spoken language translation. Kyoto, Japan, pp 1–15Google Scholar
  55. Paul M, Finch A, Sumita E (2007) Reducing human assessments of machine translation quality to binary classifiers. In: Proceedings of the 11th conference on theoretical and methodological issues in machine translation. Skövde, Sweden, pp 154–162Google Scholar
  56. Pearson K (1914) The life, letters and labours of Francis Galton. Cambridge University Press London (3 volumes: 1914, 1924, 1930)Google Scholar
  57. Popovic M, Ney H (2007) Word error rates: decomposition over POS classes and applications for error analysis. In: ACL 2007: proceedings of the second workshop on statistical machine translation. Prague, Czech Republic, pp 48–55Google Scholar
  58. Quirk C (2004) Training a sentence-level machine translation confidence metric. In: Proceedings of the 4th international conference on language resources and evaluation. Lisbon, Portugal, pp 825–828Google Scholar
  59. Quirk C, Menezes A, Cherry C (2005) Dependency treelet translation: syntactically informed phrasal SMT. In: ACL-05: 43rd annual meeting of the association for computational linguistics. Ann Arbor, Michigan, USA, pp 271–279Google Scholar
  60. Reeder F, Miller K, Doyon J, White J (2001) The naming of things and the confusion of tongues: an MT metric. In: Proceedings of the workshop on MT evaluation “who did what to whom?” at machine translation Summit VIII. Santiago de Compostela, Spain, pp 55–59Google Scholar
  61. Russo-Lassner G, Lin J, Resnik P (2005) A paraphrase-based approach to machine translation evaluation (LAMP-TR-125/CS-TR-4754/UMIACS-TR-2005-57). Technical report, University of Maryland, College Park. http://lampsrv01.umiacs.umd.edu/pubs/TechReports/LAMP_125/LAMP_125.pdf
  62. Snover M, Dorr B, Schwartz R, Micciulla L, Makhoul J (2006) A study of translation edit rate with targeted human annotation. In: AMTA 2006: proceedings of the 7th conference of the association for machine translation in the americas: visions for the future of machine translation. Cambridge, Massachusetts, USA, pp 223–231Google Scholar
  63. Snover M, Madnani N, Dorr BJ, Schwartz R (2010) TER-Plus: paraphrase, semantic, and alignment enhancements to translation edit rate. Mach Trans 23(2–3): 117–127Google Scholar
  64. Spearman C (1904) The proof and measurement of association between two rings. Am J Psychol 15: 72–101CrossRefGoogle Scholar
  65. Surdeanu M, Turmo J (2005) Semantic role labeling using complete syntactic analysis. In: CoNLL-2005: proceedings of the ninth conference on computational natural language learning. Ann Arbor, Michigan, USA, pp 221–224Google Scholar
  66. Surdeanu M, Turmo J, Comelles E (2005) Named entity recognition from spontaneous open-domain speech. In: Interspeech 2005–Eurospeech: 9th European conference on speech communication and technology. Lisbon, Portugal, pp 3433–3436Google Scholar
  67. Tillmann C, Vogel S, Ney H, Zubiaga A, Sawaf H (1997) Accelerated DP based search for statistical translation. In: Proceedings of European conference on speech communication and technology. Rhodes, Greece, pp 2667–2670Google Scholar
  68. Turian JP, Shen L, Melamed ID (2003) Evaluation of machine translation and its evaluation. In: MT SUMMIT IX: proceedings of the ninth machine translation Summit. New Orleans, USA, pp 386–393Google Scholar
  69. White JS, O’Connell T, O’Mara F (1994) The ARPA MT evaluation methodologies: evolution, lessons, and future approaches. In: Technology partnerships for crossing the language barrier, proceedings of the first conference of the association for machine translation in the Americas. Columbia, Maryland, USA, pp 193–205Google Scholar
  70. Ye Y, Zhou M, Lin CY (2007) Sentence level machine translation evaluation as a ranking. In: ACL 2007: Proceedings of the second workshop on statistical machine translation. Prague, Czech Republic, pp 240–247Google Scholar
  71. Zhou L, Lin CY, Hovy E (2006) Re-evaluating machine translation results with paraphrase support. In: EMNLP 2006: 2006 conference on empirical methods in natural language processing, proceedings of the conference. Sydney, Australia, pp 77–84Google Scholar

Copyright information

© Springer Science+Business Media B.V. 2011

Authors and Affiliations

  1. 1.Universitat Politècnica de CatalunyaBarcelonaSpain

Personalised recommendations