Skip to main content
Log in

Linguistic measures for automatic machine translation evaluation

  • Original Paper
  • Published:
Machine Translation

Abstract

Assessing the quality of candidate translations involves diverse linguistic facets. However, most automatic evaluation methods in use today rely on limited quality assumptions, such as lexical similarity. This introduces a bias in the development cycle which in some cases has been reported to carry very negative consequences. In order to tackle this methodological problem, we explore a novel path towards heterogeneous automatic Machine Translation evaluation. We have compiled a rich set of specialized similarity measures operating at different linguistic dimensions and analyzed their individual and collective behaviour over a wide range of evaluation scenarios. Results show that measures based on syntactic and semantic information are able to provide more reliable system rankings than lexical measures, especially when the systems under evaluation are based on different paradigms. At the sentence level, while some linguistic measures perform better than most lexical measures, some others perform substantially worse, mainly due to parsing problems. Their scores are, however, suitable for combination, yielding a substantially improved evaluation quality.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Akiba Y, Imamura K, Sumita E (2001) Using multiple edit distances to automatically rank machine translation output. In: MT Summit VIII: Machine translation in the information age, proceedings. Santiago de Compostela, Spain, pp 15–20

  • Albrecht J, Hwa R (2007a) A re-examination of machine learning approaches for sentence-level MT evaluation. In: Proceedings of the 45th annual meeting of the association for computational linguistics. Prague, Czech Republic, pp 880–887

  • Albrecht J, Hwa R (2007b) Regression for sentence-level MT evaluation with pseudo references. In: Proceedings of the 45th annual meeting of the association for computational linguistics. Prague, Czech Republic, pp 296–303

  • Albrecht J, Hwa R (2008) The role of pseudo references in MT evaluation. In: Proceedings of the third workshop on statistical machine translation. Columbus, Ohio, pp 187–190

  • Banerjee S, Lavie A (2005) METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In: Intrinsic and extrinsic evaluation measures for machine translation and/or summarization, proceedings of the ACL-05 workshop. Ann Arbor, Michigan, USA, pp 65–72

  • Callison-Burch C (2005) Linear B system description for the 2005 NIST MT evaluation exercise. In: Proceedings of the NIST 2005 machine translation evaluation workshop. Bethesda, MA

  • Callison-Burch C, Fordyce C, Koehn P, Monz C, Schroeder J (2007) (Meta-) Evaluation of machine translation. In: Proceedings of the second workshop on statistical machine translation. Prague, Czech Republic, pp 136–158

  • Callison-Burch C, Fordyce C, Koehn P, Monz C, Schroeder J (2008) Further meta-evaluation of machine translation. In: Proceedings of the third workshop on statistical machine translation. Columbus, Ohio, pp 70–106

  • Callison-Burch C, Koehn P, Monz C, Schroeder J (2009) Findings of the 2009 workshop on statistical machine translation. In: Proceedings of the fourth workshop on statistical machine translation. Athens, Greece, pp 1–28

  • Callison-Burch C, Osborne M, Koehn P (2006) Re-evaluating the role of BLEU in machine translation research. In: EACL-2006: 11th conference of the European chapter of the association for computational linguistics, proceedings of the conference. Trento, Italy, pp 249–256

  • Chan YS, Ng HT (2008) MAXSIM: A maximum similarity metric for machine translation evaluation. In: ACL-08: HLT, 46th annual meeting of the association for computational linguistics, human language, technologies, proceeding of the conference. Columbus, Ohio, pp 55–62

  • Charniak E, Johnson M (2005) Coarse-to-fine n-best parsing and MaxEnt discriminative reranking. In: ACL-05: 43rd annual meeting of the association for computational linguistics. Ann Arbor, Michigan, USA, pp 173–180

  • Charniak E, Knight K, Yamada K (2003) Syntax-based language models for machine translation. In: MT SUMMIT IX: Proceedings of the ninth machine translation Summit. New Orleans, USA, pp 40–46

  • Corston-Oliver S, Gamon M, Brockett C (2001) A machine learning approach to the automatic evaluation of machine translation. In: 39th annual meeting and 10th conference of the European chapter, proceedings of the conference. Toulouse, France, pp 140–147

  • Coughlin D (2003) Correlating automated and human assessments of machine translation quality. In: MT SUMMIT IX: Proceedings of the ninth machine translation Summit. New Orleans, USA, pp 23–27

  • Culy C, Riehemann SZ (2003) The limits of N-gram translation evaluation metrics. In: MT SUMMIT IX: proceedings of the ninth machine translation Summit, New Orleans, USA, pp 1–8

  • Curran J, Clark S, Bos J (2007) Linguistically motivated large-scale NLP with C&C and boxer. In: Proceedings of the 45th annual meeting of the association for computational linguistics companion volume proceedings of the demo and poster sessions. Prague, Czech Republic, pp 33–36

  • Denkowski M, Lavie A (2010) METEOR-NEXT and the METEOR Paraphrase Tables: improved evaluation support for five target languages. In: ACL 2010: joint fifth workshop on statistical machine translation and metricsMATR. Uppsala, Sweden, pp 339–342

  • Doddington G (2002) Automatic evaluation of machine translation quality using N-gram co-occurrence statistics. In: HLT 2002: human language technology conference, proceedings of the second international conference on human language technology research. San Diego, CA, pp 138–145

  • Efron B, Tibshirani R (1986) Bootstrap methods for standard errors, confidence intervals, and other measures of statistical accuracy. Stat Sci 1(1): 54–77

    Article  MathSciNet  Google Scholar 

  • Fisher RA (1921) On the ‘probable error’ of a coefficient of correlation deduced from a small sample. Metron 11: 3–32

    Google Scholar 

  • Fisher RA (1924) On a distribution yielding the error functions of several well known statistics. In: Proceedings of the international congress of mathematicians, vol 2. Toronto, Canada, pp 805–813

  • Gamon M, Aue A, Smets M (2005) Sentence-level MT evaluation without reference translations: beyond language modeling. In: Proceedings of 10th annual conference of the European association for machine translation. Budapest, Hungary, pp 103–111

  • Giménez J, Màrquez L (2004) SVMTool: a general POS tagger generator based on support vector machines. In: Proceedings of 4th international conference on language resources and evaluation. Lisbon, Portugal, pp 43–46

  • Giménez J, Màrquez L (2008) Towards heterogeneous automatic MT error analysis. In: Proceedings of the 6th international conference on language resources and evaluation. Marrakech, Morocco, pp 1894–1901

  • Giménez J, Màrquez L (2010) Asiya: an open toolkit for automatic machine translation (meta-)evaluation. Prague Bull Math Linguist 94: 77–86

    Article  Google Scholar 

  • Jaccard P (1901) Distribution de la flore alpine dans le bassin des Dranses et dans quelques régions voisines. Bulletin de la Société Vaudoise des Sciences Naturelles 37: 241–272

    Google Scholar 

  • Kahn JG, Snover M, Ostendorf M (2010) Expected dependency pair match: predicting translation quality with expected syntactic structure. Mach Trans 23(2–3): 169–179

    Google Scholar 

  • Kauchak D, Barzilay R (2006) Paraphrasing for automatic evaluation. In: human language technology conference of the North American chapter of the association of computational linguistics, proceedings of the main conference. New York, New York, USA, pp 455–462

  • Koehn P (2004) Statistical significance tests for machine translation evaluation. In: Proceedings of the (2004) Conference on empirical methods in natural language processing. Barcelona, Spain, pp 388–395

  • Koehn P, Monz C (2006) Manual and automatic evaluation of machine translation between European languages. In: HLT-NAACL 06: statistical machine translation, proceedings of the workshop. New York City, USA, pp 102–121

  • Kulesza A, Shieber SM (2004) A learning approach to improving sentence-level MT evaluation. In: TMI-2004 conference: proceedings of the 10th international conference on theoretical and methodological issues in machine translation. Baltimore, Maryland, USA, pp 75–84

  • LDC (2005) Linguistic data annotation specification: assessment of adequacy and fluency in translations. Revision 1.5. Technical report, Linguistic Data Consortium. http://www.ldc.upenn.edu/Projects/TIDES/Translation/TransAssess04.pdf.

  • Le A, Przybocki M (2005) NIST 2005 machine translation evaluation official results. In: Official release of automatic evaluation scores for all submissions, August

  • Leusch G, Ueffing N, Ney H (2006) CDER: efficient MT Evaluation Using Block Movements. In: EACL-2006: 11th conference of the European chapter of the association for computational linguistics, proceedings of the conference. Trento, Italy, pp 241–248

  • Lin CY, Och FJ (2004a) Automatic evaluation of machine translation quality using longest common subsequence and skip-bigram statics. In: ACL-04, 42nd annual meeting of the association for computational linguistics, proceedings of the conference. Barcelona, Spain, pp 605–612

  • Lin CY, Och FJ (2004b) ORANGE: a method for evaluating automatic evaluation metrics for machine translation. In: 20th International conference on computational linguistics, proceedings, vol. I. Geneva, Switzerland, pp 501–507

  • Lin D (1998) Dependency-based evaluation of MINIPAR. In: Proceedings of the workshop on the evaluation of parsing systems. Granada, Spain

  • Lita LV, Rogati M, Lavie A (2005) BLANC: learning evaluation metrics for MT. In: HLT/EMNLP 2005: human language technology conference and conference on empirical methods in natural language processing, proceedings of the conference. Vancouver, British Columbia, Canada, pp 740–747

  • Liu D, Gildea D (2005) Syntactic features for evaluation of machine translation. In: Proceedings of ACL workshop on intrinsic and extrinsic evaluation measures for MT and/or summarization. Association for Computational Linguistics, Ann Arbor, Michigan, pp 25–32

  • Liu D, Gildea D (2006) Stochastic iterative alignment for machine translation evaluation. In: COLING ACL 2006, 21st international conference on computational linguistics and 44th annual meeting of the association for computational linguistics, proceedings of the conference. Sydney, Australia, pp 539–546

  • Liu D, Gildea D (2007) Source-language features and maximum correlation training for machine translation evaluation. In: Human language technologies 2007: the conference of the North American chapter of the association for computational linguistics, proceedings of the main conference. Rochester, New York, USA, pp 41–48

  • Marcus MP, Santorini B, Marcinkiewicz MA (1993) Building a large annotated corpus of English: the Penn Treebank. Comput Linguist 19(2): 313–330

    Google Scholar 

  • Mehay D, Brew C (2007) BLEUATRE: flattening syntactic dependencies for MT evaluation. In: Proceedings of the 11th conference on theoretical and methodological issues in machine translation. Skövde, Sweden, pp 122–131

  • Melamed ID, Green R, Turian JP (2003) Precision and recall of machine translation. In: HLT-NAACL 2003: conference combining human language technology conference series and the North American chapter of the association for computational linguistics conference series. Edmonton, Canada, pp 61–63

  • Nießen S, Och FJ, Leusch G, Ney H (2000) An evaluation tool for machine translation: fast evaluation for MT research. In: Proceedings of the 2nd international conference on language resources and evaluation. Athens, Greece, pp 39–45

  • Och FJ, Gildea D, Khudanpur S, Sarkar A, Yamada K, Fraser A, Kumar S, Shen L, Smith D, Eng K, Jain V, Jin Z, Radev D (2003) Final report of Johns Hopkins 2003 summer workshop on syntax for statistical machine translation. Technical report, Johns Hopkins University, Baltimore, MD

  • Owczarzak K, Groves D, Genabith JV, Way A (2006) Contextual bitext-derived paraphrases in automatic MT evaluation. In: Proceedings of the joint conference on human language technology and the North American chapter of the association for computational linguistics. Cambridge, Massachusetts, USA, pp 148–155

  • Owczarzak K, van Genabith J, Way A (2007) Dependency-based automatic evaluation for machine translation. In: Human language technologies 2007: the conference of the North American chapter of the association for computational linguistics, proceedings of the main conference. Rochester, New York, USA, pp 80–87

  • Padó S, Cer D, Galley M, Jurafsky D, Manning CD (2009) Measuring machine translation quality as semantic equivalence: a metric based on entailment features. Mach Trans 23(2–3): 181–193

    Article  Google Scholar 

  • Padó S, Galley M, Jurafsky D, Manning CD (2009b) Textual entailment features for machine translation evaluation. In: EACL 2009: fourth workshop on statistical machine translation, proceedings of the workshop. Athens, Greece, pp 37–41

  • Palmer M, Gildea D, Kingsbury P (2005) The proposition bank: an annotated corpus of semantic roles. Comput Linguist 31(1): 71–106

    Article  Google Scholar 

  • Papineni K, Roukos S, Ward T, Zhu WJ (2002) BLEU: a method for automatic evaluation of machine translation. In: 40th Annual meeting of the association for computational linguistics, proceedings of the conference. Philadelphia, PA, USA, pp 311–318

  • Paul M (2006) Overview of the IWSLT 2006 evaluation campaign. In: IWSLT 2006: Proceedings of the 3rd international workshop on spoken language translation. Kyoto, Japan, pp 1–15

  • Paul M, Finch A, Sumita E (2007) Reducing human assessments of machine translation quality to binary classifiers. In: Proceedings of the 11th conference on theoretical and methodological issues in machine translation. Skövde, Sweden, pp 154–162

  • Pearson K (1914) The life, letters and labours of Francis Galton. Cambridge University Press London (3 volumes: 1914, 1924, 1930)

  • Popovic M, Ney H (2007) Word error rates: decomposition over POS classes and applications for error analysis. In: ACL 2007: proceedings of the second workshop on statistical machine translation. Prague, Czech Republic, pp 48–55

  • Quirk C (2004) Training a sentence-level machine translation confidence metric. In: Proceedings of the 4th international conference on language resources and evaluation. Lisbon, Portugal, pp 825–828

  • Quirk C, Menezes A, Cherry C (2005) Dependency treelet translation: syntactically informed phrasal SMT. In: ACL-05: 43rd annual meeting of the association for computational linguistics. Ann Arbor, Michigan, USA, pp 271–279

  • Reeder F, Miller K, Doyon J, White J (2001) The naming of things and the confusion of tongues: an MT metric. In: Proceedings of the workshop on MT evaluation “who did what to whom?” at machine translation Summit VIII. Santiago de Compostela, Spain, pp 55–59

  • Russo-Lassner G, Lin J, Resnik P (2005) A paraphrase-based approach to machine translation evaluation (LAMP-TR-125/CS-TR-4754/UMIACS-TR-2005-57). Technical report, University of Maryland, College Park. http://lampsrv01.umiacs.umd.edu/pubs/TechReports/LAMP_125/LAMP_125.pdf

  • Snover M, Dorr B, Schwartz R, Micciulla L, Makhoul J (2006) A study of translation edit rate with targeted human annotation. In: AMTA 2006: proceedings of the 7th conference of the association for machine translation in the americas: visions for the future of machine translation. Cambridge, Massachusetts, USA, pp 223–231

  • Snover M, Madnani N, Dorr BJ, Schwartz R (2010) TER-Plus: paraphrase, semantic, and alignment enhancements to translation edit rate. Mach Trans 23(2–3): 117–127

    Google Scholar 

  • Spearman C (1904) The proof and measurement of association between two rings. Am J Psychol 15: 72–101

    Article  Google Scholar 

  • Surdeanu M, Turmo J (2005) Semantic role labeling using complete syntactic analysis. In: CoNLL-2005: proceedings of the ninth conference on computational natural language learning. Ann Arbor, Michigan, USA, pp 221–224

  • Surdeanu M, Turmo J, Comelles E (2005) Named entity recognition from spontaneous open-domain speech. In: Interspeech 2005–Eurospeech: 9th European conference on speech communication and technology. Lisbon, Portugal, pp 3433–3436

  • Tillmann C, Vogel S, Ney H, Zubiaga A, Sawaf H (1997) Accelerated DP based search for statistical translation. In: Proceedings of European conference on speech communication and technology. Rhodes, Greece, pp 2667–2670

  • Turian JP, Shen L, Melamed ID (2003) Evaluation of machine translation and its evaluation. In: MT SUMMIT IX: proceedings of the ninth machine translation Summit. New Orleans, USA, pp 386–393

  • White JS, O’Connell T, O’Mara F (1994) The ARPA MT evaluation methodologies: evolution, lessons, and future approaches. In: Technology partnerships for crossing the language barrier, proceedings of the first conference of the association for machine translation in the Americas. Columbia, Maryland, USA, pp 193–205

  • Ye Y, Zhou M, Lin CY (2007) Sentence level machine translation evaluation as a ranking. In: ACL 2007: Proceedings of the second workshop on statistical machine translation. Prague, Czech Republic, pp 240–247

  • Zhou L, Lin CY, Hovy E (2006) Re-evaluating machine translation results with paraphrase support. In: EMNLP 2006: 2006 conference on empirical methods in natural language processing, proceedings of the conference. Sydney, Australia, pp 77–84

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jesús Giménez.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Giménez, J., Màrquez, L. Linguistic measures for automatic machine translation evaluation. Machine Translation 24, 209–240 (2010). https://doi.org/10.1007/s10590-011-9088-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10590-011-9088-7

Keywords

Navigation