Abstract
Assessing the quality of candidate translations involves diverse linguistic facets. However, most automatic evaluation methods in use today rely on limited quality assumptions, such as lexical similarity. This introduces a bias in the development cycle which in some cases has been reported to carry very negative consequences. In order to tackle this methodological problem, we explore a novel path towards heterogeneous automatic Machine Translation evaluation. We have compiled a rich set of specialized similarity measures operating at different linguistic dimensions and analyzed their individual and collective behaviour over a wide range of evaluation scenarios. Results show that measures based on syntactic and semantic information are able to provide more reliable system rankings than lexical measures, especially when the systems under evaluation are based on different paradigms. At the sentence level, while some linguistic measures perform better than most lexical measures, some others perform substantially worse, mainly due to parsing problems. Their scores are, however, suitable for combination, yielding a substantially improved evaluation quality.
Similar content being viewed by others
References
Akiba Y, Imamura K, Sumita E (2001) Using multiple edit distances to automatically rank machine translation output. In: MT Summit VIII: Machine translation in the information age, proceedings. Santiago de Compostela, Spain, pp 15–20
Albrecht J, Hwa R (2007a) A re-examination of machine learning approaches for sentence-level MT evaluation. In: Proceedings of the 45th annual meeting of the association for computational linguistics. Prague, Czech Republic, pp 880–887
Albrecht J, Hwa R (2007b) Regression for sentence-level MT evaluation with pseudo references. In: Proceedings of the 45th annual meeting of the association for computational linguistics. Prague, Czech Republic, pp 296–303
Albrecht J, Hwa R (2008) The role of pseudo references in MT evaluation. In: Proceedings of the third workshop on statistical machine translation. Columbus, Ohio, pp 187–190
Banerjee S, Lavie A (2005) METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In: Intrinsic and extrinsic evaluation measures for machine translation and/or summarization, proceedings of the ACL-05 workshop. Ann Arbor, Michigan, USA, pp 65–72
Callison-Burch C (2005) Linear B system description for the 2005 NIST MT evaluation exercise. In: Proceedings of the NIST 2005 machine translation evaluation workshop. Bethesda, MA
Callison-Burch C, Fordyce C, Koehn P, Monz C, Schroeder J (2007) (Meta-) Evaluation of machine translation. In: Proceedings of the second workshop on statistical machine translation. Prague, Czech Republic, pp 136–158
Callison-Burch C, Fordyce C, Koehn P, Monz C, Schroeder J (2008) Further meta-evaluation of machine translation. In: Proceedings of the third workshop on statistical machine translation. Columbus, Ohio, pp 70–106
Callison-Burch C, Koehn P, Monz C, Schroeder J (2009) Findings of the 2009 workshop on statistical machine translation. In: Proceedings of the fourth workshop on statistical machine translation. Athens, Greece, pp 1–28
Callison-Burch C, Osborne M, Koehn P (2006) Re-evaluating the role of BLEU in machine translation research. In: EACL-2006: 11th conference of the European chapter of the association for computational linguistics, proceedings of the conference. Trento, Italy, pp 249–256
Chan YS, Ng HT (2008) MAXSIM: A maximum similarity metric for machine translation evaluation. In: ACL-08: HLT, 46th annual meeting of the association for computational linguistics, human language, technologies, proceeding of the conference. Columbus, Ohio, pp 55–62
Charniak E, Johnson M (2005) Coarse-to-fine n-best parsing and MaxEnt discriminative reranking. In: ACL-05: 43rd annual meeting of the association for computational linguistics. Ann Arbor, Michigan, USA, pp 173–180
Charniak E, Knight K, Yamada K (2003) Syntax-based language models for machine translation. In: MT SUMMIT IX: Proceedings of the ninth machine translation Summit. New Orleans, USA, pp 40–46
Corston-Oliver S, Gamon M, Brockett C (2001) A machine learning approach to the automatic evaluation of machine translation. In: 39th annual meeting and 10th conference of the European chapter, proceedings of the conference. Toulouse, France, pp 140–147
Coughlin D (2003) Correlating automated and human assessments of machine translation quality. In: MT SUMMIT IX: Proceedings of the ninth machine translation Summit. New Orleans, USA, pp 23–27
Culy C, Riehemann SZ (2003) The limits of N-gram translation evaluation metrics. In: MT SUMMIT IX: proceedings of the ninth machine translation Summit, New Orleans, USA, pp 1–8
Curran J, Clark S, Bos J (2007) Linguistically motivated large-scale NLP with C&C and boxer. In: Proceedings of the 45th annual meeting of the association for computational linguistics companion volume proceedings of the demo and poster sessions. Prague, Czech Republic, pp 33–36
Denkowski M, Lavie A (2010) METEOR-NEXT and the METEOR Paraphrase Tables: improved evaluation support for five target languages. In: ACL 2010: joint fifth workshop on statistical machine translation and metricsMATR. Uppsala, Sweden, pp 339–342
Doddington G (2002) Automatic evaluation of machine translation quality using N-gram co-occurrence statistics. In: HLT 2002: human language technology conference, proceedings of the second international conference on human language technology research. San Diego, CA, pp 138–145
Efron B, Tibshirani R (1986) Bootstrap methods for standard errors, confidence intervals, and other measures of statistical accuracy. Stat Sci 1(1): 54–77
Fisher RA (1921) On the ‘probable error’ of a coefficient of correlation deduced from a small sample. Metron 11: 3–32
Fisher RA (1924) On a distribution yielding the error functions of several well known statistics. In: Proceedings of the international congress of mathematicians, vol 2. Toronto, Canada, pp 805–813
Gamon M, Aue A, Smets M (2005) Sentence-level MT evaluation without reference translations: beyond language modeling. In: Proceedings of 10th annual conference of the European association for machine translation. Budapest, Hungary, pp 103–111
Giménez J, Màrquez L (2004) SVMTool: a general POS tagger generator based on support vector machines. In: Proceedings of 4th international conference on language resources and evaluation. Lisbon, Portugal, pp 43–46
Giménez J, Màrquez L (2008) Towards heterogeneous automatic MT error analysis. In: Proceedings of the 6th international conference on language resources and evaluation. Marrakech, Morocco, pp 1894–1901
Giménez J, Màrquez L (2010) Asiya: an open toolkit for automatic machine translation (meta-)evaluation. Prague Bull Math Linguist 94: 77–86
Jaccard P (1901) Distribution de la flore alpine dans le bassin des Dranses et dans quelques régions voisines. Bulletin de la Société Vaudoise des Sciences Naturelles 37: 241–272
Kahn JG, Snover M, Ostendorf M (2010) Expected dependency pair match: predicting translation quality with expected syntactic structure. Mach Trans 23(2–3): 169–179
Kauchak D, Barzilay R (2006) Paraphrasing for automatic evaluation. In: human language technology conference of the North American chapter of the association of computational linguistics, proceedings of the main conference. New York, New York, USA, pp 455–462
Koehn P (2004) Statistical significance tests for machine translation evaluation. In: Proceedings of the (2004) Conference on empirical methods in natural language processing. Barcelona, Spain, pp 388–395
Koehn P, Monz C (2006) Manual and automatic evaluation of machine translation between European languages. In: HLT-NAACL 06: statistical machine translation, proceedings of the workshop. New York City, USA, pp 102–121
Kulesza A, Shieber SM (2004) A learning approach to improving sentence-level MT evaluation. In: TMI-2004 conference: proceedings of the 10th international conference on theoretical and methodological issues in machine translation. Baltimore, Maryland, USA, pp 75–84
LDC (2005) Linguistic data annotation specification: assessment of adequacy and fluency in translations. Revision 1.5. Technical report, Linguistic Data Consortium. http://www.ldc.upenn.edu/Projects/TIDES/Translation/TransAssess04.pdf.
Le A, Przybocki M (2005) NIST 2005 machine translation evaluation official results. In: Official release of automatic evaluation scores for all submissions, August
Leusch G, Ueffing N, Ney H (2006) CDER: efficient MT Evaluation Using Block Movements. In: EACL-2006: 11th conference of the European chapter of the association for computational linguistics, proceedings of the conference. Trento, Italy, pp 241–248
Lin CY, Och FJ (2004a) Automatic evaluation of machine translation quality using longest common subsequence and skip-bigram statics. In: ACL-04, 42nd annual meeting of the association for computational linguistics, proceedings of the conference. Barcelona, Spain, pp 605–612
Lin CY, Och FJ (2004b) ORANGE: a method for evaluating automatic evaluation metrics for machine translation. In: 20th International conference on computational linguistics, proceedings, vol. I. Geneva, Switzerland, pp 501–507
Lin D (1998) Dependency-based evaluation of MINIPAR. In: Proceedings of the workshop on the evaluation of parsing systems. Granada, Spain
Lita LV, Rogati M, Lavie A (2005) BLANC: learning evaluation metrics for MT. In: HLT/EMNLP 2005: human language technology conference and conference on empirical methods in natural language processing, proceedings of the conference. Vancouver, British Columbia, Canada, pp 740–747
Liu D, Gildea D (2005) Syntactic features for evaluation of machine translation. In: Proceedings of ACL workshop on intrinsic and extrinsic evaluation measures for MT and/or summarization. Association for Computational Linguistics, Ann Arbor, Michigan, pp 25–32
Liu D, Gildea D (2006) Stochastic iterative alignment for machine translation evaluation. In: COLING ACL 2006, 21st international conference on computational linguistics and 44th annual meeting of the association for computational linguistics, proceedings of the conference. Sydney, Australia, pp 539–546
Liu D, Gildea D (2007) Source-language features and maximum correlation training for machine translation evaluation. In: Human language technologies 2007: the conference of the North American chapter of the association for computational linguistics, proceedings of the main conference. Rochester, New York, USA, pp 41–48
Marcus MP, Santorini B, Marcinkiewicz MA (1993) Building a large annotated corpus of English: the Penn Treebank. Comput Linguist 19(2): 313–330
Mehay D, Brew C (2007) BLEUATRE: flattening syntactic dependencies for MT evaluation. In: Proceedings of the 11th conference on theoretical and methodological issues in machine translation. Skövde, Sweden, pp 122–131
Melamed ID, Green R, Turian JP (2003) Precision and recall of machine translation. In: HLT-NAACL 2003: conference combining human language technology conference series and the North American chapter of the association for computational linguistics conference series. Edmonton, Canada, pp 61–63
Nießen S, Och FJ, Leusch G, Ney H (2000) An evaluation tool for machine translation: fast evaluation for MT research. In: Proceedings of the 2nd international conference on language resources and evaluation. Athens, Greece, pp 39–45
Och FJ, Gildea D, Khudanpur S, Sarkar A, Yamada K, Fraser A, Kumar S, Shen L, Smith D, Eng K, Jain V, Jin Z, Radev D (2003) Final report of Johns Hopkins 2003 summer workshop on syntax for statistical machine translation. Technical report, Johns Hopkins University, Baltimore, MD
Owczarzak K, Groves D, Genabith JV, Way A (2006) Contextual bitext-derived paraphrases in automatic MT evaluation. In: Proceedings of the joint conference on human language technology and the North American chapter of the association for computational linguistics. Cambridge, Massachusetts, USA, pp 148–155
Owczarzak K, van Genabith J, Way A (2007) Dependency-based automatic evaluation for machine translation. In: Human language technologies 2007: the conference of the North American chapter of the association for computational linguistics, proceedings of the main conference. Rochester, New York, USA, pp 80–87
Padó S, Cer D, Galley M, Jurafsky D, Manning CD (2009) Measuring machine translation quality as semantic equivalence: a metric based on entailment features. Mach Trans 23(2–3): 181–193
Padó S, Galley M, Jurafsky D, Manning CD (2009b) Textual entailment features for machine translation evaluation. In: EACL 2009: fourth workshop on statistical machine translation, proceedings of the workshop. Athens, Greece, pp 37–41
Palmer M, Gildea D, Kingsbury P (2005) The proposition bank: an annotated corpus of semantic roles. Comput Linguist 31(1): 71–106
Papineni K, Roukos S, Ward T, Zhu WJ (2002) BLEU: a method for automatic evaluation of machine translation. In: 40th Annual meeting of the association for computational linguistics, proceedings of the conference. Philadelphia, PA, USA, pp 311–318
Paul M (2006) Overview of the IWSLT 2006 evaluation campaign. In: IWSLT 2006: Proceedings of the 3rd international workshop on spoken language translation. Kyoto, Japan, pp 1–15
Paul M, Finch A, Sumita E (2007) Reducing human assessments of machine translation quality to binary classifiers. In: Proceedings of the 11th conference on theoretical and methodological issues in machine translation. Skövde, Sweden, pp 154–162
Pearson K (1914) The life, letters and labours of Francis Galton. Cambridge University Press London (3 volumes: 1914, 1924, 1930)
Popovic M, Ney H (2007) Word error rates: decomposition over POS classes and applications for error analysis. In: ACL 2007: proceedings of the second workshop on statistical machine translation. Prague, Czech Republic, pp 48–55
Quirk C (2004) Training a sentence-level machine translation confidence metric. In: Proceedings of the 4th international conference on language resources and evaluation. Lisbon, Portugal, pp 825–828
Quirk C, Menezes A, Cherry C (2005) Dependency treelet translation: syntactically informed phrasal SMT. In: ACL-05: 43rd annual meeting of the association for computational linguistics. Ann Arbor, Michigan, USA, pp 271–279
Reeder F, Miller K, Doyon J, White J (2001) The naming of things and the confusion of tongues: an MT metric. In: Proceedings of the workshop on MT evaluation “who did what to whom?” at machine translation Summit VIII. Santiago de Compostela, Spain, pp 55–59
Russo-Lassner G, Lin J, Resnik P (2005) A paraphrase-based approach to machine translation evaluation (LAMP-TR-125/CS-TR-4754/UMIACS-TR-2005-57). Technical report, University of Maryland, College Park. http://lampsrv01.umiacs.umd.edu/pubs/TechReports/LAMP_125/LAMP_125.pdf
Snover M, Dorr B, Schwartz R, Micciulla L, Makhoul J (2006) A study of translation edit rate with targeted human annotation. In: AMTA 2006: proceedings of the 7th conference of the association for machine translation in the americas: visions for the future of machine translation. Cambridge, Massachusetts, USA, pp 223–231
Snover M, Madnani N, Dorr BJ, Schwartz R (2010) TER-Plus: paraphrase, semantic, and alignment enhancements to translation edit rate. Mach Trans 23(2–3): 117–127
Spearman C (1904) The proof and measurement of association between two rings. Am J Psychol 15: 72–101
Surdeanu M, Turmo J (2005) Semantic role labeling using complete syntactic analysis. In: CoNLL-2005: proceedings of the ninth conference on computational natural language learning. Ann Arbor, Michigan, USA, pp 221–224
Surdeanu M, Turmo J, Comelles E (2005) Named entity recognition from spontaneous open-domain speech. In: Interspeech 2005–Eurospeech: 9th European conference on speech communication and technology. Lisbon, Portugal, pp 3433–3436
Tillmann C, Vogel S, Ney H, Zubiaga A, Sawaf H (1997) Accelerated DP based search for statistical translation. In: Proceedings of European conference on speech communication and technology. Rhodes, Greece, pp 2667–2670
Turian JP, Shen L, Melamed ID (2003) Evaluation of machine translation and its evaluation. In: MT SUMMIT IX: proceedings of the ninth machine translation Summit. New Orleans, USA, pp 386–393
White JS, O’Connell T, O’Mara F (1994) The ARPA MT evaluation methodologies: evolution, lessons, and future approaches. In: Technology partnerships for crossing the language barrier, proceedings of the first conference of the association for machine translation in the Americas. Columbia, Maryland, USA, pp 193–205
Ye Y, Zhou M, Lin CY (2007) Sentence level machine translation evaluation as a ranking. In: ACL 2007: Proceedings of the second workshop on statistical machine translation. Prague, Czech Republic, pp 240–247
Zhou L, Lin CY, Hovy E (2006) Re-evaluating machine translation results with paraphrase support. In: EMNLP 2006: 2006 conference on empirical methods in natural language processing, proceedings of the conference. Sydney, Australia, pp 77–84
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Giménez, J., Màrquez, L. Linguistic measures for automatic machine translation evaluation. Machine Translation 24, 209–240 (2010). https://doi.org/10.1007/s10590-011-9088-7
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10590-011-9088-7