Neutralizing the Effect of Translation Shifts on Automatic Machine Translation Evaluation
State-of-the-art automatic Machine Translation [MT] evaluation is based on the idea that the closer MT output is to Human Translation [HT], the higher its quality. Thus, automatic evaluation is typically approached by measuring some sort of similarity between machine and human translations. Most widely used evaluation systems calculate similarity at surface level, for example, by computing the number of shared word n-grams. The correlation between automatic and manual evaluation scores at sentence level is still not satisfactory. One of the main reasons is that metrics underscore acceptable candidate translations due to their inability to tackle lexical and syntactic variation between possible translation options. Acceptable differences between candidate and reference translations are frequently due to optional translation shifts. It is common practice in HT to paraphrase what could be viewed as close version of the source text in order to adapt it to target language use. When a reference translation contains such changes, using it as the only point of comparison is less informative, as the differences are not indicative of MT errors. To alleviate this problem, we design a paraphrase generation system based on a set of rules that model prototypical optional shifts that may have been applied by human translators. Applying the rules to the available human reference, the system generates additional references in a principled and controlled way. We show how using linguistic rules for the generation of additional references neutralizes the negative effect of optional translation shifts on n-gram-based MT evaluation.
KeywordsTranslation shifts Machine Translation Evaluation Paraphrase Generation
Unable to display preview. Download preview PDF.
- 1.Szymańska, I.: Mosaics. A Construction-Grammar-based approach to translation. Semper, Warszawa (2011)Google Scholar
- 2.Papineni, K., Roukos, S., Ward, T., Zhu, W.-J.: Bleu: a method for automatic evaluation of machine translation. RC22176 (Technical Report), IBM T.J. Watson Research Center (2001)Google Scholar
- 3.Denkowski, M., Lavie, A.: Meteor Universal: Language Specific Translation Evaluation for Any Target Language. In: Proceedings of the EACL 2014 Workshop on Statistical Machine Translation (2014)Google Scholar
- 8.Cyrus, L.: Building a Resource for Studying Translation Shifts. In: Proceedings of the 5th International Conference on Linguistic Resources and Evaluation, pp. 1240–1245 (2006)Google Scholar
- 9.Ahrenberg, L.: Codified Close Translation as a Standard for MT. In: Proceedings of the 10th Annual Conference of the European Association for Machine Translation, pp. 13–22 (2005)Google Scholar
- 11.Owczarzak, K., Groves, D., Genabith, J.V., Way, A.: Contextual Bitext-Derived Paraphrases in Automatic MT Evaluation. In: Proceedings of the 7th Conference of the Association for Machine Translation in the Americas, pp. 148–155 (2006)Google Scholar
- 12.Bannard, C., Callison-Burch, C.: Paraphrasing with bilingual parallel corpora. In: Proceedings of ACL (2005)Google Scholar
- 14.Specia, L., Turchi, M., Cancedda, N., Dymetman, M., Cristianini, N.: Estimating the Sentence-Level Quality of Machine Translation Systems. In: 13th Conference of the European Association for Machine Translation, pp. 28–37 (2009)Google Scholar
- 15.Koehn, P.: Europarl: A Parallel Corpus for Statistical Machine Translation. MT Summit (2005)Google Scholar
- 16.Nivre, J., Hall, J., Nilsson, J., Chanev, A., Eryigit, G., Kübler, S., Marinov, S., Marsi, E.: MaltParser: A language-independent system for data-driven dependency parsing. Natural Language Engineering 13(2), 95–135 (2007)Google Scholar