Quantitative fine-grained human evaluation of machine translation systems: a case study on English to Croatian

Klubička, Filip; Toral, Antonio; Sánchez-Cartagena, Víctor M.

doi:10.1007/s10590-018-9214-x

Quantitative fine-grained human evaluation of machine translation systems: a case study on English to Croatian

Published: 10 February 2018

Volume 32, pages 195–215, (2018)
Cite this article

Machine Translation

Filip Klubička ORCID: orcid.org/0000-0001-9712-6141¹,
Antonio Toral² &
Víctor M. Sánchez-Cartagena³

1154 Accesses
17 Citations
5 Altmetric
Explore all metrics

Abstract

This paper presents a quantitative fine-grained manual evaluation approach to comparing the performance of different machine translation (MT) systems. We build upon the well-established multidimensional quality metrics (MQM) error taxonomy and implement a novel method that assesses whether the differences in performance for MQM error types between different MT systems are statistically significant. We conduct a case study for English-to-Croatian, a language direction that involves translating into a morphologically rich language, for which we compare three MT systems belonging to different paradigms: pure phrase-based, factored phrase-based and neural. First, we design an MQM-compliant error taxonomy tailored to the relevant linguistic phenomena of Slavic languages, which made the annotation process feasible and accurate. Errors in MT outputs were then annotated by two annotators following this taxonomy. Subsequently, we carried out a statistical analysis which showed that the best-performing system (neural) reduces the errors produced by the worst system (pure phrase-based) by more than half (54%). Moreover, we conducted an additional analysis of agreement errors in which we distinguished between short (phrase-level) and long distance (sentence-level) errors. We discovered that phrase-based MT approaches are of limited use for long distance agreement phenomena, for which neural MT was found to be especially effective.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Natural language processing: state of the art, current trends and challenges

Article 14 July 2022

Natural Language Processing

The Use of Artificial Intelligence in Writing Scientific Review Articles

Article Open access 16 January 2024

Notes

http://www.statmt.org/wmt16/translation-task.html.
http://www.statmt.org/wmt17/translation-task.html.
https://ec.europa.eu/jrc/en/language-technologies/dgt-translation-memory.
https://www.clarin.si/repository/xmlui/handle/11356/1058.
http://tinyurl.com/CroatianAcquis.
http://www.opensubtitles.org/.
http://opus.nlpl.eu/SETIMES2.php.
http://opus.nlpl.eu/TedTalks.php.
http://opus.nlpl.eu/.
http://www.statmt.org/wmt12/translation-task.html.
http://www.statmt.org/wmt13/translation-task.html.
https://github.com/moses-smt/mosesdecoder/tree/RELEASE-3.0.
http://www.qt21.eu/mqm-definition/definition-2015-06-16.html.
http://www.qt21.eu/mqm-definition/issues-list-2015-12-30.html.
http://www.qt21.eu/downloads/MQM-usage-guidelines.pdf.
http://www.translate5.net/.
The instructions include a handy decision tree to aid in the annotation process. It can be found at the following URL: http://www.qt21.eu/downloads/annotatorsGuidelines-2014-06-11.pdf.
https://github.com/GreenParachute/mqm-eng-cro/.
Unlike in SMT jargon, here a phrase refers to a grammatical unit, not just a string of contiguous words.

References

Bahdanau D, Cho K, Bengio Y (2015) Neural machine translation by jointly learning to align and translate. In: Proceedings of international conference on learning representations, San Diego, CA, USA, 2015
Bentivogli L, Bisazza A, Cettolo M, Federico M (2016) Neural versus phrase-based machine translation quality: a case study. In: Proceedings of the 2016 conference on empirical methods in natural language processing, Austin, Texas, USA, pp 257–267
Burchardt A, Macketanz V, Dehdari J, Heigold G, Peter JT, Williams P (2017) A linguistic evaluation of rule-based, phrase-based, and neural MT engines. Prague Bull Math Linguist 108:159–170
Article Google Scholar
Callison-Burch C, Fordyce C, Koehn P, Monz C, Schroeder J (2007) (Meta-) evaluation of machine translation. In: Proceedings of the second workshop on statistical machine translation, Prague, Czech Republic, pp 136–158
Castilho S, Moorkens J, Gaspari F, Calixto I, Tinsley J, Way A (2017) Is neural machine translation the new state of the art? Prague Bull Math Linguist 108:109–120
Article Google Scholar
Cohen J (1960) A coefficient of agreement for nominal scales. Educ Psychol Meas 20(1):37–46
Article Google Scholar
Devlin J, Zbib R, Huang Z, Lamar T, Schwartz R, Makhoul J (2014) Fast and robust neural network joint models for statistical machine translation. In: Proceedings of association for computational linguistics conference, Baltimore, Maryland, USA, pp 1370–1380
Durrani N, Schmid H, Fraser A (2011) A joint sequence translation model with integrated reordering. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies, vol 1. Portland, Oregon, USA, pp 1045–1054
Galley M, Manning CD (2008) A simple and effective hierarchical phrase reordering model. In: Proceedings of the conference on empirical methods in natural language processing, Waikiki, Honolulu, Hawaii, pp 848–856
Klubička F, Toral A, Sánchez-Cartagena VM (2017) Fine-grained human evaluation of neural versus phrase-based machine translation. Prague Bull Math Linguist 108:121–132
Article Google Scholar
Koehn P (2004) Statistical significance tests for machine translation evaluation. In: Proceedings of the conference on empirical methods in natural language processing, Barcelona, Spain, pp 388–395
Koehn P, Hoang H (2007) Factored translation models. In: Proceedings of conference on empirical methods on natural language processing and computational natural language learning, Jeju Island, Korea, pp 868–876
Koehn P, Och FJ, Marcu D (2003) Statistical phrase-based translation. In: Proceedings of the 2003 conference of the north american chapter of the association for computational linguistics on human language technology, Edmonton, Canada, pp 48–54
Koehn P, Hoang H, Birch A, Callison-Burch C, Federico M, Bertoldi N, Cowan B, Shen W, Moran C, Zens R, et al (2007) Moses: Open source toolkit for statistical machine translation. In: Proceedings of the 45th annual meeting of the ACL on interactive poster and demonstration sessions, Prague, Czech Republic, pp 177–180
Ljubešić N, Klubička F (2014) bs,hr,sr WaC-web corpora of Bosnian, Croatian and Serbian. In: Proceedings of the 9th web as corpus workshop (WaC-9), Gothenburg, Sweden, pp 29–35
Lommel AR, Burchardt A, Uszkoreit H (2014a) Multidimensional quality metrics (MQM): a framework for declaring and describing translation quality metrics. Tradumàtica: tecnologies de la traducció pp 455–463
Lommel AR, Popovic M, Burchardt A (2014b) Assessing inter-annotator agreement for translation error annotation. In: MTE: workshop on automatic and manual metrics for operational translation evaluation, Reykjavik, Iceland
Moore RC, Lewis W (2010) Intelligent selection of language model training data. In: Proceedings of the association for computational linguistics 2010 conference short papers, Stroudsburg, PA, USA, pp 220–224
Papineni K, Roukos S, Ward T, Zhu WJ (2002) BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting on association for computational linguistics, pp 311–318
Pascanu R, Mikolov T, Bengio Y (2013) On the difficulty of training recurrent neural networks. In: Dasgupta S, Mcallester D (eds) Proceedings of the 30th international conference on machine learning (ICML-13), JMLR workshop and conference proceedings, vol 28. Atlanta, USA, pp 1310–1318
Plackett RL (1983) Karl pearson and the chi-squared test. International statistical review/revue internationale de statistique pp 59–72
Popović M (2017) Comparing language related issues for NMT and PBMT between German and English. Prague Bull Math Linguist 108:209–220
Article Google Scholar
Sánchez-Cartagena VM, Ljubešić N, Klubička F (2016) Dealing with data sparseness in SMT with factored models and morphological expansion: a case study on Croatian. Baltic J Mod Comput 4(2):354–360
Google Scholar
Sennrich R, Haddow B, Birch A (2016) Neural machine translation of rare words with subword units. In: Proceedings of the 54th annual meeting of the association for computational linguistics, Berlin, Germany, pp 1715–1725
Sennrich R, Firat O, Cho K, Birch A, Haddow B, Hitschler J, Junczys-Dowmunt M, Lubli S, Barone AVM, Mokry J, Nadejde M (2017) Nematus: a toolkit for neural machine translation. In: Proceedings of the European association for computational linguistics 2017 software demonstrations, Valencia, Spain, pp 65–68
Snover M, Dorr B, Schwartz R, Micciulla L, Makhoul J (2006) A study of translation edit rate with targeted human annotation. In: Proceedings of association for machine translation in the Americas, Cambridge, Massachusetts, USA, pp 223–231
Tiedemann J (2009) News from OPUS—a collection of multilingual parallel corpora with tools and interfaces. In: Recent advances in natural language processing. Borovets, Bulgaria, pp 237–248
Google Scholar
Tiedemann J (2012) Parallel data, tools and interfaces in OPUS. In: Proceedings of the eighth international conference on language resources and evaluation, Istanbul, Turkey, pp 2214–2218
Toral A, Sánchez-Cartagena VM (2017) A multifaceted evaluation of neural versus phrase-based machine translation for 9 language directions. In: Proceedings of the 15th Conference of the European chapter of the association for computational linguistics, Valencia, Spain, pp 1063–1073
Zeiler MD (2012) ADADELTA: an adaptive learning rate method. arXiv preprint arXiv:1212.5701

Download references

Acknowledgements

We would like to extend our thanks to Maja Popović, who provided invaluable advice, and Denis Kranjčić, who performed the annotation together with Filip Klubička, first author of the paper. This research was partly funded by the ADAPT Centre, which is funded under the SFI Research Centres Programme (Grant 13/RC/2106) and is co-funded under the European Regional Development Fund. This research has also received funding from the European Union Seventh Framework Programme FP7/2007-2013 under Grant agreement PIAP-GA-2012-324414 (Abu-MaTran) and the Swiss National Science Foundation Grant 74Z0_160501 (ReLDI).

Author information

Authors and Affiliations

Dublin Institute of Technology, Dublin, Ireland
Filip Klubička
University of Groningen, Groningen, The Netherlands
Antonio Toral
Prompsit Language Engineering, Alacant, Spain
Víctor M. Sánchez-Cartagena

Authors

Filip Klubička
View author publications
You can also search for this author in PubMed Google Scholar
Antonio Toral
View author publications
You can also search for this author in PubMed Google Scholar
Víctor M. Sánchez-Cartagena
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Filip Klubička.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Klubička, F., Toral, A. & Sánchez-Cartagena, V.M. Quantitative fine-grained human evaluation of machine translation systems: a case study on English to Croatian. Machine Translation 32, 195–215 (2018). https://doi.org/10.1007/s10590-018-9214-x

Download citation

Received: 24 August 2017
Accepted: 27 January 2018
Published: 10 February 2018
Issue Date: September 2018
DOI: https://doi.org/10.1007/s10590-018-9214-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Quantitative fine-grained human evaluation of machine translation systems: a case study on English to Croatian

Abstract

Access this article

Similar content being viewed by others

Natural language processing: state of the art, current trends and challenges

Natural Language Processing

The Use of Artificial Intelligence in Writing Scientific Review Articles

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Quantitative fine-grained human evaluation of machine translation systems: a case study on English to Croatian

Abstract

Access this article

Similar content being viewed by others

Natural language processing: state of the art, current trends and challenges

Natural Language Processing

The Use of Artificial Intelligence in Writing Scientific Review Articles

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation