Machine Translation

, Volume 32, Issue 3, pp 195–215 | Cite as

Quantitative fine-grained human evaluation of machine translation systems: a case study on English to Croatian

  • Filip KlubičkaEmail author
  • Antonio Toral
  • Víctor M. Sánchez-Cartagena


This paper presents a quantitative fine-grained manual evaluation approach to comparing the performance of different machine translation (MT) systems. We build upon the well-established multidimensional quality metrics (MQM) error taxonomy and implement a novel method that assesses whether the differences in performance for MQM error types between different MT systems are statistically significant. We conduct a case study for English-to-Croatian, a language direction that involves translating into a morphologically rich language, for which we compare three MT systems belonging to different paradigms: pure phrase-based, factored phrase-based and neural. First, we design an MQM-compliant error taxonomy tailored to the relevant linguistic phenomena of Slavic languages, which made the annotation process feasible and accurate. Errors in MT outputs were then annotated by two annotators following this taxonomy. Subsequently, we carried out a statistical analysis which showed that the best-performing system (neural) reduces the errors produced by the worst system (pure phrase-based) by more than half (54%). Moreover, we conducted an additional analysis of agreement errors in which we distinguished between short (phrase-level) and long distance (sentence-level) errors. We discovered that phrase-based MT approaches are of limited use for long distance agreement phenomena, for which neural MT was found to be especially effective.


Neural machine translation Statistical machine translation Phrase-based machine translation Factored models Human evaluation Error annotation Multidimensional quality metrics (MQM) 



We would like to extend our thanks to Maja Popović, who provided invaluable advice, and Denis Kranjčić, who performed the annotation together with Filip Klubička, first author of the paper. This research was partly funded by the ADAPT Centre, which is funded under the SFI Research Centres Programme (Grant 13/RC/2106) and is co-funded under the European Regional Development Fund. This research has also received funding from the European Union Seventh Framework Programme FP7/2007-2013 under Grant agreement PIAP-GA-2012-324414 (Abu-MaTran) and the Swiss National Science Foundation Grant 74Z0_160501 (ReLDI).


  1. Bahdanau D, Cho K, Bengio Y (2015) Neural machine translation by jointly learning to align and translate. In: Proceedings of international conference on learning representations, San Diego, CA, USA, 2015Google Scholar
  2. Bentivogli L, Bisazza A, Cettolo M, Federico M (2016) Neural versus phrase-based machine translation quality: a case study. In: Proceedings of the 2016 conference on empirical methods in natural language processing, Austin, Texas, USA, pp 257–267Google Scholar
  3. Burchardt A, Macketanz V, Dehdari J, Heigold G, Peter JT, Williams P (2017) A linguistic evaluation of rule-based, phrase-based, and neural MT engines. Prague Bull Math Linguist 108:159–170CrossRefGoogle Scholar
  4. Callison-Burch C, Fordyce C, Koehn P, Monz C, Schroeder J (2007) (Meta-) evaluation of machine translation. In: Proceedings of the second workshop on statistical machine translation, Prague, Czech Republic, pp 136–158Google Scholar
  5. Castilho S, Moorkens J, Gaspari F, Calixto I, Tinsley J, Way A (2017) Is neural machine translation the new state of the art? Prague Bull Math Linguist 108:109–120CrossRefGoogle Scholar
  6. Cohen J (1960) A coefficient of agreement for nominal scales. Educ Psychol Meas 20(1):37–46CrossRefGoogle Scholar
  7. Devlin J, Zbib R, Huang Z, Lamar T, Schwartz R, Makhoul J (2014) Fast and robust neural network joint models for statistical machine translation. In: Proceedings of association for computational linguistics conference, Baltimore, Maryland, USA, pp 1370–1380Google Scholar
  8. Durrani N, Schmid H, Fraser A (2011) A joint sequence translation model with integrated reordering. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies, vol 1. Portland, Oregon, USA, pp 1045–1054Google Scholar
  9. Galley M, Manning CD (2008) A simple and effective hierarchical phrase reordering model. In: Proceedings of the conference on empirical methods in natural language processing, Waikiki, Honolulu, Hawaii, pp 848–856Google Scholar
  10. Klubička F, Toral A, Sánchez-Cartagena VM (2017) Fine-grained human evaluation of neural versus phrase-based machine translation. Prague Bull Math Linguist 108:121–132CrossRefGoogle Scholar
  11. Koehn P (2004) Statistical significance tests for machine translation evaluation. In: Proceedings of the conference on empirical methods in natural language processing, Barcelona, Spain, pp 388–395Google Scholar
  12. Koehn P, Hoang H (2007) Factored translation models. In: Proceedings of conference on empirical methods on natural language processing and computational natural language learning, Jeju Island, Korea, pp 868–876Google Scholar
  13. Koehn P, Och FJ, Marcu D (2003) Statistical phrase-based translation. In: Proceedings of the 2003 conference of the north american chapter of the association for computational linguistics on human language technology, Edmonton, Canada, pp 48–54Google Scholar
  14. Koehn P, Hoang H, Birch A, Callison-Burch C, Federico M, Bertoldi N, Cowan B, Shen W, Moran C, Zens R, et al (2007) Moses: Open source toolkit for statistical machine translation. In: Proceedings of the 45th annual meeting of the ACL on interactive poster and demonstration sessions, Prague, Czech Republic, pp 177–180Google Scholar
  15. Ljubešić N, Klubička F (2014) bs,hr,sr WaC-web corpora of Bosnian, Croatian and Serbian. In: Proceedings of the 9th web as corpus workshop (WaC-9), Gothenburg, Sweden, pp 29–35Google Scholar
  16. Lommel AR, Burchardt A, Uszkoreit H (2014a) Multidimensional quality metrics (MQM): a framework for declaring and describing translation quality metrics. Tradumàtica: tecnologies de la traducció pp 455–463Google Scholar
  17. Lommel AR, Popovic M, Burchardt A (2014b) Assessing inter-annotator agreement for translation error annotation. In: MTE: workshop on automatic and manual metrics for operational translation evaluation, Reykjavik, IcelandGoogle Scholar
  18. Moore RC, Lewis W (2010) Intelligent selection of language model training data. In: Proceedings of the association for computational linguistics 2010 conference short papers, Stroudsburg, PA, USA, pp 220–224Google Scholar
  19. Papineni K, Roukos S, Ward T, Zhu WJ (2002) BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting on association for computational linguistics, pp 311–318Google Scholar
  20. Pascanu R, Mikolov T, Bengio Y (2013) On the difficulty of training recurrent neural networks. In: Dasgupta S, Mcallester D (eds) Proceedings of the 30th international conference on machine learning (ICML-13), JMLR workshop and conference proceedings, vol 28. Atlanta, USA, pp 1310–1318Google Scholar
  21. Plackett RL (1983) Karl pearson and the chi-squared test. International statistical review/revue internationale de statistique pp 59–72Google Scholar
  22. Popović M (2017) Comparing language related issues for NMT and PBMT between German and English. Prague Bull Math Linguist 108:209–220CrossRefGoogle Scholar
  23. Sánchez-Cartagena VM, Ljubešić N, Klubička F (2016) Dealing with data sparseness in SMT with factored models and morphological expansion: a case study on Croatian. Baltic J Mod Comput 4(2):354–360Google Scholar
  24. Sennrich R, Haddow B, Birch A (2016) Neural machine translation of rare words with subword units. In: Proceedings of the 54th annual meeting of the association for computational linguistics, Berlin, Germany, pp 1715–1725Google Scholar
  25. Sennrich R, Firat O, Cho K, Birch A, Haddow B, Hitschler J, Junczys-Dowmunt M, Lubli S, Barone AVM, Mokry J, Nadejde M (2017) Nematus: a toolkit for neural machine translation. In: Proceedings of the European association for computational linguistics 2017 software demonstrations, Valencia, Spain, pp 65–68Google Scholar
  26. Snover M, Dorr B, Schwartz R, Micciulla L, Makhoul J (2006) A study of translation edit rate with targeted human annotation. In: Proceedings of association for machine translation in the Americas, Cambridge, Massachusetts, USA, pp 223–231Google Scholar
  27. Tiedemann J (2009) News from OPUS—a collection of multilingual parallel corpora with tools and interfaces. In: Recent advances in natural language processing. Borovets, Bulgaria, pp 237–248Google Scholar
  28. Tiedemann J (2012) Parallel data, tools and interfaces in OPUS. In: Proceedings of the eighth international conference on language resources and evaluation, Istanbul, Turkey, pp 2214–2218Google Scholar
  29. Toral A, Sánchez-Cartagena VM (2017) A multifaceted evaluation of neural versus phrase-based machine translation for 9 language directions. In: Proceedings of the 15th Conference of the European chapter of the association for computational linguistics, Valencia, Spain, pp 1063–1073Google Scholar
  30. Zeiler MD (2012) ADADELTA: an adaptive learning rate method. arXiv preprint arXiv:1212.5701

Copyright information

© Springer Science+Business Media B.V., part of Springer Nature 2018

Authors and Affiliations

  1. 1.Dublin Institute of TechnologyDublinIreland
  2. 2.University of GroningenGroningenThe Netherlands
  3. 3.Prompsit Language EngineeringAlacantSpain

Personalised recommendations