Abstract
This article reports a multifaceted comparison between statistical and neural machine translation (MT) systems that were developed for translation of data from massive open online courses (MOOCs). The study uses four language pairs: English to German, Greek, Portuguese, and Russian. Translation quality is evaluated using automatic metrics and human evaluation, carried out by professional translators. Results show that neural MT is preferred in side-by-side ranking, and is found to contain fewer overall errors. Results are less clear-cut for some error categories, and for temporal and technical post-editing effort. In addition, results are reported based on sentence length, showing advantages and disadvantages depending on the particular language pair and MT paradigm.
Similar content being viewed by others
Notes
SDL has recently claimed to have cracked the Russian to English NMT. See https://www.sdl.com/about/newsmedia/press/2018/sdl-cracksrussian-to-englishneural-machinetranslation.html
Britz et al. (2017), for example, used 250,000 GPU hours, equivalent to roughly 75,000 kWh for GPU power consumption alone, when testing various methods of building and extending NMT systems.
Results were statistically significant in a one-way ANOVA pairwise comparison (p < 0.05).
References
Abdelali A, Guzman F, Sajjad H, Vogel S (2014) The AMARA corpus: building parallel language resources for the educational domain. In: Proceedings of the 9th international conference on language resources and evaluation (LREC’14), Reykjavik, Iceland, pp 1856–1862
Aziz W, Castilho S, Specia L (2012) PET: a tool for post-editing and assessing machine translation. In: proceedings of the 8th international conference on language resources and evaluation (LREC’12), Istanbul, Turkey, pp 3982–3987
Bahdanau D, Cho K, Bengio Y (2014) Neural machine translation by jointly learning to align and translate. arXiv:1409.0473
Banerjee S, Lavie A (2005) METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. In: Proceedings of the ACL workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, vol 29. Ann Arbor, Michigan pp 65–72
Bentivogli L, Bisazza A, Cettolo M, Federico M (2016) Neural versus phrase-based machine translation quality: a case study. In: Proceedings of the 2016 conference on empirical methods in natural language processing, Austin, Texas, pp 257–267
Biber D, Conrad S (2009) Register, genre, and style. Cambridge University Press, Cambridge
Bojar O, Chatterjee R, Federmann C, Graham Y, Haddow B, Huck M, Jimeno Yepes A, Koehn P, Logacheva V, Monz C, Negri M, Neveol A, Neves M, Popel M, Post M, Rubino R, Scarton C, Specia L, Turchi M, Verspoor K, Zampieri M (2016) Findings of the 2016 conference on machine translation. In: Proceedings of the 1st conference on machine translation, Berlin, Germany, pp 131–198
Britz D, Goldie A, Luong M, Le QV (2017) Massive exploration of neural machine translation architectures. arXiv:1703.03906
Burchardt A, Macketanz V, Dehdari J, Heigold G, Peter JT, Williams P (2017) A linguistic evaluation of rule-based, phrase-based, and neural MT engines. Prague Bull Math Linguist 108(1):159–170
Castilho S, Moorkens J, Gaspari F, Calixto I, Tinsley J, Way A (2017a) Is neural machine translation the new state of the art? Prague Bull Math Linguist 108(1):109–120
Castilho S, Moorkens J, Gaspari F, Sennrich R, Sosoni V, Georgakopoulou P, Lohar P, Way A, Miceli Barone AV, Gialama M (2017) A comparative quality evaluation of PBSMT and NMT using professional translators. MT Summit 2017. Nagoya, Japan, pp 116–131
Cettolo M, Girardi C, Federico M (2012) Wit\(^3\): web inventory of transcribed and translated talks. In: Proceedings of the 16th conference of the European association for machine translation (EAMT), Trento, Italy, pp 261–268
Cherry C, Foster G (2012) Batch tuning strategies for statistical machine translation. In: Proceedings of the 2012 conference of the North American chapter of the association for computational linguistics: human language technologies, Montreal, Canada, pp 427–436
Chiang D (2005) A hierarchical phrase-based model for statistical machine translation. In: Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, Ann Arbor, Michigan, pp 263–270
Cho K, van Merrienboer B, Bahdanau D, Bengio Y (2014) On the Properties of Neural Machine Translation: Encoder-Decoder Approaches. CoRR abs/1409.1259, http://arxiv.org/abs/1409.1259
Costa-jussà MR, Farrús M (2015) Towards human linguistic machine translation evaluation. Digit Scholarsh Humanit 30(2):157–166
Doddington G (2002) Automatic evaluation of machine translation quality using n-gram co-occurrence statistics. In: Proceedings of the 2nd international conference on human language technology research, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, HLT ’02, pp 138–145
Durrani N, Fraser A, Schmid H (2013) Model with minimal translation units, but decode with phrases. In: Proceedings of the 2013 conference of the North American chapter of the association for computational linguistics: human language technologies NAACL, Atlanta, GA, USA, pp 1–11
Durrani N, Sajjad H, Hoang H, Koehn P (2014) Integrating an unsupervised transliteration model into statistical machine translation. In: Proceedings of the 14th conference of the European chapter of the association for computational linguistics, EACL 2014, Gothenburg, Sweden, pp 148–153
Elliott D, Hartley A, Atwell E (2004) A fluency error categorization scheme to guide automated machine translation evaluation. In: Machine Translation: From Real Users to Research: Proceedings of the 6th Conference of the Association for Machine Translation in the Americas, AMTA 2004, Washington, DC, USA, Berlin and Heidelberg, Springer, pp 64–73
Federico M, Negri M, Bentivogli L, Turchi M (2014) Assessing the impact of translation errors on machine translation quality with mixed-effects models. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), Doha, Qatar, pp 1643–1653
Galley M, Manning CD (2008) A simple and effective hierarchical phrase reordering model. In: Proceedings of the conference on empirical methods in natural language processing, Honolulu, Hawaii, EMNLP ’08, pp 848–856
Gao Q, Vogel S (2008) Parallel implementations of word alignment tool. In: Software engineering, testing, and quality assurance for natural language processing (SETQA-NLP ’08). Columbus, OH, USA, pp 49–57
Gaspari F, Hutchins WJ (2007) Online and Free! Ten Years of Online Machine Translation: Origins, Developments, Current Use and Future Prospects. In: Proceedings of MT Summit XI, Copenhagen, Denmark, pp 199–206
Hassan H, Aue A, Chen C, Chowdhary V, Clark J, Federmann C, Huang X, Junczys-Dowmunt M, Lewis W, Li M, Liu S, Liu TY, Luo R, Menezes A, Qin T, Seide F, Tan X, Tian F, Wu L, Wu S, Xia Y, Zhang D, Zhang Z, Zhou M (2018) Achieving human parity on automatic chinese to english news translation. https://www.microsoft.com/en-us/research/uploads/prod/2018/03/final-achieving-human.pdf
Heafield K (2011) Faster and Smaller Language Model Queries. In: proceedings of the 6th workshop on statistical machine translation, Edinburgh, Scotland, UK, pp 187–197
Jean S, Firat O, Cho K, Memisevic R, Bengio Y (2015) Montreal neural machine translation systems for WMT’15. In: Proceedings of the 10th workshop on statistical machine translation, Lisbon, Portugal, pp 134–140
Klubička F, Toral A, Sánchez-Cartagena VM (2017) Fine-grained human evaluation of neural versus phrase-based machine translation. Prague Bull Math Linguist 108(1):121–132
Kneser R, Ney H (1995) Improved Backing-Off for M-gram Language Modeling. In: Proceedings of the Int. Conf. on Acoustics, Speech, and Signal Processing, vol 1. Detroit, MI, USA, pp 181–184
Koehn P (2005) Europarl: a parallel corpus for statistical machine translation. In: Proceedings of the 10th machine translation summit, Phuket, Thailand, pp 79–86
Koehn P, Knowles R (2017) Six Challenges for neural machine translation. In: Proceedings of the 1st workshop on neural machine translation, Vancouver, BC, Canada, pp 28–39
Koehn P, Hoang H, Birch A, Callison-Burch C, Federico M, Bertoldi N, Cowan B, Shen W, Moran C, Zens R, Dyer C, Bojar O, Constantin A, Herbst E (2007) Moses: open source toolkit for statistical machine translation. In: Proceedings of the ACL-2007 demo and poster sessions, Association for computational linguistics, Prague, Czech Republic, pp 177–180
Koponen M (2010) Assessing machine translation quality with error analysis. In: Electronic proceedings of the KaTu symposium on translation and interpreting studies. vol 4, pp 1–12
Krings HP (2001) Repairing texts: empirical investigations of machine translation post-editing processes. Kent State University Press, Kent
Kucera H, Francis WN (1967) Computational analysis of present-day American English. Brown University Press, Providence
Lehtonen M (2015) On sentence length distribution as an authorship attribute. In: Kim KJ (ed) Information science and applications. Springer, Berlin, Heidelberg, pp 811–818
Ljubešić N, Bago P, Boras D (2010) Statistical machine translation of Croatian weather forecast: How much data do we need? In: Proceedings of the ITI 2010, 32nd international conference on information technology interfaces, SRCE University Computing Centre, Zagreb, pp 91–96
Lommel A, DePalma DA (2016) Europe’s leading role in machine translation: how Europe is driving the shift to MT. Common Sense Advisory, Boston
Lommel A, Uszkoreit H, Burchardt A (2014) Multidimensional quality metrics (MQM): a framework for declaring and describing translation quality metrics. Tradumàtica 12:455–463
Luong MT, Manning CD (2015) Stanford neural machine translation systems for spoken language domains. In: Proceedings of the international workshop on spoken language translation 2015, Da Nang, Vietnam, pp 76–79
Moorkens J (2017) Under pressure: translation in times of austerity. Perspectives 25(3):464–477
Moorkens J, O’Brien S (2015) Post-editing evaluations: trade-offs between novice and professional participants. In: Proceedings of European association for machine translation (EAMT), Antalya, Turkey, pp 75–81
Neubig G, Morishita M, Nakamura S (2015) Neural reranking improves subjective quality of machine translation: NAIST at WAT2015. arXiv preprint arXiv:1510.05203
Papineni K, Roukos S, Ward T, Zhu WJ (2002) BLEU: A Method for Automatic Evaluation of Machine Translation. In: Proceedings of the 40th annual meeting of the association for computational linguistics, Philadelphia, Pennsylvania, pp 311–318
Popović M (2015) chrF: character n-gram F-score for automatic MT evaluation. In: Proceedings of the 10th workshop on machine translation (WMT 2015), Lisbon, Portugal, pp 392–395
Popović M (2017) Comparing language related issues for NMT and PBMT between German and English. Prague Bull Math Linguist 108(1):209–220
Popović M, Arcan M, Lommel A (2016) Potential and limits of using post-edits as reference translations for MT evaluation. Balt J Mod Comput 4(2):218–229
Schuster M, Johnson M, Thorat N (2016) Zero-shot translation with Google’s multilingual neural machine translation system. https://research.googleblog.com/2016/11/zero-shot-translation-with-googles.html
Sennrich R, Haddow B, Birch A (2016a) Edinburgh neural machine translation systems for WMT 16. In: Proceedings of the 1st conference on machine translation (WMT16), Berlin, Germany, pp 371–376
Sennrich R, Haddow B, Birch A (2016b) Improving neural machine translation models with monolingual data. In: Proceedings of the 54th annual meeting of the association for computational linguistics (ACL 2016), Berlin, Germany, pp 86–96
Sennrich R, Haddow B, Birch A (2016c ) Neural machine translation of rare words with subword units. In: Proceedings of the 54th annual meeting of the association for computational linguistics (ACL 2016), Berlin, Germany, pp 1715–1725
Sennrich R, Birch A, Currey A, Germann U, Haddow B, Heafield K, Miceli Barone AV, Williams P (2017a) The University of Edinburgh’s neural MT systems for WMT17. In: Proceedings of the 2nd conference on machine translation, Copenhagen, Denmark, pp 389–399
Sennrich R, Firat O, Cho K, Birch A, Haddow B, Hitschler J, Junczys-Dowmunt M, Läubli S, Miceli Barone AV, Mokry J, Nadejde M (2017b) Nematus: a toolkit for neural machine translation. In: Proceedings of the software demonstrations of the 15th conference of the European chapter of the association for computational linguistics, Valencia, Spain, pp 65–68
Snover M, Dorr B, Schwartz R, Micciulla L, Makhoul J (2006) A study of translation edit rate with targeted human annotation. In: Proceedings of the 7th conference of the association for machine translation in the Americas, Cambridge, Massachusetts, pp 223–231
Steinberger R, Pouliquen B, Widiger A, Ignat C, Erjavec T, Tufis D, Varga D (2006) The JRC-Acquis: a multilingual aligned parallel corpus with 20+ languages. In: Proceedings of the 5th international conference on language resources and evaluation, Genoa, Italy, pp 2142–2147
Steinberger R, Eisele A, Klocek S, Pilos S, Schlü ter P (2012) DGT-TM: a freely available translation memory in 22 languages. In: Proceedings of the 8th international conference on language resources and evaluation, Istanbul, Turkey, pp 454–459
Stymne S (2013) Using a grammar checker and its error typology for annotation of statistical machine translation errors. In: Proceedings of the 24th Scandinavian conference of linguistics, pp 332–344
Stymne S, Ahrenberg L (2012) On the practice of error analysis for machine translation evaluation. In: Proceedings of the 8th international conference on language resources and evaluation (LREC’12), Istanbul, Turkey, pp 1785–1790
Sutskever I, Vinyals O, Le QV (2014) Sequence to sequence learning with neural networks. arXiv:1409.3215
Tiedemann J (2012) Parallel data, tools and interfaces in OPUS. In: Proceedings of the 8th international conference on language resources and evaluation (LREC’2012), Istanbul, Turkey, pp 2214–2218
Toral A, Sánchez-Cartagena VM (2017) A multifaceted evaluation of neural versus phrase-based machine translation for 9 language directions. In: Proceedings of the 15th conference of the European chapter of the association for computational linguistics, Valencia, Spain, pp 1063–1073
Tyers FM, Alperen MS (2010) South-East European Times: a parallel corpus of Balkan languages. In: Proceedings of the LREC workshop on exploitation of multilingual resources and tools for Central and (South-) Eastern European languages, Malta, pp 49–53
Štajner S, Querido A, Rendeiro N, Rodrigues JA, Branco A (2016) Use of domain-specific language resources in machine translation. In: Proceedings of the 10th international conference on language resources and evaluation (LREC’16), Paris, France, pp 592–598
Westin I (2002) Language change in English newspaper editorials. Rodopi, Amsterdam and New York
Wu Y, Schuster M, Chen Z, Le QV, Norouzi M, Macherey W, Krikun M, Cao Y, Gao Q, Macherey K, Klingner J, Shah A, Johnson M, Liu X, Kaiser L, Gouws S, Kato Y, Kudo T, Kazawa H, Stevens K, Kurian G, Patil N, Wang W, Young C, Smith J, Riesa J, Rudnick A, Vinyals O, Corrado G, Hughes M, Dean J (2016) Google’s neural machine translation system: bridging the gap between human and machine translation. arXiv:1609.08144
Zeiler MD (2012) ADADELTA: An adaptive learning rate method. arXiv preprint arXiv:1212.5701
Acknowledgements
The TraMOOC project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No644333. The ADAPT Centre for Digital Content Technology at Dublin City University is funded under the Science Foundation Ireland Research Centres Programme (Grant 13/RC/2106) and is co-funded under the European Regional Development Fund. We would also like to thank Maja Popović for invaluable brainstorming.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Castilho, S., Moorkens, J., Gaspari, F. et al. Evaluating MT for massive open online courses. Machine Translation 32, 255–278 (2018). https://doi.org/10.1007/s10590-018-9221-y
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10590-018-9221-y