Skip to main content

Evaluating MT for massive open online courses

A multifaceted comparison between PBSMT and NMT systems

Abstract

This article reports a multifaceted comparison between statistical and neural machine translation (MT) systems that were developed for translation of data from massive open online courses (MOOCs). The study uses four language pairs: English to German, Greek, Portuguese, and Russian. Translation quality is evaluated using automatic metrics and human evaluation, carried out by professional translators. Results show that neural MT is preferred in side-by-side ranking, and is found to contain fewer overall errors. Results are less clear-cut for some error categories, and for temporal and technical post-editing effort. In addition, results are reported based on sentence length, showing advantages and disadvantages depending on the particular language pair and MT paradigm.

This is a preview of subscription content, access via your institution.

Fig. 1

Notes

  1. SDL has recently claimed to have cracked the Russian to English NMT. See https://www.sdl.com/about/newsmedia/press/2018/sdl-cracksrussian-to-englishneural-machinetranslation.html

  2. http://tramooc.eu/

  3. http://www.statmt.org/wmt16/

  4. Britz et al. (2017), for example, used 250,000 GPU hours, equivalent to roughly 75,000 kWh for GPU power consumption alone, when testing various methods of building and extending NMT systems.

  5. http://www.opensubtitles.org

  6. https://translate.yandex.ru/corpus

  7. Results were statistically significant in a one-way ANOVA pairwise comparison (p < 0.05).

  8. http://www.qt21.eu/mqm-definition/issues-list-2015-12-30.html

References

  • Abdelali A, Guzman F, Sajjad H, Vogel S (2014) The AMARA corpus: building parallel language resources for the educational domain. In: Proceedings of the 9th international conference on language resources and evaluation (LREC’14), Reykjavik, Iceland, pp 1856–1862

  • Aziz W, Castilho S, Specia L (2012) PET: a tool for post-editing and assessing machine translation. In: proceedings of the 8th international conference on language resources and evaluation (LREC’12), Istanbul, Turkey, pp 3982–3987

  • Bahdanau D, Cho K, Bengio Y (2014) Neural machine translation by jointly learning to align and translate. arXiv:1409.0473

  • Banerjee S, Lavie A (2005) METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. In: Proceedings of the ACL workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, vol 29. Ann Arbor, Michigan pp 65–72

  • Bentivogli L, Bisazza A, Cettolo M, Federico M (2016) Neural versus phrase-based machine translation quality: a case study. In: Proceedings of the 2016 conference on empirical methods in natural language processing, Austin, Texas, pp 257–267

  • Biber D, Conrad S (2009) Register, genre, and style. Cambridge University Press, Cambridge

    Book  Google Scholar 

  • Bojar O, Chatterjee R, Federmann C, Graham Y, Haddow B, Huck M, Jimeno Yepes A, Koehn P, Logacheva V, Monz C, Negri M, Neveol A, Neves M, Popel M, Post M, Rubino R, Scarton C, Specia L, Turchi M, Verspoor K, Zampieri M (2016) Findings of the 2016 conference on machine translation. In: Proceedings of the 1st conference on machine translation, Berlin, Germany, pp 131–198

  • Britz D, Goldie A, Luong M, Le QV (2017) Massive exploration of neural machine translation architectures. arXiv:1703.03906

  • Burchardt A, Macketanz V, Dehdari J, Heigold G, Peter JT, Williams P (2017) A linguistic evaluation of rule-based, phrase-based, and neural MT engines. Prague Bull Math Linguist 108(1):159–170

    Article  Google Scholar 

  • Castilho S, Moorkens J, Gaspari F, Calixto I, Tinsley J, Way A (2017a) Is neural machine translation the new state of the art? Prague Bull Math Linguist 108(1):109–120

    Article  Google Scholar 

  • Castilho S, Moorkens J, Gaspari F, Sennrich R, Sosoni V, Georgakopoulou P, Lohar P, Way A, Miceli Barone AV, Gialama M (2017) A comparative quality evaluation of PBSMT and NMT using professional translators. MT Summit 2017. Nagoya, Japan, pp 116–131

  • Cettolo M, Girardi C, Federico M (2012) Wit\(^3\): web inventory of transcribed and translated talks. In: Proceedings of the 16th conference of the European association for machine translation (EAMT), Trento, Italy, pp 261–268

  • Cherry C, Foster G (2012) Batch tuning strategies for statistical machine translation. In: Proceedings of the 2012 conference of the North American chapter of the association for computational linguistics: human language technologies, Montreal, Canada, pp 427–436

  • Chiang D (2005) A hierarchical phrase-based model for statistical machine translation. In: Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, Ann Arbor, Michigan, pp 263–270

  • Cho K, van Merrienboer B, Bahdanau D, Bengio Y (2014) On the Properties of Neural Machine Translation: Encoder-Decoder Approaches. CoRR abs/1409.1259, http://arxiv.org/abs/1409.1259

  • Costa-jussà MR, Farrús M (2015) Towards human linguistic machine translation evaluation. Digit Scholarsh Humanit 30(2):157–166

    Article  Google Scholar 

  • Doddington G (2002) Automatic evaluation of machine translation quality using n-gram co-occurrence statistics. In: Proceedings of the 2nd international conference on human language technology research, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, HLT ’02, pp 138–145

  • Durrani N, Fraser A, Schmid H (2013) Model with minimal translation units, but decode with phrases. In: Proceedings of the 2013 conference of the North American chapter of the association for computational linguistics: human language technologies NAACL, Atlanta, GA, USA, pp 1–11

  • Durrani N, Sajjad H, Hoang H, Koehn P (2014) Integrating an unsupervised transliteration model into statistical machine translation. In: Proceedings of the 14th conference of the European chapter of the association for computational linguistics, EACL 2014, Gothenburg, Sweden, pp 148–153

  • Elliott D, Hartley A, Atwell E (2004) A fluency error categorization scheme to guide automated machine translation evaluation. In: Machine Translation: From Real Users to Research: Proceedings of the 6th Conference of the Association for Machine Translation in the Americas, AMTA 2004, Washington, DC, USA, Berlin and Heidelberg, Springer, pp 64–73

    Chapter  Google Scholar 

  • Federico M, Negri M, Bentivogli L, Turchi M (2014) Assessing the impact of translation errors on machine translation quality with mixed-effects models. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), Doha, Qatar, pp 1643–1653

  • Galley M, Manning CD (2008) A simple and effective hierarchical phrase reordering model. In: Proceedings of the conference on empirical methods in natural language processing, Honolulu, Hawaii, EMNLP ’08, pp 848–856

  • Gao Q, Vogel S (2008) Parallel implementations of word alignment tool. In: Software engineering, testing, and quality assurance for natural language processing (SETQA-NLP ’08). Columbus, OH, USA, pp 49–57

  • Gaspari F, Hutchins WJ (2007) Online and Free! Ten Years of Online Machine Translation: Origins, Developments, Current Use and Future Prospects. In: Proceedings of MT Summit XI, Copenhagen, Denmark, pp 199–206

  • Hassan H, Aue A, Chen C, Chowdhary V, Clark J, Federmann C, Huang X, Junczys-Dowmunt M, Lewis W, Li M, Liu S, Liu TY, Luo R, Menezes A, Qin T, Seide F, Tan X, Tian F, Wu L, Wu S, Xia Y, Zhang D, Zhang Z, Zhou M (2018) Achieving human parity on automatic chinese to english news translation. https://www.microsoft.com/en-us/research/uploads/prod/2018/03/final-achieving-human.pdf

  • Heafield K (2011) Faster and Smaller Language Model Queries. In: proceedings of the 6th workshop on statistical machine translation, Edinburgh, Scotland, UK, pp 187–197

  • Jean S, Firat O, Cho K, Memisevic R, Bengio Y (2015) Montreal neural machine translation systems for WMT’15. In: Proceedings of the 10th workshop on statistical machine translation, Lisbon, Portugal, pp 134–140

  • Klubička F, Toral A, Sánchez-Cartagena VM (2017) Fine-grained human evaluation of neural versus phrase-based machine translation. Prague Bull Math Linguist 108(1):121–132

    Article  Google Scholar 

  • Kneser R, Ney H (1995) Improved Backing-Off for M-gram Language Modeling. In: Proceedings of the Int. Conf. on Acoustics, Speech, and Signal Processing, vol 1. Detroit, MI, USA, pp 181–184

  • Koehn P (2005) Europarl: a parallel corpus for statistical machine translation. In: Proceedings of the 10th machine translation summit, Phuket, Thailand, pp 79–86

  • Koehn P, Knowles R (2017) Six Challenges for neural machine translation. In: Proceedings of the 1st workshop on neural machine translation, Vancouver, BC, Canada, pp 28–39

  • Koehn P, Hoang H, Birch A, Callison-Burch C, Federico M, Bertoldi N, Cowan B, Shen W, Moran C, Zens R, Dyer C, Bojar O, Constantin A, Herbst E (2007) Moses: open source toolkit for statistical machine translation. In: Proceedings of the ACL-2007 demo and poster sessions, Association for computational linguistics, Prague, Czech Republic, pp 177–180

  • Koponen M (2010) Assessing machine translation quality with error analysis. In: Electronic proceedings of the KaTu symposium on translation and interpreting studies. vol 4, pp 1–12

  • Krings HP (2001) Repairing texts: empirical investigations of machine translation post-editing processes. Kent State University Press, Kent

    Google Scholar 

  • Kucera H, Francis WN (1967) Computational analysis of present-day American English. Brown University Press, Providence

    Google Scholar 

  • Lehtonen M (2015) On sentence length distribution as an authorship attribute. In: Kim KJ (ed) Information science and applications. Springer, Berlin, Heidelberg, pp 811–818

    Chapter  Google Scholar 

  • Ljubešić N, Bago P, Boras D (2010) Statistical machine translation of Croatian weather forecast: How much data do we need? In: Proceedings of the ITI 2010, 32nd international conference on information technology interfaces, SRCE University Computing Centre, Zagreb, pp 91–96

  • Lommel A, DePalma DA (2016) Europe’s leading role in machine translation: how Europe is driving the shift to MT. Common Sense Advisory, Boston

    Google Scholar 

  • Lommel A, Uszkoreit H, Burchardt A (2014) Multidimensional quality metrics (MQM): a framework for declaring and describing translation quality metrics. Tradumàtica 12:455–463

    Article  Google Scholar 

  • Luong MT, Manning CD (2015) Stanford neural machine translation systems for spoken language domains. In: Proceedings of the international workshop on spoken language translation 2015, Da Nang, Vietnam, pp 76–79

  • Moorkens J (2017) Under pressure: translation in times of austerity. Perspectives 25(3):464–477

    Article  Google Scholar 

  • Moorkens J, O’Brien S (2015) Post-editing evaluations: trade-offs between novice and professional participants. In: Proceedings of European association for machine translation (EAMT), Antalya, Turkey, pp 75–81

  • Neubig G, Morishita M, Nakamura S (2015) Neural reranking improves subjective quality of machine translation: NAIST at WAT2015. arXiv preprint arXiv:1510.05203

  • Papineni K, Roukos S, Ward T, Zhu WJ (2002) BLEU: A Method for Automatic Evaluation of Machine Translation. In: Proceedings of the 40th annual meeting of the association for computational linguistics, Philadelphia, Pennsylvania, pp 311–318

  • Popović M (2015) chrF: character n-gram F-score for automatic MT evaluation. In: Proceedings of the 10th workshop on machine translation (WMT 2015), Lisbon, Portugal, pp 392–395

  • Popović M (2017) Comparing language related issues for NMT and PBMT between German and English. Prague Bull Math Linguist 108(1):209–220

    Article  Google Scholar 

  • Popović M, Arcan M, Lommel A (2016) Potential and limits of using post-edits as reference translations for MT evaluation. Balt J Mod Comput 4(2):218–229

    Google Scholar 

  • Schuster M, Johnson M, Thorat N (2016) Zero-shot translation with Google’s multilingual neural machine translation system. https://research.googleblog.com/2016/11/zero-shot-translation-with-googles.html

  • Sennrich R, Haddow B, Birch A (2016a) Edinburgh neural machine translation systems for WMT 16. In: Proceedings of the 1st conference on machine translation (WMT16), Berlin, Germany, pp 371–376

  • Sennrich R, Haddow B, Birch A (2016b) Improving neural machine translation models with monolingual data. In: Proceedings of the 54th annual meeting of the association for computational linguistics (ACL 2016), Berlin, Germany, pp 86–96

  • Sennrich R, Haddow B, Birch A (2016c ) Neural machine translation of rare words with subword units. In: Proceedings of the 54th annual meeting of the association for computational linguistics (ACL 2016), Berlin, Germany, pp 1715–1725

  • Sennrich R, Birch A, Currey A, Germann U, Haddow B, Heafield K, Miceli Barone AV, Williams P (2017a) The University of Edinburgh’s neural MT systems for WMT17. In: Proceedings of the 2nd conference on machine translation, Copenhagen, Denmark, pp 389–399

  • Sennrich R, Firat O, Cho K, Birch A, Haddow B, Hitschler J, Junczys-Dowmunt M, Läubli S, Miceli Barone AV, Mokry J, Nadejde M (2017b) Nematus: a toolkit for neural machine translation. In: Proceedings of the software demonstrations of the 15th conference of the European chapter of the association for computational linguistics, Valencia, Spain, pp 65–68

  • Snover M, Dorr B, Schwartz R, Micciulla L, Makhoul J (2006) A study of translation edit rate with targeted human annotation. In: Proceedings of the 7th conference of the association for machine translation in the Americas, Cambridge, Massachusetts, pp 223–231

  • Steinberger R, Pouliquen B, Widiger A, Ignat C, Erjavec T, Tufis D, Varga D (2006) The JRC-Acquis: a multilingual aligned parallel corpus with 20+ languages. In: Proceedings of the 5th international conference on language resources and evaluation, Genoa, Italy, pp 2142–2147

  • Steinberger R, Eisele A, Klocek S, Pilos S, Schlü ter P (2012) DGT-TM: a freely available translation memory in 22 languages. In: Proceedings of the 8th international conference on language resources and evaluation, Istanbul, Turkey, pp 454–459

  • Stymne S (2013) Using a grammar checker and its error typology for annotation of statistical machine translation errors. In: Proceedings of the 24th Scandinavian conference of linguistics, pp 332–344

  • Stymne S, Ahrenberg L (2012) On the practice of error analysis for machine translation evaluation. In: Proceedings of the 8th international conference on language resources and evaluation (LREC’12), Istanbul, Turkey, pp 1785–1790

  • Sutskever I, Vinyals O, Le QV (2014) Sequence to sequence learning with neural networks. arXiv:1409.3215

  • Tiedemann J (2012) Parallel data, tools and interfaces in OPUS. In: Proceedings of the 8th international conference on language resources and evaluation (LREC’2012), Istanbul, Turkey, pp 2214–2218

  • Toral A, Sánchez-Cartagena VM (2017) A multifaceted evaluation of neural versus phrase-based machine translation for 9 language directions. In: Proceedings of the 15th conference of the European chapter of the association for computational linguistics, Valencia, Spain, pp 1063–1073

  • Tyers FM, Alperen MS (2010) South-East European Times: a parallel corpus of Balkan languages. In: Proceedings of the LREC workshop on exploitation of multilingual resources and tools for Central and (South-) Eastern European languages, Malta, pp 49–53

  • Štajner S, Querido A, Rendeiro N, Rodrigues JA, Branco A (2016) Use of domain-specific language resources in machine translation. In: Proceedings of the 10th international conference on language resources and evaluation (LREC’16), Paris, France, pp 592–598

  • Westin I (2002) Language change in English newspaper editorials. Rodopi, Amsterdam and New York

    Book  Google Scholar 

  • Wu Y, Schuster M, Chen Z, Le QV, Norouzi M, Macherey W, Krikun M, Cao Y, Gao Q, Macherey K, Klingner J, Shah A, Johnson M, Liu X, Kaiser L, Gouws S, Kato Y, Kudo T, Kazawa H, Stevens K, Kurian G, Patil N, Wang W, Young C, Smith J, Riesa J, Rudnick A, Vinyals O, Corrado G, Hughes M, Dean J (2016) Google’s neural machine translation system: bridging the gap between human and machine translation. arXiv:1609.08144

  • Zeiler MD (2012) ADADELTA: An adaptive learning rate method. arXiv preprint arXiv:1212.5701

Download references

Acknowledgements

The TraMOOC project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No644333. The ADAPT Centre for Digital Content Technology at Dublin City University is funded under the Science Foundation Ireland Research Centres Programme (Grant 13/RC/2106) and is co-funded under the European Regional Development Fund. We would also like to thank Maja Popović for invaluable brainstorming.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sheila Castilho.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Castilho, S., Moorkens, J., Gaspari, F. et al. Evaluating MT for massive open online courses. Machine Translation 32, 255–278 (2018). https://doi.org/10.1007/s10590-018-9221-y

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10590-018-9221-y

Keywords

  • Neural MT
  • Statistical MT
  • Human MT evaluation
  • MOOCs