Advertisement

Machine Translation

, Volume 32, Issue 3, pp 255–278 | Cite as

Evaluating MT for massive open online courses

A multifaceted comparison between PBSMT and NMT systems
  • Sheila CastilhoEmail author
  • Joss Moorkens
  • Federico Gaspari
  • Rico Sennrich
  • Andy Way
  • Panayota Georgakopoulou
Article

Abstract

This article reports a multifaceted comparison between statistical and neural machine translation (MT) systems that were developed for translation of data from massive open online courses (MOOCs). The study uses four language pairs: English to German, Greek, Portuguese, and Russian. Translation quality is evaluated using automatic metrics and human evaluation, carried out by professional translators. Results show that neural MT is preferred in side-by-side ranking, and is found to contain fewer overall errors. Results are less clear-cut for some error categories, and for temporal and technical post-editing effort. In addition, results are reported based on sentence length, showing advantages and disadvantages depending on the particular language pair and MT paradigm.

Keywords

Neural MT Statistical MT Human MT evaluation MOOCs 

Notes

Acknowledgements

The TraMOOC project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No644333. The ADAPT Centre for Digital Content Technology at Dublin City University is funded under the Science Foundation Ireland Research Centres Programme (Grant 13/RC/2106) and is co-funded under the European Regional Development Fund. We would also like to thank Maja Popović for invaluable brainstorming.

References

  1. Abdelali A, Guzman F, Sajjad H, Vogel S (2014) The AMARA corpus: building parallel language resources for the educational domain. In: Proceedings of the 9th international conference on language resources and evaluation (LREC’14), Reykjavik, Iceland, pp 1856–1862Google Scholar
  2. Aziz W, Castilho S, Specia L (2012) PET: a tool for post-editing and assessing machine translation. In: proceedings of the 8th international conference on language resources and evaluation (LREC’12), Istanbul, Turkey, pp 3982–3987Google Scholar
  3. Bahdanau D, Cho K, Bengio Y (2014) Neural machine translation by jointly learning to align and translate. arXiv:1409.0473
  4. Banerjee S, Lavie A (2005) METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. In: Proceedings of the ACL workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, vol 29. Ann Arbor, Michigan pp 65–72Google Scholar
  5. Bentivogli L, Bisazza A, Cettolo M, Federico M (2016) Neural versus phrase-based machine translation quality: a case study. In: Proceedings of the 2016 conference on empirical methods in natural language processing, Austin, Texas, pp 257–267Google Scholar
  6. Biber D, Conrad S (2009) Register, genre, and style. Cambridge University Press, CambridgeCrossRefGoogle Scholar
  7. Bojar O, Chatterjee R, Federmann C, Graham Y, Haddow B, Huck M, Jimeno Yepes A, Koehn P, Logacheva V, Monz C, Negri M, Neveol A, Neves M, Popel M, Post M, Rubino R, Scarton C, Specia L, Turchi M, Verspoor K, Zampieri M (2016) Findings of the 2016 conference on machine translation. In: Proceedings of the 1st conference on machine translation, Berlin, Germany, pp 131–198Google Scholar
  8. Britz D, Goldie A, Luong M, Le QV (2017) Massive exploration of neural machine translation architectures. arXiv:1703.03906
  9. Burchardt A, Macketanz V, Dehdari J, Heigold G, Peter JT, Williams P (2017) A linguistic evaluation of rule-based, phrase-based, and neural MT engines. Prague Bull Math Linguist 108(1):159–170CrossRefGoogle Scholar
  10. Castilho S, Moorkens J, Gaspari F, Calixto I, Tinsley J, Way A (2017a) Is neural machine translation the new state of the art? Prague Bull Math Linguist 108(1):109–120CrossRefGoogle Scholar
  11. Castilho S, Moorkens J, Gaspari F, Sennrich R, Sosoni V, Georgakopoulou P, Lohar P, Way A, Miceli Barone AV, Gialama M (2017) A comparative quality evaluation of PBSMT and NMT using professional translators. MT Summit 2017. Nagoya, Japan, pp 116–131Google Scholar
  12. Cettolo M, Girardi C, Federico M (2012) Wit\(^3\): web inventory of transcribed and translated talks. In: Proceedings of the 16th conference of the European association for machine translation (EAMT), Trento, Italy, pp 261–268Google Scholar
  13. Cherry C, Foster G (2012) Batch tuning strategies for statistical machine translation. In: Proceedings of the 2012 conference of the North American chapter of the association for computational linguistics: human language technologies, Montreal, Canada, pp 427–436Google Scholar
  14. Chiang D (2005) A hierarchical phrase-based model for statistical machine translation. In: Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, Ann Arbor, Michigan, pp 263–270Google Scholar
  15. Cho K, van Merrienboer B, Bahdanau D, Bengio Y (2014) On the Properties of Neural Machine Translation: Encoder-Decoder Approaches. CoRR abs/1409.1259, http://arxiv.org/abs/1409.1259
  16. Costa-jussà MR, Farrús M (2015) Towards human linguistic machine translation evaluation. Digit Scholarsh Humanit 30(2):157–166CrossRefGoogle Scholar
  17. Doddington G (2002) Automatic evaluation of machine translation quality using n-gram co-occurrence statistics. In: Proceedings of the 2nd international conference on human language technology research, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, HLT ’02, pp 138–145Google Scholar
  18. Durrani N, Fraser A, Schmid H (2013) Model with minimal translation units, but decode with phrases. In: Proceedings of the 2013 conference of the North American chapter of the association for computational linguistics: human language technologies NAACL, Atlanta, GA, USA, pp 1–11Google Scholar
  19. Durrani N, Sajjad H, Hoang H, Koehn P (2014) Integrating an unsupervised transliteration model into statistical machine translation. In: Proceedings of the 14th conference of the European chapter of the association for computational linguistics, EACL 2014, Gothenburg, Sweden, pp 148–153Google Scholar
  20. Elliott D, Hartley A, Atwell E (2004) A fluency error categorization scheme to guide automated machine translation evaluation. In: Machine Translation: From Real Users to Research: Proceedings of the 6th Conference of the Association for Machine Translation in the Americas, AMTA 2004, Washington, DC, USA, Berlin and Heidelberg, Springer, pp 64–73CrossRefGoogle Scholar
  21. Federico M, Negri M, Bentivogli L, Turchi M (2014) Assessing the impact of translation errors on machine translation quality with mixed-effects models. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), Doha, Qatar, pp 1643–1653Google Scholar
  22. Galley M, Manning CD (2008) A simple and effective hierarchical phrase reordering model. In: Proceedings of the conference on empirical methods in natural language processing, Honolulu, Hawaii, EMNLP ’08, pp 848–856Google Scholar
  23. Gao Q, Vogel S (2008) Parallel implementations of word alignment tool. In: Software engineering, testing, and quality assurance for natural language processing (SETQA-NLP ’08). Columbus, OH, USA, pp 49–57Google Scholar
  24. Gaspari F, Hutchins WJ (2007) Online and Free! Ten Years of Online Machine Translation: Origins, Developments, Current Use and Future Prospects. In: Proceedings of MT Summit XI, Copenhagen, Denmark, pp 199–206Google Scholar
  25. Hassan H, Aue A, Chen C, Chowdhary V, Clark J, Federmann C, Huang X, Junczys-Dowmunt M, Lewis W, Li M, Liu S, Liu TY, Luo R, Menezes A, Qin T, Seide F, Tan X, Tian F, Wu L, Wu S, Xia Y, Zhang D, Zhang Z, Zhou M (2018) Achieving human parity on automatic chinese to english news translation. https://www.microsoft.com/en-us/research/uploads/prod/2018/03/final-achieving-human.pdf
  26. Heafield K (2011) Faster and Smaller Language Model Queries. In: proceedings of the 6th workshop on statistical machine translation, Edinburgh, Scotland, UK, pp 187–197Google Scholar
  27. Jean S, Firat O, Cho K, Memisevic R, Bengio Y (2015) Montreal neural machine translation systems for WMT’15. In: Proceedings of the 10th workshop on statistical machine translation, Lisbon, Portugal, pp 134–140Google Scholar
  28. Klubička F, Toral A, Sánchez-Cartagena VM (2017) Fine-grained human evaluation of neural versus phrase-based machine translation. Prague Bull Math Linguist 108(1):121–132CrossRefGoogle Scholar
  29. Kneser R, Ney H (1995) Improved Backing-Off for M-gram Language Modeling. In: Proceedings of the Int. Conf. on Acoustics, Speech, and Signal Processing, vol 1. Detroit, MI, USA, pp 181–184Google Scholar
  30. Koehn P (2005) Europarl: a parallel corpus for statistical machine translation. In: Proceedings of the 10th machine translation summit, Phuket, Thailand, pp 79–86Google Scholar
  31. Koehn P, Knowles R (2017) Six Challenges for neural machine translation. In: Proceedings of the 1st workshop on neural machine translation, Vancouver, BC, Canada, pp 28–39Google Scholar
  32. Koehn P, Hoang H, Birch A, Callison-Burch C, Federico M, Bertoldi N, Cowan B, Shen W, Moran C, Zens R, Dyer C, Bojar O, Constantin A, Herbst E (2007) Moses: open source toolkit for statistical machine translation. In: Proceedings of the ACL-2007 demo and poster sessions, Association for computational linguistics, Prague, Czech Republic, pp 177–180Google Scholar
  33. Koponen M (2010) Assessing machine translation quality with error analysis. In: Electronic proceedings of the KaTu symposium on translation and interpreting studies. vol 4, pp 1–12Google Scholar
  34. Krings HP (2001) Repairing texts: empirical investigations of machine translation post-editing processes. Kent State University Press, KentGoogle Scholar
  35. Kucera H, Francis WN (1967) Computational analysis of present-day American English. Brown University Press, ProvidenceGoogle Scholar
  36. Lehtonen M (2015) On sentence length distribution as an authorship attribute. In: Kim KJ (ed) Information science and applications. Springer, Berlin, Heidelberg, pp 811–818CrossRefGoogle Scholar
  37. Ljubešić N, Bago P, Boras D (2010) Statistical machine translation of Croatian weather forecast: How much data do we need? In: Proceedings of the ITI 2010, 32nd international conference on information technology interfaces, SRCE University Computing Centre, Zagreb, pp 91–96Google Scholar
  38. Lommel A, DePalma DA (2016) Europe’s leading role in machine translation: how Europe is driving the shift to MT. Common Sense Advisory, BostonGoogle Scholar
  39. Lommel A, Uszkoreit H, Burchardt A (2014) Multidimensional quality metrics (MQM): a framework for declaring and describing translation quality metrics. Tradumàtica 12:455–463CrossRefGoogle Scholar
  40. Luong MT, Manning CD (2015) Stanford neural machine translation systems for spoken language domains. In: Proceedings of the international workshop on spoken language translation 2015, Da Nang, Vietnam, pp 76–79Google Scholar
  41. Moorkens J (2017) Under pressure: translation in times of austerity. Perspectives 25(3):464–477CrossRefGoogle Scholar
  42. Moorkens J, O’Brien S (2015) Post-editing evaluations: trade-offs between novice and professional participants. In: Proceedings of European association for machine translation (EAMT), Antalya, Turkey, pp 75–81Google Scholar
  43. Neubig G, Morishita M, Nakamura S (2015) Neural reranking improves subjective quality of machine translation: NAIST at WAT2015. arXiv preprint arXiv:1510.05203
  44. Papineni K, Roukos S, Ward T, Zhu WJ (2002) BLEU: A Method for Automatic Evaluation of Machine Translation. In: Proceedings of the 40th annual meeting of the association for computational linguistics, Philadelphia, Pennsylvania, pp 311–318Google Scholar
  45. Popović M (2015) chrF: character n-gram F-score for automatic MT evaluation. In: Proceedings of the 10th workshop on machine translation (WMT 2015), Lisbon, Portugal, pp 392–395Google Scholar
  46. Popović M (2017) Comparing language related issues for NMT and PBMT between German and English. Prague Bull Math Linguist 108(1):209–220CrossRefGoogle Scholar
  47. Popović M, Arcan M, Lommel A (2016) Potential and limits of using post-edits as reference translations for MT evaluation. Balt J Mod Comput 4(2):218–229Google Scholar
  48. Schuster M, Johnson M, Thorat N (2016) Zero-shot translation with Google’s multilingual neural machine translation system. https://research.googleblog.com/2016/11/zero-shot-translation-with-googles.html
  49. Sennrich R, Haddow B, Birch A (2016a) Edinburgh neural machine translation systems for WMT 16. In: Proceedings of the 1st conference on machine translation (WMT16), Berlin, Germany, pp 371–376Google Scholar
  50. Sennrich R, Haddow B, Birch A (2016b) Improving neural machine translation models with monolingual data. In: Proceedings of the 54th annual meeting of the association for computational linguistics (ACL 2016), Berlin, Germany, pp 86–96Google Scholar
  51. Sennrich R, Haddow B, Birch A (2016c ) Neural machine translation of rare words with subword units. In: Proceedings of the 54th annual meeting of the association for computational linguistics (ACL 2016), Berlin, Germany, pp 1715–1725Google Scholar
  52. Sennrich R, Birch A, Currey A, Germann U, Haddow B, Heafield K, Miceli Barone AV, Williams P (2017a) The University of Edinburgh’s neural MT systems for WMT17. In: Proceedings of the 2nd conference on machine translation, Copenhagen, Denmark, pp 389–399Google Scholar
  53. Sennrich R, Firat O, Cho K, Birch A, Haddow B, Hitschler J, Junczys-Dowmunt M, Läubli S, Miceli Barone AV, Mokry J, Nadejde M (2017b) Nematus: a toolkit for neural machine translation. In: Proceedings of the software demonstrations of the 15th conference of the European chapter of the association for computational linguistics, Valencia, Spain, pp 65–68Google Scholar
  54. Snover M, Dorr B, Schwartz R, Micciulla L, Makhoul J (2006) A study of translation edit rate with targeted human annotation. In: Proceedings of the 7th conference of the association for machine translation in the Americas, Cambridge, Massachusetts, pp 223–231Google Scholar
  55. Steinberger R, Pouliquen B, Widiger A, Ignat C, Erjavec T, Tufis D, Varga D (2006) The JRC-Acquis: a multilingual aligned parallel corpus with 20+ languages. In: Proceedings of the 5th international conference on language resources and evaluation, Genoa, Italy, pp 2142–2147Google Scholar
  56. Steinberger R, Eisele A, Klocek S, Pilos S, Schlü ter P (2012) DGT-TM: a freely available translation memory in 22 languages. In: Proceedings of the 8th international conference on language resources and evaluation, Istanbul, Turkey, pp 454–459Google Scholar
  57. Stymne S (2013) Using a grammar checker and its error typology for annotation of statistical machine translation errors. In: Proceedings of the 24th Scandinavian conference of linguistics, pp 332–344Google Scholar
  58. Stymne S, Ahrenberg L (2012) On the practice of error analysis for machine translation evaluation. In: Proceedings of the 8th international conference on language resources and evaluation (LREC’12), Istanbul, Turkey, pp 1785–1790Google Scholar
  59. Sutskever I, Vinyals O, Le QV (2014) Sequence to sequence learning with neural networks. arXiv:1409.3215
  60. Tiedemann J (2012) Parallel data, tools and interfaces in OPUS. In: Proceedings of the 8th international conference on language resources and evaluation (LREC’2012), Istanbul, Turkey, pp 2214–2218Google Scholar
  61. Toral A, Sánchez-Cartagena VM (2017) A multifaceted evaluation of neural versus phrase-based machine translation for 9 language directions. In: Proceedings of the 15th conference of the European chapter of the association for computational linguistics, Valencia, Spain, pp 1063–1073Google Scholar
  62. Tyers FM, Alperen MS (2010) South-East European Times: a parallel corpus of Balkan languages. In: Proceedings of the LREC workshop on exploitation of multilingual resources and tools for Central and (South-) Eastern European languages, Malta, pp 49–53Google Scholar
  63. Štajner S, Querido A, Rendeiro N, Rodrigues JA, Branco A (2016) Use of domain-specific language resources in machine translation. In: Proceedings of the 10th international conference on language resources and evaluation (LREC’16), Paris, France, pp 592–598Google Scholar
  64. Westin I (2002) Language change in English newspaper editorials. Rodopi, Amsterdam and New YorkCrossRefGoogle Scholar
  65. Wu Y, Schuster M, Chen Z, Le QV, Norouzi M, Macherey W, Krikun M, Cao Y, Gao Q, Macherey K, Klingner J, Shah A, Johnson M, Liu X, Kaiser L, Gouws S, Kato Y, Kudo T, Kazawa H, Stevens K, Kurian G, Patil N, Wang W, Young C, Smith J, Riesa J, Rudnick A, Vinyals O, Corrado G, Hughes M, Dean J (2016) Google’s neural machine translation system: bridging the gap between human and machine translation. arXiv:1609.08144
  66. Zeiler MD (2012) ADADELTA: An adaptive learning rate method. arXiv preprint arXiv:1212.5701

Copyright information

© Springer Nature B.V. 2018

Authors and Affiliations

  1. 1.ADAPT Centre - Dublin City UniversityDublinIreland
  2. 2.University of EdinburghEdinburghScotland
  3. 3.Deluxe Media EuropeAthensGreece
  4. 4.ADAPT Centre—School of Applied Language and Intercultural Studies, Dublin City UniversityDublinIreland

Personalised recommendations