Abstract
Terminology translation plays a critical role in domain-specific machine translation (MT). Phrase-based statistical MT (PB-SMT) has been the dominant approach to MT for the past 30 years, both in academia and industry. Neural MT (NMT), an end-to-end learning approach to MT, is steadily taking the place of PB-SMT. In this paper, we conduct comparative qualitative evaluation and comprehensive error analysis on terminology translation in PB-SMT and NMT in two translation directions: English-to-Hindi and Hindi-to-English. To the best of our knowledge, there is no gold standard available for evaluating terminology translation quality in MT. For this reason we select an evaluation test set from a legal domain corpus and create a gold standard for evaluating terminology translation in MT. We also propose an error typology taking the terminology translation errors in MT into consideration. We translate sentences of the test set with our MT systems and terminology translations are manually classified as per the error typology. We evaluate the MT system’s performance on terminology translation, and demonstrate our findings, unraveling strengths, weaknesses, and similarities of PB-SMT and NMT in the area of term translation.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Notes
The MT research community views WMT translation shared tasks (http://www.statmt.org/wmt19/.) as the benchmark for the evaluation of automatic translation systems. In the WMT16 translation shared task (Bojar et al. 2016), we witnessed the rise of the NMT approach that surpassed the then mainstream method (i.e. PB-SMT) in a number of translation tasks (e.g. Sennrich et al. 2016a). In the WMT18 translation shared task (Bojar et al. 2018), the majority of the submissions (33) were based on deep-learning approaches, and only three submissions were PB-SMT models.
International Workshop on Spoken Language Translation (http://workshop2015.iwslt.org/).
A field within geomorphology, specializing in the study of karst formations. https://en.wiktionary.org/wiki/karstology.
For the sake of clarity we use Roman instead of the Devanagari scripts for Hindi when showing the translation examples. Note that the characters of the Hindi corpus were in Devanagari scripts.
Hindi is a language whose first alphabet should be capital. However, we carried out experiments with lowercased characters. This is why we show this named-entity in lowercased characters.
In this example, the reference English sentence is the literal translation of the source Hindi sentence.
Halsbury is a location name whose first alphabet is here a lowercased character (cf. footnote 14).
References
Arčan M, Buitelaar P (2017) Translating domain-specific expressions in knowledge bases with neural machine translation. CoRR. arXiv:1709.02184
Arčan M, Turchi M, Tonelli S, Buitelaar P (2017) Leveraging bilingual terminology to improve machine translation in a cat environment. Nat Lang Eng 23(5):763–788
Ba JL, Kiros JR, Hinton GE (2016) Layer normalization. CoRR. arXiv:1607.06450
Bahdanau D, Cho K, Bengio Y (2015) Neural machine translation by jointly learning to align and translate. In: 3rd International conference on learning representations (ICLR 2015), San Diego, CA
Bentivogli L, Bisazza A, Cettolo M, Federico M (2016) Neural versus phrase-based machine translation quality: a case study. In: Proceedings of the 2016 conference on empirical methods in natural language processing, pp 257–267, Austin, TX
Beyer AM, Macketanz V, Burchardt A, Williams P (2017) Can out-of-the-box NMT beat a Domain-trained Moses on Technical Data? In: Proceedings of EAMT user studies and project/product descriptions, pp 41–46, Prague, Czech Republic
Bojar O, Diatka V, Rychlý P, Straňák P, Suchomel V, Tamchyna A, Zeman D (2014) HindEnCorp—Hindi-English and Hindi-only corpus for machine translation. In: Proceedings of the ninth international language resources and evaluation conference (LREC’14), pp 3550–3555, Reykjavik, Iceland
Bojar O, Chatterjee R, Federmann C, Graham Y, Haddow B, Huck M, Jimeno Yepes A, Koehn P, Logacheva V, Monz C, Negri M, Neveol A, Neves M, Popel M, Post M, Rubino R, Scarton C, Specia L, Turchi M, Verspoor K, Zampieri M (2016) Findings of the 2016 conference on machine translation. In: Proceedings of the first conference on machine translation, pp 131–198, Berlin, Germany
Bojar O, Federmann C, Fishel M, Graham Y, Haddow B, Huck M, Koehn P, Monz C (2018) Findings of the 2018 conference on machine translation (WMT18). In: Proceedings of the third conference on machine translation, vol. 2: shared task papers, pp 272–307. Association for Computational Linguistics, Belgium, Brussels
Burchardt A, Macketanz V, Dehdari J, Heigold G, Peter J-T, Williams P (2017) A linguistic evaluation of rule-based, phrase-based, and neural MT engines. Prague Bull Math Linguist 108(1):159–170
Castilho S, Moorkens J, Gaspari F, Sennrich R, Sosoni V, Georgakopoulou P, Lohar P, Way A, Barone AVM, Gialama M (2017) A comparative quality evaluation of PBSMT and NMT using professional translators. In: Proceedings of MT Summit XVI, the 16th machine translation summit, pp 116–131, Nagoya, Japan
Cettolo M, Niehues J, Stüker S, Bentivogli L, Cattoni R, Federico M (2015) The IWSLT 2015 evaluation campaign. In: Proceedings of the twelfth international workshop on spoken language translation (IWSLT 2015), Da Nang, Vietnam
Chatterjee R, Negri M, Turchi M, Federico M, Specia L, Blain F (2017) Guiding neural machine translation decoding with external knowledge. In: Proceedings of the second conference on machine translation, pp 157–168. Association for Computational Linguistics, Copenhagen, Denmark
Cherry C, Foster G (2012) Batch tuning strategies for statistical machine translation. In: Proceedings of the 2012 conference of the North American Chapter of the Association for Computational Linguistics: human language technologies, pp 427–436, Montréal, Canada
Cho K, van Merriënboer B, Gülçehre Ç, Bougares F, Schwenk H, Bengio Y (2014) Learning phrase representations using RNN encoder–decoder for statistical machine translation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp 1724–1734, Doha, Qatar
Cohen J (1960) A coefficient of agreement for nominal scales. Educ Psychol Meas 20(1):37–46
Crego J M, Kim J, Klein G, Rebollo A, Yang K, Senellart J, Akhanov E, Brunelle P, Coquard A, Deng Y, Enoue S, Geiss C, Johanson J, Khalsa A, Khiari R, Ko B, Kobus C, Lorieux J, Martins L, Nguyen D, Priori A, Riccardi T, Segal N, Servan C, Tiquet C, Wang B, Yang J, Zhang D, Zhou J, Zoldan P (2016) Systran’s pure neural machine translation systems. CoRR. arXiv:1610.05540
Denkowski M, Lavie A (2011) Meteor 1.3: automatic metric for reliable optimization and evaluation of machine translation systems. In: Proceedings of the sixth workshop on statistical machine translation, pp 85–91, Edinburgh, Scotland
Durrani N, Schmid H, Fraser A (2011) A joint sequence translation model with integrated reordering. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies, pp 1045–1054, Portland, Oregon, USA
Farajian MA, Turchi M, Negri M, Bertoldi N, Federico M (2017) Neural vs. phrase-based machine translation in a multi-domain scenario. In: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, pages 280–284, Valencia, Spain
Farajian MA, Bertoldi N, Negri M, Turchi M, Federico M (2018) Evaluation of terminology translation in instance-based neural MT adaptation. In: Proceedings of the 21st Annual conference of the european association for machine translation, pp 149–158, Alicante, Spain
Gage P (1994) A new algorithm for data compression. C Users J 12(2):23–38
Gal Y, Ghahramani Z (2016) A theoretically grounded application of dropout in recurrent neural networks. CoRR. arXiv:1512.05287
Haque R, Penkale S, Way A (2014) Bilingual termbank creation via log-likelihood comparison and phrase-based statistical machine translation. In: Proceedings of the 4th international workshop on computational terminology (Computerm), pp 42–51, Dublin, Ireland
Haque R, Penkale S, Way A (2018) TermFinder: log-likelihood comparison and phrase-based statistical machine translation models for bilingual terminology extraction. Lang Resour Eval 52(2):365–400
Haque R, Hasanuzzaman M, Way A (2019a) Investigating terminology translation in statistical and neural machine translation: a case study on English-to-Hindi and Hindi-to-English. In: Proceedings of RANLP 2019: recent advances in natural language processing, pp 437–446, Varna, Bulgaria
Haque R, Hasanuzzaman M, Way A (2019b) TermEval: an automatic metric for evaluating terminology translation in MT. In: Proceedings of CICLing 2019, the 20th international conference on computational linguistics and intelligent text processing, La Rochelle, France
Haque R, Hasanuzzaman M, Way A (2019c) Terminology translation in low-resource scenarios. Information 10(9):273
Hasler E, Gispert A, Iglesias G, Byrne B (2018) Neural machine translation decoding with terminology constraints. In: Proceedings of the 2018 conference of the North American chapter of the Association for Computational Linguistics: Human Language Technologies, vol. 2 (short papers), pp 506–512. Association for Computational Linguistics, New Orleans, LA
Hassan H, Aue A, Chen C, Chowdhary V, Clark J, Federmann C, Huang X, Junczys-Dowmunt M, Lewis W, Li M, Liu S, Liu T, Luo R, Menezes A, Qin T, Seide F, Tan X, Tian F, Wu L, Wu S, Xia Y, Zhang D, Zhang Z, Zhou M (2018) Achieving human parity on automatic Chinese to English news translation. CoRR. arXiv:1803.05567
He K, Zhang X, Ren S, Sun J (2015) Deep residual learning for image recognition. CoRR. arXiv:1512.03385
Heafield K, Pouzyrevsky I, Clark JH, Koehn P (2013) Scalable modified Kneser–Ney language model estimation. In: Proceedings of the 51st annual meeting of the Association for Computational Linguistics (vol. 2: short papers), pp 690–696, Sofia, Bulgaria
Hokamp C, Liu Q (2017) Lexically constrained decoding for sequence generation using grid beam search. In: Proceedings of the 55th annual meeting of the Association for Computational Linguistics (vol. 1: long papers), pp 1535–1546, Vancouver, BC
Huang G, Zhang J, Zhou Y, Zong C (2016) A simple, straightforward and effective model for joint bilingual terms detection and word alignment in SMT. In: Natural language understanding and intelligent applications, ICCPOL/NLPCC 2016, vol 10102, pp 103–115
Huang L, Chiang D (2007) Forest rescoring: faster decoding with integrated language models. In: Proceedings of the 45th annual meeting of the Association of Computational Linguistics, pp 144–151, Prague, Czech Republic
Isabelle P, Cherry C, Foster GF (2017) A challenge set approach to evaluating machine translation. CoRR. arXiv:1704.07431
James F (2000) Modified Kneser-Ney smoothing of n-gram models. Tech. Rep. 00.07. Research Institute for Advanced Computer Science
Junczys-Dowmunt M, Dwojak T, Hoang H (2016) Is neural machine translation ready for deployment? A case study on 30 translation directions. CoRR. arXiv:1610.01108
Junczys-Dowmunt M, Grundkiewicz R, Dwojak T, Hoang H, Heafield K, Neckermann T, Seide F, Germann U, Fikri Aji A, Bogoychev N, Martins AFT, Birch A (2018) Marian: Fast neural machine translation in C++. In: Proceedings of ACL 2018, system demonstrations, pp 116–121. Association for Computational Linguistics, Melbourne, Australia
Kalchbrenner N, Blunsom P (2013) Recurrent continuous translation models. In: Proceedings of the 2013 conference on empirical methods in natural language processing (EMNLP), pp 1700–1709, Seattle, WA
Kingma DP, Ba J (2014) Adam: a method for stochastic optimization. CoRR. arXiv:1412.6980
Kinoshita S, Oshio T, Mitsuhashi T (2017) Comparison of SMT and NMT trained with large patent corpora: Japio at WAT2017. In: Proceedings of the 4th workshop on Asian translation (WAT2017), pp 140–145. Asian Federation of Natural Language Processing
Klubička F, Toral A, Sánchez-Cartagena VM (2017) Fine-grained human evaluation of neural versus phrase-based machine translation.CoRR, arXiv:1706.04389
Klubička F, Toral A, Sánchez-Cartagena VM (2018) Quantitative fine-grained human evaluation of machine translation systems: a case study on English to Croatian. CoRR. arXiv:1802.01451
Koehn P (2004) Statistical significance tests for machine translation evaluation. In: Lin D, Wu D (eds) Proceedings of the 2004 conference on empirical methods in natural language processing (EMNLP), pp 388–395, Barcelona, Spain
Koehn P (2005) Europarl: a parallel corpus for statistical machine translation. In: Proceedings of MT Summit X: the tenth machine translation summit, pp 79–86, Phuket, Thailand
Koehn P, Knowles R (2017) Six challenges for neural machine translation. CoRR. arXiv:1706.03872
Koehn P, Och FJ, Marcu D (2003) Statistical phrase-based translation. In: HLT-NAACL 2003: conference combining Human Language Technology conference series and the North American Chapter of the Association for Computational Linguistics conference series, Edmonton, AB, pp 48–54
Koehn P, Hoang H, Birch A, Callison-Burch C, Federico M, Bertoldi N, Cowan B, Shen W, Moran C, Zens R, Dyer C, Bojar O, Constantin A, College W, Herbst E (2007) Moses: Open source toolkit for statistical machine translation. In: ACL 2007, proceedings of the interactive poster and demonstration sessions, pp 177–180, Prague, Czech Republic
Kunchukuttan A, Mehta P, Bhattacharyya P (2017) The IIT Bombay English-Hindi parallel corpus. CoRR 1710:02855
Lommel AR, Uszkoreit H, Burchardt A (2014) Multidimensional Quality Metrics (MQM): a framework for declaring and describing translation quality metrics. Tradumática: tecnologies de la traducció (12):455–463
Long Z, Utsuro T, Mitsuhashi T, Yamamoto M (2016) Translation of patent sentences with a large vocabulary of technical terms using neural machine translation. In: Proceedings of the 3rd workshop on Asian translation (WAT2016), pp 47–57, Osaka, Japan
Macketanz V, Avramidis E, Burchardt A, Helcl J, Srivastava A (2017) Machine translation: phrase-based, rule-based and neural approaches with linguistic evaluation. Cybern Inf Technol 17(2):28–43
Mitkov R (2002) Anaphora resolution. Longman, Harlow
Och FJ, Ney H (2003) A systematic comparison of various statistical alignment models. Comput Linguist 29(1):19–51
Papineni K, Roukos S, Ward T, Zhu W-J (2002) BLEU: a method for automatic evaluation of machine translation. In: ACL-2002: 40th annual meeting of the Association for Computational Linguistics. ACL, Philadelphia, PA, pp 311–318
Pinnis M (2015) Dynamic terminology integration methods in statistical machine translation. In: Proceedings of the 18th annual conference of the European Association for Machine Translation (EAMT 2015), pp 89–96, Antalya, Turkey
Pinnis M, Ljubešić N, Ştefănescu D, Skadiņa I, Tadić M, Gornostay T (2012) Term extraction, tagging, and mapping tools for under-resourced languages. In: Proceedings of the 10th conference on terminology and knowledge engineering (TKE 2012), pp 193–208, Madrid, Spain
Popović M (2017) Comparing language related issues for NMT and pbmt between German and English. Prague Bull Math Linguist 108(1):209–220
Popović M, Ney H (2011) Towards automatic error analysis of machine translation output. Comput Linguist 37(4):657–688
Post M, Vilar D (2018) Fast lexically constrained decoding with dynamic beam allocation for neural machine translation. In: Proceedings of the 2018 conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, vol 1 (long papers), pp 1314–1324, New Orleans, LO
Press O, Wolf L (2016) Using the output embedding to improve language models. CoRR. arXiv:1608.05859
Rigouts Terryn A, Hoste V, Lefever E (2019) In no uncertain terms: a dataset for monolingual and multilingual automatic term extraction from comparable corpora. Lang Resour Eval 54:385–418
Sennrich R, Haddow B, Birch A (2015) Improving neural machine translation models with monolingual data. CoRR. arXiv:1511.06709
Sennrich R, Haddow B, Birch A (2016a) Edinburgh neural machine translation systems for WMT 16. In: Proceedings of the first conference on machine translation, pp 371–376, Berlin, Germany
Sennrich R, Haddow B, Birch A (2016b) Neural machine translation of rare words with subword units. In: Proceedings of the 54th annual meeting of the Association for Computational Linguistics (volume 1: long papers), pp 1715–1725, Berlin, Germany
Shterionov D, Nagle P, Casanellas L, Superbo R, O’Dowd T (2017) Empirical evaluation of nmt and pbsmt quality for large-scale translation production. In: User track of the 20th annual conference of the European Association for Machine Translation (EAMT), pp 74–79, Czech Republic, Prague
Skadiņš R, Puriņš M, Skadiņa I, Vasiļjevs A (2011) Evaluation of SMT in localization to under-resourced inflected language. In: Proceedings of the 15th international conference of the European Association for Machine Translation (EAMT 2011), pp 35–40, Leuven, Belgium
Snover M, Dorr B, Schwartz R, Micciulla L, Makhoul J (2006) A study of translation edit rate with targeted human annotation. In: In Proceedings of the 7th biennial conference of the Association for Machine Translation in the Americas (AMTA-2006), pp 223–231, Cambridge, MA
Specia L, Harris K, Blain F, Burchardt A, Macketanz V, Skadiņa I, Negri M, Turchi M (2017) Translation quality and productivity: a study on rich morphology languages. In: Proceedings of MT summit XVI, the 16th machine translation summit, pp 55–71, Nagoya, Japan
Sutskever I, Vinyals O, Le QV (2014) Sequence to sequence learning with neural networks. In: Proceedings of the 27th international conference on neural information processing systems, NIPS’14, pp 3104–3112, Montreal, Canada
Tiedemann J (2012) Parallel data, tools and interfaces in OPUS. In: Proceedings of the 8th international conference on language resources and evaluation (LREC’2012), pp 2214–2218, Istanbul, Turkey
Toral A, Sánchez-Cartagena VM (2017) A multifaceted evaluation of neural versus phrase-based machine translation for 9 language directions. CoRR. arXiv:1701.02901
Toral A, Way A (2018) What level of quality can neural machine translation attain on literary text? In: Translation quality assessment. Springer, Cham, pp 263–287
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I (2017) Attention is all you need. CoRR. arXiv:1706.03762
Vaswani A, Zhao Y, Fossum V, Chiang D (2013) Decoding with large-scale neural language models improves translation. In: Proceedings of the 2013 conference on empirical methods in natural language processing, pp 1387–1392, Seattle, Washington, USA
Vintar Š (2018) Terminology translation accuracy in statistical versus neural mt: An evaluation for the English–Slovene language pair. In: Du J, Arčan M, Liu Q, Isahara H (eds) Proceedings of the LREC 2018 workshop MLP–MomenT: the second workshop on multi-language processing in a globalising world and the first workshop on multilingualism at the intersection of knowledge bases and machine translation, pp 34–37, Miyazaki, Japan. European Language Resources Association (ELRA), Paris
Way A (2018) Quality expectations of machine translation. In: Translation quality assessment: from principles to practice. Springer, Cham
Wu Y, Schuster M, Chen Z, Le Q V, Norouzi M, Macherey W, Krikun M, Cao Y, Gao Q, Macherey K, Klingner J, Shah A, Johnson M, Liu X, Kaiser L, Gouws S, Kato Y, Kudo T, Kazawa H, Stevens K, Kurian G, Patil N, Wang W, Young C, Smith J, Riesa J, Rudnick A, Vinyals O, Corrado G, Hughes M, Dean J (2016) Google’s neural machine translation system: bridging the gap between human and machine translation. CoRR. arXiv:1609.08144
Yeh A (2000) More accurate tests for the statistical significance of result differences. In: Proceedings of the 18th conference on computational linguistics, vol 2, COLING 2000, pp 947–953, Saarbrücken, Germany
Ziemski M, Junczys-Dowmunt M, Pouliquen B (2016) The united nations parallel corpus v1.0. In: Proceedings of the tenth international conference on language resources and evaluation (LREC 2016), pp 3530–3534, Portorož, Slovenia
Acknowledgements
The ADAPT Centre for Digital Content Technology is funded under the Science Foundation Ireland (SFI) Research Centres Programme (Grant No. 13/RC/2106) and is co-funded under the European Regional Development Fund. This project has partially received funding from the European Union’s Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreement No. 713567, and the publication has emanated from research supported in part by a research grant from SFI under Grant Number 13/RC/2077.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Haque, R., Hasanuzzaman, M. & Way, A. Analysing terminology translation errors in statistical and neural machine translation. Machine Translation 34, 149–195 (2020). https://doi.org/10.1007/s10590-020-09251-z
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10590-020-09251-z