Abstract
This paper reports on the organization and results of the first Automatic Translation Memory Cleaning Shared Task. This shared task is aimed at finding automatic ways of cleaning translation memories (TMs) that have not been properly curated and thus include incorrect translations. As a follow up of the shared task, we also conducted two surveys, one targeting the teams participating in the shared task, and the other one targeting professional translators. While the researchers-oriented survey aimed at gathering information about the opinion of participants on the shared task, the translators-oriented survey aimed to better understand what constitutes a good TM unit and inform decisions that will be taken in future editions of the task. In this paper, we report on the process of data preparation and the evaluation of the automatic systems submitted, as well as on the results of the collected surveys.
Similar content being viewed by others
Notes
Most CAT tools like SDL Trados Studio (http://www.sdl.com/cxc/language/translation-productivity/trados-studio/) and memoQ (https://www.memoq.com/) also include specific QA modules.
Although in some cases MT may have been used to produce translations, translators have to verify that such translations are correct before they are stored as new TUs in a TM.
In the absence of sufficient context, any translation which had some context in which it would be adequate was accepted.
Unfortunately, to the best of our knowledge as of September 2016, only one of the participating teams has released their system (the JUMT Team). The FBK system was trained using the open-source TM cleaner TMop (Jalili Sabet et al. 2016). All systems are described in the working notes available at http://rgcl.wlv.ac.uk/nlp4tm2016/working-notes-on-cleaning-of-translation-memories-shared-task/.
In some cases, when using the web interface, translators assigned the wrong language codes to the segment, e.g. English segments were labeled as “it” and Italian segments as “en”. Although out of the scope of this paper, it would be interesting to investigate how the errors coming from each of these three sources differ.
See http://blog.mikemccandless.com/2011/10/accuracy-and-performance-of-googles.html for a benchmark.
Two segments are different if the segments as character strings are different after space normalization.
The annotation guidelines are available at: http://rgcl.wlv.ac.uk/nlp4tm2016/shared-task/.
The inter-annotator agreement results and a more detailed report of how the data was prepared can be found in Barbu et al. (2016).
Note, of course, that this may not reflect real-world situations where the translation utility of TUs may need to be measured. However, we believe opting for the “equal weights” scenario is justified for two reasons: (i) we wanted to avoid using several different evaluation metrics, so assigning equal weights is a good solution which guarantees a fair evaluation without returning “good” for each test item; and (ii) since we have different proportions of positive/negative instances in the test sets for each language pair, we decided to avoid data set-specific evaluation metrics biased to the characteristics of each test set. This strategy has the added advantage of obtaining results which are as comparable as possible across the different language-pair settings.
The script that computes the baselines can be downloaded from http://rgcl.wlv.ac.uk/resources/NLP4TM2016/baselines.py.remove.
The random forest classifier, similar to the extremely randomized trees, is an ensemble learning method that minimises overfitting by combining the output of multiple decision trees in a single class label. The two algorithms differ slightly in the way they split the trees: in a deterministic way in the case of random forest, and randomly in the case of extremely randomized trees.
See Footnote 18.
For consistency between the binary tasks and the fine-grained task, the Averaged \(F_1\) score for the fine-grained task corresponds to macro-averaged \(F_1\) score. The macro average computes the average precision, recall or \(F_1\) score whereas the micro average first sums the true positives, true negatives, false positives and false negatives and only then computes the average precision, recall or \(F_1\) score. Due to space restrictions we cannot present all the measures computed to evaluate the performance of each system. For a more detailed presentation, the interested reader can consult the overview summary available at http://rgcl.wlv.ac.uk/wp-content/uploads/2016/05/results-1st-shared_task.
References
Ataman D, Jalili Sabet M, Turchi M, Negri M (2016) FBK HLT-MT participation in the 1st translation memory cleaning shared task. Working notes on cleaning of translation memories shared task. http://rgcl.wlv.ac.uk/wp-content/uploads/2016/05/fbkhltmt-workingnote
Barbu E (2015) Spotting false translation segments in translation memories. In: Proceedings of the workshop on natural language processing for translation memories, Hissar, Bulgaria, pp 9–16
Barbu E, Parra Escartín C, Bentivogli L, Negri M, Turchi M, Federico M, Mastrostefano L, Orasan C (2016) 1st shared task on automatic translation memory cleaning. In: Proceedings of the 2nd workshop on natural language processing for translation memories (NLP4TM 2016), Portorož, Slovenia, pp 1–5
Blatz J, Fitzgerald E, Foster G, Gandrabur S, Goutte C, Kulesza A, Sanchis A, Ueffing N (2004) Confidence estimation for machine translation. In: COLING 2004: proceedings of the 20th international conference on computational linguistics, Geneva, Switzerland, pp 315–321
Buck C, Koehn P (2016) Findings of the WMT 2016 bilingual document alignment shared task. In: Proceedings of the first conference on machine translation, Berlin, Germany, pp 554–563
Buck C, Koehn P (2016) UEdin participation in the 1st translation memory cleaning shared task. Working notes on cleaning of translation memories shared task. http://rgcl.wlv.ac.uk/wp-content/uploads/2016/05/ChristianBuck-TM_Cleaning_Shared_Task
Burchardt A, Lommel A (2014) Practical guidelines for the use of MQM in scientific research on translation quality. Technical report, DFKI, Berlin, Germany
Callison-Burch C, Koehn P, Monz C, Post M, Soricut R, Specia L (2012) Findings of the 2012 workshop on statistical machine translation. In: Proceedings of the seventh workshop on statistical machine translation, Montréal, Canada, pp 10–51
Camargo de Souza JG, Esplà-Gomis M, Turchi M, Negri M (2013) Exploiting qualitative information from automatic word alignment for cross-lingual NLP tasks. In: Proceedings of the 51st annual meeting of the association for computational linguistics (volume 2: short papers), Sofia, Bulgaria, pp 771–776
Esplà Gomis M, Forcada ML (2009) Bitextor, a free/open-source software to harvest translation memories from multilingual websites. In: Proceedings of the MT summit XII—workshop: beyond translation memories: new tools for translators, Ottawa, ON, Canada
Gale WA, Church KW (1993) A program for aligning sentences in bilingual corpora. Comput Linguist 19(1):76–102
Gandrabur S, Foster G (2003) Confidence estimation for translation prediction. In: Proceedings of the 2003 human language technology conference of the North American chapter of the association for computational linguistics, Edmonton, Canada, pp 95–102
Girardi C, Bentivogli L, Farajian MA, Federico M (2014) MT-equal: a toolkit for human assessment of machine translation output. In: Proceedings of COLING 2014, the 25th international conference on computational linguistics: system demonstrations, Dublin, Ireland, pp 120–123
Jalili Sabet M, Negri M, Turchi M, Camargo de Souza JG, Federico M (2016) Tmop: a tool for unsupervised translation memory cleaning. In: Proceedings of ACL-2016 system demonstrations, Berlin, Germany, pp 49–54
Koehn P (2005) Europarl: a parallel corpus for statistical machine translation. In: MT summit X, conference proceedings: the tenth machine translation summit, Phuket, Thailand, pp 79–86
Koehn P, Hoang H, Birch A, Callison-Burch C, Federico M, Bertoldi N, Cowan B, Shen W, Moran C, Zens R, Dyer C, Bojar O, Constantin A, Herbst E (2007) Moses: open source toolkit for statistical machine translation. In: Proceedings of the 45th annual meeting of the association for computational linguistics, companion volume proceedings of the demo and poster sessions, Prague, Czech Republic, pp 177–180
Kuncheva LI (2004) Combining pattern classifiers: methods and algorithms. Wiley-Interscience, Hoboken
Levenshtein VI (1966) Binary codes capable of correcting deletions, insertions, and reversals. Sov Phys Dokl 10(8):707–710
Lommel A (2015) Multidimensional quality metrics (MQM) definition. Technical report, DFKI, Berlin, Germany
Mandorino V (2016) The Lingua Custodia participation in the NLP4TM2016 TM cleaning shared task. Working notes on cleaning of translation memories shared task. http://rgcl.wlv.ac.uk/wp-content/uploads/2016/05/description_LinguaCustodia
McNemar Q (1947) Note on the sampling error of the difference between correlated proportions or percentages. Psychometrika 12(2):153–157
Nahata N, Nayak T, Pal S, Naskar S (2016) Rule based classifier for translation memory cleaning. Working notes on cleaning of translation memories shared task. http://rgcl.wlv.ac.uk/wp-content/uploads/2016/05/Working_Note-JUMTTeam
Papineni K, Roukos S, Ward T, Zhu WJ (2002) BLEU: a method for automatic evaluation of machine translation. In: ACL-2002: 40th annual meeting of the association for computational linguistics, Philadelphia, pp 311–318
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay E (2011) Scikit-learn : machine learning in python. J Mach Learn Res 12:2825–2830
Petrov S, Das D, McDonald R (2012) A universal part-of-speech tagset. In: Proceedings of the eighth international conference on language resources and evaluation (LREC’12), Istanbul, Turkey, pp 2089–2096
Schmid H (1994) Probabilistic part-of-speech tagging using decision trees. In: Proceedings of international conference on new methods in language processing, Manchester, UK
Sennrich R, Haddow B, Birch A (2016) Neural machine translation of rare words with subword units. In: Proceedings of the 54th annual meeting of the association for computational linguistics, volume 1: long papers, Berlin, Germany, pp 2089–2096
Søgaard A, Agić V, Martínez Alonso H, Plank B, Bohnet B, Johannsen A (2015) Inverted indexing for cross-lingual NLP. In: Proceedings of the 53rd annual meeting of the association for computational linguistics and the 7th international joint conference on natural language processing (volume 1: long papers), Beijing, China, pp 1713–1722
Specia L, Turchi M, Cancedda N, Dymetman M, Cristianini N (2009) Estimating the sentence-level quality of machine translation systems. In: EAMT-2009: proceedings of the 13th annual conference of the European association for machine translation, Barcelona, Spain, pp 28–35
Steinberger R, Eisele A, Klocek S, Pilos S, Schlüter P (2012) DGT-TM: a freely available translation memory in 22 languages. In: Proceedings of the eighth international conference on language resources and evaluation (LREC’12), Istanbul, Turkey, pp 454–459
Tiedemann J (2011) Bitext alignment. In: Hirst G (ed) Synthesis lectures on human language technologies. Morgan & Claypool, San Rafael
Trombetti M (2009) Creating the worlds largest translation memory. In: MT summit XII: proceedings of the twelfth machine translation summit, Ottawa, ON, Canada, pp 9–16
Wolff F (2016) Unisa system submission at NLP4TM 2016. Working notes on cleaning of translation memories shared task. http://rgcl.wlv.ac.uk/wp-content/uploads/2016/05/UNISA
Zwahlen A, Carnal O, Läubli S (2016) Automatic TM cleaning through MT and POS tagging: autodesks submission to the NLP4TM 2016 shared task. Working notes on cleaning of translation memories shared task. http://rgcl.wlv.ac.uk/wp-content/uploads/2016/05/nlp4tm-adsk
Acknowledgements
The research reported in this paper is supported by the People Programme (Marie Curie Actions) of the European Union’s Framework Programme (FP7/2007-2013) under REA Grant Agreement No. 317471. Part of the work has been supported by the EC-funded project ModernMT (H2020 Grant Agreement No. 645487). We are grateful to Translated for giving us access to the MyMemory database. We would also like to thank the anonymous reviewers for their valuable feedback to improve this paper and for the ideas for future work they have provided us with. Last but not least, we want to thank the 6 annotators who annotated the data used in the shared task.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Barbu, E., Parra Escartín, C., Bentivogli, L. et al. The first Automatic Translation Memory Cleaning Shared Task. Machine Translation 30, 145–166 (2016). https://doi.org/10.1007/s10590-016-9183-x
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10590-016-9183-x