Automatic translation memory cleaning

  • Matteo Negri
  • Duygu Ataman
  • Masoud Jalili Sabet
  • Marco Turchi
  • Marcello Federico
Article

Abstract

We address the problem of automatically cleaning a translation memory (TM) by identifying problematic translation units (TUs). In this context, we treat as “problematic TUs” those containing useless translations from the point of view of the user of a computer-assisted translation tool. We approach TM cleaning both as a supervised and as an unsupervised learning problem. In both cases, we take advantage of Translation Memory open-source purifier, an open-source TM cleaning tool also presented in this paper. The two learning paradigms are evaluated on different benchmarks extracted from MyMemory, the world’s largest public TM. Our results indicate the effectiveness of the supervised approach in the ideal condition in which labelled training data is available, and the viability of the unsupervised solution for challenging situations in which training data is not accessible.

Keywords

Translation memories Machine learning Data cleaning 

References

  1. Abdul Rauf S, Schwenk H (2011) Parallel sentence generation from comparable corpora for improved SMT. Mach Transl 25(4):341–375CrossRefGoogle Scholar
  2. Arthern P (1979) Machine translation and computerized terminology systems: a translator’s viewpoint. In: Translating and the computer, proceedings of a seminar, London, UK, pp 77–108Google Scholar
  3. Barbu E (2015) Spotting false translation segments in translation memories. In: Proceedings of the workshop on natural language processing for translation memories, Hissar, Bulgaria, pp 9–16Google Scholar
  4. Barbu E, Parra Escartín C, Bentivogli L, Negri M, Turchi M, Federico M, Mastrostefano L, Orasan C (2016) 1st shared task on automatic translation memory cleaning. In: Proceedings of the 2nd Workshop on natural language processing for translation memories (NLP4TM 2016). Portorož, Slovenia, pp 1–5Google Scholar
  5. Biçici E, Dymetman M (2008) Dynamic translation memory: using statistical machine translation to improve translation memory fuzzy matches. In: Proceedings of the 9th international conference on computational linguistics and intelligent text processing, CICLing’08, Haifa, Israel, pp 454–465Google Scholar
  6. Bloodgood M, Strauss B (2014) Translation memory retrieval methods. In: Proceedings of the 14th conference of the European chapter of the association for computational linguistics, Gothenburg, Sweden, pp 202–210Google Scholar
  7. Brown PF, Della Pietra SA, Della Pietra VJ, Mercer RL (2003) The mathematics of statistical machine translation: parameter estimation. Comput Linguist 19(2):263–311Google Scholar
  8. Burchardt A, Lommel A (2014) Practical guidelines for the use of MQM in scientific research on translation quality. Technical report, DFKI, Berlin, GermanyGoogle Scholar
  9. Camargo de Souza JG, Buck C, Turchi M, Negri M (2013) FBK-UEdin participation to the WMT13 quality estimation shared task. In: Proceedings of the eighth workshop on statistical machine translation, Sofia, Bulgaria, pp 352–358Google Scholar
  10. Chatzitheodoroou K (2015) Improving translation memory fuzzy matching by paraphrasing. In: Proceedings of the workshop on natural language processing for translation memories, Hissar, Bulgaria, pp 24–30Google Scholar
  11. Chu C, Nakazawa T, Kurohashi S (2013) Chinese–Japanese parallel sentence extraction from quasi–comparable corpora. In: Proceedings of the sixth workshop on building and using comparable corpora, Sofia, Bulgaria, pp 34–42Google Scholar
  12. Cotterell R, Schütze H, Eisner J (2016) Morphological smoothing and extrapolation of word embeddings. In: Proceedings of the 54th annual meeting of the association for computational linguistics (volume 1: long papers), Berlin, Germany, pp 1651–1660Google Scholar
  13. Denkowski M, Hanneman G, Lavie A (2012) The CMU-avenue French–English translation system. In: Proceedings of the seventh workshop on statistical machine translation, Montréal, Canada, pp 261–266Google Scholar
  14. Dyer C, Clark J, Lavie A, Smith NA (2011) Unsupervised word alignment with arbitrary features. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies-volume 1, Portland, Oregon, USA, pp 409–419Google Scholar
  15. Eetemadi S, Lewis W, Toutanova K, Radha H (2015) Survey of data-selection methods in statistical machine translation. Mach Transl 29(3–4):189–223CrossRefGoogle Scholar
  16. Gao Q, Vogel S (2008) Parallel implementations of word alignment tool. In: Proceedings of the ACL 2008 software engineering, testing, and quality assurance workshop, Columbus, Ohio, USA, pp 49–57Google Scholar
  17. Geurts P, Ernst D, Wehenkel L (2006) Extremely randomized trees. Mach Learn 63(1):3–42CrossRefMATHGoogle Scholar
  18. Gupta R, Bechara H, Orasan C (2014) Intelligent translation memory matching and retrieval metric exploiting linguistic technology. In: Proceedings of translating and the computer 36, London, UK, pp 86–89Google Scholar
  19. Gupta R, Orasan C, Zampieri M, Vela M, Van Genabith J (2015) Can translation memories afford not to use paraphrasing? In: Proceedings of the 18th annual conference of the European association for machine translation, Antalya, Turkey, pp 35–42Google Scholar
  20. Khadivi S, Ney H (2005) Automatic filtering of bilingual corpora for statistical machine translation. In: Proceedings of natural language processing and information systems, 10th international conference on applications of natural language to information systems, Alicante, Spain, pp 263–274Google Scholar
  21. Koehn P, Senellart J (2010) Convergence of translation memory and statistical machine translation. In: Proceedings of AMTA workshop on MT research and the translation industry, Denver, CO, USA, pp 21–31Google Scholar
  22. Levenshtein VI (1966) Binary codes capable of correcting deletions, insertions, and reversals. Sov Phys Dokl 10:707–710MathSciNetMATHGoogle Scholar
  23. Lommel A (2015) Multidimensional quality metrics (MQM) definition. Technical report, DFKI, Berlin, GermanyGoogle Scholar
  24. Lui M, Baldwin T (2012) langid.py: an off-the-shelf language identification tool. In: Proceedings of the ACL 2012 system demonstrations, Jeju Island, Korea, pp 25–30Google Scholar
  25. Ma Y, He Y, Way A, Van Genabith J (2011) Consistent translation using discriminative learning: a translation memory-inspired approach. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies, volume 1, Portland, Oregon, USA, pp 1239–1248Google Scholar
  26. Meinshausen N, Bühlmann P (2010) Stability selection. J R Stat Soc B Stat Methodol 72(4):417–473MathSciNetCrossRefGoogle Scholar
  27. Munteanu DS, Marcu D (2005) Improving machine translation performance by exploiting non-parallel corpora. Comput Linguist 31(4):477–504CrossRefGoogle Scholar
  28. Nakazawa T, Kurohashi S (2011) Bayesian subtree alignment model based on dependency trees. In: Proceedings of 5th international joint conference on natural language processing, Chiang Mai, Thailand, pp 794–802Google Scholar
  29. Negri M, Marchetti A, Mehdad Y, Bentivogli L, Giampiccolo D (2012) Semeval-2012 task 8: cross-lingual textual entailment for content synchronization. In: Proceedings of the 6th international workshop on semantic evaluation (SemEval 2012), Montréal, Canada, pp 399–407Google Scholar
  30. Noreen EW (1989) Computer intensive methods for testing hypothesis. An introduction. Wiley, New YorkGoogle Scholar
  31. Rarrick S, Quirk C, Lewis W (2011) MT detection in web-scraped parallel corpora. In: MT summit XIII: the thirteenth machine translation summit, Xiamen, China, pp 422–429Google Scholar
  32. Riesa J, Marcu D (2012) Automatic parallel fragment extraction from noisy data. In: Proceedings of the 2012 conference of the North American chapter of the association for computational linguistics: human language technologies, Montréal, Canada, pp 538–542Google Scholar
  33. Sikes R (2007) Fuzzy matching in theory and practice. Multilingual 18(6):39–43Google Scholar
  34. Snover M, Dorr B, Schwartz R, Micciulla L, Makhoul J (2006) A study of translation edit rate with targeted human annotation. In: AMTA 2006: proceedings of the 7th conference of the association for machine translation in the Americas, visions for the future of machine translation, Cambridge, Massachusetts, USA, pp 223–231Google Scholar
  35. Søgaard A, Agić V, Martínez Alonso H, Plank B, Bohnet B, Johannsen A (2015) Inverted indexing for cross-lingual NLP. In: Proceedings of the 53rd annual meeting of the association for computational linguistics and the 7th international joint conference on natural language processing (volume 1: long papers), Beijing, China, pp 1713–1722Google Scholar
  36. Specia L, Cancedda N, Dymetman M, Turchi M, Cristianini N (2009) Estimating the sentence-level quality of machine translation systems. In: Proceedings of the 13th annual conference of the European association for machine translation (EAMT-2009), Barcelona, Spain, pp 28–35Google Scholar
  37. Tillmann C (2009) A beam-search extraction algorithm for comparable data. In: Proceedings of the ACL-IJCNLP 2009 conference short papers, Singapore, pp 225–228Google Scholar
  38. Turchi M, Negri M, Federico M (2013) Coping with the subjectivity of human judgements in MT quality estimation. In: Proceedings of the eighth workshop on statistical machine translation, Sofia, Bulgaria, pp 240–251Google Scholar
  39. Turchi M, Negri M, Federico M (2014) Data-driven annotation of binary MT quality estimation corpora based on human post-editions. Mach Transl 28(3):281–308CrossRefGoogle Scholar
  40. Vanallemeersch T, Vandeghinste V (2014) Improving fuzzy matching through syntactic knowledge. In: Proceedings of translating and the computer 36, London, pp 217–227Google Scholar
  41. Vanallemeersch T, Vandeghinste V (2015) Assessing linguistically aware fuzzy matching in translation memories. In: Proceedings of the 18th annual conference of the European association for machine translation, Antalya, Turkey, pp 153–160Google Scholar
  42. Wang K, Zong C, Su KY (2013) Integrating translation memory into phrase-based machine translation during decoding. In: Proceedings of the 51st annual meeting of the association for computational linguistics (volume 1: long papers), Sofia, Bulgaria, pp 11–21Google Scholar
  43. Yeh A (2000) More accurate tests for the statistical significance of result differences. In: The 18th international conference on computational linguistics, COLING 2000 in Europe, proceedings of the conference, volume 2, Saarbrücken, Germany, pp 947–953Google Scholar
  44. Zhechev V, Van Genabith J (2010) Seeding statistical machine translation with translation memory output through tree-based structural alignment. In: Proceedings of the 4th workshop on syntax and structure in statistical translation, Beijing, China, pp 43–51Google Scholar

Copyright information

© Springer Science+Business Media Dordrecht 2017

Authors and Affiliations

  1. 1.Fondazione Bruno KesslerTrentoItaly
  2. 2.Fondazione Bruno KesslerUniversità degli Studi di TrentoTrentoItaly
  3. 3.School of Electrical and Computer EngineeringUniversity of TehranTehranIran

Personalised recommendations