Machine Translation

, Volume 30, Issue 3–4, pp 167–181 | Cite as

Combining off-the-shelf components to clean a translation memory

  • Friedel WolffEmail author


We present a system to identify erroneous entries in a translation memory. It is a machine learning system that learns to classify entries according to either a strict or a permissive view on correctness. It is trained on features relating to segment length, translation quality checks, spelling and grammar errors, and additionally uses external data for detecting problems with fluency and lexical choice.


Translation memory Translation memory cleaning Translation quality 



This research was supported by the Academy of African Languages and Science Strategic Project of the University of South Africa. The author thanks the anonymous reviewers for valuable feedback.


  1. Barbu E (2015) Spotting false translation segments in translation memories. In: Proceedings of the Workshop Natural Language Processing for Translation Memories, Association for Computational Linguistics, Hissar, Bulgaria, pp 9–16,
  2. Barbu E, Parra Escartín C, Bentivogli L, Negri M, Turchi M, Federico M, Mastrostefano L, Orăsan C (2016) 1st shared task on automatic translation memory cleaning preparation and lessons learned. In: 2nd Workshop on Natural Language Processing for Translation Memories (NLP4TM 2016), Portorož, Slovenia, LREC 2016,
  3. Dougherty G (2013) Pattern Recognition and Classification: An Introduction. Springer. doi: 10.1007/978-1-4614-5323-9 MathSciNetzbMATHGoogle Scholar
  4. Dyer C, Chahuneau V, Smith NA (2013) A simple, fast, and effective reparameterization of IBM model 2. In: Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Linguistics, Atlanta, Georgia, pp 644–648,
  5. Gale WA, Church KW (1993) A program for aligning sentences in bilingual corpora. Computational Linguistics 19(1):75–102,
  6. Guyon I, Elisseeff A (2003) An introduction to variable and feature selection. Journal of Machine Learning Research 3:1157–1182,
  7. Heafield K, Pouzyrevsky I, Clark JH, Koehn P (2013) Scalable modified Kneser-Ney language model estimation. In: Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, Sofia, Bulgaria, pp 690–696,
  8. Koehn P (2005) Europarl: A parallel corpus for statistical machine translation. In: Conference Proceedings: the tenth Machine Translation Summit, AAMT, AAMT, Phuket, Thailand, pp 79–86,
  9. Lagoudaki E (2006) Translation memories survey 2006: Users’ perceptions around TM use. In: Proceedings of the ASLIB International Conference Translating & the Computer, vol 28Google Scholar
  10. Miłkowski M (2010) Developing an open-source, rule-based proofreading tool. Software: Practice and Experience 40(7):543–566, doi: 10.1002/spe.971
  11. O’Brien S (2007) Eye-tracking and translation memory matches. Perspectives: Studies in translatology 14(3):185–205Google Scholar
  12. Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay E (2011) Scikit-learn: Machine learning in Python. Journal of Machine Learning Research 12:2825–2830MathSciNetzbMATHGoogle Scholar
  13. Specia L, Paetzold G, Scarton C (2015) Multi-level translation quality prediction with QuEst++. In: Proceedings of ACL-IJCNLP 2015 System Demonstrations, Beijing, China, pp 115–120,
  14. Tiedemann J (2011) Bitext alignment. Synth Lect Hum Lang Technol. doi: 10.2200/S00367ED1V01Y201106HLT014
  15. Zariņa I, Ņikiforovs P, Skadiņš R (2015) Word alignment based parallel corpora evaluation and cleaning using machine learning techniques. In: El-Kahlout ID, Özkan M, Sánchez-Martínez F, Ramírez-Sánchez G, Hollowood F, Way A (eds) Proceedings of the 18th Annual Conference of the European Association for Machine Translation, Antalya, Turkey, pp 185–192,

Copyright information

© Springer Science+Business Media Dordrecht 2017

Authors and Affiliations

  1. 1.AALS, College of Graduate StudiesUniversity of South AfricaPretoriaSouth Africa

Personalised recommendations