Combining off-the-shelf components to clean a translation memory

Abstract

We present a system to identify erroneous entries in a translation memory. It is a machine learning system that learns to classify entries according to either a strict or a permissive view on correctness. It is trained on features relating to segment length, translation quality checks, spelling and grammar errors, and additionally uses external data for detecting problems with fluency and lexical choice.

This is a preview of subscription content, access via your institution.

Notes

  1. 1.

    For full details, see Barbu et al. (2016).

  2. 2.

    Personal correspondence with Barbu.

  3. 3.

    http://docs.translatehouse.org/projects/translate-toolkit/.

  4. 4.

    https://github.com/hlt-mt/TMOP.

  5. 5.

    http://docs.translatehouse.org/projects/translate-toolkit/en/latest/commands/pofilter_tests.html.

  6. 6.

    We quote the test names as they are used in the documentation of pofilter.

  7. 7.

    https://hunspell.github.io/.

  8. 8.

    http://www.abisource.com/projects/enchant/.

  9. 9.

    https://languagetool.org/.

  10. 10.

    From https://www.languagetool.org/languages/.

  11. 11.

    http://kheafield.com/code/kenlm/.

  12. 12.

    https://github.com/clab/fast_align.

  13. 13.

    This is the only classifier for which the random seed can not make the runs fully reproducible. See http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html.

References

  1. Barbu E (2015) Spotting false translation segments in translation memories. In: Proceedings of the Workshop Natural Language Processing for Translation Memories, Association for Computational Linguistics, Hissar, Bulgaria, pp 9–16, http://www.aclweb.org/anthology/W15-5202

  2. Barbu E, Parra Escartín C, Bentivogli L, Negri M, Turchi M, Federico M, Mastrostefano L, Orăsan C (2016) 1st shared task on automatic translation memory cleaning preparation and lessons learned. In: 2nd Workshop on Natural Language Processing for Translation Memories (NLP4TM 2016), Portorož, Slovenia, LREC 2016, http://www.lrec-conf.org/proceedings/lrec2016/workshops/LREC2016Workshop-NLP4TM_Proceedings.pdf

  3. Dougherty G (2013) Pattern Recognition and Classification: An Introduction. Springer. doi:10.1007/978-1-4614-5323-9

    MathSciNet  MATH  Google Scholar 

  4. Dyer C, Chahuneau V, Smith NA (2013) A simple, fast, and effective reparameterization of IBM model 2. In: Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Linguistics, Atlanta, Georgia, pp 644–648, http://www.aclweb.org/anthology/N13-1073

  5. Gale WA, Church KW (1993) A program for aligning sentences in bilingual corpora. Computational Linguistics 19(1):75–102, http://dl.acm.org/citation.cfm?id=972450.972455

  6. Guyon I, Elisseeff A (2003) An introduction to variable and feature selection. Journal of Machine Learning Research 3:1157–1182, http://www.jmlr.org/papers/v3/guyon03a.html

  7. Heafield K, Pouzyrevsky I, Clark JH, Koehn P (2013) Scalable modified Kneser-Ney language model estimation. In: Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, Sofia, Bulgaria, pp 690–696, http://kheafield.com/professional/edinburgh/estimate_paper.pdf

  8. Koehn P (2005) Europarl: A parallel corpus for statistical machine translation. In: Conference Proceedings: the tenth Machine Translation Summit, AAMT, AAMT, Phuket, Thailand, pp 79–86, http://mt-archive.info/MTS-2005-Koehn.pdf

  9. Lagoudaki E (2006) Translation memories survey 2006: Users’ perceptions around TM use. In: Proceedings of the ASLIB International Conference Translating & the Computer, vol 28

  10. Miłkowski M (2010) Developing an open-source, rule-based proofreading tool. Software: Practice and Experience 40(7):543–566, doi:10.1002/spe.971

  11. O’Brien S (2007) Eye-tracking and translation memory matches. Perspectives: Studies in translatology 14(3):185–205

    Google Scholar 

  12. Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay E (2011) Scikit-learn: Machine learning in Python. Journal of Machine Learning Research 12:2825–2830

    MathSciNet  MATH  Google Scholar 

  13. Specia L, Paetzold G, Scarton C (2015) Multi-level translation quality prediction with QuEst++. In: Proceedings of ACL-IJCNLP 2015 System Demonstrations, Beijing, China, pp 115–120, http://www.aclweb.org/anthology/P15-4020

  14. Tiedemann J (2011) Bitext alignment. Synth Lect Hum Lang Technol. doi:10.2200/S00367ED1V01Y201106HLT014

  15. Zariņa I, Ņikiforovs P, Skadiņš R (2015) Word alignment based parallel corpora evaluation and cleaning using machine learning techniques. In: El-Kahlout ID, Özkan M, Sánchez-Martínez F, Ramírez-Sánchez G, Hollowood F, Way A (eds) Proceedings of the 18th Annual Conference of the European Association for Machine Translation, Antalya, Turkey, pp 185–192, http://aclweb.org/anthology/W15-4924

Download references

Acknowledgements

This research was supported by the Academy of African Languages and Science Strategic Project of the University of South Africa. The author thanks the anonymous reviewers for valuable feedback.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Friedel Wolff.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Wolff, F. Combining off-the-shelf components to clean a translation memory. Machine Translation 30, 167–181 (2016). https://doi.org/10.1007/s10590-016-9186-7

Download citation

Keywords

  • Translation memory
  • Translation memory cleaning
  • Translation quality