Translation project adaptation for MT-enhanced computer assisted translation

Abstract

The effective integration of MT technology into computer-assisted translation tools is a challenging topic both for academic research and the translation industry. In particular, professional translators consider the ability of MT systems to adapt to the feedback provided by them to be crucial. In this paper, we propose an adaptation scheme to tune a statistical MT system to a translation project using small amounts of post-edited texts, like those generated by a single user in even just one day of work. The same scheme can be applied on a larger scale in order to focus general purpose models towards the specific domain of interest. We assess our method on two domains, namely information technology and legal, and four translation directions, from English to French, Italian, Spanish and German. The main outcome is that our adaptation strategy can be very effective provided that the seed data used for adaptation is ‘close enough’ to the remaining text to be translated; otherwise, MT quality neither improves nor worsens, thus showing the robustness of our method.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2

Notes

  1. 1.

    In computer-assisted translation (CAT), translators work with special text editors, simply called CAT tools, integrating several translation aids, such as translation memories, terminology dictionaries, spell checkers, concordancers, and recently also MT engines.

  2. 2.

    http://www.matecat.com.

  3. 3.

    http://www.caitra.org.

  4. 4.

    The exponential function to binary features is applied to neutralize the log function that is applied to all features participating in the log-linear model.

  5. 5.

    Available from http://www.statmt.org/wmt13/translation-task.html.

  6. 6.

    2013/488/EU: “Council Decision of 23 September 2013 on the security rules for protecting EU classified information”.

  7. 7.

    http://eur-lex.europa.eu/.

  8. 8.

    It is a report by the European Parliament, not included in the training data, containing a proposal for financial regulations in the European Union, available at: http://www.europarl.europa.eu/sides/getDoc.do?pubRef=-//EP//TEXT+REPORT+A7-2013-0039+0+DOC+XML+V0//EN.

References

  1. Axelrod A, He X, Gao J (2011) Domain adaptation via pseudo in-domain data selection. In: Proceedings of the conference on Empirical Methods in Natural Language Processing (EMNLP). Edinburgh, pp 355–362

  2. Bach N, Hsiao R, Eck M, Charoenpornsawat P, Vogel S, Schultz T, Lane I, Waibel A, Black AW (2009) Incremental adaptation of speech-to-speech translation. In: Proceedings of the North American Chapter of the Association for Computational Linguistics—Human Language Technologies (NAACL HLT) Conference: Short Papers. Boulder, US-CO, pp 149–152

  3. Bertoldi N, Cettolo M, Federico M, Buck C (2012) Evaluating the learning curve of domain adaptive statistical machine translation systems. In: Proceedings of the Workshop on Statistical Machine Translation (WMT). Montréal, pp 433–441

  4. Bertoldi N, Cettolo M, Federico M (2013) Cache-based online adaptation for machine translation enhanced computer assisted translation. In: Proceedings of the MT summit XIV. Nice, pp 35–42

  5. Bisazza A, Ruiz N, Federico M (2011) Fill-up versus interpolation methods for phrase-based SMT adaptation. In: Proceedings of the International Workshop on Spoken Language Translation (IWSLT). San Francisco, US-CA, pp 136–143

  6. Bojar O, Buck C, Callison-Burch C, Federmann C, Haddow B, Koehn P, Monz C, Post M, Soricut R, Specia L (2013) Findings of the 2013 workshop on statistical machine translation. In: Proceedings of the eighth workshop on statistical machine translation. Sofia, pp 1–44

  7. Cettolo M, Servan C, Bertoldi N, Federico M, Barrault L, Schwenk H (2013) Issues in incremental adaptation of statistical mt from human post-edits. In: Proceedings of the MT summit XIV Workshop on Post-editing Technology and Practice (WPTP-2). Nice, pp 111–118

  8. Chen SF, Goodman J (1999) An empirical study of smoothing techniques for language modeling. Comput Speech Lang 4(13):359–393

    Article  Google Scholar 

  9. Crammer K, Dekel D, Keshet J, Shalev-Shwartz S, Singer Y (2006) Online passive–aggressive algorithms. J Mach Learn Res 7:551–585

    MathSciNet  MATH  Google Scholar 

  10. Federico M, Cattelan A, Trombetti M (2012) Measuring user productivity in machine translation enhanced computer assisted translation. In: Proceedings of conference of the Association for Machine Translation in the Americas (AMTA). San Diego, US-CA

  11. Foster G, Kuhn R (2007) Mixture-model adaptation for SMT. In: Proceedings of the Workshop on Statistical Machine Translation (WMT). Prague, pp 128–135

  12. Foster G, Goutte C, Kuhn R (2010) Discriminative instance weighting for domain adaptation in statistical machine translation. In: Proceedings of the conference on Empirical Methods in Natural Language Processing (EMNLP). Cambridge, US-MA, pp 451–459

  13. Galley M, Manning CD (2008) A simple and effective hierarchical phrase reordering model. In: Proceedings of the Conference on empirical methods in natural language processing (EMNLP). Honolulu, US-HI, pp 848–856

  14. Gao J, Zhang M (2002) Improving Language model size reduction using better pruning criteria. In: Proceedings of the annual meeting of the Association for Computational Linguistics (ACL). Philadelphia, US-PA, pp 176–182

  15. Green S, Heer J, Manning CD (2013) The efficacy of human post-editing for language translation. In: Proceedings of the SIGCHI conference on human factors in computing systems. ACM, Paris, pp 439–448

  16. Guerberof A (2009) Productivity and quality in MT post-editing. In: Proceedings of the MT summit XII, Beyond translation memories: new tools for translators workshop. Ottawa, Canada

  17. Hardt D, Elming J (2010) Incremental re-training for post-editing SMT. In: Proceedings of the Conference of the Association for Machine Translation in the Americas (AMTA). Denver, US-CO

  18. Hasler E, Haddow B, Koehn P (2012) Sparse lexicalised features and topic adaptation for SMT. In: Proceedings of the International Workshop on Spoken Language Translation (IWSLT). Hong Kong, pp 268–275

  19. Kneser R, Steinbiss V (1993) On the dynamic adaptation of stochastic language models. In: Proceedings of the IEEE international conference on acoustics, speech and signal processing (ICASSP), vol II, Minneapolis, US-MN, pp 586–588

  20. Koehn P (2005) Europarl: a parallel corpus for statistical machine translation. In: Proceedings of the MT summit X. Phuket, pp 79–86

  21. Koehn P, Schroeder J (2007) Experiments in domain adaptation for statistical machine translation. In: Proceedings of the Workshop on Statistical Machine Translation (WMT). Prague, pp 224–227

  22. Koehn P, Axelrod A, Mayne AB, Callison-Burch C, Osborne M, Talbot D (2005) Edinburgh system description for the 2005 IWSLT speech translation evaluation. In: Proceedings of the international workshop on spoken language translation (IWSLT). Pittsburgh, US-PA

  23. Koehn P, Hoang H, Birch A, Callison-Burch C, Federico M, Bertoldi N, Cowan B, Shen W, Moran C, Zens R, Dyer C, Bojar O, Constantin A, Herbst E (2007) Moses: open source toolkit for statistical machine translation. In: Annual Meeting of the Association for Computational Linguistics (ACL): Companion volume proceedings of the demo and poster sessions. Prague, pp 177–180

  24. Läubli S, Fishel M, Massey G, Ehrensberger-Dow M, Volk M (2013) Assessing post-editing efficiency in a realistic translation environment. In: Proceedings of the MT summit XIV, workshop on post-editing technology and practice. Nice, pp 83–91

  25. Liu L, Cao H, Watanabe T, Zhao T, Yu M, Zhu C (2012) Locally training the log-linear model for SMT. In: Proceedings of the joint conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL). Jeju Island, pp 402–411

  26. Matsoukas S, Rosti AVI, Zhang B (2009) Discriminative Corpus weight estimation for machine translation. In: Proceedings of the conference on Empirical Methods in Natural Language Processing (EMNLP). Singapore, pp 708–717

  27. Moore RC, Lewis W (2010) Intelligent selection of language model training data. In: Proceedings of the annual meeting of the Association of Computational (ACL): Short Papers. Uppsala, pp 220–224

  28. Nakov P (2008) Improving English-Spanish Statistical machine translation: experiments in domain adaptation, sentence paraphrasing, tokenization, and recasing. In: Proceedings of the Workshop on Statistical Machine Translation (WMT). Columbus, US-OH, pp 147–150

  29. Niehues J, Waibel A (2012) Detailed Analysis of different strategies for phrase table adaptation in SMT. In: Proceedings of the conference of the Association for Machine Translation in the Americas (AMTA). San Diego, US-CA

  30. Noreen EW (1989) Computer intensive methods for testing hypotheses: an introduction. Wiley Interscience, New York

    Google Scholar 

  31. Och FJ (2003) Minimum error rate training in statistical machine translation. In: Proceedings of the annual meeting of the Association for Computational (ACL). Sapporo, pp 160–167

  32. Och FJ, Ney H (2003) A systematic comparison of various statistical alignment models. Comput Linguist 29(1):19–51

    Article  MATH  Google Scholar 

  33. Papineni K, Roukos S, Ward T, Zhu WJ (2002) BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the annual meeting of the Association of Computational (ACL). Philadelphia, US-PA, pp 311–318

  34. Plitt M, Masselot F (2010) A productivity test of statistical machine translation post-editing in a typical localisation context. Prague Bull Math Linguist 93:7–16

    Article  Google Scholar 

  35. Quenouille MH (1956) Notes on bias in estimation. Biometrika 43:353–360

    MathSciNet  Article  MATH  Google Scholar 

  36. Rousseau A (2013) XenC: an open-source tool for data selection in natural language processing. Prague Bull Math Linguist 100(1):73–82

    Article  Google Scholar 

  37. Snover M, Dorr B, Schwartz R, Micciulla L, Makhoul J (2006) A study of translation edit rate with targeted human annotation. In: Proceedings of the Conference of the association for machine translation in the Americas (AMTA). Cambridge, US-MA, pp 223–231

  38. Steinberger R, Pouliquen B, Widiger A, Ignat C, Erjavec T, Tufiş D, Varga D (2006) The JRC-acquis: a multilingual aligned parallel corpus with 20+ languages. In: Proceedings of the international conference on language resources and evaluation (LREC). Genoa, pp 2142–2147

  39. Tiedemann J (2012) Parallel Data, Tools and Interfaces in OPUS. In: Proceedings of the international conference on Language Resources and Evaluation (LREC). Istanbul, pp 2214–2218

  40. Turian JP, Shen L, Melamed ID (2003) Evaluation of machine translation and its evaluation. In: Proceedings of MT summit IX, New Orleans, US-LA, pp 386–393

  41. Yasuda K, Zhang R, Yamamoto H, Sumita E (2008) Method of Selecting training data to build a compact and efficient translation model. In: Proceedings of the International Joint Conference on Natural Language Processing (IJCNLP). Hyderabad, pp 655–660

Download references

Acknowledgments

This work was supported by the MateCAT project, which is funded by the EC under the \(7^{th}\) Framework Programme.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Mauro Cettolo.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Cettolo, M., Bertoldi, N., Federico, M. et al. Translation project adaptation for MT-enhanced computer assisted translation. Machine Translation 28, 127–150 (2014). https://doi.org/10.1007/s10590-014-9152-1

Download citation

Keywords

  • Statistical machine translation
  • Self-tuning MT
  • Domain adaptation
  • Project adaptation
  • Computer-assisted translation