Skip to main content
Log in

Improving translation memory matching and retrieval using paraphrases

  • Published:
Machine Translation

Abstract

Most current translation memory (TM) systems work on the string level (character or word level) and lack semantic knowledge while matching. They use simple edit-distance (ED) calculated on the surface form or some variation on it (stem, lemma), which does not take into consideration any semantic aspects in matching. This paper presents a novel and efficient approach to incorporating semantic information in the form of paraphrasing (PP) in the ED metric. The approach computes ED while efficiently considering paraphrases using dynamic programming and greedy approximation. In addition to using automatic evaluation metrics like BLEU and METEOR, we have carried out an extensive human evaluation in which we measured post-editing time, keystrokes, HTER, HMETEOR, and carried out three rounds of subjective evaluations. Our results show that PP substantially improves TM matching and retrieval, resulting in translation performance increases when translators use paraphrase-enhanced TMs.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1

Similar content being viewed by others

Notes

  1. OmegaT is an open source TM available from http://www.omegat.org.

  2. https://github.com/rohitguptacs/TMAdvanced.

  3. p < 0.05, one tailed Welch’s t-test for PET and KS, \({\chi }^2\) test for SE2 and SE3. Because of the small sample size for SE3, no significance test was performed on an individual segment basis. Segments are different and each segment will take different PET and KS. Therefore, we cannot apply the t-test on all 30 segments as a whole because it represents 30 different tasks. However, we applied the chi square test for subjective evaluations.

  4. For HMETEOR, higher is better and for HTER, lower is better.

  5. Statistically significant, \({\chi }^2\) test, \(p<0.001\).

  6. Statistically significant, \({\chi }^2\) test, \(p<0.001\).

  7. In this section all evaluations refer to all four evaluations viz PET, KS, SE2 and SE3.

  8. Seg #9 was skipped by one of the translators, so we have 10 evaluators for this segment instead of 11 evaluators for all other segments.

References

  • Aziz W, de Sousa SCM, Specia L (2012) PET: a tool for post-editing and assessing machine translation. In: Proceedings of the eighth international conference on language resources and evaluation (LREC 2012). Istanbul, Turkey, pp. 3982–3987

  • Clark JP (2002) System, method, and product for dynamically aligning translations in a translation-memory system. US Patent 6,345,244

  • Denkowski M, Lavie A (2014) Meteor universal: language specific translation evaluation for any target language. In: Proceedings of the ninth workshop on statistical machine translation. Baltimore, MD, pp. 376–380

  • de Sousa SCM, Aziz W, Specia L (2011) Assessing the post-editing effort for automatic and semi-automatic translations of DVD subtitles. In: Proceedings of recent advances in natural language processing. Hissar, Bulgaria, pp. 97–103

  • Du J, Jiang J, Way A (2010) Facilitating translation using source language paraphrase lattices. In: Proceedings of the 2010 conference on empirical methods in natural language processing. Cambridge, MA, pp. 420–429

  • Ganitkevitch J, Van Durme B, Callison-Burch C (2013) PPDB: the paraphrase database. In: Proceedings of the 2013 conference of the North American chapter of the association for computational linguistics: human language technologies. Atlanta, GA, pp. 758–764

  • Gupta R, Orăsan C (2014) Incorporating paraphrasing in translation memory matching and retrieval. In: Proceedings of the seventeenth annual conference of the European Association for Machine Translation (EAMT2014). Dubrovnik, Croatia, pp. 3–10

  • Gupta R, Orăsan C, Zampieri M, Vela M, van Genabith J (2015) Can translation memories afford not to use paraphrasing? In: Proceedings of the 18th annual conference of the European Association for Machine Translation (EAMT). Antalya, pp. 35–42

  • Hodász G, Pohl G (2005) MetaMorpho TM: a linguistically enriched translation memory. Workshop on modern approaches in translation technologies. Borovets, pp. 26–30

  • Koehn P (2005) Europarl: a parallel corpus for statistical machine translation. In: Summit MT, Phuket X (eds) Conference proceedings: the tenth machine translation summit. Phuket, pp 79–86

  • Koponen M, Aziz W, Ramos L, Specia L (2012) Post-editing time as a measure of cognitive effort. In: Proceedings of the AMTA 2012 workshop on post-editing technology and practice (WPTP 2012). San Diego, CA, pp 11–20

  • Langlais P, Lapalme G (2002) Trans type: development-evaluation cycles to boost translator’s productivity. Mach Transl 17(2):77–98

    Article  Google Scholar 

  • Levenshtein VI (1966) Binary codes capable of correcting deletions, insertions, and reversals. Soviet Phys Doklady 10:707–710

    MathSciNet  MATH  Google Scholar 

  • Macklovitch E, Russell G (2000) What’s been forgotten in translation memory. Envisioning machine translation in the information future: 4th conference of the Association for Machine Translation in the Americas, AMTA 2000. Cuernavaca, Mexico, pp. 137–146

  • Miller GA (1995) WordNet: a lexical database for english. Commun ACM 38(11):39–41

    Article  Google Scholar 

  • Onishi T, Utiyama M, Sumita E (2010) Paraphrase lattice for statistical machine translation. In: Proceedings of the ACL 2010 conference short papers. Uppsala, Sweden, pp. 1–5

  • Papineni K, Roukos S, Ward T, Zhu WJ (2002) BLEU: a method for automatic evaluation of machine translation. In: proceedings of the 40th annual meeting of the association for computational linguistics. Pennsylvania, PA, pp. 311–318

  • Pekar V, Mitkov R (2006) New generation translation memory: content-sensitive matching. In: Proceedings of the 40th anniversary congress of the swiss association of translators, terminologists and interpreters. Berne, Switzerland

  • Petrov S, Barrett L, Thibaux R, Klein D (2006) Learning accurate, compact, and interpretable tree annotation. In: Proceedings of the 21st international conference on computational linguistics and 44th annual meeting of the association for computational linguistics. Sydney, Australia, pp. 433–440

  • Planas E, Furuse O (1999) Formalizing translation memories. In: Proceedings of MT summit VII MT in the great translation era. Singapore, pp. 331–339

  • Simard M, Fujita A (2012) A poor man’s translation memory using machine translation evaluation metrics. In: Proceedings of the tenth conference of the association for machine translation in the Americas. San Diego, CA

  • Snover M, Dorr B, Schwartz R, Micciulla L, Makhoul J (2006) A study of translation edit rate with targeted human annotation. In: AMTA 2006: proceedings of the 7th conference of the association for machine translation in the Americas, visions for the future of machine translation. Cambridge, MA, pp. 223–231

  • Somers H (2003) Translation memory systems. In: Somers H (ed) Computers and translation: a translator’s guide. John Benjamins Publishing Company, Amsterdam, pp 31–48

    Chapter  Google Scholar 

  • Steinberger R, Eisele A, Klocek S, Pilos S, Schlüter P (2012) DGT-TM: a freely available translation memory in 22 languages. In: Proceedings of the 8th international conference on language resources and evaluation (LREC’2012). Istanbul, Turkey, pp. 454–459

  • Timonera K, Mitkov R (2015) Improving translation memory matching through clause splitting. In: Proceedings of the workshop on natural language processing for translation memories (NLP4TM). Hissar, Bulgaria, pp. 17–23

  • Utiyama M, Neubig G, Onishi T, Sumita E (2011) Searching translation memories for paraphrases. In: Proceedings of the 13th machine translation summit. Xiamen, China, pp. 325–331

  • Vela M, Neumann S, Hansen-Schirra S (2007) Querying multi-layer annotation and alignment in translation corpora. In: Proceedings of the Corpus linguistics conference CL2007. Birmingham

  • Whyman EK, Somers HL (1999) Evaluation metrics for a translation memory system. Softw-Pract Exp 29(14):1265–1284

    Article  Google Scholar 

  • Zampieri M, Vela M (2014) Quantifying the influence of MT output in the translators’ performance: a case study in technical translation. Workshop on humans and computer-assisted translation (HaCaT 2014). Gothenburg, Sweden, pp. 93–98

Download references

Acknowledgments

The research leading to these results has received funding from the People Programme (Marie Curie Actions) of the European Unions Seventh Framework Programme FP7/2007–2013/ under REA Grant Agreement No. 317471 and the EC-funded project QT21 under Horizon 2020, ICT 17, Grant Agreement No. 645452.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Rohit Gupta.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Gupta, R., Orăsan, C., Zampieri, M. et al. Improving translation memory matching and retrieval using paraphrases. Machine Translation 30, 19–40 (2016). https://doi.org/10.1007/s10590-016-9180-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10590-016-9180-0

Keywords

Navigation