Abstract
Most current translation memory (TM) systems work on the string level (character or word level) and lack semantic knowledge while matching. They use simple edit-distance (ED) calculated on the surface form or some variation on it (stem, lemma), which does not take into consideration any semantic aspects in matching. This paper presents a novel and efficient approach to incorporating semantic information in the form of paraphrasing (PP) in the ED metric. The approach computes ED while efficiently considering paraphrases using dynamic programming and greedy approximation. In addition to using automatic evaluation metrics like BLEU and METEOR, we have carried out an extensive human evaluation in which we measured post-editing time, keystrokes, HTER, HMETEOR, and carried out three rounds of subjective evaluations. Our results show that PP substantially improves TM matching and retrieval, resulting in translation performance increases when translators use paraphrase-enhanced TMs.
Similar content being viewed by others
Notes
OmegaT is an open source TM available from http://www.omegat.org.
p < 0.05, one tailed Welch’s t-test for PET and KS, \({\chi }^2\) test for SE2 and SE3. Because of the small sample size for SE3, no significance test was performed on an individual segment basis. Segments are different and each segment will take different PET and KS. Therefore, we cannot apply the t-test on all 30 segments as a whole because it represents 30 different tasks. However, we applied the chi square test for subjective evaluations.
For HMETEOR, higher is better and for HTER, lower is better.
Statistically significant, \({\chi }^2\) test, \(p<0.001\).
Statistically significant, \({\chi }^2\) test, \(p<0.001\).
In this section all evaluations refer to all four evaluations viz PET, KS, SE2 and SE3.
Seg #9 was skipped by one of the translators, so we have 10 evaluators for this segment instead of 11 evaluators for all other segments.
References
Aziz W, de Sousa SCM, Specia L (2012) PET: a tool for post-editing and assessing machine translation. In: Proceedings of the eighth international conference on language resources and evaluation (LREC 2012). Istanbul, Turkey, pp. 3982–3987
Clark JP (2002) System, method, and product for dynamically aligning translations in a translation-memory system. US Patent 6,345,244
Denkowski M, Lavie A (2014) Meteor universal: language specific translation evaluation for any target language. In: Proceedings of the ninth workshop on statistical machine translation. Baltimore, MD, pp. 376–380
de Sousa SCM, Aziz W, Specia L (2011) Assessing the post-editing effort for automatic and semi-automatic translations of DVD subtitles. In: Proceedings of recent advances in natural language processing. Hissar, Bulgaria, pp. 97–103
Du J, Jiang J, Way A (2010) Facilitating translation using source language paraphrase lattices. In: Proceedings of the 2010 conference on empirical methods in natural language processing. Cambridge, MA, pp. 420–429
Ganitkevitch J, Van Durme B, Callison-Burch C (2013) PPDB: the paraphrase database. In: Proceedings of the 2013 conference of the North American chapter of the association for computational linguistics: human language technologies. Atlanta, GA, pp. 758–764
Gupta R, Orăsan C (2014) Incorporating paraphrasing in translation memory matching and retrieval. In: Proceedings of the seventeenth annual conference of the European Association for Machine Translation (EAMT2014). Dubrovnik, Croatia, pp. 3–10
Gupta R, Orăsan C, Zampieri M, Vela M, van Genabith J (2015) Can translation memories afford not to use paraphrasing? In: Proceedings of the 18th annual conference of the European Association for Machine Translation (EAMT). Antalya, pp. 35–42
Hodász G, Pohl G (2005) MetaMorpho TM: a linguistically enriched translation memory. Workshop on modern approaches in translation technologies. Borovets, pp. 26–30
Koehn P (2005) Europarl: a parallel corpus for statistical machine translation. In: Summit MT, Phuket X (eds) Conference proceedings: the tenth machine translation summit. Phuket, pp 79–86
Koponen M, Aziz W, Ramos L, Specia L (2012) Post-editing time as a measure of cognitive effort. In: Proceedings of the AMTA 2012 workshop on post-editing technology and practice (WPTP 2012). San Diego, CA, pp 11–20
Langlais P, Lapalme G (2002) Trans type: development-evaluation cycles to boost translator’s productivity. Mach Transl 17(2):77–98
Levenshtein VI (1966) Binary codes capable of correcting deletions, insertions, and reversals. Soviet Phys Doklady 10:707–710
Macklovitch E, Russell G (2000) What’s been forgotten in translation memory. Envisioning machine translation in the information future: 4th conference of the Association for Machine Translation in the Americas, AMTA 2000. Cuernavaca, Mexico, pp. 137–146
Miller GA (1995) WordNet: a lexical database for english. Commun ACM 38(11):39–41
Onishi T, Utiyama M, Sumita E (2010) Paraphrase lattice for statistical machine translation. In: Proceedings of the ACL 2010 conference short papers. Uppsala, Sweden, pp. 1–5
Papineni K, Roukos S, Ward T, Zhu WJ (2002) BLEU: a method for automatic evaluation of machine translation. In: proceedings of the 40th annual meeting of the association for computational linguistics. Pennsylvania, PA, pp. 311–318
Pekar V, Mitkov R (2006) New generation translation memory: content-sensitive matching. In: Proceedings of the 40th anniversary congress of the swiss association of translators, terminologists and interpreters. Berne, Switzerland
Petrov S, Barrett L, Thibaux R, Klein D (2006) Learning accurate, compact, and interpretable tree annotation. In: Proceedings of the 21st international conference on computational linguistics and 44th annual meeting of the association for computational linguistics. Sydney, Australia, pp. 433–440
Planas E, Furuse O (1999) Formalizing translation memories. In: Proceedings of MT summit VII MT in the great translation era. Singapore, pp. 331–339
Simard M, Fujita A (2012) A poor man’s translation memory using machine translation evaluation metrics. In: Proceedings of the tenth conference of the association for machine translation in the Americas. San Diego, CA
Snover M, Dorr B, Schwartz R, Micciulla L, Makhoul J (2006) A study of translation edit rate with targeted human annotation. In: AMTA 2006: proceedings of the 7th conference of the association for machine translation in the Americas, visions for the future of machine translation. Cambridge, MA, pp. 223–231
Somers H (2003) Translation memory systems. In: Somers H (ed) Computers and translation: a translator’s guide. John Benjamins Publishing Company, Amsterdam, pp 31–48
Steinberger R, Eisele A, Klocek S, Pilos S, Schlüter P (2012) DGT-TM: a freely available translation memory in 22 languages. In: Proceedings of the 8th international conference on language resources and evaluation (LREC’2012). Istanbul, Turkey, pp. 454–459
Timonera K, Mitkov R (2015) Improving translation memory matching through clause splitting. In: Proceedings of the workshop on natural language processing for translation memories (NLP4TM). Hissar, Bulgaria, pp. 17–23
Utiyama M, Neubig G, Onishi T, Sumita E (2011) Searching translation memories for paraphrases. In: Proceedings of the 13th machine translation summit. Xiamen, China, pp. 325–331
Vela M, Neumann S, Hansen-Schirra S (2007) Querying multi-layer annotation and alignment in translation corpora. In: Proceedings of the Corpus linguistics conference CL2007. Birmingham
Whyman EK, Somers HL (1999) Evaluation metrics for a translation memory system. Softw-Pract Exp 29(14):1265–1284
Zampieri M, Vela M (2014) Quantifying the influence of MT output in the translators’ performance: a case study in technical translation. Workshop on humans and computer-assisted translation (HaCaT 2014). Gothenburg, Sweden, pp. 93–98
Acknowledgments
The research leading to these results has received funding from the People Programme (Marie Curie Actions) of the European Unions Seventh Framework Programme FP7/2007–2013/ under REA Grant Agreement No. 317471 and the EC-funded project QT21 under Horizon 2020, ICT 17, Grant Agreement No. 645452.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Gupta, R., Orăsan, C., Zampieri, M. et al. Improving translation memory matching and retrieval using paraphrases. Machine Translation 30, 19–40 (2016). https://doi.org/10.1007/s10590-016-9180-0
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10590-016-9180-0