Improving retrieval performance of translation memories using morphosyntactic analyses and generalized suffix arrays

Weitz, Melanie

doi:10.1007/s10590-017-9193-3

Improving retrieval performance of translation memories using morphosyntactic analyses and generalized suffix arrays

Published: 04 May 2017

Volume 31, pages 117–146, (2017)
Cite this article

Machine Translation

Melanie Weitz ORCID: orcid.org/0000-0003-3701-9816¹

318 Accesses
1 Citation
1 Altmetric
Explore all metrics

Abstract

Since the 1990s, translation memory (TM) systems have been one of the most widely-used tools in the field of computer-aided translation. However, most of the commercially available systems still consider two segments to be compared similar if the sequence of characters is identical or differ only marginally in spelling. Instead, linguistic similarities are disregarded so that semantically identical or similar segments, which have different (morpho)syntactic structures, are retrieved with a lower similarity value as expected or not at all. The iMem (iMem is an abbreviation for Intelligent Translation Memories) research project aimed at improving the retrieval of those very segments by analyzing their (morpho)syntactic structure and identifying the longest common substrings (LCS) between two sentences by means of generalized suffix arrays. The results of the morphosyntactic analysis are stored in the so-called iMem-TM, an independent relational database which is connected to a commercial, non-linguistically enhanced TM via its API. Base words were used for building the suffixes to increase the probability of finding a larger number of LCS between both sentences. Furthermore, an existing algorithm for generalized suffix arrays was enhanced by an additional array in order to distinguish which suffixes derive from which sentence. In this way, identical repeating LCS within the same sentence are ignored, whereas identical repeating LCS between two different sentences are still considered. If more than one identical repeating LCS between the sentences exist, the best matching LCS is given by calculating positional differences for the identical repeating LCS and choosing the one with the minimal positional difference.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Improving translation memory matching and retrieval using paraphrases

Article 01 June 2016

An Approach for Efficient Machine Translation Using Translation Memory

A Dynamic Programming Approach to Improving Translation Memory Matching and Retrieval Using Paraphrases

Notes

In the form of its function AutoSuggest.
http://www.casmacat.eu/.
https://www.matecat.com/.
http://similis.org/.
http://www.omegat.org/.
http://www.opentm2.org/.
Formerly known as Cologne University of Applied Sciences.
In English: Optimizing Commercial Translation Memory Systems through Integration of Morphosyntactic Analyses.
MPRO stands for Morphological Program (Maas 1998).
http://www.sqlite.org/.
As the iMem-algorithm only works on the source language segment of a TU, any other target language could have been used to create the TMs.
ST stands for source language text.

References

Aluru S (2004) Suffix trees and suffix arrays. In: Mehta DP, Sahni S (eds) Handbook of data structures and applications. Chapman & Hall/CRC, Boca Raton, pp 29-1–29-22
Google Scholar
Azzano D (2009) CAT und MÜ—Getrennte Welten? In: Seewald-Heeg U, Stein, D (eds) Journal for Language Technology and Computational Linguistics (JLCL). Maschinelle Übersetzung von der Theorie zur Anwendung—machine translation—theory and applications, vol 24(3). Hochschule Anhalt, Ludwig- Maximilians-Universität München, Köthen, München, pp 19–36
Azzano D, Reinke U, Sauer M (2011) Ansätze zur Verbesserung der Retrieval-Leistung kommerzieller Translation-Memory-Systeme. In: Hedeland H, Schmidt T, Wörner K (eds) Multilingual resources and multilingual applications, proceedings of the conference of the German Society for Computational Linguistics and Language Technology (GSCL) 2011. Universität Hamburg, Hamburg, Germany, pp 123–128
Baisa V, Horák A, Medved’ M (2015) Increasing coverage of translation memories with linguistically motivated segment combination methods. In: RANLP, proceedings of the workshop on Natural Language Processing for Translation Memories (NLP4TM), Hissar, Bulgaria, pp 31–35
Bloodgood M, Strauss B (2014) Translation memory retrieval methods. In: EACL 2014, proceedings of the 14th conference of the European Chapter of the Association for Computational Linguistics, Gothenburg, Sweden, pp 202–210
Bowker L (2002) Computer-aided translation technology: a practical introduction. University of Ottawa Press, Ottawa
Google Scholar
Bruckner C, Plitt M (2001) Evaluating the operational benefit of using machine translation output as translation memory input. In: MT Summit VIII, proceedings of the 8th Machine Translation Summit, workshop on MT evaluation, Santiago de Compostela, Spain, pp 61–65
Carl M, Schmidt-Wigger A (1998) Shallow post morphological processing with KURD. In: Powers DMW (ed) NeMLaP3/CoNLL98, proceedings of the joint conferences on new methods in Language Processing and Computational Natural Language Learning. Association for Computational Linguistics, Stroudsburg, pp 257–265
Chatzitheodorou K (2015) Improving translation memory fuzzy matching by paraphrasing. In: RANLP, proceedings of the workshop on Natural Language Processing for Translation Memories (NLP4TM), Hissar, Bulgaria, pp 24–30
Dara A, Dandapat S, Groves D, van Genabith J (2013) TMTprime: a recommender system for MT and TM integration. In: HLT-NAACL 2013, proceedings of the 2013 annual conference of the North American Chapter of the Association for Computational Linguistics, Demonstration Session, Atlanta, Georgia, USA, pp 10–13
Elita N, Gavrila M (2006) Enhancing translation memories with semantic knowledge. In: CESCL, proceedings of the 1st central European Student Conference in Linguistics, Budapest, Hungary
Federico M, Bertoldi N, Cettolo M, Negri M, Turchi M, Tromebetti M, Cattelan A, Farina A, Lupinetti D, Martines A, Massidda A, Schwenk H, Barrault L, Blain F, Koehn P, Buck C, Germann U (2014) The Matecat Tool. In: Tounsi L, Rak R (eds) COLING 2014, proceedings of the 25th international conference on computational linguistics: system demonstrations. Dublin, Ireland, pp 129–132
Flanagan K (2014) Filling in the gaps: what we need from TM subsegment recall. http://www.kftrans.co.uk/lift/FillingInTheGaps.pdf. Accessed 26 Feb 2015
Gordon I (1997) The TM revolution—what does it really mean? In: Translating and the computer 19: papers from the Aslib conference, London
Gupta R, Orăsan C (2014) Incorporating paraphrasing in translation memory matching and retrieval. In: Tadić M, Koehn P, Roturier J, Way A (eds) EAMT 2014, proceedings of the 17th annual conference of the European Association for Machine Translation, Dubrovnik, Croatia, pp 3–10
He Y, Ma Y, van Genabith J, Way A (2010a) Bridging SMT and TM with translation recommendation. In: Hajic J, Carberry S, Clark S (eds) ACL 2010, proceedings of the 48th annual meeting of the Association for Computational Linguistics, Uppsala, Sweden, pp 622–630
He Y, Ma Y, Way A, van Genabith J (2010b) Integrating N-best SMT Outputs into a TM System. In: Huang CR, Jurafsky D (eds) COLING 2010, 23rd international conference on computational linguistics, Posters Volume, Beijing, China, pp 374–382
Henrich A (2008) Information retrieval 1—Grundlagen, Modelle, Anwendungen, version 1.2 (rev.: 5727, status: 7 January 2008). Otto-Friedrich-Universität, Lehrstuhl für Medieninformatik, Bamberg, 2001–2008. http://www.uni-bamberg.de/fileadmin/uni/fakultaeten/wiai_lehrstuehle/medieninformatik/Dateien/Publikationen/2008/henrich-ir1-1.2.pdf. Accessed 9 April 2014
IBM (2016) OpenTM2 for Windows—translator’s reference, version 1.3.1, 5th edn, p 321. https://github.com/OpenTM2/opentm2-source/blob/master/Doc/Opentm2TranslatorsReference.pdf. Accessed 23 Sept 2016
Koehn P, Senellart J (2010) Convergence of translation memory and statistical machine translation. In: Zhechev V (ed) JEC 2010, proceedings of the 2nd joint EM+/CGNL workshop: bringing MT to the user: research on integrating MT in the translation industry, Denver, Colorado, pp 21–31
Koehn P, Alabau V, Carl M et al (2013) CASMACAT—Final Public Report. http://www.casmacat.eu/uploads/Deliverables/final-public-report.pdf. Accessed 26 April 2016
Kuhns RJ (2007) Advanced leveraging—the new generation of TMs. TAUS Report. TAUS B.V., De Rijp
Lagoudaki E (2006) Translation Memories Survey 2006. Translation memory systems: enlightening users’ perspective. Imperial College London. http://www3.imperial.ac.uk/pls/portallive/docs/1/7307707.PDF. Accessed 19 July 2011
Maas HD (1998) Multilingualität in MPRO. Institut für Angewandte Informationswissenschaft, Saarbrücken. http://iai.iai-sb.de/docs/mmpro.pdf. Accessed 7 Feb 2012
Maas HD, Rösener C, Theofilidis A (2009) Morphosyntactic and semantic analysis of text: The MPRO tagging procedure. In: Mahlow C, Piotrowski M (eds) State of the art in computational morphology: workshop on Systems and Frameworks for Computational Morphology (SFCM 2009), Zurich, Switzerland, Proceedings. Communications in Computer and Information Science, vol 41. Springer, Heidelberg, pp 76–87
Macklovitch E (2000) Two types of translation memory. In: Translating and the computer 22, proceedings of the 22nd international conference on translating and the computer, London, England. Aslib, London
McTait K (2001) Linguistic knowledge and complexity in an EBMT system based on translation patterns. In: MT Summit VIII, proceedings of the 8th machine translation summit, workshop on MT evaluation, Santiago de Compostela, Spain, pp 23–34
Mitkov R, Corpas G (2008) Improving third generation translation memory systems through identification of rhetorical predicates. http://www.academia.edu/7441422/Improving_Third_Generation_Translation_Memory_Systems. Accessed 26 April 2016
Nielsen J (1993) Usability engineering. Morgan Kaufmann, San Francisco
MATH Google Scholar
Planas E (2005) SIMILIS Second-generation translation memory software. In: Translating and the computer 27, proceedings of the 27th international conference on translating and the computer, London, England. Aslib, London
Planas E, Furuse O (1999) Formalizing translation memories. In: MT Summit VII, proceedings of the 7th machine translation summit: MT in the great translation era, Kent Ridge Digital Labs, Singapore, pp 331–339
Planas E, Furuse O (2000) Multi-level similar segment matching algorithm for translation memories and example-based machine translation. In: COLING 2000, proceedings of the 18th international conference on computational linguistics, Universität des Saarlandes, Saarbrücken, Germany, vol 2. Morgan Kaufmann Publishers, San Francisco, pp 621–627
Rapp R (2002) A part-of-speech-based search algorithm for translation memories. In: LREC 2002, proceedings of the 3rd international conference on language resources and evaluation, Las Palmas de Gran Canaria, Spain, pp 466–472
Reinke U (2004) Translation Memories: Systeme - Konzepte - Linguistische Optimierung. Peter Lang, Frankfurt am Main
Reinke U (2013) State of the art in translation memory technology. In: Rehm G, Sasaki F, Stein D, Witt A (eds) Translation: computation, corpora, cognition. Special issue on language technologies for a multilingual Europe, vol 3(1), pp 27–48
Rösener C (2005) Die Stecknadel im Heuhaufen: Natürlichsprachiger Zugang zu Volltextdatenbanken. Dissertation. Peter Lang, Frankfurt am Main
Schäler R (2001) Beyond translation memories. In: MT Summit VIII, proceedings of the 8th machine translation summit, workshop on MT evaluation, Santiago de Compostela, Spain, pp 49–55
Schmitt I (2006) Ähnlichkeitssuche in Multimedia-Datenbanken - Retrieval, Suchalgorithmen, Abfrage- behandlung. Oldenbourg Wissenschaftsverlag GmbH, Munich
Sikes R (2007) Fuzzy matching in theory and practice. Multiling Comput Technol 18(6):39–43
Google Scholar
Smolej V (n. d.) OmegaT 3.0—user guide. Documentation of the open-source TMS OmegaT. http://ob.nubati.net/ditundat/omegat/docs/en30/appendix.TokenizerPlugin.inOmegaT.html#d0e10802. Accessed 26 Apr 2016
Somers H, Fernandez Diaz G (2004) Translation memory vs. example-based MT: what’s the difference? Int J Transl 16(2):5–33
Google Scholar
Timonera K, Mitkov R (2015) Improving translation memory matching through clause splitting. In: RANLP, proceedings of the workshop on Natural Language Processing for Translation Memories (NLP4TM), Hissar, Bulgaria, pp 17–23
Vanallemeersch T, Vandeghinste V (2014) Improving fuzzy matching through syntactic knowledge. In: Translating and the computer 36, proceedings of the 36th international conference on translating and the computer, London, England, AsLing, London, pp 90–99
Zhechev V, van Genabith J (2010a) Maximising TM performance through sub-tree alignment and SMT. In: AMTA 2010, proceedings of the 9th conference of the association for machine translation in the Americas, Denver, Colorado
Zhechev V, van Genabith J (2010b) Seeding statistical machine translation with translation memory output through tree-based structural alignment. In: SSST-4, proceedings of the 4th workshop on syntax and structure in statistical translation, Beijing, China

Download references

Acknowledgements

The research project Intelligente Translation Memories durch computerlinguistische Optimierung (in English: Intelligent Translation Memories by Computational Linguistic Enhancement) (iMem) was funded by the German Federal Ministry of Education and Research (BMBF) (Grant No. 17N0109). Furthermore, it was supported by the Institute of the Society for the Promotion of Applied Information Sciences at the Saarland University (IAI) and Trados GmbH, Stuttgart. The author would like to thank Prof. Dr. Johann Haller, Prof. Dr. Uwe Reinke and Prof. Dr. Josef van Genabith for their helpful feedback regarding the related dissertation on this topic. Many thanks are also owed to the peer reviewers of this article.

Author information

Authors and Affiliations

Department of Applied Linguistics, Translation and Interpreting, Saarland University, Saarbrücken, Germany
Melanie Weitz

Authors

Melanie Weitz
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Melanie Weitz.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Weitz, M. Improving retrieval performance of translation memories using morphosyntactic analyses and generalized suffix arrays. Machine Translation 31, 117–146 (2017). https://doi.org/10.1007/s10590-017-9193-3

Download citation

Received: 15 May 2016
Accepted: 28 March 2017
Published: 04 May 2017
Issue Date: September 2017
DOI: https://doi.org/10.1007/s10590-017-9193-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Improving retrieval performance of translation memories using morphosyntactic analyses and generalized suffix arrays

Abstract

Access this article

Similar content being viewed by others

Improving translation memory matching and retrieval using paraphrases

An Approach for Efficient Machine Translation Using Translation Memory

A Dynamic Programming Approach to Improving Translation Memory Matching and Retrieval Using Paraphrases

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Improving retrieval performance of translation memories using morphosyntactic analyses and generalized suffix arrays

Abstract

Access this article

Similar content being viewed by others

Improving translation memory matching and retrieval using paraphrases

An Approach for Efficient Machine Translation Using Translation Memory

A Dynamic Programming Approach to Improving Translation Memory Matching and Retrieval Using Paraphrases

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation