Skip to main content
Log in

Improving retrieval performance of translation memories using morphosyntactic analyses and generalized suffix arrays

  • Published:
Machine Translation

Abstract

Since the 1990s, translation memory (TM) systems have been one of the most widely-used tools in the field of computer-aided translation. However, most of the commercially available systems still consider two segments to be compared similar if the sequence of characters is identical or differ only marginally in spelling. Instead, linguistic similarities are disregarded so that semantically identical or similar segments, which have different (morpho)syntactic structures, are retrieved with a lower similarity value as expected or not at all. The iMem (iMem is an abbreviation for Intelligent Translation Memories) research project aimed at improving the retrieval of those very segments by analyzing their (morpho)syntactic structure and identifying the longest common substrings (LCS) between two sentences by means of generalized suffix arrays. The results of the morphosyntactic analysis are stored in the so-called iMem-TM, an independent relational database which is connected to a commercial, non-linguistically enhanced TM via its API. Base words were used for building the suffixes to increase the probability of finding a larger number of LCS between both sentences. Furthermore, an existing algorithm for generalized suffix arrays was enhanced by an additional array in order to distinguish which suffixes derive from which sentence. In this way, identical repeating LCS within the same sentence are ignored, whereas identical repeating LCS between two different sentences are still considered. If more than one identical repeating LCS between the sentences exist, the best matching LCS is given by calculating positional differences for the identical repeating LCS and choosing the one with the minimal positional difference.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Notes

  1. In the form of its function AutoSuggest.

  2. http://www.casmacat.eu/.

  3. https://www.matecat.com/.

  4. http://similis.org/.

  5. http://www.omegat.org/.

  6. http://www.opentm2.org/.

  7. Formerly known as Cologne University of Applied Sciences.

  8. In English: Optimizing Commercial Translation Memory Systems through Integration of Morphosyntactic Analyses.

  9. MPRO stands for Morphological Program (Maas 1998).

  10. http://www.sqlite.org/.

  11. As the iMem-algorithm only works on the source language segment of a TU, any other target language could have been used to create the TMs.

  12. ST stands for source language text.

References

  • Aluru S (2004) Suffix trees and suffix arrays. In: Mehta DP, Sahni S (eds) Handbook of data structures and applications. Chapman & Hall/CRC, Boca Raton, pp 29-1–29-22

    Google Scholar 

  • Azzano D (2009) CAT und MÜ—Getrennte Welten? In: Seewald-Heeg U, Stein, D (eds) Journal for Language Technology and Computational Linguistics (JLCL). Maschinelle Übersetzung von der Theorie zur Anwendung—machine translation—theory and applications, vol 24(3). Hochschule Anhalt, Ludwig- Maximilians-Universität München, Köthen, München, pp 19–36

  • Azzano D, Reinke U, Sauer M (2011) Ansätze zur Verbesserung der Retrieval-Leistung kommerzieller Translation-Memory-Systeme. In: Hedeland H, Schmidt T, Wörner K (eds) Multilingual resources and multilingual applications, proceedings of the conference of the German Society for Computational Linguistics and Language Technology (GSCL) 2011. Universität Hamburg, Hamburg, Germany, pp 123–128

  • Baisa V, Horák A, Medved’ M (2015) Increasing coverage of translation memories with linguistically motivated segment combination methods. In: RANLP, proceedings of the workshop on Natural Language Processing for Translation Memories (NLP4TM), Hissar, Bulgaria, pp 31–35

  • Bloodgood M, Strauss B (2014) Translation memory retrieval methods. In: EACL 2014, proceedings of the 14th conference of the European Chapter of the Association for Computational Linguistics, Gothenburg, Sweden, pp 202–210

  • Bowker L (2002) Computer-aided translation technology: a practical introduction. University of Ottawa Press, Ottawa

    Google Scholar 

  • Bruckner C, Plitt M (2001) Evaluating the operational benefit of using machine translation output as translation memory input. In: MT Summit VIII, proceedings of the 8th Machine Translation Summit, workshop on MT evaluation, Santiago de Compostela, Spain, pp 61–65

  • Carl M, Schmidt-Wigger A (1998) Shallow post morphological processing with KURD. In: Powers DMW (ed) NeMLaP3/CoNLL98, proceedings of the joint conferences on new methods in Language Processing and Computational Natural Language Learning. Association for Computational Linguistics, Stroudsburg, pp 257–265

  • Chatzitheodorou K (2015) Improving translation memory fuzzy matching by paraphrasing. In: RANLP, proceedings of the workshop on Natural Language Processing for Translation Memories (NLP4TM), Hissar, Bulgaria, pp 24–30

  • Dara A, Dandapat S, Groves D, van Genabith J (2013) TMTprime: a recommender system for MT and TM integration. In: HLT-NAACL 2013, proceedings of the 2013 annual conference of the North American Chapter of the Association for Computational Linguistics, Demonstration Session, Atlanta, Georgia, USA, pp 10–13

  • Elita N, Gavrila M (2006) Enhancing translation memories with semantic knowledge. In: CESCL, proceedings of the 1st central European Student Conference in Linguistics, Budapest, Hungary

  • Federico M, Bertoldi N, Cettolo M, Negri M, Turchi M, Tromebetti M, Cattelan A, Farina A, Lupinetti D, Martines A, Massidda A, Schwenk H, Barrault L, Blain F, Koehn P, Buck C, Germann U (2014) The Matecat Tool. In: Tounsi L, Rak R (eds) COLING 2014, proceedings of the 25th international conference on computational linguistics: system demonstrations. Dublin, Ireland, pp 129–132

  • Flanagan K (2014) Filling in the gaps: what we need from TM subsegment recall. http://www.kftrans.co.uk/lift/FillingInTheGaps.pdf. Accessed 26 Feb 2015

  • Gordon I (1997) The TM revolution—what does it really mean? In: Translating and the computer 19: papers from the Aslib conference, London

  • Gupta R, Orăsan C (2014) Incorporating paraphrasing in translation memory matching and retrieval. In: Tadić M, Koehn P, Roturier J, Way A (eds) EAMT 2014, proceedings of the 17th annual conference of the European Association for Machine Translation, Dubrovnik, Croatia, pp 3–10

  • He Y, Ma Y, van Genabith J, Way A (2010a) Bridging SMT and TM with translation recommendation. In: Hajic J, Carberry S, Clark S (eds) ACL 2010, proceedings of the 48th annual meeting of the Association for Computational Linguistics, Uppsala, Sweden, pp 622–630

  • He Y, Ma Y, Way A, van Genabith J (2010b) Integrating N-best SMT Outputs into a TM System. In: Huang CR, Jurafsky D (eds) COLING 2010, 23rd international conference on computational linguistics, Posters Volume, Beijing, China, pp 374–382

  • Henrich A (2008) Information retrieval 1—Grundlagen, Modelle, Anwendungen, version 1.2 (rev.: 5727, status: 7 January 2008). Otto-Friedrich-Universität, Lehrstuhl für Medieninformatik, Bamberg, 2001–2008. http://www.uni-bamberg.de/fileadmin/uni/fakultaeten/wiai_lehrstuehle/medieninformatik/Dateien/Publikationen/2008/henrich-ir1-1.2.pdf. Accessed 9 April 2014

  • IBM (2016) OpenTM2 for Windows—translator’s reference, version 1.3.1, 5th edn, p 321. https://github.com/OpenTM2/opentm2-source/blob/master/Doc/Opentm2TranslatorsReference.pdf. Accessed 23 Sept 2016

  • Koehn P, Senellart J (2010) Convergence of translation memory and statistical machine translation. In: Zhechev V (ed) JEC 2010, proceedings of the 2nd joint EM+/CGNL workshop: bringing MT to the user: research on integrating MT in the translation industry, Denver, Colorado, pp 21–31

  • Koehn P, Alabau V, Carl M et al (2013) CASMACAT—Final Public Report. http://www.casmacat.eu/uploads/Deliverables/final-public-report.pdf. Accessed 26 April 2016

  • Kuhns RJ (2007) Advanced leveraging—the new generation of TMs. TAUS Report. TAUS B.V., De Rijp

  • Lagoudaki E (2006) Translation Memories Survey 2006. Translation memory systems: enlightening users’ perspective. Imperial College London. http://www3.imperial.ac.uk/pls/portallive/docs/1/7307707.PDF. Accessed 19 July 2011

  • Maas HD (1998) Multilingualität in MPRO. Institut für Angewandte Informationswissenschaft, Saarbrücken. http://iai.iai-sb.de/docs/mmpro.pdf. Accessed 7 Feb 2012

  • Maas HD, Rösener C, Theofilidis A (2009) Morphosyntactic and semantic analysis of text: The MPRO tagging procedure. In: Mahlow C, Piotrowski M (eds) State of the art in computational morphology: workshop on Systems and Frameworks for Computational Morphology (SFCM 2009), Zurich, Switzerland, Proceedings. Communications in Computer and Information Science, vol 41. Springer, Heidelberg, pp 76–87

  • Macklovitch E (2000) Two types of translation memory. In: Translating and the computer 22, proceedings of the 22nd international conference on translating and the computer, London, England. Aslib, London

  • McTait K (2001) Linguistic knowledge and complexity in an EBMT system based on translation patterns. In: MT Summit VIII, proceedings of the 8th machine translation summit, workshop on MT evaluation, Santiago de Compostela, Spain, pp 23–34

  • Mitkov R, Corpas G (2008) Improving third generation translation memory systems through identification of rhetorical predicates. http://www.academia.edu/7441422/Improving_Third_Generation_Translation_Memory_Systems. Accessed 26 April 2016

  • Nielsen J (1993) Usability engineering. Morgan Kaufmann, San Francisco

    MATH  Google Scholar 

  • Planas E (2005) SIMILIS Second-generation translation memory software. In: Translating and the computer 27, proceedings of the 27th international conference on translating and the computer, London, England. Aslib, London

  • Planas E, Furuse O (1999) Formalizing translation memories. In: MT Summit VII, proceedings of the 7th machine translation summit: MT in the great translation era, Kent Ridge Digital Labs, Singapore, pp 331–339

  • Planas E, Furuse O (2000) Multi-level similar segment matching algorithm for translation memories and example-based machine translation. In: COLING 2000, proceedings of the 18th international conference on computational linguistics, Universität des Saarlandes, Saarbrücken, Germany, vol 2. Morgan Kaufmann Publishers, San Francisco, pp 621–627

  • Rapp R (2002) A part-of-speech-based search algorithm for translation memories. In: LREC 2002, proceedings of the 3rd international conference on language resources and evaluation, Las Palmas de Gran Canaria, Spain, pp 466–472

  • Reinke U (2004) Translation Memories: Systeme - Konzepte - Linguistische Optimierung. Peter Lang, Frankfurt am Main

  • Reinke U (2013) State of the art in translation memory technology. In: Rehm G, Sasaki F, Stein D, Witt A (eds) Translation: computation, corpora, cognition. Special issue on language technologies for a multilingual Europe, vol 3(1), pp 27–48

  • Rösener C (2005) Die Stecknadel im Heuhaufen: Natürlichsprachiger Zugang zu Volltextdatenbanken. Dissertation. Peter Lang, Frankfurt am Main

  • Schäler R (2001) Beyond translation memories. In: MT Summit VIII, proceedings of the 8th machine translation summit, workshop on MT evaluation, Santiago de Compostela, Spain, pp 49–55

  • Schmitt I (2006) Ähnlichkeitssuche in Multimedia-Datenbanken - Retrieval, Suchalgorithmen, Abfrage- behandlung. Oldenbourg Wissenschaftsverlag GmbH, Munich

  • Sikes R (2007) Fuzzy matching in theory and practice. Multiling Comput Technol 18(6):39–43

    Google Scholar 

  • Smolej V (n. d.) OmegaT 3.0—user guide. Documentation of the open-source TMS OmegaT. http://ob.nubati.net/ditundat/omegat/docs/en30/appendix.TokenizerPlugin.inOmegaT.html#d0e10802. Accessed 26 Apr 2016

  • Somers H, Fernandez Diaz G (2004) Translation memory vs. example-based MT: what’s the difference? Int J Transl 16(2):5–33

    Google Scholar 

  • Timonera K, Mitkov R (2015) Improving translation memory matching through clause splitting. In: RANLP, proceedings of the workshop on Natural Language Processing for Translation Memories (NLP4TM), Hissar, Bulgaria, pp 17–23

  • Vanallemeersch T, Vandeghinste V (2014) Improving fuzzy matching through syntactic knowledge. In: Translating and the computer 36, proceedings of the 36th international conference on translating and the computer, London, England, AsLing, London, pp 90–99

  • Zhechev V, van Genabith J (2010a) Maximising TM performance through sub-tree alignment and SMT. In: AMTA 2010, proceedings of the 9th conference of the association for machine translation in the Americas, Denver, Colorado

  • Zhechev V, van Genabith J (2010b) Seeding statistical machine translation with translation memory output through tree-based structural alignment. In: SSST-4, proceedings of the 4th workshop on syntax and structure in statistical translation, Beijing, China

Download references

Acknowledgements

The research project Intelligente Translation Memories durch computerlinguistische Optimierung (in English: Intelligent Translation Memories by Computational Linguistic Enhancement) (iMem) was funded by the German Federal Ministry of Education and Research (BMBF) (Grant No. 17N0109). Furthermore, it was supported by the Institute of the Society for the Promotion of Applied Information Sciences at the Saarland University (IAI) and Trados GmbH, Stuttgart. The author would like to thank Prof. Dr. Johann Haller, Prof. Dr. Uwe Reinke and Prof. Dr. Josef van Genabith for their helpful feedback regarding the related dissertation on this topic. Many thanks are also owed to the peer reviewers of this article.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Melanie Weitz.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Weitz, M. Improving retrieval performance of translation memories using morphosyntactic analyses and generalized suffix arrays. Machine Translation 31, 117–146 (2017). https://doi.org/10.1007/s10590-017-9193-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10590-017-9193-3

Keywords

Navigation