Skip to main content
Log in

A system for terminology extraction and translation equivalent detection in real time

Efficient use of statistical machine translation phrase tables

Machine Translation


In this paper we present a system for automatic terminology extraction and automatic detection of the equivalent terms in the target language to be used alongside a computer assisted translation (CAT) tool that provides term candidates and their translations in an automatic way each time the translator goes from one segment to the next one. The system uses several sources of information: the text from the segment being translated and from the whole translation project, the translation memories assigned to the project and a translation phrase table from a statistical machine translation system. It also uses the terminological database assigned to the project in order to avoid presenting already known terms. The use of translation phrase tables allows us to use very large parallel corpora in a very efficient way. We have used Moses to calculate and to consult the translation phrase tables. The program is written in Python and it can be used with any CAT tool. In our experiments we have used OmegaT, a well-known open source CAT tool. Evaluation results for English–Spanish and for three subjects (politics, finance, and medicine) are presented.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions











  10. TonD:




  • Arcan M, Turchi M, Tonelli S, Buitelaar P (2014) Enhancing statistical machine translation with bilingual terminology in a cat environment. In: Proceedings of the 11th Biennial Conference of the Association for Machine Translation in the Americas (AMTA 2014), pp 54–68

  • Astrakhantsev NA, Fedorenko DG, Turdakov DY (2015) Methods for automatic term recognition in domain-specific text collections: a survey. Program Comput Softw 41(6):336–349

    Article  MathSciNet  Google Scholar 

  • Bononno R (2000) Terminology for translators—an implementation of ISO 12620. Meta 45(4):646–669

    Article  Google Scholar 

  • Bourigault D (1992) Surface grammatical analysis for the extraction of terminological noun phrases. In: Proceedings of the 14th conference on computational linguistics, vol 3. COLING ’92, Association for Computational Linguistics, Stroudsburg, pp 977–981

  • Cabré MT (2010) Terminology and translation. In: Gambier Y, van Doorslaer L (eds) Handbook of translation studies. John Benjamins, Amsterdam, pp 356–365

    Chapter  Google Scholar 

  • Cánovas M, Samson R (2011) Open source software in translator training. Tradumática: traducció i tecnologies de la informació i la comunicació 9:46–56

    Article  Google Scholar 

  • Cram D, Daille B (2016) Termsuite: terminology extraction with term variant detection. In: Proceedings of the 54th annual meeting of the association from computational linguistics—system demonstrations, pp 13–18

  • Dagan I, Church K (1994) Termight: identifying and translating technical terminology. In: Proceedings of the 4th conference on applied natural language processing, ANLC ’94, Association for Computational Linguistics, Stroudsburg, pp 34–40

  • Daille B, Gaussier E, Langé J-M (1994) Towards automatic extraction of monolingual and bilingual terminology. In: Proceedings of the 15th conference on computational linguistics, vol 1. COLING ’94, Association for Computational Linguistics, Stroudsburg, pp 515–521

  • Earl LL (1970) Experiments in automatic extracting and indexing. Inf Storage Retr 6(4):313–330

    Article  Google Scholar 

  • Eckl M, Haselbeck S (2006) Survey of the global translators community 2014. Technical report, LingoIO

  • Eijk P (1993) Automating the acquisition of bilingual terminology. In: Proceedings of the 6th conference on European Chapter of the Association for Computational Linguistics, EACL ’93, Association for Computational Linguistics, Stroudsburg, pp 113–119

  • Evans DA, Zhai C (1996) Noun-phrase analysis in unrestricted text for information retrieval. In: Proceedings of the 34th annual meeting on association for computational linguistics, ACL ’96, Association for Computational Linguistics, Stroudsburg, pp 17–24

  • Federico M, Bertoldi N, Cettolo M, Negri M, Turchi M, Trombetti M, Cattelan A, Farina A, Lupinetti D, Martines A et al (2014) The matecat tool. In: COLING (Demos), pp 129–132

  • Frantzi K, Ananiadou S, Mima H (2000) Automatic recognition of multi-word terms: the c-value/nc-value method. Int J Digital Libr 3(2):115–130

    Article  Google Scholar 

  • Fung P (1998) A statistical view on bilingual lexicon extraction: from parallel corpora to non-parallel corpora. In: Machine translation and the information soup: 3rd conference of the association for machine translation in the Americas AMTA’98. Springer, Langhorne, pp 28–31

  • Gaussier E (2001) General considerations on bilingual terminology extraction. In: Bourigault D, Jacquemin C, L’Homme M-C (eds) Recent advances in computational terminology. John Benjamins Publishing Company, Amsterdam/Philadelphia, pp 167–183

    Chapter  Google Scholar 

  • Gornostay T, Vodopiyanova O, Vasijevs A, Schmitz K-D (2013) Cloud-based terminology services for acquiring, sharing and reusing multilingual terminology for human and machine users. In: Proceedings of the TRALOGY II conference “The quest for meaning: where are our weak points and what do we need?, Paris

  • Gupta R, Orăsan C, Zampieri M, Vela M, van Genabith J, Mitkov R (2016) Improving translation memory matching and retrieval using paraphrases. Mach Transl 30(1–2):19–40

    Article  Google Scholar 

  • Heylen K, Hertog DD (2015) Automatic term extraction. In: Kockaert HJ, Steurs F (eds) Handbook of Terminology, vol 1. John Benjamins Publishing Company, Amsterdam/Philadelphia, pp 203–221

    Chapter  Google Scholar 

  • Hjelm H (2007) Identifying cross language term equivalents using statistical machine translation and distributional association measures. In: Proceedings of NODALIDA. Citeseer, pp 97–104

  • Hodász G, Pohl G (2005) Metamorpho tm: a linguistically enriched translation memory. In: International workshop, modern approaches in translation technologies

  • Ideue M, Yamamoto K, Utiyama M, Sumita E (2011) A comparison of unsupervised bilingual term extraction methods using phrase tables. In: Proceedings of the MT Summit XIII, Xiamen

  • Isabelle P (1992) Bi-textual aids for translators. In: Proceedings of the annual conference of the UW Center for the New OED and Text Research

  • Johnson I, MacPhail A (2000) Iate-inter-agency terminology exchange: development of a single central terminology database for the institutions and agencies of the european union. In: Workshop on terminology resources and computation

  • Junczys-Dowmunt M (2012) Phrasal rank-encoding: exploiting phrase redundancy and translational relations for phrase table compression. Prague Bull Math Linguist 98:63–74

    Article  Google Scholar 

  • Justeson JS, Katz SM (1995) Technical terminology: some linguistic properties and an algorithm for identification in text. Nat Lang Eng 1(01):9–27

    Article  Google Scholar 

  • Koehn P, Hoang H, Birch A, Callison-Burch C, Federico M, Bertoldi N, Cowan B., Shen W, Moran C, Zens R, Dyer C, Bojar O, Constantin A, Herbst E (2007) Moses: Open source toolkit for statistical machine translation. In: Proceedings of the 45th annual meeting of the ACL on interactive poster and demonstration sessions, ACL ’07. Association for Computational Linguistics, Stroudsburg, pp 177–180

  • Macken L, Lefever E, Hoste V (2013) Texsis: bilingual terminology extraction from parallel corpora using chunk-based alignment. Terminology 19(1):1–30

    Article  Google Scholar 

  • Macklovitch E, Russell G (2000) What’s been forgotten in translation memory. In: Conference of the association for machine translation in the Americas, Springer, pp 137–146

  • Och FJ, Ney H (2003) A systematic comparison of various statistical alignment models. Comput Linguist 29(1):19–51

    Article  MATH  Google Scholar 

  • Oliver A, Vàzquez M (2015) TBXTools: a free, fast and flexible tool for automatic terminology extraction. In: Proceedings of recent advances in natural language processing (RANLP-2015), pp 473–479

  • Padró L, Stanilovsky E (2012, May). Freeling 3.0: Towards wider multilinguality. In: Proceedings of the language resources and evaluation conference (LREC 2012). ELRA, Istanbul

  • Pal S, Zampieri M, Naskar, SK, Nayak T, Vela M, van Genabith J (2016) Catalog online: porting a post-editing tool to the web. In: Proceedings of LREC

  • Pazienza MT, Pennacchiotti M, Zanzotto FM (2005) Terminology extraction: an analysis of linguistic and statistical approaches. In: Sirmakessis S (ed) Knowledge mining. Springer, Heidelberg, pp 255–279

    Chapter  Google Scholar 

  • Pekar V, Mitkov R (2007) New generation translation memory: content-sensivite matching. In: Proceedings of the 40th anniversary congress of the swiss association of translators, terminologists and interpreters

  • Planas E, Furuse O (1999) Formalizing translation memories. Machine Translation Summit VII, Singapore, pp 331–339

  • Salton G, Yang C-S, Yu CT (1975) A theory of term importance in automatic text analysis. J Am Soc Inf Sci 26(1):33–44

    Article  Google Scholar 

  • Tiedemann J (2012) Parallel data, tools and interfaces in opus. In: Proceedings of the 8th international conference on language resources and evaluation (LREC’2012), pp 2214–2218

  • Utiyama M, Neubig G, Onishi T, Sumita E (2011) Searching translation memories for paraphrases. In Machine Translation Summit pp 13:325–331

  • Varga D, Halácsy P, Kornai A, Nagy V, Németh L, Trón V (2005) Parallel corpora for medium density languages. In Proceedings of RANLP, pp 590–596

  • Vivaldi J, Rodríguez H (2007) Evaluation of terms and term extraction systems: a practical approach. Terminology 13(2):225–248

    Article  Google Scholar 

  • Weitz M (2017) Improving retrieval performance of translation memories using morphosyntactic analyses and generalized suffix arrays. Mach Transl, 1–30

  • Xiong D, Meng F, Liu Q (2016) Topic-based term translation models for statistical machine translation. Artif Intell 232:54–75

    Article  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations


Corresponding author

Correspondence to Antoni Oliver.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Oliver, A. A system for terminology extraction and translation equivalent detection in real time. Machine Translation 31, 147–161 (2017).

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: