Computers and the Humanities

, Volume 38, Issue 2, pp 163–189 | Cite as

Extracting Multilingual Lexicons from Parallel Corpora

  • Dan Tufiş
  • Ana Maria Barbu
  • Radu Ion


The paper describes our recent developments in automatic extraction of translation equivalents from parallel corpora. We describe three increasingly complex algorithms: a simple baseline iterative method, and two non-iterative more elaborated versions. While the baseline algorithm is mainly described for illustrative purposes, the non-iterative algorithms outline the use of different working hypotheses which may be motivated by different kinds of applications and to some extent by the languages concerned. The first two algorithms rely on cross-lingual POS preservation, while with the third one POS invariance is not an extraction condition. The evaluation of the algorithms was conducted on three different corpora and several pairs of languages.

alignment evaluation lemmatization tagging translation equivalence 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. Ahrenberg L., Andersson M., Merkel M. (2000) A Knowledge-lite Approach to Word Alignment. In Véronis J. (ed.), Parallel Text Processing. Text, Speech and Language Technology Series, Kluwer Academic Publishers, Vol. 13, pp. 97–116.Google Scholar
  2. Brants T. (2000) TnT-A Statistical Part-of-Speech Tagger. In Proceedings ANLP-2000, April 29-May 3, Seattle, WA.Google Scholar
  3. Brew C., McKelvie D. (1996) Word-pair extraction for lexicography. Available at http:/// Scholar
  4. Brown P., Della Pietra S.A., Della Pietra V.J., Mercer R.L. (1993) The Mathematics of Statistical Machine Translation: Parameter Estimation. Computational Linguistics, 19/2, pp. 263–311.Google Scholar
  5. Dimitrova L., Erjavec T., Ide N., Kaalep H., Petkevic V., Tufis D. (1998) Multext-East: Parallel and Comparable Corpora and Lexicons for Six Central and East European Languages. In Proceedings ACL-COLING'1998, Montreal, Canada, pp. 315–319.Google Scholar
  6. Dunning T. (1993) Accurate Methods for the Statistics of Surprise and Coincidence. Computational Linguistics, 19/1, pp. 61–74.Google Scholar
  7. Erjavec T. (ed.) (2001) Specifications and Notations for MULTEXT-East Lexicon Encoding. Edition Multext-East/Concede Edition, March, 21, p. Available at [ V2/msd/html/].Google Scholar
  8. Erjavec T., Ide N. (1998) The Multext-East corpus. In Proceedings LREC'1998, Granada, Spain, pp. 971–974.Google Scholar
  9. Erjavec T., Lawson A., Romary L. (1998) East Meet West: A Compendium of Multilingual Resources. TELRI-MULTEXT EAST CD-ROM, ISBN: 3-922641-46-6.Google Scholar
  10. Gale W.A., Church K.W. (1991) Identifying word correspondences in parallel texts. In Fourth DARPA Workshop on Speech and Natural Language, pp. 152–157.Google Scholar
  11. Gale W.A., Church K.W. (1993) A Program for Aligning Sentences in Bilingual Corpora. Computational Linguistics, 19/1, pp. 75–102.Google Scholar
  12. Hiemstra D. (1997) Deriving a Bilingual Lexicon for Cross Language Information Retrieval. In Proceedings of Gronics, pp. 21–26.Google Scholar
  13. Ide N., Veronis J. (1995) Corpus Encoding Standard. MULTEXT/EAGLES Report. Available at http// Scholar
  14. Kay M., Röscheisen M. (1993) Text-Translation Alignment. Computational Linguistics, 19/1, pp. 121–142.Google Scholar
  15. Kupiec J. (1993) An algorithm for finding noun phrase correspondences in bilingual corpora. In Proceedings of the 31st Annual Meeting of the Association of Computational Linguistics, pp. 17–22.Google Scholar
  16. Melamed D. (2001) Empirical Methods for Exploiting Parallel Texts. The MIT Press, Cambridge Massachusetts, London England, 195 p.Google Scholar
  17. Mihalcea R., Pedersen T. (2003) An Evaluation Exercisefor Word Alignment. In Proceedings of the HLT-NAACL 2003 Workshop on Building and Using Parallel Texts: Data Driven Machine Translation and Beyond, May-June, Edmonton, Canada, pp. 1–10.Google Scholar
  18. Mititelu C. (2003) TREQ User Manual, Technical Report, RACAI, May, 25 p.Google Scholar
  19. Smadja F., McKeown K.R., Hatzivassiloglou V. (1996) Translating Collocations for Bilingual Lexicons: A Statistical Approach. Computational Linguistics, 22/1, pp. 1–38.Google Scholar
  20. Stamou S., Oflazer K., Pala K., Christoudoulakis D., Cristea D., Tufis D., Koeva S., Totkov G., Dutoit D., Grigoriadou M. (2002) BALKANET A Multilingual Semantic Network for the Balkan Languages. In Proceedings of the International Wordnet Conference, Mysore, India, 21–25 January.Google Scholar
  21. Tufis D. (1999).Google Scholar
  22. Tufis D. (2000) Using a Large Set of Eagles-compliant Morpho-Syntactic Descriptors as a Tagset for Probabilistic Tagging. In Proceedings LREC'2000, Athens, pp. 1105–1112.Google Scholar
  23. Tufis D. (2001). Partial Translations Recovery in a 1:1 Word Alignment Approach RACAI Technical Report, 2001(in Romanian), 18 p.Google Scholar
  24. Tufis, D. (2002) A Cheap and Fast Way to Build Useful Translation Lexicons. In Proceedings of the 19th International Conference on Computational Linguistics, COLING2002, Taipei, 25–30 August, pp. 1030–1036.Google Scholar
  25. Tufis D. Barbu A.M. (2002) Revealing Translators Knowledge: Statistical Methods in Constructing Practical Translation Lexicons for Language and Speech Processing. In International Journal of Speech Technology. Kluwer Academic Publishers, no. 5, pp. 199–209.Google Scholar
  26. Tufis D., Barbu A.M., Ion R. (2003) TREQ-AL: A Word Alignment System with Limited Language Resources. In Proceedings of the HLT-NAACL 2003 Workshop on Building and Using Parallel Texts: Data Driven Machine Translation and Beyond, May-June, Edmonton, Canada, pp. 36–39.Google Scholar
  27. Tufis D., Ide N. Erjavec T. (1998) Standardized Specifications, Development and Assessment of Large Morpho-Lexical Resources for Six Central and Eastern European Languages. In Proceedings LREC' 1998, Granada, Spain, pp. 233–240.Google Scholar
  28. Tufis D., Barbu A.M., Patrascu V., Rotariu G., Popescu C. (1997) Corpora and Corpus-Based Morpho-Lexical Processing. In Tufis D., Andersen P. (eds.), Recent Advances in Romanian Language Technology. Editura Academiei, pp. 35–56.Google Scholar

Copyright information

© Kluwer Academic Publishers 2004

Authors and Affiliations

  • Dan Tufiş
    • 1
  • Ana Maria Barbu
    • 1
  • Radu Ion
    • 1
  1. 1.Romanian Academy (RACAI)Bucharest 5Romania

Personalised recommendations