Language Resources and Evaluation

, Volume 44, Issue 1–2, pp 79–95

Compositionality and lexical alignment of multi-word terms



The automatic compilation of bilingual lists of terms from specialized comparable corpora using lexical alignment has been successful for single-word terms (SWTs), but remains disappointing for multi-word terms (MWTs). The low frequency and the variability of the syntactic structures of MWTs in the source and the target languages are the main reported problems. This paper defines a general framework dedicated to the lexical alignment of MWTs from comparable corpora that includes a compositional translation process and the standard lexical context analysis. The compositional method which is based on the translation of lexical items being restrictive, we introduce an extended compositional method that bridges the gap between MWTs of different syntactic structures through morphological links. We experimented with the two compositional methods for the French–Japanese alignment task. The results show a significant improvement for the translation of MWTs and advocate further morphological analysis in lexical alignment.


Terminology mining Comparable corpora Lexical alignment Compositional translation 


  1. Baldwin, T., & Tanaka, T. (2004). Translation by machine of complex nominals: Getting it right. In Proceedings of the ACL 2004 Workshop on multiword expressions: Integrating processing. Barcelona, Spain, pp. 24–31.Google Scholar
  2. Bowker, L., & Pearson, J. (2002). Working with specialized language: A practical guide to using corpora. London/New York: Routeledge.CrossRefGoogle Scholar
  3. Brill, E. (1994). Some advances in transformation-based part of speech tagging. In Proceedings of the 12th national conference on artificial intelligence (AAAI’94). Seattle, Washington, USA, pp. 722–727.Google Scholar
  4. Brown, P. F., Della Pietra, S. A., Della Pietra, V. J., & Mercer, R. L. (1993). The mathematics of statistical machine translation: Parameter estimation. Computational Linguistics, 19(2), 263–311.Google Scholar
  5. Chiao, Y. -C., & Zweigenbaum, P. (2002a). Looking for candidate translational equivalents in specialized, comparable corpora. In Proceedings of the 19th international conference on computational linguistics (COLING’02). Tapei, Taiwan, pp. 1208–1212.Google Scholar
  6. Chiao, Y.-C., & Zweigenbaum, P. (2002b). Looking for French–English translations in comparable medical corpora. Journal of the American Society for Information Science, 8, 150–154.Google Scholar
  7. Daille, B. (2001). Qualitative terminology extraction: Identifying relational adjectives. In D. Bourigault, C. Jacquemin, & M.-C. L’Homme (Eds.), Recent advances in computational terminology, Vol. 2 of Natural language processing (pp. 149–166). John Benjamins.Google Scholar
  8. Daille, B. (2003a). Conceptual structuring through term variations. In F. Bond, A. Korhonen, D. MacCarthy, & A. Villacicencio (Eds.), Proceedings of the ACL 2003 workshop on multiword expressions: Analysis, acquisition and treatment, pp. 9–16.Google Scholar
  9. Daille, B. (2003b). Terminology mining. In M. T. Pazienza (Ed.), Information extraction in the web era. Springer, pp. 29–44.Google Scholar
  10. Daille, B., & Morin, E. (2005). French–English terminology extraction from comparable corpora. In Proceedings of the 2nd international joint conference on natural language processing (IJCLNP’05). Jeju Island, Korea, pp. 707–718.Google Scholar
  11. Déjean, H., & Gaussier, E. (2002). Une nouvelle approche à l’extraction de lexiques bilingues à partir de corpus comparables. Lexicometrica, Alignement lexical dans les corpus multilingues, pp. 1–22.Google Scholar
  12. Déjean, H., Sadat, F., & Gaussier, E. (2002). An approach based on multilingual thesauri and model combination for bilingual lexicon extraction. In Proceedings of the 19th international conference on computational linguistics (COLING’02). Tapei, Taiwan, pp. 218–224.Google Scholar
  13. Dunning, T. (1993). Accurate methods for the statistics of surprise and coincidence. Computational Linguistics, 19(1), 61–74.Google Scholar
  14. Fano, R. M. (1961). Transmission of information: A statistical theory of communications. Cambridge, MA, USA: MIT Press.Google Scholar
  15. Fung, P. (1998). A statistical view on bilingual lexicon extraction: From parallel corpora to non-parallel corpora. In D. Farwell, L. Gerber, & E. Hovy (Eds.) , Proceedings of the 3rd conference of the association for machine translation in the Americas (AMTA’98). Langhorne, PA, USA (pp. 1–16).Google Scholar
  16. Fung, P., & McKeown, K. (1997). Finding terminology translations from non-parallel corpora. In Proceedings of the 5th annual workshop on very large corpora (VLC’97). Hong Kong, China, pp. 192–202.Google Scholar
  17. Grefenstette, G. (1994a). Corpus-derived first, second and third-order word affinities. In Proceedings of the 6th congress of the European association for lexicography (EURALEX’94). Amsterdam, The Netherlands, pp. 279–290.Google Scholar
  18. Grefenstette, G. (1994b). Explorations in automatic thesaurus discovery. Boston, MA, USA: Kluwer Academic Publisher.Google Scholar
  19. Grefenstette, G. (1999). The world wide web as a resource for example-based machine translation tasks. In ASLIB’99 translating and the computer 21. London, UK.Google Scholar
  20. Hakusui-sha. (Ed.). (1989). Dictionnaire des termes techniques et scientifiques: Francais-Japonais (4th ed.).Google Scholar
  21. Jacquemin, C. (2001). Spotting and discovering terms through natural language processing. Cambridge: MIT Press.Google Scholar
  22. Keenan, E. L., & Faltz, L. M. (1985). Boolean semantics for natural language. Dordrecht, Holland: D. Reidel.Google Scholar
  23. Matsumoto, Y., Kitauchi, A., Yamashita, T., & Hirano, Y. (1999). Japanese morphological analysis system ChaSen 2.0 users manual. Technical report, Nara Institute of Science and Technology (NAIST).Google Scholar
  24. Melamed, I. D. (1997). A word-to-word model of translational equivalence. In P. R. Cohen & W. Wahlster (Eds.), Proceedings of the 35th annual meeting of the association for computational linguistics (ACL’97) and 8th conference of the European chapter of the association for computational linguistics (EACL’97). Madrid, Spain, pp. 490–497.Google Scholar
  25. Melamed, I. D. (2001). Empirical methods for exploiting parallel texts. Cambridge: MIT Press.Google Scholar
  26. Mikheev, A. (1997). Automatic rule induction for unknown-word guessing. Computational Linguistics, 23(3), 405–423.Google Scholar
  27. Morin, E., & Daille, B. (2006). Comparabilité de corpus et fouille terminologique multilingue. Traitement Automatique des Langues (TAL), 47(2), 113–136.Google Scholar
  28. Morin, E., Daille, B., Takeuchi, K., & Kageura, K. (2007). Bilingual terminology mining—using brain, not brawn comparable corpora. In Proceedings of the 45th annual meeting of the association for computational linguistics (ACL’07). Prague, Czech Republic, pp. 664–671.Google Scholar
  29. Namer, F. (2000). FLEMM: Un analyseur flexionnel du français à base de règles. Traitement Automatique des Langues (TAL), 41(2), 523–547.Google Scholar
  30. Rapp, R. (1995). Identify word translations in non-parallel texts. In Proceedings of the 35th annual meeting of the association for computational linguistics (ACL’95). Boston, MA, USA, pp. 320–322.Google Scholar
  31. Rapp, R. (1999). Automatic identification of word translations from unrelated English and German corpora. In Proceedings of the 37th annual meeting of the association for computational linguistics (ACL’99). College Park, MD, USA, pp. 519–526.Google Scholar
  32. Robitaille, X., Sasaki, X., Tonoike, M., Sato, S., & Utsuro, S. (2006). Compiling French–Japanese terminologies from the web. In Proceedings of the 11th conference of the European chapter of the association for computational linguistics (EACL’06). Trento, Italy, pp. 225–232.Google Scholar
  33. Salton, G., & Lesk, M. E. (1968). Computer evaluation of indexing and text processing. Journal of the Association for Computational Machinery, 15(1), 8–36.Google Scholar
  34. Simard, M., & Langlais, P. (2003). Statistical translation alignment with compositionality constraint. In HLT-NAACL, worshop on building and using parallel texts: Data driven machine translation and beyond (Vol. 3, pp. 19–22).Google Scholar
  35. Takeuchi, K., Kageura, K., Daille, B., & Romary, L. (2004). Construction of grammar based term extraction model for Japanese. In S. Ananadiou & P. Zweigenbaum (Eds.) Proceedings of the COLING 2004, 3rd international workshop on computational terminology (COMPUTERM’04). Geneva, Switzerland (pp. 91–94).Google Scholar
  36. Tanaka, T. (2002). Measuring the similarity between compound nouns in different languages using non-parallel corpora. In Proceedings of the 19th international conference on computational linguistics (COLING’02). Taipei, Taiwan, pp. 1–7.Google Scholar
  37. Tanaka, T., & Baldwin, T. (2003) Noun–noun compound machine translation: A feasibility study on shallow processing. In Proceedings of the ACL 2003 workshop on multiword expressions: Analysis, acquisition and treatment. Sapporo, Japan, pp. 17–24.Google Scholar
  38. Tsutsumi, T. (1990). Wide-range restructuring of intermediate representations in machine translation. Computational Linguistics, 16(2), 71–78.Google Scholar

Copyright information

© Springer Science+Business Media B.V. 2009

Authors and Affiliations

  1. 1.Université de Nantes, LINA-UMR CNRS 6241Nantes Cedex 3France

Personalised recommendations