Skip to main content

Bilingual Terminology Mining from Language for Special Purposes Comparable Corpora

  • Chapter
  • First Online:
Building and Using Comparable Corpora

Abstract

In this paper we study the problem of compiling bilingual lexicon from language for special purposes (LSP) comparable corpora. We first define what would be the comparability for specialized comparable corpus and stress the distinction between expert and non-experts documents. We then turn to the contextual information method that concentrates on bilingual lexicon extraction and show its limits: the context vectors do not discriminate very much due to the small amount of data available and, the translation of the context vectors is more difficult due to the lack of specialized translations in the dictionary. For each problem, we propose a solution that relies on the LSP linguistic properties. For the first limit, we propose to strengthen the representativeness of the lexical contexts based on domain-specific vocabulary, called anchor points, notably the neoclassical compounds and the transliterations. For the second limit, we propose to use the parallel sentences present in the comparable corpus in order to build a specialized bilingual lexicon directly correlated to the specialized vocabulary of the comparable corpus. Our experiments illustrate that these two strategies are well-founded and show that candidate translations are of better quality.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.00
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    http://www.ilc.cnr.it/EAGLES

  2. 2.

    For English-French, AFP news: LDC2007T07 (English) and LDC2006T17 (French)

  3. 3.

    http://www.ncbi.nlm.nih.gov/PubMed/

  4. 4.

    http://kanji.free.fr; http://quebec-japon.com/lexique/index.php?a=index&d=25; http://dico.fj.free.fr/index.php; http://quebec-japon.com/lexique/index.php?a=index&d=3

  5. 5.

    http://www.csse.monash.edu.au/~jwb/j_jmdict.html

  6. 6.

    http://sciterm.nii.ac.jp/cgi-bin/reference.cgi

  7. 7.

    http://www.medo.jp/a.htm

  8. 8.

    http://www.chu-rouen.fr/cismef/

  9. 9.

    http://www.ohsu.edu/cliniweb/

  10. 10.

    http://www.ncbi.nlm.nih.gov/PubMed

  11. 11.

    http://stp.lingfil.uu.se/~joerg/Uplug/

  12. 12.

    http://www.elsevier.com

  13. 13.

    http://www.wiktionary.org/

  14. 14.

    http://www.elra.info/

  15. 15.

    http://www.ncbi.nlm.nih.gov/mesh

  16. 16.

    http://www.nlm.nih.gov/research/umls

  17. 17.

    http://www.granddictionnaire.com/

References

  1. Abdul-Rauf, S., Schwenk, H.: On the use of comparable corpora to improve SMT performance. In: Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics (EACL’09), Athens, Greece, pp. 16–23 (2009)

    Google Scholar 

  2. Bowker, L., Pearson, J.: Working with Specialized Language: A Practical Guide to Using Corpora. Routledge, London/New York (2002)

    Book  Google Scholar 

  3. Chen, J., Nie, J.Y.: Parallel web text mining for cross-language information retrieval. In: Proceedings of Recherche d’Information Assistée par Ordinateur (RIAO’00), Paris, France, pp. 62–77 (2000)

    Google Scholar 

  4. Chiao, Y.C., Zweigenbaum, P.: Looking for candidate translational equivalents in specialized, comparable corpora. In: Proceedings of the 19th International Conference on Computational Linguistics (COLING’02), Tapei, Taiwan, pp. 1208–1212 (2002)

    Google Scholar 

  5. Claveau, V.: Inférence de règles de réécriture pour la traduction de termes biomédicaux. In: Actes de la conférence Traitement automatique des langues naturelles (TALN’07) (2007)

    Google Scholar 

  6. Claveau, V.: Automatic translation of biomedical terms by supervised machine learning. In: Proceedings of the Language Resources and Evaluation Conference, (LREC’08), pp. 684–691 (2008)

    Google Scholar 

  7. Déjean, H., Gaussier, E.: Une nouvelle approche à l’extraction de lexiques bilingues à partir de corpus comparables. Lexicometrica, Alignement lexical dans les corpus multilingues, pp. 1–22 (2002)

    Google Scholar 

  8. Déjean, H., Sadat, F., Gaussier, E.: An approach based on Multilingual Thesauri and model combination for Bilingual Lexicon Extraction. In: Proceedings of the 19th International Conference on Computational Linguistics (COLING’02), Tapei, Taiwan, pp. 218–224 (2002)

    Google Scholar 

  9. Dunning, T.: Accurate methods for the statistics of surprise and coincidence. Comput. Linguist. 19(1), 61–74 (1993)

    Google Scholar 

  10. Fano, R.M.: Transmission of Information: A Statistical Theory of Communications. MIT Press, Cambridge (1961)

    Google Scholar 

  11. Fung, P., Cheung, P.: Mining very-non-parallel corpora: parallel sentence and Lexicon Extraction via Bootstrapping and EM. In: Lin, D., Wu, D. (eds.) Proceedings of Empirical Methods on Natural Language Processing (EMNLP’04), pp. 57–63. Barcelona, Spain (2004)

    Google Scholar 

  12. Fung, P., McKeown, K.: Finding terminology translations from non-parallel corpora. In: Proceedings of the 5th Annual Workshop on Very Large Corpora (VLC’97), Hong Kong, pp. 192–202 (1997)

    Google Scholar 

  13. Goeuriot, L., Morin, E., Daille, B.: Compilation of specialized comparable corpora in French and Japanese. In: Proceedings of the 2nd Workshop on Building and Using Comparable Corpora: from Parallel to Non-Parallel Corpora (BUCC’09), Singapore, pp. 55–62 (2009)

    Google Scholar 

  14. Grabar, N., Krivine, S.: Application of cross-language criteria for the automatic distinction of expert and non expert online health documents. In: Bellazzi, R., Abu-Hanna, A., Hunter, J. (eds.) AIME. Lecture Notes in Computer Science, vol. 4594, pp. 252–256. Springer, Berlin (2007)

    Google Scholar 

  15. Grefenstette, G.: Corpus-derived first, second and third-order word affinities. In: Proceedings of the 6th Congress of the European Association for Lexicography (EURALEX’94), Amsterdam, The Netherlands, pp. 279–290 (1994)

    Google Scholar 

  16. Grefenstette, G.: Explorations in Automatic Thesaurus Discovery. Kluwer Academic Publisher, Boston (1994)

    Book  MATH  Google Scholar 

  17. Harris, M.B.: Basic Statistics for Behavioral Science Research, 2nd edn. Allyn & Bacon, Boston(1998)

    Google Scholar 

  18. Joachims, T.: Learning to Classify Text Using Support Vector Machines. Kluwer Academic Publishers, Boston (2002)

    Google Scholar 

  19. Kotani, T., Kori, A.: Dictionary of Technical Terms. Kenkyusha, Tokyo (1990)

    Google Scholar 

  20. Lovis, C., Baud, R., Michel, P.A., Scherrer, J.R., Rassinoux, A.M.: Building medical dictionaries for patient encoding systems: a methodology. Lect. Notes Comput. Sci. 1211, 373–380 (1997)

    Article  Google Scholar 

  21. Ma, X., Liberman, M.Y.: Bits: a method for bilingual text search over the web. In: Proceedings of Machine Translation Summit VII, Kent Ridge Digital Labs, National University of Singapore (1999)

    Google Scholar 

  22. McEnery, A. and Xiao, Z.: Parallel and comparable corpora: What is happening? In: Anderman, G., Rogers, M. (eds.) Incorporating Corpora: The Linguist and the Translator. Multilingual Matters, Clevedon (2007)

    Google Scholar 

  23. Morin, E., Daille, B., Takeuchi, K., Kageura, K.: Bilingual terminology mining—using brain, not brawn comparable corpora. In: Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics (ACL’07), Prague, Czech Republic, pp. 664–671 (2007)

    Google Scholar 

  24. Munteanu, D.S., Marcu, D.: Improving machine translation performance by exploiting non-parallel corpora. Comput. Linguist. 31(4), 477–504 (2005)

    Article  Google Scholar 

  25. Nakao, Y., Goeuriot, L., Daille, B.: Multilingual modalities for specialied languages. Terminology 16(1), 51–76 (2010)

    Google Scholar 

  26. Namer, F.: Morphosémantique pour l’appariement de termes dans le vocabulaire médical: approche multilingue. In: Actes de la conférence Traitement Automatique des Langues Naturelles (TALN’05), pp. 63–72 (2005)

    Google Scholar 

  27. Namer, F., Zweigenbaum, P.: Acquiring meaning for french medical terminology: contribution of morphosemantics. In: Fieschi, M., Coiera, E., Li, Y.C.J. (eds.) Studies in Health Technology and Informatics, vol. 107, pp. 535–539 (2004)

    Google Scholar 

  28. Pearson, J.: Terms in Context. John Benjamins, Amsterdam (1998)

    Google Scholar 

  29. Prochasson, E., Kageura, K., Morin, E., Aizawa, A.: Looking for transliterations in a trilingual English, French and Japanese specialised comparable corpus. In: Proceedings of the 1st Workshop on Building and Using Comparable Corpora, Language Resources and Evaluation Conference (LREC’08), pp. 83–86 (2008)

    Google Scholar 

  30. Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann, San Francisco (1993)

    Google Scholar 

  31. Rapp, R.: Identify word translations in non-parallel texts. In: Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics (ACL’95), Boston, MA, USA, pp. 320–322 (1995)

    Google Scholar 

  32. Rapp, R.: Automatic identification of word translations from unrelated English and German corpora. In: Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics (ACL’99), College Park, MD, USA, pp. 519–526 (1999)

    Google Scholar 

  33. Resnik, P., Smith, N.A.: The Web as a parallel corpus. Comput. Linguist. 29(3), 349–380 (2003)

    Article  Google Scholar 

  34. Salton, G., Lesk, M.E.: Computer evaluation of indexing and text processing. J. Assoc. Comput. Mach. 15(1), 8–36 (1968)

    Article  MATH  Google Scholar 

  35. Tiedemann, J.: Recycling translations—extraction of Lexical data from parallel corpora and their application in natural language processing. Ph.D. thesis, Studia Linguistica Upsaliensia 1 (2003)

    Google Scholar 

  36. Tsuji, K., Sato, S., Kageura, K.: Evaluating the effectiveness of transliteration and search engines in bilingual proper name identifications. In: Proceedings of the 11th Annual Meeting of the Association for Natural Language Processing, pp. 352–355 (2005)

    Google Scholar 

  37. Yang, C.C., Li, K.W.: Automatic construction of English/Chinese parallel corpora. J. Am. Soc. Inf. Sci. Technol. 54(8), 730–742 (2003)

    Article  Google Scholar 

  38. Zanettin, F.: Bilingula corpora and the training of translators. META 43(4), 616–630 (1998)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Béatrice Daille .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Morin, E., Daille, B., Prochasson, E. (2013). Bilingual Terminology Mining from Language for Special Purposes Comparable Corpora. In: Sharoff, S., Rapp, R., Zweigenbaum, P., Fung, P. (eds) Building and Using Comparable Corpora. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-20128-8_14

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-20128-8_14

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-20127-1

  • Online ISBN: 978-3-642-20128-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics