Skip to main content

Enhancing Multilingual Biomedical Terminologies via Machine Translation from Parallel Corpora

  • Conference paper

Part of the Lecture Notes in Computer Science book series (LNISA,volume 8455)

Abstract

Creating and maintaining terminologies by human experts is known to be a resource-expensive task. We here report on efforts to computationally support this process by treating term acquisition as a machine translation-guided classification problem capitalizing on parallel multilingual corpora. Experiments are described for French, German, Spanish and Dutch parts of a multilingual biomedical terminology, for which we generated 18k, 23k, 19k and 12k new terms and synonyms, respectively; about one half relate to concepts that have not been lexically labeled before. Based on expert assessment of a sample of the novel German segment about 80% of these newly acquired terms were judged as linguistically correct and bio-medically reasonable additions to the terminology.

Keywords

  • Machine Translation
  • Name Entity Recognition
  • Statistical Machine Translation
  • Computational Linguistics
  • Parallel Corpus

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

This is a preview of subscription content, access via your institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • DOI: 10.1007/978-3-319-07983-7_2
  • Chapter length: 12 pages
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
eBook
USD   44.99
Price excludes VAT (USA)
  • ISBN: 978-3-319-07983-7
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
Softcover Book
USD   59.99
Price excludes VAT (USA)

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Bodenreider, O.: The Unified Medical Language System (Umls): Integrating biomedical terminology. Nucleic Acids Research 32(Database issue), D267–D270 (2004)

    Google Scholar 

  2. Bouamor, D., Popescu, A., Semmar, N., Zweigenbaum, P.: Building specialized bilingual lexicons using large-scale background knowledge. In: EMNLP 2013 – Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. A meeting of SIGDAT, a Special Interest Group of the ACL, Seattle, WA, USA, October 18-21, pp. 479–489. Association for Computational Linguistics, ACL (2013)

    Google Scholar 

  3. Bouamor, D., Semmar, N., Zweigenbaum, P.: Identifying bilingual multi-word expressions for statistical machine translation. In: LREC 2012 – Proceedings of the 8th International Conference on Language Resources and Evaluation, Istanbul, Turkey, May 23-25, pp. 674–679. European Language Resources Association (ELRA, Paris (2012)

    Google Scholar 

  4. Ştefănescu, D.: Mining for term translations in comparable corpora. In: BUCC 5 – Proceedings of the 5th Workshop on Building and Using Comparable Corpora: Language Resources for Machine Translation in Less-Resourced Languages and Domains @ LREC 2012: 8th International Conference on Language Resources and Evaluation, Istanbul, Turkey, pp. 98–103. European Language Resources Association (ELRA, Paris (2012)

    Google Scholar 

  5. Déjean, H., Gaussier, E., Renders, J.M., Sadat, F.: Automatic processing of multilingual medical terminology: Applications to thesaurus enrichment and cross-language information retrieval. Artificial Intelligence in Medicine 33(2), 111–124 (2005)

    CrossRef  Google Scholar 

  6. Deléger, L., Merkel, M., Zweigenbaum, P.: Enriching medical terminologies: An approach based on aligned corpora. In: Hasman, A., Haux, R., van der Lei, J., De Clercq, E., Roger France, F.H. (eds.) MIE 2006 – Proceedings of the 20th International Congress of the European Federation for Medical Informatics, Maastricht, The Netherlands, August 27-30. Studies in Health Technology and Informatics, vol. 124, pp. 747–752. IOS Press, Amsterdam (2006)

    Google Scholar 

  7. Deléger, L., Merkel, M., Zweigenbaum, P.: Translating medical terminologies through word alignment in parallel text corpora. Journal of Biomedical Informatics 42(4), 692–701 (2009)

    CrossRef  Google Scholar 

  8. Delpech, E., Daille, B., Morin, E., Lemaire, C.: Extraction of domain-specific bilingual lexicon from comparable corpora: Compositional translation and ranking. In: COLING 2012 – Proceedings of the 24th International Conference on Computational Linguistics: Technical Papers, Mumbai, India, December 8-15, pp. 745–762. Indian Institute of Technology (2012)

    Google Scholar 

  9. Frantzi, K.T., Ananiadou, S., Mima, H.: Automatic recognition of multi-word terms: The C-value/NC-value method. International Journal on Digital Libraries 3(2), 115–130 (2000)

    CrossRef  Google Scholar 

  10. Hahn, U., Buyko, E., Landefeld, R., Mühlhausen, M., Poprat, M., Tomanek, K., Wermter, J.: An overview of JCoRe, the Julie Lab Uima component repository. In: Proceedings of the LREC 2008 Workshop ‘Towards Enhanced Interoperability for Large HLT Systems: UIMA for NLP’, Marrakech, Morocco, pp. 1–7. European Language Resources Association (ELRA, Paris (2008)

    Google Scholar 

  11. Hahn, U., Markó, K.G., Schulz, S.: Subword clusters as light-weight interlingua for multilingual document retrieval. In: MT Summit X – Proceedings of the 10th Machine Translation Summit of the International Association for Machine Translation, Phuket, Thailand, September 12-16, pp. 17–24. Asia-Pacific Association for Machine Translation, AAMT (2005)

    Google Scholar 

  12. Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The Weka data mining software: An update. ACM SIGKDD Explorations 11(1), 10–18 (2009)

    CrossRef  Google Scholar 

  13. Hellrich, J., Hahn, U.: The julie Lab mantra system for the clef-er 2013 challenge. In: CLEF 2012, CLEF 2013 Evaluation Labs and Workshop Online Working Notes, Valencia, Spain (September 25, 2013), http://www.clef-initiative.eu/documents/71612/a132d6c9-b0f1-48a4-a0c5-648e5127e229

  14. Koehn, P.: Statistical Machine Translation. Cambridge University Press, Cambridge (2010)

    MATH  Google Scholar 

  15. Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N., Cowan, B., Shen, W., Moran, C., Zens, R., Dyer, C., Bojar, O., Constantin, A., Herbst, E.: Moses: Open source toolkit for statistical machine translation. In: ACL 2007 – Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics, Prague, Czech Republic, June 25-27. Proceedings of the Interactive Poster and Demonstration Sessions, vol. Companion, pp. 177–180. Association for Computational Linguistics, ACL (2007)

    Google Scholar 

  16. Koehn, P., Och, F.J., Marcu, D.: Statistical phrase-based translation. In: HLT-NAACL 2003 – Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics, Edmonton, Canada, May 27-June 1, vol. 1, pp. 48–54. Association for Computational Linguistics (ACL), Stroudsburg (2003)

    CrossRef  Google Scholar 

  17. Laroche, A., Langlais, P.: Revisiting context-based projection methods for term-translation spotting in comparable corpora. In: COLING 2010 – Proceedings of the 23rd International Conference on Computational Linguistics, Beijing, China, August 23-27, pp. 617–625. Tsinghua University Press, Beijing (2010)

    Google Scholar 

  18. Lefever, E., Macken, L., Hoste, V.: Language-independent bilingual terminology extraction from a multilingual parallel corpus. In: EACL 2009 – Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics, Athens, Greece, March 30-April 3, pp. 496–504. Association for Computational Linguistics (2009)

    Google Scholar 

  19. Nadeau, D., Sekine, S.: A survey of named entity recognition and classification. Linguisticae Investigationes 30(1), 3–26 (2007)

    CrossRef  Google Scholar 

  20. Och, F.J., Ney, H.: A systematic comparison of various statistical alignment models. Computational Linguistics 29(1), 19–51 (2003)

    CrossRef  MATH  Google Scholar 

  21. Rebholz-Schuhmann, D., et al.: Entity recognition in parallel multi-lingual biomedical corpora: The Clef-ER Laboratory overview. In: Forner, P., Müller, H., Paredes, R., Rosso, P., Stein, B. (eds.) CLEF 2013. LNCS, vol. 8138, pp. 353–367. Springer, Heidelberg (2013)

    CrossRef  Google Scholar 

  22. Resnik, P., Smith, N.A.: The Web as a parallel corpus. Computational Linguistics 29(3), 349–380 (2003)

    CrossRef  Google Scholar 

  23. Skadiņa, I., Aker, A., Mastropavlos, N., Su, F., Tufiş, D., Verlic, M., Vasiļjevs, A., Babych, B., Clough, P., Gaizauskas, R., Glaros, N., Lestari Paramita, M., Pinnis, M.: Collecting and using comparable corpora for statistical machine translation. In: LREC 2012 – Proceedings of the 8th International Conference on Language Resources and Evaluation, Istanbul, Turkey, May 23-25, pp. 438–445. European Language Resources Association (ELRA, Paris (2012)

    Google Scholar 

  24. Smith, B., Ashburner, M., Rosse, C., Bard, J., Bug, W., Ceusters, W., Goldberg, L., Eilbeck, K., Ireland, A., Mungall, C.J., Leontis, N., Rocca-Serra, P., Ruttenberg, A., Sansone, S.A., Scheuermann, R.H., Shah, N.H., Whetzel, P.L., Lewis, S.E.: The Obo Foundry: Coordinated evolution of ontologies to support biomedical data integration. Nature Biotechnology 25(11), 1251–1255 (2007)

    CrossRef  Google Scholar 

  25. Tiedemann, J.: News from Opus: A collection of multilingual parallel corpora with tools and interfaces. In: Nicolov, N., Angelova, G., Mitkov, R. (eds.) RANLP 2009 – Recent Advances in Natural Language Processing. No. 309 in Current Issues in Linguistic Theory, vol. V, pp. 237–248. John Benjamins, Amsterdam (2009)

    Google Scholar 

  26. Véronis, J.: From the Rosetta stone to the information society. A survey of parallel text processing. In: Véronis, J. (ed.) Parallel Text Processing. Alignment and Use of Translation Corpora. No. 13 in Text, Speech and Language Technology, pp. 1–24. Kluwer Academic Publ., Dordrecht (2000)

    Google Scholar 

  27. Vintar, Š.: Bilingual term recognition revisited: The bag-of-equivalents term alignment approach and its evaluation. Terminology 16(2), 141–158 (2010)

    CrossRef  Google Scholar 

  28. Weller, M., Gojun, A., Heid, U., Daille, B., Harastani, R.: Simple methods for dealing with term variation and term alignment. In: TIA 2011 – Proceedings of the 9th International Conference on Terminology and Artificial Intelligence, Paris, France, November 8-10, pp. 87–93 (2011)

    Google Scholar 

  29. Wermter, J., Hahn, U.: Paradigmatic modifiability statistics for the extraction of of complex multi-word terms. In: HLT/EMNLP 2005 – Proceedings of the Human Language Technology Conference and the Conference on Empirical Methods in Natural Language Processing, Vancouver, BC, Canada, October 6-8, pp. 843–850. Association for Computational Linguistics (ACL), East Stroudsburg (2005)

    Google Scholar 

  30. Whetzel, P.L., Noy, N.F., Shah, N.H., Alexander, P.R., Nyulas, C., Tudorache, T., Musen, M.: BioPortal: Enhanced functionality via new Web services from the National Center for Biomedical Ontology to access and use ontologies in software applications. Nucleic Acids Research 39(Web Server issue), W541–W545 (2011)

    Google Scholar 

  31. Wu, C., Xia, F., Deléger, L., Solti, I.: Statistical machine translation for biomedical text: Are we there yet? In: AMIA 2011 – Proceedings of the Annual Symposium of the American Medical Informatics Association. Improving Health: Informatics and IT Changing the World, Washington, DC, USA, October 22-26, pp. 1290–1299. American Medical Informatics Association (2011)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and Permissions

Copyright information

© 2014 Springer International Publishing Switzerland

About this paper

Cite this paper

Hellrich, J., Hahn, U. (2014). Enhancing Multilingual Biomedical Terminologies via Machine Translation from Parallel Corpora. In: Métais, E., Roche, M., Teisseire, M. (eds) Natural Language Processing and Information Systems. NLDB 2014. Lecture Notes in Computer Science, vol 8455. Springer, Cham. https://doi.org/10.1007/978-3-319-07983-7_2

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-07983-7_2

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-07982-0

  • Online ISBN: 978-3-319-07983-7

  • eBook Packages: Computer ScienceComputer Science (R0)