Skip to main content

Adding Multilingual Terminological Resources to Parallel Corpora for Statistical Machine Translation Deteriorates System Performance: A Negative Result from Experiments in the Biomedical Domain

Part of the Lecture Notes in Computer Science book series (LNAI,volume 9302)

Abstract

Unlike many other domains, biomedicine not only provides a wide range of parallel text corpora to train statistical machine translation (SMT) systems on, but also offers substantial amounts of ‘parallel lexicons’ in the form of multilingual terminologies. We included these lexical repositories, together with common parallel text corpora, into a Moses-based SMT system and three commercial systems and performed experiments on four language pairs, three text genres and several corpus sizes to measure the effects of adding the lexical knowledge sources. Much to our surprise, the SMT systems additionally equipped with ‘parallel lexicons’ underperformed in comparison with those systems trained on parallel text corpora only. This effect could consistently be shown for all systems by Bleu scores, as well as assessments from human judges.

Keywords

  • Machine translation
  • Biomedicine
  • Terminologies

This is a preview of subscription content, access via your institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • DOI: 10.1007/978-3-319-24033-6_57
  • Chapter length: 9 pages
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
eBook
USD   79.99
Price excludes VAT (USA)
  • ISBN: 978-3-319-24033-6
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
Softcover Book
USD   99.99
Price excludes VAT (USA)

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Bodenreider, O.: The Unified Medical Language System (UMLS): integrating biomedicalterminology. Nucleic Acids Research 32(Database issue), D267–D270 (2004)

    CrossRef  Google Scholar 

  2. Arcan, M., Federmann, C., Buitelaar, P.: Experiments with term translation. In: COLING 2012 - Proceedings of the 24th International Conference on Computational Linguistics: Technical Papers. Mumbai, India, 8–15 December 2012, pp. 67–82 (2012)

    Google Scholar 

  3. Eck, M., Vogel, S., Waibel, A.H.: Improving statistical machine translation in the medical domain using the unified medical language system. In: Proceedings of the 20th International Conference on Computational Linguistics, COLING 2004, Geneva, Switzerland, August 23–27, 2004, pp. 792–798 (2004)

    Google Scholar 

  4. Jimeno Yepes, A., Névéol, A.: Effect of additional in-domain parallel corpora in biomedical statistical machine translation. In: Proceedings of the 4th International Workshop on Health Document Text Mining and Information Analysis with the Focus of Cross-Language Evaluation (Louhi 2013), February 11–12 , 2013. NICTA, Sydney (2013)

    Google Scholar 

  5. Pecina, P., Dušek, O., Goeuriot, L., Hajič, J., Hlaváčová, J., Jones, G.J.F., Kelly, L., Leveling, J., Mareček, D., Novák, M., Popel, M., Rosa, R., Tamchyna, A., Urešová, Z.: Adaptation of machine translation for multilingual information retrieval in the medical domain. Artificial Intelligence in Medicine 61(3), 165–185 (2014)

    CrossRef  Google Scholar 

  6. Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N., Cowan, B., Shen, W., Moran, C., Zens, R., Dyer, C., Bojar, O., Constantin, A., Herbst, E.: Moses: open source toolkit for statistical machine translation. In: Proceedings of the Interactive Poster and Demonstration Sessions @ ACL 2007, Prague, Czech Republic, June 25–27, 2007, pp. 177–180 (2007)

    Google Scholar 

  7. Daumé, H., Jagarlamudi, J.: Domain adaptation for machine translation by mining unseen words. In: ACL-HLT 2011 - Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. Volume 2: Short Papers, Portland, OR, USA, 19–24 June, 2011, vol. 2, pp. 407–412 (2011)

    Google Scholar 

  8. Huang, C.C., Yen, H.C., Yang, P.C., Huang, S.T., Chang, J.S.: Using sublexical translations to handle the OOV problem in machine translation. ACM Transactions on Asian Language Information Processing (TALIP) 10(3), #16 (2011)

    CrossRef  Google Scholar 

  9. Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting of Association for Computational Linguistics, ACL 2002, Philadelphia, PA, USA, July 6–12, 2002, pp. 311–318 (2002)

    Google Scholar 

  10. Wu, C., Xia, F., Deléger, L., Solti, I.: Statistical machine translation for biomedical text: are we there yet? In: Proceedings of the Annual Symposium of the American Medical Informatics Association, AMIA 2011, Washington, D.C., USA, October 22–26, 2011, pp. 1290–1299 (2011)

    Google Scholar 

  11. Rebholz-Schuhmann, D., Clematide, S., Rinaldi, F., Kafkas, S., van Mulligen, E.M., Bui, C., Hellrich, J., Lewin, I., Milward, D., Poprat, M., Jimeno-Yepes, A., Hahn, U., Kors, J.A.: Entity recognition in parallel multi-lingual biomedical corpora: The CLEF-ER laboratory overview. In: Forner, P., Müller, H., Paredes, R., Rosso, P., Stein, B. (eds.) CLEF 2013. LNCS, vol. 8138, pp. 353–367. Springer, Heidelberg (2013)

    Google Scholar 

  12. Tiedemann, J.: News from opus: a collection of multilingual parallel corpora with tools and interfaces. In: Nicolov, N., Angelova, G., Mitkov, R. (eds.) RANLP 2009 - Recent Advances in Natural Language Processing, pp. 237–248. John Benjamins, Amsterdam (2009)

    Google Scholar 

  13. Stolcke, A.: Srlim: an extensible language modeling toolkit. In: ICSLP2002/INTERSPEECH 2002 - Proceedings of the 7th International Conference on Spoken Language Processing, Denver, CO, USA, September 16–20, 2002, pp. 901–904 (2002)

    Google Scholar 

  14. Och, F.J., Ney, H.: A systematic comparison of various statistical alignment models. Computational Linguistics 29(1), 19–51 (2003)

    CrossRef  MATH  Google Scholar 

  15. Koehn, P.: Statistical significance tests for machine translation evaluation. In: EMNLP 2004 - Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing. A meeting of SIGDAT, a Special Interest Group of the ACL Held in Conjunction with ACL 2004, Barcelona, Spain, 25–26 July 2004, pp. 388–395 (2004)

    Google Scholar 

  16. Callison-Burch, C., Fordyce, C., Koehn, P., Monz, C., Schroeder, J.: (Meta-)evaluation of machine translation. In: Proceedings of the 2nd Workshop on Statistical Machine Translation, StatMT 2007, Prague, Czech Republic, June 23, 2007, pp. 136–158 (2007)

    Google Scholar 

  17. Banerjee, P., Naskar, S.K., Roturier, J., Way, A., van Genabith, J.: Domain adaptation in SMT of user-generated forum content guided by OOV word reduction: normalization and/or supplementary data? In: Proceedings of the 16th EAMT Conference, EAMT 2012, Trento, Italy, 28–30 May 2012, pp. 169–176 (2012)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Johannes Hellrich .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and Permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Hellrich, J., Hahn, U. (2015). Adding Multilingual Terminological Resources to Parallel Corpora for Statistical Machine Translation Deteriorates System Performance: A Negative Result from Experiments in the Biomedical Domain. In: Král, P., Matoušek, V. (eds) Text, Speech, and Dialogue. TSD 2015. Lecture Notes in Computer Science(), vol 9302. Springer, Cham. https://doi.org/10.1007/978-3-319-24033-6_57

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-24033-6_57

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-24032-9

  • Online ISBN: 978-3-319-24033-6

  • eBook Packages: Computer ScienceComputer Science (R0)