Adding Multilingual Terminological Resources to Parallel Corpora for Statistical Machine Translation Deteriorates System Performance: A Negative Result from Experiments in the Biomedical Domain

  • Johannes HellrichEmail author
  • Udo Hahn
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9302)


Unlike many other domains, biomedicine not only provides a wide range of parallel text corpora to train statistical machine translation (SMT) systems on, but also offers substantial amounts of ‘parallel lexicons’ in the form of multilingual terminologies. We included these lexical repositories, together with common parallel text corpora, into a Moses-based SMT system and three commercial systems and performed experiments on four language pairs, three text genres and several corpus sizes to measure the effects of adding the lexical knowledge sources. Much to our surprise, the SMT systems additionally equipped with ‘parallel lexicons’ underperformed in comparison with those systems trained on parallel text corpora only. This effect could consistently be shown for all systems by Bleu scores, as well as assessments from human judges.


Machine translation Biomedicine Terminologies 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Bodenreider, O.: The Unified Medical Language System (UMLS): integrating biomedicalterminology. Nucleic Acids Research 32(Database issue), D267–D270 (2004)CrossRefGoogle Scholar
  2. 2.
    Arcan, M., Federmann, C., Buitelaar, P.: Experiments with term translation. In: COLING 2012 - Proceedings of the 24th International Conference on Computational Linguistics: Technical Papers. Mumbai, India, 8–15 December 2012, pp. 67–82 (2012)Google Scholar
  3. 3.
    Eck, M., Vogel, S., Waibel, A.H.: Improving statistical machine translation in the medical domain using the unified medical language system. In: Proceedings of the 20th International Conference on Computational Linguistics, COLING 2004, Geneva, Switzerland, August 23–27, 2004, pp. 792–798 (2004)Google Scholar
  4. 4.
    Jimeno Yepes, A., Névéol, A.: Effect of additional in-domain parallel corpora in biomedical statistical machine translation. In: Proceedings of the 4th International Workshop on Health Document Text Mining and Information Analysis with the Focus of Cross-Language Evaluation (Louhi 2013), February 11–12 , 2013. NICTA, Sydney (2013)Google Scholar
  5. 5.
    Pecina, P., Dušek, O., Goeuriot, L., Hajič, J., Hlaváčová, J., Jones, G.J.F., Kelly, L., Leveling, J., Mareček, D., Novák, M., Popel, M., Rosa, R., Tamchyna, A., Urešová, Z.: Adaptation of machine translation for multilingual information retrieval in the medical domain. Artificial Intelligence in Medicine 61(3), 165–185 (2014)CrossRefGoogle Scholar
  6. 6.
    Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N., Cowan, B., Shen, W., Moran, C., Zens, R., Dyer, C., Bojar, O., Constantin, A., Herbst, E.: Moses: open source toolkit for statistical machine translation. In: Proceedings of the Interactive Poster and Demonstration Sessions @ ACL 2007, Prague, Czech Republic, June 25–27, 2007, pp. 177–180 (2007)Google Scholar
  7. 7.
    Daumé, H., Jagarlamudi, J.: Domain adaptation for machine translation by mining unseen words. In: ACL-HLT 2011 - Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. Volume 2: Short Papers, Portland, OR, USA, 19–24 June, 2011, vol. 2, pp. 407–412 (2011)Google Scholar
  8. 8.
    Huang, C.C., Yen, H.C., Yang, P.C., Huang, S.T., Chang, J.S.: Using sublexical translations to handle the OOV problem in machine translation. ACM Transactions on Asian Language Information Processing (TALIP) 10(3), #16 (2011)CrossRefGoogle Scholar
  9. 9.
    Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting of Association for Computational Linguistics, ACL 2002, Philadelphia, PA, USA, July 6–12, 2002, pp. 311–318 (2002)Google Scholar
  10. 10.
    Wu, C., Xia, F., Deléger, L., Solti, I.: Statistical machine translation for biomedical text: are we there yet? In: Proceedings of the Annual Symposium of the American Medical Informatics Association, AMIA 2011, Washington, D.C., USA, October 22–26, 2011, pp. 1290–1299 (2011)Google Scholar
  11. 11.
    Rebholz-Schuhmann, D., Clematide, S., Rinaldi, F., Kafkas, S., van Mulligen, E.M., Bui, C., Hellrich, J., Lewin, I., Milward, D., Poprat, M., Jimeno-Yepes, A., Hahn, U., Kors, J.A.: Entity recognition in parallel multi-lingual biomedical corpora: The CLEF-ER laboratory overview. In: Forner, P., Müller, H., Paredes, R., Rosso, P., Stein, B. (eds.) CLEF 2013. LNCS, vol. 8138, pp. 353–367. Springer, Heidelberg (2013) Google Scholar
  12. 12.
    Tiedemann, J.: News from opus: a collection of multilingual parallel corpora with tools and interfaces. In: Nicolov, N., Angelova, G., Mitkov, R. (eds.) RANLP 2009 - Recent Advances in Natural Language Processing, pp. 237–248. John Benjamins, Amsterdam (2009)Google Scholar
  13. 13.
    Stolcke, A.: Srlim: an extensible language modeling toolkit. In: ICSLP2002/INTERSPEECH 2002 - Proceedings of the 7th International Conference on Spoken Language Processing, Denver, CO, USA, September 16–20, 2002, pp. 901–904 (2002)Google Scholar
  14. 14.
    Och, F.J., Ney, H.: A systematic comparison of various statistical alignment models. Computational Linguistics 29(1), 19–51 (2003)CrossRefzbMATHGoogle Scholar
  15. 15.
    Koehn, P.: Statistical significance tests for machine translation evaluation. In: EMNLP 2004 - Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing. A meeting of SIGDAT, a Special Interest Group of the ACL Held in Conjunction with ACL 2004, Barcelona, Spain, 25–26 July 2004, pp. 388–395 (2004)Google Scholar
  16. 16.
    Callison-Burch, C., Fordyce, C., Koehn, P., Monz, C., Schroeder, J.: (Meta-)evaluation of machine translation. In: Proceedings of the 2nd Workshop on Statistical Machine Translation, StatMT 2007, Prague, Czech Republic, June 23, 2007, pp. 136–158 (2007)Google Scholar
  17. 17.
    Banerjee, P., Naskar, S.K., Roturier, J., Way, A., van Genabith, J.: Domain adaptation in SMT of user-generated forum content guided by OOV word reduction: normalization and/or supplementary data? In: Proceedings of the 16th EAMT Conference, EAMT 2012, Trento, Italy, 28–30 May 2012, pp. 169–176 (2012)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  1. 1.Jena University Language & Information Engineering (JULIE) Lab Friedrich-Schiller-Universität JenaJenaGermany

Personalised recommendations