Abstract
Unlike many other domains, biomedicine not only provides a wide range of parallel text corpora to train statistical machine translation (SMT) systems on, but also offers substantial amounts of ‘parallel lexicons’ in the form of multilingual terminologies. We included these lexical repositories, together with common parallel text corpora, into a Moses-based SMT system and three commercial systems and performed experiments on four language pairs, three text genres and several corpus sizes to measure the effects of adding the lexical knowledge sources. Much to our surprise, the SMT systems additionally equipped with ‘parallel lexicons’ underperformed in comparison with those systems trained on parallel text corpora only. This effect could consistently be shown for all systems by Bleu scores, as well as assessments from human judges.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Bodenreider, O.: The Unified Medical Language System (UMLS): integrating biomedicalterminology. Nucleic Acids Research 32(Database issue), D267–D270 (2004)
Arcan, M., Federmann, C., Buitelaar, P.: Experiments with term translation. In: COLING 2012 - Proceedings of the 24th International Conference on Computational Linguistics: Technical Papers. Mumbai, India, 8–15 December 2012, pp. 67–82 (2012)
Eck, M., Vogel, S., Waibel, A.H.: Improving statistical machine translation in the medical domain using the unified medical language system. In: Proceedings of the 20th International Conference on Computational Linguistics, COLING 2004, Geneva, Switzerland, August 23–27, 2004, pp. 792–798 (2004)
Jimeno Yepes, A., Névéol, A.: Effect of additional in-domain parallel corpora in biomedical statistical machine translation. In: Proceedings of the 4th International Workshop on Health Document Text Mining and Information Analysis with the Focus of Cross-Language Evaluation (Louhi 2013), February 11–12 , 2013. NICTA, Sydney (2013)
Pecina, P., Dušek, O., Goeuriot, L., Hajič, J., Hlaváčová, J., Jones, G.J.F., Kelly, L., Leveling, J., Mareček, D., Novák, M., Popel, M., Rosa, R., Tamchyna, A., Urešová, Z.: Adaptation of machine translation for multilingual information retrieval in the medical domain. Artificial Intelligence in Medicine 61(3), 165–185 (2014)
Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N., Cowan, B., Shen, W., Moran, C., Zens, R., Dyer, C., Bojar, O., Constantin, A., Herbst, E.: Moses: open source toolkit for statistical machine translation. In: Proceedings of the Interactive Poster and Demonstration Sessions @ ACL 2007, Prague, Czech Republic, June 25–27, 2007, pp. 177–180 (2007)
Daumé, H., Jagarlamudi, J.: Domain adaptation for machine translation by mining unseen words. In: ACL-HLT 2011 - Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. Volume 2: Short Papers, Portland, OR, USA, 19–24 June, 2011, vol. 2, pp. 407–412 (2011)
Huang, C.C., Yen, H.C., Yang, P.C., Huang, S.T., Chang, J.S.: Using sublexical translations to handle the OOV problem in machine translation. ACM Transactions on Asian Language Information Processing (TALIP) 10(3), #16 (2011)
Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting of Association for Computational Linguistics, ACL 2002, Philadelphia, PA, USA, July 6–12, 2002, pp. 311–318 (2002)
Wu, C., Xia, F., Deléger, L., Solti, I.: Statistical machine translation for biomedical text: are we there yet? In: Proceedings of the Annual Symposium of the American Medical Informatics Association, AMIA 2011, Washington, D.C., USA, October 22–26, 2011, pp. 1290–1299 (2011)
Rebholz-Schuhmann, D., Clematide, S., Rinaldi, F., Kafkas, S., van Mulligen, E.M., Bui, C., Hellrich, J., Lewin, I., Milward, D., Poprat, M., Jimeno-Yepes, A., Hahn, U., Kors, J.A.: Entity recognition in parallel multi-lingual biomedical corpora: The CLEF-ER laboratory overview. In: Forner, P., Müller, H., Paredes, R., Rosso, P., Stein, B. (eds.) CLEF 2013. LNCS, vol. 8138, pp. 353–367. Springer, Heidelberg (2013)
Tiedemann, J.: News from opus: a collection of multilingual parallel corpora with tools and interfaces. In: Nicolov, N., Angelova, G., Mitkov, R. (eds.) RANLP 2009 - Recent Advances in Natural Language Processing, pp. 237–248. John Benjamins, Amsterdam (2009)
Stolcke, A.: Srlim: an extensible language modeling toolkit. In: ICSLP2002/INTERSPEECH 2002 - Proceedings of the 7th International Conference on Spoken Language Processing, Denver, CO, USA, September 16–20, 2002, pp. 901–904 (2002)
Och, F.J., Ney, H.: A systematic comparison of various statistical alignment models. Computational Linguistics 29(1), 19–51 (2003)
Koehn, P.: Statistical significance tests for machine translation evaluation. In: EMNLP 2004 - Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing. A meeting of SIGDAT, a Special Interest Group of the ACL Held in Conjunction with ACL 2004, Barcelona, Spain, 25–26 July 2004, pp. 388–395 (2004)
Callison-Burch, C., Fordyce, C., Koehn, P., Monz, C., Schroeder, J.: (Meta-)evaluation of machine translation. In: Proceedings of the 2nd Workshop on Statistical Machine Translation, StatMT 2007, Prague, Czech Republic, June 23, 2007, pp. 136–158 (2007)
Banerjee, P., Naskar, S.K., Roturier, J., Way, A., van Genabith, J.: Domain adaptation in SMT of user-generated forum content guided by OOV word reduction: normalization and/or supplementary data? In: Proceedings of the 16th EAMT Conference, EAMT 2012, Trento, Italy, 28–30 May 2012, pp. 169–176 (2012)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Hellrich, J., Hahn, U. (2015). Adding Multilingual Terminological Resources to Parallel Corpora for Statistical Machine Translation Deteriorates System Performance: A Negative Result from Experiments in the Biomedical Domain. In: Král, P., Matoušek, V. (eds) Text, Speech, and Dialogue. TSD 2015. Lecture Notes in Computer Science(), vol 9302. Springer, Cham. https://doi.org/10.1007/978-3-319-24033-6_57
Download citation
DOI: https://doi.org/10.1007/978-3-319-24033-6_57
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-24032-9
Online ISBN: 978-3-319-24033-6
eBook Packages: Computer ScienceComputer Science (R0)