Enabling Medical Translation for Low-Resource Languages

Musleh, Ahmad; Durrani, Nadir; Temnikova, Irina; Nakov, Preslav; Vogel, Stephan; Alsaad, Osama

doi:10.1007/978-3-319-75487-1_1

Ahmad Musleh¹⁴,
Nadir Durrani¹⁴,
Irina Temnikova¹⁴,
Preslav Nakov¹⁴,
Stephan Vogel¹⁴ &
…
Osama Alsaad¹⁵

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 9624))

Included in the following conference series:

International Conference on Intelligent Text Processing and Computational Linguistics

1224 Accesses

Abstract

We present research towards bridging the language gap between migrant workers in Qatar and medical staff. In particular, we present the first steps towards the development of a real-world Hindi-English machine translation system for doctor-patient communication. As this is a low-resource language pair, especially for speech and for the medical domain, our initial focus has been on gathering suitable training data from various sources. We applied a variety of methods ranging from fully automatic extraction from the Web to manual annotation of test data. Moreover, we developed a method for automatically augmenting the training data with synthetically generated variants, which yielded a very sizable improvement of more than 3 BLEU points absolute.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
http://qatar-weill.cornell.edu.
2.
http://www.universaldoctor.com.
3.
http://medibabble.com.
4.
http://www.canopyapps.com.
5.
http://mavroinc.com/medical.html.
6.
http://duochart.com.
7.
http://www.khresmoi.eu.
8.
http://www.wikipedia.org.
9.
http://www.wiktionary.org.
10.
http://www.omegawiki.org.
11.
https://github.com/tesseract-ocr.
12.
http://www.opensubtitles.com.
13.
http://www.ncbi.nlm.nih.gov/mesh.
14.
EMILLE contains about 12,000 sentences of comparable data in Hindi and Urdu. We were able to align about 7,000 sentences to build an Urdu-to-Hindi system.
15.
We used mkcls to cluster the data into 50 clusters.

References

Baker, P., Hardie, A., McEnery, T., Cunningham, H., Gaizauskas, R.J.: EMILLE, a 67-million word corpus of indic languages: data collection, mark-up and harmonisation. In: Proceedings of the Third International Language Resources and Evaluation Conference, LREC 2002, Las Palmas, Canary Islands, Spain (2002)
Google Scholar
Bojar, O., Diatka, V., Rychlý, P., Straňák, P., Tamchyna, A., Zeman, D.: Hindi-English and Hindi-only corpus for machine translation. In: Proceedings of the Ninth International Language Resources and Evaluation Conference, LREC 2014, Reykjavik, Iceland, pp. 3550–3555 (2014)
Google Scholar
Bouillon, P., Flores, G., Georgescul, M., Halimi Mallem, I.S., Hockey, B.A., Isahara, H., Kanzaki, K., Nakao, Y., Rayner, E., Santaholma, M.E., Starlander, M., Tsourakis, N.: Many-to-many multilingual medical speech translation on a PDA. In: Proceedings of the Eighth Conference of the Association for Machine Translation in the Americas, AMTA 2008, Waikiki, Hawaii, USA, pp. 314–323 (2008)
Google Scholar
Cherry, C., Foster, G.: Batch tuning strategies for statistical machine translation. In: Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, HLT-NAACL 2012, Montréal, Canada, pp. 427–436 (2012)
Google Scholar
Dillinger, M., Seligman, M.: Converser: highly interactive speech-to-speech translation for healthcare. In: Proceedings of the COLING-ACL 2006 Workshop on Medical Speech Translation, Sydney, Australia, pp. 36–39 (2006)
Google Scholar
Durrani, N., Haddow, B., Koehn, P., Heafield, K.: Edinburgh’s phrase-based machine translation systems for WMT-14. In: Proceedings of the ACL 2014 Ninth Workshop on Statistical Machine Translation, WMT 2014, Baltimore, Maryland, USA, pp. 97–104 (2014)
Google Scholar
Durrani, N., Koehn, P.: Improving machine translation via triangulation and transliteration. In: Proceedings of the 17th Annual Conference of the European Association for Machine Translation, EAMT 2014, Dubrovnik, Croatia, pp. 71–78 (2014)
Google Scholar
Durrani, N., Koehn, P., Schmid, H., Fraser, A.: Investigating the usefulness of generalized word representations in SMT. In: Proceedings of the 25th Annual Conference on Computational Linguistics, COLING 2014, Dublin, Ireland, pp. 421–432 (2014)
Google Scholar
Durrani, N., Sajjad, H., Fraser, A., Schmid, H.: Hindi-to-Urdu machine translation through transliteration. In: Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, ACL 2014, Uppsala, Sweden, pp. 465–474 (2010)
Google Scholar
Durrani, N., Sajjad, H., Hoang, H., Koehn, P.: Integrating an unsupervised transliteration model into statistical machine translation. In: Proceedings of the 15th Conference of the European Chapter of the ACL, EACL 2014, Gothenburg, Sweden, pp. 148–153 (2014)
Google Scholar
Durrani, N., Schmid, H., Fraser, A., Koehn, P., Schütze, H.: The operation sequence model - combining N-gram-based and phrase-based statistical machine translation. Comput. Linguist. 41(2), 157–186 (2015)
Article MathSciNet Google Scholar
Dušek, O., Hajic, J., Hlavácová, J., Novák, M., Pecina, P., Rosa, R., Tamchyna, A., Urešová, Z., Zeman, D.: Machine translation of medical texts in the khresmoi project. In: Proceedings of the 52nd Annual Meeting of the Association of Computational Linguistics, ACL 2014, Baltimore, Maryland, USA, pp. 221–228 (2014)
Google Scholar
Eck, M., Lane, I., Zhang, Y., Waibel, A.: Jibbigo: speech-to-speech translation on mobile devices. In: Proceedings of IEEE Spoken Language Technology Workshop, SLT 2010, Berkeley, California, USA, pp. 165–166 (2010)
Google Scholar
Ehsani, F., Kimzey, J., Master, D., Sudre, K., Park, H.: Speech to speech translation for medical triage in Korean. In: Proceedings of the COLING-ACL 2006 Workshop on Medical Speech Translation, New York City, New York, USA, pp. 13–19 (2006)
Google Scholar
Elnashar, M., Abdelrahim, H., Fetters, M.D.: Cultural competence springs up in the desert: the story of the center for cultural competence in health care at Weill Cornell Medical College in Qatar. Acad. Med. 87(6), 759–766 (2012)
Article Google Scholar
Federmann, C.: Appraise: an open-source toolkit for manual evaluation of MT output. Prague Bull. Math. Linguist. 98, 25–35 (2012)
Article Google Scholar
Gao, Y., Gu, L., Zhou, B., Sarikaya, R., Afify, M., Kuo, H.-K., Zhu, W.-Z., Deng, Y., Prosser, C., Zhang, W., et al.: IBM MASTOR system: multilingual automatic speech-to-speech translator. In: Proceedings of the COLING-ACL 2006 Workshop on Medical Speech Translation, Sydney, Australia, pp. 53–56 (2006)
Google Scholar
Hasler, E., Haddow, B., Koehn, P.: Sparse lexicalised features and topic adaptation for SMT. In: Proceedings of the Seventh International Workshop on Spoken Language Translation, IWSLT 2012, Hong Kong, China, pp. 268–275 (2012)
Google Scholar
Heafield, K.: KenLM: faster and smaller language model queries. In: Proceedings of the Sixth Workshop on Statistical Machine Translation, WMT 2011, Edinburgh, Scotland, United Kingdom, pp. 187–197 (2011)
Google Scholar
Heinze, D.T., Turchin, A., Jagannathan, V.: Automated interpretation of clinical encounters with cultural cues and electronic health record generation. In: Proceedings of the COLING-ACL 2006 Workshop on Medical Speech Translation, Sydney, Australia, pp. 20–27 (2006)
Google Scholar
Huang, L., Chiang, D.: Forest rescoring: faster decoding with integrated language models. In: Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, ACL 2007, Prague, Czech Republic, pp. 144–151 (2007)
Google Scholar
Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N., Cowan, B., Shen, W., Moran, C., Zens, R., Dyer, C., Bojar, O., Constantin, A., Herbst, E.: Moses: open source toolkit for statistical machine translation. In: Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, ACL 2007, Prague, Czech Republic, pp. 177–180 (2007)
Google Scholar
Kumar, S., Byrne, W.J.: Minimum Bayes-risk decoding for statistical machine translation. In: Proceedings of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, HLT-NAACL 2004, Boston, Massachusetts, USA, pp. 169–176 (2004)
Google Scholar
Lewis, W.D., Munro, R., Vogel, S.: Crisis MT: developing a cookbook for MT in crisis situations. In: Proceedings of the EMNLP 2011 Sixth Workshop on Statistical Machine Translation, WMT 2011, Edinburgh, Scotland, United Kingdom, pp. 501–511 (2011)
Google Scholar
Li, J., Kim, S.-J., Na, H., Lee, J.-H.: Postech’s system description for medical text translation task. In: Proceedings of the Ninth Workshop on Statistical Machine Translation, WMT 2014, Baltimore, Maryland, USA, pp. 229–233 (2014)
Google Scholar
Lu, Y., Wang, L., Wong, D.F., Chao, L.S., Wang, Y., Oliveira, F.: Domain adaptation for medical text translation using web resources. In: Proceedings of the Ninth Workshop on Statistical Machine Translation, WMT 2014, Baltimore, Maryland, USA, pp. 233–238 (2014)
Google Scholar
Costa-Jussa, M.R., Farrus, M., Pons, J.S.: Machine translation in medicine. A quality analysis of statistical machine translation in the medical domain. In: Proceedings of the 1st Virtual International Conference on Advanced Research in Scientific Areas, ARSA 2012, pp. 1995–1998 (2012)
Google Scholar
Nakov, P.: Improving English-Spanish statistical machine translation: experiments in domain adaptation, sentence paraphrasing, tokenization, and recasing. In: Proceedings of the Third Workshop on Statistical Machine Translation, WMT 2008, Columbus, Ohio, USA, pp. 147–150 (2008)
Google Scholar
Nakov, P., Ng, H.T.: Improved statistical machine translation for resource-poor languages using related resource-rich languages. In: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 3, EMNLP 2009, Singapore, pp. 1358–1367 (2009)
Google Scholar
Nakov, P., Ng, H.T.: Improving statistical machine translation for a resource-poor language using related resource-rich languages. J. Artif. Intell. Res. (JAIR) 44, 179–222 (2012)
MATH Google Scholar
Nakov, P., Tiedemann, J.: Combining word-level and character-level models for machine translation between closely-related languages. In: Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers - Volume 2, ACL 2012, Jeju Island, Korea, pp. 301–305 (2012)
Google Scholar
Navigli, R., Ponzetto, S.P.: BabelNet: the automatic construction, evaluation and application of a wide-coverage multilingual semantic network. Artif. Intell. 193, 217–250 (2012)
Article MathSciNet MATH Google Scholar
Och, F.J., Ney, H.: A systematic comparison of various statistical alignment models. In: Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics, ACL 2003, Sapporo, Japan, pp. 19–51 (2003)
Google Scholar
Okita, T., Vahid, A.H., Way, A., Liu, Q.: The DCU terminology translation system for the medical query subtask at WMT14. In: Proceedings of the ACL 2014 Workshop on Statistical Machine Translation, WMT 2014, Baltimore, Maryland, USA, pp. 239–245 (2014)
Google Scholar
Pécheux, N., Gong, L., Do, Q.K., Marie, B., Ivanishcheva, Y., Allauzen, A., Lavergne, T., Niehues, J., Max, A., Yvon, F.: LIMSI@ WMT’14 medical translation task. In: Proceedings of the ACL 2014 Workshop on Statistical Machine Translation, WMT 2014, Baltimore, Maryland, USA, pp. 246–253 (2014)
Google Scholar
Post, M., Callison-Burch, C., Osborne, M.: Constructing parallel corpora for six indian languages via crowdsourcing. In: Proceedings of the Seventh Workshop on Statistical Machine Translation, WMT 2012, Montréal, Canada, pp. 401–409 (2012)
Google Scholar
Rodrigues, J.A.S.G.: Speech-to-speech translation to support medical interviews. Ph.D. thesis, Universidade de Lisboa, Portugal (2013)
Google Scholar
Tiedemann, J., Nakov, P.: Analyzing the use of character-level translation with sparse and noisy datasets. In: Proceedings of the International Conference Recent Advances in Natural Language Processing, RANLP 2013, Hissar, Bulgaria, pp. 676–684 (2013)
Google Scholar
Utiyama, M., Isahara, H.: A comparison of pivot methods for phrase-based statistical machine translation. In: Proceedings of the 2007 Meeting of the North American Chapter of the Association for Computational Linguistics, NAACL 2007, Rochester, New York, USA, pp. 484–491 (2007)
Google Scholar
Wang, L., Lu, Y., Wong, D.F., Chao, L.S., Wang, Y., Oliveira, F.: Combining domain adaptation approaches for medical text translation. In: Proceedings of the Ninth Workshop on Statistical Machine Translation, WMT 2014, Baltimore, Maryland, USA, pp. 254–259 (2014)
Google Scholar
Wang, P., Nakov, P., Ng, H.T.: Source language adaptation for resource-poor machine translation. In: Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, EMNLP-CoNLL 2012, Jeju Island, Korea, pp. 286–296 (2012)
Google Scholar
Wang, P., Nakov, P., Ng, H.T.: Source language adaptation approaches for resource-poor machine translation. Comput. Linguist. 42, 1–44 (2016)
Article MathSciNet Google Scholar
Wu, H., Wang, H.: Pivot language approach for phrase-based statistical machine translation. In: Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, ACL 2007, Prague, Czech Republic, pp. 856–863 (2007)
Google Scholar
Zhang, J., Wu, X., Calixto, I., Vahid, A.H., Zhang, X., Way, A., Liu, Q.: Experiments in medical translation shared task at WMT 2014. In: Proceedings of the Ninth Workshop on Statistical Machine Translation, WMT 2014, Baltimore, Maryland, USA, pp. 260–265 (2014)
Google Scholar

Download references

Acknowledgments

The authors would like to thank Naila Khalisha and Manisha Bansal for their contributions towards the project.

Author information

Authors and Affiliations

Qatar Computing Research Institute, HBKU, Doha, Qatar
Ahmad Musleh, Nadir Durrani, Irina Temnikova, Preslav Nakov & Stephan Vogel
Texas A&M University in Qatar, Doha, Qatar
Osama Alsaad

Authors

Ahmad Musleh
View author publications
You can also search for this author in PubMed Google Scholar
Nadir Durrani
View author publications
You can also search for this author in PubMed Google Scholar
Irina Temnikova
View author publications
You can also search for this author in PubMed Google Scholar
Preslav Nakov
View author publications
You can also search for this author in PubMed Google Scholar
Stephan Vogel
View author publications
You can also search for this author in PubMed Google Scholar
Osama Alsaad
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Nadir Durrani .

Editor information

Editors and Affiliations

CIC, Instituto Politécnico Nacional, Mexico City, Mexico
Alexander Gelbukh

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Musleh, A., Durrani, N., Temnikova, I., Nakov, P., Vogel, S., Alsaad, O. (2018). Enabling Medical Translation for Low-Resource Languages. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2016. Lecture Notes in Computer Science(), vol 9624. Springer, Cham. https://doi.org/10.1007/978-3-319-75487-1_1

Download citation

DOI: https://doi.org/10.1007/978-3-319-75487-1_1
Published: 21 March 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-75486-4
Online ISBN: 978-3-319-75487-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics