Abstract
We present research towards bridging the language gap between migrant workers in Qatar and medical staff. In particular, we present the first steps towards the development of a real-world Hindi-English machine translation system for doctor-patient communication. As this is a low-resource language pair, especially for speech and for the medical domain, our initial focus has been on gathering suitable training data from various sources. We applied a variety of methods ranging from fully automatic extraction from the Web to manual annotation of test data. Moreover, we developed a method for automatically augmenting the training data with synthetically generated variants, which yielded a very sizable improvement of more than 3 BLEU points absolute.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
- 2.
- 3.
- 4.
- 5.
- 6.
- 7.
- 8.
- 9.
- 10.
- 11.
- 12.
- 13.
- 14.
EMILLE contains about 12,000 sentences of comparable data in Hindi and Urdu. We were able to align about 7,000 sentences to build an Urdu-to-Hindi system.
- 15.
We used mkcls to cluster the data into 50 clusters.
References
Baker, P., Hardie, A., McEnery, T., Cunningham, H., Gaizauskas, R.J.: EMILLE, a 67-million word corpus of indic languages: data collection, mark-up and harmonisation. In: Proceedings of the Third International Language Resources and Evaluation Conference, LREC 2002, Las Palmas, Canary Islands, Spain (2002)
Bojar, O., Diatka, V., Rychlý, P., Straňák, P., Tamchyna, A., Zeman, D.: Hindi-English and Hindi-only corpus for machine translation. In: Proceedings of the Ninth International Language Resources and Evaluation Conference, LREC 2014, Reykjavik, Iceland, pp. 3550–3555 (2014)
Bouillon, P., Flores, G., Georgescul, M., Halimi Mallem, I.S., Hockey, B.A., Isahara, H., Kanzaki, K., Nakao, Y., Rayner, E., Santaholma, M.E., Starlander, M., Tsourakis, N.: Many-to-many multilingual medical speech translation on a PDA. In: Proceedings of the Eighth Conference of the Association for Machine Translation in the Americas, AMTA 2008, Waikiki, Hawaii, USA, pp. 314–323 (2008)
Cherry, C., Foster, G.: Batch tuning strategies for statistical machine translation. In: Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, HLT-NAACL 2012, Montréal, Canada, pp. 427–436 (2012)
Dillinger, M., Seligman, M.: Converser: highly interactive speech-to-speech translation for healthcare. In: Proceedings of the COLING-ACL 2006 Workshop on Medical Speech Translation, Sydney, Australia, pp. 36–39 (2006)
Durrani, N., Haddow, B., Koehn, P., Heafield, K.: Edinburgh’s phrase-based machine translation systems for WMT-14. In: Proceedings of the ACL 2014 Ninth Workshop on Statistical Machine Translation, WMT 2014, Baltimore, Maryland, USA, pp. 97–104 (2014)
Durrani, N., Koehn, P.: Improving machine translation via triangulation and transliteration. In: Proceedings of the 17th Annual Conference of the European Association for Machine Translation, EAMT 2014, Dubrovnik, Croatia, pp. 71–78 (2014)
Durrani, N., Koehn, P., Schmid, H., Fraser, A.: Investigating the usefulness of generalized word representations in SMT. In: Proceedings of the 25th Annual Conference on Computational Linguistics, COLING 2014, Dublin, Ireland, pp. 421–432 (2014)
Durrani, N., Sajjad, H., Fraser, A., Schmid, H.: Hindi-to-Urdu machine translation through transliteration. In: Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, ACL 2014, Uppsala, Sweden, pp. 465–474 (2010)
Durrani, N., Sajjad, H., Hoang, H., Koehn, P.: Integrating an unsupervised transliteration model into statistical machine translation. In: Proceedings of the 15th Conference of the European Chapter of the ACL, EACL 2014, Gothenburg, Sweden, pp. 148–153 (2014)
Durrani, N., Schmid, H., Fraser, A., Koehn, P., Schütze, H.: The operation sequence model - combining N-gram-based and phrase-based statistical machine translation. Comput. Linguist. 41(2), 157–186 (2015)
Dušek, O., Hajic, J., Hlavácová, J., Novák, M., Pecina, P., Rosa, R., Tamchyna, A., Urešová, Z., Zeman, D.: Machine translation of medical texts in the khresmoi project. In: Proceedings of the 52nd Annual Meeting of the Association of Computational Linguistics, ACL 2014, Baltimore, Maryland, USA, pp. 221–228 (2014)
Eck, M., Lane, I., Zhang, Y., Waibel, A.: Jibbigo: speech-to-speech translation on mobile devices. In: Proceedings of IEEE Spoken Language Technology Workshop, SLT 2010, Berkeley, California, USA, pp. 165–166 (2010)
Ehsani, F., Kimzey, J., Master, D., Sudre, K., Park, H.: Speech to speech translation for medical triage in Korean. In: Proceedings of the COLING-ACL 2006 Workshop on Medical Speech Translation, New York City, New York, USA, pp. 13–19 (2006)
Elnashar, M., Abdelrahim, H., Fetters, M.D.: Cultural competence springs up in the desert: the story of the center for cultural competence in health care at Weill Cornell Medical College in Qatar. Acad. Med. 87(6), 759–766 (2012)
Federmann, C.: Appraise: an open-source toolkit for manual evaluation of MT output. Prague Bull. Math. Linguist. 98, 25–35 (2012)
Gao, Y., Gu, L., Zhou, B., Sarikaya, R., Afify, M., Kuo, H.-K., Zhu, W.-Z., Deng, Y., Prosser, C., Zhang, W., et al.: IBM MASTOR system: multilingual automatic speech-to-speech translator. In: Proceedings of the COLING-ACL 2006 Workshop on Medical Speech Translation, Sydney, Australia, pp. 53–56 (2006)
Hasler, E., Haddow, B., Koehn, P.: Sparse lexicalised features and topic adaptation for SMT. In: Proceedings of the Seventh International Workshop on Spoken Language Translation, IWSLT 2012, Hong Kong, China, pp. 268–275 (2012)
Heafield, K.: KenLM: faster and smaller language model queries. In: Proceedings of the Sixth Workshop on Statistical Machine Translation, WMT 2011, Edinburgh, Scotland, United Kingdom, pp. 187–197 (2011)
Heinze, D.T., Turchin, A., Jagannathan, V.: Automated interpretation of clinical encounters with cultural cues and electronic health record generation. In: Proceedings of the COLING-ACL 2006 Workshop on Medical Speech Translation, Sydney, Australia, pp. 20–27 (2006)
Huang, L., Chiang, D.: Forest rescoring: faster decoding with integrated language models. In: Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, ACL 2007, Prague, Czech Republic, pp. 144–151 (2007)
Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N., Cowan, B., Shen, W., Moran, C., Zens, R., Dyer, C., Bojar, O., Constantin, A., Herbst, E.: Moses: open source toolkit for statistical machine translation. In: Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, ACL 2007, Prague, Czech Republic, pp. 177–180 (2007)
Kumar, S., Byrne, W.J.: Minimum Bayes-risk decoding for statistical machine translation. In: Proceedings of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, HLT-NAACL 2004, Boston, Massachusetts, USA, pp. 169–176 (2004)
Lewis, W.D., Munro, R., Vogel, S.: Crisis MT: developing a cookbook for MT in crisis situations. In: Proceedings of the EMNLP 2011 Sixth Workshop on Statistical Machine Translation, WMT 2011, Edinburgh, Scotland, United Kingdom, pp. 501–511 (2011)
Li, J., Kim, S.-J., Na, H., Lee, J.-H.: Postech’s system description for medical text translation task. In: Proceedings of the Ninth Workshop on Statistical Machine Translation, WMT 2014, Baltimore, Maryland, USA, pp. 229–233 (2014)
Lu, Y., Wang, L., Wong, D.F., Chao, L.S., Wang, Y., Oliveira, F.: Domain adaptation for medical text translation using web resources. In: Proceedings of the Ninth Workshop on Statistical Machine Translation, WMT 2014, Baltimore, Maryland, USA, pp. 233–238 (2014)
Costa-Jussa, M.R., Farrus, M., Pons, J.S.: Machine translation in medicine. A quality analysis of statistical machine translation in the medical domain. In: Proceedings of the 1st Virtual International Conference on Advanced Research in Scientific Areas, ARSA 2012, pp. 1995–1998 (2012)
Nakov, P.: Improving English-Spanish statistical machine translation: experiments in domain adaptation, sentence paraphrasing, tokenization, and recasing. In: Proceedings of the Third Workshop on Statistical Machine Translation, WMT 2008, Columbus, Ohio, USA, pp. 147–150 (2008)
Nakov, P., Ng, H.T.: Improved statistical machine translation for resource-poor languages using related resource-rich languages. In: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 3, EMNLP 2009, Singapore, pp. 1358–1367 (2009)
Nakov, P., Ng, H.T.: Improving statistical machine translation for a resource-poor language using related resource-rich languages. J. Artif. Intell. Res. (JAIR) 44, 179–222 (2012)
Nakov, P., Tiedemann, J.: Combining word-level and character-level models for machine translation between closely-related languages. In: Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers - Volume 2, ACL 2012, Jeju Island, Korea, pp. 301–305 (2012)
Navigli, R., Ponzetto, S.P.: BabelNet: the automatic construction, evaluation and application of a wide-coverage multilingual semantic network. Artif. Intell. 193, 217–250 (2012)
Och, F.J., Ney, H.: A systematic comparison of various statistical alignment models. In: Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics, ACL 2003, Sapporo, Japan, pp. 19–51 (2003)
Okita, T., Vahid, A.H., Way, A., Liu, Q.: The DCU terminology translation system for the medical query subtask at WMT14. In: Proceedings of the ACL 2014 Workshop on Statistical Machine Translation, WMT 2014, Baltimore, Maryland, USA, pp. 239–245 (2014)
Pécheux, N., Gong, L., Do, Q.K., Marie, B., Ivanishcheva, Y., Allauzen, A., Lavergne, T., Niehues, J., Max, A., Yvon, F.: LIMSI@ WMT’14 medical translation task. In: Proceedings of the ACL 2014 Workshop on Statistical Machine Translation, WMT 2014, Baltimore, Maryland, USA, pp. 246–253 (2014)
Post, M., Callison-Burch, C., Osborne, M.: Constructing parallel corpora for six indian languages via crowdsourcing. In: Proceedings of the Seventh Workshop on Statistical Machine Translation, WMT 2012, Montréal, Canada, pp. 401–409 (2012)
Rodrigues, J.A.S.G.: Speech-to-speech translation to support medical interviews. Ph.D. thesis, Universidade de Lisboa, Portugal (2013)
Tiedemann, J., Nakov, P.: Analyzing the use of character-level translation with sparse and noisy datasets. In: Proceedings of the International Conference Recent Advances in Natural Language Processing, RANLP 2013, Hissar, Bulgaria, pp. 676–684 (2013)
Utiyama, M., Isahara, H.: A comparison of pivot methods for phrase-based statistical machine translation. In: Proceedings of the 2007 Meeting of the North American Chapter of the Association for Computational Linguistics, NAACL 2007, Rochester, New York, USA, pp. 484–491 (2007)
Wang, L., Lu, Y., Wong, D.F., Chao, L.S., Wang, Y., Oliveira, F.: Combining domain adaptation approaches for medical text translation. In: Proceedings of the Ninth Workshop on Statistical Machine Translation, WMT 2014, Baltimore, Maryland, USA, pp. 254–259 (2014)
Wang, P., Nakov, P., Ng, H.T.: Source language adaptation for resource-poor machine translation. In: Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, EMNLP-CoNLL 2012, Jeju Island, Korea, pp. 286–296 (2012)
Wang, P., Nakov, P., Ng, H.T.: Source language adaptation approaches for resource-poor machine translation. Comput. Linguist. 42, 1–44 (2016)
Wu, H., Wang, H.: Pivot language approach for phrase-based statistical machine translation. In: Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, ACL 2007, Prague, Czech Republic, pp. 856–863 (2007)
Zhang, J., Wu, X., Calixto, I., Vahid, A.H., Zhang, X., Way, A., Liu, Q.: Experiments in medical translation shared task at WMT 2014. In: Proceedings of the Ninth Workshop on Statistical Machine Translation, WMT 2014, Baltimore, Maryland, USA, pp. 260–265 (2014)
Acknowledgments
The authors would like to thank Naila Khalisha and Manisha Bansal for their contributions towards the project.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer International Publishing AG, part of Springer Nature
About this paper
Cite this paper
Musleh, A., Durrani, N., Temnikova, I., Nakov, P., Vogel, S., Alsaad, O. (2018). Enabling Medical Translation for Low-Resource Languages. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2016. Lecture Notes in Computer Science(), vol 9624. Springer, Cham. https://doi.org/10.1007/978-3-319-75487-1_1
Download citation
DOI: https://doi.org/10.1007/978-3-319-75487-1_1
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-75486-4
Online ISBN: 978-3-319-75487-1
eBook Packages: Computer ScienceComputer Science (R0)