Captioning Ultrasound Images Automatically

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11767)


We describe an automatic natural language processing (NLP)-based image captioning method to describe fetal ultrasound video content by modelling the vocabulary commonly used by sonographers and sonologists. The generated captions are similar to the words spoken by a sonographer when describing the scan experience in terms of visual content and performed scanning actions. Using full-length second-trimester fetal ultrasound videos and text derived from accompanying expert voice-over audio recordings, we train deep learning models consisting of convolutional neural networks and recurrent neural networks in merged configurations to generate captions for ultrasound video frames. We evaluate different model architectures using established general metrics (BLEU, ROUGE-L) and application-specific metrics. Results show that the proposed models can learn joint representations of image and text to generate relevant and descriptive captions for anatomies, such as the spine, the abdomen, the heart, and the head, in clinical fetal ultrasound scans.


Image description Image captioning Deep learning Natural language processing Recurrent neural networks Fetal ultrasound 



We acknowledge the ERC (ERC-ADG-2015 694 project PULSE), the EPSRC (EP/MO13774/1), the Rhodes Trust, and the NIHR BRC funding scheme.

Supplementary material

490278_1_En_37_MOESM1_ESM.pdf (1.9 mb)
Supplementary material 1 (pdf 1985 KB)


  1. 1.
    Bernardi, R., et al.: Automatic description generation from images: a survey of models, datasets, and evaluation measures. In: IJCAI, pp. 4970–4974 (2017)Google Scholar
  2. 2.
    Cho, K., et al.: Learning phrase representations using RNN encoder-decoder for statistical machine translation. In: EMNLP, pp. 1724–1734. ACL (2014)Google Scholar
  3. 3.
    Department of Engineering Science, University of Oxford: PULSE.
  4. 4.
    Elliott, D., Keller, F.: Image description using visual dependency representations. In: EMNLP, pp. 1292–1302 (2013)Google Scholar
  5. 5.
    Goodfellow, I., et al.: Deep Learning (2016)Google Scholar
  6. 6.
    Google Cloud: Cloud Speech-to-Text.
  7. 7.
    Google Code Archive: Word2Vec (2013).
  8. 8.
    GrammarBot: Grammar Check API.
  9. 9.
    Hochreiter, S., Schmidhuber, J.: Long short-term memory. NC 9(8), 1735–1780 (1997)Google Scholar
  10. 10.
    Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. CoRR abs/1412.6980 (2015)Google Scholar
  11. 11.
    Lin, C.Y.: ROUGE: a package for automatic evaluation of summaries. Text Summarization Branches Out (2004)Google Scholar
  12. 12.
    Lyndon, D., et al.: Neural captioning for the Image CLEF 2017 medical image challenges. In: CEUR Workshop Proceedings, vol. 1866 (2017)Google Scholar
  13. 13.
    McCarthy, P.M., Jarvis, S.: MTLD, vocd-D, and HD-D: a validation study of sophisticated approaches to lexical diversity assessment. Behav. Res. Methods 42(2), 381–392 (2010)CrossRefGoogle Scholar
  14. 14.
    Mikolov, T., et al.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems (2013)Google Scholar
  15. 15.
    Ordonez, V., et al.: Im2Text: describing images using 1 million captioned photographs. In: Advances in NIPS, pp. 1143–1151 (2011)Google Scholar
  16. 16.
    Papineni, K., et al.: BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting on ACL, pp. 311–318. ACL (2002)Google Scholar
  17. 17.
    Pennington, et al.: GloVe: global vectors for word representation. In: EMNLP, pp. 1532–1543 (2014)Google Scholar
  18. 18.
    Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: International Conference on Learning Representations (2015)Google Scholar
  19. 19.
    Sloetjes, H., Wittenburg, P.: Annotation by category-ELAN and ISO DCR. In: LREC (2008)Google Scholar
  20. 20.
    Tanti, M., et al.: What is the role of recurrent neural networks (RNNs) in an image caption generator? ACL, pp. 51–60 (2017)Google Scholar
  21. 21.
    Tanti, M., et al.: Where to put the image in an image caption generator. Nat. Lang. Eng. 24(3), 467–489 (2018)CrossRefGoogle Scholar
  22. 22.
    Vinyals, O., et al.: Show and tell: a neural image caption generator. In: Proceedings of the IEEE conference on CVPR, pp. 3156–3164 (2015)Google Scholar
  23. 23.
    You, Q., et al.: Image captioning with semantic attention. In: Proceedings of the IEEE Conference on CVPR, pp. 4651–4659 (2016)Google Scholar
  24. 24.
    Zeng, X.H., et al.: Understanding and generating ultrasound image description. J. Comput. Sci. Technol. 33(5), 1086–1100 (2018)CrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.Institute of Biomedical EngineeringUniversity of OxfordOxfordUK
  2. 2.Nuffield Department of Women’s and Reproductive HealthUniversity of OxfordOxfordUK

Personalised recommendations