Skip to main content

A Curriculum Learning Based Approach to Captioning Ultrasound Images

Part of the Lecture Notes in Computer Science book series (LNIP,volume 12437)


We present a novel curriculum learning approach to train a natural language processing (NLP) based fetal ultrasound image captioning model. Datasets containing medical images and corresponding textual descriptions are relatively rare and hence, smaller-sized when compared to the datasets of natural images and their captions. This fact inspired us to develop an approach to train a captioning model suitable for small-sized medical data. Our datasets are prepared using real-world ultrasound video along with synchronised and transcribed sonographer speech recordings. We propose a “dual-curriculum” method for the ultrasound image captioning problem. The method relies on building and learning from curricula of image and text information for the ultrasound image captioning problem. We compare several distance measures for creating the dual curriculum and observe the best performance using the Wasserstein distance for image information and tf-idf metric for text information. The evaluation results show an improvement in all performance metrics when using curriculum learning over stochastic mini-batch training for the individual task of image classification as well as using a dual curriculum for image captioning.


  • Image captioning
  • Curriculum learning
  • Fetal ultrasound

M. Alsharid and R. El-Bouri—Equal contribution.

This is a preview of subscription content, access via your institution.

Buying options

USD   29.95
Price excludes VAT (USA)
  • DOI: 10.1007/978-3-030-60334-2_8
  • Chapter length: 10 pages
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
USD   79.99
Price excludes VAT (USA)
  • ISBN: 978-3-030-60334-2
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
Softcover Book
USD   99.99
Price excludes VAT (USA)
Fig. 1.
Fig. 2.
Fig. 3.


  1. Allgower, E.L., Georg, K.: Numerical Continuation Methods: An Introduction. Springer Series in Computational Mathematics, vol. 13. Springer Science & Business Media, Berlin, Heidelberg (2012).

    MATH  CrossRef  Google Scholar 

  2. Alsharid, M., Sharma, H., Drukker, L., Chatelain, P., Papageorghiou, A.T., Noble, J.A.: Captioning ultrasound images automatically. In: Shen, D., et al. (eds.) MICCAI 2019. LNCS, vol. 11767, pp. 338–346. Springer, Cham (2019).

    CrossRef  Google Scholar 

  3. Bengio, Y., Louradour, J., Collobert, R., Weston, J.: Curriculum learning. In: Proceedings of the 26th Annual International Conference on Machine Learning, pp. 41–48 (2009)

    Google Scholar 

  4. Bernardi, R., et al.: Automatic description generation from images: a survey of models, datasets, and evaluation measures. J. Artif. Intell. Res. 55, 409–442 (2016)

    CrossRef  Google Scholar 

  5. El Bouri, R., Eyre, D., Watkinson, P., Zhu, T., Clifton, D.: Student-teacher curriculum learning via reinforcement learning: predicting hospital inpatient admission location. arXiv preprint arXiv:2007.01135 (2020)

  6. Cirik, V., Hovy, E., Morency, L.P.: Visualizing and understanding curriculum learning for long short-term memory networks. arXiv preprint arXiv:1611.06204 (2016)

  7. Lin, C.-Y.: Rouge: a package for automatic evaluation of summaries. In: Proceedings of the Workshop on Text Summarization Branches Out, Barcelona, Spain, pp. 56–60 (2004)

    Google Scholar 

  8. Department of Engineering Science: Pulse (2019).

  9. El-Bouri, R., Eyre, D., Watkinson, P.J., Zhu, T., Clifton, D.: Hospital admission location prediction via deep interpretable networks for the year-round improvement of emergency patient care. IEEE J. Biomed. Health Inf. PP(99), 1 (2020)

    Google Scholar 

  10. Liu, C., He, S., Liu, K., Zhao, J.: Curriculum learning for natural answer generation. In: Proceedings of the 27th International Joint Conference on Artificial Intelligence, pp. 4223–4229 (2018)

    Google Scholar 

  11. Mahalanobis, P.C.: On the generalized distance in statistics. National Institute of Science of India (1936)

    Google Scholar 

  12. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013)

  13. Morerio, P., Cavazza, J., Volpi, R., Vidal, R., Murino, V.: Curriculum dropout. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 3544–3552 (2017)

    Google Scholar 

  14. Oksuz, I., et al.: Automatic CNN-based detection of cardiac MR motion artefacts using k-space data augmentation and curriculum learning. Med. Image Anal. 55, 136–147 (2019)

    CrossRef  Google Scholar 

  15. Panaretos, V.M., Zemel, Y.: Statistical aspects of Wasserstein distances. Ann. Rev. Stat. Appl. 6, 405–431 (2019)

    Google Scholar 

  16. Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, pp. 311–318. Association for Computational Linguistics (2002)

    Google Scholar 

  17. Park, B., et al.: A curriculum learning strategy to enhance the accuracy of classification of various lesions in Chest-PA X-ray screening for pulmonary abnormalities. Sci. Rep. 9(1), 1–9 (2019)

    CrossRef  Google Scholar 

  18. Ren, Z., Wang, X., Zhang, N., Lv, X., Li, L.J.: Deep reinforcement learning-based image captioning with embedding reward. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 290–298 (2017)

    Google Scholar 

  19. Sammut, C., Webb, G.I. (eds.): TF-IDF, pp. 986–987. Springer, Boston (2010)

    Google Scholar 

  20. Sharma, H., Droste, R., Chatelain, P., Drukker, L., Papageorghiou, A., Noble, J.: Spatio-temporal partitioning and description of full-length routine fetal anomaly ultrasound scans. In: 2019 IEEE 16th International Symposium on Biomedical Imaging (ISBI 2019), pp. 987–990. IEEE (2019)

    Google Scholar 

  21. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)

  22. Tsvetkov, Y., Faruqui, M., Ling, W., MacWhinney, B., Dyer, C.: Learning the curriculum with Bayesian optimization for task-specific word representation learning. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (vol. 1: Long Papers), pp. 130–139 (2016)

    Google Scholar 

  23. Zhu, X., Li, L., Liu, J., Peng, H., Niu, X.: Captioning transformer with stacked attention modules. Appl. Sci. 8(5), 739 (2018)

    CrossRef  Google Scholar 

Download references


We acknowledge the ERC (ERC-ADG-2015 694 project PULSE), the EPSRC (EP/MO13774/1), the Rhodes Trust, and the NIHR BRC funding scheme. ReB is supported by an EPSRC Industrial Strategy Challenge Fund PhD studentship.

Author information

Authors and Affiliations


Corresponding author

Correspondence to Mohammad Alsharid .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and Permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Verify currency and authenticity via CrossMark

Cite this paper

Alsharid, M., El-Bouri, R., Sharma, H., Drukker, L., Papageorghiou, A.T., Noble, J.A. (2020). A Curriculum Learning Based Approach to Captioning Ultrasound Images. In: , et al. Medical Ultrasound, and Preterm, Perinatal and Paediatric Image Analysis. ASMUS PIPPI 2020 2020. Lecture Notes in Computer Science(), vol 12437. Springer, Cham.

Download citation

  • DOI:

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-60333-5

  • Online ISBN: 978-3-030-60334-2

  • eBook Packages: Computer ScienceComputer Science (R0)