Ultrasound Image Representation Learning by Modeling Sonographer Visual Attention

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11492)


Image representations are commonly learned from class labels, which are a simplistic approximation of human image understanding. In this paper we demonstrate that transferable representations of images can be learned without manual annotations by modeling human visual attention. The basis of our analyses is a unique gaze tracking dataset of sonographers performing routine clinical fetal anomaly screenings. Models of sonographer visual attention are learned by training a convolutional neural network (CNN) to predict gaze on ultrasound video frames through visual saliency prediction or gaze-point regression. We evaluate the transferability of the learned representations to the task of ultrasound standard plane detection in two contexts. Firstly, we perform transfer learning by fine-tuning the CNN with a limited number of labeled standard plane images. We find that fine-tuning the saliency predictor is superior to training from random initialization, with an average F1-score improvement of 9.6% overall and 15.3% for the cardiac planes. Secondly, we train a simple softmax regression on the feature activations of each CNN layer in order to evaluate the representations independently of transfer learning hyper-parameters. We find that the attention models derive strong representations, approaching the precision of a fully-supervised baseline model for all but the last layer.


Representation learning Gaze tracking Fetal ultrasound Self-supervised learning Saliency prediction Transfer learning Convolutional neural networks 



This work is supported by the ERC (ERC-ADG-2015 694581, project PULSE) and the EPSRC (EP/R013853/1 and EP/M013774/1). AP is funded by the NIHR Oxford Biomedical Research Centre.


  1. 1.
    Baumgartner, C.F., et al.: SonoNet: real-time detection and localisation of fetal standard scan planes in freehand ultrasound. IEEE Trans. Med. Imaging 36(11), 2204–2215 (2017)CrossRefGoogle Scholar
  2. 2.
    Borji, A.: Saliency prediction in the deep learning era: an empirical investigation. arXiv:1810.03716 (2018)
  3. 3.
    Cai, Y., Sharma, H., Chatelain, P., Noble, J.A.: Multi-task SonoEyeNet: detection of fetal standardized planes assisted by generated sonographer attention maps. In: Frangi, A.F., Schnabel, J.A., Davatzikos, C., Alberola-López, C., Fichtinger, G. (eds.) MICCAI 2018. LNCS, vol. 11070, pp. 871–879. Springer, Cham (2018). Scholar
  4. 4.
    Chatelain, P., Sharma, H., Drukker, L., Papageorghiou, A.T., Noble, J.A.: Evaluation of gaze tracking calibration for longitudinal biomedical imaging studies. IEEE Trans. Cybern. 99, 1–11 (2018)Google Scholar
  5. 5.
    Cornia, M., Baraldi, L., Serra, G., Cucchiara, R.: Visual saliency for image captioning in new multimedia services. In: ICMEW (2017)Google Scholar
  6. 6.
    Doersch, C., Gupta, A., Efros, A.A.: Unsupervised visual representation learning by context prediction. In: ICCV (2015)Google Scholar
  7. 7.
    Dosovitskiy, A., Fischer, P., Springenberg, J.T., Riedmiller, M., Brox, T.: Discriminative unsupervised feature learning with exemplar convolutional neural networks. In: NIPS (2014)Google Scholar
  8. 8.
    Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In: CVPR (2017)Google Scholar
  9. 9.
    Kornblith, S., Shlens, J., Le, Q.V.: Do better imagenet models transfer better? arXiv:1805.08974 (2018)
  10. 10.
    Kruthiventi, S.S.S., Ayush, K., Babu, R.V.: DeepFix: a fully convolutional neural network for predicting human eye fixations. IEEE Trans. Image Process. 26(9), 4446–4456 (2015)MathSciNetCrossRefGoogle Scholar
  11. 11.
    Levine, S., Finn, C., Darrell, T., Abbeel, P.: End-to-end training of deep visuomotor policies. J. Mach. Learn. Res. 17(1), 1334–1373 (2015)MathSciNetzbMATHGoogle Scholar
  12. 12.
    van der Maaten, L., Hinton, G.: Visualizing data using t-SNE. J. Mach. Learn. Res. 9(Nov), 2579–2605 (2008)zbMATHGoogle Scholar
  13. 13.
    Misra, I., Zitnick, C.L., Hebert, M.: Shuffle and learn: unsupervised learning using temporal order verification. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 527–544. Springer, Cham (2016). Scholar
  14. 14.
    Ngo, T., Manjunath, B.S.: Saccade gaze prediction using a recurrent neural network. In: ICIP (2017)Google Scholar
  15. 15.
    Schlemper, J., et al.: Attention-gated networks for improving ultrasound scan plane detection. In: MIDL (2018)Google Scholar
  16. 16.
    Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: ICLR (2015)Google Scholar
  17. 17.
    Wu, C.C., Wick, F.A., Pomplun, M.: Guidance of visual attention by semantic information in real-world scenes. Front. Psychol. 5, 54 (2014)Google Scholar
  18. 18.
    Xie, S., Girshick, R., Dollár, P., Tu, Z., He, K.: Aggregated residual transformations for deep neural networks. In: CVPR (2017)Google Scholar
  19. 19.
    Yu, F., Koltun, V.: Multi-scale context aggregation by dilated convolutions. In: ICLR (2016)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.Department of Engineering ScienceUniversity of OxfordOxfordUK
  2. 2.Nuffield Department of Women’s and Reproductive HealthUniversity of OxfordOxfordUK

Personalised recommendations