Skip to main content

Improving Image Representations via MoCo Pre-training for Multimodal CXR Classification

  • 2174 Accesses

Part of the Lecture Notes in Computer Science book series (LNCS,volume 13413)


Multimodal learning, here defined as learning from multiple input data types, has exciting potential for healthcare. However, current techniques rely on large multimodal datasets being available, which is rarely the case in the medical domain. In this work, we focus on improving the extracted image features which are fed into multimodal image-text Transformer architectures, evaluating on a medical multimodal classification task with dual inputs of chest X-ray images (CXRs) and the indication text passages in the corresponding radiology reports. We demonstrate that self-supervised Momentum Contrast (MoCo) pre-training of the image representation model on a large set of unlabelled CXR images improves multimodal performance compared to supervised ImageNet pre-training. MoCo shows a \(0.6\%\) absolute improvement in AUROC-macro, when considering the full MIMIC-CXR training set, and \(5.1\%\) improvement when limiting to \(10\%\) of the training data.

To the best of our knowledge, this is the first demonstration of MoCo image pre-training for multimodal learning in medical imaging.


  • Multimodal learning
  • multimodal BERT
  • Image representation
  • Self-supervised image pre-training
  • CXR classification

This is a preview of subscription content, access via your institution.

Buying options

USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
USD   89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions


  1. 1.

    Due to the limited computing power, we decided to neglect the contrastive learning approach proposed by [21], trained on 16–64 Cloud TPU cores.

  2. 2.

  3. 3.


  1. Huang, S.C., Pareek, A., Seyyedi, S., Banerjee, I., Lungren, M.P.: Fusion of medical imaging and electronic health records using deep learning: A systematic review and implementation guidelines. Digital Medicine, no. 1 (2020)

    Google Scholar 

  2. Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30 (2017).

  3. Jacenków, G., O’Neil, A.Q., Tsaftaris, S.A.: Indication as prior knowledge for multimodal disease classification in chest radiographs with transformers. In: 2022 IEEE 19th International Symposium on Biomedical Imaging (ISBI)

    Google Scholar 

  4. Hendricks, L.A., Mellor, J., Schneider, R., Alayrac, J.-B., Nematzadeh, A.: Decoupling the role of data, attention, and losses in multimodal transformers. Trans. Assoc. Comput. Linguistics 9, 570–585 (2021).

  5. Anderson, P., et al.: Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6077–6086 (2018)

    Google Scholar 

  6. Li, L.H., Yatskar, M., Yin, D., Hsieh, C.-J., Chang, K.-W.: VisualBERT: a simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557, 2019

  7. Lu, J., Batra, D., Parikh, D., Lee, S.: ViLBERT: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In: Advances in Neural Information Processing Systems, 32 (2019)

    Google Scholar 

  8. Ren, S., He, K., Girshick, R., Sun, F.: Faster R-CNN: towards real-time object detection with region proposal networks. In: Advances in neural information processing systems, 28 (2015)

    Google Scholar 

  9. Krishna, R., et al.: Visual genome: Connecting language and vision using crowdsourced dense image annotations. Int. J. Comput. Vis. 123(1), 32–73 (2017)

    Google Scholar 

  10. Huang, Z., Zeng, Z., Liu, B., Fu, D., Fu, R.: Pixel-BERT: Aligning image pixels with text by deep multi-modal transformers. arXiv preprint arXiv:2004.008492(020)

  11. Aditya Ramesh, A., et al.: Zero-shot text-to-image generation. In International Conference on Machine Learning, pp. 8821–8831. PMLR (2021)

    Google Scholar 

  12. Dosovitskiy, A., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020

  13. Johnson, A.E.W., et al.: Mimic-cxr-jpg, a large publicly available database of labeled chest radiographs. arXiv preprint arXiv:1901.07042 (2019)

  14. Johnson, A.E.W., et al.: MIMIC-CXR, a de-identified publicly available database of chest radiographs with free-text reports. Sci. Data 6(1), December 2019.

  15. Goldberger, A.L., et al.: Physiobank, physiotoolkit, and physionet: components of a new research resource for complex physiologic signals. Circulation 101(23), e215–e220 (2000)

    Google Scholar 

  16. Demner-Fushman, D., et al.: Preparing a collection of radiology examinations for distribution and retrieval. J. Am. Med. Inf. Assoc. 23(2), 304–310 (2016)

    Google Scholar 

  17. van Sonsbeek, T., Worring, M.: Towards Automated Diagnosis with Attentive Multi-modal Learning Using Electronic Health Records and Chest X-Rays. In: Syeda-Mahmood, T., Drechsler, K., Greenspan, H., Madabhushi, A., Karargyris, A., Linguraru, M.G., Oyarzun Laura, C., Shekhar, R., Wesarg, S., González Ballester, M.Á., Erdt, M. (eds.) CLIP/ML-CDS -2020. LNCS, vol. 12445, pp. 106–114. Springer, Cham (2020).

    CrossRef  Google Scholar 

  18. Liao, R., Moyer, D., Cha, M., Quigley, K., Berkowitz, S., Horng, S., Golland, P., Wells, W.M.: Multimodal representation learning via maximization of local mutual information. In: de Bruijne, M., Cattin, P.C., Cotin, S., Padoy, N., Speidel, S., Zheng, Y., Essert, C. (eds.) MICCAI 2021. LNCS, vol. 12902, pp. 273–283. Springer, Cham (2021).

    CrossRef  Google Scholar 

  19. He, K., Fan, H., Wu, Y., Xie, S., Girshick, R.: Momentum contrast for unsupervised visual representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9729–9738 (2020)

    Google Scholar 

  20. Sowrirajan, J.Y., Ng, A.Y., Rajpurkar, P.: MoCo pretraining improves representation and transferability of chest X-ray models. In: Medical Imaging with Deep Learning, pp. 728–744. PMLR (2021)

    Google Scholar 

  21. Azizi, S., et al.: Big self-supervised models advance medical image classification. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3478–3488 (2021)

    Google Scholar 

  22. Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607. PMLR (2020)

    Google Scholar 

  23. Vu, Y.N.T.: Medaug: contrastive learning leveraging patient metadata improves representations for chest x-ray interpretation. In: Machine Learning for Healthcare Conference, pp. 755–769. PMLR (2021)

    Google Scholar 

  24. Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., Batra, B.: Grad-cam: visual explanations from deep networks via gradient-based localization. In: Proceedings of the IEEE international Conference on Computer Vision, pp. 618–626 (2017)

    Google Scholar 

  25. Wang, X., Peng, Y., Lu, L., Lu, Z., Bagheri, M., Summers, R.M.: ChestX-ray8: Hospital-scale chest X-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases. In: Proceedings of the IEEE Conference on Computer vision and pattern recognition, pp. 2097–2106 (2017)

    Google Scholar 

  26. Kiela, D., Bhooshan, S., Firooz, H., Testuggine, D.: Supervised multimodal bitransformers for classifying images and text. arXiv preprint arXiv:1909.02950 (2019)

  27. Devlin, J., Chang Kenton, M.-W., Toutanova, L.K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of NAACL-HLT, pp. 4171–4186 (2019)

    Google Scholar 

  28. van den Oord, A., Li, Y., Vinyals, O.: Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748 (2018)

  29. Irvin, J., et al.: CheXpert: large chest radiograph dataset with uncertainty labels and expert comparison. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 590–597 (2019)

    Google Scholar 

  30. Singh, A., et al.: MMF: A multimodal framework for vision and language research (2020).

  31. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)

    Google Scholar 

  32. Yijia, Z., Chen, Q., Yang, Z., Lin, H., lu, Z.: BioWordVec, improving biomedical word embeddings with subword information and MeSH. Sci. Data 6, 05 2019.

  33. Chung, J., Gulcehre, C., Cho, K., Bengio, Y.: Empirical evaluation of gated recurrent neural networks on sequence modeling. In: NIPS 2014 Workshop on Deep Learning, December 2014

    Google Scholar 

Download references

Author information

Authors and Affiliations


Corresponding author

Correspondence to Francesco Dalla Serra .

Editor information

Editors and Affiliations

A Per-Class Results

A Per-Class Results

Table 4. Per-class AUROC scores using different ResNet-50 initializations. The models are fine-tuned on the full training set (top) and on 10% of the training set (bottom).

Rights and permissions

Reprints and Permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Dalla Serra, F., Jacenków, G., Deligianni, F., Dalton, J., O’Neil, A.Q. (2022). Improving Image Representations via MoCo Pre-training for Multimodal CXR Classification. In: Yang, G., Aviles-Rivero, A., Roberts, M., Schönlieb, CB. (eds) Medical Image Understanding and Analysis. MIUA 2022. Lecture Notes in Computer Science, vol 13413. Springer, Cham.

Download citation

  • DOI:

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-12052-7

  • Online ISBN: 978-3-031-12053-4

  • eBook Packages: Computer ScienceComputer Science (R0)