Skip to main content

Multimodal Representation Learning via Maximization of Local Mutual Information

  • Conference paper
  • First Online:
Medical Image Computing and Computer Assisted Intervention – MICCAI 2021 (MICCAI 2021)

Abstract

We propose and demonstrate a representation learning approach by maximizing the mutual information between local features of images and text. The goal of this approach is to learn useful image representations by taking advantage of the rich information contained in the free text that describes the findings in the image. Our method trains image and text encoders by encouraging the resulting representations to exhibit high local mutual information. We make use of recent advances in mutual information estimation with neural network discriminators. We argue that the sum of local mutual information is typically a lower bound on the global mutual information. Our experimental results in the downstream image classification tasks demonstrate the advantages of using local features for image-text representation learning.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Alsentzer, E., et al.: Publicly available clinical BERT embeddings. arXiv preprint arXiv:1904.03323 (2019)

  2. Belghazi, M.I., et al.: MINE: mutual information neural estimation. arXiv preprint arXiv:1801.04062 (2018)

  3. Bojanowski, P., Joulin, A.: Unsupervised learning by predicting noise. In: International Conference on Machine Learning, pp. 517–526. PMLR (2017)

    Google Scholar 

  4. Chauhan, G., et al.: Joint Modeling of Chest Radiographs and Radiology Reports for Pulmonary Edema Assessment. In: Martel, A.L., et al. (eds.) MICCAI 2020. LNCS, vol. 12262, pp. 529–539. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-59713-9_51

    Chapter  Google Scholar 

  5. Chen, R.T., Li, X., Grosse, R., Duvenaud, D.: Isolating sources of disentanglement in variational autoencoders. arXiv preprint arXiv:1802.04942 (2018)

  6. Chen, X., Duan, Y., Houthooft, R., Schulman, J., Sutskever, I., Abbeel, P.: InfoGAN: interpretable representation learning by information maximizing generative adversarial nets. In: Advances in Neural Information Processing Systems, pp. 2172–2180 (2016)

    Google Scholar 

  7. Demner-Fushman, D., et al.: Preparing a collection of radiology examinations for distribution and retrieval. J. Am. Med. Inform. Assoc. 23(2), 304–310 (2016)

    Article  Google Scholar 

  8. Harwath, D., Recasens, A., Surís, D., Chuang, G., Torralba, A., Glass, J.: Jointly discovering visual objects and spoken words from raw sensory input. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 649–665 (2018)

    Google Scholar 

  9. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)

    Google Scholar 

  10. Hjelm, R.D., et al.: Learning deep representations by mutual information estimation and maximization. arXiv preprint arXiv:1808.06670 (2018)

  11. Horng, S., Liao, R., Wang, X., Dalal, S., Golland, P., Berkowitz, S.J.: Deep learning to quantify pulmonary edema in chest radiographs. Radiol. Artif. Intell. e190228 (2021)

    Google Scholar 

  12. Irvin, J., et al.: CheXpert: a large chest radiograph dataset with uncertainty labels and expert comparison. arXiv preprint arXiv:1901.07031 (2019)

  13. Johnson, A.E., et al.: MIMIC-CXR, a de-identified publicly available database of chest radiographs with free-text reports. Sci. Data 6(1), 1–8 (2019)

    Article  Google Scholar 

  14. Johnson, A.E., et al.: MIMIC-CXR-JPG - chest radiographs with structured labels (version 2.0.0). PhysioNet (2019). https://doi.org/10.13026/8360-t248

  15. Johnson, A.E., et al.: MIMIC-CXR-JPG, a large publicly available database of labeled chest radiographs. arXiv preprint arXiv:1901.07042 (2019)

  16. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)

  17. Liao, R., Chauhan, G., Golland, P., Berkowitz, S.J., Horng, S.: Pulmonary edema severity grades based on MIMIC-CXR (version 1.0.1). PhysioNet (2021). https://doi.org/10.13026/rz5p-rc64

  18. Liao, R., Moyer, D., Golland, P., Wells, W.M.: DEMI: discriminative estimator of mutual information. arXiv preprint arXiv:2010.01766 (2020)

  19. Liao, R., et al.: Semi-supervised learning for quantification of pulmonary edema in chest x-ray images. arXiv preprint arXiv:1902.10785 (2019)

  20. Maes, F., Collignon, A., Vandermeulen, D., Marchal, G., Suetens, P.: Multimodality image registration by maximization of mutual information. IEEE Trans. Med. Imaging 16(2), 187–198 (1997)

    Article  Google Scholar 

  21. McGill, W.: Multivariate information transmission. Trans. IRE Prof. Group Inf. Theory 4(4), 93–111 (1954)

    Article  MathSciNet  Google Scholar 

  22. Moradi, M., Guo, Y., Gur, Y., Negahdar, M., Syeda-Mahmood, T.: A cross-modality neural network transform for semi-automatic medical image annotation. In: Ourselin, S., Joskowicz, L., Sabuncu, M.R., Unal, G., Wells, W. (eds.) MICCAI 2016. LNCS, vol. 9901, pp. 300–307. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46723-8_35

    Chapter  Google Scholar 

  23. Moradi, M., Madani, A., Gur, Y., Guo, Y., Syeda-Mahmood, T.: Bimodal network architectures for automatic generation of image annotation from text. In: Frangi, A.F., Schnabel, J.A., Davatzikos, C., Alberola-López, C., Fichtinger, G. (eds.) MICCAI 2018. LNCS, vol. 11070, pp. 449–456. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-00928-1_51

    Chapter  Google Scholar 

  24. van den Oord, A., Li, Y., Vinyals, O.: Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748 (2018)

  25. Reimers, N., Gurevych, I.: Sentence-BERT: sentence embeddings using siamese BERT-networks. arXiv preprint arXiv:1908.10084 (2019)

  26. Rifai, S., Bengio, Y., Courville, A., Vincent, P., Mirza, M.: Disentangling factors of variation for facial expression recognition. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7577, pp. 808–822. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33783-3_58

    Chapter  Google Scholar 

  27. Song, J., Ermon, S.: Understanding the limitations of variational mutual information estimators. In: International Conference on Learning Representations (2019)

    Google Scholar 

  28. Wang, X., Peng, Y., Lu, L., Lu, Z., Summers, R.M.: TienNet: text-image embedding network for common thorax disease classification and reporting in chest X-rays. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9049–9058 (2018)

    Google Scholar 

  29. Wang, X., et al.: Pulmonary edema severity estimation in chest radiographs using deep learning. In: International Conference on Medical Imaging with Deep Learning-Extended Abstract Track (2019)

    Google Scholar 

  30. Wells III, W.M., Viola, P., Atsumi, H., Nakajima, S., Kikinis, R.: Multi-modal volume registration by maximization of mutual information. Med. Image Anal. 1(1), 35–51 (1996)

    Google Scholar 

  31. Wolf, T., et al.: Huggingface’s transformers: state-of-the-art natural language processing. arXiv-1910 (2019)

    Google Scholar 

  32. Xue, Y., Huang, X.: Improved disease classification in chest X-rays with transferred features from report generation. In: Chung, A.C.S., Gee, J.C., Yushkevich, P.A., Bao, S. (eds.) IPMI 2019. LNCS, vol. 11492, pp. 125–138. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-20351-1_10

    Chapter  Google Scholar 

  33. Zhang, Y., Jiang, H., Miura, Y., Manning, C.D., Langlotz, C.P.: Contrastive learning of medical visual representations from paired images and text. arXiv preprint arXiv:2010.00747 (2020)

Download references

Acknowledgments

This work was supported in part by NIH NIBIB NAC P41EB015902, Wistron, IBM Watson, MIT Deshpande Center, MIT J-Clinic, MIT Lincoln Lab, and US Air Force.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ruizhi Liao .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Liao, R. et al. (2021). Multimodal Representation Learning via Maximization of Local Mutual Information. In: de Bruijne, M., et al. Medical Image Computing and Computer Assisted Intervention – MICCAI 2021. MICCAI 2021. Lecture Notes in Computer Science(), vol 12902. Springer, Cham. https://doi.org/10.1007/978-3-030-87196-3_26

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-87196-3_26

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-87195-6

  • Online ISBN: 978-3-030-87196-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics