Multimodal Representation Learning via Maximization of Local Mutual Information

Liao, Ruizhi; Moyer, Daniel; Cha, Miriam; Quigley, Keegan; Berkowitz, Seth; Horng, Steven; Golland, Polina; Wells, William M.

doi:10.1007/978-3-030-87196-3_26

Ruizhi Liao¹⁵,
Daniel Moyer¹⁵,
Miriam Cha¹⁶,
Keegan Quigley¹⁶,
Seth Berkowitz¹⁷,
Steven Horng¹⁷,
Polina Golland¹⁵ &
…
William M. Wells^15,18

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 12902))

Included in the following conference series:

International Conference on Medical Image Computing and Computer-Assisted Intervention

8903 Accesses
12 Citations
51 Altmetric

Abstract

We propose and demonstrate a representation learning approach by maximizing the mutual information between local features of images and text. The goal of this approach is to learn useful image representations by taking advantage of the rich information contained in the free text that describes the findings in the image. Our method trains image and text encoders by encouraging the resulting representations to exhibit high local mutual information. We make use of recent advances in mutual information estimation with neural network discriminators. We argue that the sum of local mutual information is typically a lower bound on the global mutual information. Our experimental results in the downstream image classification tasks demonstrate the advantages of using local features for image-text representation learning.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Alsentzer, E., et al.: Publicly available clinical BERT embeddings. arXiv preprint arXiv:1904.03323 (2019)
Belghazi, M.I., et al.: MINE: mutual information neural estimation. arXiv preprint arXiv:1801.04062 (2018)
Bojanowski, P., Joulin, A.: Unsupervised learning by predicting noise. In: International Conference on Machine Learning, pp. 517–526. PMLR (2017)
Google Scholar
Chauhan, G., et al.: Joint Modeling of Chest Radiographs and Radiology Reports for Pulmonary Edema Assessment. In: Martel, A.L., et al. (eds.) MICCAI 2020. LNCS, vol. 12262, pp. 529–539. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-59713-9_51
Chapter Google Scholar
Chen, R.T., Li, X., Grosse, R., Duvenaud, D.: Isolating sources of disentanglement in variational autoencoders. arXiv preprint arXiv:1802.04942 (2018)
Chen, X., Duan, Y., Houthooft, R., Schulman, J., Sutskever, I., Abbeel, P.: InfoGAN: interpretable representation learning by information maximizing generative adversarial nets. In: Advances in Neural Information Processing Systems, pp. 2172–2180 (2016)
Google Scholar
Demner-Fushman, D., et al.: Preparing a collection of radiology examinations for distribution and retrieval. J. Am. Med. Inform. Assoc. 23(2), 304–310 (2016)
Article Google Scholar
Harwath, D., Recasens, A., Surís, D., Chuang, G., Torralba, A., Glass, J.: Jointly discovering visual objects and spoken words from raw sensory input. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 649–665 (2018)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Google Scholar
Hjelm, R.D., et al.: Learning deep representations by mutual information estimation and maximization. arXiv preprint arXiv:1808.06670 (2018)
Horng, S., Liao, R., Wang, X., Dalal, S., Golland, P., Berkowitz, S.J.: Deep learning to quantify pulmonary edema in chest radiographs. Radiol. Artif. Intell. e190228 (2021)
Google Scholar
Irvin, J., et al.: CheXpert: a large chest radiograph dataset with uncertainty labels and expert comparison. arXiv preprint arXiv:1901.07031 (2019)
Johnson, A.E., et al.: MIMIC-CXR, a de-identified publicly available database of chest radiographs with free-text reports. Sci. Data 6(1), 1–8 (2019)
Article Google Scholar
Johnson, A.E., et al.: MIMIC-CXR-JPG - chest radiographs with structured labels (version 2.0.0). PhysioNet (2019). https://doi.org/10.13026/8360-t248
Johnson, A.E., et al.: MIMIC-CXR-JPG, a large publicly available database of labeled chest radiographs. arXiv preprint arXiv:1901.07042 (2019)
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
Liao, R., Chauhan, G., Golland, P., Berkowitz, S.J., Horng, S.: Pulmonary edema severity grades based on MIMIC-CXR (version 1.0.1). PhysioNet (2021). https://doi.org/10.13026/rz5p-rc64
Liao, R., Moyer, D., Golland, P., Wells, W.M.: DEMI: discriminative estimator of mutual information. arXiv preprint arXiv:2010.01766 (2020)
Liao, R., et al.: Semi-supervised learning for quantification of pulmonary edema in chest x-ray images. arXiv preprint arXiv:1902.10785 (2019)
Maes, F., Collignon, A., Vandermeulen, D., Marchal, G., Suetens, P.: Multimodality image registration by maximization of mutual information. IEEE Trans. Med. Imaging 16(2), 187–198 (1997)
Article Google Scholar
McGill, W.: Multivariate information transmission. Trans. IRE Prof. Group Inf. Theory 4(4), 93–111 (1954)
Article MathSciNet Google Scholar
Moradi, M., Guo, Y., Gur, Y., Negahdar, M., Syeda-Mahmood, T.: A cross-modality neural network transform for semi-automatic medical image annotation. In: Ourselin, S., Joskowicz, L., Sabuncu, M.R., Unal, G., Wells, W. (eds.) MICCAI 2016. LNCS, vol. 9901, pp. 300–307. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46723-8_35
Chapter Google Scholar
Moradi, M., Madani, A., Gur, Y., Guo, Y., Syeda-Mahmood, T.: Bimodal network architectures for automatic generation of image annotation from text. In: Frangi, A.F., Schnabel, J.A., Davatzikos, C., Alberola-López, C., Fichtinger, G. (eds.) MICCAI 2018. LNCS, vol. 11070, pp. 449–456. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-00928-1_51
Chapter Google Scholar
van den Oord, A., Li, Y., Vinyals, O.: Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748 (2018)
Reimers, N., Gurevych, I.: Sentence-BERT: sentence embeddings using siamese BERT-networks. arXiv preprint arXiv:1908.10084 (2019)
Rifai, S., Bengio, Y., Courville, A., Vincent, P., Mirza, M.: Disentangling factors of variation for facial expression recognition. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7577, pp. 808–822. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33783-3_58
Chapter Google Scholar
Song, J., Ermon, S.: Understanding the limitations of variational mutual information estimators. In: International Conference on Learning Representations (2019)
Google Scholar
Wang, X., Peng, Y., Lu, L., Lu, Z., Summers, R.M.: TienNet: text-image embedding network for common thorax disease classification and reporting in chest X-rays. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9049–9058 (2018)
Google Scholar
Wang, X., et al.: Pulmonary edema severity estimation in chest radiographs using deep learning. In: International Conference on Medical Imaging with Deep Learning-Extended Abstract Track (2019)
Google Scholar
Wells III, W.M., Viola, P., Atsumi, H., Nakajima, S., Kikinis, R.: Multi-modal volume registration by maximization of mutual information. Med. Image Anal. 1(1), 35–51 (1996)
Google Scholar
Wolf, T., et al.: Huggingface’s transformers: state-of-the-art natural language processing. arXiv-1910 (2019)
Google Scholar
Xue, Y., Huang, X.: Improved disease classification in chest X-rays with transferred features from report generation. In: Chung, A.C.S., Gee, J.C., Yushkevich, P.A., Bao, S. (eds.) IPMI 2019. LNCS, vol. 11492, pp. 125–138. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-20351-1_10
Chapter Google Scholar
Zhang, Y., Jiang, H., Miura, Y., Manning, C.D., Langlotz, C.P.: Contrastive learning of medical visual representations from paired images and text. arXiv preprint arXiv:2010.00747 (2020)

Download references

Acknowledgments

This work was supported in part by NIH NIBIB NAC P41EB015902, Wistron, IBM Watson, MIT Deshpande Center, MIT J-Clinic, MIT Lincoln Lab, and US Air Force.

Author information

Authors and Affiliations

CSAIL, Massachusetts Institute of Technology, Cambridge, MA, USA
Ruizhi Liao, Daniel Moyer, Polina Golland & William M. Wells
MIT Lincoln Laboratory, Lexington, MA, USA
Miriam Cha & Keegan Quigley
Beth Israel Deaconess Medical Center, Harvard Medical School, Boston, MA, USA
Seth Berkowitz & Steven Horng
Brigham and Women’s Hospital, Harvard Medical School, Boston, MA, USA
William M. Wells

Authors

Ruizhi Liao
View author publications
You can also search for this author in PubMed Google Scholar
Daniel Moyer
View author publications
You can also search for this author in PubMed Google Scholar
Miriam Cha
View author publications
You can also search for this author in PubMed Google Scholar
Keegan Quigley
View author publications
You can also search for this author in PubMed Google Scholar
Seth Berkowitz
View author publications
You can also search for this author in PubMed Google Scholar
Steven Horng
View author publications
You can also search for this author in PubMed Google Scholar
Polina Golland
View author publications
You can also search for this author in PubMed Google Scholar
William M. Wells
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ruizhi Liao .

Editor information

Editors and Affiliations

Erasmus MC - University Medical Center Rotterdam, Rotterdam, The Netherlands
Marleen de Bruijne
University of Basel, Allschwil, Switzerland
Philippe C. Cattin
Inria Nancy Grand Est, Villers-lès-Nancy, France
Stéphane Cotin
ICube, Université de Strasbourg, CNRS, Strasbourg, France
Nicolas Padoy
National Center for Tumor Diseases (NCT/UCC), Dresden, Germany
Stefanie Speidel
Tencent Jarvis Lab, Shenzhen, China
Yefeng Zheng
ICube, Université de Strasbourg, CNRS, Strasbourg, France
Caroline Essert

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Liao, R. et al. (2021). Multimodal Representation Learning via Maximization of Local Mutual Information. In: de Bruijne, M., et al. Medical Image Computing and Computer Assisted Intervention – MICCAI 2021. MICCAI 2021. Lecture Notes in Computer Science(), vol 12902. Springer, Cham. https://doi.org/10.1007/978-3-030-87196-3_26

Download citation

DOI: https://doi.org/10.1007/978-3-030-87196-3_26
Published: 21 September 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-87195-6
Online ISBN: 978-3-030-87196-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The Medical Image Computing and Computer Assisted Intervention Society (opens in a new tab)