Multi-modal Adapter for Medical Vision-and-Language Learning

Yu, Zheng; Qiao, Yanyuan; Xie, Yutong; Wu, Qi

doi:10.1007/978-3-031-45673-2_39

Zheng Yu¹²,
Yanyuan Qiao¹²,
Yutong Xie¹² &
…
Qi Wu¹²

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14348))

Included in the following conference series:

International Workshop on Machine Learning in Medical Imaging

774 Accesses

Abstract

Recently, medical vision-and-language learning has attracted great attention from biomedical communities. Thanks to the development of large pre-trained models, the performances on these medical multi-modal learning benchmarks have been greatly improved. However, due to the rapid growth of the model size, full fine-tuning these large pre-trained models has become costly in training and storing such huge parameters for each downstream task. Thus, we propose a parameter-efficient transfer learning method named Medical Multi-Modal Adapter (M\(^3\)AD) to mediate this problem. We select the state-of-the-art M\(^3\)AE model as our baseline, which is pre-trained on 30k medical image-text pairs with multiple proxy tasks and has about 340M parameters. To be specific, we first insert general adapters after multi-head attention layers and feed-forward layers in all transformer blocks of M\(^3\)AE. Then, we specifically design a modality-fusion adapter that adopts multi-head attention mechanisms and we insert them in the cross-modal encoder to enhance the multi-modal interactions. Compared to full fine-tuning, we freeze most parameters in M\(^3\)AE and only train these inserted adapters with much smaller sizes. Extensive experimental results on three medical visual question answering datasets and one medical multi-modal classification dataset demonstrate the effectiveness of our proposed method, where \(\mathrm M^{3}AD\) achieves competitive performances compared to full fine-tuning with much fewer training parameters and memory consumption.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 59.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Abacha, A.B., Gayen, S., Lau, J., Rajaraman, S., Demner-Fushman, D.: Nlm at imageclef 2018 visual question answering in the medical domain. In: CLEF (Working Notes) (2018)
Google Scholar
Abacha, A.B., Hasan, S.A., Datla, V., Liu, J., Demner-Fushman, D., Müller, H.: Vqa-med: overview of the medical visual question answering task at imageclef 2019. In: CLEF (2019)
Google Scholar
Chen, Z., et al.: Multi-modal masked autoencoders for medical vision-and-language pre-training. In: MICCAI, vol. 13435, pp. 679–689 (2022). https://doi.org/10.1007/978-3-031-16443-9_65
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding. ArXiv abs/ arXiv: 1810.04805 (2019)
Do, T., Nguyen, B.X., Tjiputra, E., Tran, M., Tran, Q.D., Nguyen, A.: Multiple meta-model quantifying for medical visual question answering. In: de Bruijne, M., Cattin, P.C., Cotin, S., Padoy, N., Speidel, S., Zheng, Y., Essert, C. (eds.) MICCAI 2021. LNCS, vol. 12905, pp. 64–74. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-87240-3_7
Chapter Google Scholar
Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. ArXiv abs/ arXiv: 2010.11929 (2020)
Eslami, S., de Melo, G., Meinel, C.: Does CLIP benefit visual question answering in the medical domain as much as it does in the general domain? CoRR abs/ arXiv: 2112.13906 (2021)
Gong, H., Chen, G., Liu, S., Yu, Y., Li, G.: Cross-modal self-attention with multi-task pre-training for medical visual question answering. In: ICMR, pp. 456–460. ACM (2021)
Google Scholar
He, R., et al.: On the effectiveness of adapter-based tuning for pretrained language model adaptation. In: Zong, C., Xia, F., Li, W., Navigli, R. (eds.) ACL/IJCNLP, pp. 2208–2222 (2021)
Google Scholar
Houlsby, N., et al.: Parameter-efficient transfer learning for nlp. In: ICML, pp. 2790–2799 (2019)
Google Scholar
Hu, E., et al.: Lora: low-rank adaptation of large language models. In: ICLR (2022)
Google Scholar
Johnson, M., et al.: Google’s multilingual neural machine translation system: Enabling zero-shot translation. Trans. Assoc. Comput. Linguistics 5, 339–351 (2017)
Article Google Scholar
Khare, Y., Bagal, V., Mathew, M., Devi, A., Priyakumar, U.D., Jawahar, C.V.: MMBERT: multimodal BERT pretraining for improved medical VQA. In: ISBI, pp. 1033–1036. IEEE (2021)
Google Scholar
Lau, J., Gayen, S., Abacha, A.B., Demner-Fushman, D.: A dataset of clinically generated visual questions and answers about radiology images. Sci. Data 5 (2018)
Google Scholar
Li, Y., Wang, H., Luo, Y.: A comparison of pre-trained vision-and-language models for multimodal representation learning across medical images and reports. In: Park, T., et al (eds.) IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2020, Virtual Event, South Korea, 16–19 December 2020, pp. 1999–2004 (2020)
Google Scholar
Liu, B., Zhan, L.M., Xu, L., Ma, L., Yang, Y., Wu, X.M.: Slake: a semantically-labeled knowledge-enhanced dataset for medical visual question answering. 2021 IEEE 18th International Symposium on Biomedical Imaging (ISBI), pp. 1650–1654 (2021)
Google Scholar
Liu, B., Zhan, L.-M., Wu, X.-M.: Contrastive pre-training and representation distillation for medical visual question answering based on radiology images. In: de Bruijne, M., Cattin, P.C., Cotin, S., Padoy, N., Speidel, S., Zheng, Y., Essert, C. (eds.) MICCAI 2021. LNCS, vol. 12902, pp. 210–220. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-87196-3_20
Chapter Google Scholar
Mahabadi, R.K., Henderson, J., Ruder, S.: Compacter: efficient low-rank hypercomplex adapter layers. In: NeurIPS, pp. 1022–1035 (2021)
Google Scholar
Moon, J.H., Lee, H., Shin, W., Choi, E.: Multi-modal understanding and generation for medical images and text via vision-language pre-training. CoRR (2021)
Google Scholar
Moon, J.H., Lee, H., Shin, W., Kim, Y., Choi, E.: Multi-modal understanding and generation for medical images and text via vision-language pre-training. IEEE J. Biomed. Health Informatics 26(12), 6070–6080 (2022)
Article Google Scholar
Nguyen, B.D., Do, T.-T., Nguyen, B.X., Do, T., Tjiputra, E., Tran, Q.D.: Overcoming data limitation in medical visual question answering. In: Shen, D., et al. (eds.) MICCAI 2019. LNCS, vol. 11767, pp. 522–530. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-32251-9_57
Chapter Google Scholar
Pfeiffer, J., et al.: Adapterhub: a framework for adapting transformers. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, EMNLP 2020 - Demos, Online, 16–20 November 2020, pp. 46–54. Association for Computational Linguistics (2020)
Google Scholar
Ren, F., Zhou, Y.: CGMVQA: a new classification and generative model for medical visual question answering. IEEE Access 8, 50626–50636 (2020)
Article Google Scholar
Subramanian, S., et al.: Medicat: a dataset of medical images, captions, and textual references. In: Cohn, T., He, Y., Liu, Y. (eds.) Findings of the Association for Computational Linguistics: EMNLP 2020, Online Event, 16–20 November 2020, pp. 2112–2120. Findings of ACL, Association for Computational Linguistics (2020)
Google Scholar
Wu, T., Singh, S., Paul, S., Burns, G.A., Peng, N.: MELINDA: a multimodal dataset for biomedical experiment method classification. In: AAAI, pp. 14076–14084 (2021)
Google Scholar
Yan, X., Li, L., Xie, C., Xiao, J., Gu, L.: Zhejiang university at imageclef 2019 visual question answering in the medical domain. In: CLEF (2019)
Google Scholar
Zhan, L., Liu, B., Fan, L., Chen, J., Wu, X.: Medical visual question answering via conditional reasoning. In: ACM MM, pp. 2345–2354. ACM (2020)
Google Scholar

Download references

Author information

Authors and Affiliations

Australian Institute for Machine Learning, The University of Adelaide, Adelaide, Australia
Zheng Yu, Yanyuan Qiao, Yutong Xie & Qi Wu

Authors

Zheng Yu
View author publications
You can also search for this author in PubMed Google Scholar
Yanyuan Qiao
View author publications
You can also search for this author in PubMed Google Scholar
Yutong Xie
View author publications
You can also search for this author in PubMed Google Scholar
Qi Wu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Qi Wu .

Editor information

Editors and Affiliations

Shanghai United Imaging Intelligence Co., Ltd., Shanghai, China
Xiaohuan Cao
Rensselaer Polytechnic Institute, Troy, NY, USA
Xuanang Xu
Imperial College London, London, UK
Islem Rekik
ShanghaiTech University, Shanghai, China
Zhiming Cui
Shanghai United Imaging Intelligence Co., Ltd., Shanghai, China
Xi Ouyang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Yu, Z., Qiao, Y., Xie, Y., Wu, Q. (2024). Multi-modal Adapter for Medical Vision-and-Language Learning. In: Cao, X., Xu, X., Rekik, I., Cui, Z., Ouyang, X. (eds) Machine Learning in Medical Imaging. MLMI 2023. Lecture Notes in Computer Science, vol 14348. Springer, Cham. https://doi.org/10.1007/978-3-031-45673-2_39

Download citation

DOI: https://doi.org/10.1007/978-3-031-45673-2_39
Published: 15 October 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-45672-5
Online ISBN: 978-3-031-45673-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The Medical Image Computing and Computer Assisted Intervention Society (opens in a new tab)

Multi-modal Adapter for Medical Vision-and-Language Learning