Skip to main content

Multi-modal Adapter for Medical Vision-and-Language Learning

  • Conference paper
  • First Online:
Machine Learning in Medical Imaging (MLMI 2023)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14348))

Included in the following conference series:

  • 774 Accesses

Abstract

Recently, medical vision-and-language learning has attracted great attention from biomedical communities. Thanks to the development of large pre-trained models, the performances on these medical multi-modal learning benchmarks have been greatly improved. However, due to the rapid growth of the model size, full fine-tuning these large pre-trained models has become costly in training and storing such huge parameters for each downstream task. Thus, we propose a parameter-efficient transfer learning method named Medical Multi-Modal Adapter (M\(^3\)AD) to mediate this problem. We select the state-of-the-art M\(^3\)AE model as our baseline, which is pre-trained on 30k medical image-text pairs with multiple proxy tasks and has about 340M parameters. To be specific, we first insert general adapters after multi-head attention layers and feed-forward layers in all transformer blocks of M\(^3\)AE. Then, we specifically design a modality-fusion adapter that adopts multi-head attention mechanisms and we insert them in the cross-modal encoder to enhance the multi-modal interactions. Compared to full fine-tuning, we freeze most parameters in M\(^3\)AE and only train these inserted adapters with much smaller sizes. Extensive experimental results on three medical visual question answering datasets and one medical multi-modal classification dataset demonstrate the effectiveness of our proposed method, where \(\mathrm M^{3}AD\) achieves competitive performances compared to full fine-tuning with much fewer training parameters and memory consumption.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 59.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 79.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Abacha, A.B., Gayen, S., Lau, J., Rajaraman, S., Demner-Fushman, D.: Nlm at imageclef 2018 visual question answering in the medical domain. In: CLEF (Working Notes) (2018)

    Google Scholar 

  2. Abacha, A.B., Hasan, S.A., Datla, V., Liu, J., Demner-Fushman, D., Müller, H.: Vqa-med: overview of the medical visual question answering task at imageclef 2019. In: CLEF (2019)

    Google Scholar 

  3. Chen, Z., et al.: Multi-modal masked autoencoders for medical vision-and-language pre-training. In: MICCAI, vol. 13435, pp. 679–689 (2022). https://doi.org/10.1007/978-3-031-16443-9_65

  4. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding. ArXiv abs/ arXiv: 1810.04805 (2019)

  5. Do, T., Nguyen, B.X., Tjiputra, E., Tran, M., Tran, Q.D., Nguyen, A.: Multiple meta-model quantifying for medical visual question answering. In: de Bruijne, M., Cattin, P.C., Cotin, S., Padoy, N., Speidel, S., Zheng, Y., Essert, C. (eds.) MICCAI 2021. LNCS, vol. 12905, pp. 64–74. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-87240-3_7

    Chapter  Google Scholar 

  6. Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. ArXiv abs/ arXiv: 2010.11929 (2020)

  7. Eslami, S., de Melo, G., Meinel, C.: Does CLIP benefit visual question answering in the medical domain as much as it does in the general domain? CoRR abs/ arXiv: 2112.13906 (2021)

  8. Gong, H., Chen, G., Liu, S., Yu, Y., Li, G.: Cross-modal self-attention with multi-task pre-training for medical visual question answering. In: ICMR, pp. 456–460. ACM (2021)

    Google Scholar 

  9. He, R., et al.: On the effectiveness of adapter-based tuning for pretrained language model adaptation. In: Zong, C., Xia, F., Li, W., Navigli, R. (eds.) ACL/IJCNLP, pp. 2208–2222 (2021)

    Google Scholar 

  10. Houlsby, N., et al.: Parameter-efficient transfer learning for nlp. In: ICML, pp. 2790–2799 (2019)

    Google Scholar 

  11. Hu, E., et al.: Lora: low-rank adaptation of large language models. In: ICLR (2022)

    Google Scholar 

  12. Johnson, M., et al.: Google’s multilingual neural machine translation system: Enabling zero-shot translation. Trans. Assoc. Comput. Linguistics 5, 339–351 (2017)

    Article  Google Scholar 

  13. Khare, Y., Bagal, V., Mathew, M., Devi, A., Priyakumar, U.D., Jawahar, C.V.: MMBERT: multimodal BERT pretraining for improved medical VQA. In: ISBI, pp. 1033–1036. IEEE (2021)

    Google Scholar 

  14. Lau, J., Gayen, S., Abacha, A.B., Demner-Fushman, D.: A dataset of clinically generated visual questions and answers about radiology images. Sci. Data 5 (2018)

    Google Scholar 

  15. Li, Y., Wang, H., Luo, Y.: A comparison of pre-trained vision-and-language models for multimodal representation learning across medical images and reports. In: Park, T., et al (eds.) IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2020, Virtual Event, South Korea, 16–19 December 2020, pp. 1999–2004 (2020)

    Google Scholar 

  16. Liu, B., Zhan, L.M., Xu, L., Ma, L., Yang, Y., Wu, X.M.: Slake: a semantically-labeled knowledge-enhanced dataset for medical visual question answering. 2021 IEEE 18th International Symposium on Biomedical Imaging (ISBI), pp. 1650–1654 (2021)

    Google Scholar 

  17. Liu, B., Zhan, L.-M., Wu, X.-M.: Contrastive pre-training and representation distillation for medical visual question answering based on radiology images. In: de Bruijne, M., Cattin, P.C., Cotin, S., Padoy, N., Speidel, S., Zheng, Y., Essert, C. (eds.) MICCAI 2021. LNCS, vol. 12902, pp. 210–220. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-87196-3_20

    Chapter  Google Scholar 

  18. Mahabadi, R.K., Henderson, J., Ruder, S.: Compacter: efficient low-rank hypercomplex adapter layers. In: NeurIPS, pp. 1022–1035 (2021)

    Google Scholar 

  19. Moon, J.H., Lee, H., Shin, W., Choi, E.: Multi-modal understanding and generation for medical images and text via vision-language pre-training. CoRR (2021)

    Google Scholar 

  20. Moon, J.H., Lee, H., Shin, W., Kim, Y., Choi, E.: Multi-modal understanding and generation for medical images and text via vision-language pre-training. IEEE J. Biomed. Health Informatics 26(12), 6070–6080 (2022)

    Article  Google Scholar 

  21. Nguyen, B.D., Do, T.-T., Nguyen, B.X., Do, T., Tjiputra, E., Tran, Q.D.: Overcoming data limitation in medical visual question answering. In: Shen, D., et al. (eds.) MICCAI 2019. LNCS, vol. 11767, pp. 522–530. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-32251-9_57

    Chapter  Google Scholar 

  22. Pfeiffer, J., et al.: Adapterhub: a framework for adapting transformers. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, EMNLP 2020 - Demos, Online, 16–20 November 2020, pp. 46–54. Association for Computational Linguistics (2020)

    Google Scholar 

  23. Ren, F., Zhou, Y.: CGMVQA: a new classification and generative model for medical visual question answering. IEEE Access 8, 50626–50636 (2020)

    Article  Google Scholar 

  24. Subramanian, S., et al.: Medicat: a dataset of medical images, captions, and textual references. In: Cohn, T., He, Y., Liu, Y. (eds.) Findings of the Association for Computational Linguistics: EMNLP 2020, Online Event, 16–20 November 2020, pp. 2112–2120. Findings of ACL, Association for Computational Linguistics (2020)

    Google Scholar 

  25. Wu, T., Singh, S., Paul, S., Burns, G.A., Peng, N.: MELINDA: a multimodal dataset for biomedical experiment method classification. In: AAAI, pp. 14076–14084 (2021)

    Google Scholar 

  26. Yan, X., Li, L., Xie, C., Xiao, J., Gu, L.: Zhejiang university at imageclef 2019 visual question answering in the medical domain. In: CLEF (2019)

    Google Scholar 

  27. Zhan, L., Liu, B., Fan, L., Chen, J., Wu, X.: Medical visual question answering via conditional reasoning. In: ACM MM, pp. 2345–2354. ACM (2020)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Qi Wu .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Yu, Z., Qiao, Y., Xie, Y., Wu, Q. (2024). Multi-modal Adapter for Medical Vision-and-Language Learning. In: Cao, X., Xu, X., Rekik, I., Cui, Z., Ouyang, X. (eds) Machine Learning in Medical Imaging. MLMI 2023. Lecture Notes in Computer Science, vol 14348. Springer, Cham. https://doi.org/10.1007/978-3-031-45673-2_39

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-45673-2_39

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-45672-5

  • Online ISBN: 978-3-031-45673-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics