MoViT: Memorizing Vision Transformers for Medical Image Analysis

Shen, Yiqing; Guo, Pengfei; Wu, Jingpu; Huang, Qianqi; Le, Nhat; Zhou, Jinyuan; Jiang, Shanshan; Unberath, Mathias

doi:10.1007/978-3-031-45676-3_21

Yiqing Shen¹²,
Pengfei Guo¹²,
Jingpu Wu¹²,
Qianqi Huang¹²,
Nhat Le¹²,
Jinyuan Zhou¹²,
Shanshan Jiang¹² &
…
Mathias Unberath¹²

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14349))

Included in the following conference series:

International Workshop on Machine Learning in Medical Imaging

631 Accesses

Abstract

The synergy of long-range dependencies from transformers and local representations of image content from convolutional neural networks (CNNs) has led to advanced architectures and increased performance for various medical image analysis tasks due to their complementary benefits. However, compared with CNNs, transformers require considerably more training data, due to a larger number of parameters and an absence of inductive bias. The need for increasingly large datasets continues to be problematic, particularly in the context of medical imaging, where both annotation efforts and data protection result in limited data availability. In this work, inspired by the human decision-making process of correlating new “evidence” with previously memorized “experience”, we propose a Memorizing Vision Transformer (MoViT) to alleviate the need for large-scale datasets to successfully train and deploy transformer-based architectures. MoViT leverages an external memory structure to cache history attention snapshots during the training stage. To prevent overfitting, we incorporate an innovative memory update scheme, attention temporal moving average, to update the stored external memories with the historical moving average. For inference speedup, we design a prototypical attention learning method to distill the external memory into smaller representative subsets. We evaluate our method on a public histology image dataset and an in-house MRI dataset, demonstrating that MoViT applied to varied medical image analysis tasks, can outperform vanilla transformer models across varied data regimes, especially in cases where only a small amount of annotated data is available. More importantly, MoViT can reach a competitive performance of ViT with only 3.0% of the training data. In conclusion, MoViT provides a simple plug-in for transformer architectures which may contribute to reducing the training data needed to achieve acceptable models for a broad range of medical image analysis tasks.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 59.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Barhoumi, Y., et al.: Scopeformer: n-CNN-ViT hybrid model for intracranial hemorrhage classification. arXiv preprint arXiv:2107.04575 (2021)
Dosovitskiy, A., et al.: An image is worth 16\(\,\times \,\)16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)
Guo, P., et al.: Learning-based analysis of amide proton transfer-weighted MRI to identify true progression in glioma patients. NeuroImage: Clin. 35, 103121 (2022)
Google Scholar
He, K., et al.: Deep residual learning for image recognition. In: Proceedings of the IEEE conference on CVPR, pp. 770–778 (2016)
Google Scholar
Kather, J.N., et al.: 100,000 histological images of human colorectal cancer and healthy tissue. Zenodo (2018). https://doi.org/10.5281/zenodo.1214456
Khandelwal, U., et al.: Generalization through memorization: nearest neighbor language models. arXiv preprint arXiv:1911.00172 (2019)
Kim, B., et al.: Examples are not enough, learn to criticize! criticism for interpretability. In: Advances in Neural Information Processing Systems 29 (2016)
Google Scholar
Laine, S., et al.: Temporal ensembling for semi-supervised learning. arXiv preprint arXiv:1610.02242 (2016)
Li, Y., et al.: LocalViT: bringing locality to vision transformers. arXiv preprint arXiv:2104.05707 (2021)
Liu, Y., et al.: Efficient training of visual transformers with small datasets. Adv. Neural. Inf. Process. Syst. 34, 23818–23830 (2021)
Google Scholar
Russakovsky, O., et al.: ImageNet large scale visual recognition challenge. Int. J. Comput. Vis. 115(3), 211–252 (2015)
Article MathSciNet Google Scholar
Sun, C., et al.: Revisiting unreasonable effectiveness of data in deep learning era. In: Proceedings of the ICCV, pp. 843–852 (2017)
Google Scholar
Tarvainen, A., et al.: Mean teachers are better role models: weight-averaged consistency targets improve semi-supervised deep learning results. In: Advances in Neural Information Processing Systems 30 (2017)
Google Scholar
Touvron, H., et al.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357. PMLR (2021)
Google Scholar
van Tulder, G., Tong, Y., Marchiori, E.: Multi-view analysis of unregistered medical images using cross-view transformers. In: de Bruijne, M., et al. (eds.) MICCAI 2021. LNCS, vol. 12903, pp. 104–113. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-87199-4_10
Chapter Google Scholar
Wang, X., et al.: Cross-batch memory for embedding learning. In: Proceedings of the CVPR, pp. 6388–6397 (2020)
Google Scholar
Wu, H., et al.: CvT: Introducing convolutions to vision transformers. In: Proceedings of the ICCV, pp. 22–31 (2021)
Google Scholar
Wu, Y., Rabe, M.N., Hutchins, D., Szegedy, C.: Memorizing transformers. In: International Conference on Learning Representations (2022). https://openreview.net/forum?id=TrjbxzRcnf-
Xu, W., et al.: Co-scale conv-attentional image transformers. In: Proceedings of the ICCV, pp. 9981–9990 (2021)
Google Scholar
Xue, M., et al.: ProtoPFormer: concentrating on prototypical parts in vision transformers for interpretable image recognition. arXiv preprint arXiv:2208.10431 (2022)
Yuan, K., et al.: Incorporating convolution designs into visual transformers. In: Proceedings of the ICCV, pp. 579–588 (2021)
Google Scholar

Download references

Acknowledgments

This work was supported in part by grants from the National Institutes of Health (R37CA248077, R01CA228188). The MRI equipment in this study was funded by the NIH grant: 1S10ODO21648.

Author information

Authors and Affiliations

Johns Hopkins University, Baltimore, USA
Yiqing Shen, Pengfei Guo, Jingpu Wu, Qianqi Huang, Nhat Le, Jinyuan Zhou, Shanshan Jiang & Mathias Unberath

Authors

Yiqing Shen
View author publications
You can also search for this author in PubMed Google Scholar
Pengfei Guo
View author publications
You can also search for this author in PubMed Google Scholar
Jingpu Wu
View author publications
You can also search for this author in PubMed Google Scholar
Qianqi Huang
View author publications
You can also search for this author in PubMed Google Scholar
Nhat Le
View author publications
You can also search for this author in PubMed Google Scholar
Jinyuan Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Shanshan Jiang
View author publications
You can also search for this author in PubMed Google Scholar
Mathias Unberath
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mathias Unberath .

Editor information

Editors and Affiliations

Shanghai United Imaging Intelligence Co., Ltd., Shanghai, China
Xiaohuan Cao
Rensselaer Polytechnic Institute, Troy, NY, USA
Xuanang Xu
Imperial College London, London, UK
Islem Rekik
ShanghaiTech University, Shanghai, China
Zhiming Cui
Shanghai United Imaging Intelligence Co., Ltd., Shanghai, China
Xi Ouyang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Shen, Y. et al. (2024). MoViT: Memorizing Vision Transformers for Medical Image Analysis. In: Cao, X., Xu, X., Rekik, I., Cui, Z., Ouyang, X. (eds) Machine Learning in Medical Imaging. MLMI 2023. Lecture Notes in Computer Science, vol 14349. Springer, Cham. https://doi.org/10.1007/978-3-031-45676-3_21

Download citation

DOI: https://doi.org/10.1007/978-3-031-45676-3_21
Published: 15 October 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-45675-6
Online ISBN: 978-3-031-45676-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The Medical Image Computing and Computer Assisted Intervention Society (opens in a new tab)

MoViT: Memorizing Vision Transformers for Medical Image Analysis