Abstract
The synergy of long-range dependencies from transformers and local representations of image content from convolutional neural networks (CNNs) has led to advanced architectures and increased performance for various medical image analysis tasks due to their complementary benefits. However, compared with CNNs, transformers require considerably more training data, due to a larger number of parameters and an absence of inductive bias. The need for increasingly large datasets continues to be problematic, particularly in the context of medical imaging, where both annotation efforts and data protection result in limited data availability. In this work, inspired by the human decision-making process of correlating new “evidence” with previously memorized “experience”, we propose a Memorizing Vision Transformer (MoViT) to alleviate the need for large-scale datasets to successfully train and deploy transformer-based architectures. MoViT leverages an external memory structure to cache history attention snapshots during the training stage. To prevent overfitting, we incorporate an innovative memory update scheme, attention temporal moving average, to update the stored external memories with the historical moving average. For inference speedup, we design a prototypical attention learning method to distill the external memory into smaller representative subsets. We evaluate our method on a public histology image dataset and an in-house MRI dataset, demonstrating that MoViT applied to varied medical image analysis tasks, can outperform vanilla transformer models across varied data regimes, especially in cases where only a small amount of annotated data is available. More importantly, MoViT can reach a competitive performance of ViT with only 3.0% of the training data. In conclusion, MoViT provides a simple plug-in for transformer architectures which may contribute to reducing the training data needed to achieve acceptable models for a broad range of medical image analysis tasks.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Barhoumi, Y., et al.: Scopeformer: n-CNN-ViT hybrid model for intracranial hemorrhage classification. arXiv preprint arXiv:2107.04575 (2021)
Dosovitskiy, A., et al.: An image is worth 16\(\,\times \,\)16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)
Guo, P., et al.: Learning-based analysis of amide proton transfer-weighted MRI to identify true progression in glioma patients. NeuroImage: Clin. 35, 103121 (2022)
He, K., et al.: Deep residual learning for image recognition. In: Proceedings of the IEEE conference on CVPR, pp. 770–778 (2016)
Kather, J.N., et al.: 100,000 histological images of human colorectal cancer and healthy tissue. Zenodo (2018). https://doi.org/10.5281/zenodo.1214456
Khandelwal, U., et al.: Generalization through memorization: nearest neighbor language models. arXiv preprint arXiv:1911.00172 (2019)
Kim, B., et al.: Examples are not enough, learn to criticize! criticism for interpretability. In: Advances in Neural Information Processing Systems 29 (2016)
Laine, S., et al.: Temporal ensembling for semi-supervised learning. arXiv preprint arXiv:1610.02242 (2016)
Li, Y., et al.: LocalViT: bringing locality to vision transformers. arXiv preprint arXiv:2104.05707 (2021)
Liu, Y., et al.: Efficient training of visual transformers with small datasets. Adv. Neural. Inf. Process. Syst. 34, 23818–23830 (2021)
Russakovsky, O., et al.: ImageNet large scale visual recognition challenge. Int. J. Comput. Vis. 115(3), 211–252 (2015)
Sun, C., et al.: Revisiting unreasonable effectiveness of data in deep learning era. In: Proceedings of the ICCV, pp. 843–852 (2017)
Tarvainen, A., et al.: Mean teachers are better role models: weight-averaged consistency targets improve semi-supervised deep learning results. In: Advances in Neural Information Processing Systems 30 (2017)
Touvron, H., et al.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357. PMLR (2021)
van Tulder, G., Tong, Y., Marchiori, E.: Multi-view analysis of unregistered medical images using cross-view transformers. In: de Bruijne, M., et al. (eds.) MICCAI 2021. LNCS, vol. 12903, pp. 104–113. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-87199-4_10
Wang, X., et al.: Cross-batch memory for embedding learning. In: Proceedings of the CVPR, pp. 6388–6397 (2020)
Wu, H., et al.: CvT: Introducing convolutions to vision transformers. In: Proceedings of the ICCV, pp. 22–31 (2021)
Wu, Y., Rabe, M.N., Hutchins, D., Szegedy, C.: Memorizing transformers. In: International Conference on Learning Representations (2022). https://openreview.net/forum?id=TrjbxzRcnf-
Xu, W., et al.: Co-scale conv-attentional image transformers. In: Proceedings of the ICCV, pp. 9981–9990 (2021)
Xue, M., et al.: ProtoPFormer: concentrating on prototypical parts in vision transformers for interpretable image recognition. arXiv preprint arXiv:2208.10431 (2022)
Yuan, K., et al.: Incorporating convolution designs into visual transformers. In: Proceedings of the ICCV, pp. 579–588 (2021)
Acknowledgments
This work was supported in part by grants from the National Institutes of Health (R37CA248077, R01CA228188). The MRI equipment in this study was funded by the NIH grant: 1S10ODO21648.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Shen, Y. et al. (2024). MoViT: Memorizing Vision Transformers for Medical Image Analysis. In: Cao, X., Xu, X., Rekik, I., Cui, Z., Ouyang, X. (eds) Machine Learning in Medical Imaging. MLMI 2023. Lecture Notes in Computer Science, vol 14349. Springer, Cham. https://doi.org/10.1007/978-3-031-45676-3_21
Download citation
DOI: https://doi.org/10.1007/978-3-031-45676-3_21
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-45675-6
Online ISBN: 978-3-031-45676-3
eBook Packages: Computer ScienceComputer Science (R0)