Skip to main content

MoViT: Memorizing Vision Transformers for Medical Image Analysis

  • Conference paper
  • First Online:
Machine Learning in Medical Imaging (MLMI 2023)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14349))

Included in the following conference series:

  • 631 Accesses

Abstract

The synergy of long-range dependencies from transformers and local representations of image content from convolutional neural networks (CNNs) has led to advanced architectures and increased performance for various medical image analysis tasks due to their complementary benefits. However, compared with CNNs, transformers require considerably more training data, due to a larger number of parameters and an absence of inductive bias. The need for increasingly large datasets continues to be problematic, particularly in the context of medical imaging, where both annotation efforts and data protection result in limited data availability. In this work, inspired by the human decision-making process of correlating new “evidence” with previously memorized “experience”, we propose a Memorizing Vision Transformer (MoViT) to alleviate the need for large-scale datasets to successfully train and deploy transformer-based architectures. MoViT leverages an external memory structure to cache history attention snapshots during the training stage. To prevent overfitting, we incorporate an innovative memory update scheme, attention temporal moving average, to update the stored external memories with the historical moving average. For inference speedup, we design a prototypical attention learning method to distill the external memory into smaller representative subsets. We evaluate our method on a public histology image dataset and an in-house MRI dataset, demonstrating that MoViT applied to varied medical image analysis tasks, can outperform vanilla transformer models across varied data regimes, especially in cases where only a small amount of annotated data is available. More importantly, MoViT can reach a competitive performance of ViT with only 3.0% of the training data. In conclusion, MoViT provides a simple plug-in for transformer architectures which may contribute to reducing the training data needed to achieve acceptable models for a broad range of medical image analysis tasks.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 59.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 79.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Barhoumi, Y., et al.: Scopeformer: n-CNN-ViT hybrid model for intracranial hemorrhage classification. arXiv preprint arXiv:2107.04575 (2021)

  2. Dosovitskiy, A., et al.: An image is worth 16\(\,\times \,\)16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)

  3. Guo, P., et al.: Learning-based analysis of amide proton transfer-weighted MRI to identify true progression in glioma patients. NeuroImage: Clin. 35, 103121 (2022)

    Google Scholar 

  4. He, K., et al.: Deep residual learning for image recognition. In: Proceedings of the IEEE conference on CVPR, pp. 770–778 (2016)

    Google Scholar 

  5. Kather, J.N., et al.: 100,000 histological images of human colorectal cancer and healthy tissue. Zenodo (2018). https://doi.org/10.5281/zenodo.1214456

  6. Khandelwal, U., et al.: Generalization through memorization: nearest neighbor language models. arXiv preprint arXiv:1911.00172 (2019)

  7. Kim, B., et al.: Examples are not enough, learn to criticize! criticism for interpretability. In: Advances in Neural Information Processing Systems 29 (2016)

    Google Scholar 

  8. Laine, S., et al.: Temporal ensembling for semi-supervised learning. arXiv preprint arXiv:1610.02242 (2016)

  9. Li, Y., et al.: LocalViT: bringing locality to vision transformers. arXiv preprint arXiv:2104.05707 (2021)

  10. Liu, Y., et al.: Efficient training of visual transformers with small datasets. Adv. Neural. Inf. Process. Syst. 34, 23818–23830 (2021)

    Google Scholar 

  11. Russakovsky, O., et al.: ImageNet large scale visual recognition challenge. Int. J. Comput. Vis. 115(3), 211–252 (2015)

    Article  MathSciNet  Google Scholar 

  12. Sun, C., et al.: Revisiting unreasonable effectiveness of data in deep learning era. In: Proceedings of the ICCV, pp. 843–852 (2017)

    Google Scholar 

  13. Tarvainen, A., et al.: Mean teachers are better role models: weight-averaged consistency targets improve semi-supervised deep learning results. In: Advances in Neural Information Processing Systems 30 (2017)

    Google Scholar 

  14. Touvron, H., et al.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357. PMLR (2021)

    Google Scholar 

  15. van Tulder, G., Tong, Y., Marchiori, E.: Multi-view analysis of unregistered medical images using cross-view transformers. In: de Bruijne, M., et al. (eds.) MICCAI 2021. LNCS, vol. 12903, pp. 104–113. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-87199-4_10

    Chapter  Google Scholar 

  16. Wang, X., et al.: Cross-batch memory for embedding learning. In: Proceedings of the CVPR, pp. 6388–6397 (2020)

    Google Scholar 

  17. Wu, H., et al.: CvT: Introducing convolutions to vision transformers. In: Proceedings of the ICCV, pp. 22–31 (2021)

    Google Scholar 

  18. Wu, Y., Rabe, M.N., Hutchins, D., Szegedy, C.: Memorizing transformers. In: International Conference on Learning Representations (2022). https://openreview.net/forum?id=TrjbxzRcnf-

  19. Xu, W., et al.: Co-scale conv-attentional image transformers. In: Proceedings of the ICCV, pp. 9981–9990 (2021)

    Google Scholar 

  20. Xue, M., et al.: ProtoPFormer: concentrating on prototypical parts in vision transformers for interpretable image recognition. arXiv preprint arXiv:2208.10431 (2022)

  21. Yuan, K., et al.: Incorporating convolution designs into visual transformers. In: Proceedings of the ICCV, pp. 579–588 (2021)

    Google Scholar 

Download references

Acknowledgments

This work was supported in part by grants from the National Institutes of Health (R37CA248077, R01CA228188). The MRI equipment in this study was funded by the NIH grant: 1S10ODO21648.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mathias Unberath .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Shen, Y. et al. (2024). MoViT: Memorizing Vision Transformers for Medical Image Analysis. In: Cao, X., Xu, X., Rekik, I., Cui, Z., Ouyang, X. (eds) Machine Learning in Medical Imaging. MLMI 2023. Lecture Notes in Computer Science, vol 14349. Springer, Cham. https://doi.org/10.1007/978-3-031-45676-3_21

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-45676-3_21

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-45675-6

  • Online ISBN: 978-3-031-45676-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics