Skip to main content

Abstract

The development of Transformer in the field of computer vision has been very rapid in the past two years. Influenced by the development of Transformer in natural language processing and the research ideas of Vision Transformer model using multi-head attention in image classification, more and more researchers are paying attention to the application of Transformer. The application prospects of Transformer models in computer vision are broad, but there are still some inherent problems that need to be overcome. This article introduces the basic principles of Transformer and Vision Transformer, and focuses on the problems that arise when applying Transformer in the field of computer vision, such as large computation and data requirements. From these issues, the author analyzes the improvement strategies proposed in related papers, and finally points out the future development direction of visual Transformers based on relevant models, in order to further promote research in this field and explore more efficient visual Transformer models.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 259.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD 329.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Zhou, F.Y., Jin, L.P., Dong, J.: Review of convolutional neural network research. Chin. J. Comput. 40(6), 1229–1251 (2017)

    Google Scholar 

  2. Ashish, V., Noam, S., Niki, P., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30 (2017)

    Google Scholar 

  3. Parmar, N., Vaswani, A., Uszkoreit, J., et al.: Image transformer. In: International Conference on Machine Learning, pp. 4055–4064. PMLR (2018)

    Google Scholar 

  4. Dosovitskiy, A., Beyer, L., Kolesnikov, A., et al.: An image is worth 16 × 16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)

  5. Hong, J.F.: Review of transformer research status. Inf. Syst. Eng. 02, 125–128 (2022)

    Google Scholar 

  6. Huangfu, X.Y., Qian, H.M., Huang, M.: Review of deep neural networks combined with attention mechanism. Comput. Mod. 330(02), 40–49+57 (2023)

    Google Scholar 

  7. Devlin, J., Chang, M.W., Lee, K., et al. BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)

  8. Wei, J., Ren, X., Li, X., et al.: Nezha: neural contextualized representation for Chinese language understanding. arXiv preprint arXiv:1909.00204 (2019)

  9. Touvron, H., Cord, M., Douze, M., et al.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357. PMLR (2021)

    Google Scholar 

  10. Kan, Wu., et al.: TinyViT: fast pretraining distillation for small vision transformers. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision – ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXI, pp. 68–85. Springer Nature Switzerland, Cham (2022). https://doi.org/10.1007/978-3-031-19803-8_5

    Chapter  Google Scholar 

  11. Yuan, K., Guo, S., Liu, Z., et al.: Incorporating convolution designs into visual transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, pp. 579–588. IEEE (2021)

    Google Scholar 

  12. Dai, Z., Liu, H., Le, Q.V., et al.: CoAtNet: marrying convolution and attention for all data sizes. In: Advances in Neural Information Processing Systems, vol. 34, pp. 3965–3977 (2021)

    Google Scholar 

  13. Wu, H., Xiao, B., Codella, N., et al.: CvT: introducing convolutions to vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, pp. 22–31. IEEE (2021)

    Google Scholar 

  14. Liu, Z., Lin, Y., Cao, Y., et al.: Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, pp. 10012–10022. IEEE (2021)

    Google Scholar 

  15. Graham, B., El-Nouby, A., Touvron, H., et al.: LeViT: a vision transformer in ConvNet’s clothing for faster inference. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, pp.12259–12269. IEEE (2021)

    Google Scholar 

  16. Liu, N., Ruan, Y., Priori, S.G.: Catecholaminergic polymorphic ventricular tachycardia. Prog. Cardiovasc. Dis. 51(1), 23–30 (2008)

    Article  Google Scholar 

  17. He, K., Chen, X., Xie, S., et al.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, pp. 16000–16009. IEEE (2022)

    Google Scholar 

  18. Liu, J., Huang, X., Liu, Y., et al.: MixMIM: mixed and masked image modeling for efficient visual representation learning. arXiv preprint arXiv:2205.13137 (2022)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xizhan Gao .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Ma, S., Gao, X., Jiang, L., Xu, R. (2024). A Review of Visual Transformer Research. In: You, P., Liu, S., Wang, J. (eds) Proceedings of International Conference on Image, Vision and Intelligent Systems 2023 (ICIVIS 2023). ICIVIS 2023. Lecture Notes in Electrical Engineering, vol 1163. Springer, Singapore. https://doi.org/10.1007/978-981-97-0855-0_33

Download citation

  • DOI: https://doi.org/10.1007/978-981-97-0855-0_33

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-97-0854-3

  • Online ISBN: 978-981-97-0855-0

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics