A Review of Visual Transformer Research

Ma, Shiyu; Gao, Xizhan; Jiang, Lujie; Xu, Ruzhi

doi:10.1007/978-981-97-0855-0_33

Part of the book series: Lecture Notes in Electrical Engineering ((LNEE,volume 1163))

Included in the following conference series:

International Conference on Image, Vision and Intelligent Systems

251 Accesses

Abstract

The development of Transformer in the field of computer vision has been very rapid in the past two years. Influenced by the development of Transformer in natural language processing and the research ideas of Vision Transformer model using multi-head attention in image classification, more and more researchers are paying attention to the application of Transformer. The application prospects of Transformer models in computer vision are broad, but there are still some inherent problems that need to be overcome. This article introduces the basic principles of Transformer and Vision Transformer, and focuses on the problems that arise when applying Transformer in the field of computer vision, such as large computation and data requirements. From these issues, the author analyzes the improvement strategies proposed in related papers, and finally points out the future development direction of visual Transformers based on relevant models, in order to further promote research in this field and explore more efficient visual Transformer models.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 259.00; Price excludes VAT (USA)

Hardcover Book: USD 329.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Zhou, F.Y., Jin, L.P., Dong, J.: Review of convolutional neural network research. Chin. J. Comput. 40(6), 1229–1251 (2017)
Google Scholar
Ashish, V., Noam, S., Niki, P., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30 (2017)
Google Scholar
Parmar, N., Vaswani, A., Uszkoreit, J., et al.: Image transformer. In: International Conference on Machine Learning, pp. 4055–4064. PMLR (2018)
Google Scholar
Dosovitskiy, A., Beyer, L., Kolesnikov, A., et al.: An image is worth 16 × 16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)
Hong, J.F.: Review of transformer research status. Inf. Syst. Eng. 02, 125–128 (2022)
Google Scholar
Huangfu, X.Y., Qian, H.M., Huang, M.: Review of deep neural networks combined with attention mechanism. Comput. Mod. 330(02), 40–49+57 (2023)
Google Scholar
Devlin, J., Chang, M.W., Lee, K., et al. BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
Wei, J., Ren, X., Li, X., et al.: Nezha: neural contextualized representation for Chinese language understanding. arXiv preprint arXiv:1909.00204 (2019)
Touvron, H., Cord, M., Douze, M., et al.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357. PMLR (2021)
Google Scholar
Kan, Wu., et al.: TinyViT: fast pretraining distillation for small vision transformers. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision – ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXI, pp. 68–85. Springer Nature Switzerland, Cham (2022). https://doi.org/10.1007/978-3-031-19803-8_5
Chapter Google Scholar
Yuan, K., Guo, S., Liu, Z., et al.: Incorporating convolution designs into visual transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, pp. 579–588. IEEE (2021)
Google Scholar
Dai, Z., Liu, H., Le, Q.V., et al.: CoAtNet: marrying convolution and attention for all data sizes. In: Advances in Neural Information Processing Systems, vol. 34, pp. 3965–3977 (2021)
Google Scholar
Wu, H., Xiao, B., Codella, N., et al.: CvT: introducing convolutions to vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, pp. 22–31. IEEE (2021)
Google Scholar
Liu, Z., Lin, Y., Cao, Y., et al.: Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, pp. 10012–10022. IEEE (2021)
Google Scholar
Graham, B., El-Nouby, A., Touvron, H., et al.: LeViT: a vision transformer in ConvNet’s clothing for faster inference. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, pp.12259–12269. IEEE (2021)
Google Scholar
Liu, N., Ruan, Y., Priori, S.G.: Catecholaminergic polymorphic ventricular tachycardia. Prog. Cardiovasc. Dis. 51(1), 23–30 (2008)
Article Google Scholar
He, K., Chen, X., Xie, S., et al.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, pp. 16000–16009. IEEE (2022)
Google Scholar
Liu, J., Huang, X., Liu, Y., et al.: MixMIM: mixed and masked image modeling for efficient visual representation learning. arXiv preprint arXiv:2205.13137 (2022)

Download references

Author information

Authors and Affiliations

School of Information Science and Engineering, University of Jinan, Jinan, 250022, China
Shiyu Ma & Xizhan Gao
Jinan Science and Technology Innovation Promotion Center, Jinan, 250000, China
Lujie Jiang
Shandong Fintech Institute, Jinan, 250000, China
Ruzhi Xu

Authors

Shiyu Ma
View author publications
You can also search for this author in PubMed Google Scholar
Xizhan Gao
View author publications
You can also search for this author in PubMed Google Scholar
Lujie Jiang
View author publications
You can also search for this author in PubMed Google Scholar
Ruzhi Xu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xizhan Gao .

Editor information

Editors and Affiliations

National University of Defense Technology, Changsha, Hunan, China
Peng You
Hebei University, Baoding, Hebei, China
Shuaiqi Liu
Hebei University, Baoding, Hebei, China
Jun Wang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ma, S., Gao, X., Jiang, L., Xu, R. (2024). A Review of Visual Transformer Research. In: You, P., Liu, S., Wang, J. (eds) Proceedings of International Conference on Image, Vision and Intelligent Systems 2023 (ICIVIS 2023). ICIVIS 2023. Lecture Notes in Electrical Engineering, vol 1163. Springer, Singapore. https://doi.org/10.1007/978-981-97-0855-0_33

Download citation

DOI: https://doi.org/10.1007/978-981-97-0855-0_33
Published: 25 February 2024
Publisher Name: Springer, Singapore
Print ISBN: 978-981-97-0854-3
Online ISBN: 978-981-97-0855-0
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics