Abstract
The development of Transformer in the field of computer vision has been very rapid in the past two years. Influenced by the development of Transformer in natural language processing and the research ideas of Vision Transformer model using multi-head attention in image classification, more and more researchers are paying attention to the application of Transformer. The application prospects of Transformer models in computer vision are broad, but there are still some inherent problems that need to be overcome. This article introduces the basic principles of Transformer and Vision Transformer, and focuses on the problems that arise when applying Transformer in the field of computer vision, such as large computation and data requirements. From these issues, the author analyzes the improvement strategies proposed in related papers, and finally points out the future development direction of visual Transformers based on relevant models, in order to further promote research in this field and explore more efficient visual Transformer models.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Zhou, F.Y., Jin, L.P., Dong, J.: Review of convolutional neural network research. Chin. J. Comput. 40(6), 1229–1251 (2017)
Ashish, V., Noam, S., Niki, P., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30 (2017)
Parmar, N., Vaswani, A., Uszkoreit, J., et al.: Image transformer. In: International Conference on Machine Learning, pp. 4055–4064. PMLR (2018)
Dosovitskiy, A., Beyer, L., Kolesnikov, A., et al.: An image is worth 16 × 16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)
Hong, J.F.: Review of transformer research status. Inf. Syst. Eng. 02, 125–128 (2022)
Huangfu, X.Y., Qian, H.M., Huang, M.: Review of deep neural networks combined with attention mechanism. Comput. Mod. 330(02), 40–49+57 (2023)
Devlin, J., Chang, M.W., Lee, K., et al. BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
Wei, J., Ren, X., Li, X., et al.: Nezha: neural contextualized representation for Chinese language understanding. arXiv preprint arXiv:1909.00204 (2019)
Touvron, H., Cord, M., Douze, M., et al.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357. PMLR (2021)
Kan, Wu., et al.: TinyViT: fast pretraining distillation for small vision transformers. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision – ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXI, pp. 68–85. Springer Nature Switzerland, Cham (2022). https://doi.org/10.1007/978-3-031-19803-8_5
Yuan, K., Guo, S., Liu, Z., et al.: Incorporating convolution designs into visual transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, pp. 579–588. IEEE (2021)
Dai, Z., Liu, H., Le, Q.V., et al.: CoAtNet: marrying convolution and attention for all data sizes. In: Advances in Neural Information Processing Systems, vol. 34, pp. 3965–3977 (2021)
Wu, H., Xiao, B., Codella, N., et al.: CvT: introducing convolutions to vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, pp. 22–31. IEEE (2021)
Liu, Z., Lin, Y., Cao, Y., et al.: Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, pp. 10012–10022. IEEE (2021)
Graham, B., El-Nouby, A., Touvron, H., et al.: LeViT: a vision transformer in ConvNet’s clothing for faster inference. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, pp.12259–12269. IEEE (2021)
Liu, N., Ruan, Y., Priori, S.G.: Catecholaminergic polymorphic ventricular tachycardia. Prog. Cardiovasc. Dis. 51(1), 23–30 (2008)
He, K., Chen, X., Xie, S., et al.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, pp. 16000–16009. IEEE (2022)
Liu, J., Huang, X., Liu, Y., et al.: MixMIM: mixed and masked image modeling for efficient visual representation learning. arXiv preprint arXiv:2205.13137 (2022)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Ma, S., Gao, X., Jiang, L., Xu, R. (2024). A Review of Visual Transformer Research. In: You, P., Liu, S., Wang, J. (eds) Proceedings of International Conference on Image, Vision and Intelligent Systems 2023 (ICIVIS 2023). ICIVIS 2023. Lecture Notes in Electrical Engineering, vol 1163. Springer, Singapore. https://doi.org/10.1007/978-981-97-0855-0_33
Download citation
DOI: https://doi.org/10.1007/978-981-97-0855-0_33
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-97-0854-3
Online ISBN: 978-981-97-0855-0
eBook Packages: EngineeringEngineering (R0)