ViTAEv2: Vision Transformer Advanced by Exploring Inductive Bias for Image Recognition and Beyond

Zhang, Qiming; Xu, Yufei; Zhang, Jing; Tao, Dacheng

doi:10.1007/s11263-022-01739-w

ViTAEv2: Vision Transformer Advanced by Exploring Inductive Bias for Image Recognition and Beyond

Published: 12 January 2023

Volume 131, pages 1141–1162, (2023)
Cite this article

International Journal of Computer Vision Aims and scope Submit manuscript

Qiming Zhang¹^na1,
Yufei Xu¹^na1,
Jing Zhang¹ &
…
Dacheng Tao^1,2

3914 Accesses
57 Citations
1 Altmetric
Explore all metrics

Abstract

Vision transformers have shown great potential in various computer vision tasks owing to their strong capability to model long-range dependency using the self-attention mechanism. Nevertheless, they treat an image as a 1D sequence of visual tokens, lacking an intrinsic inductive bias (IB) in modeling local visual structures and dealing with scale variance, which is instead learned implicitly from large-scale training data with longer training schedules. In this paper, we leverage the two IBs and propose the ViTAE transformer, which utilizes a reduction cell for multi-scale feature and a normal cell for locality. The two kinds of cells are stacked in both isotropic and multi-stage manners to formulate two families of ViTAE models, i.e., the vanilla ViTAE and ViTAEv2. Experiments on the ImageNet dataset as well as downstream tasks on the MS COCO, ADE20K, and AP10K datasets validate the superiority of our models over the baseline and representative models. Besides, we scale up our ViTAE model to 644 M parameters and obtain the state-of-the-art classification performance, i.e., 88.5% Top-1 classification accuracy on ImageNet validation set and the best 91.2% Top-1 classification accuracy on ImageNet Real validation set, without using extra private data. It demonstrates that the introduced inductive bias still helps when the model size becomes large. The source code and pretrained models are publicly available atcode.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Regularizing self-attention on vision transformers with 2D spatial distance loss

Article 18 July 2022

CMNet: a novel model and design rationale based on comparison studies and synergy of CNN and MetaFormer

Article Open access 22 September 2023

Sparse Double Descent in Vision Transformers: Real or Phantom Threat?

Notes

Despite the projection layer in a transformer can be viewed as \(1\times 1\) convolution (Chen et al., 2021c) the term of convolution here refers to those with larger kernels, e.g., \(3 \times 3\), which are widely used in typical CNNs to extract spatial features.

References

Adelson, E. H., Anderson, C. H., Bergen, J. R., Burt, P. J., & Ogden, J. M. (1984). Pyramid methods in image processing. RCA Engineer, 29(6), 33–41.
Google Scholar
Ali, A., Touvron, H., Caron, M., Bojanowski, P., Douze, M., Joulin, A., Laptev, I., Neverova, N., Synnaeve, G., Verbeek, J., et al. (2021). Xcit: Cross-covariance image transformers. Advances in Neural Information Processing Systems, 34, 20014–20027.
Google Scholar
Ba, J.L., Kiros, J.R., Hinton, G.E. (2016). Layer normalization. arXiv preprint arXiv:1607.06450
Bao, H., Dong, L., Piao, S., Wei, F. (2021). Beit: Bert pre-training of image transformers. In: International conference on learning representations
Bay, H., Tuytelaars, T., Van Gool, L. (2006). Surf: Speeded up robust features. In: European conference on computer vision, Springer, pp. 404–417
Beyer, L., Hénaff, O.J., Kolesnikov, A., Zhai, X., Oord, Avd. (2020). Are we done with imagenet? arXiv preprint arXiv:2006.07159
Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A. (2021). Understanding robustness of transformers for image classification. In: Proceedings of the IEEE/CVF international conference on computer vision, pp. 10231–10241
Burt, P.J., Adelson, E.H. (1987). The laplacian pyramid as a compact image code. In: Readings in computer vision, Elsevier, pp. 671–679
Cai, Z., Vasconcelos, N. (2018). Cascade r-cnn: Delving into high quality object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 6154–6162
Cai, Z., & Vasconcelos, N. (2019). Cascade r-CNN: High quality object detection and instance segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(5), 1483–1498.
Article Google Scholar
Chen, C.F., Panda, R., Fan, Q. (2021a). Regionvit: Regional-to-local attention for vision transformers. In: international conference on learning representations
Chen, C.F.R, Fan, Q., Panda, R. (2021b). Crossvit: Cross-attention multi-scale vision transformer for image classification. In: Proceedings of the IEEE/CVF international conference on computer vision, pp. 357–366
Chen, L.C., Papandreou, G., Schroff, F., Adam, H. (2017). Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587
Chen, X., Xie, S., He, K. (2021c). An empirical study of training self-supervised vision transformers. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 9640–9649
Chen, Y., Dai, X., Chen, D., Liu, M., Dong, X., Yuan, L., Liu, Z. (2022). Mobile-former: Bridging mobilenet and transformer. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5270–5279
Choromanski, K.M., Likhosherstov, V., Dohan, D., Song, X., Gane, A., Sarlos, T., Hawkins, P., Davis, J.Q., Mohiuddin, A., Kaiser, L., et al. (2020). Rethinking attention with performers. In: International conference on learning representations
Chu, X., Tian, Z., Wang, Y., Zhang, B., Ren, H., Wei, X., Xia, H., & Shen, C. (2021). Twins: Revisiting the design of spatial attention in vision transformers. Advances in Neural Information Processing Systems, 34, 9355–9366.
Google Scholar
Contributors, M. (2020). MMSegmentation: Openmmlab semantic segmentation toolbox and benchmark. https://github.com/open-mmlab/mmsegmentation
Dai, Z., Liu, H., Le, Q. V., & Tan, M. (2021). Coatnet: Marrying convolution and attention for all data sizes. Advances in Neural Information Processing Systems, 34, 3965–3977.
Google Scholar
d’Ascoli, S., Touvron, H., Leavitt, M.L., Morcos, A.S., Biroli, G., Sagun, L. (2021). Convit: Improving vision transformers with soft convolutional inductive biases. In International conference on machine learning, PMLR, pp. 2286–2296
Demirel, H., & Anbarjafari, G. (2010). Image resolution enhancement by using discrete and stationary wavelet decomposition. IEEE Transactions on Image Processing, 20(5), 1458–1460.
Article MathSciNet MATH Google Scholar
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L. (2009). Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, IEEE, pp. 248–255
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al. (2020). An image is worth 16x16 words: Transformers for image recognition at scale. In International conference on learning representations
Fan, H., Xiong, B., Mangalam, K., Li, Y., Yan, Z., Malik, J., Feichtenhofer, C. (2021). Multiscale vision transformers. In: Proceedings of the IEEE/CVF international conference on computer vision, pp. 6824–6835
Graham, B., El-Nouby, A., Touvron, H., Stock, P., Joulin, A., Jégou, H., Douze, M. (2021). Levit: a vision transformer in convnet’s clothing for faster inference. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 12259–12269
Guo, M.H., Lu, C.Z., Liu, Z.N., Cheng, M.M., Hu, S.M. (2022). Visual attention network. arXiv preprint arXiv:2202.09741
Han, K., Xiao, A., Wu, E., Guo, J., Xu, C., & Wang, Y. (2021). Transformer in transformer. Advances in Neural Information Processing Systems, 34, 15908–15919.
Google Scholar
He, K., Zhang, X., Ren, S., & Sun, J. (2015). Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(9), 1904–1916.
Article Google Scholar
He, K., Zhang, X., Ren, S., Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778
He, K., Gkioxari, G., Dollár, P., Girshick, R. (2017). Mask r-cnn. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 2961–2969
He, K., Chen, X., Xie, S., Li, Y., Doll,ár P, Girshick, R. (2022). Masked autoencoders are scalable vision learners. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 16000–16009
He, L., Dong, Y., Wang, Y., Tao, D., & Lin, Z. (2021). Gauge equivariant transformer. Advances in Neural Information Processing Systems, 34, 27331–27343.
Google Scholar
Heo, B., Yun, S., Han, D., Chun, S., Choe, J., Oh, S.J. (2021). Rethinking spatial dimensions of vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 11936–11945
Howard, A.G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M., Adam, H. (2017). Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861
Huang, G., Liu, Z., Van Der Maaten, L., Weinberger, K.Q. (2017). Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4700–4708
Ke, Y., Sukthankar, R. (2004). Pca-sift: A more distinctive representation for local image descriptors. In Proceedings of the IEEE conference on computer vision and pattern recognition, IEEE, vol 2, pp. II–II
Kenton, J.D.M.W.C., Toutanova, L.K. (2019). Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL-HLT, pp. 4171–4186
Kolesnikov, A., Beyer, L., Zhai, X., Puigcerver, J., Yung, J., Gelly, S., Houlsby, N. (2020). Big transfer (bit): General visual representation learning. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part V 16, Springer, pp. 491–507
Krause, J., Stark, M., Deng, J., Fei-Fei, L. (2013). 3d object representations for fine-grained categorization. In 4th International IEEE Workshop on 3D Representation and Recognition (3dRR-13), Sydney, Australia
Krizhevsky, A., Hinton, G., et al. (2009). Learning multiple layers of features from tiny images
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems, 25, 1097–1105.
Google Scholar
Lai, W.S., Huang, J.B., Ahuja, N., Yang, M.H. (2017). Deep laplacian pyramid networks for fast and accurate super-resolution. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 624–632
LeCun, Y., Bengio, Y., et al. (1995). Convolutional networks for images, speech, and time series. The Handbook of Brain Theory and Neural Networks, 3361(10), 1995.
Google Scholar
LeCun, Y., Bengio, Y., & Hinton, G. (2015). Nature, 521(7553), 436–444.
Article Google Scholar
Lee, Y., Kim, J., Willette, J., Hwang, S.J. (2022). Mpvit: Multi-path vision transformer for dense prediction. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7287–7296
Li, Y., Zhang, K., Cao, J., Timofte, R., Van Gool, L. (2021). Localvit: Bringing locality to vision transformers. arXiv preprint arXiv:2104.05707
Lin, G., Shen, C., Van Den Hengel, A., Reid, I. (2016). Efficient piecewise training of deep structured models for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3194–3203
Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L. (2014). Microsoft coco: Common objects in context. In European conference on computer vision, Springer, pp. 740–755
Lin, T.Y., Dollár, P., Girshick, R., He, K., Hariharan, B., Belongie, S. (2017). Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2117–2125
Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V. (2019). Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692
Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B. (2021). Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 10012–10022
Liu, Z., Hu, H., Lin, Y., Yao, Z., Xie, Z., Wei, Y., Ning, J., Cao, Y., Zhang, Z., Dong, L., et al. (2022). Swin transformer v2: Scaling up capacity and resolution. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 12009–12019
Loshchilov, I., Hutter, F. (2017). Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101
Loshchilov, I., Hutter, F. (2018). Decoupled weight decay regularization. In International Conference on Learning Representations
Luo, W., Li, Y., Urtasun, R., & Zemel, R. S. (2016). Understanding the effective receptive field in deep convolutional neural networks. Advances in Neural Information Processing Systems, 29, 4898–4906.
Google Scholar
Ng, P. C., & Henikoff, S. (2003). Sift: Predicting amino acid changes that affect protein function. Nucleic Acids Research, 31(13), 3812–3814.
Article Google Scholar
Nilsback, M.E., Zisserman, A. (2008). Automated flower classification over a large number of classes. In Indian Conference on Computer Vision, Graphics and Image Processing
Olkkonen, H., & Pesola, P. (1996). Gaussian pyramid wavelet transform for multiresolution analysis of images. Graphical Models and Image Processing, 58(4), 394–398.
Article Google Scholar
Parkhi, O.M., Vedaldi, A., Zisserman, A., Jawahar, C.V. (2012). Cats and dogs. In Proceedings of the IEEE conference on computer vision and pattern recognition
Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al. (2019). Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32
Peng, Z., Huang, W., Gu, S., Xie, L., Wang, Y., Jiao, J., Ye, Q. (2021). Conformer: Local features coupling global representations for visual recognition. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 367–376
Pham, H., Dai, Z., Xie, Q., Le, Q.V. (2021). Meta pseudo labels. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 11557–11568
Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al. (2021). Learning transferable visual models from natural language supervision. In international conference on machine learning, PMLR, pp. 8748–8763
Radosavovic, I., Kosaraju, R.P., Girshick, R., He, K., Dollár, P. (2020). Designing network design spaces. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 10428–10436
Rublee, E., Rabaud, V., Konolige, K., Bradski, G. (2011). Orb: An efficient alternative to sift or surf. In: Proceedings of the IEEE/CVF international conference on computer vision, IEEE, pp. 2564–2571
Sabour, S., Frosst, N., Hinton, G.E. (2017). Dynamic routing between capsules. Advances in Neural Information Processing Systems 30
Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., & Chen, L.C. (2018). Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4510–4520
Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., & Batra, D. (2017). Grad-cam: Visual explanations from deep networks via gradient-based localization. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 618–626
Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., & Rabinovich, A. (2015). Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1–9
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z. (2016). Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2818–2826
Szegedy, C., Ioffe, S., Vanhoucke, V., Alemi, A. (2017). Inception-v4, inception-resnet and the impact of residual connections on learning. In: Proceedings of the AAAI conference on artificial intelligence, vol 31
Tan, M., Le, Q. (2019). Efficientnet: Rethinking model scaling for convolutional neural networks. In International conference on machine learning, PMLR, pp. 6105–6114
Tang, S., Gong, R., Wang, Y., Liu, A., Wang, J., Chen, X., Yu, F., Liu, X., Song, D., Yuille, A. et al (2021). Robustart: Benchmarking robustness on architecture design and training techniques. arXiv preprint arXiv:2109.05211
Tolstikhin, I. O., Houlsby, N., Kolesnikov, A., Beyer, L., Zhai, X., Unterthiner, T., Yung, J., Steiner, A., Keysers, D., Uszkoreit, J., et al. (2021). Mlp-mixer: An all-mlp architecture for vision. Advances in Neural Information Processing Systems, 34, 24261–24272.
Google Scholar
Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., & Jégou, H. (2021a). Training data-efficient image transformers & distillation through attention. In International conference on machine learning, PMLR, pp. 10347–10357
Touvron, H., Sablayrolles, A., Douze, M., Cord, M., & Jégou, H. (2021b). Grafit: Learning fine-grained image representations with coarse labels. In: Proceedings of the IEEE/CVF international conference on computer vision, pp. 874–884
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems 30
Wang, W., Xie, E., Li, X., Fan, D.P., Song, K., Liang, D., Lu, T., Luo, P., & Shao, L. (2021a). Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In Proceedings of the IEEE/cvf international conference on computer vision, pp. 568–578
Wang, W., Yao, L., Chen, L., Lin, B., Cai, D., He, X., & Liu, W. (2021b). Crossformer: A versatile vision transformer hinging on cross-scale attention. In International conference on learning representations
Wang, W., Xie, E., Li, X., Fan, D. P., Song, K., Liang, D., Lu, T., Luo, P., & Shao, L. (2022). Pvt v2: Improved baselines with pyramid vision transformer. Computational Visual Media, 8(3), 415–424.
Article Google Scholar
Wei, C., Fan, H., Xie, S., Wu, C.Y., Yuille, A., & Feichtenhofer, C. (2022). Masked feature prediction for self-supervised visual pre-training. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 14668–14678
Wightman, R. (2019). Pytorch image models. https://github.com/rwightman/pytorch-image-models, 10.5281/zenodo.4414861
Wu, H., Xiao, B., Codella, N., Liu, M., Dai, X., Yuan, L., & Zhang, L. (2021). Cvt: Introducing convolutions to vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 22–31
Xia, Z., Pan, X., Song, S., Li, L.E., & Huang, G. (2022). Vision transformer with deformable attention. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 4794–4803
Xiao, B., Wu, H., & Wei, Y. (2018a). Simple baselines for human pose estimation and tracking. In: Proceedings of the European conference on computer vision (ECCV), pp. 466–481
Xiao, T., Liu, Y., Zhou, B., Jiang, Y., & Sun, J. (2018b). Unified perceptual parsing for scene understanding. In Proceedings of the European conference on computer vision (ECCV), pp. 418–434
Xie, Z., Zhang, Z., Cao, Y., Lin, Y., Bao, J., Yao, Z., Dai, Q., & Hu, H. (2022). Simmim: A simple framework for masked image modeling. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 9653–9663
Xu, Y., Zhang, Q., Zhang, J., & Tao, D. (2021). Vitae: Vision transformer advanced by exploring intrinsic inductive bias. Advances in Neural Information Processing Systems, 34, 28522–28535.
Google Scholar
Xu, Y., Zhang, J., Zhang, Q., & Tao, D. (2022). Vitpose: Simple vision transformer baselines for human pose estimation. Advances in Neural Information Processing Systems
Yan, H., Li, Z., Li, W., Wang, C., Wu, M., & Zhang, C. (2021). Contnet: Why not use convolution and transformer at the same time? arXiv preprint arXiv:2104.13497
Yang, J., Li, C., Zhang, P., Dai, X., Xiao, B., Yuan, L., & Gao, J. (2021). Focal self-attention for local-global interactions in vision transformers. Advances in Neural Information Processing Systems
Yu, F., & Koltun, V. (2016). Multi-scale context aggregation by dilated convolutions. In International conference on learning representations
Yu, F., Koltun, V., & Funkhouser, T. (2017). Dilated residual networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 472–480
Yu, H., Xu, Y., Zhang, J., Zhao, W., Guan, Z., & Tao, D. (2021). Ap-10k: A benchmark for animal pose estimation in the wild. In: Thirty-fifth conference on neural information processing systems datasets and benchmarks Track (Round 2)
Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., & Yan, S. (2022). Metaformer is actually what you need for vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 10819–10829
Yuan, K., Guo, S., Liu, Z., Zhou, A., Yu, F., & Wu, W. (2021a). Incorporating convolution designs into visual transformers. In Proceedings of the IEEE/CVF International conference on computer vision, pp. 579–588
Yuan, L., Chen, Y., Wang, T., Yu, W., Shi, Y., Jiang, Z.H., Tay, F.E., Feng, J., & Yan, S. (2021b). Tokens-to-token vit: Training vision transformers from scratch on imagenet. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 558–567
Zeiler, M.D., & Fergus, R. (2014). Visualizing and understanding convolutional networks. In European conference on computer vision, Springer, pp. 818–833
Zhai, X., Kolesnikov, A., Houlsby, N., & Beyer, L. (2022). Scaling vision transformers. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 12104–12113
Zhang, J., Cao, Y., Wang, Y., Wen, C., & Chen, C.W. (2018). Fully point-wise convolutional neural network for modeling statistical regularities in natural images. In Proceedings of the 26th ACM international conference on Multimedia, pp. 984–992
Zhang, P., Dai, X., Yang, J., Xiao, B., Yuan, L., Zhang, L., & Gao, J. (2021). Multi-scale vision longformer: A new vision transformer for high-resolution image encoding. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 2998–3008
Zhang, Q., Xu, Y., Zhang, J., & Tao, D. (2022). Vsa: Learning varied-size window attention in vision transformers. In Proceedings of the European conference on computer vision (ECCV)
Zhao, H., Shi, J., Qi, X., Wang, X., & Jia, J. (2017). Pyramid scene parsing network. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2881–2890
Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., & Torr, P.H. et al (2021). Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 6881–6890
Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., & Torralba, A. (2017). Scene parsing through ade20k dataset. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 633–641
Zhou, B., Zhao, H., Puig, X., Xiao, T., Fidler, S., Barriuso, A., & Torralba, A. (2019). Semantic understanding of scenes through the ade20k dataset. International Journal of Computer Vision, 127(3), 302–321.
Article Google Scholar

Download references

Author information

Q. Zhang and Y. Xu contribute equally to this work.

Authors and Affiliations

School of Computer Science, Faculty of Engineering, The University of Sydney, Darlington, NSW 2008, Australia
Qiming Zhang, Yufei Xu, Jing Zhang & Dacheng Tao
JD Explore Academy, Beijing, China
Dacheng Tao

Authors

Qiming Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Yufei Xu
View author publications
You can also search for this author in PubMed Google Scholar
Jing Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Dacheng Tao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Dacheng Tao.

Additional information

Communicated by Frederic Jurie.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Mr. Qiming Zhang, Mr Yufei Xu and Dr Jing Zhang are supported by the Australian Research Council Research Project FL-170100117.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhang, Q., Xu, Y., Zhang, J. et al. ViTAEv2: Vision Transformer Advanced by Exploring Inductive Bias for Image Recognition and Beyond. Int J Comput Vis 131, 1141–1162 (2023). https://doi.org/10.1007/s11263-022-01739-w

Download citation

Received: 19 February 2022
Accepted: 18 November 2022
Published: 12 January 2023
Issue Date: May 2023
DOI: https://doi.org/10.1007/s11263-022-01739-w

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

ViTAEv2: Vision Transformer Advanced by Exploring Inductive Bias for Image Recognition and Beyond

Abstract

Access this article

Similar content being viewed by others

Regularizing self-attention on vision transformers with 2D spatial distance loss

CMNet: a novel model and design rationale based on comparison studies and synergy of CNN and MetaFormer

Sparse Double Descent in Vision Transformers: Real or Phantom Threat?

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

ViTAEv2: Vision Transformer Advanced by Exploring Inductive Bias for Image Recognition and Beyond

Abstract

Access this article

Similar content being viewed by others

Regularizing self-attention on vision transformers with 2D spatial distance loss

CMNet: a novel model and design rationale based on comparison studies and synergy of CNN and MetaFormer

Sparse Double Descent in Vision Transformers: Real or Phantom Threat?

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation