Abstract
Vision transformers have shown great potential in various computer vision tasks owing to their strong capability to model long-range dependency using the self-attention mechanism. Nevertheless, they treat an image as a 1D sequence of visual tokens, lacking an intrinsic inductive bias (IB) in modeling local visual structures and dealing with scale variance, which is instead learned implicitly from large-scale training data with longer training schedules. In this paper, we leverage the two IBs and propose the ViTAE transformer, which utilizes a reduction cell for multi-scale feature and a normal cell for locality. The two kinds of cells are stacked in both isotropic and multi-stage manners to formulate two families of ViTAE models, i.e., the vanilla ViTAE and ViTAEv2. Experiments on the ImageNet dataset as well as downstream tasks on the MS COCO, ADE20K, and AP10K datasets validate the superiority of our models over the baseline and representative models. Besides, we scale up our ViTAE model to 644 M parameters and obtain the state-of-the-art classification performance, i.e., 88.5% Top-1 classification accuracy on ImageNet validation set and the best 91.2% Top-1 classification accuracy on ImageNet Real validation set, without using extra private data. It demonstrates that the introduced inductive bias still helps when the model size becomes large. The source code and pretrained models are publicly available atcode.
Similar content being viewed by others
Notes
Despite the projection layer in a transformer can be viewed as \(1\times 1\) convolution (Chen et al., 2021c) the term of convolution here refers to those with larger kernels, e.g., \(3 \times 3\), which are widely used in typical CNNs to extract spatial features.
References
Adelson, E. H., Anderson, C. H., Bergen, J. R., Burt, P. J., & Ogden, J. M. (1984). Pyramid methods in image processing. RCA Engineer, 29(6), 33–41.
Ali, A., Touvron, H., Caron, M., Bojanowski, P., Douze, M., Joulin, A., Laptev, I., Neverova, N., Synnaeve, G., Verbeek, J., et al. (2021). Xcit: Cross-covariance image transformers. Advances in Neural Information Processing Systems, 34, 20014–20027.
Ba, J.L., Kiros, J.R., Hinton, G.E. (2016). Layer normalization. arXiv preprint arXiv:1607.06450
Bao, H., Dong, L., Piao, S., Wei, F. (2021). Beit: Bert pre-training of image transformers. In: International conference on learning representations
Bay, H., Tuytelaars, T., Van Gool, L. (2006). Surf: Speeded up robust features. In: European conference on computer vision, Springer, pp. 404–417
Beyer, L., Hénaff, O.J., Kolesnikov, A., Zhai, X., Oord, Avd. (2020). Are we done with imagenet? arXiv preprint arXiv:2006.07159
Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A. (2021). Understanding robustness of transformers for image classification. In: Proceedings of the IEEE/CVF international conference on computer vision, pp. 10231–10241
Burt, P.J., Adelson, E.H. (1987). The laplacian pyramid as a compact image code. In: Readings in computer vision, Elsevier, pp. 671–679
Cai, Z., Vasconcelos, N. (2018). Cascade r-cnn: Delving into high quality object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 6154–6162
Cai, Z., & Vasconcelos, N. (2019). Cascade r-CNN: High quality object detection and instance segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(5), 1483–1498.
Chen, C.F., Panda, R., Fan, Q. (2021a). Regionvit: Regional-to-local attention for vision transformers. In: international conference on learning representations
Chen, C.F.R, Fan, Q., Panda, R. (2021b). Crossvit: Cross-attention multi-scale vision transformer for image classification. In: Proceedings of the IEEE/CVF international conference on computer vision, pp. 357–366
Chen, L.C., Papandreou, G., Schroff, F., Adam, H. (2017). Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587
Chen, X., Xie, S., He, K. (2021c). An empirical study of training self-supervised vision transformers. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 9640–9649
Chen, Y., Dai, X., Chen, D., Liu, M., Dong, X., Yuan, L., Liu, Z. (2022). Mobile-former: Bridging mobilenet and transformer. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5270–5279
Choromanski, K.M., Likhosherstov, V., Dohan, D., Song, X., Gane, A., Sarlos, T., Hawkins, P., Davis, J.Q., Mohiuddin, A., Kaiser, L., et al. (2020). Rethinking attention with performers. In: International conference on learning representations
Chu, X., Tian, Z., Wang, Y., Zhang, B., Ren, H., Wei, X., Xia, H., & Shen, C. (2021). Twins: Revisiting the design of spatial attention in vision transformers. Advances in Neural Information Processing Systems, 34, 9355–9366.
Contributors, M. (2020). MMSegmentation: Openmmlab semantic segmentation toolbox and benchmark. https://github.com/open-mmlab/mmsegmentation
Dai, Z., Liu, H., Le, Q. V., & Tan, M. (2021). Coatnet: Marrying convolution and attention for all data sizes. Advances in Neural Information Processing Systems, 34, 3965–3977.
d’Ascoli, S., Touvron, H., Leavitt, M.L., Morcos, A.S., Biroli, G., Sagun, L. (2021). Convit: Improving vision transformers with soft convolutional inductive biases. In International conference on machine learning, PMLR, pp. 2286–2296
Demirel, H., & Anbarjafari, G. (2010). Image resolution enhancement by using discrete and stationary wavelet decomposition. IEEE Transactions on Image Processing, 20(5), 1458–1460.
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L. (2009). Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, IEEE, pp. 248–255
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al. (2020). An image is worth 16x16 words: Transformers for image recognition at scale. In International conference on learning representations
Fan, H., Xiong, B., Mangalam, K., Li, Y., Yan, Z., Malik, J., Feichtenhofer, C. (2021). Multiscale vision transformers. In: Proceedings of the IEEE/CVF international conference on computer vision, pp. 6824–6835
Graham, B., El-Nouby, A., Touvron, H., Stock, P., Joulin, A., Jégou, H., Douze, M. (2021). Levit: a vision transformer in convnet’s clothing for faster inference. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 12259–12269
Guo, M.H., Lu, C.Z., Liu, Z.N., Cheng, M.M., Hu, S.M. (2022). Visual attention network. arXiv preprint arXiv:2202.09741
Han, K., Xiao, A., Wu, E., Guo, J., Xu, C., & Wang, Y. (2021). Transformer in transformer. Advances in Neural Information Processing Systems, 34, 15908–15919.
He, K., Zhang, X., Ren, S., & Sun, J. (2015). Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(9), 1904–1916.
He, K., Zhang, X., Ren, S., Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778
He, K., Gkioxari, G., Dollár, P., Girshick, R. (2017). Mask r-cnn. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 2961–2969
He, K., Chen, X., Xie, S., Li, Y., Doll,ár P, Girshick, R. (2022). Masked autoencoders are scalable vision learners. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 16000–16009
He, L., Dong, Y., Wang, Y., Tao, D., & Lin, Z. (2021). Gauge equivariant transformer. Advances in Neural Information Processing Systems, 34, 27331–27343.
Heo, B., Yun, S., Han, D., Chun, S., Choe, J., Oh, S.J. (2021). Rethinking spatial dimensions of vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 11936–11945
Howard, A.G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M., Adam, H. (2017). Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861
Huang, G., Liu, Z., Van Der Maaten, L., Weinberger, K.Q. (2017). Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4700–4708
Ke, Y., Sukthankar, R. (2004). Pca-sift: A more distinctive representation for local image descriptors. In Proceedings of the IEEE conference on computer vision and pattern recognition, IEEE, vol 2, pp. II–II
Kenton, J.D.M.W.C., Toutanova, L.K. (2019). Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL-HLT, pp. 4171–4186
Kolesnikov, A., Beyer, L., Zhai, X., Puigcerver, J., Yung, J., Gelly, S., Houlsby, N. (2020). Big transfer (bit): General visual representation learning. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part V 16, Springer, pp. 491–507
Krause, J., Stark, M., Deng, J., Fei-Fei, L. (2013). 3d object representations for fine-grained categorization. In 4th International IEEE Workshop on 3D Representation and Recognition (3dRR-13), Sydney, Australia
Krizhevsky, A., Hinton, G., et al. (2009). Learning multiple layers of features from tiny images
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems, 25, 1097–1105.
Lai, W.S., Huang, J.B., Ahuja, N., Yang, M.H. (2017). Deep laplacian pyramid networks for fast and accurate super-resolution. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 624–632
LeCun, Y., Bengio, Y., et al. (1995). Convolutional networks for images, speech, and time series. The Handbook of Brain Theory and Neural Networks, 3361(10), 1995.
LeCun, Y., Bengio, Y., & Hinton, G. (2015). Nature, 521(7553), 436–444.
Lee, Y., Kim, J., Willette, J., Hwang, S.J. (2022). Mpvit: Multi-path vision transformer for dense prediction. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7287–7296
Li, Y., Zhang, K., Cao, J., Timofte, R., Van Gool, L. (2021). Localvit: Bringing locality to vision transformers. arXiv preprint arXiv:2104.05707
Lin, G., Shen, C., Van Den Hengel, A., Reid, I. (2016). Efficient piecewise training of deep structured models for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3194–3203
Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L. (2014). Microsoft coco: Common objects in context. In European conference on computer vision, Springer, pp. 740–755
Lin, T.Y., Dollár, P., Girshick, R., He, K., Hariharan, B., Belongie, S. (2017). Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2117–2125
Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V. (2019). Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692
Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B. (2021). Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 10012–10022
Liu, Z., Hu, H., Lin, Y., Yao, Z., Xie, Z., Wei, Y., Ning, J., Cao, Y., Zhang, Z., Dong, L., et al. (2022). Swin transformer v2: Scaling up capacity and resolution. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 12009–12019
Loshchilov, I., Hutter, F. (2017). Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101
Loshchilov, I., Hutter, F. (2018). Decoupled weight decay regularization. In International Conference on Learning Representations
Luo, W., Li, Y., Urtasun, R., & Zemel, R. S. (2016). Understanding the effective receptive field in deep convolutional neural networks. Advances in Neural Information Processing Systems, 29, 4898–4906.
Ng, P. C., & Henikoff, S. (2003). Sift: Predicting amino acid changes that affect protein function. Nucleic Acids Research, 31(13), 3812–3814.
Nilsback, M.E., Zisserman, A. (2008). Automated flower classification over a large number of classes. In Indian Conference on Computer Vision, Graphics and Image Processing
Olkkonen, H., & Pesola, P. (1996). Gaussian pyramid wavelet transform for multiresolution analysis of images. Graphical Models and Image Processing, 58(4), 394–398.
Parkhi, O.M., Vedaldi, A., Zisserman, A., Jawahar, C.V. (2012). Cats and dogs. In Proceedings of the IEEE conference on computer vision and pattern recognition
Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al. (2019). Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32
Peng, Z., Huang, W., Gu, S., Xie, L., Wang, Y., Jiao, J., Ye, Q. (2021). Conformer: Local features coupling global representations for visual recognition. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 367–376
Pham, H., Dai, Z., Xie, Q., Le, Q.V. (2021). Meta pseudo labels. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 11557–11568
Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al. (2021). Learning transferable visual models from natural language supervision. In international conference on machine learning, PMLR, pp. 8748–8763
Radosavovic, I., Kosaraju, R.P., Girshick, R., He, K., Dollár, P. (2020). Designing network design spaces. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 10428–10436
Rublee, E., Rabaud, V., Konolige, K., Bradski, G. (2011). Orb: An efficient alternative to sift or surf. In: Proceedings of the IEEE/CVF international conference on computer vision, IEEE, pp. 2564–2571
Sabour, S., Frosst, N., Hinton, G.E. (2017). Dynamic routing between capsules. Advances in Neural Information Processing Systems 30
Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., & Chen, L.C. (2018). Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4510–4520
Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., & Batra, D. (2017). Grad-cam: Visual explanations from deep networks via gradient-based localization. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 618–626
Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., & Rabinovich, A. (2015). Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1–9
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z. (2016). Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2818–2826
Szegedy, C., Ioffe, S., Vanhoucke, V., Alemi, A. (2017). Inception-v4, inception-resnet and the impact of residual connections on learning. In: Proceedings of the AAAI conference on artificial intelligence, vol 31
Tan, M., Le, Q. (2019). Efficientnet: Rethinking model scaling for convolutional neural networks. In International conference on machine learning, PMLR, pp. 6105–6114
Tang, S., Gong, R., Wang, Y., Liu, A., Wang, J., Chen, X., Yu, F., Liu, X., Song, D., Yuille, A. et al (2021). Robustart: Benchmarking robustness on architecture design and training techniques. arXiv preprint arXiv:2109.05211
Tolstikhin, I. O., Houlsby, N., Kolesnikov, A., Beyer, L., Zhai, X., Unterthiner, T., Yung, J., Steiner, A., Keysers, D., Uszkoreit, J., et al. (2021). Mlp-mixer: An all-mlp architecture for vision. Advances in Neural Information Processing Systems, 34, 24261–24272.
Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., & Jégou, H. (2021a). Training data-efficient image transformers & distillation through attention. In International conference on machine learning, PMLR, pp. 10347–10357
Touvron, H., Sablayrolles, A., Douze, M., Cord, M., & Jégou, H. (2021b). Grafit: Learning fine-grained image representations with coarse labels. In: Proceedings of the IEEE/CVF international conference on computer vision, pp. 874–884
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems 30
Wang, W., Xie, E., Li, X., Fan, D.P., Song, K., Liang, D., Lu, T., Luo, P., & Shao, L. (2021a). Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In Proceedings of the IEEE/cvf international conference on computer vision, pp. 568–578
Wang, W., Yao, L., Chen, L., Lin, B., Cai, D., He, X., & Liu, W. (2021b). Crossformer: A versatile vision transformer hinging on cross-scale attention. In International conference on learning representations
Wang, W., Xie, E., Li, X., Fan, D. P., Song, K., Liang, D., Lu, T., Luo, P., & Shao, L. (2022). Pvt v2: Improved baselines with pyramid vision transformer. Computational Visual Media, 8(3), 415–424.
Wei, C., Fan, H., Xie, S., Wu, C.Y., Yuille, A., & Feichtenhofer, C. (2022). Masked feature prediction for self-supervised visual pre-training. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 14668–14678
Wightman, R. (2019). Pytorch image models. https://github.com/rwightman/pytorch-image-models, 10.5281/zenodo.4414861
Wu, H., Xiao, B., Codella, N., Liu, M., Dai, X., Yuan, L., & Zhang, L. (2021). Cvt: Introducing convolutions to vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 22–31
Xia, Z., Pan, X., Song, S., Li, L.E., & Huang, G. (2022). Vision transformer with deformable attention. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 4794–4803
Xiao, B., Wu, H., & Wei, Y. (2018a). Simple baselines for human pose estimation and tracking. In: Proceedings of the European conference on computer vision (ECCV), pp. 466–481
Xiao, T., Liu, Y., Zhou, B., Jiang, Y., & Sun, J. (2018b). Unified perceptual parsing for scene understanding. In Proceedings of the European conference on computer vision (ECCV), pp. 418–434
Xie, Z., Zhang, Z., Cao, Y., Lin, Y., Bao, J., Yao, Z., Dai, Q., & Hu, H. (2022). Simmim: A simple framework for masked image modeling. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 9653–9663
Xu, Y., Zhang, Q., Zhang, J., & Tao, D. (2021). Vitae: Vision transformer advanced by exploring intrinsic inductive bias. Advances in Neural Information Processing Systems, 34, 28522–28535.
Xu, Y., Zhang, J., Zhang, Q., & Tao, D. (2022). Vitpose: Simple vision transformer baselines for human pose estimation. Advances in Neural Information Processing Systems
Yan, H., Li, Z., Li, W., Wang, C., Wu, M., & Zhang, C. (2021). Contnet: Why not use convolution and transformer at the same time? arXiv preprint arXiv:2104.13497
Yang, J., Li, C., Zhang, P., Dai, X., Xiao, B., Yuan, L., & Gao, J. (2021). Focal self-attention for local-global interactions in vision transformers. Advances in Neural Information Processing Systems
Yu, F., & Koltun, V. (2016). Multi-scale context aggregation by dilated convolutions. In International conference on learning representations
Yu, F., Koltun, V., & Funkhouser, T. (2017). Dilated residual networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 472–480
Yu, H., Xu, Y., Zhang, J., Zhao, W., Guan, Z., & Tao, D. (2021). Ap-10k: A benchmark for animal pose estimation in the wild. In: Thirty-fifth conference on neural information processing systems datasets and benchmarks Track (Round 2)
Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., & Yan, S. (2022). Metaformer is actually what you need for vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 10819–10829
Yuan, K., Guo, S., Liu, Z., Zhou, A., Yu, F., & Wu, W. (2021a). Incorporating convolution designs into visual transformers. In Proceedings of the IEEE/CVF International conference on computer vision, pp. 579–588
Yuan, L., Chen, Y., Wang, T., Yu, W., Shi, Y., Jiang, Z.H., Tay, F.E., Feng, J., & Yan, S. (2021b). Tokens-to-token vit: Training vision transformers from scratch on imagenet. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 558–567
Zeiler, M.D., & Fergus, R. (2014). Visualizing and understanding convolutional networks. In European conference on computer vision, Springer, pp. 818–833
Zhai, X., Kolesnikov, A., Houlsby, N., & Beyer, L. (2022). Scaling vision transformers. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 12104–12113
Zhang, J., Cao, Y., Wang, Y., Wen, C., & Chen, C.W. (2018). Fully point-wise convolutional neural network for modeling statistical regularities in natural images. In Proceedings of the 26th ACM international conference on Multimedia, pp. 984–992
Zhang, P., Dai, X., Yang, J., Xiao, B., Yuan, L., Zhang, L., & Gao, J. (2021). Multi-scale vision longformer: A new vision transformer for high-resolution image encoding. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 2998–3008
Zhang, Q., Xu, Y., Zhang, J., & Tao, D. (2022). Vsa: Learning varied-size window attention in vision transformers. In Proceedings of the European conference on computer vision (ECCV)
Zhao, H., Shi, J., Qi, X., Wang, X., & Jia, J. (2017). Pyramid scene parsing network. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2881–2890
Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., & Torr, P.H. et al (2021). Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 6881–6890
Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., & Torralba, A. (2017). Scene parsing through ade20k dataset. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 633–641
Zhou, B., Zhao, H., Puig, X., Xiao, T., Fidler, S., Barriuso, A., & Torralba, A. (2019). Semantic understanding of scenes through the ade20k dataset. International Journal of Computer Vision, 127(3), 302–321.
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by Frederic Jurie.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Mr. Qiming Zhang, Mr Yufei Xu and Dr Jing Zhang are supported by the Australian Research Council Research Project FL-170100117.
Rights and permissions
About this article
Cite this article
Zhang, Q., Xu, Y., Zhang, J. et al. ViTAEv2: Vision Transformer Advanced by Exploring Inductive Bias for Image Recognition and Beyond. Int J Comput Vis 131, 1141–1162 (2023). https://doi.org/10.1007/s11263-022-01739-w
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11263-022-01739-w