Skip to main content
Log in

ViTAEv2: Vision Transformer Advanced by Exploring Inductive Bias for Image Recognition and Beyond

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

Vision transformers have shown great potential in various computer vision tasks owing to their strong capability to model long-range dependency using the self-attention mechanism. Nevertheless, they treat an image as a 1D sequence of visual tokens, lacking an intrinsic inductive bias (IB) in modeling local visual structures and dealing with scale variance, which is instead learned implicitly from large-scale training data with longer training schedules. In this paper, we leverage the two IBs and propose the ViTAE transformer, which utilizes a reduction cell for multi-scale feature and a normal cell for locality. The two kinds of cells are stacked in both isotropic and multi-stage manners to formulate two families of ViTAE models, i.e., the vanilla ViTAE and ViTAEv2. Experiments on the ImageNet dataset as well as downstream tasks on the MS COCO, ADE20K, and AP10K datasets validate the superiority of our models over the baseline and representative models. Besides, we scale up our ViTAE model to 644 M parameters and obtain the state-of-the-art classification performance, i.e., 88.5% Top-1 classification accuracy on ImageNet validation set and the best 91.2% Top-1 classification accuracy on ImageNet Real validation set, without using extra private data. It demonstrates that the introduced inductive bias still helps when the model size becomes large. The source code and pretrained models are publicly available atcode.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Notes

  1. Despite the projection layer in a transformer can be viewed as \(1\times 1\) convolution (Chen et al., 2021c) the term of convolution here refers to those with larger kernels, e.g., \(3 \times 3\), which are widely used in typical CNNs to extract spatial features.

References

  • Adelson, E. H., Anderson, C. H., Bergen, J. R., Burt, P. J., & Ogden, J. M. (1984). Pyramid methods in image processing. RCA Engineer, 29(6), 33–41.

    Google Scholar 

  • Ali, A., Touvron, H., Caron, M., Bojanowski, P., Douze, M., Joulin, A., Laptev, I., Neverova, N., Synnaeve, G., Verbeek, J., et al. (2021). Xcit: Cross-covariance image transformers. Advances in Neural Information Processing Systems, 34, 20014–20027.

    Google Scholar 

  • Ba, J.L., Kiros, J.R., Hinton, G.E. (2016). Layer normalization. arXiv preprint arXiv:1607.06450

  • Bao, H., Dong, L., Piao, S., Wei, F. (2021). Beit: Bert pre-training of image transformers. In: International conference on learning representations

  • Bay, H., Tuytelaars, T., Van Gool, L. (2006). Surf: Speeded up robust features. In: European conference on computer vision, Springer, pp. 404–417

  • Beyer, L., Hénaff, O.J., Kolesnikov, A., Zhai, X., Oord, Avd. (2020). Are we done with imagenet? arXiv preprint arXiv:2006.07159

  • Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A. (2021). Understanding robustness of transformers for image classification. In: Proceedings of the IEEE/CVF international conference on computer vision, pp. 10231–10241

  • Burt, P.J., Adelson, E.H. (1987). The laplacian pyramid as a compact image code. In: Readings in computer vision, Elsevier, pp. 671–679

  • Cai, Z., Vasconcelos, N. (2018). Cascade r-cnn: Delving into high quality object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 6154–6162

  • Cai, Z., & Vasconcelos, N. (2019). Cascade r-CNN: High quality object detection and instance segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(5), 1483–1498.

    Article  Google Scholar 

  • Chen, C.F., Panda, R., Fan, Q. (2021a). Regionvit: Regional-to-local attention for vision transformers. In: international conference on learning representations

  • Chen, C.F.R, Fan, Q., Panda, R. (2021b). Crossvit: Cross-attention multi-scale vision transformer for image classification. In: Proceedings of the IEEE/CVF international conference on computer vision, pp. 357–366

  • Chen, L.C., Papandreou, G., Schroff, F., Adam, H. (2017). Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587

  • Chen, X., Xie, S., He, K. (2021c). An empirical study of training self-supervised vision transformers. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 9640–9649

  • Chen, Y., Dai, X., Chen, D., Liu, M., Dong, X., Yuan, L., Liu, Z. (2022). Mobile-former: Bridging mobilenet and transformer. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5270–5279

  • Choromanski, K.M., Likhosherstov, V., Dohan, D., Song, X., Gane, A., Sarlos, T., Hawkins, P., Davis, J.Q., Mohiuddin, A., Kaiser, L., et al. (2020). Rethinking attention with performers. In: International conference on learning representations

  • Chu, X., Tian, Z., Wang, Y., Zhang, B., Ren, H., Wei, X., Xia, H., & Shen, C. (2021). Twins: Revisiting the design of spatial attention in vision transformers. Advances in Neural Information Processing Systems, 34, 9355–9366.

    Google Scholar 

  • Contributors, M. (2020). MMSegmentation: Openmmlab semantic segmentation toolbox and benchmark. https://github.com/open-mmlab/mmsegmentation

  • Dai, Z., Liu, H., Le, Q. V., & Tan, M. (2021). Coatnet: Marrying convolution and attention for all data sizes. Advances in Neural Information Processing Systems, 34, 3965–3977.

    Google Scholar 

  • d’Ascoli, S., Touvron, H., Leavitt, M.L., Morcos, A.S., Biroli, G., Sagun, L. (2021). Convit: Improving vision transformers with soft convolutional inductive biases. In International conference on machine learning, PMLR, pp. 2286–2296

  • Demirel, H., & Anbarjafari, G. (2010). Image resolution enhancement by using discrete and stationary wavelet decomposition. IEEE Transactions on Image Processing, 20(5), 1458–1460.

    Article  MathSciNet  MATH  Google Scholar 

  • Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L. (2009). Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, IEEE, pp. 248–255

  • Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al. (2020). An image is worth 16x16 words: Transformers for image recognition at scale. In International conference on learning representations

  • Fan, H., Xiong, B., Mangalam, K., Li, Y., Yan, Z., Malik, J., Feichtenhofer, C. (2021). Multiscale vision transformers. In: Proceedings of the IEEE/CVF international conference on computer vision, pp. 6824–6835

  • Graham, B., El-Nouby, A., Touvron, H., Stock, P., Joulin, A., Jégou, H., Douze, M. (2021). Levit: a vision transformer in convnet’s clothing for faster inference. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 12259–12269

  • Guo, M.H., Lu, C.Z., Liu, Z.N., Cheng, M.M., Hu, S.M. (2022). Visual attention network. arXiv preprint arXiv:2202.09741

  • Han, K., Xiao, A., Wu, E., Guo, J., Xu, C., & Wang, Y. (2021). Transformer in transformer. Advances in Neural Information Processing Systems, 34, 15908–15919.

    Google Scholar 

  • He, K., Zhang, X., Ren, S., & Sun, J. (2015). Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(9), 1904–1916.

    Article  Google Scholar 

  • He, K., Zhang, X., Ren, S., Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778

  • He, K., Gkioxari, G., Dollár, P., Girshick, R. (2017). Mask r-cnn. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 2961–2969

  • He, K., Chen, X., Xie, S., Li, Y., Doll,ár P, Girshick, R. (2022). Masked autoencoders are scalable vision learners. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 16000–16009

  • He, L., Dong, Y., Wang, Y., Tao, D., & Lin, Z. (2021). Gauge equivariant transformer. Advances in Neural Information Processing Systems, 34, 27331–27343.

    Google Scholar 

  • Heo, B., Yun, S., Han, D., Chun, S., Choe, J., Oh, S.J. (2021). Rethinking spatial dimensions of vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 11936–11945

  • Howard, A.G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M., Adam, H. (2017). Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861

  • Huang, G., Liu, Z., Van Der Maaten, L., Weinberger, K.Q. (2017). Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4700–4708

  • Ke, Y., Sukthankar, R. (2004). Pca-sift: A more distinctive representation for local image descriptors. In Proceedings of the IEEE conference on computer vision and pattern recognition, IEEE, vol 2, pp. II–II

  • Kenton, J.D.M.W.C., Toutanova, L.K. (2019). Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL-HLT, pp. 4171–4186

  • Kolesnikov, A., Beyer, L., Zhai, X., Puigcerver, J., Yung, J., Gelly, S., Houlsby, N. (2020). Big transfer (bit): General visual representation learning. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part V 16, Springer, pp. 491–507

  • Krause, J., Stark, M., Deng, J., Fei-Fei, L. (2013). 3d object representations for fine-grained categorization. In 4th International IEEE Workshop on 3D Representation and Recognition (3dRR-13), Sydney, Australia

  • Krizhevsky, A., Hinton, G., et al. (2009). Learning multiple layers of features from tiny images

  • Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems, 25, 1097–1105.

    Google Scholar 

  • Lai, W.S., Huang, J.B., Ahuja, N., Yang, M.H. (2017). Deep laplacian pyramid networks for fast and accurate super-resolution. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 624–632

  • LeCun, Y., Bengio, Y., et al. (1995). Convolutional networks for images, speech, and time series. The Handbook of Brain Theory and Neural Networks, 3361(10), 1995.

    Google Scholar 

  • LeCun, Y., Bengio, Y., & Hinton, G. (2015). Nature, 521(7553), 436–444.

    Article  Google Scholar 

  • Lee, Y., Kim, J., Willette, J., Hwang, S.J. (2022). Mpvit: Multi-path vision transformer for dense prediction. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7287–7296

  • Li, Y., Zhang, K., Cao, J., Timofte, R., Van Gool, L. (2021). Localvit: Bringing locality to vision transformers. arXiv preprint arXiv:2104.05707

  • Lin, G., Shen, C., Van Den Hengel, A., Reid, I. (2016). Efficient piecewise training of deep structured models for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3194–3203

  • Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L. (2014). Microsoft coco: Common objects in context. In European conference on computer vision, Springer, pp. 740–755

  • Lin, T.Y., Dollár, P., Girshick, R., He, K., Hariharan, B., Belongie, S. (2017). Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2117–2125

  • Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V. (2019). Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692

  • Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B. (2021). Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 10012–10022

  • Liu, Z., Hu, H., Lin, Y., Yao, Z., Xie, Z., Wei, Y., Ning, J., Cao, Y., Zhang, Z., Dong, L., et al. (2022). Swin transformer v2: Scaling up capacity and resolution. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 12009–12019

  • Loshchilov, I., Hutter, F. (2017). Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101

  • Loshchilov, I., Hutter, F. (2018). Decoupled weight decay regularization. In International Conference on Learning Representations

  • Luo, W., Li, Y., Urtasun, R., & Zemel, R. S. (2016). Understanding the effective receptive field in deep convolutional neural networks. Advances in Neural Information Processing Systems, 29, 4898–4906.

    Google Scholar 

  • Ng, P. C., & Henikoff, S. (2003). Sift: Predicting amino acid changes that affect protein function. Nucleic Acids Research, 31(13), 3812–3814.

    Article  Google Scholar 

  • Nilsback, M.E., Zisserman, A. (2008). Automated flower classification over a large number of classes. In Indian Conference on Computer Vision, Graphics and Image Processing

  • Olkkonen, H., & Pesola, P. (1996). Gaussian pyramid wavelet transform for multiresolution analysis of images. Graphical Models and Image Processing, 58(4), 394–398.

    Article  Google Scholar 

  • Parkhi, O.M., Vedaldi, A., Zisserman, A., Jawahar, C.V. (2012). Cats and dogs. In Proceedings of the IEEE conference on computer vision and pattern recognition

  • Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al. (2019). Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32

  • Peng, Z., Huang, W., Gu, S., Xie, L., Wang, Y., Jiao, J., Ye, Q. (2021). Conformer: Local features coupling global representations for visual recognition. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 367–376

  • Pham, H., Dai, Z., Xie, Q., Le, Q.V. (2021). Meta pseudo labels. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 11557–11568

  • Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al. (2021). Learning transferable visual models from natural language supervision. In international conference on machine learning, PMLR, pp. 8748–8763

  • Radosavovic, I., Kosaraju, R.P., Girshick, R., He, K., Dollár, P. (2020). Designing network design spaces. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 10428–10436

  • Rublee, E., Rabaud, V., Konolige, K., Bradski, G. (2011). Orb: An efficient alternative to sift or surf. In: Proceedings of the IEEE/CVF international conference on computer vision, IEEE, pp. 2564–2571

  • Sabour, S., Frosst, N., Hinton, G.E. (2017). Dynamic routing between capsules. Advances in Neural Information Processing Systems 30

  • Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., & Chen, L.C. (2018). Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4510–4520

  • Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., & Batra, D. (2017). Grad-cam: Visual explanations from deep networks via gradient-based localization. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 618–626

  • Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556

  • Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., & Rabinovich, A. (2015). Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1–9

  • Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z. (2016). Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2818–2826

  • Szegedy, C., Ioffe, S., Vanhoucke, V., Alemi, A. (2017). Inception-v4, inception-resnet and the impact of residual connections on learning. In: Proceedings of the AAAI conference on artificial intelligence, vol 31

  • Tan, M., Le, Q. (2019). Efficientnet: Rethinking model scaling for convolutional neural networks. In International conference on machine learning, PMLR, pp. 6105–6114

  • Tang, S., Gong, R., Wang, Y., Liu, A., Wang, J., Chen, X., Yu, F., Liu, X., Song, D., Yuille, A. et al (2021). Robustart: Benchmarking robustness on architecture design and training techniques. arXiv preprint arXiv:2109.05211

  • Tolstikhin, I. O., Houlsby, N., Kolesnikov, A., Beyer, L., Zhai, X., Unterthiner, T., Yung, J., Steiner, A., Keysers, D., Uszkoreit, J., et al. (2021). Mlp-mixer: An all-mlp architecture for vision. Advances in Neural Information Processing Systems, 34, 24261–24272.

    Google Scholar 

  • Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., & Jégou, H. (2021a). Training data-efficient image transformers & distillation through attention. In International conference on machine learning, PMLR, pp. 10347–10357

  • Touvron, H., Sablayrolles, A., Douze, M., Cord, M., & Jégou, H. (2021b). Grafit: Learning fine-grained image representations with coarse labels. In: Proceedings of the IEEE/CVF international conference on computer vision, pp. 874–884

  • Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems 30

  • Wang, W., Xie, E., Li, X., Fan, D.P., Song, K., Liang, D., Lu, T., Luo, P., & Shao, L. (2021a). Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In Proceedings of the IEEE/cvf international conference on computer vision, pp. 568–578

  • Wang, W., Yao, L., Chen, L., Lin, B., Cai, D., He, X., & Liu, W. (2021b). Crossformer: A versatile vision transformer hinging on cross-scale attention. In International conference on learning representations

  • Wang, W., Xie, E., Li, X., Fan, D. P., Song, K., Liang, D., Lu, T., Luo, P., & Shao, L. (2022). Pvt v2: Improved baselines with pyramid vision transformer. Computational Visual Media, 8(3), 415–424.

    Article  Google Scholar 

  • Wei, C., Fan, H., Xie, S., Wu, C.Y., Yuille, A., & Feichtenhofer, C. (2022). Masked feature prediction for self-supervised visual pre-training. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 14668–14678

  • Wightman, R. (2019). Pytorch image models. https://github.com/rwightman/pytorch-image-models, 10.5281/zenodo.4414861

  • Wu, H., Xiao, B., Codella, N., Liu, M., Dai, X., Yuan, L., & Zhang, L. (2021). Cvt: Introducing convolutions to vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 22–31

  • Xia, Z., Pan, X., Song, S., Li, L.E., & Huang, G. (2022). Vision transformer with deformable attention. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 4794–4803

  • Xiao, B., Wu, H., & Wei, Y. (2018a). Simple baselines for human pose estimation and tracking. In: Proceedings of the European conference on computer vision (ECCV), pp. 466–481

  • Xiao, T., Liu, Y., Zhou, B., Jiang, Y., & Sun, J. (2018b). Unified perceptual parsing for scene understanding. In Proceedings of the European conference on computer vision (ECCV), pp. 418–434

  • Xie, Z., Zhang, Z., Cao, Y., Lin, Y., Bao, J., Yao, Z., Dai, Q., & Hu, H. (2022). Simmim: A simple framework for masked image modeling. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 9653–9663

  • Xu, Y., Zhang, Q., Zhang, J., & Tao, D. (2021). Vitae: Vision transformer advanced by exploring intrinsic inductive bias. Advances in Neural Information Processing Systems, 34, 28522–28535.

    Google Scholar 

  • Xu, Y., Zhang, J., Zhang, Q., & Tao, D. (2022). Vitpose: Simple vision transformer baselines for human pose estimation. Advances in Neural Information Processing Systems

  • Yan, H., Li, Z., Li, W., Wang, C., Wu, M., & Zhang, C. (2021). Contnet: Why not use convolution and transformer at the same time? arXiv preprint arXiv:2104.13497

  • Yang, J., Li, C., Zhang, P., Dai, X., Xiao, B., Yuan, L., & Gao, J. (2021). Focal self-attention for local-global interactions in vision transformers. Advances in Neural Information Processing Systems

  • Yu, F., & Koltun, V. (2016). Multi-scale context aggregation by dilated convolutions. In International conference on learning representations

  • Yu, F., Koltun, V., & Funkhouser, T. (2017). Dilated residual networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 472–480

  • Yu, H., Xu, Y., Zhang, J., Zhao, W., Guan, Z., & Tao, D. (2021). Ap-10k: A benchmark for animal pose estimation in the wild. In: Thirty-fifth conference on neural information processing systems datasets and benchmarks Track (Round 2)

  • Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., & Yan, S. (2022). Metaformer is actually what you need for vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 10819–10829

  • Yuan, K., Guo, S., Liu, Z., Zhou, A., Yu, F., & Wu, W. (2021a). Incorporating convolution designs into visual transformers. In Proceedings of the IEEE/CVF International conference on computer vision, pp. 579–588

  • Yuan, L., Chen, Y., Wang, T., Yu, W., Shi, Y., Jiang, Z.H., Tay, F.E., Feng, J., & Yan, S. (2021b). Tokens-to-token vit: Training vision transformers from scratch on imagenet. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 558–567

  • Zeiler, M.D., & Fergus, R. (2014). Visualizing and understanding convolutional networks. In European conference on computer vision, Springer, pp. 818–833

  • Zhai, X., Kolesnikov, A., Houlsby, N., & Beyer, L. (2022). Scaling vision transformers. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 12104–12113

  • Zhang, J., Cao, Y., Wang, Y., Wen, C., & Chen, C.W. (2018). Fully point-wise convolutional neural network for modeling statistical regularities in natural images. In Proceedings of the 26th ACM international conference on Multimedia, pp. 984–992

  • Zhang, P., Dai, X., Yang, J., Xiao, B., Yuan, L., Zhang, L., & Gao, J. (2021). Multi-scale vision longformer: A new vision transformer for high-resolution image encoding. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 2998–3008

  • Zhang, Q., Xu, Y., Zhang, J., & Tao, D. (2022). Vsa: Learning varied-size window attention in vision transformers. In Proceedings of the European conference on computer vision (ECCV)

  • Zhao, H., Shi, J., Qi, X., Wang, X., & Jia, J. (2017). Pyramid scene parsing network. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2881–2890

  • Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., & Torr, P.H. et al (2021). Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 6881–6890

  • Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., & Torralba, A. (2017). Scene parsing through ade20k dataset. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 633–641

  • Zhou, B., Zhao, H., Puig, X., Xiao, T., Fidler, S., Barriuso, A., & Torralba, A. (2019). Semantic understanding of scenes through the ade20k dataset. International Journal of Computer Vision, 127(3), 302–321.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Dacheng Tao.

Additional information

Communicated by Frederic Jurie.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Mr. Qiming Zhang, Mr Yufei Xu and Dr Jing Zhang are supported by the Australian Research Council Research Project FL-170100117.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhang, Q., Xu, Y., Zhang, J. et al. ViTAEv2: Vision Transformer Advanced by Exploring Inductive Bias for Image Recognition and Beyond. Int J Comput Vis 131, 1141–1162 (2023). https://doi.org/10.1007/s11263-022-01739-w

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11263-022-01739-w

Keywords

Navigation