DaViT: Dual Attention Vision Transformers

Ding, Mingyu; Xiao, Bin; Codella, Noel; Luo, Ping; Wang, Jingdong; Yuan, Lu

doi:10.1007/978-3-031-20053-3_5

Mingyu Ding¹²,
Bin Xiao¹³,
Noel Codella¹³,
Ping Luo¹²,
Jingdong Wang¹⁴ &
…
Lu Yuan¹³

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13684))

Included in the following conference series:

European Conference on Computer Vision

3905 Accesses
73 Citations

Abstract

In this work, we introduce Dual Attention Vision Transformers (DaViT), a simple yet effective vision transformer architecture that is able to capture global context while maintaining computational efficiency. We propose approaching the problem from an orthogonal angle: exploiting self-attention mechanisms with both “spatial tokens” and “channel tokens”. With spatial tokens, the spatial dimension defines the token scope, and the channel dimension defines the token feature dimension. With channel tokens, we have the inverse: the channel dimension defines the token scope, and the spatial dimension defines the token feature dimension. We further group tokens along the sequence direction for both spatial and channel tokens to maintain the linear complexity of the entire model. We show that these two self-attentions complement each other: (i) since each channel token contains an abstract representation of the entire image, the channel attention naturally captures global interactions and representations by taking all spatial positions into account when computing attention scores between channels; (ii) the spatial attention refines the local representations by performing fine-grained interactions across spatial locations, which in turn helps the global information modeling in channel attention. Extensive experiments show DaViT backbones achieve state-of-the-art performance on four different tasks. Specially, DaViT-Tiny, DaViT-Small, and DaViT-Base achieve 82.8%, 84.2%, and 84.6% top-1 accuracy on ImageNet-1K without extra training data, using 28.3M, 49.7M, and 87.9M parameters, respectively. When we further scale up DaViT with 1.5B weakly supervised image and text pairs, DaViT-Giant reaches 90.4% top-1 accuracy on ImageNet-1K. Code is available at https://github.com/microsoft/DaViT.

M. Ding—This work is done when Mingyu was an intern at Microsoft.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Ali, A., et al.: XCiT: cross-covariance image transformers. In: NeurIPS, vol. 34 (2021)
Google Scholar
Bello, I.: Lambdanetworks: modeling long-range interactions without attention. arXiv preprint arXiv:2102.08602 (2021)
Berman, M., Jégou, H., Vedaldi, A., Kokkinos, I., Douze, M.: Multigrain: a unified image embedding for classes and instances. arXiv preprint arXiv:1902.05509 (2019)
Brock, A., De, S., Smith, S.L., Simonyan, K.: High-performance large-scale image recognition without normalization. arXiv Computer Vision and Pattern Recognition (2021)
Google Scholar
Chen, C.F., Fan, Q., Panda, R.: CrossViT: cross-attention multi-scale vision transformer for image classification. In: ICCV (2021)
Google Scholar
Chen, L.C., Zhu, Y., Papandreou, G., Schroff, F., Adam, H.: Encoder-decoder with Atrous separable convolution for semantic image segmentation. In: ECCV, pp. 801–818 (2018)
Google Scholar
Chen, M., et al.: Generative pretraining from pixels. In: ICML, pp. 1691–1703 (2020)
Google Scholar
Chen, Z., Xie, L., Niu, J., Liu, X., Wei, L., Tian, Q.: Visformer: the vision-friendly transformer. In: ICCV, pp. 589–598 (2021)
Google Scholar
Chu, X., et al.: Twins: revisiting spatial attention design in vision transformers. arXiv preprint arXiv:2104.13840 (2021)
Chu, X., et al.: Conditional positional encodings for vision transformers. arxiv preprint arXiv:2102.10882 (2021)
Dai, X., et al.: Dynamic head: unifying object detection heads with attentions. In: CVPR, pp. 7373–7382 (2021)
Google Scholar
Dai, Z., Liu, H., Le, Q., Tan, M.: CoAtNet: marrying convolution and attention for all data sizes. In: Advances in Neural Information Processing Systems, vol. 34 (2021)
Google Scholar
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: CVPR, pp. 248–255. IEEE (2009)
Google Scholar
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: NAACL-HLT (1) (2019)
Google Scholar
Ding, M., et al.: Learning versatile neural architectures by propagating network codes. In: ICLR (2022)
Google Scholar
Ding, M., et al.: HR-NAS: Searching efficient high-resolution neural architectures with lightweight transformers. In: CVPR (2021)
Google Scholar
Dong, X., et al.: CSWin transformer: a general vision transformer backbone with cross-shaped windows. arXiv preprint arXiv:2107.00652 (2021)
Dosovitskiy, A., et al.: An image is worth \(16\times 16\) words: transformers for image recognition at scale. In: ICLR (2021)
Google Scholar
Fu, J., et al.: Dual attention network for scene segmentation. In: CVPR, pp. 3146–3154 (2019)
Google Scholar
Graham, B.,et al.: LeViT: a vision transformer in convnet’s clothing for faster inference. In: ICCV, pp. 12259–12269 (2021)
Google Scholar
Guo, M.H., Lu, C.Z., Liu, Z.N., Cheng, M.M., Hu, S.M.: Visual attention network. arXiv preprint arXiv:2202.09741 (2022)
Han, K., Xiao, A., Wu, E., Guo, J., Xu, C., Wang, Y.: Transformer in transformer. In: NeurIPS, vol. 34 (2021)
Google Scholar
He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN. In: ICCV, pp. 2961–2969 (2017)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR, pp. 770–778 (2016)
Google Scholar
Heo, B., Yun, S., Han, D., Chun, S., Choe, J., Oh, S.J.: Rethinking spatial dimensions of vision transformers. In: ICCV, pp. 11936–11945 (2021)
Google Scholar
Hoffer, E., Ben-Nun, T., Hubara, I., Giladi, N., Hoefler, T., Soudry, D.: Augment your batch: improving generalization through instance repetition. In: CVPR, pp. 8129–8138 (2020)
Google Scholar
Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In: CVPR, pp. 7132–7141 (2018)
Google Scholar
Huang, Z., Ben, Y., Luo, G., Cheng, P., Yu, G., Fu, B.: Shuffle transformer: rethinking spatial shuffle for vision transformer. arXiv preprint arXiv:2106.03650 (2021)
Jiang, Z.H., et al.: All tokens matter: token labeling for training better vision transformers. In: NeurIPS, vol. 34 (2021)
Google Scholar
Li, C., et al.: BossNAS: exploring hybrid CNN-transformers with block-wisely self-supervised neural architecture search. In: ICCV, pp. 12281–12291 (2021)
Google Scholar
Li, K., et al.: UniFormer: unifying convolution and self-attention for visual recognition. arXiv preprint arXiv:2201.09450 (2022)
Li, Y., et al.: Improved multiscale vision transformers for classification and detection. arXiv Computer Vision and Pattern Recognition (2021)
Google Scholar
Li, Y., Zhang, K., Cao, J., Timofte, R., Van Gool, L.: LocalViT: bringing locality to vision transformers. arXiv preprint arXiv:2104.05707 (2021)
Lin, T.Y., Dollár, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Feature pyramid networks for object detection. In: CVPR, pp. 2117–2125 (2017)
Google Scholar
Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollár, P.: Focal loss for dense object detection. In: ICCV, pp. 2980–2988 (2017)
Google Scholar
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Chapter Google Scholar
Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows. In: ICCV (2021)
Google Scholar
Liu, Z., Mao, H., Wu, C.Y., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. arXiv preprint arXiv:2201.03545 (2022)
Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017)
Pan, Z., Zhuang, B., Liu, J., He, H., Cai, J.: Scalable visual transformers with hierarchical pooling. arXiv preprint arXiv:2103.10619 (2021)
Polyak, B.T., Juditsky, A.B.: Acceleration of stochastic approximation by averaging. SIAM J. Control Optim. 30(4), 838–855 (1992)
Article MathSciNet MATH Google Scholar
Ranftl, R., Bochkovskiy, A., Koltun, V.: Vision transformers for dense prediction. In: ICCV, pp. 12179–12188 (2021)
Google Scholar
Riquelme, C., et al.: Scaling vision with sparse mixture of experts. In: NeurIPS, vol. 34 (2021)
Google Scholar
Ryoo, M.S., Piergiovanni, A., Arnab, A., Dehghani, M., Angelova, A.: Tokenlearner: what can 8 learned tokens do for images and videos? arXiv Computer Vision and Pattern Recognition (2021)
Google Scholar
Smith, L.N., Topin, N.: Super-convergence: very fast training of neural networks using large learning rates. In: Artificial Intelligence and Machine Learning for Multi-Domain Operations Applications, vol. 11006, p. 1100612. International Society for Optics and Photonics (2019)
Google Scholar
Song, J.G.: UFO-ViT: high performance linear vision transformer without softmax. arXiv preprint arXiv:2109.14382 (2021)
Srinivas, A., Lin, T.Y., Parmar, N., Shlens, J., Abbeel, P., Vaswani, A.: Bottleneck transformers for visual recognition. In: CVPR, pp. 16519–16529 (2021)
Google Scholar
Tang, S., Zhang, J., Zhu, S., Tan, P.: Quadtree attention for vision transformers. arXiv preprint arXiv:2201.02767 (2022)
Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, pp. 10347–10357. PMLR (2021)
Google Scholar
Touvron, H., Cord, M., Sablayrolles, A., Synnaeve, G., Jégou, H.: Going deeper with image transformers. arXiv preprint arXiv:2103.17239 (2021)
Trockman, A., Kolter, J.Z.: Patches are all you need? arXiv preprint arXiv:2201.09792 (2022)
Vaswani, A., Ramachandran, P., Srinivas, A., Parmar, N., Hechtman, B., Shlens, J.: Scaling local self-attention for parameter efficient visual backbones. In: CVPR, pp. 12894–12904 (2021)
Google Scholar
Vaswani, A., et al.: Attention is all you need. In: NeurIPS (2017)
Google Scholar
Wang, J., et al.: Deep high-resolution representation learning for visual recognition. IEEE TPAMI 43, 3349–3364 (2020)
Article Google Scholar
Wang, W., et al.: PVT v2: improved baselines with pyramid vision transformer. arXiv preprint arXiv:2106.13797 (2021)
Wang, W., et al.: Pyramid vision transformer: a versatile backbone for dense prediction without convolutions. In: ICCV (2021)
Google Scholar
Wightman, R.: Pytorch image models. (cited on p.) (2019). https://github.com/rwightman/pytorch-image-models
Wu, H., et al.: CVT: introducing convolutions to vision transformers. arXiv preprint arXiv:2103.15808 (2021)
Wu, K., Peng, H., Chen, M., Fu, J., Chao, H.: Rethinking and improving relative position encoding for vision transformer. In: ICCV, pp. 10033–10041 (2021)
Google Scholar
Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, pp. 418–434 (2018)
Google Scholar
Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: SegFormer: simple and efficient design for semantic segmentation with transformers. In: NeurIPS (2021)
Google Scholar
Xie, S., Girshick, R., Dollár, P., Tu, Z., He, K.: Aggregated residual transformations for deep neural networks. In: CVPR, pp. 1492–1500 (2017)
Google Scholar
Xu, W., Xu, Y., Chang, T., Tu, Z.: Co-scale conv-attentional image transformers. In: ICCV, pp. 9981–9990 (2021)
Google Scholar
Xu, Y., Zhang, Q., Zhang, J., Tao, D.: ViTAE: vision transformer advanced by exploring intrinsic inductive bias. In: NeurIPS, vol. 34 (2021)
Google Scholar
Yang, J., et al.: Focal self-attention for local-global interactions in vision transformers. arXiv preprint arXiv:2107.00641 (2021)
Yang, J., et al.: Unified contrastive learning in image-text-label space. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 19163–19173 (2022)
Google Scholar
Yu, Q., Xia, Y., Bai, Y., Lu, Y., Yuille, A.L., Shen, W.: Glance-and-gaze vision transformer. In: NeurIPS, vol. 34 (2021)
Google Scholar
Yuan, L., et al.: Tokens-to-token ViT: training vision transformers from scratch on imageNet. arXiv preprint arXiv:2101.11986 (2021)
Yuan, L., Hou, Q., Jiang, Z., Feng, J., Yan, S.: VOLO: vision outlooker for visual recognition. arXiv preprint arXiv:2106.13112 (2021)
Yuan, L., et al.: Florence: a new foundation model for computer vision. arXiv preprint arXiv:2111.11432 (2021)
Yuan, Y., Chen, X., Wang, J.: Object-contextual representations for semantic segmentation. arXiv preprint arXiv:1909.11065 (2019)
Yuan, Y., et al.: HRFormer: high-resolution transformer for dense prediction. In: NeurIPS (2021)
Google Scholar
Zhai, X., Kolesnikov, A., Houlsby, N., Beyer, L.: Scaling vision transformers. arXiv Computer Vision and Pattern Recognition (2021)
Google Scholar
Zhang, H., et al.: ResNest: split-attention networks. arXiv preprint arXiv:2004.08955 (2020)
Zhang, P., et al.: Multi-scale vision longformer: a new vision transformer for high-resolution image encoding. In: ICCV (2021)
Google Scholar
Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 633–641 (2017)
Google Scholar
Zhou, D., et al.: DeepViT: towards deeper vision transformer. arXiv preprint arXiv:2103.11886 (2021)
Zhou, J., Wang, P., Wang, F., Liu, Q., Li, H., Jin, R.: ELSA: enhanced local self-attention for vision transformer. arXiv Computer Vision and Pattern Recognition (2021)
Google Scholar

Download references

Acknowledgement

Ping Luo is supported by the General Research Fund of HK No. 27208720, No. 17212120, and No. 17200622.

Author information

Authors and Affiliations

The University of Hong Kong, Pok Fu Lam, Hong Kong
Mingyu Ding & Ping Luo
Microsoft, Bellevue, USA
Bin Xiao, Noel Codella & Lu Yuan
Baidu, Beijing, China
Jingdong Wang

Authors

Mingyu Ding
View author publications
You can also search for this author in PubMed Google Scholar
Bin Xiao
View author publications
You can also search for this author in PubMed Google Scholar
Noel Codella
View author publications
You can also search for this author in PubMed Google Scholar
Ping Luo
View author publications
You can also search for this author in PubMed Google Scholar
Jingdong Wang
View author publications
You can also search for this author in PubMed Google Scholar
Lu Yuan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Bin Xiao or Ping Luo .

Editor information

Editors and Affiliations

Tel Aviv University, Tel Aviv, Israel
Shai Avidan
University College London, London, UK
Gabriel Brostow
Google AI, Accra, Ghana
Moustapha Cissé
University of Catania, Catania, Italy
Giovanni Maria Farinella
Facebook (United States), Menlo Park, CA, USA
Tal Hassner

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 260 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ding, M., Xiao, B., Codella, N., Luo, P., Wang, J., Yuan, L. (2022). DaViT: Dual Attention Vision Transformers. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13684. Springer, Cham. https://doi.org/10.1007/978-3-031-20053-3_5

Download citation

DOI: https://doi.org/10.1007/978-3-031-20053-3_5
Published: 06 November 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-20052-6
Online ISBN: 978-3-031-20053-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics