Skip to main content

DaViT: Dual Attention Vision Transformers

  • Conference paper
  • First Online:
Computer Vision – ECCV 2022 (ECCV 2022)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13684))

Included in the following conference series:

Abstract

In this work, we introduce Dual Attention Vision Transformers  (DaViT), a simple yet effective vision transformer architecture that is able to capture global context while maintaining computational efficiency. We propose approaching the problem from an orthogonal angle: exploiting self-attention mechanisms with both “spatial tokens” and “channel tokens”. With spatial tokens, the spatial dimension defines the token scope, and the channel dimension defines the token feature dimension. With channel tokens, we have the inverse: the channel dimension defines the token scope, and the spatial dimension defines the token feature dimension. We further group tokens along the sequence direction for both spatial and channel tokens to maintain the linear complexity of the entire model. We show that these two self-attentions complement each other: (i) since each channel token contains an abstract representation of the entire image, the channel attention naturally captures global interactions and representations by taking all spatial positions into account when computing attention scores between channels; (ii) the spatial attention refines the local representations by performing fine-grained interactions across spatial locations, which in turn helps the global information modeling in channel attention. Extensive experiments show DaViT backbones achieve state-of-the-art performance on four different tasks. Specially, DaViT-Tiny, DaViT-Small, and DaViT-Base achieve 82.8%, 84.2%, and 84.6% top-1 accuracy on ImageNet-1K without extra training data, using 28.3M, 49.7M, and 87.9M parameters, respectively. When we further scale up DaViT with 1.5B weakly supervised image and text pairs, DaViT-Giant reaches 90.4% top-1 accuracy on ImageNet-1K. Code is available at https://github.com/microsoft/DaViT.

M. Ding—This work is done when Mingyu was an intern at Microsoft.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Ali, A., et al.: XCiT: cross-covariance image transformers. In: NeurIPS, vol. 34 (2021)

    Google Scholar 

  2. Bello, I.: Lambdanetworks: modeling long-range interactions without attention. arXiv preprint arXiv:2102.08602 (2021)

  3. Berman, M., Jégou, H., Vedaldi, A., Kokkinos, I., Douze, M.: Multigrain: a unified image embedding for classes and instances. arXiv preprint arXiv:1902.05509 (2019)

  4. Brock, A., De, S., Smith, S.L., Simonyan, K.: High-performance large-scale image recognition without normalization. arXiv Computer Vision and Pattern Recognition (2021)

    Google Scholar 

  5. Chen, C.F., Fan, Q., Panda, R.: CrossViT: cross-attention multi-scale vision transformer for image classification. In: ICCV (2021)

    Google Scholar 

  6. Chen, L.C., Zhu, Y., Papandreou, G., Schroff, F., Adam, H.: Encoder-decoder with Atrous separable convolution for semantic image segmentation. In: ECCV, pp. 801–818 (2018)

    Google Scholar 

  7. Chen, M., et al.: Generative pretraining from pixels. In: ICML, pp. 1691–1703 (2020)

    Google Scholar 

  8. Chen, Z., Xie, L., Niu, J., Liu, X., Wei, L., Tian, Q.: Visformer: the vision-friendly transformer. In: ICCV, pp. 589–598 (2021)

    Google Scholar 

  9. Chu, X., et al.: Twins: revisiting spatial attention design in vision transformers. arXiv preprint arXiv:2104.13840 (2021)

  10. Chu, X., et al.: Conditional positional encodings for vision transformers. arxiv preprint arXiv:2102.10882 (2021)

  11. Dai, X., et al.: Dynamic head: unifying object detection heads with attentions. In: CVPR, pp. 7373–7382 (2021)

    Google Scholar 

  12. Dai, Z., Liu, H., Le, Q., Tan, M.: CoAtNet: marrying convolution and attention for all data sizes. In: Advances in Neural Information Processing Systems, vol. 34 (2021)

    Google Scholar 

  13. Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: CVPR, pp. 248–255. IEEE (2009)

    Google Scholar 

  14. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: NAACL-HLT (1) (2019)

    Google Scholar 

  15. Ding, M., et al.: Learning versatile neural architectures by propagating network codes. In: ICLR (2022)

    Google Scholar 

  16. Ding, M., et al.: HR-NAS: Searching efficient high-resolution neural architectures with lightweight transformers. In: CVPR (2021)

    Google Scholar 

  17. Dong, X., et al.: CSWin transformer: a general vision transformer backbone with cross-shaped windows. arXiv preprint arXiv:2107.00652 (2021)

  18. Dosovitskiy, A., et al.: An image is worth \(16\times 16\) words: transformers for image recognition at scale. In: ICLR (2021)

    Google Scholar 

  19. Fu, J., et al.: Dual attention network for scene segmentation. In: CVPR, pp. 3146–3154 (2019)

    Google Scholar 

  20. Graham, B.,et al.: LeViT: a vision transformer in convnet’s clothing for faster inference. In: ICCV, pp. 12259–12269 (2021)

    Google Scholar 

  21. Guo, M.H., Lu, C.Z., Liu, Z.N., Cheng, M.M., Hu, S.M.: Visual attention network. arXiv preprint arXiv:2202.09741 (2022)

  22. Han, K., Xiao, A., Wu, E., Guo, J., Xu, C., Wang, Y.: Transformer in transformer. In: NeurIPS, vol. 34 (2021)

    Google Scholar 

  23. He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN. In: ICCV, pp. 2961–2969 (2017)

    Google Scholar 

  24. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR, pp. 770–778 (2016)

    Google Scholar 

  25. Heo, B., Yun, S., Han, D., Chun, S., Choe, J., Oh, S.J.: Rethinking spatial dimensions of vision transformers. In: ICCV, pp. 11936–11945 (2021)

    Google Scholar 

  26. Hoffer, E., Ben-Nun, T., Hubara, I., Giladi, N., Hoefler, T., Soudry, D.: Augment your batch: improving generalization through instance repetition. In: CVPR, pp. 8129–8138 (2020)

    Google Scholar 

  27. Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In: CVPR, pp. 7132–7141 (2018)

    Google Scholar 

  28. Huang, Z., Ben, Y., Luo, G., Cheng, P., Yu, G., Fu, B.: Shuffle transformer: rethinking spatial shuffle for vision transformer. arXiv preprint arXiv:2106.03650 (2021)

  29. Jiang, Z.H., et al.: All tokens matter: token labeling for training better vision transformers. In: NeurIPS, vol. 34 (2021)

    Google Scholar 

  30. Li, C., et al.: BossNAS: exploring hybrid CNN-transformers with block-wisely self-supervised neural architecture search. In: ICCV, pp. 12281–12291 (2021)

    Google Scholar 

  31. Li, K., et al.: UniFormer: unifying convolution and self-attention for visual recognition. arXiv preprint arXiv:2201.09450 (2022)

  32. Li, Y., et al.: Improved multiscale vision transformers for classification and detection. arXiv Computer Vision and Pattern Recognition (2021)

    Google Scholar 

  33. Li, Y., Zhang, K., Cao, J., Timofte, R., Van Gool, L.: LocalViT: bringing locality to vision transformers. arXiv preprint arXiv:2104.05707 (2021)

  34. Lin, T.Y., Dollár, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Feature pyramid networks for object detection. In: CVPR, pp. 2117–2125 (2017)

    Google Scholar 

  35. Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollár, P.: Focal loss for dense object detection. In: ICCV, pp. 2980–2988 (2017)

    Google Scholar 

  36. Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48

    Chapter  Google Scholar 

  37. Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows. In: ICCV (2021)

    Google Scholar 

  38. Liu, Z., Mao, H., Wu, C.Y., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. arXiv preprint arXiv:2201.03545 (2022)

  39. Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017)

  40. Pan, Z., Zhuang, B., Liu, J., He, H., Cai, J.: Scalable visual transformers with hierarchical pooling. arXiv preprint arXiv:2103.10619 (2021)

  41. Polyak, B.T., Juditsky, A.B.: Acceleration of stochastic approximation by averaging. SIAM J. Control Optim. 30(4), 838–855 (1992)

    Article  MathSciNet  MATH  Google Scholar 

  42. Ranftl, R., Bochkovskiy, A., Koltun, V.: Vision transformers for dense prediction. In: ICCV, pp. 12179–12188 (2021)

    Google Scholar 

  43. Riquelme, C., et al.: Scaling vision with sparse mixture of experts. In: NeurIPS, vol. 34 (2021)

    Google Scholar 

  44. Ryoo, M.S., Piergiovanni, A., Arnab, A., Dehghani, M., Angelova, A.: Tokenlearner: what can 8 learned tokens do for images and videos? arXiv Computer Vision and Pattern Recognition (2021)

    Google Scholar 

  45. Smith, L.N., Topin, N.: Super-convergence: very fast training of neural networks using large learning rates. In: Artificial Intelligence and Machine Learning for Multi-Domain Operations Applications, vol. 11006, p. 1100612. International Society for Optics and Photonics (2019)

    Google Scholar 

  46. Song, J.G.: UFO-ViT: high performance linear vision transformer without softmax. arXiv preprint arXiv:2109.14382 (2021)

  47. Srinivas, A., Lin, T.Y., Parmar, N., Shlens, J., Abbeel, P., Vaswani, A.: Bottleneck transformers for visual recognition. In: CVPR, pp. 16519–16529 (2021)

    Google Scholar 

  48. Tang, S., Zhang, J., Zhu, S., Tan, P.: Quadtree attention for vision transformers. arXiv preprint arXiv:2201.02767 (2022)

  49. Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML, pp. 10347–10357. PMLR (2021)

    Google Scholar 

  50. Touvron, H., Cord, M., Sablayrolles, A., Synnaeve, G., Jégou, H.: Going deeper with image transformers. arXiv preprint arXiv:2103.17239 (2021)

  51. Trockman, A., Kolter, J.Z.: Patches are all you need? arXiv preprint arXiv:2201.09792 (2022)

  52. Vaswani, A., Ramachandran, P., Srinivas, A., Parmar, N., Hechtman, B., Shlens, J.: Scaling local self-attention for parameter efficient visual backbones. In: CVPR, pp. 12894–12904 (2021)

    Google Scholar 

  53. Vaswani, A., et al.: Attention is all you need. In: NeurIPS (2017)

    Google Scholar 

  54. Wang, J., et al.: Deep high-resolution representation learning for visual recognition. IEEE TPAMI 43, 3349–3364 (2020)

    Article  Google Scholar 

  55. Wang, W., et al.: PVT v2: improved baselines with pyramid vision transformer. arXiv preprint arXiv:2106.13797 (2021)

  56. Wang, W., et al.: Pyramid vision transformer: a versatile backbone for dense prediction without convolutions. In: ICCV (2021)

    Google Scholar 

  57. Wightman, R.: Pytorch image models. (cited on p.) (2019). https://github.com/rwightman/pytorch-image-models

  58. Wu, H., et al.: CVT: introducing convolutions to vision transformers. arXiv preprint arXiv:2103.15808 (2021)

  59. Wu, K., Peng, H., Chen, M., Fu, J., Chao, H.: Rethinking and improving relative position encoding for vision transformer. In: ICCV, pp. 10033–10041 (2021)

    Google Scholar 

  60. Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV, pp. 418–434 (2018)

    Google Scholar 

  61. Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: SegFormer: simple and efficient design for semantic segmentation with transformers. In: NeurIPS (2021)

    Google Scholar 

  62. Xie, S., Girshick, R., Dollár, P., Tu, Z., He, K.: Aggregated residual transformations for deep neural networks. In: CVPR, pp. 1492–1500 (2017)

    Google Scholar 

  63. Xu, W., Xu, Y., Chang, T., Tu, Z.: Co-scale conv-attentional image transformers. In: ICCV, pp. 9981–9990 (2021)

    Google Scholar 

  64. Xu, Y., Zhang, Q., Zhang, J., Tao, D.: ViTAE: vision transformer advanced by exploring intrinsic inductive bias. In: NeurIPS, vol. 34 (2021)

    Google Scholar 

  65. Yang, J., et al.: Focal self-attention for local-global interactions in vision transformers. arXiv preprint arXiv:2107.00641 (2021)

  66. Yang, J., et al.: Unified contrastive learning in image-text-label space. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 19163–19173 (2022)

    Google Scholar 

  67. Yu, Q., Xia, Y., Bai, Y., Lu, Y., Yuille, A.L., Shen, W.: Glance-and-gaze vision transformer. In: NeurIPS, vol. 34 (2021)

    Google Scholar 

  68. Yuan, L., et al.: Tokens-to-token ViT: training vision transformers from scratch on imageNet. arXiv preprint arXiv:2101.11986 (2021)

  69. Yuan, L., Hou, Q., Jiang, Z., Feng, J., Yan, S.: VOLO: vision outlooker for visual recognition. arXiv preprint arXiv:2106.13112 (2021)

  70. Yuan, L., et al.: Florence: a new foundation model for computer vision. arXiv preprint arXiv:2111.11432 (2021)

  71. Yuan, Y., Chen, X., Wang, J.: Object-contextual representations for semantic segmentation. arXiv preprint arXiv:1909.11065 (2019)

  72. Yuan, Y., et al.: HRFormer: high-resolution transformer for dense prediction. In: NeurIPS (2021)

    Google Scholar 

  73. Zhai, X., Kolesnikov, A., Houlsby, N., Beyer, L.: Scaling vision transformers. arXiv Computer Vision and Pattern Recognition (2021)

    Google Scholar 

  74. Zhang, H., et al.: ResNest: split-attention networks. arXiv preprint arXiv:2004.08955 (2020)

  75. Zhang, P., et al.: Multi-scale vision longformer: a new vision transformer for high-resolution image encoding. In: ICCV (2021)

    Google Scholar 

  76. Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR, pp. 633–641 (2017)

    Google Scholar 

  77. Zhou, D., et al.: DeepViT: towards deeper vision transformer. arXiv preprint arXiv:2103.11886 (2021)

  78. Zhou, J., Wang, P., Wang, F., Liu, Q., Li, H., Jin, R.: ELSA: enhanced local self-attention for vision transformer. arXiv Computer Vision and Pattern Recognition (2021)

    Google Scholar 

Download references

Acknowledgement

Ping Luo is supported by the General Research Fund of HK No. 27208720, No. 17212120, and No. 17200622.

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Bin Xiao or Ping Luo .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 260 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Ding, M., Xiao, B., Codella, N., Luo, P., Wang, J., Yuan, L. (2022). DaViT: Dual Attention Vision Transformers. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13684. Springer, Cham. https://doi.org/10.1007/978-3-031-20053-3_5

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-20053-3_5

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-20052-6

  • Online ISBN: 978-3-031-20053-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics