Vision Transformers with Hierarchical Attention

Liu, Yun; Wu, Yu-Huan; Sun, Guolei; Zhang, Le; Chhatkuli, Ajad; Van Gool, Luc

doi:10.1007/s11633-024-1393-8

Vision Transformers with Hierarchical Attention

Research Article
Open access
Published: 19 April 2024

Volume 21, pages 670–683, (2024)
Cite this article

Download PDF

You have full access to this open access article

Machine Intelligence Research Aims and scope Submit manuscript

Vision Transformers with Hierarchical Attention

Download PDF

728 Accesses
1 Citation
Explore all metrics

Abstract

This paper tackles the high computational/space complexity associated with multi-head self-attention (MHSA) in vanilla vision transformers. To this end, we propose hierarchical MHSA (H-MHSA), a novel approach that computes sell-attention in a hierarchical fashion. Specifically, we first divide the input image into patches as commonly done, and each patch is viewed as a token. Then, the proposed H-MHSA learns token relationships within local patches, serving as local relationship modeling. Then, the small patches are merged into larger ones, and H-MHSA models the global dependencies for the small number of the merged tokens. At last, the local and global attentive features are aggregated to obtain features with powerful representation capacity. Since we only calculate attention for a limited number of tokens at each step, the computational load is reduced dramatically. Hence, H-MHSA can efficiently model global relationships among tokens without sacrificing fine-grained information. With the H-MHSA module incorporated, we build a family of hierarchical-attention-based transformer networks, namely HAT-Net. To demonstrate the superiority of HAT-Net in scene understanding, we conduct extensive experiments on fundamental vision tasks, including image classification, semantic segmentation, object detection and instance segmentation. Therefore, HAT-Net provides a new perspective for vision transformers. Code and pretrained models are available at https://github.com/yun-liu/HAT-Net.

Article PDF

DaViT: Dual Attention Vision Transformers

An Attention-Based Token Pruning Method for Vision Transformers

NASformer: Neural Architecture Search for Vision Transformer

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

References

A. Krizhevsky, I. Sutskever, G. E. Hinton. ImageNet classification with deep convolutional neural networks. In Proceedings of the 25th International Conference on Neural Information Processing Systems, Lake Tahoe, USA, pp. 1097–1105, 2012. DOI: https://doi.org/10.5555/2999134.2999257.
K. Simonyan, A. Zisserman. Very deep convolutional networks for large-scale image recognition. In Proceedings of the 3rd International Conference on Learning Representations, San Diego, USA, 2015.
K. M. He, X. Y. Zhang, S. Q. Ren, J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, USA, pp. 770–778, 2016. DOI: https://doi.org/10.1109/CVPR.2016.90.
S. Q. Ren, K. M. He, R. Girshick, J. Sun. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 6, pp. 1137–1149, 2017. DOI: https://doi.org/10.1109/TPAMI.2016.2577031.
Article Google Scholar
K. M. He, G. Gkioxari, P. Dollár, R. Girshick. Mask R-CNN. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, pp. 2980–2988, 2017. DOI: https://doi.org/10.1109/ICCV.2017.322.
H. S. Zhao, J. P. Shi, X. J. Qi, X. G. Wang, J. Y. Jia. Pyramid scene parsing network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, USA, pp. 6230–6239, 2017. DOI: https://doi.org/10.1109/CV-PR.2017.660.
Y. Liu, M. M. Cheng, X. W. Hu, J. W. Bian, L. Zhang, X. Bai, J. H. Tang. Richer convolutional features for edge detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 41, no. 8, pp. 1939–1946, 2019. DOI: https://doi.org/10.1109/TPAMI.2018.2878849.
Article Google Scholar
Y. Liu, M. M. Cheng, D. P. Fan, L. Zhang, J. W. Bian, D. C. Tao. Semantic edge detection with diverse deep supervision. International Journal of Computer Vision, vol. 130, no. 1, pp. 179–198, 2022. DOI: https://doi.org/10.1007/s11263-021-01539-8.
Article Google Scholar
Y. Liu, M. M. Cheng, X. Y. Zhang, G. Y. Nie, M. Wang. DNA: Deeply supervised nonlinear aggregation for salient object detection. IEEE Transactions on Cybernetics, vol. 52, no. 7, pp. 6131–6142, 2022. DOI: https://doi.org/10.1109/TCYB.2021.3051350.
Article Google Scholar
Y. Liu, Y. C. Gu, X. Y. Zhang, W. W. Wang, M. M. Cheng. Lightweight salient object detection via hierarchical visual perception learning. IEEE Transactions on Cybernetics, vol. 51, no. 9, pp. 4439–4449, 2021. DOI: https://doi.org/10.1109/TCYB.2020.3035613.
Article Google Scholar
Y. Liu, Y. H. Wu, Y. F. Ban, H. F. Wang, M. M. Cheng. Rethinking computer-aided tuberculosis diagnosis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, pp. 2643–2652, 2020. DOI: https://doi.org/10.1109/CVPR42600.2020.00272.
Y. Liu, Y. H. Wu, P. S. Wen, Y. J. Shi, Y. Qiu, M. M. Cheng. Leveraging instance-, image- and dataset-level information for weakly supervised instance segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 3, pp. 1415–1428, 2022. DOI: https://doi.org/10.1109/TPAMI.2020.3023152.
Article Google Scholar
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, I. Polosukhin. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, USA, pp. 6000–6010, 2017. DOI: https://doi.org/10.5555/3295222.3295349.
J. Devlin, M. W. Chang, K. Lee, K. Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, USA, pp. 4171–4186, 2019. DOI: https://doi.org/10.18653/v1/N19-1423.
Z. H. Dai, Z. L. Yang, Y. M. Yang, J. Carbonell, Q. Le, R. Salakhutdinov. Transformer-XL: Attentive language models beyond a fixed-length context. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 2978–2988, 2019. DOI: https://doi.org/10.18653/v1/P19-1285.
A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. H. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, N. Houlsby. An image is worth 16×16 words: Transformers for image recognition at scale. In Proceedings of the 9th International Conference on Learning Representations, 2021.
B. Heo, S. Yun, D. Han, S. Chun, J. Choe, S. J. Oh. Rethinking spatial dimensions of vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, Canada, pp. 11916–11925, 2021. DOI: https://doi.org/10.1109/ICCV48922.2021.01172.
Z. Liu, Y. T. Lin, Y. Cao, H. Hu, Y. X. Wei, Z. Zhang, S. Lin, B. N. Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, Canada, pp. 9992–10002, 2021. DOI: https://doi.org/10.1109/ICCV48922.2021.00986.
W. H. Wang, E. Z. Xie, X. Li, D. P. Fan, K. T. Song, D. Liang, T. Lu, P. Luo, L. Shao. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, Canada, pp. 548–558, 2021. DOI: https://doi.org/10.1109/ICCV48922.2021.00061.
W. J. Xu, Y. F. Xu, T. Chang, Z. W. Tu. Co-scale conv-attentional image transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, Canada, pp. 9961–9970, 2021. DOI: https://doi.org/10.1109/IC-CV48922.2021.00983.
H. Q. Fan, B. Xiong, K. Mangalam, Y. H. Li, Z. C. Yan, J. Malik, C. Feichtenhofer. Multiscale vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, Canada, pp. 6804–6815, 2021. DOI: https://doi.org/10.1109/ICCV48922.2021.00675.
D. Bolya, C. Y. Fu, X. L. Dai, P. Z. Zhang, C. Feichtenhofer, J. Hoffman. Token merging: Your ViT but faster. In Proceedings of the 11th International Conference on Learning Representations, Kigali, Rwanda, 2023.
Y. LeCun, L. Bottou, Y. Bengio, P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998. DOI: https://doi.org/10.1109/5.726791.
Article Google Scholar
O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. A. Ma, Z. H. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, L. Fei-Fei. ImageNet large scale visual recognition challenge. International Journal of Computer Vision, vol. 115, no. 3, pp. 211–252, 2015. DOI: https://doi.org/10.1007/s11263-015-0816-y.
Article MathSciNet Google Scholar
R. K. Srivastava, K. Greff, J. Schmidhuber. Highway networks, [Online], Available: https://arxiv.org/abs/1505.00387, 2015.
C. Szegedy, W. Liu, Y. Q. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, A. Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, USA, pp. 1–9, 2015. DOI: https://doi.org/10.1109/CVPR.2015.7298594.
C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, Z. Wojna. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, USA, pp. 2818–2826, 2016. DOI: https://doi.org/10.1109/CVPR.2016.308.
C. Szegedy, S. Ioffe, V. Vanhoucke, A. A. Alemi. Inception-v4, Inception-ResNet and the impact of residual connections on learning. In Proceedings of the 31st AAAI Conference on Artificial Intelligence, San Francisco, USA, pp. 4278–4284, 2017. DOI: https://doi.org/10.5555/3298023.3298188.
S. N. Xie, R. Girshick, P. Dollar, Z. W. Tu, K. M. He. Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, USA, pp. 5987–5995, 2017. DOI: https://doi.org/10.1109/CVPR.2017.634.
G. Huang, Z. Liu, L. Van Der Maaten, K. Q. Weinberger. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, USA, pp. 2261–2269, 2017. DOI: https://doi.org/10.1109/CVPR.2017.243.
A. G. Howard, M. L. Zhu, B. Chen, D. Kalenichenko, W. J. Wang, T. Weyand, M. Andreetto, H. Adam. MobileNets: Efficient convolutional neural networks for mobile vision applications, [Online], Available: https://arxiv.org/abs/1704.04861, 2017.
M. Sandler, A. Howard, M. L. Zhu, A. Zhmoginov, L. C. Chen. MobileNetV2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, USA, pp. 4510–4520, 2018. DOI: https://doi.org/10.1109/CVPR.2018.00474.
X. Y. Zhang, X. Y. Zhou, M. X. Lin, J. Sun. ShuffleNet: An extremely efficient convolutional neural network for mobile devices. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, USA, pp. 6848–6856, 2018. DOI: https://doi.org/10.1109/CV-PR.2018.00716.
N. N. Ma, X. Y. Zhang, H. T. Zheng, J. Sun. ShuffleNet V2: Practical guidelines for efficient CNN architecture design. In Proceedings of the 15th European Conference on Computer Vision, Munich, Germany, pp. 122–138, 2018. DOI: https://doi.org/10.1007/978-3-030-01264-9_8.
M. X. Tan, B. Chen, R. M. Pang, V. Vasudevan, M. Sandler, A. Howard, Q. V. Le. MnasNet: Platform-aware neural architecture search for mobile. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, USA, pp. 2815–2823, 2019. DOI: https://doi.org/10.1109/CVPR.2019.00293.
M. X. Tan, Q. V. Le. EfficientNet: Rethinking model scaling for convolutional neural networks. In Proceedings of the 36th International Conference on Machine Learning, Long Beach, USA, pp. 6105–6114, 2019.
M. Jaderberg, K. Simonyan, A. Zisserman, K. Kavukcuoglu. Spatial transformer networks. In Proceedings of the 28th International Conference on Neural Information Processing Systems, Montreal, Canada, pp. 2017–2025, 2015. DOI: https://doi.org/10.5555/2969442.2969465.
L. Chen, H. W. Zhang, J. Xiao, L. Q. Nie, J. Shao, W. Liu, T. S. Chua. SCA-CNN: Spatial and channel-wise attention in convolutional networks for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, USA, pp. 6298–6306, 2017. DOI: https://doi.org/10.1109/CVPR.2017.667.
F. Wang, M. Q. Jiang, C. Qian, S. Yang, C. Li, H. G. Zhang, X. G. Wang, X. O. Tang. Residual attention network for image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, USA, pp. 6450–6458, 2017. DOI: https://doi.org/10.1109/CVPR.2017.683.
J. Hu, L. Shen, S. Albanie, G. Sun, E. H. Wu. Squeeze-and-excitation networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 42, no. 8, pp. 2011–2023, 2020. DOI: https://doi.org/10.1109/TPAMI.2019.2913372.
Article Google Scholar
S. Woo, J. Park, J. Y. Lee, I. S. Kweon. CBAM: Convolutional block attention module. In Proceedings of the 15th European Conference on Computer Vision, Munich, Germany, pp. 3–19, 2018. DOI: https://doi.org/10.1007/978-3-030-01234-2_1.
J. Park, S. Woo, J. Y. Lee, I. S. Kweon. BAM: Bottleneck attention module. In Proceedings of the British Machine Vision Conference, Newcastle, UK, Article number 147, 2018.
X. Li, W. H. Wang, X. L. Hu, J. Yang. Selective kernel networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, USA, pp. 510–519, 2019. DOI: https://doi.org/10.1109/CVPR.2019.00060.
X. L. Wang, R. Girshick, A. Gupta, K. M. He. Non-local neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, USA, pp. 7794–7803, 2018. DOI: https://doi.org/10.1109/CVPR.2018.00813.
H. Zhang, C. R. Wu, Z. Y. Zhang, Y. Zhu, H. B. Lin, Z. Zhang, Y. Sun, T. He, J. Mueller, R. Manmatha, M. Li, A. Smola. ResNeSt: Split-attention networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, New Orleans, USA, pp. 2735–2745, 2022. DOI: https://doi.org/10.1109/CVPRW56347.2022.00309.
L. Yuan, Y. P. Chen, T. Wang, W. H. Yu, Y. J. Shi, Z. H. Jiang, F. E. H. Tay, J. S. Feng, S. C. Yan. Tokens-to-token ViT: Training vision transformers from scratch on ImageNet. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, Canada, pp. 538–547, 2021. DOI: https://doi.org/10.1109/ICCV48922.2021.00060.
H. Touvron, M. Cord, A. Sablayrolles, G. Synnaeve, H. Jégou. Going deeper with image transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, Canada, pp. 32–42, 2021. DOI: https://doi.org/10.1109/ICCV48922.2021.00010.
D. Q. Zhou, B. Y. Kang, X. J. Jin, L. J. Yang, X. C. Lian, Z. H. Jiang, Q. B. Hou, J. S. Feng. DeepViT: Towards deeper vision transformer, [Online], Available: https://arxiv.org/abs/2103.11886, 2021.
H. Touvron, M. Cord, M. Douze, F. Massa, A. Sablayrolles, H. Jégou. Training data-efficient image transformers & distillation through attention. In Proceedings of the 38th International Conference on Machine Learning, pp. 10347–10357, 2021.
A. Srinivas, T. Y. Lin, N. Parmar, J. Shlens, P. Abbeel, A. Vaswani. Bottleneck transformers for visual recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, USA, pp. 16514–16524, 2021. DOI: https://doi.org/10.1109/CVPR46437.2021.01625.
I. O. Tolstikhin, N. Houlsby, A. Kolesnikov, L. Beyer, X. H. Zhai, T. Unterthiner, J. Yung, A. Steiner, D. Keysers, J. Uszkoreit, M. Lucic, A. Dosovitskiy. MLP-Mixer: An all-MLP architecture for vision. In Proceedings of the 34th Advances in Neural Information Processing Systems, pp. 24261–24272, 2021.
H. X. Liu, Z. H. Dai, D. R. So, Q. V. Le. Pay attention to MLPs. In Proceedings of the 34th Advances in Neural Information Processing Systems, pp. 9204–9215, 2021.
Q. B. Hou, Z. H. Jiang, L. Yuan, M. M. Cheng, S. C. Yan, J. S. Feng. Vision Permutator: A permutable MLP-like architecture for visual recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 1, pp. 1328–1334, 2023. DOI: https://doi.org/10.1109/TPAMI.2022.3145427.
Article Google Scholar
Z. C. Wang, Y. B. Hao, X. Y. Gao, H. Zhang, S. Wang, T. T. Mu, X. N. He. Parameterization of cross-token relations with relative positional encoding for vision MLP. In Proceedings of the 30th ACM International Conference on Multimedia, Lisboa, Portugal, pp. 6288–6299, 2022. DOI: https://doi.org/10.1145/3503161.3547953.
K. Han, A. Xiao, E. H. Wu, J. Y. Guo, C. J. Xu, Y. H. Wang. Transformer in transformer. In Proceedings of the 35th Conference on Neural Information Processing Systems, pp. 15908–15919, 2021.
Y. W. Li, K. Zhang, J. Z. Cao, R. Timofte, L. Van Gool. LocalViT: Bringing locality to vision transformers, [Online], Available: https://arxiv.org/abs/2104.05707, 2021.
K. Yuan, S. P. Guo, Z. W. Liu, A. J. Zhou, F. W. Yu, W. Wu. Incorporating convolution designs into visual transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, Canada, pp. 559–568, 2021. DOI: https://doi.org/10.1109/ICCV48922.2021.00062.
D. Hendrycks, K. Gimpel. Gaussian error linear units (GELUs), [Online], Available: https://arxiv.org/abs/1606.08415, 2016.
S. Elfwing, E. Uchibe, K. Doya. Sigmoid-weighted linear units for neural network function approximation in reinforcement learning. Neural Networks, vol. 107, pp. 3–11, 2018. DOI: https://doi.org/10.1016/j.neunet.2017.12.012.
Article Google Scholar
A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. M. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Köpf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. J. Bai, S. Chintala. PyTorch: An imperative style, high-performance deep learning library. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, Red Hook, USA, Article number 721, 2019. DOI: https://doi.org/10.5555/3454287.3455008.
H. Y. Zhang, M. Cissé, Y. N. Dauphin, D. Lopez-Paz. mixup: Beyond empirical risk minimization. In Proceedings of the 6th International Conference on Learning Representations, Vancouver, Canada, 2018.
I. Loshchilov, F. Hutter. Decoupled weight decay regularization. In Proceedings of the 7th International Conference on Learning Representations, New Orleans, USA, 2019.
I. Loshchilov, F. Hutter. SGDR: Stochastic gradient descent with warm restarts. In Proceedings of the 5th International Conference on Learning Representations, Toulon, France, 2017.
I. Radosavovic, R. P. Kosaraju, R. Girshick, K. M. He, P. Dollár. Designing network design spaces. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, pp. 10425–10433, 2020. DOI: https://doi.org/10.1109/CVPR42600.2020.01044.
H. P. Wu, B. Xiao, N. Codella, M. C. Liu, X. Y. Dai, L. Yuan, L. Zhang. CvT: Introducing convolutions to vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, Canada, pp. 22–31, 2021. DOI: https://doi.org/10.1109/ICCV48922.2021.00009.
X. X. Chu, Z. Tian, Y. Q. Wang, B. Zhang, H. B. Ren, X. L. Wei, H. X. Xia, C. H. Shen. Twins: Revisiting the design of spatial attention in vision transformers. In Proceedings of the 34th Advances in Neural Information Processing Systems, pp. 9355–9366, 2021.
W. H. Wang, E. Z. Xie, X. Li, D. P. Fan, K. T. Song, D. Liang, T. Lu, P. Luo, L. Shao. PVT v2: Improved baselines with pyramid vision transformer. Computational Visual Media, vol. 8, no. 3, pp. 415–424, 2022. DOI: https://doi.org/10.1007/s41095-022-0274-8.
Article Google Scholar
A. Kirillov, R. Girshick, K. M. He, P. Dollár. Panoptic feature pyramid networks. In Proceedings of the Conference on Computer Vision and Pattern Recognition, Long Beach, USA, pp. 6392–6401, 2019.DOL10.1109/CVPR.2019.00656.
B. L. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso, A. Torralba. Scene parsing through ADE20K dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, USA, pp. 5122–5130, 2017. DOI: https://doi.org/10.1109/CVPR.2017.544.
MMSegmentation Contributors. MMSegmentation: Open-MMLab semantic segmentation toolbox and benchmark, [Online], Available: https://github.com/open-mmlab/mm-segmentation, 2020.
T. Y. Lin, P. Goyal, R. Girshick, K. M. He, P. Dollár. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, pp. 2999–3007, 2017. DOI: https://doi.org/10.1109/ICCV.2017.324.
T. Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollar, C. L. Zitnick. Microsoft COCO: Common objects in context. In Proceedings of the 13th European Conference on Computer Vision, Zürich, Switzerland, pp. 740–755, 2014. DOI: https://doi.org/10.1007/978-3-319-10602-1.
K. Chen, J. Q. Wang, J. M. Pang, Y. H. Cao, Y. Xiong, X. X. Li, S. Y. Sun, W. S. Feng, Z. W. Liu, J. R. Xu, Z. Zhang, D. Z. Cheng, C. C. Zhu, T. H. Cheng, Q. J. Zhao, B. Y. Li, X. Lu, R. Zhu, Y. Wu, J. F. Dai, J. D. Wang, J. P. Shi, W. L. Ouyang, C. C. Loy, D. H. Lin. MMDetection: Open MMLab detection toolbox and benchmark, [Online], Available: https://arxiv.org/abs/1906.07155, 2019.
P. C. Zhang, X. Y. Dai, J. W. Yang, B. Xiao, L. Yuan, L. Zhang, J. F. Gao. Multi-scale vision Longformer: A new vision transformer for high-resolution image encoding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, Canada, pp. 2978–2988, 2021. DOI: https://doi.org/10.1109/ICCV48922.2021.00299.

Download references

Acknowledgements

This work was supported by A*STAR Career Development Fund, Singapore (No. C233312006). Open access funding provided by Swiss Federal Institute of Technology Zurich.

Author information

Authors and Affiliations

Institute for Infocomm Research (I2R), A*STAR, Singapore, 138632, Singapore
Yun Liu
Institute of High Performance Computing (IHPC), A*STAR, Singapore, 138632, Singapore
Yu-Huan Wu
Computer Vision Lab, ETH Zürich, Zürich, 8092, Switzerland
Guolei Sun, Ajad Chhatkuli & Luc Van Gool
School of Information and Communication Engineering, University of Electronic Science and Technology of China (UESTC), Chengdu, 611731, China
Le Zhang

Authors

Yun Liu
View author publications
You can also search for this author in PubMed Google Scholar
Yu-Huan Wu
View author publications
You can also search for this author in PubMed Google Scholar
Guolei Sun
View author publications
You can also search for this author in PubMed Google Scholar
Le Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Ajad Chhatkuli
View author publications
You can also search for this author in PubMed Google Scholar
Luc Van Gool
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Guolei Sun or Le Zhang.

Ethics declarations

The authors declared that they have no conflicts of interest to this work.

Additional information

Colored figures are available in the online version at https://link.springer.com/journal/11633

Yun Liu received the B.Eng. and Ph. D. degrees in computer science from Nankai University, China in 2016 and 2020, respectively. Then, he worked with Prof. Luc Van Gool for one and a half years as a postdoctoral scholar at Computer Vision Lab, ETH Zürich, Switzerland. Currently, he is a senior scientist at Institute for Infocomm Research (I2R), A*STAR, Singapore.

His research interests include computer vision and machine learning (especially deep learning).

Yu-Huan Wu received the Ph. D. degree in computer science from Nankai University, China in 2022, advised by Prof. Ming-Ming Cheng. He is a scientist at the Institute of High Performance Computing (IHPC), A*STAR, Singapore. He has published 10+ papers on top-tier conferences and journals such as IEEE TPAMI/TIP/CVPR/ICCV.

His research interests include computer vision and medical imaging.

Guolei Sun received the M. Sc. degree in computer science from King Abdullah University of Science and Technology, Saudi Arabia in 2018. From 2018 to 2019, he worked as a research engineer at the Inception Institute of Artificial Intelligence, UAE. Currently, he is a Ph. D. degree candidate at ETH Zürich, Switzerland under supervision of Prof. Luc Van Gool. He has published more than 20 papers in top journals and conferences such as TP AMI, CVPR, ICCV, and ECCV.

His research interests include computer vision and deep learning for tasks such as semantic segmentation, video understanding, and object counting.

Le Zhang received the Ph. D. degree in electrical and electronic engineering from Nanyang Technological University (NTU), Singapore in 2016. He is a professor with the School of Information and Communication Engineering, University of Electronic Science and Technology of China (UESTC), China. From 2016 to 2018, he was a postdoc fellow at Advanced Digital Sciences Center (ADSC), Singapore. From 2018 to 2021, he was a research scientist at the Institute for Infocomm Research (I2R), A*STAR, Singapore. He is an Associate Editor of Neural Networks, Neurocomputing, and IET Biometrics.

His research interests include computer vision and machine learning.

Ajad Chhatkuli received the M. Sc. degree in computer vision from the University of Burgundy, France in 2013, and the Ph. D. degree in computer vision from the University of Clermont Auvergne, France in 2017 under the supervision of Prof. Adrien Bartoli and Dr. Daniel Pizarro. He is currently a postdoctoral researcher supervised by Prof. Luc Van Gool at ETH Zürich, Switzerland.

His research interests include template-based and template-free non-rigid 3D reconstruction.

Luc Van Gool received the B.Eng. degree in electromechanical engineering from the Katholieke Universiteit Leuven, Belgium in 1981. Currently, he is a professor at the Katholieke Universiteit Leuven in Belgium and the ETH in Zürich, Switzerland. He leads computer vision research at both places, and also teaches at both. He has been a program committee member of several major computer vision conferences. He received several Best Paper awards, won a David Marr Prize and a Koenderink Award, and was nominated Distinguished Researcher by the IEEE Computer Science committee. He is a co-founder of 10 spin-off companies.

His research interests include 3D reconstruction and modelling, object recognition, tracking, gesture analysis, and the combination of those.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.

The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Liu, Y., Wu, YH., Sun, G. et al. Vision Transformers with Hierarchical Attention. Mach. Intell. Res. 21, 670–683 (2024). https://doi.org/10.1007/s11633-024-1393-8

Download citation

Received: 03 September 2023
Accepted: 08 January 2024
Published: 19 April 2024
Issue Date: August 2024
DOI: https://doi.org/10.1007/s11633-024-1393-8

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Vision Transformers with Hierarchical Attention

Abstract

Article PDF

Similar content being viewed by others

DaViT: Dual Attention Vision Transformers

An Attention-Based Token Pruning Method for Vision Transformers

NASformer: Neural Architecture Search for Vision Transformer

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding authors

Ethics declarations

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Vision Transformers with Hierarchical Attention

Abstract

Article PDF

Similar content being viewed by others

DaViT: Dual Attention Vision Transformers

An Attention-Based Token Pruning Method for Vision Transformers

NASformer: Neural Architecture Search for Vision Transformer

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding authors

Ethics declarations

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation