Self-slimmed Vision Transformer

Zong, Zhuofan; Li, Kunchang; Song, Guanglu; Wang, Yali; Qiao, Yu; Leng, Biao; Liu, Yu

doi:10.1007/978-3-031-20083-0_26

Zhuofan Zong¹²,
Kunchang Li^14,15,
Guanglu Song¹³,
Yali Wang^14,16,
Yu Qiao^14,17,
Biao Leng¹² &
…
Yu Liu¹³

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13671))

Included in the following conference series:

European Conference on Computer Vision

2297 Accesses
6 Citations

Abstract

Vision transformers (ViTs) have become the popular structures and outperformed convolutional neural networks (CNNs) on various vision tasks. However, such powerful transformers bring a huge computation burden, because of the exhausting token-to-token comparison. The previous works focus on dropping insignificant tokens to reduce the computational cost of ViTs. But when the dropping ratio increases, this hard manner will inevitably discard the vital tokens, which limits its efficiency. To solve the issue, we propose a generic self-slimmed learning approach for vanilla ViTs, namely SiT. Specifically, we first design a novel Token Slimming Module (TSM), which can boost the inference efficiency of ViTs by dynamic token aggregation. As a general method of token hard dropping, our TSM softly integrates redundant tokens into fewer informative ones. It can dynamically zoom visual attention without cutting off discriminative token relations in the images, even with a high slimming ratio. Furthermore, we introduce a concise Feature Recalibration Distillation (FRD) framework, wherein we design a reverse version of TSM (RTSM) to recalibrate the unstructured token in a flexible auto-encoder manner. Due to the similar structure between teacher and student, our FRD can effectively leverage structure knowledge for better convergence. Finally, we conduct extensive experiments to evaluate our SiT. It demonstrates that our method can speed up ViTs by \(\mathbf {1.7}\times \) with negligible accuracy drop, and even speed up ViTs by \(\mathbf {3.6}\times \) while maintaining \(\textbf{97}\%\) of their performance. Surprisingly, by simply arming LV-ViT with our SiT, we achieve new state-of-the-art performance on ImageNet. Code is available at https://github.com/Sense-X/SiT.

Z. Zong and K. Li—Contribute equally during their internship at SenseTime.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Ali, A., et al.: Xcit: cross-covariance image transformers. In: Advances in Neural Information Processing Systems (2021)
Google Scholar
Brock, A., De, S., Smith, S.L., Simonyan, K.: High-performance large-scale image recognition without normalization. In: International Conference on Machine Learning (2021)
Google Scholar
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-End object detection with transformers. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 213–229. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_13
Chapter Google Scholar
Chen, C.F.R., Fan, Q., Panda, R.: Crossvit: cross-attention multi-scale vision transformer for image classification. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (2021)
Google Scholar
Chen, H., et al.: Pre-trained image processing transformer. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2021)
Google Scholar
Chen, T., Cheng, Y., Gan, Z., Yuan, L., Zhang, L., Wang, Z.: Chasing sparsity in vision transformers: an end-to-end exploration. In: Advances in Neural Information Processing Systems (2021)
Google Scholar
Chu, X., et al.: Twins: revisiting the design of spatial attention in vision transformers. In: Advances in Neural Information Processing Systems (2021)
Google Scholar
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: a large-scale hierarchical image database. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2009)
Google Scholar
Dong, X., Bao, J., et al.: Cswin transformer: a general vision transformer backbone with cross-shaped windows. ArXiv abs/2107.00652 (2021)
Google Scholar
Dosovitskiy, A., Beyer, et al.: An image is worth 16x16 words: transformers for image recognition at scale. In: International Conference on Learning Representations (2020)
Google Scholar
d’Ascoli, S., Touvron, H., Leavitt, M.L., Morcos, A.S., Biroli, G., Sagun, L.: Convit: improving vision transformers with soft convolutional inductive biases. In: International Conference on Machine Learning (2021)
Google Scholar
Graham, B., et al.: Levit: a vision transformer in convnet’s clothing for faster inference. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (2021)
Google Scholar
Han, K., Xiao, A., Wu, E., Guo, J., Xu, C., Wang, Y.: Transformer in transformer. In: Advances in Neural Information Processing Systems (2021)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: IEEE Conference on Computer Vision and Pattern Recognition (2016)
Google Scholar
Heo, B., Yun, S., Han, D., Chun, S., Choe, J., Oh, S.J.: Rethinking spatial dimensions of vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (2021)
Google Scholar
Jiang, Z.H., et al.: All tokens matter: token labeling for training better vision transformers. In: Advances in Neural Information Processing Systems (2021)
Google Scholar
Kim, J., Park, S., Kwak, N.: Paraphrasing complex network: network compression via factor transfer. In: Advances in Neural Information Processing Systems (2018)
Google Scholar
Li, K., et al.: Uniformer: unifying convolution and self-attention for visual recognition. ArXiv abs/2201.09450 (2022)
Google Scholar
Li, Y., Zhang, K., Cao, J., Timofte, R., Gool, L.V.: Localvit: Bringing locality to vision transformers. ArXiv abs/2104.05707 (2021)
Google Scholar
Liang, Y., GE, C., Tong, Z., Song, Y., Wang, J., Xie, P.: EVit: expediting vision transformers via token reorganizations. In: International Conference on Learning Representations (2022)
Google Scholar
Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (2021)
Google Scholar
Pan, B., Panda, R., Jiang, Y., Wang, Z., Feris, R., Oliva, A.: Ia-red2: interpretability-aware redundancy reduction for vision transformers. In: Advances in Neural Information Processing Systems (2021)
Google Scholar
Radosavovic, I., Kosaraju, R.P., Girshick, R.B., He, K., Dollár, P.: Designing network design spaces. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (2020)
Google Scholar
Rao, Y., Zhao, W., Liu, B., Lu, J., Zhou, J., Hsieh, C.J.: Dynamicvit: efficient vision transformers with dynamic token sparsification. In: Advances in Neural Information Processing Systems (2021)
Google Scholar
Romero, A., Ballas, N., Kahou, S.E., Chassang, A., Gatta, C., Bengio, Y.: Fitnets: hints for thin deep nets. CoRR abs/1412.6550 (2015)
Google Scholar
Tan, M., Le, Q.: Efficientnet: rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning (2019)
Google Scholar
Tan, M., Le, Q.: Efficientnetv2: smaller models and faster training. In: International Conference on Machine Learning (2021)
Google Scholar
Tang, Y., et al.: Patch slimming for efficient vision transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12165–12174 (2022)
Google Scholar
Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning (2021)
Google Scholar
Touvron, H., Cord, M., Sablayrolles, A., Synnaeve, G., Jégou, H.: Going deeper with image transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (2021)
Google Scholar
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems (2017)
Google Scholar
Wang, P., et al.: Kvt: k-nn attention for boosting vision transformers. ArXiv abs/2106.00515 (2021)
Google Scholar
Wang, W., et al.: Pyramid vision transformer: a versatile backbone for dense prediction without convolutions. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (2021)
Google Scholar
Wu, H., et al.: Cvt: introducing convolutions to vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (2021)
Google Scholar
Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: simple and efficient design for semantic segmentation with transformers. In: Advances in Neural Information Processing Systems (2021)
Google Scholar
Xu, Y., et al.: Evo-vit: slow-fast token evolution for dynamic vision transformer. In: Proceedings of the AAAI Conference on Artificial Intelligence (2022)
Google Scholar
Yang, J., et al.: Focal self-attention for local-global interactions in vision transformers. ArXiv abs/2107.00641 (2021)
Google Scholar
Yim, J., Joo, D., Bae, J.H., Kim, J.: A gift from knowledge distillation: Fast optimization, network minimization and transfer learning. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)
Google Scholar
Yuan, L., et al.: Tokens-to-token vit: training vision transformers from scratch on imagenet. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (2021)
Google Scholar
Yuan, L., Hou, Q., Jiang, Z., Feng, J., Yan, S.: Volo: vision outlooker for visual recognition. ArXiv abs/2106.13112 (2021)
Google Scholar
Zagoruyko, S., Komodakis, N.: Paying more attention to attention: improving the performance of convolutional neural networks via attention transfer. ArXiv abs/1612.03928 (2017)
Google Scholar
Zhou, D., et al.: Deepvit: towards deeper vision transformer. ArXiv abs/2103.11886 (2021)
Google Scholar

Download references

Acknowledgements

This work is partially supported by National Key R &D Program of China under Grant 2019YFB2102400, National Natural Science Foundation of China (61876176), the Joint Lab of CAS-HK, Shenzhen Institute of Artificial Intelligence and Robotics for Society, the Shanghai Committee of Science and Technology (Grant No. 21DZ1100100).

Author information

Authors and Affiliations

School of Computer Science and Engineering, Beihang University, Beijing, China
Zhuofan Zong & Biao Leng
SenseTime Research, Hong Kong, China
Guanglu Song & Yu Liu
ShenZhen Key Lab of Computer Vision and Pattern Recognition, SIAT-SenseTime Joint Lab, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China
Kunchang Li, Yali Wang & Yu Qiao
University of Chinese Academy of Sciences, Beijing, China
Kunchang Li
SIAT Branch, Shenzhen Institute of Artificial Intelligence and Robotics for Society, Shenzhen, China
Yali Wang
Shanghai AI Laboratory, Shanghai, China
Yu Qiao

Authors

Zhuofan Zong
View author publications
You can also search for this author in PubMed Google Scholar
Kunchang Li
View author publications
You can also search for this author in PubMed Google Scholar
Guanglu Song
View author publications
You can also search for this author in PubMed Google Scholar
Yali Wang
View author publications
You can also search for this author in PubMed Google Scholar
Yu Qiao
View author publications
You can also search for this author in PubMed Google Scholar
Biao Leng
View author publications
You can also search for this author in PubMed Google Scholar
Yu Liu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yu Liu .

Editor information

Editors and Affiliations

Tel Aviv University, Tel Aviv, Israel
Shai Avidan
University College London, London, UK
Gabriel Brostow
Google AI, Accra, Ghana
Moustapha Cissé
University of Catania, Catania, Italy
Giovanni Maria Farinella
Facebook (United States), Menlo Park, CA, USA
Tal Hassner

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 1124 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zong, Z. et al. (2022). Self-slimmed Vision Transformer. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13671. Springer, Cham. https://doi.org/10.1007/978-3-031-20083-0_26

Download citation

DOI: https://doi.org/10.1007/978-3-031-20083-0_26
Published: 03 November 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-20082-3
Online ISBN: 978-3-031-20083-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics