Skip to main content

Self-slimmed Vision Transformer

  • Conference paper
  • First Online:
Computer Vision – ECCV 2022 (ECCV 2022)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13671))

Included in the following conference series:

Abstract

Vision transformers (ViTs) have become the popular structures and outperformed convolutional neural networks (CNNs) on various vision tasks. However, such powerful transformers bring a huge computation burden, because of the exhausting token-to-token comparison. The previous works focus on dropping insignificant tokens to reduce the computational cost of ViTs. But when the dropping ratio increases, this hard manner will inevitably discard the vital tokens, which limits its efficiency. To solve the issue, we propose a generic self-slimmed learning approach for vanilla ViTs, namely SiT. Specifically, we first design a novel Token Slimming Module (TSM), which can boost the inference efficiency of ViTs by dynamic token aggregation. As a general method of token hard dropping, our TSM softly integrates redundant tokens into fewer informative ones. It can dynamically zoom visual attention without cutting off discriminative token relations in the images, even with a high slimming ratio. Furthermore, we introduce a concise Feature Recalibration Distillation (FRD) framework, wherein we design a reverse version of TSM (RTSM) to recalibrate the unstructured token in a flexible auto-encoder manner. Due to the similar structure between teacher and student, our FRD can effectively leverage structure knowledge for better convergence. Finally, we conduct extensive experiments to evaluate our SiT. It demonstrates that our method can speed up ViTs by \(\mathbf {1.7}\times \) with negligible accuracy drop, and even speed up ViTs by \(\mathbf {3.6}\times \) while maintaining \(\textbf{97}\%\) of their performance. Surprisingly, by simply arming LV-ViT with our SiT, we achieve new state-of-the-art performance on ImageNet. Code is available at https://github.com/Sense-X/SiT.

Z. Zong and K. Li—Contribute equally during their internship at SenseTime.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Ali, A., et al.: Xcit: cross-covariance image transformers. In: Advances in Neural Information Processing Systems (2021)

    Google Scholar 

  2. Brock, A., De, S., Smith, S.L., Simonyan, K.: High-performance large-scale image recognition without normalization. In: International Conference on Machine Learning (2021)

    Google Scholar 

  3. Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-End object detection with transformers. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 213–229. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_13

    Chapter  Google Scholar 

  4. Chen, C.F.R., Fan, Q., Panda, R.: Crossvit: cross-attention multi-scale vision transformer for image classification. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (2021)

    Google Scholar 

  5. Chen, H., et al.: Pre-trained image processing transformer. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2021)

    Google Scholar 

  6. Chen, T., Cheng, Y., Gan, Z., Yuan, L., Zhang, L., Wang, Z.: Chasing sparsity in vision transformers: an end-to-end exploration. In: Advances in Neural Information Processing Systems (2021)

    Google Scholar 

  7. Chu, X., et al.: Twins: revisiting the design of spatial attention in vision transformers. In: Advances in Neural Information Processing Systems (2021)

    Google Scholar 

  8. Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: a large-scale hierarchical image database. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2009)

    Google Scholar 

  9. Dong, X., Bao, J., et al.: Cswin transformer: a general vision transformer backbone with cross-shaped windows. ArXiv abs/2107.00652 (2021)

    Google Scholar 

  10. Dosovitskiy, A., Beyer, et al.: An image is worth 16x16 words: transformers for image recognition at scale. In: International Conference on Learning Representations (2020)

    Google Scholar 

  11. d’Ascoli, S., Touvron, H., Leavitt, M.L., Morcos, A.S., Biroli, G., Sagun, L.: Convit: improving vision transformers with soft convolutional inductive biases. In: International Conference on Machine Learning (2021)

    Google Scholar 

  12. Graham, B., et al.: Levit: a vision transformer in convnet’s clothing for faster inference. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (2021)

    Google Scholar 

  13. Han, K., Xiao, A., Wu, E., Guo, J., Xu, C., Wang, Y.: Transformer in transformer. In: Advances in Neural Information Processing Systems (2021)

    Google Scholar 

  14. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: IEEE Conference on Computer Vision and Pattern Recognition (2016)

    Google Scholar 

  15. Heo, B., Yun, S., Han, D., Chun, S., Choe, J., Oh, S.J.: Rethinking spatial dimensions of vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (2021)

    Google Scholar 

  16. Jiang, Z.H., et al.: All tokens matter: token labeling for training better vision transformers. In: Advances in Neural Information Processing Systems (2021)

    Google Scholar 

  17. Kim, J., Park, S., Kwak, N.: Paraphrasing complex network: network compression via factor transfer. In: Advances in Neural Information Processing Systems (2018)

    Google Scholar 

  18. Li, K., et al.: Uniformer: unifying convolution and self-attention for visual recognition. ArXiv abs/2201.09450 (2022)

    Google Scholar 

  19. Li, Y., Zhang, K., Cao, J., Timofte, R., Gool, L.V.: Localvit: Bringing locality to vision transformers. ArXiv abs/2104.05707 (2021)

    Google Scholar 

  20. Liang, Y., GE, C., Tong, Z., Song, Y., Wang, J., Xie, P.: EVit: expediting vision transformers via token reorganizations. In: International Conference on Learning Representations (2022)

    Google Scholar 

  21. Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (2021)

    Google Scholar 

  22. Pan, B., Panda, R., Jiang, Y., Wang, Z., Feris, R., Oliva, A.: Ia-red2: interpretability-aware redundancy reduction for vision transformers. In: Advances in Neural Information Processing Systems (2021)

    Google Scholar 

  23. Radosavovic, I., Kosaraju, R.P., Girshick, R.B., He, K., Dollár, P.: Designing network design spaces. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (2020)

    Google Scholar 

  24. Rao, Y., Zhao, W., Liu, B., Lu, J., Zhou, J., Hsieh, C.J.: Dynamicvit: efficient vision transformers with dynamic token sparsification. In: Advances in Neural Information Processing Systems (2021)

    Google Scholar 

  25. Romero, A., Ballas, N., Kahou, S.E., Chassang, A., Gatta, C., Bengio, Y.: Fitnets: hints for thin deep nets. CoRR abs/1412.6550 (2015)

    Google Scholar 

  26. Tan, M., Le, Q.: Efficientnet: rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning (2019)

    Google Scholar 

  27. Tan, M., Le, Q.: Efficientnetv2: smaller models and faster training. In: International Conference on Machine Learning (2021)

    Google Scholar 

  28. Tang, Y., et al.: Patch slimming for efficient vision transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12165–12174 (2022)

    Google Scholar 

  29. Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning (2021)

    Google Scholar 

  30. Touvron, H., Cord, M., Sablayrolles, A., Synnaeve, G., Jégou, H.: Going deeper with image transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (2021)

    Google Scholar 

  31. Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems (2017)

    Google Scholar 

  32. Wang, P., et al.: Kvt: k-nn attention for boosting vision transformers. ArXiv abs/2106.00515 (2021)

    Google Scholar 

  33. Wang, W., et al.: Pyramid vision transformer: a versatile backbone for dense prediction without convolutions. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (2021)

    Google Scholar 

  34. Wu, H., et al.: Cvt: introducing convolutions to vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (2021)

    Google Scholar 

  35. Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: simple and efficient design for semantic segmentation with transformers. In: Advances in Neural Information Processing Systems (2021)

    Google Scholar 

  36. Xu, Y., et al.: Evo-vit: slow-fast token evolution for dynamic vision transformer. In: Proceedings of the AAAI Conference on Artificial Intelligence (2022)

    Google Scholar 

  37. Yang, J., et al.: Focal self-attention for local-global interactions in vision transformers. ArXiv abs/2107.00641 (2021)

    Google Scholar 

  38. Yim, J., Joo, D., Bae, J.H., Kim, J.: A gift from knowledge distillation: Fast optimization, network minimization and transfer learning. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)

    Google Scholar 

  39. Yuan, L., et al.: Tokens-to-token vit: training vision transformers from scratch on imagenet. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (2021)

    Google Scholar 

  40. Yuan, L., Hou, Q., Jiang, Z., Feng, J., Yan, S.: Volo: vision outlooker for visual recognition. ArXiv abs/2106.13112 (2021)

    Google Scholar 

  41. Zagoruyko, S., Komodakis, N.: Paying more attention to attention: improving the performance of convolutional neural networks via attention transfer. ArXiv abs/1612.03928 (2017)

    Google Scholar 

  42. Zhou, D., et al.: Deepvit: towards deeper vision transformer. ArXiv abs/2103.11886 (2021)

    Google Scholar 

Download references

Acknowledgements

This work is partially supported by National Key R &D Program of China under Grant 2019YFB2102400, National Natural Science Foundation of China (61876176), the Joint Lab of CAS-HK, Shenzhen Institute of Artificial Intelligence and Robotics for Society, the Shanghai Committee of Science and Technology (Grant No. 21DZ1100100).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yu Liu .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 1124 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Zong, Z. et al. (2022). Self-slimmed Vision Transformer. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13671. Springer, Cham. https://doi.org/10.1007/978-3-031-20083-0_26

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-20083-0_26

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-20082-3

  • Online ISBN: 978-3-031-20083-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics