TinyViT: Fast Pretraining Distillation for Small Vision Transformers

Wu, Kan; Zhang, Jinnian; Peng, Houwen; Liu, Mengchen; Xiao, Bin; Fu, Jianlong; Yuan, Lu

doi:10.1007/978-3-031-19803-8_5

Kan Wu^12,14,
Jinnian Zhang^13,15,
Houwen Peng¹⁴,
Mengchen Liu¹⁵,
Bin Xiao¹⁵,
Jianlong Fu¹⁴ &
…
Lu Yuan¹⁵

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13681))

Included in the following conference series:

European Conference on Computer Vision

3013 Accesses
34 Citations

Abstract

Vision transformer (ViT) recently has drawn great attention in computer vision due to its remarkable model capability. However, most prevailing ViT models suffer from huge number of parameters, restricting their applicability on devices with limited resources. To alleviate this issue, we propose TinyViT, a new family of tiny and efficient small vision transformers pretrained on large-scale datasets with our proposed fast distillation framework. The central idea is to transfer knowledge from large pretrained models to small ones, while enabling small models to get the dividends of massive pretraining data. More specifically, we apply distillation during pretraining for knowledge transfer. The logits of large teacher models are sparsified and stored in disk in advance to save the memory cost and computation overheads. The tiny student transformers are automatically scaled down from a large pretrained model with computation and parameter constraints. Comprehensive experiments demonstrate the efficacy of TinyViT. It achieves a top-1 accuracy of 84.8% on ImageNet-1k with only 21M parameters, being comparable to Swin-B pretrained on ImageNet-21k while using 4.2 times fewer parameters. Moreover, increasing image resolutions, TinyViT can reach 86.5% accuracy, being slightly better than Swin-L while using only 11% parameters. Last but not the least, we demonstrate a good transfer ability of TinyViT on various downstream tasks. Code and models are available at https://github.com/microsoft/Cream/tree/main/TinyViT.

K. Wu, J. Zhang and H. Peng—Equal contribution. Work done when Kan and Jinnian were interns of Microsoft.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

fvcore library. https://github.com/facebookresearch/fvcore/
3d object representations for fine-grained categorization. In: 3dRR (2013)
Google Scholar
Ba, J.L., Kiros, J.R., Hinton, G.E.: Layer normalization. arXiv (2016)
Google Scholar
Bao, H., Dong, L., Wei, F.: BEiT: BERT pre-training of image transformers. In: ICLR (2022)
Google Scholar
Beyer, L., Hénaff, O.J., Kolesnikov, A., Zhai, X., Oord, A.V.D.: Are we done with imagenet? arXiv (2020)
Google Scholar
Bommasani, R., et al.: On the opportunities and risks of foundation models (2021)
Google Scholar
Cai, Z., Vasconcelos, N.: Cascade R-CNN: delving into high quality object detection. In: CVPR (2018)
Google Scholar
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-End object detection with transformers. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 213–229. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_13
Chapter Google Scholar
Chen, B., et al.: Glit: neural architecture search for global and local image transformer. In: ICCV (2021)
Google Scholar
Chen, C.F., Fan, Q., Panda, R.: CrossViT: cross-attention multi-scale vision transformer for image classification. ICCV (2021)
Google Scholar
Chen, M., Peng, H., Fu, J., Ling, H.: AutoFormer: searching transformers for visual recognition. In: ICCV (2021)
Google Scholar
Chen, W., Huang, W., Du, X., Song, X., Wang, Z., Zhou, D.: Auto-scaling vision transformers without training. In: ICLR (2021)
Google Scholar
Chen, X., Xie, S., He, K.: An empirical study of training self-supervised vision transformers. In: ICCV (2021)
Google Scholar
Chen, Y., et al.: Mobile-former: bridging mobileNet and transformer. In: CVPR (2022)
Google Scholar
Chu, X., et al.: Conditional positional encodings for vision transformers. arXiv (2021)
Google Scholar
Codella, et al.: Skin lesion analysis toward melanoma detection 2018: a challenge hosted by the international skin imaging collaboration (isic). arXiv (2019)
Google Scholar
Cubuk, E.D., Zoph, B., Shlens, J., Le, Q.V.: Randaugment: practical automated data augmentation with a reduced search space. In: CVPR (2020)
Google Scholar
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: CVPR (2009)
Google Scholar
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: NAACL-HLT (1) (2019)
Google Scholar
Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. ICLR (2021)
Google Scholar
Feichtenhofer, C.: X3d: expanding architectures for efficient video recognition. In: CVPR (2020)
Google Scholar
Gong, C., et al.: NASVit: neural architecture search for efficient vision transformers with gradient conflict aware supernet training. In: ICLR (2022)
Google Scholar
Gou, J., Yu, B., Maybank, S.J., Tao, D.: Knowledge distillation: a survey. IJCV (2021)
Google Scholar
Graham, B., et al.: LeViT: a vision transformer in convnet’s clothing for faster inference. In: ICCV (2021)
Google Scholar
Guo, Y., et al.: A broader study of cross-domain few-shot learning. In: ECCV (2020)
Google Scholar
Han, S., Mao, H., Dally, W.J.: Deep compression: compressing deep neural networks with pruning, trained quantization and Huffman coding. arXiv (2015)
Google Scholar
He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN. In: ICCV (2017)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)
Google Scholar
Helber, P., Bischke, B., Dengel, A., Borth, D.: EuroSAT: a novel dataset and deep learning benchmark for land use and land cover classification. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 12(7), 2217–2226 (2019)
Google Scholar
Hendrycks, D., Gimpel, K.: Gaussian error linear units (gelus). arXiv (2016)
Google Scholar
Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. arXiv (2015)
Google Scholar
Hoos, H.H., Stützle, T.: Stochastic local search: foundations and applications. Elsevier (2004)
Google Scholar
Howard, A., Sandler, M., Chu, G., Chen, L.C., Chen, B., Tan, M., Wang, W., Zhu, Y., Pang, R., Vasudevan, V., et al.: Searching for mobilenetv3. In: ICCV (2019)
Google Scholar
Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. In: ICML (2015)
Google Scholar
Jacob, B., et al.: Quantization and training of neural networks for efficient integer-arithmetic-only inference. In: CVPR (2018)
Google Scholar
Jia, C., et al.: Scaling up visual and vision-language representation learning with noisy text supervision. In: ICML (2021)
Google Scholar
Jia, D., et al.: Efficient vision transformers via fine-grained manifold distillation. arXiv (2021)
Google Scholar
Kong, Z., et al.: SPVit: enabling faster vision transformers via soft token pruning. arXiv (2021)
Google Scholar
Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009)
Google Scholar
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Chapter Google Scholar
Liu, Z., et al.: Swin transformer v2: scaling up capacity and resolution. In: CVPR (2022)
Google Scholar
Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows. In: ICCV (2021)
Google Scholar
Liu, Z., Wang, Y., Han, K., Zhang, W., Ma, S., Gao, W.: Post-training quantization for vision transformer. NeurIPS 34(2021), 28092–28103 (2021)
Google Scholar
Mehta, S., Rastegari, M.: MobileViT: light-weight, general-purpose, and mobile-friendly vision transformer. In: ICLR (2021)
Google Scholar
Mohanty, S.P., Hughes, D.P., Salathé, M.: Using deep learning for image-based plant disease detection. Front. Plant Sci. 7, 1419 (2016)
Google Scholar
Nilsback, M.E., Zisserman, A.: A visual vocabulary for flower classification. In: CVPR (2006)
Google Scholar
O’Neill, M.E.: PCG: a family of simple fast space-efficient statistically good algorithms for random number generation. TOMS (2014)
Google Scholar
Parkhi, O.M., Vedaldi, A., Zisserman, A., Jawahar, C.: Cats and dogs. In: CVPR (2012)
Google Scholar
Paszke, A., et al.: Pytorch: an imperative style, high-performance deep learning library. NeurIPS (2019)
Google Scholar
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: ICML (2021)
Google Scholar
Radford, A., Narasimhan, K., Salimans, T., Sutskever, I., et al.: Improving language understanding by generative pre-training (2018)
Google Scholar
Recht, B., Roelofs, R., Schmidt, L., Shankar, V.: Do imagenet classifiers generalize to imagenet? In: ICML (2019)
Google Scholar
Ridnik, T., Ben-Baruch, E., Noy, A., Zelnik-Manor, L.: ImageNet-21k pretraining for the masses. In: NeurIPS (2021)
Google Scholar
Riquelme, C., et al.: Scaling vision with sparse mixture of experts. In: NeurIPS (2021)
Google Scholar
Shen, Z., Liu, Z., Xu, D., Chen, Z., Cheng, K.T., Savvides, M.: Is label smoothing truly incompatible with knowledge distillation: an empirical study. In: ICLR (2020)
Google Scholar
Shen, Z., Xing, E.: A fast knowledge distillation framework for visual recognition. arXiv (2021)
Google Scholar
Su, X., et al.: Vitas: vision transformer architecture search. arXiv (2021)
Google Scholar
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision. In: CVPR (2016)
Google Scholar
Tan, M., Le, Q.: Efficientnet: rethinking model scaling for convolutional neural networks. In: ICML (2019)
Google Scholar
Tang, J., et al.: Understanding and improving knowledge distillation (2020)
Google Scholar
Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML. PMLR (2021)
Google Scholar
Vaswani, A., et al.: Attention is all you need. In: NeurIPS (2017)
Google Scholar
Wang, W., et al.: Pyramid vision transformer: a versatile backbone for dense prediction without convolutions. In: ICCV (2021)
Google Scholar
Wang, X., Peng, Y., Lu, L., Lu, Z., Bagheri, M., Summers, R.M.: ChestX-ray8: hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases. In: CVPR (2017)
Google Scholar
Wightman, R.: Pytorch image models (2019)
Google Scholar
Wu, K., Peng, H., Chen, M., Fu, J., Chao, H.: Rethinking and improving relative position encoding for vision transformer. In: ICCV (2021)
Google Scholar
Xiao, T., Dollar, P., Singh, M., Mintun, E., Darrell, T., Girshick, R.: Early convolutions help transformers see better. NeurIPS 34, 30392–30400 (2021)
Google Scholar
Xu, W., Xu, Y., Chang, T., Tu, Z.: Co-scale conv-attentional image transformers. In: ICCV (2021)
Google Scholar
Yang, H., Yin, H., Molchanov, P., Li, H., Kautz, J.: NViT: vision transformer compression and parameter redistribution. arXiv (2021)
Google Scholar
Yu, S., et al.: Unified visual transformer compression. In: ICLR (2022)
Google Scholar
Yuan, L., et al.: Tokens-to-token ViT: training vision transformers from scratch on ImageNet. In: ICCV (2021)
Google Scholar
Yuan, L., et al.: Florence: a new foundation model for computer vision. ArXiv (2021)
Google Scholar
Yun, S., Han, D., Oh, S.J., Chun, S., Choe, J., Yoo, Y.: CutMix: regularization strategy to train strong classifiers with localizable features. In: ICCV (2019)
Google Scholar
Yun, S., Oh, S.J., Heo, B., Han, D., Choe, J., Chun, S.: Re-labeling ImageNet: from single to multi-labels, from global to localized labels. In: CVPR (2021)
Google Scholar
Zhai, X., Kolesnikov, A., Houlsby, N., Beyer, L.: Scaling vision transformers. In: CVPR (2022)
Google Scholar
Zhang, H., Cisse, M., Dauphin, Y.N., Lopez-Paz, D.: mixup: beyond empirical risk minimization. In: ICLR (2018)
Google Scholar
Zhang, J., et al.: MiniViT: compressing vision transformers with weight multiplexing. In: CVPR (2022)
Google Scholar
Zhang, Q., bin Yang, Y.: Rest: an efficient transformer for visual recognition. In: NeurIPS (2021)
Google Scholar
Zhang, X., Zhou, X., Lin, M., Sun, J.: ShuffleNet: an extremely efficient convolutional neural network for mobile devices. In: CVPR (2018)
Google Scholar
Zhou, W., Xu, C., McAuley, J.: Meta learning for knowledge distillation (2022)
Google Scholar
Zhu, M., Tang, Y., Han, K.: Vision transformer pruning. In: KDD Workshop on Model Mining (2021)
Google Scholar

Download references

Author information

Authors and Affiliations

Sun Yat-sen University, Guangzhou, China
Kan Wu
University of Wisconsin-Madison, Madison, USA
Jinnian Zhang
Microsoft Research Asia, Beijing, China
Kan Wu, Houwen Peng & Jianlong Fu
Microsoft Cloud+AI, Redmond, USA
Jinnian Zhang, Mengchen Liu, Bin Xiao & Lu Yuan

Authors

Kan Wu
View author publications
You can also search for this author in PubMed Google Scholar
Jinnian Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Houwen Peng
View author publications
You can also search for this author in PubMed Google Scholar
Mengchen Liu
View author publications
You can also search for this author in PubMed Google Scholar
Bin Xiao
View author publications
You can also search for this author in PubMed Google Scholar
Jianlong Fu
View author publications
You can also search for this author in PubMed Google Scholar
Lu Yuan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Houwen Peng .

Editor information

Editors and Affiliations

Tel Aviv University, Tel Aviv, Israel
Shai Avidan
University College London, London, UK
Gabriel Brostow
Google AI, Accra, Ghana
Moustapha Cissé
University of Catania, Catania, Italy
Giovanni Maria Farinella
Facebook (United States), Menlo Park, CA, USA
Tal Hassner

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 221 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wu, K. et al. (2022). TinyViT: Fast Pretraining Distillation for Small Vision Transformers. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13681. Springer, Cham. https://doi.org/10.1007/978-3-031-19803-8_5

Download citation

DOI: https://doi.org/10.1007/978-3-031-19803-8_5
Published: 23 October 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-19802-1
Online ISBN: 978-3-031-19803-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

TinyViT: Fast Pretraining Distillation for Small Vision Transformers