Patch Similarity Aware Data-Free Quantization for Vision Transformers

Li, Zhikai; Ma, Liping; Chen, Mengjuan; Xiao, Junrui; Gu, Qingyi

doi:10.1007/978-3-031-20083-0_10

Zhikai Li^12,13,
Liping Ma¹²,
Mengjuan Chen¹²,
Junrui Xiao^12,13 &
…
Qingyi Gu¹²

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13671))

Included in the following conference series:

European Conference on Computer Vision

2275 Accesses
9 Citations
2 Altmetric

Abstract

Vision transformers have recently gained great success on various computer vision tasks; nevertheless, their high model complexity makes it challenging to deploy on resource-constrained devices. Quantization is an effective approach to reduce model complexity, and data-free quantization, which can address data privacy and security concerns during model deployment, has received widespread interest. Unfortunately, all existing methods, such as BN regularization, were designed for convolutional neural networks and cannot be applied to vision transformers with significantly different model architectures. In this paper, we propose PSAQ-ViT, a Patch Similarity Aware data-free Quantization framework for Vision Transformers, to enable the generation of “realistic" samples based on the vision transformer’s unique properties for calibrating the quantization parameters. Specifically, we analyze the self-attention module’s properties and reveal a general difference (patch similarity) in its processing of Gaussian noise and real images. The above insights guide us to design a relative value metric to optimize the Gaussian noise to approximate the real images, which are then utilized to calibrate the quantization parameters. Extensive experiments and ablation studies are conducted on various benchmarks to validate the effectiveness of PSAQ-ViT, which can even outperform the real-data-driven methods. Code is available at: https://github.com/zkkli/PSAQ-ViT.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
https://github.com/rwightman/pytorch-image-models

References

Arnab, A., Dehghani, M., Heigold, G., Sun, C., Lučić, M., Schmid, C.: ViViT: a video vision transformer. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6836–6846 (2021)
Google Scholar
Bengio, Y., Léonard, N., Courville, A.: Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432 (2013)
Cai, Y., Yao, Z., Dong, Z., Gholami, A., Mahoney, M.W., Keutzer, K.: ZeroQ: a novel zero shot quantization framework. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13169–13178 (2020)
Google Scholar
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 213–229. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_13
Chapter Google Scholar
Chen, H., et al.: Pre-trained image processing transformer. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12299–12310 (2021)
Google Scholar
Chen, X., Yan, B., Zhu, J., Wang, D., Yang, X., Lu, H.: Transformer tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8126–8135 (2021)
Google Scholar
Chin, T.-W., Chuang, P.I.-J., Chandra, V., Marculescu, D.: One weight bitwidth to rule them all. In: Bartoli, A., Fusiello, A. (eds.) ECCV 2020. LNCS, vol. 12539, pp. 85–103. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-68238-5_7
Chapter Google Scholar
Choi, J., Wang, Z., Venkataramani, S., Chuang, P.I.J., Srinivasan, V., Gopalakrishnan, K.: PACT: parameterized clipping activation for quantized neural networks. arXiv preprint arXiv:1805.06085 (2018)
Choukroun, Y., Kravchik, E., Yang, F., Kisilev, P.: Low-bit quantization of neural networks for efficient inference. In: 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), pp. 3009–3018. IEEE (2019)
Google Scholar
Dosovitskiy, A., et al.: An image is worth 16\(\times \)16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)
Elthakeb, A.T., Pilligundla, P., Mireshghallah, F., Elgindi, T., Deledalle, C.A., Esmaeilzadeh, H.: Gradient-based deep quantization of neural networks through sinusoidal adaptive regularization. arXiv preprint arXiv:2003.00146 (2020)
Esser, S.K., McKinstry, J.L., Bablani, D., Appuswamy, R., Modha, D.S.: Learned step size quantization. arXiv preprint arXiv:1902.08153 (2019)
Gholami, A., Kim, S., Dong, Z., Yao, Z., Mahoney, M.W., Keutzer, K.: A survey of quantization methods for efficient neural network inference. arXiv preprint arXiv:2103.13630 (2021)
Han, K., et al.: A survey on visual transformer. arXiv e-prints, p. arXiv-2012 (2020)
Google Scholar
Han, K., Xiao, A., Wu, E., Guo, J., Xu, C., Wang, Y.: Transformer in transformer. Adv. Neural. Inf. Process. Syst. 34, 15908–15919 (2021)
Google Scholar
Hubara, I., Courbariaux, M., Soudry, D., El-Yaniv, R., Bengio, Y.: Binarized neural networks. Adv. Neural. Inf. Process. Syst. 29, 1–9 (2016)
MATH Google Scholar
Jacob, B., et al.: Quantization and training of neural networks for efficient integer-arithmetic-only inference. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2704–2713 (2018)
Google Scholar
Jia, D., et al.: Efficient vision transformers via fine-grained manifold distillation. arXiv preprint arXiv:2107.01378 (2021)
Khan, S., Naseer, M., Hayat, M., Zamir, S.W., Khan, F.S., Shah, M.: Transformers in vision: a survey. ACM Comput. Surv. (CSUR) 54, 1–41 (2021)
Article Google Scholar
Krishnamoorthi, R.: Quantizing deep convolutional networks for efficient inference: a whitepaper. arXiv preprint arXiv:1806.08342 (2018)
Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 25, 1106–1114 (2012)
Google Scholar
Li, R., Wang, Y., Liang, F., Qin, H., Yan, J., Fan, R.: Fully quantized network for object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2810–2819 (2019)
Google Scholar
Li, Y., Dong, X., Wang, W.: Additive powers-of-two quantization: an efficient non-uniform discretization for neural networks. arXiv preprint arXiv:1909.13144 (2019)
Li, Y., et al.: BRECQ: pushing the limit of post-training quantization by block reconstruction. arXiv preprint arXiv:2102.05426 (2021)
Li, Z., Gu, Q.: I-ViT: integer-only quantization for efficient vision transformer inference. arXiv preprint arXiv:2207.01405 (2022)
Lin, Y., Zhang, T., Sun, P., Li, Z., Zhou, S.: FQ-VIT: fully quantized vision transformer without retraining. arXiv preprint arXiv:2111.13824 (2021)
Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021)
Google Scholar
Liu, Z., Wang, Y., Han, K., Zhang, W., Ma, S., Gao, W.: Post-training quantization for vision transformer. Adv. Neural Inf. Process. Syst. 34, 28092–28103 (2021)
Google Scholar
Nagel, M., Amjad, R.A., Van Baalen, M., Louizos, C., Blankevoort, T.: Up or down? Adaptive rounding for post-training quantization. In: International Conference on Machine Learning, pp. 7197–7206. PMLR (2020)
Google Scholar
Neimark, D., Bar, O., Zohar, M., Asselmann, D.: Video transformer network. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3163–3172 (2021)
Google Scholar
Rastegari, M., Ordonez, V., Redmon, J., Farhadi, A.: XNOR-Net: ImageNet classification using binary convolutional neural networks. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp. 525–542. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46493-0_32
Chapter Google Scholar
Tang, Y., et al.: Patch slimming for efficient vision transformers. arXiv preprint arXiv:2106.02852 (2021)
Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers and distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357. PMLR (2021)
Google Scholar
Wu, D., Tang, Q., Zhao, Y., Zhang, M., Fu, Y., Zhang, D.: EasyQuant: post-training quantization via scale optimization. arXiv preprint arXiv:2006.16669 (2020)
Wu, K., Peng, H., Chen, M., Fu, J., Chao, H.: Rethinking and improving relative position encoding for vision transformer. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10033–10041 (2021)
Google Scholar
Xu, S., et al.: Generative low-bitwidth data free quantization. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12357, pp. 1–17. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58610-2_1
Chapter Google Scholar
Yin, H., et al.: Dreaming to distill: data-free knowledge transfer via deepinversion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8715–8724 (2020)
Google Scholar
Yuan, Z., Xue, C., Chen, Y., Wu, Q., Sun, G.: PTQ4VIT: post-training quantization framework for vision transformers. arXiv preprint arXiv:2111.12293 (2021)
Zhang, D., Yang, J., Ye, D., Hua, G.: LQ-Nets: learned quantization for highly accurate and compact deep neural networks. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 365–382 (2018)
Google Scholar
Zhang, X., et al.: Diversifying sample generation for accurate data-free quantization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15658–15667 (2021)
Google Scholar
Zhong, Y., et al.: IntraQ: learning synthetic images with intra-class heterogeneity for zero-shot network quantization. arXiv preprint arXiv:2111.09136 (2021)
Zhou, D., et al.: DeepViT: towards deeper vision transformer. arXiv preprint arXiv:2103.11886 (2021)
Zhu, X., Su, W., Lu, L., Li, B., Wang, X., Dai, J.: Deformable DETR: deformable transformers for end-to-end object detection. arXiv preprint arXiv:2010.04159 (2020)

Download references

Acknowledgements

This work was supported in part by the National Natural Science Foundation of China under Grant 62276255; in part by the Scientific Instrument Developing Project of the Chinese Academy of Sciences under Grant YJKYYQ20200045.

Author information

Authors and Affiliations

Institute of Automation, Chinese Academy of Sciences, Beijing, China
Zhikai Li, Liping Ma, Mengjuan Chen, Junrui Xiao & Qingyi Gu
School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China
Zhikai Li & Junrui Xiao

Authors

Zhikai Li
View author publications
You can also search for this author in PubMed Google Scholar
Liping Ma
View author publications
You can also search for this author in PubMed Google Scholar
Mengjuan Chen
View author publications
You can also search for this author in PubMed Google Scholar
Junrui Xiao
View author publications
You can also search for this author in PubMed Google Scholar
Qingyi Gu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Qingyi Gu .

Editor information

Editors and Affiliations

Tel Aviv University, Tel Aviv, Israel
Shai Avidan
University College London, London, UK
Gabriel Brostow
Google AI, Accra, Ghana
Moustapha Cissé
University of Catania, Catania, Italy
Giovanni Maria Farinella
Facebook (United States), Menlo Park, CA, USA
Tal Hassner

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Li, Z., Ma, L., Chen, M., Xiao, J., Gu, Q. (2022). Patch Similarity Aware Data-Free Quantization for Vision Transformers. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13671. Springer, Cham. https://doi.org/10.1007/978-3-031-20083-0_10

Download citation

DOI: https://doi.org/10.1007/978-3-031-20083-0_10
Published: 03 November 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-20082-3
Online ISBN: 978-3-031-20083-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Patch Similarity Aware Data-Free Quantization for Vision Transformers