Skip to main content

Are Vision Transformers Robust to Patch Perturbations?

  • Conference paper
  • First Online:
Computer Vision – ECCV 2022 (ECCV 2022)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13672))

Included in the following conference series:


Recent advances in Vision Transformer (ViT) have demonstrated its impressive performance in image classification, which makes it a promising alternative to Convolutional Neural Network (CNN). Unlike CNNs, ViT represents an input image as a sequence of image patches. The patch-based input image representation makes the following question interesting: How does ViT perform when individual input image patches are perturbed with natural corruptions or adversarial perturbations, compared to CNNs? In this work, we study the robustness of ViT to patch-wise perturbations. Surprisingly, we find that ViTs are more robust to naturally corrupted patches than CNNs, whereas they are more vulnerable to adversarial patches. Furthermore, we discover that the attention mechanism greatly affects the robustness of vision transformers. Specifically, the attention module can help improve the robustness of ViT by effectively ignoring natural corrupted patches. However, when ViTs are attacked by an adversary, the attention mechanism can be easily fooled to focus more on the adversarially perturbed patches and cause a mistake. Based on our analysis, we propose a simple temperature-scaling based method to improve the robustness of ViT against adversarial patches. Extensive qualitative and quantitative experiments are performed to support our findings, understanding, and improvement of ViT robustness to patch-wise perturbations across a set of transformer-based architectures.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
USD 89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others


  1. Abnar, S., Zuidema, W.: Quantifying attention flow in transformers. In: Annual Meeting of the Association for Computational Linguistics (ACL) (2020)

    Google Scholar 

  2. Aldahdooh, A., Hamidouche, W., Deforges, O.: Reveal of vision transformers robustness against adversarial attacks. arXiv:2106.03734 (2021)

  3. Bai, Y., Mei, J., Yuille, A., Xie, C.: Are transformers more robust than CNNs? arXiv:2111.05464 (2021)

  4. Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and MLP-mixer to CNNs. arXiv preprint arXiv:2110.02797 (2021)

  5. Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. arXiv:2103.14586 (2021)

  6. Brown, T.B., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. arXiv:1712.09665v1 (2017)

  7. Chen, C.F., Fan, Q., Panda, R.: CrossVit: cross-attention multi-scale vision transformer for image classification. arXiv:2103.14899 (2021)

  8. Chen, Z., Xie, L., Niu, J., Liu, X., Wei, L., Tian, Q.: VisFormer: the vision-friendly transformer. arXiv:2104.12533 (2021)

  9. Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2009)

    Google Scholar 

  10. Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. arXiv:2010.11929 (2020)

  11. Fawzi, A., Frossard, P.: Measuring the effect of nuisance variables on classifiers. In: Proceedings of the British Machine Vision Conference (BMVC) (2016)

    Google Scholar 

  12. Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: are vision transformers always robust against adversarial perturbations? In: International Conference on Learning Representations (2021)

    Google Scholar 

  13. Goodfellow, I.J., Shlens, J., Szegedy, C.: Explaining and harnessing adversarial examples. arXiv:1412.6572 (2014)

  14. Graham, B., et al.: Levit: a vision transformer in convnet’s clothing for faster inference. arXiv:2104.01136 (2021)

  15. Han, K., Xiao, A., Wu, E., Guo, J., Xu, C., Wang, Y.: Transformer in transformer. arXiv:2103.00112 (2021)

  16. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016)

    Google Scholar 

  17. Hendrycks, D., Dietterich, T.: Benchmarking neural network robustness to common corruptions and perturbations. In: International Conference on Learning Representations (ICLR) (2019)

    Google Scholar 

  18. Hu, H., Lu, X., Zhang, X., Zhang, T., Sun, G.: Inheritance attention matrix-based universal adversarial perturbations on vision transformers. IEEE Sig. Process. Lett. 28, 1923–1927 (2021)

    Article  Google Scholar 

  19. Huang, G., Liu, Z., Van Der Maaten, L., Weinberger, K.Q.: Densely connected convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4700–4708 (2017)

    Google Scholar 

  20. Joshi, A., Jagatap, G., Hegde, C.: Adversarial token attacks on vision transformers. arXiv:2110.04337 (2021)

  21. Karmon, D., Zoran, D., Goldberg, Y.: Lavan: localized and visible adversarial noise. In: International Conference on Machine Learning (ICML) (2018)

    Google Scholar 

  22. Kolesnikov, A., et al.: Big transfer (BiT): general visual representation learning. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12350, pp. 491–507. Springer, Cham (2020).

    Chapter  Google Scholar 

  23. Liu, A., et al.: Perceptual-sensitive GAN for generating adversarial patches. In: AAAI (2019)

    Google Scholar 

  24. Liu, A., Wang, J., Liu, X., Cao, B., Zhang, C., Yu, H.: Bias-based universal adversarial patch attack for automatic check-out. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12358, pp. 395–410. Springer, Cham (2020).

    Chapter  Google Scholar 

  25. Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows. arXiv:2103.14030 (2021)

  26. Luo, J., Bai, T., Zhao, J.: Generating adversarial yet inconspicuous patches with a single image (student abstract). In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 15837–15838 (2021)

    Google Scholar 

  27. Madry, A., Makelov, A., Schmidt, L., Tsipras, D., Vladu, A.: Towards deep learning models resistant to adversarial attacks. In: arXiv:1706.06083 (2017)

  28. Mahmood, K., Mahmood, R., Van Dijk, M.: On the robustness of vision transformers to adversarial examples. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7838–7847 (2021)

    Google Scholar 

  29. Mao, X., et al.: Towards robust vision transformer. arXiv:2105.07926 (2021)

  30. Mao, X., Qi, G., Chen, Y., Li, X., Ye, S., He, Y., Xue, H.: Rethinking the design principles of robust vision transformer. arXiv:2105.07926 (2021)

  31. Metzen, J.H., Finnie, N., Hutmacher, R.: Meta adversarial training against universal patches. arXiv preprint arXiv:2101.11453 (2021)

  32. Mu, N., Wagner, D.: Defending against adversarial patches with robust self-attention. In: ICML 2021 Workshop on Uncertainty and Robustness in Deep Learning (2021)

    Google Scholar 

  33. Naseer, M., Ranasinghe, K., Khan, S., Hayat, M., Khan, F.S., Yang, M.H.: Intriguing properties of vision transformers. arXiv:2105.10497 (2021)

  34. Naseer, M., Ranasinghe, K., Khan, S., Khan, F.S., Porikli, F.: On improving adversarial transferability of vision transformers. arXiv:2106.04169 (2021)

  35. Papernot, N., McDaniel, P., Jha, S., Fredrikson, M., Celik, Z.B., Swami, A.: The limitations of deep learning in adversarial settings. In: 2016 IEEE European Symposium on Security and Privacy (EuroS &P) (2016)

    Google Scholar 

  36. Paul, S., Chen, P.Y.: Vision transformers are robust learners. arXiv:2105.07581 (2021)

  37. Qian, Y., Wang, J., Wang, B., Zeng, S., Gu, Z., Ji, S., Swaileh, W.: Visually imperceptible adversarial patch attacks on digital images. arXiv preprint arXiv:2012.00909 (2020)

  38. Qin, Y., Zhang, C., Chen, T., Lakshminarayanan, B., Beutel, A., Wang, X.: Understanding and improving robustness of vision transformers through patch-based negative augmentation. arXiv preprint arXiv:2110.07858 (2021)

  39. Salman, H., Jain, S., Wong, E., Madry, A.: Certified patch robustness via smoothed vision transformers. arXiv:2110.07719 (2021)

  40. Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., Batra, D.: Grad-cam: visual explanations from deep networks via gradient-based localization. In: ICCV (2017)

    Google Scholar 

  41. Shao, R., Shi, Z., Yi, J., Chen, P.Y., Hsieh, C.J.: On the adversarial robustness of visual transformers. arXiv:2103.15670 (2021)

  42. Shi, Y., Han, Y.: Decision-based black-box attack against vision transformers via patch-wise adversarial removal. arXiv preprint arXiv:2112.03492 (2021)

  43. Shrikumar, A., Greenside, P., Kundaje, A.: Learning important features through propagating activation differences. In: International Conference on Machine Learning (ICML) (2017)

    Google Scholar 

  44. Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I., Fergus, R.: Intriguing properties of neural networks. In: International Conference on Learning Representations (ICLR) (2014)

    Google Scholar 

  45. Tang, S., et al.: Robustart: benchmarking robustness on architecture design and training techniques. arXiv preprint arXiv:2109.05211 (2021)

  46. Tolstikhin, I., et al.: MLP-mixer: an all-MLP architecture for vision. In: arXiv:2105.01601 (2021)

  47. Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning (ICML) (2021)

    Google Scholar 

  48. Wang, J., Liu, A., Bai, X., Liu, X.: Universal adversarial patch attack for automatic checkout using perceptual and attentional bias. IEEE Trans. Image Process. 31, 598–611 (2021)

    Article  Google Scholar 

  49. Wu, B., et al.: Visual transformers: token-based image representation and processing for computer vision. arXiv:2006.03677 (2020)

  50. Xiao, T., Singh, M., Mintun, E., Darrell, T., Dollár, P., Girshick, R.: Early convolutions help transformers see better. arXiv:2106.14881 (2021)

  51. Yu, Z., Fu, Y., Li, S., Li, C., Lin, Y.: Mia-former: efficient and robust vision transformers via multi-grained input-adaptation. arXiv preprint arXiv:2112.11542 (2021)

  52. Zeiler, M.D., Fergus, R.: Visualizing and understanding convolutional networks. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8689, pp. 818–833. Springer, Cham (2014).

    Chapter  Google Scholar 

  53. Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., Torralba, A.: Learning deep features for discriminative localization. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016)

    Google Scholar 

Download references

Author information

Authors and Affiliations


Corresponding author

Correspondence to Jindong Gu .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 13205 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Gu, J., Tresp, V., Qin, Y. (2022). Are Vision Transformers Robust to Patch Perturbations?. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13672. Springer, Cham.

Download citation

  • DOI:

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-19774-1

  • Online ISBN: 978-3-031-19775-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics