Skip to main content

An Impartial Take to the CNN vs Transformer Robustness Contest

  • Conference paper
  • First Online:
Computer Vision – ECCV 2022 (ECCV 2022)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13673))

Included in the following conference series:


Following the surge of popularity of Transformers in Computer Vision, several studies have attempted to determine whether they could be more robust to distribution shifts and provide better uncertainty estimates than Convolutional Neural Networks (CNNs). The almost unanimous conclusion is that they are, and it is often conjectured more or less explicitly that the reason of this supposed superiority is to be attributed to the self-attention mechanism. In this paper we perform extensive empirical analyses showing that recent state-of-the-art CNNs (particularly, ConvNeXt [20]) can be as robust and reliable or even sometimes more than the current state-of-the-art Transformers. However, there is no clear winner. Therefore, although it is tempting to state the definitive superiority of one family of architectures over another, they seem to enjoy similar extraordinary performances on a variety of tasks while also suffering from similar vulnerabilities such as texture, background, and simplicity biases.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
USD 89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others


  1. 1.

    We omit ViT-B/32 ViT-L/32 as we find them to always underperform with respect to ViT-B/16 and ViT-L/16 (a similar observation was made in [28]). Similarly, we also omit DeiT [36] as it underperforms compared to SwinTransformers.

  2. 2.

    Consider that ViT-L/32 has about 307M parameters, ViT-L/16 has 305M, yet ViT-L/32 requires about 15GFLOPS, while ViT-L/16 requires about 61GFLOPS, and ViT-L/32 exhibits lower accuracy and robustness than ViT-B/32 [28].

  3. 3.

    We understand that defining complexity is subjective. Here we assume that something that is visually more complex (having more colors, shapes, textures etc.) across the training set would require learning more complex features.

  4. 4.

    We oversample OoD samples (\(4\times \)) so that both in-distribution and OoD datasets have 10000 samples each. We could rebalance them also by randomly sampling 2000 out of the 10000 in-distribution samples, but this could induce some variance in the metrics; we also observed that the average of this strategy coincides with the balancing strategy.


  1. Arjovsky, M., Bottou, L., Gulrajani, I., Lopez-Paz, D.: Invariant risk minimization. arXiv e-Prints arXiv:1907.02893, July 2019

  2. Ba, J.L., Kiros, J.R., Hinton, G.E.: Layer normalization. arXiv preprint arXiv:1607.06450 (2016)

  3. Bai, Y., Mei, J., Yuille, A., Xie, C.: Are transformers more robust than CNNs? In: NeurIPS (2021)

    Google Scholar 

  4. Condessa, F., Kovacevic, J., Bioucas-Dias, J.: Performance measures for classification systems with rejection. Pattern Recogn. (2015)

    Google Scholar 

  5. Cybenko, G.: Approximation by superpositions of a sigmoidal function. Math. Control Sig. Syst. 2, 303–314 (1989)

    Article  MathSciNet  MATH  Google Scholar 

  6. Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: 2009 CVPR, pp. 248–255 (2009)

    Google Scholar 

  7. Deng, L.: The MNIST database of handwritten digit images for machine learning research. IEEE Sig. Process. Mag. 29(6), 141–142 (2012)

    Article  Google Scholar 

  8. Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. In: ICLR (2021)

    Google Scholar 

  9. Fort, S., Ren, J., Lakshminarayanan, B.: Exploring the limits of Out-of-Distribution detection. In: NeurIPS (2021)

    Google Scholar 

  10. Fumera, G., Roli, F.: Support vector machines with embedded reject option. In: Lee, S.-W., Verri, A. (eds.) SVM 2002. LNCS, vol. 2388, pp. 68–82. Springer, Heidelberg (2002).

    Chapter  MATH  Google Scholar 

  11. Guo, C., Pleiss, G., Sun, Y., Weinberger, K.Q.: On calibration of modern neural networks. In: ICML 2017, pp. 1321–1330. (2017)

    Google Scholar 

  12. Hendrycks, D., et al.: The many faces of robustness: a critical analysis of out-of-distribution generalization. IN: ICCV (2021)

    Google Scholar 

  13. Hendrycks, D., Gimpel, K.: Bridging nonlinearities and stochastic regularizers with gaussian error linear units. CoRR abs/1606.08415 (2016).

  14. Hendrycks, D., Zhao, K., Basart, S., Steinhardt, J., Song, D.: Natural adversarial examples. In: CVPR (2021)

    Google Scholar 

  15. Hornik, K., Stinchcombe, M., White, H.: Multilayer feedforward networks are universal approximators. Neural Netw. 2(5), 359–366 (1989)

    Article  MATH  Google Scholar 

  16. Jiang, Y., Neyshabur, B., Mobahi, H., Krishnan, D., Bengio, S.: Fantastic generalization measures and where to find them. In: ICLR (2020)

    Google Scholar 

  17. Kolesnikov, A., et al.: Big transfer (BiT): general visual representation learning. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12350, pp. 491–507. Springer, Cham (2020).

    Chapter  Google Scholar 

  18. Landgrebe, T.C.W., Tax, D.M.J., Paclík, P., Duin, R.P.W.: The interaction between classification and reject performance for distance-based reject-option classifiers. Pattern Recogn. Lett. 27(8), 908–917 (2006)

    Google Scholar 

  19. Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows. In: ICCV (2021)

    Google Scholar 

  20. Liu, Z., Mao, H., Wu, C.Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: CVPR (2022)

    Google Scholar 

  21. Malinin, A., Mlodozeniec, B., Gales, M.: Ensemble distribution distillation. In: ICLR (2020)

    Google Scholar 

  22. Minderer, M., et al.: Revisiting the calibration of modern neural networks. In: NeurIPS (2021)

    Google Scholar 

  23. Morrison, K., Gilby, B., Lipchak, C., Mattioli, A., Kovashka, A.: Exploring corruption robustness: inductive biases in vision transformers and mlp-mixers, vol. abs/2106.13122 (2021).

  24. Mukhoti, J., Kulharia, V., Sanyal, A., Golodetz, S., Torr, P.H., Dokania, P.K.: Calibrating deep neural networks using focal loss. In: NeurIPS (2020)

    Google Scholar 

  25. Naeini, M.P., Cooper, G.F., Hauskrecht, M.: Obtaining well calibrated probabilities using Bayesian binning. In: Proceedigs of Conference on AAAI Artificial Intelligence 2015, pp. 2901–2907, January 2015

    Google Scholar 

  26. Neyshabur, B., Bhojanapalli, S., Mcallester, D., Srebro, N.: Exploring generalization in deep learning. In: Guyon, I., et al. (eds.) NeurIPS, vol. 30. Curran Associates, Inc. (2017)

    Google Scholar 

  27. Neyshabur, B., Bhojanapalli, S., Srebro, N.: A PAC-Bayesian approach to spectrally-normalized margin bounds for neural networks. In: ICLR (2018)

    Google Scholar 

  28. Paul, S., Chen, P.Y.: Vision transformers are robust learners. In: AAAI (2022)

    Google Scholar 

  29. Pinto, F., Torr, P., Dokania, P.: Are vision transformers always more robust than convolutional neural networks? In: NeurIPS Workshop on Distribution Shifts: Connecting Methods and Applications (2021)

    Google Scholar 

  30. Quionero-Candela, J., Sugiyama, M., Schwaighofer, A., Lawrence, N.D.: Dataset Shift in Machine Learning. The MIT Press, Cambridge (2009)

    Google Scholar 

  31. Recht, B., Roelofs, R., Schmidt, L., Shankar, V.: Do imagenet classifiers generalize to imagenet? In: ICML (2019)

    Google Scholar 

  32. Ridnik, T., Ben-Baruch, E., Noy, A., Zelnik-Manor, L.: Imagenet-21k pretraining for the masses (2021)

    Google Scholar 

  33. Sanyal, A., Torr, P.H.S., Dokania, P.K.: Stable rank normalization for improved generalization in neural networks and GANs. In: ICLR (2020)

    Google Scholar 

  34. Shah, H., Tamuly, K., Raghunathan, A., Jain, P., Netrapalli, P.: The pitfalls of simplicity bias in neural networks. In: NeurIPS (2020)

    Google Scholar 

  35. Tang, S., et al.: RobuStart: benchmarking robustness on architecture design and training techniques. arXiv (2021)

    Google Scholar 

  36. Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML (2021)

    Google Scholar 

  37. Vaswani, A., et al.: Attention is all you need. In: NeurIPS, vol. 30 (2017)

    Google Scholar 

  38. Wang, H., Ge, S., Lipton, Z., Xing, E.P.: Learning robust global representations by penalizing local predictive power. In: NeurIPS, pp. 10506–10518 (2019)

    Google Scholar 

  39. Wightman, R.: PyTorch image models (2019).

  40. Xiao, K., Engstrom, L., Ilyas, A., Madry, A.: Noise or signal: the role of image backgrounds in object recognition. In: ICLR (2021)

    Google Scholar 

  41. Yuan, L., et al.: Tokens-to-Token ViT: training vision transformers from scratch on ImageNet. In: ICCV (2021)

    Google Scholar 

  42. Zhang, C., et al.: Delving deep into the generalization of vision transformers under distribution shifts. In: CVPR (2022)

    Google Scholar 

Download references


This work is supported by the UKRI grant: Turing AI Fellowship EP/W002981/1 and EPSRC/MURI grant: EP/N019474/1. We would like to thank the Royal Academy of Engineering and FiveAI. Francesco Pinto’s PhD is funded by the European Space Agency (ESA). PD would like to thank Anuj Sharma and Kemal Oksuz for their comments on the draft.

Author information

Authors and Affiliations


Corresponding author

Correspondence to Francesco Pinto .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 748 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Pinto, F., Torr, P.H.S., K. Dokania, P. (2022). An Impartial Take to the CNN vs Transformer Robustness Contest. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13673. Springer, Cham.

Download citation

  • DOI:

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-19777-2

  • Online ISBN: 978-3-031-19778-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics