Skip to main content

FrequencyLowCut Pooling - Plug and Play Against Catastrophic Overfitting

  • Conference paper
  • First Online:
Computer Vision – ECCV 2022 (ECCV 2022)

Abstract

Over the last years, Convolutional Neural Networks (CNNs) have been the dominating neural architecture in a wide range of computer vision tasks. From an image and signal processing point of view, this success might be a bit surprising as the inherent spatial pyramid design of most CNNs is apparently violating basic signal processing laws, i.e. Sampling Theorem in their down-sampling operations. However, since poor sampling appeared not to affect model accuracy, this issue has been broadly neglected until model robustness started to receive more attention. Recent work [18] in the context of adversarial attacks and distribution shifts, showed after all, that there is a strong correlation between the vulnerability of CNNs and aliasing artifacts induced by poor down-sampling operations. This paper builds on these findings and introduces an aliasing free down-sampling operation which can easily be plugged into any CNN architecture: FrequencyLowCut pooling. Our experiments show, that in combination with simple and Fast Gradient Sign Method (FGSM) adversarial training, our hyper-parameter free operator substantially improves model robustness and avoids catastrophic overfitting. Our code is available at https://github.com/GeJulia/flc_pooling.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: a query-efficient black-box adversarial attack via random search. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12368, pp. 484–501. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58592-1_29

    Chapter  Google Scholar 

  2. Andriushchenko, M., Flammarion, N.: Understanding and improving fast adversarial training. In: Advances in Neural Information Processing Systems, vol. 33, pp. 16048–16059 (2020)

    Google Scholar 

  3. Carlini, N., Wagner, D.: Towards evaluating the robustness of neural networks. In: 2017 IEEE Symposium on Security and Privacy (SP), pp. 39–57. IEEE (2017)

    Google Scholar 

  4. Carmon, Y., Raghunathan, A., Schmidt, L., Duchi, J.C., Liang, P.S.: Unlabeled data improves adversarial robustness. In: Advances in Neural Information Processing Systems, vol. 32 (2019)

    Google Scholar 

  5. Chaman, A., Dokmanic, I.: Truly shift-invariant convolutional neural networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3773–3783 (2021)

    Google Scholar 

  6. Chen, T., Zhang, Z., Liu, S., Chang, S., Wang, Z.: Robust overfitting may be mitigated by properly learned smoothening. In: International Conference on Learning Representations (2021). https://openreview.net/forum?id=qZzy5urZw9

  7. Cheng, M., Le, T., Chen, P.Y., Yi, J., Zhang, H., Hsieh, C.J.: Query-efficient hard-label black-box attack: an optimization-based approach. arXiv preprint http://arxiv.org/abs/1807.04457arXiv:1807.04457 (2018)

    Google Scholar 

  8. Croce, F., et al.: RobustBench: a standardized adversarial robustness benchmark. arXiv preprint http://arxiv.org/abs/2010.09670arXiv:2010.09670 (2020)

    Google Scholar 

  9. Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML (2020)

    Google Scholar 

  10. Darlow, L.N., Crowley, E.J., Antoniou, A., Storkey, A.J.: CINIC-10 is not ImageNet or CIFAR-10. arXiv preprint arXiv:1810.03505 (2018)

  11. Durall, R., Keuper, M., Keuper, J.: Watch your up-convolution: CNN based generative deep neural networks are failing to reproduce spectral distributions (2020)

    Google Scholar 

  12. Engstrom, L., Ilyas, A., Salman, H., Santurkar, S., Tsipras, D.: Robustness (python library) (2019). https://github.com/MadryLab/robustness

  13. Geirhos, R., Rubisch, P., Michaelis, C., Bethge, M., Wichmann, F.A., Brendel, W.: ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness. arXiv preprint http://arxiv.org/abs/1811.12231arXiv:1811.12231 (2018)

    Google Scholar 

  14. Gonzalez, R.C., Woods, R.E.: Digital Image Processing, 3rd edn. Prentice-Hall Inc. (2006)

    Google Scholar 

  15. Goodfellow, I.J., Shlens, J., Szegedy, C.: Explaining and harnessing adversarial examples (2015)

    Google Scholar 

  16. Gowal, S., Qin, C., Uesato, J., Mann, T., Kohli, P.: Uncovering the limits of adversarial training against norm-bounded adversarial examples (2021)

    Google Scholar 

  17. Gowal, S., Rebuffi, S.A., Wiles, O., Stimberg, F., Calian, D.A., Mann, T.A.: Improving robustness using generated data. In: Advances in Neural Information Processing Systems, vol. 34 (2021)

    Google Scholar 

  18. Grabinski, J., Keuper, J., Keuper, M.: Aliasing coincides with CNNs vulnerability towards adversarial attacks. In: The AAAI-2022 Workshop on Adversarial Machine Learning and Beyond (2022). https://openreview.net/forum?id=vKc1mLxBebP

  19. Hendrycks, D., Dietterich, T.: Benchmarking neural network robustness to common corruptions and perturbations. In: Proceedings of the International Conference on Learning Representations (2019)

    Google Scholar 

  20. Hendrycks, D., Lee, K., Mazeika, M.: Using pre-training can improve model robustness and uncertainty. In: International Conference on Machine Learning, pp. 2712–2721. PMLR (2019)

    Google Scholar 

  21. Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. In: Advances in Neural Information Processing Systems, vol. 33, pp. 6840–6851 (2020)

    Google Scholar 

  22. Hossain, M.T., Teng, S.W., Sohel, F., Lu, G.: Anti-aliasing deep image classifiers using novel depth adaptive blurring and activation function (2021)

    Google Scholar 

  23. Jung, S., Keuper, M.: Spectral distribution aware image generation. In: AAAI (2021)

    Google Scholar 

  24. Karras, T., et al.: Alias-free generative adversarial networks. In: Advances in Neural Information Processing Systems, vol. 34 (2021)

    Google Scholar 

  25. Kim, H., Lee, W., Lee, J.: Understanding catastrophic overfitting in single-step adversarial training (2020)

    Google Scholar 

  26. Krizhevsky, A.: Learning multiple layers of features from tiny images. University of Toronto, May 2012

    Google Scholar 

  27. Kurakin, A., Goodfellow, I., Bengio, S.: Adversarial machine learning at scale (2017)

    Google Scholar 

  28. Li, Q., Shen, L., Guo, S., Lai, Z.: Wavelet integrated CNNs for noise-robust image classification. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7245–7254 (2020)

    Google Scholar 

  29. Li, Q., Shen, L., Guo, S., Lai, Z.: WaveCNet: wavelet integrated CNNs to suppress aliasing effect for noise-robust image classification. IEEE Trans. Image Process. 30, 7074–7089 (2021). https://doi.org/10.1109/tip.2021.3101395

    Article  MathSciNet  Google Scholar 

  30. Lohn, A.J.: Downscaling attack and defense: turning what you see back into what you get (2020)

    Google Scholar 

  31. Lorenz, P., Strassel, D., Keuper, M., Keuper, J.: Is robustbench/autoattack a suitable benchmark for adversarial robustness? In: The AAAI-2022 Workshop on Adversarial Machine Learning and Beyond (2022). https://openreview.net/forum?id=aLB3FaqoMBs

  32. Madry, A., Makelov, A., Schmidt, L., Tsipras, D., Vladu, A.: Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083 (2017)

  33. Moosavi-Dezfooli, S.M., Fawzi, A., Frossard, P.: DeepFool: a simple and accurate method to fool deep neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2574–2582 (2016)

    Google Scholar 

  34. Rade, R., Moosavi-Dezfooli, S.M.: Helper-based adversarial training: reducing excessive margin to achieve a better accuracy vs. robustness trade-off. In: ICML 2021 Workshop on Adversarial Machine Learning (2021). https://openreview.net/forum?id=BuD2LmNaU3a

  35. Rebuffi, S.A., Gowal, S., Calian, D.A., Stimberg, F., Wiles, O., Mann, T.: Fixing data augmentation to improve adversarial robustness (2021)

    Google Scholar 

  36. Rice, L., Wong, E., Kolter, Z.: Overfitting in adversarially robust deep learning. In: International Conference on Machine Learning, pp. 8093–8104. PMLR (2020)

    Google Scholar 

  37. Rony, J., Hafemann, L.G., Oliveira, L.S., Ayed, I.B., Sabourin, R., Granger, E.: Decoupling direction and norm for efficient gradient-based L2 adversarial attacks and defenses. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4322–4330 (2019)

    Google Scholar 

  38. Salman, H., Ilyas, A., Engstrom, L., Kapoor, A., Madry, A.: Do adversarially robust imagenet models transfer better? In: Advances in Neural Information Processing Systems, vol. 33, pp. 3533–3545 (2020)

    Google Scholar 

  39. Sehwag, V., et al.: Improving adversarial robustness using proxy distributions (2021)

    Google Scholar 

  40. Shannon, C.: Communication in the presence of noise. Proc. IRE 37(1), 10–21 (1949). https://doi.org/10.1109/JRPROC.1949.232969

    Article  MathSciNet  Google Scholar 

  41. Stutz, D., Hein, M., Schiele, B.: Relating adversarially robust generalization to flat minima (2021)

    Google Scholar 

  42. Szegedy, C., et al.: Intriguing properties of neural networks. In: International Conference on Learning Representations (2014). http://arxiv.org/abs/1312.6199

  43. Wang, Y., Zou, D., Yi, J., Bailey, J., Ma, X., Gu, Q.: Improving adversarial robustness requires revisiting misclassified examples. In: International Conference on Learning Representations (2020). https://openreview.net/forum?id=rklOg6EFwS

  44. Wong, E., Rice, L., Kolter, J.Z.: Fast is better than free: revisiting adversarial training. In: International Conference on Learning Representations (2020). https://openreview.net/forum?id=BJx040EFvH

  45. Wu, D., Xia, S.T., Wang, Y.: Adversarial weight perturbation helps robust generalization. In: Advances in Neural Information Processing Systems, vol. 33, pp. 2958–2969 (2020)

    Google Scholar 

  46. Xiao, Q., Li, K., Zhang, D., Jin, Y.: Wolf in sheep’s clothing - the downscaling attack against deep learning applications (2017)

    Google Scholar 

  47. Zhang, H., Yu, Y., Jiao, J., Xing, E.P., Ghaoui, L.E., Jordan, M.I.: Theoretically principled trade-off between robustness and accuracy. In: International Conference on Machine Learning (2019)

    Google Scholar 

  48. Zhang, J., Zhu, J., Niu, G., Han, B., Sugiyama, M., Kankanhalli, M.: Geometry-aware instance-reweighted adversarial training. In: International Conference on Learning Representations (2021). https://openreview.net/forum?id=iAX0l6Cz8ub

  49. Zhang, R.: Making convolutional networks shift-invariant again. In: ICML (2019)

    Google Scholar 

  50. Zou, X., Xiao, F., Yu, Z., Lee, Y.J.: Delving deeper into anti-aliasing in convnets. In: BMVC (2020)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Julia Grabinski .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 417 KB)

A Appendix

A Appendix

1.1 A.1 Training Schedules

CIFAR-10 Adversarial Training Schedule: For our baseline experiments on CIFAR-10, we used the PRN-18 as well as the WRN-28-10 architecture as they give a good trade-off between complexity and feasibility. For the PRN-18 models, we trained for 300 epochs with a batch size of 512 and a circling learning rate schedule with the maximal learning rate 0.2 and minimal learning rate 0. We set the momentum to 0.9 and weight decay to \(5e^{-4}\). The loss is calculated via Cross Entropy Loss and as an optimizer, we use Stochastic Gradient Descent (SGD). For the AT, we used the FGSM attack with an \(\epsilon \) of 8/255 and an \(\alpha \) of 10/255 (in Fast FGSM the attack is computed for step size \(\alpha \) once and then projected to \(\epsilon \)). For the WRN-28-10 we used a similar training schedule as for the PRN-18 models but used only 200 epochs and a smaller maximal learning rate of 0.08.

CIFAR-10 Clean Training Schedule: Each model is trained without AT. We used 300 epochs, a batch size of 512 for each training run and a circling learning rate schedule with the maximal learning rate at 0.2 and minimal at 0. We set the momentum to 0.9 and a weight decay to \(5e^{-4}\). The loss is calculated via Cross Entropy Loss and as an optimizer, we use Stochastic Gradient Descent (SGD).

CINIC-10 Adversarial Training Schedule: For our baseline experiments on CINIC-10 we used the PRN-18 architecture. We used 300 epochs, a batch size of 512 for each training run and a circling learning rate schedule with the maximal learning rate at 0.1 and minimal at 0. We set the momentum to 0.9 and weight decay to \(5e^{-4}\). The loss is calculated via Cross Entropy Loss and as an optimizer, we use Stochastic Gradient Descent (SGD). For the AT, we used the FGSM attack with an epsilon of 8/255 and an alpha of 10/255.

CIFAR-100 Adversarial Training Schedule: For our baseline experiments on CIFAR-100 we used the PRN-18 architecture as it gives a good trade-off between complexity and feasibility. We used 300 epochs, a batch size of 512 for each training run and a circling learning rate schedule with the maximal learning rate at 0.01 and minimal at 0. We set the momentum to 0.9 and a weight decay to \(5e^{-4}\). The loss is calculated via Cross Entropy Loss and as an optimizer, we use Stochastic Gradient Descent (SGD). For the AT, we used the FGSM attack with an epsilon of 8/255 and an alpha of 10/255.

ImageNet Adversarial Training Schedule: For our experiment on ImageNet we used the ResNet50 architecture. We trained for 150 epochs with a batch size of 400, and a multistep learning rate schedule with an initial learning rate 0.1, \(\gamma =0.1\), and milestones [30, 60, 90, 120]. We set the momentum to 0.9 and weight decay to \(5e^{-4}\). The loss is calculated via Cross Entropy Loss and as an optimizer, we use Stochastic Gradient Descent (SGD). For the AT, we used FGSM attack with an epsilon of 4/255 and an alpha of 5/255.

1.2 A.2 ImageNet Training Efficiency

When evaluating practical training times (in minutes) on ImageNet per epoch, we can not see a measurable difference in the costs between a ResNet50 with FLC pooling or strided convolution.

We varied the number of workers for dataloaders with clean training on 4 A-100 GPUs and measured \({\approx }43\) m for 12 workers, \({\approx }22\) m for 48 workers and \({\approx }18\) m for 72 workers for both. FGSM-based AT with the pipeline by [42] takes 1:07 h for both FLC pooling and strided convolutions per epoch. We conclude that training with FLC pooling in terms of practical runtime is scalable (runtime increase in ms-s range) and training times are likely governed by other factors.

The training time of our model should be comparable to the one from Wong et al. [44] while other reported methods have a significantly longer training time. Yet, the clean accuracy of the proposed model using FLC pooling improves about 8% over the one reached by [44], with a 1% improvement in robust accuracy. For example [12] has an increased training time by factor four compared to our model, already on CIFAR10 (see Table 6). This model achieves overall comparable results to ours. The model by Salman et al. [38] is trained with the training schedule from Madry et al. [32] and uses a multi-step adversarial attack for training. Since there is no release of the training script of this model on ImageNet, we can only roughly estimate their training times. Since they adopt the training schedule from Madry et al., we assume a similar training time increase of a factor of four, which is similar to the multi-step times reported for PGD in Table 6.

1.3 A.3 Aliasing Free Down-Sampling

Previous approaches like [49, 50] have proposed to apply blurring operations before down-sampling, with the purpose of achieving models with improved shift invariance. Therefore, they apply Gaussian blurring directly on the feature maps via convolution. In the following, we briefly discuss why this setting can not guarantee to prevent aliasing in the feature maps, even if large convolutional kernels would be applied, and why, in contrast, the proposed FLC pooling can guarantee to prevent aliasing.

To prevent aliasing, the feature maps need to be band-limited before down-sampling [14]. This band limitation is needed to ensure that after down-sampling no replica of the frequency spectrum overlap (see Fig. 3). To guarantee the required band limitation for sub-sampling with a factor of two to N/2 where N is the size of the original signal, one has to remove (reduce to zero) all frequency components above N/2.

Spatial Filtering Based Approaches. [49, 50] propose to apply approximated Gaussian filter kernels to the feature map. This operation is motivated by the fact that an actual Gaussian in the spatial domain corresponds to a Gaussian in the frequency (e.g. Fourier) domain. As the standard deviation of the Gaussian in the spatial domain increases, the standard deviation of its frequency representation decreases. Yet, the Gaussian distribution has infinite support, regardless of its standard deviation, i.e. the function never actually drops to zero. The convolution in the spatial domain corresponds to the point-wise multiplication in the frequency domain.

Therefore, even after convolving a signal with a perfect Gaussian filter with large standard deviation (and infinite support), all frequency components that were \(\ne 0\) before the convolution will be afterwards (although smaller in magnitude). Specifically, the convolution with a Gaussian (even in theoretically ideal settings), can reduce the apparent aliasing but some amount of aliasing will always persist. In practice, these ideal settings are not given: Prior works such as [49, 50] have to employ approximated Gaussian filters with finite support (usually not larger than \(7\times 7\)).

FLC Pooling. Therefore, FLC pooling operates directly in the frequency domain, where it removes all frequencies that can cause aliases.

This operation in the Fourier domain is called the ideal low pass filter and corresponds to a point-wise multiplication of the Fourier transform of the feature maps with a rectangular pulse \(H(\hat{m},\hat{n})\).

$$\begin{aligned} H(\hat{m},\hat{n}) = {\left\{ \begin{array}{ll} 1 &{} \text {for all}\,\hat{m},\hat{n}\, \text {below M/2 and N/2}\\ 0 &{} \text {otherwise} \end{array}\right. } \end{aligned}$$
(6)

This trivially guarantees all frequencies above below M/2 and N/2 to be zero.

Could We Apply FLC Pooling as Convolution in the Spatial Domain? In the spatial domain, the ideal low pass filter operation from above corresponds to a convolution of the feature maps with the Fourier transform of the rectangular pulse \(H(\hat{m},\hat{n})\) (by the Convolution Theorem, e.g. [14]). The Fourier transform of the rectangle function is

$$\begin{aligned} sinc(m,n) = {\left\{ \begin{array}{ll} \frac{sin(\sqrt{m^2+n^2})}{\sqrt{m^2+n^2}} &{} m, n \ne 0 \\ 1 &{} m,n =0 \end{array}\right. } \end{aligned}$$
(7)

However, while the ideal low pass filter in the Fourier domain has finite support, specifically all frequencies above N/2 are zero, sinc(mn) in the spatial domain has infinite support. Hence, we need an infinitely large convolution kernel to apply perfect low pass filtering in the spatial domain. This is obviously not possible in practice. In CNNs the standard kernel size is \(3\times 3\) and one hardly applies kernels larger than \(7\times 7\) in CNNs.

Table 10. Evaluation for clean and robust accuracy, higher is better, on AutoAttack [9] with our trained models. The models reported by the original authors may have different numbers due to different hyperparameter selection. We report each models confidence on their correct predictions on the clean data (Clean Confidence) and the models confidence on its false predictions due to adversarial perturbations (Perturbation Confidence). The top row reports the baseline without adversarial training.

1.4 A.4 Model Confidences

In Table 10, we evaluate the confidence of model predictions. We compare each model’s confidence on correctly classified clean examples to its respective confidence on wrongly classified adversarial examples. Ideally, the confidence on the adversarial examples should be lower. The results for the different methods show that FLC yields comparably high confidence on correctly classified clean examples with a 20% gap in confidence to wrongly classified adversarial examples. In contrast, the baseline model is highly confident in both cases. Other, even state-of-the-art robustness models have on average lower confidences but are even less confident in their correct predictions on clean examples than on erroneously classified adversarial examples (e.g. MART [43] and PGD [27]). Only the model from [45] has a trade-off preferable over the one from the proposed, FLC model.

1.5 A.5 AutoAttack Attack Structure

In the main paper we showed one example of an image optimized by AutoAttack [9] to fool our model and the baseline in Fig. 5. In Fig. 6, we give more examples for better visualisation and comparison.

Fig. 6.
figure 6

Spectrum and spectral differences of adversarial perturbations created by AutoAttack with \(\epsilon =\frac{8}{255}\) on the baseline model as well as our FLC Pooling. The classes from top left down to the bottom right are: Bird, Frog, Automobile, Ship, Cat and Truck.

Fig. 7.
figure 7

LC pooling plus, which either includes the original down-sampled signal like it is done traditionally (right) or with the high frequency components filtered by a high pass filter in the Fourier domain and down-sampled in the spatial domain by an identity convolution of stride two (left).

1.6 A.6 Ablation Study: Additional Frequency Components

In addition to the low frequency components we tested different settings in which we establish a second path through which we aim to add high frequency or the original information. We either add up the feature maps or contacted them. The procedure of how to include a second path is represented in Fig. 7. One approach is to execute the standard down-sampling and add it to the FLC pooled feature map. The other is to perform a high pass filter on the feature map and down-sample these feature maps. Afterwards, the FLC pooled feature maps as well as the high pass filtered and down-sampled ones are added. With this ablation, we aim to see if we do lose too much through the aggressive FLC pooling and if we would need additional high frequency information which is discarded through the FLC pooling. Table 11 show that we can gain minor points for the clean but not for the robust accuracy. Hence we did not see any improvement in the robustness and an increase in training time per epoch as well as a minor increase in model size, we will stick to the simple FLC pooling.

Table 11. Accuracies for CIFAR-10 Baseline LowCutPooling plus the original or high freqeuncy part of the featuremaps down-sampled in the spatial domain for FGSM Training. We can see that the additional data does not improve the robust accuracy and gives only minor improvement for the clean accuracy. Due to the additional computations necessary for the high frequency /original part we decided to fully discard them and stick to the pure low frequency cutting.

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Grabinski, J., Jung, S., Keuper, J., Keuper, M. (2022). FrequencyLowCut Pooling - Plug and Play Against Catastrophic Overfitting. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13674. Springer, Cham. https://doi.org/10.1007/978-3-031-19781-9_3

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-19781-9_3

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-19780-2

  • Online ISBN: 978-3-031-19781-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics