## Abstract

Over the last years, Convolutional Neural Networks (CNNs) have been the dominating neural architecture in a wide range of computer vision tasks. From an image and signal processing point of view, this success might be a bit surprising as the inherent spatial pyramid design of most CNNs is apparently violating basic signal processing laws, i.e. *Sampling Theorem* in their down-sampling operations. However, since poor sampling appeared not to affect model accuracy, this issue has been broadly neglected until model robustness started to receive more attention. Recent work [18] in the context of adversarial attacks and distribution shifts, showed after all, that there is a strong correlation between the vulnerability of CNNs and aliasing artifacts induced by poor down-sampling operations. This paper builds on these findings and introduces an aliasing free down-sampling operation which can easily be plugged into any CNN architecture: FrequencyLowCut pooling. Our experiments show, that in combination with simple and Fast Gradient Sign Method (FGSM) adversarial training, our hyper-parameter free operator substantially improves model robustness and avoids catastrophic overfitting. Our code is available at https://github.com/GeJulia/flc_pooling.

## Access this chapter

Tax calculation will be finalised at checkout

Purchases are for personal use only

### Similar content being viewed by others

## References

Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: a query-efficient black-box adversarial attack via random search. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12368, pp. 484–501. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58592-1_29

Andriushchenko, M., Flammarion, N.: Understanding and improving fast adversarial training. In: Advances in Neural Information Processing Systems, vol. 33, pp. 16048–16059 (2020)

Carlini, N., Wagner, D.: Towards evaluating the robustness of neural networks. In: 2017 IEEE Symposium on Security and Privacy (SP), pp. 39–57. IEEE (2017)

Carmon, Y., Raghunathan, A., Schmidt, L., Duchi, J.C., Liang, P.S.: Unlabeled data improves adversarial robustness. In: Advances in Neural Information Processing Systems, vol. 32 (2019)

Chaman, A., Dokmanic, I.: Truly shift-invariant convolutional neural networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3773–3783 (2021)

Chen, T., Zhang, Z., Liu, S., Chang, S., Wang, Z.: Robust overfitting may be mitigated by properly learned smoothening. In: International Conference on Learning Representations (2021). https://openreview.net/forum?id=qZzy5urZw9

Cheng, M., Le, T., Chen, P.Y., Yi, J., Zhang, H., Hsieh, C.J.: Query-efficient hard-label black-box attack: an optimization-based approach. arXiv preprint http://arxiv.org/abs/1807.04457arXiv:1807.04457 (2018)

Croce, F., et al.: RobustBench: a standardized adversarial robustness benchmark. arXiv preprint http://arxiv.org/abs/2010.09670arXiv:2010.09670 (2020)

Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: ICML (2020)

Darlow, L.N., Crowley, E.J., Antoniou, A., Storkey, A.J.: CINIC-10 is not ImageNet or CIFAR-10. arXiv preprint arXiv:1810.03505 (2018)

Durall, R., Keuper, M., Keuper, J.: Watch your up-convolution: CNN based generative deep neural networks are failing to reproduce spectral distributions (2020)

Engstrom, L., Ilyas, A., Salman, H., Santurkar, S., Tsipras, D.: Robustness (python library) (2019). https://github.com/MadryLab/robustness

Geirhos, R., Rubisch, P., Michaelis, C., Bethge, M., Wichmann, F.A., Brendel, W.: ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness. arXiv preprint http://arxiv.org/abs/1811.12231arXiv:1811.12231 (2018)

Gonzalez, R.C., Woods, R.E.: Digital Image Processing, 3rd edn. Prentice-Hall Inc. (2006)

Goodfellow, I.J., Shlens, J., Szegedy, C.: Explaining and harnessing adversarial examples (2015)

Gowal, S., Qin, C., Uesato, J., Mann, T., Kohli, P.: Uncovering the limits of adversarial training against norm-bounded adversarial examples (2021)

Gowal, S., Rebuffi, S.A., Wiles, O., Stimberg, F., Calian, D.A., Mann, T.A.: Improving robustness using generated data. In: Advances in Neural Information Processing Systems, vol. 34 (2021)

Grabinski, J., Keuper, J., Keuper, M.: Aliasing coincides with CNNs vulnerability towards adversarial attacks. In: The AAAI-2022 Workshop on Adversarial Machine Learning and Beyond (2022). https://openreview.net/forum?id=vKc1mLxBebP

Hendrycks, D., Dietterich, T.: Benchmarking neural network robustness to common corruptions and perturbations. In: Proceedings of the International Conference on Learning Representations (2019)

Hendrycks, D., Lee, K., Mazeika, M.: Using pre-training can improve model robustness and uncertainty. In: International Conference on Machine Learning, pp. 2712–2721. PMLR (2019)

Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. In: Advances in Neural Information Processing Systems, vol. 33, pp. 6840–6851 (2020)

Hossain, M.T., Teng, S.W., Sohel, F., Lu, G.: Anti-aliasing deep image classifiers using novel depth adaptive blurring and activation function (2021)

Jung, S., Keuper, M.: Spectral distribution aware image generation. In: AAAI (2021)

Karras, T., et al.: Alias-free generative adversarial networks. In: Advances in Neural Information Processing Systems, vol. 34 (2021)

Kim, H., Lee, W., Lee, J.: Understanding catastrophic overfitting in single-step adversarial training (2020)

Krizhevsky, A.: Learning multiple layers of features from tiny images. University of Toronto, May 2012

Kurakin, A., Goodfellow, I., Bengio, S.: Adversarial machine learning at scale (2017)

Li, Q., Shen, L., Guo, S., Lai, Z.: Wavelet integrated CNNs for noise-robust image classification. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7245–7254 (2020)

Li, Q., Shen, L., Guo, S., Lai, Z.: WaveCNet: wavelet integrated CNNs to suppress aliasing effect for noise-robust image classification. IEEE Trans. Image Process.

**30**, 7074–7089 (2021). https://doi.org/10.1109/tip.2021.3101395Lohn, A.J.: Downscaling attack and defense: turning what you see back into what you get (2020)

Lorenz, P., Strassel, D., Keuper, M., Keuper, J.: Is robustbench/autoattack a suitable benchmark for adversarial robustness? In: The AAAI-2022 Workshop on Adversarial Machine Learning and Beyond (2022). https://openreview.net/forum?id=aLB3FaqoMBs

Madry, A., Makelov, A., Schmidt, L., Tsipras, D., Vladu, A.: Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083 (2017)

Moosavi-Dezfooli, S.M., Fawzi, A., Frossard, P.: DeepFool: a simple and accurate method to fool deep neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2574–2582 (2016)

Rade, R., Moosavi-Dezfooli, S.M.: Helper-based adversarial training: reducing excessive margin to achieve a better accuracy vs. robustness trade-off. In: ICML 2021 Workshop on Adversarial Machine Learning (2021). https://openreview.net/forum?id=BuD2LmNaU3a

Rebuffi, S.A., Gowal, S., Calian, D.A., Stimberg, F., Wiles, O., Mann, T.: Fixing data augmentation to improve adversarial robustness (2021)

Rice, L., Wong, E., Kolter, Z.: Overfitting in adversarially robust deep learning. In: International Conference on Machine Learning, pp. 8093–8104. PMLR (2020)

Rony, J., Hafemann, L.G., Oliveira, L.S., Ayed, I.B., Sabourin, R., Granger, E.: Decoupling direction and norm for efficient gradient-based L2 adversarial attacks and defenses. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4322–4330 (2019)

Salman, H., Ilyas, A., Engstrom, L., Kapoor, A., Madry, A.: Do adversarially robust imagenet models transfer better? In: Advances in Neural Information Processing Systems, vol. 33, pp. 3533–3545 (2020)

Sehwag, V., et al.: Improving adversarial robustness using proxy distributions (2021)

Shannon, C.: Communication in the presence of noise. Proc. IRE

**37**(1), 10–21 (1949). https://doi.org/10.1109/JRPROC.1949.232969Stutz, D., Hein, M., Schiele, B.: Relating adversarially robust generalization to flat minima (2021)

Szegedy, C., et al.: Intriguing properties of neural networks. In: International Conference on Learning Representations (2014). http://arxiv.org/abs/1312.6199

Wang, Y., Zou, D., Yi, J., Bailey, J., Ma, X., Gu, Q.: Improving adversarial robustness requires revisiting misclassified examples. In: International Conference on Learning Representations (2020). https://openreview.net/forum?id=rklOg6EFwS

Wong, E., Rice, L., Kolter, J.Z.: Fast is better than free: revisiting adversarial training. In: International Conference on Learning Representations (2020). https://openreview.net/forum?id=BJx040EFvH

Wu, D., Xia, S.T., Wang, Y.: Adversarial weight perturbation helps robust generalization. In: Advances in Neural Information Processing Systems, vol. 33, pp. 2958–2969 (2020)

Xiao, Q., Li, K., Zhang, D., Jin, Y.: Wolf in sheep’s clothing - the downscaling attack against deep learning applications (2017)

Zhang, H., Yu, Y., Jiao, J., Xing, E.P., Ghaoui, L.E., Jordan, M.I.: Theoretically principled trade-off between robustness and accuracy. In: International Conference on Machine Learning (2019)

Zhang, J., Zhu, J., Niu, G., Han, B., Sugiyama, M., Kankanhalli, M.: Geometry-aware instance-reweighted adversarial training. In: International Conference on Learning Representations (2021). https://openreview.net/forum?id=iAX0l6Cz8ub

Zhang, R.: Making convolutional networks shift-invariant again. In: ICML (2019)

Zou, X., Xiao, F., Yu, Z., Lee, Y.J.: Delving deeper into anti-aliasing in convnets. In: BMVC (2020)

## Author information

### Authors and Affiliations

### Corresponding author

## Editor information

### Editors and Affiliations

## 1 Electronic supplementary material

Below is the link to the electronic supplementary material.

## A Appendix

### A Appendix

### 1.1 A.1 Training Schedules

**CIFAR-10 Adversarial Training Schedule:** For our baseline experiments on CIFAR-10, we used the PRN-18 as well as the WRN-28-10 architecture as they give a good trade-off between complexity and feasibility. For the PRN-18 models, we trained for 300 epochs with a batch size of 512 and a circling learning rate schedule with the maximal learning rate 0.2 and minimal learning rate 0. We set the momentum to 0.9 and weight decay to \(5e^{-4}\). The loss is calculated via Cross Entropy Loss and as an optimizer, we use Stochastic Gradient Descent (SGD). For the AT, we used the FGSM attack with an \(\epsilon \) of 8/255 and an \(\alpha \) of 10/255 (in Fast FGSM the attack is computed for step size \(\alpha \) once and then projected to \(\epsilon \)). For the WRN-28-10 we used a similar training schedule as for the PRN-18 models but used only 200 epochs and a smaller maximal learning rate of 0.08.

**CIFAR-10 Clean Training Schedule:** Each model is trained without AT. We used 300 epochs, a batch size of 512 for each training run and a circling learning rate schedule with the maximal learning rate at 0.2 and minimal at 0. We set the momentum to 0.9 and a weight decay to \(5e^{-4}\). The loss is calculated via Cross Entropy Loss and as an optimizer, we use Stochastic Gradient Descent (SGD).

**CINIC-10 Adversarial Training Schedule:** For our baseline experiments on CINIC-10 we used the PRN-18 architecture. We used 300 epochs, a batch size of 512 for each training run and a circling learning rate schedule with the maximal learning rate at 0.1 and minimal at 0. We set the momentum to 0.9 and weight decay to \(5e^{-4}\). The loss is calculated via Cross Entropy Loss and as an optimizer, we use Stochastic Gradient Descent (SGD). For the AT, we used the FGSM attack with an epsilon of 8/255 and an alpha of 10/255.

**CIFAR-100 Adversarial Training Schedule:** For our baseline experiments on CIFAR-100 we used the PRN-18 architecture as it gives a good trade-off between complexity and feasibility. We used 300 epochs, a batch size of 512 for each training run and a circling learning rate schedule with the maximal learning rate at 0.01 and minimal at 0. We set the momentum to 0.9 and a weight decay to \(5e^{-4}\). The loss is calculated via Cross Entropy Loss and as an optimizer, we use Stochastic Gradient Descent (SGD). For the AT, we used the FGSM attack with an epsilon of 8/255 and an alpha of 10/255.

**ImageNet Adversarial Training Schedule:** For our experiment on ImageNet we used the ResNet50 architecture. We trained for 150 epochs with a batch size of 400, and a multistep learning rate schedule with an initial learning rate 0.1, \(\gamma =0.1\), and milestones [30, 60, 90, 120]. We set the momentum to 0.9 and weight decay to \(5e^{-4}\). The loss is calculated via Cross Entropy Loss and as an optimizer, we use Stochastic Gradient Descent (SGD). For the AT, we used FGSM attack with an epsilon of 4/255 and an alpha of 5/255.

### 1.2 A.2 ImageNet Training Efficiency

When evaluating practical training times (in minutes) on ImageNet per epoch, we can not see a measurable difference in the costs between a ResNet50 with FLC pooling or strided convolution.

We varied the number of workers for dataloaders with clean training on 4 A-100 GPUs and measured \({\approx }43\) m for 12 workers, \({\approx }22\) m for 48 workers and \({\approx }18\) m for 72 workers for both. FGSM-based AT with the pipeline by [42] takes 1:07 h for both FLC pooling and strided convolutions per epoch. We conclude that training with FLC pooling in terms of practical runtime is scalable (runtime increase in ms-s range) and training times are likely governed by other factors.

The training time of our model should be comparable to the one from Wong et al. [44] while other reported methods have a significantly longer training time. Yet, the clean accuracy of the proposed model using FLC pooling improves about 8% over the one reached by [44], with a 1% improvement in robust accuracy. For example [12] has an increased training time by factor four compared to our model, already on CIFAR10 (see Table 6). This model achieves overall comparable results to ours. The model by Salman et al. [38] is trained with the training schedule from Madry et al. [32] and uses a multi-step adversarial attack for training. Since there is no release of the training script of this model on ImageNet, we can only roughly estimate their training times. Since they adopt the training schedule from Madry et al., we assume a similar training time increase of a factor of four, which is similar to the multi-step times reported for PGD in Table 6.

### 1.3 A.3 Aliasing Free Down-Sampling

Previous approaches like [49, 50] have proposed to apply blurring operations before down-sampling, with the purpose of achieving models with improved shift invariance. Therefore, they apply Gaussian blurring directly on the feature maps via convolution. In the following, we briefly discuss why this setting can not guarantee to prevent aliasing in the feature maps, even if large convolutional kernels would be applied, and why, in contrast, the proposed FLC pooling can guarantee to prevent aliasing.

To prevent aliasing, the feature maps need to be band-limited before down-sampling [14]. This band limitation is needed to ensure that after down-sampling no replica of the frequency spectrum overlap (see Fig. 3). To guarantee the required band limitation for sub-sampling with a factor of two to *N*/2 where *N* is the size of the original signal, one has to remove (reduce to zero) all frequency components above *N*/2.

**Spatial Filtering Based Approaches.** [49, 50] propose to apply approximated Gaussian filter kernels to the feature map. This operation is motivated by the fact that an actual Gaussian in the spatial domain corresponds to a Gaussian in the frequency (e.g. Fourier) domain. As the standard deviation of the Gaussian in the spatial domain increases, the standard deviation of its frequency representation decreases. Yet, the Gaussian distribution has infinite support, regardless of its standard deviation, i.e. the function never actually drops to zero. The convolution in the spatial domain corresponds to the point-wise multiplication in the frequency domain.

Therefore, even after convolving a signal with a perfect Gaussian filter with large standard deviation (and infinite support), all frequency components that were \(\ne 0\) before the convolution will be afterwards (although smaller in magnitude). Specifically, the convolution with a Gaussian (even in theoretically ideal settings), can reduce the apparent aliasing but some amount of aliasing will always persist. In practice, these ideal settings are not given: Prior works such as [49, 50] have to employ approximated Gaussian filters with finite support (usually not larger than \(7\times 7\)).

**FLC Pooling.** Therefore, FLC pooling operates directly in the frequency domain, where it removes all frequencies that can cause aliases.

This operation in the Fourier domain is called the *ideal low pass filter* and corresponds to a point-wise multiplication of the Fourier transform of the feature maps with a rectangular pulse \(H(\hat{m},\hat{n})\).

This trivially guarantees all frequencies above below M/2 and N/2 to be zero.

**Could We Apply FLC Pooling as Convolution in the Spatial Domain?** In the spatial domain, the ideal low pass filter operation from above corresponds to a convolution of the feature maps with the Fourier transform of the rectangular pulse \(H(\hat{m},\hat{n})\) (by the Convolution Theorem, e.g. [14]). The Fourier transform of the rectangle function is

However, while the ideal low pass filter in the Fourier domain has finite support, specifically all frequencies above *N*/2 are zero, *sinc*(*m*, *n*) in the spatial domain has infinite support. Hence, we need an infinitely large convolution kernel to apply perfect low pass filtering in the spatial domain. This is obviously not possible in practice. In CNNs the standard kernel size is \(3\times 3\) and one hardly applies kernels larger than \(7\times 7\) in CNNs.

### 1.4 A.4 Model Confidences

In Table 10, we evaluate the confidence of model predictions. We compare each model’s confidence on correctly classified clean examples to its respective confidence on wrongly classified adversarial examples. Ideally, the confidence on the adversarial examples should be lower. The results for the different methods show that FLC yields comparably high confidence on correctly classified clean examples with a 20% gap in confidence to wrongly classified adversarial examples. In contrast, the baseline model is highly confident in both cases. Other, even state-of-the-art robustness models have on average lower confidences but are even less confident in their correct predictions on clean examples than on erroneously classified adversarial examples (e.g. MART [43] and PGD [27]). Only the model from [45] has a trade-off preferable over the one from the proposed, FLC model.

### 1.5 A.5 AutoAttack Attack Structure

In the main paper we showed one example of an image optimized by AutoAttack [9] to fool our model and the baseline in Fig. 5. In Fig. 6, we give more examples for better visualisation and comparison.

### 1.6 A.6 Ablation Study: Additional Frequency Components

In addition to the low frequency components we tested different settings in which we establish a second path through which we aim to add high frequency or the original information. We either add up the feature maps or contacted them. The procedure of how to include a second path is represented in Fig. 7. One approach is to execute the standard down-sampling and add it to the FLC pooled feature map. The other is to perform a high pass filter on the feature map and down-sample these feature maps. Afterwards, the FLC pooled feature maps as well as the high pass filtered and down-sampled ones are added. With this ablation, we aim to see if we do lose too much through the aggressive FLC pooling and if we would need additional high frequency information which is discarded through the FLC pooling. Table 11 show that we can gain minor points for the clean but not for the robust accuracy. Hence we did not see any improvement in the robustness and an increase in training time per epoch as well as a minor increase in model size, we will stick to the simple FLC pooling.

## Rights and permissions

## Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

## About this paper

### Cite this paper

Grabinski, J., Jung, S., Keuper, J., Keuper, M. (2022). FrequencyLowCut Pooling - Plug and Play Against Catastrophic Overfitting. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13674. Springer, Cham. https://doi.org/10.1007/978-3-031-19781-9_3

### Download citation

DOI: https://doi.org/10.1007/978-3-031-19781-9_3

Published:

Publisher Name: Springer, Cham

Print ISBN: 978-3-031-19780-2

Online ISBN: 978-3-031-19781-9

eBook Packages: Computer ScienceComputer Science (R0)