Introduction

Deep learning [36] has achieved remarkable success across a wide spectrum of applications, including image classification [23, 27], object detection [52, 55], image generation [31, 84], information fusion [57, 61, 83, 86] etc. However, despite these advancements, massive research works [4, 60, 85] have shown that deep neural networks are inherently fragile. Malicious attackers can make well-trained neural networks generate ridiculously erroneous predictions by just adding tiny perturbations to original images. Attacks that leverage such methods are called adversarial attacks, and those corresponding distorted images are adversarial examples [18, 60]. Due to the existence of these malicious attacks, some security-critical application scenarios are exposed to very high risks, such as autonomous driving [15, 32, 39], face recognition [13, 82], and medical diagnosis based on medical image analysis [22, 30, 43].

Fig. 1
figure 1

The class activation figures of samples of ImageNette. The figures are generated by GradCAM. Best viewed in color

To address the huge security threat to intelligent systems caused by adversarial examples, researchers from both academia and industry have invested tremendous pioneering works toward adversarial example defenses [49]. Among all defense approaches, adversarial training (AT) and its variants [44, 63, 78] have emerged as common choices for mitigating malicious attacks, but they can only bring limited adversarial robustness and often result in a reduction in clean accuracy. More recently, some researchers [2, 71] have attempted to counteract adversarial examples by modifying original network architectures to denoise the malicious effects in the feature space. Figure 1 shows the class activation figures of clean samples and its adversarial examples in the feature space. Bai et al. [2] leveraged channel activation suppression (CAS) to strengthen significant feature channels while suppressing abnormal ones using weighted scores. Furthermore, Yan et al. [71] proposed CIFS to utilize an important mask generation module to extend theories of the vanilla CAS method.

The above approaches contribute to enhancing the adversarial robustness of target models compared to the original AT methods. However, they still have some drawbacks in adversarial robustness. First, existing methods of modifying network architectures typically incorporate global average pooling (GAP) to denoise and obtain channel features initially. Nevertheless, this technique does not effectively eliminate malicious activations within channels induced by adversarial examples [67]. Furthermore, these algorithms predominantly focus on mitigating redundant activation values primarily within feature channels, thereby overlooking the spatial dimension’s role in bolstering robust image classification [16, 65, 81]. Additionally, it is worth noting that the majority of adversarial defense techniques tend to substantially compromise clean image performance, thereby hindering the practical deployability and application of defense-enhanced models.

To address the aforementioned challenges, in this paper, we innovatively propose the RSA algorithm, consisting of the feature Refinement, the feature activation Suppression, and the feature Alignment modules. Specifically, to mitigate the abnormally low channel activation values due to adversarial perturbations, the feature refinement module first detects and refines abnormal values in the channel dimension. Next, we utilize the feature activation suppression module with a parallel complementary mode to suppress the redundant activation values from both channel and spatial dimensions. To avoid the clean accuracy drop caused by adversarial training, we additionally leverage the feature alignment module to incorporate two constraints on the hidden features. One constraint is the consistency constraint which aims to make features of the same categories more compact, and the other constraint is the knowledge distillation to guide the features through an independent teacher network.

Overall, previous research often prioritizes the development of adversarial training strategies or confines its scope to individual feature channel dimensions, thereby neglecting the broader, multi-dimensional consequences of adversarial noise at the feature level. Upon analysis, we observe that adversarial examples have a disruptive effect on both the channel and spatial dimensions within the feature space, subsequently impacting model performance. Therefore, we develop a novel feature-level defense mechanism RSA. Our method operates at the feature level and can enhance adversarial robustness and preserve clean accuracy. Our findings underscore the significance of considering feature-level defenses to mitigate the adverse impact of adversarial examples on model performance.

In conclusion, our main contributions can be summarized as follows:

  1. (1)

    We conducted a quantitative analysis to assess the influence of adversarial attacks on the sample feature space. Building upon these insights, we introduce two novel modules: Feature Refinement and Feature Activation Suppression. These modules are designed to mitigate the adverse effects of adversarial attacks and reduce redundant activation values across both channel and spatial dimensions.

  2. (2)

    To avoid the model performance drop in clean accuracy after adversarial training, we impose an extra feature alignment module with a consistency constraint and a knowledge distillation constraint on the feature space after refinement and suppression.

  3. (3)

    Extensive quantitative and qualitative experiments on five public datasets and three widely used backbone networks prove that our proposed RSA algorithm has overall superior performance compared to state-of-the-art defense methods.

Related work

In this section, we briefly review the commonly used adversarial defense algorithms. We divide the defense approaches into three categories: input transformation defenses, adversarial training (AT) and its variants, and network structure optimization.

Input transformation defenses

The exploration of adversarial example defenses in the early research community focused on transforming the input images, such as the minimum variance method [20] and the JPEG compression algorithm [42]. After that, Raff et al. [51] utilized the stochastically combined multiple input transformations like ensemble learning to counter potential adaptive adversarial attacks. Similarly, Taran et al. [62] introduced a defense mechanism based on randomized diversification on a set of input transformations. In addition, some researchers also explored other methods based on input transformations. Jia et al. [28] utilized learnable compression modules and mitigated malicious perturbations. Samangouei et al. [53] proposed Defense-GAN to counter adversarial examples with GANs. Inspired by [53], Yuan et al. [74] leveraged an ensemble generative cleaning network with a feedback loop to clean images from adversarial patterns. Some defense methods based on input denoising have also been proposed. Gu et al. [19] attempted to use a denoising autoencoder and a deep contractive network to remove adversarial noises. Liao et al. [41] also proposed a specific denoiser to guide high-level representations and solved the error amplification effect of the standard denoiser. Liang et al. [40] introduced a smoothing spatial filter with a scalar quantization mechanism to reduce the impact of adversarial attacks. In general, input transformation-based defenses are usually easy to deploy, but they are not effective against many powerful attacks.

Adversarial training and variants

Adversarial training (AT) has been the most successful work to train robust neural networks against adversarial examples. AT was first proposed by Goodfellow et al. [18], which was to expose the original deep neural networks to adversarial examples in the training phase. Madry et al. [44] proved that the solution to the Min–Max optimization problem for adversarial training was the optimal parameter combination for models when faced with adversarial examples. Moreover, they advocated using projected gradient descent (PGD) to generate adversarial examples in adversarial training, and it has become a widely used and effective approach to resist malicious attacks.

Although adversarial training could improve the robustness of models to some extent, it required a lot of additional computing resources and computing time. Unfortunately, AT also degraded the generalization performance on the clean samples. To better enhance the vanilla AT approach, multiple AT variants have been explored. Zhang et al. [78] proposed the TRADES algorithm to simultaneously control two parts of the original adversarial error: natural error and boundary error. Meanwhile, MART [63] algorithm considered the impact of misclassified clean samples on model robustness during training. Wu et al. [66] proposed the Adversarial Weight Perturbation (AWP), which adversarially perturbed both inputs and weights to explicitly regularize the flatness of the weight loss landscape. Some researchers have also tried to utilize the single-step defenses to reduce computational costs of AT by some relax terms and regularizers, such as Guided Adversarial Training (GAT) [58] and NuAT [59]. To solve the performance decline in clean images due to AT, Lamb et al. [34] proposed Interpolated Adversarial Training that employed interpolation-based methods in the framework of AT. Cui et al. [9] proposed LBGAT to guide robust network training by the classifier boundary from clean models. Yang et al. [72] proposed DAAT to retain clean accuracy through a naturally trained calibration network. In the face of catastrophic overfitting problems in adversarial training, the FGSM-GA [1] and FGSM-MEP [29] utilized the GradAlign method and a prior-based initialization method respectively. Similarly, Chen et al. [5] proposed an approach to alleviating the catastrophic overfitting of the multi-exit network and had lower time complexity compared to vanilla AT. To improve the generalization performance of the model after AT, the FAT algorithm [79] generated more diverse adversarial examples without reducing the robustness of the models. The Smooth Adversarial Training (SAT) [56] leveraged the concept of curriculum learning to allow networks to operate at a better clean accuracy versus robustness trade-off curve compared to vanilla AT.

Network structure optimization

Researchers have also explored ways to design and optimize network structures for better adversarial robustness combined with AT. Early works focused on denoising or making attacks difficult by adding some additional structures. Xie et al. [69] leveraged the non-local attention module used in the semantic segmentation field to eliminate the aberrant noises in the feature map of adversarial examples. He et al. [24] injected a kind of trainable Gaussian noise into either activations or weights during the training phase to improve the performance of vanilla AT. Mygdalis et al. [47] proposed the M-SVDD-D to defend adversarial examples by increasing the noise energy required to deceive the protected models and decreasing the effectiveness of adversarial attacks. Nowadays, the more effective methods based on network structure optimization can be divided into the following categories.

Knowledge-distillation-based methods: Some researchers attempted to leverage the concept of knowledge distillation (KD) to defend adversarial noises, including ARD [17], RSLAD [87], and MTARD [80]. The KD-based approaches usually employed one- or multi-teacher models to guide the training process of the target student model, and the teacher models could be adversarially trained or standard trained. ARD was a classical KD-based defense approach. Inspired by this method, RSLAD exploited robust soft labels to train small student models through large teacher models. Alternatively, MTARD used multi-teacher models and a dynamic training algorithm to balance the influence between the adversarial teacher and clean teacher models.

NAS-based and pruning-based methods: Some defense methods like RobNet [21] and DS-NET [14] utilized the neural architecture search (NAS) to find the robust modules and structures to redesign families of robust architectures. Meanwhile, some pruning-based algorithms also have been proposed, such as SAP [11], AdvPrune [73], kWTA [68], HYDRA [54], and MAD [37]. The purpose of the above approaches was to enable preserving the adversarial robustness while compressing the models’ size.

Channel activation suppression-based methods: In recent years, the correlation between channel activation values and adversarial robustness has attracted more attention. Bai et al. [2] found that there were disparities in channel activation value magnitudes and frequencies between clean and adversarial examples. Therefore, they proposed the channel activation suppress (CAS) algorithm, which inhibited adversarial noise in abnormal channels by implementing global average pooling (GAP) and an extra auxiliary classifier branch. Based on CAS, Yan et al. [71] divided channels of feature maps into positively relevant (PR) channels and negatively relevant (NR) channels. They then introduced a new mechanism named channel-wise importance-based feature selection (CIFS). CIFS attempted to align PR channels and suppressed NR ones and, therefore, improved the adversarial robustness of target models.

Although vanilla CAS and CIFS algorithms made better performance in improving adversarial robustness, both of them aligned the aberrant feature activation values only from the perspective of channel dimension. Besides, the GAP operation used in CAS could not effectively eliminate the effects of anomalous activations of hidden features. We assume that this is due to the fact that GAP is just a global sum and average operation, and it does not remove the adverse effects of the features. Moreover, these algorithms still did not avoid the decreased performance of models on clean samples after adversarial training. To address the above drawbacks, in this paper, we propose a novel model RSA, which reduces malicious effects from adversarial examples to improve adversarial robustness and avoids the performance decline on clean samples.

Fig. 2
figure 2

The difference between a the standard training model, b our baseline model CAS, and c our proposed RSA model. Our RSA model consists of three modules: feature refinement module, feature activation suppression module, and feature alignment module (consistency constraint (CC) and knowledge distillation (KD)). Best viewed in color

Method

In this section, we introduce the proposed RSA algorithm. We first investigate the difference in activation values of adversarial examples and clean samples in the corresponding channels of feature maps, based on which we propose a feature refinement module. After that, we introduce the feature activation suppression module toward both channel and spatial dimensions. Finally, we present the feature alignment module, which consists of a consistency constraint used on the extra branches and a knowledge distillation operation on the reweighted features. The above three components together constitute our proposed RSA algorithm. Figure 2 shows three schematic diagrams among a standard training model, the baseline CAS model, and our proposed RSA model. As shown in Figure 2 (c), perturbed features from adversarial images are first refined through the feature refinement module. After that, the feature activation suppression module reweights the refined features. Finally, the feature alignment module adopts two constraints, the consistency constraint(CC) and the knowledge distillation (KD) to align the reweighted features.

Feature refinement

CAS [2] utilizes the global average pooling (GAP) to obtain the global channel representation and regards it as a preliminary denoising function toward adversarial examples as well. However, GAP essentially averages all activations of each channel and we assume that it does not effectively remove the influence of malicious noise. [67] also demonstrates that GAP is not a reliable information aggregation method. Although adversarial examples only contain human-imperceptible perturbations, the adverse effects may be enhanced through forward propagation. These malicious influences are reflected in irregular differences in feature activations, and a simple GAP may not be able to purify them.

Fig. 3
figure 3

The subtracted difference between clean samples and adversarial examples of the corresponding channel activation values. Each subgraph shows the average channel values of all samples for a specific class. The horizontal axis represents each channel of the feature maps, and the vertical axis represents the difference between the activation values of the clean samples and their corresponding adversarial examples on this channel. The red in the figure indicates that the activation values of the clean samples are higher than the adversarial examples, while the blue represents the opposite situation. For display purposes, we select samples from three classes in the CIFAR-10 dataset and observe the top 100 of the whole 512 channels in terms of their magnitudes of activation values. Best viewed in color

To inspect the influence of the feature activation values, we extract activation values of both clean and adversarial examples from the penultimate layer of an adversarial trained network with CAS. We then investigate the channel-wise representation differences between clean samples and adversarial examples through the following operations. By deploying GAP on the extracted features, we primarily rearrange channels of the clean samples according to the magnitude of activation values in descending order. Afterward, we fix this channel order and compute the difference between the activations of the clean and adversarial examples within each channel. In Figure 3, red indicates that clean images have higher activations than adversarial examples in this channel, and blue indicates the opposite.

As shown in Figure 3, it is obvious that in channels with higher activation magnitudes, the activation values of original images are larger than those of adversarial examples. As the channel activation values decrease, adversarial examples have slightly larger activation amplitudes than clean samples, but the difference is not as large as that in high-amplitude channels. Adversarial examples perturbed feature distributions, which are likely to cause the above activation differences. Therefore, we believe that it is necessary to deal with aberrant channel activation values caused by adversarial perturbations, thereby improving the robustness of target models.

Fig. 4
figure 4

Schematic diagram of the feature refinement module, showing the process of operating on a channel of a feature map. The red in the figure represents the detected outliers, while the blue in the corresponding position represents the corrected value. Best viewed in color

Some research works [6, 50] suggest that channels with different activation values contribute differently to correct classification, and we believe that a similar rationale also exists in the adversarial defense. We suspect that channels with large activation values in the original images are more important for correct image classification. In these channels, the amplitude of the activation values of adversarial examples is mostly lower than that of clean samples. We believe that the addition of adversarial noise changes the activation values of the features and causes the differences. For each channel, if some of the smaller activation values are amplified, the overall activation amplitude of the channel is correspondingly slightly improved, and adverse effects may be alleviated. We attempt to refine the minimum value in each channel. In this way, the overall activation values of the whole channel can be enlarged without significant disturbance to the feature distribution of data. Therefore, we specifically propose a novel feature refinement module, which is capable of revising those minimum activation values. The module requires no additional parameters for training. The schematic diagram of the feature refinement module is shown in Figure 4. Specifically, for each channel in the feature map, the module first calculates its average activation values and then scales the minimum value in each channel to its average value. The purpose of this module is to increase the minimum value to make a refinement of the whole magnitude of the channel activations, and we assume that modifying only one value may not unduly distort the feature distribution. The operation of the feature refinement module is formulated as follows:

$$\begin{aligned} X^{'} = X - X \bigotimes MinMask + Mean(X) \bigotimes MinMask. \end{aligned}$$
(1)

X represents the original hidden features, and \(X^{'}\) stands for the refined features. MinMask is a mask matrix containing only (0, 1), where 1 stands for the location of the minimum value in this channel and 0 represents other values. Mean(X) computes average activation values in each channel and \(\bigotimes \) represents the element-wise multiplication. After performing the above operations, the aberrant small activation magnitude in channels of perturbed features due to adversarial noise is likely to be amplified and refined.

Feature activation suppression

Our baseline model utilizes channel-wise activation suppression and an auxiliary classifier to calculate the importance of channel features and adjust the activation values. However, we suppose that the activation difference also exists in the spatial domain, apart from the channel perspective. Similar to the qualitative analysis in the previous section, we draw a heatmap figure to investigate whether there is a difference among activations in the spatial domain. Figure 5 displays three \(4*4\) feature maps, and the value of each spatial position represents the magnitude difference between the clean images and their corresponding adversarial examples at this position (along all channels). In this figure, red represents positive values and blue indicates the opposite. It can be clearly observed that there are apparent differences in the spatial distribution of features between the two types of samples. Thus, we assume that in addition to suppressing the so-called harmful redundant activation values from the channel perspective, similar operations should also be migrated to the spatial level in the feature space. We suppose that reweighting the feature maps through both channel and spatial activation suppression can bring complementary effects in model robustness.

Fig. 5
figure 5

Analyze the difference between adversarial and clean sample feature maps from the spatial perspective. Each subgraph shows the average spatial activation values of all samples for a specific class. The red in the figure represents that the activation value of the clean samples at the corresponding position is higher than that of the adversarial examples, while the blue represents the opposite situation. For a more intuitive analysis, the feature maps belong to all samples of three classes in the CIFAR-10 dataset. For ease of presentation and analysis, the feature maps have been converted to a single channel (2D), and the activation values at all locations are averaged. Best viewed in color

Therefore, inspired by the CBAM model [65] and other research utilizing spatial attention [16, 81], we extend the vanilla feature activation suppression module into a parallel complementary mode. The new feature activation suppression module suppresses both channel-wise and spatial-wise redundant activation values simultaneously, and we suppose that it can make a better complementary effect to counter adversarial examples. Figure 6 shows the schematic diagram of the proposed feature activation suppression module.

Given an intermediate feature map \(Z\in {R^{C\times H\times W}}\) obtained by the above feature refinement module, where C, H, and W represent the channel, height, and width of the feature map Z, respectively, the feature activation suppression module estimates the channel-wise activation \(F^c\) and the spatial-wise activation \(F^s\) by leveraging the global average pooling (GAP) operation simultaneously. Since the features are already refined by the previous module, we can apply the GAP operation. Additionally, for spatial activations, we use an extra \(1*1\) convolutional layer before the GAP operation to aggregate features. Formally, the two complementary activations can be computed as:

$$\begin{aligned}{} & {} F^{c} = \frac{1}{H\times W}\sum _{i=1}^{H}\sum _{j=1}^{W}Z(i,j), \end{aligned}$$
(2)
$$\begin{aligned}{} & {} F^{s} = \frac{1}{C}\sum _{k=1}^{C}Conv^{1*1}(Z(k)). \end{aligned}$$
(3)
Fig. 6
figure 6

Schematic illustration of the feature activation suppression module. The corresponding weights are obtained from two auxiliary classifiers respectively, and we use those weight vectors to reweight the refined features. We finally aggregate the channel and spatial reweighted features by the tensor addition. Best viewed in color

As shown above, we aggregate spatial information by GAP and generate the channel context descriptor \(F^{c}\). The spatial feature descriptor \(F^{s}\) is obtained by a similar operation along the channel axis after an extra 1*1 depthwise separable convolutional layer. We then utilize two auxiliary fully connected (FC) layers to be extra classifiers after extracting the two complementary context descriptors above. \(F^{c}\) and \(F^{s}\) are the inputs to the corresponding classifiers. For a multi-class classification task with K classes, the weight of an auxiliary FC layer can be written as W = \([W^{1},W^{2},...,W^{K}]\), in which each weight vector belongs to one certain class corresponding to the true labels. Therefore, we elicit the specific weight vector related to the class of examples and leverage it to reorganize the above activations. We adopt the labels of training data to select the weight vectors when training the model. In the inference phase, we pick up the weight components by using the predicting labels of the auxiliary layers, since in this stage the ground-truth labels are not accessible. In short, the activation reorganization can be performed as below:

$$\begin{aligned}{} & {} Z^{'} = Z \bigotimes W^{true}, \text {in the training phase}, \end{aligned}$$
(4)
$$\begin{aligned}{} & {} Z^{'} = Z \bigotimes W^{predict}, \text {in the testing phase}, \end{aligned}$$
(5)

where \(\bigotimes \) represents the element-wise multiplication, and \(W^{true}\) and \(W^{predict}\) stand for the weight vectors corresponding to the true labels and the predicted labels in the auxiliary FC layers, respectively. \(Z^{'}\) is the new feature reactivated by the above weight vectors, and it will be input into subsequent layers of the backbone networks. It should be noted that \(Z^{'}\) has two forms, \(Z^{c}\) and \(Z^{s}\), corresponding to channel dimension and space dimension, respectively.

Both the channel-wise and the spatial-wise context descriptor are able to generate a weight vector owing to its auxiliary classifier, and we have two independent weight components to suppress the refined feature map Z. Vanilla CBAM utilizes the sequential arrangement to enhance the original intermediate features, but here we indicate applying the parallel mode to merge both the channel and spatial features after the activation suppression. Finally, the ultimate feature is computed as:

$$\begin{aligned} Z^{sup} = \frac{1}{2}\times (Z^{c} + Z^{s}), \end{aligned}$$
(6)

where \(Z^{c}\) and \(Z^{s}\) stand for the channel suppressed features and spatial suppressed features, respectively, and \(Z^{sup}\) is the final aggregated features. Leveraging the above operations, we finally aggregate the features reweighted by the feature suppression activation module to enhance the adversarial robustness from both channel and spatial perspectives.

Some prior works like MIRNet [77] utilize dual attention (channel and spatial) mechanisms to capture contextual information, but it is important to note that both the motivation and the approach of MIRNet are different from the feature activation suppression modules proposed in this paper. The dual attention in MIRNet is to fuse more discriminative features for image restoration and enhancement, but our feature activation suppression module aims to suppress the malicious effects from adversarial examples in both channels and spatial activation values. Besides, MIRNet uses dual attention to reweight the vanilla features respectively and fuses the reweighted features by concatenating. However, our feature activation suppression module leverages extra auxiliary classifiers to obtain the weight vector to channel and spatial dimensions and fuses the suppressed features by adding. In general, we assume that the feature fusion used by MIRNet differs from the feature activation suppression module proposed in this paper in terms of both motivation and specific practices.

Feature alignment

The feature refinement module and the feature activation suppression module above are both designed for counteracting malicious features caused by adversarial examples. But the generalization performance on the clean images still declines due to the adversarial training. We attempt to make some constraints on the feature space so that it is not overly distorted by adversarial training. Our goal is to force the denoised feature map \(Z^{sup}\) to be closer to that extracted from a standard trained model on clean samples. Furthermore, we also try to tighten the above-reweighted feature space, which may decrease the distance of the intra-class samples and make the hidden features more compact.

Fig. 7
figure 7

Schematic illustration of the consistency constraint in the feature alignment module. Note that when applying the consistency constraint, the characteristics of samples of the same category become more compact. Best viewed in color

To make the features within each category of samples more compact and discriminative, we first add a consistency constraint on the feature activation suppression module. It imposes an additional restriction on the target models’ feature space by modifying the classification loss, inspired by the center constraint proposed in the face recognition task [64]. Especially, as shown in Figure 7, we add this consistency constraint on the two auxiliary classifiers in the feature activation suppression module. Formally, the modified loss of the auxiliary classifiers in the feature activation suppression module is computed as:

$$\begin{aligned} \begin{aligned} \mathcal {L}_{fas}&=\lambda _1\mathcal {L}_{ce_a}+\lambda _2 \mathcal {L}_{center} \\&=-\lambda _1\sum _{i=1}^{m} \log \frac{e^{W_{y_{i}}^{T} \varvec{x}_{i}+b_{y_{i}}}}{\sum _{j=1}^{n} e^{W_{j}^{T} \varvec{x}_{i}+b_{j}}}+\frac{\lambda _2}{2} \sum _{i=1}^{m}\left\| \varvec{x}_{i}-\varvec{c}_{y_{i}}\right\| _{2}^{2}, \end{aligned} \end{aligned}$$
(7)

where \(\mathcal {L}_{ce_a}\) stands for the vanilla cross-entropy loss of auxiliary classifiers, which is the same as the baseline model, and \(\mathcal {L}_{center}\) represents the assistant consistency constraint. The variable \({c}_{y_{i}}\) is the class center of a specific class \(y_{i}\). \(\lambda _1\) and \(\lambda _2\) are the adjustment coefficients between the cross-entropy loss and the consistency constraint. The previous work [70] also proposed a class-aware constraint on the feature space. The difference between our proposed feature alignment module and the above algorithm is that [70] constrains the features of clean and adversarial examples modified by the deep generated networks in the front end of its algorithm, and the constraint acts in the backbone network of the target models. In contrast, the consistency constraint used in this paper acts only on the additional branches of the feature activation suppression module proposed in the previous section and is not applied to the backbone networks.

Fig. 8
figure 8

Schematic illustration of the knowledge distillation constraint in the feature alignment module. In this figure, the model in the top row is our RSA model, and we regard it as a student model. The bottom is a teacher model, which has the same architecture as our RSA model and is trained on the clean images. Best viewed in color

Moreover, we apply a knowledge distillation mechanism [76] to transfer knowledge in the form of attention maps from an independent teacher model to the current adversarial training student model. The teacher model has the same network architecture as our target student model, and it is trained on the clean samples merely. We can generate a pair of attention maps for knowledge transferring after extracting feature maps from the same layers of the two models and then perform the knowledge distillation operation. As shown in Figure 8, for the reweighted feature \(X^{S}\) (from the adversarially trained student model) processed by feature refinement and suppression, and the feature \(X^{T}\) extracted from the same layer in the corresponding teacher model, we perform a knowledge distillation constraint by optimizing the following loss function:

$$\begin{aligned} L_{kd} = Distance\left( Kn\left( X^{T} \right) , Kn\left( X^{S} \right) \right) , \end{aligned}$$
(8)

where \(X^{T}\) and \(X^{S}\) \(\in R^{C\times H\times W}\), and their corresponding knowledge \(Kn\left( X^{T} \right) \) and \(Kn\left( X^{S} \right) \) \(\in R^{1\times H\times W}\), since we deploy a GAP operation along the channel dimension of hidden feature X to convert it into a 2D tensor. To make training more stable, we impose a Min–Max normalization on the extracted 2D knowledge. The function \(Distance\left( \right) \) is utilized to measure the distance difference between two 2D tensors, one can choose the common L1 or L2 distance. By adding this constraint, the knowledge from the teacher model can specifically guide the features from the student model. The above optimization target makes the distorted feature space to be close to the original clean feature distribution. It can reduce the performance drop of the adversarial training models on clean samples and also weaken the influence of adversarial examples to a certain extent.

Therefore, the final loss function of the entire RSA can be expressed as:

$$\begin{aligned} \mathcal {L} = \mathcal {L}_{ce} + \alpha \mathcal {L}_{fasc} + \beta \mathcal {L}_{fass} + \gamma \mathcal {L}_{kd}, \end{aligned}$$
(9)

where \(\mathcal {L}_{ce}\) stands for the vanilla classification loss in the backbone network. \(\mathcal {L}_{fasc}\) and \(\mathcal {L}_{fass}\) represent the modified loss of the auxiliary classifiers in channel and spatial dimension, respectively. \(\mathcal {L}_{kd}\) represents the knowledge distillation loss. \(\alpha \), \(\beta \), and \(\gamma \) are the weights of components within the loss function.

Experiments

In this section, we evaluate our proposed RSA algorithm from both quantitative and qualitative perspectives through extensive experiments. We first introduce the five datasets used in experiments and present the basic hyperparameters settings. Afterward, we analyze the results of both the white-box and the black-box attack defenses, including comparisons with our baseline model and other state-of-the-art adversarial defense approaches. We finally conduct a series of qualitative experiments and ablation experiments to further investigate and demonstrate the effectiveness of our proposed algorithm.

Table 1 White-box experimental results of clean accuracy and robust accuracy on CIFAR-10 under network architecture ResNet-18 and WRN-34-10. - represents the absence of corresponding results in the compared literature. Best results are shown in bold

Experiment Settings

Datasets We choose five classical datasets to evaluate our RSA model: CIFAR-10 [33],Footnote 1 CIFAR-100 [33],Footnote 2 SVHN [48],Footnote 3 Tiny ImageNet [10],Footnote 4 and ImageNette [25].Footnote 5 CIFAR-10 contains 60,000 three-channel color images with 32*32 pixels, with a total of 10 different object categories. CIFAR-100 is also composed of 60,000 colored images with 100 categories. SVHN is an excerpt from the house number in Google Street View images in a style similar to MNIST [35]. This dataset has 10 different number classes, of which the training set has 73,257 images and the test set has 26,032 images. Tiny ImageNet contains 100,000 colored images, and all of them are downsized to 64*64 pixels. In this dataset, each class has 500 training images and 50 validation images. ImageNette is regarded as a subset of 10 easily classified classes from Imagenet [10], and the images in ImageNette have a higher resolution compared to the other four datasets mentioned above. ImageNette consists of 9469 training images and 3925 validation images.

Experimental setup To evaluate the effectiveness of our algorithm in various network architectures, we choose ResNet-18 [23], WRN-34-10 [75] and PreActResNet-18 [23] as our backbone networks. Note that our RSA model should be trained through adversarial training. For fair comparisons, we adopt the same training strategy and hyper-parameters as our baseline model. Therefore, we adversarially train the above three backbone networks for 200 epochs with adversarial examples generated by PGD-10(\(\varepsilon =8/255\), step size \(=2/255\), and random initialization). We leverage SGD with a learning rate \(1e-2\), momentum 0.9, and weight decay \(2e-4\) to optimize our model. To be consistent with the baseline model, we insert the proposed feature refinement module and feature activation suppression module into the last residual block in the chosen backbone networks. The classifiers within the extra branches in the feature activation suppression module utilize the modified loss mentioned above, while the normal cross-entropy loss is used on the backbone network for classification. In the experiment, we set the weight coefficients \(\alpha \), \(\beta \), and \(\gamma \) to be 2, 0.01, and 0.001, respectively. And parameter \(\lambda _1\) and \(\lambda _2\) in the modified loss of channel dimension are set as 1 and 0.1, respectively. We choose the L1 distance to measure the distance of hidden features between the teacher and target student model in the knowledge-distillation constraint. In this paper, we evaluate the performance of the proposed model using clean and robust accuracy. Clean accuracy measures the model’s recognition accuracy on clean samples, while robust accuracy assesses the model’s performance in scenarios with adversarial threats. This comprehensive evaluation approach provides a more thorough assessment of the model’s robustness. In summary, all of our experimental settings and evaluation metrics align with mainstream practices in the field of adversarial defense, ensuring a precise assessment of the proposed algorithm’s effectiveness.

White-box attack defense

Table 2 White-box experimental results of clean accuracy and robust accuracy on CIFAR-100 under network architecture ResNet-18 and WRN-34-10. - represents the absence of corresponding results in the compared literature. Best results are shown in bold
Table 3 White-box experimental results of clean accuracy and robust accuracy on SVHN under network architecture ResNet-18 and WRN-34-10. Best results are shown in bold
Table 4 White-box experimental results of clean accuracy and robust accuracy on Tiny ImageNet under network architecture PreActResNet-18. Best results are shown in bold
Table 5 White-box experimental results of clean accuracy and robust accuracy on ImageNette under network architecture ResNet-18. Best results are shown in bold

To verify the effectiveness of our proposed RSA algorithm, we first test our algorithm on the four most frequently used white-box attack algorithms: FGSMFootnote 6 [18], PGD-20 [44], CW-\(\infty \) [3], and AutoAttack [7, 8],Footnote 7 then compare the defense results with other advanced adversarial defense algorithms. The \(l_{\infty }\)-norm of the perturbations is bounded by the value of \(\epsilon = 8/255\). In this paper, the settings of the attack methods used in the experiments align with those of the baseline model. It should be noted that since various datasets and network architectures are tested by different defense methods, to evaluate our proposed model as completely as possible, we select different backbone networks on different datasets for testing and comparison according to the compared algorithms. In general, we compare RSA with five classes of defense algorithms, which are our baseline models, the improved adversarial training-based methods, the knowledge distillation-based methods, the NAS-based methods, and the pruning-based methods.

Compared with the baseline models: Our baseline model CAS [2] and its variant CIFS [71] algorithm only consider the effect of adversarial examples from the perspective of feature channel activation values, while omitting the pernicious effect of polluted features on the spatial domain. At the same time, neither method considers the problem of clean accuracy drop brought by adversarial training. Our RSA outperforms both two algorithms in clean and robust accuracy. It is obvious that the RSA surpasses the baseline model CAS in both clean accuracy and the other four adversarial robustness metrics. In particular, the clean accuracy of RSA on ResNet-18/CIFAR-10 is more than 2\(\%\) higher than these two algorithms, while the robust accuracy improvement on PGD-20 is more than 10\(\%\), and on AA is more than \(5\%\). As shown in Table 3, on SVHN/WRN-34-10, RSA is also higher than the baseline model by more than 5\(\%\) in all robustness accuracy. In addition, we report the white-box attack results on the ImageNette dataset in Table 5. We assume that it can confirm the effectiveness of our RSA in the high-resolution images.

Compared with the improved adversarial training-based models: We also compare our proposed RSA with some advanced defense approaches which change the paradigms of adversarial training directly, including GAT [58], NuAT [59], PCL [46], TLA [45], FGSM-GA [1], FSAM-MEP [29], AWP [63], LBGAT [9], FAT [79], SAT [56], and UIAT [12]. As shown in Table 1, 2, 3 and 4, our proposed RSA achieves better performance compared with the above algorithms. It is worth mentioning that our proposed RSA not only outperforms these AT-improved algorithms in terms of robust accuracy, but also maintains a high level of clean accuracy. It suggests that our RSA architecture can better improve the adversarial robustness of target models while avoiding degrading the models’ performance on clean images too much. In particular, our algorithm is better than the newly proposed UIAT in terms of both clean accuracy and adversarial robustness in the CIFAR-10, CIFAR-100, and SVHN datasets. It’s noteworthy that the TLA algorithm also leverages metric learning techniques to enhance the model’s adversarial robustness. In contrast to our approach, TLA employs a triplet loss function at the features of penultimate layers, while we incorporate consistency constraints into additional branches. Furthermore, our method addresses activation value suppression at both the feature channel and spatial levels, which contributes to its superior performance.

Compared with the knowledge-distillation-based models: As we design the knowledge distillation constraint in the feature alignment module, we also compare the performance of the proposed RSA with several knowledge-distillation-based adversarial defense methods, including ARD [17], RSLAD [87], MTARD [80], and AdaAD [26]. It can be seen from Tables 1 and 3 that our algorithm surpasses the RSLAD algorithm in almost all the clean and robust accuracy, which shows that the first two modules we proposed can indeed improve the adversarial robustness of models. Specifically, RSA outperforms RSLAD by about 10\(\%\) on FGSM and PGD20 on ResNet-18/CIFAR-10, and by about 5\(\%\) on CW-\(\infty \) on ResNet-18/CIFAR-100. Compared with MTARD, our algorithm achieves better clean and robust accuracy. In particular, our algorithm outperforms the method by about 8\(\%\), 7\(\%\), and 9\(\%\) in the robust accuracy of FGSM, PGD-20, and CW-\(\infty \), respectively. We assume that although MTARD utilizes a dynamic training paradigm to balance the clean and adversarial teacher models, the results demonstrate that the clean teacher model shows overwhelming dominance. In contrast, our RSA model better balances clean and robust performance than MTARD. Specifically, compared with the current state-of-the-art algorithm AdaAD, our proposed RSA model performs better both in terms of clean accuracy and in terms of adversarial robustness under FGSM and AutoAttack tests.

Compared with the NAS-based & the pruning-based models: The RobNet-large-v2 [21] and DS-NET algorithm [14] utilize neural architecture search (NAS) measures to find the most robust network through artificially designed modules and atomic structures. As shown in Table 1, our RSA surpasses the above two NAS-based defense methods in clean and robust accuracy. It is worth noting that although the clean accuracy using WideResNet-34-10 on the CIFAR-10 and SVHN datasets is slightly lower than DS-NET, the adversarial robustness of our algorithm is much better, especially in PGD-20, CW, and AA, which are 10\(\%\), 9\(\%\) and 7\(\%\) higher than DS-NET, respectively. We also compared our RSA with five pruning-based approaches, including AdvPrune [73], kWTA [68], HYDRA [54], SAP [11], and MAD [37]. The clean accuracy and robust accuracy of our proposed algorithm exceed both kWTA and SAP. Likewise, our RSA algorithm outperforms the HYDRA in all accuracy and is substantially ahead of the AdvPrune in terms of robust accuracy. Although MAD is slightly higher than our algorithm in some individual accuracy, the overall performance of our model is still better than that of the pruning-based method.

The results under AutoAttack: The AutoAttack [7, 8] is an ensemble of two Auto-PGD attacks and the other two complementary attacks: FAB and Square Attack. It is regarded as a more powerful attack and is becoming a widely used fair evaluation of adversarial robustness, so we also report the AutoAttack test results in each table. It is obvious that our RSA outperforms both our baseline model CAS and other compared approaches on all five datasets. We assume that these results not only illustrate the effectiveness of our proposed RSA algorithm, but also demonstrate that the performance improvement is not brought by obfuscated gradients or unfair evaluations.

Black-box attack defense

Table 6 Black-box experimental results of robust accuracy on CIFAR-10 based on ResNet-18. Best results are shown in bold

Although our model is not specifically designed to defend against black-box attacks, we also test the defensive performance of the proposed algorithm against black-box attacks. For the fairness of the comparisons, we adopt the same experimental settings as the baseline model CAS and select two types of black-box attack methods based on transferring and querying. The former includes PGD-20 and CW-\(\infty \), while the latter leverages the \(\mathcal {N}\)A attack [38]. In transfer attack, the adversarial examples are generated from a ResNet-50 model under standard training, which we regard it as the proxy model. When testing the \(\mathcal {N}\)A attack, we randomly sample 1000 images from the test set of the CIFAR-10 and SVHN datasets and limit the maximum number of queries to 2,0000. The experimental results are shown in Table 6. It can be seen that the black-box robustness of our proposed model is better than the baseline model under all three black-box attacks. Moreover, our proposed method also outperforms our baseline model when combined with other advanced adversarial defense methods, such as TRADES [78] and MART [63]. We think it can be shown empirically that our proposed RSA algorithm can not only improve the defense ability against white-box attacks, but also has better performance in the face of diverse black-box attacks compared with our baseline model.

Fig. 9
figure 9

Comparison of results under two gray-box attack scenarios. The left side shows the robust accuracy under the PGD attack with different attack steps, while the right side shows the robust accuracy under the FGSM attack with different perturbation disturbance radius. Best viewed in color

Gray-box attack defense

In addition to the aforementioned tests evaluating the model’s defense capabilities against white-box and black-box attacks, we conducted supplementary assessments to gauge the model’s robustness under gray-box attack scenarios. In gray-box attacks, adversaries possess an intermediate level of information compared to white-box and black-box adversaries. Specifically, in this context, we assume that gray-box attackers have access to the model’s specific structure, its final output, and gradient information, but they remain unaware of the exact defense method employed. To evaluate the model’s performance under gray-box attacks, we conducted two scenarios on the ImageNette dataset: one involving PGD attacks with continuously increasing attack steps and the other employing FGSM attacks with a progressively expanding perturbation disturbance radius. Figure 9 presents the results, illustrating a comparison between our model and the baseline model under above two gray-box attack scenarios. It demonstrates that our model exhibits superior adversarial robustness in both gray-box attack scenarios. In summary, our model consistently outperforms the baseline model across white-box, black-box, and gray-box attack scenarios, thereby substantiating the effectiveness of the proposed algorithm against diverse threat models.

Qualitative experiment

Fig. 10
figure 10

Comparison of the CAS model and our RSA model in the difference of the feature activations in the channel dimension. The subplots on the above row are the results of the CAS model, and they are three categories randomly chosen from the CIFAR-10 dataset. Below are the corresponding results of our RSA model. The top and bottom subplots of each column are of the same image category. The channels represented by the abscissa are also in one-to-one correspondence. Similar to the schematic diagram of the Method part, red represents that the activation value of clean samples in this channel is higher than that of adversarial examples, while blue is the opposite. Best viewed in color

Fig. 11
figure 11

Comparison of the CAS model and our RSA model in the difference of the feature activations in the spatial dimension. The subplots on the above row are the results of the CAS model, and they are three categories randomly chosen from the CIFAR-10 dataset. Below are the corresponding results of our RSA model. The top and bottom subplots of each column are of the same image category. In each column, the spatial positions on the upper and lower heat maps correspond to each other. Similar to the schematic diagram of the Method part, red represents that the activation value of clean samples in this position is higher than that of adversarial examples, while blue is the opposite. Best viewed in color

Fig. 12
figure 12

Comparison of the baseline model and our RSA model in the MSE distance between clean images and its adversarial examples. On the left are the results from CIFAR-10, while on the right are those on ImageNette. The horizontal axis in the graph represents the categories in the dataset. Best viewed in color

Fig. 13
figure 13

Attention maps of samples of ImageNette obtained from a standard trained ResNet-18 and RSA. Best viewed in color

In addition to the quantitative experiments conducted in the white-box and black-box attacks, we also analyze the superiority of the RSA algorithm through qualitative experiments. To investigate the difference in the feature between our RSA model and the baseline model CAS, we design two types of experiments in the channel and spatial dimensions.

For the channel dimension, we compare the activation value difference between clean samples and adversarial examples in the same channel/spatial positions in the CAS and RSA. Specially, we count the above situations in the same category of samples. We first investigate the channel levels. As shown in Figure 10, we can find that in the channels with large activation values (the left part of each subplot), our RSA algorithm can effectively reduce the difference between activation values of which in the CAS. Although our RSA algorithm slightly enlarges the activation values of adversarial examples in a few channels (the significant blue parts in the second column), we can observe that the numerical value of this difference is smaller compared to CAS. Therefore, it suggests that the RSA model can effectively reduce the differences between the two sample activation values in the channel dimension.

For the spatial dimension, we also analyze the two models from the perspective of the spatial dimension. It can be seen from Figure 11 that for the difference in activation values in the corresponding space position, our RSA algorithm is obviously smaller than the baseline model CAS on the whole. Therefore, it can also be qualitatively proved that the RSA algorithm can effectively eliminate the differences between the feature activation values of two types of samples at the spatial dimension.

Furthermore, we conduct a comparative analysis of the feature distances between RSA and the baseline models when applied to clean samples and their corresponding adversarial counterparts. Specifically, we measure the Mean Squared Error (MSE) distance between each clean image and its corresponding adversarial example (generated using PGD-20) at the output features of the penultimate layer, categorized accordingly. Figure 12 presents the findings for both CIFAR-10 and ImageNette datasets. As illustrated in the figure, on the CIFAR-10 dataset, our model significantly reduces the feature distance between adversarial and normal samples compared to the baseline model. Moreover, there is a slight reduction observed on the ImageNette dataset as well. These results underscore the efficacy of our proposed algorithm in mitigating the deleterious impact of adversarial samples within the feature space.

We also provide attention maps of adversarial examples obtained by our RSA model and a standard trained ResNet-18 on ImageNette separately. As shown in Figure 13, the standard trained network cannot correctly locate the region in the images where the classification targets are located due to adversarial noise, while our proposed RSA model can reduce the effects and correctly focus on the features of the classification targets.

Ablation study

Table 7 Analysis of the effectiveness of proposed RSA modules at different blocks of ResNet-18 on CIFAR-10. Best results are shown in bold

To better investigate the RSA algorithm in this paper, we conduct multiple sets of ablation and comparison experiments to explore: 1. the impact of the insertion position of the RSA modules in the backbone network on the final performance; 2. the impact of each of the three modules on the final performance.

Impact of RSA insertion position: First, to explore the impact of RSA modules on the performance of the insertion location in the backbone network, we insert them into different blocks of the ResNet-18 for comparison. It should be noted that we consider that the three modules designed in this paper need to be used in conjunction with each other, so when inserting each block, it is indicated that all three modules are used at the same time. The results of the comparison experiments are shown in Table 7, and it can be seen that the maximum adversarial robustness improvement can be obtained when inserting the RSA modules into the deep blocks of the backbone network (e.g., block 4 and block 3+4). In contrast, insertion into shallow blocks gives poor performance. We believe that the activation values of the deeper features of the backbone network are more relevant to correct category prediction, and conversely, the shallow features may have a wrong effect on classification. Also, it can be seen that inserting RSA into block 4 can obtain a better trade-off between robustness and clean accuracy, so we utilize inserting RSA modules into block 4 as the setting for all experiments.

Table 8 Sensitivity analysis of the weight of proposed feature activation module and feature alignment module of ResNet-18 on ImageNette. Best results are shown in bold. Our report results are underlined
Table 9 Analysis of the impacts among different modules of the RSA model on CIFAR-10, CIFAR-100, and SVHN based on ResNet-18 and WRN-34-10. Best results are shown in bold

Sensitivity analysis of RSA loss weights: We conduct a sensitivity analysis on the weights within the RSA loss function. To ensure consistency with the original baseline model, we hold \(\alpha \) fixed at 2 while varying the values of \(\beta \) and \(\gamma \) to assess their impact on final performance. These experiments are executed using the ResNet-18 network architecture on the ImageNette dataset, and the results are presented in Table 8. It is evident from the table that our proposed model demonstrates insensitivity to changes in various weight values, indicating robust generalization within the RSA model. It is worth noting that the underlined results in the table make a better balance between clean accuracy and adversarial robustness, representing the second best performances outcomes in both aspects. So, we choose them as the final reported results.

Impact of sub-components of RSA: In addition, to explore the impact of the various sub-components in our proposed algorithm on adversarial robustness and clean accuracy, we design another ablation experiment for comparison. As shown in Table 9, we compare the robust accuracy and clean accuracy of the full version RSA algorithm (V7 in each sub-table) with the other six variants. We first compare the performance of the three proposed modules when they are used independently. It can be seen that a certain degree of improvement in adversarial robustness can be obtained using only the feature refinement module, but it also has the largest accuracy reduction on clean samples among V1 to V3. We infer that this is because when reducing the malicious effects in the channel dimension, it may also lose some information and affect the feature distribution of clean images. In contrast, using only the feature alignment module achieves the highest accuracy for clean samples, but has limited performance in terms of robustness. The reason is that the feature alignment module aims to constrain the distance between the adversarial and clean samples in the feature space, so it can increase the performance of the model on clean samples after adversarial training. The overall performance of using the feature activation suppression module alone, on the other hand, is in between the above two modules. Therefore, we believe that each of the three modules has different functions and purposes and is more suitable to be used in combination.

We then compare the complete RSA model with other variants. It can be seen that the performance of using the three modules proposed in combination is better on both clean and adversarial examples. Although the best performance on clean samples can be obtained on SVHN/WRN-34-10 using only the feature alignment module, its robust accuracy is substantially lower than that of the full version of RSA. Therefore, we believe that our proposed RSA incorporating the three modules together can improve the trade-off of the clean accuracy and robust accuracy of target models based on adversarial training.

Discussion

Our proposed RSA algorithm has demonstrated its effectiveness in countering adversarial attacks, as evidenced by both quantitative and qualitative analyses. Through feature-level defense, which considers both channel and spatial dimensions, along with an additional alignment constraint, RSA not only enhances model adversarial robustness but also mitigates significant drops in clean accuracy. These outcomes underscore the practical promise of our method in security-critical applications.

While our RSA outperforms the compared state-of-the-art algorithms, it is not without limitations. The effectiveness of our feature alignment module relies heavily on the availability of powerful teacher models, which may constrain its applicability in cases where such models are not readily accessible. Moreover, the exploration of adversarial defense methods and principles for countering adversarial examples at the feature level remains an open issue, demanding further investigation in upcoming studies. Additionally, further in-depth research is needed to address potential challenges posed by unseen attacks in the future, marking an important direction for enhancing our model’s robustness against stronger adversarial threats.

Conclusion

In this paper, we propose a novel adversarial example defense algorithm RSA, which first leverages the feature refinement module to restore and refine the overall activation magnitude in the feature channels, and then utilizes the feature activation suppression module to reweight the high-order features in both channel and spatial domains. The feature space is finally aligned by a knowledge distillation operation and an extra consistency constraint on the two auxiliary branches. Extensive experiments and comparisons with other state-of-the-art defense algorithms on five public datasets and three widely used backbone networks demonstrate the superiority of our proposed RSA algorithm. Through experimental analysis, we argue that feature-level protection plays an important role in defending against adversarial examples. In the future, we will conduct further research into the patterns and characteristics of adversarial examples at the sample feature level. We will also integrate these findings with the latest feature consistency and restoration methods to explore more effective strategies for enhancing model robustness. Simultaneously, we are committed to investigating strategies for augmenting the adversarial robustness of foundational visual models and large multi-modal models, aiming to ensure their safety and reliability.