Boosting adversarial robustness via feature refinement, suppression, and alignment

Wu, Yulun; Guo, Yanming; Chen, Dongmei; Yu, Tianyuan; Xiao, Huaxin; Guo, Yuanhao; Bai, Liang

doi:10.1007/s40747-023-01311-0

Boosting adversarial robustness via feature refinement, suppression, and alignment

Original Article
Open access
Published: 18 January 2024

Volume 10, pages 3213–3233, (2024)
Cite this article

Download PDF

You have full access to this open access article

Complex & Intelligent Systems Aims and scope Submit manuscript

Boosting adversarial robustness via feature refinement, suppression, and alignment

Download PDF

Yulun Wu¹,
Yanming Guo^1,2,
Dongmei Chen^3,4,
Tianyuan Yu¹,
Huaxin Xiao¹,
Yuanhao Guo⁵ &
…
Liang Bai^1,2

874 Accesses
Explore all metrics

Abstract

Deep neural networks are vulnerable to adversarial attacks, bringing high risk to numerous security-critical applications. Existing adversarial defense algorithms primarily concentrate on optimizing adversarial training strategies to improve the robustness of neural networks, but ignore that the misguided decisions are essentially made by the activation values. Besides, such conventional strategies normally result in a great decline in clean accuracy. To address the above issues, we propose a novel RSA algorithm to counteract adversarial perturbations while maintaining clean accuracy. Specifically, RSA comprises three distinct modules: feature refinement, activation suppression, and alignment modules. First, the feature refinement module refines malicious activation values in the feature space. Subsequently, the feature activation suppression module mitigates redundant activation values induced by adversarial perturbations across both channel and spatial dimensions. Finally, to avoid an excessive performance drop on clean samples, RSA incorporates a consistency constraint and a knowledge distillation constraint for feature alignment. Extensive experiments on five public datasets and three backbone networks demonstrate that our proposed algorithm achieves consistently superior performance in both adversarial robustness and clean accuracy over the state-of-the-art.

Towards Both Accurate and Robust Neural Networks Without Extra Data

Boosting Adversarial Transferability Through Intermediate Feature

Multi-scale Features Destructive Universal Adversarial Perturbations

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Introduction

Deep learning [36] has achieved remarkable success across a wide spectrum of applications, including image classification [23, 27], object detection [52, 55], image generation [31, 84], information fusion [57, 61, 83, 86] etc. However, despite these advancements, massive research works [4, 60, 85] have shown that deep neural networks are inherently fragile. Malicious attackers can make well-trained neural networks generate ridiculously erroneous predictions by just adding tiny perturbations to original images. Attacks that leverage such methods are called adversarial attacks, and those corresponding distorted images are adversarial examples [18, 60]. Due to the existence of these malicious attacks, some security-critical application scenarios are exposed to very high risks, such as autonomous driving [15, 32, 39], face recognition [13, 82], and medical diagnosis based on medical image analysis [22, 30, 43].

To address the huge security threat to intelligent systems caused by adversarial examples, researchers from both academia and industry have invested tremendous pioneering works toward adversarial example defenses [49]. Among all defense approaches, adversarial training (AT) and its variants [44, 63, 78] have emerged as common choices for mitigating malicious attacks, but they can only bring limited adversarial robustness and often result in a reduction in clean accuracy. More recently, some researchers [2, 71] have attempted to counteract adversarial examples by modifying original network architectures to denoise the malicious effects in the feature space. Figure 1 shows the class activation figures of clean samples and its adversarial examples in the feature space. Bai et al. [2] leveraged channel activation suppression (CAS) to strengthen significant feature channels while suppressing abnormal ones using weighted scores. Furthermore, Yan et al. [71] proposed CIFS to utilize an important mask generation module to extend theories of the vanilla CAS method.

The above approaches contribute to enhancing the adversarial robustness of target models compared to the original AT methods. However, they still have some drawbacks in adversarial robustness. First, existing methods of modifying network architectures typically incorporate global average pooling (GAP) to denoise and obtain channel features initially. Nevertheless, this technique does not effectively eliminate malicious activations within channels induced by adversarial examples [67]. Furthermore, these algorithms predominantly focus on mitigating redundant activation values primarily within feature channels, thereby overlooking the spatial dimension’s role in bolstering robust image classification [16, 65, 81]. Additionally, it is worth noting that the majority of adversarial defense techniques tend to substantially compromise clean image performance, thereby hindering the practical deployability and application of defense-enhanced models.

To address the aforementioned challenges, in this paper, we innovatively propose the RSA algorithm, consisting of the feature Refinement, the feature activation Suppression, and the feature Alignment modules. Specifically, to mitigate the abnormally low channel activation values due to adversarial perturbations, the feature refinement module first detects and refines abnormal values in the channel dimension. Next, we utilize the feature activation suppression module with a parallel complementary mode to suppress the redundant activation values from both channel and spatial dimensions. To avoid the clean accuracy drop caused by adversarial training, we additionally leverage the feature alignment module to incorporate two constraints on the hidden features. One constraint is the consistency constraint which aims to make features of the same categories more compact, and the other constraint is the knowledge distillation to guide the features through an independent teacher network.

Overall, previous research often prioritizes the development of adversarial training strategies or confines its scope to individual feature channel dimensions, thereby neglecting the broader, multi-dimensional consequences of adversarial noise at the feature level. Upon analysis, we observe that adversarial examples have a disruptive effect on both the channel and spatial dimensions within the feature space, subsequently impacting model performance. Therefore, we develop a novel feature-level defense mechanism RSA. Our method operates at the feature level and can enhance adversarial robustness and preserve clean accuracy. Our findings underscore the significance of considering feature-level defenses to mitigate the adverse impact of adversarial examples on model performance.

In conclusion, our main contributions can be summarized as follows:

(1)
We conducted a quantitative analysis to assess the influence of adversarial attacks on the sample feature space. Building upon these insights, we introduce two novel modules: Feature Refinement and Feature Activation Suppression. These modules are designed to mitigate the adverse effects of adversarial attacks and reduce redundant activation values across both channel and spatial dimensions.
(2)
To avoid the model performance drop in clean accuracy after adversarial training, we impose an extra feature alignment module with a consistency constraint and a knowledge distillation constraint on the feature space after refinement and suppression.
(3)
Extensive quantitative and qualitative experiments on five public datasets and three widely used backbone networks prove that our proposed RSA algorithm has overall superior performance compared to state-of-the-art defense methods.

Related work

In this section, we briefly review the commonly used adversarial defense algorithms. We divide the defense approaches into three categories: input transformation defenses, adversarial training (AT) and its variants, and network structure optimization.

Input transformation defenses

The exploration of adversarial example defenses in the early research community focused on transforming the input images, such as the minimum variance method [20] and the JPEG compression algorithm [42]. After that, Raff et al. [51] utilized the stochastically combined multiple input transformations like ensemble learning to counter potential adaptive adversarial attacks. Similarly, Taran et al. [62] introduced a defense mechanism based on randomized diversification on a set of input transformations. In addition, some researchers also explored other methods based on input transformations. Jia et al. [28] utilized learnable compression modules and mitigated malicious perturbations. Samangouei et al. [53] proposed Defense-GAN to counter adversarial examples with GANs. Inspired by [53], Yuan et al. [74] leveraged an ensemble generative cleaning network with a feedback loop to clean images from adversarial patterns. Some defense methods based on input denoising have also been proposed. Gu et al. [19] attempted to use a denoising autoencoder and a deep contractive network to remove adversarial noises. Liao et al. [41] also proposed a specific denoiser to guide high-level representations and solved the error amplification effect of the standard denoiser. Liang et al. [40] introduced a smoothing spatial filter with a scalar quantization mechanism to reduce the impact of adversarial attacks. In general, input transformation-based defenses are usually easy to deploy, but they are not effective against many powerful attacks.

Adversarial training and variants

Adversarial training (AT) has been the most successful work to train robust neural networks against adversarial examples. AT was first proposed by Goodfellow et al. [18], which was to expose the original deep neural networks to adversarial examples in the training phase. Madry et al. [44] proved that the solution to the Min–Max optimization problem for adversarial training was the optimal parameter combination for models when faced with adversarial examples. Moreover, they advocated using projected gradient descent (PGD) to generate adversarial examples in adversarial training, and it has become a widely used and effective approach to resist malicious attacks.

Although adversarial training could improve the robustness of models to some extent, it required a lot of additional computing resources and computing time. Unfortunately, AT also degraded the generalization performance on the clean samples. To better enhance the vanilla AT approach, multiple AT variants have been explored. Zhang et al. [78] proposed the TRADES algorithm to simultaneously control two parts of the original adversarial error: natural error and boundary error. Meanwhile, MART [63] algorithm considered the impact of misclassified clean samples on model robustness during training. Wu et al. [66] proposed the Adversarial Weight Perturbation (AWP), which adversarially perturbed both inputs and weights to explicitly regularize the flatness of the weight loss landscape. Some researchers have also tried to utilize the single-step defenses to reduce computational costs of AT by some relax terms and regularizers, such as Guided Adversarial Training (GAT) [58] and NuAT [59]. To solve the performance decline in clean images due to AT, Lamb et al. [34] proposed Interpolated Adversarial Training that employed interpolation-based methods in the framework of AT. Cui et al. [9] proposed LBGAT to guide robust network training by the classifier boundary from clean models. Yang et al. [72] proposed DAAT to retain clean accuracy through a naturally trained calibration network. In the face of catastrophic overfitting problems in adversarial training, the FGSM-GA [1] and FGSM-MEP [29] utilized the GradAlign method and a prior-based initialization method respectively. Similarly, Chen et al. [5] proposed an approach to alleviating the catastrophic overfitting of the multi-exit network and had lower time complexity compared to vanilla AT. To improve the generalization performance of the model after AT, the FAT algorithm [79] generated more diverse adversarial examples without reducing the robustness of the models. The Smooth Adversarial Training (SAT) [56] leveraged the concept of curriculum learning to allow networks to operate at a better clean accuracy versus robustness trade-off curve compared to vanilla AT.

Network structure optimization

Researchers have also explored ways to design and optimize network structures for better adversarial robustness combined with AT. Early works focused on denoising or making attacks difficult by adding some additional structures. Xie et al. [69] leveraged the non-local attention module used in the semantic segmentation field to eliminate the aberrant noises in the feature map of adversarial examples. He et al. [24] injected a kind of trainable Gaussian noise into either activations or weights during the training phase to improve the performance of vanilla AT. Mygdalis et al. [47] proposed the M-SVDD-D to defend adversarial examples by increasing the noise energy required to deceive the protected models and decreasing the effectiveness of adversarial attacks. Nowadays, the more effective methods based on network structure optimization can be divided into the following categories.

Knowledge-distillation-based methods: Some researchers attempted to leverage the concept of knowledge distillation (KD) to defend adversarial noises, including ARD [17], RSLAD [87], and MTARD [80]. The KD-based approaches usually employed one- or multi-teacher models to guide the training process of the target student model, and the teacher models could be adversarially trained or standard trained. ARD was a classical KD-based defense approach. Inspired by this method, RSLAD exploited robust soft labels to train small student models through large teacher models. Alternatively, MTARD used multi-teacher models and a dynamic training algorithm to balance the influence between the adversarial teacher and clean teacher models.

NAS-based and pruning-based methods: Some defense methods like RobNet [21] and DS-NET [14] utilized the neural architecture search (NAS) to find the robust modules and structures to redesign families of robust architectures. Meanwhile, some pruning-based algorithms also have been proposed, such as SAP [11], AdvPrune [73], kWTA [68], HYDRA [54], and MAD [37]. The purpose of the above approaches was to enable preserving the adversarial robustness while compressing the models’ size.

Channel activation suppression-based methods: In recent years, the correlation between channel activation values and adversarial robustness has attracted more attention. Bai et al. [2] found that there were disparities in channel activation value magnitudes and frequencies between clean and adversarial examples. Therefore, they proposed the channel activation suppress (CAS) algorithm, which inhibited adversarial noise in abnormal channels by implementing global average pooling (GAP) and an extra auxiliary classifier branch. Based on CAS, Yan et al. [71] divided channels of feature maps into positively relevant (PR) channels and negatively relevant (NR) channels. They then introduced a new mechanism named channel-wise importance-based feature selection (CIFS). CIFS attempted to align PR channels and suppressed NR ones and, therefore, improved the adversarial robustness of target models.

Although vanilla CAS and CIFS algorithms made better performance in improving adversarial robustness, both of them aligned the aberrant feature activation values only from the perspective of channel dimension. Besides, the GAP operation used in CAS could not effectively eliminate the effects of anomalous activations of hidden features. We assume that this is due to the fact that GAP is just a global sum and average operation, and it does not remove the adverse effects of the features. Moreover, these algorithms still did not avoid the decreased performance of models on clean samples after adversarial training. To address the above drawbacks, in this paper, we propose a novel model RSA, which reduces malicious effects from adversarial examples to improve adversarial robustness and avoids the performance decline on clean samples.

Method

In this section, we introduce the proposed RSA algorithm. We first investigate the difference in activation values of adversarial examples and clean samples in the corresponding channels of feature maps, based on which we propose a feature refinement module. After that, we introduce the feature activation suppression module toward both channel and spatial dimensions. Finally, we present the feature alignment module, which consists of a consistency constraint used on the extra branches and a knowledge distillation operation on the reweighted features. The above three components together constitute our proposed RSA algorithm. Figure 2 shows three schematic diagrams among a standard training model, the baseline CAS model, and our proposed RSA model. As shown in Figure 2 (c), perturbed features from adversarial images are first refined through the feature refinement module. After that, the feature activation suppression module reweights the refined features. Finally, the feature alignment module adopts two constraints, the consistency constraint(CC) and the knowledge distillation (KD) to align the reweighted features.

Feature refinement

CAS [2] utilizes the global average pooling (GAP) to obtain the global channel representation and regards it as a preliminary denoising function toward adversarial examples as well. However, GAP essentially averages all activations of each channel and we assume that it does not effectively remove the influence of malicious noise. [67] also demonstrates that GAP is not a reliable information aggregation method. Although adversarial examples only contain human-imperceptible perturbations, the adverse effects may be enhanced through forward propagation. These malicious influences are reflected in irregular differences in feature activations, and a simple GAP may not be able to purify them.

To inspect the influence of the feature activation values, we extract activation values of both clean and adversarial examples from the penultimate layer of an adversarial trained network with CAS. We then investigate the channel-wise representation differences between clean samples and adversarial examples through the following operations. By deploying GAP on the extracted features, we primarily rearrange channels of the clean samples according to the magnitude of activation values in descending order. Afterward, we fix this channel order and compute the difference between the activations of the clean and adversarial examples within each channel. In Figure 3, red indicates that clean images have higher activations than adversarial examples in this channel, and blue indicates the opposite.

As shown in Figure 3, it is obvious that in channels with higher activation magnitudes, the activation values of original images are larger than those of adversarial examples. As the channel activation values decrease, adversarial examples have slightly larger activation amplitudes than clean samples, but the difference is not as large as that in high-amplitude channels. Adversarial examples perturbed feature distributions, which are likely to cause the above activation differences. Therefore, we believe that it is necessary to deal with aberrant channel activation values caused by adversarial perturbations, thereby improving the robustness of target models.

Some research works [6, 50] suggest that channels with different activation values contribute differently to correct classification, and we believe that a similar rationale also exists in the adversarial defense. We suspect that channels with large activation values in the original images are more important for correct image classification. In these channels, the amplitude of the activation values of adversarial examples is mostly lower than that of clean samples. We believe that the addition of adversarial noise changes the activation values of the features and causes the differences. For each channel, if some of the smaller activation values are amplified, the overall activation amplitude of the channel is correspondingly slightly improved, and adverse effects may be alleviated. We attempt to refine the minimum value in each channel. In this way, the overall activation values of the whole channel can be enlarged without significant disturbance to the feature distribution of data. Therefore, we specifically propose a novel feature refinement module, which is capable of revising those minimum activation values. The module requires no additional parameters for training. The schematic diagram of the feature refinement module is shown in Figure 4. Specifically, for each channel in the feature map, the module first calculates its average activation values and then scales the minimum value in each channel to its average value. The purpose of this module is to increase the minimum value to make a refinement of the whole magnitude of the channel activations, and we assume that modifying only one value may not unduly distort the feature distribution. The operation of the feature refinement module is formulated as follows:

$$\begin{aligned} X^{'} = X - X \bigotimes MinMask + Mean(X) \bigotimes MinMask. \end{aligned}$$

(1)

X represents the original hidden features, and $X^{'}$ stands for the refined features. MinMask is a mask matrix containing only (0, 1), where 1 stands for the location of the minimum value in this channel and 0 represents other values. Mean(X) computes average activation values in each channel and $\bigotimes $ represents the element-wise multiplication. After performing the above operations, the aberrant small activation magnitude in channels of perturbed features due to adversarial noise is likely to be amplified and refined.

Feature activation suppression

Our baseline model utilizes channel-wise activation suppression and an auxiliary classifier to calculate the importance of channel features and adjust the activation values. However, we suppose that the activation difference also exists in the spatial domain, apart from the channel perspective. Similar to the qualitative analysis in the previous section, we draw a heatmap figure to investigate whether there is a difference among activations in the spatial domain. Figure 5 displays three $4*4$ feature maps, and the value of each spatial position represents the magnitude difference between the clean images and their corresponding adversarial examples at this position (along all channels). In this figure, red represents positive values and blue indicates the opposite. It can be clearly observed that there are apparent differences in the spatial distribution of features between the two types of samples. Thus, we assume that in addition to suppressing the so-called harmful redundant activation values from the channel perspective, similar operations should also be migrated to the spatial level in the feature space. We suppose that reweighting the feature maps through both channel and spatial activation suppression can bring complementary effects in model robustness.

Therefore, inspired by the CBAM model [65] and other research utilizing spatial attention [16, 81], we extend the vanilla feature activation suppression module into a parallel complementary mode. The new feature activation suppression module suppresses both channel-wise and spatial-wise redundant activation values simultaneously, and we suppose that it can make a better complementary effect to counter adversarial examples. Figure 6 shows the schematic diagram of the proposed feature activation suppression module.

Given an intermediate feature map $Z\in {R^{C\times H\times W}}$ obtained by the above feature refinement module, where C, H, and W represent the channel, height, and width of the feature map Z, respectively, the feature activation suppression module estimates the channel-wise activation $F^c$ and the spatial-wise activation $F^s$ by leveraging the global average pooling (GAP) operation simultaneously. Since the features are already refined by the previous module, we can apply the GAP operation. Additionally, for spatial activations, we use an extra $1*1$ convolutional layer before the GAP operation to aggregate features. Formally, the two complementary activations can be computed as:

$$\begin{aligned}{} & {} F^{c} = \frac{1}{H\times W}\sum _{i=1}^{H}\sum _{j=1}^{W}Z(i,j), \end{aligned}$$

(2)

$$\begin{aligned}{} & {} F^{s} = \frac{1}{C}\sum _{k=1}^{C}Conv^{1*1}(Z(k)). \end{aligned}$$

(3)

As shown above, we aggregate spatial information by GAP and generate the channel context descriptor $F^{c}$. The spatial feature descriptor $F^{s}$ is obtained by a similar operation along the channel axis after an extra 1*1 depthwise separable convolutional layer. We then utilize two auxiliary fully connected (FC) layers to be extra classifiers after extracting the two complementary context descriptors above. $F^{c}$ and $F^{s}$ are the inputs to the corresponding classifiers. For a multi-class classification task with K classes, the weight of an auxiliary FC layer can be written as W = $[W^{1},W^{2},...,W^{K}]$, in which each weight vector belongs to one certain class corresponding to the true labels. Therefore, we elicit the specific weight vector related to the class of examples and leverage it to reorganize the above activations. We adopt the labels of training data to select the weight vectors when training the model. In the inference phase, we pick up the weight components by using the predicting labels of the auxiliary layers, since in this stage the ground-truth labels are not accessible. In short, the activation reorganization can be performed as below:

$$\begin{aligned}{} & {} Z^{'} = Z \bigotimes W^{true}, \text {in the training phase}, \end{aligned}$$

(4)

$$\begin{aligned}{} & {} Z^{'} = Z \bigotimes W^{predict}, \text {in the testing phase}, \end{aligned}$$

(5)

where $\bigotimes $ represents the element-wise multiplication, and $W^{true}$ and $W^{predict}$ stand for the weight vectors corresponding to the true labels and the predicted labels in the auxiliary FC layers, respectively. $Z^{'}$ is the new feature reactivated by the above weight vectors, and it will be input into subsequent layers of the backbone networks. It should be noted that $Z^{'}$ has two forms, $Z^{c}$ and $Z^{s}$, corresponding to channel dimension and space dimension, respectively.

Both the channel-wise and the spatial-wise context descriptor are able to generate a weight vector owing to its auxiliary classifier, and we have two independent weight components to suppress the refined feature map Z. Vanilla CBAM utilizes the sequential arrangement to enhance the original intermediate features, but here we indicate applying the parallel mode to merge both the channel and spatial features after the activation suppression. Finally, the ultimate feature is computed as:

$$\begin{aligned} Z^{sup} = \frac{1}{2}\times (Z^{c} + Z^{s}), \end{aligned}$$

(6)

where $Z^{c}$ and $Z^{s}$ stand for the channel suppressed features and spatial suppressed features, respectively, and $Z^{sup}$ is the final aggregated features. Leveraging the above operations, we finally aggregate the features reweighted by the feature suppression activation module to enhance the adversarial robustness from both channel and spatial perspectives.

Some prior works like MIRNet [77] utilize dual attention (channel and spatial) mechanisms to capture contextual information, but it is important to note that both the motivation and the approach of MIRNet are different from the feature activation suppression modules proposed in this paper. The dual attention in MIRNet is to fuse more discriminative features for image restoration and enhancement, but our feature activation suppression module aims to suppress the malicious effects from adversarial examples in both channels and spatial activation values. Besides, MIRNet uses dual attention to reweight the vanilla features respectively and fuses the reweighted features by concatenating. However, our feature activation suppression module leverages extra auxiliary classifiers to obtain the weight vector to channel and spatial dimensions and fuses the suppressed features by adding. In general, we assume that the feature fusion used by MIRNet differs from the feature activation suppression module proposed in this paper in terms of both motivation and specific practices.

Feature alignment

The feature refinement module and the feature activation suppression module above are both designed for counteracting malicious features caused by adversarial examples. But the generalization performance on the clean images still declines due to the adversarial training. We attempt to make some constraints on the feature space so that it is not overly distorted by adversarial training. Our goal is to force the denoised feature map $Z^{sup}$ to be closer to that extracted from a standard trained model on clean samples. Furthermore, we also try to tighten the above-reweighted feature space, which may decrease the distance of the intra-class samples and make the hidden features more compact.

To make the features within each category of samples more compact and discriminative, we first add a consistency constraint on the feature activation suppression module. It imposes an additional restriction on the target models’ feature space by modifying the classification loss, inspired by the center constraint proposed in the face recognition task [64]. Especially, as shown in Figure 7, we add this consistency constraint on the two auxiliary classifiers in the feature activation suppression module. Formally, the modified loss of the auxiliary classifiers in the feature activation suppression module is computed as:

$$\begin{aligned} \begin{aligned} \mathcal {L}_{fas}&=\lambda _1\mathcal {L}_{ce_a}+\lambda _2 \mathcal {L}_{center} \\&=-\lambda _1\sum _{i=1}^{m} \log \frac{e^{W_{y_{i}}^{T} \varvec{x}_{i}+b_{y_{i}}}}{\sum _{j=1}^{n} e^{W_{j}^{T} \varvec{x}_{i}+b_{j}}}+\frac{\lambda _2}{2} \sum _{i=1}^{m}\left\| \varvec{x}_{i}-\varvec{c}_{y_{i}}\right\| _{2}^{2}, \end{aligned} \end{aligned}$$

(7)

where $\mathcal {L}_{ce_a}$ stands for the vanilla cross-entropy loss of auxiliary classifiers, which is the same as the baseline model, and $\mathcal {L}_{center}$ represents the assistant consistency constraint. The variable ${c}_{y_{i}}$ is the class center of a specific class $y_{i}$. $\lambda _1$ and $\lambda _2$ are the adjustment coefficients between the cross-entropy loss and the consistency constraint. The previous work [70] also proposed a class-aware constraint on the feature space. The difference between our proposed feature alignment module and the above algorithm is that [70] constrains the features of clean and adversarial examples modified by the deep generated networks in the front end of its algorithm, and the constraint acts in the backbone network of the target models. In contrast, the consistency constraint used in this paper acts only on the additional branches of the feature activation suppression module proposed in the previous section and is not applied to the backbone networks.

Moreover, we apply a knowledge distillation mechanism [76] to transfer knowledge in the form of attention maps from an independent teacher model to the current adversarial training student model. The teacher model has the same network architecture as our target student model, and it is trained on the clean samples merely. We can generate a pair of attention maps for knowledge transferring after extracting feature maps from the same layers of the two models and then perform the knowledge distillation operation. As shown in Figure 8, for the reweighted feature $X^{S}$ (from the adversarially trained student model) processed by feature refinement and suppression, and the feature $X^{T}$ extracted from the same layer in the corresponding teacher model, we perform a knowledge distillation constraint by optimizing the following loss function:

$$\begin{aligned} L_{kd} = Distance\left( Kn\left( X^{T} \right) , Kn\left( X^{S} \right) \right) , \end{aligned}$$

(8)

where $X^{T}$ and $X^{S}$ $\in R^{C\times H\times W}$, and their corresponding knowledge $Kn\left( X^{T} \right) $ and $Kn\left( X^{S} \right) $ $\in R^{1\times H\times W}$, since we deploy a GAP operation along the channel dimension of hidden feature X to convert it into a 2D tensor. To make training more stable, we impose a Min–Max normalization on the extracted 2D knowledge. The function $Distance\left( \right) $ is utilized to measure the distance difference between two 2D tensors, one can choose the common L1 or L2 distance. By adding this constraint, the knowledge from the teacher model can specifically guide the features from the student model. The above optimization target makes the distorted feature space to be close to the original clean feature distribution. It can reduce the performance drop of the adversarial training models on clean samples and also weaken the influence of adversarial examples to a certain extent.

Therefore, the final loss function of the entire RSA can be expressed as:

$$\begin{aligned} \mathcal {L} = \mathcal {L}_{ce} + \alpha \mathcal {L}_{fasc} + \beta \mathcal {L}_{fass} + \gamma \mathcal {L}_{kd}, \end{aligned}$$

(9)

where $\mathcal {L}_{ce}$ stands for the vanilla classification loss in the backbone network. $\mathcal {L}_{fasc}$ and $\mathcal {L}_{fass}$ represent the modified loss of the auxiliary classifiers in channel and spatial dimension, respectively. $\mathcal {L}_{kd}$ represents the knowledge distillation loss. $\alpha $, $\beta $, and $\gamma $ are the weights of components within the loss function.

Experiments

In this section, we evaluate our proposed RSA algorithm from both quantitative and qualitative perspectives through extensive experiments. We first introduce the five datasets used in experiments and present the basic hyperparameters settings. Afterward, we analyze the results of both the white-box and the black-box attack defenses, including comparisons with our baseline model and other state-of-the-art adversarial defense approaches. We finally conduct a series of qualitative experiments and ablation experiments to further investigate and demonstrate the effectiveness of our proposed algorithm.

Table 1 White-box experimental results of clean accuracy and robust accuracy on CIFAR-10 under network architecture ResNet-18 and WRN-34-10. - represents the absence of corresponding results in the compared literature. Best results are shown in bold

Full size table

Experiment Settings

Datasets We choose five classical datasets to evaluate our RSA model: CIFAR-10 [33],^{Footnote 1} CIFAR-100 [33],^{Footnote 2} SVHN [48],^{Footnote 3} Tiny ImageNet [10],^{Footnote 4} and ImageNette [25].^{Footnote 5} CIFAR-10 contains 60,000 three-channel color images with 32*32 pixels, with a total of 10 different object categories. CIFAR-100 is also composed of 60,000 colored images with 100 categories. SVHN is an excerpt from the house number in Google Street View images in a style similar to MNIST [35]. This dataset has 10 different number classes, of which the training set has 73,257 images and the test set has 26,032 images. Tiny ImageNet contains 100,000 colored images, and all of them are downsized to 64*64 pixels. In this dataset, each class has 500 training images and 50 validation images. ImageNette is regarded as a subset of 10 easily classified classes from Imagenet [10], and the images in ImageNette have a higher resolution compared to the other four datasets mentioned above. ImageNette consists of 9469 training images and 3925 validation images.

Experimental setup To evaluate the effectiveness of our algorithm in various network architectures, we choose ResNet-18 [23], WRN-34-10 [75] and PreActResNet-18 [23] as our backbone networks. Note that our RSA model should be trained through adversarial training. For fair comparisons, we adopt the same training strategy and hyper-parameters as our baseline model. Therefore, we adversarially train the above three backbone networks for 200 epochs with adversarial examples generated by PGD-10($\varepsilon =8/255$, step size $=2/255$, and random initialization). We leverage SGD with a learning rate $1e-2$, momentum 0.9, and weight decay $2e-4$ to optimize our model. To be consistent with the baseline model, we insert the proposed feature refinement module and feature activation suppression module into the last residual block in the chosen backbone networks. The classifiers within the extra branches in the feature activation suppression module utilize the modified loss mentioned above, while the normal cross-entropy loss is used on the backbone network for classification. In the experiment, we set the weight coefficients $\alpha $, $\beta $, and $\gamma $ to be 2, 0.01, and 0.001, respectively. And parameter $\lambda _1$ and $\lambda _2$ in the modified loss of channel dimension are set as 1 and 0.1, respectively. We choose the L1 distance to measure the distance of hidden features between the teacher and target student model in the knowledge-distillation constraint. In this paper, we evaluate the performance of the proposed model using clean and robust accuracy. Clean accuracy measures the model’s recognition accuracy on clean samples, while robust accuracy assesses the model’s performance in scenarios with adversarial threats. This comprehensive evaluation approach provides a more thorough assessment of the model’s robustness. In summary, all of our experimental settings and evaluation metrics align with mainstream practices in the field of adversarial defense, ensuring a precise assessment of the proposed algorithm’s effectiveness.

White-box attack defense

Table 2 White-box experimental results of clean accuracy and robust accuracy on CIFAR-100 under network architecture ResNet-18 and WRN-34-10. - represents the absence of corresponding results in the compared literature. Best results are shown in bold

Full size table

Table 3 White-box experimental results of clean accuracy and robust accuracy on SVHN under network architecture ResNet-18 and WRN-34-10. Best results are shown in bold

Full size table

Table 4 White-box experimental results of clean accuracy and robust accuracy on Tiny ImageNet under network architecture PreActResNet-18. Best results are shown in bold

Full size table

Table 5 White-box experimental results of clean accuracy and robust accuracy on ImageNette under network architecture ResNet-18. Best results are shown in bold

Full size table

To verify the effectiveness of our proposed RSA algorithm, we first test our algorithm on the four most frequently used white-box attack algorithms: FGSM^{Footnote 6} [18], PGD-20 [44], CW-$\infty $ [3], and AutoAttack [7, 8],^{Footnote 7} then compare the defense results with other advanced adversarial defense algorithms. The $l_{\infty }$-norm of the perturbations is bounded by the value of $\epsilon = 8/255$. In this paper, the settings of the attack methods used in the experiments align with those of the baseline model. It should be noted that since various datasets and network architectures are tested by different defense methods, to evaluate our proposed model as completely as possible, we select different backbone networks on different datasets for testing and comparison according to the compared algorithms. In general, we compare RSA with five classes of defense algorithms, which are our baseline models, the improved adversarial training-based methods, the knowledge distillation-based methods, the NAS-based methods, and the pruning-based methods.

Compared with the baseline models: Our baseline model CAS [2] and its variant CIFS [71] algorithm only consider the effect of adversarial examples from the perspective of feature channel activation values, while omitting the pernicious effect of polluted features on the spatial domain. At the same time, neither method considers the problem of clean accuracy drop brought by adversarial training. Our RSA outperforms both two algorithms in clean and robust accuracy. It is obvious that the RSA surpasses the baseline model CAS in both clean accuracy and the other four adversarial robustness metrics. In particular, the clean accuracy of RSA on ResNet-18/CIFAR-10 is more than 2$\%$ higher than these two algorithms, while the robust accuracy improvement on PGD-20 is more than 10$\%$, and on AA is more than $5\%$. As shown in Table 3, on SVHN/WRN-34-10, RSA is also higher than the baseline model by more than 5$\%$ in all robustness accuracy. In addition, we report the white-box attack results on the ImageNette dataset in Table 5. We assume that it can confirm the effectiveness of our RSA in the high-resolution images.

Compared with the improved adversarial training-based models: We also compare our proposed RSA with some advanced defense approaches which change the paradigms of adversarial training directly, including GAT [58], NuAT [59], PCL [46], TLA [45], FGSM-GA [1], FSAM-MEP [29], AWP [63], LBGAT [9], FAT [79], SAT [56], and UIAT [12]. As shown in Table 1, 2, 3 and 4, our proposed RSA achieves better performance compared with the above algorithms. It is worth mentioning that our proposed RSA not only outperforms these AT-improved algorithms in terms of robust accuracy, but also maintains a high level of clean accuracy. It suggests that our RSA architecture can better improve the adversarial robustness of target models while avoiding degrading the models’ performance on clean images too much. In particular, our algorithm is better than the newly proposed UIAT in terms of both clean accuracy and adversarial robustness in the CIFAR-10, CIFAR-100, and SVHN datasets. It’s noteworthy that the TLA algorithm also leverages metric learning techniques to enhance the model’s adversarial robustness. In contrast to our approach, TLA employs a triplet loss function at the features of penultimate layers, while we incorporate consistency constraints into additional branches. Furthermore, our method addresses activation value suppression at both the feature channel and spatial levels, which contributes to its superior performance.

Compared with the knowledge-distillation-based models: As we design the knowledge distillation constraint in the feature alignment module, we also compare the performance of the proposed RSA with several knowledge-distillation-based adversarial defense methods, including ARD [17], RSLAD [87], MTARD [80], and AdaAD [26]. It can be seen from Tables 1 and 3 that our algorithm surpasses the RSLAD algorithm in almost all the clean and robust accuracy, which shows that the first two modules we proposed can indeed improve the adversarial robustness of models. Specifically, RSA outperforms RSLAD by about 10$\%$ on FGSM and PGD20 on ResNet-18/CIFAR-10, and by about 5$\%$ on CW-$\infty $ on ResNet-18/CIFAR-100. Compared with MTARD, our algorithm achieves better clean and robust accuracy. In particular, our algorithm outperforms the method by about 8$\%$, 7$\%$, and 9$\%$ in the robust accuracy of FGSM, PGD-20, and CW-$\infty $, respectively. We assume that although MTARD utilizes a dynamic training paradigm to balance the clean and adversarial teacher models, the results demonstrate that the clean teacher model shows overwhelming dominance. In contrast, our RSA model better balances clean and robust performance than MTARD. Specifically, compared with the current state-of-the-art algorithm AdaAD, our proposed RSA model performs better both in terms of clean accuracy and in terms of adversarial robustness under FGSM and AutoAttack tests.

Compared with the NAS-based & the pruning-based models: The RobNet-large-v2 [21] and DS-NET algorithm [14] utilize neural architecture search (NAS) measures to find the most robust network through artificially designed modules and atomic structures. As shown in Table 1, our RSA surpasses the above two NAS-based defense methods in clean and robust accuracy. It is worth noting that although the clean accuracy using WideResNet-34-10 on the CIFAR-10 and SVHN datasets is slightly lower than DS-NET, the adversarial robustness of our algorithm is much better, especially in PGD-20, CW, and AA, which are 10$\%$, 9$\%$ and 7$\%$ higher than DS-NET, respectively. We also compared our RSA with five pruning-based approaches, including AdvPrune [73], kWTA [68], HYDRA [54], SAP [11], and MAD [37]. The clean accuracy and robust accuracy of our proposed algorithm exceed both kWTA and SAP. Likewise, our RSA algorithm outperforms the HYDRA in all accuracy and is substantially ahead of the AdvPrune in terms of robust accuracy. Although MAD is slightly higher than our algorithm in some individual accuracy, the overall performance of our model is still better than that of the pruning-based method.

The results under AutoAttack: The AutoAttack [7, 8] is an ensemble of two Auto-PGD attacks and the other two complementary attacks: FAB and Square Attack. It is regarded as a more powerful attack and is becoming a widely used fair evaluation of adversarial robustness, so we also report the AutoAttack test results in each table. It is obvious that our RSA outperforms both our baseline model CAS and other compared approaches on all five datasets. We assume that these results not only illustrate the effectiveness of our proposed RSA algorithm, but also demonstrate that the performance improvement is not brought by obfuscated gradients or unfair evaluations.

Black-box attack defense

Table 6 Black-box experimental results of robust accuracy on CIFAR-10 based on ResNet-18. Best results are shown in bold

Full size table

Although our model is not specifically designed to defend against black-box attacks, we also test the defensive performance of the proposed algorithm against black-box attacks. For the fairness of the comparisons, we adopt the same experimental settings as the baseline model CAS and select two types of black-box attack methods based on transferring and querying. The former includes PGD-20 and CW-$\infty $, while the latter leverages the $\mathcal {N}$A attack [38]. In transfer attack, the adversarial examples are generated from a ResNet-50 model under standard training, which we regard it as the proxy model. When testing the $\mathcal {N}$A attack, we randomly sample 1000 images from the test set of the CIFAR-10 and SVHN datasets and limit the maximum number of queries to 2,0000. The experimental results are shown in Table 6. It can be seen that the black-box robustness of our proposed model is better than the baseline model under all three black-box attacks. Moreover, our proposed method also outperforms our baseline model when combined with other advanced adversarial defense methods, such as TRADES [78] and MART [63]. We think it can be shown empirically that our proposed RSA algorithm can not only improve the defense ability against white-box attacks, but also has better performance in the face of diverse black-box attacks compared with our baseline model.

Gray-box attack defense

In addition to the aforementioned tests evaluating the model’s defense capabilities against white-box and black-box attacks, we conducted supplementary assessments to gauge the model’s robustness under gray-box attack scenarios. In gray-box attacks, adversaries possess an intermediate level of information compared to white-box and black-box adversaries. Specifically, in this context, we assume that gray-box attackers have access to the model’s specific structure, its final output, and gradient information, but they remain unaware of the exact defense method employed. To evaluate the model’s performance under gray-box attacks, we conducted two scenarios on the ImageNette dataset: one involving PGD attacks with continuously increasing attack steps and the other employing FGSM attacks with a progressively expanding perturbation disturbance radius. Figure 9 presents the results, illustrating a comparison between our model and the baseline model under above two gray-box attack scenarios. It demonstrates that our model exhibits superior adversarial robustness in both gray-box attack scenarios. In summary, our model consistently outperforms the baseline model across white-box, black-box, and gray-box attack scenarios, thereby substantiating the effectiveness of the proposed algorithm against diverse threat models.

Qualitative experiment

In addition to the quantitative experiments conducted in the white-box and black-box attacks, we also analyze the superiority of the RSA algorithm through qualitative experiments. To investigate the difference in the feature between our RSA model and the baseline model CAS, we design two types of experiments in the channel and spatial dimensions.

For the channel dimension, we compare the activation value difference between clean samples and adversarial examples in the same channel/spatial positions in the CAS and RSA. Specially, we count the above situations in the same category of samples. We first investigate the channel levels. As shown in Figure 10, we can find that in the channels with large activation values (the left part of each subplot), our RSA algorithm can effectively reduce the difference between activation values of which in the CAS. Although our RSA algorithm slightly enlarges the activation values of adversarial examples in a few channels (the significant blue parts in the second column), we can observe that the numerical value of this difference is smaller compared to CAS. Therefore, it suggests that the RSA model can effectively reduce the differences between the two sample activation values in the channel dimension.

For the spatial dimension, we also analyze the two models from the perspective of the spatial dimension. It can be seen from Figure 11 that for the difference in activation values in the corresponding space position, our RSA algorithm is obviously smaller than the baseline model CAS on the whole. Therefore, it can also be qualitatively proved that the RSA algorithm can effectively eliminate the differences between the feature activation values of two types of samples at the spatial dimension.

Furthermore, we conduct a comparative analysis of the feature distances between RSA and the baseline models when applied to clean samples and their corresponding adversarial counterparts. Specifically, we measure the Mean Squared Error (MSE) distance between each clean image and its corresponding adversarial example (generated using PGD-20) at the output features of the penultimate layer, categorized accordingly. Figure 12 presents the findings for both CIFAR-10 and ImageNette datasets. As illustrated in the figure, on the CIFAR-10 dataset, our model significantly reduces the feature distance between adversarial and normal samples compared to the baseline model. Moreover, there is a slight reduction observed on the ImageNette dataset as well. These results underscore the efficacy of our proposed algorithm in mitigating the deleterious impact of adversarial samples within the feature space.

We also provide attention maps of adversarial examples obtained by our RSA model and a standard trained ResNet-18 on ImageNette separately. As shown in Figure 13, the standard trained network cannot correctly locate the region in the images where the classification targets are located due to adversarial noise, while our proposed RSA model can reduce the effects and correctly focus on the features of the classification targets.

Ablation study

Table 7 Analysis of the effectiveness of proposed RSA modules at different blocks of ResNet-18 on CIFAR-10. Best results are shown in bold

Full size table

To better investigate the RSA algorithm in this paper, we conduct multiple sets of ablation and comparison experiments to explore: 1. the impact of the insertion position of the RSA modules in the backbone network on the final performance; 2. the impact of each of the three modules on the final performance.

Impact of RSA insertion position: First, to explore the impact of RSA modules on the performance of the insertion location in the backbone network, we insert them into different blocks of the ResNet-18 for comparison. It should be noted that we consider that the three modules designed in this paper need to be used in conjunction with each other, so when inserting each block, it is indicated that all three modules are used at the same time. The results of the comparison experiments are shown in Table 7, and it can be seen that the maximum adversarial robustness improvement can be obtained when inserting the RSA modules into the deep blocks of the backbone network (e.g., block 4 and block 3+4). In contrast, insertion into shallow blocks gives poor performance. We believe that the activation values of the deeper features of the backbone network are more relevant to correct category prediction, and conversely, the shallow features may have a wrong effect on classification. Also, it can be seen that inserting RSA into block 4 can obtain a better trade-off between robustness and clean accuracy, so we utilize inserting RSA modules into block 4 as the setting for all experiments.

Table 8 Sensitivity analysis of the weight of proposed feature activation module and feature alignment module of ResNet-18 on ImageNette. Best results are shown in bold. Our report results are underlined

Full size table

Table 9 Analysis of the impacts among different modules of the RSA model on CIFAR-10, CIFAR-100, and SVHN based on ResNet-18 and WRN-34-10. Best results are shown in bold

Full size table

Sensitivity analysis of RSA loss weights: We conduct a sensitivity analysis on the weights within the RSA loss function. To ensure consistency with the original baseline model, we hold $\alpha $ fixed at 2 while varying the values of $\beta $ and $\gamma $ to assess their impact on final performance. These experiments are executed using the ResNet-18 network architecture on the ImageNette dataset, and the results are presented in Table 8. It is evident from the table that our proposed model demonstrates insensitivity to changes in various weight values, indicating robust generalization within the RSA model. It is worth noting that the underlined results in the table make a better balance between clean accuracy and adversarial robustness, representing the second best performances outcomes in both aspects. So, we choose them as the final reported results.

Impact of sub-components of RSA: In addition, to explore the impact of the various sub-components in our proposed algorithm on adversarial robustness and clean accuracy, we design another ablation experiment for comparison. As shown in Table 9, we compare the robust accuracy and clean accuracy of the full version RSA algorithm (V7 in each sub-table) with the other six variants. We first compare the performance of the three proposed modules when they are used independently. It can be seen that a certain degree of improvement in adversarial robustness can be obtained using only the feature refinement module, but it also has the largest accuracy reduction on clean samples among V1 to V3. We infer that this is because when reducing the malicious effects in the channel dimension, it may also lose some information and affect the feature distribution of clean images. In contrast, using only the feature alignment module achieves the highest accuracy for clean samples, but has limited performance in terms of robustness. The reason is that the feature alignment module aims to constrain the distance between the adversarial and clean samples in the feature space, so it can increase the performance of the model on clean samples after adversarial training. The overall performance of using the feature activation suppression module alone, on the other hand, is in between the above two modules. Therefore, we believe that each of the three modules has different functions and purposes and is more suitable to be used in combination.

We then compare the complete RSA model with other variants. It can be seen that the performance of using the three modules proposed in combination is better on both clean and adversarial examples. Although the best performance on clean samples can be obtained on SVHN/WRN-34-10 using only the feature alignment module, its robust accuracy is substantially lower than that of the full version of RSA. Therefore, we believe that our proposed RSA incorporating the three modules together can improve the trade-off of the clean accuracy and robust accuracy of target models based on adversarial training.

Discussion

Our proposed RSA algorithm has demonstrated its effectiveness in countering adversarial attacks, as evidenced by both quantitative and qualitative analyses. Through feature-level defense, which considers both channel and spatial dimensions, along with an additional alignment constraint, RSA not only enhances model adversarial robustness but also mitigates significant drops in clean accuracy. These outcomes underscore the practical promise of our method in security-critical applications.

While our RSA outperforms the compared state-of-the-art algorithms, it is not without limitations. The effectiveness of our feature alignment module relies heavily on the availability of powerful teacher models, which may constrain its applicability in cases where such models are not readily accessible. Moreover, the exploration of adversarial defense methods and principles for countering adversarial examples at the feature level remains an open issue, demanding further investigation in upcoming studies. Additionally, further in-depth research is needed to address potential challenges posed by unseen attacks in the future, marking an important direction for enhancing our model’s robustness against stronger adversarial threats.

Conclusion

In this paper, we propose a novel adversarial example defense algorithm RSA, which first leverages the feature refinement module to restore and refine the overall activation magnitude in the feature channels, and then utilizes the feature activation suppression module to reweight the high-order features in both channel and spatial domains. The feature space is finally aligned by a knowledge distillation operation and an extra consistency constraint on the two auxiliary branches. Extensive experiments and comparisons with other state-of-the-art defense algorithms on five public datasets and three widely used backbone networks demonstrate the superiority of our proposed RSA algorithm. Through experimental analysis, we argue that feature-level protection plays an important role in defending against adversarial examples. In the future, we will conduct further research into the patterns and characteristics of adversarial examples at the sample feature level. We will also integrate these findings with the latest feature consistency and restoration methods to explore more effective strategies for enhancing model robustness. Simultaneously, we are committed to investigating strategies for augmenting the adversarial robustness of foundational visual models and large multi-modal models, aiming to ensure their safety and reliability.

Data availability

The datasets generated during and/or analyzed during the current study are available from the corresponding author on reasonable request. Data will be available.

Notes

https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz.
https://www.cs.toronto.edu/~kriz/cifar-100-python.tar.gz.
http://ufldl.stanford.edu/housenumbers/.
https://huggingface.co/datasets/zh-plus/tiny-imagenet.
https://github.com/fastai/imagenette.
To facilitate comparison with other algorithms, we test the PGD-10 attack results instead of FGSM in Tiny ImageNet.
https://github.com/fra31/auto-attack.

References

Andriushchenko M, Flammarion N (2020) Understanding and improving fast adversarial training. Adv Neural Inf Process Syst 33:16,048-16,059
Google Scholar
Bai Y, Zeng Y, Jiang Y, et al (2021) Improving adversarial robustness via channel-wise activation suppressing. In: International Conference on Learning Representations, https://openreview.net/forum?id=zQTezqCCtNx
Carlini N, Wagner D (2017) Towards evaluating the robustness of neural networks. In: 2017 IEEE Symposium on Security and Privacy (SP), IEEE, 39–57, https://doi.org/10.1109/SP.2017.49
Chakraborty A, Alam M, Dey V et al (2021) A survey on adversarial attacks and defences. CAAI Transactions on Intelligence Technology 6(1):25–45
Article Google Scholar
Chen S, Shen H, Wang R et al (2022) Towards improving fast adversarial training in multi-exit network. Neural Netw 150:1–11
Article Google Scholar
Choi M, Kim H, Han B, et al (2020) Channel attention is all you need for video frame interpolation. In: Proceedings of the AAAI Conference on Artificial Intelligence, 10,663–10,671, https://doi.org/10.1609/aaai.v34i07.6693
Croce F, Hein M (2020a) Minimally distorted adversarial examples with a fast adaptive boundary attack. In: International Conference on Machine Learning, PMLR, 2196–2205, https://proceedings.mlr.press/v119/croce20a.html
Croce F, Hein M (2020b) Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: International Conference on Machine Learning, PMLR, 2206–2216, https://proceedings.mlr.press/v119/croce20b.html
Cui J, Liu S, Wang L, et al (2021) Learnable boundary guided adversarial training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, IEEE, 15,721–15,730, https://doi.org/10.1109/iccv48922.2021.01543
Deng J, Dong W, Socher R, et al (2009) Imagenet: A large-scale hierarchical image database. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, 248–255, https://doi.org/10.1109/cvpr.2009.5206848
Dhillon GS, Azizzadenesheli K, Lipton ZC, et al (2018) Stochastic activation pruning for robust adversarial defense. In: International Conference on Learning Representations, https://openreview.net/forum?id=H1uR4GZRZ
Dong J, Moosavi-Dezfooli SM, Lai J, et al (2023) The enemy of my enemy is my friend: Exploring inverse adversaries for improving adversarial training. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, 24,678–24,687, https://doi.org/10.1109/cvpr52729.2023.02364
Dong Y, Su H, Wu B, et al (2019) Efficient decision-based black-box adversarial attacks on face recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, 7714–7722, https://doi.org/10.1109/cvpr.2019.00790
Du X, Zhang J, Han B, et al (2021) Learning diverse-structured networks for adversarial robustness. In: International Conference on Machine Learning, PMLR, 2880–2891, https://proceedings.mlr.press/v139/du21f.html
Eykholt K, Evtimov I, Fernandes E, et al (2018) Robust physical-world attacks on deep learning visual classification. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, 1625–1634, https://doi.org/10.1109/cvpr.2018.00175
Fu J, Liu J, Tian H, et al (2019) Dual attention network for scene segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, 3146–3154, https://doi.org/10.1109/cvpr.2019.00326
Goldblum M, Fowl L, Feizi S, et al (2020) Adversarially robust distillation. In: Proceedings of the AAAI Conference on Artificial Intelligence, 3996–4003, https://doi.org/10.1609/aaai.v34i04.5816
Goodfellow IJ, Shlens J, Szegedy C (2014) Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572
Gu S, Rigazio L (2014) Towards deep neural network architectures robust to adversarial examples. arXiv preprint arXiv:1412.5068
Guo C, Rana M, Cisse M, et al (2018) Countering adversarial images using input transformations. In: International Conference on Learning Representations, https://openreview.net/forum?id=SyJ7ClWCb
Guo M, Yang Y, Xu R, et al (2020) When nas meets robustness: In search of robust architectures against adversarial attacks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, 631–640, https://doi.org/10.1109/cvpr42600.2020.00071
Han X, Hu Y, Foschini L et al (2020) Deep learning models for electrocardiograms are susceptible to adversarial attack. Nat Med 26(3):360–363
Article Google Scholar
He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, 770–778, https://doi.org/10.1109/cvpr.2016.90
He Z, Rakin AS, Fan D (2019) Parametric noise injection: Trainable randomness to improve deep neural network robustness against adversarial attack. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, 588–597, https://doi.org/10.1109/cvpr.2019.00068
Howard J, Gugger S (2020) Fastai: A layered api for deep learning. Information 11(2):108
Article Google Scholar
Huang B, Chen M, Wang Y, et al (2023) Boosting accuracy and robustness of student models via adaptive adversarial distillation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, 24,668–24,677, https://doi.org/10.1109/cvpr52729.2023.02363
Huang G, Liu Z, Van Der Maaten L, et al (2017) Densely connected convolutional networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, 4700–4708, https://doi.org/10.1109/cvpr.2017.243
Jia X, Wei X, Cao X, et al (2019) Comdefend: An efficient image compression model to defend adversarial examples. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, 6084–6092, https://doi.org/10.1109/cvpr.2019.00624
Jia X, Zhang Y, Wei X, et al (2022) Prior-guided adversarial initialization for fast adversarial training. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part IV, Springer, 567–584, https://doi.org/10.1007/978-3-031-19772-7_33
Kaissis GA, Makowski MR, Rückert D et al (2020) Secure, privacy-preserving and federated machine learning in medical imaging. Nature Machine Intelligence 2(6):305–311
Article Google Scholar
Kammoun A, Slama R, Tabia H et al (2022) Generative adversarial networks for face generation: A survey. ACM Comput Surv 55(5):1–37
Article Google Scholar
Kong Z, Guo J, Li A, et al (2020) Physgan: Generating physical-world-resilient adversarial examples for autonomous driving. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, 14,254–14,263, https://doi.org/10.1109/cvpr42600.2020.01426
Krizhevsky A, Hinton G, et al (2009) Learning multiple layers of features from tiny images. Technical Report
Lamb A, Verma V, Kawaguchi K et al (2022) Interpolated adversarial training: Achieving robust neural networks without sacrificing too much accuracy. Neural Netw 154:218–233
Article Google Scholar
LeCun Y, Bottou L, Bengio Y et al (1998) Gradient-based learning applied to document recognition. Proc IEEE 86(11):2278–2324
Article Google Scholar
LeCun Y, Bengio Y, Hinton G (2015) Deep learning. Nature 521(7553):436–444
Article Google Scholar
Lee BK, Kim J, Ro YM (2022) Masking adversarial damage: Finding adversarial saliency for robust and sparse network. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, 15,126–15,136, https://doi.org/10.1109/cvpr52688.2022.01470
Li Y, Li L, Wang L, et al (2019) Nattack: Learning the distributions of adversarial examples for an improved black-box attack on deep neural networks. In: International Conference on Machine Learning, PMLR, 3866–3876, https://proceedings.mlr.press/v97/li19g.html
Li Y, Xu X, Xiao J et al (2020) Adaptive square attack: Fooling autonomous cars with adversarial traffic signs. IEEE Internet Things J 8(8):6337–6347
Article Google Scholar
Liang B, Li H, Su M et al (2018) Detecting adversarial image examples in deep neural networks with adaptive noise reduction. IEEE Trans Dependable Secure Comput 18(1):72–85
Article Google Scholar
Liao F, Liang M, Dong Y, et al (2018) Defense against adversarial attacks using high-level representation guided denoiser. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, IEEE, 1778–1787, https://doi.org/10.1109/cvpr.2018.00191
Liu Z, Liu Q, Liu T, et al (2019) Feature distillation: Dnn-oriented jpeg compression against adversarial examples. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, 860–868, https://doi.org/10.1109/cvpr.2019.00095
Ma X, Niu Y, Gu L et al (2021) Understanding adversarial attacks on deep learning based medical image analysis systems. Pattern Recogn 110(107):332
Google Scholar
Madry A, Makelov A, Schmidt L, et al (2018) Towards deep learning models resistant to adversarial attacks. In: International Conference on Learning Representations, https://openreview.net/forum?id=rJzIBfZAb
Mao C, Zhong Z, Yang J, et al (2019) Metric learning for adversarial robustness. Advances in Neural Information Processing Systems 32. https://proceedings.neurips.cc/paper_files/paper/2019/file/c24cd76e1ce41366a4bbe8a49b02a028-Paper.pdf
Mustafa A, Khan S, Hayat M, et al (2019) Adversarial defense by restricting the hidden space of deep neural networks. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, IEEE, 3385–3394, https://doi.org/10.1109/iccv.2019.00348
Mygdalis V, Tefas A, Pitas I (2020) K-anonymity inspired adversarial attack and multiple one-class classification defense. Neural Netw 124:296–307
Article Google Scholar
Netzer Y, Wang T, Coates A, et al (2011) Reading digits in natural images with unsupervised feature learning. In: NIPS Workshop on Deep Learning and Unsupervised Feature Learning, 1–9, http://ufldl.stanford.edu/housenumbers/nips2011_housenumbers.pdf
Qian Z, Huang K, Wang QF et al (2022) A survey of robust adversarial training in pattern recognition: Fundamental, theory, and methodologies. Pattern Recogn 131(108):889
Google Scholar
Qin Z, Zhang P, Wu F, et al (2021) Fcanet: Frequency channel attention networks. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, IEEE, 783–792, https://doi.org/10.1109/iccv48922.2021.00082
Raff E, Sylvester J, Forsyth S, et al (2019) Barrage of random transforms for adversarially robust defense. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, 6528–6537, https://doi.org/10.1109/cvpr.2019.00669
Redmon J, Divvala S, Girshick R, et al (2016) You only look once: Unified, real-time object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, 779–788
Samangouei P, Kabkab M, Chellappa R (2018) Defense-gan: Protecting classifiers against adversarial attacks using generative models. In: International Conference on Learning Representations, https://openreview.net/forum?id=BkJ3ibb0-
Sehwag V, Wang S, Mittal P, et al (2020) Hydra: Pruning adversarially robust neural networks. Advances in Neural Information Processing Systems 33:19,655–19,666
Shen L, Tao H, Ni Y et al (2023) Improved yolov3 model with feature map cropping for multi-scale road object detection. Meas Sci Technol 34(4):045,406
Article Google Scholar
Sitawarin C, Chakraborty S, Wagner D (2021) Sat: Improving adversarial training via curriculum-based loss smoothing. In: Proceedings of the 14th ACM Workshop on Artificial Intelligence and Security, 25–36, https://doi.org/10.1145/3474369.3486878
Song X, Wu N, Song S, et al (2023) Switching-like event-triggered state estimation for reaction–diffusion neural networks against dos attacks. Neural Processing Letters 1–22
Sriramanan G, Addepalli S, Baburaj A, et al (2020) Guided adversarial attack for evaluating and enhancing adversarial defenses. Advances in Neural Information Processing Systems 33:20,297–20,308
Sriramanan G, Addepalli S, Baburaj A, et al (2021) Towards efficient and effective adversarial training. Advances in Neural Information Processing Systems 34:11,821–11,833
Szegedy C, Zaremba W, Sutskever I, et al (2013) Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199
Tao H, Qiu J, Chen Y et al (2023) Unsupervised cross-domain rolling bearing fault diagnosis based on time-frequency information fusion. J Franklin Inst 360(2):1454–1477
Article Google Scholar
Taran O, Rezaeifar S, Holotyak T, et al (2019) Defending against adversarial attacks by randomized diversification. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, 11,226–11,233, https://doi.org/10.1109/cvpr.2019.01148
Wang Y, Zou D, Yi J, et al (2019) Improving adversarial robustness requires revisiting misclassified examples. In: International Conference on Learning Representations, https://openreview.net/forum?id=rklOg6EFwS
Wen Y, Zhang K, Li Z, et al (2016) A discriminative feature learning approach for deep face recognition. In: Proceedings of the European Conference on Computer Vision, Springer, 499–515, https://doi.org/10.1007/978-3-319-46478-7_31
Woo S, Park J, Lee JY, et al (2018) Cbam: Convolutional block attention module. In: Proceedings of the European Conference on Computer Vision, Springer, 3–19, https://doi.org/10.1007/978-3-030-01234-2_1
Wu D, Xia ST, Wang Y (2020) Adversarial weight perturbation helps robust generalization. Adv Neural Inf Process Syst 33:2958–2969
Google Scholar
Xiang C, Bhagoji AN, Sehwag V, et al (2021) $\{$PatchGuard$\}$: A provably robust defense against adversarial patches via small receptive fields and masking. In: 30th USENIX Security Symposium (USENIX Security 21). USENIX Association, 2237–2254, https://www.usenix.org/conference/usenixsecurity21/presentation/xiang
Xiao C, Zhong P, Zheng C (2019) Enhancing adversarial defense by k-winners-take-all. In: International Conference on Learning Representations, https://openreview.net/forum?id=Skgvy64tvr
Xie C, Wu Y, Maaten Lvd, et al (2019) Feature denoising for improving adversarial robustness. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, 501–509, https://doi.org/10.1109/cvpr.2019.00059
Xu X, Zhao H, Torr P, et al (2022) General adversarial defense against black-box attacks via pixel level and feature level distribution alignments. arXiv preprint arXiv:2212.05387
Yan H, Zhang J, Niu G, et al (2021) Cifs: Improving adversarial robustness of cnns via channel-wise importance-based feature selection. In: International Conference on Machine Learning, PMLR, 11,693–11,703, https://proceedings.mlr.press/v139/yan21e.html
Yang S, Xu C (2022) One size does not fit all: Data-adaptive adversarial training. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part V, Springer, 70–85, https://doi.org/10.1007/978-3-031-20065-6_5
Ye S, Xu K, Liu S, et al (2019) Adversarial robustness vs. model compression, or both? In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 111–120, https://doi.org/10.1109/iccv.2019.00020
Yuan J, He Z (2020) Ensemble generative cleaning with feedback loops for defending adversarial attacks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, 581–590, https://doi.org/10.1109/cvpr42600.2020.00066
Zagoruyko S, Komodakis N (2016) Wide residual networks. In: British Machine Vision Conference, British Machine Vision Association, https://doi.org/10.5244/c.30.87
Zagoruyko S, Komodakis N (2017) Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer. In: International Conference on Learning Representations
Zamir SW, Arora A, Khan S, et al (2020) Learning enriched features for real image restoration and enhancement. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, Springer, 492–511, https://doi.org/10.1007/978-3-030-58595-2_30
Zhang H, Yu Y, Jiao J, et al (2019) Theoretically principled trade-off between robustness and accuracy. In: International Conference on Machine Learning, PMLR, 7472–7482, https://proceedings.mlr.press/v97/zhang19p.html
Zhang J, Xu X, Han B, et al (2020) Attacks which do not kill training make adversarial learning stronger. In: International Conference on Machine Learning, PMLR, 11,278–11,287, https://proceedings.mlr.press/v119/zhang20z.html
Zhao S, Yu J, Sun Z, et al (2022) Enhanced accuracy and robustness via multi-teacher adversarial distillation. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part IV, Springer, 585–602, https://doi.org/10.1007/978-3-031-19772-7_34
Zhao T, Wu X (2019) Pyramid feature attention network for saliency detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, 3085–3094, https://doi.org/10.1109/cvpr.2019.00320
Zhong Y, Deng W (2020) Towards transferable adversarial attack against deep face recognition. IEEE Trans Inf Forensics Secur 16:1452–1466
Article Google Scholar
Zhou H, Hou J, Zhang Y et al (2022) Unified gradient-and intensity-discriminator generative adversarial network for image fusion. Information Fusion 88:184–201
Article Google Scholar
Zhou NR, Zhang TF, Xie XW et al (2023) Hybrid quantum-classical generative adversarial networks for image generation via learning discrete distribution. Signal Processing: Image Communication 110(116):891
Google Scholar
Zhou S, Liu C, Ye D et al (2022) Adversarial attacks and defenses in deep learning: From a perspective of cybersecurity. ACM Comput Surv 55(8):1–39
Article Google Scholar
Zhuang Z, Tao H, Chen Y, et al (2022) An optimal iterative learning control approach for linear systems with nonuniform trial lengths under input constraints. IEEE Transactions on Systems, Man, and Cybernetics: Systems
Zi B, Zhao S, Ma X, et al (2021) Revisiting adversarial robustness distillation: Robust soft labels make student better. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, IEEE, 16,443–16,452, https://doi.org/10.1109/iccv48922.2021.01613

Download references

Funding

The authors received support from the National Natural Science Foundation of Hunan Province, China, under Grant 2023JJ30082 for the submitted work.

Author information

Authors and Affiliations

College of Systems Engineering, National University of Defense Technology, Changsha, 410072, China
Yulun Wu, Yanming Guo, Tianyuan Yu, Huaxin Xiao & Liang Bai
Hunan Institute of Advanced Technology, Changsha, 410072, China
Yanming Guo & Liang Bai
International Academic Center of Complex Systems, Beijing Normal University, Zhuhai, 519087, China
Dongmei Chen
School of Systems Science, Beijing Normal University, Beijing, 100875, China
Dongmei Chen
RoadMaint Co., Ltd., Beijing, 100095, China
Yuanhao Guo

Authors

Yulun Wu
View author publications
You can also search for this author in PubMed Google Scholar
Yanming Guo
View author publications
You can also search for this author in PubMed Google Scholar
Dongmei Chen
View author publications
You can also search for this author in PubMed Google Scholar
Tianyuan Yu
View author publications
You can also search for this author in PubMed Google Scholar
Huaxin Xiao
View author publications
You can also search for this author in PubMed Google Scholar
Yuanhao Guo
View author publications
You can also search for this author in PubMed Google Scholar
Liang Bai
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yanming Guo.

Ethics declarations

Conflict of interest

The authors have no competing interests to declare that are relevant to the content of this article.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Wu, Y., Guo, Y., Chen, D. et al. Boosting adversarial robustness via feature refinement, suppression, and alignment. Complex Intell. Syst. 10, 3213–3233 (2024). https://doi.org/10.1007/s40747-023-01311-0

Download citation

Received: 23 June 2023
Accepted: 02 December 2023
Published: 18 January 2024
Issue Date: June 2024
DOI: https://doi.org/10.1007/s40747-023-01311-0

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Boosting adversarial robustness via feature refinement, suppression, and alignment

Abstract

Similar content being viewed by others

Towards Both Accurate and Robust Neural Networks Without Extra Data

Boosting Adversarial Transferability Through Intermediate Feature

Multi-scale Features Destructive Universal Adversarial Perturbations

Introduction