Exploring misclassifications of robust neural networks to enhance adversarial attacks

Schwinn, Leo; Raab, René; Nguyen, An; Zanca, Dario; Eskofier, Bjoern

doi:10.1007/s10489-023-04532-5

Exploring misclassifications of robust neural networks to enhance adversarial attacks

Open access
Published: 21 March 2023

Volume 53, pages 19843–19859, (2023)
Cite this article

Download PDF

You have full access to this open access article

Applied Intelligence Aims and scope Submit manuscript

Exploring misclassifications of robust neural networks to enhance adversarial attacks

Download PDF

Leo Schwinn ORCID: orcid.org/0000-0003-3967-2202¹,
René Raab¹,
An Nguyen¹,
Dario Zanca¹ &
…
Bjoern Eskofier¹

1838 Accesses
Explore all metrics

Abstract

Progress in making neural networks more robust against adversarial attacks is mostly marginal, despite the great efforts of the research community. Moreover, the robustness evaluation is often imprecise, making it challenging to identify promising approaches. We do an observational study on the classification decisions of 19 different state-of-the-art neural networks trained to be robust against adversarial attacks. This analysis gives a new indication of the limits of the robustness of current models on a common benchmark. In addition, our findings suggest that current untargeted adversarial attacks induce misclassification toward only a limited amount of different classes. Similarly, we find that previous attacks under-explore the perturbation space during optimization. This leads to unsuccessful attacks for samples where the initial gradient direction is not a good approximation of the final adversarial perturbation direction. Additionally, we observe that both over- and under-confidence in model predictions result in an inaccurate assessment of model robustness. Based on these observations, we propose a novel loss function for adversarial attacks that consistently improves their efficiency and success rate compared to prior attacks for all 30 analyzed models.

Deep Learning: A Comprehensive Overview on Techniques, Taxonomy, Applications and Research Directions

Article 18 August 2021

Deep learning modelling techniques: current progress, applications, advantages, and challenges

Article Open access 17 April 2023

Deep learning: systematic review, models, challenges, and research directions

Article Open access 07 September 2023

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Deep neural networks (DNN) can be easily fooled into making wrong predictions by seemingly negligible perturbations to their input data, called adversarial examples. [1] first demonstrated the existence of adversarial examples for neural networks in the image domain. Since then, adversarial examples have been identified in various other domains such as speech recognition [2, 3] and natural language processing [4, 5]. The prevalence of adversarial examples has severe security implications for real-world applications, making the development of robust machine learning models essential. As a result, the robustness of neural networks has become a central research topic of deep learning in recent years [6].

Several defense strategies have been proposed to make Deep Neural Networks (DNNs) more robust and reliable [7,8,9,10,11,12,13]. However, most of them have later been shown to be ineffective against stronger attacks [14, 15] and overall progress has been slow [16]. As robustness improvements are mostly in the single-digit percentage range, a reliable evaluation of new defense strategies is critical to identify methods that actually improve robustness. An inaccurate evaluation of new defenses can lead to the adaption of ineffective defense strategies, which in turn may hinder progress in robustness research. Moreover, accurate quantification of network robustness is necessary to accurately assess the risk of deploying machine learning models in the real world.

To improve the robustness quantification of neural networks, the community has established helpful guidelines [17,18,19]. Nevertheless, the worst-case robustness of DNNs is still reduced repetitively by even stronger attacks and a precise evaluation remains a challenging problem [16].

In this work, we explore the classification decisions of 19 recently published DNNs to identify weak points of current adversarial attacks (Fig. 1). Hereby, we restrict our analysis to DNNs which have been trained to be robust against adversarial attacks with a variety of different methods. Our analysis can be summarized by four main findings:

1.
We show that none of the 21 CIFAR10 [20] models that we evaluated can correctly classify 24.2% of the CIFAR10 test data in the presence of adversarial attacks. This finding emphasizes the vast gap in robust and clean performance of current neural networks beyond prior analysis done for individual models.
2.
Further, we observe that untargeted adversarial attacks cause misclassification towards only a limited amount of different classes in the dataset.
3.
Additionally, we find that, where the loss of the model does not change along the direction of the initial gradient, samples are more difficult to attack. This limits current gradient-based attacks that exploit these gradient directions without sufficiently exploring the loss landscape.
4.
Furthermore, we observe that if a model exhibits irregularly large over- and under-confidence, it is difficult to assess its robustness accurately.

We leverage these observations to design a new loss function that improves the success rate of adversarial attacks compared to current state-of-the-art loss functions. More specifically, we encourage target diversity in untargeted attacks by injecting noise into the model output. Additionally, we introduce scale invariance to the loss function by normalizing the output logits to a fixed value range. Thereby, we circumvent the gradient obfuscation problem generated by models with low-confidence predictions or irregularly large output logits [16, 21]. Moreover, we propose a simple yet effective mechanism that minimizes the magnitude of perturbations, as shown in Fig. 2, without compromising the success rate of an attack. This leads to the definition of an objective function for adversarial attacks, which we will refer to as Jitter. We empirically evaluate our loss function on an extensive benchmark consisting of 30 different models proposed in the literature. We show that Jitter-based attacks consistently outperform prior attacks in all 30 analyzed models by up to 11.8 percentage points. Additionally, Jitter-based attacks generate perturbations with a 4.20 times smaller norm on average. Lastly, we analyze the effect of Jitter on the classification decisions to explain its effectiveness. Pytorch-like pseudo code to compute the Jitter loss and a Jitter-based PGD attack is given in the Appendix^{Footnote 1}.

2 Notation

Let $f_{\theta }: [0, 1]^{d} \rightarrow \mathbb {R}^{C}$ be a DNN classifier parameterized by 𝜃 ∈Θ with f_𝜃 : x↦z. Here x is a d-dimensional input image, z is the respective output vector (logits) of the DNN, and C denotes the number of classes. The ground truth class label of a given image is described by $y \in \{1, \dots , C\}$ while the predicted class label $\hat {y} \in \{1, \dots , C\}$ is given by argmax(z). The confidence values for every class are given by softmax(z).

Adversarial examples x_adv = x + γ aim to change the input data of DNNs such that the classification decision of the network is altered, but the class label remains the same for human perception. Additionally, x_adv is constrained in the data domain, i.e., x_adv ∈ [0,1]^d. A common way to enforce semantic similarity to the original sample is to restrict the magnitude 𝜖 of the adversarial perturbation γ by an ℓ_p-norm bound, such that ∥γ∥_p ≤ 𝜖. We refer to the set of valid adversarial examples that fulfill these constraints as S. As prior work mainly focuses on $p = \infty $ and thus most models are available for this threat model, we focus on $p = \infty $ in this work as well. Furthermore, we restrict our analysis to untargeted white-box adversarial attacks, as done in prior work [16, 22].

3 Related work

One of the most often used adversarial attacks, Projected Gradient Descent (PGD), was proposed by [10]. PGD is an iterative gradient-based attack, in which multiple smaller gradient updates are used to find the adversarial perturbation

$$ \begin{array}{@{}rcl@{}} x_{adv}^{t + 1} = {\Pi}_{S} (x_{adv}^{t} + \alpha \cdot \operatorname{sign}(\nabla_{x} \mathcal{L}(f_{\theta}(x_{adv}^{t}), y)) \end{array} $$

(1)

where 0 < α ≤ 𝜖 and $x_{adv}^{t}$ describes the adversarial example at iteration t. The loss of the attack is given by ${\mathscr{L}}(f_{\theta }(x_{adv}^{t}), y)$. π_S(x) is a projection operator that keeps $x_{adv}^{t+1}$ within the set of valid adversarial examples S and sign is the component-wise signum operator. The starting point of the attack $x_{adv}^{0}$ is randomly chosen in the 𝜖-norm ball. Several variants of iterative gradient-based attacks have been proposed that are more effective than vanilla PGD [19, 22,23,24]. Recently, [16] proposed the Auto-PGD (APGD) attack. In contrast to previous PGD versions, APGD requires considerably less hyperparameter tuning and was shown to be more effective than other PGD-based attacks against a variety of models [16]. Nevertheless, one important component of all gradient-based attacks is their optimization objective. The most commonly used objective is the Cross-Entropy (CE) loss. Carlini and Wagner [21] observe that CE-based attacks fail against models with large logits. They propose the Carlini & Wagner (CW) loss function $-z_{y} + \operatorname {\max \limits }_{i \neq y} (z_{i})$ which does not make use of the softmax function and thereby reduces the scaling problem. Nevertheless, [16] observe that the scale dependence of the CW loss can still lead to failed attacks against models with exceptionally large logits. They address this issue with the scale- and shift-invariant Difference of Logit Ratio (DLR) loss and show its effectiveness on an extensive benchmark. Recently, Pintor et al. [25] proposed the Fast Minimum Norm (FMN) attack that is robust to hyperparameter choices, creates low-norm adversarial perturbations, and is computationally less complex than previous attack approaches. Another method to improve the robustness evaluation of machine learning models is to combine multiple conceptually different attacks into an attack ensemble [16, 26]. We explore and discuss the limitations of current adversarial attacks in the following section.

4 Robust misclassifications

While several benchmarks of adversarial robustness exist in the literature, they generally do not investigate the classification decisions of multiple models simultaneously [16]. To investigate common limitations among models in the literature and get a more holistic overview of model robustness, we explore the classification decisions of 21 out of the 30 different models in the presence of adversarial attacks. We restrict our analysis to the 21 models trained on the CIFAR10 dataset [20], as only a limited amount of pre-trained models are available for the other datasets. The labels “airplane” and “automobile” have been renamed to “plane” and “car”, respectively. We choose the recently proposed Auto-PGD (APGD) with the Difference of Logit Ratio (DLR) loss as an attack to perturb the images, as it is one of the most efficient and reliable gradient-based attacks [16]. These choices and specific hyperparameters are described in more detail in Section 6. A summary of the most important symbols and abbreviations used in this work is given in Table 1.

Table 1 Summary of the most important abbreviations and symbols used in the methods and experiments sections of this work

Full size table

4.1 Distribution of misclassifications

Recent studies mainly focus on common evaluation metrics to assess the robustness of DNNs. This includes the worst-case robustness of a classifier [16] and the magnitude of the perturbation norm necessary to fool the classifier for individual inputs [21]. Here, we provide insights into the classification decisions and numerical properties of a large and diverse set of models from the literature. We focus on models that are trained to be robust to adversarial attacks. Furthermore, all models are attacked individually to find the respective worst-case robustness.

Figure 3a shows how the 19 most robust models misclassify inputs attacked by APGD. We left out the models by [14] and [27] from this analysis, as they show negligible robustness against strong adversarial attacks. Out of the 10,000 test samples of the CIFAR10 dataset, 3298 are correctly classified by all 19 models, while 2423 samples are consistently misclassified by all models. This is shown by the leftmost (green, dashed) and rightmost (red) bars of the histogram plot. This shows that none of the 19 models is able to robustly classify a considerable fraction of the test set. Further, it highlights the vast accuracy gap between adversarial and clean data beyond prior analysis of this trade-off between robustness and accuracy on individual models. Inspired by prior work [28], we will refer to images in the first group that are never misclassified as robust images and images in the second group that are always misclassified as non-robust images. The gray bars in between show the remaining 4279 samples misclassified by at least one model, but not by all models. Figure 3b summarizes the class distribution of robust and non-robust images. There is a considerable difference in frequency for most classes between the two groups. Images from the classes “plane”, “car”, “horse”, “ship”, and “truck” are often classified correctly while “bird”, “cat”, “deer” and “dog” are mostly misclassified.

We additionally explored the average of the confusion matrices of all models for adversarially perturbed images. Note that the CIFAR10 dataset is balanced and contains an equal amount of samples for all classes. Figure 4 shows the confusion matrix of only the misclassifications. The confusion matrix contains only a few large values, which is in line with the previous observation that some classes are easier to perturb than others. Furthermore, the matrix is largely symmetric. Classes are mainly confused amongst pairs. This includes semantically meaningful pairs such as “cats” and “dogs” or “car” and “truck”, but it also includes other pairs that generally share similar image backgrounds such as “plane” and “ship”, “deer” and “frog”, and “deer” and “bird”. Examples of non-robust and robust images are shown in Fig. 5. Non-robust images contain outliers and wrongly labeled images. These include:

Subfigure (a): A seaplane that is classified as a ship.
Subfigure (b): A ship in the air that is classified as a plane.
Subfigure (c): A golf cart and an ambulance that are labeled as a car but classified as a truck.
Subfigure (d) An oldtimer car that is labeled as a truck and classified as a car.

We additionally observe considerable sparsity of the misclassification confusion matrices of the CIFAR100 [20] and ImageNet [29] datasets. For all CIFAR100 and ImageNet models, only 17% and 1.4% of the entries in the confusion matrix are higher than 0 (note that for an attack that induces optimal target class diversity, 100% and 7.2% of the entries in the confusion matrix would be higher than 0, respectively). Concurrent work by [30] made a similar observation on the ImageNet dataset. They find that untargeted adversarial attacks mostly cause misclassifications into semantically similar classes.

4.2 Attacks against robust and non-robust images

We further analyzed the behavior of the DLR-based adversarial attack for robust and non-robust images. This is exemplified in Fig. 6 for the model proposed by [8]. We explored how the CW loss [21] (y-axis) changes during the attack optimization (x-axis in Fig. 6a) and how the CW loss changes along the direction of the final adversarial perturbation starting from a clean image (x-axis in Fig. 6b). We choose to display the CW loss and not the DLR loss as the CW loss can directly be related to the classification decision of a classifier (inputs with $\mathcal {L_{\textit {CW}}} > 0$ are misclassified). Note that we still use the DLR loss in the attack optimization and only use the CW loss for display purposes. We additionally calculated the CW loss on the softmax output of the network such that the output is scaled between − 1 and 1. The subfigures show the mean loss value over the sets of robust and non-robust images by a solid line. The individual loss values for 10 randomly drawn samples from each set are shown as dashed lines. For non-robust images, the CW loss increases rapidly during the first attack iterations and most images are successfully attacked in the first attack iteration. Moreover, for non-robust images, the CW loss increases steadily along the direction of the final adversarial perturbations on average, which indicates that the initial gradient direction is a good approximation of the final attack direction. In contrast, for robust images following the gradient directions in the vicinity of the original image is not effective and the CW loss changes only marginally during the attack.

4.3 Logit and confidence distribution

Next, we analyze the distribution of the output logits z and the confidence of all CIFAR10 models and relate these properties to the difficulty of the robustness evaluation. Prior work observed that simply scaling the output of a DNN will lead to vanishing gradients when the softmax function is used in the last layer of the network [16]. This phenomenon occurs due to finite arithmetic and thus limited precision, where the CE loss is quantized to 0 and the model effectively obfuscates the gradient from the attack. The CE loss is given by

$$ \begin{array}{@{}rcl@{}} \operatorname{CE}(z, y) &=& -\log(\text{softmax}(z)_{y}) \\ &&\text{where softmax}(z)_{y} = \frac{e^{z_{y}}}{{\sum}_{j=1}^{C} e^{z_{j}}}. \end{array} $$

(2)

Figure 7 summarizes the logit and confidence distributions for the analyzed CIFAR10 models. The model proposed by [27] shows exceptionally large logits that are outside of the floating-point precision, thus obfuscating its gradients. Furthermore, the models by [27] and [8] exhibit a considerably higher average confidence (0.948) than all other models (0.666 excluding models with exceptionally low confidence [14, 15]). In contrast, the models by [14] and [15] reveal a different phenomenon where the logits are close to zero and show a substantially lower standard deviation than the other models. Consequently, the logits are generally mapped to a limited value range by the softmax function, where all values are similar. Thus, the loss may only change slightly between different attack iterations, decreasing the attack performance. This is also reflected in low confidence for the two models, where the most confident prediction has a probability of < 0.51 while it is ≈ 1 for all other models.

Note that prior work already observed that models with exceptionally large logits are difficult to attack [16, 21]. However, we observe that the robustness of all 4 models identified in the above analysis is difficult to assess in practice, including models with above or below average confidence but without exceptionally large logits. For these 4 models, the difference between a standard robustness evaluation with CE-based PGD and stronger attacks is larger than 7% and considerably less accurate than for the other 17 models [31]. This highlights that the distribution of the output logits z can be a possible failure case for an accurate robustness evaluation, even when the logits are not exceptionally large. We explain how we combat this problem in comparison to prior work in Section 5.1.

5 Enhancing adversarial attacks

In the previous section, we explored the misclassification of robust DNNs under adversarial attacks. The experiments showed a general consistency between the different models. Specifically, we discovered that common attacks mostly focus on a limited amount of different classes to attack in the untargeted setting. At the same time, current attacks often fail to find adversarial examples if the initial gradient direction is not a good approximation of the final attack direction and cannot change the classification loss even slightly in these cases. Additionally, we observed that the scale and distribution of the output logits are linked to the success rate of adversarial attacks. Based on these observations, we now design a novel loss function for adversarial attacks to make them more effective. We first describe the two main components of this loss function. Subsequently, we elaborate on how we can minimize the norm of the final adversarial perturbation without compromising the attack’s success rate. This is important as adversarial attacks should not change the label for human perception, which is linked to the perturbation magnitude.

5.1 Scale invariance

Previous work has already demonstrated that high output logits can lead to gradient obfuscation and weaken adversarial attacks [16, 21]. We additionally observe that a small value range of the logits can also lead to attack failure. We propose to scale the output logits by the following rule:

$$ {\hat{z} = \alpha \cdot \frac{z}{\| z \|_{\infty}}} $$

(3)

where α is an easy-to-tune scalar value that controls the lowest and highest possible output values of the softmax function. After rescaling, the logits are within a fixed value range $\hat {z} \in [0, \alpha ]^{C}$, which solves the aforementioned problems. We additionally define the scaled softmax output as $\hat {s} = \operatorname {softmax}\left (\hat {z} \right )$, where softmax is the element-wise softmax operator. While other loss functions are already designed to be scale invariant [16] or handle large output logits [21] they are not suited to be combined with the loss function modification that we propose in the next section.

5.2 Attack target diversity and attack exploration

Figure 4 demonstrates that untargeted adversarial attacks mainly induce misclassifications for a limited amount of classes. We argue that this behavior limits the effectiveness of adversarial attacks. This is further supported by prior work showing that performing targeted attacks against every possible class is usually more effective than applying a single untargeted attack [16, 32]. However, these so-called multi-targeted attacks are computationally expensive and do not scale to datasets with a large number of output classes. Besides, in Fig. 6, we show that current adversarial attacks have difficulties changing the loss for robust images. We argue that this stems from a bad trade-off between attack exploitation and attack exploration. Gradient-based attacks exploit the local gradient information to find adversarial examples with no incentive to explore.

To address the above problems, we propose to perturb the scaled softmax output of a model after each forward pass with Gaussian noise ${\hat {s}}_{\text {Noise}} = {\hat {s}} + \mathcal {N}(0, \sigma )$ to prompt adversarial attacks to further explore the input space. Here, the noise magnitude is controlled by the hyperparameter σ.

Still, the CE loss is only dependent on the output of the ground truth class and adding noise to the other output values has no impact. Other loss functions such as DLR or CW have non-normalized logits, which make it difficult to find a suitable σ, as the logit range changes between every input. Instead, we exchange the CE loss function with the Euclidean distance loss between the one-hot encoded ground truth vector Y of the class label and the scaled output of the model. The proposed rescaling makes it easier to tune the σ hyperparameter for individual models in our experiments. We additionally observe the scaled Euclidean distance loss to be more effective than the DLR or CW loss even without noise injection (see Table 2). More details are given in Section 6. The loss function is given by the following equation:

$$ \mathcal{L}_{2} = {\|}{\hat{s}} - Y{\|}_{2}. $$

(4)

Table 2 Ablation results for the individual Jitter components for the model proposed by [14]

Full size table

Combining the Euclidean loss function and the scaling described in (3) the loss function can be described by the following equation.

$$ \mathcal{L}_{Noise} = {\|}{\hat{s}}_{\text{Noise}} - Y{\|}_{2}. $$

(5)

Injecting gradient noise to improve the convergence of optimization algorithms is well-motivated by previous work [33,34,35,36]. Neelakantan et al. [36], found that adding noise to the weight updates of a neural network during training does not only improve the generalization ability of the model but additionally leads to a lower training loss. They attribute this to the additional exploration of the parameter space induced by the noise. Furthermore, non-gradient-based algorithms like simulated annealing [33] or genetic algorithms [34] utilize randomness to escape local optima in non-convex optimization landscapes. However, to the best of our knowledge enhancing gradient-based adversarial attacks by adding noise during the optimization has not been investigated by existing work.

5.3 Minimizing the norm of the perturbation

Finally, we aim to encourage the attack to find small perturbations. As long as no successful perturbation is found, we apply the loss function presented in (5). As soon as the adversarial attack is able to change the predicted label of the model, we additionally aim to minimize the norm of the adversarial perturbation. Furthermore, we only override the current perturbation if the norm-minimized perturbation also leads to a successful attack. This procedure can never decrease the success rate of the attack and effectively minimizes the norm of the adversarial perturbation in our experiments. In addition, the norm (or other distance measures) can be freely chosen according to the respective problem (e.g., ℓ₁, ℓ₂, $\ell _{\infty }$) as long as it is differentiable. The final loss function can be defined as

$$ \mathcal{L}_{Jitter} = \left\{\begin{array}{ll} \frac{{\|}{\hat{s}}_{\text{Noise}} - Y{\|}_{2}}{{\|}\gamma{\|}_{p}} & \text{if } \hat{y} \neq y \\ {\|}{\hat{s}}_{\text{Noise}} - Y{\|}_{2} & \text{if } \hat{y} \equiv y \end{array}.\right. $$

(6)

The effect of the different components is exemplified in Table 2 for the model proposed in [14]. Every component decreases the accuracy of the model and therefore increases the success rate of the attack. The norm minimization does not affect the performance and is therefore excluded from the table. We additionally analyzed the influence of noise injection for the DLR and CW loss functions. However, since the logits of the DLR and CW loss functions are not normalized we additionally scaled the sigma value by the largest logit for every sample in the batch. Note that noise injection to the output does not improve the performance when using the CE loss in our experiments. This may be attributed to the fact that the CE loss is only dependent on the output of the ground truth class and adding noise to the other output values has little impact.

6 Experiments

We conducted a series of experiments to evaluate the effectiveness of the proposed Jitter loss function. Furthermore, we inspect the perturbations generated with the Jitter loss function to explain its effectiveness compared to other state-of-the-art loss functions. All experiments were conducted on a single NVIDIA V100 GPU.

6.1 Data and models

All experiments were performed on the CIFAR10, CIFAR100 [20], and ImageNet datasets [29]. We chose CIFAR10 for our initial analysis as many pre-trained models exist for this dataset. CIFAR100 and ImageNet were used to evaluate if the findings on CIFAR10 and the proposed Jitter loss generalize to more complicated classification tasks. We gathered 30 models from the literature for the attack evaluation. All models were either taken from the RobustBench library [31] or from the GitHub repositories of the authors directly [14, 27, 39, 49]. We excluded some models from RobustBench as we had difficulties getting them to run correctly. We only considered models which are trained to be robust against $\ell _{\infty }$-norm attacks. The resulting benchmark contains a diverse set of models which are trained with different methods.

6.2 Threat model

We compare the performance of different loss functions for the Auto-PGD (APGD) attack [16], which is one of the state-of-the-art iterative gradient-based attacks. Moreover, APGD has no hyperparameters such as step size and thus enables a less biased comparison between different loss functions. We compare Jitter to three different loss functions and two popular gradient-based adversarial attacks. This includes the Cross-Entropy (CE) loss, which is the standard loss function for training supervised DNNs and is the most often used loss function for gradient-based adversarial attacks. We also consider the Carlini & Wagner (CW) loss proposed by [21] that shows considerably better results compared to CE when the model shows high output logits. Additionally, we include the Difference of Logit Ratio (DLR) loss proposed in [16] that was shown to achieve more stable results compared to the CE and CW loss. Lastly, we compare Jitter to the recently proposed B&B and the Fast Minimum Norm (FMN) attacks, which have shown to be effective against several different defenses and robust to hyperparameter choices [22, 25]. All attacks are untargeted $\ell _{\infty }$-norm attacks and use 100 attack iterations. We use a perturbation budget of 𝜖 = 8/255 for CIFAR10 and CIFAR100 models and a perturbation budget of 𝜖 = 4/255 for ImageNet models.

6.3 Jitter hyperparameter

Compared to CE and DLR, Jitter introduces two additional hyperparameters. The first hyperparameter α rescales the softmax input and directly controls the possible minimum and maximum value of the output logits and the average magnitude of the gradient. Note that values for α close to or greater than ≈ 88 will result in an overflow of 32-bit float values (e⁸⁸ ≈ 3.402823 ⋅ 10³⁸) in the softmax function and thereby lead to numerical issues. Thus, we can focus on 0 < α ≪ 88. In a preliminary experiment, we explored different values for α between 2 and 20 and observed a stable performance for all values and therefore chose α = 10 for all remaining experiments. The second hyperparameter σ controls the amount of noise added to the rescaled softmax output $\hat {s}$. We tuned σ for every model individually on a batch of 100 samples by testing values for σ ∈{0,0.05,0.1,0.15,0.2}. Note that tuning σ on a small batch for each model introduces only a negligible overhead (≈ 1% additional runtime). Additionally, we analyzed the sensitivity of the attack performance with respect to σ for all models. Values between 0.05 and 0.2 resulted in similar success rates, while values above 0.25 decreased the attack performance compared to no noise injection on average.

7 Results and discussion

In this section, we summarize and discuss the results of the experiments.

7.1 Attack performance

Table 3 compares the performance of the different loss functions on the CIFAR10, CIFAR100, and ImageNet datasets. The best result for every model is highlighted in bold. The best attack for every model is highlighted in bold. The minimum and maximum difference between Jitter and the other attacks is shown in the two rightmost columns. The proposed Jitter loss achieves superior performance compared to the other attacks for 29 out of 30 models. For the model proposed in [27] the B&B and FMN attacks achieve a marginally higher success rate in the 100 attack iterations. Nevertheless, every iteration of the B&B and FMN attack are considerably slower than an iteration of Jitter-based attacks in our experiments (We use the implementation provided in the GitHub repositories of the authors). Jitter still achieves 100% success rate faster than the B&B and FMN attack, which makes it the most efficient attack in all experiments. Furthermore, compared to the other attacks Jitter achieves the same success rate 49% faster than CE-based attacks, 37% faster than CW-based attacks, 35% faster than DLR-based attacks, 162% faster than B&B attacks, and 46% faster than FMN attacks on average. Moreover, the Jitter loss is the only loss function that is consistently better than the other loss functions. In contrast, the other five attacks differ in performance for the individual experiments as shown in Table 3, where the second-best attack is underlined in each row. Combining the DLR and CW loss with noise injection led to equal or higher success rates in all cases but both losses remain less effective than Jitter on average. A more extensive overview is given in Appendix C. To evaluate the performance of Jitter with a higher computational budget we compared DLR and Jitter using 1000 model evaluations (5 restarts and 200 iterations) for all CIFAR10 models. While the success rate increased up to 6.51% for Jitter, the high-budget version of DLR performed worse than 100 iteration Jitter in all cases. The results are summarized in Appendix D.

Table 3 Accuracy [%] of the evaluated models when attacked with $\ell _{\infty }$ norm APGD attacks using different loss functions

Full size table

7.2 Induced misclassifications

We designed Jitter to increase target diversity for untargeted adversarial attacks and compare the target diversity of the different loss functions for all 30 models from all explored datasets. The average increase in target diversity of Jitter compared to the other loss functions is: CE: 36%, CW: 52%, DLR: 155%, and ${\mathscr{L}}_{2}$: 57% (given in (4)). Moreover, noise injection into the output for the CW and DLR loss increases the attack target diversity by 49% and 56%, respectively. This shows that noise injection to the output increases target diversity and not the ${\mathscr{L}}_{2}$ loss. Nevertheless, the combination of noise injection with the ${\mathscr{L}}_{2}$ loss was more effective in our experiments than the combination of noise injection with CW and DLR (more details are given in Appendix C). Figure 8 exemplifies the increased attack target diversity of ${\mathscr{L}}_{Jitter}$-based attacks for the model proposed in [8]. Green squares denote that an attack changed the classification decision to the respective class at least once. ${\mathscr{L}}_{Jitter}$-based attacks show a considerably higher amount of different target classes compared to the other two attacks. Further, the ${\mathscr{L}}_{DLR}$-based attack was not able to successfully attack the classes car and truck, which reduces the attack success rate compared to Jitter.

Analogous to the analysis presented in Fig. 6, we explored the behavior of Jitter-based adversarial attacks for robust and non-robust images for the same model [8]. The different subfigures of Fig. 9 show the CW loss [21] on the y-axis during the attack optimization (Fig. 9a) and along the direction of an adversarial perturbation (Fig. 9b). As in Fig. 6 the individual loss values for 10 randomly drawn samples are shown as dashed lines for the sets of robust and non-robust images, while the mean over the whole sets is shown as a solid line. In comparison to DLR-based attacks (Fig. 6), Jitter-based attacks (Fig. 9) exhibit a considerably larger fluctuation of the loss during the attack optimization for both robust and non-robust images. In contrast to DLR-based attacks, Jitter-based attacks also show considerable loss changes for robust images and thereby achieve higher success rates. Besides, while DLR-based attacks generally find adversarial directions which directly increase the CW loss, Jitter-based attacks mainly find adversarial directions which do not directly increase the CW loss, which can be seen by the constant mean near the clean input x in Fig. 9b. Moreover, the mean CW loss value of Jitter-based attacks exceeds the threshold of misclassification (${\mathscr{L}}_{CW} = 0$) noticeably later than DLR-based attacks even for non-robust images (Jitter:0.87, DLR:0.48). DLR-based attacks by design follow the direction of the steepest ascent. In contrast, Jitter-based attacks have a better trade-off between attack exploration and attack exploitation due to the injected noise. This enables Jitter-based attacks to find perturbation directions that are sub-optimal in the beginning but lead to a misclassification at the final adversarial perturbation.

7.3 Attack norm and structure

In a final experiment, we examined the average perturbation norm of the different attack configurations for all 30 models. We choose to minimize the ℓ₂ norm with Jitter, as differences in the ℓ₂ norm are easier to interpret than for the $\ell _{\infty }$ norm (e.g. the attack focusing on specific regions). The average ℓ₂ perturbation norm over all samples for the different loss functions are: CE: 0.52, CW: 0.54, DLR: 0.55, B&B: 1.29, FMN: 1.09, and Jitter: 0.19. Jitter achieves considerably lower average perturbation norms than the other attacks. In contrast to Jitter, the B&B and FMN attacks are not able to minimize the perturbation norm considerably with 100 attack iterations. Note that the average $ell_{\infty }$ norm of both B&B and FMN are also considerable larger than for Jitter in our experiments (B&B: 0.024, FMN: 0.022, Jitter: 0.009). An overview is given in Fig. 10.

We also inspect the structure of the perturbations. Figure 2 displays the perturbation for CE- and Jitter-based attacks for several images. To plot the perturbations, we calculate the absolute sum over every color channel and show the magnitude as a color gradient, where no change is denoted by black color. CE-based attacks generally attack every pixel in an image. In comparison, Jitter-based attacks mainly focus on the salient regions of an image. We enforce this by regularization of the perturbation γ within our loss function and thereby enable Jitter-based attacks to create successful and low-norm adversarial perturbations. In an ablation study, we found that combing other loss functions with the norm-minimization objective described in (5) successfully reduces the perturbation norm of those attacks. For this experiment, we evaluated CE-based, CW-based, and ${\mathscr{L}}_{2}$-based attacks. However, in our experiments the perturbation norm of Jitter-based attacks was always the lowest on average. This is expected, as we only minimize the perturbation norm of an attack once the attack is successful. This is done to prevent the perturbation norm minimization from reducing the effectiveness of the attack. As Jitter-based attacks find successful adversarial perturbations faster than other attacks in our experiments (see Section 7.1), the number of iterations that are dedicated to minimizing the perturbation norm is greater for Jitter-based attacks compared to other attacks.

7.4 Discussion

In an extensive benchmark study, we compared the proposed Jitter loss function to other adversarial attacks from the literature. Our experiments showed that Jitter achieves higher attack success rates and efficiency compared to prior methods. Moreover, Jitter-based attacks exhibit lower perturbation norms. The proposed Jitter loss has two key components that lead to increased attack success rate and efficiency.

1)
Scale invariance. Previous work observed that exceptionally large output logits in neural networks reduce the efficiency of gradient-based attacks [21]. We further found that gradient-based attacks are also inefficient when models show exceptionally low or high confidence values. Both issues can be solved by normalizing and rescaling the output logits of neural networks to a specific range. In addition, rescaling the output logits of neural networks makes it simpler to integrate noise injection into the Jitter loss function. For a more detailed explanation, refer to Section 5.2. Noise injection is the second key component that improves the success rate of Jitter-based attacks.
2)
Noise Injection. Our experiments revealed that existing untargeted adversarial attacks induce misclassifications in only a limited amount of target classes. We show that the target class diversity can be increased by injecting noise into the scaled softmax output of the model. We observe that this simultaneously increased the success rate of the attack, which indicates a connection between the target class diversity of an untargeted attack and its effectiveness. Moreover, noise injection also increased the success rate of previous adversarial attack methods in our experiments and is not limited to the proposed Jitter loss function. The effect of injecting noise into the output of a neural network was studied from a theoretical perspective in prior work, which could yield another explanation for the effectiveness of noise injection apart from the empirical observation of increased target class diversity. Zhu et al. [53] demonstrate that using gradient Langevin dynamics (GLD) instead of regular gradient descent can help to escape local minima during optimization from a theoretical perspective. The ability to escape sharp and poor local minima could also improve the effectiveness of adversarial attacks. However, in GLD Gaussian noise is directly added to the gradient and not to the output before performing the gradient calculation. Further investigations are necessary to clarify the connection between Jitter and GLD.

8 Conclusion and outlook

In this paper, we analyze the classification decisions of a diverse set of models that are trained to be robust against adversarial attacks. This analysis gives an indication of the limits of the robustness of current models on the CIFAR10 dataset. We utilize insights from our analysis to create a novel loss function which we name Jitter that increases the efficiency and success rate of adversarial attacks. Specifically, we enforce scale invariance of the lossfunction and encourage attack exploration and a diverse set of target classes for the attack by adding Gaussian noise to the output logits. In addition to the analysis on CIFAR10, we show that the proposed attack generalizes to two other benchmark datasets CIFAR100 and ImageNet. Our experiments demonstrate that analyzing failure cases of adversarial attacks over multiple models at the same time is an effective way to design stronger adversarial attacks. The proposed method shows superior attack efficiency for all 30 analyzed models for all three datasets compared to five other popular attacks from the literature. In all cases, Jitter achieved a higher success rate in a shorter amount of time. Moreover, the average perturbation norm of Jitter-based attacks is considerably lower compared to prior methods, which is achieved without compromising the success rate of the attack. Future work will explore if using Jitter for adversarial training can further improve the robustness of models against strong attacks. Theoretical analysis was beyond the scope of this paper but will be explored in future work. This includes connections between the proposed Jitter loss function and gradient Langevin dynamics.

Notes

Code is available at: https://git.io/JyJ0N

References

Szegedy C, Zaremba W, Sutskever I, Bruna J, Erhan D, Goodfellow IJ, Fergus R (2014) Intriguing properties of neural networks. In: International conference on learning representations, ICLR
Qin Y, Carlini N, Cottrell GW, Goodfellow IJ, Raffel C (2019) Imperceptible, robust, and targeted adversarial examples for automatic speech recognition. In: International conference on machine learning, ICML of proceedings of machine learning research, PMLR, vol 97, pp 5231–5240
Hu S, Shang X, Qin Z, Li M, Wang Q, Wang C (2019) Adversarial examples for automatic speech recognition: attacks and countermeasures. IEEE Commun Mag 57(10):120–126. https://doi.org/10.1109/MCOM.2019.1900006
Article Google Scholar
Morris JX, Lifland E, Yoo JY, Grigsby J, Jin D, Qi Y (2020) Textattack: a framework for adversarial attacks, data augmentation, and adversarial training in NLP. In: Conference on empirical methods in natural language processing: system demonstrations, EMNLP, demo track, association for computational linguistics, pp 119–126
Yang P, Chen J, Hsieh C-J, Wang J-L, Michael I, Jordan MI (2020) Greedy attack and gumbel attack: generating adversarial examples for discrete data. J Mach Learn Res JMLR 21(43):1–36
MathSciNet MATH Google Scholar
Ren K, Zheng T, Qin Z, Liu X (2020) Adversarial attacks and defenses in deep learning. Engineering 6(3):346–360. https://doi.org/10.1016/j.eng.2019.12.012. ISSN 2095-8099
Article Google Scholar
Carmon Y, Raghunathan A, Schmidt L, Duchi JC, Liang P (2019) Unlabeled data improves adversarial robustness. In: Advances in neural information processing systems, NeurIPS, pp 11190– 11201
Ding GW, Sharma Y, Lui KYC, Huang R (2020) MMA training: direct input space margin maximization through adversarial training. In: International conference on learning representations, ICLR
Hendrycks D, Lee K, Mazeika M (2019) Using pre-training can improve model robustness and uncertainty. In: International conference on machine learning, ICML of proceedings of machine learning research, PMLR, vol 97, pp 2712–2721
Madry A, Makelov A, Schmidt L, Tsipras D, Vladu A (2018) Towards deep learning models resistant to adversarial attacks. In: 6th International conference on learning representations, ICLR
Leon Bungert, Raab R, Roith T, Schwinn L, Tenbrinck D (2021) CLIP: cheap lipschitz training of neural networks. In: Scale space and variational methods in computer vision, SSVM of lecture notes in computer science. Springer, vol 12679, pp 307– 319
Schwinn L, Nguyen A, Raab R, Bungert L, Tenbrinck D, Zanca D, Burger M, Eskofier BM (2021a) Identifying untrustworthy predictions in neural networks by geometric gradient analysis. In: Conference on uncertainty in artificial intelligence, UAI of proceedings of machine learning research, AUAI Press, vol 161, pp 854–864
Richardson E, Weiss Y (2021) A bayes-optimal view on adversarial examples. J Mach Learn Res JMLR 22(221):1–28
MathSciNet MATH Google Scholar
Jin C, Rinard M (2020) Manifold regularization for adversarial robustness. arXiv:2003.04286
Pang T, Yang X, Dong Y, Xu T, Zhu J, Su H (2020) Boosting adversarial training with hypersphere embedding. In: Advances in neural information processing systems, NeurIPS
Croce F, Hein M (2020) Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: International conference on machine learning, ICML, vol 119 of proceedings of machine learning research, pp 2206–2216. PMLR
Athalye A, Carlini N, Wagner DA (2018) Obfuscated gradients give a false sense of security: circumventing defenses to adversarial examples. In: International conference on machine learning, ICML, vol 80 of proceedings of machine learning research, pp 274–283. PMLR
Tramèr F, Carlini N, Brendel W, Madry A (2020) On adaptive attacks to adversarial example defenses. In: Larochelle H, Marc’Aurelio Ranzato RH, Balcan M-F, Lin H-T (eds) Advances in neural information processing systems, NeurIPS
Uesato J, O’Donoghue B, Kohli P, van den Oord A (2018) Adversarial risk and the dangers of evaluating against weak attacks. In: International conference on machine learning, ICML, vol 80 of proceedings of machine learning research, pp 5032–5041. PMLR
Krizhevsky A (2009) Learning multiple layers of features from tiny images. Tech Rep
Carlini N, Wagner DA (2017) Towards evaluating the robustness of neural networks. In: 2017 IEEE symposium on security and privacy, SP, pp 39–57. IEEE Computer Society
Brendel W, Rauber J, Kümmerer M, Ustyuzhaninov I, Bethge M (2019) Accurate, reliable and fast robustness evaluation. In: Advances in neural information processing systems, NeurIPS, pp 12841–12851
Lin J, Song C, He K, Wang L, Hopcroft JE (2020) Nesterov accelerated gradient and scale invariance for adversarial attacks. In: International conference on learning representations, ICLR
Schwinn L, Nguyen A, Raab R, Zanca D, Eskofier BM, Tenbrinck D, Burger M (2021) Dynamically sampled nonlocal gradients for stronger adversarial attacks. In: International joint conference on neural networks, IJCNN, pp 1–8. IEEE
Pintor M, Roli F, Brendel W, Biggio B (2021) Fast minimum-norm adversarial attacks through adaptive norm constraints. In: Advances in neural information processing systems, NeurIPS, pp 20052–20062
Mao X, Chen Y, Wang S, Su H, He Y, Xue H (2021) Composite adversarial attacks. In: Conference on artificial intelligence, AAAI, pp 8884–8892. AAAI Press
Mustafa A, Khan SH, Hayat M, Goecke R, Shen J, Shao L (2019) Adversarial defense by restricting the hidden space of deep neural networks. In: IEEE/CVF international conference on computer vision, ICCV, pp 3384–3393. IEEE
Ilyas A, Santurkar S, Tsipras D, Engstrom L, Tran B, Madry A (2019) Adversarial examples are not bugs, they are features. In: Advances in neural information processing systems, NeurIPS, pp 125–136
Deng J, Dong W, Socher R, Li LJ, Kai Li, Fei-Fei L (2009) Imagenet: a large-scale hierarchical image database. In: IEEE/CVF computer society conference on computer vision and pattern recognition CVPR, pp 248–255. IEEE
Ozbulak U, Pintor M, Van Messem A, De Neve W (2021) Evaluating adversarial attacks on imagenet: a reality check on misclassification classes. In: Advances in neural information processing systems, NeurIPS, workshop on ImageNet: past, present, and future
Croce F, Andriushchenko M, Sehwag V , Debenedetti E, Flammarion N, Chiang M, Mittal P, Hein M (2021) Robustbench: a standardized adversarial robustness benchmark. In: Advances in neural information processing systems, NeurIPS, track Datasets and benchmarks
Kwon H, Kim Y, Park KW, Yoon H, Choi D (2018) Multi-targeted adversarial example in evasion attack on deep neural network. IEEE Access 6:46084–46096
Article Google Scholar
Kirkpatrick S, Gelatt D, Vecchi M (1983) Optimization by simulated annealing. Science 220(4598):671–680
Article MathSciNet MATH Google Scholar
Goldberg DE, Holland JH (1988) Genetic algorithms and machine learning. Mach Learn 3:95–99
Article Google Scholar
Hinton GE, Roweis ST (2002) Stochastic neighbor embedding. In: Advances in neural information processing systems, NeurIPS, pp 833–840. MIT Press
Neelakantan A, Vilnis L, Le QV, Sutskever I, Kaiser L, Kurach K, Martens J (2015) adding gradient noise improves learning for very deep networks. CoRR arXiv:1511.06807
Wong E, Rice L, Kolter JZ (2020) Fast is better than free: revisiting adversarial training. In: International conference on learning representations, ICLR
Zhang D, Zhang T, Lu Y, Zhu Z, Dong B (2019) You only propagate once: accelerating adversarial training via maximal principle. In: Advances in neural information processing systems, NeurIPS, pp 227–238
Engstrom L, Ilyas A, Salman H, Santurkar S, Tsipras D (2019) Robustness python library. https://github.com/MadryLab/robustness. [Accessed May 25th, 2021]
Zhang H, Yu Y, Jiao J, Xing EP, El Ghaoui L, Jordan MI (2019) Theoretically principled trade-off between robustness and accuracy. In: International conference on machine learning, ICML, vol 97 of proceedings of machine learning research, pp 7472–7482. PMLR
Huang L, Zhang C, Zhang H (2020) Self-adaptive training: beyond empirical risk minimization. In: Advances in neural information processing systems, NeurIPS
Zhang J, Xu X, Han B, Niu G, Cui L, Sugiyama M, Kankanhalli MS (2020) Attacks which do not kill training make adversarial learning stronger. In: International conference on machine learning, ICML, vol 119 of proceedings of machine learning research, pp 11278–11287. PMLR
Rice L, Wong E, Kolter JZ (2020) Overfitting in adversarially robust deep learning. In: International conference on machine learning, ICML, vol 119 of proceedings of machine learning research, pp 8093–8104. PMLR
Sehwag V, Mahloujifar S, Handina T, Dai S, Xiang C, Chiang M, Mittal P (2021) Improving adversarial robustness using proxy distributions. CoRR arXiv:2104.09425
Wu D, Xia S-T, Wang Y (2020) Adversarial weight perturbation helps robust generalization. In: Advances in neural information processing systems, NeurIPS
Gowal S, Qin C, Uesato J, Mann TA, Kohli P (2020) Uncovering the limits of adversarial training against norm-bounded adversarial examples. CoRR arXiv:2010.03593
Wang Y, Zou D, Yi J, Bailey J, Ma X, Gu Q (2020) Improving adversarial robustness requires revisiting misclassified examples. In: International conference on learning representations, ICLR
Sehwag V, Wang S, Mittal P, Jana S (2020) HYDRA: pruning adversarially robust neural networks. In: Advances in neural information processing systems, NeurIPS
Zhang J, Zhu J, Niu G, Han B, Sugiyama M, Kankanhalli MS (2021) Geometry-aware instance-reweighted adversarial training. In: International conference on learning representations, ICLR
Sitawarin C, Chakraborty S, Wagner DA (2020) Improving adversarial robustness through progressive hardening. CoRR arXiv:2003.09347
Chen J, Cheng Y, Gan Z, Gu Q, Liu J (2022) Efficient robust training via backward smoothing. In: Conference On artificial intelligence, AAAI, pp 6222–6230. AAAI Press
Cui J, Liu S, Wang L, Jia J (2021) Learnable boundary guided adversarial training. In: IEEE/CVF international conference on computer vision, ICCV, pp 15701–15710. IEEE
Zhu Z, Wu J, Yu B, Wu L, Ma J (2020) The anisotropic noise in stochastic gradient descent: its behavior of escaping from sharp minima and regularization effects. In: International conference on machine learning, ICML , vol 97 of proceedings of machine learning research, pp 7654–7663. PMLR

Download references

Acknowledgements

Bjoern Eskofier gratefully acknowledges support by the German Research Foundation (DFG) within the framework of the Heisenberg professorship program (grant number ES434/8-1). An Nguyen gratefully acknowledges support by the Siemens Healthcare GmbH, Germany.

Funding

Open Access funding enabled and organized by Projekt DEAL.

Author information

Authors and Affiliations

Department Artificial Intelligence in Biomedical Engineering, Friedirch-Alexander Universität Erlangen Nürnberg, Carl-Thiersch-Straße 2b, Erlangen, 91052, Bavaria, Germany
Leo Schwinn, René Raab, An Nguyen, Dario Zanca & Bjoern Eskofier

Authors

Leo Schwinn
View author publications
You can also search for this author in PubMed Google Scholar
René Raab
View author publications
You can also search for this author in PubMed Google Scholar
An Nguyen
View author publications
You can also search for this author in PubMed Google Scholar
Dario Zanca
View author publications
You can also search for this author in PubMed Google Scholar
Bjoern Eskofier
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Conceptualization: Leo Schwinn proposed the initial idea. Dario Zanca and Bjoern Eskofier contributed further ideas; Methodology: The initial methodology was proposed by Leo Schwinn. René Raab and An Nguyen contributed with ideas and discussion of critical points; Conducting experiments; Leo Schwinn; Writing - original draft preparation: Leo Schwinn; Writing - review and editing: Leo Schwinn, René Raab, An Nguyen, Dario Zanca, and Bjoern Eskofier; Supervision: Dario Zanca and Bjoern Eskofier

Corresponding author

Correspondence to Leo Schwinn.

Ethics declarations

Competing interests

The authors have no competing interests to declare that are relevant to the content of this article. All the data used in the experiments is open source and freely available.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A: Jitter code

The Jitter loss can be implemented with just a few lines of code, which makes it easy to combine with prior approaches. Table 4 shows a PyTorch-like implementation of Jitter.

Table 4 Pytorch-like implementation of the Jitter loss function

Full size table

Appendix B: Jitter adversarial attack

Table 5 exemplifies how Jitter can be combined with adversarial attack algorithms like PGD.

Table 5 Pytorch-like pseudo-code of an untargeted PGD-like attack using the Jitter loss function

Full size table

Appendix C: DLR and CW with noise injection

The performance of DLR- and CW-based attacks with noise injection is shown in Table 6. The proposed Jitter loss function achieves the highest success rate most often. Moreover, injecting noise to the logits of the other two loss functions is highly effective as well. Here the performance is equal or better in all cases compared to no noise injection.

Table 6 Accuracy [%] of the evaluated models when attacked with APGD using different loss functions

Full size table

Appendix D: Attack performance for a higher computational budget

The performance of DLR- and Jitter-based attacks for more model evaluations is shown in Table 7. Attacks with a “strong” suffix use 200 iterations and 5 restarts, while the other attacks use 100 iterations without additional restarts. Low-budget Jitter-based attacks achieve a higher success rate than both normal and strong DLR-based attacks in all cases. Overall more model evaluations do only marginally improve the performance for DLR-based attacks except for the model proposed by [14], where the success rate increases by 8.9 percentage points. For Jitter-based attacks more model evaluations improve the performance considerably for the models proposed by [14] and [8] and slightly for the models proposed by [37, 43], and [9].

Table 7 Accuracy [%] of the CIFAR10 models when attacked with APGD using either the DLR or Jitter loss function

Full size table

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Schwinn, L., Raab, R., Nguyen, A. et al. Exploring misclassifications of robust neural networks to enhance adversarial attacks. Appl Intell 53, 19843–19859 (2023). https://doi.org/10.1007/s10489-023-04532-5

Download citation

Accepted: 15 February 2023
Published: 21 March 2023
Issue Date: September 2023
DOI: https://doi.org/10.1007/s10489-023-04532-5

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Exploring misclassifications of robust neural networks to enhance adversarial attacks

Abstract

Similar content being viewed by others

Deep Learning: A Comprehensive Overview on Techniques, Taxonomy, Applications and Research Directions

Deep learning modelling techniques: current progress, applications, advantages, and challenges

Deep learning: systematic review, models, challenges, and research directions

1 Introduction

2 Notation

3 Related work

4 Robust misclassifications

4.1 Distribution of misclassifications

4.2 Attacks against robust and non-robust images

4.3 Logit and confidence distribution

5 Enhancing adversarial attacks

5.1 Scale invariance

5.2 Attack target diversity and attack exploration

5.3 Minimizing the norm of the perturbation

6 Experiments

6.1 Data and models

6.2 Threat model

6.3 Jitter hyperparameter

7 Results and discussion

7.1 Attack performance

7.2 Induced misclassifications

7.3 Attack norm and structure

7.4 Discussion

8 Conclusion and outlook

Notes

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher’s note

Appendices

Appendix A: Jitter code

Appendix B: Jitter adversarial attack

Appendix C: DLR and CW with noise injection

Appendix D: Attack performance for a higher computational budget

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation