1 Introduction

In recent years, deep learning models began to exceed human-level performances. In 2015, a deep learning model called ResNet [23] beat the human performance in ImageNet Large Scale Visual Recognition Challenge (ILSVRC) and the record was hit by more advanced architectures later on. Goodfellow et al. [18] proposed a system that outperforms human operators in the task of reading address information from Google Street View imagery or solving CAPTCHAS. In the game playing domain, an AI software named AlphaGo defeated the world Go champion in 2016 [11]. Today, we observe that many advanced systems built upon deep learning models offer a very high degree of success in different domains. As a result of this success, DNN models are being used in various fields, ranging from medical diagnosis and autonomous vehicles to game playing and machine translation. However, during the rise of the DNN models, the researchers’ main focus was to build more accurate models, and nearly no attention has been paid to the reliability and robustness of these models. Deep learning models indeed require a more elaborate evaluation since these models do not provide uncertainty estimates and are mostly prone to over-confidence or under-confidence predictions [5, 21]. Besides, they have some intrinsic vulnerabilities that let intruders to easily exploit them [50].

Numerous researchers have been striving to comprehend and quantify the uncertainty in DNNs in an attempt to make more reliable decisions. As a result, many research fields, such as robotics and medical diagnosis, have begun to rely on uncertainty-based reasoning rather than a most likely point estimate. For instance, in scenarios like lesion identification [43] and glioma segmentation [13] , researchers have employed predictive uncertainty estimates in evaluating medical images. And recently, Ghoshal et al. [17] employed uncertainty quantification for Coronavirus (COVID-19) detection task in X-ray images. Using uncertainty-based reasoning, the authors significantly improved the diagnostic performance of the human-machine alliance using a COVID-19 imaging dataset.

Similar to the efforts for improving the reliability of model predictions, we have witnessed numerous studies aiming to improve the robustness of DNN models. The history of these efforts goes back to the end of 2013, when the researchers have discovered that existing deep neural networks are vulnerable to attacks. Szegedy et al. [53] first noticed the presence of adversarial examples in the context of image classification. The authors have shown that it is possible to perturb an image by a small amount and change how it is classified. Minimal and nearly imperceptible perturbations of the data samples are sufficient to fool the most advanced classifiers into incorrect classification. As in the case of the steganography technique, where researchers aim to efficiently hide secret data within a digital content [35,36,37], the applied perturbation for an adversarial sample should be quasi-imperceptible.

Adversarial machine learning attacks are based on perturbation of the input instances in a direction that maximizes the chance of wrong decision making and results in false predictions. These attacks can lead to a loss of the model’s prediction performance as the algorithm cannot correctly predict the input instances. Thus, attacks utilizing the vulnerability of DNNs can seriously undermine the security of these machine learning (ML) based systems, sometimes with devastating consequences. In medical applications, the perturbation attack can lead to an incorrect diagnosis of a disease. Consequently, it can cause severe harm to the patient’s health and also damage the healthcare economy [14]. Another example is autonomous navigation; automobiles that use ML to drive through traffic without human intervention while avoiding accidents. A wrong decision through an adversarial attack could cause a fatal accident [42, 52]. Hence, defending against adversarial attempts and increasing the robustness of ML models without compromising their accuracy is of crucial importance. Assuming that these ML models will be used in critical tasks, we should focus our efforts not only on the performance of the models, but also on the security of these systems.

Adversarial ML is a burgeoning area of research, and scientists around the world are struggling to develop new attack algorithms, which in turn will help to develop more robust architectures resistant to malicious attack threats. In this study, we focus on adversarial attack strategies based on epistemic uncertainty maximization instead of traditional loss maximization based attacks. By looking at the problem from a different perspective and proposing alternative ways of crafting adversarial samples, we hope to contribute to future studies that aim to develop more effective and uncertainty-aware defense approaches. The common approach of adversarial machine learning attacks is to create craftily-designed inputs based on model loss maximization. Unlike the previous researches in literature, we follow a slightly modified strategy to craft adversarial samples by exploiting the vulnerability of the model using its quantified epistemic uncertainty. We show that perturbing the input image in a direction that maximizes the model’s uncertainty amplifies model loss and results in wrong predictions. The new approach combines the strengths of adversarial approaches to produce more destructive attacks by uncertainty and loss maximization. We have released our source code on GitHub.Footnote 1

To sum up; our main contributions with this paper are:

  • We utilize a new metric (epistemic uncertainty of the model) which can be exploited to craft adversarial examples.

  • We show that the performance of pure uncertainty based attacks is indeed as high as the attacks based on the model loss.

  • We demonstrate that crafting adversarial examples using both the model loss and uncertainty yields better performance in adversarial attacks.

  • We empirically show that the maximum value of quantified uncertainty for a sample can be located beyond the decision boundary of the model, where the loss is at its maximum value.

This study is organized as follows. Section 2 introduces some of the known attack types in the literature. In Section 3, we will introduce the concept of uncertainty together with the main types and discuss how we can quantify epistemic uncertainty. Section 4 will give the details of our approach. We will present our experimental results in Section 5 and conclude our work in Section 6.

2 Related work

Since the discovery of DNN’s vulnerability to adversarial attacks [53], a vast amount of research has been conducted in both devising new adversarial attacks and defending against these attacks with more robust DNN models [9, 26, 47, 48].

Deep learning models contain many vulnerabilities and weaknesses, making them difficult to defend against adversarial attacks. For instance, they are often sensitive to small changes in the input data, resulting in unexpected results in the final output of the model. Figure 1 shows how an adversary would exploit such a vulnerability and manipulate the model through a carefully crafted perturbation of the input data. The malicious input is produced by slightly perturbing the original image so that a ``West Highland White Terrier (Dog)” is misclassified as ``Paper Towel” with very high confidence.

Fig. 1
figure 1

An adversarial example

Traditionally, attack strategies are mainly based on perturbing the input instance to maximise the model’s loss. A great deal of adversarial attack algorithms have been proposed in recent years. In this section, we will briefly describe some of the well-known adversarial attack algorithms.

2.1 Fast-gradient sign method

This method, also known as FGSM [19], is one of the earliest and most popular adversarial attacks to date. FGSM utilizes the gradient of the model’s loss function to adjudge in which direction the pixel values of the source image should be altered to minimize the model’s loss function. Then it changes all pixels simultaneously in the opposite direction to maximize the loss. For a model with classification loss function described as L(𝜃,x,y) where 𝜃 represents model parameters, x is the input to the model (sample input image in our case), ytrue is the label of our input, we can generate adversarial samples using below formula:

$$ \mathbf{x}^{*} = \mathbf{x} + \epsilon sign\left( \nabla_{x} L(\theta,\mathbf{x},y_{true}) \right) $$
(1)

One last key point about FGSM is that it is designed to be fast but not optimal, i.e. it is not intended to devise the minimum required adversarial perturbation. Besides, the success ratio of this method is relatively low in small 𝜖 values compared to other attack types.

2.2 Iterative gradient sign method

Kurakin et al. [30] proposed a small but effective improvement to the FGSM. In their approach, rather than taking only one step of size 𝜖 in the direction of the gradient sign, they take several but smaller steps α, and use the given 𝜖 value to clip the result. This attack type is referred to as Basic Iterative Method (BIM), and it is just FGSM applied to an input image iteratively. Generating perturbed images under Linf norm for BIM attack is given by (2).

$$ \begin{aligned} \mathbf{x}^{*} & = \mathbf{x} \\ \mathbf{x}_{N+1}^{*} & = \mathbf{x} + Clip_{x, \epsilon} \{ \alpha sign \left( \nabla_{\mathbf{x}} L(\mathbf{x}_{N}^{*}, y_{true}) \right) \} \end{aligned} $$
(2)

where x is the input sample, x is the produced adversarial sample at ith iteration, L is the loss function of the model, ytrue is the actual label for input sample, 𝜖 is a tunable parameter, limiting maximum level of perturbation in given linf norm, and α is the step size.

The success ratio of BIM attack is higher than the FGSM [31]. This is mainly due to the fact that the attacker can decide how far past the decision boundary an adversarial sample will be pushed by simply adjusting the 𝜖 parameter.

We can group BIM attacks under two main types, namely BIM-A and BIM-B. In the former type, we stop iterations as soon as we succeed in fooling the model (passing the decision boundary), while in the latter, we continue to the attack till the end of provided number of iterations so that we push the input beyond the decision boundary.

2.3 Projected gradient descent

Projected Gradient Descent (PGD) was proposed by Madry et al. [40]. PGD perturbs a clean image x for several number of iterations with a small step size in the direction of the gradient of the loss function. Unlike BIM, after each perturbation step, it projects the resulting adversarial sample back onto the 𝜖-ball of the input sample instead of clipping. Moreover, instead of starting from the original point (𝜖= 0, in all dimensions), PGD uses random start, which can be described as:

$$ \mathbf{x}_{0} = \mathbf{x} + U\left( -\epsilon, +\epsilon \right) $$
(3)

where \(U\left (-\epsilon , +\epsilon \right )\) is the uniform distribution between (− 𝜖,+𝜖).

2.4 Other popular attack strategies

Since the main focus of this study is to propose an alternative and effective strategy to traditional loss based attacks, we will only mention some of the other well-known attack types without discussing them in detail. One of the most popular attack algorithms to date is proposed by Carlini and Wagner [8]. Their attack strategy is based on redefining the attacks as optimization problems which can be solved by using gradient descent to craft more powerful and effective adversarial samples. To give another example, Moosavi-Dezfooli et al. [41] proposed an attack named Deepfool based on the assumption that the neural network models behave as linear classifiers and the classes are separated by a hyperplane. Their algorithm starts from the initial input point xt and at each iteration it calculates the closest hyperplane and the minimum perturbation amount, which is the orthogonal projection to the hyperplane. Then the algorithm calculates xt + 1 by adding the minimal perturbation to the xt+ 1 and checks whether misclassification has occurred.

Adversarial machine learning is a trending research field, and new attack algorithms are being proposed immensely. Some of the latest studies are Bandit [27], Square Attack [3], JSMA [46] and HSJA [10]. Moreover, there are some recent studies that utilize MC Dropout sampling and uncertainty information to craft adversarial samples. Liu et al. [38] introduced Universal Adversarial Perturbation (UAP) approach that utilizes a metric named virtual uncertainty, extracted from the model’s structural activation. Nonetheless, estimating the model’s uncertainty involves aggregating all the neurons’ virtual uncertainties, which is computationally costly and unattractive. In this study, unlike traditional loss based attacks, we focus on developing an efficient and effective approach by utilizing epistemic uncertainty estimates of the model derived from final softmax score outputs.

3 Preliminaries

Traditionally, predictive models are used to be forced to provide a decision even in ambiguous cases where the model is not sure about its prediction, and the quality of its predictions is expected to be low. Assuming the model’s prediction is always correct without any reasoning based on its uncertainty may result in catastrophic results. This fact led the researchers to propose various ways of quantifying uncertainty and to suggest abstaining models based on certain conditions like when the model’s uncertainty is high, thus improving the reliability [33, 54].

In this section, we will first introduce the two types of uncertainty in machine learning. Then, we will present how we can quantify Epistemic Uncertainty in the context of deep learning.

3.1 Uncertainty in machine learning

There are two different types of uncertainty in machine learning: epistemic uncertainty and aleatoric uncertainty [2, 24, 56].

3.1.1 Epistemic uncertainty

Epistemic uncertainty refers to uncertainty caused by a lack of knowledge and limited data needed for a perfect predictor [4]. It can be categorized under 2 groups as approximation uncertainty and model uncertainty as depicted in Fig. 2.

Fig. 2
figure 2

Epistemic uncertainty

Approximation uncertainty

In a standard machine learning task, the learner is given data points from an independent, identically distributed dataset. Then he/she tries to induce a hypothesis \(\hat {h}\) from the hypothesis space \({\mathscr{H}}\) by choosing an appropriate learning method with its related hyperparameters minimizing the expected loss (risk) with a selected loss function, . However, what the learner does is to try to minimize the empirical riskRemp which is an estimation of the real risk R(h). The induced \(\hat {h}\) is an approximation of the h, which is the optimum hypothesis within \({\mathscr{H}}\) and the real risk minimizer. This fact results in an approximation uncertainty. Therefore, the induced hypothesis’s quality is not perfect, and the learned model will always be prone to errors.

Model uncertainty

Suppose the chosen hypothesis space \({\mathscr{H}}\) does not include the perfect predictor. In that case, the learner has no chance to realize his/her aim of finding a hypothesis function that can successfully map all possible inputs to outputs. This leads to a discrepancy between the ground truth f and the best possible function h within \({\mathscr{H}}\), called model uncertainty.

However, Universal Approximation Theorem states that for any target function f, there exists a neural network that can approximate f [12, 57]. The hypothesis space \({\mathscr{H}}\) is huge for deep neural networks. Thus, it will not be wrong to assume that h = f. We can ignore the model uncertainty for deep neural networks, and only care about the approximation uncertainty. Consequently, in deep learning tasks, the actual source of epistemic uncertainty is related to approximation uncertainty.

Epistemic uncertainty refers to the confidence a model has about its prediction [39]. The underlying cause is the uncertainty about the parameters of the model. This type of uncertainty is apparent in areas with limited training data, and the model weights are not optimized correctly. High epistemic uncertainty is observed when the model is expected to predict a sample derived from a shifted version of the training data distribution or for an out-of-domain sample [16].

3.1.2 Aleatoric uncertainty

Aleatoric uncertainty refers to the variability in an experiment’s outcome, which is due to the inherent random effects [22]. This type of uncertainty can not be reduced even if we have enough training samples [49]. An excellent example of this phenomenon is the noise observed in the measurements of a sensor.

Figure 3 depicts a simple nonlinear function (\(\sin \limits ({0.3 \times x})\) where x ∈ [0,12] ) plot. As shown in the region where data points are populated at right (9 < x < 12), the noisy samples are clustered, leading to high aleatoric uncertainty. As an example, these points may represent measurements of a faulty sensor; one can conclude that the sensor produces errors around x = 10.5 for some inherent reason. We can also conclude that the middle regions of the figure represent the areas of high epistemic uncertainty. Because, there are not enough training samples for our model to describe the data best. Moreover, we can assume that the area of high epistemic uncertainty area corresponds to the area with low prediction accuracy.

Fig. 3
figure 3

Illustration of epistemic and aleatoric uncertainty

3.2 Quantifying epistemic uncertainty in deep neural networks

Using techniques that help us to quantify the uncertainty of the model is necessary for robust decision making. Assuming that we, as humanity, will use deep learning models in areas where safety and reliability are critical concerns as in the case of autonomous driving and medical applications, researchers need to be very careful and pay utmost attention to prediction uncertainty. This will help us to increase the quality of the model predictions.

In recent years, a significant number of researches have been conducted to quantify uncertainty in deep learning models. Most of the work was based on Bayesian Neural Networks, which learn the posterior distribution over weights to quantify predictive uncertainty [44]. However, Bayesian NN’s come with an additional computational cost and inference issues. Therefore, several approximations to Bayesian methods have been developed which make use of variational inference [7, 20, 25, 45]. On the other hand, Lakshminarayanan et al. [32] used the deep ensemble approach as an alternative to Bayesian NN’s to quantify predictive uncertainty. But, this requires training several NN’s, which may not be feasible in practice. A more efficient and elegant approach was proposed by Gal et al. [15]. The authors showed that a neural network model with inference time dropout is equivalent to a Bayesian approximation of the Gaussian process. And the prediction hypothesis uncertainty can be approximated by averaging probabilistic feed-forward Monte-Carlo (MC) dropout sampling during the prediction time.

Inference time dropout acts as an ensemble approach. In each single ensemble model, the system needs to drop out different neurons in network layers according to the dropout ratio in the prediction time. The predictive mean is the average of the predictions over dropout iterations, T, and the predictive mean is used as the final inference \(\hat {y}\), for the input instance \(\hat {\mathbf {x}}\) in the dataset. The overall prediction uncertainty is approximated by finding the variance of the probabilistic feed-forward Monte Carlo (MC) dropout sampling during prediction time. The final prediction is defined as follows:

$$ p(\hat{y}=c|\hat{\mathbf{x}},\mathcal{D}) \approx \hat{\mu}_{pred} = \frac{1}{T} \sum\limits_{y \in T} p(\hat{y}|\mathbf{\theta},\mathcal{D}) $$
(4)

where 𝜃 is the model weights, \(\mathcal {D}\) is the input dataset, T is the number of predictions of the MC dropouts, and x is the input sample. The label of input sample x can be estimated with the mean value of Monte-Carlo dropout predictions \(p(\hat {y}|\mathbf {\theta },\mathcal {D})\), which will be done T times.

Figure 4 shows the general overview of the Monte-Carlo dropout based classification algorithm. In the prediction time, random neurons in each layer are dropped out (based on probability p) from the base neural network model to create a new model. As a result, T different classification models can be used to predict the class label of the input instance and quantify uncertainty of the overall prediction. For each testing of an input sample x, the predicted label is assigned with the highest predictive mean. And the variance of the \(p(\hat {y})\) is used as a measure of epistemic uncertainty of the model.

Fig. 4
figure 4

Illustration of the Monte Carlo dropout based Bayesian prediction

We chose the MC dropout method due to its simplicity and efficiency. The approach needs only a single trained model to measure the uncertainty, while different techniques such as Deep Ensemble need multiple models. Secondly, one can take the backward derivative of the computed variance term for each input sample and use it to craft adversarial samples to evade the model.

4 Approach

The uncertainty of the model is higher in areas with limited number of training points. Due to this ignorance about ground truth, we cannot achieve a perfect model that can predict accurately every possible test data. Figure 5 shows the prediction of a regression model trained on a limited number of data points constrained on some interval. In this simple example, we trained a single hidden layer NN with ten neurons to learn a linear function y = −x + 1. As can be seen from the graph, in the areas where we do not have enough training points, the uncertainty values obtained from MC dropout estimates of our model is high, which can be interpreted as the quality of the prediction is low, and our model is having difficulties in deciding the correct output values. Harmoniously, we also observe high error in these areas. For this reason, we can conclude that the high epistemic uncertainty area coincides with the low prediction accuracy area. Accordingly, we claim that pushing the model’s limits by testing it in extreme conditions with input outside from training data distribution (input from a shifted-domain) may cause model prediction failure.

Fig. 5
figure 5

Uncertainty estimates obtained from a regression model

The adversarial attacks aim to find the necessary perturbation amount (δ) constrained to some interval (𝜖), resulting in maximum loss, thus fooling the classifier. We can express this mathematically in the below equation, where F𝜃(x) is our neural network.

$$ \underset{\|\delta\| \leq \epsilon}{arg max} \ell(F_{\theta}(x+\delta), y) $$
(5)

Like most of the attack types in literature, the attackers can perturb the input image in a direction that maximizes the loss, and this direction is found using the gradient of the loss function. However, we showed that instead of using the loss function, another effective approach is to use the model’s epistemic uncertainty. Our alternative method uses the model’s epistemic uncertainty as a tool for creating successfully manipulated adversarial input instances. In contrast to the loss based adversarial machine learning attacks, this method can provide an alternative strategy in which the attacker can make an effective attack by exploiting uncertainty due to the difficulty of the model to interpret the shifted-domain sample based on the data observed during training.

To verify that our intuition holds true, we have done a simple experiment and depicted the loss surfaces of a trained CNN model (digit classifier) within a constrained epsilon neighbourhood of the original input data points, as shown in Fig. 6. Figure 6b shows the model’s loss values in the direction of the model’s loss gradient and a random direction. We see that the maximum loss value observed is 3.783. Then, as shown in Fig. 6c, we depicted the model loss surface in the direction of the model’s epistemic uncertainty’s gradient and the same random direction we used in the previous try. This time, the maximum loss value achieved is 3.713, which is close to the previous one. We observed that out of 784 sub directions, 693 were the same and 91 were different according to the directions of loss and uncertainty gradients. The model loss can be maximized by perturbing the input image in a slightly different direction than we used to do before. Lastly, we depicted the model loss surface in the direction of loss and uncertainty’s gradient directions as in Fig. 6d. We reached a loss value of 4.167, which is bigger than the previous two attempts. In Fig. 6b, c and d, the points where there is a difference in color on the loss surfaces indicate that the model prediction has changed from the correct class 7 to wrong class 2. Therefore, we can conclude that perturbing the image in both directions will lead to misclassification for the model.

Fig. 6
figure 6

Loss surfaces in different directions. The maximum loss values are 3.783, 3.713, 4.167 for b, c and d respectively

The loss surfaces of DNN models are well-known to be highly non-linear, with many local minimums and maximums in high-dimensional space. Numerical solution to finding global extrema points is an NP-hard problem [6, 28]. No optimization approach can reach these global extrema points by utilizing a naive method like gradient descent on the model’s loss function. Eventually, the optimizer will be stuck to local extrema points. However, the above experiment shows that slightly changing the direction in each gradient descent step by leveraging the model uncertainty can increase the proposed attacks’ performance.

We conducted the same experiment on a different sample from the MNIST test dataset. Figure 7 shows that the maximum loss value in uncertainty’s gradient direction is far greater than the maximum loss value in model loss’ gradient direction. And the maximum loss value in the hybrid direction is larger than the ones in both model loss’ and uncertainty’s directions. Besides, we observe that there is no possibility of misclassification in the loss gradient direction, as there is no visible colour change in the surface plot of Fig. 7b, whereas in Fig. 7c we observe that there are yellow regions where the model misclassifies the input image in the uncertainty’s gradient direction. Again, when we analyzed the directions of loss’ and uncertainty’s gradients, we saw that out of 784 directions, only 639 of them were the same and 145 of them were different, which is much larger than the first experiment.

Fig. 7
figure 7

Loss surfaces in different directions. The maximum loss values are 0.285, 0.811, 0.845 for b, c and d respectively

The epistemic uncertainty yields a better direction for our second experiment because our model (like all the trained ML models) is not the ``perfect” predictor and is just an approximation to the oracle function. The model itself has an inherent ``approximation uncertainty” which sometimes induce sub-optimal solutions. Consequently, any method which only relies on the trained model (which is not the optimum model) will result in less effective performance.

4.1 Proposed epistemic uncertainty based attacks

Previous attack types in literature have been designed to exploit the model loss and maximize the model loss value within a constrained neighbourhood of the input data points. And we have witnessed quite successful results with this approach. However, one possible drawback for these attacks is that they solely rely on the trained ML model, which inevitably suffers from the approximation error. We can overcome this problem by utilizing an additional metric, namely epistemic uncertainty of the model. This additional uncertainty information has a correcting effect and improves the convergence to global extrema points by yielding a higher loss value. Results shown in Figs. 6 and 7 support our argument. Therefore, we can reformulate existing attacks using model uncertainty instead of model loss. And even we can benefit from both of them.

4.1.1 Fast gradient sign method (uncertainty-based)

The formulation used in traditional loss-based FGSM attack is given below:

$$ \mathbf{x}^{adv} = \mathbf{x} + \epsilon \cdot sign(\nabla_{\mathbf{x}} \ell(\mathbf{x},y_{true})) $$
(6)

where x is the input (clean) image xadv is the perturbed adversarial image, is the classification loss function, ytrue is true label for the input x.

Our modified FGSM attack (uncertainty-based) is shown as;

$$ \mathbf{x}^{adv} = \mathbf{x} + \epsilon \cdot sign(\nabla_{\mathbf{x}} U(\mathbf{x}, F, p, T)) $$
(7)

where x is the input (clean) image, xadv is the perturbed adversarial image, U is the uncertainty metric (mean variance) obtained from T different MC dropout estimates, F is the prediction model in training mode, p is the dropout ratio used in the dropout layers, T is the number of MC dropout samples in model training mode.

And the steps necessary for the calculation of uncertainty metric (mean variance of T predictions) is given as below:

  • Step: 1 For an input image x, T different predictions is obtained pt(x) by Monte-Carlo Dropout sampling where each prediction is a vector of softmax scores for the C classes.

    $$p_{t} (\mathbf{x})= \mathcal{F}(\mathbf{x},p,T)$$
  • Step: 2 The next step is to compute the average prediction score for the T different outputs:

    $$p_{T} (\mathbf{x})= \frac{1}{T} \sum\limits_{t \in T}p_{t}(\mathbf{x}) $$
  • Step: 3 Compute the variance of the T predictions for each class.

    $$\sigma^{2} (p_{t}(\mathbf{x})) = \frac{1}{T} \sum\limits_{t \in T} (p_{t}(\mathbf{x}) - p_{T}(\mathbf{x}) )^{2} $$
  • Step: 4 Compute the expected value of variance over all classes by taking their average.

    $$ U(\mathbf{x},F,p,T)=E(\sigma^{2} (p_{T} (\mathbf{x})))$$

4.1.2 Basic iterative attack (uncertainty-based)

In this section, we first provide the pseudo-codes for known loss-based BIM attack types as in Algorithm 1 and 2. Then, we provide our proposed uncertainty-based BIM attack variants in Algorithm 3 and 4. All the attack types proposed here are designed under \(L_{\infty }\) norm.

figure a
figure b
figure c
figure d

4.1.3 Basic iterative attack (hybrid approach)

Here, we present the pseudo-code for our Hybrid Approach in Algorithm 5. Same as the previous BIM attack variants, our Hybrid Approach is also designed under \(\ell _{\infty }\) norm. At each iteration, we step into both the model loss’ gradient direction and model uncertainty’s gradient direction. These two metrics make up for each other and yield to a better result.

figure e

4.2 Visualizing gradient path for uncertainty-based attacks

Figure 8 shows a simplified example of the gradient path for our uncertainty-based BIM attack variants. In the example shown in the figures, the low uncertainty regions are shown in blue, while the high uncertainty regions are shown in red. Figure 8a shows an example of successful uncertainty-based BIM attack type-A. But, we would expect the uncertainty-based BIM attack type-B to be unsuccessful for this specific example. Because at the intermediate iteration where we passed the decision boundary from source to target class, we are at the left side of the uncertainty hill. Therefore, when we try to decrease the uncertainty, we will perturb the image back to the original class manifold. However, for Fig. 8b, we would expect both uncertainty-based BIM attack types A and B would be successful. Because this time, at the intermediate iteration where we passed the decision boundary from source to target class, we are at the right side of the uncertainty hill.

Fig. 8
figure 8

Uncertainty gradient path

4.3 Capability of the attacker

We assumed that the attacker’s primary purpose is to evade the model by applying a carefully crafted perturbation to the input data. In a real-world scenario, the white-box setting is the most desired choice for an adversary that does not take the risks of being caught in a trap. The problem is that it requires the attacker to access the model from outside to generate adversarial examples. After capturing model information, the attacker can exploit the model’s vulnerabilities in the same manner as an adversary’s sandbox environment.

However, the attacker must solve an optimization problem to decide which regions in the input data must be changed to prevent this manipulation from being easily noticed by the human eye. By solving this optimization problem using one of the available attack methods [1, 19, 30, 40], the attacker aims to reduce the classification performance of the model on the adversarial data as much as possible. In this study, to limit the maximum allowed perturbation for the attacker, we used \(\ell _{\infty }\) norm, which is the maximum pixel difference limit between original and adversarial images.

5 Results

5.1 Experimental Setup

We trained our CNN models for the MNIST (Digit) [34] and MNIST (Fashion) [55] datasets, and we achieved accuracy rates of 99.05 % and 91.15 % respectively. The model architectures are given in Table 1 and the hyperparameters selected in Table 2. For the CIFAR10 dataset [29], we used a pretrained VGG-A (11 weight layers) model [51] and then apply transfer learning by freezing the convolution layers, changing the number of neurons in the output layer from 1000 to 10 and updating the weights of the dense layers only for 10 epoch. In this way, we achieved an accuracy rate of 89.07 % on test data. Since the used pretrained VGG model was trained on IMAGENET dataset, we had to rescale the CIFAR10 images from 32 × 32 to 224 × 224. We also applied the same normalization procedure by normalizing all the pixels with mean = [0.485,0.456,0.406] and std = [0.229,0.224,0.225]. The adversarial settings that have been used throughout our experiments are provided in Table 3. Finally, we used T = 50 as the number of MC dropout samples when quantifying uncertainty.

Table 1 CNN model architectures
Table 2 Model training parameters
Table 3 Adversarial settings of our experiments: α, i respectively denote the step-size and the number of attack steps for a perturbation budget 𝜖

5.2 Visualizing uncertainty under different attack variants

Figure 9 shows the change in our quantified uncertainty values of the model during different BIM attack variants. In this experiment, we applied all of the attack variants to the 23th test sample from MNIST (Digit) dataset. The original label of the input image was 6. For type A and B of loss and uncertainty based attacks, we observed that at 11th and 13th iterations, respectively, the attack is successful, and the input image was misclassified as 4. In Fig. 9a, we stop the iteration as soon as we succeeded in fooling the model, whereas in Fig. 9b, we continue to perturb the image, but this time in a direction which minimize the uncertainty. After the last iteration, the predicted label was still 4, and the uncertainty level decreased compared to the time of misclassification. For this sample, our uncertainty-based BIM attack type-B was successful, because, when we pass the decision boundary as we try to maximize model uncertainty, we also go beyond the point where there is the maximum uncertainty. One last important point to mention is that, when we apply the hybrid approach where we utilize both loss and uncertainty, we could successfully fool the model after 6th iteration, which is much faster. This also proves our assumption that the hybrid approach is more effective than the others.

Fig. 9
figure 9

Change of uncertainty values during different BIM attack variants

5.3 Experimental results

During our experiments, we only perturbed the test samples which were correctly classified by our models in their original states. Because, an intruder would have no reason to perturb samples that are already classified wrongly.

The results show that our Hybrid Approach of using both model’s loss and uncertainty results into the best performance. Success rates of pure loss-based and pure uncertainty-based attacks are similar to each other. We also observe that the success rates for uncertainty-based attack types A and B are different. We argue that the point of global maximum for uncertainty metric for any class is not on the model’s decision boundary as in the case of model loss. Instead, the point where the uncertainty is maximum can be beyond the decision boundary. Therefore, during the gradient-based search, it may be possible for us to pass the decision boundary but still not reaching the peak value for uncertainty. And when we start to decrease the uncertainty after passing the decision boundary (fooling the model), it may be possible to go back to the original class. However, this is not the case in loss-based approaches. Since we are trying to maximize the loss based on a reference class, we always see an increasing trend during the gradient descent approach of the loss maximization journey. Figure 10 shows some examples of adversarial samples crafted using different methods mentioned in this study. Table 4 shows the results of our experiments.

Fig. 10
figure 10

Some example images from MNIST(Digit), MNIST(Fashion) and CIFAR-10. The original image is shown in the left-most column and adversarial samples crafted based on different methods are on the other columns

Table 4 Attack success rates on different datasets

6 Conclusion

In this study, we proposed new attack algorithms by perturbing the input in a direction that maximizes the model’s epistemic uncertainty instead of its loss. We observed almost similar performances compared to loss based approaches. We also introduced a new concept for finding better points resulting in higher loss values within a specified p norm interval to craft adversarial samples. For this, we used a hybrid approach and stepped into gradient directions of both loss and uncertainty in each gradient descent step. We showed that the attack success rates are higher when we utilize this approach.

The aim of this study was not to propose the most powerful attack to date. Instead, we aimed to show that there exist other powerful metrics, different from model’s loss, that can be exploited to craft adversarial examples. Besides, we empirically demonstrated that relying just on the trained model may not always be the greatest option as it is just an approximation to the best predictor whereas epistemic uncertainty information can be very advantageous in cases where the model is misleading. We also showed that the combined usage of uncertainty and loss yields better performance in attacks.

In future work, we will be investigating the possible usage of uncertainty metrics for defense purposes. We believe that accurate efforts for minimizing the quantified uncertainty of an input sample can help to reduce any possible adversarial effect and thereby assist to restore the input sample back to its original data manifold.