1 Introduction

The fast improvement of deep learning methods resulted in major breakthroughs in multiple fields of machine learning, including image classification. With novel deep learning architectures, very accurate classification performance can be achieved, however, these models are sensitive to adversarial perturbations [47]. Generating adversarial examples is an effective attack method, which misleads for example the face recognition system through obfuscation attack (rejecting a genuine subject) or impersonation attack (matching to an impostor), where the attacker aims to identify someone as the target person [51]. Counterfactuals are very similar to adversarial examples because the aim is to find a different data point with a different classification output under the constraint that this example should be as similar as possible to the original data point. It can be used for explaining classifications, so counterfactuals have been proposed to fulfill the “right to an explanation” in the General Data Protection Regulation (GDPR) [7, 49]. Unfortunately, images (related to the GDPR) can be used to violate users’ privacy and exploited to perform targeted attacks for retrieving personal information, for example, from social network platforms [5, 8].

The goal of adversarial attacks is to change the output label of the model by adding special noise to the input. The attacker also minimizes the norm of the attack vector to hide the adversarial perturbations from human eyes. The aim of this paper was to work out a defense method against small adversarial perturbations with the capability of detecting adversarial attacks and robust classification performance.

Most machine learning techniques were designed to work on a specific problem set, where the training and test data are sampled from the same distribution. In most cases, it can be shown that it is possible to mislead these models by supplying input data that is not from the same statistical distribution as the training data. We call an (input, noise) pair the adversarial example for a given model if the model correctly classifies the input but fails to classify the noised input. In most real-world scenarios, it can be crucial to train models that are robust against adversarial attacks [44]. In this scientific field, there are many different approaches for training a robust model. In this paper, we focus on image classification problems, where the goal was to develop a new defense method against adversarial examples to incorporate two aspects of adversarial defense, the capability of detecting adversarial attacks and the robust training for better classification performance [30].

The rest of the paper is organized as follows. Section 2 contains the literature review. In Section 3 we examined attack methods and a new variant of the defense method. Section 4 contains the proposed 2N labeling defense method with the two aspects mentioned above. In Section 5 we discuss the evaluation plan for defense methods. Section 6 contains the comparison of the state-of-the-art methods with our method based on the results. Section 7 consists of the conclusion about the proposed method.

2 Related works

2.1 Adversarial attacks

An adversarial attack tries to mislead the classifier model by adding a carefully crafted noise to the input. The attacks can be divided into two groups based on the goal of the attacker. The goal of targeted attacks is to find a noise vector (with a minimal norm) that changes the output of the model to a specific class. In the case of untargeted attacks, the attacker wants to change the output of the model to any class but the correct one; these attacks can be formally written as.

$$ {\boldsymbol{\delta}}^{opt}=\arg\ {\min}_{\boldsymbol{\delta}}\left\Vert \boldsymbol{\delta} \right\Vert\ \mathrm{such}\ \mathrm{that}\ C\left(\boldsymbol{x}+\boldsymbol{\delta} \right)={y}^{\ast }\ \left(\mathrm{targeted}\right) $$
$$ {\boldsymbol{\delta}}^{opt}=\arg\ {\min}_{\boldsymbol{\delta}}\left\Vert \boldsymbol{\delta} \right\Vert\ \mathrm{such}\ \mathrm{that}\ C\left(\boldsymbol{x}+\boldsymbol{\delta} \right)\ne C\left(\boldsymbol{x}\right)\ \left(\mathrm{untargeted}\right), $$

where δopt is the optimal noise vector, C(x) = arg max F(x) and F is the classifier function that returns a probability distribution over the possible labels, and y is the desired label by the attacker. The norm which we minimize can be any vector norm, in this paper L was used. An input and perturbation pair (x, δ) is an adversarial example in the computer vision field if it successfully misleads a given model, while a human can still correctly classify the image; in our paper, the untargeted attacks are investigated, but targeted attacks can be also examined similarly.

Adversarial attacks are widely studied as they can identify vulnerabilities of the machine learning models before (or after) they are deployed; they can find perturbations for a given input. Applications of adversarial attack techniques can be found in computer vision, natural language processing [20], cyberspace security, healthcare [41] and the physical world [42], our examples only come from computer vision. The attacker methods used to generate such perturbations rely either on detailed model information (gradient-based attacks [30, 33]) or confidence scores such as class probabilities (score-based attacks [25]) or on the final model decision (decision-based attacks [39], e.g. Boundary Attack [4], Evolutionary Attack [12] or HopSkipJumpAttack algorithm [11]). The Boundary Attack follows the decision boundary between adversarial and non-adversarial samples using a simple rejection sampling algorithm in conjunction with a proposal distribution and a dynamic step-size adjustment [4]. The Evolutionary Attack algorithm was proposed to generate adversarial examples in the decision-based black-box setting. The method improves the efficiency by modeling the local geometry of the search directions and meanwhile reducing the dimension of the search space. The existing face recognition models are extremely vulnerable to adversarial attacks in the black-box manner, which raises security concerns for developing more robust face recognition models [12]. HopSkipJumpAttack algorithm [11] estimates the gradient direction using binary information at the decision boundary. The algorithm is capable of optimizing the parameters for both targeted and untargeted attacks, but lots of iterations are required for decision-based adversarial attacks. This paper focuses on decision-based adversarial attacks, where the model can be considered a black-box model.

2.2 Defense methods

Defense against the adversarial examples can be solved by many different approaches [34], like distillation [38], gradient hiding [48], noise injection [18], or adversarial training [14]. The use of the distillation technique, which was previously used to reduce Deep Neural Network (DNN) dimensionality, was investigated as a defense method against adversarial perturbations. Based on this, the defensive distillation [38] was defined and evaluated on standard DNN architectures. The authors analytically showed how distillation impacts models learned by deep neural network architectures during training, and they presented that defensive distillation can reduce the successfulness of attacks against DNNs. However, Carlini and Wagner [6] developed a set of attacks that computes norm-restricted additive perturbations that completely break defensive distillation. Liu et al. [28] improved the defensive distillation with feature distillation but the original deficiency of the distillation type methods remained.

A natural defense technique against gradient-based attacks is the gradient hiding method [48], which tries to hide information about the gradients of the model from the adversary. There are two disadvantages of this technique, (i) non-gradient-based machine learning methods can avoid this type of attack but they are not so accurate (ii) gradient hiding defense is easily fooled by learning a surrogate black-box model having gradient and crafting examples using it [39].

Another approach is noise injection; training the network with Gaussian noise is an effective technique to perform model regularization, thus improving DNN model robustness against input variation. The Parametric-Noise-Injection (PNI) method [18] involves trainable Gaussian noise injection at each layer of DNN on either activation or weights through solving the min-max optimization problem, embedded with adversarial training. These parameters are trained explicitly to achieve improved robustness. Fan et al. [13] proposed two detectors (a statistical detector and the Gaussian noise injection detector) and integrated them. The statistical detector extracts Subtractive Pixel Adjacency Matrix (SPAM) and uses the second-order Markov transition probability matrix to model SPAM to highlight the statistical anomaly hidden in an adversarial input. Then a classifier using SPAM-based feature is applied to detect the adversarial input containing perturbation. The Gaussian noise injection detector first injects an additive Gaussian noise into the input and then feeds both the original input and the injected one into the classifier. By comparing the two outputs difference, the detector is applied to detect adversarial input containing perturbation; if the difference exceeds a threshold, the input is adversarial; otherwise legitimate. These methods focus on only detection, and they do not attempt to strengthen the attacked model after the detection.

Feature squeezing [50, 54] is another defense technique, where the main idea is that it reduces the complexity of representing the data to disappear perturbations because of low sensitivity. At the image type dataset, this technique (i) reduces the color depth on a pixel level, that is, uses fewer values to encode the colors; or (ii) uses a smoothing filter over the images to map multiple inputs into the same value. Shaham et al. [43] investigated various image manipulation basis functions, as defense mechanisms like low-pass filtering, and JPEG compression. Other authors [26] also used filtering, where the high-frequency components were removed, and rotation was applied as an image manipulation basis function. Although these techniques provide a countermeasure against adversarial attacks, however, feature squeezing and basis function manipulations significantly worsen the accuracy of the model.

During the defense mechanism, the majority of the research results try to modify the model to increase the robustness [21], we call it robust training. For example, a Convolutional Neural Network (CNN) was trained with perturbed samples manipulated by various transformations and contaminated by different noises to foster robustness of the CNN against adversarial attacks, because both adversarial and noisy samples undermine the classifier accuracy. The authors proposed a combination of a convolutional denoising autoencoder with a classifier (CDAEC) as a defensive structure [17].

In other papers, the methods try to detect the presence of adversarial attacks instead of trying to increase the robustness of the classifier in different fields [2], like speech recognition [29], time-series data [1], face recognition [32], or computer vision [53]. In computer vision, checking the intrinsic consistencies in the input data is a possible way to detect adversarial attacks (e.g., by checking the object co-occurrence relationships in complex scenes). Motivated by the observation that language descriptions of natural scene images have already captured the object co-occurrence relationships that can be learned by a language model, the authors developed an approach to perform context consistency checks using such language models. Another example in computer vision is the Fourier analysis, where the information in the Fourier domain of input images and feature maps can be used to distinguish original test samples from adversarial images. Based on this analysis two detection methods are proposed [16], (i) a method that employs the magnitude spectrum of the input images to detect an adversarial attack, and (ii) a second one, which builds upon the first and additionally extracts the phase of Fourier coefficients of feature-maps at different layers of the deep neural network.

The NULL labeling method [19] is a state-of-the-art defense method with detecting possibilities. The advantage of this method is the labeling of the perturbed inputs to the NULL label instead of classifying them into their original label. This method can be regarded as the most effective defense mechanism against adversarial attacks [9]. This method is accurate to reject an adversarial example while not compromising the accuracy of the clean data. The only disadvantage of the NULL labeling method is that the probabilities for the original class N are usually low, so even if the method is able to detect the attack, it is hard to decide what the original class might have been.

These defense methods aim to solve only the detection or robust classification task, previous research did not investigate the possibility of solving both of them at the same time. If a model could be trained for both tasks, then the multitask learning [31] can result in improving the adversarial robustness of the models. Our research was similar to the NULL labeling method [19] (i.e. detection of the adversarial samples), but we are interested in the scientific question about classification after the detection as well. This means that the aim of our method was to filter out the suspected adversarial images, then classify the remaining ones. The theoretical analysis of this combined task is one of the novelties of this paper.

Lots of methods described in this section are not able to detect the adversarial samples. Naturally, they can be built into a pipeline process, where the first part is a new binary classifier to detect and filter the adversarial examples, and the second part is the original method. Thus, we implemented some defense methods; these methods with a binary classifier were evaluated and compared with our method. Besides that, we compared the NULL labeling method (as shown in Sections 3.3 and 3.4) as the most similar to our proposed method.

3 Examined attack methods and a new variant of the defense method

3.1 Attack methods

The most frequently used attack method is the Fast Gradient Sign Method (FGSM) [14], which approximates the optimal adversarial perturbation in a given L bounding space of the original input. This means that the attack noise vector norm cannot be higher than a given threshold. During the training, the cost function of the machine learning model can be linearized around the current value of the model parameters, and an optimal max-norm constrained perturbation can be obtained as follows.

$$ \boldsymbol{\delta} =\varepsilon \operatorname{sign}\left({\nabla}_{\boldsymbol{x}}J\left(\boldsymbol{\theta}, \boldsymbol{x},y\right)\right) $$

Here δ is the adversarial perturbation, ε is the amplitude of the attack and θ is the model parameter vector. J is a cost function, which penalizes the divergence from the desired class. The divergence can be measured by cross-entropy as Eq. 4. presents.

$$ H\left(p,q\right)=-\sum \limits_{i=1}^n{p}_i\log {q}_i $$

If y is the correct class and y is the desired class for the targeted attack, the cost function of the attacker can be written as follows

$$ J\left(\boldsymbol{\theta}, \boldsymbol{x},y\right)=\left\{\begin{array}{c}-\log {F}_{y\ast}\left(\boldsymbol{x}\right)\kern0.5em \left(\mathrm{targeted}\right)\\ {}\log {F}_y\left(\boldsymbol{x}\right)\kern0.75em \left(\mathrm{untargeted}\right),\end{array}\right. $$

where the negative sign disappears for the untargeted attack because of the entropy maximization at this type of attack. There are extensions to this method that use an iterative approach for generating adversarial noise, for example, Iterative Fast Gradient Sign Method [24] or Projected Gradient Descent [10]. With these methods, the noise vector can be closer to the optimum at the cost of computing gradients multiple times.

Besides the L norm constrained perturbations, another attack type is when the L0 norm is minimized [3], which means that the attacker only wants to modify the minimal number of elements (i.e. pixels in case of image) in the input vector not bothering the magnitude of each modification; in our research, all pixels were allowed to change (so L norm was minimized). The examined Smooth Targeted Attack Based on Gradient Method (STG) method [19] belongs to the second type, where the attacker can generate adversarial noise with a given L0 constraint. The STG attack is an iterative approach for finding adversarial examples, meaning that it modifies one input feature at a time.

3.2 Adversarial training defense method

After the adversarial attacks, we present the defense methods. Adversarial training (or adversarial learning) (ADV) [14] is a defense method against L bounded adversarial attacks. To learn a robust model the adversarial training uses an extended loss function defined as follows:

$$ \overset{\sim }{J}\left(\boldsymbol{\theta}, \boldsymbol{x},y\right)=\alpha J\left(\boldsymbol{\theta}, \boldsymbol{x},y\right)+\left(1-\alpha \right)J\left(\boldsymbol{\theta}, \boldsymbol{x}+\varepsilon \operatorname{sign}\left({\nabla}_{\boldsymbol{x}}J\left(\boldsymbol{\theta}, \boldsymbol{x},y\right)\right),y\right), $$

where J is the original, \( \overset{\sim }{J} \) is the extended loss function and α is the importance of the original samples against the adversarial ones. Using this method, the model can classify images correctly even with adversarial noise present, but by this design, it would be difficult to obtain attack probability from the output of the model trained with Adversarial training.

3.3 NULL labeling defense method

Another defense method is NULL labeling [19], which extends the label set of an N label classification problem with a new label (NULL), which represents the presence of adversarial noise.

$$ {S}_{NULL}=\left\{{C}_1,{C}_2,\dots, {C}_N,{C}_{NULL}\right\}, $$

where {C1, C2, …, CN} is the original N label classification problems label set and SNULL is the extended label set for NULL labeling. NULL labeling also defines a training strategy. First, the model is pre-trained on the original dataset, without any perturbations until the model achieves a certain accuracy in the N class classification problem.

In the next phase the model is trained on clean samples with α probability, and on adversarial examples (generated with the STG method [19]) with 1-α probability. NULL label probabilities are assigned for each adversarial example as follows

$$ P\left(c= NULL\ \mathrm{for}\ \mathrm{attack}\ \boldsymbol{\delta} \right)=f\left(\boldsymbol{\delta} \right)=\frac{\left|\left\{\boldsymbol{\tau} :{\left\Vert \boldsymbol{\tau} \right\Vert}_0<{\left\Vert \boldsymbol{\delta} \right\Vert}_0\ |\ \boldsymbol{\tau} \epsilon \Delta \right\}\right|}{\left|\Delta \right|}, $$

where ∆ is the set of perturbations for adversarial examples on the validation set. The desired probability distribution used to calculate the cross-entropy loss for the samples without perturbation:

$$ P\left(c=i\right)=\left\{\begin{array}{c}q,\kern0.5em \mathrm{if}\ i={c}_{original}\\ {}0,\kern3.25em \mathrm{if}\ i=\mathrm{NULL}\\ {}\frac{1-q}{\mathrm{N}-1},\kern4.5em \mathrm{else}\end{array}\right. $$

By setting the value of q less than 1 (but close to 1), this method is called label smoothing [35] which prevents the model from learning to give overconfident predictions [40]. The probabilities for the samples with perturbation are calculated as follows:

$$ P\left(c=i\right)=\left\{\begin{array}{c}q\left(1-f\left(\boldsymbol{\delta} \right)\right),\kern0.5em \mathrm{if}\ i={c}_{original}\\ {}f\left(\boldsymbol{\delta} \right),\kern7em \mathrm{if}\ i=\mathrm{NULL}\\ {}\frac{\left(1-q\right)\left(1-f\left(\boldsymbol{\delta} \right)\right)}{\mathrm{N}-1},\kern3.75em \mathrm{else}\end{array}\right. $$

3.4 NULL labeling variant for L bounded attacks

In this paper we focused on L bounded attacks, thus the NULL labeling training strategy had to be modified to support L instead of L0 bounded attacks. We achieved this by replacing the STG method with FGSM for generating adversarial inputs, where the attack amplitude εFGSM was sampled from a uniform distribution. We also had to define a new function to assign NULL label probabilities for the FGSM perturbations:

$$ f\left(\boldsymbol{\delta} \right)=\frac{{\left\Vert \boldsymbol{\delta} \right\Vert}_{\infty }}{\max \left\{{\left\Vert \boldsymbol{\tau} \right\Vert}_{\infty }|\boldsymbol{\tau} \epsilon \Delta \right\}} $$

where this function normalizes the attack amplitude for each sample by dividing with the largest εFGSM used for generating the perturbations. Since we use a uniform distribution for sampling ε, this function works similarly to the original f, defined in the paper (the probability of the attack is proportional to the ratio of the number of attacks with less amplitude to the number of all attacks).

4 Proposed 2N labeling defense method

4.1 The architecture of the 2N labeling method

The proposed 2N labeling defense method extends the original class label set to twice classes as the original ones. As same as the other methods, our defense method consists of two phases, the training and the test one, the Fig. 1 presents the first phase. During the training, the model should be prepared for the attack, thus the attacker is simulated by a module. The simulated attacker module calculates the optimal noise based on Eq. 12, and in the next step, this noise is added to the input resulting in x* perturbated input. In this branch of the process, the class label is converted to the appropriate extended class by adding N to y; this helps the model how learn to distinguish the attacked and original inputs. Label smoothing (details are in Section 4.2) assists both the original and the perturbated examples, here the two branches (original and perturbed data) of the process are merged. The extended class label set provides the possibility of larger flexibility for the model during the training; this extended classifier produces the predictions. For the deep neural network type classifiers, modification of the output layer solves the class label extension, where a double number of nodes are required. Based on the calculated J losses, the iterative learning process (i.e. the robust training) computes the optimal δ noise and the optimal θ model parameters as the next two equations show, where \( {\overline{y}}_{T, smooth} \) and \( {\overline{y}}_{smooth} \) are the smooth variants of the target and the original class labels.

Fig. 1
figure 1

Block diagram of the proposed 2 N labeling defense method during the training phase

$$ {\delta}^{\ast }={argmin}_{\delta }\ J\left(\theta, x+\delta, {\overline{y}}_{T, smooth}\right),\kern0.5em s.t.{\left\Vert \delta \right\Vert}_{\infty}\sim U, $$
$$ {\theta}^{\ast }={argmin}_{\theta }\ \left(\alpha J\left(\theta, x,{\overline{y}}_{smooth}\right)+\left(1-\alpha \right)J\left(\theta, x+{\delta}^{\ast },{\overline{y}}_{smooth}\right)\right), $$

The block diagram of the test phase is seen in Fig. 2, where the trained model will not know whether there is an attacker or not (in the unknown circumstances the data can be original or perturbated) before the model. Since the model was trained based on Fig. 1, thus the attacked model has defense capability. The extended class label set facilitates two-stage decisions, where the first stage (detector module) tries to detect the perturbated data. The detector module predicts whether the instance is a perturbated input or not (it can be seen later in Eq. 15). The suspected perturbated ones are removed, and final decisions (as the second stage) are drawn on the others, where 2N classes are converted to original N classes (Eq. 16 presents it later).

Fig. 2
figure 2

Block diagram of the proposed 2 N labeling defense method during the test phase

4.2 Extending class label set for 2N labeling method

Our idea for a new defense method is similar to NULL labeling, but instead of extending the original label set with one (NULL) label, the proposed method duplicates each original label (S2N denotes the set of labels for 2N labeling in the next equation),

$$ {S}_{2N}=\left\{{C}_1,{\mathrm{C}}_2,\dots, {C}_N,{C}_{N+1},{C}_{N+2},\dots, {C}_{2N}\right\} $$

where Ci is the original label (at index i, where i = 1,2, …, N) and CN + i stands for the same label but with adversarial perturbation (noise) present.

In higher amplitude adversarial attacks, the output distribution of the NULL labeling model shifts towards the NULL label, causing it harder to determine the original label of the input. In contrast, our method directly learns to classify these high amplitude attacks correctly. This technique also allows us to detect the presence of adversarial perturbation. If we calculate the sum of the second N labels, we get the probability of whether noise is present.

$$ P\left(\mathrm{attack}\right)=\sum \limits_{i=N+1}^{2N}{F}_i\left(\boldsymbol{x}\right), $$

where Fi(x) is the value at index i of the output probability distribution vector of the model. This P(attack) can be used in our detector module (in Fig. 2); if the P(attack) of an instance is larger than a predefined threshold, then the method removes this instance. The rest of them should be converted into the original classes to obtain the final decision (as the last step of the test phase). The Eq. 16 presents the conversion formula, the class with the largest probability will be the output of the model.

$$ P\left(c=i\right)={F}_i\left(\boldsymbol{x}\right)+{F}_{i+N}\left(\boldsymbol{x}\right), $$

In the training phase (as can be seen in Fig. 2), we used label smoothing [35] for the 2N labeling technique. The label smoothing was performed as follows:

$$ P\left(c=i\right)=\left\{\begin{array}{c}q,\kern0.5em \mathrm{if}\ i=\hat{c}\\ {}\frac{1-q}{2\mathrm{N}-1},\kern1em \mathrm{else}\end{array}\right. $$

where \( \hat{c}={c}_{original} \) if there is no perturbation, otherwise \( \hat{c}={c}_{original}+\mathrm{N} \).

In the next section, we will show that filtering inputs can increase the overall predictive performance in an adversarial setting.

4.3 Proving that filtering can increase the predictive performance

Using robust training (in the learning phase) and filtering (in the test phase) allows us to deal with adversarial examples in two phases. First, our solution tries to identify all adversarial samples (then they are filtered out), and after that, the ones missed still can be correctly classified if the model is robust. In real-world scenarios it can be more important to not let attacks succeed than to evaluate each input, this justifies the filtering mentioned before. Let us denote the ratio of the perturbed inputs, the number of all inputs, the accuracy on original instances, and the accuracy on perturbed instances for the attack by pa, N, accori, accatt respectively. If there is no attack, the number of the true positive original instances (TPori) can be calculated from the accuracy as can be seen in Eq. 18.

$$ {TP}_{ori}={acc}_{ori}\cdotp N\cdotp \left(1-{p}_a\right) $$

In the case of the attack, the number of the true positive perturbed instances (TPatt) also can be calculated from the accuracy (Eq. 19).

$$ {TP}_{att}={acc}_{att}\cdotp N\cdotp {p}_a $$

Before the filtering, the accuracy of the mixed data (original and perturbed instances) is equal to the sum of true positive instances divided by the number of all instances as shown in Eq. 20.

$$ { ac c}_{before}=\frac{TP_{ori}+{TP}_{att}}{N}={ ac c}_{ori}\cdotp \left(1-{p}_a\right)+{ac\mathrm{c}}_{att}\cdotp {p}_a $$

The filtering process discards some perturbed instances (the ratio of the discarded inputs is the recall of the perturbed class in the binary classification, we denote this by Ratt), the ratio of the not detected (i.e. not discarded) perturbed instances, and all perturbed instances are equal to 1 − Ratt; and the filtering process passes original instances, the ratio of them is denoted by Rori because this is the recall of the original instances. After the filtering, the accuracy can be expressed by the recall of the perturbed and original classes; after simplification of Eq. 21., we can write accafter as can be seen in Eq. 22.

$$ {acc}_{after}=\frac{TP_{ori}\cdotp {R}_{ori}+{TP}_{att}\cdotp \left(1-{R}_{att}\right)}{N\cdotp \left(1-{p}_a\right)\cdotp {R}_{ori}+N\cdotp {p}_a\cdotp {R}_{att}} $$
$$ {acc}_{after}=\frac{acc_{ori}\cdotp \left(1-{p}_a\right)\cdotp {R}_{ori}+{acc}_{att}\cdotp {p}_a\cdotp \left(1-{R}_{att}\right)}{\left(1-{p}_a\right)\cdotp {R}_{ori}+{p}_a\cdotp \left(1-{R}_{att}\right)} $$

If we examine when the inequality accbefore < accafter will be true, we can get to the following inequality

$$ 0<{p}_a\cdotp \left(1-{p}_a\right)\cdotp \left({acc}_{ori}-{acc}_{att}\right)\cdotp \left({R}_{ori}-{\overline{R}}_{att}\right), $$

where \( {\overline{R}}_{att} \) is the ratio of not discarded perturbed instances: \( {\overline{R}}_{att}=1-{R}_{att} \). The step-by-step derivation can be seen in the Appendix.

Here pa · (1 − pa) is always positive, and we can assume that the accori − accatt expression is also positive (because the aim of the attacker is to mess up the classifier model, thus accatt will be lower than accori). Then the inequality can be reduced to

$$ {R}_{ori}<{\overline{R}}_{att} $$

This means that we get an increase in accuracy when the ratio of discarded non-perturbed samples is less than the ratio of the not discarded perturbed samples.

In this section, we examined the effect of the filtering by comparing two cases, one with filtering, and another without filtering. The filtering can be investigated from another point of view. In the next section, we discuss the effect of the different attack strengths (attack amplitudes) on the model and the role of filtering in this effect.

4.4 The behavior of a model and investigating the accuracy diminution

Let us investigate the effect of the different attack strengths. There is a general phenomenon that the accuracy is in inverse ratio to the strength and the calculated probability of the attack (calculated by the model at perturbed examples) is in direct ratio to the epsilon. The same phenomenon of two models is presented in Fig. 3 but with two different variants, where the accuracies and the average probabilities of the attack are seen as a function of the attack strength (epsilon). The trends of the curves are general in every model that possesses detector capability (i.e. able to calculate the probability of an attack). The phenomenon on our model and an arbitrary model can be seen in the left and the right diagram of Fig. 3, respectively. In the right variant of the phenomenon, the probability of the attack is constant for a long time (the increasing period is only brief at the end), thus attack detection encounters difficulties. Long increasing probability of the attack (as in the left diagram) alleviates the correct detection. If a defense method possesses an accurate detector, then the filtering is able to compensate for the accuracy reduction (theoretically, even constant accuracy is achievable).

Fig. 3
figure 3

Accuracy and the average probability of the attack detection as a function of the attack strength

If the goal is to reach constant accuracy after the filtering, then we should examine the conditions that should be true. We call this constant accuracy situation as the attack amplitude insensitivity phenomenon. The attack strength is changing (increasing) step by step and considering the i and the i + 1 steps, we would like to achieve equal accuracies in two consecutive steps as Eq. 25 shows.

$$ {acc}_{after\_i}={acc}_{after\_i+1} $$

Let us denote the accuracy on perturbed instances for the attack (with robust training but before filtering) by accatt (and pa, accori, Rori, \( {\overline{R}}_{att} \) are the same as in the previous subsection). The accatt is reducing as the attack strength is increasing, the reduced ratio (between the i and the i + 1 steps) is denoted by rZ. \( {\overline{R}}_{att} \), i.e. the ratio of not discarded perturbed instances should also decrease as a function of attack strength. The detector aims to decrease the \( {\overline{R}}_{att} \) as much as it can, so tries to minimize rX, where \( {r}_X={\overline{R}}_{att\_i+1}/{\overline{R}}_{att\_i} \). Let us suppose that the attacker reached rZ diminution (rZ = accatt _ i + 1/accatt _ i). What is the value of rX that the detector should achieve for the final accuracy not to decrease? We are interested in a formula that gives a relationship between the rZ and rX to answer this question. Eq. 25 can be continued, the latter accuracy, i.e. accafter_i + 1 can be expressed with rZ and rX from Eq. 22.

$$ {acc}_{after\_i}=\frac{acc_{ori}\cdotp \left(1-{p}_a\right)\cdotp {R}_{ori}+\left( acc{\prime}_{att}\cdotp {r}_Z\right)\cdotp {p}_a\cdotp \left({\overline{R}}_{att}\cdotp {r}_{\mathrm{X}}\right)}{\left(1-{p}_a\right)\cdotp {R}_{ori}+{p}_a\cdotp \left({\overline{R}}_{att}\cdotp {r}_X\right)} $$

The relationship between the rZ and rX can be derived from the previous equation and rX can be expressed as Eq. 27 shows.

$$ {r}_X=\frac{\left(1-{p}_a\right)\cdotp {R}_{ori}\cdotp \left({acc}_{ori}-{acc}_{after\_i}\right)}{p_a\cdotp {\overline{R}}_{att}\cdotp \left({acc}_{after\_i}- acc{\prime}_{att}\cdotp {r}_Z\right)} $$

The condition of a nearly constant accuracy is that the defense method is able to achieve this rX value between the two steps as many times as possible.

4.5 Training the selected model

In our work, the VGG-19 model architecture [45] was selected as attacked model to evaluate the defense methods. The VGG architecture consists of convolutional and pooling layers, there are multiple configurations of VGG which differ in the number of weight layers, and we used 19 layers. In the implementation, each model layer consists of a convolutional layer followed by a batch normalization layer and a ReLU (Rectified Linear Unit) activation.

To support NULL and 2N labeling the model architecture had to be modified only at the final layer. In the case of NULL labeling, the length of the output vectors at the final layer is N + 1, and for the 2N model, this is 2⋅N, where N is the number of the classes in the original classification problem. We used 0.8 value for the q parameter at label smoothing.

Both of the models were trained for 200 epochs with early stopping [52]; we used cross-entropy as a loss function and trained the models with Stochastic Gradient Descent [22] with Cosine Annealing learning rate [15]. When training the adversarial learning (ADV) model, the alpha was equal to 0.5, and for the NULL and 2N labeling models, we added noise to 50% of the images. During the training, the epsilon value of the FGSM perturbations was drawn from a uniform distribution between 0 and 0.3 for each training sample.

5 Evaluation plan for defense methods

5.1 Indicators for evaluation

Accuracy is a widely used metric for classification problems as an indicator of model goodness (which is the proportion of correct predictions – both true positives and true negatives – among the total number of cases examined); we used this indicator in our evaluation.

Adversarial sample transferability is the property that some adversarial samples produced to mislead a specific model can mislead other models even if their architectures greatly differ [37]. Different models trained to solve the same problem tend to be vulnerable to similar perturbations. If we have two models, Ma and Mb, and Xa is a set of adversarial examples on Ma, the ratio of |Xb|/|Xa| is called the transferability of Xa to Mb (Xb is a subset of Xa and each element in Xb is an adversarial example for Mb). If a model has significantly lower transferability, we expect it to be more robust against adversarial attacks, thus transferability can be a good measure of robustness.

5.2 Defense methods in the comparison

The aim of using adversarial defense methods is to prevent accuracy decrease while lowering the transferability of models in an environment where the models can be exposed to adversarial attacks. To generate adversarial perturbations a substitute model (with VGG-19 architecture) was used by the attacker. The substitute model was trained on the whole training set with early stopping. The attacked model (parameters of the VGG-19 architecture), even though it was similar to the substitute model, was unknown to the attacker.

There were 5 defense models (ADV: Adversarial learning [14], NULL: NULL labeling [19], 2N: 2N labeling, CDAEC: CDAEC method [17], DISTILL: Defense Distillation [38]) in the comparison of the evaluation, for each model we trained 5 different instances (for the 5-fold cross-validation). Despite the epsilon value was between 0 and 0.3 during the training, the epsilon value of the FGSM perturbations in testing was drawn from a uniform distribution between 0 and 0.5 to investigate larger attacks and unknown amplitude attacks as well. We used the test set (in each fold of the cross-validation) to evaluate models and half of the samples contained adversarial perturbation (generated with FGSM algorithm using the substitute model).

5.3 Evaluation pipelines

We used evaluation pipelines shown in the next diagrams to measure the robust classification performance (Fig. 4) and to measure the performance of the combined task (filtering and N-class classification subtasks) as can be seen in Fig. 5. The module “FGSM using the substitute model” can be seen in both Figures; the FGSM attacker can observe the inputs and the outputs of the attacked model but it has no knowledge of the inside of the model. After the observation, the attacker constructs a substitute model for the original attacked model, thus the attacker creates an adversarial example (adversarial image) based on the substitute model. In the evaluation, the attacked model receives both the adversarial and original images. In the scenario described in Fig. 4, the detector module is omitted when the defense method possesses detector capabilities (e.g. NULL, 2N labeling method), otherwise, the attacked model is trained by the defense method (e.g. ADV, CDAEC, DISTILL).

Fig. 4
figure 4

Block diagram showing the evaluation pipeline for the robust classification task (only N-class classification task, i.e. without filtering)

Fig. 5
figure 5

Block diagram showing the evaluation pipeline for the filtering and N-class classification task

In the scenario described in Fig. 5, a detector module is added to the attacked model when the defense methods do not possess detector capabilities (e.g. ADV, CDAEC, DISTILL). The filtering in the detector is a binary classification task, each model should predict whether the input was perturbed or not. More precisely, we need to calculate the probability of whether a given input contains adversarial perturbation. By defining a threshold probability, we can assign labels (perturbed or not perturbed) to each input. The inputs predicted as perturbed are excluded from the following classification task.

The NULL and 2N labeling techniques are able to calculate attack probabilities by design, however for the other defense methods, we had to train a separate binary model (BIN) specifically for the filtering task. For the BIN model, we used VGG-16 architecture (a shallower network was worked out better). The BIN model was trained by adding adversarial perturbations with various amplitudes to the inputs. After this addition, we can calculate attack probabilities for each model as follows Table 1.

Table 1 How the probability of the attack was calculated at each method

The last task is to predict the label for inputs that the filter method did not discard. For this task we had to transform the output of the NULL and 2N models to a probability distribution over the original N labels, this output transformation to N-class probabilities can be seen in the next Table; this transformation is not required at other defense methods Table 2.

Table 2 How the probability distribution over the original N labels was obtained for each model

Due to the two scenarios, all methods can be evaluated with and without filtering, and the accuracy and the transferability values can be measured.

5.4 Two datasets in the evaluation

We used the CIFAR-10 [23] and GTSRB [46] datasets to evaluate the adversarial defense methods mentioned above. The CIFAR-10 dataset consists of 60,000 32 × 32 RGB in 10 classes, with 6000 images per class. The dataset was split into training (consists of 50,000 images) and test (consists of 10,000 images) sets.

The GTSRB (German Traffic Sign Recognition Benchmark) dataset is the other benchmark for multi-class, single-image classification tasks; each image contains one traffic sign. Images are stored in PPM format, and the sizes vary between 15 × 15 to 250 × 250 pixels. Dataset consists of 43 classes, and it contains more than 50,000 images in total, 75% of them are the training set, and the remaining are the test set.

6 Results

6.1 Accuracy with and without using filtering

The mean accuracy was calculated for each method against various attack amplitudes as can be seen in Fig. 6 and Fig. 7. The vertical lines indicate the minimal and maximal accuracy values for each method (since multiple model instances resulted in different accuracies due to cross-validation), and the middle points (representing the mean values) are connected into a line. In section 4.2, we presented the possibility of filtering to increase accuracy from an analytical point of view, Fig. 6 and Fig. 7 show empirical evidence that this really can be achieved.

Fig. 6
figure 6

Accuracy for each method with and without using filtering on CIFAR-10

Fig. 7
figure 7

Accuracy for each method with and without using filtering on GTSRB

The increase in accuracy for 2N labeling and NULL with filtering is conspicuous (as can be seen in Fig. 6 and Fig. 7, where the difference was also large at CDAEC and DISTILL), however, for the ADV models, this effect was much smaller. The cause for this phenomenon can be the multi-task learning; while we train the 2N and NULL model instances to solve the original N label classification problem and detect adversarial noise simultaneously, the other models and BIN filter model were trained completely separately (having no chance for one task to improve performance of another).

At the CIFAR-10 dataset, Fig. 6 presents that for lower attack amplitudes NULL labeling can achieve higher accuracy (but for higher attacks the 2N labeling is the highest), however, it can be observed at the 2N labeling method that the accuracy is not sensitive to the amplitude of the attack. This means that models trained with the 2N labeling technique can operate on a nearly constant accuracy level regardless of the presence or amplitude of the attack. In a real-world scenario, this would be the desired behavior, if an attacker has feedback on the success rate from the attacked system, it can be revealed which amplitudes work best against the system. If the performance of the system is independent of the attack amplitude, it is much harder for the attacker to find adversarial perturbance that can mislead our system.

At the GTSRB dataset (as can be seen in Fig. 7), the sensitivity of the 2N labeling method is also very low, nevertheless, the NULL labeling shows a similar phenomenon. To compare the sensitivities of them we calculated the difference of accuracies in the diagrams and divided by the difference of epsilons, then we took the maximum among them, which gives the maximal observed slope of the accuracy curve. Besides the (i) maximal sensitivity, we calculated (ii) the standard deviation of the accuracy, and (iii) the maximal absolute difference (range) of accuracy over the whole range of attack amplitudes. Table 3 presents that the 2N labeling method possesses the lowest sensitivity, standard deviation, and range values on both datasets, so this is the least sensitive model.

Table 3 Max sensitivity: largest sensitivity between 0 and 0.5. Std. dev: standard deviation of the accuracies. Range: maximal absolute difference of accuracy (this is also averaged for each model instance)

The overall mean accuracies and the accuracies at stronger attacks are compared in Table 4. At the former one, the epsilon of the attacks was between 0, 0.5; at the latter one, the attacks were stronger than expected (the defense technique was trained to withstand attacks with lower noise amplitude – the epsilon value during training was sampled from a uniform distribution between 0 and 0.3). 2N labeling reached the highest accuracy against attacks with higher amplitude as can be seen in the 3rd and 5th columns of Table 4. Considering the results, the best overall accuracy was achieved by 2N labeling, only on CIFAR-10 was the second-highest, however, the difference in accuracy is only 1%.

Table 4 Mean accuracy for each method if the attacks are sampled from a uniform distribution

Since Table 4 presents only the mean values, thus we planned a paired comparison (our model with each competitor model) similarly to other research [27]; and we conducted statistical tests on each pair to verify whether bald values are significantly better than the other one in the pair. We designed two-sample Student’s t-tests to test the null hypothesis that one of the means of two populations is larger than the other one. This can be used if the variances of the two populations are assumed to be equal. We examined it by F-test and this assumption was not true based on the result of this test. Consequently, Welch’s t-test should have been used instead of the two-sample Student’s t-tests, because the means of two populations can be tested by Welch’s t-test without equal variances condition. We investigated each pair with 95% significance level, and the detailed results of the Welch’s t-test are seen in the Appendix, in Table 6. Table 4 shows that mean of the NULL is larger than 2N but based on the value 0.2523 (as the result of Welch’s t-test) in Table 6, we reject the null hypothesis that NULL is always better than 2N. All other values are less than 0.05, therefore 2N is significantly better than the competitors. Based on the accuracy, the 2N labeling method is the best option to choose if we want to retain accuracy in adversarial circumstances. The fact that 2N labeling was able to learn an attack amplitude insensitive model makes it the best choice by far.

6.2 Test results of transferability

Transferability values for 3 defense methods were calculated and presented in Fig. 8, and in Fig. 9, where the middle points (representing the mean values) are connected into a line, and the vertical line segments show the range of the result values. Significant improvements can be seen in transferability in filtering cases compared to the cases without filtering (at 2N the difference is large). Similarly to accuracy, the 2N labeling method is capable of training a model that can operate on a nearly constant transferability regardless of the attack amplitude. The filtered transferability for 2N labeling is almost zero in all cases, for every attack amplitude; thus, we were able to achieve an almost ideal defense. This means that even when an adversarial example overpasses the filter, our model will be misled by the attack with a very low probability.

Fig. 8
figure 8

Transferability for methods with and without using filtering on CIFAR-10

Fig. 9
figure 9

Transferability for methods with and without using filtering on GTSRB

In Table 5 it can be observed that the best transferability was achieved by 2N labeling. The differences in transferability are conspicuous, the 2N labeling is (i) 5 times less than the second-best NULL labeling on GTSRB, and (ii) 3 times smaller than the second-best BIN+DISTILL method on CIFAR-10. We examined the values in Table 5 from the statistical point of view, and several Welch’s t-tests were conducted similarly to accuracy investigation (with 95% significance level). The detailed results of the hypothesis tests are seen in the Appendix, in Table 7, it can be concluded that the 2N labeling method is significantly better than the competitor methods.

Table 5 Mean transferability for each method if the attacks are sampled from a uniform distribution [0, 0.5]

7 Conclusion

The adversarial attack is an actual and frequently researched topic in machine learning literature. Our innovative approach in this literature was the concept of filtering and robust training in a common model. It allows the model to learn more tasks, which leverages the goodness of the model.

For image classification problems we elaborated a new combined defense method, the 2N labeling method with the idea of an extended class label set. In this method, we allow the models under attack to discard samples considered to contain adversarial perturbation because we theoretically proved that filtering can increase the accuracy. In the paper, we investigated the minimal condition of the positive effect of the filtering. Based on the mathematical derivation, it can be concluded that the accuracy increases when the ratio of discarded non-perturbed examples is less than the ratio of the not discarded perturbed samples. This means that the filtering module should filter out a higher rate of perturbed instances than the original instances. If a filtering module would operate randomly, then rates would be the same, which causes the same accuracies as well. Besides the 2N labeling method, any filtering module that operates better than randomly can reach a rise in accuracy.

Another contribution of the paper is the investigation of the model behavior, considering the accuracy diminution and the average probability of the attack detection. We presented that if a defense method possesses an accurate detector, then the filtering is able to compensate for the accuracy reduction; we stated that even constant accuracy is attainable. This was not only a theoretical statement but we presented this aspect as well, i.e. our defense method learns a model that can operate on a constant classification performance regardless of the presence or amplitude of adversarial attacks. We name this phenomenon the attack amplitude insensitivity phenomenon, which was measured with accuracy and transferability. This phenomenon was most strongly observed with our 2N labeling method on both datasets.

We compared the 2N and NULL labeling methods with three defense methods, the Adversarial learning, the CDAEC method, and the Defense Distillation method. To measure the effectiveness of each method, accuracy and transferability were calculated on two malicious validation datasets. The results presented that by using filtering we can achieve better performance at all indicators. At accuracy, the 2N labeling and the L variant of NULL labeling were the best according to our measurements (only a small difference was between them, which was not significant), but at accuracy for unexpected attacks (with large attack amplitude), the 2N labeling method surpassed the others. At transferability, the 2N labeling method reached significantly higher performance than the NULL labeling and the other competitor methods.

Summarizing the paper, four new scientific results are highlighted at the end of the research: (i) the combination of two capabilities, the robust training and the detecting capabilities of adversarial examples, (ii) the idea of an extended class label set for a new defense method, (iii) theoretical analysis of the filtering in the detector, (iv) theoretical possibility and experiments related to the attack amplitude insensitivity phenomenon.

The most adversarial examples have restricted their perturbations to the Lp-norm, thus our defense method has focused on these types of perturbations. In the future work, we plan to overcome this limitation of the defense, because unrestricted adversarial examples appeared in the literature [36] as well.