1 Introduction

Despite their impressive performances on a variety of tasks, it has been known for more than a decade that machine-learning algorithms can be misled by different adversarial attacks, staged either at training or at test time [1, 2]. After the first attacks proposed against linear classifiers in 2004 [3, 4], Biggio et al. [5, 6] have been the first to show that nonlinear machine learning algorithms, including support vector machines (SVMs) and neural networks, can be misled by gradient-based optimization attacks [1]. Nevertheless, such vulnerabilities of learning algorithms have become extremely popular only after that Szegedy et al. [7, 8] have demonstrated that also deep learning algorithms exhibiting superhuman performances on image classification tasks suffer from the same problems. They have shown that even only slightly manipulating the pixels of an input image can be sufficient to induce deep neural networks to misclassify its content. Such attacks have then been popularized under the name of adversarial examples [2, 6, 7].

Since the seminal work by Szegedy et al. [7, 8], many defense methods have been proposed to mitigate the threat of adversarial examples. Most of the proposed defenses have been shown to be ineffective against more sophisticated attacks (i.e., attacks that are aware of the defense mechanism), leaving the problem of defending neural networks against adversarial examples still open. According to [2], the most promising defenses can be broadly categorized into two families. The first includes approaches based on robust optimization and game-theoretical models [911]. These approaches, which also encompass adversarial training [8], explicitly model the interactions between the classifier and the attacker to learn robust classifiers. The underlying idea is to incorporate knowledge of potential attacks during training. The second family of defenses (complementary to the first) is based on the idea of rejecting samples that exhibit an outlying behavior with respect to unperturbed training data [1216].

In this work, we focus on defenses based on rejection mechanisms and try to improve their effectiveness. In fact, it has been shown that only relying upon the feature representation learned by the last network layer to reject adversarial examples is not sufficient [12, 13]. In particular, it happens that adversarial examples become indistinguishable from samples of the target class at such a higher representation level even for small input perturbations. To overcome this issue, we propose here a defense mechanism, named deep neural rejection (DNR), based on analyzing the representations of input samples at different network layers and on rejecting samples which exhibit anomalous behavior with respect to that observed from the training data at such layers (Section 2). With respect to similar approaches based on analyzing different network layers [14, 15], our defense does not require generating adversarial examples during training, and it is thus less computationally demanding.

We evaluate our defense against an adaptive white-box attacker that is aware of the defense mechanism and tries to bypass it. To this end, we propose a novel gradient-based attack that accounts for the rejection mechanism and aims to craft adversarial examples that avoid it (Section 3).

It is worth remarking here that correctly evaluating a defense mechanism is a crucial point when proposing novel defenses against adversarial examples [2, 17]. The majority of previous work proposing defense methods against adversarial examples has only evaluated such defenses against previous attacks rather than against an ad hoc attack crafted specifically against the proposed defense (see, e.g., [15, 18, 19] and all the other re-evaluated defenses in [17, 20]). The problem with these black-box and gray-box evaluations in which the attack is essentially unaware of the defense mechanism is that they are overly optimistic. It has indeed been shown afterwards that such defenses can be easily bypassed by simple modifications to the attack algorithm [17, 20, 21]. For instance, many defenses have been found to perform gradient obfuscation, i.e., they learn functions which are harder to optimize for gradient-based attacks; however, they can be easily bypassed by constructing a smoother, differentiable approximation of their function, e.g., via learning a surrogate model [2, 6, 2225] or replacing network layers which obfuscate gradients with smoother mappings [17, 20, 21]. In our case, an attack that is unaware of the defense mechanism may tend to craft adversarial examples in areas of the input space which are assigned to the rejection class; thus, such attacks, as well as previously proposed ones, may rarely bypass our defense. For this reason, we believe that our adaptive white-box attack, along with the security evaluation methodology adopted in this work, provides another significant contribution to the state of the art related to the problem of properly evaluating defenses against adversarial examples.

The security evaluation methodology advocated in [2, 17, 26], which we also adopt in this work, consists of evaluating the accuracy of the system against attacks crafted with an increasing amount of perturbation. The corresponding security evaluation curve [2] shows how gracefully the performance decreases while the attack increases in strength, up to the point where the defense reaches zero accuracy. This is another important phenomenon to be observed, since any defense against test-time evasion attacks has to fail when the perturbation is sufficiently large (or, even better, unbounded); in fact, in the unbounded case, the attacker can ideally replace the source sample with any other sample from another class [17]. If accuracy under attack does not reach zero for very large perturbations, then it may be that the attack algorithm fails to find a good optimum (i.e., a good adversarial example). This in turn means that we are probably providing an optimistic evaluation of the defense. As suggested in [17], the purpose of a security evaluation should not be to show which attacks the defense withstands to, but rather to show when the defense fails. If one shows that larger perturbations that may compromise the content of the input samples and its nature (i.e., its true label) are required to break the defense, then we can retain the defense mechanism to be sufficiently robust. Another relevant point is to show that such a breakdown point occurs at a larger perturbation than that exhibited by competing defenses, to show that the proposed defense is more robust than previously proposed ones.

The empirical evaluation reported in Section 4, using both MNIST handwritten digits and CIFAR10 images, provides consistent results with the aforementioned aspects. First, it shows that our adaptive white-box attack is able to break our defensive method at larger perturbations. Second, it shows that our method improves the performance of competing rejection mechanisms which only leverage the deep representation learned at the output network layer. We thus believe that our analysis unveils a promising way of defending against adversarial examples.

We conclude the paper by discussing related work (Section 5), the main contributions of this work, and its limitations, along with promising future research directions (Section 6).

2 Deep neural rejection

The underlying idea of our DNR method is to estimate the distribution of unperturbed training points at different network layers and reject anomalous samples that may be incurred at test time, including adversarial examples. The architecture of DNR is shown in Fig. 1.

Fig. 1
figure 1

Architecture of deep neural rejection (DNR). DNR considers different network layers and learns an SVM with the RBF kernel on each of their representations. The outputs of these SVMs are then combined using another RBF SVM, which will provide prediction scores s1,…,sc for each class. This classifier will reject samples if the maximum score maxk=1,…,csk is not higher than the rejection threshold θ. This decision rule can be equivalently represented as arg maxk=0,…,csk(x), if we consider rejection as an additional class with s0=θ

Before delving into the details of our method, let us introduce some notation. We denote the prediction function of a deep neural network with \(f : \mathcal {X} \mapsto \mathcal {Y}\), where \(\mathcal {X} \subseteq \mathbb R^{d}\) is the d-dimensional space of input samples (e.g., image pixels) and \(\mathcal {Y} \subseteq \mathbb R^{c}\) is the space of the output predictions (i.e., the estimated confidence values for each class), being c the number of classes. If we assume that the network consists of m layers, then the prediction function f can be rewritten to make this explicit as f(ϕ1(ϕ2(…ϕm(x;wm);w2);w1), where ϕ1 and ϕm denote the mapping function learned respectively by the output and the input layer, and w1 and wm are their weight parameters (learned during training).

For our defense mechanism to work, one has first to select a set of network layers, e.g., in Fig. 1, we select the outer layers ϕ1, ϕ2, and ϕ3. Let us assume that the representation of the input sample x at level ϕi is zi. Then, on each of these selected representations, DNR learns an SVM with the RBF kernel gi(zi), trying to correctly predict the input sample. The confidence values on the c classes provided by this classifier are then concatenated with those provided by the other base SVMs and used to train a combiner, using again an RBF SVM.Footnote 1 The combiner will output predictions s1,…,sc for the c classes, but will reject samples if the maximum confidence score maxk=1,…,csk is not higher than a rejection threshold θ. This decision rule can be compactly represented as arg maxk=0,…,csk(x), where we define an additional, constant output s0(x)=θ for the rejection class. According to this rule, if s0(x)=θ is the highest value in the set, the sample is rejected; otherwise, it is assigned to the class exhibiting the larger confidence value.

As proposed in [12], we use an RBF SVM here to ensure that the confidence values s1,…,sc, for each given class, decrease while x moves further away from regions of the feature space which are densely populated by training samples of that class. This property, named compact abating probability in open-set problems [13, 28], is a desirable property to easily implement a distance-based rejection mechanism as the one required in our case to detect outlying samples. With respect to [12], we train this combiner on top of other base classifiers rather than only on the representation learned by the last network layer, to further improve the detection of adversarial examples. For this reason, in the following, we refer to the approach by Melis et al. [12], rejecting samples based only on their representation at the last layer, as neural rejection (NR), and to ours, exploiting also representations from other layers, as deep neural rejection (DNR).

3 Attacking deep neural rejection

To properly evaluate security, or adversarial robustness, of rejection-based defenses against adaptive white-box adversarial examples, we propose the following. Given a source sample x and a maximum-allowed ε-sized perturbation, the attacker can optimize a defense-aware adversarial example x by solving the following constrained optimization problem:

$$ {{}\begin{aligned} \boldsymbol{x}^{\star} = \arg\,\min_{\boldsymbol{x}^{\prime} : \| \boldsymbol{x}- \boldsymbol{x}^{\prime}\| \leq \varepsilon} \Omega(\boldsymbol{x}^{\prime}) & \ \text{where}& \Omega(\boldsymbol{x}^{\prime}) = s_{y}(\boldsymbol{x}^{\prime}) - \max_{j \not \in \{0, y\}} s_{j}(\boldsymbol{x}^{\prime}) \, \end{aligned}} $$

where ∥xx∥≤ε is an p-norm constraint (typical norms used for crafting adversarial examples are 1, 2, and , for which efficient projection algorithms exist [29]), \(y \in \mathcal {Y} = \{1, \ldots, c\}\) is the true class, and 0 is the rejection class. In practice, the attacker minimizes the output of the true class, while maximizing the output of the competing class (excluding the reject class) to achieve (untargeted) evasion. This amounts to performing a strong maximum-confidence evasion attack (rather than searching for a minimum-distance adversarial example). We refer the reader to [2, 6, 24] for a more detailed discussion on such topic. While we focus here on untargeted (error-generic) attacks, our formulation can be extended to account for targeted (error-specific) evasion as also done in [12].

The optimization problem in Eq. (1) can be solved through a standard projected gradient descent (PGD) algorithm, as given in Algorithm 1. In our experiments, we consider a variable step size η (by doubling the initial step size for ten times) and select the point x minimizing the objective at each update step. This allows our attack to escape local minima which may hinder the optimization process, and consequently, it allows us to obtain a more reliable security evaluation of the proposed detection method.

In Fig. 2, we report an example on a bi-dimensional toy problem to show how our defense-aware attack works against a rejection-based defense mechanism.

Fig. 2
figure 2

Our defense-aware attack against an RBF SVM with rejection, on a 3-class bi-dimensional classification problem. The initial sample x0 and the adversarial example x are respectively represented as a red hexagon and a green star, while the 2-norm perturbation constraint ∥x0x2ε is shown as a black circle. The left plot shows the decision region of each class, along with the reject region (in white). The right plot shows the values of the attack objective Ω(x) (in colors), which correctly enforces our attacks to avoid the reject region

4 Experimental analysis

In this section, we evaluate the security of the proposed DNR method against adaptive, defense-aware adversarial examples. We consider two common computer vision benchmarks for this task, i.e., handwritten digit recognition (MNIST data) and image classification (CIFAR10 data). Our goal is to investigate whether and to which extent DNR can improve security against adversarial examples, in particular, compared to the previously proposed neural rejection (NR) defense (which only leverages the feature representation learned at the last network layer to reject adversarial examples) [12]. All the experiments presented in this section are based on the open-source Python library secml [30], which we plan to extend in the near future to include an implementation of both DNR and NR.

4.1 Experimental setup

We discuss here the experimental setup used to evaluate our defense mechanism.

4.1.1 Datasets

As mentioned before, we run experiments on MNIST and CIFAR10 data. MNIST handwritten digit data consists of 60,000 training and 10,000 test gray-scale 28×28 images. CIFAR10 consists of 50,000 training and 10,000 test RGB 32×32 images. We normalized the images of both datasets in [0, 1] by simply dividing the input pixel values by 255.

4.1.2 Train-test splits

We average the results on five different runs. In each run, we consider 10,000 training samples and 1000 test samples, randomly drawn from the corresponding datasets. To avoid overfitting, we train the DNR combiner on the outputs of the base SVMs computed on a separate validation set, using a procedure known as stacked generalization [27]. We use a 3-fold cross-validation procedure to subdivide the training dataset into three folds. For three times, we learn the base SVMs on two folds and classify the remaining (validation) fold. We then concatenate the predicted values for each validation fold and use such values to train the combiner. The deep neural networks (DNNs) used in our experiments are pre-trained on a training dataset (different from the ones that we use to train the SVMs) of 30,000 and 40,000 training samples, respectively, for MNIST and CIFAR10.

4.1.3 Classifiers

We compare the DNR approach (which implements rejection here based on the representations learned by three different network layers) against an undefended DNN (without any rejection mechanism) and against the NR defense by Melis et al. [12] (which implements rejection on top of the representation learned by the output network layer). To implement the undefended DNNs for the MNIST dataset, we used the same architecture suggested by Carlini et al. [21]. For CIFAR10, instead, we considered a lightweight network that, despite its size, allows obtaining high performances. The two considered architectures are shown in Table 1, whereas Table 2 shows the model parameters that we used to train the overmentioned architectures. The three layers considered by the DNR classifier are the last three layers for the network trained on MNIST, and the last layer plus the last batch norm layer and the second to the last max-pooling layer for the one trained on CIFAR10 (chosen to obtain a reasonable amount of features).

Table 1 Model architecture for MNIST (left) and CIFAR10 (right) networks
Table 2 Parameters used to train MNIST and CIFAR10 networks

4.1.4 Security evaluation

We compare these classifiers in terms of their security evaluation curves [2], reporting classification accuracy against an increasing 2-norm perturbation size ε, used to perturb all test samples. In particular, classification accuracy is computed as follows:

  • In the absence of adversarial perturbation (i.e., for ε=0), classification accuracy is computed as usual, but considering rejects as errors;

  • In the presence of adversarial perturbation (i.e., for ε>0), all test samples become adversarial examples, and we consider them correctly classified if they are assigned either to the rejection class or to their original class (which typically happens when the perturbation is too small to cause a misclassification).

For DNR and NR, we also report the rejection rates, computed by dividing the number of rejected samples by the number of test samples. Note that the difference between accuracy and rejection rate at each ε>0 corresponds to the fraction of adversarial examples which are not rejected but still correctly assigned to their original class. Accordingly, under this setting, classifiers exhibiting higher accuracies under attack (ε>0) can be retained more robust.

4.1.5 Parameter setting

We use a 5-fold cross-validation procedure to select the hyperparameters that maximize classification accuracy on the unperturbed training data, and set the rejection threshold θ for NR and DNR to reject 10% of the samples when no attack is performed (at ε=0).

4.2 Experimental results

The results are reported in Fig. 3. In the absence of attack (ε=0), the undefended DNNs slightly outperform NR and DNR, since the latter wrongly reject also some unperturbed samples.

Fig. 3
figure 3

Security evaluation curves for MNIST (left) and CIFAR10 (right) data, reporting mean accuracy (and standard deviation) against ε-sized attacks

Under attack, (ε>0), when the amount of injected perturbation is exiguous, the rejection rate of both NR and DNR increases jointly with ε, as the adversarial examples are located far from the rest of the training classes in the representation space (i.e., the intermediate representations learned by the neural network). For larger ε, both NR and DNR can no longer correctly detect the adversarial examples, as they tend to become indistinguishable from the rest of the training samples (in the representation space in which NR and DNR operate). Both defenses outperform the undefended DNNs on the adversarial samples, and DNR slightly outperforms NR, exhibiting a more graceful decrease in performance. Although NR tends to reject more samples for ε∈ [0.1,1] on CIFAR and for ε=0.5 on MNIST, its accuracy is lower than DNR. The reason is that DNR remains more accurate than NR when classifying samples that are not rejected. This also means that DNR provides tighter boundaries closer to the training classes than NR, thanks to the exploitation of lower-level network representations, which makes the corresponding defended classifier more difficult to evade. In Figs. 4 and 5, we show some adversarial examples computed respectively on the MNIST and CIFAR10 datasets against the considered classifiers.

Fig. 4
figure 4

Influence of the rejection threshold ? on classifier accuracy under attack (y-axis) vs false rejection rate (i.e., fraction of wrongly rejected unperturbed samples) on MNIST (top) and CIFAR10 (bottom) for NR (left) and DNR (right), for different ?-sized attacks. The dashed line highlights the performance at 10% false rejection rate (i.e., the operating point used in our experiments)

Fig. 5
figure 5

Adversarial examples computed on the MNIST data to evade DNN, NR, and DNR. The source image is reported on the left, followed by the (magnified) adversarial perturbation crafted with ε=1 against each classifier, and the resulting adversarial examples. We remind the reader that the attacks considered in this work are untargeted, i.e., they succeed when the attack sample is not correctly assigned to its true class

Finally, in Fig. 6, we show how the selection of the rejection threshold θ allows us to trade security against adversarial examples (i.e., accuracy on the y-axis) for a more accurate classifier on the unperturbed samples (reported in terms of the rejection rate of unperturbed samples on the x-axis). In particular, increasing (decreasing) the rejection threshold amounts to increasing (decreasing) the fraction of correcly detected adversarial examples and to increasing (decreasing) the rejection rate when no attack is performed.

Fig. 6
figure 6

Adversarial examples computed on the CIFAR10 dataset adding a perturbation computed with ε=0.2. See the caption of Fig. 5 for further details

5 Related work

Different approaches have been recently proposed to perform rejection of samples that are outside of the training data distribution [3133]. For example, Thulasidasan et al. [31] and Geifman et al. [32] have proposed novel loss functions accounting for rejection of inputs on which the classifier is not sufficiently confident. Geifman et al. [33] have proposed a method that allows the system designer to set the desired risk level by adding a rejection mechanism to a pre-trained neural network architecture. These approaches have, however, not been originally tested against adversarial examples, and it is thus of interest to assess their performance under attack in future work, also in comparison to our proposal.

Even if the majority of approaches implementing rejection or abstaining classifiers have not considered the problem of defending against adversarial examples, some recent work has explored this direction too [12, 13]. Nevertheless, with respect to the approach proposed in this work, they have only considered the output of the last network layer and perform rejection based solely on that specific feature representation. In particular, Bendale and Boult [13] have proposed a rejection mechanism based on reducing the open-set risk in the feature space of the activation vectors extracted from the last layer of the network, while Melis et al. [12] have applied a threshold on the output of an RBF SVM classifier. Despite these differences, the rationale of the two approaches is quite similar and resembles the older idea of distance-based rejection.

Few approaches have considered a multi-layer detection scheme similar to that envisioned in our work [1416, 34, 35]. However, most of these approaches require generating adversarial examples at training time, which is computationally intensive, especially for high-dimensional problems and large datasets [14, 15, 34, 35]. Finding a methodology to tune the hyperparameters for generating the attack samples is also an open research challenge. Finally, even though the deep k-nearest neighbors approach by Papernot et al. [16] does not require generating adversarial examples at training time, it requires computing the distance of each test sample against all the training points at different network layer representations, which again raises scalability issues to high-dimensional problems and large datasets.

6 Conclusions and future work

We have proposed deep neural rejection (DNR), i.e., a multi-layer rejection mechanism, that, differently from other state-of-the-art rejection approaches against adversarial examples, does not require generating adversarial examples at training time, and it is less computationally demanding. Our approach can be applied to pre-trained network architectures to implement a defense against adversarial attacks at test time. The base classifiers and the combiner used in our DNR approach are trained separately. As future work, it would be interesting to perform an end-to-end training of the proposed classifier similarly to the approaches proposed in [31] and [32]. Another research direction may be that of testing our defense against training-time poisoning attacks [2, 5, 3638].