Keywords

1 Introduction

The success of deep neural nets for pattern recognition [35] has been a main driver behind the recent surge of interest in AI. A substantial part of this success is due to the Convolutional Neural Net (CNN) [5, 20] and its descendants, applied to image recognition tasks. Respective methods have reached the application level in business and industry [38] and lead to a wide variety of deployed models for critical applications like automated driving [2] or biometrics [46].

However, concerns regarding the reliability of deep neural networks have been raised through the discovery of so-called adversarial examples [41]. These inputs are specifically generated to “fool” [28] a classifier by visually appearing as some class (to humans), but being misclassified by the network with high confidence through the addition of barely visible perturbations (see Fig. 1). The perturbations are achieved by an optimization process on the input: the network weights are fixed, and the input pixels are optimized for the dual criterion of (a) classifying the input differently than the true class, and (b) minimizing the changes to the input. A growing body of literature confirms the impact of this discovery on practical applications of neural nets [1]. It raises questions on how—and in what respect different from humans—they achieve their performance, and threatens serious deployments with the possibility of tailor-made adversarial attacks.

For instance, Su et al. [40] report on successfully attacking neural networks by modifying a single pixel. The attack works without having access to the internal structure nor the gradients in the network under attack. Moosavi-Dezfooli et al. [27] furthermore show the existence of universal adversarial perturbations that can be added to any image to fool a specific model, whereas transferability of perturbations from one model to another is for example shown by Xu et al. [44]. The impact of similar attacks extends beyond classification [26], is transferable to other modalities than images [6], and also works on models distinct from neural networks [31]. Finally, adversarial attacks have been shown to work reliably even after perturbed images have been printed and captured again via a mobile phone camera [18]. Apparently, such research touches a weak spot.

On the other hand, there is a recent interest in the interpretability of AI agents and in particular machine learning models [30, 42]. It goes hand in hand with societal developments like the new European legislation on data protection that is impacting any organization using algorithms on personal data [13]. While neural networks are publicly perceived as “black boxes” with respect to how they arrive at their conclusions [15], several methods have been developed recently to allow insight into the representation and decision surface of a trained model, improving interpretability. Prime candidates amongst these methods are feature visualization approaches that make the operations in hidden layers of a CNN visible [29, 37, 45]. They can thus serve a human engineer as a diagnostic tool in support of reasoning over success and failure of a model on the task at hand.

In this paper, we propose to use a specific form of CNN feature visualization, namely feature response maps, to not only trace the effect of adversarial inputs on algorithmic decisions throughout the CNN; we subsequently also use it as input to a novel automated detection approach, based on statistical analysis of the feature responses using average of image local spatial entropy. The goal is to decide if a model is currently under attack by the given input. Our approach has the advantage over existing methods of not changing the network architecture, i.e., not affecting classification accuracy; and of being interpretable both to humans and machines, an intriguing property also for future work on the method. Experiments on the validation set of ImageNet [34] with VGG19 networks [36] shows the validity of our approach for detecting various state-of-the-art attacks.

Below, Sect. 2 reviews related work in contrast to our approach. Section 3 presents the background on adversarial attacks and feature response estimation before Sect. 4 introduces our approach in detail. Section 5 reports on experimental evaluations, and Sect. 6 concludes with an outlook to future work.

2 Related Work

Work on adversarial examples for neural networks is a very active research field. Potential attacks and defenses are published at a high rate and have been surveyed recently by Akhtar and Mian [1]. Amongst potential defenses, directly comparable to our approach are those that focus on the sole detection of a possible attack and not on additionally recovering correct classification.

On one hand, several detection approaches exist that exploit specific abnormal behavioral traces that adversarial examples leave while passing through a neural network: Liang et al. [22] consider the artificial perturbations as noise in the input and attempt to detect it by quantizing and smoothing image filters. A similar concept underlies the SqueezeNet approach by Xu et al. [43], that compares the network’s output on the raw and filtered input, and raises a flag if detecting a large difference between both. Feinman et al. [9] observe the network’s output confidence as estimated by dropout in the forward pass [11], and Lu et al’s SafetyNet [23] looks for abnormal patterns in the ReLU activations of higher layers. In contrast, our method performs detection based on statistics of activation patterns in the complete representation learning part of the network as observed in feature response maps, whereas Li and Li [21] directly observe convolutional filter statistics there.

On the other hand, a second class of detection approaches trains sophisticated classifiers for directly sorting out malformed inputs: Meng and Chen’s MagNet [24] learns the manifold of friendly images, rejects far away ones as hostile and modifies close outliers to be attracted to the manifold before feeding them back to the network under attack. Grosse et al. [14] enhance the output of an attacked classifier by an additional class and retrain the model to directly classify adversarial examples as such. Metzen et al. [25] have a similar goal but target it via an additional subnetwork. In contrast, our method uses a simple threshold-based detector and pushes all decision power to the human-interpretable feature extraction via the feature response maps.

Finally, as shown in [1], different and mutually exclusive explanations for the existence of adversarial examples and the nature of neural network decision boundaries exist in the literature. Because our method enables a human investigator to trace attacks visually, it can be helpful in this debate in the future.

3 Background

We briefly present adversarial attacks and feature response estimation in general before assembling both parts into our detection approach in the next Section.

Fig. 1.
figure 1

Examples of different state-of-the-art adversarial attacks on a VGG19 model: original image and label (left), perturbation (middle) and mislabeled adversarial example (right). In the middle column difference of zero is encoded white and maximum difference is black because of visual enhancement.

3.1 Adversarial Attacks

The main idea of adversarial attacks is to find a small perturbation for a given image that changes the decision of the Convolutional Neural Network. Pioneering work [41] demonstrated that negligible and visually insignificant perturbations could lead to considerable deviations in the networks’ output. The problem of finding a perturbation \(\varvec{\eta }\) for a normalized clean image \(\varvec{I} \in \mathbb {R}^m\), where m is the image width \(\times \) height, is stated as follows [41]:

$$\begin{aligned} \min _{\varvec{\eta }} \parallel \varvec{\eta }\parallel _2 \quad \text {s.t.} \quad \mathscr {C}(\varvec{I}+\varvec{\eta })\ne \ell ;\quad \varvec{I}+\varvec{\eta } \in [0,1]^m \end{aligned}$$
(1)

where \(\mathscr {C}(.)\) presents the classifier and \(\ell \) is the ground truth label. Szegedy et al. [41] proposed to solve the optimization problem in Eq. 1 for an arbitrary label \(\ell ^\prime \) that differs from the ground truth to find the perturbation. However, box-constrained Limmited-memory Broyden-Fletcher-Goldfarb-Shanno (L-BFGS) [10] is alternatively used to find perturbations satisfying Eq. 1 to improve computational efficiency. Optimization based on the L-BFGS algorithm for finding adversarial attacks are computational inefficient compared with gradient-based methods. Therefore, we use a couple of gradient-based attacks, a one-pixel attack, and boundary attack to compute adversarial examples (see Fig. 1).

Fast Gradient Sign Method (FGSM) [12] is a method suggested to compute adversarial perturbations based on the gradient \(\nabla _{\varvec{I}}J(\varvec{\theta }, \varvec{I}, \ell )\) of the cost function with respect to the original image pixel values:

$$\begin{aligned} \varvec{\eta } = \epsilon \ \text {sign} (\nabla _{\varvec{I}}J(\varvec{\theta }, \varvec{I}, \ell )) \end{aligned}$$
(2)

where \(\varvec{\theta }\) represents the network parameters and \(\epsilon \) is a constant factor that constrains the max-norm \(l_\infty \) of the additive perturbation \(\eta \). The ground truth label is presented by \(\ell \) in Eq. 2. The \(\text {sign}\) function is Eq. 2 computes the elementwise sign of the gradient of the loss function with respect to the input image. Optimizing the perturbation in Eq. 2 in a single step is called Fast Gradient Sign Method (FGSM) in the literature. This method is a white box attack, i.e. the algorithm for finding the adversarial example requires the information of weights and gradients of the network.

Gradient attack is a simple and straightforward realization of finding adversarial perturbations in the FoolBox toolbox [33]. It optimizes pixel values of an ori ginal image to minimize the ground truth label confidence in a single step.

One pixel attack [40] is a semi-black box approach to compute adversarial examples using differential evolution [39]. The algorithm is not white box since it does not need the gradient information of the classifier; however, it is not fully black box as it needs the class probabilities. The iterative algorithm starts with randomly initialized parent perturbations. The generated offspring compete with their parent at each iteration, and the winners advance to the next step. The algorithm stops when the ground truth label probability is lower than 5%.

DeepFool [28] is a white box iterative approach in which the closest direction to the decision boundary is computed in every step. It is equivalent to finding the corresponding path to the orthogonal projection of the data point onto the affine hyperplane which separates the binary classes. The initial method for binary classifiers can be extended to a multi-class task by considering it as multiple one-versus-all binary classifications. After finding the optimal updates toward the decision boundary, the perturbation is added to the given image. The iterations continue with estimating the optimal perturbation and apply it to the perturbed image from the last step until the network decision changes.

Boundary attack is a reliable black-box attack proposed by Brendel et al. in [3]. The iterative algorithm already starts with an adversarial image and iteratively optimize the distance between this image and the original image. It searches for an adversarial example with minimum distance from the original image.

3.2 Feature Response Estimation

The idea of visualizing CNNs through feature responses is to find out which region of the image leads to the final decision of the network. Computing feature responses enhances the interpretability of the classifier. In this paper, we use this visualization tool to track the effect of the adversarial attacks on a CNN’s decision as well as to detect perturbed examples automatically.

Fig. 2.
figure 2

Effect of adversarial attacks on feature responses: original image and feature response (left), perturbed versions (right).

Erhan et al. [8] used backpropagation for visualizing feature responses of CNNs. This is implemented by evaluating an arbitrary image in the forward pass, thereby retaining the values of activated neurons at the final convolutional layer, and backpropagating these activations to the original image. The feature response has higher intensities in the regions that cause larger values of activation in the network (see Fig. 2). The information of max-pooling layers in the forward pass can further improve the quality of visualizations. Zeiler et al. [45] proposed to compute “switches”, the position of maxima in all pooling regions, and then construct the feature response using transposed convolutional [7] layers.

Ultimately, Springenberg et al. [37] proposed a combination of both methods called guided backpropagation. In this approach, the information of “switches” (max-pooling spatial information) is kept, and the activations are propagated backwards with the guidance of the “switch” information. This method leads to the best performance in network innards visualization, therefore we use guided backpropagation for computing feature response maps in this paper.

4 Human-Interpretable Detection of Adversarial Attacks

After reviewing the necessary background in the last Section, we will now present our work on tracing adversarial examples in feature response maps, which inspired a novel approach to automatic detection of adversarial perturbations in images. Using visual representations of the inner workings of neural network in this manner additionally provides a human expert guidance in developing deep convolutional networks with increased reliability and interpretability.

4.1 Tracing Adversarial Attacks in Feature Responses

The research question followed in this work is to obtain insight into the reasons behind misclassification of adversarial examples. Their effect in the feature response of a CNN is for example traced in Fig. 2. The general phenomenon observed in all experiments is the broader feature response of adversarial examples. In contrast, Fig. 2 demonstrates that the network looks at a smaller region of the image—is more focused—in case of not manipulated samples.

The adversarial images are visually very similar to the original ones. However, they are not correctly recognizable by deep CNNs. The original idea which triggered this study is that the focus of CNNs changes during an adversarial attack and lead to the incorrect decision. Conversely, the network makes the correct decision once it focuses on the right region of the image. Visualizing the feature response provides this and other interesting information regarding the decision making in neural networks: for instance, the image of the submarine in Fig. 2 can be considered a good candidate for an adversarial attack since the CNN is making the decision based on an object in the background (see the feature response of the original submarine in Fig. 2).

Fig. 3.
figure 3

Input, feature response and local spatial entropy for clean and perturbed images, respectively.

4.2 Detecting Adversarial Attacks Using Spatial Entropy

Experiments for tracing the effect of adversarial attacks on feature responses thus suggested that a CNN classifier focuses on a broader region of the input if it has been maliciously perturbed. Figure 2 demonstrates this connection for decision making in case of clean inputs compared with manipulated ones. The effect of adversarial manipulation is visible in the local spatial entropy of the gray-scale feature responses as well (see Fig. 3). The feature responses are initially converted to gray scale images, and local spatial entropies are computed based on transformed feature responses as follows [4]:

$$\begin{aligned} S_k = - \sum _{i} \sum _{j} \varvec{h}_k(i, j) \log _2 (h_k(i, j)) \end{aligned}$$
(3)

where \(S_k\) is the local spatial entropy of a small part (patch) of the input image and \(\varvec{h}_k\) represents the normalized 2D histogram value of the \(k^{th}\) patch. The indices i and j scan through the height and width of the image patches. The patch size is \(3 \times 3\) and the same as the filter size of the first layer of the used CNN (VGG19 [36]). The local spatial entropies of corresponding feature responses are presented in Fig. 3, and their difference for clean and adversarial examples suggests a likely chance to detect perturbed images based on this feature.

Accordingly, we propose to use the average local spatial entropy of an image as the final single measure to decide whether an attack has occurred or not. The average local spatial entropy \(\bar{S}\) is defined as:

$$\begin{aligned} \bar{S} = \frac{1}{K} \sum _{k} S_k \end{aligned}$$
(4)

where K is the number of patches on the complete feature response and \(S_k\) shows the local spatial entropy as defined in Eq. 3 and depicted in the last row of Fig. 3. Our detector makes the final decision by comparing the average local spatial entropy from Eq. 4 with a selected threshold, i.e., we use this feature to measure the spatial complexity of an input image (feature response).

5 Experimental Results

To confirm the value of our final metric in Eq. 4, we first perform experiments to visually compare the approximated distribution of the averaged local spatial entropy of feature responses in clean and perturbed images. We use the validation set of ImageNet [34] with more than 50, 000 images from 1, 000 classes and again the VGG19 CNN [36]. Perturbations for this experiment are computed only via the Fast Gradient Sign Attack (FGSM) method for computational reasons. Figure 4(a) shows that the clean images are separable from perturbed examples although there is some overlap between the distributions.

Fig. 4.
figure 4

(a) Distribution of average local spatial entropy in clean images (green) versus adversarial examples (red) as computed on the ImageNet validation set [34]. (b) Receiver operating characteristic (ROC) curve of the performance of our detection algorithm on different attacks. (Color figure online)

Computing adversarial perturbations using evolutionary and iterative algorithms is demanding regarding time and computational resources. However, we would like to apply the proposed detector to a wide range of adversarial attacks. Therefore, we have drawn a number of images from the validation set of ImageNet for each attack and present the detection performance of our method in Fig. 4. The selection of images is done sequentially by class and file name up to a total number of images per method that could be processed in a reasonable amount of time (see Table 1). We base our experiments on the FoolBox benchmarking implementationFootnote 1, running on a Pascal-based TitanX GPU.

Table 1. Numerical evaluation of detection performance on the three different adversarial attacks. Column two gives the amount of tested attacks and elapsed approx. run time. Success of an adversarial attack is given if a perturbation changes the prediction. Columns four and five show average confidence values of the true (ground truth) and wrong (target) class after successful attack, respectively. The last columns show detection rates for different false positive rates.
Table 2. Performance of similar adversarial attack detection methods. The Area Under Curve (AUC) is the average value of all attacks in the third and last row.

Figure 4b presents the Receiver Operating Characteristics (ROC) of the proposed detector, and numerical evaluations are provided in Table 1. Our detection method performs better for gradient-based perturbations compared to the single pixel attack. Furthermore, Table 1 suggests that the best adversarial attack detection performance is achieved for FGSM and boundary attack perturbations, where the network confidences are changed the most. This observation suggests that the proposed detector is more sensitive to attacks which are stronger in fooling the network (i.e., change the ground truth label and target class confidence more drastically). By using feature responses, we detect more than \(91\%\) of the perturbed samples with a low false positive rate (\(1\%\)).

In general, it is difficult to directly compare different studies on attack detectors since they use a vast variety of neural network models, datasets, attacks and experimental setups. We present a short overview of the performances of current detection approaches in Table 2. Our approach is most similar to the methods of Liang et al. [22] and Xu et al. [43]. The proposed detector in this paper outperforms both based on the presented results in their work; however, we cannot guarantee identical implementations and parameterizations of the used attacks (e.g., subset of used images, learning rates for optimization of perturbations). Similarly, adaptive noise reduction in the original publication [22] is applied to only four classes of the ImageNet dataset and defended a model based on CaffeNet, which differs from our experimental setup.

6 Discussion and Conclusion

The presented results demonstrate that the reality of adversarial attacks: improving the robustness of CNNs is necessary. However, we conducted further preliminary experiments on binary (cat versus dog [32]) and ternary (among three classes of cars [16]) classification tasks as proxies for the kind of few-class classifications settings frequently arising in practice. They suggest that it is more challenging to find adversarial examples in such a setting without plenty of “other classes” to pick from for misclassification. Figure 5 illustrates these results.

Fig. 5.
figure 5

Successful adversarial examples created by DeepFool [28] for binary and ternary classification tasks are only possible with notable visible perturbations.

In this paper, we have presented an approach to detect adversarial attacks based on human-interpretable feature response maps. We traced the effect of adversarial perturbations on the visual focus of the network in original images, which inspired a simple yet robust approach for automatic detection. This proposed method is based on thresholding the averaged local spatial entropy of the feature response maps and detects at least \(91\%\) of state-of-the-art adversarial attacks with a low false positive rate on the validation set of ImageNet. However, the results are not directly comparable with methods in the literature because of the diversity in the experimental setups and implementations of attacks.

Our results verify that feature response are informative to detect specific cases of failure in deep CNNs. The proposed detector applies to increase the interpretability of neural network decisions, which is an increasingly important topic towards robust and reliable AI. Future work, therefore, will concentrate on developing reliable and interpretable image classification methods for practical use cases based on our preliminary results for binary and ternary classification.