An end-to-end convolutional network for joint detecting and denoising adversarial perturbations in vehicle classification

Deep convolutional neural networks (DCNNs) have been widely deployed in real-world scenarios. However, DCNNs are easily tricked by adversarial examples, which present challenges for critical applications, such as vehicle classification. To address this problem, we propose a novel end-to-end convolutional network for joint detection and removal of adversarial perturbations by denoising (DDAP). It gets rid of adversarial perturbations using the DDAP denoiser based on adversarial examples discovered by the DDAP detector. The proposed method can be regarded as a pre-processing step—it does not require modifying the structure of the vehicle classification model and hardly affects the classification results on clean images. We consider four kinds of adversarial attack (FGSM, BIM, DeepFool, PGD) to verify DDAP’s capabilities when trained on BIT-Vehicle and other public datasets. It provides better defense than other state-of-the-art defensive methods.


Introduction
In recent years, deep convolutional neural networks (DCNNs) have been widely used in many different tasks, such as image recognition [1][2][3], self-driving vehicles [4], semantic segmentation [5], and vehicle re-identification [6]. As an essential requirement of an intelligent transport system, remarkable performance has been achieved in vehicle classification [7,8].
However, recent studies [9][10][11] have shown that DCNNs are vulnerable to adversarial examples, specially-crafted by making minute perturbations to natural images. Such perturbations can cause a classifier to misclassify an image with high confidence in the wrong result. Figure 1 shows that an adversarial example causes misclassification of an SUV as a bus. Clearly, it is important to make deep convolutional neural networks that are robust in the face of adversarial attacks.
Previous defenses to adversarial attacks are mainly of two kinds. The first kind trains a detector network [12][13][14], which acts as a filter rejecting malicious input to the target model. The other kind uses a defensive model to decrease the effects of adversarial perturbations and improve the adversarial robustness of the target model [15][16][17]. However, Xie et al. [18] have shown that denoising may affect the performance of the target model on clean images, as valid information for classification may be lost in the denoising process.
In this paper, we propose a new method based on joint detection and removal of adversarial perturbations by denoising (DDAP). Unlike previous work, our defensive method combines an adversarial perturbation detector and a denoiser, using joint learning for end-to-end training. Adversarial examples detected by the detector are passed to the denoiser to remove perturbations. The detector and denoiser share the same parameters in the feature extraction stage to reduce the amount of calculation.
The main contributions of this paper are: • an end-to-end defensive method combining detection and removal of adversarial perturbations by denoising, for vehicle classification. It can be applied as a pre-processing method to improve the robustness of vehicle classification models; • a new loss function for joint supervised training of the adversarial perturbation detector and denoiser, which is beneficial to optimization of the model; • an evaluation on two vehicle datasets which shows that our method provides state-of-the-art performance against both white-box and blackbox attacks, with negligible reduction in classifier performance for clean images. The rest of this paper is organized as follows. Section 2 reviews popular adversarial attacks and defense methods. Our proposed approach is described in detail in Section 3. Section 4 provides experimental results in defending against adversarial attacks. Finally, Section 5 concludes our work.

Related work
In this section, the literature on adversarial attack methods is reviewed, and then mechanisms for detection and defense against adversarial attacks are introduced.

Adversarial attack methods
The concept of adversarial examples was proposed by Szegedy et al. [19]; subsequent work [9][10][11] showed that DNNs are vulnerable to adversarial examples. Given a classification model f , deliberately adding some subtle perturbations p to the correctly classified image x, will cause f to give a wrong output yet with high confidence that f (x + p) = f (x). We now describe some well-known adversarial attack algorithms. It should be emphasized that we are concerned with untargeted adversarial attacks.
Goodfellow et al. [20] introduced the attack known as the single step fast gradient sign method (FGSM). It keeps the amount of change consistent with the direction of the gradient, thus inverting the output of the classifier. An adversarial example can be expressed in the following form: The acquisition of adversarial examples aims to maximize the loss function J(x, y) which measures the classification error, usually cross-entropy. Maximizing J causes the example no longer belong to the correct class y after adding noise. In the optimization process, the difference between the original example and adversarial example needs to be within a certain range x − x . sign() is the sign function, acting on the partial derivative of the loss function with respect to x.
Extending FGSM, BIM [9] generates adversarial examples by using FGSM multiple times with a smaller step α. In each iteration, clip() is used to ensure that generated perturbations stay within the −neighborhood of the image x. clip is defined as DeepFool [21] perturbs the image by a small vector, and gradually pushes the image within the classification boundary until incorrect classification occurs. The adversarial example for iteration k + 1 is where f denotes the classifier model, f i (x) is the ith dimension of the output, and fî (x) represents the dimension with the largest output. DeepFool proves that the generated perturbations are smaller than those of FGSM, while achieving similar deception rates. Projected gradient descent (PGD) [22] can be regarded as a similar iterative attack method to FGSM. FGSM uses only one iteration, while PGD performs multiple iterations, taking a small step each time, and each iteration's change is clipped to a specified range. The difference between PGD and BIM is that the former applies random perturbations. PGD calculates adversarial samples for iteration k + 1 using:  [24] verified the combination of kernel density estimates in the subspace of the last hidden layer; Bayesian neural network uncertainty estimates can effectively discover adversarial perturbations. Adaptive noise reduction with scalar quantization and smoothing spatial filter are used to detect adversarial noise in Ref. [25]. A transferability prediction difference method [13] detects adversarial examples by measuring the transferability difference in various DNN models. Papernot et al. [26] considered a defensive distillation method to resist adversarial attacks. Results generated by the example using the original neural network are regarded as new labels to train a distillation network with the same architecture and distillation temperature T , which is used for classification. However, Carlini and Wagner [27] showed that defensive distillation is unable to increase the robustness of neural networks. Samangouei et al. [28] borrowed a generative adversarial network to suppress adversarial attacks on the MNIST [29] digits dataset, but the results are hard to transfer to other datasets. Liao et al. [15] proposed to eliminate adversarial perturbations using high-level features, with the output difference before softmax as the loss function. Prakash et al. [17] developed a technology that combines computationally efficient image transformation, redistribution of pixel values, and soft wavelet noise reduction to overcome perturbations. Mustafa et al. [16] considered a novel defensive mechanism: image super-resolution enhances the quality, and projects adversarial examples onto the manifold of natural images.

Method
The framework of our proposed DDAP method (see Fig. 2) consists of two models. The first is an adversarial perturbation detector and the second is an adversarial perturbation denoiser. Both share the same parameters for feature extraction. We now explain the proposed defensive method in detail.

Adversarial perturbations detector
Adversarial attackers generate small perturbations that are often imperceptible to humans, yet fool the classifier. We emphasize that adversarial images with added perturbations change the pixel distribution and fall outside the data manifold for real examples [14,30]. Therefore, we can train a detection network to determine whether an example is adversarial through the feature representation of the input data [14,23].
As illustrated in Fig. 3, given an input image x ∈ R 3×w×h , we define E to be a mapping function from x to the features generated by feature extraction; Fig. 2 Overview of our end-to-end framework. We jointly detect and remove adversarial perturbations by denoising. When the input x is recognized as an adversarial example by the detector, Dec(x) == 1, adversarial perturbations are excluded by the denoiser model before x is classified. Any clean image x is directly passed to the classification model. Dec is a mapping function from features to the prediction category of the detector. More specifically, the input data is forwarded through multiple blocks to obtain E(x). For calculation speed and to reduce the number of parameters, normal convolution in blocks 2 to 5 is replaced by depthwise convolution and pointwise convolution except for the first convolution layer of each block [31]. Furthermore, the first convolutional layer in blocks 2 to 5 adapts a 2 × 2 stride for feature downsampling, making E(x) 4 times smaller than x. E(x) is then fed into three conv units and a fully-connected layer to learn the discriminative difference between the features of clean images and adversarial examples. Note that the last conv unit also utilizes 2×2 stride operations; the fullyconnected layer following softmax produces a twodimensional vector. In the inference phase, we select the index with maximum value as the detector output. Zero represents a clean image and one represents an adversarial example. If the output of Dec(E(x)) is a clean image, the original image x is delivered directly to the target classification model. Otherwise, the feature representation E(x) is passed to the denoiser to eliminate the adversarial perturbations.

Adversarial perturbation denoiser
If the detector classifies the input x as an adversarial sample, the denoiser employs the features E(x) to reconstruct the sample, to deal with the perturbations. The reconstruction function Den has limitations; the goal is for the reconstructed sample to be as similar as possible to the original data: Den(E(x)) ≈ x. In fact, the combination of feature extraction and the denoiser can be considered as a variant of an autoencoder, where the feature extractor is responsible for encoding the feature and the denoiser reconstructs the clean features. Previous literature [30,32] show that adversarial examples usually lie outside the data manifold, and the autoencoder can place them on the manifold by learning the manifold structure. Thus, the feature extractor and denoiser can defend against adversarial examples attacks by removing perturbations. Figure 4 details the denoising process. The feature extraction parameters are reused; the denoiser comprises four blocks and a 1 × 1 convolutional layer. To reduce loss of spatial information caused by downsampling, skip-connections [5] are introduced, and the feature maps recovered by upsampling contain additional low-level feature information provided by a fusion unit. This performs upsampling and a concatenation operation. The output of each denoiser block is upsampled using bilinear interpolation, and then feature concatenation is performed with the output of the corresponding block from feature extraction. Unlike the blocks in feature extraction, the conv units of all blocks in the denoiser use a stride of 1 × 1. In addition, we follow Refs. [15,33] to implement residual learning instead of directly reconstructing a whole image, thus benefiting deep neural network training. The residual generated by the last 1×1 convolutional layer is added to the input x, which is converted into a clean image. Although the structure of our denoiser is similar to that in Ref. [15], some obvious differences exist. Our denoiser shares feature extraction with the detector, and the light convolution is used. We next consider the denoiser loss function. Fig. 4 Architecture of the adversarial perturbation denoiser. If the input x is an adversarial image, the denoiser aims to transform it into clean data. The conv unit and light conv unit are the same as in Fig. 3. The fusion unit performs bilinear interpolation and feature concatenation.

Network loss function
To detect and eliminate adversarial perturbations, our end-to-end defensive method includes a detector and a denoiser. When training the detector, each image belongs to only one of two categories: clean image or adversarial image. Therefore, we use a cross-entropy cost function to measure the difference between prediction and expectation. represents the predicted probability that x i is an adversarial example.
The goal of training the denoiser is to make the difference between the recovered image and the clean image as small as possible. However, the remaining perturbations may influence the response of the target classification model. To overcome this problem, we combine pixel-level loss and high-level feature loss in the cost function used to supervise the training stage of the denoiser, unlike Ref. [15]. The cost function of the denoiser is wherex i denotes the adversarial sample of x i , and Den (E (x i )) is the recovered output of the denoiser. f l denotes the response of the lth convolutional layer from the bottom of the target classification model and l is set to 1. The L 1 norm is used to calculate pixel-level loss and high-level feature loss.
Based on the above analysis, the cost function of DDAP can be formulated as where α, β are hyperparameters. For training stability, alternate training is adopted. First, the parameters for feature extraction and the denoiser are trained using clean images and adversarial examples until the network converges. Next, the feature extraction parameters are frozen, and the detector is trained until convergence. Finally, we adopt a small learning rate to fine-tune the detector and denoiser.

Evaluation
In this section, we validate the capabilities of the proposed DDAP method in the presence of various adversarial attacks, including FGSM [20], BIM [9], DeepFool [21], and PGD [22]. Then DDAP is compared with other advanced defense methods: LGD [15], SR [16], PD [17], and TPD [13] using the BIT-Vehicle dataset [34] and an online Public dataset https://github.com/CNHNLP/public-dataset. In addition, we also consider the performance on clean images and visualization of feature maps.

Setup
We use pre-trained Inception-v3 [3]  To assess the proposed model, adversarial images are required for training and testing. We used several attack methods to construct these adversarial samples. The adversarial training set was generated using FGSM, BIM, and DeepFool, while the adversarial test set was generated using FGSM, BIM, DeepFool, and PGD. Adding PGD during testing aims to verify the robustness of our method against unmet adversarial attacks. The perturbation level of these attack methods was set to 0.15. Table 1 gives the parameters used for training our method. Different experimental settings are used in different training phases to keep the optimization stable. While training the denoiser, hyperparameter α is set to 0. Conversely, when training the detector, the hyperparameter β is set to 0. SGD optimization was applied in the experiment, and an early stop strategy was also used in the training process. The proposed DDAP method was implemented with the Pytorch deep learning framework, and deployed on an NVIDIA TESLA P100 GPU.

BIT-Vehicle dataset
The BIT-Vehicle dataset targets vehicle classification. the adversarial test set provides 7880 samples based on the test set of the target classifier. We used the adversarial training set to supervise training of the denoiser. In addition, we also selected 1000 images from both the training set and adversarial training set to supervise the detector. The optimized model is compared with other advanced defense models on the adversarial test set. Table 2 reports the defensive performance of DDAP on the BIT-Vehicle dataset. The classification accuracy of the target classifier without a defensive model is significantly lowered, and for BIM and PGD attacks, it drops to zero. DDAP provides better accuracy for all kinds of attack than other defensive methods. Although PGD is not used for training, we still achieve 95.4% accuracy, reflecting the robustness of our defense method.
We also investigated the capabilities of these defense methods under black-box attack. In a black-box attack setting, the attacker has no knowledge of the target classifier. Therefore, another classifier ResNet-18 [2] was optimized on the BIT-Vehicle training set, and the adversarial test set was constructed using the ResNet-18 classifier and the same attack methods. As illustrated in Table 3, even though the classification accuracy of DDAP decreases slightly compared with a white-box attack, it still outperforms other methods. Obviously, LGD is sensitive to black-box attack, and its classification accuracy rate drops by nearly 30%. This may be caused by insufficient ability to learn manifold structure.

Public dataset
The Public dataset contains 10 vehicle categories: Bus, Family Sedan, Fire Engine, Heavy Truck, Jeep, Minibus, Racing Car, SUV, Taxi, and Truck. We extracted 1400 images as the training set and 200 images as the test set. An adversarial training set and adversarial test set were constructed as before. Table 4 shows defensive performance of DDAP on Public dataset. SR and PD may fail to defend against FGSM attack because the classification accuracy of the target classifier is only 25% and 16%. On the contrary, both LGD and DDAP achieve acceptable defensive effect, and DDAP gets higher accuracy compared with LGD. DDAP supports the target classifier to obtain 96.0% accuracy under PGD attack, confirming the robustness of the proposed method under different attackers. It should be pointed out that compared with other attackers, DeepFool's effect on defensive models is limited, and the accuracy of the target classifier slightly decreases. Table 5 compares average forward time of these defensive methods, and DDAP spends less inference time due to lightweight convolution and parameter sharing of the feature extraction. Figure 5 exhibits  with imperceptible perturbations. After removing perturbations, there is almost no difference between defended images and clean images, which enhances the correct prediction of the target classifier.
As the first step of the network, it is critical that the detector correctly detects adversarial examples. We compare our method with TPD for recognition accuracy on the Public adversarial test set. In TPD, we assume trained Inception-v3, and Resnet-18 with softmax normalization, as defensive model and target model. Table 6 summarizes the recognition performance of TPD and DDAP. In order to balance classification accuracy of TPD for both clean and adversarial examples, thresholds of 0.05 and 0.1 were considered. The recognition accuracy achieved by TDP is 73.9% and 71.2% for these different thresholds, which is far short of the 99.1% achieved by DDAP.

Performance on clean images
As mentioned in the introduction, using defensive models may reduce the accuracy of the target classifier when presented with clean images. Table 7 provides target classifier results for various defensive methods on the BIT-Vehicle and Public test sets. Using LGD greatly reduces the classification accuracy from 79.5% to 65.4%. DDAP hardly affects the performance of the classifier on clean images, reducing it to 97.1% from 97.5%. This is also better than the results obtained by SR and PD.

Ablation study for DDAP
In order to verify the effectiveness of the design, we compared the performance of DDAP with or without the detector on two vehicle datasets. The test set includes clean images and the adversarial test set. Table 8 shows that DDAP with the detector achieves better classification accuracy than without, indicating the benefit of adding the detector.

DUNET
If we remove the detector from DDAP, it basically degenerates to DUNET [15]. Ideally, if we combine clean examples and adversarial examples to train DUNET, it may learn the ability to recognizing clean images and adversarial examples, and denoise adversarial perturbations at the same time. When a clean example is entered, it outputs the original clean example. When an adversarial example is input, a repaired example is output. Therefore, we combined the adversarial training set and the training set to form a joint training set, using BIT-Vehicle. The test set and adversarial samples generated by attack methods were employed to construct the joint test set. The joint training set was used to train both DDAP and DUNET. Table 9 gives the classification accuracy on the joint test set. DDAP achieves better results than DUNET for all kinds of attack, which may be due to it being difficult for DUNET to learn to recognize and reduce perturbations at the same time.

Further validation on the CompCars dataset
In order to further validate the effectiveness of the design, various defense methods were also tested in the web-nature scenario of the CompCars dataset [35] Table 10 gives the results. As before, these methods effectively defend against DeepFool. DDAP achieves the best classification accuracy for all attacks.

CAMs visualization
CAMs [36] is a method to help interpret convolutional neural networks: it can visualize the discriminative features they learn. Usually, redder regions are more sensitive to the target classifier. Figure 6 shows the class activation mapping for the top-1 prediction of the Inception-v3 classifier on the BIT-Vehicle data. The target classifier focuses on the vehicle region in clean images. The perturbations generated by FGSM take attention from these discriminative features, resulting in incorrect classification. We can see that DDAP can refocus on the discriminative region by removing the perturbations.

Conclusions
In this paper, we have presented an end-to-end approach, DDAP, to defend against adversarial attacks in vehicle classification. It detects adversarial examples and eliminates adversarial perturbations without changing the structure of the vehicle classifier.
As our experiments demonstrate, DDAP resists a variety of powerful adversarial attacks and is robust in both white-box and black-box attacks. The DDAP model only slightly decreases the performance of the vehicle classifier on clean images, and outperforms other state-of-the-art defensive methods on available vehicle datasets.