1 Introduction

Deep neural networks (DNNs) have achieved high performance in various applications, such as image classification [1], object detection [2], and natural language processing [3]. However, despite their success, DNNs are vulnerable to adversarial attacks that generate malicious inputs to misclassify samples. These malicious inputs, which are created by imperceptible perturbations in the original sample, are called adversarial examples, and various attack methods have been proposed to generate them [4,5,6]. The purpose of these attacks is to maximize the classification error while minimizing the perturbations so that a human cannot distinguish between a normal sample and an adversarial example. These adversarial attacks can deceive models trained on different network architectures or different subsets of training data [4]. Moreover, adversarial attacks are effective on DNNs deployed in real-world settings, such as self-driving cars [7, 8], facial recognition [9, 10], and object detection [11].

To combat adversarial attacks, two main types of defense methods have been proposed. First, denoising methods [12, 13] mitigate the perturbations on adversarial examples when they are input to the target model. Denoising methods enable DNN models to correctly classify adversarial examples by returning them to the DNN models after removing the perturbations. Second type is adversarial training [14, 15], which generates adversarial examples and subsequently trains the model to classify them as the correct class. However, denoising methods and adversarial training have a common limitation in that after training, the classification accuracy of the DNN for clean images decreases. This is because denoising methods perform training separately from the DNN model to remove adversarial perturbations and the DNN model is not trained on the denoised clean images, thus decreasing the classification accuracy of the DNN model for the denoised clean images. In addition, adversarial training decreases the classification accuracy for clean images because it trains the model using only adversarial examples [14] or by adjusting the balance for the importance of natural and robust errors [15].

To overcome this limitation, we propose a hybrid adversarial training method that trains the denoising network and DNN model simultaneously. The proposed method repeats three steps for every epoch: 1) it generates adversarial examples to deceive the DNN model; 2) it then trains the denoising network to remove adversarial perturbations on adversarial examples, with the denoising network outputting the denoised clean images and the denoised adversarial examples for the clean images and adversarial examples, respectively; 3) it trains the DNN model to correctly classify the non-denoised clean images, non-denoised adversarial examples, denoised clean images, and denoised adversarial examples. We demonstrate that the proposed training method results in enhanced robustness against adversarial attacks and higher classification accuracy for clean images than applying a denoising network that is trained separately from the DNN model and adversarial training methods. The main contributions of this study are summarized as follows:

  • We propose a hybrid adversarial training method that results in higher classification accuracy for clean images and enables DNN models to be more robust against various adversarial attacks.

  • We show that the clean image classification performance of a model trained using our proposed method is higher than that resulting from state-of-the-art methods.

  • We show that our proposed method outperforms several adversarial training methods and denoising methods that involve separate training by the model.

The remainder of this paper is organized as follows. Section 2 reviews the related work on adversarial attacks, adversarial training, and denoising-based adversarial defense methods. Section 3 describes our proposed training method. Section 4 compares the performance of the proposed method with that of several state-of-the-art methods. Section 5 discusses the experimental results and limitations of the proposed method. Section 6 presents concluding remarks.

2 Background and related work

In this section, we introduce various adversarial attack methods, denoising methods to mitigate adversarial perturbation on adversarial examples, and relevant background on adversarial training. We also summarize the advantages and disadvantages of denoising methods and adversarial training methods.

2.1 Adversarial attacks

Adversarial attacks can be categorized into two types of threat models: white-box and black-box. A white-box adversary has full access to the DNN parameters, network architecture, and weights, whereas a black-box adversary has no knowledge of the target DNN or cannot access it.

Szegedy et al. [4] were the first to show that for white-box attacks, the existence of small perturbations in images might lead to DNN misclassification. They generated adversarial examples using the box-constrained limited-memory Broyden–Fletcher–Goldfarb–Shanno (L-BFGS) algorithm. Additionally, they were the first to demonstrate the property of transferability, where adversarial examples that deceive one model could deceive other models. Further, Goodfellow et al. [16] proposed the fast gradient sign method (FGSM) to generate adversarial examples using only a one-step gradient update along the direction of the gradient at each pixel. Kurakin et al. [17] and Madry et al. [14] proposed multi-step attack methods, called the iterative fast gradient sign method (I-FGSM) and projected gradient descent (PGD), both of which are capable of generating adversarial examples and achieving higher attack success rates than FGSM. Moosavi-Dezfooli et al. [6] proposed DeepFool, which is a simple yet accurate method for computing and comparing the robustness of different models to adversarial perturbations. Carlini et al. [18] proposed an optimization-based attack method that determines the smallest perturbation that can deceive the target model.

Alternatively, Papernot et al. [19] introduced the first black-box attack method using a substitute model. Park et al. [20] proposed a substitute model attack method that emulates a partial target model by adjusting the partial classification boundary to decrease the number of queries. Additionally, Dong et al. [21] introduced a class of attack methods, called the momentum-based iterative method (MIM), in which the gradients of the loss function are accumulated at each iteration to stabilize the optimization and avoid a poor local maximum. This method results in more transferable adversarial examples.

2.2 Denoising-based adversarial defenses

To combat such adversarial attacks, denoising-based defense methods perform image denoising to remove the adversarial perturbations. Liao et al. [22] proposed a high-level representation guided denoiser (HGD), which uses a loss function defined as the difference between the target model’s outputs activated by the natural image and the denoised image. Samangouei et al. [23] proposed Defense-GAN, which projects adversarial examples into a trained generative adversarial network (GAN) to approximate the input sample using a generated clean sample. Song et al. [24] proposed PixelDefend, which reconstructs adversarial examples so that they follow the training distribution using PixelCNN [25]. Prakash et al. [12] proposed pixel deflection, that is, a defense method in which some pixels are randomly replaced with nearby pixels and then wavelets are used to denoise the image. Naseer et al. [13] proposed NRP; a method that recovers a legitimate sample from a given adversarial example using an adversarially trained purifier network. Kang et al. [26] proposed CAP-GAN, which integrates the idea of pixel-level and feature-level consistency to achieve reasonable purification under cycle-consistent learning. These denoising methods can mitigate adversarial perturbations on adversarial examples, but they can remove the important information from clean images. Therefore, the accuracy of the DNN model decreases for clean images because the DNN model may not correctly classify the denoised clean images.

2.3 Adversarial training

Adversarial training is a defense method for effectively enhancing the robustness of models against adversarial attacks. Adversarial training is based on the principles of model loss maximization and minimization. In the maximization step, adversarial examples are generated by an adversarial attack using each mini-batch included in the training set. In the minimization step, the model parameters are updated using the generated adversarial examples for loss minimization. Recently, many studies [14,15,16, 27, 28] have focused on analyzing and improving adversarial machine learning. For example, Goodfellow et al. [16] were the first to suggest feeding generated adversarial examples into the model in the training process, while Madry et al. [14] formulated adversarial training as a min-max optimization problem as follows:

$$ \min_{\theta} \mathbb{E}_{\mathcal{D}}\left [\max_{x^{\prime}\in\mathcal{S}_{x}} \textit{L}_{ce}\left (f_{\theta}\left (x^{\prime} \right ), y \right ) \right], $$
(1)

where 𝜃 is a parameter of the model, x is an original image, \(x^{\prime }\) is an adversarial example, \(f_{\theta }\left (\cdot \right )\) is the output vector of the model, \(\mathcal {D}\) is the distribution of the training dataset, Lce is the cross-entropy (CE) loss function, and \({\mathcal {S}_{\S }}\) is the allowed perturbation space that is typically selected as an Lp norm ball around x. The inner maximization step generates adversarial examples using the FGSM [16] or PGD [14] attack methods. In the outer minimization step, the model is trained to minimize the adversarial loss induced by the inner maximization step.

Zhang et al. [15] proposed the tradeoff-inspired adversarial defense via surrogate-loss minimization (TRADES) to optimize a regularized surrogate loss, and defined it as follows:

$$ \min_{\theta} \mathbb{E}_{\mathcal{D}} \left [ \textit{L}_{ce}\left (f_{\theta}\left (x \right ), y \right ) + \max_{x^{\prime}\in\mathcal{S}_{x}} \textit{L}_{kl}\left (f_{\theta}\left (x \right ), f_{\theta}\left (x^{\prime} \right )\right )/\lambda \right ], $$
(2)

where Lkl is the Kullback–Leibler divergence loss function, and λ is a parameter that adjusts the balance for the importance of natural and robust errors. In the inner maximization step, the decision boundary of the model is pushed away from the sample instances by minimizing the difference between the prediction of the natural image and that of the adversarial example. In the outer minimization step, a model is trained to stimulate the optimization of the natural error by minimizing the difference between the prediction and ground truth of the natural image. Ryu et al. [34] proposed an adversarial training method that trains DNNs to be robust against transferable adversarial examples and to maximize their classification performance for clean images in black-box settings.

Shafahi et al. [27] and Zheng et al. [28] proposed adversarial training methods to reduce computational cost, called ”free” and adversarial training with transferable adversarial examples (ATTA), respectively. The “free” adversarial training method [27] eliminates the overhead cost of generating adversarial examples by recycling the gradient information computed when updating the model parameters. Meanwhile, to reduce the computational cost, ATTA [28] can generate adversarial examples using fewer attack iterations by accumulating adversarial perturbations through epochs. As the purpose of these methods is to reduce the computational cost of adversarial training, both methods use the maximization and minimization steps of Madry’s adversarial training (MAT) [14] or TRADES [15]. These adversarial training methods train DNN models to be robust against various adversarial attacks, but the resulting DNN models have lower classification accuracy for clean images than conventionally trained DNN models.

3 Hybrid adversarial training

In this section, we describe our proposed hybrid adversarial training (HAT) method that can be used to train a DNN model to be robust against adversarial attacks while retaining its classification accuracy for clean images. Figure 1 shows the process employed in the HAT method. The HAT generates adversarial examples to deceive the DNN model, and then inputs the clean images and adversarial examples to the denoising network to mitigate adversarial perturbations. The denoising network reconstructs the clean images and the adversarial examples. Thereby, the HAT method trains the DNN model using clean images, adversarial examples, reconstructed clean images, and reconstructed adversarial examples. This process is repeated every epoch to train the denoising network and DNN model.

Fig. 1
figure 1

Visual illustration of the proposed hybrid adversarial training (HAT) method. The main idea is to train the DNN model and the denoising network simultaneously

The HAT method has two effects: 1) the DNN model can correctly classify the denoised clean images because it is trained using the denoised clean images; and 2) adversarial perturbations on the adversarial examples are mitigated by the denoising network, and the denoised adversarial examples are then inputted into the DNN model, thus enabling the DNN model to correctly classify the denoised adversarial examples.

3.1 Loss function of denoising network

To mitigate adversarial perturbations on adversarial examples, we train an autoencoder-based denoising network with a U-Net [29] structure. The denoising network has four convolution layers and four de-convolution layers with batch normalization. We define a loss term to minimize the difference between the reconstructed adversarial example and the clean image as follows:

$$ \textit{L}_{img} \left (x, x^{\prime} \right ) = {\left \| \mathcal{P}_{\psi} \left (x^{\prime} \right ) - x \right \|}_{2}, $$
(3)

where ψ is a parameter of the denoising network, \(\mathcal {P}_{\psi }\left (\cdot \right )\) is the output of the denoising network, x is a clean image, \(x^{\prime }\) is an adversarial example generated by adding adversarial perturbations to the clean image, and \(\left \| \cdot \right \|_{2}\) is the L2 distance. The reconstructed adversarial example mitigates the adversarial perturbations on the adversarial example because it is reconstructed in a manner similar to the clean image. In addition, we add a loss term \( it{L}_{adv}^{recon}\) so that the DNN model correctly classifies the reconstructed adversarial example. \(\textit {L}_{adv}^{recon}\) is defined depending on the adversarial training method [14, 15]. If we use MAT [14], \(\textit {L}_{adv}^{recon}\) is defined as follows:

$$ \textit{L}_{adv}^{recon} = \textit{L}_{ce}\left (f_{\theta}\left (\mathcal{P}_{\psi} \left (x^{\prime} \right ) \right ), y \right ), $$
(4)

where y is the ground truth corresponding to x, 𝜃 is a parameter of the model, \(f_{\theta }\left (\cdot \right )\) is the output of the model for a particular input, and Lce is the CE loss function. If we use TRADES [15], \(\textit {L}_{adv}^{recon}\) is defined as follows:

$$ \textit{L}_{adv}^{recon} = \textit{L}_{kl}\left (f_{\theta}\left (x \right ), f_{\theta}\left (\mathcal{P}_{\psi} \left (x^{\prime} \right ) \right )\right ), $$
(5)

where Lkl is the Kullback–Leibler divergence loss function. To train the denoising network, we minimize the loss function as follows:

$$ \textit{L}_{\mathcal{P}_{\psi}} = \textit{L}_{img} + \alpha \cdot \textit{L}_{adv}^{recon}, $$
(6)

where α controls the weighting of \(\textit {L}_{adv}^{recon}\). Further, we use only the adversarial example. When both the adversarial example and the clean image are used, the denoising network reconstructs the input image to intermediate values between the clean image and the adversarial example. In other words, the denoising network cannot effectively mitigate adversarial perturbations on adversarial examples.

3.2 Loss function of the DNN model

To train a DNN model that correctly classifies the reconstructed input images, we use the clean images and adversarial examples as well as the reconstructed clean images and reconstructed adversarial examples. We first define a loss term to correctly classify the clean image as follows:

$$ \textit{L}_{clean} = \textit{L}_{ce} \left (f_{\theta} \left (x \right ), y \right) $$
(7)

To correctly classify the clean image reconstructed by the denoising network, we define another loss term as follows:

$$ \textit{L}_{clean}^{recon} = \textit{L}_{ce} \left (f_{\theta} \left (\mathcal{P}_{\psi} \left (x \right ) \right ), y \right) $$
(8)

To correctly classify the adversarial example, we define a loss term Ladv depending on the adversarial training method used, such as (4) or (5). To correctly classify the denoised adversarial examples, we define the loss function for training the DNN model as follows:

$$ \textit{L}_{f_{\theta}} = \textit{L}_{clean} + \beta_{1} \cdot \textit{L}_{clean}^{recon} + \beta_{2} \cdot \textit{L}_{adv} + \beta_{3} \cdot \textit{L}_{adv}^{recon}, $$
(9)

where β1, β2, and β3 control the weighting of \(\textit {L}_{clean}^{recon}\), Ladv, and \(\textit {L}_{adv}^{recon}\), respectively.

The HAT method trains the denoising network and the DNN model simultaneously. We construct a loss function for the denoising network to mitigate adversarial perturbations on adversarial examples and add loss terms to correctly classify the clean images and the adversarial examples denoised by the denoising network in the loss function of the DNN model. Thus, the loss functions of the denoising network and the DNN model are connected to each other in every epoch. Consequently, the denoising network can more effectively mitigate adversarial perturbations on adversarial examples that deceive the DNN model, and the DNN model can more accurately classify the clean images and the adversarial examples denoised by the denoising network.

4 Experimental evaluation and results

We evaluated our proposed HAT method on three benchmark datasets: MNIST [30], CIFAR-10, CIFAR-100 [31], and German Traffic Sign Recognition Benchmark (GTSRB) [35]. DNNs trained using the proposed method were compared with a conventionally trained DNN model without and with a denoising network and DNNs trained using state-of-the-art adversarial training methods, including MAT [14], TRADES [15], and ATTA [28].

4.1 Training setting

The MNIST dataset contains 60,000 training and 10,000 test images with input sizes of 1× 28 ×28 and 10 classes. To train the model on the MNIST dataset, we used a network with four convolutional layers followed by three fully connected layers—the same architecture as that used in [15]. We set the perturbation budget 𝜖 = 0.3, step size α = 0.01, number of iterations K = 40, learning rate η = 0.1, and batch size m = 128, then ran 100 epochs on the training dataset.

The CIFAR-10 and CIFAR-100 datasets contain 50,000 training and 10,000 test images with input sizes of 3× 32 ×32. The GTSRB dataset contains 39,209 training and 12,630 test images of various image sizes and 43 classes. We resized the images in the GTSRB dataset to 3× 32 ×32. To train models on the CIFAR-10, CIFAR-100, and GTSRB datasets, we used the wide residual network (WRN)-34-10 [32], which is the same as that used in [15]. The CIFAR-100 dataset is a more challenging dataset than the CIFAR-10 dataset because it includes more classes (100 vs. 10 in the CIFAR-10 dataset). Therefore, there are only 600 images per class in the CIFAR-100 dataset. We set the perturbation budget 𝜖 = 0.031, step size α = 0.007, number of iterations K = 10, learning rate η = 0.1, and batch size m = 64, then ran 100 epochs on the training dataset. For all the datasets, we set α and all betas as 0.5 or 1.0 when training the DNN model using MAT [14] in the proposed method, and α, β2, and β3 as 1.0 or 6.0, β1 as 0.5 or 1.0 when training the DNN model using TRADES [15] in the proposed method.

4.2 Attack setting

To verify the robustness of our method, we used FGSM [16], PGD [14], DeepFool [6], CW [18], and MIM [21] attacks to generate adversarial examples. For the MNIST dataset, we set the perturbation budget 𝜖 to 0.3 for all attack methods. Additionally, we set the step size α to 0.01 and the number of iterations K to 40 for the PGD and MIM attacks. For the CIFAR-10, CIFAR-100, and GTSRB datasets, we set the perturbation budget 𝜖 to 0.031 for all attack methods. We set the step size α to 0.003 and the number of iterations K to 20 for the PGD and MIM attacks. Moreover, we performed a CW attack by applying the CW’s objective function [18] within the PGD framework. The attack parameters were the same as in [14].

4.3 Selecting optimal alpha and betas

To study the robustness of the proposed method against adversarial attacks, we must select the optimal values for alpha and betas. To do so, we trained the denoising network and DNN model by adjusting the alpha and betas. For training the denoising network and DNN model by applying MAT, α and all betas were set as 0.5 and 1.0. For training the denoising network and DNN model by applying TRADES, α, β2, and β3 were set as 1.0 and 6.0, and β1 was selected as 0.5 and 1.0.

For the MNIST dataset, when α and all betas were set to 1.0, the DNN model trained using MAT had the lowest classification accuracy of 99.49% for clean images, but it was the most robust against FGSM, PGD, DeepFool, CW, and MIM attacks with accuracies of 98.12%, 97.67%, 98.89%, 99.08, and 97.45%, respectively. Therefore, we selected the optimal value for α and all betas to be 1.0. When we set alpha and all betas to 1.0, the DNN model trained using TRADES had the highest classification accuracy of 99.51% for the clean images, and was most robust against FGSM, DeepFool, and MIM attacks with accuracies of 98.04%, 98.88%, and 97.38%, respectively. In addition, it was similarly robust with models trained using other parameter settings against PGD and CW attacks. Therefore, we selected the optimal value for α and all betas to be 1.0. Table 1 shows the robustness of the DNN models trained using the proposed HAT method according to the values of alpha and betas against adversarial attacks on the MNIST dataset.

Table 1 Robustness of the DNN model trained using the proposed hybrid adversarial training (HAT) method according to alpha and betas on the MNIST dataset

For the CIFAR-10 dataset, when we set α and all betas to 1.0, the DNN model trained using MAT had the third highest classification accuracy of 89.20% for the clean images, but it was the most robust against FGSM, PGD, CW, and MIM attacks with accuracies of 93.99%, 94.78%, 75.41%, and 95.02%, respectively, whereas it was the second-most robust performance against DeepFool attacks with accuracy of 89.13%. Therefore, we selected the optimal value of α and all betas to be 1.0. When we set α = 1.0, β1 = 0.5, β2 = 1.0, and β3 = 1.0, the DNN model trained using TRADES had the highest classification accuracy of 90.58% for the clean images, and it showed the most robust performance against FGSM, PGD, CW, and MIM attacks with accuracies of 80.07%, 71.69%, 71.98%, and 76.44%, respectively. Therefore, we set the optimal values as α = 1.0, β1 = 0.5, β2 = 1.0, and β3 = 1.0. Table 2 shows the robustness of the DNN models trained using the proposed HAT method according to the values of alpha and betas against adversarial attacks on the CIFAR-10 dataset.

Table 2 Robustness of the DNN model trained using the proposed HAT method according to alpha and betas on the CIFAR-10 dataset

For the CIFAR-100 dataset, when we set α and all betas to 1.0, the model trained using MAT had the lowest classification accuracy of 62.58% for the clean images, but it showed the most robust performance against all adversarial attacks, with accuracies of 71.54%, 72.99%, 62.45%, 46.33%, and 73.58%, respectively. Therefore, we selected the optimal values as α and set all betas to 1.0. When we set α = 6.0, β1 = 0.5, β2 = 6.0, and β3 = 6.0, the model trained using TRADES had the second-highest classification accuracy of 66.55% for the clean images, but it had the most robust performance against FGSM, PGD, DeepFool, and CW attacks with accuracies of 43.89%, 45.97%, 66.97%, and 50.21%, respectively, and it showed the second-most robust performance against MIM attacks with accuracy of 45.60%. Therefore, we selected the optimal values as α = 6.0, β1 = 0.5, β2 = 6.0, and β3 = 6.0. Table 3 shows the robustness of the DNN models trained using the proposed HAT method according to the values of alpha and betas against adversarial attacks on the CIFAR-100 dataset.

Table 3 Robustness of the DNN model trained using the proposed HAT method according to alpha and betas on the CIFAR-100 dataset

For the GTSRB dataset, when we set α = 1.0, β1 = 0.5, β2 = 1.0, and β3 = 1.0, the model trained using MAT had the highest classification accuracy of 95.52% for the clean images. In addition, it had the most robust performance against all adversarial attacks with accuracies of 91.94%, 81.84%, 91.20%, 86.29%, and 81.62%, respectively. Therefore, we selected the optimal values as α = 1.0, β1 = 0.5, β2 = 1.0, and β3 = 1.0. When we set α and all betas to 1.0, the model trained using TRADES had the highest classification accuracy of 96.56% for the clean images. In addition, it had the most robust performance against FGSM, PGD, and MIM attacks with accuracies of 82.53%, 81.10%, and 81.50%, respectively, and it showed similar robust performance against DeepFool and CW attacks, with accuracies of 92.97% and 80.01%. Therefore, we selected the optimal value for α and all betas to be 1.0. Table 4 shows the robustness of the DNN models trained using the proposed method according to the values of alpha and betas against adversarial attacks on the GTSRB dataset.

Table 4 Robustness of the DNN model trained using the proposed HAT method according to alpha and betas on the GTSRB dataset

4.4 Robustness comparison with previous methods

To compare the robustness of the proposed method with that of the conventionally trained DNN models with and without the denoising network, we first conventionally trained the DNN model and then trained the denoising network using adversarial examples to deceive the conventionally trained DNN model. In addition, we compared the robustness of the proposed method with that of the previous methods, including MAT [14], TRADES [15], and ATTA [28].

For the MNIST dataset, the conventionally trained DNN model without the denoising network had accuracy of 99.46% for the clean images, but it was vulnerable to PGD, DeepFool, CW, and MIM attacks with accuracies of 9.38%, 1.06%, 2.70%, and 17.07%, respectively. Additionally, it showed accuracy of 63.81% against FGSM attacks. The conventionally trained DNN model with the denoising network for the clean images had lower accuracy than the conventionally trained DNN model without the denoising network. However, it was more robust than the conventionally trained DNN model without the denoising network against all adversarial attacks. The DNN models trained using MAT, TRADES, and ATTA were more robust than the conventionally trained DNN model without the denoising network against all attack methods, but their classification accuracies were less than that of the conventionally trained DNN model without the denoising network for the clean images. In addition, the DNN models trained using MAT, TRADES, and ATTA were less robust than the conventionally trained DNN model with the denoising network against most adversarial attacks. The DNN models trained using the proposed method had higher accuracies than the conventionally trained DNN model with the denoising network and the DNN models trained using MAT, TRADES, and ATTA for clean images. In addition, the DNN models trained using the proposed method were more robust than the DNN models trained using MAT, TRADES, and ATTA against most adversarial attacks. Table 5 shows the robustness comparison between the proposed method and the previous methods against various adversarial attacks on the MNIST dataset.

Table 5 Robustness comparison of our proposed HAT method versus conventionally trained DNN models without and with a denoising network and DNN models trained using adversarial training methods (MAT [14], TRADES [15], and ATTA [28]) on the MNIST dataset
Table 6 Robustness comparison of our proposed HAT method versus conventionally trained models without and with a denoising network and DNN models trained using adversarial training methods (MAT [14], TRADES [15], and ATTA [28]) on the CIFAR-10 dataset

For the CIFAR-10 dataset, the conventionally trained DNN model without the denoising network had accuracy of 96.01% for the clean images, but it was vulnerable to all adversarial attacks. The conventionally trained DNN model with the denoising network had lower accuracy than the conventionally trained DNN model without the denoising network for the clean images, but the conventionally trained DNN model with the denoising network was more robust than the conventionally trained DNN model without the denoising network against all adversarial attacks. The DNN models trained using MAT, TRADES, and ATTA were more robust than the conventionally trained DNN model without the denoising network against all attack methods, but their classification accuracies were less than that of the conventionally trained DNN model without the denoising network for the clean images. In addition, the DNN models trained using MAT, TRADES, and ATTA were more robust than the conventionally trained DNN model with the denoising network against some adversarial attacks. The DNN models trained using the proposed method had higher accuracies than the conventionally trained DNN model with the denoising network and the DNN models trained by MAT, TRADES, and ATTA for clean images. In addition, the DNN models trained using the proposed method were more robust than the DNN models trained using MAT, TRADES, and ATTA against all adversarial attacks. Table 6 shows the robustness comparison between the proposed method and the previous methods against various adversarial attacks on the CIFAR-10 dataset.

For the CIFAR-100 dataset, the conventionally trained DNN model without denoising network had accuracy of 79.45% for the clean images, but it was very vulnerable to all adversarial attacks. The DNN model with denoising network had lower accuracy than the DNN model without denoising network for the clean images, but it was more robust than the DNN model without denoising network against all adversarial attacks. The DNN models trained using MAT, TRADES, and ATTA had lower accuracies than the conventionally trained DNN model with denoising network for the clean images, and they were less robust against all adversarial attacks. The DNN model trained using the proposed method with MAT had lower accuracy than the conventionally trained DNN model with denoising network for the clean images, but it had higher accuracy than the DNN models trained using TRADES and ATTA for the clean images. In addition, the DNN model trained using the proposed method with MAT was more robust than the conventionally trained DNN model with denoising network against FGSM, PGD, and MIM attacks. Additionally, it was more robust than the DNN models trained using MAT and TRADES against all adversarial attacks. Table 7 shows the robustness comparison between the proposed method and the previous methods against adversarial attacks on the CIFAR-100 dataset.

Table 7 Robustness comparison of our proposed HAT method versus conventionally trained models without and with a denoising network and DNN models trained using adversarial training methods (MAT [14], TRADES [15], and ATTA [28]) on the CIFAR-100 dataset

For the GTSRB dataset, the conventionally trained DNN model without denoising network had accuracy of 98.69% for the clean images, but it was very vulnerable to all adversarial attacks. The DNN model with denoising network had lower accuracy than the DNN model without denoising network for the clean images, but it was more robust than the DNN model without denoising network against all adversarial attacks. In addition, the DNN model with denoising network was more robust than the DNN models trained using MAT, TRADES, ATTA against all adversarial attacks. The DNN model trained using the proposed method with MAT had higher accuracy than the conventionally trained DNN model with denoising network for the clean images. In addition, the DNN models trained using the proposed method were as robust as the conventionally trained DNN model with denoising network against all adversarial attacks. Table 8 shows the robustness comparison between the proposed method and the previous methods against adversarial attacks on the GTSRB dataset.

Table 8 Robustness comparison of our proposed HAT method versus conventionally trained models without and with a denoising network and DNN models trained using adversarial training methods (MAT [14], TRADES [15], and ATTA [28]) on the GTSRB dataset

5 Discussion

In this section, we discuss the classification performance for clean images, robustness against adversarial attacks, and the denoising network.

5.1 Classification performance

For the clean images, the conventionally trained DNN model with a denoising network has lower accuracy than the model without a denoising network because the conventionally trained DNN model is not trained on the denoised clean images and the important features of the clean images may be removed by the denoising network. The adversarially trained DNN model also has lower accuracy than the conventionally trained DNN model because MAT [14] trains the DNN model using only adversarial examples, whereas TRADES [15] trains the DNN model by adjusting the balance for the importance of natural and robust errors. Furthermore, ATTA [28] generates adversarial examples using MAT or TRADES by accumulating adversarial perturbations in every epoch to reduce the computational cost of adversarial training. Therefore, the DNN model trained using ATTA has lower accuracy than the conventionally trained DNN model. Our proposed method uses not only clean images and adversarial examples but also denoised clean images and denoised adversarial examples to train the DNN model. Therefore, the proposed method trains the DNN model using four times more images than conventional training, consequently enabling the DNN model trained using the proposed method to correctly classify the denoised clean images.

5.2 Robustness against adversarial attacks

Our proposed method results in greater robustness than previous adversarial training methods, including MAT, TRADES, and ATTA. The proposed method repeatedly trains the denoising network to remove the adversarial perturbations on adversarial examples by minimizing the difference between adversarial examples reconstructed by the denoising network and the original images. In addition, the proposed method repeatedly trains the DNN model so that it correctly classifies the original images and adversarial examples as well as the reconstructed images and adversarial examples produced by the denoising network. In the inference step, adversarial perturbations on adversarial examples inputted into the DNN model are first mitigated by the denoising network, and then inputted into the adversarially trained DNN model. Therefore, the adversarially trained DNN model correctly classifies the adversarial examples mitigated by the denoising network.

5.3 Denoising network

In this study, we only used an autoencoder-based denoising network to mitigate adversarial perturbations on adversarial examples. We were able to defend DNN models against various adversarial attacks using a simple autoencoder-based denoising network. However, there are many other denoising network structures, including NRP [13] and CAP-GAN [26], which are more effective than autoencoder-based denoising networks and can be alternatively used in the proposed method. With other denoising networks in the proposed method, we can expect higher classification accuracy for clean images and more robust performance against adversarial attacks. However, because other denoising networks, such as NRP [13] and CAP-GAN [26], have complex architectures, if they are used in the proposed method, a longer time will be necessary for their training and that of the DNN model.

5.4 Limitations

Our study has the following limitations. First, the computational cost of adversarial training is higher than that of conventional training because it trains the DNN model using the generated adversarial examples to deceive the DNN model. A high computational cost means that it takes a long time to train the DNN model. Our proposed method requires more time than the adversarial training method to train the DNN model because it trains the DNN model and the denoising network simultaneously.

Second, denoising networks are vulnerable to adaptive attacks, such as backward pass differentiable approximation (BPDA) attacks [33]. The proposed method generates adversarial examples to deceive the DNN model by assuming that adversaries not only know the DNN model but also know exactly the denoising network that the defender uses. In other words, adversaries generate adversarial examples that can deceive the DNN model even if adversarial perturbations on adversarial examples are mitigated by the denoising network. If adversaries perform a BPDA attack against the proposed method, we anticipate that the proposed method would not correctly classify the adversarial examples generated.

6 Conclusion and future work

We proposed a hybrid adversarial training method that trains a DNN model and a denoising network simultaneously. The proposed method trains the DNN model such that it correctly classifies non-denoised clean images and adversarial examples as well as denoised clean images and adversarial examples. We showed that the DNN model trained using the proposed method has higher classification accuracy than the conventionally trained DNN model with a denoising network on the MNIST, CIFAR-10, CIFAR-100, and GTSRB datasets. In addition, the DNN model trained using the proposed method was more robust against adversarial attacks than the DNN model trained using previous adversarial training methods on all datasets. However, the denoising network is vulnerable to adaptive attacks such as BPDA. In future studies, we will focus on adversarial defense methods that are robust to adaptive attacks. In addition, we will extend our hybrid adversarial training method to 3D engineering applications [36,37,38].