LADDER: Latent boundary-guided adversarial training

Deep Neural Networks (DNNs) have recently achieved great success in many classification tasks. Unfortunately, they are vulnerable to adversarial attacks that generate adversarial examples with a small perturbation to fool DNN models, especially in model sharing scenarios. Adversarial training is proved to be the most effective strategy that injects adversarial examples into model training to improve the robustness of DNN models against adversarial attacks. However, adversarial training based on the existing adversarial examples fails to generalize well to standard, unperturbed test data. To achieve a better trade-off between standard accuracy and adversarial robustness, we propose a novel adversarial training framework called LAtent bounDary-guided aDvErsarial tRaining (LADDER) that adversarially trains DNN models on latent boundary-guided adversarial examples. As opposed to most of the existing methods that generate adversarial examples in the input space, LADDER generates a myriad of high-quality adversarial examples through adding perturbations to latent features. The perturbations are made along the normal of the decision boundary constructed by an SVM with an attention mechanism. We analyze the merits of our generated boundary-guided adversarial examples from a boundary field perspective and visualization view. Extensive experiments and detailed analysis on MNIST, SVHN, CelebA, and CIFAR-10 validate the effectiveness of LADDER in achieving a better trade-off between standard accuracy and adversarial robustness as compared with vanilla DNNs and competitive baselines.


Introduction
In recent years, deep neural networks (DNNs) have been successfully applied to empower many advanced applications, such as image processing (Huang et al., 2017;Shi et al., 2021), speech generation (Seide et al., 2011;Amodei et al., 2016) and natural language processing (Fedus et al., 2018;Alikaniotis et al., 2016).Nevertheless, training a DNN model often requires large amounts of labeled data and significant efforts of parameter tuning.As such, this catalyzes new ways of developing DNN models as a service that can be shared by a third party.This development leads to many offline and online Machine Learning as a Service (MLaaS) platforms that provide shared services for various tasks based on DNN models, such as image and video analysis from the AWS pretrained AI Services (Amazon, 2019), and powerful image analysis from Google Cloud Vision (Google, 2019).In such model sharing scenarios, the increasing use of DNN models, however, has raised serious security and reliability concerns.As illustrated in Fig. 1, the service provider trains a DNN model using the training data that they have collected, expecting it to achieve high classification accuracy on test samples with similar distributions.End-users then provide their own test samples, mostly unknown to the service provider in advance, into the shared model to obtain prediction results.However, very often, test samples are likely to be mixed with unknown adversarial examples (Szegedy et al., 2013;Yuan et al., 2019), which are generated by attackers by adding small and hardly visible perturbations to standard test samples.What makes it even worse is that adversarial examples can be generated by various types of unknown attacks, resulting in a dramatic drop in classification accuracy.This is due to the fact that the shared DNN models are not robustly trained to defend against unknown adversarial attacks before they are released as a service.Thus, research works have been proposed to improve the robustness of DNN models against adversarial attacks.
Adversarial training (Goodfellow et al., 2015;Shaham et al., 2018;Shrivastava et al., 2017) is one of the most successful techniques for improving the robustness of DNN models against adversarial attacks.Its key idea is to augment the training data with adversarial examples to retrain DNN models.However, as studied in (Tsipras et al., 2019), there exists a trade-off between standard accuracy on clean data and adversarial robustness.Better robustness often leads to worse classification accuracy on clean data.To address this, several recent studies (Zhang et al., 2019b;Ding et al., 2020;Wang et al., 2020) have been proposed to add additional regularization terms into the loss function of adversarial training.These methods employ the modified adversarial training loss function with regularization to achieve better generalization and adversarial robustness.This has raised a research question: Could adversarial examples generated by the existing methods account for the trade-off between standard accuracy and adversarial robustness?This motivates us to explore the curses and blessings of adversarial examples for achieving a better trade-off between standard accuracy and robustness against adversarial attacks.
The adversarial examples generated by the existing methods suffer from two major curses.First, most of these methods (Goodfellow et al., 2015;Madry et al., 2018;Papernot et al., 2016b;Carlini and Wagner, 2017) generate adversarial examples through adding a small perturbation to legitimate samples in the input space, which often only craft adversarial examples of repeating patterns.The DNN models adversarially trained with these examples would be effective only in defending very specific types of adversarial attacks, but still vulnerable to other adversarial attacks.Thus, this presses a need for increasing the diversity of generated adversarial examples so that DNN models can fully explore the unknown data space to improve the robustness against adversarial attacks.
Second, the generated adversarial samples might hurt the standard accuracy of DNN models on clean (legitimate) samples.As shown in Fig. 2 (a), the adversarial examples generated by most existing methods are within a ball in the input space.When adversarially trained with these samples, the decision boundary of the DNN classifier would dramatically change as compared with the original one.This leads legitimate samples (sample A) to be misclassified.Moreover, as perturbations are added in the input space, the generated adversarial examples would contain lots of noise.This can be seen from adversarial examples (in Fig. 6) generated by FGSM (Goodfellow et al., 2015) or JSMA (Papernot et al., 2016b).Consequently, it would hurt the standard ac-

Vanilla Classifier
Adv. Trained Classifier

Input Space Input Space
Boundary Field

Related Work
This section reviews two branches of related literature: adversarial attack methods and adversarial defence methods.
Gradient-based Attacks.Gradient-based attacks mainly add perturbations in the direction of the gradient of loss function with respect to the input sample.Goodfellow et al. (2015) proposed the fast gradient sign method (FGSM) that uses the sign of gradient (∇ x J(θ, x, y)) of loss Decision-based Attacks.Decision-based attacks manipulate the labels of training data to make the learned DNN model beneficial to their specific purposes.This line of methods uses Eq. (1) as a measure, i.e., changing the original label to target label, to generate adversarial examples.For a given classifier f , the predicted label of an input sample x is defined as f (x), and f (x ) is the label of adversarial example x : where δ is a small perturbation added to an input sample x. x is the generated adversarial example.Carlini and Wagner (2017) proposed an approach, CW, to generate adversarial examples by adding small changes to the original images in the input space.CW tries to minimize the distance between benign examples and adversarial ones, while enforcing the labels of adversarial examples as the targeted ones.A deep learning based attack method was developed by (Song et al., 2018).This method explored the AC-GAN (Odena et al., 2017) latent space to generate adversarial images, which could most likely mislead the targeted classifier.GeoDA (Rahmati et al., 2020) estimates the decision boundary for each data point in the input space to generate adversarial examples.In contrast, our LADDER trains an SVM to obtain the decision boundary between any two classes in the latent space.
From the knowledge accessibility point of view, the adversarial attacks can be divided into white-box attacks and black-box attacks.Under white-box attack settings, the attackers have access to full knowledge about the target model, i.e., model structure and model parameters.On the contrary, the attackers have no knowledge about the target model under black-box attack settings.In this work, we are mainly concerned with model sharing scenarios, where model structure and model parameters are unknown to attackers.Thus, we focus on defending black-box attacks.

Adversarial Defence
For various types of adversarial attacks, a key research question is, how can we improve the adversarial robustness of a DNN model before it is deployed as a service?In response, adversarial defence strategies have been proposed to mitigate the effect of adversarial attacks (Papernot et al., 2016c;Papernot and McDaniel, 2017;Samangouei et al., 2018;Meng and Chen, 2017;Guo et al., 2018;Kannan et al., 2018;Mustafa et al., 2019;Xiao et al., 2020).
Gradient Masking Gradient masking methods (Papernot et al., 2017) construct a model which does not provide useful gradients to be attacked.For example, Defensive distillation (Papernot et al., 2016c;Papernot and Mc-Daniel, 2017) learns a smooth targeted defence model with d training times, where the predicted labels from (d − 1)-th model are used as ground truth to train the d-th model except for the first time.
Recently, Tsipras et al. (2019) found the adversarial robustness is at odds with the standard accuracy on clean samples.Several recent methods were proposed to trade the adversarial robustness off the standard accuracy.Most of these methods considered adding different regularization terms to the adversarial training loss to achieve a better trade-off (Zhang et al., 2019b;Ding et al., 2020;Wang et al., 2020;Zhang et al., 2021).TRADES (Zhang et al., 2019b) is a regularization-based method that minimizes the loss between the predicted labels and ground truth of legitimate samples as well as the 'difference' between the predictions of legitimate samples and the corresponding adversarial examples.The 'difference' term was regarded as the regularization.GAIRAT (Zhang et al., 2021) uses the distance to the decision boundary to assign a weight for each adversarial example in the adversarial training loss.The weight is estimated based on the difficulty of attacking the input by PGD rather than the actual distance to the decision boundary.On the contrary, our LADDER derives an explicit decision boundary to generate adversarial examples for adversarial training.It is worth noting that, the existing regularization based methods and our LADDER method provide two different views to achieve a better trade-off between the standard accuracy and adversarial robustness.Our LADDER method can be used in combination with these regularization based methods to further boost their performance (See our empirical results in Section 4.5).
Different from the existing methods that use a regularized loss to improve the trade-off, AVmixup (Lee et al., 2020) is a data augmentation method that uses linear interpolation to acquire the augmented examples in the input space based on adversarial examples generated by PGD attack.However, as proved in Manifold Mixup (Verma et al., 2019a), AVmixup may produce less semantically meaningful examples as a result of linear interpolation in the input space.In contrast, our LADDER method alleviates this issue by adding perturbations along the normal of the decision boundary in the latent space.

Latent Boundary-guided Adversarial Training
To achieve a better trade-off between standard accuracy and adversarial robustness, LADDER aims to generate better adversarial examples based on the decision boundary constructed in a latent space.Perturbations are added to latent features along the normal of decision boundary and inverted to the input space by the trained generator.Through adversarial training with the generated adversarial examples, LADDER achieves a better trade-off between standard accuracy and adversarial robustness.
For clarity, we first define key notations and symbols.Define X = {x 1 , ..., x n } as the set of samples, where n is the number of samples.For each sample x i , z i is the latent feature vector extracted from a trained DNN model.d is the normal of the decision boundary.is the perturbation added to the latent feature vector z i .xi is the generated sample from the latent feature z i .

Latent Boundary-guided Generation
The adversarial examples are generated through perturbating the latent space, guided by the decision boundary, which is obtained from an attention SVM learned on latent features.

Boundary-guided Attention
To approximate local decision boundaries in the latent space, we train a linear SVM with an attention mechanism (Zhang et al., 2019a) with latent features of a DNN model.This idea is grounded on the theoretical proof by Li et al. (2018) that, the last layer of neural networks trained by cross-entropy loss converges to a linear SVM.For any neural network used for binary or multiclass classification, when the cross-entropy loss gradually approaches to 0, the last layer weights of neural networks would converge to the solution of an SVM.Specifically, we use the latent feature z, the input to the last layer of the DNN model to train a linear SVM with an attention mechanism.The trained linear SVM provides an explicit margin as compared to other linear models such as a linear mixture model.
By employing an attention mechanism when training the linear SVM, our aim is to capture a better representation with different weights assigned to different elements of latent features.To this end, an attention layer is added to process latent features, before passing them to the SVM.The attention layer is defined as follows: , and where z i is a latent feature vector, the input to the last layer of the DNN model; conv is the convolutional operation; tanh is the activation function; β i is the attention weight vector that can be learned; z att i is the output after applying attention weights on latent feature vector z i .

Latent Feature Perturbation
After training an SVM with attention, the latent features of each sample are perturbed along the normal of the decision boundary of the SVM.The normal d provides a direction to guide the generation, where the attention weight β captures the importance of different components of latent features to move across the boundary.Different perturbations can be added to the same latent features z i to obtain the perturbed latent features z j i by: where vector d is the normal of the decision boundary of the linear SVM; β i is the attention weight vector obtained by the SVM for each sample; j > 0 represents the perturbation; j is the index of different perturbation.i (j = 0, 1, 2) are the generated images after perturbing the latent features; y j i (j = 0, 1, 2) are the predicted labels of the generated images.
When the perturbation is big enough, the class label of the perturbed latent features would change from positive to negative, or vice versa.That means it would cross the decision boundary of the DNN model.As shown in Fig. 3, perturbed latent features z 1 i move from the left side of the decision boundary to the right side.As the perturbation continues to increase, perturbed latent features z 2 i would move far away from the decision boundary.The effect of perturbation will later be empirically investigated in Section 4.4.

Boundary-guided Generation
To enable humans to understand what changes happen in the input space, caused by the perturbations to latent features, we train a generator to invert the perturbed latent features to the input space.
For a specific DNN model (i.e., LeNet), we learn a generator Ĝ on the training set X train = {x 1 , ..., x n } to map latent features to the input space.As shown in Fig. 3, each sample in the training set is fed into the DNN model, to extract the corresponding latent features.The output of the last fully connected layer of a DNN model can be used to construct the set of latent features Z train = {z 1 , ..., z n }.Sample x i and its corresponding latent features z i are fed to the designed generator.The objective function of G over the neural network class G is defined as follows: Ĝ = arg min where • p denotes L p norm, p = 1 or p = 2 in this paper; z i = Φ(x i ), where Φ is a feature extractor in a DNN model.A mapping between the latent space and the input space is learned by optimizing Eq. ( 4).
The reconstructed sample xi can be obtained by passing z i to the trained generator Ĝ, that is, xi = Ĝ (z i + βd).For example, in Fig. 3, different latent features (z i , z 1 i and z 2 i ) are fed into the trained generator Ĝ to obtain the corresponding samples xj i (j = 0, 1, 2) in the input space.When the generated samples are fed into the targeted DNN model, the predicted labels should be the same as the ground-truth label ŷ.That is, the following equation should be satisfied: As shown in Fig. 3, samples x0 i and x2 i are generated from z i and z 2 i , with 0 and 2 perturbation added, respectively.Among them, z i does not cross the boundary and the predicted label of x0 i is still the ground-truth label, y 0 i = 3.For z 2 i that has crossed the boundary, the predicted label of its corresponding sample x2 i has changed to 5, y 2 i = 5.This satisfies the rules specified by Eq. ( 5).However, Eq. ( 5) does not always hold for some perturbed latent features.Fig. 3 provides such an illustration.The perturbed z 1 i has crossed the boundary and the predicted label y 1 i of its generated sample x1 i has also changed to 5.However, the ground-truth label of x1 i is still 3.Such samples, whose predicted labels of the reconstructed samples and ground-truth labels are inconsistent, are adversarial examples that are effective to attack the targeted DNN model.

Latent Boundary-guided Adversarial Training
To improve the adversarial robustness of the DNN models, we adopt the adversarial training method (Goodfellow et al., 2015;Shaham et al., 2018) that uses the generated adversarial examples to augment the training data for retraining.Specifically, we get the perturbed latent features z j i through Eq. (3) and pass z j i to the trained generator Ĝ to generate adversarial examples.Then, we adversarially train the DNN model through the following adversarial loss function: where J(θ) is the original loss function of the DNN model and θ is the parameter of the targeted DNN model.The first and second term is the loss for the original training samples x and the generated adversarial examples Ĝ (z + βd), respectively.α is the weighting factor that trades off the two terms, which is usually set as 0.5.Through adversarial training on the generated boundary-guided adversarial examples, the adversarially trained DNN model can achieve a better trade-off between standard accuracy and adversarial robustness.Without loss of generality, other consistency losses (Zhang et al., 2018;Liu and Tan, 2021;Verma et al., 2019b) in semi-supervised learning can also be used here, but they require additional modifications to be adapted for our adversarial training purposes.Complexity Analysis.Compared with the adversarial training based defence methods that generate adversarial examples in the input space, the extra overhead of LADDER mainly lies in the construction of a linear SVM and the training of our generator.The complexity of training a linear SVM is O(n 2 ), where n = 400 is the number of samples used to train the SVM in our method.The complexity of training our generator is related to the number of layers and number of weights in the generator.After our generator is trained, the generation of adversarial examples is just one forward propagation of the trained generator.For the adversarial training part, our method has the same computational complexity as we use the original adversarial training loss function to adversarially train the model.

Experimental Evaluation
In this section, we present experimental results to show the effectiveness of our method in achieving a better trade-off between standard accuracy and adversarial robustness.We conduct extensive experiments on MNIST (LeCun and Cortes, 1998), SVHN (Netzer et al., 2011), CelebA (Liu et al., 2015), and CIFAR-10 ( Krizhevsky et al., 2009)  P2: Standard Accuracy and Adversarial Robustness.We evaluate the standard accuracy and adversarial robustness of different adversarially trained models and demonstrate the competitiveness of our LADDER method (Section 4.3).In model sharing scenarios, we focus on adversarial robustness against black-box attacks.Detailed experiments on adversarial robustness against white-box attacks can be found in Appendix 6.2).
P3: Effect of Perturbation.We investigate how perturbation impacts the performance of our LADDER method.(Section 4.4) P4: Complement to Regularization-based Adversarial Training Methods.We verify the complement effect of LADDER to the existing regularizationbased adversarial training methods to achieve a better trade-off between standard accuracy and adversarial robustness.(Section 4.5)

Experiments Settings
Datasets and Shared DNN Models.We conduct our experiments on four datasets: MNIST (one grey digits dataset), SVHN (one colorful digits dataset), CelebA (one human face image dataset), and CIFAR-10 (one natual image dataset).On the four datasets, we use DNN models with different architectures and depths, LeNet (LeCun et al., 1995), SVHNNet (shallow VGG model), CelebANet (deep VGG model (Simonyan and Zisserman, 2014)), and Cifar-Net (ResNet18) as the targeted classifiers for defence in model sharing scenarios, respectively.Note that, on CelebA, because the size of original images is 178×218, we first pre-process the images to 128×128 using DLIB (Dlib, 2019).We detect faces in images and crop them into square sizes.Our task is the classification of smile or non-smile for an input image.
Baseline Methods.The competing methods used for comparison are summarized as follows.FGSM (Goodfellow et al., 2015), JSMA (Papernot et al., 2016b), PGD (Madry et al., 2018), CW (Carlini and Wagner, 2017) and Au-toAttack (Croce and Hein, 2020b) are five baselines that generate adversarial examples by adding perturbations in the input space.Song et al. (2018) is one baseline method that generates adversarial examples in the latent space.TRADES (Zhang et al., 2019b) is one representative method that adds regularisation into the adversarial training loss to improve the trade-off between standard accuracy and adversarial robustness.It uses adversarial examples generated by FGSM for adversarial training.We also compare with another baseline called TRADES+LADDER that combines TRADES with LADDER.This baseline is used to assess whether the methods that regularize the adversarial training loss can be complemented when using adversarial examples generated by our LADDER method.
For ablation study, we compare with two variants of our LADDER method: LADDER cavRandom and LADDER Random, which use different strategies for generating adversarial examples xi .LADDER cavRandom adds some random noise δ on the normal of decision boundary obtained from the SVM: xi = Ĝ (z i + β(d + δ)).LADDER Random uses a random noise γ to replace the normal of decision boundary for generation: xi = Ĝ (z i + βγ).The two baselines are used to show LADDER's effectiveness in using the normal of decision boundary to guide the generation of adversarial examples.
For FGSM, PGD, JSMA and CW, we generate adversarial examples using the open-source attack library cleverhans (Papernot et al., 2016a).For the method of (Song et al., 2018), AutoAttack (Croce and Hein, 2020b) and TRADES (Zhang et al., 2019b), we use the source code released by the authors.The number of generated adversarial examples for adversarial training on each dataset is: 4,500 on MNIST; 4,500 on SVHN; 2,000 on CelebA; and 50,000 on CIFAR-10.The hyper-parameters used for all methods in adversarial training are summarized in Table 6 in Appendix 6.1.

Fidelity of Generator
We first validate the performance of our trained generator in terms of the quality of the generated samples.Here, MNIST is used as a case study to visualize and analyze the results.We train a generator on MNIST using latent features with a dimension of 500, the input to the last layer of LeNet.All components of the generator architecture except for activation functions are provided in Table 10 in Appendix 6 To demonstrate the quality of adversarial examples generated by LAD-DER on natural images, we show adversarial examples of selected classes on SVHN, CIFAR-10 and CelebA in Fig. 5.These examples are generated by perturbing the latent features with different perturbations .As we can see, these generated images are of high quality without any pepper noise.

Diversity of Generated Adversarial Examples
We compare adversarial examples generated by our LADDER method and other methods (FGSM, JSMA, PGD and (Song et al., 2018)) on MNIST.For LADDER, we used the trained generator to generate adversarial examples  versity property enables LADDER to be more effective for defending against adversarial attacks.

High-Quality Adversarial Examples near Boundary
Fig. 7 shows adversarial examples generated by our LADDER method in relation to the decision boundary.Compared with adversarial examples generated by FGSM and JSMA (see Fig. 6), LADDER is able to generate non-blurry images of adversarial examples that contain no noise in background.Such high-quality adversarial examples would not hurt the standard accuracy.
From classification perspectives, samples close to the decision boundary are more likely to be misclassified by a classifier.These samples should be more useful for constructing the classifier to obtain good standard accuracy.LADDER uses the normal of decision boundary as a guide to generate adversarial examples near the boundary.As shown in Fig. 7(a), the original sample is apparently a digit 2. When we increase the perturbation ( ) added to its corresponding latent features along the normal of the decision boundary, the predicted label of the generated samples changes from 2 to 3. The two examples near the decision boundary, generated with = 7 and = 9, are inherently ambiguous, even making humans difficult to make a judgement.If we add these ambiguous adversarial examples with labels to the training set, it would enrich the data space near the decision boundary, thereby improving the generalization of the trained classifier.This is the same case for Fig. 7(b).The DNN classifier predicts the two examples, generated with = 9 and = 10, as class 1 and class 7, while they look very similar.Such similar adversarial examples would be beneficial to improve the standard accuracy.We provide more illustrative examples in Fig. 10 in Appendix 6.3 to demonstrate the ability of our generator to generate sensible adversarial examples and the effectiveness of perturbing latent features along the normal of the decision boundary.

Standard Accuracy and Adversarial Robustness
To validate the efficacy of our LADDER method on standard accuracy, i.e., the accuracy on clean test datasets, as well as adversarial robustness, we conduct experiments on MNIST, SVHN, CelebA and CIFAR-10.
The results are reported in Tables 1-4, where row 1 indicates the vanilla model, and other rows indicate adversarially trained models; column 2 represents the clean test dataset, and columns 3-8 represent different attack methods that are used to generate adversarial examples for attacking the targeted model.Under our setting, we focus on defence methods based on adversarial training, and each adversarially trained model is trained on adversarial examples generated by different methods under white-box setting.For FGSM and PGD, we set perturbation as 0.3; for CW, we choose l 2 norm distance.For CW and JSMA, the generation of adversarial examples is under targeted attack condition.For other methods, the generation is under untargeted attack condition.For LADDER, the architectures of the generators on four datasets are provided in Appendix 6.4.To improve the generation performance on natural images, i.e., CelebA and CIFAR-10, we generalize LADDER by replacing the L p norm loss with an adversarial loss used in generative adversarial network (GAN) (Goodfellow et al., 2014) to train a stronger generator.This leads to a variant of our method called LADDER-GAN.
In model sharing scenarios, after one trained model is released, it could be targeted by different attacks, which are unknown to the trained models.Thus, we focus on black-box attacks as indicated in columns 3-8, where adversarial examples are generated with no access to the trained models.In particular, we assess the ability of each adversarially trained model to defend other types of attacks.Thus, the robustness results of each adversarially trained model are not reported against the same attack used to generate adversarial examples.
We focus on assessing the performance of different models in terms of both standard accuracy on clean test dataset and adversarial robustness.We thus calculate the average rank for each adversarially trained model to show its trade-off between standard accuracy and adversarial robustness against several We also find that, LADDER achieves the best performance in terms of defending the (Song et al., 2018) attack, compared with all other adversarially trained models.As for the overall performance on defending all attacks and on clean test dataset, LADDER achieves an average rank of 3.71, outperforming all other methods.This shows that LADDER achieves a better trade-off between standard accuracy and adversarial robustness.Compared with two variants LADDER cavRandom and LADDER Random, LADDER improves the average rank by 2.15 and 3.15, respectively.This validates the necessity of using the normal of decision boundary as guidance to generate adversarial examples.In terms of adversarial robustness, it is clear to observe that LADDER improves the vanilla model in defending FGSM attack and PGD attack by 9.2% and 19.27%, respectively.When defending the JSMA attack, LADDER performs similarly to the vanilla model.Among all attacks, LADDER achieves the best performance of defending the CW attack and AutoAttack, compared with other adversarially trained models.Overall, LADDER and LADDER Random achieve an average rank of 3.29, which is the highest among all adversarially trained models except for CW Adv.model.Yet, LADDER achieves better performance than CW Adv. on clean test dataset and against PGD attack, and achieves the same performance against AutoAttack.LADDER cavRandom also outperforms FGSM, PGD and Song et al. Adv. model This confirms the usefulness of leveraging the latent features to generate adversarial examples.

Results on CelebA
Table 3 reports standard accuracy and adversarial robustness of the vanilla model and different adversarially trained models on CelebA.In terms of standard accuracy on clean test dataset, LADDER yields the highest accuracy, while LADDER-GAN achieves the second best performance.For the two variants of LADDER, LADDER cavRandom performs better than LADDER Random, while both variants outperform the FGSM, PGD, JSMA and CW Adv.mod- els.This shows that performing feature perturbations in the latent space is beneficial to achieve better standard accuracy.
As for the adversarial robustness performance against adversarial attacks, LADDER achieves better performance of defending FGSM, JSMA, PGD and (Song et al., 2018) attacks, compared with most of the baseline models.In particular, for the PGD attack, LADDER improves the accuracy from 13.90% to 27.60%.As a whole, LADDER achieves an average rank of 3.29, which is the best among all methods.The smaller the average rank, the better the overall performance of defending adversarial attacks and achieving standard accuracy simultaneously.LADDER-GAN and LADDER Random both achieve an average rank of 4.71, which stands behind only Song et al. Adv. and LADDER.This proves the overall effectiveness of LADDER and its variants.

Results on CIFAR-10
We also compare standard accuracy and adversarial robustness of LADDER with other baseline methods on CIFAR-10 -a more challenging dataset for the generation task.Table 4 shows the classification results of the vanilla model and different adversarially trained models.We can see that, LADDER is the second best performer among all adversarially trained models, achieving an accuracy of 85.92%.Only the JSMA Adv.model performs slightly better than LADDER with a small gap of 1.95%.Compared to Song et al., FGSM and PGD Adv., LADDER achieves significant improvements by 37.26%, 18.71%, and 6.8%, respectively.The performance of LADDER-GAN slightly lags behind LADDER.This signifies the competitive performance of LADDER in achieving good standard accuracy on CIFAR-10.
For the adversarial robustness, LADDER achieves the best performance when defending the (Song et al., 2018) attack.In general, LADDER achieves As the generation task on CIFAR-10 is more challenging, we also compare with LADDER-GAN.As can be seen, LADDER-GAN improves the average rank of LADDER from 4.0 to 3.86.Yet, we find that LADDER and LADDER-GAN perform worse than CW and JSMA adversarially trained models.This indicates that generator-based defence methods have difficulties in achieving the most appealing results on challenging datasets like CIFAR-10.Our findings reaffirm the results of (Song et al., 2018) and those reported in (Jang et al., 2019) where a recursive and stochastic generator is used to generate adversarial examples for adversarial training.We leave further investigation of this problem to future work.

Analyse of the Trade-off between Standard Accuracy and Robustness
To visually demonstrate the advantages of our LADDER method in achieving a better trade-off between standard accuracy and adversarial robustness, we explicitly compare the trade-off performance of different defence methods with respect to different numbers of adversarial examples on CelebA as a case study.Specifically, we vary the number of examples used to adversarially train the models from 100 to 2,000.The classification results are plotted in Fig. 8, where there are 7 points for each adversarially trained model.In the figure, the x-axis indicates the accuracy of adversarially trained models on adversarial examples generated by Song et al. (2018), and the y-axis indicates the standard accuracy of adversarially trained models on clean CelebA test dataset.If the trade-off achieved by one method is better, the method is expected to locate in the top right corner.It can be seen clearly that, our LADDER method and its variants (marked in circles) are located in the top right corner.Markedly, our LADDER method outperforms FGSM Adv., PGD Adv., and CW Adv. by a large margin.Again, this confirms that our LADDER method is able to achieve a better trade-off between standard accuracy and adversarial robustness.

Effect of Perturbation
Next, we empirically evaluate the effect of perturbation on the performance of our LADDER method.First, we study the impact of on standard accuracy.
To adversarially train the LeNet, we randomly select 450 images per class from MNIST dataset to generate 4,500 adversarial examples for each perturbation [0.1, 2.0, 5.0, 7.0, 10.0, 15.0, 20.0].These adversarial examples with different perturbations are separately used to adversarially train the LeNet.We undertake classification on clean MNIST test dataset using these adversarially trained LeNet models.The results are reported in Fig. 9, colored in blue.It is clear to observe that: 1).As increases, the classification accuracy of the adversarially trained models decreases firstly and then slightly increases at a later stage.2).With different values, the changes in classification accuracy are within an interval of 1.56% only.3).When is not too large, i.e. < 7, the performance of the adversarially trained models and the vanilla LeNet is very close.
Second, we study the impact of on adversarial robustness.We still use the adversarially trained LeNet models in the previous step for conducting experiments.These models are used to defend adversarial examples generated using the Song et al. (2018) attack method.From the red part in Fig. 9, we can see that, as increases, the performance of the adversarially trained model  drops slightly.Overall, with different values, our LADDER method is able to achieve stable performance within a reasonably small range.

Complement to Regularization-based Adversarial Training Methods
Experiments are further performed to testify whether LADDER can complement the existing regularization-based adversarial training methods that regularize the adversarial loss to achieve a better trade-off between standard accuracy and adversarial robustness.TRADES (Zhang et al., 2019b) is a strong competing method in this category.To achieve the same objective, our LAD-DER method takes a complementary approach to generate better adversarial examples but to use the original adversarial training loss.We expect that the performance of TRADES could be improved in combination with LADDER.
We perform experiments on MNIST, SVHN, CelebA and CIFAR-10 to compare LADDER, TRADES and the combined TRADES+LADDER.The results are shown in Table 5.As we can see, LADDER achieves better defence performance than TRADES in 18 out of 28 cases on the four datasets.Especially on SVHN, LADDER outperforms TRADES against all attacks and on clean test dataset.As expected, TRADES+LADDER is found to outperform TRADES in most cases (23 out of 28) on the four datasets.This proves that, by generating high-quality and diverse adversarial examples, LADDER can In the future, we will extend our work from the following aspects.Firstly, our method generates adversarial examples by perturbing along the normal of decision boundary to reduce the level of minimal perturbations in the latent space.For inverting to the input space, we will try to derive theoretical bounds about when the perturbations of our generated examples are narrower than the L p norm perturbations in the input space.Secondly, for complex datasets like CIFAR-10 and ImageNet, where the generation task is more challenging, we have made attempts to use an adversarial loss rather than the L p norm loss for training a strong generator.We will investigate how to generate better adversarial examples to boost the adversarial robustness on complex datasets.Finally, we would like to reduce the computational complexity of our proposed method by removing the generator and directly using the adversarial features vectors in the latent space for adversarial training.

Hyper-parameters in Experiments
The hyper-parameters used for adversarial training in our experimental part are summarized in the Table 6.Especially, our LADDER method achieves the best defence performance, when defending CW attacks.Compared with the vanilla model, LADDER improves the performance against all attacks except for CW and (Song et al., 2018).Overall, our proposed method exhibits competitive performance in defending white-box attacks.

LADDER's Robustness against LADDER Attacks
We have also conducted experiments to compare the defence performance of the vanilla model and our trained model against white-box adversarial examples generated using LADDER.The results are reported in Table 8.As can be clearly seen, our trained model (LADDER) significantly improves the defence performance of the vanilla model on the three datasets by 28.54%, 31.56, and 5.95%, respectively.This proves the efficacy of our trained models against the adversarial examples generated under white-box settings using the same latent based attack method.

Susceptibility of Baseline Methods Against LADDER Attacks
In Table 9, rows 3-7 show the defence performance of five conventional adversarially trained models against the adversarial examples generated by LAD-DER on three datasets.As compared to the vanilla model, the best performer among the five conventional adversarially trained models improves the defence performance by only 3.55% on MNIST.All five conventional adversarially trained models exhibit worse defence performance than the vanilla model on SVHN and CelebA.This confirms the susceptibility of the conventional adversarially trained methods to the adversarial examples generated using LADDER.In contrast, the adversarially trained LADDER model is very successful in defending against adversarial samples generated using LADDER.

Illustrative Examples Generated by LADDER
We further provide more illustrative examples to demonstrate the ability of our generator to generate sensible adversarial examples and the effectiveness of perturbing latent features along the normal of the decision boundary.
Table 9: Defence performance of the conventional adversarially trained models and our LADDER method against adversarial samples generated using LADDER.The best method is highlighted in bold.Fig. 10: Illustrative examples generated using our LADDER method with an increasing perturbation ( ).
As shown in Fig. 10, the first column is the original input image; the last column is the randomly sampled target class; columns 2-16 are the generated examples, which are generated by adding different perturbations ( ) to latent features of the original inputs.From column 2 to column 16, the perturbation increases gradually from 0.5 to 30.0 along the normal of decision boundary between the class of original input and target class.We can see from the figure, when we increase the perturbation, the generated examples gradually change from the original class to the target class, and when the perturbation is too large (i.e., the last 3 columns), the generated images are distorted.The images marked with red rectangles are inherently ambiguous between the class of the original input and the target class, even making humans difficult to make a judgement.These images enrich the data space near the decision boundary, thereby improving the generalization of the trained classifier.

Generator Architectures
The neural network architectures of boundary-guided generator for MNIST, SVHN, CelebA and CIFAR-10 are detailed in this part.In each

Fig. 1 :
Fig. 1: Attacks in model sharing scenarios.The end user performs classification tasks through the shared model without the knowledge on adversarial examples generated by attackers to attack the shared model.
Fig. 2: (a): Adversarial examples generated by the existing methods are often within a ball in the input space, which leads the decision boundary of the adversarially trained classifier (dotted line) to dramatically change.The legitimate sample, sample A, would be misclassified.(c): Our LADDER method perturbs latent features of samples within the boundary field in the latent space.The generated adversarial examples would reside in a more restricted area.Sample A would thus be classified correctly.(b): The adversarially trained classifier by existing adversarial examples (red dotted line) would misclassify legitimate samples (sample B and C), which would hurt the standard accuracy.The adversarially trained classifier using our method (purple dashed line) would change less remarkably than the existing one (red dotted line), which would correctly classify the samples.
and MagNet (Meng and Chen, 2017) are three typical methods that remove the perturbation added on adversarial examples to reconstruct a clean sample similar to the legitimate one.The reconstructed examples can be easily recognised by the model, compared with adversarial examples.

Fig. 3 :
Fig. 3: Overview of Latent Boundary-guided Adversarial Training.LADDER generates adversarial examples by perturbing latent features alongside the normal of decision boundary obtained from an SVM with an attention mechanism.These generated adversarial examples are inverted to the input space via a trained generator to adversarially train the DNN model.z i is the latent feature of one original sample x i ; β is attention weight; d is the normal of the decision boundary; z 1i and z 2 i are the perturbed latent features for generation; xj i (j = 0, 1, 2) are the generated images after perturbing the latent features; y j i (j = 0, 1, 2) are the predicted labels of the generated images.
from four perspectives.The source code of our implementations is provided 1 .P1: Blessings of Adversarial Examples.To show the merits of our latent boundary-guided adversarial examples, we visualize and analyse the generated adversarial examples.(Section 4.2).

Fig. 4 :
Fig. 4: Reconstructed images of our generator trained on MNIST.(a) and (c) indicate the original training and test images, whereas (b) and (d) show the generated training and test images.

Fig. 5 :Fig. 6 :
Fig. 5: Adversarial examples generated by our LADDER method on SVHN, CIFAR-10 and CelebA, where the texts on the left indicate the actual class labels.
examples by Song et al. (%) Classification Accuracy on Clean and Adversarial Examples vanilla model on Adv.adversarial model on Adv.

Fig. 9 :
Fig. 9: Classification accuracy of vanilla LeNet and adversarially trained LeNet on MNIST test dataset and adversarial examples with different perturbations .Best viewed in color.
function with respect to input examples as perturbation.Built upon FGSM, the one-step attack method, Madry et al. (2018) proposed a multi-step attack method called PGD.PGD iteratively uses the gradient information and generates adversarial examples on the results of the last step.Similarly, DI 2 -FGSM (Xie et al., 2019) generates adversarial examples on the results of last step, but it uses the gradients of stochastic transformed inputs rather than the original ones.Papernot et al. (2016b) introduced saliency map based on Jacobian matrix into the generation of adversarial examples.The saliency values computed by forward derivative of a target model are used as an indicator to determine the locations in the input examples to add perturbations.This method is called Jacobian saliency map attack (JSMA).
DLS by using a difference of logits ratio (DLR) loss.Then, four attack methods, including APGD CE , APGD DLS , FAB

is proved as one of the most effective defence methods, which augments the train- ing data with adversarial examples when training the targeted model. It can
be achieved either by training the targeted model with original samples augmented with adversarial examples

Table 1 :
SVHN: Classification accuracy (%) of vanilla and adversarially trained models on clean test dataset and adversarial examples.Smaller means better for the average rank (Avg.Rank).The best method is highlighted in bold and the second best is underlined.The average rank is calculated over the ranks of each adversarially trained model on clean test dataset and defending all other attacks, which is reported as the last column in the tables.
4.3.1 Results on SVHNTable 1reports the standard accuracy on clean SVHN test dataset and adversarial robustness of different models against other adversarial attacks.We can see that, among all the adversarially trained models, LADDER achieves the second best standard accuracy (91.71%) on clean test dataset, which lags behind the Song et al.Adv.model only.Compared with other models, LAD-DER achieves an improvement of 3.96% and 3.05% over PGD and FGSM, respectively.As compared with the other two variants, LADDER cavRandom and LADDER Random, LADDER performs better on clean test dataset.

Table 2 :
Table2reports the classification results of the vanilla and adversarially trained LeNet models on the clean MNIST test dataset and adversarial robustness MNIST: Classification accuracy (%) of vanilla and adversarially trained LeNet on clean and adversarial examples.Smaller means better for the average rank (Avg.Rank).The best method is highlighted in bold and the second best is underlined.In terms of standard accuracy on clean test dataset, LADDER performs the best and LADDER cavRandom achieves the second best among all the adversarially trained models.The performance of LAD-DER (99.12%) is very close to that of the vanilla model (99.13%), with only 0.01% difference.Moreover, LADDER outperforms the baseline PGD Adv. by a large margin of 8.33%.LADDER is also observed to perform better than its two counterparts, LADDER cavRandom and LADDER Random, while LAD-DER cavRandom performs better than LADDER Random.

Table 3 :
CelebA: Classification accuracy (%) of vanilla and adversarially trained CelebANet on clean examples and adversarial examples.Smaller means better for average rank (avg.Rank).The best method is highlighted in bold and the second best is underlined.

Table 4 :
CIFAR-10: Classification accuracy (%) of vanilla and adversarially trained models on clean and adversarial examples.Smaller means better for the average rank (Avg.Rank).The best method is highlighted in bold and the second best is underlined.
a better average rank than FGSM, PGD and Song et al. based adversarially trained models.
Classification accuracy of different defence methods on adversarial examples generated by Song et al. (2018) and on CelebA clean test dataset.Each model is adversarially trained on varying numbers of adversarial examples, with 7 points for each method compared in the figure.Best viewed in color.

Table 5 :
Classification accuracy (%) of vanilla and adversarially trained models on clean test dataset and adversarial examples generated by different attack methods.Higher means better for classification accuracy.The best results are highlighted in bold.

Table 6 :
Hyper-parameters of adversarial training for all methods.To verify the robustness of our method in defending white-box attacks, we have conducted experiments under white-box settings, where attack methods generate adversarial examples with gradients available from the network.The comparison of our method and other baseline methods on MNIST are shown in Table7.As can be seen, our methods (LADDER and LADDER Random) are the best two performers for defending different types of attacks simultaneously.

Table 7 :
Defending white-box attacks targeted on LeNet: classification accuracy of the vanilla LeNet and adversarially trained LeNet models on white-box adversarial examples generated by different attack methods.Smaller means better for the average rank (Avg.Rank).The best method is in bold and the second best is underlined.

Table 8 :
Classification accuracy on white-box adversarial examples generated by LADDER.
table, Linear indicates linear transformation; Conv Transpose denotes transposed convolution; Conv represents convolution; BN represents batch normalization.kernels means number of kernels.kernel means the dimension of kernel.stride means steps of convolutions.ReLU means the ReLU activation function.

Table 10 :
The architecture of boundary-guided generator for MNIST.

Table 11 :
The architecture of boundary-guided generator for SVHN.

Table 12 :
The architecture of boundary-guided generator for CelebA.