A3T: accuracy aware adversarial training

Adversarial training has been empirically shown to be more prone to overfitting than standard training. The exact underlying reasons are still not fully understood. In this paper, we identify one cause of overfitting related to current practices of generating adversarial examples from misclassified samples. We show that, following current practice, adversarial examples from misclassified samples results in harder-to-classify samples than the original ones. This leads to a complex adjustment of the decision boundary during training and hence overfitting. To mitigate this issue, we propose A3T, an accuracy aware AT method that generate adversarial example differently for misclassified and correctly classified samples. We show that our approach achieves better generalization while maintaining comparable robustness to state-of-the-art AT methods on a wide range of computer vision, natural language processing, and tabular tasks.


Introduction
Deep learning achieved excellent generalization on a wide variety of tasks [1][2][3][4][5].However, deepnets are also known to be vulnerable to adversarial attacks [6][7][8][9], due to the non-smooth nature of their associated loss landscapes.Adversarial training (AT) [10] is established as the most efficient mechanism for defending against these attacks and consists of training on adversarially perturbed samples instead of on the original ones.Yet, AT is also more prone to overfitting [11][12][13][14][15][16] and therefore results in a lower generalization than standard training.The reasons behind this are still under-explored.Nevertheless, several classical and modern deep learning remedies for overfitting, e.g., regularization [17][18][19][20] and data augmentation [13,21,22] have been proposed without a full understanding of the root causes.

INTRODUCTION
Recently, Dong et al. [14] show that overfitting is due to augmenting the training data with 'hard' to classify adversarial samples that incentivize a complicated adjustment of the decision boundaries.The authors argue that these 'hard' adversarial samples are generated from samples close to the decision boundary with noisy or inappropriate ground-truth one-hot encoded labels.They show that previously proposed techniques for label and weight smoothing [20,23,24] can alleviate this issue.
In this paper, we identify another category of 'hard' adversarial samples.Specifically, we show that adversarial samples generated from misclassified samples in every iteration of the training are, counter-intuitively, always placed maximally away from the decision boundary, as illustrated in Figure 1.Hence, they induce a large loss leading to substantial perturbations in the local region of the misclassified point.This problem is more exacerbated in high-capacity models, as they have the flexibility of making local changes to a decision boundary.To address this problem, we propose Accuracy Aware Adversarial Training (A3T) which controls the generation of adversarial samples in a manner that is aware of the current predicted label of the original training example.While Wang et al. [25] and Ding et al. [26] proposed formulations that generate adversarial samples differently for misclassified samples, their insights were imperially driven, and they do not identify nor directly address the root cause for poor generalization, i.e., adversarial samples from misclassified samples are generated maximally away from the decision boundary.Differently, A3T proposes a simpler fix that directly addresses the root cause, i.e., places adversarial samples from misclassified samples closer to the decision boundary.
We show that our A3T improves upon previously adopted techniques for mitigating overfitting in AT and achieves better generalization while having comparable robustness to state-of-the-art models on both toy experiments as well as on computer vision, natural language processing, and tabular applications.This hence demonstrates that we have identified a root cause for overfitting that is unaddressed in the literature.

Related work 2.1 Adversarial Training
Adversarial Training (AT) [10] is a defense mechanism against adversarial attacks which aim at fooling a trained supervised machine learning model by adding imperceptible perturbations to the inputs to cause misclassification.Modern day deepnets are particularly vulnerable to such attacks.Due to the non-smooth nature of their loss-landscapes, it is possible to find, at almost any input sample, the direction of a steep gradient (adversarial direction) where perturbations along such directions lead to high loss and potentially a different prediction [27].To address this, in AT, the model is trained to correctly classify perturbed versions of the inputs.Formally, AT is formulated as a min-max program, searching for the best parameters θ of a classifier f θ under the worst-case perturbation δ applied to an input x [11], i.e., where is a loss function (e.g., cross-entropy loss) and ∆ is a constrained set of imperceptible perturbations.The outer-optimization program is classically solved using gradient descent.Popular approaches for solving the constrained inner maximization program include Fast Gradient Sign Methods (FGSM) [10,27] and Projected Gradient Descent (PGD) [11].In FGSM, the perturbation δ is computed via: with α being the learning rate and sgn the sign operator.PGD is a multi-step variant of FGSM.It starts at randomly initialized perturbations in the feasible set ∆ and iteratively applies a gradient ascent update followed by a projecting onto ∆: where Π ∆ is the projection operator into ∆ and k is the number of steps the algorithm runs.
AT is only defined for correctly classified samples.However, adversarial examples are generated following the same procedure for both correctly classified and misclassified samples.We will show that this makes the model more prone to overfitting.

Overfitting in AT
Different extension to standard AT objective (Eq. 1) have been proposed to alleviate overfitting, including Local Distributional Smoothness (LDS) [28,[28][29][30], Margin Maximization (MM) [31,32], Label-Smoothing Regularization (LSR) [32] and Misclassification Awareness (MA) [25].Objectives of notable methods from these different families are presented in Tab. 3.However, all these attempts try to fix overfitting without identifying the root causes behind it.Naturally, creating a truly robust model would require a deeper understanding of the source of this vulnerability.Recently, Dong et al. [14] have shown that overfitting is caused by generating 'hard' to classify adversarial samples that are memorized by the network, i.e., the network adjusts its boundaries in a complex way around these samples in order to produce correct classifications.The authors argue that these 'hard' adversarial samples are generated from samples close to the boundaries and hence their associated one-hot encoding ground truth label is likely to be inaccurate or noisy.To fix this, the authors propose label smoothing of the adversarial samples.Yu et al. [15] empirically show that small-loss data is responsible for overfitting and propose a minimum loss constrained adversarial training procedure that increases the loss of small-loss data.
Local Distributional Smoothness (LDS) extends the robust optimization problem in Eq. 1 with a regularization term that encourages the label distribution around each sample point to be locally smooth.This is achieved via minimizing the KL divergence between the label distribution of the original sample and the one of the corresponding adversarially generated example [28,29].Intuitively, this results in forcing the decision boundary to be sufficiently away from the training samples [28,29].AT objectives from this family include TRADES [29] and VILLA [30] (Lines 2 and 3 in Table 1).A major drawback of this approach is that it equally reinforces correct and incorrect predictions, i.e., it encourages the adversarial examples from misclassified samples to have the same label as the original (misclassified) samples.This results in a drop in generalization.Margin Maximization (MM) methods propose to individually tune the perturbation strength for each sample.Intuitively, adopting a fixed level of perturbation for all the samples can potentially result in pushing adversarial examples further away from the boundary, causing them to be mixed with samples of other classes.This would force the model to learn a non-smooth and complex decision boundary.AT bjectives from this class include MMA [26], IAAT [31], and CAT [32] (Lines 4-6 in Tab. 1).Note that MM methods only generate adversarial examples from correctly classified samples.Label-Smoothing Regularization (LSR): One cause for overfitting is the use of one-hot encoded ground-truth labels for adversarial samples [14], which essentially forces the model to assign an overconfident probability of one to all samples of a given class.This can be noisy or inappropriate in case of adversarial examples as they are deliberately generated by pushing closeto-the-boundary samples over to boundary to the side of another class.To address this, Cheng et al. [32] propose applying a label-smoothing regularization depending on the perturbation tolerance of each sample.Misclassification Awareness (MA): Wang et al. [25] imperially show that adversarial examples from misclassified samples result in a drop in robustness and propose extending the standard training loss with a misclassification aware regularization (line 7 in Tab. 1).However, this has the same drawback as LDS approaches.In this work, we provide a deeper understanding of the drawback of the current practice in generating adversarial examples from misclassified samples and propose an easier fix that dresses the root cause.

Accuracy Aware Adversarial Training (A3T)
Given a data set D = {(x i , y i )}, we consider a K-class classification setup.The classifier is a θ-parameterized score function , where the i-th class is assigned the score f i θ (x).The predicted label of x is denoted as ŷ = arg max i f i θ (x).The classic adversarial training problem is formulated as a min-max optimization problem [11]: Here is the loss function used for training the classifier, and ∆ is typically a L ∞ norm ball bounded by .For linear binary classification problems, the inner maximization can be solved analytically.Specifically, suppose f θ (x) = θx + b with class labels y ∈ {+1, −1} and logistic loss For nonlinear models (e.g., deepnets), the inner maximization has to be solved in an iterative fashion using PGD or its variants [11].In case of correctly classified training samples, the optimal perturbation would result in misclassified adversarial examples that are closer to the decision boundary than the original ones.However, in case of misclassified samples, the same optimal perturbation will result in adversarial examples that are further away from the decision boundary than the original samples.For high-capacity models, AT with such samples will result in an extremely non-smooth boundary with poor generalization.
A3T Loss: To prevent overfitting, A3T aims at generating close to decision boundary adversarial examples.To achieve this, it generates adversarial examples differently for classified and misclassified samples, i.e., it solves different inner maximization objectives depending on the samples prediction accuracy: Here, D + θ and D − θ are the set of correctly and misclassified examples, respectively.Note that the loss terms of the two inner maximizations have different arguments, i.e., y vs ŷ.In particular, for the misclassified examples, the loss uses the predicted label ŷ as its second argument.The optimal perturbation δ * can be computed using PGD as follows: Since ŷ = y holds for samples in D + θ , Eq. 6 can be simplified to Combining Eq. 5 and Eq. 7 results in an alternative training objective to Eq. 5, i.e., Theorem 1 Let f θ (x) = θx + b be a linear model trained with a logistic loss .Assume that (x i , y i ) is a misclassified training example and that are the solutions to the inner maximization using standard AT (Eq.4) and A3T (Eq.5), respectively.We prove that θ( Proof See Appendix.
Intuitively, Theorem. 1 sates that adversarial examples from misclassified samples generated using A3T are closer to the decision boundary than the ones generated using AT.
Example 1: Consider a 2-d binary linear classification setup, where the decision boundary is given by x 1 + x 2 = 1 and labels are y = ±1.The classifier weights are therefore (θ 1 , θ 2 ) = (1, 1).The adversarial perturbation δ = −y sgn(θ).Suppose a data point (x 1 , x 2 ) with label +1 is misclassified as -1.Then the traditional perturbation based on the original label is (x 1 − , x 2 − ) but based on the proposed approach will be (x 1 + , x 2 + ).Thus the proposed approach will push the adversarial sample corresponding to (x 1 , x 2 ) towards the boundary.Note that A3T, Label Smoothing Regularization (LSR) and Margin-Maximization (MM) are complementary.LSR addresses a different cause of overfitting as explained in Sec. 2. MM is orthogonal as it searches for the optimal perturbation set ∆ instead of the optimal perturbation delta in a predefined set ∆.We denote by A3T + the approach that combines A3T, LSR and MM.A3T + objective and algorithm are given in Line 8 of Table 1 and Alg. 2, respectively.

Improvement over Prior Work
So far, two misclassification aware adversarial training approaches have been proposed [25,26].Wang et al. [25] were the first to explicitly recognize that misclassified and correctly classified samples should be treated differently for e = 1, .., E do 3: for ŷi ← arg max j∈{1...K} f j θ (x i ) 5: for s = 1, .., S do 7: end for 9: end for 11: end for 12: end procedure and proposed a misclassificiation aware adversarial training method (MART).Their training objective consists of two separate loss terms as given in line 7 of Table 1.The first term corresponds to the adversarial loss used in conventional AT, and does not distinguish between correct and misclassified examples.The second term is a KL divergence loss weighted by (1 − f θ (x)) and introduced by [28] effectively lowering the impact of correctly classified samples and encouraging label smoothness around misclassified examples.In the context of Figure 1, MART populates the space around the misclassified examples by new samples of both the true class and the incorrectly predicted class.The true class samples will be located in a direction that is maximally away from the decision boundary and will serve as hard examples for the model.In contrast, the regularization based label smoothing will add samples of the incorrect class isotropically.In effect, the latter group of samples neutralizes the impact of the hard examples created by the conventional AT method and prevents the model from exhibiting even worse overfitting behavior.
Ding et.al. [26] use margin-maximization to propose an implicit treatment of misclassified examples.The perturbation margin of each example is individually determined through increasing perturbation strength gradually until an adversarial sample is generated.Since misclassified examples are adversarial in nature, they yield the lowest possible margin.Thus effectively adversarial samples created for misclassified examples are not significantly perturbed.
Our approach, A3T, fundamentally differs from the above methods in that the adversarial samples created include easy samples that are placed towards the decision boundary, as opposed to creating no samples or samples that are further away from the boundary.These easy samples essentially serve to correct for any potential deformation in that part of the decision boundary.In other words, we attribute the cause of a misclassified example to the presence of opposing class examples in the immediate vicinity of that example, preventing the model from successfully fitting them.By creating samples that yield lower loss, our method reduces the effect of misclassified examples on the resulting decision boundary.

Results on Synthetic Data
To better observe the behavior of A3T, we evaluate it on a synthetic dataset introduced in [30].This dataset contains 1,016 samples from two classes in a two-dimensional space over Real numbers.Samples in each class are created on two trajectories each following the shape of a crescent moon.Each data point is then projected onto a 100-dimensional vector space.For classification, a model with 100 neurons is trained using only 16 randomly selected training samples per class.With such a small training sample size the model is prone to overfitting, thereby allowing better evaluation of the generalization capability.The model is trained for 400 epochs at a learning rate of 0.01.For the first 100 epochs, the standard training loss is used and in the remaining epochs adversarial training loss is incorporated.Adversarial samples are created through PGD attack, Eq. 3, after applying a random perturbation (normally distributed with zero mean and standard deviation of 0.05) with only five PGD steps assuming α = 0.1 and ∆ = 0.4.
To verify our intuition on the operation of A3T (Fig 1 ) we first show the resulting decision boundary considering this binary classification task.As part of the tests, 20 models are generated for each training approach using the same set of training samples and the fixed parameter values.Following the standard training (for 100 epochs) all models were found to correctly classify all but one (out of 32) training samples, overall exhibiting a good fitting capability.The decision boundaries are then drawn based on the positions of all data points classified with the least confidence, i.e., by averaging the locations of all training samples in all runs that were classified with confidence in the 0.49 − 0.51 probability range.Figure 3 shows the decision boundaries obtained after the first 100 epochs and when the training is completed, respectively, as gray and black lines for both AT and A3T.All adversarial samples created beyond epoch 100 are also displayed as orange and turquoise dots on the figure.
The difference between the conventional AT approach and A3T can best be seen around the misclassified examples of the red class, at the upper end of the crescent.In the case of A3T, the adversarial samples, i.e., orange points around the misclassified examples, are created between the misclassified example and the decision boundary.Whereas for AT, adversarial samples are created further away from the boundary.This further resulted in intermixing of red-class adversarial samples with the blue-class data points, which may force the model to learn an extremely non-smooth decision boundary.In other respects, it can  be seen that the final decision boundary (black line) for AT and A3T are very similar and therefore the robustness behavior may be expected to be on par with each other.
To further examine the generalization behavior, we repeated the same experiment by randomly selecting the training points for each run and averaging results over 50 runs.Figure 4 shows all training points used as well as the corresponding decision boundaries.(Note that training examples remain the same for both AT and A3T within a run, but they change across runs.)Here, the generalization capability of A3T is more visible as it is able to correctly classify most of the blue-class points.In the case of red-class data points, both methods perform similarly.These observations are also reflected in the measured classification accuracies where A3T achieved 83.6%, i.e., 1.2% better than AT.

Results on Real Data
We now assess the robustness and generalization trade-off yielded by A3T on tasks related to vision, natural language processing, and tabular data.Models obtained by standard training and adversarial training are evaluated under attack-free and attack circumstances.The models are attacked using adversarial samples produced by PGD [11] and AutoAttack [33].

Vision Tasks
Most adversarial training methods are proposed to improve the robustness of image classifiers.The effectiveness of these methods has been evaluated on CIFAR-10 datasets using fixed architectures such as WideResNet-34-10 and WideResNet-28-10 with varying attack and training parameters.Therefore, in our tests, we also consider a similar setting.
MART is the only other method that explicitly formulates a treatment for misclassified examples during adversarial training.Therefore, we start by comparing A3T against MART on how they cope with overfitting using the standard cross-entropy loss.In their paper, MART uses boosted cross-entropy loss, and report a significant improvement in accuracy (∼3%) compared to the use of standard cross entropy loss1 .Since boosted loss can be generically incorporated to all adversarial training approaches, we only considered the use of cross-entropy loss to ensure a comparison on an equal footing.To be in line with the previous methods, we trained all models for a total of 100 epochs using SGD with momentum 0.9, weight decay 2 × 10 −4 , an initial learning rate of 0.1, and we decay the learning rate by 90% at the 75th, 90th epoch.Table 2 provides corresponding accuracy results for standard and under-attack settings.As can be seen, the most notable difference concerns the natural accuracy values, where A3T performed 4.7% better than MART.This can be attributed to A3T's ability to better mitigate overfitting, which mainly reveals itself in the model's generalization ability.However, this gain comes at the cost of reduced robustness, where MART is seen to perform 2-3% better.When compared to conventional AT, A3T improves both natural and robust accuracy by around 2%. Next, we compare A3T with several adversarial training methods that leverage the ideas of MM [26,32], LSR [31,32,34], and LDS [29] as well as the A3T + method that incorporates MM and LSR approaches [32].Here we determined that the results reported by these methods include some inconsistencies, making a fair comparison challenging.These include the learning-rate schedule used by a method that affects the proneness to robust overfitting.(That is, schedules that include a larger number of training epochs are expected to report lower final accuracy values.)Another factor relates to the PGD step size used during attacks which determines the strength of the attack applied to a model.To circumvent these ambiguities, we decided to use the robustness results reported by the AutoAttack (AA) benchmark 3 and evaluated A3T and A3T + accordingly.Table 3 provides corresponding results for several adversarial training methods.These results demonstrate that A3T and A3T + yield the highest natural accuracy after the Bilateral AT method which exhibited very limited robustness under AA attack.In terms of robust accuracy, CAT yields the best accuracy, performing only marginally better than A3T + (+0.2%) and noticeably better than A3T (+3.7%).However, both A3T and A3T + yield a higher natural accuracy compared to CAT (+2.8-1.6%).These results overall present a more granular view of the generalization vs. robustness trade-off with MM and LSR allowing a shift in favor of robustness, while misclassification awareness and LDS tilting the balance more towards improved generalization.When the average accuracy is evaluated, it is seen that A3T + offers a better tradeoff between natural and AA accuracy as it allows an increase in the former

Natural Language Processing Tasks
Tests are performed on the GLUE tasks [36] 6 .This benchmark contains seven datasets for two sentiment analysis tasks, two similarity tasks and three inference tasks.Due to the discrete nature of text, generating adversarial samples in the input space is nontrivial because it involves projecting the continuous perturbation computed in the embedding space to the input space.Although several input-domain approaches are proposed to create adversarial samples, they are not efficient [37].An alternative approach is to perform adversarial training in the latent space without the need for creating adversarial inputs [30].In our experiments, we also adopt this approach both during adversarial training and when performing adversarial attacks.It must be noted that in the case of launching an adversarial attack this approach is impractical to implement, but it nevertheless corresponds to a worst-case attack setting and allows a better evaluation of the model's robustness.
For tests, we fine-tuned the deBERTa-base model [38] from the Hug-gingFace library.For each dataset in the GLUE benchmark, we created a fine-tuned, task-specific model by first training the model for three epochs at the suggested learning rate of 2e −5 [38].Then, the first five layers are frozen and fine-tuning is continued for another three epochs with either standard or adversarial training.In all cases, adversarial samples are generated using threestep PGD assuming ∆ = 0.01 using random initialization with zero mean and standard deviation of 0.005.Table 4 reports classification performance yielded by various models under both normal and adversarial attack scenarios to evaluate the improvement of A3T over AT.As compared to the baseline accuracy of the model, i.e., attack-free test setting, A3T yields a small drop in the average performance (-2.5%) but performs noticeably better (+3.6%) than AT as displayed in the last line of the table.Similarly, under an adversarial attack test setting, A3T yields an improvement of 1.5% over conventional AT.The improvement due to A3T is more noticeable when the results of CoLA task is removed from the average as it uses a different metric than accuracy (third to the last line).In that case, the use of A3T is found to result in a performance improvement of 2.8% over AT.

Tabular Tasks
The robustness of machine learning models that make financial decisions against adversarial attacks is another important area of concern.Since financial data is mostly in tabular format, we tested the effectiveness of A3T on tabular data for two classification tasks.The first test involves Matlab's Retail Credit Panel dataset [39], and the task is to predict the overall yearly default rate for subjects.Each dataset sample contains risk factor, year, and the default status (0 or 1) for a subject.To predict what percent of subjects default in a given year, a logistic regression model with two hidden layers, consisting of 256 and 128 neurons, is trained over 500 epochs, with a learning rate of 0.001.The adversarial training is only applied between epochs 100 and 500 using the attack parameters obtained after a grid search 7 .
Since the input data is very low-dimensional, no adversarial attack is performed.Instead, we investigated how adversarial training impacts the model's performance in the attack-free setting.Figure 5 provides the average default rates predicted by different models after 50 runs in comparison to the ground truth rates.Results show that A3T predictions follow the realized default rates more closely than other approaches.To better evaluate the fit of each model mean squared error (MSE) of  predictions with respect to actual rates is also computed across all years.Accordingly, A3T is seen to yield the lowest MSE among all approaches, even outperforming the non-robust model.The second task is binary classification and involves two datasets, namely, the European card dataset (ECD) [40] and the Adult dataset [41].The former dataset contains 492 fraud and 10K genuine transactions randomly downsampled from more than 280K transactions with 31 features, and the goal is to identify the type of transaction.The latter involves 32K samples with nine features, and the objective is to predict whether the income of a subject is higher than $50K or not.All categorical values are represented by one-hot encoding, and logistic regression models with the same network architecture described above is trained.Rather than using a fixed adversarial training setting, multiple robust models with different training parameters are generated.Each model is then tested under 12 adversarial attack scenarios parameterized by the same possible parameter value configurations considered in the grid search, and the average of resulting accuracy values is taken as a model's under-attack prediction accuracy.The standard and under-attack accuracy achievable by a model is finally determined by averaging corresponding values over five runs.Resulting accuracy values are plotted in Fig. 6.
In the figure, the upper right corner corresponds to a performance regime where a model performs well both in attack-free and under-attack settings.Thus, the best performing model can be identified based on how close it gets to that corner.It can be seen that several robust versions of A3T exhibit high under-attack performance with only a slight drop in performance in the attack-free setting.This result also demonstrates the importance of the choice of adversarial training parameters on model accuracy.

Discussion and Conclusions
Adversarial training uses adversarial samples to retrain machine learning models in order to promote robustness.The robustness of a model can, however, only be attained at the expense of model generalization.In this work, we propose a new adversarial training method that yields a more favorable robustness-generalization trade-off.We discuss below further points to provide a better insight about our proposed approach.
Generation of non-adversarial samples.The underlying idea of our misclassification-aware adversarial training approach (A3T) is to prevent the creation of hard examples from misclassified samples, thereby reducing the risk of overfitting.A3T effectively realizes this by generating samples that are non-adversarial in nature.That is, instead of applying a perturbation that maximizes the loss, A3T computes a perturbation that minimizes the loss, defeating the purpose of an adversarial sample.Hence, the newly generated samples by A3T are maximally close to the decision boundary as opposed to adversarial samples that are maximally away from the boundary.From this perspective, imposing a bound on the extent of loss reduction may not seem meaningful.However, since samples with arbitrarily lowloss values will constitute easy examples, they will likely be less informative for the model.Therefore, we also evaluated how much to reduce the loss of a misclassified sample.Our test results indicated that A3T's generalization ability does not improve when generated samples have a loss similar to that of the misclassified one.Tests also indicated that a loss-reduction around ∆ makes A3T most effective.
Incorporation of LDS with A3T.Since MM, LSR, LDS, and MA are proposed to address different aspects of adversarial training, a question to be answered is if and to what extent these design choices interfere with each other.In fact, our results with A3T + show that combining MM, LSR, and MA help achieve a more preferable generalization vs. robustness trade-off.However, incorporating LDS with A3T, i.e., using A3T during inner maximization and LDS as an additional regularization, will be ineffective.Crucially, LDS applies label smoothing by placing samples of predicted class, which for a misclassified sample includes opposing class samples, around the misclassified sample.In contrast, A3T generates samples of the same class as the misclassified sample towards the decision boundary.Hence, incorporating the two together will potentially contribute to the model's overfitting as samples of opposing classes will be placed in the same vicinity.
Influence on robust overfitting.Robust overfitting is another artifact of adversarial training wherein a model test accuracy remains similar but the robust accuracy exhibits a drop as the number of training epochs increase [13,14].Our observations show that A3T also suffers from robust overfitting.This is in agreement with the findings of MART, where a larger variation between best and last accuracies is observed compared to A3T.Hence, we can deduce that the root cause for this phenomenon does not mainly relate to the treatment of misclassified examples.

Acknowledgement
This work is partially supported by the Qatar National Research Fund (QNRF) grant NPRP11C-1229-170007.
Proof First note that if (x i , y i ) is a training example with y i ∈ {+1, −1} then the misclassified label is 1 − 2y i .Also, note that the distance between a data point x 0 and the decision boundary θx + b is given by Based on the assumption that x i is misclassifed then Since L is logistic loss it is a monotonically decreasing function, therefore for (x i , y i ) ∈ D do 5: ŷi ← arg max f (x i , θ) if arg max f θ (x i + δ i )) = y i then MM i ← min( max , i ) MM end for 20: end procedure

Fig. 1
Fig. 1 (a) In adversarial training, adversarial examples generated from misclassified samples are always placed maximally away from the decision boundary, making the model more prone to overfitting; (b) A3T adjusts the adversarial objective to generate adversarial examples close to the decision boundary and thus encourages learning smoother loss landscapes which achieves better generalization.

Fig. 2
Fig. 2 Adversarial training of of a linear classifier using standard AT and A3T on a two-dimensional synthetic dataset.The black, red, and green lines represent the decision boundaries of the model trained with standard training, AT and A3T, respectively.Samples A and B are misclassified by all the models.While A3T places adversarial examples from A and B, i.e., A A3T and B A3T ), closer to the boundary than A AT and B AT which are generated using AT.We omit correctly classified training examples for better visualization.

Example 2 :
Consider the linear classification setup presented in Figure 2. The standard training decision boundary corresponds to the black line.The AT decision boundary is shown in orange.A3T results in the green decision boundary.Note that A3T boundary is closer to the standard boundary than the AT one.This is due the adversarial examples from misclassified samples A and B being placed closer to the boundary in case of A3T.This also implies that A3T performance on clean test samples is closer to the original classifier.A3T algorithm: For completeness, the full algorithm is shown in Alg. 1.

Algorithm 1
Algorithm of A3TRequire: E: the number of epochs, D = {(x i , y i )} n i=1 : the dataset, f θ (x): the machine learning model parametrized by θ, δ: the perturbation initialized by σ and limited by , τ : the global learning rate, α: the adversarial learning rate, S: the number of PGD step, Π the projection function.Ensure: Model parameters θ 1: procedure A3T2:

Fig. 3
Fig. 3 The orange and turquoise dots show adversarial samples created by AT and A3T methods.For each method, decision boundaries are determined based on the position of test points that are assigned to the correct class with a probability of around 0.5.The gray line is obtained by the standard training (after epoch 100 is fixed) and the black line corresponds to adversarial training (obtained after epoch 400).A3T pushes adversarial samples misclassified example (orange colored sample) closer to the decision boundary.

Fig. 4
Fig.4Resulting decision boundaries obtained after 50 runs when training samples are picked at random for each run.The improved generalization capability of A3T is reflected in the measured accuracy and can also be seen in the way it can more correctly classify test samples compared to AT.

Fig. 5
Fig.5Yearly default rates predicted by robust and standard models obtained averaged over 50 runs for each model.MSE values are computed between model predictions and realized, ground-truth rates.Computed values are normalized by e −5 for ease of viewing.

Fig. 6
Fig.6Model accuracy values for both normal and under-attack settings obtained by averaging over five runs.For both AT and A3T, 12 models are created for varying parameter values that govern the adversarial sample generation process.A3T is found to yield the most favorable performance, exhibiting high accuracy in both attack-free and under attack scenarios.

Table 1
Optimization objectives for AT methods and their main characteristics.

Table 2
Comparison of the two misclassification-aware adversarial training methods based on results obtained on the CIFAR-10 dataset using the WideResNet-34-10 model 2 .A3T results in better generalization than MART at near-similar levels of robutness.

Table 4
Adversarial Training Results on GLUE Benchmark.Results show that for all tasks, A3T performs on par with or better than the conventional AT.while keeping the robust accuracy on par with other methods.This finding further strengthens the idea that MM, LSR, and misclassification awareness are addressing different aspects of adversarial training and that they are indeed compatible with each other.