1 Introduction

Deep neural networks have achieved considerable success in many fields (He et al. 2016; Devlin et al. 2018; Mnih et al. 2015), such as computer vision, natural language processes, and reinforcement learning, which have emerged as a transformative forces due to their remarkable efficacy and broad applicability. However, these powerful models are vulnerable to imperceptible perturbation (Goodfellow et al. 2014), i.e., adversarial examples (AEs). An AE, denoted as \(x^{\prime }\), can be crafted by adding an adversarial perturbation \(\varvec{\delta }\) to a natural example xi.e., \(x^{\prime } = x + \varvec{\delta }\). Adversarial attacks typically cause the classifier \(h_{\varvec{\theta }}\) to make an incorrect prediction. Such a perturbation \(\varvec{\delta }\) is often small and imperceptible to human perception, bounded by a \(L_p\) - norm ball, that can be written as \(\left\| x^{\prime } - x \right\| _p \le \epsilon\).

AEs were first introduced by Szegedy et al. (2013), which enables the community to be aware of the vulnerability of neural networks and inspires the development of adversarial defenses, including defensive distillation (Papernot et al. 2016), feature squeezing (Xu et al. 2017), and adversarial training (AT) (Goodfellow et al. 2014), etc. Among them, AT has been regarded as the most powerful one (Athalye et al. 2018). The basic idea of AT is to incorporate both natural examples and AEs during the training stage, enabling models to perform better against AEs compared to standard training (ST). Formally, AT can be formulated as a min-max problem, i.e.,

$$\begin{aligned} \min _{\varvec{\theta }} \mathbb {E}_{(\varvec{Z}, y) \sim \mathcal {D}}\left[ \max _{\Vert \varvec{\delta }\Vert \le \epsilon } L\left( h_{\varvec{\theta }}(\varvec{X}+\varvec{\delta }), y\right) \right] , \end{aligned}$$
(1)

where the inner maximization searches for perturbations that maximize the loss, while the outer minimization optimizes the neural network. A multi-step gradient-based attack known as the PGD attack was proposed by  Madry et al. (2017) to solve the inner maximization of AT, which can significantly improve the adversarial robustness of neural networks against various attacks, it has been deemed as the standard and baseline method of AT, referred to as PGD-AT in this paper. Based on their idea, researchers have proposed various variations, such as TRADES (Zhang et al. 2019), MART (Wang et al. 2019), AWP (Wu et al. 2020), and S\(^2\)O (Jin et al. 2022). Their ideas are basically based on three directions: objective functions, data augmentation, and weight perturbation. However, most of these methods have not taken into account the presence of noisy labels, whereas real-world datasets are reported to have an inherent noise label rate between 8 to 38.5% (Xiao et al. 2015).

Therefore, designing an effective AT algorithm in the presence of inherent label noise is a nontrivial research challenge, yet this challenge is under-explored by the community. While there is some literature discusses the relationship between AT and noisy labels (Zhu et al. 2021; Zhang et al. 2021; Dong et al. 2021), they are more concerned with how to strengthen AT through noisy labels while do not consider AT with inherent label noise, i.e., noisy labels already exist in the original dataset, this paper aims to make the first attempt to tackle this research challenge. There are two important metrics for evaluating the performance of AT: (1) natural-robust trade-off: the trade-off between natural accuracy and robust accuracy; (2) the extent of robust overfitting, i.e., the robust accuracy decreases after a certain training epoch, while the natural accuracy for natural examples remains increasing or relatively constant, i.e., robust overfitting thereby hindering the natural-robust trade-off.

First, we empirically evaluate the performance of three recent AT methods on CIFAR-10 with injected inherent noisy labels. We observe that both natural accuracy and robust accuracy decrease significantly with increasing noise rate, across all AT methods. Furthermore, in the presence of inherent label noise, we notice that the natural accuracy exhibits a decline from a specific training epoch, i.e., natural overfitting, in addition to the already observed robust overfitting. This phenomenon, we call “double overfitting" in this paper. Conversely, when there is no inherent label noise, the natural accuracy consistently improves or remains stable throughout the training process. The Cross-Entropy (CE) Loss, although widely used in the aforementioned mainstream AT methods, has been shown to be non-robust to label noise (Feng et al. 2021). This vulnerability may lead to degrades in their generalization performance. To address this issue, we propose incorporating noisy-robust loss functions in AT to enhance generalizability in the presence of label noise. The performance of these methods is shown in Fig. 1.

Fig. 1
figure 1

The learning curves of natural accuracy and PGD robust accuracy for PGD-AT, MART, and TRADES under 0% (natural/robust), 20% and 40% inherent symmetric label noise on CIFAR-10 with \(\ell _{\infty }\) threat model

To accurately assess the true performance of a model trained with inherent label noise, the common practice is to train the model on a training dataset that contains noisy labels and then evaluate its performance on a clean test dataset without any noisy labels. This ensures that the model’s performance is correctly evaluated. If the test dataset also contains a proportionate amount of noisy labels, it would not be possible to gauge the model’s true performance, as the label noise in the test dataset would confound the evaluation and not accurately reflect its actual effectiveness. The overview figure of AT with inherent label noise is shown in Fig. 2, which can be seen as a general framework in our setting, including the noisy training set, adversarial examples generation, classifier, and prediction on the clean test dataset. Our main contributions are as follows:

Fig. 2
figure 2

Overview of AT with inherent label noise

\(\bullet\) We investigate AT with inherent label noise and observe that it typically be unstable and prone to show poor generalization performance. Furthermore, we empirically identify the occurrence of the “double overfitting” phenomenon, where both the natural accuracy of natural examples and the robust accuracy of AEs start to decline after a certain training stage;

\(\bullet\) From the perspective of objective functions for AT with inherent label noise, we replace the non-robust CE loss with a noisy-robust loss function and further propose Noisy-Robust Adversarial Training (NRAT);

\(\bullet\) Theoretically and empirically, we demonstrate that NRAT achieves compatible performance or outperforms recent AT methods when dealing with inherent label noise.

2 Related works

2.1 Adversarial training algorithms

This section introduces three widely recognized AT algorithms, which are PGD-AT (Madry et al. 2017), TRADES (Zhang et al. 2019), and MART (Wang et al. 2019) (We use \(h_{\varvec{\theta }}\) to denote the classifier with model parameter \(\varvec{\theta }\)). First, we elaborate on their objective functions employed during the training process. Subsequently, we present the objective function of our NRAT and conduct a comparative analysis.

PGD-AT The idea of PGD-AT is straightforward, as it first generates AEs and then directly optimizes them. Despite its intuitive and simplistic formulation, it has been empirically shown to achieve excellent performance in terms of adversarial robustness. The training loss function is given by

$$\begin{aligned} \ell ^{PGD-AT}\left( \textbf{x}_{i}, y_{i}, \varvec{\theta }\right) ={\text {CE}}\left( h_{\varvec{\theta }} \left( \textbf{x}_{i}^{\prime }\right) ,y_{i}\right) . \end{aligned}$$
(2)

TRADES TRADES aimed to trade off natural accuracy and robust accuracy by employing the CE loss for natural examples and incorporating a KL-divergence as the regularization term for adversarial examples. Its objective function can be written as

$$\begin{aligned} \ell ^{TRADES}\left( \textbf{x}_{i}, y_{i}, \varvec{\theta }\right) =\sum _{i=1}^{n} \left\{ {\text {CE}}\left( h_{\varvec{\theta }}\left( \textbf{x}_{i}\right) , y_{i}\right) +\beta \cdot \max _{\textbf{x}_{i}^{\prime } \in \mathcal {S}} {\text {KL}}\left( h_{\varvec{\theta }}\left( \textbf{x}_{i}\right) \Vert h_{\varvec{\theta }}\left( \textbf{x}_{i}^{\prime }\right) \right) \right\} , \end{aligned}$$
(3)

where the first term aims to maximize natural accuracy, and the second term aims to improve robust accuracy by minimizing the distance between the predictions of natural examples and adversarial examples, thereby encouraging the outputs to be smooth. The hyperparameter \(\beta\) controls the trade-off between natural accuracy and robust accuracy.

MART The fundamental idea behind MART is to treat misclassified and correctly classified examples as distinct instances and assign different optimization directions for them. The training loss can be formulated as follows

$$\begin{aligned}{} & {} \ell ^{MART}\left( \textbf{x}_{i}, y_{i}, \varvec{\theta }\right) =\text {BCE} \left( h_{\varvec{\theta }}\left( \textbf{x}_{i}^{\prime }\right) ,y_{i}\right) \nonumber \\{} & {} \quad +\lambda \cdot \text {KL}\left( h_{\varvec{\theta }}\left( \textbf{x}_{i}\right) \Vert h_{\varvec{\theta }}\left( \textbf{x}_{i}^{\prime }\right) \right) \cdot \left( 1-h_{\varvec{\theta }}\left( \textbf{x}_{i}\right) \right) , \end{aligned}$$
(4)

BCE is the boosted cross-entropy that can be written as

$$\begin{aligned} \text {BCE}\left( h_{\varvec{\theta }} \left( \textbf{x}_{i}^{\prime }\right) ,y_{i}\right) = \text {CE}\left( h_{\varvec{\theta }} \left( \textbf{x}_{i}^{\prime }\right) ,y_{i}\right) -\log \left( 1-\max _{k \ne y_{i}} \left( h_{\varvec{\theta }}\left( \textbf{x}_{i}^{\prime }\right) ,y_{i}\right) \right) , \end{aligned}$$
(5)

where k is the predicted class of \(h_{\varvec{\theta }}\). Empirically, BCE can mitigate insufficient learning of CE to some extent. The hyperparameter \(\lambda\) balances the influence of misclassified and correctly classified examples.

2.2 Interactions between adversarial training and noisy labels

Noisy labels are unavoidable in real-world datasets due to errors in manual annotation or in annotation platforms (Xiao et al. 2015). Accordingly, research on learning with noisy labels has also emerged. While the analysis of noisy labels in ST has been extensively explored (Li et al. 2017; Natarajan et al. 2013), researchers have recently started investigating the relationship between noisy labels and AT, i.e., AT with noisy labels.  Zhu et al. (2021) explored the distinctions between AT and ST in the presence of noisy labels from the perspective of the smoothing effects of AT and the loss landscape. Zhang et al. (2021) proposed NoiLIn, which gradually injects noisy labels in both the inner maximization and outer minimization stages of AT to improve adversarial robustness.  Dong et al. (2021) studied AT under random labels (almost 100% noisy labels) and identified that the remembering of one-hot labels as the cause of robust overfitting. They then adopted the Temporal Ensemble to mitigate this overfitting. Basically, these works explore the properties of noisy labels in AT and how to enhance AT’s performance on clean datasets by introducing noisy labels. Inherent label noise, instead refers to the situation where the dataset itself already contains noisy labels.

2.3 Robust loss functions and learning with noisy labels

It has been demonstrated that a trained DNN with a suitable adjusted loss function \(\mathcal {L}\), namely, a noisy-robust loss function (referred to as robust loss function hereafter), can approach the best classifier \(h_{\varvec{\theta }}\) under some mild assumptions with symmetric and asymmetric label noise (Ghosh et al. 2017). These robust loss functions satisfy the following equation (for a K-class classification problem, K>1 and any training example x)

$$\begin{aligned} \sum _{j=1}^{K} \mathcal {L}(h(\varvec{x}), j)=C, \end{aligned}$$
(6)

where C is a constant. Equation(6) indicates that these loss functions are symmetric and considered to be noise-tolerant according to the definitions in Ghosh et al. (2017). Even though there are many loss functions that satisfy this symmetry, the most commonly used CE does not possess this symmetry. Recent studies (Zhang and Sabuncu 2018; Amid et al. 2019) have demonstrated that adopting robust loss functions is the most straightforward and generic approach for effectively training deep neural networks with inherent label noise. Specifically,  Ma et al. (2020) proposed the robust loss function NCE \(+\)RCE following an active-passive loss (APL) framework, which currently achieves state-of-the-art performance on ST. For a K-class classification task, NCE (Ma et al. 2020) represents the normalized version of CE, and RCE (Wang et al. 2019) is the reversed version of CE. We will provide a more detailed explanation of both NCE and RCE in the next section.

In addition to the noisy-robust loss function, there are several other methods available for learning under label noise, including label correction (Zheng et al. 2021) and collaborative learning (Han et al. 2018), etc. However, these methods often involve too complicated procedures, making their application in AT quite challenging. Considering that AT already requires significant computational resources and may incur additional performance costs. Therefore, in this paper, we specifically focus on implementing the core concept of AT under inherent label noise using a robust loss function.

3 Noisy-robust adversarial training

In the previous sections, we discussed robust loss functions which exhibit the symmetric property. Based on this, we present a novel perspective on AT with inherent label noise, i.e., replacing the non-robust loss function with a robust counterpart to enhance the performance of AT in the presence of inherent label noise. Finally, we will conduct a comprehensive comparison between our proposed NRAT with existing approaches.

3.1 Basic notation of AT with inherent label noise

For a K-class classification task, let \(X=\{(x_i, y_i)\}_{i=1,...n}\) be the training dataset drawn from an input distribution \(\mathcal {D}\) with n training instances, where \(x_i\in \mathbb {R}^d\) represents a natural example and \(y_i\in \{1,..., K\}\) denotes its annotated label, which may be incorrect, therefore we denote \(y_i^*\) as the true label for \(x_i\). We use \(\varvec{q}(k \mid \varvec{x})\) to represent the distribution of sample x of label \(k\in K\) and \(\sum _{k=1}^{K} \varvec{q}(k \mid \varvec{x})=1\). We consider two types of label noise: symmetric and asymmetric label noise with an overall noise rate \(\eta \in [0, 1]\). For each class j flipped to k, we denote its class-wise noise rate by \(\eta _{jk}\). Symmetric label noise, which means that each label has the same probability of flipping to any other class, i.e., \(\eta _{jk} = \frac{\eta }{K-1}, j \ne k\); While asymmetric noise refers to labels being flipped between similar classes, e.g., the class “truck” being flipped to “car”.

Given a classifier \(h_{\varvec{\theta }}\) with model parameter \(\varvec{\theta }\) (For simplicity, we may omit the \(\varvec{\theta }\) in the subsequent content), it predicts the class of an input example as

$$\begin{aligned} h_{\varvec{\theta }}\left( {x}\right) =\arg \max \varvec{p}_k(x,\varvec{\theta }), \ \text{ where } \ \varvec{p}_k(x,\varvec{\theta })=\frac{e^{\varvec{z}_{k}(x, \varvec{\theta })}}{\sum _{j=1}^{K} e^{\varvec{z}_{j}(x, \varvec{\theta })}} , \end{aligned}$$
(7)

where \(\varvec{z}_{k}(x, \varvec{\theta })\) denotes the logits output of a network and \(\varvec{p}_k(x,\varvec{\theta })\) represents the softmax output of x. Then we denote \(x^\prime\) as the AE, \(X^\prime\) and \(\mathcal {D}^\prime\) be the adversarial set and distribution, respectively. We perform PGD attack to produce the AEs, i.e.,

$$\begin{aligned} x^0= & {} x + \sigma ,\; \text{ where }\; \sigma \sim \mathcal {N}(0,1), \end{aligned}$$
(8)
$$\begin{aligned} x^{t+1}= & {} \Pi _{x+\mathcal {S}}(x^t + \alpha sign(\nabla _x\mathcal {L}(\theta , x^t, y)), \end{aligned}$$
(9)

where x denotes the natural example and \(x^0\) is obtained by perturbing x with random noise \(\sigma\) sampled from the normal distribution \(\mathcal {N}(0,1)\), t denotes the current time step, \(\alpha\) is the step size, \(\Pi\) denotes the projection function, \(\mathcal {S} \subseteq \mathbb {R}^{d}\) denotes the perturbation set of AEs. Based on the definition, the adversarial risk to be optimized for the given dataset and classifier \(h_{\varvec{\theta }}\) is defined as follows:

$$\begin{aligned} \mathcal {R}_{adv}\left( h_{\varvec{\theta }}\right) =\frac{1}{n} \sum _{i=1}^{n} \max _{\textbf{x}_{i}^{\prime } \in \mathcal {S}} \ell \left( h_{\varvec{\theta }}\left( \textbf{x}_{i}^{\prime }\right) \ne y_{i}\right) . \end{aligned}$$
(10)

3.2 Interactions between adversarial training and robust loss functions

In Sect. 2.3, we introduced that loss functions with symmetric properties are robust to inherent label noise. In this section, we will mainly delve into their theoretical details. Following the works in  Ghosh et al. (2017) and Ma et al. (2020), we first demonstrate that a symmetric loss function exhibits noise tolerance under both symmetric and asymmetric label noise with certain mild assumptions.

Lemma 1

In a multi-class classification problem, let a loss function L satisfy Eq. (6). Then L is noisy-robust under symmetric label noise if noise rate \(\eta < \frac{K-1}{K}\). ( Ghosh et al. 2017)

Lemma 2

In a multi-class classification problem, suppose L satisfies Eq. (6) and \(0\le L(h(x),k)\le \frac{C}{(K-1)},\forall k \in K\). If \(R(h^*) = 0\), then L is noisy-robust under asymmetric label noise if noise rate \(\eta _{jk} < \eta _{jj}\). ( Ghosh et al. 2017) where C is a constant, \(h^*\) denotes the global minimizer. Ghosh et al. (2017) provide detailed proofs of these two lemmas. Lemma 2 is not easy to understand, we also provide our proof in "Appendix A". Given the above conditions on the label noise rate \(\eta\), then the learning risk under both clean labels R(h) and under noisy labels with noise rate \(\eta : R^\eta (h)\) shares the same global minimizer \(h^*\), i.e., the loss function L is noisy-robust.

The above discussion focuses on the robust loss functions for ST, i.e., natural examples. Referring to Eqs. (8) and  (9), we know that AEs are generated at the input level. Therefore, when confronted with symmetric label noise, the loss function L remains noisy-robust for AEs if it is already noisy-robust for natural examples, since there are no additional conditions regarding the inputs, as stated in Lemma 1. However, when considering asymmetric label noise, first note that the condition \(0\le L(h(x),k)\le \frac{C}{(K-1)},\forall k \in K\) can be easily satisfied by a typical loss function  (Ma et al. 2020). However, when replacing natural examples with AEs, it is intuitive that the value of L will increase significantly, potentially exceeding the upper bound. Another condition \(R(h^*) = 0\) is a restrictive condition for the noisy-robust theory to hold, which means that an \(h^*\) can achieve 100% classification accuracy. In experiments, it has been observed that satisfactory performance can still be achieved as long as \(R(h^*)\) is close to 0 (Ma et al. 2020). However, as \(R(h^*)\) increases, the corresponding performance tends to decline. Currently, the SOTA robust accuracy for AEs on CIFAR-10 is below 70%, indicating a high value of \(R(h^*)\) and low utility of noisy-robust loss functions for AEs. Based on the above analysis, in the case of asymmetric label noise, AEs may not satisfy the two conditions for the noisy-robust properties to hold. We can conclude that:

Proposition 1

In a multi-class classification problem, suppose L is a noisy-robust loss function, then L remains noisy-robust for AEs under symmetric label noise while may be non-robust for AEs under asymmetric label noise.

Given the analysis presented above, the mainstream AT algorithms currently rely on the CE loss, some of them direct optimizing AEs, as seen in PGD-AT in Eq. (2) and MART in Eq. (4). However, replacing the CE with a robust loss function may not be effective for them under asymmetric label noise referring to Proposition 1. In contrast, TRADES in Eq. (3) optimizes natural examples and incorporates a regularization term to approximate the distribution of AEs and natural examples, it is mathematically well-suited for the application of a robust loss function. We will also empirically verify this proposition in the experiments.

3.3 Noisy robust cross entropy loss

First, we demonstrate why a simple mean absolute error (MAE) is symmetric while CE does not. For a K-class classification, recall Eq. (6), it is obvious that for MAE, \(\sum _{j}^{K} \mathcal {L}(f(\varvec{x}), j) =\sum _{j}^{K}(2-2\sum _{j}^{K}\varvec{p}(k \mid \varvec{x})) = 2*(K-1)\) which is a constant. While for CE, the \(\sum _{j}^{K} \mathcal {L}(f(\varvec{x}), j) = -log\varvec{p}(y \mid \varvec{x})\) where y is the ground truth, it is obvious that the value of \(-log\varvec{p}(y \mid \varvec{x})\) may vary for different x, i.e., the value of \(\sum _{j}^{K} \mathcal {L}(f(\varvec{x}), j)\) is not a constant for CE. However, the simplicity of MAE makes it susceptible to underfitting on large datasets. While CE is a widely used and effective loss function, it lacks the property of noisy-robust. Thus, it is intuitive to consider combining the advantages of both MAE and CE.

In Ma et al. (2020), they divide the loss function into active and passive components according to whether it solely counts on the value of \(\varvec{p}(k=y|x)\). Specifically, CE is considered as an active loss function while MAE is regarded as a passive loss function. In Ma et al. (2020), they argue that combining an active loss function with a passive loss function can benefit from complementary learning, as demonstrated by Kim et al. (2019). This combination is referred to as Active-Passive Loss (APL) framework. Furthermore, by using noisy-robust versions of both active and passive loss functions, a noisy-robust APL loss can be obtained.

To transfer CE to a noisy-robust APL loss form, we require the noisy-robust active and passive versions of CE. The normalized CE, shown in Eq. (11) which is obtained by dividing by \(\sum _{j}^{K} \mathcal {L}(h(\varvec{x}), j)\), is proven to be a noisy-robust active loss function in Ma et al. (2020), while the reversed CE in Eq. (12) is a noisy-robust passive loss function. Currently, the SOTA version of robust loss functions is a combination of NCE+RCE.

$$\begin{aligned}{} & {} \begin{aligned} N C E = \frac{CE}{{\sum _{j=1}^K}\mathcal {L}(h(x), j)}&= \frac{-\sum _{k=1}^{K} \varvec{q}(k \mid \varvec{x}) \log \varvec{p}(k \mid \varvec{x})}{-\sum _{j=1}^{K} \sum _{k=1}^{K} \varvec{q}(y=j \mid \varvec{x}) \log \varvec{p}(k \mid \varvec{x})}\\&= \log _{\prod _{k}^{K} \varvec{p}(k \mid \varvec{x})} \varvec{p}(y \mid \varvec{x}), \end{aligned} \end{aligned}$$
(11)
$$\begin{aligned} RCE = {-\sum _{k=1}^{K} \varvec{p}(k \mid \varvec{x}) \log \varvec{q}(k \mid \varvec{x})}, \end{aligned}$$
(12)

By definition, both NCE and RCE satisfy the symmetry in Eq. (6). Therefore, NCE+RCE can be regarded as a robust variant of CE when noisy labels are inhered in the training datasets.

3.4 Noisy robust adversarial training

In Sect. 3.2, we have analyzed why TRADES is suitable for incorporating a robust loss function while PGD-AT and MART are not as compatible in this regard. Our NRAT is formally based on TRADES, with enhancements made to both of its components. These enhancements aim to make NRAT more effective for datasets with inherent label noise. We rewrite the original objective function of TRADES first

$$\begin{aligned} \ell ^{TRADES}\left( \textbf{x}_{i}, y_{i}, \varvec{\theta }\right) =\sum _{i=1}^{n} \left\{ \text {CE}\left( h_{\varvec{\theta }}\left( \textbf{x}_{i}\right) , y_{i}\right) +\beta \cdot \max _{\textbf{x}_{i}^{\prime } \in \mathcal {S}} \text {KL}\left( h_{\varvec{\theta }}\left( \textbf{x}_{i}\right) \Vert h_{\varvec{\theta }}\left( \textbf{x}_{i}^{\prime }\right) \right) \right\} , \end{aligned}$$
(13)

We already know that CE is non-robust for inherent label noise. Now, we demonstrate that the KL-divergence can also be replaced by a more robust alternative. The KL-divergence, given by

$$\begin{aligned} \text {KL}(p(x_i, \varvec{\theta })||p(x_i^\prime , \varvec{\theta })) = \sum _{k=1}^{K}p_k(x_i, \varvec{\theta })log \frac{p_k(x_i, \varvec{\theta })}{p_k(x_i^\prime , \varvec{\theta })}, \end{aligned}$$
(14)

the KL-divergence is an asymmetric measure that treats two distributions unequally. In learning with noisy labels, it becomes apparent that the ground truth distribution \(\varvec{q}(k|x)\) for x may not accurately reflect the true distribution, while the predicted distribution \(\varvec{p}(k|x)\) may better represent the true distribution to some extent (Wang et al. 2019). Therefore, in learning with noisy labels, it will be more robust to utilize both \(\text {KL}(p_{1}||p_{2})\) and \(\text {KL}(p_{2}||p_{1})\) to obtain a symmetric divergence measure

$$\begin{aligned} \text {KL}_{sym}(p(x_i, \varvec{\theta })||p(x_i^\prime , \varvec{\theta })) = \frac{1}{2}\left\{ \text {KL}(p(x_i, \varvec{\theta })||p(x_i^\prime , \varvec{\theta })) + (\text {KL}(p(x_i^\prime , \varvec{\theta })||p(x_i, \varvec{\theta }))\right\} . \end{aligned}$$
(15)

Another theory that supports the use of symmetric KL-divergence is the memorization effects (Dong et al. 2021) of neural networks. These effects indicate that neural networks have the capability to fit training data well, even in the presence of noisy labels. However, mislabeled examples tend to incur larger losses compared to correctly labeled examples (Song et al. 2019), leading to increased uncertainty in the output probabilities \(p(x_i, \varvec{\theta })\) and \(p(x_i^\prime , \varvec{\theta })\) (assuming \(x_i\) as a mislabelled example). Consequently, as the label noise rate increases, the inequality of KL-divergence is magnified throughout the dataset. We also analyze the robust risk of symmetric KL-divergence under symmetric label noise in Appendix B and show that symmetric KL-divergence tends to have a tighter bound compared with KL-divergence. Based on this analysis, we replace the two terms in the original TRADES with more robust alternatives that are more suitable for datasets with inherent noisy labels, the training objective function of NRAT is defined as followed

$$\begin{aligned}{} & {} \ell ^{NRAT}\left( \textrm{x}_{i}, y_{i}, \varvec{\theta }\right) : =\sum _{i=1}^{n}\bigg \{ L_{apl}\left( h_{\varvec{\theta }}\left( \textrm{x}_{i}\right) , y_{i}\right) \nonumber \\{} & {} \quad +\lambda \cdot \max _{\textbf{x}_{i}^{\prime } \in \mathcal {S}} \text {KL}_{\text{ sym } }\left( h_{\varvec{\theta }}\left( \textrm{x}_{i}\right) \Vert h_{\varvec{\theta }}\left( \textrm{x}_{i}^{\prime }\right) \right) \bigg \}, \end{aligned}$$
(16)

where \(L_{apl}\) denotes the robust loss functions NCE\(+\)RCE following the APL framework in Eqs. (11) and  (12), \(\text {KL}_{sym}\) denotes the symmetric KL-divergence in Eq. (15). The pseudocode of the training algorithm for NRAT is given below.

figure a

3.5 Relation to existing work

So far, PGD-AT, MART, and TRADES have commonly been used as baselines for newly proposed AT algorithms, like in Wu et al. (2020), we have introduced our NRAT as an enhanced and more noisy-robust objective function compared to TRADES in Sect. 3.4. In this section, we will mainly focus on the distinctions between our NRAT and MART, since they also share a similar formulation of objective functions.

MART divides the training dataset into correctly classified examples and misclassified examples which are similar to correctly labeled examples and mislabeled examples. When facing noisy labels, we can also divide the natural training set \(\mathcal {S}\) into two subsets, that is, examples with correct labels as \(\mathcal {S}^+\) and examples with noisy labels as \(\mathcal {S}^-\), given a classifier \(h_\theta ^*\) that satisfies \(\mathcal {R}(h_\theta ^*) = 0\), then we get:

$$\begin{aligned} \begin{aligned} \mathcal {S}_{h_{\varvec{\theta }}}^{+}=\left\{ i: i \in [n], h_{\varvec{\theta }}^* \left( \textbf{x}_{i}\right) =y_{i} = y_{i}^*\right\} ; \\ \mathcal {S}_{h_{\varvec{\theta }}}^{-}=\left\{ i: i \in [n], h_{\varvec{\theta }}^* \left( \textbf{x}_{i}\right) = y_{i} \ne y_{i}^*\right\} . \end{aligned} \end{aligned}$$
(17)

However, in learning with noisy labels, we do not know which label is incorrect or correct in advance. Hence, it becomes necessary to minimize the overall risk \(\mathcal {R}\left( h_{\varvec{\theta }}\right)\) instead of dividing it into subsets, as done in MART. Therefore, our NRAT algorithm follows PGD-AT and TRADES by minimizing the risk of the whole dataset. MART, on the other hand, it utilizes different objective functions for correctly classified and misclassified examples. MART achieves optimal performance when applied to clean datasets, while the presence of noisy labels often results in misclassifications being actually correct, and vice versa. This leads to the possibility of its different objective functions being applied to inappropriate examples, thereby diminishing its performance.

4 Experiments

In this section, we empirically evaluate the performance of the proposed NRAT on CIFAR-10 dataset against two types of injected inherent label noise: symmetric noise and asymmetric noise. We compare our method with three existing AT methods and their variants on the noisy dataset with varying label noise rates.

4.1 Experimental setup

Baselines We consider three well-known AT algorithms as baselines: (1) PGD-AT; (2) TRADES; (3) MART. To evaluate the effectiveness of robust loss functions in these algorithms, we also replace the CE loss used in these algorithms with NCE\(+\)RCE as three additional baselines, i.e., (4) PGD-AT-APL; (5) TRADES-APL; (6) MART-APL.

Generation of label noise To simulate real-world datasets that may contain inherent label noise, we introduce two types of research-oriented label noise to the original CIFAR-10 dataset. Symmetric label noise refers to each label having an equal probability of being flipped to any other class; In contrast, asymmetric noise involves label flipping between similar classes, which is more representative of real-world scenarios. For asymmetric label noise, we flip labels between \(TRUCK \leftrightarrow AUTOMOBILE\), \(BIRD \leftrightarrow AIRPLANE\), \(DEER \leftrightarrow HORSE\), and \(CAT\leftrightarrow DOG\), following (Zhang and Sabuncu 2018). We consider noise rates ranging from 20% and 40% to simulate the noise rate in real-world datasets, and we also report the performance on the clean dataset without any label noise (0%). We also provide the results of NRAT on MNIST and FashionMNIST via Table 4 in the “Appendix C”.

Adversarial training settings For AT, we train ResNet18 on all algorithms, we basically follow the standard settings in Rice et al. (2020) with some improvements made to be more suitable for AT with noisy labels. Specifically, we use stochastic gradient descent (SGD) with momentum 0.9, the total training epochs is 200, with weight decay 5e-4, we used standard data augmentation, i.e., random crops and random horizontal flips, we also implement data normalization for all methods. For the training attack, we use PGD-10 with random initialization and perturbation limit \(\epsilon =8/255\), step size 2/255. For the initial learning rate, the standard default value is 0.1, while we choose different smaller initial learning rates from [0.01, 0.05, 0.1] for different noisy rates since is prone to show gradient collapse when AT with inherent label noise, the general principle is to choose the largest possible learning rate without encountering gradient collapse. (In "Appendix D", we provide an additional experiment by replacing the CE in PGD attack with our proposed loss function.)

We use \(\ell _\infty\) threat models for all methods. We do not train any WideResNet since it usually shows a similar trend with ResNet18, while it is much more time-consuming. For NCE\(+\)RCE in NRAT, we follow the setting in Rice et al. (2020) for CIFAR10, i.e., both the coefficients before the two terms are 1. The hyperparameters of the baselines are consistent with their original papers: \(\lambda = 5\) for MART and \(\beta = 6\) for TRADES. For our NRAT, we try \(\lambda = [4, 6, 8, 10]\) and find that \(\lambda = 6\) yields the best empirical results across different noise rates, we report the best natural-robust trade-off performance for all the methods. All experiments are implemented on a server with an Intel i7-12700F CPU and an RTX3090 GPU. Note that we do not perform any training tricks like gradient clipping, label smoothing, etc., to accurately compare the performance between different objective strategies.

4.2 Performance evaluation

Adversarial attacks We conduct two different typical white-box attacks: PGD-20, CW-20 (Carlini and Wagner 2017) (the \(\ell _\infty\) version of CW loss optimized by PGD-20), and one more powerful auto attack (Croce et al. 2020) to evaluate the baselines as well as NRAT. Auto attack contains an ensemble of parameter-free attacks, which can serve as a reliable metric for assessing the robustness performance of a model. While some of the attacks here may not be a white-box attack in the noisy labels setting, like CW-20, as it may easily be swayed by gradient obfuscation caused by the random label flipping, we believe they can reflect the robustness performance of the model to a certain extent.

To evaluate the performance, we report “natural” and “robust” which denote the accuracy of natural test images and adversarial test images using different attacks, respectively. From it, we can see the natural-robust trade-off of different methods. Another metric of measuring AT is the degree of robust overfitting, so we also report the “Best” (highest accuracy) and “Last” (accuracy at the last training epoch) natural and robust accuracy to see the gap between them, the smaller the gap, the lower the degree of overfitting. Results are shown in Tables 1 and 2 for learning with symmetric/asymmetric label noises respectively. Recall Fig. 1, we find that even natural accuracy is overfitting when there is label noise for the baselines. From our results, this double overfitting can be largely mitigated by our method.

Table 1 Best and last robustness performance(%) on CIFAR-10 with inherent symmetric label noise with 0%, 20% and 40% noise rate
Table 2 Best and Last Robustness performance(%) on CIFAR-10 with inherent asymmetric label noise with 20% and 40% noise rate

Remark for Tables 1 and 2. Under the symmetric label noise, Table 1 shows that NRAT can outperform the baselines considering the best robust accuracy when facing symmetric label noise. While for the clean dataset, TRADES exhibits superior robust performance. Comparing the performance of MART with MART-APL and TRADES with TRADES-APL, we observe that the robust performance both improved under 20% and 40% symmetric label noise. Particularly, MART-APL demonstrates a significant improvement, these results indicate that in the presence of noisy labels, a robust loss function can be considered as a more robust alternative to the CE loss.

While MART and TRADES exhibit significant robust overfitting, considering the gap between the last performance and best performance (around 11% to 16% for 20% symmetric label noise and 18% to 28% for 40% asymmetric label noise), the APL versions can significantly mitigate the double overfitting issues (around 6% to 10% for 20% symmetric label noise and 8% to 10% for 40% asymmetric label noise. These demonstrate the effectiveness of robust loss functions in addressing the double overfitting issues. We provide the learning curves for MART-APL and TRADES-APL in Fig. 3 below:

Fig. 3
figure 3

The learning curves of natural accuracy and PGD robust accuracy for MART-APL, and TRADES-APL under 0% (natural/robust), 20% and 40% inherent symmetric/asymmetric label noise on CIFAR-10 with \(\ell _{\infty }\) threat model

Under the asymmetric label noise, Table 2 further demonstrates that NRAT achieves the highest robust performance under 20% and 40% asymmetric label noise. Another noteworthy observation is that the best robust performance of MART-APL consistently falls below that of MART, which aligns with our Proposition 1 that the robust loss functions may be non-robust for AEs under asymmetric label noise, as the CE is non-robust and the performance of NCE+RCE is even lower than CE around 1% to 3.5%. Another notable phenomenon is that the robust overfitting observed in PGD-AT, MART, and TRADES is not as pronounced under asymmetric label noise compared to symmetric label noise. This suggests that, in AT, the CE loss is relatively more robust for asymmetric label noise compared to symmetric label noise. Conversely, in ST, asymmetric label noise is generally more challenging.

The effectiveness of symmetric KL-divergence The key difference between TRADE-APL and our NRAT is whether to use a symmetric KL-divergence or not. Considering their performance shown in Tables 1 and 2, it is evident that NRAT consistently achieves higher robust performance but lower natural performance compared to TRADES-APL. This highlights the role of symmetric KL divergence, as it serves to bridge the performance gap between natural and robust, albeit at the cost of some natural performance. Given that robust performance is the primary focus of AT, indeed, this trade-off is considered an appropriate compromise.

The performance of PGD-AT-APL As we have analyzed in Sect. 3.2, PGD-AT is not well-suited for APL. Empirically, the training process for PGD-AT-APL exhibits a peculiar tendency, with significantly low natural accuracy (less than 30%) and high robust accuracy (more than 60%). Therefore we do not show the results of PGD-AT-APL in the above tables. However, the underlying reasons for such performance remain an open issue that requires further investigation.

Further discussions with TRADES We make a full comparison between our NRAT and TRADES in this part. First, under symmetric label noise, NRAT gets a higher best robust accuracy, and with the noise rate increasing from 20 to 40%, the improvement becomes apparent at around 2% to 4% under different attacks; another improvement is that NRAT can mitigate the double overfitting, from the second part of Table 1, TRADES shows a significant double overfitting issue, that the gap between the last accuracy and best accuracy is quite large, while the last performance of NRAT is much higher than that of TRADES. Second, for asymmetric label noise which shows an opposite phenomenon that although NRAT still outperformance TRADES at the best robust accuracy, with the noise rate increase, the gap becomes closer. This is related to the condition \(R(h^{*}=0)\) in Lemma 2, with the noise rate increase, \(R(h^{*})\) tends to be far away from 0 which limit the performance of the NCE+RCE loss.

4.3 Mitigating double overfitting

Although NRAT partially mitigates the issue of double overfitting, there is still significant robust overfitting, resulting in a substantial best and last performance gap. The gap is around 7% to 10% for symmetric label noise and 8% to 12% for asymmetric label noise. To further address this, we introduce using weight perturbation. Adversarial weight perturbation (AWP) (Wu et al. 2020) aims to adversarially perturb both the inputs and weights during the training stage. The input perturbation is produced via PGD attack, while the weight perturbation can be written as

$$\begin{aligned} \textbf{v} \leftarrow \Pi _{\gamma }\left( \textbf{v}+\eta \frac{\nabla _{\textbf{v}} \frac{1}{n} \sum _{i=1}^{n} \ell \left( \textbf{f}_{\textbf{w}+\textbf{v}}\left( \textbf{x}_{i}^{\prime }\right) , y_{i}\right) }{\left\| \nabla _{\textbf{v}} \frac{1}{n} \sum _{i=1}^{n} \ell \left( \textbf{f}_{\textbf{w}+\textbf{v}}\left( \textbf{x}_{i}^{\prime }\right) , y_{i}\right) \right\| }\Vert \textbf{w}\Vert \right) , \end{aligned}$$
(18)

where \(\textbf{v}\) denotes the weight perturbation, which can be solved by multi-step methods like PGD, and n is the batch size. Combining \(x^{\prime }\) and \(\textbf{v}\) for adversarial training has been shown to enhance adversarial robustness, as well as alleviate robust overfitting. Furthermore, we empirically demonstrate that NRAT is compatible with AWP and can effectively mitigate the issue of double overfitting in the presence of label noise. The comparison between NRAT and NRAT-AWP is shown in Fig. 4 and Table 3.

Fig. 4
figure 4

The learning curves of natural accuracy and PGD robust accuracy for NRAT and NRAT-AWP under 0% (natural/robust), 20% and 40% inherent symmetric/asymmetric label noise on CIFAR-10 with \(\ell _{\infty }\) threat model

Table 3 Robustness performance(%) on CIFAR-10 of NRAT-AWP and NRAT with 20% and 40% symmetric/asymmetric label noise

It is clear that NRAT-AWP achieves higher robust accuracy and significantly mitigates robust overfitting. The performance gap is less than 5% across all label noise rates.

4.4 AT with generated data

Currently, one of the most effective approaches in AT is leveraging additional data. For instance, Wang et al. (2023) used the elucidating diffusion model (EDM) (Turkeltaub et al. 2023) to generate millions of additional data for AT, leading to the state-of-the-art performance on the RobustBench (Croce et al.. 2020) leaderboard. However, it is worth noting that these augmented datasets may also contain an unknown proportion of noisy labels. Out of curiosity, we also trained NRAT on these additional data. (We refer to their method as DM_AT in this section.)

Settings for this part We use the 1 M generated data provided in Wang et al. (2023), following most of the settings outlined in Sect. 4.1. While for each method (DM_AT and NRAT), we employed the WideResNet-28–10 model to train this large dataset. Additionally, as per Wang et al. (2023), we apply label smoothing with a value of 0.1 and separate the first 1024 images of the training set to create a fixed validation set to replace the test data in CIFAR-10, since the distribution of generated dataset is still different from the distribution of the test set of CIFAR-10 dataset, a fixed validation set sampled from the generated dataset can be seen as a more fair comparison to eliminate the impact of distribution distance. We train each method for 150 epochs to observe the training tendency. The performance on the validation set is shown in Fig. 5.

Fig. 5
figure 5

The learning curves of natural accuracy and PGD robust accuracy on the validation set for DM_AT and NRAT using 1 M generated data with \(\ell _{\infty }\) threat model

Although the exact number of noisy labels in the generated dataset is unknown, it is clear from Fig. 5 that NRAT exhibits higher clean accuracy on the validation set compared to DM_AT. However, the robust accuracy of NRAT appears slightly lower than that of DM_AT. The best natural accuracy achieved is 77.4% for DM_AT and 82.66% for NRAT, while the best robust accuracy is 49.41% for DM_AT and 49.02% for NRAT.

5 Conclusion

In this paper, we first investigate the performance of existing AT methods when confronted with inherent label noise. We observe that these methods exhibit poor generalization on inherent label noise. To address this issue, we propose a novel noisy robust adversarial training algorithm, i.e., NRAT, by incorporating a robust loss function and a more robust regularization term to enhance adversarial robustness in the presence of inherent label noise. This work is a combination of technologies in the field of noisy labels and AT, aiming to improve the performance of adversarial robustness on more realistic datasets. Comprehensive experiments show that, with inherent label noise, NRAT achieves comparable or superior performance compared to existing AT algorithms in terms of robust accuracy and robust overfitting. Furthermore, we empirically show that NRAT is well-suited for training with large generated datasets, which is the state-of-the-art practice for improving adversarial training.