Introduction

Deep Neural Networks (DNNs) have achieved remarkable success across a wide range of fields [1,2,3,4,5,6,7], and the research on Learning with Noisy Labels (LNL) methods aims to improve the accuracy of DNNs when the training dataset contains noisy labels (i.e., incorrect labels). This field of research has recently gained significant attention [8,9,10,11,12,13,14,15,16] for two main reasons. First, DNNs are susceptible to label noise, which can negatively affect DNNs’ performance [9, 11, 17,18,19]. Second, real-world datasets often contain noisy labels [8, 10, 14, 18,19,20]. Without effective interventions, DNNs’ performance on real-world datasets will be severely degraded by noisy labels [9, 11, 13, 17,18,19]. Therefore, research for LNL methods is crucial for improving DNN’s performance in real-world datasets.

Popular LNL methods for DNNs can be coarsely classified into two types: loss reweighting and label correction. Loss reweighting methods frequently treat high-loss samples as noisy samples and restrain their gradient [9, 11, 14, 19, 21,22,23]. For instance, methods that modify the loss function to demote the gradient of high-loss samples [9, 11, 21, 24], and methods that utilize the samples’ loss during training to filter out potential noisy samples [12, 14, 22, 23, 25, 26]. On the other hand, label correction methods focus on correcting samples with low prediction probability on their observed labels [10, 16, 18, 20, 27,28,29,30,31,32,33]. The main approaches for label correction include training with only corrected labels [10, 18, 30, 33], and training with a combination of noisy and corrected labels [20, 31, 32]. In general, for both loss reweighting methods and label correction methods, samples with high learning difficulty (i.e., DNN has low prediction probability on samples’ observed labels) are often considered noisy [10, 14, 18, 19, 22]. Since noisy samples are usually fitted through brute-force memorization [34, 35], they tend to have higher learning difficulty compared to simple clean samples. Therefore, existing LNL methods are effective in separating noisy samples from simple clean samples.

Existing LNL methods frequently regard samples with high learning difficulty as noisy samples. Nevertheless, irregular feature patterns from hard clean samples can also cause high learning difficulty for DNNs, thus they can be mis-corrected or filtered by existing LNL methods. Although hard clean samples are only minority in the dataset, they play a vital role in improving DNNs’ generalization [12, 35,36,37,38,39]. Thus, separating noisy samples from hard clean samples through a more effective criterion can further improve DNN’s performance in noisy labeled dataset. Although there have been previous works on utilizing samples’ learning difficulty (logits output or loss) during different training epochs to distinguish between noisy samples and hard clean samples [12, 14, 16], they did not utilize the primary difference between hard clean samples and noisy samples. Hard clean samples possess correct labels, implying that their high learning difficulty primarily stems from their irregular feature patterns. This results in the learned features from other clean samples being inapplicable to these hard clean samples. Consequently, the learning difficulty of these samples is higher than that of other clean samples. On the other hand, the high learning difficulty of noisy samples is mainly caused by incorrect labels, and many of them possess feature patterns similar to those of simple clean samples. Thus, ignoring this difference may cause the existing LNL methods to mis-classify hard clean samples as noisy samples.

In this paper, we propose the Samples’ Learning Risk-based Learning with Noisy Labels (SLRLNL) method to better separate noisy samples from hard clean samples, thus improving DNN’s learning for hard clean samples while mitigating label noise. To be specific, samples’ learning risks are DNN’s accuracy variation on training dataset after learning the sample, as will be demonstrated in this paper, samples’ learning risk is comprehensively determined by samples’ learning difficulty as well as samples’ feature similarity to other samples, and only samples with high learning difficulty, as well as similar feature patterns to other samples, will be detected as noisy samples. Since hard clean samples often possess feature patterns that are dissimilar to other samples, SLRLNL can separate noisy samples from hard clean samples more effectively compared with existing LNL methods that only rely on samples’ learning difficulty.

We divide our proposed SLRLNL method into two processes. The first process is the label correction process, in which we propose to identify noisy samples through samples’ learning risk and then correct them to obtain clean samples to improve DNN’s performance. Furthermore, to extract useful information from the samples with irregular feature patterns (i.e., hard samples), we propose a Relabeling-based Label Augmentation (RLA) process as the second process of SLRLNL. This process can prevent the DNN from memorizing hard noisy samples and enhance the learning for hard clean samples, thus extracting useful information from the hard samples. Specifically, in each epoch, the RLA process selects different samples that are likely to be hard samples and temporarily relabels them to another probable class. This process mainly relabels the hard samples, thus can prevent the DNN from memorizing the hard noisy samples. As the relabeled class may contain valuable semantic information, the temporary relabeling of the selected hard samples also encourages the DNN to learn more generalized knowledge from them, thereby improving DNN’s generalization performance.

The effectiveness of the learning risk-based label correction process in identifying noisy samples, and the effectiveness of the RLA process in enhancing the learning for hard samples are evaluated through empirical studies. And we conduct experiments on five frequently used real-world datasets to evaluate our method, including four image classification datasets (CIFAR-10 and CIFAR-100 [40]; Animal-10N [27] and Clothing1M [41]) and one natural language processing dataset (Docred [2]). The experimental results from the aforementioned datasets demonstrate that our proposed method achieves better performance compared to other existing LNL methods. The source code for SLRLNL can be found at https://github.com/yangbo1973/SLRLNL.

In summary, the contributions of this paper are as follows.

  • We propose the SLRLNL method to better separate noisy samples from hard clean samples. To detect noisy samples, the SLRLNL method utilizes samples’ learning risk as selection criterion. Since samples’ learning risk is comprehensively determined by samples’ learning difficulty and samples’ feature similarity to other samples, the SLRLNL method can correct noisy labels more effectively without hindering the learning of hard clean samples. Compared to existing LNL methods, our proposed SLRLNL can further enhance DNN’s performance in practice.

  • In addition to identifying and correcting noisy labels using the samples’ learning risk, we propose the RLA process to extract more meaningful information from the hard samples.

  • We conduct experiments on CIFAR-10, CIFAR-100, Animal-10N, Clothing1M, and Docred datasets. Our proposed method was evaluated against existing popular LNL methods, and the experimental results demonstrate that our method achieved better performance compared to other LNL methods.

The structure of this paper is as follows: The section Related work provides a review of related works in the context of our method. The section Preliminaries presents the preliminaries necessary for our proposed method. In the section The proposed method, we outline our proposed method. The section Experimental setup presents the experimental results obtained using our method. In the section Experimental results, we conclude this paper.

Related work

Loss reweighting methods

Loss reweighting methods are effective in demoting the influence from noisy samples. Popular loss reweighting methods first detect noisy samples through samples’ learning difficulty (e.g., loss value or prediction probability on observed label) and then reduce the weight from detected samples or filter them. For example, methods that utilize DNN’s prediction probability to detect and filter the noisy samples [12, 14, 19, 22, 25, 26, 42,43,44,45]; methods that modified the loss function to reduce the weight from high-loss samples [9, 11, 24, 46]. These methods yield effective results in mitigating the adverse impact from noisy labels; however, since hard clean samples frequently possess irregular feature patterns, the existing loss reweighting methods have an undesirable tendency to ignore the useful hard clean samples and train DNN with only simple samples. Consequently, these methods can probably bias DNN’s training process [32] and damage DNN’s performance.

Label correction methods

During the training of a DNN, the gradient from clean samples can influence the DNN’s prediction probability for noisy samples [19, 35], and therefore, DNN’s prediction probability can be utilized to detect and correct noisy labels [10, 15,16,17,18, 20, 27,28,29,30,31,32,33]. In general, these methods regard samples whose labels are highly inconsistent with DNN’s prediction probability as noisy ones, then correct these samples with DNN’s prediction output. Nevertheless, since hard clean samples possess feature patterns that are dissimilar to other simple clean samples, the existing label correction methods can easily misinterpret hard clean samples as noisy samples and mis-correct them. Since hard clean samples play an important role in DNN’s generalization performance [36,37,38,39], mis-correcting them can lead to DNN’s performance degradation.

Identify hard samples

Hard samples are essential for DNN’s generalization [35, 37, 47], and thus, research in identifying hard clean samples are also important. For example, Lin et al. [48] regard samples with rare labels as hard samples, and Huang et al. [49] identify hard samples through DNN’s prediction probability. And Koh et al. [50] evaluate samples’ informativeness through DNN’s parameters variation after sample removal, and Harutyunyan et al. [37] search for hard samples through the mutual information between the sample and DNN’s parameters.

Data augmentation

Data augmentation is an effective measure for improving DNNs’ generalization. In general, data augmentation methods improve DNNs’ generalization through adopting transformations to the samples’ input. For example, image rotation, flipping, cropping, and random scaling in image classification tasks [51], synonym replacement, random insertion, swapping, and deletion in natural language processing tasks [52]. While the methods listed above are task-specific, label augmentation, which trains DNNs with constructed artificial labels, can be used in various tasks to encourage DNNs to learn more generalized knowledge from the training samples [53, 54].

Preliminaries

Table 1 Basic notations

In this section, we provide notations and definitions related to our proposed method, and the basic notations are listed in Table 1. Generally, the aim of this paper is to improve the DNN’s accuracy when the training dataset contains noisy samples. We focus on the multi-classification task and denote \(D=\{s_1,s_2,\ldots ,s_n\}\) as the clean training dataset, where \(s_i=({\textbf {x}}_i,y_i)\in ({\mathcal {X}}, {\mathcal {Y}})\) is the ith sample of D. \({\textbf {x}} \in {\mathcal {X}}\) is the input for the DNN, \(y \in {\mathcal {Y}}\) is the ground truth label, where \({\mathcal {X}}\) is the space for input data, and we have \({\mathcal {Y}} = \{1,\ldots ,K\}\), K is the number of categories for the classification task. Then, we define the observed training dataset as \({\tilde{D}}=\{{\tilde{s}}_1,{\tilde{s}}_2,\ldots ,{\tilde{s}}_n\}\), where \({\tilde{s}}_i=({\textbf {x}}_i,{\tilde{y}}_i)\) is the ith sample of \({\tilde{D}}\). In practice, it is unknown whether the label of an observed sample \({\tilde{s}}_i\) is correct or not. We define the noisy samples as follows:

Definition 1

(Noisy sample)

For an observed sample \({\tilde{s}}_i = ({\textbf {x}}_i, {\tilde{y}}_i)\), \({\tilde{s}}_i\) is noisy sample if \({\tilde{y}}_i \ne y_i\).

Define \({\varvec{\theta }}\) as the parameters for the DNN. When given DNN’s parameters \({\varvec{\theta }}\), define \(\phi (\cdot ;{\varvec{\theta }}):{\mathcal {X}}\rightarrow \mathbb {R}^{N_l}\), \(g(\cdot ;{\varvec{\theta }}):{\mathcal {X}}\rightarrow \mathbb {R}^{K}\), and \(f(\cdot ;{\varvec{\theta }}):{\mathcal {X}}\rightarrow \mathbb {R}^{K}\) as the functions that map the sample’s input to DNN’s penultimate layer output, DNN’s logits output, and DNN’s prediction probability (output after the softmax layer), respectively. \(g(\cdot ;{\varvec{\theta }})\) can be considered as DNN’s feature extractor, and DNN’s penultimate layer output for the sample can also be considered as the feature representation for that sample.

In this paper, for the sake of simplicity, when given the DNN’s parameters \({\varvec{\theta }}\), the DNN’s penultimate layer output, logits output, and prediction probability for the sample with input \({\textbf {x}}_i\) can be abbreviated as \({\textbf {z}}_i\), \({\textbf {u}}_i\), and \({\textbf {p}}_i\), respectively. The matrices for the DNN’s penultimate layer output, logits output, and prediction probability for the dataset D can be abbreviated as \({\textbf {Z}}_D\), \({\textbf {U}}_D\), and \({\textbf {P}}_D\), respectively. The matrices for the one-hot form of clean labels and observed labels for datasets D and \({\tilde{D}}\) can be abbreviated as \({\textbf {Y}}_D\) and \(\tilde{{\textbf {Y}}}_{{\tilde{D}}}\), respectively. When replacing the subscript D of these abbreviated symbols above with another set of samples, it denotes the DNN’s output matrices for that set.

Generally, existing LNL methods frequently attempt to identify noisy samples through DNN’ learning difficulty or its extension (sample’s loss value or prediction probability) [10, 19, 22]. Then, the learning difficulty is defined as follows:

Definition 2

(Learning difficulty)

Define \(1-f_{{\tilde{y}}_i}({\textbf {x}}_i;{\varvec{\theta }})\) be DNN’s learning difficulty for observed sample \(({\textbf {x}}_i,{\tilde{y}}_i)\).

Higher learning difficulty for sample \({\tilde{s}}_i\) indicates that the DNN will take a longer time to eventually fit these data. Although utilizing samples’ learning difficulty can effectively identify noisy samples, they may tend to mistake hard clean samples with irregular feature patterns as noisy samples . Although these samples are the minority in the dataset, they are essential for DNN’s generalization [35,36,37,38,39]. In this paper, we define hard samples as samples with high learning difficulty when trained with a clean dataset. They are defined as hard clean samples if their observed labels are consistent with their ground truth labels; otherwise, they are defined as hard noisy samples, as shown below:

Definition 3

(Hard clean/noisy sample)

For an observed sample \({\tilde{s}}_i = ({\textbf {x}}_i, {\tilde{y}}_i)\), \({\tilde{s}}_i\) is hard clean sample if \({\tilde{y}}_i = y_i\) and \(1 - f_{y_i}({\textbf {x}}_i;{\varvec{\theta }}^*) > \tau \), and is hard noisy sample if \({\tilde{y}}_i \ne y_i\) and \(1 - f_{y_i}({\textbf {x}}_i;{\varvec{\theta }}^*) > \tau \). \({\varvec{\theta }}^*\) is the optimal parameters for DNN trained with clean dataset D, and \(\tau \) is the selection criteria for hard clean samples.

Other than noisy samples and hard clean samples, the simple clean sample is defined as follows:

Definition 4

(Simple clean sample)

For an observed sample \({\tilde{s}}_i = ({\textbf {x}}_i, {\tilde{y}}_i)\), \({\tilde{s}}_i\) is simple clean sample if \({\tilde{y}}_i = y_i\) and \(1 - f_{y_i}({\textbf {x}}_i;{\varvec{\theta }}^*) \le \tau \), where \({\varvec{\theta }}^*\) is the optimal parameter for DNN trained with clean dataset.

Define \(\ell (y, f({\textbf {x}}; {\varvec{\theta }}))\) and \(E(y, f({\textbf {x}}; {\varvec{\theta }}))\) as the loss function and evaluation metric function, respectively. Both functions can be utilized to measure how close the DNN’s prediction probability is to the sample’s label. In this paper, we propose to better separate noisy samples from both hard clean samples and simple clean samples through the learning risk of samples. The definitions of samples’ learning risk are given as follows:

Definition 5

(Learning risk)

Denote the learning risk from sample \({\tilde{s}}_i\) to be \(\bigtriangleup E_{{\tilde{s}}_i\rightarrow D}\), which is the variation of DNN’s empirical risk in clean training set D after updating the gradient from \({\tilde{s}}_i\):

$$\begin{aligned} \bigtriangleup E_{{\tilde{s}}_i\rightarrow D}&=\frac{1}{|D|}\sum _{({\textbf {x}}_d,y_d)\in D}E(y_d,f({\textbf {x}}_d;{\varvec{\theta }}+\bigtriangleup {\varvec{\theta }}_i))\nonumber \\&\quad -E(y_d,f({\textbf {x}}_d;{\varvec{\theta }})) {,} \end{aligned}$$
(1)

where \(\bigtriangleup {\varvec{\theta }}_i\) is the variation of the DNN’s parameters after updating the gradient from \({\tilde{s}}_i\).

The proposed method

Our proposed method is presented in this section. In the section Calculation for samples’ learning risk, we demonstrate the calculation method for samples’ learning risk. The section Empirical study of separating noisy samples from hard clean samples provides an empirical study of our method in separating hard clean samples from noisy samples. We illustrate the methods for label correction in the section Label correction for noisy samples and present the proposed RLA in the section Relabeling-based label augmentation. Finally, in the section Implementation detail, we present the overall algorithm for SLRLNL and implementation details.

Fig. 1
figure 1

Illustration for the difference between hard clean samples and noisy samples

Calculation for samples’ learning risk

This subsection demonstrates the calculation method for samples’ learning risks. To calculate the learning risk, we use Mean Square Error (MSE) as the evaluation metric function for DNN accuracy

$$\begin{aligned} E_{MSE}(y,g({\textbf {x}};{\varvec{\theta }})) = ||{\textbf {y}}-g({\textbf {x}};{\varvec{\theta }})||^2_2 {,} \end{aligned}$$
(2)

then DNN’s empirical risk in D is

$$\begin{aligned} \frac{1}{n}\sum _{({\textbf {x}}_d,y_d)\in D}{||{\textbf {y}}_d-g({\textbf {x}}_d;{\varvec{\theta }})||^2_2} {,} \end{aligned}$$
(3)

then, for a DNN with parameters \({\varvec{\theta }}\), suppose MSE is adopted as the loss function and gradient descent is adopted as the optimizer, after updating gradient from \({\tilde{s}}_i = ({\textbf {x}}_i,{\tilde{y}}_i)\), the variation for the logits output of sample \(s_d = ({\textbf {x}}_d,y_d)\) is

$$\begin{aligned} \bigtriangleup {\textbf {u}}_{{\tilde{s}}_i\rightarrow s_d} = -2\alpha ({\textbf {z}}_i({\textbf {z}}_d+\bigtriangleup {\textbf {z}}_{{\tilde{s}}_i\rightarrow s_d})^T+1)({\textbf {u}}_i-\tilde{{\textbf {y}}}_i) {,} \end{aligned}$$
(4)

where \(\alpha \) is the learning rate, \({\textbf {z}}_i=\phi ({\textbf {x}}_i;{\varvec{\theta }})\in \mathbb {R}^{1\times N_l}\) and \({\textbf {z}}_d=\phi ({\textbf {x}}_d;{\varvec{\theta }})\in \mathbb {R}^{1\times N_l}\) are DNN’s penultimate layer output for sample \({\tilde{s}}_i\) and sample \(s_d\), respectively. \({\textbf {u}}_i = g({\textbf {x}}_i;{\varvec{\theta }})\in \mathbb {R}^{1\times K}\) and \(\tilde{{\textbf {y}}}_i\in \mathbb {R}^{1\times K}\) are the logits output and one-hot form observed label for sample \({\tilde{s}}_i\), respectively. \(\bigtriangleup {\textbf {z}}_{{\tilde{s}}_i\rightarrow s_d}\) is the variation for DNN’s penultimate layer output for sample \(s_d\) after updating gradient from sample \({\tilde{s}}_i\). Then, the learning risk \(\bigtriangleup E_{{\tilde{s}}_i\rightarrow D}\) for sample \({\tilde{s}}_i = ({\textbf {x}}_i,{\tilde{y}}_i)\) is

$$\begin{aligned}&\bigtriangleup E_{{\tilde{s}}_i\rightarrow D}\nonumber \\&\quad =\frac{1}{|D|}\sum _{({\textbf {x}}_d,y_d)\in D}{||{\textbf {y}}_d-{\textbf {u}}_d-\bigtriangleup {\textbf {u}}_{{\tilde{s}}_i\rightarrow s_d}||^2_2-||{\textbf {y}}_d-{\textbf {u}}_d||^2_2}\nonumber \\&\quad = \frac{4\alpha }{|D|}({\textbf {z}}_i({\textbf {Z}}_D)^T+1)({\textbf {U}}_D-{\textbf {Y}}_D)(\tilde{{\textbf {y}}}_i-{\textbf {u}}_i)^T\nonumber \\&\qquad + \frac{4\alpha ^2}{|D|}(1+{\textbf {u}}_i{\textbf {u}}_i^T-2{\textbf {u}}_i\tilde{{\textbf {y}}}_i^T)\sum _{s_d\in D}{({\textbf {z}}_i({\textbf {z}}_d)^T+1)^2} {.} \end{aligned}$$
(5)

The proof for Eq. (4) and Eq. (5) can be found in Appendix A. In Eq. (5), \({\textbf {u}}_d\in \mathbb {R}^{1\times K}\), \({\textbf {z}}_d\in \mathbb {R}^{1\times N_l}\), and \({\textbf {y}}_d\in \mathbb {R}^{1\times K}\) are sample \(s_d\)’s logits output, penultimate layer output, and one-hot form clean label, respectively. \({\textbf {Z}}_D\in \mathbb {R}^{n\times N_l}\), \({\textbf {U}}_D\in \mathbb {R}^{n\times K}\), and \({\textbf {Y}}_D\in \mathbb {R}^{n\times K}\) are matrices for clean dataset D’s penultimate layer output, logits output, and one-hot form clean label, respectively. Since the clean dataset D is generally unavailable in practice, in this paper, we adopt low learning difficulty samples as replacements, which are likely to be simple clean samples [35, 55]. In each epoch, we select a subset of the observed dataset as the nearly clean samples \(C^{(t)}\)

$$\begin{aligned}&C^{(t)}=\left\{ ({\textbf {x}}_i,{\tilde{y}}_i)| \frac{1}{t}\sum ^t_{m=1}{\bigg (1-f_{{\tilde{y}}_i}({\textbf {x}}_i;{\varvec{\theta }}^{(m)}) \bigg )} \le \tau (t, n_C) \right\} {,} \end{aligned}$$
(6)

where t is the current epoch, \(\tau (t, n_C)\) is the threshold for selecting \(C^{(t)}\), and is the \(n_C\%\) lowest average learning difficulty from 1th to tth epoch, and \(n_C\) is the hyper-parameter of our method in selecting nearly clean samples. Then, the nearly clean samples \(C^{(t)}\) are utilized to represent the clean dataset D. Note that the learning rate \(\alpha \) is also involved in Eq. (5). In practice, the learning rate \(\alpha \) will be set to a small value, and thus, for the sake of simplicity, we set \(\alpha \rightarrow 0^+\) and ignore the latter part in Eq. (5). Then, the learning risk for sample \({\tilde{s}}_i\) is

$$\begin{aligned} \bigtriangleup E_{{\tilde{s}}_i\rightarrow C^{(t)}}&= \frac{4\alpha }{|C^{(t)}|}({\textbf {z}}_i({\textbf {Z}}_{C^{(t)}})^T+1)\nonumber \\&\quad ({\textbf {U}}_{C^{(t)}}-\tilde{{\textbf {Y}}}_{C^{(t)}})(\tilde{{\textbf {y}}}_i-{\textbf {u}}_i)^T {,} \end{aligned}$$
(7)

where \({\textbf {Z}}_{C^{(t)}}\), \({\textbf {U}}_{C^{(t)}}\), and \(\tilde{{\textbf {Y}}}_{C^{(t)}}\) are the matrices for the nearly clean samples \(C^{(t)}\)’s penultimate layer output, logits output, and one-hot form observed label, respectively. Moreover, since in practice the cross-entropy loss is frequently utilized for classification tasks, in this paper, we use the cross-entropy loss function to train the DNN. Then, to represent samples’ learning risk when trained with cross-entropy, we replace the logits output terms in Eq. (7) with DNN’s prediction probability, and thus, the learning risk for sample \({\tilde{s}}_i\) can be represented as

$$\begin{aligned} \bigtriangleup E_{{\tilde{s}}_i\rightarrow C^{(t)}}&= \frac{4\alpha }{|C^{(t)}|}({\textbf {z}}_i({\textbf {Z}}_{C^{(t)}})^T+1)({\textbf {P}}_{C^{(t)}}-\tilde{{\textbf {Y}}}_{C^{(t)}})\nonumber \\&\quad (\tilde{{\textbf {y}}}_i-{\textbf {p}}_i)^T {,} \end{aligned}$$
(8)

where \({\textbf {p}}_i = f({\textbf {x}}_i;{\varvec{\theta }})\) is DNN’s prediction probability for sample \({\tilde{s}}_i\), and \({\textbf {P}}_{C^{(t)}}\) is the matrix for the nearly clean samples \(C^{(t)}\)’s prediction probability. In this paper, we utilize Eq. (8) to calculate samples’ learning risk. As shown in Eq. (8), the learning risk of the ith sample is mutually determined by the term \((\tilde{{\textbf {y}}}_i - {\textbf {p}}_i)\) that similar to sample’s learning difficulty, and its feature similarity to other samples: \({\textbf {z}}_i({\textbf {Z}}_{C^{(t)}})^T\). This equation indicates that the influence of sample \({\tilde{s}}_i\) on DNN’s empirical risk is comprehensively determined by its feature similarity to other samples and its learning risk, and samples with higher feature similarity and higher learning difficulty will have a greater learning risk.

Hard clean samples typically exhibit irregular feature patterns. Therefore, for a hard clean sample \(s_i\), its feature representation \({\textbf {z}}_i\) will be different from any other sample \(s_d\), resulting in a small feature similarity \({\textbf {z}}_i({\textbf {z}}_d)^{T}\). According to Eq. (8), learning these hard clean samples does not significantly increase the DNN’s empirical risk. In contrast, many noisy samples often have both high feature similarity and high learning difficulty. This means that learning such samples can distort the DNN’s decision boundary and increase its empirical risk. Therefore, by setting an appropriate threshold, the learning risk can accurately detect and correct noisy samples while avoiding mis-correcting hard clean samples. The difference between hard clean samples and noisy samples is illustrated in Fig. 1.

In the next subsection, it will be demonstrated that existing LNL methods are ineffective in distinguishing noisy samples from hard clean samples, while from the perspective of samples’ learning risk, these samples can be effectively separated.

Fig. 2
figure 2

Comparison between SLRLNL and existing LNL methods in separating noisy samples from hard clean samples

Empirical study of separating noisy samples from hard clean samples

We conduct numerical experiments on CIFAR-10 and CIFAR-100 with artificially generated label noise to compare our proposed SLRLNL with the existing LNL methods in separating noisy samples from hard clean samples. The selection criteria used by the existing LNL methods to identify noisy samples are listed below:

  • Co-teaching [22]: Han et al. [22] train two DNNs, and utilize the loss value from another DNN to identify noisy samples. The selection criteria for identifying noisy samples for co-teaching are samples’ loss value from the other DNN

    $$\begin{aligned} -\log {f_{{\tilde{y}}_i}({\textbf {x}}_i;{\varvec{\theta }}')} {,} \end{aligned}$$
    (9)

    where \({\varvec{\theta }}'\) is the parameters from another DNN.

  • Progressive Label Correction (PLC) [10]: Zhang et al. [10] regard samples whose labels are highly inconsistent with DNN’s prediction probability as noisy samples, then progressively correct them. The selection criteria for PLC in identifying noisy samples are

    $$\begin{aligned} \max _{j\ne {\tilde{y}}}{\bigg (f_{j}({\textbf {x}}_i;{\varvec{\theta }}) - f_{{\tilde{y}}_i}({\textbf {x}}_i;{\varvec{\theta }}) \bigg )} {.} \end{aligned}$$
    (10)
  • Area Under the Margin ranking (AUM) [19]: Pleiss et al. [19] utilize the average difference between the logits values for sample’s observed class and its highest other class to identify noisy samples. Their selection criteria for identifying noisy samples are

    $$\begin{aligned} \frac{1}{t}\sum ^t_{m=1}{\Bigg (\max _{j\ne {\tilde{y}}}{\bigg (g_{j}({\textbf {x}}_i;{\varvec{\theta }}^{(m)})\bigg )}-g_{{\tilde{y}}_i}({\textbf {x}}_i;{\varvec{\theta }}^{(m)}) \Bigg )} {.} \end{aligned}$$
    (11)

Existing LNL methods consider samples with high selection criteria listed above to be noisy. In the numerical experiments, hard clean samples are selected as the samples whose learning difficulty \(1 - f_{y_i}({\textbf {x}}_i;{\varvec{\theta }}^*)\) are above the highest 10% learning difficulty, and DNN’s parameters \({\varvec{\theta }}^*\) are obtained by training a DNN with clean dataset. To simulate noisy labels in real-world datasets, we generate 80% uniform flip and 40% pair flip label noise for CIFAR-10 and CIFAR-100, respectively (the details for generating label noise are illustrated in Section 5). Other samples are regarded as simple clean samples if they are neither noisy samples nor hard clean samples. The adopted DNN is ResNet-18 trained from scratch. The batch size is 64, and we train the DNN with SGD. The learning rate is 2e-2, the momentum is set to 0.9, and the weight decay rate is 5e-4. After the warm-up process (we set a 10-epoch warm-up process for CIFAR-10 and 30-epoch warm-up for CIFAR-100), the experimental results are shown in Fig. 2.

Figure 2 shows the histogram of each selection criteria (Co-teaching, PLC, AUM, and learning risk) for CIFAR-10 and CIFAR-100 with different types of label noise. The horizontal axis represents the value of selection criteria. The vertical axis in Fig. 2 represents the density of the selection criteria, which shows the proportion of data points within each range. To effectively separate noisy samples from simple clean samples and hard clean samples, the selection criteria for noisy samples need to be higher than those for clean samples.

The results presented in Fig. 2 demonstrate that the proposed learning risk criterion can more effectively distinguish noisy samples from both simple clean samples and hard clean samples in both CIFAR-10 and CIFAR-100 under different noise types. Moreover, as shown in Fig. 2b, d, when facing pair flip label noise, existing LNL methods can barely separate noisy samples from hard clean samples. This is due to the fact that pair flip label noise will flip the ground truth labels into other similar classes, thereby will not significantly increasing their learning difficulty. On the other hand, since pair flip label noise can still damage DNN’s accuracy performance, the samples with pair flip label noise can still be effectively separated from hard clean samples by the learning risk criterion, as shown in Fig. 2b, d. In this case, correcting samples with high learning risk is able to mitigate the noisy labels without hindering the learning of hard clean samples and further improve DNN’s performance in practice.

After evaluating the effectiveness of learning risk in separating noisy samples, the label correction method utilized in this paper will be demonstrated in the next subsection.

Label correction for noisy samples

The label correction method utilized in this paper is illustrated in this subsection. As shown in Fig. 2, unlike hard clean samples and simple clean samples, noisy samples tend to have higher learning risk. Thus, we conduct label correction on samples with high learning risk to mitigate the negative impact from noisy samples, as shown in Eq. (12)

$$\begin{aligned} \breve{y}^{(t+1)}_i=\left\{ \begin{aligned}&\arg \max _{j\in {\mathcal {Y}}}{f_{j}({\textbf {x}}_i;{\varvec{\theta }}^{(t)})} ,&\bigtriangleup E_{{\tilde{s}}_i\rightarrow C^{(t)}}\ge \upsilon (t,n_V), \\&{\tilde{y}}_i ,&\bigtriangleup E_{{\tilde{s}}_i\rightarrow C^{(t)}}<\upsilon (t,n_V), \end{aligned} \right. \nonumber \\ \end{aligned}$$
(12)

where \(\bigtriangleup E_{{\tilde{s}}_i\rightarrow C^{(t)}}\) is calculated through Eq. (8), and \(\upsilon (t,n_V)\) is the selection threshold, which is the \(n_V\%\) highest learning risk in the tth epoch. \(n_V \in [0,100)\) is the correction proportion, and is a hyper-parameter of our method. In this paper, we set \(n_V\) to be increased along the training process to achieve more effective label correction, and the implementation detail can be found in hyper-parameters setting part in Section 5.

The label correction process based on samples’ learning risk can detect and correct noisy samples without mis-correcting the hard clean samples, thereby improving DNN’s performance in practice.

Relabeling-based label augmentation

Since samples with irregular feature patterns are important for DNN’s generalization, to better utilize the information contained in these hard samples, we propose the Relabeling-based Label Augmentation (RLA) process, which focus on relabeling samples with high learning difficulty. Since samples with common features and noisy labels can be easily corrected by the learning risk-based label correction, those samples with high learning difficulty after the label correction process are typically attributed to their irregular feature patterns, making them likely to be hard samples. Thus, selecting samples with high learning difficulty can focus on relabeling the hard samples.

Specifically, the RLA process is to select different samples with high learning difficulty in each epoch then temporarily relabel them to the most probable class other than the training labels in the previous epoch, as shown in Eq. (13)

$$\begin{aligned}&\ell _{RLA}(\breve{y}^{(t)}_i,f({\textbf {x}}_i;{\varvec{\theta }}^{(t)}))\nonumber \\&=\left\{ \begin{array}{ll} \ell \big (\arg \max _{j \ne \breve{y}^{(t-1)}_i}{f_{j}({\textbf {x}}_i;{\varvec{\theta }}^{(t)}),f({\textbf {x}}_i;{\varvec{\theta }}^{(t)})\big )} ,&{}\quad i \in I^{(t)}_{R}, \\ \ell (\breve{y}^{(t)}_i,f({\textbf {x}}_i;{\varvec{\theta }}^{(t)})) , &{}\quad else, \end{array} \right. \end{aligned}$$
(13)

where \(I^{(t)}_{R}\) is the index set of selected samples for the RLA process in the tth epoch, and we set \(I^{(t)}_{R}\) to select high learning difficulty samples different from the previous epoch, as shown as follows:

$$ \begin{aligned} I^{(t)}_{R} = \left\{ i|1-f_{\breve{y}^{(t)}_i}({\textbf {x}}_i; {\varvec{\theta }}^{(t)})\ge \rho (t,n_{R})\ \& \ i \notin I^{(t-1)}_{R}\right\} {,} \end{aligned}$$
(14)

where \(\rho (t,n_{R})\) is the threshold for selecting \(I^{(t)}_{R}\), and is \(n_{R}\%\) highest learning difficulty in the tth epoch, and \(n_{R} \in [0,100)\) is the relabeling proportion that determines the effectiveness of RLA, and is one of the hyper-parameters of our proposed method.

As shown in Eqs. (13) and (14), in each epoch, different samples with high learning difficulty will be selected by the RLA process. Therefore, in each epoch, the RLA process will temporarily relabel the selected samples, preventing the DNN from memorizing hard noisy samples. Moreover, as shown in Eq. (13), the selected samples will be relabeled with the class to which the DNN assigns a certain degree of prediction probability. In a multi-classification task, the relabeled class can retain a certain amount of semantic information, thus assisting the DNN in acquiring more generalized knowledge about the selected hard samples, which will be beneficial for DNN’s generalization performance. Additionally, according to Eq. (14), the RLA process does not permanently modify the labels of the hard clean samples, it will not bias their gradient in an incorrect direction, and thus, overall, this process can enhance the learning of hard clean samples.

As will be demonstrated in section 6.5, for CIFAR-10 and CIFAR-100 datasets with different types of label noise, the proposed RLA process can improve the DNN’s test accuracy, which indicates that the RLA process effectively mitigates the negative impact of hard noisy samples and improve DNN’s generalization performance. Additionally, the experimental results in section 6.6 illustrate that the RLA process can reduce the minimal training loss for the hard clean samples, which proves the efficacy of the RLA process in enhancing the learning process for hard clean samples.

The implementation detail of our method is provided in the next subsection.

Fig. 3
figure 3

The flowchart for our method, where \(t_w\) is the warm-up epoch and \(t_m\) is the total training epochs

Implementation detail

Basic implementation detail

The hyper-parameters related to SLRLNL mentioned above include: nearly clean samples selection parameter \(n_C\), max correction proportion \(max\ n_V\), and relabeling proportion \(n_{R}\) for RLA. For the label correction process, in each dataset, we increase the correction proportion \(n_V\) by 2 during each epoch until it reaches the maximum proportion \(max\ n_V\). Other than the mentioned parameters, we also set up a hyper-parameter warm-up epoch \(t_w\) to attain a DNN with basic classification ability before conducting SLRLNL. The details of our hyper-parameter settings can be found in "Hyper-parameters setting" in section “Experimental setup”. The flowchart of our SLRLNL method is shown in Fig. 3.

The overall algorithm is presented in Algorithm 1. The extra computational complexity of our SLRLNL method is listed in Table 2. The extra computational time is primarily generated during the calculation of the samples’ learning risk, which is \(O(n^2n_C\%(N_{l} + K))\), where n is the size of the training dataset, \(n_C\%\) is the ratio for selecting the nearly clean samples, and \(N_{l}\) is dim of DNN’s penultimate layer. In practice, we can adjust the proportion for selecting the nearly clean samples \(n_{C}\) to reduce the extra computational time from our proposed method. As will be shown in the experiment section, the extra computation time for our proposed SLRLNL is feasible.

Class imbalance issue

Class imbalance is a common issue in practice. For instance, the data sizes of different classes can significantly vary in real-world scenarios (e.g., datasets such as Clothing1M [41] and Docred [2] utilized in this paper). Consequently, this variation leads to differences in learning difficulty across classes and ultimately impacts the effectiveness of label correction methods. To improve the performance of our method in practice, in this paper, for all datasets, the selection threshold (\(\tau (t,n_{C})\), \(\upsilon (t,n_{V})\), and \(\rho (t,n_{R})\)) and the ranking process are performed separately for each class. Let \(K_{j} = \{i|{\tilde{y}}_i = j\}\) be the set of indexes of samples with observed label j. For class imbalanced datasets, the selection process in tth epoch for the nearly clean samples \(C^{(t)}\) is as follows:

$$ \begin{aligned} C^{(t)} \!=\! \cup _{j=1}^{K}\left\{ ({\textbf {x}}_i,{\tilde{y}}_i)| \frac{1}{t}\sum ^{t}_{m=1}{\big ( 1-f_{{\tilde{y}}}({\textbf {x}}_i;{\varvec{\theta }}^{(m)})\big )\le \tau _{j}(t,n_{C})\ \& \ i\in K_{j}}\right\} {,} \end{aligned}$$
(15)

where \(\tau _{j}(t,n_{C})\) is the \(n_{C}\%\) lowest average learning difficulty among samples in \(K_{j}\) from 1th to tth epoch, and the label correction process for the class imbalanced dataset is

$$\begin{aligned}&\breve{y}^{(t+1)}_i=\left\{ \begin{aligned}&\arg \max _{j\in {\mathcal {Y}}}{f_{j}({\textbf {x}}_i;{\varvec{\theta }}^{(t)})} ,&\bigtriangleup E_{{\tilde{s}}_i\rightarrow C^{(t)}}\ge \upsilon _{{\tilde{y}}_i}(t,n_V), \\&{\tilde{y}}_i ,&\bigtriangleup E_{{\tilde{s}}_i\rightarrow C^{(t)}}<\upsilon _{{\tilde{y}}_i}(t,n_V), \end{aligned} \right. \end{aligned}$$
(16)

where \(\upsilon _{{\tilde{y}}_i}\) is the \(n_{V}\%\) highest learning risk among samples in \(K_{{\tilde{y}}_i}\) in the tth epoch. The selection for \(I^{(t)}_{R}\) for class imbalanced dataset is

$$ \begin{aligned} I^{(t)}_{R}\!=\!\cup ^{K}_{j=1}\{i| 1\!-\!f_{{\tilde{y}}_i}({\textbf {x}}_i;{\varvec{\theta }}^{(t)})\!\ge \! \rho _j(t,n_{R})\ \& \ i \notin I^{(t-1)}_R \ \& \ i\in K_{j}\} {,} \end{aligned}$$
(17)

where \(\rho _{j}(t,n_{R})\) is the \(n_{R}\%\) highest learning difficulty among samples in \(K_{j}\) in the tth epoch.

Algorithm 1
figure a

Overall algorithm for SLRLNL

Table 2 Computational complexity of the major components and the overall complexity of our proposed method
Fig. 4
figure 4

Best test accuracy of SLRLNL on CIFAR-10 with different set of hyper-parameters

Table 3 Average test accuracy and their standard deviation over three trials for DNN trained on CIFAR-10 with standard method, other LNL methods, and our method
Table 4 Average test accuracy and their standard deviation over three trials for DNN trained on CIFAR-100 with standard method, other LNL methods, and our method

Experimental setup

Datasets

The experiments are conducted on four image classification datasets: CIFAR-10 and CIFAR-100Footnote 1 [40], Animal-10NFootnote 2 [27], Clothing1MFootnote 3 [41], and one natural language processing dataset: DocredFootnote 4 [2]. In this paper, label noise in the original CIFAR-10 and CIFAR-100 datasets can generally be ignored, while Animal-10N and Clothing1M datasets both contain real-world label noise. Docred includes both clean and noisy labeled datasets. To better evaluate our proposed method, two types of artificial label noise are generated for CIFAR-10 and CIFAR-100: uniform flip noise and pair flip noise. Following existing research [15, 18], artificial noise is generated by replacing labels from randomly selected samples, with samples being selected with the probability of the noise rate. For uniform flip noise, we replace original labels with other random labels, and for pair flip noise, we replace the sample’s label with a similar category.

For datasets that contain real-world label noise, both Animal-10N and Clothing1M contain images crawled from online websites and are poor in label quality. Animal-10N contains 50000 training samples and 5000 testing samples, and the categories for this dataset contain five pairs of animals with similar appearances. Meanwhile, Clothing1M is a much larger dataset, containing 1 million clothing images collected from various online shopping websites, and this dataset contains 14k images with correct labels for validation and 10k images for testing.

We also evaluate our method on the relation extraction task in the field of NLP. The popular method for the relation extraction task is distantly supervised learning [57,58,59], which annotates the entity pairs in the plain text through the open-source knowledge base. Therefore, labels for distant-supervised datasets are generally erroneous [57,58,59]. The experiments are conducted on the recently proposed Docred dataset [2], which contains 101873 distantly labeled documents, and 3053, 1000, and 1000 documents strictly annotated by well-trained human annotators for training, validation, and testing, respectively.

Meanwhile, many of the datasets in practice are class imbalanced (e.g., Clothing1M [41] and Docred [2]), and thus, to improve the effectiveness of SLRLNL in practice, for all the datasets, the selection threshold and ranking process related to our method are performed separately for each class; the details can be found in the "Class imbalance issue" part of section 4.5.

Backbones

The backbone for CIFAR-10 and CIFAR-100 is ResNet-18 [1], and for Clothing1M, the backbone is ResNet-50 pretrained on Imagenet. For Animal-10N, we use VGG-19 with batch normalization [60] as our backbone. For Docred, we adopt the BiLSTM from Yao et al. [2] as backbone and GloVe [61] as word embedding.

Baselines

To evaluate the effectiveness of our proposed SLRLNL, we compared our method with several recent or classic loss reweighting methods and label correction methods. In CIFAR-10 and CIFAR-100, the compared loss reweighting methods include MentorNet [42], Generalized Cross Entropy (GCE) [9], Symmetric Loss (SL) [24], Co-teaching [22], Area Under the Margin ranking (AUM) [19], SELFIE [27], Co-teaching+ [25], Robust inference via Generative classifiers (RoG) [28], Probabilistic End-to-end Noise Correction In Labels (PENCIL) [29], TopoFilter [26], Momentum of Memorization (Me-Momentum) [12], and Soft version of Combats Noisy Labels by Concerning Uncertainty (CNLCU-S) [14]. The compared label correction methods include Likelihood Ratio Test (LRT) [18], Progressive Label Correction (PLC) [10], and Forward-Backward Cycle-Consistency Regularization (FBCCR) [15].

In Animal-10N, the compared baseline loss reweighting methods include ActiveBias [56] and Co-teaching [22]. The compared label correction methods include SELFIE [27] and PLC [10].

In Clothing1M, the compared baseline loss reweighting methods include GCE [9], SL [24], MentorNet [42], Co-teaching [22], AUM [19], and CNLCU-S [14]. The compared label correction methods include LRT [18], PLC [10], Universal Probabilistic Model (UPM) [20], and FBCCR [15].

In Docred, the compared baseline LNL methods include Generalized Cross Entropy (GCE) [9], Symmetric Loss (SL) [24], Noisy Label and Negative Sample Robust Loss function (NLNSRL) [11], AUM [19], LRT [18], and PLC [10].

Hyper-parameters’ setting

For datasets CIFAR-10 and CIFAR-100, we adopt random flip, brightness, contrast, and saturation data augmentation. We adopt SGD with an initial learning rate of 2e-2 as our optimizer, and the learning rate is divided by 10 in the 50th and 100th epochs. We set the monument to 0.9 and weight decay rate to 5e-4. Then, we train DNN for 150 epochs with a batch size of 64. For the hyper-parameters of our method, we set \(n_C=10\), \(n_{R} = 2\), and set \(t_w = 10\) for CIFAR-10, set \(n_C=10\), \(n_{R} = 10\), and \(t_w = 30\) for CIFAR-100. To avoid mis-corrections during the label correction process, we set the \(max\ n_V\) to be half the value of the noise rate. For CIFAR-10 with 40% uniform flip label noise, we set \(max\ n_V\) to be 20, and for CIFAR-100 with 30% pair flip noise, \(max\ n_V\) will be set to 15. And in each epoch, the correction proportion is increased by 2 until it reaches the \(max\ n_V\).

For Animal-10N, we adopt random flip as data augmentation. We set SGD as the optimizer and train the DNN for 360 epochs with an initial learning rate of 1e-1, and is divided by 10 in 150th and 250th epochs. The batch size is 64. For the hyper-parameters related to our method, we set \(n_C=10\), \(max\ n_V=4\), \(n_{R} = 2\), and \(t_w = 50\). And in each epoch, the correction proportion is increased by 2 until it reaches \(max\ n_V\).

For Clothing1M, we follow the experiment setting in Zhang et al. [10], and use a randomly sampled pseudo-balanced subset, including about 260k images. The data augmentation strategies adopted include random crop and random flip. And we train the DNN with a batch size of 32, adopt the SGD as optimizer with a learning rate of 1e-2 for 20 epochs, and we divide the learning rate by 10 at the 3rd, 6th, and 9th epochs. The hyper-parameters of our method are set as \(n_C=10\), \(max\ n_V=10\), \(n_{R} = 5\), and \(t_w = 1\). And in each epoch, the correction proportion is increased by 2 until it reaches \(max\ n_V\).

For Docred, Adam with a learning rate of 2e-4 is adopted as the optimizer. Each mini-batch contains 30 documents, and we train DNN for 20 and 100 epochs for the distantly supervised dataset and the human-annotated dataset, respectively. For the hyper-parameters of our method, we set \(n_C=1\) and \(n_{R} = 1\) for both datasets, and the correction proportion \(n_V\) is directly set to 2. We set \(t_w = 0\) for the distantly supervised dataset, and set \(t_w=30\) for the human-annotated dataset.

Experimental results

This section includes the experimental results for SLRLNL and other methods. The source code for SLRLNL can be found in https://github.com/yangbo1973/SLRLNL.

Experimental results for CIFAR-10 and CIFAR-100

This subsection includes the experimental results for artificially noised CIFAR-10 and CIFAR-100. The performance on the test dataset is reported in Tables 3 and 4. Zheng et al. [18] reported the results of Standard, MentorNet, Co-teaching, and LRT in their study. Wu et al. [26] reported the results of Co-teaching+ , RoG, PENCIL, and TopoFilter in their study. Song et al. [27] reported the results of SELFIE against pair flip noise, and the results for uniform flip noise with rate of 20% and 40%. Wang et al. [24] reported the results of GCE and SL against uniform flip noise. Pleiss et al. [19] reported the results of AUM against uniform flip noise. Bai et al. [12] reported the results of Me-Momentum against uniform flip noise with rate of 20% and 40%. Cheng et al. [15] reported the results of FBCCR against pair flip noise with rate of 20% and 40%, and the results for uniform flip noise with rate of 20%, 40%, and 60%. Xia et al. [14] reported the results of CNLCU-S against pair flip noise with rate of 20% and 40% and uniform flip noise with rate of 20% and 40%. And we reproduced the other experimental results that are listed in Tables 3 and 4 but are not reported in the researches mentioned above.

The results in Tables 3 and 4 depict that the accuracy score of our method exceeds other methods when faced with noisy samples, which indicates that the proposed SLRLNL can better learn the hard clean samples while mitigating the negative impact from noisy samples. These experiments are conducted with RTX 3080, and the training time in Table 5 also depicts that the extra computation time of our method is feasible.

Table 5 Average training time and their standard deviation over three trials for DNN trained on CIFAR-10 and CIFAR-100 during each epoch with standard method and our method
Table 6 Average test accuracy and its standard deviation over three trials for DNN trained on Animal-10N with standard method, other LNL methods, and our method
Table 7 Accuracy score for test dataset for DNN trained on Clothing1M with standard method, other LNL methods, and our method

Experimental results for Animal-10N and Clothing1M

To evaluate the effectiveness of our method when dealing with real-world label noise, we conduct experiments on Animal-10N [27] and Clothing1M [41].

Other than the standard method that only utilizes cross-entropy, we compare our proposed method with the existing LNL methods: ActiveBias [56], Co-teaching [22], SELFIE [27], and PLC [10]. The experimental results are reported in Table 6, where the results of Standard, ActiveBias, Co-teaching, and SELFIE are reported in Song et al. [27], and the results of PLC [10] are reported in it’s own paper. The results presented in Table 6 demonstrate that our method surpasses the performance of the existing LNL methods.

For Clothing1M, the experimental results for Clothing1M are reported in Table 7, where the results of Standard, GCE, SL, LRT, and PLC are reported in Zhang et al. [10], the results of MentorNet, Co-teaching, and CNLCU-S are reported in Xia et al. [14], and the results of AUM [19], UPM [20], and FBCCR [15] are reported in their respective papers. Our method is compared against these existing popular LNL methods, and the results listed in Table 7 demonstrate that our method is more effective than the existing baselines.

The experimental results in Animal-10N and Clothing1M indicate that improving DNN’s learning for hard clean samples while mitigating label noise is beneficial for DNN’s generalization in practice.

Table 8 Evaluation results of average F1 score and its standard deviation over three trials on Dev set for DNN trained on Human-annotated and Distantly supervised dataset of Docred with standard method, other LNL methods, and our method
Table 9 Average test accuracy and its standard deviation over three trials on CIFAR-10 and CIFAR-100 for SLRLNL with or without RLA

Experimental results for Docred

We conducted experiments on both the distantly supervised dataset and the human-annotated dataset for Docred, and evaluated the results on the validation dataset. We reproduced the results of existing LNL baselines, and we used the F1 score to evaluate the performance of DNN. Table 8 displays the experimental results. The results indicate that our method outperforms several baseline LNL methods for both the distantly supervised and human-annotated datasets, which indicates that SLRLNL can also be applicable for the tasks in the NLP field.

Hyper-parameters’ analysis

Four hyper-parameters are involved in the proposed SLRLNL: warm-up epoch \(t_w\), nearly clean samples selection proportion \(n_C\), max correction proportion \(max\ n_V\), and relabeling proportion \(n_{R}\) for RLA. Each of these hyper-parameters is determined by different characteristics of the training dataset: Setting the hyper-parameter \(t_w\) is to better utilize DNN’s memorization effect [34, 35], which indicates that the DNN will first learn the samples with a clean label. And this parameter is determined by the learning efficiency of the DNN on the training dataset. Setting the hyper-parameter \(n_C\) is to ensure that the samples’ learning risk can be calculated accurately, and this parameter is determined by the severity of the label noise. Setting the hyper-parameter \(n_V\) is to adjust the effectiveness of the label correction process, and it is also determined by the severity of the label noise. Setting the hyper-parameter \(n_R\) is to regulate the effectiveness of the RLA process, and its value is associated with the overall learning difficulty of the training dataset. A training dataset with a higher proportion of samples exhibiting high learning difficulty suggests that this parameter should be set to a higher value. To evaluate the effect of these parameters and determine the best values for our method, we adjust each of these parameters individually while leaving the other three fixed. First, we fix the hyper-parameters as \(t_w=10\), \(n_C=10\), \(max\ n_V=20\), and \(n_{R}=2\), and then test \(t_w\) in the range of [0, 10, 20, 30], \(n_C\) in the range of [5, 10, 20, 40], \(max\ n_V\) in the range of [10, 20, 40, 80], and \(n_{R}\) in the range of [2, 5, 10, 20]. The backbone DNN and optimization parameters are identical with the setting for CIFAR-10 in "Hyper-parameters setting" part of Section 5. The experimental results on CIFAR-10 are reported in Fig. 4.

As shown in Fig. 4a, for the hyper-parameter \(t_w\), when the dataset is heavily noised, starting the correction process in an early time can benefit DNN’s performance. And if the dataset contains only a few noisy labels, extending the warm-up process reasonably can mitigate the mis-correction from DNN.

When the dataset contain only a few label noise, the influence of the nearly clean samples’ selection proportion \(n_C\) on DNN’s performance is negligible, as shown in Fig. 4b. However, for the heavily noised dataset, the proportion \(n_C\) needs to be tuned to improve the efficacy of label correction.

As for the max correction proportion \(max\ n_V\), it is suggested to set \(max\ n_V\) to a lower value when the dataset contains only a few label noise to guarantee the precision of correction. When the dataset is heavily noised, \(max\ n_V\) can be set to a higher value to eliminate the negative impact from the label noise. However, it is important to note that setting \(max\ n_V\) too high can result in a significant amount of mis-correction. This is evident from the results shown in Fig. 4c where setting \(max\ n_V\) to 80 lowers the DNN’s performance. Therefore, it is crucial to carefully tune this hyper-parameter in practical applications.

Under different level of the label noise, DNN’s performance is insensitive to the setting of \(n_{R}\). However, to extract more information from the hard samples, it is recommended to adjust this parameter to a reasonably higher value when the training dataset contains high proportion of samples with high learning difficulty.

Table 10 Average minimal training loss (cross-entropy) over three trials on the hard clean samples of CIFAR-10 and CIFAR-100 during training process
Table 11 Average minimal loss (cross-entropy) over three trials on the hard clean samples in the test set of CIFAR-10 and CIFAR-100

In general, DNN performance remains insensitive to different settings of SLRLNL’s hyper-parameters when the label noise is not severe (20%, 40%, and 60% uniform flip label noise). However, when the training dataset is heavily noised (80% uniform flip label noise), the warm-up epoch \(t_w\), the nearly clean samples’ selection proportion \(n_C\), and correction proportion \(n_V\) cause influence to DNN performance. This indicates that these three hyper-parameters need to be carefully tuned in practice.

Ablation study

An ablation study is conducted in this subsection to validate the effectiveness of RLA in improving DNN’s generalization performance. To be specific, we perform SLRLNL on CIFAR-10 and CIFAR-100 with artificial noise, and the DNN adopted is ResNet-18, and hyper-parameters’ setting is identical with the "Hyper-parameters setting" part in Section 5. Then, we compare DNN’s performance between SLRLNL with RLA (\(n_R = 2\) for CIFAR-10, \(n_R = 10\) for CIFAR-100) and SLRLNL without RLA (\(n_R = 0\) for CIFAR-10 and CIFAR-100), and the results are reported in Table 9.

As shown in Table 9, the proposed RLA process achieved greater improvement in CIFAR-100. This is because the learning difficulty for samples in CIFAR-100 is higher than that in CIFAR-10. Thus, the RLA process, which aims to extract more information from the hard samples, can bring about more significant improvements. This finding indicates that after the risk-based label correction process, the RLA process can effectively prevent the DNN from memorizing hard noisy samples, and enhance the learning for the hard samples, thus improving DNN’s generalization performance.

Empirical study of influence of the RLA process on the hard clean samples

The hard clean samples are vital for DNN’s generalization performance, since the RLA process focus on relabeling the hard samples, and it is important to validate whether this process will influence DNN’s learning for the hard clean samples.

In this subsection, we conducted empirical study experiments on CIFAR-10 and CIFAR-100 with uniform flip or pair flip label noise to test the influence of the RLA process on the hard clean samples. The backbone DNN adopted is ResNet-18, and the hyper-parameters setting is identical with the "Hyper-parameters setting" part in Section 5. Then, we compare DNN’s training loss on the hard clean samples. The hard clean samples selected as the samples with the highest 10% learning difficulty from the DNN trained by the clean training dataset, and we keep these samples clean in the training dataset. And the experimental results are listed in Table 10, which show the minimal training loss for the hard clean samples during the training process. Moreover, we also evaluated DNN’s performance on the hard clean samples in the test set. These hard clean samples in the test set are selected as the highest 10% learning difficulty from the DNN trained by the clean training dataset. And the experimental results are listed in Table 11, which show the minimal loss for the hard clean samples in the test set.

As shown in Tables 10 and 11, for CIFAR-10 and CIFAR-100 datasets, the RLA process reduced the loss for hard clean samples in both the training and test datasets. This indicates that the RLA process encourages the DNN to extract useful information from the hard clean samples. Thus, in summary, although the RLA process may temporarily relabel the hard clean samples and slow down the DNN’s learning on them, this process can still improves the learning for the hard clean samples.

Discussion

In contrast to the existing LNL methods that rely on samples’ learning difficulty [10, 18, 19, 22], our proposed SLRLNL method can better distinguish noisy samples from hard clean samples. As a result, it effectively mitigates the adverse effects of label noise without compromising the learning progress of hard clean samples, ultimately leading to better performance compared to the existing baseline LNL methods. Moreover, to extract extra information from the hard samples, we proposed the RLA process to prevent the DNN from memorizing the hard noisy samples and further enhancing DNN’s learning for hard clean samples.

In general, the experimental results from Subsection 6.1 reveal that our proposed method can effectively improve DNN’s performance when the training dataset contains artificially generated noisy labels. And the experimental results from Subsections 6.2 and 6.3 reveal that our method achieved better performance when compared with baseline label correction and loss reweighting methods, which shows that our proposed method can detect noisy samples more effectively in practice.

Conclusion and future work

Conclusion

In conclusion, the primary purpose of the proposed SLRLNL is to detect and correct noisy samples without mis-correcting hard clean samples, and thus improve DNN’s performance in practice. Its benefits for DNN accuracy stem from two aspects. First, we utilize the learning risk of samples to correct noisy samples without mis-correcting hard clean samples. Since the latter are vital for DNN generalization, SLRLNL can further improve DNN performance in practice. Second, our proposed RLA can enhance DNN generalization by encouraging the learning of more generalized knowledge about the hard samples, resulting in improved generalization performance in practice. The empirical study in section “Empirical study of separating noisy samples from hard clean samples” shows that our method can more accurately separate noisy samples from hard clean samples, and the empirical study in section “Emprical study of influence of the RLA process on the hard clean samples” indicates that the RLA process can enhance the learning for hard clean samples. The experimental results in sections “Experimental results for CIFAR-10 and CIFAR-100”, “Experimental results for Animal-10N and Clothing1M”, and “Experimental results for Docred” demonstrate the effectiveness of our proposed SLRLNL in improving DNN accuracy when trained with artificial or real-world label noise compared to existing popular LNL methods.

Future work

The proposed SLRLNL is effective in separating noisy samples from hard clean samples, and in the future works, we will incorporate our work with Semi-Supervised Learning method to further improve the performance of our work in practice.

Limitations

This work still has a few limitations. First, although compared to existing LNL methods that are based on the learning difficulty, the proposed SLRLNL can separate noisy samples from hard clean samples more effectively, separating hard noisy samples from hard clean samples still remains challenging for the proposed SLRLNL method. To effectively separate these two types of samples, it is required to construct a more efficient feature extractor, which, in practice, is frequently task-specific and is out of the scope for this paper.

Second, although the experimental results in the ablation study show that the proposed RLA process can improve DNN’s performance, it lacks sufficient theoretical evidence to prove that it can reduce the noise rate or contribute to better convergence toward the optimal classifier learned in the clean dataset.

Third, the proposed method in this paper aims to avoid mis-corrections on samples with clean labels but irregular feature patterns. However, in multimodal scenarios, the high learning risk of the samples can also be caused by the inconsistency between the different modalities of the samples’ input (e.g., in the image-text multimodal scenario, the image input is mismatched with the text). Under such circumstances, the proposed method can potentially mistake samples with inconsistent input from different modalities as samples with noisy labels, thus degrading the performance of the proposed