SplitNet: Learnable Clean-Noisy Label Splitting for Learning with Noisy Labels

Kim, Daehwan; Ryoo, Kwangrok; Cho, Hansang; Kim, Seungryong

doi:10.1007/s11263-024-02187-4

SplitNet: Learnable Clean-Noisy Label Splitting for Learning with Noisy Labels

Open access
Published: 09 August 2024

(2024)
Cite this article

Download PDF

You have full access to this open access article

International Journal of Computer Vision Aims and scope Submit manuscript

SplitNet: Learnable Clean-Noisy Label Splitting for Learning with Noisy Labels

Download PDF

Daehwan Kim¹^na1,
Kwangrok Ryoo^2,3^na1,
Hansang Cho¹ &
…
Seungryong Kim ORCID: orcid.org/0000-0003-2927-6273⁴

388 Accesses
Explore all metrics

Abstract

Annotating the dataset with high-quality labels is crucial for deep networks’ performance, but in real-world scenarios, the labels are often contaminated by noise. To address this, some methods were recently proposed to automatically split clean and noisy labels among training data, and learn a semi-supervised learner in a Learning with Noisy Labels (LNL) framework. However, they leverage a handcrafted module for clean-noisy label splitting, which induces a confirmation bias in the semi-supervised learning phase and limits the performance. In this paper, for the first time, we present a learnable module for clean-noisy label splitting, dubbed SplitNet, and a novel LNL framework which complementarily trains the SplitNet and main network for the LNL task. We also propose to use a dynamic threshold based on split confidence by SplitNet to optimize the semi-supervised learner better. To enhance SplitNet training, we further present a risk hedging method. Our proposed method performs at a state-of-the-art level, especially in high noise ratio settings on various LNL benchmarks.

Learning from Multiple Annotator Noisy Labels via Sample-Wise Label Fusion

Noisy Label Learning in Deep Learning

NCMatch: Semi-supervised Learning with Noisy Labels via Noisy Sample Filter and Contrastive Learning

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Deep Neural Networks (DNNs) generally rely on large-scale training data with human-annotated good labels for achieving satisfactory performance (Krizhevsky et al., 2012). However, due to the high costs and complexity of labeling the data, the labels are often contaminated by noise, and thus many works have strived to develop alternative methods that are robust to label noise, which is often called Learning with Noisy Labels (LNL) (Natarajan et al., 2013).

Recent studies for LNL, in general, have attempted to distinguish clean samples from the noisy dataset using handcrafted methods, e.g., Gaussian Mixture Models (GMMs), and then use these clean samples as labeled samples in the Semi-Supervised Learning (SSL) phase (Li et al., 2020; Nishi et al., 2021). However, the shape of the loss distribution often does not follow the Gaussian distribution (Arazo et al., 2019), and data with loss values that are not large or small enough cannot be properly distinguished. Furthermore, the dominant approaches maintain multiple models to avoid the risk attributable to the ability of DNNs to fit arbitrary labels, but this often leads to complicated training procedures (Iscen et al., 2022). Moreover, in the aforementioned LNL methodology that leverages SSL techniques, the weight of the unlabeled loss, one of the most substantial hyper-parameter, must be adjusted carefully depending on the noise ratio to prevent the model from overfitting. However, the noise ratio is challenging to tease out in a real-world environment, proving to be an unrealistic approach.

To overcome these limitations, we present a novel framework incorporating a learnable network, called SplitNet, which splits the clean and noisy data in a data-driven manner. Contrary to conventional methods (Li et al., 2020; Nishi et al., 2021) that fit GMMs solely based on per-sample loss distribution to select clean samples, our SplitNet can additionally incorporate the prediction history as input, which allows us to better distinguish ambiguous samples that cannot be precisely distinguished by GMM. In addition, we use a split confidence, a score indicating how confidently SplitNet divides the samples, to determine whether to apply unsupervised loss, enabling more stable learning of SSL method in LNL settings.

More specifically, our overall framework begins with a warm-up and then iteratively learns the main network and SplitNet. As shown in Fig. 1, by formulating the main network and SplitNet in an iterative manner, the two learners are alternately updated, each using the data from the other network. For SplitNet training, the main network provides class prediction and loss distribution, while for the main network training, SplitNet provides split confidences to flexibly adjust the threshold for its SSL procedure.

In particular, taking into account the learning status of the main network and the estimated noise ratio of the data set, the thresholds are automatically calculated to distinguish confidently clean and noisy samples. This process which we dub risk hedging, results in a favorable learning environment for SplitNet to mitigate confirmation bias. As the number of confidently clean and noisy sample increase throughout the process, SplitNet enjoys the benefit of a natural curriculum with the aid of the gradually increasing number of hard samples.

The key contributions of this method are as follows:

Our method effectively distinguishes clean samples from noisy datasets compared to other methods through a learnable network called SplitNet.
As our method enables the learning curriculum to adjust automatically depending on noise ratio, we propose the SSL method that is favorable to LNL by utilizing split confidence obtained through SplitNet.
Our method significantly outperforms state-of-the-art results on numerous benchmarks with different types and levels of label noise.

2 Related Work

2.1 Learning with Noisy Labels

Modern LNL methods can be largely classified into two categories. The first category uses a loss correction. These methods are further classified into those that relabel noisy samples to correct losses and those that reweights loss depending on each sample. On the one hand, in a study related to the methods that involve relabeling, Reed et al. (2014) proposed a bootstrap method that adjusts the loss using model prediction. Additionally, the D2L proposed by Ma et al. (2018) provided further improvement by using the dimensionality of feature space to determine the weights of the output and label. Furthermore, Tanaka et al. (2018) proposed the joint optimization method, which reassigns noisy labels depending on the output of the network, updates networks’parameters, and labels each epoch. On the other hand, regarding the methods related to reweighting, Shen and Sanghavi (2019) conducted training by predicting smaller loss samples as clean.

The second category first discards noisy sample labels to apply the semi-supervised learning method. Ding et al. (2018) and Kong et al. (2019) proved that the SSL method is effective for LNL, and Li et al. (2020) avoided confirmation bias (Tarvainen & Valpola, 2017) by having two classification networks that filter out each other. The Nishi et al. (2021) studies examined augmentation that was effective for LNL and provided additional related contributions. The method proposed by our paper also utilizes SSL but its novelty compared to existing studies resides in the fact that it only requires a single classification network to resolve confirmation bias in a data-driven manner using an incidental SplitNet. Moreover, in order to part from a more favorable starting point, the proposed method utilizes K-fold cross-filtering to distinguish between clean and noisy data and trains networks using the SSL method.

2.2 Semi-supervised Learning

SSL methods aim to utilize not only labeled data but also unlabeled data in order to enhance the performance of a model. SSL methodology is particularly effective when the amount of labeled data is limited and when a large amount of unlabeled data can be used. SSL has been applied in multiple ways in diverse fields of study and is considered a mature research field (Yang et al., 2021b). In general, SSL methodology can be divided into two areas. These are consistency regularization (Miyato et al., 2018; Laine & Aila, 2016; Tarvainen & Valpola, 2017), which forces differently augmented input data to predict the same outcome, entropy minimization (Grandvalet & Bengio, 2004) and pseudo labeling (Lee et al., 2013), which allow unlabeled data to produce more confident outcomes. In recent times, a holistic approach that makes use of all of the aforementioned methodologies shows an improved performance (Sohn et al., 2020; Berthelot et al., 2019b, a).

Recently, advancing beyond methods that modify the threshold class-wise according to the difficulty levels for different classes (Zhang et al., 2021; Xu et al., 2021; Wang et al., 2022), the most recent development is DISC (Li et al., 2023), which adjusts the threshold on an instance-wise basis. While DISC utilizes the confidence values from the main classification network, our approach leverages the split confidence obtained from SplitNet to adjust the threshold.

3 Methodology

3.1 Overview

Let us denote $\mathcal {X}= \{(x_i,y_i)\}^N_{i=1}$ as a training dataset, where $x_i$ is an image, $y_i$ is an one-hot label over r classes, and N is the total number of the training data. In the noisy label setting, we assume that $y_i$ could be corrupted, and such labels are called noisy labels. We define noisy data as images with noisy labels. $p_{\textrm{m}}(x;\theta )$ is the predicted class distribution produced by the main model $p_{\textrm{m}}(\cdot ;\theta )$ with parameters $\theta $ for input x. Our goal is to optimize the model parameters $\theta $ so that $p_\textrm{m}(x_i;\theta )$ approaches the ground-truth label.

Figure 2 shows our overall architecture. After training the main model through a warm-up, we use the proposed risk hedging process to only select confident samples to train SplitNet. With SplitNet we obtain clean probability and split confidence, and with this information we train the main model through SSL. Loss distribution generated by the main model is used in risk hedging as the whole process is repeated. Through this iterative process, the main model and SplitNet can be alternately improved.

3.2 SplitNet

Concretely, given the dataset, the proposed SplitNet is designed to output a probability prediction $s\in \mathbb {R}^2$ regarding the two classes, clean and noisy. The network takes three inputs; model prediction $\{p_{\textrm{m}}(x_i;\theta )\}_{i}$, the difference in the model predictions of the current and previous iteration $\{\Delta \,p_{\textrm{m}}(x_i;\theta )\}_{i}$, and one-hot label $\{y_i\}_{i}$ for $i \in \{1,...,M\}$ where M is the total number of samples selected out of a total number of N train data by risk hedging. Note that samples selected by the risk hedging process change in each iteration. In the following, we explain training SplitNet with the proposed risk hedging and semi-supervised learning framework. In addition, Sect. 3.3 provides a detailed discussion on the reasons for selecting these three inputs.

SplitNet is trained to classify the clean and noisy data that has been labeled by GMM; thus, it requires that GMM correctly classifies clean data and noisy data, but there are many cases where the GMM incorrectly classifies data in the overlap between the clean and noisy distributions. A naïve solution would be to use a fixed threshold to only select confident data. However, as the model evolves, the loss distribution changes consistently, to which a fixed threshold cannot be adjusted. This leads to the model ignoring a considerable amount of unlabeled data at the earlier stage of training or using a considerable amount of incorrectly labeled data at the late stage of the training (Xu et al., 2021; Zhang et al., 2021; Wang et al., 2022).

3.2.1 Risk Hedging

To solve this problem, we propose risk hedging, a process that enhances the training of SplitNet by dynamically adjusting the threshold and selecting confident data. In the risk hedging process, the model’s current learning status and the noise ratio of the training dataset are autonomously determined.

In this context, we fit GMM on the loss distribution of the entire training data to obtain the clean probability w. A large mean value of clean probability w distribution implies that the dataset is mostly composed of clean data, and so more overall data can be treated as clean data. A large standard deviation value of clean probability w distribution implies that data classification ability is enhanced, and so the next value of the threshold is decreased.

Specifically, $\tau _{\mu }$ and $\tau _{\nu }$ should be determined, where $\tau _{\mu }$ denotes the threshold that distinguishes clean data with clean label $\mu $, and $\tau _{\nu }$ denotes the threshold that distinguishes noisy data with noisy label $\nu $. Note that $\mu $ and $\nu $ represent the binary classes of clean and noisy data, respectively, as a one-hot label. In practice, we used [1,0] for $\mu $ and [0,1] for $\nu $. SplitNet’s target label $c \in \mathbb {R}^2$ can be either $\mu $ (clean) or $\nu $ (noisy), and is determined by comparing $\tau _{\mu }$ and $\tau _{\nu }$ with the clean probability w derived from the GMM. Formally, for dataset $\mathcal {X}_w=\{x_i,y_i,w_i\}^N_{i=1}$, the training dataset for SplitNet is defined as:

$$\begin{aligned} \begin{aligned}&\{(x,y,\mu )\mid w\ge \tau _{\mu },\ (x,y,w)\in \mathcal {X}_w\}\\&\cup \{(x,y,\nu )\mid w\le \tau _{\nu },\ (x,y,w)\in \mathcal {X}_w\}. \end{aligned} \end{aligned}$$

(1)

In the following section, we explain the detailed derivation process of $\tau _{\mu }$ and $\tau _{\nu }$.

3.2.2 Derivation of $\tau _\mu $ and $\tau _\nu $

We define $\mathrm {\tau _{\mu }}$ and $\mathrm {\tau _{\nu }}$ as follows:

$$\begin{aligned} \begin{aligned} \tau _{\mathrm {\mu }} := z-z^\textrm{F}\!\bar{w}^\textrm{F}\!\textrm{P}(\sigma ),\\ \tau _{\mathrm {\nu }} := z-\,z\,\,\bar{w}\,\,\textrm{P}(\sigma ), \end{aligned} \end{aligned}$$

(2)

where $\textrm{P}(\sigma )$ is a function defined as $1-4\,\sigma ^2$, and pivot point z is a value between 0 and 1 which serves as a reference point for the clean and noisy thresholds. Each threshold value changes based on z. $\bar{w} = \frac{1}{\mid \mathcal {X}\mid }\sum _{i=1}^N{w_i}$ and $\sigma ^2 = \frac{1}{\mid \mathcal {X}\mid }\sum _{i=1}^N(w_i-\bar{w})^2$ are the mean and variance of the clean probability predicted with GMM for the entire dataset, respectively, where $w_i$ is the clean probability of the i th sample predicted with GMM. $\textrm{F}$ is an operator that performs the following operation where j is an imaginary number:

$$\begin{aligned} z^\textrm{F} := (1-z)j. \end{aligned}$$

(3)

Lemma 1

Let x be a vector of n numbers in the range [0, r], where r is a positive number. Then, the maximum variance of this n number is $r^2/4$.

Proof

Let $\bar{x} = \frac{1}{n}\sum _{i=1}^nx_i$ and $\textrm{var}(x)=\frac{1}{n}\sum _{i=1}^{n}(x_i-\bar{x})^2$. since $x_i \le r$,

$$\begin{aligned} \sum \limits _{i}x_i^2=\sum _ix_i\cdot x_i\le \sum _i r\cdot x_i=rn\frac{1}{n}\sum _ix_i =rn\bar{x}. \end{aligned}$$

Note that $0\le \bar{x}\le r$. Then,

$$\begin{aligned} \begin{aligned} n\cdot \textrm{var}(x)&=\sum _i(x_i-\bar{x})^2\\&=\sum _i(x_i^2-2x_i\bar{x}+\bar{x}^2)\\&=\sum _ix^2_i-2\bar{x}\sum _ix_i+n\bar{x}^2\\&=\sum _ix^2_i-2\bar{x}n\frac{1}{n}\sum _ix_i+n\bar{x}^2\\&=\sum _ix^2_i-n\bar{x}^2\\&\le rn\bar{x}-n\bar{x}^2=n\bar{x}(r-\bar{x}). \end{aligned} \end{aligned}$$

And thus

$$\begin{aligned} \textrm{var}(x)\le \bar{x}(r-\bar{x}). \end{aligned}$$

Using AM-GM inequality, we get

$$\begin{aligned} \bar{x}(r-\bar{x}) \le \left( \frac{\bar{x}+(r-\bar{x})}{2}\right) ^2=\frac{r^2}{4}. \end{aligned}$$

This shows that,

$$\begin{aligned} \textrm{var}(x)\le \frac{r^2}{4}. \end{aligned}$$

$\square $

Equation (2) is derived as follows. According to lemma 1, for a distribution of real numbers between 0 and 1, the minimum and maximum values of $1-4\sigma ^2$ are 0 and 1, respectively. Thus we can set $\tau _\mu $ and $\tau _\nu $ that move dynamically between z and 1, and 0 and z, respectively. In this paper, we set the z as 0.5 for all experiments.

To summarize, $\tau _\mu $ and $\tau _\nu $ are updated to new values at the end of each epoch of the main network training, using the refreshed statistical values of w.

3.3 Network Architecture

As shown in Fig. 3, SplitNet could be implemented in various ways. First of all, we evaluate several modifications of the network architecture to understand SplitNet further. Specifically, we measure the performance of SplitNet by:

1.
Changing the number of layers (Fig. 3a).
2.
Removing the prediction difference from the input (Fig. 3b).
3.
Removing the batch normalization (Ioffe & Szegedy, 2015) (Fig. 3c).

We experiment with Fig. 3a, b, and c for the following reasons.

Samples with noisy labels generate the wrong supervised signal in the warm-up. As discussed in Sect. 3.5, these samples usually have large losses, so during main training, the labels are discarded and learned through unsupervised loss. Therefore, the change in logit per epoch is large, and it can be used as a cue to distinguish noisy data. To confirm this effect, we design a structure Fig. 3b that does not consider logit differences and compare its performance with Fig. 3a.

In addition, in order to design SplitNet to have sufficient capacity while being lightweight, we measure the performance by changing the number of layers as shown in Fig. 3a. The number of layers consisting of Linear - Batch Normalization - ReLU is increased from 2 to 4.

We evaluate the performance of the SplitNet with batch normalization removed, shown in Fig. 3c, to confirm the importance of batch normalization in the structure of the SplitNet. As a result, convergence fails when batch normalization is not used, verifying the importance of batch normalization.

As a result of the experiment, SplitNet shows the best performance when it is composed of 3 layers based on Fig. 3a with batch normalization, and we adopt this as our structure. We provide a more detailed performance analysis in Sect. 5.4

3.4 Dynamic Thresholding in Semi-supervised Learner

To further train the main model, we define the labeled and unlabeled dataset required to train the semi-supervised learner as follows: where $s\in \{s_{\textrm{clean}},s_{\textrm{noisy}}\}$ is the binary class prediction with $s_{\textrm{clean}}$ and $s_{\textrm{noisy}}$ being the clean and noisy probabilities predicted by SplitNet, dataset $\mathcal {X}$ is forwarded to SplitNet to obtain s and form the dataset $\mathcal {X}_{s}=\{(x_i,y_i,s_i)\}^N_{i=1}$. Note that, as shown in Fig. 3, the last layer of SplitNet includes a softmax function, ensuring that the sum of $s_{\textrm{clean}}$ and $s_{\textrm{noisy}}$ is always 1.

Using this dataset, we form a clean labeled dataset $\mathcal {C}=\{(x,y) \mid s_\textrm{clean}\ge \tau _{\mathrm {\ label}},\ (x,y,s)\in \mathcal {X}_s\}$ where clean class probability $s_\textrm{clean}$ exceeds clean label threshold $\tau _{\ \textrm{label}}$, and an unlabeled dataset $\mathcal {U} = \{(x,s) \mid (x,y,s)\in \mathcal {X}_s\}$, which is used for consistency regularization (Rasmus et al., 2015; Sajjadi et al., 2016) based learning.

Based on these datasets, the semi-supervised loss function consists of two cross-entropy loss terms: supervised loss $\mathcal {L_C}$ and unsupervised loss $\mathcal {L_U}$. First of all, $\mathcal {L_C}$ is the standard cross-entropy loss $\mathcal {H}(\cdot )$ on dataset $\mathcal {C}$ as follows:

$$\begin{aligned} \mathcal {L}_\mathcal {C} = \frac{1}{\mathcal {\mid C \mid }}\sum _{(x,y) \in \mathcal {C}}\mathcal {H}(y,p_{\textrm{m}}(x;\theta )). \end{aligned}$$

(4)

For the unsupervised loss function, we exploit consistency regularization loss, a function used by Sohn et al. (2020), one of the most prevalent modern SSL frameworks. However, Our methodology is different in that it maximizes the effect according to the LNL by flexibly adjusting the threshold for determining a stable sample using split confidence that indicates the degree of distance from the decision boundary that divides the clean and noisy samples obtained through SplitNet.

As shown in Fig. 4, in the case of a fixed threshold, in situations with a very high level of label noise, a lower threshold achieves better performance and vice versa. This tendency is the reason that achieving superior performance in all noise ratio benchmarks with only one hyper-parameter setting is a difficult task. With motivation from these findings, we propose a dynamic threshold that is adjusted according to the split confidence of the sample.

A high split confidence indicates that the model is already well aware of the class to which the sample belongs. Consequently, our approach lowers the threshold for data with high split confidence, allowing more pseudo labels to be utilized more quickly. This, in turn, improves the model’s performance through the Flywheel Effect, enabling it to identify correct pseudo labels more efficiently in subsequent epochs. The results (see Sect. 5.5) demonstrate that the correctness of pseudo labeling is higher with a dynamic threshold, which aligns with our motivation and contention.

Specifically, we first generate an artificial pseudo-label $q=\mathcal {E}(\mathrm {arg\,max}(p_{\textrm{m}}(\alpha (x);\theta )))$, where $\alpha (\cdot )$ is a weak augmentation function that can carry out simple transformations (for example, flip and shift) on an image, and $\mathcal {E}$ is a function that one-hot-encodes an index value. Then we enforce the model so that the model output of strongly-augmented data and of weakly-augmented data are consistent.

$$\begin{aligned} \begin{aligned} \mathcal {L}_{\mathcal {U}}=\frac{1}{\mid \mathcal {U}\mid } \sum _{(x,s)\in \mathcal {U}}&\mathbb {1}(\textrm{max}(p_{\textrm{m}}(\alpha (x);\theta ))\\&\ge \tau _{\textrm{dyn}})\mathcal {H}(q,p_{\textrm{m}}(\mathcal {A}(x);\theta )), \end{aligned} \end{aligned}$$

(5)

where $\mathcal {A}(\cdot )$ is the strong augmentation function, which carries out more complex transformations (e.g., RandAug Cubuk et al. (2020)) on an image, and $\tau _{\textrm{dyn}}$ is the dynamically-changing threshold determined per sample based on the sample’s split confidence, $\textrm{max}(s)$ (i.e., $\textrm{max}(s_\text {clean}, s_\text {noise})$). Formally, $\tau _{\text {dyn}}$ is defined as follows:

$$\begin{aligned} \begin{aligned} \tau _{\textrm{dyn}} =(1-\textrm{max}(s_\text {clean}, s_\text {noise}))\beta _{1} \\ +\textrm{max}(s_\text {clean}, s_\text {noise})\beta _{2}, \end{aligned} \end{aligned}$$

(6)

where $\beta _1$ and $\beta _2$ refer to the upper bound and lower bound of $\tau _{\textrm{dyn}}$ respectively.

Note that $\tau _\text {dyn}$ assumes different values for each sample, reflecting the dynamic nature of the threshold based on individual sample’s split confidence. However, for the sake of notational simplicity, Eq. (6) omits the i index.

In this way, even without adjustments in hyper-parameters, robust performance is achieved in various noise ratios of the training dataset.

The semi-supervised loss used to train the model can be written as:

$$\begin{aligned} \mathcal {L}={\eta }\mathcal {L}_\mathcal {C}+(1-\eta )\mathcal {L}_\mathcal {U}, \end{aligned}$$

(7)

where $\eta = \mid \mathcal {\,C\,}\mid / \mid \mathcal {\,X\,}\mid $ is a weight automatically adjusted to become smaller as the estimated noise ratio of the dataset is smaller. As a result, the more noisy the dataset, the more unsupervised loss contributes to the total loss.

As shown in Alg. 1, we outline our main training algorithm in Paszke et al. (2019) style. In the algorithm, $\theta _s$ denotes the parameters of SplitNet.

3.4.1 Open-Set Noisy Labels

To construct large-scale datasets, web crawling is often utilized (Kaur et al., 2017), introducing not only noisy labels but also classes outside the predefined label space. While methods like Out-of-Distribution Detection and Open-Set Recognition focus on unseen classes in the test set, they overlook data outside the label space in the training set (Yang et al., 2021a). This is known as the open-set noisy labels problem (Wang et al., 2018) and is considered nontrivial. Open-set noisy labels specifically address the challenge of training set data that originates from classes beyond the label space and is inaccurately tagged with noisy labels.

Datasets like FOOD-101N (Lee et al., 2018) also encounter the open-set noisy labels problem, containing images of animals or people within the training set, not just food. To address this, we devised a simple yet effective trick for masking noisy open class data. Given that open-set noisy data typically exhibits low $s_{\text {clean}}$ (Wei et al., 2021b; Xie et al., 2021), we set a dynamic threshold, $\tau _{\text {dyn}}$, as

$$\begin{aligned} \tau _{\textrm{dyn}} =(1-s_{\text {clean}})\beta _{1} +s_{\text {clean}}\beta _{2}, \end{aligned}$$

(8)

where the upper bound threshold $\beta _1$ is set to values of 1 or higher to effectively mask open-set noisy data with low $s_{\text {clean}}$, thereby preventing their inclusion in the learning process.

In this paper, we standardized the use of $\beta _1$ = 1, $\beta _2$ = 0.7, and $\tau _{\text {dyn}}$ as defined in Eq. (8) for the open-set noisy labels setting (i.e., Li et al. (2017) and FOOD-101N Lee et al. (2018)). Setting $\beta _1$ to 1 (or a higher value) ensures that for data with low $s_{\text {clean}}$, the threshold becomes 1 (or higher), effectively masking these data points and preventing their inclusion as pseudo labels. This helps avoid negative impacts that open-set data could have on the model.

3.5 Warm-Up Stage

In DNN, correctly labeled data tend to converge more quickly than incorrectly labeled data (Arpit et al., 2017), which allows samples with lower loss and higher loss to be categorized as clean data and noisy data, respectively. In the previous state-of-the-art methods (Li et al., 2020; Nishi et al., 2021), for the initial convergence of the algorithm, the model is trained for a few epochs on a training dataset by using the standard cross-entropy loss. However, this training method does not function effectively in asymmetric noise settings and thus requires the addition of negative entropy loss terms and so forth (Li et al., 2020; Nishi et al., 2021; Chen et al., 2021a). Performance is also unstable in settings with a high noise ratio. To address this issue, we propose a novel warm-up method that does not require hyper-parameter changes or negative entropy loss because it works well even in high ratio noisy settings or asymmetric noise. Fig. 5 shows the diagram of our warm-up.

Our warm-up consists of K-fold cross-filtering and SSL Training. K-fold cross-filtering first divides the data into K-folds and then checks whether the labels of the test data match through out-of-fold prediction. By K-fold cross-filtering, we can find noisy data, discard their labels, and warm up the main network using the SSL method presented in Sohn et al. (2020).

In DNNs, learning fewer noise samples is important to make a distinguishable loss distribution since the loss value of incorrectly labeled samples quickly decreases to the loss value of correctly labeled samples. The similarity in loss value between correctly and incorrectly labeled samples makes it difficult to distinguish when learning with incorrect labels. Therefore, we select safer samples through K-fold cross-filtering to maintain a high loss value for noise samples. The higher the noise ratio, the greater the effect because more noise samples are removed.

Table 1 List of hyper-parameters

Full size table

Formally, when training dataset $\mathcal {X}$ is divided into a number $\mathcal {K}$ of folds of equal size, let the k-th fold be $\mathcal {X}^k$. $\theta ^k_f$ refers to filtering network parameters that are trained by cross-entropy loss on $\mathcal {O^ k }=\mathcal {X}\setminus \mathcal {X}^k$, which is the set difference between $\mathcal {X}$ and $\mathcal {X}^k$. It follows that $\theta ^k_f$ is trained by the following loss function:

$$\begin{aligned} \ell (\theta ^k_f) = -\frac{1}{\mid \mathcal {O^ k }\mid }\sum _{x,y\in \mathcal {O^ k }}y\, \textrm{log}(p_f(x;\theta ^k_f)), \end{aligned}$$

(9)

where $p_f(x;\theta _f)$ is the filtering network’s predicted class distribution with parameters $\theta _f$. In this case, we define $\mathcal {T}^k$ as the set of data presumed to be clean within $\mathcal {X}^k$ as:

$$\begin{aligned} \begin{aligned} \mathcal {T}^{\,k}=\{(x,y)\mid \mathrm {arg\,max}(y)=\mathrm {arg\,max}(p_ f (x;\theta ^k_f))\\ ,\ \textrm{max}(p_ f (x;\theta ^k_f))\ge \tau _{\mathrm {\ label}},\ (x,y)\in \mathcal {X}^k\}. \end{aligned} \end{aligned}$$

(10)

Note that $\tau _{\mathrm {\ label}}$ here is the same value used in Sect. 3.4. Then, it follows that the clean dataset $\mathcal {T}$ ultimately yielded by K-fold cross-filtering can be expressed such that $\mathcal {T}=\{\mathcal {T}^{\,{1}} \cup \mathcal {T}^{\,{2}}\cdot \cdot \cdot \cup \ \mathcal {T}^{\,{\mathcal {K}}}\}$.

Since we configure a clean dataset through the filtering process, our ability to warm-up the main model is enhanced compared to that of the normal cross-entropy. Through the method presented in Zhang et al. (2017), we train the main model using supervised loss on clean dataset $\mathcal {T}$. Additionally, by utilizing consistency regulation as previously presented in Sohn et al. (2020), we train the model on all data included in the training dataset $\mathcal {X}$. As demonstrated in Fig. 6 the loss distribution is more evidently differentiated when our method is used to warm-up.

Table 2 Performance comparison for our method and the state-of-the-art methods on CIFAR-10 and CIFAR-100

Full size table

Table 3 Performance comparison for our method and the state-of-the-art methods on CIFAR-10IDN and CIFAR-100IDN

Full size table

Table 4 Performance comparison for our method and the state-of-the-art methods on CIFAR-10N and CIFAR-100N

Full size table

4 Experiments

In order to evaluate the effectiveness of our method, we conduct experiments on synthetic datasets designed to have a variety of noise ratios and a real-world dataset, all of which follow standard LNL evaluation protocols (Li et al., 2020; Nishi et al., 2021; Xia et al., 2019; Wei et al., 2021b). All datasets used in our experiments are currently available for download from the internet, and we have cited the links accordingly.

4.1 Experiment Settings

4.1.1 CIFAR-10 and CIFAR-100

The CIFAR-10 and CIFAR-100 datasets (Krizhevsky, 2009; cif, 2017) each contain 50,000 sizes of 32x32 color images for training. While the CIFAR-10 dataset comprises 10 classes with 6,000 images each, the CIFAR-100 dataset consists of 100 classes with 600 images each. We test two different noise settings, symmetric noise and asymmetric noise (Tanaka et al., 2018; Li et al., 2019). In the case of symmetric noise, symmetric noisy labels are developed when the labels of a set proportion of training samples are flipped to other classes’labels randomly. In the case of asymmetric noise, asymmetric noisy labels are generated by exchanging images from two specific classes with similar characteristics, such as deer to horse.

An 18-layer PreAct Resnet (He et al., 2016b) is used as the main network and the filtering network, and trained using SGD with momentum of 0.9. We design a SplitNet which consists of three blocks with three layers each (FC layer, batch normalization (Ioffe & Szegedy, 2015), and ReLU (Agarap, 2018) and one projection layer at the end. For SplitNet training, we use AdamW (Loshchilov & Hutter, 2018) with a weight decay of 0.0005. In order to ensure a fair and objective comparison, the experiment follows the hyper-parameters of the state-of-the-art technique (Nishi et al., 2021). For all CIFAR experiments, we use the same hyper-parameters $\beta _{1}=0.5$, $\beta _{2}=0.95$, batch size of 128, and a weight decay of 0.0005 for the main network and filtering network.

A complete list of the utilized hyper-parameters can be found in the Table 1. Unlike recent studies (Li et al., 2020; Nishi et al., 2021), our method does not need to set a ${\lambda _u}$ that adjusts the weight of the unlabeled loss when training the SSL learner. This is because the weight is adjusted according to the noise ratio by itself through $\eta = \mid \mathcal {C}\mid / \mid \mathcal {X}\mid $.

4.1.2 CIFAR-IDN

CIFAR-10IDN and CIFAR-100IDN (Xia et al., 2019; Chen et al., 2021b; cif, 2021a) are datasets that have synthetically injected part-dependent label noise into CIFAR-10 and CIFAR-100, respectively. They are derived from the fact that humans perceive instances by breaking them down into parts and estimate the IDN transition matrix of an instance as a combination of the transition matrices of different parts of the instance. Experiment settings, including hyper-parameters, are identical as in the case of CIFAR-10 and CIFAR-100.

4.1.3 CIFAR-N

Wei et al. (2021b) presents the CIFAR-N dataset consisting of CIFAR-10N and CIFAR-100N (cif, 2021b). CIFAR-N equips the training datasets of CIFAR-10 and CIFAR-100 with human-annotated real-world noisy labels, which are collected from Amazon Mechanical Turk. Unlike existing real-world noisy datasets, CIFAR-N is a real-world noisy dataset that establishes controllable, easy-to-use, and moderated-sized with both ground-truth and noisy labels. Experiment settings, including hyper-parameters, are identical as in the case of CIFAR-10 and CIFAR-100.

4.1.4 Food-101N

Food-101N (Lee et al., 2018; foo, 2018) is a large-scale dataset with real-world noisy labels consisting of 31k images from online websites allocated in 101 classes. Image classification is evaluated on Food-101 (Kaur et al., 2017) test set. For a fair comparison, we follow the previous work (Lee et al., 2018) and use ResNet-50 with ImageNet (Deng et al., 2009) pre-trained weights. We observed that the train data included data that should not be learned, which are not included in the given classes in Food-101N. For this reason, we set $\beta _1$ and $\beta _2$ as 0.7 and 1.0, each at a high value to mask the excluded data.

4.1.5 WebVision

The WebVision dataset (Li et al., 2017; web, 2017) consists of 2.4 million images spanning 1,000 classes, sourced from the internet, similar to ImageNet ILSVRC12 (Krizhevsky et al., 2012). In line with prior studies (Li et al., 2020), the performance of baseline methods is compared using only the first 50 classes from the Google image subset. The effectiveness of these methods is evaluated based on their top-1 and top-5 accuracy rates, using both the WebVision validation set and the ImageNet ILSVRC12 dataset for benchmarking.

4.2 Experiment Results

4.2.1 Results on CIFAR Benchmarks

As demonstrated in Table 2, we compare state-of-the-art methods with various ratios of symmetric noise and with 40% of asymmetric noise. The asymmetric noise is set at 40% as setting it at a rate higher than 50% would result in specific classes becoming theoretically indistinguishable (Li et al., 2020). We report substantial improvements in performance across all evaluated benchmarks, with the increases in performance becoming even more evident in cases where more challenging strong noise ratios are used. Note that compared to Li et al. (2020) and Nishi et al. (2021), where well-performing hyper-parameters differ for cases depending on the strength of the noise ratio, and specifically compared to AugDesc, which has separate well-performing models for cases depending on the strength of the noise ratio (i.e., DM-AugDesc-WS-SAW and DM-Aug-Desc-WS-WAW), our method enhances performance using a single model. Table 3 shows that our method outperforms the previous method in 5 settings out of 6 settings on CIFAR-IDN. Table 4 shows that our method achieves state-of-the-art performance in all criteria on CIFAR-N. It should be noted that before our method, previous state-of-the-arts were different in each criterion on the CIFAR-N benchmark. Note that we use identical hyper-parameters used in all CIFAR experiments.

4.2.2 Results on Large-Scale Datasets

We evaluated our method on datasets containing real-world label noise. For the WebVision experiments, we followed previous works (Chen et al., 2019; Li et al., 2020) and used the inception-resnet v2 (Szegedy et al., 2017) as the backbone model. Due to Food-101N and WebVision being large-scale crawled datasets, they encompass open-set noisy labels. Consequently, we employed the setting mentioned in Sect. 3.4.1. Tables 5 and 6 present the results from Food-101N and WebVision, respectively. Our methods outperform all existing methods, with particularly notable results in WebVision, showing a 4.02%p improvement over the previous state of the art. Our results on the real-world noise dataset underscore performance gains, underscoring our method’s efficacy in complex scenarios.

Table 5 Comparison against previous state-of-the-arts in test accuracy(%) on Food-101N

Full size table

Table 6 Comparison against previous state-of-the-arts in test accuracy(%) on WebVision

Full size table

5 Analysis

5.1 Ablation Study

In order to obtain a better understanding regarding why our method was able to achieve state-of-the-art results, we study the effect of removing certain components. Table 7 indicates the results obtained when each component is removed. When SplitNet is removed, it becomes a consistency regularization (Sohn et al., 2020) in general. Also, the absence of the warm-up means that the warm-up is conducted with the existing cross-entropy based method. It can be confirmed that SplitNet has a boosting effect on performance across all noise settings and that it is even more effective when used along with the proposed warm-up. When using the open-set noisy labels setting, the performance difference is negligible when the noise level is low, but the performance deteriorates as the noise level increases. This implies that there is no need to mask data when there are no open-set labels.

We also experiment with adding several elements to DivideMix to assess performance enhancements. Merely substituting DivideMix’s SSL with FixMatch results in a performance decline, whereas incorporating our warm-up lead to performance improvements. Our warm-up shows particularly beneficial for enhancing performance in settings with a high noise ratio. These results underscore our warm-up’s versatility across different architectural frameworks, showing marked effectiveness, especially in synergy with SplitNet.

Table 7 Ablation study results in terms of test accuracy (%) on The dagger symbol ($\dag $) denotes the open-set noisy labels setting

Full size table

5.2 Distinguishing Ability of SplitNet

In this section, we evaluate the F1 score and accuracy of the SplitNet against the conventional method, i.e., GMM. For a more detailed comparison, we also provide a confusion matrix of SplitNet and GMM in Sect. 5.3.

5.2.1 Accuracy and F1 Score

When selecting clean data, the selected data not only has to be actually clean but ‘more actually clean data’ must be selected from the entire dataset to say that it is selected well. Therefore we measured accuracy, a metric that considers the size of the entire dataset. Also, since there is a large difference between the number of clean and noisy data, we verified our method with the f1 score, a metric that takes this into account.

The F1 score and accuracy are defined as follows:

$$\begin{aligned}{} & {} \mathrm {F1\;Score = 2\cdot \frac{precision{\cdot }recall}{precision+recall}}, \\{} & {} \mathrm {Accuracy = \frac{TP+TN}{TP+FN+FP+TN}}, \end{aligned}$$

where TP is the number of data predicted to be clean among those that were actually clean, TN is the number of data predicted to be noisy among those that were actually noisy, FN is the number of data predicted to be noisy but was actually clean, and FP is the number of data predicted to be clean but was actually noisy.

5.2.2 Accuracy and F1 Score Results

Fig. 7 shows the accuracy and F1 score of clean and noise data separated by SplitNet and GMM by epoch. The result shows that SplitNet selects clean data with higher accuracy and F1 score than GMM despite its simple structure regardless of the noise ratio.

5.3 Confusion Matrix Comparison

Figure 8 shows the performance of SplitNet through a confusion matrix. The experiment was conducted on CIFAR-100 with a noise ratio of 90%. With SplitNet, the number of False Positives, which are data predicted to be Clean but actually Noisy, decreases dramatically.

5.4 Accuracy According to Structure

Figure 9 shows the comparison of accuracy according to the structure of SplitNet. It fails to converge when batch normalization is not used (see Fig. 3c) and does not show good performance when the prediction difference is not taken into account (see Fig. 3b). As shown in Fig. 3a, among 2, 3, and 4 layers, it shows the best performance with 3 layers. Future works may try out more various techniques (e.g., residual He et al. (2016a), DenseNet Huang et al. (2017)) to improve accuracy.

5.5 Effect of Dynamic Thresholding by Split Confidence

When the main network is trained with SSL, pseudo labels are generated for data whose confidence value exceeds the threshold. As shown in Fig. 10, the correctness of pseudo labeling is higher with a dynamic threshold as described in Eq. (6), compared to when the threshold value is fixed at 0.5 or 0.95.

Figure 11 shows the change in average dynamic threshold by epoch. It shows a sharp drop around epoch 0, followed by a gradual decline. This indicates that as training progresses, the split confidence generally increases, leading to a lower average threshold.

Table 8 Training time comparison

Full size table

5.6 K-Fold Cross-Filtering Performance According to K

K, the number of partitions in the dataset in K-fold cross-filtering, can be set as a hyper-parameter. Figure 12 shows the test accuracy according to K, on noise ratio 80%, 90% CIFAR-100. Accuracy is saturated after the value of K reaches 8.

5.7 Failure Case Study

We examined cases where SplitNet failed to accurately select clean data, seeking any specific trends using the CIFAR-10 dataset. Similar to methods employing GMM (Li et al., 2020), SplitNet encountered failures primarily with asymmetric noise samples, yet as depicted in Fig. 8, the occurrences were fewer. Figure 13 shows instances of noise detection failures by SplitNet.

5.8 Training Time Analysis

As shown in Table 8 training time analysis, our method is more efficient compared to conventional methods. We compare the training time on CIFAR-10 with 20% noise ratio. For a fair comparison, the training times are obtained using a single NVIDIA GeForce RTX 3090 GPU and AMD EPYC 7282 CPU. Note that Nishi et al. (2021), the latest methodology, requires more computational cost, so it requires more training time than Li et al. (2020).

6 Conclusion and Discussion

There has been rapid growth in LNL in recent years. Despite the fact that progress has been accelerated, the setting is becoming more complex due to issues such as having to set different hyper-parameters depending on various noise ratios. The relevance of our method compared to previous ones is that it achieves state-of-the-art performance on most benchmarks with only a single model. Our method enhances the existing warm-up through K-fold cross-filtering and SSL Training. Additionally, it improves SSL so that it can be better applied to LNL through risk hedging and Dynamic Thresholding. Moreover, we conduct extensive ablation studies to identify why our method is successful and validate the effect of each component. A natural next step, which we leave for future work, is to extend our method to other domains such as audio, text, video, etc.

Broader Impact. Previous state-of-the-arts used different hyper-parameters depending on the noise ratio according to the noise ratio or created different settings appropriate for each case with different models. Differing from these previous methods, our method designs the model more flexibly so that it can be adapted to various environments. When applying LNL in the real-world, it is not often possible to know the noise ratio of the collected data. Thus, it is very important to study the noise ratio robustly in order to apply LNL to the real environment. The differentiation point of our method is that it can be effectively applied in a real-world environment and can especially be of great aid to organizations with low budgets that face difficulties in obtaining high-quality, refined datasets.

References

Agarap, A. F. (2018). Deep learning using rectified linear units (relu). arXiv preprint arXiv:1803.08375
Arazo, E., Ortego, D., Albert, P., O’Connor, N., & McGuinness, K. (2019). Unsupervised label noise modeling and loss correction. In ICML
Arpit, D., Jastrzębski, S., Ballas, N., Krueger, D., Bengio, E., Kanwal, M. S., Maharaj, T., Fischer, A., Courville, A., & Bengio, Y. et al. (2017). A closer look at memorization in deep networks. In ICML
Bai, Y., Yang, E., Han, B., Yang, Y., Li, J., Mao, Y., Niu, G., & Liu, T. (2021). Understanding and improving early stopping for learning with noisy labels. Advances in Neural Information Processing Systems
Berthelot, D., Carlini, N., Cubuk, E. D., Kurakin, A., Sohn, K., Zhang, H., & Raffel, C. (2019a). Remixmatch: Semi-supervised learning with distribution alignment and augmentation anchoring. arXiv preprint arXiv:1911.09785
Berthelot, D., Carlini, N., Goodfellow, I., Papernot, N., Oliver, A., & Raffel, C. A. (2019b). Mixmatch: A holistic approach to semi-supervised learning. NeurIPS
Chen, M., Cheng, H., Du, Y., Xu, M., Jiang, W., & Wang, C. (2021a). Two wrongs don’t make a right: Combating confirmation bias in learning with label noise. arXiv preprint arXiv:2112.02960
Chen, P., Liao, B. B., Chen, G., & Zhang, S. (2019). Understanding and utilizing deep neural networks trained with noisy labels. In ICML
Chen, P., Ye, J., Chen, G., Zhao, J., & Heng, P. A. (2021b). Beyond class-conditional assumption: A primary attempt to combat instance-dependent label noise. In AAAI
Cheng, H., Zhu, Z., Li, X., Gong, Y., Sun, X., & Liu, Y. (2020). Learning with instance-dependent label noise: A sample sieve approach. arXiv preprint arXiv:2010.02347
CIFAR (2017) . www.cs.toronto.edu/~kriz/cifar.html
CIFAR-IDN (2021a) . https://github.com/chenpf1025/IDN
CIFAR-N (2021b) . http://www.noisylabels.com/
Cubuk, E. D., Zoph, B., Shlens, J., & Le, Q. V. (2020). Randaugment: Practical automated data augmentation with a reduced search space. In CVPR Workshops
Deng, J., Dong, W., Socher, R., Li, L. J., Li, K., & Fei-Fei, L. (2009). Imagenet: A large-scale hierarchical image database. In CVPR
Ding, Y., Wang, L., Fan, D., & Gong, B. (2018). A semi-supervised two-stage approach to learning from noisy labels. In WACV
Food-101N (2018) . https://kuanghuei.github.io/Food-101N/
Grandvalet, Y., & Bengio, Y. (2004). Semi-supervised learning by entropy minimization. NeurIPS
Han, B., Yao, Q., Yu, X., Niu, G., Xu, M., Hu, W., Tsang, I., & Sugiyama, M. (2018). Co-teaching: Robust training of deep neural networks with extremely noisy labels. NeurIPS
Han, J., Luo, P., & Wang, X. (2019). Deep self-learning from noisy labels. In ICCV
He, K., Zhang, X., Ren, S., & Sun, J. (2016a). Deep residual learning for image recognition. In CVPR
He, K., Zhang, X., Ren, S., & Sun, J. (2016b). Identity mappings in deep residual networks. In ECCV. Springer
Huang, G., Liu, Z., Van Der Maaten, L., & Weinberger, K. Q. (2017). Densely connected convolutional networks. In CVPR
Ioffe, S., & Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML
Iscen, A., Valmadre, J., Arnab, A., & Schmid, C. (2022). Learning with neighbor consistency for noisy labels. In CVPR
Jiang, L., Zhou, Z., Leung, T., Li, L. J., & Fei-Fei, L. (2018). Mentornet: Learning data-driven curriculum for very deep neural networks on corrupted labels. In ICML
Kaur, P., Sikka, K., & Divakaran, A. (2017). Combining weakly and webly supervised learning for classifying food images. arXiv preprint arXiv:1712.08730
Kong, K., Lee, J., Kwak, Y., Kang, M., Kim, S. G., & Song, W. J. (2019). Recycling: Semi-supervised learning with noisy labels in deep neural networks. Access
Krizhevsky, A. (2009). Learning multiple layers of features from tiny images. Master’s thesis, University of Tront
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. NeurIPS
Laine, S., & Aila, T. (2016). Temporal ensembling for semi-supervised learning. arXiv preprint arXiv:1610.02242
Lee, D. H., et al. (2013). Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. In ICML Workshop
Lee, K. H., He, X., Zhang, L., & Yang, L. (2018). Cleannet: Transfer learning for scalable image classifier training with label noise. In CVPR
Li, J., Wong, Y., Zhao, Q., & Kankanhalli, M. S. (2019). Learning to learn from noisy labeled data. In CVPR
Li, J., Socher, R., & Hoi, S. C. (2020). Dividemix: Learning with noisy labels as semi-supervised learning. arXiv preprint arXiv:2002.07394
Li, W., Wang, L., Li, W., Agustsson, E., & Van Gool, L. (2017). Webvision database: Visual learning and understanding from web data. arXiv preprint arXiv:1708.02862
Li, X., Liu, T., Han, B., Niu, G., & Sugiyama, M. (2021). Provably end-to-end label-noise learning without anchor points. In ICML
Li, Y., Han, H., Shan, S., & Chen, X. (2023). Disc: Learning from noisy labels via dynamic instance-specific selection and correction. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, (pp. 24070–24079)
Liu, S., Niles-Weed, J., Razavian, N., & Fernandez-Granda, C. (2020). Early-learning regularization prevents memorization of noisy labels. NeurIPS
Liu, Y., & Guo, H. (2020). Peer loss functions: Learning from noisy labels without knowing noise rates. In ICML
Loshchilov, I., & Hutter, F. (2018). Decoupled weight decay regularization. In ICLR
Lukasik, M., Bhojanapalli, S., Menon A, Kumar S. (2020). Does label smoothing mitigate label noise? In ICML
Ma, X., Wang, Y., Houle, M. E., Zhou, S., Erfani, S., Xia, S., Wijewickrema, S., & Bailey, J. (2018). Dimensionality-driven learning with noisy labels. In ICML
Malach, E., & Shalev-Shwartz, S. (2017). Decoupling when to update from how to update. Advances in Neural Information Processing Systems
Miyato, T., Maeda, Si., Koyama, M., & Ishii, S. (2018). Virtual adversarial training: A regularization method for supervised and semi-supervised learning. TPAMI
Natarajan, N., Dhillon, I. S., Ravikumar, P. K., & Tewari, A. (2013). Learning with noisy labels. Advances in Neural Information Processing Systems
Nishi, K., Ding, Y., Rich, A., & Hollerer, T. (2021). Augmentation strategies for learning with noisy labels. In CVPR
Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin,Z, Gimelshein N, Antiga L., et al. (2019). Pytorch: An imperative style, high-performance deep learning library. NeurIPS
Patrini, G., Rozza, A., Krishna Menon, A., Nock, R., & Qu, L. (2017). Making deep neural networks robust to label noise: A loss correction approach. In CVPR
Rasmus, A., Berglund, M., Honkala, M., Valpola, H., Raiko T. (2015). Semi-supervised learning with ladder networks. NeurIPS
Reed, S., Lee, H., Anguelov, D., Szegedy, C., Erhan, D., & Rabinovich, A. (2014). Training deep neural networks on noisy labels with bootstrapping. arXiv preprint arXiv:1412.6596
Sajjadi, M., Javanmardi, M., & Tasdizen, T. (2016). Regularization with stochastic transformations and perturbations for deep semi-supervised learning. NeurIPS
Shen, Y., & Sanghavi, S. (2019). Learning with bad training data via iterative trimmed loss minimization. In ICML
Sohn, K., Berthelot, D., Carlini, N., Zhang, Z., Zhang, H., Raffel, C. A., Cubuk, E. D., Kurakin, A., Li, C. L. (2020). Fixmatch: Simplifying semi-supervised learning with consistency and confidence. NeurIPS
Sun, Z., Shen, F., Huang, D., Wang, Q., Shu, X., Yao, Y., & Tang, J. (2022). Pnp: Robust learning from noisy labels by probabilistic noise prediction. In CVPR
Szegedy, C., Ioffe, S., Vanhoucke, V., & Alemi, A. (2017). Inception-v4, inception-resnet and the impact of residual connections on learning. In AAAI
Tanaka, D., Ikami, D., Yamasaki, T., & Aizawa, K. (2018). Joint optimization framework for learning with noisy labels. In CVPR
Tarvainen, A., & Valpola, H. (2017). Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. NeurIPS
Wang, Y., Liu, W., Ma, X., Bailey, J., Zha, H., Song, L., & Xia, S. T. (2018). Iterative learning with open-set noisy labels. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 8688–8696)
Wang, Y., Chen, H., Heng, Q., Hou, W., Savvides, M., Shinozaki, T., Raj, B., Wu, Z., & Wang, J. (2022). Freematch: Self-adaptive thresholding for semi-supervised learning. arXiv preprint arXiv:2205.07246
WebVision (2017) . https://data.vision.ee.ethz.ch/cvl/webvision/dataset2017.html
Wei, H., Feng, L., Chen, X., & An, B. (2020). Combating noisy labels by agreement: A joint training method with co-regularization. In CVPR
Wei, J., & Liu, Y. (2020). When optimizing $ f $-divergence is robust with label noise. arXiv preprint arXiv:2011.03687
Wei, J., Liu, H., Liu, T., Niu, G., & Liu, Y. (2021a). Understanding (generalized) label smoothing whenlearning with noisy labels. arXiv preprint arXiv:2106.04149
Wei, J., Zhu, Z., Cheng, H., Liu, T., Niu, G., & Liu, Y. (2021b). Learning with noisy labels revisited: A study using real-world human annotations. arXiv preprint arXiv:2110.12088
Xia, X., Liu, T., Wang, N., Han, B., Gong, C., Niu, G., & Sugiyama, M. (2019). Are anchor points really indispensable in label-noise learning? NeurIPS
Xie, Z., He, F., Fu, S., Sato, I., Tao, D., & Sugiyama, M. (2021). Artificial neural variability for deep learning: On overfitting, noise memorization, and catastrophic forgetting. Neural Computation
Xu, Y., Cao, P., Kong, Y., & Wang, Y. (2019). L_dmi: A novel information-theoretic loss function for training deep nets robust to label noise. NeurIPS
Xu, Y., Shang, L., Ye, J., Qian, Q., Li, YF., Sun, B., Li, H., & Jin, R. (2021). Dash: Semi-supervised learning with dynamic thresholding. In ICML (pp. 11525–11536)
Yang, J., Zhou, K., Li, Y., & Liu, Z. (2021a). Generalized out-of-distribution detection: A survey. arXiv preprint arXiv:2110.11334
Yang, X., Song, Z., King, I., & Xu, Z. (2021b). A survey on deep semi-supervised learning. arXiv preprint arXiv:2103.00550
Yao, Y., Sun, Z., Zhang, C., Shen, F., Wu, Q., Zhang, J., & Tang, Z. (2021). Jo-src: A contrastive approach for combating noisy labels. In CVPR
Yi, K., & Wu, J. (2019). Probabilistic end-to-end noise correction for learning with noisy labels. In CVPR
Yu, X., Han, B., Yao, J., Niu, G., Tsang, I., & Sugiyama, M. (2019). How does disagreement help generalization against label corruption? In ICML
Zhang, B., Wang, Y., Hou, W., Wu, H., Wang, J., Okumura, M., & Shinozaki, T. (2021). Flexmatch: Boosting semi-supervised learning with curriculum pseudo labeling. NeurIPS 34
Zhang, H., Cisse, M., Dauphin, Y. N., & Lopez-Paz, D. (2017). mixup: Beyond empirical risk minimization. arXiv preprint arXiv:1710.09412
Zhang, Z., & Sabuncu, M. (2018). Generalized cross entropy loss for training deep neural networks with noisy labels. NeurIPS
Zhao, G., Li, G., Qin, Y., Liu, F., & Yu, Y. (2022). Centrality and consistency: two-stage clean samples identification for learning with instance-dependent noisy labels. arXiv preprint arXiv:2207.14476
Zhu, Z., Liu, T., & Liu, Y. (2021). A second-order approach to learning with instance-dependent label noise. In CVPR

Download references

Acknowledgements

This research was supported by the Culture, Sports, and Tourism R &D Program through the Korea Creative Content Agency grant funded by the Ministry of Culture, Sports and Tourism (Research on neural watermark technology for copyright protection of generative AI 3D content, RS-2024-00348469, Development of technology for dataset copyright of multimodal generative AI model, RS-2024-00333068) and National Research Foundation of Korea (RS-2024-00346597). This work was also supported by Samsung Electronics Co., Ltd (project IO220829-02236-01).

Funding

Open Access funding enabled and organized by KAIST.

Author information

Daehwan Kim and Kwangrok Ryoo have contributed equally to this work.

Authors and Affiliations

Samsung Electro-Mechanics, 150, Maeyeong-ro, Yeongtong-gu, Suwon, Gyeonggi, Korea
Daehwan Kim & Hansang Cho
Korea University, 145, Anam-ro, Seongbuk-gu, Seoul, Korea
Kwangrok Ryoo
LG AI Research, 128 Yeoui-daero, Yeongdeungpo-gu, Seoul, Korea
Kwangrok Ryoo
KAIST, Dongdaemun-gu, Seoul, Republic of Korea
Seungryong Kim

Authors

Daehwan Kim
View author publications
You can also search for this author in PubMed Google Scholar
Kwangrok Ryoo
View author publications
You can also search for this author in PubMed Google Scholar
Hansang Cho
View author publications
You can also search for this author in PubMed Google Scholar
Seungryong Kim
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Seungryong Kim.

Additional information

Communicated by Yasuyuki Matsushita.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Kim, D., Ryoo, K., Cho, H. et al. SplitNet: Learnable Clean-Noisy Label Splitting for Learning with Noisy Labels. Int J Comput Vis (2024). https://doi.org/10.1007/s11263-024-02187-4

Download citation

Received: 30 November 2022
Accepted: 10 July 2024
Published: 09 August 2024
DOI: https://doi.org/10.1007/s11263-024-02187-4

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

SplitNet: Learnable Clean-Noisy Label Splitting for Learning with Noisy Labels

Abstract

Similar content being viewed by others

Learning from Multiple Annotator Noisy Labels via Sample-Wise Label Fusion

Noisy Label Learning in Deep Learning

NCMatch: Semi-supervised Learning with Noisy Labels via Noisy Sample Filter and Contrastive Learning

Explore related subjects

1 Introduction

2 Related Work

2.1 Learning with Noisy Labels

2.2 Semi-supervised Learning

3 Methodology

3.1 Overview

3.2 SplitNet

3.2.1 Risk Hedging

3.2.2 Derivation of \(\tau _\mu \) and \(\tau _\nu \)

Lemma 1

Proof

3.3 Network Architecture

3.4 Dynamic Thresholding in Semi-supervised Learner

3.4.1 Open-Set Noisy Labels

3.5 Warm-Up Stage

4 Experiments

4.1 Experiment Settings

4.1.1 CIFAR-10 and CIFAR-100

4.1.2 CIFAR-IDN

4.1.3 CIFAR-N

4.1.4 Food-101N

4.1.5 WebVision

4.2 Experiment Results

4.2.1 Results on CIFAR Benchmarks

4.2.2 Results on Large-Scale Datasets

5 Analysis

5.1 Ablation Study

5.2 Distinguishing Ability of SplitNet

5.2.1 Accuracy and F1 Score

5.2.2 Accuracy and F1 Score Results

5.3 Confusion Matrix Comparison

5.4 Accuracy According to Structure

5.5 Effect of Dynamic Thresholding by Split Confidence

5.6 K-Fold Cross-Filtering Performance According to K

5.7 Failure Case Study

5.8 Training Time Analysis

6 Conclusion and Discussion

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation