1 Introduction

Deep Neural Networks (DNNs) generally rely on large-scale training data with human-annotated good labels for achieving satisfactory performance (Krizhevsky et al., 2012). However, due to the high costs and complexity of labeling the data, the labels are often contaminated by noise, and thus many works have strived to develop alternative methods that are robust to label noise, which is often called Learning with Noisy Labels (LNL) (Natarajan et al., 2013).

Recent studies for LNL, in general, have attempted to distinguish clean samples from the noisy dataset using handcrafted methods, e.g., Gaussian Mixture Models (GMMs), and then use these clean samples as labeled samples in the Semi-Supervised Learning (SSL) phase (Li et al., 2020; Nishi et al., 2021). However, the shape of the loss distribution often does not follow the Gaussian distribution (Arazo et al., 2019), and data with loss values that are not large or small enough cannot be properly distinguished. Furthermore, the dominant approaches maintain multiple models to avoid the risk attributable to the ability of DNNs to fit arbitrary labels, but this often leads to complicated training procedures (Iscen et al., 2022). Moreover, in the aforementioned LNL methodology that leverages SSL techniques, the weight of the unlabeled loss, one of the most substantial hyper-parameter, must be adjusted carefully depending on the noise ratio to prevent the model from overfitting. However, the noise ratio is challenging to tease out in a real-world environment, proving to be an unrealistic approach.

Fig. 1
figure 1

The concept of our alternating update framework with SplitNet. When the main network outputs the prediction history and label, SplitNet uses them to generate a clean training dataset and deliver it to the main network, which is then used as the labeled data of SSL. These two phases are alternatively used to boost convergence and performance

To overcome these limitations, we present a novel framework incorporating a learnable network, called SplitNet, which splits the clean and noisy data in a data-driven manner. Contrary to conventional methods (Li et al., 2020; Nishi et al., 2021) that fit GMMs solely based on per-sample loss distribution to select clean samples, our SplitNet can additionally incorporate the prediction history as input, which allows us to better distinguish ambiguous samples that cannot be precisely distinguished by GMM. In addition, we use a split confidence, a score indicating how confidently SplitNet divides the samples, to determine whether to apply unsupervised loss, enabling more stable learning of SSL method in LNL settings.

More specifically, our overall framework begins with a warm-up and then iteratively learns the main network and SplitNet. As shown in Fig. 1, by formulating the main network and SplitNet in an iterative manner, the two learners are alternately updated, each using the data from the other network. For SplitNet training, the main network provides class prediction and loss distribution, while for the main network training, SplitNet provides split confidences to flexibly adjust the threshold for its SSL procedure.

In particular, taking into account the learning status of the main network and the estimated noise ratio of the data set, the thresholds are automatically calculated to distinguish confidently clean and noisy samples. This process which we dub risk hedging, results in a favorable learning environment for SplitNet to mitigate confirmation bias. As the number of confidently clean and noisy sample increase throughout the process, SplitNet enjoys the benefit of a natural curriculum with the aid of the gradually increasing number of hard samples.

The key contributions of this method are as follows:

  • Our method effectively distinguishes clean samples from noisy datasets compared to other methods through a learnable network called SplitNet.

  • As our method enables the learning curriculum to adjust automatically depending on noise ratio, we propose the SSL method that is favorable to LNL by utilizing split confidence obtained through SplitNet.

  • Our method significantly outperforms state-of-the-art results on numerous benchmarks with different types and levels of label noise.

Fig. 2
figure 2

The overall architecture of our method. After training the main model through a warm-up, we use the proposed risk hedging process to only select confident samples to train SplitNet. With SplitNet, we obtain clean probability and split confidence, and with this information, we train the main model through SSL. Loss distribution generated by the main model is used in risk hedging. The main model and SplitNet can be alternately improved through this iterative process

2 Related Work

2.1 Learning with Noisy Labels

Modern LNL methods can be largely classified into two categories. The first category uses a loss correction. These methods are further classified into those that relabel noisy samples to correct losses and those that reweights loss depending on each sample. On the one hand, in a study related to the methods that involve relabeling, Reed et al. (2014) proposed a bootstrap method that adjusts the loss using model prediction. Additionally, the D2L proposed by Ma et al. (2018) provided further improvement by using the dimensionality of feature space to determine the weights of the output and label. Furthermore, Tanaka et al. (2018) proposed the joint optimization method, which reassigns noisy labels depending on the output of the network, updates networks’parameters, and labels each epoch. On the other hand, regarding the methods related to reweighting, Shen and Sanghavi (2019) conducted training by predicting smaller loss samples as clean.

The second category first discards noisy sample labels to apply the semi-supervised learning method. Ding et al. (2018) and Kong et al. (2019) proved that the SSL method is effective for LNL, and Li et al. (2020) avoided confirmation bias (Tarvainen & Valpola, 2017) by having two classification networks that filter out each other. The Nishi et al. (2021) studies examined augmentation that was effective for LNL and provided additional related contributions. The method proposed by our paper also utilizes SSL but its novelty compared to existing studies resides in the fact that it only requires a single classification network to resolve confirmation bias in a data-driven manner using an incidental SplitNet. Moreover, in order to part from a more favorable starting point, the proposed method utilizes K-fold cross-filtering to distinguish between clean and noisy data and trains networks using the SSL method.

2.2 Semi-supervised Learning

SSL methods aim to utilize not only labeled data but also unlabeled data in order to enhance the performance of a model. SSL methodology is particularly effective when the amount of labeled data is limited and when a large amount of unlabeled data can be used. SSL has been applied in multiple ways in diverse fields of study and is considered a mature research field (Yang et al., 2021b). In general, SSL methodology can be divided into two areas. These are consistency regularization (Miyato et al., 2018; Laine & Aila, 2016; Tarvainen & Valpola, 2017), which forces differently augmented input data to predict the same outcome, entropy minimization (Grandvalet & Bengio, 2004) and pseudo labeling (Lee et al., 2013), which allow unlabeled data to produce more confident outcomes. In recent times, a holistic approach that makes use of all of the aforementioned methodologies shows an improved performance (Sohn et al., 2020; Berthelot et al., 2019b, a).

Recently, advancing beyond methods that modify the threshold class-wise according to the difficulty levels for different classes (Zhang et al., 2021; Xu et al., 2021; Wang et al., 2022), the most recent development is DISC (Li et al., 2023), which adjusts the threshold on an instance-wise basis. While DISC utilizes the confidence values from the main classification network, our approach leverages the split confidence obtained from SplitNet to adjust the threshold.

3 Methodology

3.1 Overview

Let us denote \(\mathcal {X}= \{(x_i,y_i)\}^N_{i=1}\) as a training dataset, where \(x_i\) is an image, \(y_i\) is an one-hot label over r classes, and N is the total number of the training data. In the noisy label setting, we assume that \(y_i\) could be corrupted, and such labels are called noisy labels. We define noisy data as images with noisy labels. \(p_{\textrm{m}}(x;\theta )\) is the predicted class distribution produced by the main model \(p_{\textrm{m}}(\cdot ;\theta )\) with parameters \(\theta \) for input x. Our goal is to optimize the model parameters \(\theta \) so that \(p_\textrm{m}(x_i;\theta )\) approaches the ground-truth label.

Figure 2 shows our overall architecture. After training the main model through a warm-up, we use the proposed risk hedging process to only select confident samples to train SplitNet. With SplitNet we obtain clean probability and split confidence, and with this information we train the main model through SSL. Loss distribution generated by the main model is used in risk hedging as the whole process is repeated. Through this iterative process, the main model and SplitNet can be alternately improved.

3.2 SplitNet

Concretely, given the dataset, the proposed SplitNet is designed to output a probability prediction \(s\in \mathbb {R}^2\) regarding the two classes, clean and noisy. The network takes three inputs; model prediction \(\{p_{\textrm{m}}(x_i;\theta )\}_{i}\), the difference in the model predictions of the current and previous iteration \(\{\Delta \,p_{\textrm{m}}(x_i;\theta )\}_{i}\), and one-hot label \(\{y_i\}_{i}\) for \(i \in \{1,...,M\}\) where M is the total number of samples selected out of a total number of N train data by risk hedging. Note that samples selected by the risk hedging process change in each iteration. In the following, we explain training SplitNet with the proposed risk hedging and semi-supervised learning framework. In addition, Sect. 3.3 provides a detailed discussion on the reasons for selecting these three inputs.

Fig. 3
figure 3

Variation of SplitNet. As in (a), the number of layers can be adjusted. (b) Is a model that does not consider the prediction difference. (c) Is a model without batch normalization

SplitNet is trained to classify the clean and noisy data that has been labeled by GMM; thus, it requires that GMM correctly classifies clean data and noisy data, but there are many cases where the GMM incorrectly classifies data in the overlap between the clean and noisy distributions. A naïve solution would be to use a fixed threshold to only select confident data. However, as the model evolves, the loss distribution changes consistently, to which a fixed threshold cannot be adjusted. This leads to the model ignoring a considerable amount of unlabeled data at the earlier stage of training or using a considerable amount of incorrectly labeled data at the late stage of the training (Xu et al., 2021; Zhang et al., 2021; Wang et al., 2022).

3.2.1 Risk Hedging

To solve this problem, we propose risk hedging, a process that enhances the training of SplitNet by dynamically adjusting the threshold and selecting confident data. In the risk hedging process, the model’s current learning status and the noise ratio of the training dataset are autonomously determined.

In this context, we fit GMM on the loss distribution of the entire training data to obtain the clean probability w. A large mean value of clean probability w distribution implies that the dataset is mostly composed of clean data, and so more overall data can be treated as clean data. A large standard deviation value of clean probability w distribution implies that data classification ability is enhanced, and so the next value of the threshold is decreased.

Specifically, \(\tau _{\mu }\) and \(\tau _{\nu }\) should be determined, where \(\tau _{\mu }\) denotes the threshold that distinguishes clean data with clean label \(\mu \), and \(\tau _{\nu }\) denotes the threshold that distinguishes noisy data with noisy label \(\nu \). Note that \(\mu \) and \(\nu \) represent the binary classes of clean and noisy data, respectively, as a one-hot label. In practice, we used [1,0] for \(\mu \) and [0,1] for \(\nu \). SplitNet’s target label \(c \in \mathbb {R}^2\) can be either \(\mu \) (clean) or \(\nu \) (noisy), and is determined by comparing \(\tau _{\mu }\) and \(\tau _{\nu }\) with the clean probability w derived from the GMM. Formally, for dataset \(\mathcal {X}_w=\{x_i,y_i,w_i\}^N_{i=1}\), the training dataset for SplitNet is defined as:

$$\begin{aligned} \begin{aligned}&\{(x,y,\mu )\mid w\ge \tau _{\mu },\ (x,y,w)\in \mathcal {X}_w\}\\&\cup \{(x,y,\nu )\mid w\le \tau _{\nu },\ (x,y,w)\in \mathcal {X}_w\}. \end{aligned} \end{aligned}$$
(1)

In the following section, we explain the detailed derivation process of \(\tau _{\mu }\) and \(\tau _{\nu }\).

3.2.2 Derivation of \(\tau _\mu \) and \(\tau _\nu \)

We define \(\mathrm {\tau _{\mu }}\) and \(\mathrm {\tau _{\nu }}\) as follows:

$$\begin{aligned} \begin{aligned} \tau _{\mathrm {\mu }} := z-z^\textrm{F}\!\bar{w}^\textrm{F}\!\textrm{P}(\sigma ),\\ \tau _{\mathrm {\nu }} := z-\,z\,\,\bar{w}\,\,\textrm{P}(\sigma ), \end{aligned} \end{aligned}$$
(2)

where \(\textrm{P}(\sigma )\) is a function defined as \(1-4\,\sigma ^2\), and pivot point z is a value between 0 and 1 which serves as a reference point for the clean and noisy thresholds. Each threshold value changes based on z. \(\bar{w} = \frac{1}{\mid \mathcal {X}\mid }\sum _{i=1}^N{w_i}\) and \(\sigma ^2 = \frac{1}{\mid \mathcal {X}\mid }\sum _{i=1}^N(w_i-\bar{w})^2\) are the mean and variance of the clean probability predicted with GMM for the entire dataset, respectively, where \(w_i\) is the clean probability of the i th sample predicted with GMM. \(\textrm{F}\) is an operator that performs the following operation where j is an imaginary number:

$$\begin{aligned} z^\textrm{F} := (1-z)j. \end{aligned}$$
(3)

Lemma 1

Let x be a vector of n numbers in the range [0, r], where r is a positive number. Then, the maximum variance of this n number is \(r^2/4\).

Proof

Let \(\bar{x} = \frac{1}{n}\sum _{i=1}^nx_i\) and \(\textrm{var}(x)=\frac{1}{n}\sum _{i=1}^{n}(x_i-\bar{x})^2\). since \(x_i \le r\),

$$\begin{aligned} \sum \limits _{i}x_i^2=\sum _ix_i\cdot x_i\le \sum _i r\cdot x_i=rn\frac{1}{n}\sum _ix_i =rn\bar{x}. \end{aligned}$$

Note that \(0\le \bar{x}\le r\). Then,

$$\begin{aligned} \begin{aligned} n\cdot \textrm{var}(x)&=\sum _i(x_i-\bar{x})^2\\&=\sum _i(x_i^2-2x_i\bar{x}+\bar{x}^2)\\&=\sum _ix^2_i-2\bar{x}\sum _ix_i+n\bar{x}^2\\&=\sum _ix^2_i-2\bar{x}n\frac{1}{n}\sum _ix_i+n\bar{x}^2\\&=\sum _ix^2_i-n\bar{x}^2\\&\le rn\bar{x}-n\bar{x}^2=n\bar{x}(r-\bar{x}). \end{aligned} \end{aligned}$$

And thus

$$\begin{aligned} \textrm{var}(x)\le \bar{x}(r-\bar{x}). \end{aligned}$$

Using AM-GM inequality, we get

$$\begin{aligned} \bar{x}(r-\bar{x}) \le \left( \frac{\bar{x}+(r-\bar{x})}{2}\right) ^2=\frac{r^2}{4}. \end{aligned}$$

This shows that,

$$\begin{aligned} \textrm{var}(x)\le \frac{r^2}{4}. \end{aligned}$$

\(\square \)

Equation (2) is derived as follows. According to lemma 1, for a distribution of real numbers between 0 and 1, the minimum and maximum values of \(1-4\sigma ^2\) are 0 and 1, respectively. Thus we can set \(\tau _\mu \) and \(\tau _\nu \) that move dynamically between z and 1, and 0 and z, respectively. In this paper, we set the z as 0.5 for all experiments.

To summarize, \(\tau _\mu \) and \(\tau _\nu \) are updated to new values at the end of each epoch of the main network training, using the refreshed statistical values of w.

3.3 Network Architecture

As shown in Fig. 3, SplitNet could be implemented in various ways. First of all, we evaluate several modifications of the network architecture to understand SplitNet further. Specifically, we measure the performance of SplitNet by:

  1. 1.

    Changing the number of layers (Fig. 3a).

  2. 2.

    Removing the prediction difference from the input (Fig. 3b).

  3. 3.

    Removing the batch normalization (Ioffe & Szegedy, 2015) (Fig. 3c).

We experiment with Fig. 3a, b, and c for the following reasons.

Samples with noisy labels generate the wrong supervised signal in the warm-up. As discussed in Sect. 3.5, these samples usually have large losses, so during main training, the labels are discarded and learned through unsupervised loss. Therefore, the change in logit per epoch is large, and it can be used as a cue to distinguish noisy data. To confirm this effect, we design a structure Fig. 3b that does not consider logit differences and compare its performance with Fig. 3a.

In addition, in order to design SplitNet to have sufficient capacity while being lightweight, we measure the performance by changing the number of layers as shown in Fig. 3a. The number of layers consisting of Linear - Batch Normalization - ReLU is increased from 2 to 4.

We evaluate the performance of the SplitNet with batch normalization removed, shown in Fig. 3c, to confirm the importance of batch normalization in the structure of the SplitNet. As a result, convergence fails when batch normalization is not used, verifying the importance of batch normalization.

As a result of the experiment, SplitNet shows the best performance when it is composed of 3 layers based on Fig. 3a with batch normalization, and we adopt this as our structure. We provide a more detailed performance analysis in Sect. 5.4

Fig. 4
figure 4

Pseudo label accuracy by threshold. The higher the noise ratio, the better the performance at a weak threshold. Conversely, the lower the noise ratio, the better the performance at a strong threshold

3.4 Dynamic Thresholding in Semi-supervised Learner

To further train the main model, we define the labeled and unlabeled dataset required to train the semi-supervised learner as follows: where \(s\in \{s_{\textrm{clean}},s_{\textrm{noisy}}\}\) is the binary class prediction with \(s_{\textrm{clean}}\) and \(s_{\textrm{noisy}}\) being the clean and noisy probabilities predicted by SplitNet, dataset \(\mathcal {X}\) is forwarded to SplitNet to obtain s and form the dataset \(\mathcal {X}_{s}=\{(x_i,y_i,s_i)\}^N_{i=1}\). Note that, as shown in Fig. 3, the last layer of SplitNet includes a softmax function, ensuring that the sum of \(s_{\textrm{clean}}\) and \(s_{\textrm{noisy}}\) is always 1.

Using this dataset, we form a clean labeled dataset \(\mathcal {C}=\{(x,y) \mid s_\textrm{clean}\ge \tau _{\mathrm {\ label}},\ (x,y,s)\in \mathcal {X}_s\}\) where clean class probability \(s_\textrm{clean}\) exceeds clean label threshold \(\tau _{\ \textrm{label}}\), and an unlabeled dataset \(\mathcal {U} = \{(x,s) \mid (x,y,s)\in \mathcal {X}_s\}\), which is used for consistency regularization (Rasmus et al., 2015; Sajjadi et al., 2016) based learning.

Based on these datasets, the semi-supervised loss function consists of two cross-entropy loss terms: supervised loss \(\mathcal {L_C}\) and unsupervised loss \(\mathcal {L_U}\). First of all, \(\mathcal {L_C}\) is the standard cross-entropy loss \(\mathcal {H}(\cdot )\) on dataset \(\mathcal {C}\) as follows:

$$\begin{aligned} \mathcal {L}_\mathcal {C} = \frac{1}{\mathcal {\mid C \mid }}\sum _{(x,y) \in \mathcal {C}}\mathcal {H}(y,p_{\textrm{m}}(x;\theta )). \end{aligned}$$
(4)

For the unsupervised loss function, we exploit consistency regularization loss, a function used by Sohn et al. (2020), one of the most prevalent modern SSL frameworks. However, Our methodology is different in that it maximizes the effect according to the LNL by flexibly adjusting the threshold for determining a stable sample using split confidence that indicates the degree of distance from the decision boundary that divides the clean and noisy samples obtained through SplitNet.

As shown in Fig. 4, in the case of a fixed threshold, in situations with a very high level of label noise, a lower threshold achieves better performance and vice versa. This tendency is the reason that achieving superior performance in all noise ratio benchmarks with only one hyper-parameter setting is a difficult task. With motivation from these findings, we propose a dynamic threshold that is adjusted according to the split confidence of the sample.

A high split confidence indicates that the model is already well aware of the class to which the sample belongs. Consequently, our approach lowers the threshold for data with high split confidence, allowing more pseudo labels to be utilized more quickly. This, in turn, improves the model’s performance through the Flywheel Effect, enabling it to identify correct pseudo labels more efficiently in subsequent epochs. The results (see Sect. 5.5) demonstrate that the correctness of pseudo labeling is higher with a dynamic threshold, which aligns with our motivation and contention.

Specifically, we first generate an artificial pseudo-label \(q=\mathcal {E}(\mathrm {arg\,max}(p_{\textrm{m}}(\alpha (x);\theta )))\), where \(\alpha (\cdot )\) is a weak augmentation function that can carry out simple transformations (for example, flip and shift) on an image, and \(\mathcal {E}\) is a function that one-hot-encodes an index value. Then we enforce the model so that the model output of strongly-augmented data and of weakly-augmented data are consistent.

$$\begin{aligned} \begin{aligned} \mathcal {L}_{\mathcal {U}}=\frac{1}{\mid \mathcal {U}\mid } \sum _{(x,s)\in \mathcal {U}}&\mathbb {1}(\textrm{max}(p_{\textrm{m}}(\alpha (x);\theta ))\\&\ge \tau _{\textrm{dyn}})\mathcal {H}(q,p_{\textrm{m}}(\mathcal {A}(x);\theta )), \end{aligned} \end{aligned}$$
(5)

where \(\mathcal {A}(\cdot )\) is the strong augmentation function, which carries out more complex transformations (e.g., RandAug Cubuk et al. (2020)) on an image, and \(\tau _{\textrm{dyn}}\) is the dynamically-changing threshold determined per sample based on the sample’s split confidence, \(\textrm{max}(s)\) (i.e., \(\textrm{max}(s_\text {clean}, s_\text {noise})\)). Formally, \(\tau _{\text {dyn}}\) is defined as follows:

$$\begin{aligned} \begin{aligned} \tau _{\textrm{dyn}} =(1-\textrm{max}(s_\text {clean}, s_\text {noise}))\beta _{1} \\ +\textrm{max}(s_\text {clean}, s_\text {noise})\beta _{2}, \end{aligned} \end{aligned}$$
(6)

where \(\beta _1\) and \(\beta _2\) refer to the upper bound and lower bound of \(\tau _{\textrm{dyn}}\) respectively.

Note that \(\tau _\text {dyn}\) assumes different values for each sample, reflecting the dynamic nature of the threshold based on individual sample’s split confidence. However, for the sake of notational simplicity, Eq. (6) omits the i index.

In this way, even without adjustments in hyper-parameters, robust performance is achieved in various noise ratios of the training dataset.

The semi-supervised loss used to train the model can be written as:

$$\begin{aligned} \mathcal {L}={\eta }\mathcal {L}_\mathcal {C}+(1-\eta )\mathcal {L}_\mathcal {U}, \end{aligned}$$
(7)

where \(\eta = \mid \mathcal {\,C\,}\mid / \mid \mathcal {\,X\,}\mid \) is a weight automatically adjusted to become smaller as the estimated noise ratio of the dataset is smaller. As a result, the more noisy the dataset, the more unsupervised loss contributes to the total loss.

As shown in Alg. 1, we outline our main training algorithm in Paszke et al. (2019) style. In the algorithm, \(\theta _s\) denotes the parameters of SplitNet.

Algorithm 1
figure a

Network Training with SplitNet

Fig. 5
figure 5

The warm-up process. The SSL learner warms up the model, using clean data selected by K-Fold cross-filtering as labeled data

3.4.1 Open-Set Noisy Labels

To construct large-scale datasets, web crawling is often utilized (Kaur et al., 2017), introducing not only noisy labels but also classes outside the predefined label space. While methods like Out-of-Distribution Detection and Open-Set Recognition focus on unseen classes in the test set, they overlook data outside the label space in the training set (Yang et al., 2021a). This is known as the open-set noisy labels problem (Wang et al., 2018) and is considered nontrivial. Open-set noisy labels specifically address the challenge of training set data that originates from classes beyond the label space and is inaccurately tagged with noisy labels.

Datasets like FOOD-101N (Lee et al., 2018) also encounter the open-set noisy labels problem, containing images of animals or people within the training set, not just food. To address this, we devised a simple yet effective trick for masking noisy open class data. Given that open-set noisy data typically exhibits low \(s_{\text {clean}}\) (Wei et al., 2021b; Xie et al., 2021), we set a dynamic threshold, \(\tau _{\text {dyn}}\), as

$$\begin{aligned} \tau _{\textrm{dyn}} =(1-s_{\text {clean}})\beta _{1} +s_{\text {clean}}\beta _{2}, \end{aligned}$$
(8)

where the upper bound threshold \(\beta _1\) is set to values of 1 or higher to effectively mask open-set noisy data with low \(s_{\text {clean}}\), thereby preventing their inclusion in the learning process.

In this paper, we standardized the use of \(\beta _1\) = 1, \(\beta _2\) = 0.7, and \(\tau _{\text {dyn}}\) as defined in Eq. (8) for the open-set noisy labels setting (i.e., Li et al. (2017) and FOOD-101N Lee et al. (2018)). Setting \(\beta _1\) to 1 (or a higher value) ensures that for data with low \(s_{\text {clean}}\), the threshold becomes 1 (or higher), effectively masking these data points and preventing their inclusion as pseudo labels. This helps avoid negative impacts that open-set data could have on the model.

3.5 Warm-Up Stage

In DNN, correctly labeled data tend to converge more quickly than incorrectly labeled data (Arpit et al., 2017), which allows samples with lower loss and higher loss to be categorized as clean data and noisy data, respectively. In the previous state-of-the-art methods (Li et al., 2020; Nishi et al., 2021), for the initial convergence of the algorithm, the model is trained for a few epochs on a training dataset by using the standard cross-entropy loss. However, this training method does not function effectively in asymmetric noise settings and thus requires the addition of negative entropy loss terms and so forth (Li et al., 2020; Nishi et al., 2021; Chen et al., 2021a). Performance is also unstable in settings with a high noise ratio. To address this issue, we propose a novel warm-up method that does not require hyper-parameter changes or negative entropy loss because it works well even in high ratio noisy settings or asymmetric noise. Fig. 5 shows the diagram of our warm-up.

Our warm-up consists of K-fold cross-filtering and SSL Training. K-fold cross-filtering first divides the data into K-folds and then checks whether the labels of the test data match through out-of-fold prediction. By K-fold cross-filtering, we can find noisy data, discard their labels, and warm up the main network using the SSL method presented in Sohn et al. (2020).

In DNNs, learning fewer noise samples is important to make a distinguishable loss distribution since the loss value of incorrectly labeled samples quickly decreases to the loss value of correctly labeled samples. The similarity in loss value between correctly and incorrectly labeled samples makes it difficult to distinguish when learning with incorrect labels. Therefore, we select safer samples through K-fold cross-filtering to maintain a high loss value for noise samples. The higher the noise ratio, the greater the effect because more noise samples are removed.

Fig. 6
figure 6

Effect of proposed warm-up. a, c, e, and g show the results of warm-up using only cross-entropy. b, d, g, and h show the results of our warm-up. With our warm-up, clean and noisy data can be better distinguished

Table 1 List of hyper-parameters

Formally, when training dataset \(\mathcal {X}\) is divided into a number \(\mathcal {K}\) of folds of equal size, let the k-th fold be \(\mathcal {X}^k\). \(\theta ^k_f\) refers to filtering network parameters that are trained by cross-entropy loss on \(\mathcal {O^ k }=\mathcal {X}\setminus \mathcal {X}^k\), which is the set difference between \(\mathcal {X}\) and \(\mathcal {X}^k\). It follows that \(\theta ^k_f\) is trained by the following loss function:

$$\begin{aligned} \ell (\theta ^k_f) = -\frac{1}{\mid \mathcal {O^ k }\mid }\sum _{x,y\in \mathcal {O^ k }}y\, \textrm{log}(p_f(x;\theta ^k_f)), \end{aligned}$$
(9)

where \(p_f(x;\theta _f)\) is the filtering network’s predicted class distribution with parameters \(\theta _f\). In this case, we define \(\mathcal {T}^k\) as the set of data presumed to be clean within \(\mathcal {X}^k\) as:

$$\begin{aligned} \begin{aligned} \mathcal {T}^{\,k}=\{(x,y)\mid \mathrm {arg\,max}(y)=\mathrm {arg\,max}(p_ f (x;\theta ^k_f))\\ ,\ \textrm{max}(p_ f (x;\theta ^k_f))\ge \tau _{\mathrm {\ label}},\ (x,y)\in \mathcal {X}^k\}. \end{aligned} \end{aligned}$$
(10)

Note that \(\tau _{\mathrm {\ label}}\) here is the same value used in Sect. 3.4. Then, it follows that the clean dataset \(\mathcal {T}\) ultimately yielded by K-fold cross-filtering can be expressed such that \(\mathcal {T}=\{\mathcal {T}^{\,{1}} \cup \mathcal {T}^{\,{2}}\cdot \cdot \cdot \cup \ \mathcal {T}^{\,{\mathcal {K}}}\}\).

Since we configure a clean dataset through the filtering process, our ability to warm-up the main model is enhanced compared to that of the normal cross-entropy. Through the method presented in Zhang et al. (2017), we train the main model using supervised loss on clean dataset \(\mathcal {T}\). Additionally, by utilizing consistency regulation as previously presented in Sohn et al. (2020), we train the model on all data included in the training dataset \(\mathcal {X}\). As demonstrated in Fig. 6 the loss distribution is more evidently differentiated when our method is used to warm-up.

Table 2 Performance comparison for our method and the state-of-the-art methods on CIFAR-10 and CIFAR-100
Table 3 Performance comparison for our method and the state-of-the-art methods on CIFAR-10IDN and CIFAR-100IDN
Table 4 Performance comparison for our method and the state-of-the-art methods on CIFAR-10N and CIFAR-100N

4 Experiments

In order to evaluate the effectiveness of our method, we conduct experiments on synthetic datasets designed to have a variety of noise ratios and a real-world dataset, all of which follow standard LNL evaluation protocols (Li et al., 2020; Nishi et al., 2021; Xia et al., 2019; Wei et al., 2021b). All datasets used in our experiments are currently available for download from the internet, and we have cited the links accordingly.

4.1 Experiment Settings

4.1.1 CIFAR-10 and CIFAR-100

The CIFAR-10 and CIFAR-100 datasets (Krizhevsky, 2009; cif, 2017) each contain 50,000 sizes of 32x32 color images for training. While the CIFAR-10 dataset comprises 10 classes with 6,000 images each, the CIFAR-100 dataset consists of 100 classes with 600 images each. We test two different noise settings, symmetric noise and asymmetric noise (Tanaka et al., 2018; Li et al., 2019). In the case of symmetric noise, symmetric noisy labels are developed when the labels of a set proportion of training samples are flipped to other classes’labels randomly. In the case of asymmetric noise, asymmetric noisy labels are generated by exchanging images from two specific classes with similar characteristics, such as deer to horse.

An 18-layer PreAct Resnet (He et al., 2016b) is used as the main network and the filtering network, and trained using SGD with momentum of 0.9. We design a SplitNet which consists of three blocks with three layers each (FC layer, batch normalization (Ioffe & Szegedy, 2015), and ReLU (Agarap, 2018) and one projection layer at the end. For SplitNet training, we use AdamW (Loshchilov & Hutter, 2018) with a weight decay of 0.0005. In order to ensure a fair and objective comparison, the experiment follows the hyper-parameters of the state-of-the-art technique (Nishi et al., 2021). For all CIFAR experiments, we use the same hyper-parameters \(\beta _{1}=0.5\), \(\beta _{2}=0.95\), batch size of 128, and a weight decay of 0.0005 for the main network and filtering network.

A complete list of the utilized hyper-parameters can be found in the Table 1. Unlike recent studies (Li et al., 2020; Nishi et al., 2021), our method does not need to set a \({\lambda _u}\) that adjusts the weight of the unlabeled loss when training the SSL learner. This is because the weight is adjusted according to the noise ratio by itself through \(\eta = \mid \mathcal {C}\mid / \mid \mathcal {X}\mid \).

4.1.2 CIFAR-IDN

CIFAR-10IDN and CIFAR-100IDN (Xia et al., 2019; Chen et al., 2021b; cif, 2021a) are datasets that have synthetically injected part-dependent label noise into CIFAR-10 and CIFAR-100, respectively. They are derived from the fact that humans perceive instances by breaking them down into parts and estimate the IDN transition matrix of an instance as a combination of the transition matrices of different parts of the instance. Experiment settings, including hyper-parameters, are identical as in the case of CIFAR-10 and CIFAR-100.

4.1.3 CIFAR-N

Wei et al. (2021b) presents the CIFAR-N dataset consisting of CIFAR-10N and CIFAR-100N (cif, 2021b). CIFAR-N equips the training datasets of CIFAR-10 and CIFAR-100 with human-annotated real-world noisy labels, which are collected from Amazon Mechanical Turk. Unlike existing real-world noisy datasets, CIFAR-N is a real-world noisy dataset that establishes controllable, easy-to-use, and moderated-sized with both ground-truth and noisy labels. Experiment settings, including hyper-parameters, are identical as in the case of CIFAR-10 and CIFAR-100.

4.1.4 Food-101N

Food-101N (Lee et al., 2018; foo, 2018) is a large-scale dataset with real-world noisy labels consisting of 31k images from online websites allocated in 101 classes. Image classification is evaluated on Food-101 (Kaur et al., 2017) test set. For a fair comparison, we follow the previous work (Lee et al., 2018) and use ResNet-50 with ImageNet (Deng et al., 2009) pre-trained weights. We observed that the train data included data that should not be learned, which are not included in the given classes in Food-101N. For this reason, we set \(\beta _1\) and \(\beta _2\) as 0.7 and 1.0, each at a high value to mask the excluded data.

4.1.5 WebVision

The WebVision dataset (Li et al., 2017; web, 2017) consists of 2.4 million images spanning 1,000 classes, sourced from the internet, similar to ImageNet ILSVRC12 (Krizhevsky et al., 2012). In line with prior studies (Li et al., 2020), the performance of baseline methods is compared using only the first 50 classes from the Google image subset. The effectiveness of these methods is evaluated based on their top-1 and top-5 accuracy rates, using both the WebVision validation set and the ImageNet ILSVRC12 dataset for benchmarking.

4.2 Experiment Results

4.2.1 Results on CIFAR Benchmarks

As demonstrated in Table 2, we compare state-of-the-art methods with various ratios of symmetric noise and with 40% of asymmetric noise. The asymmetric noise is set at 40% as setting it at a rate higher than 50% would result in specific classes becoming theoretically indistinguishable (Li et al., 2020). We report substantial improvements in performance across all evaluated benchmarks, with the increases in performance becoming even more evident in cases where more challenging strong noise ratios are used. Note that compared to Li et al. (2020) and Nishi et al. (2021), where well-performing hyper-parameters differ for cases depending on the strength of the noise ratio, and specifically compared to AugDesc, which has separate well-performing models for cases depending on the strength of the noise ratio (i.e., DM-AugDesc-WS-SAW and DM-Aug-Desc-WS-WAW), our method enhances performance using a single model.  Table 3 shows that our method outperforms the previous method in 5 settings out of 6 settings on CIFAR-IDN.  Table 4 shows that our method achieves state-of-the-art performance in all criteria on CIFAR-N. It should be noted that before our method, previous state-of-the-arts were different in each criterion on the CIFAR-N benchmark. Note that we use identical hyper-parameters used in all CIFAR experiments.

4.2.2 Results on Large-Scale Datasets

We evaluated our method on datasets containing real-world label noise. For the WebVision experiments, we followed previous works (Chen et al., 2019; Li et al., 2020) and used the inception-resnet v2 (Szegedy et al., 2017) as the backbone model. Due to Food-101N and WebVision being large-scale crawled datasets, they encompass open-set noisy labels. Consequently, we employed the setting mentioned in Sect. 3.4.1. Tables 5 and 6 present the results from Food-101N and WebVision, respectively. Our methods outperform all existing methods, with particularly notable results in WebVision, showing a 4.02%p improvement over the previous state of the art. Our results on the real-world noise dataset underscore performance gains, underscoring our method’s efficacy in complex scenarios.

Table 5 Comparison against previous state-of-the-arts in test accuracy(%) on Food-101N
Table 6 Comparison against previous state-of-the-arts in test accuracy(%) on WebVision
Fig. 7
figure 7

Comparison of F1 score and accuracy. a, b, c, and d are the F1 score when the noise ratios are 20%, 50%, 80%, and 90%, respectively. e, f, g, and h are the accuracy when the noise ratios are 20%, 50%, 80%, and 90%, respectively. For all noise ratios, the F1 score and accuracy of SplitNet are higher, which means that SplitNet selects more actually clean data

Fig. 8
figure 8

Confusion matrix of SplitNet and GMM. The horizontal axis represents prediction, and the vertical axis represents ground-truth in each confusion matrix. The far-left column shows the results of SplitNet trained through hedging after warm-up, the middle column shows the results of SplitNet trained with data filtered with a fixed threshold, and the far-right column shows the results of GMM. The top row shows results at 0 epoch, and the bottom row shows results at 150

5 Analysis

5.1 Ablation Study

In order to obtain a better understanding regarding why our method was able to achieve state-of-the-art results, we study the effect of removing certain components. Table 7 indicates the results obtained when each component is removed. When SplitNet is removed, it becomes a consistency regularization (Sohn et al., 2020) in general. Also, the absence of the warm-up means that the warm-up is conducted with the existing cross-entropy based method. It can be confirmed that SplitNet has a boosting effect on performance across all noise settings and that it is even more effective when used along with the proposed warm-up. When using the open-set noisy labels setting, the performance difference is negligible when the noise level is low, but the performance deteriorates as the noise level increases. This implies that there is no need to mask data when there are no open-set labels.

We also experiment with adding several elements to DivideMix to assess performance enhancements. Merely substituting DivideMix’s SSL with FixMatch results in a performance decline, whereas incorporating our warm-up lead to performance improvements. Our warm-up shows particularly beneficial for enhancing performance in settings with a high noise ratio. These results underscore our warm-up’s versatility across different architectural frameworks, showing marked effectiveness, especially in synergy with SplitNet.

Table 7 Ablation study results in terms of test accuracy (%) on The dagger symbol (\(\dag \)) denotes the open-set noisy labels setting

5.2 Distinguishing Ability of SplitNet

In this section, we evaluate the F1 score and accuracy of the SplitNet against the conventional method, i.e., GMM. For a more detailed comparison, we also provide a confusion matrix of SplitNet and GMM in Sect. 5.3.

5.2.1 Accuracy and F1 Score

When selecting clean data, the selected data not only has to be actually clean but ‘more actually clean data’ must be selected from the entire dataset to say that it is selected well. Therefore we measured accuracy, a metric that considers the size of the entire dataset. Also, since there is a large difference between the number of clean and noisy data, we verified our method with the f1 score, a metric that takes this into account.

The F1 score and accuracy are defined as follows:

$$\begin{aligned}{} & {} \mathrm {F1\;Score = 2\cdot \frac{precision{\cdot }recall}{precision+recall}}, \\{} & {} \mathrm {Accuracy = \frac{TP+TN}{TP+FN+FP+TN}}, \end{aligned}$$

where TP is the number of data predicted to be clean among those that were actually clean, TN is the number of data predicted to be noisy among those that were actually noisy, FN is the number of data predicted to be noisy but was actually clean, and FP is the number of data predicted to be clean but was actually noisy.

5.2.2 Accuracy and F1 Score Results

Fig. 7 shows the accuracy and F1 score of clean and noise data separated by SplitNet and GMM by epoch. The result shows that SplitNet selects clean data with higher accuracy and F1 score than GMM despite its simple structure regardless of the noise ratio.

5.3 Confusion Matrix Comparison

Figure 8 shows the performance of SplitNet through a confusion matrix. The experiment was conducted on CIFAR-100 with a noise ratio of 90%. With SplitNet, the number of False Positives, which are data predicted to be Clean but actually Noisy, decreases dramatically.

5.4 Accuracy According to Structure

Figure 9 shows the comparison of accuracy according to the structure of SplitNet. It fails to converge when batch normalization is not used (see Fig. 3c) and does not show good performance when the prediction difference is not taken into account (see Fig. 3b). As shown in Fig. 3a, among 2, 3, and 4 layers, it shows the best performance with 3 layers. Future works may try out more various techniques (e.g., residual He et al. (2016a), DenseNet Huang et al. (2017)) to improve accuracy.

Fig. 9
figure 9

Accuracy according to the structure of SplitNet. w/o delta shows the accuracy when prediction difference is not considered, and 2,3,4 layers show the accuracy when SplitNet is composed of 2,3,4 layers, respectively

5.5 Effect of Dynamic Thresholding by Split Confidence

When the main network is trained with SSL, pseudo labels are generated for data whose confidence value exceeds the threshold. As shown in Fig. 10, the correctness of pseudo labeling is higher with a dynamic threshold as described in Eq. (6), compared to when the threshold value is fixed at 0.5 or 0.95.

Fig. 10
figure 10

Correctness of pseudo labels by threshold on CIFAR-100 with 90% noise ratio. a and b show the number of correct pseudo labels and wrong pseudo labels, respectively. A dynamic threshold generates more correct pseudo labels and fewer wrong pseudo labels than a fixed threshold

Figure 11 shows the change in average dynamic threshold by epoch. It shows a sharp drop around epoch 0, followed by a gradual decline. This indicates that as training progresses, the split confidence generally increases, leading to a lower average threshold.

Fig. 11
figure 11

Average Dynamic Threshold on CIFAR-100 with a 90% Noise Ratio

Fig. 12
figure 12

K -fold cross filtering performance evaluation on CIFAR-100 with various noise ratio

Fig. 13
figure 13

Examples of SplitNet failures. Notated in the form of “{ground-truth}/{noisy label}”

Table 8 Training time comparison

5.6 K-Fold Cross-Filtering Performance According to K

K, the number of partitions in the dataset in K-fold cross-filtering, can be set as a hyper-parameter. Figure 12 shows the test accuracy according to K, on noise ratio 80%, 90% CIFAR-100. Accuracy is saturated after the value of K reaches 8.

5.7 Failure Case Study

We examined cases where SplitNet failed to accurately select clean data, seeking any specific trends using the CIFAR-10 dataset. Similar to methods employing GMM (Li et al., 2020), SplitNet encountered failures primarily with asymmetric noise samples, yet as depicted in Fig. 8, the occurrences were fewer. Figure 13 shows instances of noise detection failures by SplitNet.

5.8 Training Time Analysis

As shown in Table 8 training time analysis, our method is more efficient compared to conventional methods. We compare the training time on CIFAR-10 with 20% noise ratio. For a fair comparison, the training times are obtained using a single NVIDIA GeForce RTX 3090 GPU and AMD EPYC 7282 CPU. Note that Nishi et al. (2021), the latest methodology, requires more computational cost, so it requires more training time than Li et al. (2020).

6 Conclusion and Discussion

There has been rapid growth in LNL in recent years. Despite the fact that progress has been accelerated, the setting is becoming more complex due to issues such as having to set different hyper-parameters depending on various noise ratios. The relevance of our method compared to previous ones is that it achieves state-of-the-art performance on most benchmarks with only a single model. Our method enhances the existing warm-up through K-fold cross-filtering and SSL Training. Additionally, it improves SSL so that it can be better applied to LNL through risk hedging and Dynamic Thresholding. Moreover, we conduct extensive ablation studies to identify why our method is successful and validate the effect of each component. A natural next step, which we leave for future work, is to extend our method to other domains such as audio, text, video, etc.

Broader Impact. Previous state-of-the-arts used different hyper-parameters depending on the noise ratio according to the noise ratio or created different settings appropriate for each case with different models. Differing from these previous methods, our method designs the model more flexibly so that it can be adapted to various environments. When applying LNL in the real-world, it is not often possible to know the noise ratio of the collected data. Thus, it is very important to study the noise ratio robustly in order to apply LNL to the real environment. The differentiation point of our method is that it can be effectively applied in a real-world environment and can especially be of great aid to organizations with low budgets that face difficulties in obtaining high-quality, refined datasets.