GenKL: An Iterative Framework for Resolving Label Ambiguity and Label Non-conformity in Web Images Via a New Generalized KL Divergence

Web image datasets curated online inherently contain ambiguous in-distribution instances and out-of-distribution instances, which we collectively call non-conforming (NC) instances. In many recent approaches for mitigating the negative effects of NC instances, the core implicit assumption is that the NC instances can be found via entropy maximization. For “entropy” to be well-defined, we are interpreting the output prediction vector of an instance as the parameter vector of a multinomial random variable, with respect to some trained model with a softmax output layer. Hence, entropy maximization is based on the idealized assumption that NC instances have predictions that are “almost” uniformly distributed. However, in real-world web image datasets, there are numerous NC instances whose predictions are far from being uniformly distributed. To tackle the limitation of entropy maximization, we propose (α,β)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$(\alpha , \beta )$$\end{document}-generalized KL divergence, DKLα,β(p‖q)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\mathcal {D}}_{\text {KL}}^{\alpha , \beta }(p\Vert q)$$\end{document}, which can be used to identify significantly more NC instances. Theoretical properties of DKLα,β(p‖q)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\mathcal {D}}_{\text {KL}}^{\alpha , \beta }(p\Vert q)$$\end{document} are proven, and we also show empirically that a simple use of DKLα,β(p‖q)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\mathcal {D}}_{\text {KL}}^{\alpha , \beta }(p\Vert q)$$\end{document} outperforms all baselines on the NC instance identification task. Building upon (α,β)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$(\alpha ,\beta )$$\end{document}-generalized KL divergence, we also introduce a new iterative training framework, GenKL, that identifies and relabels NC instances. When evaluated on three web image datasets, Clothing1M, Food101/Food101N, and mini WebVision 1.0, we achieved new state-of-the-art classification accuracies: 81.34%\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$81.34\%$$\end{document}, 85.73%\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$85.73\%$$\end{document} and 78.99%\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$78.99\%$$\end{document}/92.54%\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$92.54\%$$\end{document} (top-1/top-5), respectively.


Introduction
Web data is an abundant source for curating image datasets (Bossard et al., 2014;Kaur et al., 2017;Lee et al., 2018;Liang et al., 2020;Shang et al., 2018;Xiao et al., 2015).Raw web images collected online are typically annotated with weak-supervision methods (Xiao et al. 2015;Varma & Ré 2018;Tekumalla & Banda 2021;Zhang et al. 2021;Helmstetter & Paulheim 2021;Yang et al. 2022).Although much more efficient in comparison to manual annotation, such automated annotation methods inevitably introduce nonconforming (NC) instances, which comprise both ambiguous in-distribution (ID) instances and out-of-distribution (OOD) B Kai Fong Ernest Chong ernest_chong@sutd.edu.sgXia Huang xia_huang@mymail.sutd.edu.sg 1 Singapore University of Technology and Design, 8 Somapah Road, Singapore 487372, Singapore instances.For example, Clothing1M (Xiao et al., 2015), a large-scale web image dataset that is well-known for containing real-world label noise, also contains NC instances; see Fig. 1 for explicit examples.Whether ambiguous ID or OOD, these NC instances may lead to significant performance degradation during training, which is not surprising since neural networks are capable of achieving zero training error even in the extreme case of completely noisy data (Arpit et al., 2017;Yao et al., 2021;Zhang et al., 2021).How then do we deal with such NC instances in web image datasets?
Numerous works (Goldberger & Ben-Reuven, 2016;Hendrycks et al., 2018;Ma et al., 2020;Patrini et al., 2017;Peng et al., 2020;Sharma et al., 2020;Xia et al., 2019;Yao et al., 2020) have tackled this problem by viewing NC instances from the lens of label noise.The underlying assumption is that NC instances hurt performance because they (may) have incorrect labels.From this viewpoint, the problem is then reduced to a simpler one: How do we alleviate the effect of label noise?However, such a simplification ignores the instances.The images in the green section depict clean instances, i.e. images with correct labels.The 14 given labels in Clothing1M are: T-Shirt, Shirt, Knitwear, Chiffon, Sweater, Hoodie, Windbreaker, Jacket, Downcoat, Suit, Shawl, Dress, Vest, Underwear role of image content in these NC instances.Ambiguous ID instances, especially those that do not fit neatly into a single label class (e.g.due to vague or incomplete object presentation, occlusion, etc.), could still hurt performance even if they are correctly labeled.Although incorrectly labeled by definition, OOD instances could still have similar visual features that are present in images of certain label classes, which may distort the learned feature space (and so hurt performance), especially if they could be confused even by humans for some of the given label classes.
More recent methods avoid this over-simplification by incorporating various data sampling techniques based on image content (Albert et al., 2022;Guo et al., 2018;Han et al., 2019;Lee et al., 2018;Tu et al., 2020;Xu et al., 2021;Yao et al., 2021), many of which use a common underlying idea that NC instances can be identified via entropy maximization (Albert et al., 2022;Chan et al., 2021;Kirsch et al., 2021;Macêdo & Ludermir, 2021a;Macêdo et al., 2021b, c;Yao et al., 2021;Yu & Aizawa, 2019).Informally, the most easily identifiable NC instances are precisely those with near-maximum entropy.To make sense of this information-theoretic notion of (Shannon) entropy, we implicitly assume1 that, with respect to a trained classifier, instances have prediction vectors that are valid as parameter vectors of multinomial distributions.Moreover, we interpret the prediction of an instance probabilistically to be a random variable with this multinomial distribution.The entropy of an instance is then simply the entropy of its prediction.Since the discrete uniform distribution has maximum entropy by definition, it follows that the most easily identifiable NC instances are precisely those instances whose predictions are "almost" uniformly distributed.Thus, entropy maximization is the idea of categorizing an instance as an NC instance if its entropy exceeds a certain threshold.
However, there is a fundamental limitation of using entropy maximization: Not all NC instances have predictions that are "almost" uniformly distributed.Consider the hypothetical example of a web image dataset with 10 classes, consisting of 5 animal classes and 5 plant classes.Let x be an instance whose prediction score for each animal class is Table 1 The (α, β)-generalized KL divergences, KL divergences, normalized entropies, and prediction vectors, for the images shown in Fig. 1. (We used p = 1 k , . . ., 1 k ∈ R k , α = 0.7 and β = 0.03 for our (α, β)-generalized KL divergence, where k = 14 is the number of classes in Clothing1M.) The table rows are arranged according to the KL divergences in ascending order.Normalized entropy refers to entropy divided by the maximum possible entropy (log k).For prediction vectors, the non-dominant entries with values lesser than 1 k − β are highlighted in red.Note that there is no possible threshold for both KL divergence and normalized entropy that would distinguish clean instances from NC instances across all ten images in Fig. 1, since image C-1 has low KL divergence and high normalized entropy, while image C-2 has high KL divergence and low normalized entropy.In contrast, it is possible to distinguish non-NC instances from NC instances by checking whether their (α, β)-generalized KL divergence is negative (see boldfaced values) or non-negative, respectively 0.20, and whose prediction scores for all plant classes are 0.00.Clearly, x could be interpreted as an NC instance that is predicted to be related to the 5 animal classes with equal uncertainty, but predicted to be unrelated to any of the plant classes.Next, consider another instance x , whose prediction score for the first animal class is 0.55, and whose prediction scores for all remaining 9 classes are 0.05 each.In contrast, x is naturally not an NC instance, since it fits well into exactly one class.However, the normalized2 entropy of x is log 5 log 10 ≈ 0.699, while the normalized entropy of x is ≈ 0.728.If the threshold is less than 0.728, then x would be misclassified as an NC instance.If the threshold is greater than 0.699, then x would be misclassified as a non-NC instance.Hence, this implies that all possible thresholds when using normalized entropy to identify NC instances would miscategorize at least one of x, x .For more concrete examples, see Table 1 for NC instances and non-NC instances (in Clothing1M) that cannot be distinguished using thresh-olds on their normalized entropies.As these examples reveal, entropy maximization is inherently inadequate for identifying NC instances.
Beyond entropy maximization.By definition, entropy maximization is equivalent to the minimization of the Kullback-Leibler (KL) divergence.Building upon this observation, we introduce (α, β)-generalized KL divergence, a new generalization of KL divergence that is well-suited for identifying NC instances.In essence, we are extending this KL divergence minimization idea to the case of minimizing the "divergence" from p to q, relative to those dominant entries in the parameter vector of q.Here, the precise meaning of "dominant" depends on our hyperparameters α and β, whose values we can adjust.(Intuitively, an entry is dominant if it is not small.)By using (α, β)-generalized KL divergence, we are able to not only identify NC instances whose prediction vectors have entries that are all dominant (i.e.instances with "almost" uniformly distributed predictions), but also identify additional NC instances whose prediction vectors have less dominant entries.Thus, (α, β)-generalized Fig. 2 An overview of our GenKL framework in iteration t.GenKL has two stages; we iterate between the two stages until the model converges.Stage one includes two critical components: (i) identification of NC instances using (α, β)-generalized KL divergence; and (ii) relabeling of NC and non-NC instances using soft labels.In component (i), given the i-th instance (x i main , y i main ), we obtain its prediction vector q i from the model F (x i main | t−1 ).Next, we compute the (α, β)-generalized KL divergence D α,β KL ( p q i ), where p is a uniform-like vector (see Sect. 3.4 for its definition), then identify our i-th instance as an NC instance if D α,β KL ( p q i ) ≥ 0, and as a non-NC instance otherwise.In component (ii), there are two scenarios: (ii-a) If (x i main , y i main ) is an NC instance, then assign to it a uniform vector 1 k , . . ., 1 k as its soft label; and (ii-b) If (x i main , y i main ) is a non-NC instance, then assign to it a double-hot vector qi as its soft label.In stage two, an initialized model is trained on X non-NC , X NC with their respective soft labels, and X clean with its given labels.Full details can be found in Sect.3.5 KL divergence directly addresses the fundamental limitation of using entropy maximization.
Robust training with generalized KL divergence.To improve the robustness of training classifiers on web image datasets with both NC instances and non-NC instances with label noise, we propose GenKL, a general training framework based on our (α, β)-generalized KL divergence.There are two stages in GenKL.For stage one, there are two key components: (i) identification of NC instances using (α, β)generalized KL divergence; and (ii) relabeling of NC and non-NC instances using soft labels.For stage two, we perform weight initialization, then carry out the usual training on the relabeled instances and clean instances.By iteratively alternating between the two stages, GenKL is robust to both NC instances, and label noise in the input data.See Fig. 2 for an overview of the GenKL framework.Our experiments on web image datasets show that GenKL is able to achieve state-of-the-art (SOTA) accuracies.
Our main contributions are given as follows: • We propose a new generalized KL divergence, D α,β KL ( p q), which is well-suited for identifying NC instances.
• We prove theoretical properties of D α,β KL ( p q), whose proofs are provided in the Appendix.

Dealing with NC Instances
Existing approaches for dealing with NC instances can be broadly categorized into two types: (i) treating NC instances more generally as instances with label noise; and (ii) identifying and alleviating the effects of NC instances.Intuitively, although instances with label noise include unambiguous ID instances with incorrect labels, which are not NC instances, any method that tackles the problem of label noise would naturally tackle the sub-problem of OOD instances, which are For the first type, existing works can be categorized based on their use of (i) estimators for noise transition matrices, (ii) robust loss functions, (iii) regularization techniques, and (iv) other specialized model architectures.An accurate estimate of the noise transition matrix is vital to many applications, because the matrix can not only model the label corruption process for the dataset, but also can be used to infer the clean class posterior probabilities of the dataset (Han et al., 2020).There are numerous methods that estimate the noise transition matrix, which is then used either to alleviate the effect of label noise (Sukhbaatar & Fergus, 2014;Xiao et al., 2015;Han et al., 2018), or to perform label correction (Hendrycks et al., 2018;Patrini et al., 2017;Xia et al., 2020Xia et al., , 2019)).Various loss functions have been proposed (Ghosh et al., 2017;Lyu & Tsang, 2019;Ma et al., 2020;Song et al., 2019;Wang et al., 2019;Hendrycks et al., 2018;Patrini et al., 2017;Xia et al., 2019;Yao et al., 2020), which are specifically designed to be robust to noisy data.Also, regularization techniques (Ioffe & Szegedy, 2015;Jenni & Favaro, 2018;Krogh & Hertz, 1991;Pereyra et al., 2017;Shorten & Khoshgoftaar, 2019;Srivastava et al., 2014;Zhang et al., 2017) are used to increase the generalization capability of a trained model.Such techniques are frequently combined with other methods.Finally, some methods do not fit neatly into the first three sub-categories, and instead use newly proposed specialized model architectures (Wang et al., 2021;Peng et al., 2020;Yang et al., 2021;Tu et al., 2020;Yao et al., 2018).CAN (Yao et al., 2018) is a contrastive-additive noise network that has a contrastive layer to estimate a quality embedding space, and an additive layer for estimation aggregation.AFM (Peng et al., 2020) introduces a training block that suppresses mislabeled data via grouping and self-attention; see also (Xu et al., 2022).SOMNet (Tu et al., 2020) groups images and their proposed regions-of-interest (ROIs) from the same category into bags, and thereafter, weights are assigned to the bags based on their discriminative scores with the nearest clusters.These methods of the first type are designed to deal with label noise, and do not specifically target the challenge of NC instances.Hence, these methods have limited impact on web image datasets that have a non-negligible number of NC instances.
For the second type, numerous works identify NC instances using entropy-based methods.Many of them use entropy maximization to identify OOD instances (Chan et al., 2021;Kirsch et al., 2021;Macêdo & Ludermir, 2021a;Macêdo et al., 2021b, c).Other works use divergences closely related to KL divergence to identify NC instances.For example, Jo-SRC (Yao et al., 2021) uses Jensen-Shannon divergence to separate clean instances from OOD and noisy ID instances.It then compares the consistency of the predictions of each non-clean instance from multiple views (via data augmentation) to determine whether the instance is OOD or noisy ID.Another example is DSOS (Albert et al., 2022), which first separates non-clean instances from clean instances via collision entropy, then uses beta mixture models to identify noisy ID and OOD instances, and applies label correction to improve classification accuracy.It should be noted that in Albert et al. (2022), the authors hypothesized that DSOS might not perform as well if most of the non-clean instances are OOD rather than noisy ID, which may limit the effectiveness of DSOS on some web image datasets, e.g.Clothing1M.

Generalizations of KL Divergence and other divergences
KL divergence (also called relative entropy) is a nonnegative, unbounded divergence between two stochastic vectors p and q, defined as follows: Note that KL divergence is asymmetric, i.e. in general, D KL ( p q) = D KL (q p).There are multiple variants of KL divergence and other divergences: • Jeffreys divergence (Jeffreys, 1998) is defined by which is a symmetric analog of KL divergence.• Jensen-Shannon (JS) divergence is defined by which is another symmetric analog of KL divergence.This divergence is bounded within [0, 1] if logarithms are taken over base 2 (Lin, 1991).• Decision Cognizant (DC) KL divergence was first introduced in Ponti et al. (2017).Suppose arg max( p) = s, arg max(q) = t, and define the set := {1, . . ., k}\{s, t}.
Then this divergence is defined by D DC ( p q) := j∈{s,t} q j log q j p j + j∈ q j log j∈ q j j∈ p j .
Similar to KL divergence, this DC KL divergence is non-negative, unbounded and asymmetric.The main difference for this divergence is that the contributions from minority classes are reduced, which is what (Ponti et al., 2017) refers to as being "decision cognizant".
• Delta divergence (Kittler & Zor, 2018): Let arg max( p) = s, let arg max(q) = t, and define the index set := {1, . . ., k}\{s, t}.Then Delta divergence is defined by where p = j∈ p j and q = j∈ q j .This divergence is non-negative, bounded, symmetric and effectively groups non-dominant entries into a single class.
Rényi entropy generalizes the usual notion of (Shannon) entropy, and it is defined by where α ≥ 0, α = 1.Note that Rényi entropy becomes Shannon entropy in the limit as α → 1.

Proposed Method
This section introduces GenKL, an iterative training framework robust to both NC instances and instances with label noise.We first begin in Sect.3.1 with a rigorous definition of NC instances.Next, we cover the preliminaries in Sect.3.2, followed by a formal introduction to our (α, β)-generalized KL divergence D α,β KL ( p q) in Sect.3.3.Section 3.4 describes the usage of (α, β)-generalized KL divergence.We then build upon this (α, β)-generalized KL divergence, and give full algorithmic details of our proposed GenKL framework in Sect.3.5.Finally in Sect.3.6, we give further details on "double-hot vectors", an important ingredient of our GenKL framework.

What Exactly are NC Instances?
To rigorously define what NC instances are, it is helpful to think of the dataset annotation process.In particular, the annotation process of "assigning a single class label to an input image" can be interpreted in terms of solving an object detection task.
Imagine a two-step object detection process: • Step 1: We locate the objects in the input image, which correspond to regions of the image (specified by bounding boxes) whose content is distinguishable from the background; this is commonly known as generic object detection (Liu et al., 2020);cf. Maaz et al. (2022).For convenience, let O denote the set of located objects.• Step 2: Given a fixed set L of object class labels, we shall try to assign each object in O to a label in L. We shall assume that an object is assigned a label y ∈ L, only if all salient features of the object class associated to y are present/detected in the bounding box of the object.It is possible that some objects in O cannot be assigned any label in L, in which case, we consider such objects to be out-of-distribution (OOD).Objects that are assigned labels in L are called in-distribution (ID).If an ID object in O could be assigned multiple labels in L, then we say it is an ambiguous ID object.
Based on the located objects in O, together with the labels assigned to these objects wherever possible, our goal is to assign a single label from L to represent the entire image.Although our goal is seemingly simple, there are subtle issues we have to address.Crucially, is there an obvious main object of interest?Are there multiple main objects of interest?How do we characterize objects in O to be main objects of interest?
Among the bounding boxes of all objects in O, let A denote the maximum possible area among these bounding boxes.Given a threshold η (e.g., η = 0.5), we shall define an object obj ∈ O to be a main object of interest, if the bounding box for obj has an area at least η A. Let O interest denote the subset of O consisting of main objects of interest. 4or a single label y ∈ L to accurately represent a given input image, we require every object in O interest to be assigned a single label y.This brings us to our formal definition of NC instances: Definition 1 Fix a set L of object class labels.For a given instance (x, y) (where x represents an image, and y is the corresponding given label in L, which may possibly be incorrect), define O interest for image x as above.Then we call (x, y) an NC instance, if there is no possible single label y ∈ L such that every object in O interest is assigned the same label y .5

Preliminaries
Throughout, given any vector v, we shall let its j-th entry be denoted by v j .(We shall always reserve subscripts on vectors to refer to their entries.)For any dataset with N instances, we use the convention that the i-th instance is the pair (x i , y i ), where x i is the i-th feature vector/image.The corresponding label y i is an integer j ∈ {1, . . ., k}.Let e(y i ) = e( j) be the one-hot vector whose j-th entry is 1, and whose remaining entries are 0.For convenience, let ) be the sequence of feature vectors/images (resp.sequence of labels) for D. A neural network model in iteration t is denoted by F(•| t ), where t represents the model weights in iteration t.Assume that for every input x i , the output q i = F(x i | t ) is a stochastic vector.For convenience, we define the entropy of a stochastic vector p to be H ( p) := − k j=1 p j log p j .Analogously, for stochastic vectors p and q, we define the cross-entropy from p to q to be H ( p, q) := − k j=1 p j log q j .In both H ( p) and H ( p, q), we use the usual convention that 0 log 0 := 0. Later, we shall abuse notation and use H ( p, q) for the case when p is a non-stochastic vector.In this case, H ( p, q) is defined using the same formula.

(˛, ˇ)-generalized KL Divergence
Given any two stochastic vectors p and q of length k, and given real values α, β satisfying α > 0 and 0 ≤ β ≤ 1 k , we shall define the (α, β)-generalized KL divergence from p to q as follows: (1) Here, and returns 0 otherwise.Succinctly, we can write (1) as: where . Note that we can easily compute β p by replacing those entries in p that are less than 1 k − β with the value 0. In both ( 1) and ( 2), we use the usual convention that 0 log 0 := 0.
There are two hyperparameters in D α,β KL ( p q): α and β.Note that when α = 1 and β = 1 k , our (α, β)-generalized KL divergence coincides exactly with the usual KL divergence.Informally, the first term −α H ( p) is negative, where α > 0 is a hyperparameter that controls "how negative" this weighted "entropy" term is.The second hyperparameter β appears in the second term H ( β p, q), which is a positive "cross-entropy" term.Intuitively, β controls the threshold of what it means to be a non-dominant entry.Given a stochastic vector q of length k, we say that its i-th entry q i is dominant if q i ≥ 1 k − β, and non-dominant otherwise.Thus, we are effectively computing the cross-entropy only for the dominant entries of q, where the contributions of the non-dominant entries of q are ignored.

Theorem 2 Let p and q be two stochastic vectors of length
(3) , then equality holds if and only if q = p and p is a one-hot vector.
k and α > 1, then equality holds if and only if p = q and p is a uniform vector.
k , then equality holds if and only if q is a one-hot vector and p is a uniform vector.
Informally, Theorem 2 gives the full range of values that D α,β KL ( p q) can attain, over all possible pairs of values for the hyperparameters α and β.In the special case when α = 1 and is the usual KL divergence from p to q, and Theorem 2 becomes the well-known "information" from Information Theory; see (Thomas & Joy 2006, Thm. 2.6.3).

Theorem 3 For any
we define R β u := ( p, q) : p and q are stochastic vectors, we have that (α, β)generalized KL divergence is piecewise convex: This means that for all 0/1-vectors u, all pairs ( p, q), ( p , q ) ∈ R β u , and all λ ∈ [0, 1], It is well-known that KL divergence D KL ( p q) is convex in the pair ( p, q); see (Thomas & Joy 2006, Thm. 2.7.2). Theorem 3 extends this result to the general case of (α, β)generalized KL divergence, where instead of convexity, we have piecewise convexity for D α,β

Identification of NC Instances
Later, when we introduce GenKL in Sect.3.5, we shall be considering two disjoint datasets D main and D clean with the same set of label classes, where D main has both NC instances and instances with label noise, while D clean is assumed to contain only non-NC instances that are clean (i.e., all labels are correct).We do not assume that the feature vectors/images of D main and D clean are sampled from the same distribution.We shall use labels "main" and "clean" to indicate membership in the respective datasets; for example, X main refers to the sequence X D main , while (x i clean , y i clean ) refers to the i-th instance of D clean .
When using D α,β KL ( p q) to identify NC instances, the vector q we used is a prediction vector from a trained neural network.To capture the variance of the predictions, we use multiple "uniform-like" vectors for p, which is defined as follows: First, we sample a vector p, where each entry p j is sampled from the normal distribution N ( 1 k , σ 2 ).(When σ = 0, p is a uniform vector.)If the entries of p are all non-negative, then we normalize the vector p to generate the "uniform-like" vector p, where p j = p j k i=1 p j .Let P be the set of uniform-like vectors p obtained via this sampling process.Given an instance with prediction vector q, we say that this instance is an NC instance if D α,β KL ( p q) ≥ 0 for any p in P.

GenKL Framework
Our framework GenKL comprises two stages.In stage one, we identify NC and non-NC instances in D main , and generate respective soft labels for both types of instances.In stage two, we perform training on D main with its newly assigned labels and D clean with its given labels (usual training).We iterate between stage one and stage two, until the model converges.
Stage one.There are two critical components in stage one: (i) identification of NC instances using (α, β)-generalized KL divergence; and (ii) relabeling of NC and non-NC instances using soft labels.
For component (i), we obtain the set Q of prediction vectors for all instances in D main from our model F(• | t−1 ).Using our (α, β)-generalized KL divergence, we then partition X main into two sub-sequences: X NC and X non-NC , comprising the images of NC instances and non-NC instances, respectively.In particular, our (α, β)-generalized KL divergence allows the identification of NC instances whose predictions are not "almost" uniformly distributed.
For component (ii), the uniform vector u := [ 1 k , . . ., 1 k ] is assigned to every image in X NC as its soft label.This makes the model more likely to have uniform-like prediction vectors for NC instances.In contrast, we assign what we call a double-hot vector, qi , as the soft label to an image in X non-NC .The double-hot vector qi of a non-NC instance (x i main , y i main ) is defined as follows: First, let the vector of normalized class ratios of D pre 6 be denoted by v, and let the size of class j in D pre be k j .This means that the j-th entry of v is: (This vector v of normalized class ratios was previously used in Li et al. (2019) to define a weighted cross-entropy loss for imbalanced datasets.)Let λ = v y i main , λ = max(q i ), and = arg max q i .In words, λ is the class ratio for the class label of x i main , while λ is the maximum value of the entries in the prediction vector q i , corresponding to an entry with index .Then qi is defined by: qi := λ e(y i main ) + λe( ). (4) Note that qi is a weighted sum of two one-hot vectors, where the two weights λ and λ do not necessarily sum to 1. Let Q denote the sequence of double-hot vectors corresponding to X non-NC .More details on double-hot vectors, including its motivation and interpretation, can be found later in Sect.3.6.Stage two.In iteration t, a model is initialized and trained on X clean with given labels Y clean , X NC with the uniform vector u as the common soft label, and X non-NC with doublehot vectors Q as soft labels, respectively.Their respective loss functions are defined by ( 5), (6), and (7): (5) 6 We define D pre to be the dataset that the initial model F (• | 0 ) in the first iteration is pretrained on.Note that D pre could be D clean , or D main ∪ D clean .See Sect.5.3 for more implementation details.
The overall loss function is defined by: where ω 1 , ω 2 and ω 3 are hyperparameters that represent the weightage of the contributions of X clean , X NC and X non-NC to the overall loss.
The model weights update process is given by SGD (L all ((X clean ,Y clean ), (X non-NC , Q), (X NC ,u), t )) in iteration t.A trained model is returned at the end of stage two, which would produce the prediction vectors for D main in the next iteration.See Algo. 1 for full algorithmic details.

More on Double-Hot Vectors
The role of double-hot vectors in loss optimization during training can naturally be viewed from the lens of information theory.The key idea we use is that information content can be assigned to events of random processes.Recall that for any event A with corresponding probability Pr(A) = p, the information content of A is defined to be − log p, which intuitively quantifies how "surprising" the occurrence of that event would be.
In our paper, the prediction of an instance is defined to be a random variable, where the set of possible outcomes is {1, . . ., k}, i.e. the set of all possible labels.For convenience, an outcome of the prediction shall be called a predicted label.This means if q is the prediction vector of an instance, then each entry q j is the corresponding probability that the predicted label is j.Thus, the usual cross entropy loss of an instance can be interpreted as the information content of the event that the predicted label is the given label.For a clean instance, this cross entropy loss (see ( 5)) becomes the information content of the event that the predicted label is the correct label.
For a given non-NC instance, in addition to the usual prediction vector q, we also have a double-hot vector q.Let A Algorithm 1 Pseudocode for GenKL.
Require: Pre-trained weights 0 , num_iters, "uniform-like" vector set P, k, α, β, ω 1 , ω 2 , ω 3 .Ensure: Final trained model F (• | num_iters ) 1: for t from 1 to num_iters do Stage one: NC instances identification and relabeling.2: Initialize X NC , X non-NC , Q as empty sequences.3: Generate prediction vectors for D main .4: for p ∈ P and (x i main , y i main ) ∈ D main do 5: for j ∈ {1, . . ., k} do 6: if q i j < 1 k − β then 7: p j ← 0 Set p j to be 0 if q i j is a non-dominant entry.be the event that the predicted label is the correct label.Naturally, we are interested in event A, but we also have to deal with the uncertainty that the given label may be incorrect.Informally, we can think of both vectors q and q as measurements of two independent random processes that each yields information about the correct label of the given non-NC instance, and we would like to quantify the "combined" information we get from the measurements.For an instance (x, y) with given prediction vector q, the double-hot vector q associated to this instance has at most two non-zero entries, at indices y and := arg max q. (Note that q has only one non-zero entry if y = .)Recall that the value of λ in (4) depends only on the class distribution of the entire dataset D pre , where λ is larger if the normalized class ratio for object class y is larger (i.e. if class y is rarer, in the case that D pre is an imbalanced dataset).Intuitively, a given label is "more surprising" if it corresponds to a rarer class, and in this "more surprising" case, we would like to assign a larger prior belief probability that the given label is correct.Hence, λ can be interpreted as a measure of our prior belief that the given label is correct.The value λ := q in (4) is by definition the probability of the most probable predicted label .Hence, λ can be interpreted as a measure of the model's belief that the most probable predicted label is correct.Consequently, the double-hot vector q can be interpreted as the overall relative measure of our prior belief (based on normalized class ratios) in comparison to the model's belief (based on training on the given labels), for the correctness of the given label versus the most probable predicted label.
If q is a stochastic vector, then we can define a random variable V for the same set {1, . . ., k} of possible outcomes, such that Pr(V = j) = q j .For convenience, an outcome of V shall be called a belief label.Hence, each entry q j is the corresponding probability that the belief label is j.Thus, we can interpret the loss value in (6) as the information content of A ∩ B, where A is the event defined as above, and B is the event that the belief label is the correct label.Under the assumption that A and B are independent, it then follows from the law of total probability that Pr(A ∩ B) = k j=1 q j q j , so the information content of event A ∩ B is − log k j=1 q j q j .Intuitively, as we minimize the value of the loss function L non-NC (see ( 6)), we are maximizing the probability that the correct label for the given non-NC instance is either the given label or the most probable prediction label, i.e. exactly one of the indices of the non-zero entries of the double-hot vector.This intuition still holds when we drop the requirement that q must be a stochastic vector, and later in our ablation study (see Table 6), we show that allowing q to be a non-stochastic vector yields better performance in our experiments.

Discussion
Not all NC instances have "almost" uniformly distributed predictions.This was the fundamental limitation of entropy maximization methods that we highlighted in Sect. 1. Through the use of (α, β)-generalized KL divergence, D α,β KL ( p q), we are able to overcome this limitation and identify more NC instances.(Later in Sect.5.2, we report our experiment results on the NC instance identification task.) Intuitively, the additional NC instances we identified would have prediction vectors whose entries are not all dominant, such that the values of those dominant entries are near-uniform.Since the prediction vector represents the multinomial distribution of the prediction (treated as a random variable), it then follows from our definition of NC instances (in Sect.3.1) that the object classes corresponding to those dominant entries would have salient features that are detected in the images of these NC instances.
Consequently, for our proposed D α,β KL ( p q) to be effective in identifying more NC instances, we require the implicit assumption that the prediction model can detect the salient features of all object classes that are present in any input image.Specifically, the j-th entry of the prediction vector should be a good measure of the presence of the salient features of the j-th object class, where the more confident the prediction model is in detecting the salient features, the larger this j-th entry should be.
For example, consider Image AID-2 in Fig. 1.This is an ambiguous ID image where the feature "wrinkle-resistant fabric" is present.Interestingly, this is a salient feature of several object classes: Shirt, Windbreaker, Suit, Shawl (e.g.satin shawl), and Underwear (e.g.wrinkle-resistant pyjamas), and we noticed that the corresponding prediction vector obtained for this image has high scores for these respective entries.
In our GenKL framework, after NC instances have been identified, we then relabel non-NC instances with doublehot vectors.As elaborated in Sect.3.6, double-hot vectors represent the overall relative measure of the beliefs for the correctness of the given label versus the most probable predicted label.Note that the effectiveness of the double-hot vectors in capturing this overall relative measure would depend on the effectiveness of the prediction model in detecting salient features.Hence, our implicit assumption is not only important to the NC instance identification task, but also to the iterative training process in our GenKL framework.
To obtain a prediction model that is able to adequately detect the salient features of all object classes, there are two general approaches.The first approach is rather natural: Maximize the overall classification accuracy of the prediction model.This is based on the intuition that a well-trained model with high classification accuracy, especially on non-NC instances, would be able to detect the salient features of the object classes with high confidence.The second approach is to train a model on a dataset consisting of sufficiently many unambiguous ID instances.This would naturally be satisfied if NC instances are relatively rare, while for a dataset where NC instances are more common, it could be better to pretrain on a clean subset of the dataset.Here we are implicitly assuming that the model is able to detect salient features of object classes with higher confidence, when trained on more unambiguous ID instances.
Finally, note that our GenKL framework has multiple hyperparameters.For a comprehensive sensitivity analysis of the effects of different choices of hyperparameter values, see Sect.5.5 in the next section.

Experiments
We first describe in Sect.5.1 the datasets we used in our experiments.Next, we report our experiments for the identification of NC instances in Sect.5.2, and our experiments for the classification of web images in Sect.5.3.In Sect.5.4, we analyze the effectiveness of each component of GenKL via an ablation study.Finally in Sect.5.5, we provide a sensitivity analysis of the hyperparameters of our GenKL framework.

Datasets
Clothing1M: Clothing1M (Xiao et al., 2015) has over 1 million images collected from online shopping websites.There are a total of 14 clothing classes.During data curation, labels are automatically assigned based on the keywords in the text surrounding the collected images, which may be incorrect.The authors also provide an additional clean training set with 50k images, a clean validation set with 14k images, and a clean test set with 10k images.
Food101/Food101N: Food101 (Bossard et al., 2014) contains 101k food images collected from foodspotting.com,while Food101N (Lee et al., 2018) contains 310k images collected from Google, Bing, Yelp and TripAdvisor.Both datasets use a common taxonomy of 101 food classes.For Food101N, there are 305k images in the training set, of which 53k images have verified labels.It also has a validation set containing 5k images with verified labels.
Mini WebVision 1.0:The mini WebVision 1.0 (Jiang et al., 2018) dataset is a subset of the WebVision 1.0 dataset (Li et al., 2017).The training set of mini WebVision 1.0 has 66k images collected from Flickr and Google, which collectively form the first 50 classes of the larger WebVision 1.0 dataset.The test set of mini WebVision 1.0 consists of 2.5k images with verified labels.The reported results are averaged over a 5-fold cross validation

Experiments on Identification of NC Instances
This section compares the effectiveness of several divergences and methods to identify NC instances.
Baselines.We used 7 baselines in total, where 5 of these (Jo-SRC (Yao et al., 2021), DSOS (Albert et al., 2022), Delta divergence (Kittler & Zor, 2018), DC KL divergence (Ponti et al., 2017), and KL divergence) were introduced in Sect. 2. Our remaining two baselines are normalized entropy, which was described in the introduction, and the classic meansquared error.Throughout, logarithms are taken over base 2, q is the prediction vector of an instance (x i , y i ), and p is a uniform vector, with the exception that in our method, p is instead a uniform-like vector.Each method determines whether (x i , y i ) is an NC instance, as described below; see also Appendix C for more details.
• Jo-SRC (Yao et al., 2021) has two hyperparameters τ clean and τ OOD .Let q and q be two different prediction vectors of the same instance (x i , y i ) obtained under two data augmentations.If 1 − D JS (q e(y i )) > τ clean , and if min{1, | arg max q − arg max q |} > τ OOD , then this instance is an NC instance.• Delta divergence (Kittler & Zor, 2018) has a hyperparameter τ .If D ( p q) ≤ τ , then (x i , y i ) is an NC instance.
• DSOS (Albert et al., 2022) has two hyperparameters γ and δ.Let the output value of a beta mixture model with two components using input q be z.If collision entropy H 2 ( q+e(y i )
• DC KL divergence (Ponti et al., 2017) has a hyperparameter τ DC .If D DC ( p q) ≤ τ DC , then (x i , y i ) is an NC instance.
• Mean Squared Error (MSE), which is defined by Experimental set-up.For Clothing1M, we used the 50k clean set as our clean instances.We also manually verified 200 NC instances out of approximately 2500 instances randomly selected from the 1 million noisy dataset.We used ResNet-50 (He et al., 2016) pretrained on ImageNet for all experiments in this section.To keep the class ratios invariant, we first used stratified sampling to randomly select 10% of the 50k clean set as test data.For the remaining 90% of the instances, we used stratified 5-fold cross-validation to split the data.The validation set is randomly shuffled and then selected with the same size as the test set.We randomly split the 200 NC instances into 2 folds of equal sizes: One fold is used to augment the validation set, while the other fold is used to augment the test set.A model is trained to generate prediction vectors for the respective (augmented) validation set, which has 100 NC instances; see Appendix C for further experiment details.
We obtained the test accuracies for all methods using the hyperparameters tuned on the validation set.Recall that P is our set of uniform-like vectors.For our experiments on the identification of NC instances, we used a set P with two vectors, one of which is the uniform vector, and we used α = 1.0247, β = 0.0665, and σ = 0.06.
Evaluation metrics.The main metrics used are F1 score and Cohen's kappa score (Feuerman & Miller, 2005).Let the number of positives, true negatives, false positives, and false negatives be denoted by TP, TN, FP and FN, respectively.Note that the number of predicted positives (resp.predicted negatives) is given by TP+FP (resp.TN+FN).
• Precision is the ratio of true positives to predicted positives, i.e. given by TP TP+FP .
• Recall, also known as sensitivity, is the true positive rate, i.e. given by TP TP+FN .
• Specificity is the true negative rate, i.e. given by TN TN+FP .
In general, there is a trade-off between precision and recall.By adjusting probability thresholds, we can increase precision at the cost of decreasing recall, and vice versa.F1 score is a popular metric used to balance this trade-off between precision and recall, given by TP TP+ 1 2 (FP+FN) .
Similarly, there is a trade-off between sensitivity and specificity.Again, by adjusting probability thresholds, we can increase sensitivity at the cost of decreasing specificity, and vice versa.Cohen's kappa score (Feuerman & Miller, 2005) is a popular metric used to balance this trade-off between sensitivity and specificity.This score is given by the formula

2(TP×TN−FN×FP) (TP+FP)×(FP+TN)+(TP+FN)×(FN+TN) .
Experiment results.Our experiment results are reported in Table 2.Among all evaluated methods, we achieved the highest F1 score of 0.463, and the highest kappa score of 0.448.Note that our method achieves the highest recall/sensitivity of 0.508, which is a significant margin above the second highest value 0.488.For precision, our method has the value of 0.434, which is only marginally second to that of normalized entropy, 0.438.For specificity, all methods perform well, with specificity values at least 0.969.Our method has specificity 0.985, which is only marginally second to that of normalized entropy, 0.991.
Note that although normalized entropy (used in entropy maximization methods) has the highest precision and specificity (with our method coming a close second), its true positive rate (i.e.recall or sensitivity) is only 0.306, which is significantly lower than 0.508 achieved by our method.This means that our method is able to identify 20.2% more NC instances than normalized entropy.

Experiments on Web Image Classification
Baselines.We used 9 baselines in total, where 2 of the baselines, DSOS (Albert et al., 2022) and Jo-SRC (Yao et al., 2021), were already introduced in Sect.2.1.The rest of the baselines are described as follows: • AFM (Peng et al., 2020) introduces a training block that suppresses mislabeled data via grouping and selfattention.
• CleanNet (Lee et al., 2018) detects instances with label noise and assigns weights accordingly.• DivideMix (Li et al., 2020) uses sample loss to partition the training data into a clean set and a noisy set.Then, two networks are trained jointly based on each network's data partition.• Joint optimization (Tanaka et al., 2018) tackles label noise by alternately updating network parameters and labels during training.• MetaCleaner (Zhang et al., 2019) uses a noisy weighting module to estimate weights for each instance and uses a clean hallucinating module to learn from weighted representations.
• MoPro (Li et al., 2021) identifies clean, noisy and OOD instances and assigns pseudo-labels accordingly."Momentum prototypes" are then computed, after which both cross-entropy loss and contrastive loss are jointly used to train the model with the newly assigned pseudolabels and momentum prototypes.• SMP (Han et al., 2019) is an iterative self-training framework that measures data complexity and classifies data into several class prototypes.Models are trained on prototypes with the least complexity, which are assumed less likely to be noisy.
Experiment Set-up.Across all experiments on the Cloth-ing1M, Food101/Food101N and mini WebVision 1.0 datasets, we used the same ResNet-50 architecture (He et al., 2016).For Clothing1M and Food101/Food101N, we initialize the ResNet-50 using weights pretrained on ImageNet.For mini WebVision 1.0, we follow the same set-up in our baselines (Li et al., 2020(Li et al., , 2021;;Albert et al., 2022), and used the default random weight initialization. 7or the Clothing1M dataset, we first trained on the combined set of 1 million noisy instances and 50k clean instances, then fine-tuned on the 50k clean set.
For the Food101/Food101N datasets, we followed the popular experiment set-up, where the models are first trained on the combined set of 306k noisy instances and 53k clean instances of Food101N, then fine-tuned on the 53k clean instances of Food101N.Evaluation is then done on the Food101 test set.
For the mini WebVision 1.0 dataset, there is no clean set provided.To identify clean instances, we followed the common set-up in MoPro (Li et al., 2021) and DivideMix (Li et al., 2020), and let X clean vary over the epochs, where at the beginning of each epoch, we initialized X clean as the empty sequence, then computed X clean as follows: Using the weights from the previous epoch, an image is inserted into X clean if only if the prediction value corresponding to its given label exceeds 0.5.For the first epoch, since there is no previous epoch, we instead trained a model with cross-entropy loss over 10 epochs, and used the trained model (trained over these 10 epochs) to identify X clean .
Note that full experimental set-up details (for all methods, across all datasets) are provided in Appendix C.2.In particular, we trained all models until convergence.
In stage two of our GenKL framework, across all datasets, the set Q of prediction vectors is obtained by averaging over multiple models trained on D pre .8For subsequent iterations, the set Q of prediction vectors is obtained by averaging from the models in previous iterations.Throughout, for the Cloth-ing1M dataset and the Food101/Food101N datasets, we used SGD with initial learning rate 0.001, and Nesterov momentum 0.9, while for the mini WebVision 1.0 dataset, we used SGD with initial learning rate 0.01.For Clothing1M, we used ω 1 = 1, ω 2 = 32, ω 3 = 1.For Food101/Food101N, we used ω 1 = 20, ω 2 = 100, ω 3 = 1.For mini WebVision 1.0, we used ω 1 = 10, ω 2 = 32, ω 3 = 4.For more training details (e.g.learning schedule), see Appendix C.
For Clothing1M, the next best replicable baseline (i.e. with publicly available code) after GenKL is DivideMix, which is on average 1.86% lower (79.48%versus our 81.34%).For Food101/Food101N, although Jo-SRC (Yao et al., 2021) and AFM (Peng et al., 2020) have test accuracies closest to ours, it should be noted that these two methods are the two lowestperforming baselines when evaluated on Clothing1M.As for mini WebVision 1.0, we outperformed the second best method (DivideMix) by a significant margin of 0.71% for top-1 accuracy (78.99% versus 78.28%), and a margin of 0.16% for top-5 accuracy (92.54% versus 92.38%).

Ablation Study
In Table 6, we evaluate the performances of GenKL on the Clothing1M dataset when individual components are removed.
Averaging prediction vectors.When the set Q of prediction vectors was obtained from one single model (instead of being obtained by averaging from multiple models), we had We re-implemented our baselines wherever possible, using a common experiment set-up.For the two methods (#7, #8) that do not have publicly available code, we report the accuracies (marked with *) as indicated in the respective papers We re-implemented our baselines wherever possible, using a common experiment set-up.For the two methods (#5, #6) that do not have publicly available code, we report the accuracies (marked with *) as indicated in the respective papers We re-implemented all baselines using a common experiment set-up The averaged best test accuracies and standard deviations (over 5 trials) are reported for four experiments, each of which is conducted with one component removed from the complete GenKL framework.For the removal of the third component (marked with *), we are removing the non-stochasticity of double-hot vectors used in the relabeling process; effectively, for this particular experiment, we are not dropping the assumption that label vectors should be stochastic a test accuracy of 81.31%, which is a minor drop of 0.03% from the complete GenKL framework.Stratified sampling.Our test accuracy dropped by 0.09% when vanilla random sampling was used in place of stratified sampling.This shows that our framework is not too affected by the imbalance between clean and non-clean (noisy and NC) instances.
Double-hot vector normalization.Recall that we used double-hot vectors for our iterative relabeling process in GenKL.By definition, these double-hot vectors are not stochastic vectors.We chose not to normalize these doublehot vectors because we could achieve higher test accuracies; cf.Section 3.6.For this part of our ablation study, we evaluated the effect of normalizing the double-hot vectors qi .By (4), the new j-th entry after normalization is qi j λ+λ .With this normalization, the resulting average test accuracy is 81.18%, which is a slight drop of 0.16% when compared to the original GenKL framework without normalization.In particular, this shows that the updated label vectors (via our relabeling process) need not be stochastic to achieve better performance.
Iterative training.Recall that GenKL is an iterative framework. 9To understand the effect of iterations, we report in Table 6 the test accuracy when the training stops after the first iteration.Among all the components evaluated, the removal of this component has the most significant impact, yielding a 0.40% drop in test accuracy.

Sensitivity Analysis
Recall from Sect.3.5 that our GenKL framework has seven hyperparameters α, β, |P|, σ , ω 1 , ω 2 and ω 3 .The first two hyperparameters (α and β) come from our (α, β)-generalized KL divergence, D α,β KL ( p q).The next two hyperparameters (|P| and σ ) are used in stage one (identification of NC 9 Among our baselines, both Joint Optimization (Tanaka et al., 2018) and SMP (Han et al., 2019) are also iterative frameworks.instances) of our GenKL framework to generate a set P of uniform-like vectors with associated standard deviation σ .The last three hyperparameters (ω 1 , ω 2 and ω 3 ) are the weight factors for the respective three loss terms, ( 5), ( 6) and ( 7), in our loss function (8).In Tables 7, 8, 9, 10, 11 and 12, we analyze the sensitivity of the performance of GenKL on the Clothing1M dataset, with respect to the values of the hyperparameters α, β, |P|, σ , ω 2 , as well as with respect to the choice of the loss function for training.In particular, for our GenKL experiments on Clothing1M, we fixed ω 1 = 1 and ω 3 = 1 for simplicity.Hence, for our sensitivity analysis, we focused on ω 2 , which is the weight factor of the loss term (6) computed on non-NC instances.
Sensitivity of hyperparameter α.Recall that α is the weight for the negative entropy term −α H ( p) in our (α, β)generalized KL divergence D α,β KL ( p q); see (2).An instance is identified as an NC instance if D α,β KL ( p q) ≥ 0. Hence, as the value of α increases (note that α > 0), the number of identified NC instances would decrease.We used α = 1.05 in our GenKL experiments on the Clothing1M dataset.To see the effect of the value of α, we also report the performance of GenKL for α = 0.90 and α = 1.20, while keeping other hyperparameter values the same; see Table 7.Our analysis demonstrates that the performance of GenKL is sensitive to the value of α (and hence the number of identified NC instances), where a deviation of ±0.15 in the value of α (from the optimal value α = 1.05) resulted in an approximate 0.15% drop in accuracy.
Sensitivity of hyperparameter β.Recall that β appears in the positive term H ( β p, q) in our (α, β)-generalized KL divergence D α,β KL ( p q).The larger the value of β, the larger this positive term H ( β p, q) will be, and hence, more NC instances would be identified.Intuitively, β controls the threshold for when an entry of the prediction vector is considered dominant, where if β is too large, then almost all entries are considered dominant.We used β = 0.03 in our GenKL experiments on the Clothing1M dataset.Since Clothing1M has k = 14 classes, it means that our value β = 0.03 is slightly less than half of 1 k ≈ 0.07143.To see the effect of the value of β, we also report the performance of GenKL for β = 0.02 and β = 0.04, while keeping other hyperparameter values the same; see Table 8.Our analysis demonstrates that the performance of GenKL is sensitive to the value of β (and hence the number of identified NC instances), where a deviation of ±0.01 in the value of β (from the optimal value β = 0.03) resulted in an approximate 0.2% drop in accuracy.
Sensitivity of hyperparameter |P|.Recall that P is the set of uniform-like vectors, such that each p ∈ P is used to compute D α,β KL ( p q) for the identification of NC instances.As we vary p over all vectors in P, we take the union of all identified NC instances.Hence, as the value of |P| increases (note that |P| ≥ 1), the number of identified NC instances is expected to increase.We used |P| = 20 in our GenKL experiments on the Clothing1M dataset.To see the effect of the value of |P|, we also report the performance of GenKL for |P| = 1 and |P| = 10, while keeping other hyperparameter values the same; see Table 10.
Sensitivity of hyperparameter σ .Recall that σ is the standard deviation of the normal distribution N ( 1 k , σ 2 ), used for sampling the value of each entry in a uniform-like vector p ∈ P (before normalization).Hence, for a sufficiently large set P, as the value of σ increases (note that σ > 0), the number of identified NC instances would tend to increase.We used σ = 0.05 in our GenKL experiments on the Cloth-ing1M dataset.To see the effect of the value of σ , we also report the performance of GenKL for σ = 0.01 and σ = 0.1, while keeping other hyperparameter values the same; see Table 11.Our analysis demonstrates that the performance of GenKL is sensitive to the value of σ (and hence the number of identified NC instances), where the value σ = 0.1 may be a little too large, resulting in a slight 0.07% drop in accuracy (as compared to σ = 0.05).Sensitivity of hyperparameter ω 2 .Recall that ω 2 is the weight for the loss term L non-NC in (8).As the value of ω 2 increases, the contribution of the non-NC instances to the overall loss would increase.We used ω 2 = 32 in our GenKL experiments on the Clothing1M dataset.To see the effect of the value of ω 2 , we also report the performance of GenKL for ω 2 = 1 and ω 2 = 64, while keeping other hyperparameter values the same; see Table 12.Intuitively, when ω 2 is sufficiently large (e.g. in the range of ω 2 = 32 to ω 2 = 64), the contribution of L non-NC to the overall loss has a regularization effect, thereby improving overall accuracy.
Sensitivity of the choice of loss function.Recall that we used cross-entropy loss as our loss function; see (5), ( 6), ( 7) and (8); cf.Section 3.6.To see the effect of the choice of loss function, we also report the performance of GenKL when the loss function is replaced by MSE, mean absolute error (MAE) and KL loss, respectively, while keeping all hyperparameter values the same; see Table 9.Our analysis demonstrates that our choice of cross-entropy loss is crucial for the outperformance of GenKL over the baselines.

Conclusion
We introduced the notion of non-conforming (NC) instances, which encompasses both ambiguous ID and OOD instances.Although there are numerous methods that tackle the problem of OOD instances, we are not aware of any method that explicitly tackles the problem of ambiguous ID instances, which are prevalent in web image datasets curated online.To tackle NC instances in a unified manner, we proposed a new generalized KL divergence, D α,β KL ( p q), and an iterative training framework GenKL built upon this new generalized KL divergence.Moreover, we proved theoretical proper-ties of α,β KL ( p q).The key advantage of using α,β KL ( p is that we can effectively identify more NC instances, including those whose predictions are not "almost" uniformly distributed, for which the usual approaches of entropy maximization and KL divergence minimization are unable to identify.We showed empirically that using D α,β KL ( p q) yields the best performance for NC instance identification.For our GenKL framework, we outperformed SOTA methods on real-world web image datasets: Clothing1M, Food101/Food101N and mini WebVision 1.0.
NC instances are unavoidable in web image datasets.Since the identification of NC instances is clearly a prerequisite step for tackling NC instances, we expect future work to further build upon the effectiveness of our new generalized KL divergence for NC instance identification.is a vector.And note that the second term in (A4) becomes: Since log 1 q j is a strictly decreasing function in terms of q j , the second term in (A4) reaches its maximum value log 1 1 k −β when there exist indices j 1 , . . ., j ( ≥ 1) such that q j t = 1 k − β for all 1 ≤ t ≤ and t=1 p j t = 1.Therefore, (A4) reaches its maximum value log 1 1 k −β when p is a one-hot vector, say with a non-zero j-th entry, and q j = 1 k − β.

Appendix B: Proof of Theorem 3
We treat D α,β KL ( p q) as a real-valued function on R 2k , and consider its Hessian matrix H , which is a symmetric 2 × 2 block matrix, whose constituent blocks are k × k matrices, given as follows: Note that A, B, and C are diagonal matrices: Recall that the Schur complement of A is defined to be H /A := C − B T A −1 B. We check that: If H is positive semi-definite, then D α,β KL ( p q) is convex.It follows from Smith (1992) that if A is positive definite, then H is positive semi-definite if and only if the Schur complement H /A is positive semi-definite.Thus, to prove that D α,β KL ( p q) is convex on R β u , it suffices to show that A is positive definite and show that the Schur complement H /A is positive semi-definite.
If a symmetric matrix is strictly row diagonally dominant and has strictly positive diagonal entries, then it is positive definite.Given that α ≥ 1 and A is symmetric, A is positive definite.A symmetric diagonally dominant real matrix with non-negative diagonal entries is positive semi-definite.Given that α ≥ 1 and p i ≥ 0, H /A is positive semi-definite.Thus, given α ≥ 1 and

Appendix C: Additional training details
This section contains more details (e.g.training hyperparameters) for the experiments in Sects.5.2 and 5.3.
For each validation set, we trained the model on its respective training set using cross-entropy loss with SGD, initial learning rate 0.01, Nesterov momentum 0.9, weight decay 0.001, and a batch size of 32, over 50 epochs.The learning rate was reduced by a factor of 10 whenever the validation loss did not drop after 4 consecutive epochs.The model from the epoch with the best validation accuracy is used to generate prediction vectors.

C.2 Details for experiments in Sect. 5.3
For the Clothing1M dataset, resized the images to 256×256, then randomly cropped them to size 224×224.Random horizontal flip applied with 0.5.
Details of the hyperparameters used in the individual methods are given as follows: • For our method, as part of the pre-training on the 50k clean set (D pre ), we trained the model using cross-entropy loss with SGD, initial learning rate 0.01, Nesterov momentum 0.9, weight decay 0.001, and a batch size of 32, over 30 epochs.The learning rate was reduced by a factor of 10 whenever the validation loss did not drop after 4 consecutive epochs.To identify NC instances, we used the hyperparameters α = 1.05, β = 0.03, σ = 0.05, and |P| = 20.During the main training stage (i.e. after the identification of NC instances), we used stratified sampling.This means that for each minibatch, half of the instances are sampled from the 50k clean set, while the remaining half of the instances are sampled from the 1 million noisy set.We also used mixup (Zhang et al., 2018) with hyperparameter 0.5.Weight decay is set at 0.001, and the learning rate was reduced by a factor of 10 whenever the validation loss did not drop after 3 consecutive epochs.We trained the model iteratively until convergence, with 20 epochs in each iteration.We then fine-tuned all methods on the 50k clean set over 25 epochs, using an Adam optimizer, with learning rate 5 × 10 −7 and weight decay 0.001.• For Cross-entropy, we trained the model using crossentropy loss with SGD, initial learning rate 0.001, Nesterov momentum 0.9, weight decay 0.001, and a batch size of 32, over 20 epochs.Learning rate was reduced by a factor of 10 whenever the validation loss did not drop after 3 consecutive epochs.• For all other baselines that we re-implemented, we used the same hyperparameters as those reported in their respective papers.
For the Food101/Food101N datasets, in addition to the same application of resizing, random cropping and random horizontal flip, as described above for the Clothing1M dataset, we also applied random rotation within the range of ±30 degrees on the images.For our method and all re-implemented baselines, unless otherwise mentioned, we always used the following hyperparameters in the main training stage: We trained the model with SGD, using initial learning rate 0.001, Nesterov momentum 0.9, weight decay 0.005, and a batch size of 128, over 40 epochs.The learning rate was adjusted using cosine annealing.We fine-tuned on the 53k clean set over 40 epochs, using regular random sampling, cross-entropy loss and an Adam optimizer, with learning rate 10 −10 and weight decay 10 −4 .
Exceptions to these hyperparameters are described as follows: • For our method, during training on D pre in the first iteration, we trained the model using cross-entropy loss with SGD, initial learning rate 0.001, Nesterov momentum 0.9, weight decay 0.005, and a batch size of 128, over 50 epochs.The learning rate is adjusted using cosine annealing.To identify NC instances, we used the hyperparameters α = 1.1, β = 0.008, σ = 0, and |P| = 1.• For Joint Optimization (Tanaka et al., 2018), in the first step, we trained the model with SGD, using initial learning rate 0.003, Nesterov momentum 0.9, weight decay 0.0001, and a batch size of 128, over 30 epochs.For the hyperparameters α, β specific to Joint Optimization, we used α = 0.7 and β = 0.4.In the second step, we trained the model with SGD, using initial learning rate 0.001, Nesterov momentum 0.9, weight decay 0.0001, and a batch size of 128, over 40 epochs.• For Jo-SRC (Yao et al., 2021) and AFM (Peng et al., 2020), we used the same hyperparameters as those reported in their respective papers or in the code the authors released.
For the mini WebVision 1.0 dataset, we resized the images to 320×320, then randomly cropped them to size 299×299.Random horizontal flip was applied with probability 0.5.
Details of the hyperparameters used in the individual methods are given as follows: • For our method, we used the training set as our D pre dataset for pre-training.As part of the pre-training on the whole noisy training set (D pre ), we trained the model using cross-entropy loss with SGD, initial learning rate 0.01, momentum 0.9, weight decay 0.001, and a batch size of 32, over 100 epochs.The learning rate was reduced by a factor of 10 in epoch 50.To identify NC instances, we used the hyperparameters α = 0.9, β = 0.015, σ = 0.05, and |P| = 20.During the main training stage (i.e. after the identification of NC instances), we followed the set-up in MoPro (Li et al., 2021) and DivideMix (Li et al., 2020) to select clean instances: If the entry of the prediction vector corresponding to the given label is above 0.5, then this instance is clean, i.e. in D clean .We also used mixup (Zhang et al., 2018) with hyperparameter 3.0.Weight decay is set at 0.0001.Learning rate is adjusted using cosine annealing.We trained the model for 300 epochs.Subsequently, we used X clean from epoch 300 for fine-tuning over 50 epochs, using an Adam optimizer, with learning rate 5 × 10 −7 and weight decay 0.001.

Fig. 1
Fig. 1 A depiction of NC instances versus clean instances in the Clothing1M dataset.This figure is divided into three colored sections: orange, yellow and green.The images in the orange section depict OOD instances.The images in the yellow section depict ambiguous ID (AID)

Fig. 3 A
Fig. 3 A Venn diagram that illustrates the relationship between NC instances and instances with label noise.For each sub-category in this Venn diagram, some example images are shown from the Clothing1M dataset

Fig. 4
Fig. 4 A depiction of some NC instances found in the Clothing1M dataset.The images the last two rows depict OOD instances.The images in the first rows depict ID (AID) instances.

Fig. 6
Fig. 6 A depiction of some non-NC instances that are wrongly identified as NC instances via our (α, β)-generalized KL divergence for classes Hoodie and Downcoat on the Clothing1M dataset

Table 2
Precision, recall/sensitivity, specificity, F1 score and kappa score of all methods for NC instance identification, on Clothing1M 50k clean data combined with 200 manually verified NC instances

Table 3
Averaged best test accuracies and standard deviations (over 5 trials) of different methods on the Clothing1M dataset

Table 4
Averaged best test accuracies and standard deviations (over 5 trials) of different methods on the Food101/Food101N datasets

Table 5
Averaged best test accuracies and standard deviations (over 5 trials) of different methods on the mini WebVision 1.0 dataset

Table 7
Averaged best test accuracies and standard deviations (over 5 trials) on the Clothing1M dataset for our framework GenKL, over different values for α: 0.90, 1.05 and 1.20

Table 9
Averaged best accuracies and deviations (over 5 trials) on the Clothing1M dataset for our framework GenKL, over different choices of the loss function: cross-entropy, MSE, MAE and KL loss

Table 12
Averaged best test accuracies and standard deviations (over 5 trials) on the Clothing1M dataset for our framework GenKL, over different values for ω 2 : 1, 32 and 64