Abstract
Supervised learning has seen numerous theoretical and practical advances over the last few decades. However, its basic assumption of identical train and test distributions often fails to hold in practice. One important example of this is when the training instances are subject to label noise: that is, where the observed labels do not accurately reflect the underlying ground truth. While the impact of simple noise models has been extensively studied, relatively less attention has been paid to the practically relevant setting of instancedependent label noise. It is thus unclear whether one can learn, both in theory and in practice, good models from data subject to such noise, with no access to clean labels. We provide a theoretical analysis of this issue, with three contributions. First, we prove that for instancedependent (but labelindependent) noise, any algorithm that is consistent for classification on the noisy distribution is also consistent on the noisefree distribution. Second, we prove that consistency also holds for the area under the ROC curve, assuming the noise scales (in a precise sense) with the inherent difficulty of an instance. Third, we show that the Isotron algorithm can efficiently and provably learn from noisy samples when the noisefree distribution is a generalised linear model. We empirically confirm our theoretical findings, which we hope may stimulate further analysis of this important learning setting.
Learning with instancedependent label noise
Recent advances in classification models such as deep neural networks have seen resounding successes (Krizhevsky et al. 2012; He et al. 2016; Xiao et al 2015), in no small part due to the availability of large labelled training datasets. However, realworld labels are often corrupted by instancedependent label noise, wherein the observed labels are not representative of the underlying ground truth, and noise levels vary across different instances. For example, in object recognition problems, poor quality images are more likely to be mislabelled (Reed et al. 2014; Xiao et al 2015); furthermore, certain classes of images tend to be confused with others. A natural question thus arises: what can we say about the impact of such label noise on the accuracy of our trained models?
More precisely, the following questions are of fundamental interest:
 Q1 :

does good classification performance on the noisy distribution translate to good classification performance on the noisefree (“clean”) distribution?
 Q2 :

does the answer to Q1 also hold for more complex measures, e.g. for ranking?
 Q3 :

are there simple algorithms which are provably noise robust?
In the case of instanceindependent label noise, questions Q1–Q3 have been studied by several recent theoretical works (Stempfel et al. 2007; Stempfel and Ralaivola 2009; Natarajan et al. 2013; Scott et al. 2013; Liu and Tao 2015; Menon et al. 2015; van Rooyen et al. 2015; Patrini et al. 2016, 2017), whose analysis has resulted in a surprising conclusion: for powerful (highcapacity) models, one can achieve optimal classification and ranking error given enough noisy examples, without the need for any clean labels. Further, for modest (lowcapacity) models, while even a tiny amount of noise may be harmful (Long and Servedio 2008), there are simple provably robust algorithms (Natarajan et al. 2013; van Rooyen et al. 2015).
In the case of instancedependent label noise, while there is some theoretical precedent (Manwani and Sastry 2013; Ghosh et al. 2015; Awasthi et al. 2015), questions Q1–Q3 have to our knowledge remained unanswered. In this paper, we study these questions systematically. We answer Q1 and Q2 by showing that under (suitably constrained) instancedependent noise, powerful models can optimally classify and rank given enough noisy samples; this is a nontrivial generalisation of existing results. We answer Q3 by showing how an existing algorithmic extension of generalised linear models can efficiently and provably learn from noisy samples; this is in contrast to existing algorithms even for instanceindependent noise, which either require the noise rate to be known, or lack guarantees.
More precisely, our contributions are:
 C1 :

we show that for a range of losses, any algorithm that minimises the expected loss (i.e., risk) on the noisy distribution also minimises the expected loss on the clean distribution (Theorem 1) i.e., noisy risk minimisation is consistent for classification;
 C2 :

we show that area under the ROC curve (AUROC) maximisation on the noisy distribution is also consistent for the clean distribution (Theorem 2), under a new boundaryconsistent noise model where “harder” instances are subject to noise (Definition 4);
 C3 :

we show that if the clean distribution is a generalised linear model, the Isotron algorithm (Kalai and Sastry 2009) is provably robust to boundaryconsistent noise (Theorem 3).
While our contributions are primarily of a theoretical nature, we also provide experiments (Sect. 7) illustrating potential practical implications of our results.
Background and notation
We begin with some notation and background material. Table 1 provides a glossary.
Learning from binary labels
In standard problems of learning from binary labels, one observes a set of instances paired with binary labels, assumed to be an i.i.d. draw from an unobserved distribution. The goal is to find a model that can determine if future instances are more likely to be positive or negative. To state this more formally, we need some notation.
Distributions, scorers, and risks
Fix a measurable instance space \(\mathscr {X}\). We denote by \(D\) some distribution over \(\mathscr {X}\times \{ \pm 1 \}\), with random variables \((\mathsf {X}, \mathsf {Y}) \sim D\). Any \(D\) may be expressed via the marginal \(M = \mathbb {P}( \mathsf {X})\) and classprobability function \(\eta :x \mapsto \mathbb {P}( \mathsf {Y}= 1 \mid \mathsf {X}= x )\). A scorer is any measurable \(s :\mathscr {X}\rightarrow \mathbb {R}\); e.g., a linear scorer is of the form \(s( x ) = \langle w, x \rangle \). A loss is any measurable \(\ell :\{ \pm 1 \}\times \mathbb {R}\rightarrow \mathbb {R}_+\), measuring the disagreement between a label and score. A risk is any measurable \(R( \cdot ; D) :\mathbb {R}^{\mathscr {X}} \rightarrow \mathbb {R}_+\) which summarises a scorer’s performance on samples drawn from \(D\). Canonically, one works with the \(\ell \)risk , or the \(\ell \)ranking risk, .
Given this, the standard problem of learning from binary labels may be stated as:
Example
We will be interested in two canonical problems of learning from binary labels. In binary classification (Devroye 1996), the goal is to approximately minimise the misclassification error \(R(s; D, \ell ^{01})\), where \(\ell ^{01}\) is the zeroone loss \(\ell ^{01}(y,v) =\llbracket yv < 0 \rrbracket + \frac{1}{2} \llbracket v = 0 \rrbracket \) for indicator function \(\llbracket \cdot \rrbracket \).
In bipartite ranking (Agarwal and Niyogi 2005), the goal is to approximately minimise the pairwise disagreement \(R_{\mathrm {rank}}( s; D, \ell ^{01})\), which is also known as one minus the area under the ROC curve (AUROC) of s (Clémençon et al. 2008). The latter is preferred over the misclassification error under class imbalance (Ling and Li 1998).
Bayesoptimal scorers and regret
In studying the asymptotic behaviour of learning algorithms, two additional riskrelated concepts are useful. A Bayesoptimal scorer is any theoretical riskminimising scorer \( s^* \in {{\text {argmin }}\, }_{s \in \mathbb {R}^{\mathscr {X}}} \, R( s; D) \). The regret of a scorer \(s :\mathscr {X}\rightarrow \mathbb {R}\) is its excess risk over that of any Bayesoptimal scorer, .
For example, the set of Bayesoptimal scorers for the misclassification error \(R( \cdot ; D, \ell ^{01})\) comprises all \(s^*\) satisfying
so that the sign of an instance’s score matches whether its label is on average positive. Further, the regret for the 0–1 loss is \( \mathrm {reg}( s; D, \ell ^{01}) = \mathbb {E}_{\mathsf {X}\sim M}\left[  2\eta ( x )  1  \cdot \llbracket (2 \eta ( \mathsf {X})  1) \cdot s( x ) < 0 \rrbracket \right] \) (Devroye et al. 1996, Theorem 2.2), i.e., the concentration of \(\eta \) near \(\frac{1}{2}\) in the region of disagreement with any optimal scorer.
Learning from corrupted binary labels
Fix some distribution \(D\). In the problem of learning from corrupted or noisy binary labels, we have a training sample \(\bar{\mathsf {S}}\sim \bar{D}^m\), for some \(\bar{D}\ne D\) whose \(\mathbb {P}(\mathsf {X})\) is unchanged, but \(\mathbb {P}( \bar{\mathsf {Y}}\mid \mathsf {X}= x ) \ne \mathbb {P}( \mathsf {Y}\mid \mathsf {X}= x )\). That is, we observe samples with the same marginal distribution over instances, but different conditional distribution over labels. Our goal remains to learn a scorer with small risk with respect to \(D\), despite \(D\) being unobserved. More precisely, the problem of learning from noisy binary labels may be stated as:
We refer to \(D\) as the “clean” and \(\bar{D}\) as the “corrupted” distribution. Note that we allow \(D\) to be nonseparable, i.e., \(\eta ( x ) \in (0, 1)\) for some \(x \in \mathscr {X}\); thus, even under \(D\), there is not necessarily certainty as to every instance’s label. Our use of “noise” and “corruption” thus refers to an additional, exogenous uncertainty in the labelling process.
Instancedependent noise models
We will focus on \(\bar{D}\) that arise from randomly flipping the labels in \(D\). Further, our interest is in instancedependent noise, i.e., noise which depends compulsorily on the instance, and optionally on the label. To capture this, we first introduce the general label and instancedependent noise (LIN) model.
Definition 1
(LIN model) Given a clean distribution \(D\) and label flip functions \(\rho _1, \rho _{1} :\mathscr {X}\rightarrow [ 0, 1 ]\), under the LIN model we observe samples \(( \mathsf {X}, \bar{\mathsf {Y}} ) \sim \bar{D}= \mathrm {LIN}( D, \rho _{1}, \rho _{1} ) \), where first we draw \((\mathsf {X}, \mathsf {Y}) \sim D\) as usual, and then flip \(\mathsf {Y}\) with probability \(\rho _{\mathsf {Y}}( \mathsf {X})\) to produce \(\bar{\mathsf {Y}}\).
The label flip functions \(\rho _{\pm 1}\) allow one to model label noise with dependences on the instance and true label. We do not impose any parametric assumptions on these functions; the only restriction we place is that on average, the noisy and true labels must agree, i.e.,
When \(\rho _{\pm 1}\) are constant, this is a standard assumption (Blum and Mitchell 1998; Scott et al. 2013). We will refer to \(\rho _{\pm 1}\) satisfying Eq. 2 as being admissible.
The LIN model may be specialised to the case where the noise depends on the instance, but not the label. We term this the purely instancedependent noise (PIN) model.
Definition 2
(PIN model) Given a label flip function \(\rho :\mathscr {X}\rightarrow [0, \nicefrac []{1}{2})\), under the PIN model we observe samples from .
Both the LIN and PIN models consider noise which is instancedependent; however, the LIN model is strictly more general. In particular, for nonseparable \(D\), each \(x \in \mathscr {X}\) has nonzero probability of being paired with either \(\{ \pm 1 \}\) as a label; thus, under the LIN model, the example \(( x, +1 )\) occurring in a sample \(\mathsf {S}\sim D^N\) could have its label flipped with different probability to \(( x, 1 )\) occurring in another \(\mathsf {S}' \sim D^N\).
Note that the image of \(\rho \) in Definition 2 is \([0, \nicefrac []{1}{2})\) so as to enforce the condition in Eq. 2. When \(D\) is separable, this condition is equivalent to enforcing that the noisy classprobabilities are bounded away from \(\frac{1}{2}\), which is known as a Massart condition (Massart and Nédélec 2006) on the classprobability. Consequently, when \(D\) is separable, instancedependent noise satisfying Eq. 2 is also known as a Massart or bounded noise model.
Relation to existing models
As a special case, the LIN model captures instanceindependent but labeldependent noise. Here, all instances within the same class have the same label flip probability. This is known as the classconditional noise (CCN) setting, and has received considerable attention (Blum and Mitchell 1998; Natarajan et al. 2013).
Definition 3
(CCN model) Given label flip probabilities \(\rho _{\pm 1} \in [0, 1]\), under the CCN model we observe samples from .
Consistency of noisy risk minimisation?
Our primary theoretical interest in learning from LIN or PIN noise is the issue of statistical consistency of noisy risk minimisation. This aims to answer the question: if we can perform nearoptimally with respect to some risk on the noisy distribution, will we also perform nearoptimally on the clean distribution? More formally, we wish to know if, e.g.,
for any distribution \(D\), corrupted distribution \( \bar{D}\), and scorer sequence \(( s_n )_{n = 1}^\infty \). Establishing this would imply that one can perform nearoptimally given sufficiently many noisy samples, and a sufficiently powerful class of scorers. The latter assumption is in keeping with standard consistency analysis for binary classification (Zhang 2004; Bartlett et al. 2006); however, its practical applicability is somewhat limited. To address this, we further study (Sect. 5) an algorithm to efficiently (and provably) learn under instancedependent noise.
As noted in the Introduction, a number of recent works have established classification consistency of noisy risk minimisation (Scott et al. 2013; Natarajan et al. 2013; Menon et al. 2015) for the special case of classconditional (and hence instanceindependent) noise. A large strand of work has provided PACstyle guarantees under various instancedependent noise models (Bylander 1997, 1998; Servedio 1999; Awasthi et al. 2015, 2016, 2017). However, these works impose assumptions on both \(D\) and the class of scorers. For a more detailed comparison and discussion, see Sect. 6.
Classification consistency under purely instancedependent noise
We begin with our first contribution (C1), which shows that one can classify optimally given access only to samples corrupted with purely instancedependent noise, assuming a suitably rich function class and sufficiently many samples; i.e., noisy risk minimisation is consistent.
Relating clean and corrupt Bayesoptimal scorers
Recall from Eq. 3 that establishing consistency of noisy risk minimisation requires showing that a scorer s that classifies well on the corrupted \(\bar{D}\) also classifies well on the clean \(D\), i.e., if the regret \(\mathrm {reg}( s; \bar{D}, \ell )\) is small for a suitable loss \(\ell \), then so is \(\mathrm {reg}( s; D, \ell )\).
Before proceeding, it is prudent to convince ourselves that such a result is possible in the first place. A necessary condition is that the clean and corrupted Bayesoptimal scorers coincide; without this, noisy risk minimisation will converge to the wrong object. For many losses, the Bayesoptimal scorers depend on the underlying classprobability function (c.f. Eq. 1). Thus, to study these scorers on \(\bar{D}\) resulting from generic label and instancedependent noise, we examine its classprobability function \(\bar{\eta }\).
Lemma 1
Pick any distribution \(D\). Suppose \(\bar{D}= \mathrm {LIN}( D, \rho _{1}, \rho _{1} )\) for admissible label flip functions \(\rho _{\pm 1} :\mathscr {X}\rightarrow [ 0, 1 ]\). Then, \(\bar{D}\) has corrupted classprobability function
The form of Eq. 4 is intuitive: the corrupted positives can be seen as a mixture of a positive and negative instances, with mixing weights determined by the flip probabilities. This also illustrates that the effect of noise is to compress the range of \(\eta \), thus increasing one’s uncertainty as to an instance’s label.
Lemma 1 implies that we cannot hope to establish consistency without further assumptions. For example, with the 0–1 loss, Eq. 1 established that any Bayesoptimal scorer \(s^*\) on \(D\) has \(\mathrm {sign}( s^*( x ) ) = \mathrm {sign}( \eta ( x )  \nicefrac []{1}{2})\). However, if \(\rho _{1}\) and \(\rho _{1}\) vary arbitrarily, then it is easy to check from Eq. 4 that the \(\mathrm {sign}( \eta ( x )  \nicefrac []{1}{2}) \ne \mathrm {sign}( \bar{\eta }( x )  \nicefrac []{1}{2})\). Consequently, the clean and corrupted optimal scorers will differ, and we will not have consistency in general.
Fortunately, we can make progress under two further assumptions: that the noise is purely instancedependent (per Definition 2), and following (Ghosh et al. 2015), that
for some \(C \in \mathbb {R}\). Equation 5 holds for the zeroone, ramp, and “unhinged” loss (van Rooyen et al. 2015). Under these restrictions, the clean and corrupted optimal scorers agree.
Corollary 1
Pick any distribution \(D\), and loss \(\ell \) satisfying Eq. 5. Suppose that \(\bar{D}= \mathrm {PIN}( D, \rho )\) for admissible label flip function \(\rho :\mathscr {X}\rightarrow [0, \nicefrac []{1}{2})\). Then,
For the case of 0–1 loss, Corollary 1 is intuitive: with purely instancedependent noise satisfying the condition in Eq. 2, the corrupted label will agree on average with the true label; thus, the Bayesoptimal classifier, which simply looks at whether an instance is more likely on average to be positive or negative, will remain the same.
We emphasise that Corollary 1 does not require \(D\) to have a deterministic labelling function, i.e., it does not require separability of the distribution. Corollary 1 generalises Natarajan et al. (2013, Corollary 10), which was for instanceindependent noise. Awasthi et al. (2015), Ghosh et al. (2015, Theorem 1) made a similar observation, but only for 0–1 loss and under the additional assumption of \(D\) being separable, i.e., \(\eta ( x ) \in \{ 0, 1 \}\).
Relating clean and corrupt regrets
Having established the equivalence of the clean and corrupted optimal scorers, the next step in showing consistency is relating the clean and the corrupted regrets. We have the following, which relies on the same assumptions on the noise and loss as Corollary 1.
Theorem 1
Pick any distribution \(D\), and loss \(\ell \) satisfying Eq. 5. Suppose \(\bar{D}= \mathrm {PIN}( D, \rho )\) for admissible label flip function \(\rho :\mathscr {X}\rightarrow [0, \nicefrac []{1}{2})\). Then, for any \(s :\mathscr {X}\rightarrow \mathbb {R}\),
where . Further, if \(\sup _{y, v}  \ell ( y, v )  = B < +\infty \), then for any \(\alpha \in [0, 1]\),
The proof of Theorem 1 relies on the observation that the clean risk can be written as a weighted corrupted risk. We thus simply bound these weights, and appeal to the fact that the clean and corrupted regrets both involve the same Bayesoptimal scorer (Corollary 1).
Implications
For the zeroone loss \(\ell ^{01}\), Theorem 1 implies that for a sequence of scorers \(( s_n )_{n = 1}^\infty \), if \(\mathrm {reg}( s_n; \bar{D}, \ell ^{01}) \rightarrow 0\), then \(\mathrm {reg}( s_n; D, \ell ^{01}) \rightarrow 0\) as well; i.e., consistent classification on the corrupted distribution implies consistency on the clean distribution as well. Thus, with powerful models and sufficient data, we can optimally classify even when learning solely from noisy labels. One can achieve \(\mathrm {reg}( s; \bar{D}, \ell ^{01}) \rightarrow 0\) by minimising any appropriate convex surrogate to \(\ell ^{01}\) on \(\bar{D}\) (e.g. hinge, logistic, exponential), owing to standard classification calibration results (Zhang 2004; Bartlett et al. 2006). Importantly, this surrogate does not have to satisfy Eq. 5.
In Eq. 7, \(\alpha \) may be chosen (in a distributiondependent manner) to yield the tightest possible bound. When \(\alpha = 0\), the bound is identical to Eq. 6. However, when \(\alpha > 0\), the former explicates how the regret depends on the average noise rate of instances, while the latter pessimistically focusses on the maximal noise rate. In particular, Eq. 7 illustrates that when most instances have low noise (\(\rho ( x ) \sim 0\)), one is not overly harmed by a small fraction of instances with high noise: even if \(\rho _{\mathrm {max}} \sim \nicefrac []{1}{2}\), the second term term will dominate and the regret on the clean distribution will be small. At the other extreme, when \(\rho ( x ) \sim \nicefrac []{1}{2}\) for most x, while we still have asymptotic consistency, there will be a large relative difference in the clean and absolute regrets. This is also as expected, since the presence of noise intuitively must make the learning task more challenging.
Extensions
The regret bound in Theorem 1 may be combined with standard surrogate regret and generalisation bounds applied to the noisy risk minimisation problem. Specifically, per the bounds of Bartlett et al. (2006), Eq. 6 can be further bounded as
where \(\ell \) is a classificationcalibrated loss, and \(\Psi \) the corresponding calibration function as per (Bartlett et al. 2006, Definition 2). For example, \(\Psi ( z ) = z\) for the hinge loss \(\ell ^{\mathrm {hng}}\).
We may further specify how the regret on \(D\) decays given a scorer derived from a finite noisy sample with a suitable function class, by combining Eq. 8 with results on the behaviour of \(\mathrm {reg}( s; \bar{D}, \ell )\). Formally, given a noisy sample \(\bar{\mathsf {S}} \sim \bar{D}^n\), let \(\bar{s}_n\) denote the regularised empirical minimiser of the hinge loss \(\ell ^{\mathrm {hng}}\) over a kernelised scorer class \(\mathscr {S}= \{ x \mapsto \langle w, \Phi ( x ) \rangle _{\mathscr {H}} \}\), for feature mapping \(\Phi :\mathscr {X}\rightarrow \mathscr {H}\) and reproducing kernel Hilbert space \(\mathscr {H}\). Then, with probability at least \(1  \delta \), (Steinwart and Scovel 2005, Theorem 1)
where \(\alpha \) is such that the strength of regularisation is \(\lambda _n = {n^{\alpha }}\), and \(\beta \) controls the approximation error from using kernelised (rather than all measurable) scorers.
Related work
Theorem 1 generalises Natarajan et al. (2013, Theorem 11), which was for instanceindependent noise. Ghosh et al. (2015, Theorem 1) provided a distinct bound between clean and corrupted risks, which does not establish consistency. Awasthi et al. (2015, 2016) established small corrupted 0–1 regret for specific algorithms under separable \(D\), while our bound relates clean and corrupted regrets for the output of any algorithm. See also Sect. 6.
Beyond misclassification error?
Theorem 1 implies consistency for the misclassification error. In practice, other measures such as the balanced error and Fscore are also practically pervasive, especially under class imbalance. Can we show consistency for such measures as well?
Disappointingly, the answer is no. The reason is simple: for a range of such classification measures, any optimal scorer on \(D\) has \(\mathrm {sign}( s^*( x ) ) = \mathrm {sign}( \eta ( x )  t(D) )\), where \(t(D)\) is some possibly distributional dependent threshold (Narasimhan et al. 2014; Koyejo et al. 2014). However, Eq. 4 reveals that retaining such an optimal scorer on \(\bar{D}\) is not possible, as
i.e., the thresholds of the clean and corrupted classprobability function do not coincide in general, so that no analogue of Corollary 1 can possibly hold. Specifically, for any \(t \ne \nicefrac []{1}{2}\) (i.e. any threshold beyond that for 0–1 loss), optimal classification based on \(\bar{\eta }\) requires knowing the unknown flipping function \(\rho ( x )\).
The above implies that under purely instancedependent noise, we cannot (at least naïvely) optimally classify with measures beyond the misclassification error. This is a point of departure from existing analysis for instanceindependent noise; for example, Menon et al. (2015) established that the balanced error minimiser is unaffected under classconditional noise.
AUROC consistency under boundaryconsistent noise
Having established classification consistency for purely instancedependent noise, we turn to our second contribution (C2), concerning the distinct problem of bipartite ranking consistency. Recall from Sect. 2.1 that bipartite ranking (Agarwal and Niyogi 2005) considers
viz. one minus the area under the ROC curve (AUROC) of s (Clémençon et al. 2008).
Given the popularity of the AUROC as a performance measure under class imbalance (Ling and Li 1998), studying its consistency under label noise is of interest. However, compared to the misclassification error, even in the instanceindependent case, this issue has received comparatively little attention, with a few exceptions (Menon et al. 2015). We now provide such an analysis for a structured form of label and instancedependent noise.
Relating clean and corrupt Bayesoptimal scorers
As in Sect. 3, before studying AUROC consistency, it is prudent to confirm that the clean and corrupted Bayesoptimal scorers of the AUROC coincide. The AUROC is maximised by any scorer \(s^*\) that is order preserving for \(\eta \) (Clémençon et al. 2008), i.e.
Equally, on the corrupted \(\bar{D}\), the corrupted AUROC will be maximised by any scorer that is order preserving for \(\bar{\eta }\). Thus, for the Bayesoptimal scorers to coincide, we will have to ensure that \(\bar{\eta }\) is order preserving for \(\eta \), i.e., that
But by Lemma 1, this cannot be true for general label and instancedependent noise, since there is no necessary relationship between the flip functions \(\rho _{\pm 1}\) and \(\eta \); see “Appendix C” for some concrete counterexamples.
To make progress, we thus need to restrict our noise model by injecting suitable dependence between \(\rho _{\pm 1}\) and \(\eta \). We next present one such noise model which suits our needs.
The boundary consistent noise (BCN) model
We propose a noise model where, roughly, the higher the inherent uncertainty (i.e., \(\eta \approx \nicefrac []{1}{2}\)), the higher the noise. We will shortly show such a model possesses order preservation.
Definition 4
(BCN model) Given a clean distribution \(D\), consider an label and instancedependent noise model \( \mathrm {LIN}( D, \rho _{1}, \rho _{1} ) \) where \(\rho _y = f_y \circ s\) for some functions \(f_{\pm 1} :\mathbb {R}\rightarrow [0, 1]\) and \(s :\mathscr {X}\rightarrow \mathbb {R}\) such that:

(a)
s is order preserving for \(\eta \) i.e., \( ( \forall x, x' \in \mathscr {X}) \, \eta ( x )< \eta ( x' ) \implies s( x ) < s( x' ). \)

(b)
\(f_{\pm 1}\) are nondecreasing on \((\infty , s_0]\) and nonincreasing on \([s_0, \infty )\), where

(c)
is nonincreasing.
We term this the boundary consistent noise model (BCN model). We write the resulting corrupted distribution as \(\mathrm {BCN}( D, f_{1}, f_{1}, s )\).
The \(\mathrm {BCN}\) noise model is, to our knowledge, novel. However, special cases of the model have been studied by Bylander (1997), Du and Cai (2015) and Bootkrajang (2016), wherein it is assumed that \(D\) is linearly separable, and the noise is purely instancedependent. As one such special case, the BCN model captures a plausible model of human annotator noise, wherein “hard” instances (i.e. those close to some optimal separator) have the most noise.
Example 1
(Annotator noise) Suppose \(s( x ) = \langle w^*, x \rangle \) for some \(w^* \in \mathbb {R}^d\). Consider a linearly separable \(D\) with \(\eta ( x ) = \llbracket s( x ) > 0 \rrbracket \), and noise \(\mathrm {LIN}( D, \rho _{1}, \rho _{1} )\) where \(\rho _{1} = \rho _{1} = f \circ s\), and \(f_{\pm 1}( z ) = g( z )\) for some monotone decreasing g.
We now unpack the three conditions underpinning the general model:

(a)
encodes that the scores underlying the noise order instances consistently with \(\eta \).

(b)
encodes that “harder” instances (with \(\eta \approx \nicefrac []{1}{2}\)) have the highest chance of a label flip.

(c)
is more opaque; however, it is trivially satisfied when the flip functions are constant (i.e., the noise is classconditional), or identical (i.e. the noise is purely instancedependent). The latter covers the practically relevant Example 1; thus, all results for \(\mathrm {BCN}\) automatically hold for this important case. In more general settings, the condition is needed for technical reasons (see Sect. 4.3 and “Appendix C”).
Relating clean and corrupt regrets
We now show that under the \(\mathrm {BCN}\) model, order preservation of \(\eta \) is guaranteed as per Eq. 18. Thus, the clean and corrupt Bayesoptimal AUROC scorers coincide.
Proposition 1
Pick any distribution \(D\). Suppose \(\bar{D}= \mathrm {BCN}( D, f_{1}, f_{1}, s )\). Then,
While simple to state, the result requires a careful analysis of the relationship between \(\bar{\eta }( x )  \bar{\eta }( x' )\) and \(\eta ( x )  \eta ( x' )\). Further, it crucially requires Condition (c) of the \(\mathrm {BCN}\) model; see “Appendix C” for counterexamples, including one where \(f_1( z )  f_{1}( z )\) is nondecreasing rather than nonincreasing.
Proposition 1 reassures us that under the \(\mathrm {BCN}\) model, corrupted ranking risk minimisation converges to the right object. A careful analysis of the behaviour of \((\bar{\eta }( x )  \bar{\eta }( x' ))/(\eta ( x )  \eta ( x' ))\) lets us go further and provide a ranking regret bound, analogous to Theorem 1.
Theorem 2
Pick any distribution \(D\). Let \(\bar{D}\) be a corrupted distribution such that \((\eta , \bar{\eta })\) satisfy Eq. 10, and there exists a constant C such that
Then, for any scorer \(s :\mathscr {X}\rightarrow \mathbb {R}\),
where \(\mathrm {reg}_{\mathrm {rank}}\) denotes the excess ranking risk of a scorer s, and \(\pi = \mathbb {P}( \mathsf {Y}= 1 )\), \(\bar{\pi }= \mathbb {P}( \bar{\mathsf {Y}}= 1)\).
In particular, if \(\bar{D}= \mathrm {BCN}( D, f_{1}, f_{1}, s )\) where \(( f_{1}, f_{1}, s, \eta )\) are \(\mathrm {BCN}\)admissible, and , then Eq. 12 holds with \(C = ({1  2 \cdot \rho _{\mathrm {max}})}^{1}\).
Intuitively, the condition in Eq. 11 ensures that if a pair of instances are easy to distinguish on the clean distribution (e.g., \(\eta ( x ) = 1\) while \(\eta ( x' ) = 0\)), they remain relatively so on the corrupted distribution. This rules out scenarios where the noise makes all instances, regardless of their original \(\eta \) value, have an \(\bar{\eta }\) value arbitrarily close to \(\nicefrac []{1}{2}\).
Implications
Theorem 2 implies that, under BCN noise, we can optimally rank (in the sense of AUROC) even when learning solely from noisy labels. Note that we can make \(\mathrm {reg}_{\mathrm {rank}}( s; \bar{D}) \rightarrow 0\) by appropriate surrogate loss minimisation on \(\bar{D}\) (Agarwal 2014).
Note also that neither of the noise models in Theorems 1 and 2 are special cases of each other. In particular, Theorem 2 allows for the noise to depend on the label, while Theorem 1 does not. However, even under purely instancedependent noise, Theorem 2 requires the flip function \(\rho \) to satisfy additional conditions so as to guarantee orderpreservation.
As a final remark, we note that the BCN model is only sufficient for establishing Theorem 2: as stated, the necessary conditions are that \(\bar{\eta }\) is orderpreserving for \(\eta \), and there is a bound on the ratio \(({\bar{\eta }( x )  \bar{\eta }( x' )})/({\eta ( x )  \eta ( x' )})\). We focus on BCN as it is a plausible model of realworld noise, and leave for future work the exploration of other admissible noise models.
Related work
Theorem 2 generalises Menon et al. (2015, Corollary 3), which assumed instanceindependent noise. This generalisation is nontrivial, with the proof of Proposition 1 requiring a careful casebased analysis. We are not aware of any prior analysis of the consistency of AUROC maximisation under noise with any form of instancedependence.
The Isotron: efficiently learning under boundaryconsistent noise
Theorems 1 and 2 imply that by ensuring vanishing regret on the corrupted distribution, we also ensure vanishing regret on the clean distribution. We now turn to our third contribution (C3), concerning the algorithmic implications of our results, by specifying how precisely one can minimise the corrupted regret in practice.
A standard approach is to choose s from a rich function class, e.g., that of a universal kernel with appropriately tuned parameters. However, this is potentially unsatisfying in two ways. First, training a kernel machine without further approximation requires quadratic complexity (Schölkopf 2001, p. 288), which may be computationally infeasible. Second, suppose one has further knowledge about the clean \(D\), e.g., that it is wellmodelled by a linear scorer in the native feature space. Employing a generic kernel machine here is intuitively overkill, and does not exploit our prior knowledge. As a practical consequence, we expect such an approach to generalise worse than one that directly uses a linear model.
We now show that, when we know the clean \(D\) can be modelled by a linear scorer (allowing but not requiring \(D\) to be linearly separable), the Isotron algorithm (Kalai and Sastry 2009) can provably and efficiently learn under certain boundaryconsistent noise. To make this more precise, we need to introduce two additional concepts.
The SIM family of classprobability functions
Our assumption on \(D\) will be that it belongs to some member of the generalised linear model (GLM) family. More formally, for link function \(u :\mathbb {R}\rightarrow [0, 1]\) and separator \(w^* \in \mathbb {R}^d\), the GLM classprobability function is . We assume \(D\) belongs to the singleindex model (SIM) family of classprobability functions (Kalai and Sastry 2009), wherein the link is unknown, but is known to be Lipschitz. That is, the SIM family comprises all possible GLM models with Lipschitz link.
Definition 5
(SIM family) For any \(L, W \in \mathbb {R}_+\), the singleindex model (SIM) family is
where \(\mathscr {U}( L )\) is all nondecreasing LLipschitz functions.
Intuitively, the SIM assumption on \(D\) encodes that a linear model equipped with a suitable nonlinearity can accurately predict the labels. Two simple examples are presented below.
Example 2
Suppose that \(D\) is linearly separable with margin \(\gamma > 0\), i.e., \( \eta ( x ) = \llbracket \langle w^*, x \rangle > 0 \rrbracket \) where \( \mathbb {P}( \{ (\mathsf {X}, \mathsf {Y}) \mid \mathsf {Y}\cdot \langle w^*, \mathsf {X}\rangle < \gamma \} ) = 0.\) Then, \( \eta \in \mathrm {SIM}( (2\gamma )^{1},  w^*  ) \) (Kalai and Sastry 2009). This is since we can equally write \(\eta ( x ) = u_{\mathrm {mar}( \gamma )}( \langle w^*, x \rangle )\), where
The function \(u( \cdot )\) is clearly \((2\gamma )^{1}\)Lipschitz.
Example 3
Suppose that \(D\) has classprobability of the logistic regression form, i.e., \( \eta ( x ) = ({1 + e^{\langle w^*, x \rangle }})^{1}. \) Then, \( \eta \in \mathrm {SIM}( 1,  w^*  ) \).
The SIN family of noise models
Our assumption on the noise will be that the distance from the optimal separator determines the level of noise. More formally, suppose our clean \(D\) has \(\eta = \mathrm {GLM}( u, w^* )\) for some (unknown) \(u, w^*\). We then consider a boundary consistent model of the noise with \(s^*( x ) = \langle w^*, x \rangle \) determining the flip probability^{Footnote 1}; we shall call this the single index noise (SIN) model.
Definition 6
(SIN noise) Let \(f_1, f_{1} :\mathbb {R}\rightarrow [ 0, 1 ]\). Given any distribution \(D\) with \(\eta = \mathrm {GLM}( u, w^* )\), define where \(s^* :x \mapsto \langle w^*, x \rangle \).
We shall see concrete examples of this noise model shortly. Put simply, like the underlying boundaryconsistent noise model, it posits that inherently “hard” instances experience the most noise. To see this, suppose \(D\) is linearly separable. Then, instances close to \(w^*\) are “hard” in the sense that they are optimally classified with low confidence; intuitively, such instances are easily confusable with instances from the other class.
Corruption runs in the SIN family
Under the SIM assumption on \(D\) and SIN assumption on the noise, learning from the resulting corrupted distribution \(\bar{D}\) is nontrivial: even if we know the correct link function \(u( \cdot )\) for \(D\), we will not know the precise link under \(\bar{D}\), as this will be affected by the (unknown) noise. Thus, we cannot directly leverage a standard GLM to provably learn from \(\bar{D}\).
Fortunately, an appealing consequence of pairing the SIM and SIN assumptions is that the SIM family is closed under SIN corruption, i.e., the resulting corrupted distribution is also a member of the SIM family.
Proposition 2
Pick any distribution \(D\) with \(\eta \in \mathrm {SIM}( L, W )\). Suppose that \(\bar{D}= \mathrm {SIN}( D, f_{1}, f_{1} )\) where \(( f_{1}, f_{1}, \eta )\) are \(\mathrm {BCN}\)admissible, and \(( f_{1}, f_{1} )\) are \(( L_{1}, L_{1} )\)Lipschitz respectively. Then, \(\bar{\eta }\in \mathrm {SIM}( L + L_{1} + L_{1}, W )\). In particular, \( \bar{\eta }( x ) = \bar{u}( \langle w^*, x \rangle ) \) where
This result is intuitive in light of Proposition 1, as \(\bar{\eta }\) is order preserving for \(\eta \) under \(\mathrm {BCN}\). To illustrate this further, we provide two examples of corrupting the SIM member \(\eta ( x ) = u( \langle w^*, x \rangle )\) by SIN noise.
Example 4
Consider the classconditional noise regime, so that \(f_1 \equiv \rho _{+}, f_{1} \equiv \rho _{}\) for constants \(\rho _{\pm } \in [0, 1]\). Then, by Eq. 4, \(\bar{\eta }( x ) = \bar{u}( \langle w^*, x \rangle )\) for \( \bar{u}( z ) = ( 1  \rho _{+}  \rho _{} ) \cdot u( z ) + \rho _{}. \)
Example 5
Suppose \(f_1 \equiv f_{1} \equiv f\) and \(f( z ) = g(  z  )\) for some arbitrary monotone decreasing function g. Then, by Eq. 4, \(\bar{\eta }( x ) = \bar{u}( \langle w^*, x \rangle )\) for \( \bar{u}( z ) = ( 1  2 \cdot f( z) ) \cdot u( z ) + f( z ). \) If we further assume \( u( z ) = \llbracket z > 0 \rrbracket \), so that \(D\) is separable, we have
Observe that if g satisfies \(g( z ) = 1  g( z )\), then this is \( \bar{u}( z ) = g( z ). \) That is, a structured form of monotonic noise on a linearly separable distribution yields a distribution scorable by some generalised linear model. When \(g( z ) = {1}/({1 + e^{z}})\) for example, we end up with a logistic regression model. This observation has been made previously (Du and Cai 2015).
We are not aware of prior results akin to Proposition 2 on the behaviour of SIMs under structured noise. However, when \(D\) is separable, Du and Cai (2015) observed that a certain special case of our \(\mathrm {BCN}\) noise results in an \(\bar{\eta }\) that belongs to the GLM family.
Proposition 2 implies that any algorithm for learning a generic SIM \(D\) may be used to learn \(\bar{\eta }\) under SIN noise. Fortunately, we now see efficient algorithms to learn SIMs exist.
Efficiently learning noisy SIMs via the Isotron
SIMs for instances in the unit ball \(\mathbb {B}^d\) can be provably learned with the Isotron (Kalai and Sastry 2009), and its Lipschitz variant, the SLIsotron (Kakade et al. 2011). The elegant Isotron algorithm (Algorithm 1) alternately updates the separator w, and the link function u. The latter is estimated nonparametrically using the pav algorithm (Ayer et al. 1955), which solves the isotonic regression problem: \( ( \hat{{u}}_1, \ldots , \hat{{u}}_m ) = {{\text {argmin }}\, }_{{u}_1 \le {u}_2 \le \cdots \le {u}_m}{\sum _{i = 1}^m ( y_i  {u}_i )^2}, \) where we presort the scores so that \(s_1 \le s_2 \le \cdots \le s_m\), i.e., we wish for the u’s to respect the ordering of the s’s. The SLIsotron algorithm is identical, except that one calls lpav, a variant of pav that obeys a Lipschitz constraint.
In light of Proposition 2, we thus propose to simply run the SLIsotron on corrupted samples. One can guarantee ranking consistency of this procedure; further, if the noise does not depend on the label, then we also have classification consistency.
Theorem 3
Let \(\mathscr {X}\subseteq \mathbb {B}^d\). Pick any distribution \(D\) with \(\eta \in \mathrm {SIM}(L, W)\), and \(\bar{D}= \mathrm {SIN}( D, f_{1}, f_{1} )\) for Lipschitz \(( f_{1}, f_{1} )\). Given a corrupted sample \(\bar{\mathsf {S}} \sim \bar{D}^n\), we can construct a corrupted classprobability estimator \(\hat{\bar{\eta }}_{\bar{\mathsf {S}}} :\mathscr {X}\rightarrow [0, 1]\) using the SLIsotron, with \( \mathrm {reg}_{\mathrm {rank}}( \hat{\bar{\eta }}_{\bar{\mathsf {S}}}; D) {\mathop {\rightarrow }\limits ^{\mathbb {P}}} 0. \) Further, if \(f_{1} = f_{1}\), we can construct a classifier \(c_{\bar{\mathsf {S}}} :x \mapsto \mathrm {sign}( 2\hat{\bar{\eta }}_{\bar{\mathsf {S}}}( x )  1 )\) with \( \mathrm {reg}( c_{\bar{\mathsf {S}}}; D, \ell ^{01}) {\mathop {\rightarrow }\limits ^{\mathbb {P}}} 0. \)
Intuitively, Theorem 3 relies on the existing SLIsotron consistency guarantee for its classprobability estimate (see “Appendix B.5” for a review). Since the SLIsotron is applied on corrupted samples, this implies a suitable corrupted regret asymptotically vanishes. Combined with our classification and ranking regret bounds (Theorems 1 and 2), this implies the clean regret for this estimator also asymptotically vanishes.
Implications
We make some additional remarks on the use of the SLIsotron under label noise. First, the SLIsotron does not require one to know the precise form of either \(\eta \) or the label flipping functions. Even if one just knows that there exists some u such that \(\eta = \mathrm {GLM}(u,w^*)\), and that the labels are subject to (Lipschitz) monotonic noise, one can estimate \(\bar{\eta }\).
Second, by estimating \(\bar{\eta }\), one can potentially estimate the flipping functions themselves. For example, in the classconditional setting, we can estimate the label flip probabilities via the range of \(\bar{\eta }\), under a mild assumption on \(D\) (Scott et al. 2013; Liu and Tao 2015; Menon et al. 2015). For SIN noise, estimation is possible if one knows the precise form of \(u( \cdot )\), and if the noise does not depend on the labels. For example, one may know that \(D\) is separable with a certain margin. Then, we can infer the label flipping function as
The estimation error in this term depends wholly on the error in estimating \(\bar{u}\).
Third, while Theorem 3 is a statement about asymptotic consistency, one can establish rates of convergence as well. For example, the SLIsotron guarantee is that the regret of the corrupted classprobability estimates decays like \(\mathscr {O}\left( ( {d}/{n} )^{1/3} \right) \) (see “Appendix B.5” for a review). This can be contrasted to the regret decay for kernelised scorers (Eq. 9), which can be significantly larger in the regime of low regularisation (which is to expected for lowdimensional problems). This makes concrete our motivating intuition for the potential limitation of using a blackbox kernel machine to tackle problems with additional structure.
Related work
Existing analysis of the Isotron has focussed on the setting of standard learning from binary labels (Kalai and Sastry 2009; Kakade et al. 2011); to our knowledge, there is no existing analysis of its behaviour under label noise.
Recently, Awasthi et al. (2015, 2016) proposed efficient algorithms to learn under purely instancedependent noise (PIN), assuming that \(D\) is linearly separable with logconcave isotropic marginal over instances. Our use of the Isotron operates with a more structured form of noise (SIN), which is a subset of PIN; however, we do not require an assumption on the marginals, and merely require \(D\) to be linearly scorable by belonging in the GLM family. Further, we show ranking as well as classification consistency.
To learn under classconditional noise with linear models, Natarajan et al. (2013) proposed a losscorrection requiring knowledge of the noise rates, and Menon et al. (2015) proposed a neural network. The Isotron is distinct from the former by not requiring the noise to be known; from the latter by having a correctness guarantee; and from both by working for noise that can depend on the instances.
Related work
Recall that our three contributions C1–C3 are in showing the classification and ranking consistency of risk minimisation under suitably constrained instancedependent noise, and a practical algorithm that can learn from such data. We now detail how these contributions are distinct from a number of existing works in label noise. Table 2 provides a summary.
Three strands of label noise research
While there is too large a body of work on label noise to summarise here (see e.g. Frénay and Kabán 2014; Frénay and Verleysen 2014 for recent surveys), broadly, there have been three strands of theoretical analysis S1–S3 that are relevant to our work.

(S1)
PAC guarantees The first strand has focussed on PACstyle guarantees for learning under symmetric and classconditional noise (e.g. Bylander 1994; Blum et al. 1996; Blum and Mitchell 1998), noise consistent with the distance to the margin (e.g. Angluin and Laird 1988; Bylander 1997, 1998; Servedio 1999), noise constant on partitions of the input space (e.g. Decatur 1997; Ralaivola et al. 2006), noise with bounded error rate (e.g. Kalai et al. 2005; Awasthi et al. 2014), and arbitrary bounded instance dependent or Massart noise (e.g. Awasthi et al. 2015). These works often assume the true distribution \(D\) is linearly separable with some margin, the marginal over instances has some structure (e.g. uniform over the unit sphere, or logconcave isotropic), and that one employs linear scorers for learning.

(S2)
Surrogate losses The second strand has focussed on the design of surrogate losses robust to label noise. Stempfel and Ralaivola (2009) proposed a nonconvex variant of the hinge loss robust to asymmetric noise; however, it requires knowledge of the noise rate. For classconditional noise, Natarajan et al. (2013) provided a simple “noisecorrected” version of any loss, which again requires knowledge of the noise rate. Ghosh et al. (2015) showed that losses whose components sum to a constant are robust to symmetric label noise. van Rooyen et al. (2015) showed that the linear or unhinged loss is robust to symmetric label noise. Patrini et al. (2016) showed that a range of “linearodd” losses are approximately robust to asymmetric noise.

(S3)
Consistency The third strand, which is closest to our work, has focussed on showing consistency of appropriate risk minimisation in the regime where one has a suitably powerful function class (Scott et al. 2013; Natarajan et al. 2013; Menon et al. 2015). For example, Natarajan et al. (2013) showed that minimisation of appropriately weighted convex surrogates on the corrupted distribution \(\bar{D}\) is consistent for the purposes of classification on \(D\). This work has been restricted to the case of symmetric and classconditional noise.
The difference of the present paper to these works may be summarised as:

(a)
we work with instancedependent noise models (unlike S2 and S3); this is more practically relevant than the standard instanceindependent noise assumption.

(b)
we do not make assumptions on \(D\) for our theoretical analysis in Sects. 3 and 4 (unlike S1); this is in keeping with standard consistency results for binary classification (Zhang 2004; Bartlett et al. 2006).

(c)
we do not assume the scorer class is linear, but rather that it is sufficiently powerful to contain the Bayesoptimal scorer (unlike S1 and S2); this is again in keeping with consistency results for binary classification (Zhang 2004; Bartlett et al. 2006).

(d)
we study consistency with respect to the AUROC, unlike all works (to our knowledge) with the exception of Menon et al. (2015); this is of interest since the AUROC is a canonical performance measure under class imbalance (Ling and Li 1998).

(e)
we explicitly provide a practical algorithm for learning in the common scenario where the clean distribution belongs to the GLM family; this is in contrast to algorithmic proposals such as that of Natarajan et al. (2013), which require knowledge of the noise rates. While Patrini et al. (2017) proposed an algorithm to combine this with an estimate of the noise rate, guarantees as to the quality of the resulting solution are lacking.
We remark that a related strand of research is on learning from positive and unlabelled data (Elkan and Noto 2008; Plessis et al. 2015; Jain et al. 2016), which may be seen as a special case of learning with classconditional (and hence instance independent) noise (Scott et al. 2013; Menon et al. 2015). Finally, we note that several works have focussed on designing algorithms for coping with noise (Bootkrajang and Kabán 2014; Reed et al. 2014; Du and Cai 2015) (see Frénay and Verleysen 2014 for additional references); usually, however, these approaches lack theoretical guarantees. Formalising practical insights from these works in conjunction with our framework would be of interest for future work.
Comparison to specific works
We provide more details comparing our work to a few particularly related works.
Comparison to Ghosh et al. (2015)
Ghosh et al. (2015) provide a bound on the risk of the optimal solution on the corrupted distribution. By contrast, we provide explicit bounds on the regrets for the clean and corrupted distributions, rather than the risks. More precisely, they established the following.
Theorem 4
(Ghosh et al. 2015, Theorem 2) Pick any distribution \(D\) and loss \(\ell \) satisfying Eq. 5. Let \(\bar{D}= \mathrm {PIN}( D, \rho )\) for some admissible \(\rho :\mathscr {X}\rightarrow [0, \nicefrac []{1}{2})\). Then, for any function class \(\mathscr {S}\subseteq \mathbb {R}^{\mathscr {X}}\),
Theorem 4 implies that for purely instancedependent noise, the \(\ell \)risk minimiser (for suitable \(\ell \)) will not differ considerably on the clean and the corrupted samples. But a limitation of the result is that one cannot guarantee consistency with respect to, e.g., 0–1 loss, of using the result of \(\ell \)risk minimisation on the corrupted samples. This is because the above only holds for the risk with respect to the clean distribution \(D\), which does not let us bound the clean regret in terms of the corrupted regret.
Comparison to Patrini et al. (2016)
Compared to Patrini et al. (2016), the primary difference of the present work is as per the above: the latter work does not provide a bound relating the clean and noisy regret for an arbitrary scorer. More precisely, they establish the following.
Theorem 5
(Patrini et al. 2016, Theorem 10) Pick any distribution \(D\) and loss \(\ell \) satisfying
Let \(\bar{D}\) be the result of \(D\) passed through classconditional noise for some admissible \(\rho _+, \rho _ \in [0, 1]\). Suppose \(\mathscr {S}= \{ x \mapsto \langle w, x \rangle \mid \Vert w \Vert _2 \le W \}\). Then,
Thus, as per Ghosh et al. (2015), their Theorem 10 bounds the corrupted risk, rather than clean regret, and does not establish consistency. Indeed, as the bound is in terms of the corrupted rather than clean distribution, it does not specify how well a solution obtained from the noisy distribution will perform on a test set comprising clean labels.
Comparison to Awasthi et al. (2015, 2017)
Awasthi et al. (2015, 2017) show that for separable \(D\) with marginals possessing certain structure, one can guarantee small corrupted 0–1 regret for a specific algorithm under separable \(D\). By contrast, the present work relates the clean and corrupted regret for the output of any algorithm, under no assumptions on the marginal distribution of \(D\). Finally, these works provide no analysis of ranking consistency.
These works also provided an algorithm to provably learn under the settings of their theorems; however, to our knowledge, there has been no practical assessment of the performance of these methods. On the other hand, Awasthi et al. (2017) also provide analysis for settings beyond our label flipping noise model. It is an interesting topic for future work as to whether one can extend our analysis to such models.
Comparison to regression approaches
Our LIN noise model is the natural discrete variant of heteroscedastic noise in regression problems (Le et al. 2005). Typically, such noise is handled by inferring the reliability of each instance, and then suitably weighting them (Shalizi 2017, Chapter 7). A distinct line of work has focussed on arbitrary (i.e., not necessarily probabilistically generated) regression noise (Wright and Ma 2010; Nguyen and Tran 2013; Bhatia et al. 2015). This is less immediately related to our probabilistic labelflipping noise setting.
Experimental illustration of theoretical results
We present experiments that validate our theoretical results. While our primary contributions are in providing formal theoretical statements of the behaviour of learning algorithms under noise, we wish to illustrate that there are potential practical implications from our findings.
Illustration of classification and ranking consistency
We first validate Theorems 1 and 2: we show that given access only to samples subject to instance dependent noise, a rich model can asymptoticallyclassify optimally; and if the noise is further boundary consistent, then it can rank optimally as well.
We fix a nonseparable discrete distribution \(D\) concentrated on notional instances \(\mathscr {X}= \{ x_1, x_2, \ldots , x_{16} \}\). We assume a uniform marginal M, and set \(\eta ( x_i ) = i/16\). We pick label flip function \(\rho ( x_i ) = \rho _{\mathrm {max}}\) for \(i = 8\) and \(\rho _{\mathrm {avg}}\) otherwise, for parameters \(\rho _{\mathrm {max}}, \rho _{\mathrm {avg}}\) to be specified. We then draw \(\bar{\mathsf {S}} \sim \bar{D}^m\) from the induced corrupted distribution, compute the minimiser of the empirical logistic risk (since \(\mathscr {X}\) is discrete, we can explicitly optimise over \(s \in \mathbb {R}^{16}\)), and compute the clean 0–1 regret of this solution. We repeat this for 100 random draws of of \(\bar{\mathsf {S}}\).
We fix \(\rho _{\mathrm {max}} = 0.49\), and vary \(\rho _{\mathrm {avg}} \in \{ 0.1, 0.2, 0.3, 0.4 \}\). Figure 1 plots the average 0–1 regret as the number of samples m is varied. As predicted by Theorem 1, all the regrets eventually tend to zero; thus, asymptotically, we can classify optimally despite only having access to noisy samples. Further, as predicted by Eq. 7, small values of \(\rho _{\mathrm {avg}}\) lead to significantly smaller 0–1 regret. This is despite the fact that all the induced noisy distributions \(\bar{D}\) have the same maximal noise rate. Note now that \(\rho \) is boundary consistent, since the noise is highest when \(\eta ( x ) = \nicefrac []{1}{2}\). Figure 1 plots the average AUROC regret versus m, and confirms that this also tends to zero, as predicted by Theorem 2.
Illustration of noise robustness of the Isotron
We next illustrate Theorem 3, showing that the Isotron can effectively learn GLMs under suitable boundary consistent (SIN) noise.
To start, we fix a nonseparable \(D\) such that M is a mixture of Gaussians with means (1, 1) and \((1, 1)\) and identity covariance. We picked \(\eta ( x ) = \sigma ( s^*( x ) )\) for sigmoid \(\sigma \) and \(s^*( x ) = 10 \cdot x_1 + 10 \cdot x_2\). For flip functions \(f_{\pm 1}( z ) = (1/2) e^{z^2/4}\), we drew a sample \(\bar{\mathsf {S}}\) of 5000 elements from the boundaryconsistent corruption of \(D\). We then estimated \(\bar{\eta }\) from \(\bar{\mathsf {S}}\) using 1000 iterations of Isotron. Figure 1 shows this estimate closely matches the actual \(\bar{\eta }\) computed explicitly via Eq. 4.
Next, we ran experiments on the USPS and MNIST datasets, for the tasks of distinguishing digits 0 and 9 for the former, and 6 and 7 for the latter. For an 80–20 traintest split, we inject boundaryconsistent noise by flipping the training labels with probability \(f( x ) = \alpha \cdot \sigma ( \langle w^*, x \rangle ^2 )\) for parameter \(\alpha \in [0, \nicefrac []{1}{2})\), where \(w^*\) is the optimal separator found by ordinary least squares. This mimics a scenario where the labels are from a human annotator liable to make errors for the easily confusable digits. We then trained regularised least squares and logistic regression models (using regularisation strength \(\lambda = 10^{8}\)), and the Isotron (using 100 iterations) on the corrupted training sample. We measured the models’ classification accuracy on the test set with clean labels.
For \(\alpha \in \{ 0.0, 0.1, \ldots , 0.5 \}\), Table 3 reports the mean and standard error of the accuracies over \(T = 25\) independent corruptions for both datasets. We find that for higher \(\alpha \) (i.e., more noise), the Isotron offers a significant improvement over standard learners.
Further experiments with the Isotron
We now present results showing that the Isotron learns good decision boundaries on nonseparable realworld datasets, and that it can estimate noise rates in classconditional settings. This indicates that our results are not purely theoretical, and have potential practical viability; it also motivates further study of algorithms to learn SIMs, as they may lead to principled means of coping with instancedependent noise.
UCI experiments
We first show that the boundary consistent noise (BCN) model captures the realworld labeling process to some extent, in that the Isotron can classify such data well. To this end, we run Isotron algorithm on several UCI benchmark datasets (preprocessed and made available by Gunnar Rätsch^{Footnote 2}), using the given labels as is, without injecting any artificial noise. We compare the Isotron to two linear baseline methods, viz. ridge and logistic regression.
The results are presented in Table 4. We observe that in almost all the datasets, assuming a boundary consistent noise and using the Isotron helps learn a better linear decision boundary. This is so even when a linear model does not capture the underlying Bayesoptimal scorer, such as the highly nonlinear banana dataset. Overall, this confirms the usefulness and conformance of the noise model.
Noise rate estimation
We additionally assessed the feasibility of using the Isotron to estimate noise rates for a classconditional noise model, a possibility hinted at in Sect. 5.4. For the USPS and MNSIT datasets as used above, we artificially injected classconditional noise with rate \(\rho _+ = 0.2\) on instances from the positive class, and \(\rho _ = 0.4\) from the negative class. We then used the quantilebased noise rate estimator of (Menon et al. 2015, Section 6.3) on the estimates of the corrupted probability \(\bar{\eta }\) produced by the Isotron. Violin plots in Fig. 2 shows that on both datasets, the estimates of the noise rates are unbiased on average, with modest variance.
Conclusion and future work
We have theoretically analysed the problem of learning with instancedependent label noise, with three main conclusions:

(a)
for purely instancedependent noise, minimising the classification risk on the noisy distribution is consistent for classification on the clean distribution;

(b)
for a broad class of “boundary consistent” label and instancedependent noise, a similar consistency result holds for the area under the ROC curve; and

(c)
one can learn generalised linear models subject to the same “boundary consistent” noise using the Isotron algorithm (Kalai and Sastry 2009).
For future work, determining sufficient conditions for order preservation of \(\eta \), and studying simplified versions of the Isotron under more specific noise models (e.g., classconditional) would be of interest.
Notes
 1.
It is crucial to use \(w^*\) here, rather than any arbitrary w. With the latter, there would be no necessary connection between the level of noise and the underlying classprobability. As a result, the corrupted classprobabilities would not by themselves provide information about their clean counterparts.
 2.
 3.
This result is implicit in the proof of Ghosh et al. (2015, Theorem 1).
 4.
By contrapositive of Condition (a) of \(\mathrm {BCN}\)admissibility, if \(s( x ) \le s( x' )\) then \(\eta ( x ) \le \eta ( x' )\).
 5.
If \(\eta \in \mathrm {SIM}( L, W )\), then trivially \(\eta \in \mathrm {SIM}( 1, L \cdot W )\), because \(\eta ( x ) = u( \langle w^*, x \rangle ) = u( (1/L) \cdot \langle (L \cdot w^*), x \rangle ) = \tilde{u}( \langle \tilde{w}^*, x \rangle )\), where \(\tilde{u}\) is a 1Lipschitz function, and \(\Vert \tilde{w}^* \Vert _2 = L \cdot W\).
References
Agarwal, S. (2014). Surrogate regret bounds for bipartite ranking via strongly proper losses. Journal of Machine Learning Research, 15, 1653–1674.
Agarwal, S., & Niyogi, P. (2005). Stability and generalization of bipartite ranking algorithms. In Conference on learning theory (COLT), Springer (pp. 32–47).
Angluin, D., & Laird, P. (1988). Learning from noisy examples. Machine Learning, 2(4), 343–370.
Awasthi, P., Balcan, M. F., & Long, P. M. (2014). The power of localization for efficiently learning linear separators with noise. In Symposium on the theory of computing (STOC) (pp. 449–458).
Awasthi, P., Balcan, M. F., Haghtalab, N., & Urner, R. (2015). Efficient learning of linear separators under bounded noise. Conference on Learning Theory (COLT), 40, 167–190.
Awasthi, P., Balcan, M., Haghtalab, N., & Zhang, H. (2016). Learning and 1bit compressed sensing under asymmetric noise. In Conference on learning theory (COLT) (pp. 152–192).
Awasthi, P., Balcan, M., & Long, P. M. (2017). The power of localization for efficiently learning linear separators with noise. Journal of the ACM, 63(6), 50.
Ayer, M., Brunk, H. D., Ewing, G. M., Reid, W. T., & Silverman, E. (1955). An empirical distribution function for sampling with incomplete information. The Annals of Mathematical Statistics, 26(4), 641–647.
Bartlett, P. L., Jordan, M. I., & McAuliffe, J. D. (2006). Convexity, classification, and risk bounds. Journal of the American Statistical Association, 101(473), 138–156.
Bhatia, K., Jain, P., & Kar, P. (2015). Robust regression via hard thresholding. In Advances in neural information processing systems (NIPS) (pp. 721–729).
Blum, A., & Mitchell, T. (1998). Combining labeled and unlabeled data with cotraining. In Conference on learning theory (COLT) (pp. 92–100).
Blum, A., Frieze, A., Kannan, R., & Vempala, S.(1996). A polynomialtime algorithm for learning noisy linear threshold functions. In Foundations of computer science (FOCS) (pp. 330–338).
Bootkrajang, J. (2016). A generalised label noise model for classification in the presence of annotation errors. Neurocomputing, 192, 61–71.
Bootkrajang, J., & Kabán, A. (2014). Learning kernel logistic regression in the presence of class label noise. Pattern Recognition, 47(11), 3641–3655.
Bylander, T. (1994). Learning linear threshold functions in the presence of classification noise. In Conference on learning theory (COLT) (pp. 340–347).
Bylander, T. (1997). Learning probabilistically consistent linear threshold functions. In Conference on learning theory (COLT) (pp. 62–71).
Bylander, T. (1998). Learning noisy linear threshold functions (unpublished manuscript). http://www.cs.utsa.edu/~bylander/pubs/learningnoisyltfs.ps.gz.
Clémençon, S., Lugosi, G., & Vayatis, N. (2008). Ranking and empirical minimization of Ustatistics. The Annals of Statistics, 36(2), 844–874.
Decatur, S. E. (1997). PAC learning with constantpartition classification noise and applications to decision tree induction. In International conference on machine learning (ICML) (pp. 83–91).
Devroye, L., Györfi, L., & Lugosi, G. (1996). A probabilistic theory of pattern recognition. Berlin: Springer.
Du, J., & Cai, Z. (2015). Modelling class noise with symmetric and asymmetric distributions. In Conference on artificial intelligence (AAAI) (pp. 2589–2595).
Elkan, C., & Noto, K. (2008). Learning classifiers from only positive and unlabeled data. In International conference on knowledge discovery and data mining (KDD) (pp. 213–220).
Frénay, B., & Kabán, A. (2014). A comprehensive introduction to label noise. In European symposium on artificial neural networks (ESANN) (pp. 667—676).
Frénay, B., & Verleysen, M. (2014). Classification in the presence of label noise: A survey. IEEE Transactions on Neural Networks and Learning Systems, 25(5), 845–869.
Ghosh, A., Manwani, N., & Sastry, P. S. (2015). Making risk minimization tolerant to label noise. Neurocomputing, 160, 93–107.
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 770–778).
Jain, S., White, M., & Radivojac, P. (2016). Estimating the class prior and posterior from noisy positives and unlabeled data. In Advances in neural information processing systems (NIPS) (pp. 2685–2693).
Kakade, S., Kanade, V., Shamir, O., & Kalai, A.(2011). Efficient learning of generalized linear and single index models with isotonic regression. In Advances in neural information processing systems (NIPS) (pp. 927–935).
Kalai, A., & Sastry, R. (2009). The Isotron algorithm: Highdimensional isotonic regression. In Conference on learning theory (COLT).
Kalai, A., Klivans, A., Mansour, Y., & Servedio, R. (2005). Agnostically learning halfspaces. In Foundations of computer systems (FOCS) (pp. 11–20).
Koyejo, O. O., Natarajan, N., Ravikumar, P. K., & Dhillon, I. S. (2014). Consistent binary classification with generalized performance metrics. In Advances in neural information processing systems (NIPS) (pp. 2744–2752).
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems (NIPS) (pp. 1106–1114).
Le, Q. V., Smola, A. J., & Canu, S. (2005). Heteroscedastic gaussian process regression. In International conference on machine learning (ICML) (pp. 489–496).
Ling, C. X., & Li, C. (1998). Data mining for direct marketing: Problems and solutions. In Knowledge discovery and data mining (KDD) (pp. 73–79).
Liu, T., & Tao, D. (2015). Classification with noisy labels by importance reweighting. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2001, 447–461.
Long, P., & Servedio, R. (2008). Random classification noise defeats all convex potential boosters. In International conference on machine learning (ICML) (pp. 608–615).
Manwani, N., & Sastry, P. S. (2013). Noise tolerance under risk minimization. IEEE Transactions on Cybernetics, 43(3), 1146–1151.
Massart, P., & Nédélec, E. (2006). Risk bounds for statistical learning. The Annals of Statistics, 34(5), 2326–2366.
Menon, A. K., van Rooyen, B., Ong, C. S., & Williamson, B. (2015). Learning from corrupted binary labels via classprobability estimation. In International conference on machine learning (ICML) (pp. 125–134).
Narasimhan, H., Vaish, R., & Agarwal, S. (2014). On the statistical consistency of plugin classifiers for nondecomposable performance measures. In Advances in neural information processing systems (NIPS) (pp. 1493–1501).
Natarajan, N., Dhillon, I. S., Ravikumar, P. D., & Tewari, A. (2013). Learning with noisy labels. In Advances in neural information processing systems (NIPS) (pp. 1196–1204).
Nguyen, N. H., & Tran, T. D. (2013). Exact recoverability from dense corrupted observations via \(\ell _1\)minimization. IEEE Transactions on Information Theory, 59(4), 2017–2035.
Patrini, G., Nielsen, F., Nock, R., & Carioni, M.(2016). Loss factorization, weakly supervised learning and label noise robustness. In International conference on machine learning (ICML) (pp. 708–717).
Patrini, G., Rozza, A., Menon, A., Nock, R., & Qu, L. (2017). Making deep neural networks robust to label noise: A loss correction approach. In Computer vision and pattern recognition (CVPR) (pp. 2233–2241).
Plessis, M. C., Niu, G., Sugiyama, M. (2015). Convex formulation for learning from positive and unlabeled data. In International conference on machine learning (ICML) (pp. 1386–1394).
Ralaivola, L., Denis, F., & Magnan, C. N.(2006). CN = CPCN. In International conference on machine learning (ICML) (pp. 721–728).
Reed, S. E., Lee, H., Anguelov, D., Szegedy, C., Erhan, D., & Rabinovich, A. (2014). Training deep neural networks on noisy labels with bootstrapping. CoRR abs/1412.6596.
Reid, M. D., & Williamson, R. C.(2009). Surrogate regret bounds for proper losses. In International conference on machine learning (ICML) (pp. 897–904).
van Rooyen, B., Menon, A. K., & Williamson, R. C. (2015). Learning with symmetric label noise: the importance of being unhinged. In Advances in neural information processing systems (NIPS) (pp. 10–18).
Schölkopf, B., & Smola, A. J. (2001). Learning with kernels. Cambridge: MIT Press.
Scott, C., Blanchard, G., & Handy, G. (2013). Classification with asymmetric label noise: Consistency and maximal denoising. In Conference on learning theory (COLT) (pp. 489–511).
Servedio, R. (1999). On PAC learning using winnow, perceptron, and a perceptronlike algorithm. In Conference on learning theory (COLT) (pp. 296–307).
Shalizi, C. R. (2017). Advanced data analysis from an elementary point of view (unpublished book draft). http://www.stat.cmu.edu/~cshalizi/ADAfaEPoV/ADAfaEPoV.pdf.
Steinwart, I., & Scovel, C. (2005). Fast rates for support vector machines. In Conference on learning theory (COLT) (pp. 279–294).
Stempfel, G., & Ralaivola, L. (2007). Learning kernel perceptrons on noisy data using random projections. In Algorithmic learning theory (ALT) (pp. 328–342).
Stempfel, G., & Ralaivola, L. (2009). Learning SVMs from sloppily labeled data. In International conference on artificial neural networks (ICANN) (pp. 884–893).
Wright, J., & Ma, Y. (2010). Dense error correction via \(\ell _1\)minimization. IEEE Transactions on Information Theory, 56(7), 3540–3560.
Xiao, T., Xia, T., Yang, Y., Huang, C., & Wang, X.(2015). Learning from massive noisy labeled data for image classification. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 2691–2699).
Zhang, T. (2004). Statistical behavior and consistency of classification methods based on convex risk minimization. Annals of Statistics, 32(1), 56–85.
Author information
Affiliations
Corresponding author
Additional information
Editors: Jesse Davis, Elisa Fromont, Derek Greene, and Bjorn Bringmaan.
Appendices
Appendix
A: Proofs of results in main body
Proof of Lemma 1
By definition of how corrupted labels \(\bar{\mathsf {Y}}\) are generated,
The second identity follows by rearranging. \(\square \)
Proof of Corollary 1
This is a simple consequence of the fact that weighting a risk does not affect Bayesoptimal scorers. Formally, for any \(w :\mathscr {X}\rightarrow \mathbb {R}_+\), let the weighted \(\ell \)risk be
where . When \(w \equiv 1\), this is the standard \(\ell \)risk. By Proposition 4,
The second line is because weighting does not affect the Bayesoptimal scorers for a risk: for any \(w > 0\),
Note finally that by definition, the weighting factor \(w( x ) = (1  2 \cdot f( x ))^{1} \ge 1\), and so no term is suppressed after weighting. (If \(w( x ) = 0\) for some \(x \in \mathscr {X}\), then any prediction would be optimal for that instance; thus, we would get containment rather than equality of minimisers.)\(\square \)
Proof of Theorem 1
Let \(s^{*} \in \underset{s}{{\text {argmin }}\, } R( s; D, \ell )\). Let , and recall \(R^{\mathrm {wt}( w )}\) is the corresponding weighted \(\ell \)risk (Eq. 15). By definition,
where the last line is since by Corollary 1, we know that \(s^* \in \underset{s}{{\text {argmin }}\, } R( s; \bar{D}, \ell )\) also. This fact also implies that for the inequality step above, we can guarantee \(L( \bar{\eta }( x ), s( x ) )  L( \bar{\eta }( x ), s^{*}( x ) ) \ge 0\) for every \(x \in \mathscr {X}\), and so we do not have to worry about the direction of the inequality.
To get the second bound, suppose is the conditional regret. Trivially,
Now define the (nonnegative) random variables \(\mathsf {W}= w( \mathsf {X})\), \(\mathsf {R}= r( \mathsf {X})\). The regret of Eq. 16 can be rewritten
Note that \(W = (1  2 \cdot \rho _{\mathrm {max}})^{1}\) by definition. The case \(\alpha = 0\) gives the original bound of Eq. 2. \(\square \)
Proof of Proposition 1
Pick some \(x, x'\) such that \(\eta ( x ) < \eta ( x ' )\). Certainly \(s( x ) < s( x' )\) since s is order preserving for \(\eta \) by BCNadmissibility Condition (a). Thus, by Lemma 5,
By the total noise assumption (Assumption 2), \(1  \rho _{1}( x )  \rho _{1}( x ) > 0\) for every x, and so the \(\max ( \cdot )\) term above is \(> 0\). Since \(\eta ( x )  \eta ( x' ) < 0\) by assumption, we conclude that \(\bar{\eta }( x )  \bar{\eta }( x' ) < 0\). \(\square \)
Proof of Theorem 2
From Clémençon et al. (2008) and Agarwal (2014, Theorem 11),
where
By the orderpreservation assumption,
Thus, in this case, \(\mathrm {sign}( \Delta \eta ) = \mathrm {sign}( \Delta \bar{\eta })\), and so \(\mathbb {I}( \Delta \eta , \Delta s ) = \mathbb {I}( \Delta \bar{\eta }, \Delta s )\). When \(\eta ( x ) = \eta ( x' )\), however, there is no guarantee on the relative values of \(\bar{\eta }( x )\) and \(\bar{\eta }( x' )\). But if \(\Delta \eta = 0\), then the first term in \(\mathbb {I}\) above is necessarily zero, while that for \(\Delta \bar{\eta }\) can only be \(\ge 0\). Thus, when \(\eta ( x ) \ne \eta ( x' )\) we have
and so, further applying the assumption on the difference between \(\eta \) values,
In the special case of the \(\mathrm {BCN}\) model, orderpreservation holds by Proposition 1. What remains then is the \( \eta ( x )  \eta ( x' ) \) term. Now, by Lemma 5,
Consequently, when \(\eta ( x ) < \eta ( x' )\), we have
By swapping x and \(x'\), an identical result holds if \(\eta ( x ) > \eta ( x' )\). If \(\eta ( x ) = \eta ( x' )\), we trivially have \(0 =  \eta ( x )  \eta ( x' )  \le  \bar{\eta }( x )  \bar{\eta }( x' )  \cdot (1  2 \cdot \rho _{\mathrm {max}})^{1}\). Thus, the regret bound holds with . \(\square \)
Proof of Proposition 2
By Lemma 2, the mandatory Condition (a) of the model \(\mathrm {BCN}( D, f_{1}, f_{1}, s )\) implies that \(\eta = u \circ s\) for some nondecreasing u. Thus, by Lemma 1,
where
By Corollary 3,
so that \(\bar{u}\) is a nondecreasing function, and thus a valid GLM link.
Next, applying the triangle inequality to Lemma 4, and using \(z = s( x ), z' = s( x' )\),
using the fact that \( 1  f_{1}( z' )  f_{1}( z' )  < 1\) by the total noise assumption (Assumption 2), \(1  u(z) \le 1\) and \(u(z) \le 1\) since \(\mathrm {Im}( u ) = [ 0, 1 ]\), and the Lipschitz assumptions on \(u, f_{\pm 1}\). It follows that \(\bar{u}\) is \(( L + L_{1} + L_{1} )\)Lipschitz. \(\square \)
Proof of Theorem 3
By Proposition 2, \(\bar{\eta }\in \mathrm {SIM}( L + L_2 + L_3, W )\). Thus, as a member of the SIM family, it is suitable for estimation using SLIsotron.
Proposition 6 implies that one can always choose an iteration of SLIsotron with low regret. Let \(\hat{\bar{\eta }}_{\mathsf {S}, t}\) denote the estimate produced by SLIsotron at iteration t. If in an abuse of notation we let \(\hat{\bar{\eta }}_{\mathsf {S}}\) denote the estimate \(\hat{\bar{\eta }}_{\mathsf {S}, t^*}\), where \(t^*\) is an appropriately determined iteration, then we have that \(\mathrm {reg}( \hat{\bar{\eta }}_{\mathsf {S}}; D, \ell ^{\mathrm {sq}}) {\mathop {\rightarrow }\limits ^{\mathbb {P}}} 0\).
For AUROC consistency, standard surrogate regret bounds (Agarwal 2014) imply that for any estimator \(\hat{\bar{\eta }}\),
for \(\ell ^{\mathrm {sq}}\) being the squared loss \(\ell ^{\mathrm {sq}}(y,v) = (1yv)^2\). By Theorem 2, we conclude that
The Isotron guarantee implies the RHS tends to 0 with sufficiently many samples. Thus, \(\mathrm {reg}_{\mathrm {rank}}( \hat{\bar{\eta }}_{\bar{\mathsf {S}}}; D) \rightarrow 0\).
For classification consistency, standard surrogate regret bounds (Zhang 2004; Bartlett et al. 2006; Reid and Williamson 2009) imply that we can bound the 0–1 regret in terms of the square loss regret:
By Theorem 1, for symmetric (labelindependent) noise, thresholding our estimate of \(\bar{\eta }\) around \(\nicefrac []{1}{2}\) yields:
The Isotron guarantee implies the RHS tends to 0 with sufficiently many samples. Thus, in the case of symmetric \(\mathrm {BCN}\) noise, thresholding \(\bar{\eta }\) around \(\nicefrac []{1}{2}\) will be consistent wrt the clean distribution. \(\square \)
B: Additional helper results
B.1: Order preservation
We will make use of the following simple fact about order preservation, stated without proof.
Lemma 2
Suppose \(f, g :\mathbb {R}\rightarrow \mathbb {R}\) are such that
Then, \(f = u \circ g\) for some nondecreasing u.
Taking the contrapositive gives us an alternate useful statement.
Corollary 2
Suppose \(f, g :\mathbb {R}\rightarrow \mathbb {R}\) are such that
Then, \(f = u \circ g\) for some nondecreasing u.
Finally, we can make a more precise statement about behaviour when \(g( x ) = g( y )\) under the above conditions.
Lemma 3
Suppose \(f, g :\mathbb {R}\rightarrow \mathbb {R}\) are such that
Then,
Proof
By the contrapositive in Corollary 2,
If \(g( x ) < g( y )\) then trivially \(g( x ) \le g( y )\) and the result follows. Suppose that \(g( x ) = g( y )\). Then \(g( x ) \le g( y )\) and \(g( y ) \le g( x )\). Thus \(f( x ) \le f( y )\) and \(f( y ) \le f( x )\), i.e., \(f( x ) = f( y )\). The result is also evident from the fact that \(f = u \circ g\) by Corollary 2. \(\square \)
Note that if we only know that \(g( x ) < g( y ) \implies f( x ) \le f( y )\), we cannot conclude that \(f = u \circ g\), nor that \(g = u \circ f\); we must be able to conclude something about the behaviour of f when \(g( x ) = g( y )\).
B.2: Relating clean and corrupt risks
We have the following general relationship between the risk on the clean and corrupted distributions, which is a generalisation of Natarajan et al. (2013, Lemma 1). In the following, we use the shorthand \(\ell _y( s ) = \ell ( y, s )\).
Proposition 3
Pick any distribution \(D\), and any loss \(\ell \). Suppose that \(\bar{D}= \mathrm {LIN}( D, \rho _1, \rho _{1} )\) for admissible \(\rho _{\pm 1} :\mathscr {X}\rightarrow [0, 1]\). Then, for any scorer \(s :\mathscr {X}\rightarrow \mathbb {R}\),
where \(w( x ) = (1  \rho _1( x )  \rho _{1}( x ))^{1}\).
Proof of Proposition 3
Reexpressing Lemma 1, for \(\mathrm {LIN}( D, \rho _1, \rho _{1} )\),
and
where \(w( x ) = (1  \rho _1( x )  \rho _{1}( x ))^{1} > 0\). Thus, the \(\ell \)risk of an arbitrary scorer is
\(\square \)
The instantiation of Proposition 3 for the case of PIN noise and losses satisfying Eq. 5 will be useful in proving Corollary 1: in this case, we can show the clean risk is an instanceweighted version of the corrupted risk. Recall that for \(w :\mathscr {X}\rightarrow \mathbb {R}_+\), \(R^{\mathrm {wt}(w)}\) is the weighted \(\ell \)risk, per Eq. 15. Then, we have the following.^{Footnote 3}
Proposition 4
Pick any distribution \(D\), and loss \(\ell \) satisfying Eq. 5. Suppose that \(\bar{D}= \mathrm {PIN}( D, \rho )\) for admissible \(\rho :\mathscr {X}\rightarrow [0, \nicefrac []{1}{2})\). Then, for any scorer \(s :\mathscr {X}\rightarrow \mathbb {R}\),
where \( w( x ) = ( 1  2 \cdot \rho ( x ) )^{1} \), and \(A( D, \rho )\) is some term independent of s.
Proof
(Proof of Proposition 4) By Proposition 3, for \(\mathrm {LIN}( D, \rho _1, \rho _{1} )\),
If \(\rho _1 \equiv \rho _{1} \equiv \rho \), \(w( x ) = (1  2 \cdot \rho ( x ))^{1}\) and
Thus, if per assumption the sum of the partial losses is a constant C,
Noting that the second term above does not depend on the scorer s, the result follows. \(\square \)
For the symmetric label noise model, Proposition 4 reduces to Natarajan et al. (2013, Theorem 9).
B.3: Relating clean and corrupt thresholds
For a general LIN model, we have the following relation between the thresholds of \(\bar{\eta }\) values and the corresponding thresholds for \(\eta \).
Proposition 5
Pick any distribution \(D\). Suppose that \(\bar{D}= \mathrm {LIN}( D, \rho _{1}, \rho _{1} )\) for admissible \(\rho _{\pm 1} :\mathscr {X}\rightarrow [0, 1].\) Then, for any \(t \in [0, 1]\),
Proof
(Proof of Proposition 5) By Lemma 1,
By Assumption 2, \(\rho _1( x ) + \rho _{1}( x ) < 1\) for every \(x \in \mathscr {X}\), and so \(1  \rho _1( x )  \rho _{1}( x ) > 0\). We thus have
\(\square \)
B.4: Difference in \(\bar{\eta }\) values
For the general LIN model, we have the following relation between the difference in \(\bar{\eta }\) values and the corresponding \(\eta \) values, which will be useful in demonstrating order preservation of \(\bar{\eta }\) and \(\eta \).
Lemma 4
Pick any distribution \(D\). Suppose \(\bar{D}= \mathrm {LIN}( D, \rho _{1}, \rho _{1} )\). Then,
where
Proof of Lemma 4
By Lemma 1,
Thus,
where
Alternately, we have
where
\(\square \)
Some examples illustrate the above result.
Example 6
For the case of classconditional noise where \(\rho _1 \equiv \alpha , \rho _{1} \equiv \beta \), \(\Delta _1 \equiv \Delta _2 \equiv 0\) and so we have the simpler expression
from which order preservation is immediate.
Example 7
For the case of purely instancedependent noise \(\mathrm {PIN}( D, \rho )\),
Thus,
Order preservation here will depend on the structure of \(\rho \).
For the BCN model, Lemma 4 can be converted to show that \(\bar{\eta }\) is a monotone transform of s, the underlying score used in the noise model; furthermore, we have a simple bound on the differences in \(\bar{\eta }\) values in terms of the corresponding difference in \(\eta \) values.
Lemma 5
Pick any distribution \(D\). Suppose \(\bar{D}= \mathrm {BCN}( D, f_{1}, f_{1}, s )\) where \(( f_{1}, f_{1}, s, \eta )\) are \(\mathrm {BCN}\)admissible. Then,
where \(\rho _{\pm 1}( x ) = f_{\pm 1} \circ s\). Further, if \(s( x ) = s( x' )\), then \(\bar{\eta }( x ) = \bar{\eta }( x' )\).
Proof
(Proof of Lemma 5) Note that by Condition (a) of BCNadmissiblity and Lemma 2, \(\eta = u \circ s\) for some nondecreasing u. For the BCN model, Lemma 4 is
where \(z = s( x ), z' = s( x' )\), and
Suppose that \(s( x ) = s( x' )\). Then clearly \(\rho _y( x ) = \rho _y( x' )\), implying that \(\Delta _1 \equiv \Delta _2 \equiv 0\), and also \(u( z ) = u( z' )\) by Corollary 2, so \(\bar{\eta }( x ) = \bar{\eta }( x' )\).
Suppose that \(s( x ) < s( x' )\) so that^{Footnote 4} \(\eta ( x ) \le \eta ( x' )\); or equivalently, \(z < z'\) so that \(u( z ) \le u( z' )\). Our goal is to show that \(\min ( \Delta _1( z, z' ), \Delta _2( z, z' ) ) \le 0\); this will imply the desired bound, since we can just use the tighter of the implied bounds on Eqs. 20 and 21. By Condition (c) of \(\mathrm {BCN}\)admissibility, for any \(z < z'\),
or equivalently
Thus, since \(u( z ) \ge 0\), we have
and similarly,
We now argue why the minimum of these terms must be \(\le 0\). Consider the following three cases:

(a)
Suppose \(f_{1}( z ) = f_{1}( z' )\). Then trivially both terms are \(\le 0\).

(b)
Suppose \(f_{1}( z ) < f_{1}( z' )\). Then either \(u( z ) \le \frac{1}{2}\) or \(u( z' ) \le \frac{1}{2}\); if both u values are larger than \(\frac{1}{2}\), then by BCNadmissibility Condition (b) it must be true that \(f_{1}( z ) \ge f_{1}( z' )\), a contradiction. Thus either \(1  2 \cdot u( z ) \ge 0\) or \(1  2 \cdot u( z' ) \ge 0\), and so one of the terms must be \(\le 0\).

(c)
Suppose \(f_{1}( z ) > f_{1}( z' )\). Then either \(u( z ) \ge \frac{1}{2}\) or \(u( z' ) \ge \frac{1}{2}\); if both u values are smaller than \(\frac{1}{2}\), then by BCNadmissibility Condition (b) it must be true that \(f_{1}( z ) \le f_{1}( z' )\), a contradiction. Thus either \(1  2 \cdot u( z ) \le 0\) or \(1  2 \cdot u( z' ) \le 0\), and so one of the terms must be \(\le 0\).
Thus, we conclude \(\min ( \Delta _1( z, z' ), \Delta _2( z, z' ) ) \le 0\), and so either
or
must be true; since \(\eta ( x )  \eta ( x' ) \le 0\) and \(\max ( 1  \rho _{1}( x )  \rho _{1}( x ), 1  \rho _{1}( x )  \rho _{1}( x' ) ) > 0\), this implies
Since \(\eta ( x )  \eta ( x' ) \le 0\) and \(\max ( 1  \rho _{1}( x )  \rho _{1}( x ), 1  \rho _{1}( x )  \rho _{1}( x' ) ) > 0\), we may bound the entire expression by 0, thus concluding that \(\bar{\eta }( x ) \le \bar{\eta }( x' )\).\(\square \)
An immediate consequence of Lemma 5 is that \(\bar{\eta }\) is orderpreserving for the underlying scores.
Corollary 3
Suppose \(\bar{D}= \mathrm {BCN}( D, f_{1}, f_{1}, s )\) where \(( f_{1}, f_{1}, s, \eta )\) are \(\mathrm {BCN}\)admissible. Then,
and so \(\bar{\eta }= \bar{u} \circ s\) for some nondecreasing \(\bar{u}\).
Proof
By Lemma 5, if \(s( x ) = s( x' )\) then \(\bar{\eta }( x ) = \bar{\eta }( x' )\). If \(s( x ) < s( x' )\) then \(\eta ( x ) \le \eta ( x' )\) by BCNadmissiblity Condition (a). Further, \(1  \rho _1( x )  \rho _{1}( x ) > 0\) by Assumption 2. Thus, \(\bar{\eta }( x )  \bar{\eta }( x' ) \le 0\). The fact that \(\bar{\eta }= \bar{u} \circ s\) follows from Corollary 2.\(\square \)
Remark 1
By definition of BCN admissibility, \(\eta = u \circ s\) for some monotone u; and by Lemma 5, \(\bar{\eta }= \bar{u} \circ s\), for some monotone \(\bar{u}\). If we could establish that \(\bar{u}\) were strictly monotone, then we would immediately conclude \(\eta = u \circ \bar{u}^{1} \circ \bar{\eta }\), which would establish Proposition 1. But this is not true in general; fortunately, \(\bar{u}\) is only constant when u is (owing to the explicit bound in Lemma 5), and so we are still able to write \(\eta = \phi \circ \bar{\eta }\) for some monotone \(\phi \).
Remark 2
Order preservation by itself does not let us establish an AUROC regret bound. We need the precise bound on the difference in \(\bar{\eta }\) values provided in Lemma 5 to quantify how much distortion is introduced relative to the difference in \(\eta \) values.
B.5: Classprobability estimation guarantees with the Isotron
We recall that the basic SLIsotron guarantee is as follows.
Proposition 6
(Kakade et al. 2011, Theorem 2) Pick any distribution \(D\) over \(\mathbb {B}^d \times \{ \pm 1 \}\) with^{Footnote 5} \(\eta \in \mathrm {SIM}( 1, W )\) for some \(W \in \mathbb {R}_+\). Let \(\{ \hat{\eta }_{\mathsf {S}, t} \}_{t = 1}^\infty \) denote the estimates of \(\eta \) produced at each iteration of SLISotron, when applied to a training sample \(\mathsf {S} \sim D^m\). Then, for any \(\delta \in ( 0, 1 )\),
where
C: Failure of order preservation under \(\bar{\eta }\)
We illustrate that for noise models other than \(\mathrm {BCN}\), order preservation under \(\bar{\eta }\) is not guaranteed.
C.1: Failure of order preservation without Condition (c)
Order preservation is not guaranteed without Condition (c) of the \(\mathrm {BCN}\) model.
Example 8
Suppose \(f_{1}( z ) \equiv 0\), \(f_{1}( z ) = a \cdot \llbracket z \le 0 \rrbracket \) for some \(a < 1\), and s is such that \(\eta ( x ) = \frac{1}{1 + e^{s( x )}}\). Certainly \(( f_{1}, f_{1}, s )\) satisfy the requisite Conditions (a), (b) of the \(\mathrm {BCN}\) model. However, \(f_1( z )  f_{1}( z )\) is nondecreasing, and so Condition (c) is not satisfied. It is easy to check that
which is easily checked to not be monotone in z.
The difference \(\Delta ( z ) = f_1( z )  f_{1}( z )\) above is nondecreasing. Swapping the flip functions thus makes the function nonincreasing, satisfying Condition (c) of the \(\mathrm {BCN}\) model. We can confirm that in this case, \(\bar{\eta }\) will indeed be orderpreserving for \(\eta \).
Example 9
Suppose \(f_{1}( z ) \equiv 0\), \(f_{1}( z ) = a \cdot \llbracket z \le 0 \rrbracket \) for some \(a < 1\), and s is such that \(\eta ( x ) = \frac{1}{1 + e^{s( x )}}\). Certainly \(( f_{1}, f_{1}, s )\) satisfy the requisite Conditions (a), (b) of the \(\mathrm {BCN}\) model. It is easy to check that
which is easily checked to be monotone in z.
Condition (c) implies an asymmetry in the treatment of the positive and negative labels. When \(f_1  f_{1}\) is nondecreasing rather than nonincreasing, one may think to resolve this by simply swapping the roles of the positive and negative labels. Why is there an asymmetry, and why will this approach not work?
The reason is that the underlying score \(s^*\) is such that \(\eta = u \circ s^*\) for some nondecreasing \(u( \cdot )\), so that higher scores correspond to equal or higher probability of an example being positive. This already imposes some restriction on how the scores relate to the labels, and so the flip functions must respect this.
In particular, suppose \(f_1  f_{1}\) is nondecreasing, but we just relabel the positives as negatives and viceversa. Certainly then our new \(\tilde{f}_1  \tilde{f}_{1}\) on the relabelled positive and negative classes will be nonincreasing. However, we also have new classprobability \(\tilde{\eta } = \tilde{u} \cdot s\), where now the link is nonincreasing. This means that s actually is reverse order preserving, and so we cannot conclude that the resulting \(\tilde{\bar{\eta }}\) will be order preserving for \(\eta \).
C.2: Failure of order preservation for the PIN model
For the PIN model, order preservation will not be guaranteed in general. This means that it does not suffice to merely remove dependence of the noise on the labels.
Example 10
Consider a model \(\mathrm {PIN}(D, \rho )\) where \(\rho ( x ) = \frac{1}{2} \cdot \eta ( x )\). This means that there is more noise for positive instances. Then, we have
This will not be order preserving for \(\eta \), since \(\varphi ( z ) = z \cdot \left( \frac{3}{2}  z \right) \) is not monotone on [0, 1].
Example 10 violates Condition (b), illustrating why this is important for guaranteeing order preservation.
Rights and permissions
About this article
Cite this article
Menon, A.K., van Rooyen, B. & Natarajan, N. Learning from binary labels with instancedependent noise. Mach Learn 107, 1561–1595 (2018). https://doi.org/10.1007/s1099401857153
Received:
Accepted:
Published:
Issue Date:
Keywords
 Label noise
 Instancedependent noise
 Consistency