Ensemble with estimation: seeking for optimization in class noisy data

Class noise, as know as the mislabeled data in training set, can lead to poor accuracy in classification no matter what machine learning methods are used. A reasonable estimation of class noise has a significant impact on the performance of learning methods. However, the error in existing estimation is inevitable theoretically and infer the performance of optimal classifier trained on noisy data. Instead of seeking a single optimal classifier on noisy data, in this work, we use a set of weak classifiers, which are caused by negative impacts of noisy data, to learn an ensemble strong classifier which is based on the training error and estimation of class noise. By this strategy, the proposed ensemble with estimation method overcomes the gap between the estimation and true distribution of class noise. Our proposed method does not require any a priori knowledge about class noises. We prove that the optimal ensemble classifier on the noisy distribution can approximate the optimal classifier on the clean distribution when the training set grows. Comparisons with existing algorithms show that our methods outperform state-of-the-art approaches on a large number of benchmark datasets in different domains. Both the theoretical analysis and the experimental result reveal that our method can improve the performance, works well on clean data and is robust on the algorithm parameter.


Introduction
Typical machine learning method uses a classifier learned from a labeled dataset (i.e., the training data) to predict the class labels of new samples (i.e., the testing data). In most of classification applications, labels of the training data are assumed correct. However, real-world datasets often contain noise which may occur either in the features of the data, defined as the attribute noise, or in the labels of the data, defined as the class noise.
Many studies have focused on handling attribute noise since it is quite common in machine learning and data mining tasks. However, researchers, such as [1,2], have indicated that class noise can be potentially more detrimental than attribute noise. The study on class noise problem has an essential impact on classification performance improvement [1]. We must point out that class noise is unavoidable in many real world applications such as disease prediction in medical applications [3], food labeling for the food industry [4], and manual data labeling in some natural language processing applications [5][6][7].
Generally speaking, there are two types of strategies to deal with class noise. The first entails learning with noisy data and the second is based on noise elimination.
Learning with noise assumes that each training sample is assigned to a weight based on an estimated probability of class noise, that is, the class noise rate. The learning algorithm will consider the class noise rate while learning from the original noisy training data [8]. Unfortunately, this method requires a priori knowledge of the class noise rate in the training data.
Noise elimination strategy attempts to detect and remove erroneous data from the training set [9][10][11]. This method can reduce the rate of class noise in the training data, yet it often leads to an overall reduction of training samples.
Even though the reduced training data may be less noisy, reduced training data may result in reduced performance of a learning algorithm compared to the result from using the original noisy data. Therefore, good noise estimation methods are important to avoid overrating class noises which can lead to large data reduction. One successful approach for class noise elimination is called kernel density estimation including methods such as k-nearest neighbors (kNN) or Parzen Window to detect and remove noisy data [10,12,13]. However, there is a theoretical flaw in these methods. Their work reposes on kernel density estimation, which requires a small radius of neighborhood , to satisfy both the manifold assumption and the central-limit theorem which requires a large to estimate the parameter for Guassian distribution. The contradicting assumptions will limit the performance of their methods in applications.
In this paper, we propose a novel method to estimate the noise rate for each training sample. In order to avoid the overrating of class noise rate and the contradict assumptions, we introduce the sum of Rademacher distribution [14] instead of the central-limit theorem to estimate class noise. We choose the kNN graph for the kernel density estimation because it is more sensitive to class noise.
Based on this noise estimation method, we then modify a given surrogate loss function based on the estimated class noise rate. The modified surrogate loss function is the optimization objective of the classifier. According to the theoretical analysis, this loss function is sensitive to the parameter in the estimation algorithm. So, we propose a sampling based algorithms to obtain a strong classifier through a series of weak classifiers as an ensemble to overcome the sensibility of the parameter,which is adopted to optimize the loss function on the training data. Traditionally, the ensemble method is not suitable to handle class noise because it will also enhance the noise in the learning process [15,16]. However, in our method, we take into consideration of the noise rate and make the algorithm to adapt the noisy distribution. Both the theoretical analysis and the experimental result reveal that our method can improve the performance, works well on clean data and is robust on the algorithm parameter.
The main contributions of this paper are: • A class noise estimation approach is proposed based and a new weighted loss function is given; • The proposed loss function based on the estimated class noise experimentally demonstrates better class noise estimation performance than the existing popular algorithms; • In comparison with the existing methods of noise estimation, our approach requires no a priori knowledge about the class noise and it makes up for the contradiction of the currently used theory. Performance evaluation also show that we can indeed achieve state-of-the art results.
The remainder of this paper is organized as follows. Section 2 provides a review of related works on class noise estimation and the two types of strategies to deal with class noise. Section 3 presents the problem setup and background. Sections 4 and 5 present our proposed class noise estimation method and our method to incorporate the estimation into class noise elimination, respectively. Section 6 shows the performance evaluation on real-world public datasets in different domains. Section 7 lists our conclusion and future work.

Related work
Identifying class noise is an important issue in machine learning. Previous publications cover both the theory [17] and the application aspects [12,13] of this topic. Especially, in recent years, with the development of deep learning, how to train a rubost neural network becomes a new hot topic in the designing of learning architecture [18,[18][19][20][21]. In this section, we briefly introduce related works from three perspectives: the source of class noise, the handling of class noise and the application of class noise. Class noises can exist for different reasons. When used for disease prediction in medical applications [3], training data contains a probability of false positive or negative because data comes from medical experiments. In other words, class noises naturally exist and cannot be avoided. Food labeling for the food industry [4] also faces class noise problem. As shown by [4], beef has a higher price than mutton in some countries in South Africa. Miss-labeling is thus an aforethought to achieve more benefit. The manual labeling of data in some natural language processing applications [5][6][7] also contains class noise because there is always a possibility of inter-annotator inconsistency.
Given x, being the set of features of a sample, let ỹ be the observed label of x; y the true label of the sample x and p is the probability to flip the true label into a noise label (thus p is the class noise rate, or noise rate for short). For any training algorithm, only x and ỹ can be observed. In general, class noises can be simply categorized into three different models based the dependence of noise to y and p [22]. The first model in is called the noise completely at random model [17,23,24]. The basic idea is that the class noise rate of a sample is completely random and independent of the labels and the feature set. Thus, an observed label is only determined by the true label and the class noise rate. The second model shown is called the Noise at Random Model [25][26][27]. In this model, the class noise rate is dependent on the true label of a sample and is independent of the feature set of the sample. In other words, different labels have different probabilities to flip to the wrong label. The third model shown is the noise not at random model [28,29]. This model assumes the class noise rate should be affected by both the label and the feature set of the sample. Informally, this model can be described as: a sample will be miss-labeled to the most similar category.
The theoretical research on class noise and learning was first proposed by Angluin and Laird in 1988 [17]. In their work, all instances of the labeled data for binary classification have a uniform invert probability p ∈ [0, 1∕2) . This is referred to as the random classification noise.
Class noises are typically assumed to be stochastic in algorithms that can handle class noises. The work by Ref. [8] assumed a learned noise rate from a priori knowledge and every sample was given the same probability, a simple assumption that may not be reasonable in all scenarios. The Cut Edge Weight Statistic method [10] also required prior assumed noise rate. This method used the prior probability as a hypothesis to test if the training sample satisfies the null hypothesis. The method also required the neighbors of a training sample to follow the central limit theorem, which is not reasonable because the set of neighbors are too small. Other works simply did not consider noise rate [5,9,11,30,31].
There are two basic categories of strategies to deal with class noise in training data: either learning with class noise [5,8] or class noise elimination [9][10][11]. The learning with noise strategy approximates a distribution of noiseless training data using the distribution of the original training data with class noise and must have a priori knowledge about the class noise [8]. The problem, however, is that a priori knowledge of the class noise is not typically available, limiting the applicability of this method. Li [24] uses the Kernel Fisher method to estimate the class noise. The estimation is then used to use a robust algorithm that can tolerate noise.
The class noise elimination strategy attempts to detect the samples with high noise probabilities and remove them from the training set. There are different methods to detect class noise and they can be categorized as classification-based methods and graph-based methods. The classification-based method was first proposed by Brodley in 1999 [9]. He used k-fold cross-validation to split the training set into k subsets, and used any k − 1 sets as the training data to classify the remaining data. If the classification result for a sample is different from the original label, that sample is considered class noise and is removed. Zhu et al. proposed a more efficient algorithm suitable for large datasets [30]. Zhu also proposed a cost-sensitive approach based on k-fold crossvalidation [1]. Sluba employed a 10-fold cross-validation to detect class noise [11]. The graph-based method, known as the Cut Edge Weight Statistic method, was proposed by Zighed [10]. The principal idea is based on the manifold assumption and Bernoulli distribution assumption. A similar approach was proposed by Jiang and Zhou, who used a kNN graph to detect class noise without considering noise rate [32]. There are three major problems with the elimination approach. First, some correctly labeled training data can be eliminated because of potentially inaccurate class noise identification. Second, the number of samples in the training data will be reduced, potentially leading to an adverse effect on the learning algorithm performance. Third, the manifold assumption requires small k and the sum of Bernoulli distribution converges to Gaussian distribution if and only if k is larger than 25. The paradox leads to a limited performance gain in the class noise estimation. In elimination-based methods, reliable noise estimation is crucial and inaccurate estimations can potentially degrade performance compared to no noise elimination.
Some works also use some revision to the AdaBoost learning algorithm to handle noise data [33][34][35]. In principle, AdaBoost [36] is not a suitable learning algorithm for noisy data. The reason is that during the learning process AdaBoost will enhance the misclassified training samples in the next iteration. For a noisy sample, which is actually classified correctly, may be regarded as a misclassified sample due to the noisy label to lead to worsened performance [15,16]. However, some strategies can be used to smooth out weight updating in the learning steps of AdaBoost to avoid over-fitting on noisy data [33,34]. Another method simply allows boosting to miss-classify some of the training samples to obtain a more robust boosting [35]. There are also adaptive methods by using a confidence score and removing a sample if its noise estimation confidence is higher than a threshold [37][38][39].

Problem setup and background
Before presenting our proposed algorithm, we need to explain the background and the problem setup first. The fundamental problem is that each sample in the training data has the probability to be a miss-labeled sample, which means that the samples have class noise. Hence, we need to estimate the probability for the miss-labeling of each training sample, referred to as the class noise rate in the rest of this paper. After that, we can train a classifier on the noisy training data based on the estimated probability.
The second problem is how to measure the learning ability of the classifier with the obtained noise rate? Is it possible that our classifier, trained on noisy distribution, can achieve similar performance to the classifier trained on the clean distribution? If the answer to this question is yes, we can then use the differential between the two classifiers to measure the learning ability of the classifier trained on noisy distribution. If the classifier trained on noisy distribution is the optimal classifier on the clean distribution, this differential should be a small number theoretically.
In this section, we first give a strict formalized definition of the class noise rate, followed by the learning ability measure of the classifiers.
Let D denote the clean distribution without class noise, and (x 1 , y 1 ), (x 2 , y 2 ), … , (x n , y n ) denote n training samples from D with true binary label y i (y i = ±1, i = 1, 2, … , n) When there are class noises, D denotes the observed distribution which contains class noise, and the training sample from a noisy distribution D are (x 1 ,ỹ 1 ), (x 2 ,ỹ 2 ), … , (x n ,ỹ n ) , where the label ỹ i may be different from the true label y i . In this paper, we want to estimate noise rate at the level of the individual samples. Here, we give the definition of class noise rate as follows: ∈D be a sample from the noisy distribution.The class noise rate is the probability of the observed label different from the true label of x i , denoted by P(ỹ i ≠ y i |x i ).
According this definition, class noise rate defines on individual data samples. In other words, we allow different samples to have different noise rate.Thus, the first issue is how to estimate the class noise rate for each training sample. The main challenge, however, is that a learner can only observe the noisy data (x i ,ỹ i ) ∈D and there is no a priori knowledge about the class noise rate and the clean distribution D.
For the time being, let us assume that we have a reasonable estimation method. Then the second issue is how to measure the performance of a classifier. Generally speaking, the objective of a classifier is to minimize a given loss function on a given set of training data. However, in this paper, the observed training data contains class noise. The minimized loss function on noisy training data may not be the minimized loss function on the clean data. So the issue is whether we can use the estimation of the class noise rate for each individual sample to modify the loss function on the noisy distribution so that the modified loss function can also minimize the loss on the clean distribution.
In order to address the problem in a formally, we give some definitions below.
The risk of f for each sample on the clean distribution is 0-1 loss given by Let l(f(x), y) denote a loss function with a real-value prediction, for the clean distribution where y = ±1 is the true label of x. We can then use the estimation of the class noise rate for each individual sample to modify the loss function on the noisy distribution with an observed label, denoted as l (f (x),ỹ) . The modified loss function is marked with a hat ⋅ . Because the loss function is defined under the noisy distribution D . Then, we can define three related risks as follows: The empirical l -risk on the training data: (Here, the hat ⋅ means that the marked object is a estimated result. The hat ⋅ means that the noisy label will influence the marked object.) Here the empirical l -risk R̃l (f ) is is the expected error of the trained classifier on noisy distribution, and the expectation of l-risk R l,D (f ) is the expected error of a classifier training on clean distribution. The learning ability of a training classifier is the difference between the two risks: |R̃l(f ) − R l,D (f )| It indicates the distance between our trained classifier on noisy data and the optimal classifier on the clean distribution. The objective of our algorithm is to make |R̃l(f ) − R l,D (f )| approaching 0 when the size of noisy training set grows.

Class noise estimation
In this section, we deploy a class noise estimation method proposed in our previous work [40]. Due to the length limitation, we only introduce the basic idea and theorem. The details of mathematical proof can be found in paper [40]. The idea is based on the kernel density estimation method. In this method, the label of an individual sample should be similar to the most similar neighbors even if there is class noise in the data. Therefore, we first present a class noise model based on the kernel density estimation method. We mainly introduce the kNN graph as kernel density estimation method, and we will also introduce the formula based on Parzen Window (known as e-graph).Since the kernel density estimation method is sensitive to class noise and it can overrate the class noise. To avoid overrating of noises, we introduce a loose distribution called the Sum of Random Noise in this section. As will be seen in the evaluation, our method is more reasonable theoretically and shows a higher performance in the experiments when compare to the existing estimation methods [10,12,13].
For any (x i ,ỹ i ) from the training set with true label y i , we can define a sign function on the individual sample (x i ,ỹ i ) and its k nearest samples (x j ,ỹ j ): (1) I ij indicates the difference between an individual sample and its nearest labels. Even ỹ j can also contain noise with some probability, the noise level of ỹ i should still be similar to that of ỹ j . Furthermore, the noise estimation method should also consider similarity. Thus, we define a statistic, S i , referred to as the sum of noisy similarity, which can be used to measure the total noise level. (1) and a normalized similarity w ij , the sum of noisy similarity is:

Definition 4 For any individual sample
Now, we can define the class noise rate for (x i ,ỹ i ) as the probability of S i being opposite from SRN. Because I ij = 1 indicates that the sample x i has different label to its nearest neighbor and the similarity metric is between 0 to 1, the larger S i is, the higher the probability it should be that x i has a noisy label. So, we should only consider the upper quantile of SRN here. Now, we can give the class noise rate under Definition 5.

Definition 5
For any individual sample (x i ,ỹ i ) ∈D and the sum of noisy similarity S i , the probability of SRN denoted as P SRN , the class noise rate of ( The formal definition above would be easier to understand using the following explanation. If the principle of entropy maximum reveals a best guessed label, the upper quantile of SRN is the probability of the individual sample having a "worse" label than a guessed one. So Definition 5 can be presented as a descriptive definition as: the class noise rate of a sample is the probability of the observed label being worse than a guessed label. Definition 5 defines the class noise rate of the sample (x i ,ỹ i ) as the probability of S i not following SRN. We can explain the definition in a different way. If the label of a training samples totally random under PEM, the label of the training sample can be considered as a label from guessing. Then we can get the distribution of S i from a "guessing result". If the individual sample (x i ,ỹ i ) contains class noise, the corresponding S i should be larger than the "guessing result". Thus, we can say the probability of S i does not follow SRN. The formal theorem is given below: Here ∥ ⋅ ∥ 1 and ∥ ⋅ ∥ 2 are the L 1 and L 2 norms; and is well known as the K-method of real interpolation for Banach Space [41]. The proof of Lemma 1. was given by [14] in details. According to Lemma 1, we can easily get the Lemma 2 by a sub-optimal solution of K 1,2 (w i , t).
The proof of Lemma 2 is given in our previouse work [40].
According to Definition 4, the estimated class noise rate For noise detection, the lower boundary is more significant since it pertains to the minimum noise rate of a labeled training sample. In practice, we can then use as the estimation for each training sample. Here, can be either be incorporated in the learning algorithms to weight the importance of a training sample or used to identify for elimination. It should be noticed that r(x i ) is symmetrical on S i . It means that S i is less than 0, which indicates the label of the individual is similar to the nearest label, the upper quantile of S i needs to be solved by the lower quantile of SRN which is introduced by opposite sign function.

Learning in noise data
In this section, we introduce our learning algorithm based on the estimated class noise rate. We propose to modify a given surrogate loss function based on the estimated class noise rate. The modified surrogate loss function is the optimization objective of the classifier. In this section, two training algorithms are used to optimize the loss function on the training data. One is a perceptron based method, which is based on the learning with noise strategy and aims to reduce the impact of noisy data. The other one is a sampling based Adaboost method, which is based on the class noise elimination strategy and focuses on selecting high quality training data rather than to identify low quality data.

Class noise estimation based on loss function
Based on the class noise estimation given in Formula (5), we propose a surrogate loss function based on the estimated class noise rate. The key is to ensure that the surrogate loss function can adequately approximate the loss function for the clean training data. Then the loss function on the observed distribution is defined as Without loss of generality, we do not specify any particular loss function on the observed distribution. Theoretically, any loss function can be used in Formula (6) to modified a surrogate loss function. In Formula (6), the numerator is formed by two parts. The first part is the original loss function with the observed labels weighted by their label correctness probabilities. The second is a penalty for the loss function with an inverted label (i.e., the probability of the observed label is incorrect) weighted by the class noise rate. The denominator is based on the average class noise rate to ensure that the expectation of loss on noisy training data approximates the expectation of loss on the clean data. This is an updated version of the original loss function on noisy data. Our question is whether the minimum risk of the proposed surrogate loss function on the noisy training data can approximate the minimum risk of the original loss function on the clean data. In general, the main question is whether we can train a classifier for the cleaning distribution by the noisy data only without any prior knowledge? Let us assume that the training data is clean. Then, the risk of the loss function on the clean distribution and noisy distribution are defined under Definitions 3.1, 3.2 and 3.3. Based on these definitions, the main question can be split into two questions: • Will R̃l (f ) converge to R̃l ,D (f ) under the noisy distribution when n grows? If the answer is "Yes", it means that we could train a stable classifier on the noisy training data. • Will R̃l ,D (f ) converge to R l,D (f ) under the clean distribution? If the answer is "Yes", it means that the trained classifier by noisy training data is also the optimal classifier on the clean distribution.
For the frist question, we can use the Chebyshev law of large numbers to prove. Given n independent (x i ,ỹ i ) , P c (x i ) denotes the class noise rate estimated by Formula (5)  For the second question, the proof is given in detail by Ref. [8], which will not repeated here, Based on the proof in Ref. [8], the following inequality holds: with probability of at least 1 − .
Here, f * is the minimizer of R l,D (f ) on the clean distribution, f p is the minimizer of R l,D (f p ) on the clean distribution and where i is an independent and identically distributed Rademancher random variable. The right side of the inequality in Formula (7) is bound by the richness of the function family F and the sample size of n.
Thus the risk of our estimation method R l,D (f ) satisfies: Because the second item is bound, the risk of our estimation is determined by the first item which is related to the estimation result. This can be interpreted as that the performance of the classifier trained on the noisy data is determined by the estimation result when the size of the training set grows.

Learning in class noise
According to Sect. 5.1, the basic idea is to use the surrogate loss function defined in Formula (6) to train a classifier on the noisy training data. In a previous work [40], a simple Perceptron based on-line learning method is used with noisy class data. However, the theoretical analysis has revealed that the result of the surrogate loss function would be affected by the estimated class noise rate. It is shown in Formula (8).
In Formula (8), |R l,D (f p ) − R l,D (f * )| is dependent on the size of the training dataset. Theoretically speaking, when the size of the training set grows, this absolute value will approximate to 0. It means that the risk of the surrogate loss function is decided by |R l,D (f ) − R l,D (f p )| , which is related to the estimation result. If we have a good estimation, this item should be small. Otherwise, we will face the problem with a risky surrogate loss function.
According to theoretical analysis, direct use of a perceptron or any other linear optimizer as the basic classifier (proposed in our previous work [40]) can face one problem: when the estimation is unreliable, the performance of the surrogate loss function may be limited because the first item in |R l,D (f ) − R l,D (f p )| may not be sufficiently small. Since the estimation method proposed in Section 4 is based on kNN graph, estimation is related to the parameter k and the similarity measure. The incorrect parameter of the similarity measure may misguide the estimation result, and the error in the estimation will affect the learning algorithm according to our analysis.
Hence we propose an Adaboost based method based on the class noise elimination strategy to whittle the impact of estimation. The reason is that Adaboost makes use of a bag of weak classifiers to achieve a better classification by enhancing the "misclassified" sample in the learning process. However, for a noisy data, "misclassified" by a base classifier may actually indicate a good performance because the observed label is incorrect. These samples should be given a lower weight since they have been trained well. If we give these samples a higher weight as traditional AdaBoost, it may lead to worsened performance because the algorithm enhances the noisy data with noisy label causing overfitting on these samples.
In our method, we solve the problem by adjusting the weight. The basic idea is to use the surrogate loss function given in Formula (6) as the objective function. We then use the AdaBoost.M1 [36] method to achieve optimal result on the noisy training data. In the learning step, we use the surrogate loss function to avoid overfitting on the noisy data without the need to eliminate any training sample. In the optimization, the samples with high class noise probability is given a low weight when "misclassified" by the base classifier so that even if the noisy sample is "misclassified" in the training, it will not obtain a higher weight in the next iteration to enhance learning. By using this strategy, our algorithm can adaptively handle class noise in the training instead of overfitting on noisy data. The pseudo code is shown in Algorithm 1. The weight of each sample is initialized to the probability of the correct label. Then, this weight is used to sample the training data. That is, we use the samples seemed "correct" as the training data to train the base classifier. Based on the error of the whole dataset, we will update the weight to make sure that the sample with high correct label probability but misclassified will have a high weight so that it will be added into training in the next round.
Next, we need to provide the training error of Algorithm 1. Since the observed labels are noisy, we can only get the training error on noisy data. That is, 1 We need to prove that the error on the clean data is also optimized in our algorithm. Reference [42] has given the boundary of training error for Adaboost.M1 on the clean training data. Take the training error in each iteration j as j , take t = 1∕2 − j . If ∀ j > 0 , ∃ j > . Then the training error is: However, in our task, we cannot observe the correct label of training data and we do not know the training error in each iteration neither. So, we cannot obtain the training error on the cleaning data directly. Fortunately, Formula (8) reveals that the gap between the training error on the clean data and the noisy data is decided by the estimation result. In each iteration, if we define the training error on the clean data as: j = h j (x i ) ≠ y i , and the training error on the noisy data as: ̃ j = h j (x i ) ≠ỹ i . There is a positive real number g to indicate the gap between the two errors: | j −̃ j | ≤ g . So, we can obtain the inequality: According to the Hoeffding boundary from Angluin and Laird [17], we have 1 2 − j ≥ 0 here. It means that if we take t = 1∕2 − j , then ∀ j > 0 , ∃ j > such that the training error is: This means that the classifier can be optimized on the noisy distribution and achieves an optimal result on the clean distribution.

Experimental setup
Our experiments evaluate the performance of the noise estimation method and show its usefulness in improving the learning performance. The evaluations are based on experiments with varying class-noise rates and training-set sizes compared to other state-of-the-art systems. We use seven public datasets for binary classification with different classnoise rates: (1) the LEU [43] set of cancer data, we reduce the dimensions into 20 by PCA; (2) the Splice for DNA sequence splice-junction classification; (3) the UCI Adult dataset collection containing seven subsets (referred to as UCI.a1a to UCI.a7a in this paper) of independent training and testing data [44]; (4) the DBWorld e-mails DataSet in English (in short, DB), which consists of 64 e-mails manually collected from DBWorld mailing list and are classified into two classes: "announces of conferences" and "everything else"; (5) the Farm Ads DataSet in English (in short, FADS), which is collected from text ads found on twelve websites that deal with various farm animal related topics with binary labels based on whether the content owner approves of the ad; (6) the Twitter Dataset for Arabic Sentiment Analysis Dataset (in short, TDA), the class labels are opinion polarity; (7) the Product Reviews from Amazon in three categories, Book, DVD and Music (in short, PRA) with class labels being the opinion polarities. (8) Banknote is the banknote authentication Data Set. Data were extracted from images that were taken from genuine and forged banknote-like specimens. For digitization, an industrial camera usually used for print inspection was used. (9) Haberman contains cases from a study that was conducted between 1958 and 1970 at the University of Chicago's Billings Hospital on the survival of patients who had undergone surgery for breast cancer [45]. (10) ILPD contains 416 liver patient records and 167 non liver patient records [46]. (11) QSAR contains values for 41 attributes (molecular descriptors) used to classify 1055 chemicals into 2 classes (ready and not ready biodegradable) [47]. (12) SPEC cardiac Single Proton Emission Computed Tomography (SPECT) images. Each

3
patient classified into two categories: normal and abnormal [48].
The datasets (1) and (2) can be downloaded from the website of LibSVM 1 . The datasets (3) to (6) and (8) to (12) are from UCI 2 . Dataset (7) is from NLPCC 2013 cross lingual opinion analysis evaluation task. Datasets (4)-(7) are text data. The size, class ratio, types and dimensions of the datasets are listed Table 1.
To introduce class noise into the training set, we stochastically invert the binary labels of training samples with probability of 10%, 20%, and 30%. For datasets (1)-(3), we train the binary classification algorithm on the inverted noisy data. The algorithm is tuned on a development set before being used on the testing set. We use SVM light3 as the basic classifier for this experiment. We choose two kind of similarity measures, the cosine similarity for text data, and the Euclidean Distance based similarity for other data. The parameter k in the kNN graph is 5, set experimentally. For datasets (4)- (11), the experiment result is from a fivefolds cross-validation as they do not have separate testing data.
We compare the performance of ouralgorithm with other state-of-the-art learning algorithms for noisy data: 1. LiC [40]: a Perceptron based on-line learning method from our previous work ; 2. NHERD [49]: a widely used open source robust method; 3. log [8] : a log loss method using the same loss function as LiC; 4. IMTD [9]: a method using [9]; 5. CEWS [10]: a method using cut edge weight statistics.
Furthermore, in order to see the necessity to use an ensemble or not, we also compare our methods with: 6. the original SVMs; 7. an ensemble SVMs without class noise handling; 8. an e-graph based method (two samples will be linked if the similarity between them is larger than e, including three different parameters, e = 0.2, 0.1 and 0.05); 9. Laplace distribution based CEWS; Notice that log , CEWS, and Laplace distribution based CEWS require prior knowledge of class-noise rate in training data. Thus, they are more appropriate for use as benchmarks rather than direct comparison to other algorithms, and should be discussed separately. Table 2 shows the performance of our proposed algorithms compared to the other state-of-the-art methods using the 12 datasets. Results show that our proposed algorithms outperform other algorithms in most of the cases. Both the macro average and the weighted micro average over the size of the different class labels clearly show that our algorithm outperform all other methods. Note that log , CEWS, and Laplace are provided with the class-noise rate. So the comparison to our method is not completely fair. Even with provided class-noise rate to log , CEWS, and Laplace, they outperform LiC and AdaBC only for the relatively low class noise levels (10% and 20%) in two to three datasets only. This is because these methods require all samples to have the same class-noise rate for the probability weighting. For the tiny training sets, such as LEU which has only 37 training samples and DB which has only 64 samples for training, the size effect is very prominent. The noisy data elimination based method does not work well in these data because the removal of noisy data also removed useful training data. This is particularly true when the class noise rate increases to 20% and 30%. Obviously, the high percentage of noise has a big effect on the training data. Most methods perform well on 10% class-noise rate but have large degradation in the 20% and the 30% class-noise rates. Different from these methods, LiC performs well and shows a significant advantage in these two sets of data at all three levels of classnoise rate. This is because when the training data is small, the size of the training data is more important than the quality, LiC does not remove any training data. Thus, it can make full use of the data for training. Due to its good class noise estimation, LiC also shows a better performance than log even though log does not remove any training data either.

Comparison with other methods
When the training set grows, such as Splice which has 1000 samples or TDA which has 2000 samples, the quality of training samples becomes much more important. In these relatively large datasets, AdaBC shows better performance than the other methods. When the training set size becomes larger, the advantages of our methods become even more obviously. For example, the UCI.Adult at most 16,100 samples, or the review text from Amazon with 4000 samples. When the noisy level is at the 30% level, AdaBC can achieve a 5% higher accuracy than other methods which can be shown in the micro average of accuracy with 30% class noise.
We also compare the performance of SVMs and Boosting of SVMs on noisy data with other methods.When class noise rate is low, the class noise handing approach does not show significant performance gain compared to SVMs. But, as class noise increases, our proposed method and other class noise handling approach start to work and have marked improvements over SVMs [22]. have claimed that IMTD is cheap and easy toimplement.However,it is also likely to remove a substantialamount of data. CEWS has similar problem. When the class noise is 30%, which is at a high level, our proposed method is better than other methods.
Note that the e-graph based method and the Laplace based CEWS are also compared to our proposed methods.
Since an e-graph does not need the ranking processing, it is more convenient than a knn graph. However, experimental results shows that the value of e should be different for different tasks. How to choose a reasonable e is an important issue. This is even more serious in our experiments because of the diversity of the datasets. By tuning e for each task, we did not get any better performance compared to LiC nor to AdaBC. This, again, shows that knn graph is a better choice because k is much easier to find. For the Laplace based CEWS, it uses the Laplace distribution and the Bernoulli distribution both are similar to Gaussian distribution. Since CEWS is a discretized result of these distributions, the performance of this method is quite similar to the original CEWS.
Generally speaking, AdaBC shows a more stable performance in most cases as shown in Table.IV. For the cases which AdaBC cannot achieve the best performance, they are at least as competitive as the other methods. Considering the performance across all datasets, our proposed AdaBC method has achieved the state-of-the-art performance.

Further discussion and experimental analysis
We further answer remaining questions: in this section. (1) "Can we use kNN directly?"; (2)"If the SRN distribution seems to be similar to the Gaussian distribution, what is the difference between them?"; (3) "Since the performance of AdaBC looks similar to LiC, is it necessary to use AdaBC?"; and (4)"Can we use the estimation method on the clean data?".

kNN, SRN and Gaussian
In the estimation method proposed in section 4, kNN graph is to comparean individual sample with its most similar neighbors. kNN graph is used because it is sensitive to class noise [50,51]. The Gaussian distribution based estimation method also used kNN graph [10,12]. To see the differences, we conduct the following experiment to estimate the class noise rate for each individual sample by different methods on the UCI.Adult dataset. The parameter k is 5 in this experiment because a small k always leads to better performance. After the estimation, the mean value of the class noise rate for each disagreement in kNN is shown in Fig. 1. For the kNN method, we take the class noise rate equal to 0 when the number of different labels is less than 3, and the class noise rate is 1 otherwise. In Fig. 1a, the Gaussian based method is sensitive to class noise. When the number of different labels is 2, the probability of class noise estimation is near 95%. If this result is used as a probability, it will lead to an overrated class noise rate. Note that in Refs. [10,12], the author takes the Gaussian distribution based method as a hypothesis testing method, and the threshold is indeed 0.95. If the individual sample achieves a higher p value, 1 3 Fig. 1 Distribution of three estimation results (a 10% class noise, b 20% class noise, c 30% class noise) and Different k in the kNN graph for the two methods (d 10% class noise, e 20% class noise, f 30% class noise) the individual sample will be considered as a noisy sample. We should note that, the threshold of an estimation method is the same as the experimental threshold for kNN. Due to the sensibility of kNN to class noise, this threshold is an overrated result for class noise rate. It will lead to misclassified noisy samples, which have the correct label. When the class noise increasing, the threshold of kNN and Gaussian distribution based method do not change. However, SRN is quite different. First, SRN will not over-rate class noise. It is obvious that even if there is no class noise, the number of disagreement labels can still be 3 when k = 5 in kNN graph. So SRN gives quite a reasonable estimation according to common sense. Another benefit is that, when the class-noise rate increases, SRN also gives a higher estimation of classnoise as indicated in Fig. 1a-c, with increasing noise levels.
From the experiment above, we can see that the estimation method proposed in this paper is more flexible. Only when most of neighbors have different labels, the individual sample is considered a noisy sample with high probability. If the training data has high probability of noise label, the number of disagreements should be high. In addition, comparing to kNN and Gaussian distribution, SRN gives a more reasonable estimation based on our analysis above. That is why the performance of our method is better than CEWS (Gaussian distribution based method) in the experiment.We do not compare SRN with the original kNN based method because the performance of kNN is similar to CEWS, and CEWS is a more sound method theoretically speaking.

Lic v.s AdaBC on different parameters
In the experiments of Sect. 4. C, LiC achieves better performance in most datasets than AdaBC. Then, why we need the AdaBoost based method, which seems to be more complicated? Figure 1 shows the set of performance evaluations of CUI. Adult using different k of the kNN graph under 10%, 20% and 30% class noise, respectively. Figure 1 shows very clearly that when k increases, the performance of LiC shows a sharp degradation. That is the gap between the noisy training data and clean training data revealed by the theoretical analysis. The performance of LiC is dependent on the estimation result. In fact, it is a well-known conclusion that the best k in kNN should be no larger than 5 (in most textbook of A.I or machine learning such as [52]), or the precision of this method will be limited and proven in Fig. 1 here again. When k is no larger than 5, LiC achieves the best performance among all k values.
In the Fig. 1d-f the gap between the two methods becomes larger when k increases, it also provides proof of our analysis. AdaBC, seems to be quite robust to k value.This is because even if the estimation is wrong, it still reveals some truth of the clean distribution, and the adaptive sampling method ensemble a series of weak classifiers into a strong classifier.The error boundary in Formula (9) also is affected by the class noise rate. The low class noise rate does less harm to the performance of the base classifier obviously. That is why AdaBC is robust to the k value.

Stability of LiC and AdaBC
Both LiC and AdaBC are iteration based learning algorithms. With the growth of iterations, the risk of overfitting on the training data becomes higher. With noisy training data, overfitting will certainly lead to misclassification directly. In the following set of experiments, we compare LiC and AdaBC with iterations as the variable to identify the stability of the two algorithms. Figure 2a-c show the performance of LiC at noise level of 10%, 20%, and 30% on the UCI.Adult data, respectively. Since LiC is a perceptron based method, we also care about the effect of different learning rates. We take three different learning rate of 0.02, 0.01 and 0.005 in this experiment. Different learning rate can achieve similar top performance. It is also obvious that a smaller learning rate picks up performance slower, but it will outperform the higher learning rate after iteration 5. The performance gain with smaller learning rate is much more obvious when the noise level increases. In fact, for the 30% class noise case, there is a 4% gap between different learning rates. Figure 2d-f show the boxplots of the respective noise levels for AdaBC. Since AdaBC is an AdaBoost based method, the experiment about AdaBC focuses on the mean value and variance of accuracy. In Fig. 2, the top performances of the two methods are similar. But AdaBC gives a more stable performance because the mean value of accuracy is in a similar level when iteration number increases. The variance is also small in the figure. However, the performance of LiC will peak at certain iteration number and then degrade because the accumulated noise will take its tolls on performance.
Now we can answer the question proposed at the beginning of this section. Even though the performances of the two methods are similar, we still have reason to use AdaBC because LiC needs to choose the optimal iteration number and learning rate to achieve top performance. The stability of LiC is also not as good as AdaBC. If the parameter is not suitable for a dataset, the performance degradation is obvious. Comparing to LiC, AdaBC shows a stable and robust result and that is the reason why we propose AdaBC although AdaBC may not be suited for small training datasets.

Estimating the noise on clean data
In practice, we do not know if the data contains noise or not. So an interesting question is that if we estimate the noise rate on the clean data, what will happen. In this experiment, we run both LiC and AdaBC on clean data. As a comparison, a linear kernel SVMs is used as the baseline method. The result is show below in Table 3.
Note that LiC performs worse than the original SVMs on the clean dataset. The main reason is that, in the loss function of LiC given by Formula (6), the samples with high class noise rate will have a penalty. When the estimated class noise rate is higher than 50%, the weight of the penalty item will be larger than the original loss function item. It is actually an operation of label inversing on the training sample. Since LiC can introduce class noise into clean data this way, it performs worse than the original SVMs is reasonable. Different from LiC, AdaBC is an ensemble method. The basic classifier is still SVMs. The loss function is to estimate the error rate of the basic classifier and calculate the weight based on this error rate. Usually, the error rate is less than 0.5. It means that each basic classifier will have a positive weight. So, the AdaBC algorithm becomes a bagging method of SVMs. Each basic classifier in the bagging has a weight, but the weight is meaningless since the training data is clean. Since each classifier trains on only part of the training set, the performance is no better than the original SVMs, but it is still comparable.
In conclusion, the estimation of class noise is used to weigh the samples in LiC but to weigh classifiers in AdaBC. In clean data, the incorrect weight on samples is much more harmful than the incorrect weight on classifiers. This is because the former leads to a miss-labeled sample, yet the later only introduces an incorrect weight. Fortunately, the weight is still with correct polarity and the training data of classifier is clean. That is the reason why AdaBC also works well on clean data.

Summary
Generally speaking, AdaBC achieves the best performance in the evaluation. However, by examining their performance in details in the experiments, we can see that LiC is sensitive to algorithm parameters including the k value in the kNN graph and the learning rate of the perceptron as well as the termination point of the algorithm, it cannot work on the clean data neither. AdaBC is better than LiC in this perspective. AdaBC is robust to all theabove parameters. Another advantage of AdaBC is the performance on clean data. Even the data does not contain noise, the AdaBC still perform well. Therefore, the Adaboost based improvement is necessary.

Conclusion and future work
In this paper, we present a novel class noise estimation method. We apply our estimation result into an Adaboost based algorithm to handle class noise. The algorithm is competitive compared to the state-of-the-art techniques and show superior performance on real datasets. We analyze the algorithm performance on different training dataset sizes and class-noise rates. Results confirm to the learning theorem provided in Eq. (8). In future works, we will investigate noise handling in semi-supervised tasks such as semisupervised classification, transductive transfer learning, and also look into the domain adaptation problem. We will also consider different noise rate for different classes since label noise rates are often class label dependent in practice.