Learning with mitigating random consistency from the accuracy measure

Human beings may make random guesses in decision-making. Occasionally, their guesses may generate consistency with the real situation. This kind of consistency is termed random consistency. In the area of machine leaning, the randomness is unavoidable and ubiquitous in learning algorithms. However, the accuracy (A), which is a fundamental performance measure for machine learning, does not recognize the random consistency. This causes that the classifiers learnt by A contain the random consistency. The random consistency may cause an unreliable evaluation and harm the generalization performance. To solve this problem, the pure accuracy (PA) is defined to eliminate the random consistency from the A. In this paper, we mainly study the necessity, learning consistency and leaning method of the PA. We show that the PA is insensitive to the class distribution of classifier and is more fair to the majority and the minority than A. Subsequently, some novel generalization bounds on the PA and A are given. Furthermore, we show that the PA is Bayes-risk consistent in finite and infinite hypothesis space. We design a plug-in rule that maximizes the PA, and the experiments on twenty benchmark data sets demonstrate that the proposed method performs statistically better than the kernel logistic regression in terms of PA and comparable performance in terms of A. Compared with the other plug-in rules, the proposed method obtains much better performance.


Introduction
In the process of decision-making, human beings may make random guesses without logical reasoning when they lack sufficient evidence or detailed knowledge. For instance, intern doctors are likely to diagnose patients with colds during flu season, and students are likely to choose a lucky option when faced with a difficult multiple-choices question. Sometimes, these random guesses may generate consistency with the real situation. We term this consistency the random consistency.
In the area of machine learning, randomness is unavoidable and ubiquitous in constructing classifiers, such as collecting and labeling data, selecting the structures or parameters of models and even in setting random operations (Ghahramani 2015). The prediction results of the learning models may also contain the random consistency. The random consistency produces dishonest feedback, misleads the decision direction and harms the improvement of the generalization ability, especially when the tendency of random guesses coincides with the class distribution of the real situation.
Eliminating the random consistency from evaluation measures has been well-studied in the field of educational psychology, where researchers advocate that the expected score for the accurate answer with no insight would be zero rather than one. This elimination has proven helpful in achieving a higher reliability and validity assessment and increasing the performance of examinees (Sabers and Feldt 1968;Diamond and Evans 1973;Wu et al. 2017;Budescu and Bar-Hillel 1993;Espinosa and Gardeazabal 2010). In the field of clustering evaluation, eliminating the random consistency has been an increasingly employed method to improve the quality of clustering evaluation (Hubert and Arabie 1985;Albatineh et al. 2006;Vinh et al. 2009Vinh et al. , 2010Albatineh and Niewiadomska-Bugaj 2011;Qian et al. 2016;Li et al. 2018Li et al. , 2019. In the area of classification, the accuracy (A) is a vital performance measure in model evaluation and learning theory. The original learning theories focus on searching the generalization bounds for the error probability (one minus accuracy) (Valiant 1984; Bartlett and Mendelson 2003). The traditional algorithms, including logistic regression, support vector machine and Adaboost are designed to optimize convex surrogate loss functions of the error probability (Zhang 2003;Bartlett et al. 2006). In ensemble learning, accuracy has been used as the preferential measure to evaluate the performance of integration (Zhou et al. 2002;Martinezmunoz and Suarez 2006). Although it is a fundamental performance measure, the accuracy does not recognize the random consistency, which may limit the performance of the algorithms based on it. In this paper, we aim to define a performance measure that eliminates the random consistency from the accuracy and to study the learning performance of the measure theoretically and experimentally.

Related work
The measure that eliminates the random consistency from the accuracy is referred to as the pure accuracy (PA). The PA measure is a kind of non-decomposable measures. The non-decomposable measures cannot be decomposed into each individual instance (Waegeman et al. 2014;Kotlowski and Dembczynski 2017;Sanyal et al. 2018). Similar measures include the F-measure, AUC, and balanced error rate (Zhao et al. 2013). For the non-decomposable measures, many learning theories and algorithms have been developed.
From the aspect of learning theory, Waegeman et al. (2014) investigated the generalization bound in terms of the F-measure when optimizing the Hamming loss and subset zero-one loss in a multi-label learning setting, and concluded that optimizing such losses as a surrogate of the F-measure leads to a high worst-case regret. Bayes-risk consistency guarantees that by increasing the amount of data, a rule can eventually learn the optimal decision with high probability. Agarwal et al. (2005b) show the Bayes-risk consistency of the AUC based on a new proposed combinatorial parameter. The key step of their proof is the symmetrization by a ghost sample that is the same as that for the classification error rate (Devroye et al. 1996). In this paper, to clarify the surrogate relation of PA and A, we show the upper bound of PA value for A-optimal rule and the upper bound of A value for PA-optimal rule. In addition, we give a Bayes-risk consistency analysis for the pure accuracy based on the Rademacher complexity in a finite hypothesis space and based on the VC dimension in an infinite space.
In optimizing the non-decomposable measures, Musicant et al. (2003) extended the support vector machine to optimize the F-measure by setting appropriate parameters in the standard SVM. Joachims (2005) proposed a large margin machine for maximizing a convex lower bound of non-decomposable measures. Hazan et al. (2010) and Song et al. (2016) trained deep neural networks by inferring the gradients of the non-decomposable measures. Narasimhan and Agarwal (2013) proposed a SVM model for optimizing the AUC via a tight convex upper bound. Waegeman et al. (2014) proposed an exact algorithm for optimizing the F-measure in the context of multi-label learning. Gao et al. (2016) proposed a one-pass AUC optimization algorithm that needed to read the training data only once. These methods directly optimize the non-decomposable measures. In addition to these direct methods, the plug-in rule is an effective method that learns a posterior probability function by the logistic regression method or some other mature methods, and searches a threshold that optimizes the objective measure. For optimizing non-decomposable measures, Narasimhan et al. (2015) simply used the bisection method to determine a threshold. The bisection method require the monotonicity of the function being solved. Then, there is still much room for improving the effectiveness of the plug-in method. Here, we give an interval search method to determine the threshold of the plug-in rule for optimizing the PA.

Contributions
We aim to verify the learning ability and Bayes-risk consistency of the PA in this paper. First, with regard to the cost-sensitive loss function, we give a non-closed formulation of the optimal rule w.r.t the PA. Based on this formulation, we illustrate that the PA is insensitive to the class distribution of classifiers and gets a low bias in minority accuracy and majority accuracy compared with A. Second, we give a novel lower and an upper bound for the optimal rules w.r.t the A and PA, respectively. These bounds help us clarify the surrogate relation between the PA and A. Furthermore, the generalization upper bounds of the PA in the worst case are given to analyze the consistency. The proof of these bounds employ the same symmetrization technique that was applied to prove the generalization upper bound of the accuracy (Devroye et al. 1996) and AUC (Agarwal et al. 2005a). However, the difference is that the PA has fractional formulation. Thus, the consistent analysis of the PA needs to handle the fractional formulation. Last, we design a plug-in rule in terms of maximizing the PA and experimentally validate its performance.
Briefly, the major contributions of this paper are summarized as follows: • Some bounds for the optimal rules w.r.t the PA and A are given. These bounds theoretically show that the PA-optimal rule is capable of approaching a satisfactory A value for all distributions. • Second, we develop an inequality to handle the probability of large deviations of variables in fractional form. The generalization bounds for the PA are shown in finite and infinite hypothesis space. These bounds verify the Bayes-risk consistency of learning by PA. • We propose a plug-in rule based on the interval search method for optimizing the PA.
Through it, we experimentally verify the fairness and performance of PA in learning.
The organization of this paper is presented as follows: We give the definition of the PA in Sect. 2. In Sect. 3, two examples are given to show the necessity of evaluating classifiers by the PA. In Sect. 4, a surrogate analysis between the PA and the A is conducted. In Sect. 5, the generalization upper bounds of the PA are developed. We propose a plug-in rule for optimizing the PA and experimentally validate its performance in Sect. 6. We form a conclusion and propose future work in Sect. 7.
In this paper, definitions and theorems which are tagged with a literature reference are taken from the literature, while the original ones come without such a tag. All the proofs are presented in the "Appendix".

Preliminaries
We consider the task of binary classification. Let X ⊂ R d and Y = {+1, −1} be the feature space and the label space, respectively. The underlying distribution of X × Y is usually unknown, and we only have a collection of empirical data S N = {(x 1 , y 1 ), ..., (x N , y N )} that are drawn independently from this distribution. The goal of classification is to learn a classifier h(x) mapping from X to Y via S N . Let H be the hypothesis space, from which the classifier h(x) is learnt. To evaluate the performance of classifiers, the confusion matrix is usually employed. Let TP, FP, FN, TN denote the true positive ℙ(h(X) = +1, Y = +1) , false positive ℙ(h(X) = +1, Y = −1) , false negative ℙ(h(X) = −1, Y = +1) and true negative ℙ(h(X) = −1, Y = −1) , respectively. Let p and q(h) denote the probability of ℙ(Y = +1) and ℙ(h(X) = +1) , respectively. The confusion matrix is shown in Table 1.
Based on the confusion matrix, the accuracy (A) and the error probability (L) are defined as:

The definition of PA
To define the pure accuracy (PA), we begin with giving the definition of random accuracy (RA), which aims to measure the random consistency in accuracy. For the classifier h(x) to be evaluated, let H q(h) be the set of all possible binary partitions with the same class distribution as it: Considering that the output preference of the classifier (tendency of predicting which instances as positive) is unknown in advance, we suppose the partitions in H q(h) are uniformly distributed. Because the partitions in H q(h) have the same output randomness as the classifier to be evaluated, we define RA as the expectation accuracy over the partitions in H q(h) . Definition 3 The pure loss (PL) is defined as:

Lemma 1 When the partitions in
The denominator of PA guarantees the maximum value to be 1. Note that the formulation of the PA coincides with the definition of Cohen's statistic (Cohen 1960;Scott 1955;Goodman and Kruskal 1963). The difference between them is how to define the random consistency. In the definition of Cohen's statistic, random consistency is called as chance agreement. The chance agreement is the agreement degree that the two raters give their ratings independently. The chance agreement between the classifier h(X) and the label label Y is: The way we define the RA gives a general framework to measure the random consistency in measures and is helpful to propose new performance measures.
Cohen's statistic has been successfully used in the area of psychology (Cameron et al. 2003) and medicine (Blair and Stanley 2008). The advantage of correction for the expected agreement by chance has made Cohen's statistic commonly be used as a reliable performance measure in the area of machine learning (Ferri et al. 2009;). In ensemble learning, Kappa-error diagrams have been used to gain insights about the effectiveness of classifier ensembles (Kuncheva 2013) and to prune classifiers (Margineantu and Dietterich 1997). In addition, Cohen's statistic has been used for feature selection (Vieira et al. 2010).

On the advantages of pure accuracy measure
A learning algorithm sensitive to the class distribution may get a decision boundary that deviates from the optimal one. Thus, the learning objective should be insensitive to the output class distribution. The extensively applied accuracy does not satisfy this property. We employ Example 1 to show that the PA is satisfactory in this respect.
Example 1 (Class distribution insensitivity) In this example, we aim to compare the evaluation result of the A and PA on the prediction results with different class distribution. Under the settings of N = 100 and p = 0.3 , we randomly generate a binary vector as the true label vector. A partition with a fixed class distribution q can be generated by Algorithm 1. The class distribution q is varied from 0 to 1 with a step of 0.05. Under each q, we run Algorithm 1 1000 times to generate 1000 partitions and use A and PA to evaluate the partitions, respectively. The distributions of the A value and PA value are shown in Fig. 1. From Fig. 1, it is easy to observe that the value of A decreases with the increase of q, while the value of PA is always near zero. This finding reflects that the A is sensitive to the class distribution of classifiers, while the PA is not. Lemma 2 (Devroye et al. 1996) Let (x) = ℙ(Y = +1|X = x) be the conditional class probability given X = x . The classifier that maximizes the A or minimizes the L is: Correspondingly, the minimal error probability is Theorem 1 The classifier that maximizes the PA is For the cost-sensitive loss L = FP + (1 − )FN , it is known that when is smaller, more attention will be paid to the minority class to get a smaller L . According to the proof of Theorem 1, PA is equivalent to L with = (1∕2 − p)PA * + p . Due to PA * ≤ 1 , a smaller p value will generate a smaller (1∕2 − p)PA * + p value. In this case, h * PA will pay more attention to the minority class. Thus, h * PA may be insensitive to class distribution. In learning classifiers, the minority class is often overwhelmed by the majority class to guarantee a higher overall accuracy (He and Garcia 2009). Then the classifiers learnt by optimizing the accuracy or error probability are usually biased to the majority class. This phenomenon is particularly desirable to avoid because the minority class is precious and inadequately represented. We employ Example 2 to show that the pure accuracy can mitigate the classification bias.
Example 2 (Fairness) To measure the bias of the classifier h(X), we use the absolute difference of the two class accuracy: Assume that two class data are generated from Gaussian distribution: N( 1 , ) and N( 2 , ) . The label of the minority class is corrupted by the instance-independent noise at the level s 1 : ℙ(Ỹ = −1|Y = +1) = s 1 .
For this learning task, the bias of h * A is: where (•) is the cumulative distribution function of the standard normal distribution, = ( 1 − 2 ) � − ( 1 − 2 ) and d 0 = ln 1−p p 1 1−2s 1 . Due to the formulation of h * PA is nonclosed, the bias of it is simulate through a large number of instances. First, a sample that obey the distribution of this task are generated with a size of 10 4 . Then, the threshold that optimizes the PA is searched from the range [0, 1] with a step 10 −4 , and the bias of h * PA is calculated through the sample.
Let 1 = −1 , = 1 and 2 vary from 0 to 2, p vary from 0.05 to 0.35 and the one-side noise level s 1 vary from 0 to 0.5. The bias curve of h * A (the dashed line) and that of h * PA (the solid line) are shown in Fig. 2 Fig. 2 shown, the dashed line is consistently lower than the solid line in each case, which demonstrates that learning by PA is more fair than learning by A under different imbalance degree, overlap degree and noise level.

Surrogate analysis of the optimal rules
The task of classification is to predict the labels of future observations. The optimal classifier is usually obtained by minimizing a loss function. From the same hypothesis space, different loss functions usually obtain different optimal classifiers. In this section, we focus on giving some novel bounds for h * PA (x) and h * A (x) to clarify the substitution relationship between them in learning classifiers. Theorem 2 (derived by Lemma 3) and Theorem 3 (derived by Lemma 4) are major results of this section.

Lemma 3 For all distributions, the plug-in rule with as the decision threshold
satisfies: when = 1∕2 , the equality holds.
Lemma 3 gives an upper bound on the error probability of the plug-in rule. According to Lemma 3, we have: Theorem 3 For all distributions, suppose p ≤ 1 2 , the pure loss of h * A satisfies: From Theorem 3, we can conclude that the pure loss of the optimal classifier learnt by A satisfies PL(h * A ) → PL(h * PA ) as L * → 0 only when p = 1 2 . Based on Theorems 2 and 3, we can infer that learning by PA can obtain a satisfactory A for all distributions, while learning by A can obtain a satisfactory PA only when the class distribution is balanced. We also employ Example 3 to reflect this phenomenon.

Bayes-risk consistency analysis of learning by the pure accuracy measure
The underlying distribution of X × Y is usually unknown, and we only have a collection of the empirical data S N = {(x 1 , y 1 ), ..., (x N , y N )} that is drawn independently from the distribution. In machine learning, the classifier is generally obtained by the principle of empirical risk minimization (ERM). The feasibility of the ERM is guaranteed by the property of Bayes-risk consistency. The corresponding loss function of PA is PL. Therefore, in this section, we validate the learnability of PA by analyzing the Bayes-risk consistency of PL.
For the risk function R, let R N (h) be the empirical risk calculated on S N : To guarantee the feasibility of the ERM, the property of Bayes-risk consistency is defined as: Definition 4 (Devroye et al. 1996) The rule h * R N is Bayes-risk consistent, if for any small enough , it satisfies The Bayes-risk consistency requires that the empirical optimal hypothesis h * R N has a large probability of converging to the universal optimal hypothesis as the number of empirical data tends to infinite.
To analysis the Bayes-risk consistency, the gap between R(h *

R N
) and inf h R(h) is usually upper bounded by (Devroye et al. 1996): which is known as the estimation error. The estimation error measures the performance gap between the empirical data and the underlying distribution. The convergence of the estimation error ensures that the rule learnt finite samples can be generalized to infinite samples. The bound of the estimation error, the so-called generalization bound, is the key factor in studying the property of the Bayes-risk consistency. The Rademacher complexity (Bartlett and Mendelson 2003) and the VC Dimension (Vapnik and Chervonenkis 1971) are two complexity measures of the hypothesis space; they have a crucial role in bounding the estimation error in the sense of accuracy. Here, we use the generalization bounds based on them to analyse the Bayes-risk consistency of learning by PA. To save space, we omit the definitions of the Rademacher complexity, the VC dimension and the corresponding generalization bounds. 1 3

The Bayes-risk consistency of the pure loss measure in a finite hypothesis space
The fractional form of the pure loss leads to that the empirical value of it is not an unbiased estimation of the expected value. Therefore, the techniques in deriving the generalization bounds of the error probability (Theorem 8 in Bartlett and Mendelson (2003) and Theorem 2 in Vapnik and Chervonenkis (1971) cannot be directly applied. Here, we establish a bridge between the estimation error of the pure loss and that of the error probability; and then obtain the Bayes-risk consistency of the pure loss in finite hypothesis space and infinite hypothesis space based on Theorem 8 in Bartlett and Mendelson (2003) and Theorem 2 in Vapnik and Chervonenkis (1971), respectively. First, we give the formulation of the empirical error probability L N (h) and the empirical random accuracy R A N (h) to analysis the Bayes-risk consistency: In practice, according to Lemma 1, the empirical random accuracy is computed by: where Lemma 5 For two random variables Z 1 , Z 2 ∈ [0, 1] , any Lemma 5 links the probability of the large estimation error of the fractional variable to that of the numerator and denominator. Based on Lemma 5, we obtain Theorems 4 and 5. Theorem 4 provides the probability of the large estimation error in terms of the number of the empirical data in finite hypothesis space. From Theorem 4, we can conclude that learning by the PA is Bayes-risk consistency in a finite hypothesis space.

The Bayes-risk consistency of the pure loss measure in an infinite hypothesis space
In this section, we consider the Bayes-risk consistency in an infinite hypothesis space. For an infinite hypothesis space, the union bound cannot be utilized. We utilize the symmetrization technical to bound the estimation error of the pure loss. Next, we divide the hypothesis space into N + 1 subspaces according to the class probability of hypothesis functions, to ensure that each hypothesis subspace has the same degree of random accuracy. Then, we employ the VC bound of the error probability to bound the estimation error of the pure loss. Theorem 5 provides the probability of the large estimation error in terms of the number of the empirical data in infinite hypothesis space. From Theorem 5, we can conclude that learning by the PA is Bayes-risk consistent in an infinite hypothesis space.

Performance validation of learning by the pure accuracy measure
By the Bayes-risk consistency, we have shown that the PA can be utilized to learn classifiers through minimizing PL. However, due to the fractional form, optimizing PL is a challenging task. To handle this challenge, we introduce the plug-in rule and propose an interval search method. The plug-in rule refers to a rule with a formulation of h * (x) = sign(̂ (x) − * ) , where ̂ (x) is an estimator of the posterior probability (x) = ℙ(Y = +1|X = x) and * is a threshold (Koyejo et al. 2014). The plug-in method mainly contains the following steps: first, randomly split the training data S N into S 1 and S 2 ; second, learn ̂ (x) by minimizing a loss function on S 1 ; third, determine * by maximizing the learning objective on S 2 .
In Narasimhan et al. (2014), it has been proved that assigning an empirical threshold to a suitable posterior probability estimate can optimize the performance measures expressed as a function of the TP and TN and p. That is, the plug-in method can optimize a complex performance measure through searching a decision threshold that optimizes the measure for the posterior probability estimate. The major focus of this section is developing an method to search the threshold that optimizes PL rather than to learn the posterior probability ̂ (x).
In this section, first, we introduce the method to learn the posterior probability. Then, we discuss some methods of determining the threshold that optimizes PL and propose a interval search method. Finally, we experimentally validate the performance of the interval search method and the classifier learnt by the PA.

Learning (x)
Many methods can be employed to learn ̂ (x) . Here, we introduce the kernel logistic regression model, which is proven to be a suitable posterior probability estimate (Ingo 2005;Narasimhan et al. 2014;Menon et al. 2013). The kernel logistic regression model is: where i are the variables to be solved and K(•, •) is kernel function. With the optimal * i is obtained by the gradient descent method, we have

The interval search method
As for determining * , different threshold settings correspond to optimizing different learning objective functions.
To optimize the accuracy, the threshold * of the plug-in rule is 0.5, and this is the socalled kernel logistic regression (KLR) method. To optimize the balanced accuracy (BA), the threshold * of the plug-in rule is p (Menon et al. 2013).
For the measures in a fractional form, search strategies are effective and simple. An intuitive approach to determine the optimal threshold is the point-wise search method, .
namely, evaluating the fractional measure at each possible threshold and outputting the best performing threshold. There is no doubt that exhausting all possible thresholds is impossible. The gird search is a method to handle this, which divide the range of the threshold into multiple equal intervals and set the end points as the candidate thresholds. Besides, the posterior probabilities on S 2 can also be set to the candidate thresholds. We term this search strategy the S 2 -search. The gird search method and the S 2 -search method search the threshold in a limited range. In addition to the point-wise search methods, the bisection method transforms the fractional measure to a one-dimensional function and obtains the optimal threshold by solving the zero root of the one-dimensional function in binary (Narasimhan et al. 2015). The bisection requires that the objective function be monotone on the interval, while the fractional performance measures are usually non-monotonic with respect to the threshold.
In this subsection, we develop a method for searching the optimal threshold via the interval search method, and use this method to minimize PL. The interval search method is an effective way to search the local minimum of a unimodal function (Chong and Żak 2011). For a unimodal one-dimensional function f(r) defined in [ , ], to obtain the minimum r * , the interval search method is based on the idea that it produces a series of intervals and lim k→∞ k = lim k→∞ k = r * . Specifically, the interval search method inserts two points in each iteration and produces , then k+1 = k and k+1 = k ; otherwise, k+1 = k and k+1 = k . When the interval length is reduced by the ratio of 1 − ( √ 5 − 1)∕2 , the interval search method is socalled gold section method.
For any plug-in rule h (x) = sign(̂ (x) − ) , we briefly discuss about whether the PL is a unimodal function of the threshold . According to the proof of Theorem 1, the PL is consistent to the cost-sensitive loss with the optimal threshold as the cost weight: The unimodality of PL requires that for 1 < 2 < * , L * ( 1 ) > L * ( 2 ) and for * < 1 < 2 , L * ( 1 ) < L * ( 2 ) . Thus, when * < 1 < 2 , the unimodality of L * ( ) requires that the posteriori probability should satisfy condition (41), which signifies that there exist a small number of positive objects in the objects with small posterior probabilities. When 1 < 2 < * , the unimodal of L * ( ) requires that the posteriori probability should satisfy the contrary case of condition (41), which signifies that there exist a large number of positive objects in the objects with large posterior probabilities.
According to the above discussion, if the posteriori probability is sufficiently good, PL is a unimodal function of . The interval search method is applied to obtain * . From Theorem 1, we have Then, we express the plug-in rule as: and apply the interval search method to finding the optimal r that minimizes P L |S 2 | (h r (x)).
A fixed reduction of the interval is employed. In each iteration, the interval length is reduced by the ∈ (0, 0.5) ratio. The interval search method is thus called as -interval search method and the ratio is a parameter to be tuned. The interval search method for minimizing the PL is shown as Algorithm 2. The time complexity of the -interval search method contains two parts, which are learning ̂ (x) and searching * . The time complexity of learning ̂ (x) is the same as the gradient descent method, and the time complexity of search * is O(N log ) , where N is the number of training data, is the reduction ratio of the interval and is the threshold of the stop condition. Learning ̂ (x) is the main time consuming part. When handling large number of samples, it is suggested to utilize effective gradient descent method.

Experiments
We validate the performance of the -interval search on a variety of benchmark data sets. By the benchmark data sets, we show that learning by PA is more fair in majority accuracy

3
and minority accuracy than A and compare the -interval search method with some other plug-in rules to show its effectiveness. The benchmark data sets are downloaded from the KEEL Data Set Repository (Alcalafdez et al. 2008) and the UCI Machine Learning Repository (Dua and Graff 2017). These data sets are briefly described in Table 2, including data ID, name, size, number of attributes and the imbalance ratio(IR). The posterior probability is generated by the kernel logistic regression and the kernel function is the RBF kernel Each data set is randomly divided into a training set, a validation set and a test set at a ratio of 3:1:1. The methods are compared in the same division. We randomly divide the data set 30 times to obtain an average performance. The parameter is chosen from {2 −4 , 2 −2 , 2 0 , 2 2 , 2 4 , 2 6 } and the is chosen from {0.1, 0.2, 0.3, 0.4} via the validation set. Each attribute is linearly scaled to the range [0, 1] using the maximum and minimum values in the training data. For each data, we also add 3% and 5% random uniform label noise to increase the complexity of the data.
First, to show that learning by PA is more fair than A, we compare the bias [refer to Eq. (13)] of KLR and the -interval method. Figure 4 shows the comparison result. Each bar of Fig. 4 is the difference of the mean bias over 30 times between KLR and the -interval method on each benchmark data set. As shown in Fig. 4, we observe that 16/20, 17/20, 16/20 bars are greater than zero under 0% noise, 3% noise, 5% noise, respectively. That is, the bias of KLR is large than that of the -interval method, which reflects the classifiers learnt by PA is more fair than the classifiers learnt by A. Second, to validate the performance the proposed method, the A and PA are employed as evaluation measures. The benchmark methods are KLR, p-cut (with the proportion of the minority class in S 2 as the threshold), grid-search, S 2 -search and bisection method. The KLR aims to optimize the A, and p-cut aims to optimize the balanced accuracy. The grid-search and S 2 -search aim to optimize the PA. The bisection is used to optimize the F 1 -measure and PA, which are noted as Bisection-F 1 and Bisection-PA, respectively. Tables 3, 5 and 7 show the mean and the standard deviation of A over 30 time comparisons with 0%, 3% and 5% label noise, respectively. Tables 4, 6 and 8 show the mean and the standard deviation of PA over 30 time comparisons with 0%, 3% and 5% label noise, respectively. In each row of the tables, the method with the maximal evaluation value is underlined and printed in bold type, and the method with a dot indicates that the -interval search is significantly better with regard to the pairwise Student's t test with a level of 0.1. As shown in Tables 3, 4, 5, 6, 7 and 8, the evaluation score obtained by the -interval search is highlighted in bold and is underlined in most of the comparisons. In many comparisons, the -interval search is statistically better than other methods.
To further analysis the statical performance of each method, for each method, we calculate the gap between the times of the significant wins and the times of significant loses. An algorithm a significantly wins b if its mean and standard deviation are satisfied: where t is the number of comparison times; otherwise, a significantly loses b (Please refer to reference Li et al. (2016) for more details). Figure 5 shows the results of the statistical comparison. Each bar in Fig. 5 represents the gap between the times of the significant wins and the times of significant loses. As shown in Fig. 5, we observe that the bar of the -interval search method is the highest w.r.t PA under different noise level. With respect to A, the bar of the -interval search is the highest when the noise level is % and 5%; and when the label is not polluted by noise, the bar of the -interval search is the second highest. In general, we can conclude that the -interval search method can optimize the PA value better and also can obtain a satisfactory A value.

Conclusion
With an increase in the complexity of the data, eliminating random consistency from learning algorithms has great potential to improve the generalization ability. In this paper, first, we have shown that the PA is insensitive to the class distribution of classifiers in evaluation and is more fairer than the A in learning classifiers through two vivid examples. Second, we have given some novel bounds to show that learning by PA can approach to the optimal A and have shown that the empirical risk minimization process of the PA is Bayes-risk consistent. Based on these theoretical guarantees, we have proposed a plug-in rule model that optimizes the PA. The experimental results have shown the fairness and effectiveness of learning by PA. An interesting future work is to establish the other strategies to define the random consistency. An analysis of the random consistency for each instance maybe a promising direction.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creat iveco mmons .org/licen ses/by/4.0/.

Appendix: Proofs
Lemma 1  Proof Without loss of generality, we assume that q(h) < p . Assuming that the size of data is N, we have where j = 0, ..., Nq(h) and C m n is the number of combinations of n items taken m at a time. From (46), we know that N ⋅ TP(h � ) follows the hypergeometric distribution with the size of the population selected from be N, Np elements of the population belonging to one group and N − Np belonging to the other group, and the number of samples drawn from the population be Nq(h). Thus, Example 2 Assume that two-class data are generated from two Gaussian distributions with uncommon means 1 , 2 , but a common covariance : and the probability of the positive class is p = ℙ(Y = +1) . The label of the minority class is corrupted by the instance-independent noise at the level s 1 : ℙ(Ỹ = −1|Y = +1) = s 1 . For this learning task, the bias of h * A is: where (•) is the cumulative distribution function of the standard normal distribution, = ( 1 − 2 ) � − ( 1 − 2 ) and d 0 = ln 1−p p 1 1−2s 1 . Proof According to Lemma 2, the corrupted conditional class probability ℙ(Ỹ = +1|X = x) is needed. Based on the Bayes' theorem:

Proof
The formulation of the pure accuracy measure is fractional, which hinders obtaining the optimal classifier. Here, we resort to the cost-sensitive loss to obtain a non-closed-form solution. We begin this proof with two existing definitions and two lemmas: Definition 5 (Kotlowski and Dembczynski 2017) We refer to a measure as a linear-fractional performance measure if it is non-increasing with FP, FN and formalized as where a 0 , a 1 , a 2 , b 0 , b 1 , b 2 ∈ R and b 0 + b 1 FP + b 2 FN ≥ C 1 > 0. where * = max h (h) , L * = min h L (h) and C 2 = 1 C 1 * (b 1 + b 2 ) − (a 1 + a 2 ) . Lemma 8 (Elkan 2001   Combining (117), (118) and (119), we obtain the final result. ◻