Anti-noise twin-hyperspheres with density fuzzy for binary classification to imbalanced data with noise

This paper presents twin-hyperspheres of resisting noise for binary classification to imbalanced data with noise. First, employing the decision of evaluating the contributions created by points for the training of the hyperspheres, then the label density estimator is introduced into the fuzzy membership to quantize the provided contributions, and finally, unknown points can be assigned into corresponding classes. Utilizing the decision, the interference created by the noise hidden in the data is suppressed. Experiment results show that when noise ratio reaches 90%, classification accuracies of the model are 0.802, 0.611 on the synthetic datasets and UCI datasets containing Gaussian noise, respectively. Classification results of the model outperform these of the competitors, and these boundaries learned by the model to separate noise from majority classes and minority classes are superior to these learned by the competitors. Moreover, efforts gained by the proposed density fuzzy are effectiveness in noise resistance; meanwhile, the density fuzzy does not rely on specific classifiers or specific scenarios.


Introduction
The so-called imbalanced data refer to the extreme difference in the number of samples between classes in the data [1][2][3].Both the noise hidden in the data and imbalance ratio between the classes have seriously negative effects on classification methods and classifiers [4].From the view of data level, highlighting class attributes is likely to encounter those unpredicted traps due to noise can blur class attributes (socalled noise interference) [5].Additionally, from the view of algorithm level, the noise has the capability to induce classifiers or classification methods to treat minority classes as noise [6].Consequently, the above complex issue nature brings challenges for the classification aiming at imbalanced data containing the noise.
Recently, some efforts have been gained to address this issue.For instance, (i) sampling-based techniques addressing the classification of imbalanced data.The distribution between classes is balanced via applying oversampling techniques on the minority classes or adopting under-sampling B Jian Zheng zhengjian.002@163.com;zhengjian002@yeah.net 1 College of Artificial Intelligence, Chongqing Technology and Business University, Chongqing 400067, China techniques on the majority classes.Unfortunately, data distribution is likely to be damaged during sampling, causing incorrect classification results.Particularly, when the density of the noise is close to that of the minority classes, sampling techniques are embarrassed in noise resistance.(ii) Label noise-based methods, which apply spatial distribution of sample points to filter the label noise [4].(iii) Fuzzy theorybased methods.Fuzzy approaches are used in classification tasks, decision-making tasks and security tasks.Such as, the fuzzy set method proposed by [7], the fuzzy set method in [8], the orthopair fuzzy sets method [9], and the method proposed in [10].Additionally, to address noise interference, the [11] designed the optimal ARX models and the [12] robust Kalman filtering method.

Motivation
The study goal of this work is binary classification aiming at the imbalanced data containing noise.However, we are eager to demonstrate that the method can separate the noise from majority and minority classes, so as to provide the valuable insights for noise resistance, and there also supplies a reference for the development of the classifier.Certainly, the final motivation is to learn these boundaries separating the noise from majority and minority classes, which can promote the classified precision during classifying highly imbalanced data containing the noise.Here, this paper proposed the twin-hyperspheres with density fuzzy, namely DHS-DF.To suppress the interference created by the noise, from a data level of perspective, using the density fuzzy judges whether an unknown point is the noise.From an algorithm level point of view, the hypersphere itself has the natural ascendency to classify imbalanced data.

Contributions
The specific contributions of this work are summarized, as follows.
(i) The twin-hyperspheres utilizing the density fuzzy is proposed to classify imbalanced data containing the noise.The fuzzy membership of importing the label density estimator, i.e., so-called the density fuzzy, quantizes the contributions provided by instance points, thereby suppressing noise resistance.(ii) These efforts gained by the density fuzzy are effectiveness in the identification of the noise; moreover, they also do not rely on specific classifiers or specific scenarios.
The rest of this paper is organized as follows.Related works are summarized in "Related works".Section "Methodology" describes the problem formalization, the theory and the implementation of the model.The details regarding experiment settings and the design are illustrated in "Experiment settings"."Results" section displays experiment results, and then we discuss the results in "Discussion".Finally, a conclusion is drawn in section "Conclusion".

Sampling-based approaches
Oversampling techniques and under-sampling techniques are used to handle the classification issue of imbalanced data.For example, the AB-Smote method [13]-based oversampling gains good classified results via paying more attention to boundary points.Neighborhood approach [14]based under-sampling gain the advanced results through finding the nearest neighbor points.Similarly, the undersampling approach implemented in [15].Generally, sampling approaches need to generate the data points in minority classes or to remove the data points in majority classes.Considering binary classification, the number of points after sampling is twice in that of points in the minority class (i.e., under-sampling) or in the majority classes (i.e., oversampling) [4].Consequently, sampling methods are prone to gain high classification accuracy, by contrary, they get poor efficiency in treating large-scale classification.

Label noise-based approaches
The [16] defines that label noise is these points with incorrect labels.For instance, the MadaBoost [17], AveBoost [18] and AveBoost2 [18] are proposed to reduce the sensitivity of boosting to label noise, which improve classification results in the presence of label noise.Similarly, including the A-Boost (average boosting) [19].Additionally, the method-based semi-supervised learning in [20] is proposed to perform semi-supervised learning in the presence of label noise.Indeed, label noise-based approaches are sensitive to the density of labels so that the number of labels directly impacts their classification capabilities.

Deep learning-based approaches
Deep learning-based classification models have moved from theory to practice, and have been widely used in data classification, for example, the CNNs (convolutional neural networks) [21] are used for classification tasks and gain good classified precision.Deep architectures not only extract deeper representations of the input data by utilizing deep nonlinear network structures, but also have strong capabilities to learn the essential features of the input data.Therefore, many excellent deep learning-based classification models have been proposed for different application backgrounds, such as, the CNN using 3D convolution kernel [22], and the deep models proposed in [23].Deep learning-based classification models may involve complex feature decomposition in the classification process [24,25], which need to carry out more efforts on feature decomposition.As including, these classification models in [26][27][28]

Fuzzy-based approaches
The fuzzy membership of a point can be used to judge whether the point provides contributions for the construction of classes [36].If the point cannot provide the contributions, it allows to be regarded as the noise.For instance, Richhariya et al. [37] proposed a fuzzy least squares twin support vector machine (RFLSTSVM), because of utilizing the fuzzy membership function, RFLSTSVM achieves good noise immunity.Whereas, it needs to afford dear computation to solve a pair of system of linear equations.Similarly, the model with fuzzy implemented in [38].

Problem formalization
For an unknown point x k in Fig. 1, it was assigned into the majority class, illustrated in Fig. 1a, or was classified into the minority class, as shown in Fig. 1b, or was treated as noise in Fig. 1c.However, these classification results cause some concern, (i) we are eager to how to avoid noise interference during classification.(ii) We pay attention to whether point x k is noise.
Several definitions are given, and Table 1 gives the details of symbols.M and C −1 N , respectively.The hypersphere of learning M +1 and the hypersphere of learning N −1 are defined as S +1 M , S −1 N , respectively, and as follows where δ ±1 > 0 are a penalty factor.a±1 and R ±1 are the centers and radii of S +1 M , S −1 N .ξ±1 ≥ 0 are slack variables.φ(•) is a nonlinear function.

Definition 3 Binary classification tasks on
M and S −1 N .The majority class The minority class The number of The number of The proposed model +1, −1 Majority class label, minority class label Average distance between X and C −1 Symbols are arranged in the order of occurrence in this work

Theory
The noise is suppressed through evaluating the contributions provided by each point for the training of S +1 M and S −1 N .The contributions are quantized through calculating the fuzzy membership of each point in D im .Using the contributions employs the decision which a point should be regarded as noise or be assigned into corresponding classes.Therefore, this can address (i) and (ii) in "Problem formalization" from a data level of perspective.The details are as follows, illustrated in Fig. 2.
Given sample X = {x 1 , x 2 , x 3 , ....}, we consider three types of scenarios regarding the contributions provided by a 123 N , so that it can be classified into N −1 in Fig. 2b, and gains a minority class label.

Scenario (3). Point x 3 just provides a weak contribution for the training of either S +1
M or S −1 N , unfortunately, point x 3 is treated as noise, as shown in Fig. 2c.
There needs to determine how to evaluate the so-called far distance or near distance mentioned in Fig. 2. Let us define that the distance between point x i ∈ X and

Calculation of density fuzzy
The distance between point x i and C +1 M is calculated by Eq. ( 3) Similarly, the distance between x i and C −1 N is given in Eq. ( 4), where κ is a kernel function.
The average distance between sample X and C +1 M , C −1 N is, respectively, calculated, as follows: The calculation of the proposed fuzzy membership f (•) is as follows where o f ≥ 1 is a constant item to avoid the situation where the denominator appears zero.−∞ is a very small value specified by the users, e.g., −∞ = 1e − 7.
To determine C +1 M in Eq. ( 3) and C −1 N in Eq. ( 4), the method of estimating class label density used in the [39] is selected, which estimates the density of class labels from a probabilistic view.Since there exists the difference of the density between majority classes and minority classes, the method is suitable to be used to determine class center.As follows: where I is a density estimator, which is used to estimate label density of the majority class and the minority class, respectively.For the calculation of I , please refer to the Proposition 1 in [39].According to the above derivation, it can be known that the density estimator was introduced into the fuzzy membership, i.e., namely the proposed dense fuzzy.

Evaluation of contributions
The f (•) is used to quantize the contributions.For above Scenarios (1), ( 2) and ( 3), the contributions created by point x i are illustrated, respectively.As following:

evaluates the contributions provided by point x i for the training of S +1
M .Here, an example was given to interpret the details.Assuming that o f = 1, M hold, so that x i can provide the contributions for the training of S +1 M .Consequently, the contribution in Fig. 2b,  Noting that regarding the Manner (I), (II) and (III), there displays a relative comparison.For an unknown point x i , we are prone to that it can create great contributions to which of the two hyperspheres, so the relative comparison is considered.

Illustrations of datasets
Five imbalanced datasets were synthesized using random distribution, denoted as S1-S5, and Gaussian noise with different ratio was added into the five synthetic datasets, as shown in Table 2.There did not consider specific distribution to synthesize the dataset, since those data distributions in applications are usually complex and unknown.Using the random distribution is to objectively analyze classification performance of the proposed model.
Five UCI datasets with different imbalanced ratio were also used, denoted as U1-U5, illustrated in Table 3.To verify the ability of the model to resist noise, without changing the attributes of the five UCI datasets, Gaussian noise with different noise ratio was added into them.The noise ratio is increased as imbalanced ratio (IR) increases.The UCI datasets adding the Gaussian noise are named as UG6-UG10.The details are displayed in Table 4.

Assessment metrics and comparison models
Evaluated metrics are accuracy metric and F1-score, as follows where TP, TN are the indicator that correctly predicts the number of the minority class and the majority class, respectively.FP is the indicator that predicts the majority class as the number of the minority class.FN is the indicator that predicts the minority class as the number of the majority class.
The AB-Smote [13] model-based sampling, MadaBoost [17]  Regarding the selection of parameters, RBF (radial basis function) is used for the DHS-B and DHS-DF, and the kernel parameter was tuned in rang of {0.1, 0.3, 0.5, 0. 7, 1, 1.5, 2, 3, 5}.For these parameters of competitors, there applied the parameters observed in the corresponding literature.We implemented the corresponding algorithms of these models by Python 3.8 in Tensorflow framework on the Linux system.

Experiment description
Experiment (i), to observe the results of separating noise from minority classes and majority classes, these models were run on the five synthetic datasets S1-S5, then the learned classification boundaries were visualized.Experiment (ii), to test the ability of resisting noise, these models were run on the five UCI datasets UG6-UG10, and then the results were analyzed by using Accuracy and F1score metrics.
Experiment (iii), to test classification ability on imbalanced datasets, these models were run on the five UCI datasets U1-U5, then, using accuracy and F1-score metrics assess the classification results.
Ablation experiment.To demonstrate that the proposed density fuzzy can resist noise, the ablation experiments were also implemented.

Classification boundaries
The results in Fig. 3 show that the proposed DHS-DF outperforms the competitors and the benchmark model.In terms of the seven models (DHS-DF, DHS-B and the five competing models), although the capabilities of resisting noise decrease quickly as noise ratio increases, the dropped tendency of DHS-DF is slower than that of the competitors and the benchmark model DHS-B.
To observe the results of separating noise from the majority and minority classes, Fig. 4 visualized the classification boundaries learned by these models on the synthetic dataset S5 when noise ratio is 90%.In this case, it can be seen that DHS-DF still learned the desired boundaries so that it can separate noise from minority and majority classes well, correspondingly, DHS-DF gained the classification precision well.By contrary, the competitors and benchmark model learned poor classification boundaries, however, the competitor with fuzzy RFLSTVM outperforms the four competitors without fuzzy and the benchmark model DHS-B.
The results of ablation experiments in Fig. 5 shown that DHS-DF and RFLSTSVM with fuzzy are significantly superior to these models without fuzzy (DHS-B, AB-Smote, MadaBoost, CNNs, RDE) in terms of classification performance.DHS-DF has more advantages than RFLSTSVM.However, compared the benchmark model DHS-B with the competitors without fuzzy, some of they win over DHS-B on most datasets.These indicate that the benefit of resisting noise with fuzzy is greater than that of with the model structures itself.Clearly, these models with fuzzy can gain more efforts than these without fuzzy in terms of noise resistance.Training and testing datasets have the same imbalanced ratio (IR) and noise ratio Take the five data sets in Table 3 as the benchmark, keep the attributes unchanged, and add Gaussian noise with different ratios

Ability of noise resistance
The results in Fig. 6 show that DHS-DF is still a winner on the five datasets UG6-UG10 in terms of noise resistance.Even on the dataset UG10 with highly imbalanced ratio (IR = 87.8:1)and with high noise ratio (|ς 0 | = 90%), DHS-DF gains the advanced classification results, i.e., accuracy = 0.611, F1-socre = 0.623, observing the competitors; however, they get the poor classification results, and the accuracy and F1-score all are below 0.5.Hence, on datasets UG6-UG10, DHS-DF gains the advanced classification results of being similar to these on the five synthetic datasets.
The results of ablation experiments in Fig. 7 show that these models with fuzzy are superior to these without fuzzy in suppressing noise.These comparison results confirm that they are consistent with those in Fig. 5. Together, these results in Figs. 3, 4, 5, 6 and 7 demonstrate that fuzzy indeed promotes these models to resist noise.

Classification ability
Figure 8 displays the classification results of these models on the five UCI datasets U1-U5 without noise.DHS-DF outperforms the six models on most datasets, e.g., datasets U1, U3, U4 and U5.Especially, on highly imbalanced dataset U6 (IR = 87.8:1),DHS-DF has outstanding ascendency than the six models.Additionally, the competitor with fuzzy RFLSTSVM wins over the four competitors without fuzzy AB-Smote, MadaBoost, CNNS and RDE (noting DHS-B is the benchmark model) on the three datasets U1, U2 and U5.However, in Fig. 7, RFLSTSVM outperforms the four competitors without fuzzy and the benchmark model DHS-B on the five datasets UG6-UG10 with noise.Together, these indicate that fuzzy can suppress noise.

Insights
Compared with the six models, the proposed model shows ascendency since an unknown point should be assigned into majority or minority classes, or be treated as the noise, depending on the contributions that the point provides for the hypersphere training.The contributions provided by the point for the hypersphere training can be assessed by Eq. (7).Then, the decision on which points are noise can be made based on the contributions provided by points.Therefore, from a data

Model limitations
Certainly, the proposed model also has disadvantages.During the training of the model, the number of iterations relies on data dimensionality I d and data volume I v , i.e., c1*I d + c2* I v , where c1 and c2 are constants.The model depends on the fuzzy membership function in Eq. ( 7), when large-scale data are used as the training set, the convergence epochs of our model may be increased.Whereas, this does not imply that our model cannot converge, but the training epochs become long.In addition, Eqs. ( 8) and ( 9) invoke the density estimator I in [39], due to I has high computational complexity, the overall computational complexity of the proposed model is increased.Therefore, the time complexity

Conclusion
The noise hidden in the data has negative effects on classification capabilities of those classifiers.To address the issue of noise interference, this paper proposed the twinhyperspheres model for binary classification on imbalanced datasets containing noise.Utilizing the proposed density fuzzy, the noise is effectively suppressed during classification.Results on the synthetic datasets and UCI datasets containing Gaussian noise show that the proposed model outperforms the competitors in noise resistance and classification accuracy, moreover, the classification boundaries learned by the proposed model are better than these learned by the competitors.These efforts gained by the density fuzzy are not only effectiveness in suppressing noise, but also they do not rely on specific classifiers or specific scenarios.In future work, we will look at addressing multiclassification on those datasets containing noise.Due to multi-classification tasks are more complex than binary classification tasks, classification capabilities of those classifiers are challenged.

Definition 2
of majority class M +1 , minority class N −1 and noise ς 0 .h is the h-dimensional Euclidean space.|M+1 |, |N −1 | are the number of the majority class and the minority class, respectively.|ς 0| indicates noise ratio.If |ς 0 | = 0 holds, this means that there is no noise in D im .By contrary, if |ς 0 | > 0 holds, there is noise in D im .The class centers of M +1 and N −1 are denoted as C +1

Fig. 1 Fig. 2
Fig. 1 Classification illustration of an unknown point.a, b Show that unknown point x k was assigned into the majority class or the minority class.c Indicates unknown point x k was treated as the noise.The Fig.2aas an example, d1  M must be less than d 1 N and d X M at the same time, which is the so-called near distance appeared by point x 1 and C +1 M .By contrast with near distance, d 1 N is greater than d X N , which is the so-called far distance exhibited by point x 1 and C −1 N .Similarly, for point x 2 in Fig.2band point x 3 in Fig.2c.
1 − 1/ d i N + 1 evaluates the contributions provided by point x i for the training of S −1 N .Similarly, assuming that d i M = 14, d i N = 9 and d X N = 12.Clearly, d i N < d i M and d i N < d X N hold, therefore, x i can provide the contributions for the training of S −1 N .The contribution is 1−1/ d i N + 1 = 1 − 1/ √ 9 + 1 = 0.9, i.e., (x i ) = 0.9.Manner (III).Otherwise, i.e., Scenario (3) in Fig.2c, point x i is treated as noise, because this scenario provides little or no contribution for the training of S +1 M or S −1 N .For example, for this scenariod i M = 14, d i N = 9 and d X N = 6, although d i N < d i M holds, d i N < d X Ndoes not hold, therefore, point x i is treated as the noise, and f (•) = −∞ assesses the contributions of the noise point x i , i.e., (x i ) = 1e − 7.In this scenarios, point x i is closer to the minority class, compared with the majority class, i.e., d i N < d i M , however, point x i is still far away from the class center of the minority class, i.e., d i N > d X N , so point x i just creates little contribution for the training of S −1 N .Certainly, also including another scenario, model-based label noise, CNNs [21] model-based deep learning, RDE [31] model-based deep long-tailed learning and RFLSTSVM [37] model-based fuzzy are used for comparisons.Additionally, the benchmark model, namely DHS-B, was designed referring as our DHS-DF.DHS-B has the same structure and parameters as DHS-DF, while it does not apply the proposed density fuzzy.This is to analyze the effects of the proposed density fuzzy on resisting noise.

Fig. 3
Fig. 3 Classification results on the five synthetic datasets.a Displays the accuracy metric.b Displays the F1-score metric.The measurement results of the two metrics drop as noise ratio increases

Fig. 4 Fig. 5
Fig. 4 Visualization of classification results on the synthetic dataset S5.The minority classes and majority classes are marked as red circles, blue circles, respectively.The noise is marked as yellow squares.The black curves are the learned boundaries

Table 1
Illustrations of symbols h h-dimensional Euclidean space M +1 . The implementation of (S +1 M , S −1 N ) is given in Algorithm 1 and Algorithm 2. Algorithm 1 displays the training of (S +1 M , S −1 N ).Training sample X = {x 1 , . . ., x i , . . ., } is used for the input of (S +1 First, the parameters are initialized in Step 1.We configured a greater initialization value for the d Step 16 in Algorithm 2. Similarly, for the procedure of between Step 17 and Step 31 in Algorithm 2, i.e., Scenario (2), if point x i is assigned into the minority class, the corresponding contribution is used for the training of hypersphere S −1 N .Otherwise, point x i is treated as noise, i.e., Scenario (3), the fuzzy membership is calculated using f (•) = −∞, as shown in Step 32-Step 35 in Algorithm 2. According to the returned results in Step 6 in Algorithm 1, these learned class labels are obtained in Step 7 in Algorithm 1.The training of the model is terminated once each point in sample is judged, illustrated in Step 8 to Step 11 in Algorithm 1. Finally, the learned class labels are outputted in Step 12 in Algorithm 1.

Table 3
The five UCI datasets

Table 4
The five UCI datasets with noise level of view, this can address the issue of noise interference.Overall, the proposed model gains good classification accuracy and shows better noise resistance on imbalanced datasets with noise.Cost-sensitive learning aims at rebalancing classes by adjusting loss values for different classes during training [40], including class-level re-weighting and class-level remargining.As for class-level re-weighting, the most methods are to directly utilize label frequencies of training samples for loss re-weighting, namely weighted softmax loss [41].The [42] proposed that the label frequencies are used to adjust model predictions during training, so as to alleviate the bias of class imbalance by using the prior knowledge, called balanced softmax.To disentangle the learned model from the long-tailed training distribution, Hong et al. [29] applied a label distribution to disentangle loss, and indicate that models can adapt arbitrary testing class distributions if the testing label frequencies are available.Unlike the [29], Cui et al. [32] introduced a concept of effective number to approximate the expected sample number of different classes, rather than using label frequencies.So-called the effective number refers to an exponential function of the training sample number.To address class imbalance, the [32] enforces a new class-balanced weighting item that is inversely proportional to the effective number of classes, also including equalization loss [43], seesaw loss [44], and adaptive class suppression loss [45].