1 Introduction

One of the main elements for successful learning and knowledge discovery in data mining is data quality. Data cleansing can be done manually which is difficult, time consuming and inclined to errors and noise. So, effective automatic tools are necessary in data cleansing process. Noise refers to the inaccuracies and inconsistencies of data, which reduces the quality of the real data. Besides, noise can affect the quality of information extracted from the data, as well as the models created from the data and the decisions made by the data [1]. Identifying noisy instances and then eliminating or correcting them are useful techniques in data mining research [2]. Eliminating noisy samples from training sets may improve data reliability and quality [3]. Noise detection is the critical part for data understanding and cleaning, as well as semi-supervised outlier detection [4]. Noise filtering is used to eliminate incorrect instances from real-life data. Noise reduction is a difficult and important process in machine learning to achieve precise and high performance models. If the noisy data is not removed, it might yield wrong decision [5]. Effective noise detection process decreases the risk of poor decision making using erroneous data [6]. Noise in data categorized into attribute noise and class noise or combination of both categories. Attribute noise is related to the errors or unusual values and class noise is wrong class label. Several experimental researches shows that class noise has negative effects on the performance of machine learning classifiers [7]. Class noise is known as the major challenge in data mining research, which has negative effects on the performance of the model. Enhancing classification accuracy of induced models is known as the main issue of noise detection techniques [8]. It is also clear that classification accuracy extremely depends on the quality of the training set [1].

The review of the existing studies shows that many researchers have proposed methods to handle the noise in the data sets using machine learning algorithms [3, 8,9,10]. Sluban et al. [8] developed new class noise detection algorithms including the high agreement random forest filter on two UCI data sets. Xiong H et al. [11] explored four approaches to increase data analysis through noise removal using unsupervised techniques. Lowongtrakool and Hiransakolwong [5] developed unsupervised clustering intelligence method to reduce the quantity of spam. The outcomes from noise filtering were beneficial for data processing which makes them more precise. Zeidat et al. [12] compared several popular data set editing techniques which are Wilson editing, Citation editing, and multi-edit. They also introduced supervised clustering editing. Smith et al. [13] identified the reasons cause instances to be misclassified. Moreover, Thongkam et al. [14] applied SVM on training set to detect and eliminate all samples which misclassified by the SVM. Jeatrakul et al. [15] also applied same approach using neural networks. The proposed cleaning method enhances the confidence of cleaning noisy training instances. Likewise, it is important to have good classifiers in classification filtering and existence of class noise produce poor classifiers [16]. Since SVM removes the instances that their prediction is not reliable, a local support vector machines (LSVM) noise reduction technique is proposed by Segata et al. [17]. According to Segata et al. [18], a new strategy is proposed to reduce the number of local SVMs for noise reduction process. A neural network based automatic noise reduction (ANR) was presented to clean noisy instances in data sets by Martinez et al. [19]. Sánchez et al. [20] applied K-nearest neighbor classifier (KNN) to predict data sets and then, the misclassified instances are removed. Sabzevari et al. [21] applied randomized ensembles such as bagging and random forest for detecting and handling class noisy instances in training set. The results showed that removing is better than relabeling when the noise levels are low and medium and relabeling is more precise at high noise levels. Also, there are studies which figured out the most important factors that deteriorate the performance of the k-means algorithm. Fränti and Sieranoja [22] found that if the clusters overlap, the choice of initialization technique does not matter much, and repeated k-means is usually good enough for the application. However, if the data has well-separated clusters, the result of k-means depends merely on the initialization algorithm. Since, the performance of evolutionary k-means often decreased by noisy data, a clustering stability-based EKM algorithm (CSEKM) which evolves partitions and the aggregated matrices simultaneously was proposed by He and Yu [23]. The experimental results show that the CSEKM is more robust to noise.

Based on these studies, two main issues are investigated. First, there is a lack of attention to the misclassified instances which has a great impact on the clustering efficiency. Second, removing class noise may affect the classification performance, which highlights to have a good and reasonable classification filtering for noise detection. This paper aims to extend our previous model, namely, the k-means support vector machine [24]. It also proposes a CLCF model using k-means clustering algorithm and five different classification filtering algorithms on four real data sets to recognize the class noisy instances. Furthermore, the proposed model increases the clustering efficiency and overall performance. The model is constructed in three phases. The first phase is noise detection, which is based on clustering technique to identify misclassified instances in each cluster. The second phase is noise filtering, which applies five classification filtering algorithms to obtain the real noisy instances. Third phase is noise classification that employs two different techniques, namely, the removing and relabeling for classifying noisy instances. Experiments were conducted to measure the performance of the model using evaluation criteria. Figure 1 presents the general view of proposed model.

Fig. 1
figure 1

General view of the CLCF model

In brief, the main contributions of the present study are as follows:

  1. 1.

    Investigating misclassified instances issues in k-means clustering algorithm.

  2. 2.

    Proposing a new CLCF model that comprises k-means clustering algorithm as well as classification filtering algorithms for class noise detection in binary datsets.

The paper is organized as follows. Section 2 presents the preliminaries knowledge. Section 3 describes the proposed model. The data sets and performance measurement are described in Sect. 4. The discussion on the results is described in Sect. 5. Finally, Sect. 6 concludes this paper with a brief summary and suggestions for future works.

2 Preliminaries knowledge

In this section, the methods required for noise detection and noise filtering are introduced. The selection of classifiers applied for the filtering is explained as well.

2.1 K-means clustering algorithm

In this research, one crisp clustering technique, namely, k-means is applied to recognize the misclassified instances, which are then assumed as noisy instances. Applying the k-means is very common because it is theoretically simple and memory efficient and is computationally fast [25]. The flowchart of k-means clustering algorithm is illustrated in Fig. 2.

Fig. 2
figure 2

The flowchart of k-means algorithm

2.2 Classification filtering algorithms

In this study, five classifiers with different learning paradigms among the most popular supervised learning techniques to identify noisy instances [26, 27] were used as classification filtering algorithms.

2.2.1 Support vector machine (SVM)

The basis of the support vector machine (SVM) [28] is the procedure of learning a linear hyperplane from a training set separating positive examples from negative ones. SVM can be considered as a binary classifier. In addition, in numerous existing works such as [4, 8, 18, 29,30,31], SVM has been used for detecting noise. SVM is a popular classification filtering method broadly used to detect class noise [2].

2.2.2 Naïve Bayes (NB)

One of the statistical classifier is Bayesian classifier which is known as simple probabilistic classifiers. By using this technique, the probability of an instance which belongs to a particular class is predicted [32]. Many existing works have applied NB for noise detection such as [4, 7, 8, 26].

2.2.3 Random forest (RF500)

The random forest (RF) learner with 500 decision trees was considered in this study because of its strong performance compared to well-known classifier [33]. To construct randomized decision tree, the RF classifier uses bagging and the ‘random subspace method’ [34]. The outputs of ensembles of these randomized, unpruned decision trees are combined to produce the final prediction. Many existing works have applied RF for noise detection such as [4, 8, 26, 31].

2.2.4 K-nearest neighbor (KNN)

The k-nearest neighbor classifier reduces hyper spheres in the space of instances by allocating the majority class of the k-nearest instances based on a defined metric [35]. It is an effective, simple classification algorithm [36] and it has been widely used in the domain of noise detection [3, 4, 18, 31, 37,38,39,40,41].

2.2.5 Neural network (NN)

Multilayer perceptron (MLP) is a technique which feed forward neural networks trained with the standard back propagation procedure. They need to train favorable results since they are supervised networks. The MLPs is used in majority of neural network applications [42]. Neural network has been widely used in the domain of noise detection [3, 5, 19, 26, 38, 43]. It is well-known classification filtering methods for class noise detection [2].

3 Proposed ClCF model for class noise detection and classification

The proposed CLCF model is described for the detection of noisy instances. This model consists of three main phases: noise detection, noise filtering, and noise classification. Figure 3 illustrates the overall architecture of the proposed model. The clustering technique and classification filtering are integrated to detect and filter noisy data. The model phases are explained in detail next.

Fig. 3
figure 3

Overall architecture of the CLCF model

3.1 Phase 1: noise detection (clustering-based)

In this phase, the k-means (KM) clustering technique [44] is applied on four real data sets to recognize the misclassified instances. K-means clustering technique distributes input vectors into separated clusters by means of similarity and distance measurement [45]. All input vectors are assembled into distinct centers by means of minimizing objective function based on Eq. 1.

$${\text{V}} = \mathop \sum \limits_{{{\text{i}} = 1}}^{\text{k}} \mathop \sum \limits_{{{\text{x}}_{\text{j}} \in {\text{S}}_{\text{i}} }}^{{}} \left( {{\text{x}}_{\text{j}} - {{\upmu }}_{\text{i}} } \right)^{2}$$
(1)

where \({\text{k}}\) is the number of cluster (\({\text{S}}_{\text{i}} )\) and \({\text{i}} = 1, 2 \ldots {\text{k}}\), and \({{\upmu }}_{\text{i}}\) displays the centers of the clusters. First, the intensity distribution is computed, and then initial centroids are created using \({\text{K}}\) random intensities. The following equation shows the iterative algorithm for clustering based on their intensities.

$${\text{c}}^{{\left( {\text{i}} \right)}} = {\text{min}}_{\text{j}} {\text{x}}^{{\left( {\text{i}} \right)}} - {{\upmu }}_{\text{j}}^{2}$$
(2)

The misclassified instances referred to the lowest number of a certain class label in each cluster, which is then counted as class noise.

Let the noisy data \(\hbox{"}X\hbox{"}\) consists of \(n\) datum \(\left( {x_{1} ,y_{1} } \right), \ldots ,\left( {x_{n} ,y_{n} } \right)\), x \(\in R^{n}\) and \(y \in \left\{ {1 , - 1} \right\}\). Y = {\(y_{a} = x_{i}\) | \(L\left( {x_{i} } \right)\) = 1} where \(a = 1, \ldots ,A\) (\(A\) is the number of samples that their labels are \(+ 1\)), \(p =\) \(\{ p_{t}\) = \(x_{i}\) | \(L\left( {x_{i} } \right) = - 1\}\) where \(t = 1, \ldots ,U\) (\(U\) is the number of samples that their labels are \(- 1\)) and \(n = U + A\). \(L\left( {x_{i} } \right)\) represents the class label for each sample \(L\left( {x_{i} } \right) = \{ label(x_{i} )\) | \(Label\left( {x_{i} } \right) = 1 or - 1\} .\)

Definition 1

Suppose “\({\text{M}}\)” is a cluster includes instances with class label “\(+ 1\)\(M = \left\{ {L\left( {x_{a} } \right)| a = 1, \ldots ,A} \right\}\) and “\({\text{H}}\)” is a cluster includes instances with class label “\(- 1\)\(H = \left\{ {L\left( {x_{t} } \right)| t = 1, \ldots ,U} \right\}\). Assume “\({\text{M}}\)” is a cluster which is partitioned into two classes \(M_{1}\) and \(M_{2} = M - M_{1}\), where \(\left| M \right| = b\), \(\left| {M_{1} } \right| = b_{1}\) and \(\left| {M_{2} \left| = \right|M - M_{1} } \right| = b_{2}\). Then, the following statements are used to detect noisy instances:

$$(b_{1} < b_{2} ) \Rightarrow \left( {\forall \;x_{i} \; \in \;M_{1} ,\;x_{i} \;is\;noise} \right) \wedge \left( {\forall \;x_{i} \; \in \;M_{2} ,\;x_{i} \;is\;noisefree} \right)$$
(3)
$$(b_{2} < b_{1} ) \Rightarrow \left( {\forall \;x_{i} \; \in \;M_{2} ,\;x_{i} \;is\;noise} \right) \wedge \left( {\forall \,x_{i} \, \in \;M_{1} ,\; x_{i} \;is\;noisefree} \right)$$
(4)

3.2 Phase 2: noise filtering (classification filtering)

In this phase, five classification filtering algorithms are applied to detect noisy instances. These classifiers are support vector machine (SVM), random forests 500 (RF500), Naïve Bayes (NB), neural network (NN) and K-nearest neighbor (KNN, k = 10) respectively. Based on the first phase, the noisy and noise free sets are detected. Then, the noise set is considered as the testing set (\(T\)) and the noise free set is considered as the training set (\(Tr\)). The training data sets are separately classified using each classifier to create a model. The testing data sets are then predicted based on the created models. If the predicted label for each testing instance is not equal with its original label, the instance is known as “real noise” otherwise “noise free”. The classification filtering problem is presented as follows:

Definition 2

Suppose \(\varphi\) is a classifier algorithm and \(B = \varphi \left( {T, Tr} \right) = \left\{ {b_{i} } \right\}_{i = 1}^{n}\) and \(b_{i}\) is the predicted label of instance \(\left( {x_{i} } \right)\) from the test set \(\left( T \right)\) and \(\left| T \right| = n\). The Eq. (5) demonstrates how the real noisy instances are identified. The procedure of classification filtering is presented in Fig. 4 to simplify the Definition 2. It shows how classification filtering can detect noisy instances in data sets.

$$b_{i} = \left\{ {\begin{array}{*{20}l} {L\left( {x_{i} } \right)} \hfill & {x_{i} \;is\;noise\;free} \hfill \\ { - L\left( {x_{i} } \right)} \hfill & {x_{i} \;is\;noise} \hfill \\ \end{array} } \right.$$
(5)
Fig. 4
figure 4

The procedure of classification filtering

3.3 Phase 3: noise classification

Two approaches are used to deal with noisy samples, which are “removing” and “relabeling” techniques. The removing approach omits all detected noisy samples after noise filtering procedure and produces a new decreased data set. The relabeling approach assigns a new label to all detected noisy objects after noise filtering procedure by switching their label and keeps the original size of the data set. The proposed CLCF algorithm is illustrated in Fig. 5.

Fig. 5
figure 5

Proposed CLCF algorithm

4 Experimental setup

The experimental data sets and the performance evaluation criteria used in this study are discussed here. The accuracy of the CLCF algorithms in terms of removing and relabeling techniques on the Pima, Heart (statlog), Wisconsin and Ionosphere data sets [6] are presented as well. The experiment was applied based on 10 runs for each data set to achieve average evaluation criteria. The average performance was calculated in terms of accuracy using SVM-RBF kernel algorithm and tenfold cross validation.

4.1 Data sets

To test and evaluate the CLCF model, four real experimental data sets were used. Three medical data set namely Pima, Wisconsin and Heart (statlog) along with one non-medical data set namely ionosphere are used from UCI repository [46]. We used one non-medical data set in order to evaluate our proposed model in different areas. All the data sets are related to binary classification problem. Table 1 lists the data sets used in this research with the number of classes (#Class), number of features (#Feature), and number of examples (#Ex).

Table 1 Distribution of data sets [6]

4.2 Performance measures

The accuracy formula is applied to calculate the performance of the proposed technique in classification [24, 48] using the confusion matrix. In the following formula, True Negative refers to correctly rejected samples, True Positive (TP) refers to correctly identified samples, False Positive refers to incorrectly identified samples and False Negative (FN) means incorrectly rejected samples:

$${\text{Accuracy}} = \frac{{{\text{TP}} + {\text{TN}}}}{{{\text{TP}} + {\text{TN}} + {\text{FP}} + {\text{FN}}}}$$
(6)

5 Results and discussion

The accuracy of each data set on CLCF model using k-means with five classification filtering algorithms are analyzed separately and illustrated in Figs. 6, 7, 8 and 9. The best K = 3 is determined experimentally from K = [2, 10]. The results of four data sets including the Pima, Wisconsin, Heart and Ionosphere data sets are illustrated and explained as follow.

Fig. 6
figure 6

Comparing accuracy of the CLCF model using k-means with five classification filtering algorithms on the Pima data set

Fig. 7
figure 7

Comparing accuracy of the CLCF model using k-means with five classification filtering algorithms on the Wisconsin data set

Fig. 8
figure 8

Comparing accuracy of the CLCF model using k-means with five classification filtering algorithms on the Heart data set

Fig. 9
figure 9

Comparing accuracy of the CLCF model using k-means with five classification filtering algorithms on the Ionosphere data set

The accuracy achieved by five different algorithms, namely, the KM-SVM, KM-KNN, KM-NB, KM-RF500 and KM-NN with two different classification techniques, namely, the removing and relabeling on the Pima data set are illustrated in Fig. 6. Although the KM-KNN with 90.843% accuracy was the best algorithm using the removing technique in all five CLCF algorithms, the Fig. 6 shows that the relabeling technique outperformed the removing technique in all the five CLCF algorithms. Finally, the best accuracy which is highlighted in Table 2 is achieved by KM-RF500 using relabeling technique with 94.817% on the Pima data set.

The accuracy achieved by five different algorithms, namely, the KM-SVM, KM-KNN, KM-NB, KM-RF500 and KM-NN with two different classification techniques, namely, the removing and relabeling on the Wisconsin data set are illustrated in Fig. 7. Although The KM-SVM with 96.704% accuracy was higher using the removal technique in all five CLCF algorithms, but KM-SVM using relabeling outperformed the removing technique. Finally, the best accuracy which is highlighted in Table 3 is achieved by KM-SVM using relabeling technique with 96.877% on the Wisconsin data set.

The accuracy achieved by five different algorithms, namely, the KM-SVM, KM-KNN, KM-NB, KM-RF500 and KM-NN with two different classification techniques, namely, the removing and relabeling on the Heart data set are illustrated in Fig. 8. The KM-NN with 85.715% accuracy was higher using the removing technique in all five CLCF algorithms. Also, the KM-NN with 85.066% using relabeling was higher in all five different algorithms. Finally, the best accuracy which is highlighted in Table 4 is achieved by KM-NN using removing technique with 85.715% on the heart data set.

The accuracy achieved by five different algorithms, namely, the KM-SVM, KM-KNN, KM-NB, KM-RF500 and KM-NN with two different classification techniques, namely, the removing and relabeling on the Ionosphere data set are illustrated in Fig. 9. Although, the removing technique outperformed the relabeling technique in four CLCF algorithms but the KM-SVM using relabeling with 85.439% outperformed the removing technique. The best accuracy is highlighted in Table 5.

To guide the interpretation of Figs. 6, 7, 8 and 9, a comparison of the results from the CLCF model using k-means with five classification filtering algorithms are presented in Tables 2, 3, 4 and 5 for each data set (with the best results highlighted in bold and their standard deviations). It shows the best results in terms of accuracy for each data set. Based on the empirical results, the best CLCF algorithm is KM-RF500 for the Pima data set using the relabeling technique. However, the removing technique achieved the better result in the Wisconsin data set, the KM-SVM using the relabeling outperformed comparing with other techniques. The best CLCF algorithm is KM-NN using the removing technique for the Heart data set. Although, the removing technique achieved the better result in the Ionosphere data set, the KM-SVM using the relabeling outperformed comparing with other techniques.

Table 2 Accuracy of the removing and relabeling techniques of the CLCF model using k-means with five classification filtering algorithms on the Pima data set
Table 3 Accuracy of the removing and relabeling techniques of the CLCF model using k-means with five classification filtering algorithms on the Wisconsin data set
Table 4 Accuracy of the removing and relabeling techniques of the CLCF model using k-means with five classification filtering algorithms on the Heart data set
Table 5 Accuracy of the removing and relabeling techniques of the CLCF model using k-means with five classification filtering algorithms on the Ionosphere data set

We analyzed and compared the best CLCF algorithm of each data set (after noise reduction) with the results of each data set before noise reduction. All the four data set are evaluated before noise reduction using SVM-RBF kernel algorithm and tenfold cross validation. Table 6 shows a comparison between accuracy of data sets before and after noise detection. The results from this table present that Pima and Heart data sets include more noisy instances in compare to Wisconsin and Ionosphere data sets. This table also shows our proposed CLCF model outperformed and increased the overall performance. This research has used three medical data sets, which were Pima, Wisconsin, Heart (statlog). The consistency between the results provides strong support for the validity of the proposed algorithms to classify class noisy instances in medical areas. The results also in Ionosphere data set show its efficiency for decision-making systems in various domains.

Table 6 Comparative analysis of the best result of the CLCF model and the results obtained before noise reduction on four data sets

5.1 Comparison of the results

As Table 7 shows, a comparison between the proposed technique and existing techniques indicates that the proposed technique leads to more accurate classification. The results show an improvement when compared with the existing approaches and the preceding comparative analysis. A close analysis of Tables 2, 3, 4, 5, 6 and 7 suggests that the proposed model is able to accurately detect class noisy instances and enhance the overall performance.

Table 7 Accuracy comparisons with existing techniques

5.2 Significant test on the CLCF results

In this section, the statistical tests are used to examine the significance of the differences in the means of performance before and after noise detection. A t test has been performed on accuracy results to observe whether differences between two methods are statistically significant at level of \(\propto = 0.05\). The p value of paired t test that compares CLCF model with before noise detection for each data set is shown in Table 8. It shows all the differences are statistically significant.

Table 8 p value of paired t test that compares CLCF model with before noise detection model

6 Conclusions

This paper investigates misclassified instances issues and proposes the CLCF model for class noise detection and classification using k-means algorithm and five different classification algorithms. It has been confirmed that clustering technique and classification filtering can lead to more reliable and accurate results for class noise detection and classification. The proposed model was applied on four binary data sets with low dimension and the performance was evaluated in terms of accuracy. It has been assumed that the data sets do not have missing values and the proposed model is not robust against outliers. It was shown that the proposed CLCF model was successful in identifying class noisy samples in comparison with the results obtained before noise detection. In addition, in order to improve data quality, two different techniques for noise classification, namely, the removing and relabeling are applied. The main limitation of this model is the crisp nature of k-means algorithms for allocating cluster membership to data instances. Any data item, especially highly structured data, belongs to one of the existing clusters based on the minimum distance while this scenario does not work generally for real world data. There are three trends for future works. First, it would be useful to apply other clustering techniques to overcome the limitation of the k-means. Second, a new method for noise recognition can be developed. Finally, because the proposed method works with two-class data sets, another direction of this research is proposing a method for classifying multiclass data sets.