Abstract
Oversampling is a promising preprocessing technique for imbalanced datasets which generates new minority instances to balance the dataset. However, improper generated minority instances, i.e., noise instances, may interfere the learning of the classifier and impact it negatively. Given this, in this paper, we propose a simple and effective oversampling approach known as ASN-SMOTE based on the k-nearest neighbors and the synthetic minority oversampling technology (SMOTE). ASN-SMOTE first filters noise in the minority class by determining whether the nearest neighbor of each minority instance belongs to the minority or majority class. After that, ASN-SMOTE uses the nearest majority instance of each minority instance to effectively perceive the decision boundary, inside which the qualified minority instances are selected adaptively for each minority instance by the proposed adaptive neighbor selection scheme to synthesize new minority instance. To substantiate the effectiveness, ASN-SMOTE has been applied to three different classifiers and comprehensive experiments have been conducted on 24 imbalanced benchmark datasets. ASN-SMOTE is also extensively compared with nine notable oversampling algorithms. The results show that ASN-SMOTE achieves the best results in the majority of datasets. The ASN-SMOTE implementation is available at: https://www.github.com/yixinkai123/ASN-SMOTE/.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
Introduction
The problem of class-imbalanced data in machine learning may occur often, which means that the class distribution of data in binary or multi-class classification problems has a significant slant. For instance, within the binary classification problem, the majority class involves a large number of instances, while only a few instances in the minority class [42]. Such problems often appear in practical applications in bank fraudulent transaction detection methods [36], credit risk assessment [34], text classification [41], biomedical diagnosis [2, 59] and firewall intrusion detection [5].
Learning from class-imbalanced data poses a tough challenge to conventional classifiers in the field of supervised learning and data mining. Specifically, when dealing with imbalanced datasets, traditional classifiers usually favor majority class since they were originally designed to process balanced datasets [53]. As a result, the prediction ability of the classifier will be greatly affected by the imbalanced distribution of data, and the minority instances cannot be correctly classified. Especially in cases of extreme training data imbalances, the classifier will be highly influenced by the majority class and achieve high classification accuracy. However, the minority class may even go entirely undetected [14]. For example, for the application of firewall intrusion detection, the malicious attack information found in training data is only 0.1%, so the accuracy rate of classification of normal classifier to recognize all sample information as normal information is 99.9%. This means that even a well-trained classifier cannot be applied to the actual
Due to the universal existence of imbalanced datasets in practical applications and the difficulty for traditional classifiers to deal with them, learning from class-imbalanced data has attracted the attention of many prominent researchers over the last 20 years [31]. Many preprocessing methods have been put forward to deal with class imbalance of datasets. These methods can be basically divided into three wide-ranging categories: algorithm level methods, cost-sensitive methods AND data-level methods [23]. Algorithm-level approaches modify or develop algorithms for classifiers to enhance the learning of minority classes [11, 27, 60]. And the essence of cost-sensitive methods is to minimize the overall cost of incorrect classifications [24, 37]. Apart from the above, data-level methods preprocess imbalanced datasets by resampling of data to balance the class distribution [13, 28, 30, 43, 52]. Compared with algorithm level and cost-sensitive methods, data-level methods are more conventional ways to tackle imbalanced datasets for they can be applied to classifiers without modifying them [45].
Among the large number of oversampling methods, the synthetic minority oversampling technology (SMOTE) is one of the most well-known methods which generates artificial minority instances by linear interpolation [13]. However, SMOTE and some other existing methods mainly focus on dealing with the imbalance between classes, but fail to effectively solve the issues of within-class imbalances and minor disjuncts [3, 13, 30]. This may result in some improper synthesized minority instances, such as noise and overlapping instances, which lead to performance degradation instead of improvement. Although there have been some methods proposed that use minority instances around the decision boundary to synthesize instances and can largely address the above issues [16, 21, 28, 43, 44, 52], however, this makes the decision boundary even more indistinct and blurred, and may cause the problem of the new synthesized samples being shown in the majority class region. This amplifies the noise and degrades the performance of classifiers [56].
In view of this, this paper proposes a simple while effective synthetic minority oversampling technology (SMOTE), known as ASN-SMOTE, which adaptively nominates qualified minority neighbors to synthesize minority instances. More specifically, for each instance of minority class, the Euclidean distances between it and all other instances are firstly calculated. After that, the minority instance whose nearest neighbor is a majority instance is considered to be noise and is filtered. On the other hand, if the minority instance’s nearest neighbor is also a minority one, it adaptively determines the qualified synthesizer among its k-nearest neighbors (KNN) according to the Euclidean distances to them. Finally, these qualified neighbors are used to generate minority instances by the SMOTE algorithm [13]. The synthesized minority instances are combined with imbalanced dataset and fed into a traditional supervised classifier for training.
ASN-SMOTE can be applied to real-world datasets and improve the accuracy of the supervised classifiers without modifications.
The main advantages of ASN-SMOTE are threefold: (1) It can filter noise and improve the effectiveness of oversampling. (2) It makes full use of the information of the imbalanced dataset to adaptively select qualified synthesizers, which makes the synthesized minority instances more accurate to the practical data distribution. (3) It can be combined with any classification algorithm and there is no need to modify the classification algorithms.
To evaluate the performance of ASN-SMOTE, extensive comparative experiments were conducted. The proposed ASN-SMOTE was applied on 24 public benchmark datasets with the imbalance ratio ranging from 1.82 to 41.4. Three classifiers, namely the k-nearest neighbor (KNN) [18], SVM [17]and random forest (RF) [8], were used for classification. ASN-SMOTE was then compared with seven other notable oversampling techniques, namely (1) Random oversampling, (2) SMOTE [13], (3) ADASYN [30], (4) borderline1-SMOTE [28], (5) borderline2-SMOTE [28], (6) k-means SMOTE [21], (7) SVM-SMOTE [44], (8) SWIM-RBF [7], and (9) LoRAS [6]. The results show that ASN-SMOTE achieves the best results on the majority of datasets in terms of three different evaluation indicators.
The remainder of the paper is organized as follows. In “Related work”, prior research alongside other existing sampling methods are reviewed, especially in regard to oversampling methods, in addition to problems that exist in practical application. In “Oversampling algorithm for adaptively selecting neighbor synthesis instances”, the working principle of the proposed oversampling method are introduced in detail. And in “Experiments and results”, the experimental design, results, and discussion are described. Finally, the conclusion is presented in “Conclusion”.
Related work
In this section, data-level preprocessing methods are reviewed and summarized. Generally, these preprocessing methods can be categorized into three types, namely the oversampling, undersampling and hybrid sampling methods [20]. Oversampling refers to the use of sampling methods to generate new minority instances to balance the dataset [13, 28, 30, 43, 52]. In contrast, undersampling is employed to tackle the problem of imbalanced data sets by deleting some majority instances [29, 38, 39, 54, 55, 57]. Table 1 summarizes the existing oversampling, undersampling and hybrid sampling methods that have been successfully applied to deal with imbalanced data problems.
Undersampling
Random undersampling is a common method which balances the class distribution of imbalanced datasets by randomly removing some majority instances . However, this method may cause the loss of some valuable classification information during the training process of a classifier. To overcome this shortcoming, researchers have proposed distinctive undersampling approaches. The condensed nearest neighbor rule (CNN) [29] and Tomek link [54] are some examples. In the CNN method, some of the majority instances around the decision boundary are removed and the remaining instances are used for training. When the Tomek Link method is utilized, the Tomek Link pair is calculated, which is the nearest neighbors of two instances of different classes between each other, resulting in a deletion of the majority instance of the Tomek Link pair [4]. In general, when the dataset is minute or extremely imbalanced, these undersampling methods cannot overcome the class-imbalance problem. This results in the removal of a large number of majority instances and the classifier cannot achieve an effective out-of-sample performance [21, 53].
Oversampling
Random oversampling is the simplest oversampling method. It randomly selects minority instances and copies them until the class distribution of imbalanced dataset becomes balanced. Nevertheless, this method generates samples similar to the original data resulting in over-fitting of the classification algorithm [4, 14].
To avoid the problem of over-fitting caused by random oversampling, Chawla et al. proposed the use of the SMOTE algorithm [13] in 2002. Contrary to merely copying the existing instances, this algorithm generates artificial instances by linear interpolation. As shown in Fig. 1, SMOTE algorithm generates minority instances by interpolating between the minority instance and the k nearest minority class neighbors. Specifically, the SMOTE algorithm generates minority instances in three steps. First, it chooses a minority instance \(\overrightarrow{x} \), then randomly selects one \(\overrightarrow{y} \) of the K nearest minority neighbors of \(\overrightarrow{x} \). Eventually, a new instance \(\overrightarrow{s} \) is synthesized, which is the random interpolation between these two instances calculated by formula \( \overrightarrow{s}=\overrightarrow{x}+\lambda \times (\overrightarrow{y}-\overrightarrow{x}) \) , where \(\lambda \) is a random weight in [0,1]. However, SMOTE treats all minority instances alike and ignores the locations of the minority instances selected for synthesizing new instances, which results in the synthesized minority instances located in the majority class region, causing noise and deteriorating the classification performance, as shown in Fig. 2.
Given the above problems, a variety of improved oversampling methods have been proposed. Adaptive synthesis sampling (ADASYN) [30] and its variant KernelADASYN [52] adaptively adjust sampling weights according to the proportion of majority instances in k-nearest neighbors. In this way, minority instances that are similar to the majority instances are more likely to be sampled. Similar to ADASYN, Borderline-SMOTE [28] has two sampling strategies. Borderline1-SMOTE1 focuses attention to the decision boundary. First, it identifies the minority instances around the decision boundary, and then synthesizes instances with these minority instances, so as to balance the class distribution of the dataset. On the other hand, Borderline2-SMOTE extends the previous method allowing minority instances to interpolate with only one majority class neighbors, and setting the interpolation weight to less than 0.5. The aim of this is to improve the similarity between the minority instance and the synthetic instance, so as to enhance the ability of classifier to distinguish the instance around the decision boundary.
Additionally, another kind of sampling methods are clustering-based oversampling algorithms. Cluster-SMOTE initially uses the k-means clustering algorithm to divide the minority instances into several clusters and applies SMOTE in each cluster [16]. In Ref. [43], an adaptive semi-unsupervised weighted over-sampling (A-SUMO) approach was presented. A-SUWO first utilizes a semi-unsupervised hierarchical clustering algorithm to cluster minority instances and adaptively determines the oversample size to each sub-cluster. Recently, a heuristic oversampling method based on k-means and SMOTE was proposed by Douzas et al. [21]. This method clusters the entire sample set, and calculates the imbalance rate (IR) of each cluster. According to the size of IR, SMOTE is used within the cluster. The main idea of SVM-SMOTE (Support Vector Machine SMOTE) [44] is to use an SVM classifier to find the support vector, and then synthesize new instances based on it. Similar to Borderline-SMOTE, SVM-SMOTE also determines the type of instances (safe, danger, noise) according to the attributes of k-nearest neighbors, and then train SVM according to the danger type [44].
The above methods suggested that the minority instances around the decision boundary can be used to synthesize instances and can largely address the the issues of within-class imbalances and minor disjuncts. However, this blurs the decision boundary and may generate minority instances in the majority instances region (see Fig. 3), which amplifies the noise and degrades the performance of classifiers [56].
Unlike SMOTE and its variants that rely on the positions and distances between minority instances, SWIM-RBF [7] uses the density of minority instances relative to the distribution of majority class to determine where to generate synthetic instances. LoRAS [6] provides a method to choose neighborhoods of a minority class instance point by performing prior manifold learning over the minority class using t-Stochastic Neighborhood Embedding.
To address more effectively the issue of of how to determine the proper minority instances involved in the synthesis to avoid the generated minority instances falling into the majority class region, in this paper, we present ASN-SMOTE which is also an improved variant of SMOTE. Similar to Borderline-SMOTE [28] and some other variants of SMOTE [21, 44], ASN-SMOTE performs noise filtering before synthesizing minority instances. However, in ASN-SMOTE, a minority instance whose nearest neighbor is a majority instance is considered to be noise and is filtered. It is different from other SMOTE-based techniques that determine whether an instance is noise according to the proportion of majority instances in its k-nearest neighbors. Another difference between ASN-SMOTE and other oversampling techniques is that ASN-SMOTE only selects minority neighbors that are closer than nearest majority instance for synthetic interpolation. These two mechanisms help ASN-SMOTE selecting proper minority instances to synthesize minority instances which will not appear in majority class regions.
Oversampling algorithm for adaptively selecting neighbor synthesis instances
Previously, we have reviewed some related works and presented the existing issues. Motivated by this, we present the novel synthetic minority oversampling method with adaptive qualified synthesizer selection, known as ASN-SMOTE, in this section. The fundamental idea behind the ASN-SMOTE algorithm is to make full use of all the information of the k nearest neighbor (KNN) instances to synthesize the appropriate minority instances. The ASN-SMOTE involves the following three steps: (1) noise filtering, (2) adaptively selecting neighbor instances, and (3) synthesizing instances.
Noise filtering
Filtering noise is an essential process in the training stage of machine learning because noise is a kind of interference for sampling algorithms and classifiers [12]. It not only invalidates the sampling algorithm, but also reduces the overall performance of the classifier. In this work, a KNN-based noise filtering algorithm is proposed to reduce the generation of synthetic minority instances located in the majority class regions. The proposed algorithm can accurately identify noise and the minority instances surrounding the decision boundary by judging the nearest instance of each minority instance, and filter them.
Specifically, denote the majority class set, the minority class set and the complete dataset as \({\mathbf {N}}\), \({\mathbf {P}}\) and \({\mathbf {T}}\), respectively. \(N_p\) and \(N_n\) are the numbers of minority instances and majority instances respectively. For each instance \( p_{i}\in {\mathbf {P}}, (i=1,2, \ldots , N_p) \), calculate the Euclidean distance to each instance in the \( {\mathbf {T}} \), and find the nearest instance \( D(p_i) \):
where \(\Vert x-y\Vert _2\) returns the Euclidean distance between x and y.
If \(D\left( p_{i}\right) \in {\mathbf {N}}\), \( p_i \) is regarded as noise or a minority instance around the decision boundary and is recorded as an unqualified instance, \({\mathbf {M}}_u = {\mathbf {M}}_u \cup p_i\), where \({\mathbf {M}}_u\) is the set of unqualified minority instances of \(p_i\). Otherwise, \({\mathbf {M}}_q = {\mathbf {M}}_q \cup p_i\) where \({\mathbf {M}}_q\) is the set of qualified minority instances of \(p_i\) (see Fig. 4).
The pseudo-code of the noise filtering method is presented in Algorithm 1.
![figure a](http://media.springernature.com/lw685/springer-static/image/art%3A10.1007%2Fs40747-021-00638-w/MediaObjects/40747_2021_638_Figa_HTML.png)
Adaptive selection of neighbor instances
Neighbor instance selection greatly affects the quality of the generated instances, where the challenge is to select the appropriate nearest neighbors that qualify for a minority class instance. In the original SMOTE algorithm, minority class instances are randomly selected to synthesize new instances. This may result in that the synthesized instances locate in the majority class region [37]. Using these synthetic instances as training data reduces the performance of the classifier. Given this, an adaptive neighbor selection strategy is proposed in this paper.
Specifically,
-
(1)
For each instance in \({\mathbf {M}}_{q}\), denoted as \(q_{j}, j=1,2, \ldots , N_{q}\), find the k nearest neighbor instances of \(q_{j}\) in \({\mathbf {T}}\) by Euclidean distance calculation, denoted as \(q_{j}^1, q_{j}^2, \ldots , q_{j}^k\).
-
(2)
If all the k nearest neighbor instances, i.e., \(q_{j}^1, q_{j}^2, \ldots , q_{j}^k\) are minority instances, all of them are added to the set \({\mathbf {Q}}_{j}\) where \({\mathbf {Q}}_{j}\) represents the set of qualified neighbors of the qualified minority instance \(q_{j}\). Otherwise, go to step 3.
-
(3)
If there is at least one majority instance in the k nearest neighbor instances, find the nearest majority instance, denoted as \(q_{j}^\mathrm {near}\), among the k nearest neighbor instances of \(q_{j}\).
-
(4)
Each neighbor in the minority class whose distance to \(q_{j}\) is smaller than \(\Vert q_{j} - q_{j}^\mathrm {near}\Vert _2\) is added to the set \({\mathbf {Q}}_{j}\).
The pseudo-code of the adaptive neighbor instances selection method is presented in Algorithm 2.
![figure b](http://media.springernature.com/lw685/springer-static/image/art%3A10.1007%2Fs40747-021-00638-w/MediaObjects/40747_2021_638_Figb_HTML.png)
Synthesizing instances
Synthesizing the minority class instances is the last step of the oversampling technique for imbalanced data. The instance synthesis method in this work follows the original linear interpolation for it is easy to calculate and does not cause the synthesized instance to be located in the majority class regions due to improper weight allocations.
Specifically, for the qualified minority instance \(q_i\), we first determine the number of synthesis instances (\(N_s\)) for each qualified minority instance as follows:
After that, randomly select a qualified neighbor from \({\mathbf {Q}}_{i}\), denoted as \(q_i'\), and perform linear interpolation to synthesize an instance (s) as follows:
where \( w\in (0,1) \) is the weight. This random selection and linear interpolation is executed for \(N_s\) times for each qualified minority instance.
As shown in Fig. 5, the noise filtered minority class instances adaptively select qualified minority class neighbor instances to synthesize new instances. The synthesized minority instances are generated from the original minority instances and do not reside in the region of the majority class.
The procedure of the proposed ASN-SMOTE is summarized as follows.
![figure c](http://media.springernature.com/lw685/springer-static/image/art%3A10.1007%2Fs40747-021-00638-w/MediaObjects/40747_2021_638_Figc_HTML.png)
Experiments and results
This section focuses on the validation and evaluation of the proposed oversampling method. The ASN-SMOTE is evaluated with different classifiers, datasets in terms of various performance measures.
Experimental datasets
To evaluate the performance of ASN-SMOTE, experiments were conducted on 24 public datasets that were selected from the KEEL repository [1] and the UCI Machine Learning Repository [22]. Considering that the datasets of real-world applications are quite complex, in the case of that a dataset contains more than two categories, one of the smaller categories is selected as the minority class. The rest of the data is used as the majority class. Additionally, to evaluate the adaptability of ASN-SMOTE under extreme conditions, two large datasets with high imbalance rate were selected as well. The detailed information of the experimental datasets is shown in Table 2.
Evaluation measures
In this work, the geometric mean (G-mean) of accuracy, F-measure and area under the curve (AUC) were selected as the evaluation measures, which are, respectively, defined as follows.
Geometric evaluation accuracy:
The harmonic mean of Sensitivity and Precision:
AUC is the area under the receiver operating characteristics (ROC) curve. The ROC curve is a two-dimensional coordinate graph in which the X-axis represents the false positive rate (FPR) and Y-axis represent the true positive rate (TPR). It is a commonly used performance measure to evaluate the imbalance problem and can be calculated as:
where \(\text {TPR}=\frac{\text {TP}}{N_{p}}\) and \(\text {FPR}=\frac{\text {FP}}{N_{n}}\), TP represents the proportion of positive (minority) class instances correctly classified, and FP represents the proportion of negative (majority) class instances that is incorrectly classified as positive class.
Experimental configurations
To conserve the between-class ratios of the original datasets, a fivefold stratified cross validation was used to ensure that the original class distribution is preserved in each fold. Furthermore, each experiment was conducted three times to obtain the average metrics value as to avoid the impact of randomness. To evaluate the performance of the proposed method, it was compared with nine other oversampling techniques: (1) random over-sampling, (2) SMOTE, (3) ADASYN, (4) Borderline1-SMOTE, (5) Borderline2-SMOTE, (6) k-means SMOTE, (7) SVM-SMOTE, (8) SWIM-RBF, and (9) LoRAS. Three classifiers, namely KNN, SVM and Random Forests (RF) were used to make classifications. These algorithms were implemented using python on a 64-bit operating system with 8 GB available RAM using 2.30 GHz i5-8300 CPU.
For SMOTE and its variants, namely Borderlin1-SMOTE, Borderlin2-SMOTE, k-means SMOTE, and ASN-SMOTE, the number of nearest neighbors were selected among the values (3, 5, 7, 9) and the number of clusters in k-means SMOTE was selected from (3, 5, 7, 9, 10, 15, 20), subject to detail dataset. For SWIM-RBF, epsilon was selected from (0.25, 0.5, 1, 2, 3, 5). And for LoRAS, the number of shadows was selected from 2 to the number of features and the number of nearest neighbors were selected among the values (3, 5, 7, 9). SVM-SMOTE and ADASYN were implemented in the Python library scikit-learn [46] with default parameters .
For the KNN classifier [18], the number of nearest neighbors was selected from the values (3, 5, 7, 9). For the SVM classifier [17], the cost parameter was selected from the values (0.001, 0.01, 0.1, 1, 10). The parameter kernel width \(\gamma \) was selected from the values (0.1, 0.3, 0.5, 0.7, 1, 2, 5). For the Random Forests (RF) classifier [8], the parameter n_estimators, i.e. the number of the decision trees of the estimators, was selected from the values (20, 30, 50, 80, 100). The parameter max_depth, which is the maximum depth of the decision tree, was selected from the values (5, 10, 13, 15, 20). And the Gini index was selected as the node partition standard. The parameter max_features, i.e., the number of features to consider when looking for the best split, was the square radial of the number of features in the dataset.
The parameters of three classifiers and eight over-sampling methods were optimized using fivefold stratified cross-validation in the 24 public datasets based on G-means, F-measure and AUC.
Results and discussion
The average results and the standard deviations of the G-means, F-measures and AUCs obtained by three classifiers are shown in Tables 3, 4 and 5, respectively. The best experimental results on each dataset are highlighted in bold. In addition, Fig. 6 summarizes the scores and the comprehensive ranking of these oversampling techniques in 24 datasets in terms of convergence accuracy The three performance measures obtained from fivefold cross validation are used as the ranking basis. For each dataset, the first rank is allocated to the optimum oversampling technique and the eighth rank is assigned to the worst performing technique. Consequently, the average ranking of the three performance measures of each oversampling techniques in all 24 datasets is calculated. The mean ranking of each method ranged between [1.0,10.0].
From the results, it is shown that ASN-SMOTE can obtain the best performance in at least one performance measure in 18 out of 24 public datasets when SVM, KNN, and RF were used. Figure 6 also indicates that ASN-SMOTE ranks first overall no matter which classifier and performance indicator is adopted. This is because there are always noise minority instances in practical datasets and exist the issues of within-class imbalance and small disjuncts. And the noise filtering and adaptive neighbor selection proposed in ASN-SMOTE are promising mechanisms to deal with them by avoiding synthesizing new minority instances in the majority class region. We can also observe that ASN-SMOTE shows high performance as well on some datasets that have high imbalance ratios and sparse minority instances. This is possibly because such datasets have more severe within-class issues and hence noise instances account for a larger proportion of minority instances. As a result, a larger proportion of synthetic minority instances are likely to be located in the majority class region, which makes the advantages of ASN-SMOTE to be fully exploited.
On the other hand, we find that the random oversampling method performs the most poorly on all three measures when KNN was used as the classifier. When SVM and RF were used as the classifier, Borderlin2-SMOTE method performs the worst on all three measures. Compared to ADASYN, k-means SMOTE, SVM-SMOTE, Borderlin2-SMOTE, SMOTE performs better in term of AUC, and Borderline1-SMOTE and SMOTE perform better in terms of G-mean and F-measure. ADASYN, K-means SMOTE and SVM-SMOTE perform moderately according to all three classifiers and three measures. The two recently proposed algorithms, i.e., SWIM-RBF and LoRAS perform well on all three measures when KNN and SVM were used as the classifier.
To have a deep insight into the effects of the noise filtering and adaptive neighbor selection mechanisms, we applied SMOTE, SMOTE/NF (SMOTE with noise filtering) and ASN-SMOTE to three two-dimensional datasets, namely the toy dataset, moon dataset and circle dataset. The distributions of synthetic instances are shown in Fig. 7. It can be observed that SMOTE, without noise filtering and adaptive neighbor selection, generates the largest number of minority instances in the majority class region. By contrast, SMOTE/NF reduces the number of minority instances located in the majority class region since the minority instances that are closer to the majority instances are rejected to actively generate instances with other minority instances. However, there are still some synthesis minority instances in the majority class region. The reasons are twofold. One is that in SMOTE/NF, there may be a majority class region between two minority instances used to synthesize instances. The other is that noise minority instances may be selected to generate minority instances. Both situations may cause the generated minority instance to be located in the majority class region. The adaptive neighbor selection mechanism leveraged in ASN-SMOTE can avoid such situations. As we can see from the figures, all instances generated by ASN-SMOTE locate in the minority class region, owing to the noise filtering and adaptive neighbor selection mechanisms.
Besides the performance comparison, Friedman’s test [32] followed by Holm’s test [33] were performed to verify the statistical significance of the proposed method compared to the other oversampling methods. The Friedman’s test is a non-parametric statistical test. Similar to the repeated measures ANOVA (analysis of variance), it is utilized to distinguish differences in treatments across multiple test attempts. The null hypothesis in Friedman test is whether all classifiers are performing similarly in mean rankings without significant difference. From the results of Friedman’s test in Table 6, it can be found that there exists enough evidence at \(\alpha \) = 0.05 to reject the null hypothesis for all classifiers and measures, which demonstrates significant differences between ASN-SMOTE and other oversampling methods in a statistical sense.
Since the null hypothesis is rejected for all classifiers and performance measures, a post hoc test is applied. Holm’s test is used for this study where the proposed method was regarded as the control method. The Holm’s test is the non-parametric equivalent of multiple t test that adjusts \( \alpha \) to compensate for multiple comparisons in a step-down procedure. The largest \( \alpha \) is equal to 0.1 in our experiments. The null hypothesis is that ASN-SMOTE does not performs better than the other methods as the control algorithm. The adjusted \( \alpha \) and the corresponding p value for each method are shown in Table 7. It can be read from the results that ASN-SMOTE outperforms all other oversampling approach except LoRAS in terms of all metrics in a statistical sense when using SVM and KNN as the classifier. When RF was used as the classifier, in 26 of the 27 tests, ASN-SMOTE performs better than the counterparts statistically.
Next, we evaluate the runtimes of ASN-SMOTE and all comparative techniques over the 24 benchmark datasets. The average results on ten independent tests are shown in Table 8. To make the comparisons more intuitive, the ranking of average running time of these algorithms is provided in Fig. 8.
From the results, it can be found that the random method is the fastest of all oversampling techniques due to its simplicity and SMOTE ranks the second. Compared with the original SMOTE, the variants of SMOTE such as ADASYN, Borderline-smote, K-means SMOTE and SVMSMOTE need more runtimes. It makes sense since they introduce additional mechanisms to better overcome the imbalance which meanwhile make them more complex. Similarly, ASN-SMOTE involves an adaptive sampling strategy, which needs to calculate the k nearest neighbors of each qualified minority instance in the dataset and hence increases the computational complexity of the proposed algorithm. Nevertheless, as can be read from Table 8, its runtime of much less than one second is perfectly acceptable.
As for LoRAS, it uses the manifold learning-based dimension reduction technology to reduce the dimension of minority instances before selecting the k-nearest neighbors of instances, which makes it spend more time on it and ranks the second to last. In addition, not surprisingly, SWIM-RBF ranks the last and spend much more time since it utilizes the radial basis function (RBF) with Gaussian kernel to estimate the local density of minority samples and majority classes. The computation of the square root and the whitening operation have \(O(d^{3})\) and \(O(bd^{2})\) time complexity, where d is the dimension of the data and b is the number of instances in the minority class.
Parameter analysis
Since ASN-SMOTE uses the minority instances in k nearest qualified neighbors to synthesize new minority instances, the essential parameter k significantly affects its performance. To better investigate the effect of the value of k upon the performance of ASN-SMOTE, comparative evaluations were conducted on six benchmark databases randomly selected from the above 24 ones (see Table 2). Considering the least number of parameters required, we chose SVM as the classifier. The three performance measures were evaluated by fivefold cross validation. Each experiment was repeated three times to obtain the average metrics value to avoid contingency.
Figure 9 shows the results where k is selected among values \((1, 2, \ldots 10)\). As can be seen from the results, the performance of ASN-SMOTE varies along with the value of k. Take the Haberman dataset as an example, the performance measures degrade with the increase of k first, and the performance gets the worst when \(k=3\). After that, as k continues to increase, the performance measures begin to improve, until k reaches 7, where the best performance is achieved. This demonstrated that k should be carefully tuned when ASN-SMOTE is used to balance the dataset.
Conclusion
In this paper, we propose an SMOTE based oversampling method, ASN-SMOTE. Compared to existing oversampling methods, the key advantages of ASN-SMOTE is that ASN-SMOTE performs the novel noise filtering and adaptive neighbor selection mechanisms before synthesizing minority instances. The former filters any minority instance whose nearest neighbor is a majority instance, and the latter only selects minority neighbors that are closer than nearest majority instance for synthetic oversampling. By employing these two mechanisms, ASN-SMOTE can effectively avoid to generate synthetic minority instances in majority class region which results in a more general decision boundary.
ASN-SMOTE is evaluated against nine state-of-the-art oversampling methods on 24 public datasets whose imbalance ratios range from 1.82 to 41.4, with three different well-known classifiers, namely KNN, SVM and RF. The results, based on the G-mean, F-measure and AUC metrics, show that ASN-SMOTE ranks the first and is more effective than all other comparative oversampling techniques. Although the runtime of ASN-SMOTE is not competitive, however, its runtime of much less than one second is perfectly acceptable.
In the future work, we intend to extend ASN-SMOTE to deal with multi-classification imbalanced datasets since the problem of imbalanced datasets not only appears in binary classifications, but also in multi-classifications in real-world applications.
Availability of data and material
The datasets extracted and analyzed during the current study are available in the KEEL repository and the UCI Machine Learning Repository.
References
Alcalá-Fdez, Fernández J, Luengo A, Derrac J, García J, Sánchez S, Herrera L F (2011) Keel data-mining software tool: data set repository, integration of algorithms and experimental analysis framework. J Mult Valued Logic Soft Computg 17: 255–287
Bach M, Werner A, żywiec J, Pluskiewicz W (2017) The study of under- and over-sampling methods’ utility in analysis of highly imbalanced data on osteoporosis. Inf Sci 384:174–190. https://doi.org/10.1016/j.ins.2016.09.038
Barua S, Islam MM, Yao X, Murase K (2014) MWMOTE-majority weighted minority oversampling technique for imbalanced data set learning. IEEE Trans Knowl Data Eng 26(2):405–425. https://doi.org/10.1109/TKDE.2012.232
Batista GEAPA, Prati RC, Monard MC (2004) A study of the behavior of several methods for balancing machine learning training data. SIGKDD Explor Newsl 6(1):20–29. https://doi.org/10.1145/1007730.1007735
Bedi P, Gupta N, Jindal V (2020) I-SiamIDS: an improved Siam-IDS for handling class imbalance in network-based intrusion detection systems. Appl Intell 51:1133–1151 (2021). https://doi.org/10.1007/s10489-020-01886-y
Bej S, Davtyan N, Wolfien M, Nassar M, Wolkenhauer O (2021) Loras: an oversampling approach for imbalanced datasets. Mach Learn 110(2):279–301
Bellinger C, Sharma S, Japkowicz N, Zaïane OR (2020) Framework for extreme imbalance classification: swim-sampling with the majority class. Knowl Inf Syst 62(3):841–866
Breiman L (2001) Random forests. Mach Learn 45(1):5–32
Bunkhumpornpat C, Sinapiromsaran K, Lursinsap C (2009) Safe-Level-SMOTE: safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem. In: PAKDD
Bunkhumpornpat C, Sinapiromsaran K, Lursinsap C (2011) DBSMOTE: density-based synthetic minority over-sampling technique. Appl Intell 36:664–684
Castro CL, Braga AP (2013) Novel cost-sensitive approach to improve the multilayer perceptron performance on imbalanced data. IEEE Trans Neural Netw Learn Syst 24(6):888–899. https://doi.org/10.1109/TNNLS.2013.2246188
Chambolle A, De Vore R, Lee NY, Lucier B (1998) Nonlinear wavelet image processing: variational problems, compression, and noise removal through wavelet shrinkage. IEEE Trans Image Process 7(3):319–335. https://doi.org/10.1109/83.661182
Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357
Chawla NV, Japkowicz N, Kotcz A (2004) Editorial: special issue on learning from imbalanced data sets. SIGKDD Explor Newsl 6(1):1–6. https://doi.org/10.1145/1007730.1007733
Chen XS, Kang Q, Zhou MC, Wei Z (2016) A novel under-sampling algorithm based on iterative-partitioning filters for imbalanced classification. In: IEEE international conference on automation science and engineering
Cieslak D, Chawla N (2006) Combating imbalance in network intrusion datasets. In: 2006 IEEE international conference on granular computing, pp 732–737. https://doi.org/10.1109/GRC.2006.1635905
Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20(3):273–297
Cover T, Hart P (1967) Nearest neighbor pattern classification. IEEE Trans Inf Theory 13(1):21–27. https://doi.org/10.1109/TIT.1967.1053964
Devi D, Biswas SK, Purkayastha B (2019) Learning in presence of class imbalance and class overlapping by using one-class SVM and undersampling technique. Connect Sci 31:105–142
Douzas G, Bacao F (2019) Geometric SMOTE a geometrically enhanced drop-in replacement for SMOTE. Inf Sci 501:118–135. https://doi.org/10.1016/j.ins.2019.06.007
Douzas G, Bacao F, Last F (2018) Improving imbalanced learning through a heuristic oversampling method based on k-means and SMOTE. Inf Sci 465:1–20. https://doi.org/10.1016/j.ins.2018.06.056
Dua D, Graff C (2019) UCI machine learning repository. University of California,School of Information and Computer Science, Irvine. http://archive.ics.uci.edu/ml
Fernández A, López V, Galar M, del Jesus MJ, Herrera F (2013) Analysing the classification of imbalanced data-sets with multiple classes: binarization techniques and ad-hoc approaches. Knowl Based Syst 42:97–110. https://doi.org/10.1016/j.knosys.2013.01.018
Galar M, Fernandez A, Barrenechea E, Bustince H, Herrera F (2012) A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches. IEEE Trans Syst Man Cybern Part C (Appl Rev) 42(4):463–484. https://doi.org/10.1109/TSMCC.2011.2161285
Gao M, Hong X, Chen S, Harris CJ (2011) A combined SMOTE and PSO based RBF classifier for two-class imbalanced problems. Neurocomputing 74(17):3456–3466. https://doi.org/10.1016/j.neucom.2011.06.010
García S, Herrera F (2009) Evolutionary undersampling for classification with imbalanced datasets: proposals and taxonomy. Evol Comput 17(3):275–306. https://doi.org/10.1162/evco.2009.17.3.275
Ghazikhani A, Monsefi R, Yazdi H (2014) Online neural network model for non-stationary and imbalanced data stream classification. Int J Mach Learn Cybern 5:51–62
Han H, Wang WY, Mao BH (2005) Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. In: Proceedings of the 2005 international conference on advances in intelligent computing—volume part I, ICIC’05. Springer, Berlin, Heidelberg, pp 878–887. https://doi.org/10.1007/11538059_91
Hart P (1968) The condensed nearest neighbor rule (corresp.). IEEE Trans Inf Theory 14(3):515–516. https://doi.org/10.1109/TIT.1968.1054155
He H, Bai Y, Garcia EA, Li S (2008) ADASYN: adaptive synthetic sampling approach for imbalanced learning. In: 2008 IEEE international joint conference on neural networks (IEEE World congress on computational intelligence), pp 1322–1328. https://doi.org/10.1109/IJCNN.2008.4633969
He H, Garcia EA (2009) Learning from imbalanced data. IEEE Trans Knowl Data Eng 21(9):1263–1284. https://doi.org/10.1109/TKDE.2008.239
Hollander M, Wolfe DA, Chicken E (2013) Nonparametric statistical methods, vol 751. Wiley, New York
Holm S (1979) A simple sequentially rejective multiple test procedure. Scand J Stat 65–70
hui Hou W, kang Wang X, yu Zhang H, qiang Wang J, Li L (2020) A novel dynamic ensemble selection classifier for an imbalanced data set: an application for credit risk assessment. Knowl Based Syst 208:106462. https://doi.org/10.1016/j.knosys.2020.106462
Hu S, Liang Y, Ma L, He Y (2009) MSMOTE: improving classification performance when training data is imbalanced. In: 2009 second international workshop on computer science and engineering, vol 2, pp 13–17. https://doi.org/10.1109/WCSE.2009.756
Jensen D (1997) Prospective assessment of AI technologies for fraud detection: a case study. In: AAAI workshop on AI approaches to fraud detection and risk management. Citeseer, pp 34–38
Kotsiantis S, Kanellopoulos D, Pintelas P et al (2006) Handling imbalanced datasets: a review. GESTS Int Trans Comput Sci Eng 30(1):25–36
Kubát M, Matwin S (1997) Addressing the curse of imbalanced training sets: one-sided selection. In: ICML
Laurikkala J (2001) Improving identification of difficult small classes by balancing class distribution. In: Proceedings of the 8th conference on AI in medicine in Europe: artificial intelligence medicine, AIME ’01. Springer, Berlin, Heidelberg, pp 63–66
Lee H, Kim J, Kim S (2017) Gaussian-based SMOTE algorithm for solving skewed class distributions. Int J Fuzzy Log Intell Syst 17:229–234
Li Y, Guo H, Zhang Q, Gu M, Yang J (2018) Imbalanced text sentiment classification using universal and domain-specific knowledge. Knowl Based Syst 160:1–15. https://doi.org/10.1016/j.knosys.2018.06.019
Lin WC, Tsai CF, Hu YH, Jhang JS (2017) Clustering-based undersampling in class-imbalanced data. Inf Sci 409–410:17–26. https://doi.org/10.1016/j.ins.2017.05.008
Nekooeimehr I, Lai-Yuen SK (2016) Adaptive semi-unsupervised weighted oversampling (A-SUWO) for imbalanced datasets. Expert Syst Appl 46:405–416. https://doi.org/10.1016/j.eswa.2015.10.031
Nguyen H.M, Cooper E, Kamei K (2011) Borderline over-sampling for imbalanced data classification. Int. J. Knowl. Eng. Soft Data Paradigms 3:4–21
Orriols-Puig A, Bernado-Mansilla E, Goldberg DE, Sastry K, Lanzi PL (2009) Facetwise analysis of xcs for problems with class imbalances. IEEE Trans Evol Comput 13(5):1093–1119
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay E (2011) Scikit-learn: machine learning in python. J Mach Learn Res 12(null):2825–2830
Popel MH, Hasib KM, Ahsan Habib S, Muhammad Shah F (2018) A hybrid under-sampling method (HUSBoost) to classify imbalanced data. In: 2018 21st international conference of computer and information technology (ICCIT), pp 1–7. 10.1109/ICCITECHN.2018.8631915
Ramentol E, Caballero Y, Bello R, Herrera F (2012) SMOTE-RSB*: a hybrid preprocessing approach based on oversampling and undersampling for high imbalanced data-sets using SMOTe and rough sets theory. Knowl Inf Syst 33(2):245–265. https://doi.org/10.1007/s10115-011-0465-6
Smith MR, Martinez T, Giraud-Carrier C (2014) An instance level analysis of data complexity. Mach Learn 95(2):225–256. https://doi.org/10.1007/s10994-013-5422-z
Sáez JA, Luengo J, Stefanowski J, Herrera F (2015) SMOTE-IPF: addressing the noisy and borderline examples problem in imbalanced classification by a re-sampling method with filtering. Inf Sci 291:184–203. https://doi.org/10.1016/j.ins.2014.08.051
Tahir MA, Kittler J, Yan F (2012) Inverse random under sampling for class imbalance problem and its application to multi-label classification. Pattern Recognit 45(10):3738–3750. https://doi.org/10.1016/j.patcog.2012.03.014
Tang B, He H (2015) KernelADASYN: kernel based adaptive synthetic data generation for imbalanced learning. In: 2015 IEEE congress on evolutionary computation (CEC), pp. 664–671. https://doi.org/10.1109/CEC.2015.7256954
Tao X, Li Q, Guo W, Ren C, He Q, Liu R, Zou J (2020) Adaptive weighted over-sampling for imbalanced datasets based on density peaks clustering with heuristic filtering. Inf Sci 519:43–73. https://doi.org/10.1016/j.ins.2020.01.032
Tomek I (1976) Two modifications of CNN. IEEE Trans Syst Man Cybern SMC–6(11):769–772. https://doi.org/10.1109/TSMC.1976.4309452
Vo MT, Nguyen T, Vo HA, Le T (2021) Noise-adaptive synthetic oversampling technique. Appl Intell 51:7827–7836 (2021). https://doi.org/10.1007/s10489-021-02341-2
Weiss GM (1995) Learning with rare cases and small disjuncts. In: Prieditis A, Russell S (eds) Machine learning Proceedings 1995. Morgan Kaufmann, San Francisco (CA), pp 558–565. https://doi.org/10.1016/B978-1-55860-377-6.50075-X
Wilson DL (1972) Asymptotic properties of nearest neighbor rules using edited data. IEEE Trans Syst Man Cybern SMC–2(3):408–421. https://doi.org/10.1109/TSMC.1972.4309137
Yu H, Ni J, Zhao J (2013) ACOSampling: an ant colony optimization-based undersampling method for classifying imbalanced DNA microarray data. Neurocomputing 101:309–318. https://doi.org/10.1016/j.neucom.2012.08.018
Zhou M, Lin F, Hu Q, Tang Z, Jin C (2020) AI-enabled diagnosis of spontaneous rupture of ovarian endometriomas: a PSO enhanced random forest approach. IEEE Access 8:132253–132264. https://doi.org/10.1109/ACCESS.2020.3008473
Zhou ZH, Liu XY (2006) Training cost-sensitive neural networks with methods addressing the class imbalance problem. IEEE Trans Knowl Data Eng 18(1):63–77. https://doi.org/10.1109/TKDE.2006.17
Acknowledgements
This work was supported in part by the Zhejiang Provincial Natural Science Foundation of China under Grants LZ20F010008 and in part by the National Undergraduate Innovation and Entrepreneurship Training Program under Grants 202110351060.
Funding
This work was supported in part by the Zhejiang Provincial Natural Science Foundation of China under Grants LZ20F010008 and in part by the National Undergraduate Innovation and Entrepreneurship Training Program under Grants 202110351060.
Author information
Authors and Affiliations
Contributions
ZT, XY and YX contributed to the conception of the study; XY and YX performed the experiment; XY, YX, and QH contributed significantly to data analysis and manuscript preparation; XY, SK, YX, QH and WL performed the data analyses and wrote the manuscript; ZT helped perform the analysis with constructive discussions.
Ethics declarations
Conflict of interest
The authors declare that there are no conflicts of interest regarding the publication of this paper.
Code availability
T he ASN SMOTE implementation is available at https://github.com/yixinkai123/ASN SMOTE/. Authors’ contributions
Ethics approval
The authors declare that there is no ethics involved in this experiment.
Consent to participate
Written informed consent for publication was obtained from all participants.
Consent for publication
The study received approval from all participants.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Xinkai Yi and Yingying Xu contributed equally to this work.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Yi, X., Xu, Y., Hu, Q. et al. ASN-SMOTE: a synthetic minority oversampling method with adaptive qualified synthesizer selection. Complex Intell. Syst. 8, 2247–2272 (2022). https://doi.org/10.1007/s40747-021-00638-w
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s40747-021-00638-w