Introduction

The problem of class-imbalanced data in machine learning may occur often, which means that the class distribution of data in binary or multi-class classification problems has a significant slant. For instance, within the binary classification problem, the majority class involves a large number of instances, while only a few instances in the minority class [42]. Such problems often appear in practical applications in bank fraudulent transaction detection methods [36], credit risk assessment [34], text classification [41], biomedical diagnosis [2, 59] and firewall intrusion detection [5].

Learning from class-imbalanced data poses a tough challenge to conventional classifiers in the field of supervised learning and data mining. Specifically, when dealing with imbalanced datasets, traditional classifiers usually favor majority class since they were originally designed to process balanced datasets [53]. As a result, the prediction ability of the classifier will be greatly affected by the imbalanced distribution of data, and the minority instances cannot be correctly classified. Especially in cases of extreme training data imbalances, the classifier will be highly influenced by the majority class and achieve high classification accuracy. However, the minority class may even go entirely undetected [14]. For example, for the application of firewall intrusion detection, the malicious attack information found in training data is only 0.1%, so the accuracy rate of classification of normal classifier to recognize all sample information as normal information is 99.9%. This means that even a well-trained classifier cannot be applied to the actual

Due to the universal existence of imbalanced datasets in practical applications and the difficulty for traditional classifiers to deal with them, learning from class-imbalanced data has attracted the attention of many prominent researchers over the last 20 years [31]. Many preprocessing methods have been put forward to deal with class imbalance of datasets. These methods can be basically divided into three wide-ranging categories: algorithm level methods, cost-sensitive methods AND data-level methods [23]. Algorithm-level approaches modify or develop algorithms for classifiers to enhance the learning of minority classes [11, 27, 60]. And the essence of cost-sensitive methods is to minimize the overall cost of incorrect classifications [24, 37]. Apart from the above, data-level methods preprocess imbalanced datasets by resampling of data to balance the class distribution [13, 28, 30, 43, 52]. Compared with algorithm level and cost-sensitive methods, data-level methods are more conventional ways to tackle imbalanced datasets for they can be applied to classifiers without modifying them [45].

Among the large number of oversampling methods, the synthetic minority oversampling technology (SMOTE) is one of the most well-known methods which generates artificial minority instances by linear interpolation [13]. However, SMOTE and some other existing methods mainly focus on dealing with the imbalance between classes, but fail to effectively solve the issues of within-class imbalances and minor disjuncts [3, 13, 30]. This may result in some improper synthesized minority instances, such as noise and overlapping instances, which lead to performance degradation instead of improvement. Although there have been some methods proposed that use minority instances around the decision boundary to synthesize instances and can largely address the above issues [16, 21, 28, 43, 44, 52], however, this makes the decision boundary even more indistinct and blurred, and may cause the problem of the new synthesized samples being shown in the majority class region. This amplifies the noise and degrades the performance of classifiers [56].

In view of this, this paper proposes a simple while effective synthetic minority oversampling technology (SMOTE), known as ASN-SMOTE, which adaptively nominates qualified minority neighbors to synthesize minority instances. More specifically, for each instance of minority class, the Euclidean distances between it and all other instances are firstly calculated. After that, the minority instance whose nearest neighbor is a majority instance is considered to be noise and is filtered. On the other hand, if the minority instance’s nearest neighbor is also a minority one, it adaptively determines the qualified synthesizer among its k-nearest neighbors (KNN) according to the Euclidean distances to them. Finally, these qualified neighbors are used to generate minority instances by the SMOTE algorithm [13]. The synthesized minority instances are combined with imbalanced dataset and fed into a traditional supervised classifier for training.

ASN-SMOTE can be applied to real-world datasets and improve the accuracy of the supervised classifiers without modifications.

The main advantages of ASN-SMOTE are threefold: (1) It can filter noise and improve the effectiveness of oversampling. (2) It makes full use of the information of the imbalanced dataset to adaptively select qualified synthesizers, which makes the synthesized minority instances more accurate to the practical data distribution. (3) It can be combined with any classification algorithm and there is no need to modify the classification algorithms.

To evaluate the performance of ASN-SMOTE, extensive comparative experiments were conducted. The proposed ASN-SMOTE was applied on 24 public benchmark datasets with the imbalance ratio ranging from 1.82 to 41.4. Three classifiers, namely the k-nearest neighbor (KNN) [18], SVM [17]and random forest (RF) [8], were used for classification. ASN-SMOTE was then compared with seven other notable oversampling techniques, namely (1) Random oversampling, (2) SMOTE [13], (3) ADASYN [30], (4) borderline1-SMOTE [28], (5) borderline2-SMOTE [28], (6) k-means SMOTE [21], (7) SVM-SMOTE [44], (8) SWIM-RBF [7], and (9) LoRAS [6]. The results show that ASN-SMOTE achieves the best results on the majority of datasets in terms of three different evaluation indicators.

The remainder of the paper is organized as follows. In “Related work”, prior research alongside other existing sampling methods are reviewed, especially in regard to oversampling methods, in addition to problems that exist in practical application. In “Oversampling algorithm for adaptively selecting neighbor synthesis instances”, the working principle of the proposed oversampling method are introduced in detail. And in “Experiments and results”, the experimental design, results, and discussion are described. Finally, the conclusion is presented in “Conclusion”.

Related work

In this section, data-level preprocessing methods are reviewed and summarized. Generally, these preprocessing methods can be categorized into three types, namely the oversampling, undersampling and hybrid sampling methods [20]. Oversampling refers to the use of sampling methods to generate new minority instances to balance the dataset [13, 28, 30, 43, 52]. In contrast, undersampling is employed to tackle the problem of imbalanced data sets by deleting some majority instances [29, 38, 39, 54, 55, 57]. Table 1 summarizes the existing oversampling, undersampling and hybrid sampling methods that have been successfully applied to deal with imbalanced data problems.

Table 1 Summary of existing oversampling (O), undersampling (U) and hybrid sampling (H) techniques

Undersampling

Random undersampling is a common method which balances the class distribution of imbalanced datasets by randomly removing some majority instances . However, this method may cause the loss of some valuable classification information during the training process of a classifier. To overcome this shortcoming, researchers have proposed distinctive undersampling approaches. The condensed nearest neighbor rule (CNN) [29] and Tomek link [54] are some examples. In the CNN method, some of the majority instances around the decision boundary are removed and the remaining instances are used for training. When the Tomek Link method is utilized, the Tomek Link pair is calculated, which is the nearest neighbors of two instances of different classes between each other, resulting in a deletion of the majority instance of the Tomek Link pair [4]. In general, when the dataset is minute or extremely imbalanced, these undersampling methods cannot overcome the class-imbalance problem. This results in the removal of a large number of majority instances and the classifier cannot achieve an effective out-of-sample performance [21, 53].

Oversampling

Random oversampling is the simplest oversampling method. It randomly selects minority instances and copies them until the class distribution of imbalanced dataset becomes balanced. Nevertheless, this method generates samples similar to the original data resulting in over-fitting of the classification algorithm [4, 14].

To avoid the problem of over-fitting caused by random oversampling, Chawla et al. proposed the use of the SMOTE algorithm [13] in 2002. Contrary to merely copying the existing instances, this algorithm generates artificial instances by linear interpolation. As shown in Fig. 1, SMOTE algorithm generates minority instances by interpolating between the minority instance and the k nearest minority class neighbors. Specifically, the SMOTE algorithm generates minority instances in three steps. First, it chooses a minority instance \(\overrightarrow{x} \), then randomly selects one \(\overrightarrow{y} \) of the K nearest minority neighbors of \(\overrightarrow{x} \). Eventually, a new instance \(\overrightarrow{s} \) is synthesized, which is the random interpolation between these two instances calculated by formula \( \overrightarrow{s}=\overrightarrow{x}+\lambda \times (\overrightarrow{y}-\overrightarrow{x}) \) , where \(\lambda \) is a random weight in [0,1]. However, SMOTE treats all minority instances alike and ignores the locations of the minority instances selected for synthesizing new instances, which results in the synthesized minority instances located in the majority class region, causing noise and deteriorating the classification performance, as shown in Fig. 2.

Fig. 1
figure 1

The selected minority instance and its \( k =5\) nearest neighbors are used to generate five new samples by SMOTE linearly interpolation

Fig. 2
figure 2

Due to the noise in the imbalance dataset, the new samples synthesized by SMOTE are likely to be located in the majority class region

Given the above problems, a variety of improved oversampling methods have been proposed. Adaptive synthesis sampling (ADASYN) [30] and its variant KernelADASYN [52] adaptively adjust sampling weights according to the proportion of majority instances in k-nearest neighbors. In this way, minority instances that are similar to the majority instances are more likely to be sampled. Similar to ADASYN, Borderline-SMOTE [28] has two sampling strategies. Borderline1-SMOTE1 focuses attention to the decision boundary. First, it identifies the minority instances around the decision boundary, and then synthesizes instances with these minority instances, so as to balance the class distribution of the dataset. On the other hand, Borderline2-SMOTE extends the previous method allowing minority instances to interpolate with only one majority class neighbors, and setting the interpolation weight to less than 0.5. The aim of this is to improve the similarity between the minority instance and the synthetic instance, so as to enhance the ability of classifier to distinguish the instance around the decision boundary.

Additionally, another kind of sampling methods are clustering-based oversampling algorithms. Cluster-SMOTE initially uses the k-means clustering algorithm to divide the minority instances into several clusters and applies SMOTE in each cluster [16]. In Ref. [43], an adaptive semi-unsupervised weighted over-sampling (A-SUMO) approach was presented. A-SUWO first utilizes a semi-unsupervised hierarchical clustering algorithm to cluster minority instances and adaptively determines the oversample size to each sub-cluster. Recently, a heuristic oversampling method based on k-means and SMOTE was proposed by Douzas et al. [21]. This method clusters the entire sample set, and calculates the imbalance rate (IR) of each cluster. According to the size of IR, SMOTE is used within the cluster. The main idea of SVM-SMOTE (Support Vector Machine SMOTE) [44] is to use an SVM classifier to find the support vector, and then synthesize new instances based on it. Similar to Borderline-SMOTE, SVM-SMOTE also determines the type of instances (safe, danger, noise) according to the attributes of k-nearest neighbors, and then train SVM according to the danger type [44].

The above methods suggested that the minority instances around the decision boundary can be used to synthesize instances and can largely address the the issues of within-class imbalances and minor disjuncts. However, this blurs the decision boundary and may generate minority instances in the majority instances region (see Fig. 3), which amplifies the noise and degrades the performance of classifiers [56].

Unlike SMOTE and its variants that rely on the positions and distances between minority instances, SWIM-RBF [7] uses the density of minority instances relative to the distribution of majority class to determine where to generate synthetic instances. LoRAS [6] provides a method to choose neighborhoods of a minority class instance point by performing prior manifold learning over the minority class using t-Stochastic Neighborhood Embedding.

Fig. 3
figure 3

The new instances generated by clustering are located in the majority class region

To address more effectively the issue of of how to determine the proper minority instances involved in the synthesis to avoid the generated minority instances falling into the majority class region, in this paper, we present ASN-SMOTE which is also an improved variant of SMOTE. Similar to Borderline-SMOTE [28] and some other variants of SMOTE [21, 44], ASN-SMOTE performs noise filtering before synthesizing minority instances. However, in ASN-SMOTE, a minority instance whose nearest neighbor is a majority instance is considered to be noise and is filtered. It is different from other SMOTE-based techniques that determine whether an instance is noise according to the proportion of majority instances in its k-nearest neighbors. Another difference between ASN-SMOTE and other oversampling techniques is that ASN-SMOTE only selects minority neighbors that are closer than nearest majority instance for synthetic interpolation. These two mechanisms help ASN-SMOTE selecting proper minority instances to synthesize minority instances which will not appear in majority class regions.

Oversampling algorithm for adaptively selecting neighbor synthesis instances

Previously, we have reviewed some related works and presented the existing issues. Motivated by this, we present the novel synthetic minority oversampling method with adaptive qualified synthesizer selection, known as ASN-SMOTE, in this section. The fundamental idea behind the ASN-SMOTE algorithm is to make full use of all the information of the k nearest neighbor (KNN) instances to synthesize the appropriate minority instances. The ASN-SMOTE involves the following three steps: (1) noise filtering, (2) adaptively selecting neighbor instances, and (3) synthesizing instances.

Noise filtering

Filtering noise is an essential process in the training stage of machine learning because noise is a kind of interference for sampling algorithms and classifiers [12]. It not only invalidates the sampling algorithm, but also reduces the overall performance of the classifier. In this work, a KNN-based noise filtering algorithm is proposed to reduce the generation of synthetic minority instances located in the majority class regions. The proposed algorithm can accurately identify noise and the minority instances surrounding the decision boundary by judging the nearest instance of each minority instance, and filter them.

Specifically, denote the majority class set, the minority class set and the complete dataset as \({\mathbf {N}}\), \({\mathbf {P}}\) and \({\mathbf {T}}\), respectively. \(N_p\) and \(N_n\) are the numbers of minority instances and majority instances respectively. For each instance \( p_{i}\in {\mathbf {P}}, (i=1,2, \ldots , N_p) \), calculate the Euclidean distance to each instance in the \( {\mathbf {T}} \), and find the nearest instance \( D(p_i) \):

$$\begin{aligned} D\left( p_{i}\right) =\arg \min _{t\in {\mathbf {T}}, t\ne p_i}{\parallel p_i-t\parallel _2}, \end{aligned}$$
(1)

where \(\Vert x-y\Vert _2\) returns the Euclidean distance between x and y.

If \(D\left( p_{i}\right) \in {\mathbf {N}}\), \( p_i \) is regarded as noise or a minority instance around the decision boundary and is recorded as an unqualified instance, \({\mathbf {M}}_u = {\mathbf {M}}_u \cup p_i\), where \({\mathbf {M}}_u\) is the set of unqualified minority instances of \(p_i\). Otherwise, \({\mathbf {M}}_q = {\mathbf {M}}_q \cup p_i\) where \({\mathbf {M}}_q\) is the set of qualified minority instances of \(p_i\) (see Fig. 4).

The pseudo-code of the noise filtering method is presented in Algorithm 1.

Fig. 4
figure 4

Identify the noise and the minority instances around decision boundary by judging the nearest instance of each minority instance in the whole dataset

figure a

Adaptive selection of neighbor instances

Neighbor instance selection greatly affects the quality of the generated instances, where the challenge is to select the appropriate nearest neighbors that qualify for a minority class instance. In the original SMOTE algorithm, minority class instances are randomly selected to synthesize new instances. This may result in that the synthesized instances locate in the majority class region [37]. Using these synthetic instances as training data reduces the performance of the classifier. Given this, an adaptive neighbor selection strategy is proposed in this paper.

Specifically,

  1. (1)

    For each instance in \({\mathbf {M}}_{q}\), denoted as \(q_{j}, j=1,2, \ldots , N_{q}\), find the k nearest neighbor instances of \(q_{j}\) in \({\mathbf {T}}\) by Euclidean distance calculation, denoted as \(q_{j}^1, q_{j}^2, \ldots , q_{j}^k\).

  2. (2)

    If all the k nearest neighbor instances, i.e., \(q_{j}^1, q_{j}^2, \ldots , q_{j}^k\) are minority instances, all of them are added to the set \({\mathbf {Q}}_{j}\) where \({\mathbf {Q}}_{j}\) represents the set of qualified neighbors of the qualified minority instance \(q_{j}\). Otherwise, go to step 3.

  3. (3)

    If there is at least one majority instance in the k nearest neighbor instances, find the nearest majority instance, denoted as \(q_{j}^\mathrm {near}\), among the k nearest neighbor instances of \(q_{j}\).

  4. (4)

    Each neighbor in the minority class whose distance to \(q_{j}\) is smaller than \(\Vert q_{j} - q_{j}^\mathrm {near}\Vert _2\) is added to the set \({\mathbf {Q}}_{j}\).

The pseudo-code of the adaptive neighbor instances selection method is presented in Algorithm 2.

figure b

Synthesizing instances

Synthesizing the minority class instances is the last step of the oversampling technique for imbalanced data. The instance synthesis method in this work follows the original linear interpolation for it is easy to calculate and does not cause the synthesized instance to be located in the majority class regions due to improper weight allocations.

Specifically, for the qualified minority instance \(q_i\), we first determine the number of synthesis instances (\(N_s\)) for each qualified minority instance as follows:

$$\begin{aligned} N_s=\text {round}\left[ \frac{N_n-N_p}{N_{q}}\right] . \end{aligned}$$
(2)

After that, randomly select a qualified neighbor from \({\mathbf {Q}}_{i}\), denoted as \(q_i'\), and perform linear interpolation to synthesize an instance (s) as follows:

$$\begin{aligned} s=q_{i}+w \times \left( q_{i}'-q_{i}\right) , \end{aligned}$$
(3)

where \( w\in (0,1) \) is the weight. This random selection and linear interpolation is executed for \(N_s\) times for each qualified minority instance.

Fig. 5
figure 5

ASN-SMOTE adaptively selects the qualified neighbors through noise filtering to synthesize instances (\( k = 4\))

As shown in Fig. 5, the noise filtered minority class instances adaptively select qualified minority class neighbor instances to synthesize new instances. The synthesized minority instances are generated from the original minority instances and do not reside in the region of the majority class.

The procedure of the proposed ASN-SMOTE is summarized as follows.

figure c

Experiments and results

This section focuses on the validation and evaluation of the proposed oversampling method. The ASN-SMOTE is evaluated with different classifiers, datasets in terms of various performance measures.

Experimental datasets

To evaluate the performance of ASN-SMOTE, experiments were conducted on 24 public datasets that were selected from the KEEL repository [1] and the UCI Machine Learning Repository [22]. Considering that the datasets of real-world applications are quite complex, in the case of that a dataset contains more than two categories, one of the smaller categories is selected as the minority class. The rest of the data is used as the majority class. Additionally, to evaluate the adaptability of ASN-SMOTE under extreme conditions, two large datasets with high imbalance rate were selected as well. The detailed information of the experimental datasets is shown in Table 2.

Table 2 Information description of the experimental data set

Evaluation measures

In this work, the geometric mean (G-mean) of accuracy, F-measure and area under the curve (AUC) were selected as the evaluation measures, which are, respectively, defined as follows.

Geometric evaluation accuracy:

$$\begin{aligned} G\text{-mean }=\sqrt{ \text{ Sensitivity } \times \text{ Specificity } }. \end{aligned}$$

The harmonic mean of Sensitivity and Precision:

$$\begin{aligned} F\text{-measure }=\frac{2 \times \text{ Sensitivity } \times \text{ Precision } }{ \text{ Sensitivity } + \text{ Precision } }. \end{aligned}$$

AUC is the area under the receiver operating characteristics (ROC) curve. The ROC curve is a two-dimensional coordinate graph in which the X-axis represents the false positive rate (FPR) and Y-axis represent the true positive rate (TPR). It is a commonly used performance measure to evaluate the imbalance problem and can be calculated as:

$$\begin{aligned} \text {AUC}=\frac{1+\text {TPR}-\text {FPR}}{2}, \end{aligned}$$

where \(\text {TPR}=\frac{\text {TP}}{N_{p}}\) and \(\text {FPR}=\frac{\text {FP}}{N_{n}}\), TP represents the proportion of positive (minority) class instances correctly classified, and FP represents the proportion of negative (majority) class instances that is incorrectly classified as positive class.

Experimental configurations

To conserve the between-class ratios of the original datasets, a fivefold stratified cross validation was used to ensure that the original class distribution is preserved in each fold. Furthermore, each experiment was conducted three times to obtain the average metrics value as to avoid the impact of randomness. To evaluate the performance of the proposed method, it was compared with nine other oversampling techniques: (1) random over-sampling, (2) SMOTE, (3) ADASYN, (4) Borderline1-SMOTE, (5) Borderline2-SMOTE, (6) k-means SMOTE, (7) SVM-SMOTE, (8) SWIM-RBF, and (9) LoRAS. Three classifiers, namely KNN, SVM and Random Forests (RF) were used to make classifications. These algorithms were implemented using python on a 64-bit operating system with 8 GB available RAM using 2.30 GHz i5-8300 CPU.

For SMOTE and its variants, namely Borderlin1-SMOTE, Borderlin2-SMOTE, k-means SMOTE, and ASN-SMOTE, the number of nearest neighbors were selected among the values (3, 5, 7, 9) and the number of clusters in k-means SMOTE was selected from (3, 5, 7, 9, 10, 15, 20), subject to detail dataset. For SWIM-RBF, epsilon was selected from (0.25, 0.5, 1, 2, 3, 5). And for LoRAS, the number of shadows was selected from 2 to the number of features and the number of nearest neighbors were selected among the values (3, 5, 7, 9). SVM-SMOTE and ADASYN were implemented in the Python library scikit-learn [46] with default parameters .

For the KNN classifier [18], the number of nearest neighbors was selected from the values (3, 5, 7, 9). For the SVM classifier [17], the cost parameter was selected from the values (0.001, 0.01, 0.1, 1, 10). The parameter kernel width \(\gamma \) was selected from the values (0.1, 0.3, 0.5, 0.7, 1, 2, 5). For the Random Forests (RF) classifier [8], the parameter n_estimators, i.e. the number of the decision trees of the estimators, was selected from the values (20, 30, 50, 80, 100). The parameter max_depth, which is the maximum depth of the decision tree, was selected from the values (5, 10, 13, 15, 20). And the Gini index was selected as the node partition standard. The parameter max_features, i.e., the number of features to consider when looking for the best split, was the square radial of the number of features in the dataset.

The parameters of three classifiers and eight over-sampling methods were optimized using fivefold stratified cross-validation in the 24 public datasets based on G-means, F-measure and AUC.

Table 3 Results obtained by KNN on datasets oversampled by different techniques
Table 4 Results obtained by SVM on datasets oversampled by different techniques
Table 5 Results obtained by RF on datasets oversampled by different techniques

Results and discussion

The average results and the standard deviations of the G-means, F-measures and AUCs obtained by three classifiers are shown in Tables 3, 4 and 5, respectively. The best experimental results on each dataset are highlighted in bold. In addition, Fig. 6 summarizes the scores and the comprehensive ranking of these oversampling techniques in 24 datasets in terms of convergence accuracy The three performance measures obtained from fivefold cross validation are used as the ranking basis. For each dataset, the first rank is allocated to the optimum oversampling technique and the eighth rank is assigned to the worst performing technique. Consequently, the average ranking of the three performance measures of each oversampling techniques in all 24 datasets is calculated. The mean ranking of each method ranged between [1.0,10.0].

From the results, it is shown that ASN-SMOTE can obtain the best performance in at least one performance measure in 18 out of 24 public datasets when SVM, KNN, and RF were used. Figure 6 also indicates that ASN-SMOTE ranks first overall no matter which classifier and performance indicator is adopted. This is because there are always noise minority instances in practical datasets and exist the issues of within-class imbalance and small disjuncts. And the noise filtering and adaptive neighbor selection proposed in ASN-SMOTE are promising mechanisms to deal with them by avoiding synthesizing new minority instances in the majority class region. We can also observe that ASN-SMOTE shows high performance as well on some datasets that have high imbalance ratios and sparse minority instances. This is possibly because such datasets have more severe within-class issues and hence noise instances account for a larger proportion of minority instances. As a result, a larger proportion of synthetic minority instances are likely to be located in the majority class region, which makes the advantages of ASN-SMOTE to be fully exploited.

On the other hand, we find that the random oversampling method performs the most poorly on all three measures when KNN was used as the classifier. When SVM and RF were used as the classifier, Borderlin2-SMOTE method performs the worst on all three measures. Compared to ADASYN, k-means SMOTE, SVM-SMOTE, Borderlin2-SMOTE, SMOTE performs better in term of AUC, and Borderline1-SMOTE and SMOTE perform better in terms of G-mean and F-measure. ADASYN, K-means SMOTE and SVM-SMOTE perform moderately according to all three classifiers and three measures. The two recently proposed algorithms, i.e., SWIM-RBF and LoRAS perform well on all three measures when KNN and SVM were used as the classifier.

Fig. 6
figure 6

The comprehensive ranking of oversampling techniques in 24 data sets. a Achieved by KNN classifier, b achieved by SVM classifier, c achieved by RF classifier

Fig. 7
figure 7

Distributions of minority instances generated by SMOTE, SMOTE/NF and ASN-SMOTE on various two dimensional datasets

To have a deep insight into the effects of the noise filtering and adaptive neighbor selection mechanisms, we applied SMOTE, SMOTE/NF (SMOTE with noise filtering) and ASN-SMOTE to three two-dimensional datasets, namely the toy dataset, moon dataset and circle dataset. The distributions of synthetic instances are shown in Fig. 7. It can be observed that SMOTE, without noise filtering and adaptive neighbor selection, generates the largest number of minority instances in the majority class region. By contrast, SMOTE/NF reduces the number of minority instances located in the majority class region since the minority instances that are closer to the majority instances are rejected to actively generate instances with other minority instances. However, there are still some synthesis minority instances in the majority class region. The reasons are twofold. One is that in SMOTE/NF, there may be a majority class region between two minority instances used to synthesize instances. The other is that noise minority instances may be selected to generate minority instances. Both situations may cause the generated minority instance to be located in the majority class region. The adaptive neighbor selection mechanism leveraged in ASN-SMOTE can avoid such situations. As we can see from the figures, all instances generated by ASN-SMOTE locate in the minority class region, owing to the noise filtering and adaptive neighbor selection mechanisms.

Besides the performance comparison, Friedman’s test [32] followed by Holm’s test [33] were performed to verify the statistical significance of the proposed method compared to the other oversampling methods. The Friedman’s test is a non-parametric statistical test. Similar to the repeated measures ANOVA (analysis of variance), it is utilized to distinguish differences in treatments across multiple test attempts. The null hypothesis in Friedman test is whether all classifiers are performing similarly in mean rankings without significant difference. From the results of Friedman’s test in Table 6, it can be found that there exists enough evidence at \(\alpha \) = 0.05 to reject the null hypothesis for all classifiers and measures, which demonstrates significant differences between ASN-SMOTE and other oversampling methods in a statistical sense.

Table 6 Friedman’ test results
Table 7 Results of Holm’s test
Table 8 Runtimes (in second) of different over-sampling techniques over 24 tested datasets

Since the null hypothesis is rejected for all classifiers and performance measures, a post hoc test is applied. Holm’s test is used for this study where the proposed method was regarded as the control method. The Holm’s test is the non-parametric equivalent of multiple t test that adjusts \( \alpha \) to compensate for multiple comparisons in a step-down procedure. The largest \( \alpha \) is equal to 0.1 in our experiments. The null hypothesis is that ASN-SMOTE does not performs better than the other methods as the control algorithm. The adjusted \( \alpha \) and the corresponding p value for each method are shown in Table 7. It can be read from the results that ASN-SMOTE outperforms all other oversampling approach except LoRAS in terms of all metrics in a statistical sense when using SVM and KNN as the classifier. When RF was used as the classifier, in 26 of the 27 tests, ASN-SMOTE performs better than the counterparts statistically.

Next, we evaluate the runtimes of ASN-SMOTE and all comparative techniques over the 24 benchmark datasets. The average results on ten independent tests are shown in Table 8. To make the comparisons more intuitive, the ranking of average running time of these algorithms is provided in Fig. 8.

From the results, it can be found that the random method is the fastest of all oversampling techniques due to its simplicity and SMOTE ranks the second. Compared with the original SMOTE, the variants of SMOTE such as ADASYN, Borderline-smote, K-means SMOTE and SVMSMOTE need more runtimes. It makes sense since they introduce additional mechanisms to better overcome the imbalance which meanwhile make them more complex. Similarly, ASN-SMOTE involves an adaptive sampling strategy, which needs to calculate the k nearest neighbors of each qualified minority instance in the dataset and hence increases the computational complexity of the proposed algorithm. Nevertheless, as can be read from Table 8, its runtime of much less than one second is perfectly acceptable.

As for LoRAS, it uses the manifold learning-based dimension reduction technology to reduce the dimension of minority instances before selecting the k-nearest neighbors of instances, which makes it spend more time on it and ranks the second to last. In addition, not surprisingly, SWIM-RBF ranks the last and spend much more time since it utilizes the radial basis function (RBF) with Gaussian kernel to estimate the local density of minority samples and majority classes. The computation of the square root and the whitening operation have \(O(d^{3})\) and \(O(bd^{2})\) time complexity, where d is the dimension of the data and b is the number of instances in the minority class.

Fig. 8
figure 8

Ranking of averaged runtimes of all oversampling techniques in 24 tested datasets

Parameter analysis

Since ASN-SMOTE uses the minority instances in k nearest qualified neighbors to synthesize new minority instances, the essential parameter k significantly affects its performance. To better investigate the effect of the value of k upon the performance of ASN-SMOTE, comparative evaluations were conducted on six benchmark databases randomly selected from the above 24 ones (see Table 2). Considering the least number of parameters required, we chose SVM as the classifier. The three performance measures were evaluated by fivefold cross validation. Each experiment was repeated three times to obtain the average metrics value to avoid contingency.

Figure 9 shows the results where k is selected among values \((1, 2, \ldots 10)\). As can be seen from the results, the performance of ASN-SMOTE varies along with the value of k. Take the Haberman dataset as an example, the performance measures degrade with the increase of k first, and the performance gets the worst when \(k=3\). After that, as k continues to increase, the performance measures begin to improve, until k reaches 7, where the best performance is achieved. This demonstrated that k should be carefully tuned when ASN-SMOTE is used to balance the dataset.

Fig. 9
figure 9

Performances and rankings of ASN-SMOTE under various k values. a G-means, b F-measures, c AUC

Conclusion

In this paper, we propose an SMOTE based oversampling method, ASN-SMOTE. Compared to existing oversampling methods, the key advantages of ASN-SMOTE is that ASN-SMOTE performs the novel noise filtering and adaptive neighbor selection mechanisms before synthesizing minority instances. The former filters any minority instance whose nearest neighbor is a majority instance, and the latter only selects minority neighbors that are closer than nearest majority instance for synthetic oversampling. By employing these two mechanisms, ASN-SMOTE can effectively avoid to generate synthetic minority instances in majority class region which results in a more general decision boundary.

ASN-SMOTE is evaluated against nine state-of-the-art oversampling methods on 24 public datasets whose imbalance ratios range from 1.82 to 41.4, with three different well-known classifiers, namely KNN, SVM and RF. The results, based on the G-mean, F-measure and AUC metrics, show that ASN-SMOTE ranks the first and is more effective than all other comparative oversampling techniques. Although the runtime of ASN-SMOTE is not competitive, however, its runtime of much less than one second is perfectly acceptable.

In the future work, we intend to extend ASN-SMOTE to deal with multi-classification imbalanced datasets since the problem of imbalanced datasets not only appears in binary classifications, but also in multi-classifications in real-world applications.