1 Introduction

Class imbalance constitutes a challenging problem that has recently received much attention in a wide variety of research fields such as Data Mining, Machine Learning and Pattern Recognition [16, 31, 43, 47]. The class imbalance problem occurs when one or several classes (the majority classes) vastly outnumber the other classes (the minority classes), which are usually the most important classes and often with the highest misclassification costs. It has been observed that class imbalance may affect the performance of most standard classification systems, which assume a relatively well-balanced class distribution and equal misclassification costs [28]. This issue results particularly relevant in a variety of real-life applications, such as the diagnosis of infrequent diseases [10, 35], credit risk assessment [4, 27], detection of software defects [30], fraud detection in telecommunications [15, 24], prediction of customer insolvency [11], text categorization [9, 44], and detection of oil spills in radar images [32]. In two-class imbalanced problems, the examples of the minority class are typically referred to as positive, whereas the instances of the majority class are referred to as negative.

The issue of class imbalance has been addressed by numerous approaches at both data and algorithmic levels [23]. The methods at the algorithmic level modify the existing learning algorithms for biasing the discrimination process towards the minority class [2, 38, 39], whereas the data level solutions consist of artificially resampling the original data set, either by over-sampling the minority class [2, 33, 49] and/or under-sampling the majority class [3, 7, 20, 21], until the classes are approximately equally represented.

In general, the resampling strategies have been the most investigated because they are independent of the underlying classifier and can be easily implemented for any problem [18]. However, the methods at the data level also present some drawbacks because they artificially alter the original class distribution. For example, under-sampling may throw out potentially useful data, while over-sampling artificially increases the size of the data set and consequently it worsens the computational burden of the learning algorithm. Despite conclusions about what is the most suitable resampling strategy for the class imbalance problem are divergent, several studies have reported that over-sampling usually performs better than under-sampling [3, 21, 46].

The present paper concentrates on the over-sampling strategy and more specifically, extends the well-known Synthetic Minority Over-sampling TEchnique (SMOTE) algorithm [7] by exploiting an alternative neighborhood formulation, namely surrounding neighborhood [40]. A key feature of this type of neighborhood is that the neighbors of a sample are considered in terms of both proximity and spatial distribution with respect to the sample, showing some practical advantages over the conventional neighborhood that is only based on the minimum distance. The use of the surrounding neighborhood to over-sample the minority class generates new synthetic examples that will be homogeneously distributed around the original positive instances, contributing to spread the influence region of the minority class. The thorough experimental study carried out proves the significant performance gains of our approach when compared to other state-of-the-art algorithms.

The rest of this paper is organized as follows. Section  2 reviews the SMOTE algorithm and some of its most relevant extensions. In Sect. 3, the general concept of surrounding neighborhood and two of its implementations are presented. The modification of SMOTE based on the different surrounding neighborhood formulations is introduced in Sect. 4. Section 5 describes the experimental framework, including the data sets, the over-sampling algorithms, the classifiers, the performance evaluation metrics and the statistical tests used in the present analysis. In Sect. 6, the experiments are carried out and their results are discussed. Finally, Sect. 7 remarks the main conclusions and outlines possible directions for future research.

2 The SMOTE algorithm and some variants

The simplest strategy to expand the minority class is random over-sampling (ROS), which corresponds to a non-heuristic method that balances the class distribution through a random replication of positive examples [3, 36]. Although effective, this method may increase the likelihood of overfitting since it makes exact copies of the minority class instances [7].

In order to avoid overfitting, Chawla et al. [7] proposed the SMOTE algorithm to up-size the minority class. Instead of merely replicating positive instances, this method generates artificial examples of the minority class by interpolating existing instances that lie close together. It first finds the \(k\) positive nearest neighbors for each minority class example and then, the synthetic examples are generated in the direction of some or all of those nearest neighbors.

SMOTE allows the classifier to build larger decision regions that contain nearby instances of the minority class. Depending upon the amount of over-sampling required, a certain number of instances from the \(k\) nearest neighbors are randomly chosen. In the experiments reported in the original paper, \(k\) is set to five. The generation procedure for each minority class example can be explained as follows: (i) take the difference between the feature vector (instance) under consideration and one of its \(k\) minority class nearest neighbors; (ii) multiply this difference by a random number between \(0\) and \(1\); and (iii) add it to the feature vector that corresponds to the new synthetic example of the minority class.

Although SMOTE has proved to be an effective tool for handling the class imbalance problem, it may overgeneralize the minority class as it does not take care of the distribution of majority class neighbors, especially when the minority class is very sparse with respect to the majority class. As a result, the generation of synthetic examples may increase the overlapping between classes [37]. From the original SMOTE algorithm, several modifications have further been proposed in the literature, most of them pursuing to determine the region in which the positive examples should be generated. Among them, one of the most widely-known generalizations corresponds to the Borderline SMOTE (B-SMOTE) algorithm [20], which consists of using only positive examples close to the decision boundary since these are more likely to be misclassified.

The Safe-Level SMOTE (SL-SMOTE) algorithm [5] calculates a “safe level” coefficient (\(sl\)) for each minority class example, which is defined as the number of other minority class instances among its \(k\) neighbors. If the coefficient \(sl\) is equal or close to 0, such an example is considered as noise; if \(sl\) is close to \(k\), then this example may be located in a safe region of the minority class. The idea is to direct the generation of new synthetic examples close to safe regions.

Finally, other less known extensions of SMOTE are the FSMOTE algorithm proposed by Zhang et al. [50], which utilizes fractal interpolation theory to generate the synthetic positive examples, and the LLE-based SMOTE [48] that implements the locally linear embedding algorithm to map the high-dimensional data onto a low-dimensional space where the synthetic instances of the minority class are then generated and mapped back to the original input space. On the other hand, the MSMOTE algorithm [25] divides the instances of the minority class into three groups: safe, border and latent noise instances. When MSMOTE generates new examples, the strategy to select the nearest neighbors depends on the group to which the instance belongs. For safe instances, the algorithm randomly selects a data point from the \(k\) neighbors; for border instances, it only selects the nearest neighbor; for latent noise instances, it does nothing. Maciejewski and Stefanowski [37] introduced the LN-SMOTE, which exploits more precisely information about the local neighborhood of the considered examples. The SMOTEBoost [8] combines SMOTE with the standard boosting procedure: it utilizes SMOTE for improving the prediction of the minority class and boosting for not affecting the accuracy over the entire data set.

3 Surrounding neighborhood

Intuitively, the concept of neighborhood should be such that the neighbors are as close to an instance as possible but also, the neighbors should lie as homogeneously around it as possible. The second condition is a consequence of the first in the asymptotic case but in some practical situations, the geometrical location may become much more important than the actual distances to appropriately characterize an instance by means of its neighborhood [40]. As the traditional neighborhood takes into account the first property only, the nearest neighbors may not be placed symmetrically around the instance if the neighborhood in the data set is not spatially homogeneous. In fact, it has been shown that the use of local distance measures can significantly improve the classifier behavior in the finite sample size case [41].

Alternative neighborhood definitions have been proposed as a way to overcome the problem just pointed out. These consider both proximity and symmetry so as to define the general concept of surrounding neighborhood [40]: they try to search for neighbors of an instance close enough (in the basic distance sense), but also in terms of their spatial distribution with respect to that instance. The nearest centroid neighborhood and the graph neighborhood are two representative examples of the surrounding neighborhood, which have demonstrated to behave better than the conventional nearest neighborhood for a number of pattern classification problems [40, 51].

3.1 Nearest centroid neighborhood

The first definition of surrounding neighborhood comes from the nearest centroid neighborhood (NCN) concept [6]. Let \(p\) be a sample whose \(k\) neighbors should be found from a set of \(n\) points \(X=\{x_{1},\ldots ,x_{n}\}\). These \(k\) neighbors are such that (a) they are as near \(p\) as possible and, (b) their centroid is also as close to \(p\) as possible. Both conditions can be satisfied through the following iterative procedure:

This definition leads to a type of neighborhood in which both closeness and spatial distribution of neighbors are taken into account because of the centroid criterion. Besides, the proximity of the nearest centroid neighbors to the sample is guaranteed because of the incremental nature of the way in which those are obtained from the first nearest neighbor. However, note that the iterative procedure outlined in Algorithm 1 clearly does not minimize the distance to the centroid because it gives precedence to the individual distances instead. On the other hand, the region of influence of the NCN results bigger than that of the traditional nearest neighborhood (NN); as can be seen in Fig. 1, the 4-NCN (\(a, b, c, d\)) of a given point \(p\) enclose a region quite bigger than the region determined by the 4-NN (\(a, e, f, g\)).

Fig. 1
figure 1

An example of NCN compared to the traditional NN

figure a1

3.2 Graph neighborhood

A proximity graph defined on a set \(X=\{x_{1},\ldots ,x_{n}\}\) is an undirected graph \(G=(V,E)\), which comprises a set of nodes \(V=X\) and a set of edges \(E:V \times V\), such that \((x_{i},x_{j}) \in E\) if and only if the points \(x_{i}\) and \(x_{j}\) fulfill some mutual neighborhood criterion; then \(x_{i}\) is said to be neighbor of \(x_{j}\) and vice versa. The set of graph neighbors of a given point constitutes its graph neighborhood [40]. The graph neighborhood of a subset \(S \subseteq V\) consists of the union of all the graph neighbors of every node in \(S\).

Two well-known examples of proximity graphs are the Gabriel graph (GG) and the relative neighborhood graph (RNG) [29], which are subgraphs of the Delaunay triangulation (DT): RNG \(\subseteq \) GG \(\subseteq \) DT. As definitions for and properties of the GG and RNG are widely available in the literature, only essential concepts needed in this paper are reproduced here.

3.2.1 Gabriel graph

Let \(d(\cdot ,\cdot )\) be the Euclidean distance between two points in \(\mathbb R ^{d}\). The set of edges in a GG consists of the pairs of points that satisfy the following relation:

$$\begin{aligned} (x_{i},x_{j}) \in E \Longleftrightarrow d^{2}(x_{i},x_{j})&\le d^{2}(x_{i},x_{k})+d^{2}(x_{j},x_{k}) \nonumber \\&\forall x_{k} \in X,~k \ne i,j \nonumber \end{aligned}$$

Geometrically, two points \(x_{i}\) and \(x_{j}\) are said to be Gabriel neighbors if and only if there is no other point of \(X\) lying in the hypersphere of influence \(\Gamma (x_{i},x_{j})\) centered at their middle point and whose diameter is the distance between \(x_{i}\) and \(x_{j}\). In Fig. 2, for example, both \(p\) and \(q\) are Gabriel neighbors of the point \(a\), but \(r\) is not because \(q\) lies inside the sphere of influence determined by the points \(a\) and \(r\).

Fig. 2
figure 2

An example of Gabriel neighborhood

3.2.2 Relative neighborhood graph

In a similar fashion, the set of edges that belong to an RNG comprises the pairs of points that fulfill the following neighborhood property:

$$\begin{aligned} (x_{i},x_{j}) \in E \Longleftrightarrow d(x_{i},x_{j})&\le \max [d(x_{i},x_{k}),d(x_{j},x_{k})] \nonumber \\&\forall x_{k} \in X,~k \ne i,j \nonumber \end{aligned}$$

In this case, its corresponding geometric interpretation is based on the concept of lune \(\Lambda _{x_{i},x_{j}}\), which is defined as the disjoint intersection between two hyperspheres centered at \(x_{i}\) and \(x_{j}\) and whose radii are equal to the distance between them. Two points \(x_{i}\) and \(x_{j}\) are said to be relative neighbors if and only if their lune does not contain other points of the set \(X\). In Fig. 3, the points \(q\) and \(a\) are not relative neighbors because \(p\) lies inside their lune; conversely, \(p\) and \(a\) are relative neighbors because their lune is empty.

Fig. 3
figure 3

An example of relative neighborhood

4 Surrounding SMOTE

As already mentioned, the surrounding neighborhood methods have successfully been applied to a number of pattern classification and data mining problems. These approaches can effectively help in several situations (finite sample size case), in which training instances do not fully represent the underlying statistics and/or the distance used (irrelevant in the asymptotic case) exhibits some undesirable properties. In fact, the ultimate goal of the surrounding neighborhood is to overcome some shortcomings of the conventional NN-based techniques.

Hence, based upon the analysis just stated, we here propose to employ the three surrounding neighborhood realizations (NCN, GG, and RNG) for over-sampling the minority class by means of a modification of the standard SMOTE algorithm.

SMOTE finds the \(k\) positive nearest neighbors for each minority class example in the training set and then, it generates artificial samples in the direction of some (or all) of those nearest neighbors. Instead of nearest neighbors, now we propose to select surrounding positive neighbors for each instance of the minority class. The rationale behind this modification of the original SMOTE algorithm is that these surrounding neighbors will extend the region of new synthetic samples and therefore, it seems that the resulting over-sampled set can describe better the decision boundaries.

The Surrounding SMOTE algorithm (using the NCN concept) can be written as follows:

figure a2

The size of the set of synthetic instances will be \(N \times \text{ min}\). Note that in the case of GG and RNG, we do not have to provide the number of neighbors (\(k\)), since each training positive instance may have a different number of graph neighbors. Apart from this difference, the rest of the procedure for Surrounding SMOTE using proximity graphs will be exactly the same as the one reported in Algorithm 2.

From a practical point of view, one disadvantage of the graph neighborhood compared to the nearest centroid neighborhood is its higher computational cost. The graph neighbors of a set of points can be computed exhaustively by using the brute-force method (i.e. by testing all pairs of samples in the set \(X\)), with a complexity of O\((n^3)\). Nevertheless, in the case of the GG and RNG, there exist heuristic methods that allow to considerably reduce the number of pairs to be tested for graph neighbors and whose computational cost is close to O\((n^2)\) [29]. In addition, it is worth noting that the Surrounding SMOTE only computes the graph neighbors that belong to the minority class (usually consisting of a very small number of instances), what results in a low-complexity algorithm.

5 Experimental set-up

An empirical comparison between the three Surrounding SMOTE methods here proposed and other over-sampling algorithms has been performed over a total of 39 data sets taken from the KEEL Data Set Repository (http://www.keel.es/dataset.php). Note that all the original multi-class databases have firstly been transformed into two-class problems. Table 1 summarizes the main characteristics of the data sets, including the imbalance ratio (IR), that is, the number of negative examples divided by the number of positive examples. The fifth and sixth columns in Table 1 indicate the original classes that have been employed to shape the positive and negative classes, respectively. For example, in Glass567 database the classes 5, 6 and 7 have been combined to form a unique minority class, whereas the original classes 1, 2 and 3 have been joined to represent the majority class.

Table 1 Data sets used in the experimental analysis

All the experiments have been carried out using the Weka learning environment [19] with the 1-NN decision rule, the C4.5 decision tree and the multi-layer perceptron (MLP) neural network, whose parameter values used in the experiments are given in Table 2. We have adopted a fivefold cross-validation method to estimate the AUC measure: each data set has been divided into five stratified blocks of size \(n/5\) (where \(n\) denotes the total number of examples in the data set), using four folds for training the classifiers and the remaining block as an independent test set. Therefore, the results correspond to the average over the five runs.

Table 2 Parameters used in the classifiers

Each classifier has been applied to the original (imbalanced) training sets and also to sets that have been preprocessed by the three implementations of the Surrounding SMOTE (NCN-SMOTE, GG-SMOTE, RNG-SMOTE) and seven state-of-the-art over-sampling approaches taken from the KEEL data mining software tool [1]. Apart from the original SMOTE and two of its variants (B-SMOTE and SL-SMOTE), other four over-sampling algorithms have been included in this study: ROS, agglomerative hierarchical clustering (AHC), adjusting the direction of the synthetic minority class examples (ADOMS), and adaptive synthetic (ADASYN). The Euclidean distance has been used with all the algorithms tested. The number of neighbors has been set to 5 for NCN-SMOTE, SMOTE, B-SMOTE, SL-SMOTE, ADOMS, and ADASYN, and these neighbors have been searched among the minority class instances. The data sets have been balanced to the 50 % distribution.

The AHC over-sampling method [10] involves three major steps: (1) using single- and complete-linkage to form a dendogram, (2) gathering clusters from all levels of the dendogram and computing the cluster centroids as synthetic examples, and (3) concatenating centroids with the original minority class instances. The ADOMS algorithm [45] generates synthetic positive examples along the first principal component axis of local data distribution (made up of the positive instance being analyzed and its \(k\) neighbors). Finally, the ADASYN approach [22] is based on the idea of using a density distribution as a criterion to adaptively determine the number of synthetic examples to be generated for each minority class instance according to its level of difficulty in learning; in practice, the algorithm results in a balanced data set more focused on those positive instances that are harder to learn.

5.1 Performance evaluation metrics

Many measures have been developed for performance evaluation on imbalanced classification problems. Most of them are based on the \(2 \times 2\) confusion matrix as illustrated in Table  3.

Table 3 Confusion matrix for a two-class problem

The most commonly used metric for measuring the performance of learning systems is the overall accuracy (and its counterpart, the error rate), which can be easily computed as follows:

$$\begin{aligned} \mathrm{Acc}=\frac{\mathrm{TP}+\mathrm{TN}}{\mathrm{TP}+\mathrm{FN}+\mathrm{TN}+\mathrm{FP}} \end{aligned}$$
(1)

Nevertheless, researchers have demonstrated that, when the prior class probabilities are very different, the overall accuracy is not appropriate because it does not consider misclassification costs, is strongly biased to favor the majority class, and is very sensitive to class skews [11, 14, 26]. Thus, in domains with imbalanced data, alternative metrics that measure the classification performance on positive and negative classes independently are required.

Two straightforward metrics that evaluate the classification performance on the majority and minority classes independently are the true positive rate (or sensitivity or recall) and the true negative rate (or specificity), that is, the percentage of instances (positive and negative, respectively) correctly classified. In general, in imbalanced problems more attention should be given to sensitivity than to specificity [34]:

$$\begin{aligned} \mathrm{sensitivity}=\frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FN}} \end{aligned}$$
(2)
$$\begin{aligned} \mathrm{specificity}=\frac{\mathrm{TN}}{\mathrm{TN}+\mathrm{FP}} \end{aligned}$$
(3)

Alternative performance evaluation criteria include the area under the ROC curve (AUC), the geometric mean of accuracies, the precision, the \(F\)-measure and the area under the precision-recall curve, among others. In general, these are good indicators of classification performance on imbalanced data because they are independent of the distribution of examples between classes.

The AUC, which constitutes one of the most commonly used metrics in the context of skewed class distributions, will be the method employed in the present paper to evaluate the performance of a variety of over-sampling techniques. For a binary problem, the AUC measure defined by a single point on the ROC curve is also referred to as balanced accuracy or macro-average, which can be computed as follows [42]:

$$\begin{aligned} \text{ AUC} = \frac{\mathrm{sensitivity}+\mathrm{specificity}}{2} \end{aligned}$$
(4)

5.2 Statistical tests

The AUC results have further been tested for statistically significant differences by means of non-parametric tests, which are generally preferred over the parametric methods because the usual assumptions of independence, normality and homogeneity of variance are often violated due to the non-parametric nature of the problems [12, 17].

Both pairwise and multiple comparisons have been used in this paper. First, the Iman–Davenport’s statistic has been applied to determine whether there exist significant differences among the over-sampling strategies. The process starts by computing the Friedman’s ranking of the algorithms for each data set independently according to the AUC results: as there are eleven competing strategies, the ranks for each data set go from 1 (best) to 11 (worst); in case of ties, average ranks are assigned. Then the average rank of each algorithm across all data sets is computed. Under the null-hypothesis, which states that all the algorithms are equivalent, the Friedman’s statistic can be computed as follows:

$$\begin{aligned} \chi _{F}^2=\frac{12N}{K(K+1)}\left[\sum _j R_j^2 - \frac{K(K+1)^2}{4} \right] \end{aligned}$$
(5)

where \(N\) denotes the number of data sets, \(K\) is the total number of algorithms, and \(R_j\) is the average rank of the algorithm \(j\).

The \(\chi _{F}^2\) is distributed according to the Chi-square distribution with \(K-1\) degrees of freedom, when \(N\) and \(K\) are big enough. However, as the Friedman statistic produces an undesirably conservative effect [13], the Iman–Davenport’s statistic constitutes a better alternative. This is distributed according to the \(F\)-distribution with \(K-1\) and \((K-1)(N-1)\) degrees of freedom:

$$\begin{aligned} F_{F}=\frac{(N-1)\chi _{F}^2}{N(K-1)-\chi _{F}^2} \end{aligned}$$
(6)

If the null-hypothesis of equivalence is rejected, we can then proceed with a post hoc test. In this work, the Holm’s post hoc test has been employed to ascertain whether the best (control) algorithm performs significantly better than the remaining techniques [17].

Afterwards, the Wilcoxon’s paired signed-rank test has been used to find out statistically significant differences between each pair of over-sampling algorithms. This statistic ranks the differences in performances of two algorithms for each data set, ignoring the signs, and compares the ranks for the positive and the negative differences. Let \(d_i\) be the difference between the performance scores of the two algorithms on \(i\)-th out of \(N\) data sets. The differences are ranked according to their absolute values. Let \(R^+\) be the sum of ranks for the data sets on which the first algorithm outperforms the second, and \(R^-\) the sum of ranks for the opposite. Ranks of \(d_i = 0\) are split evenly among the sums; if there is an odd number of them, one is ignored:

$$\begin{aligned} \begin{aligned}&R^+=\sum _{d_i>0}{\text{ rank}(d_i)}+\frac{1}{2}\sum _{d_i=0}{\text{ rank}(d_i)}\\&R^-=\sum _{d_i<0}{\text{ rank}(d_i)}+\frac{1}{2}\sum _{d_i=0}{\text{ rank}(d_i)} \end{aligned} \end{aligned}$$
(7)

Let \(Z\) be the smaller of the sums, \(Z = \text{ min}(R^+, R^-)\). If \(Z\) is less than or equal to the value of the distribution of Wilcoxon for \(N\) degrees of freedom, the null-hypothesis that both algorithms perform equally well can be rejected.

6 Experimental results and discussion

The aim of the present study is threefold. First, we want to establish whether the surrounding neighborhood is able to properly handle the class imbalance problem and to what extent its application can be robust across different classifiers. Second, we are also interested in investigating whether or not the surrounding versions of SMOTE outperform other classical over-sampling algorithms. Finally, we try to find out which of the three surrounding implementations yields the best performance in terms of the AUC metric.

In order to make more comprehensible the experimental results, Table 4 shows the average AUC values across all databases obtained with the 1-NN, C4.5 and MLP classifiers using the different over-sampling methods. The detailed results over each problem are given in the Appendix. As expected, classification with the imbalanced data sets produces the poorest performance, irrespective of the classifier used. However, the most important observation is that the results of the Surrounding SMOTE algorithms are among those of the best performing methods, especially in the case of NCN-SMOTE and GG-SMOTE realizations. Therefore, it appears that the use of the surrounding neighborhood to over-sample the minority class leads to balanced data sets with a better representation of the underlying class distribution, which contributes to better classification results according to the AUC metric.

Table 4 Average AUC values

The Friedman’s average ranks for the three classification models have been plotted in Fig. 4, which can be taken as a further confirmation of the findings with the AUC values. For the 1-NN classifier, GG-SMOTE and NCN-SMOTE clearly arise as the over-sampling algorithms with the lowest rankings, that is, the highest performance in average; when using the C4.5 decision tree, GG-SMOTE and ADASYN are the techniques with the best rankings, followed by NCN-SMOTE and RNG-SMOTE; for the MLP neural network, the NCN-SMOTE, ADOMS and GG-SMOTE methods yield the lowest rankings. Despite the use of imbalanced data sets produces the highest (worst) average ranks with all classifiers, it is worth noting that the rankings of SL-SMOTE and ROS are not too far in the case of the 1-NN rule.

Fig. 4
figure 4

Friedman’s average ranks

With the aim of checking whether our first conclusions can be supported by non-parametric statistical tests, the Iman–Davenport’s statistic has been computed using Eq. 6 to discover whether or not the AUC results are significantly different. This computation yielded \(F_F=17.41\) for 1-NN, \(F_F=6.15\) for C4.5, and \(F_F=5.54\) for MLP. As the critical values for the \(F\)-distribution with \(K-1= 11-1 =10\) and \((K-1)(N-1) = (11-1)(39-1) = 380\) degrees of freedom at confidence levels of 90 and 95 % are \(F(10,380)_{0.90}=1.62\) and \(F(10,380)_{0.95}=1.86\), the null-hypothesis that all strategies here explored perform equally well can be rejected. Consequently, we can now carry on with a Holm’s post hoc test, using the best over-sampling method for each classifier as the respective control algorithm.

Table 5 reports the \(z\) values, the \(p\)-values and the adjusted \(\alpha \)’s calculated using the Holm’s procedure, where the symbol “**” indicates that the null-hypothesis of equivalence with the control algorithm is rejected at a significance level of \(\alpha = 0.05\). For each classifier, the algorithms have been ordered from the smallest to largest \(p\)-values.

Table 5 Results obtained with the Holm’s test for \(\alpha = 0.05\)

The results of the Holm’s test given in Table 5 reveal the superiority of the surrounding strategies with the 1-NN classifier: the GG-SMOTE approach performs significantly better than all the other algorithms, except SMOTE, NCN-SMOTE and RNG-SMOTE. Focusing on the results of the C4.5 decision tree, one can observe that GG-SMOTE appears significantly better than ROS, SL-SMOTE and B-SMOTE, but it is statistically equivalent to AHC, ADOMS, SMOTE, RNG-SMOTE, NCN-SMOTE and ADASYN. With the MLP neural network, the control algorithm NCN-SMOTE significantly outperforms the non-preprocessed imbalanced data set, but behaves equally well as all the other over-sampling techniques.

As several algorithms exhibit similar behaviors, especially with the C4.5 and MLP classifiers, we have run a Wilcoxon’s test between each pair of techniques for each classification model. The upper diagonal half of Tables 6, 7, and 8 summarizes this statistic for a significance level of \(\alpha =0.10\) (10 % or less chance), whereas the lower diagonal half corresponds to a significance level of \(\alpha =0.05\). The symbol “\(\bullet \)” indicates that the method in the row significantly outperforms the method in the column, and the symbol “\(\circ \)” means that the method in the column performs significantly better than the method in the row.

Table 6 Summary of the Wilcoxon’s statistic for the over-sampling methods with the 1-NN classifier
Table 7 Summary of the Wilcoxon’s statistic for the over-sampling methods with the C4.5 classifier
Table 8 Summary of the Wilcoxon’s statistic for the over-sampling methods with the MLP classifier

With the 1-NN classifier, the original SMOTE algorithm performs significantly better than SL-SMOTE, ROS and AHC at both significance levels, whereas there are not significant differences between SMOTE and B-SMOTE at a significance level of \(\alpha =0.05\). The most remarkable observation from Table 6 is that NCN-SMOTE and GG-SMOTE are significantly better than the remaining methods at both significance levels, what demonstrates the suitability of these over-sampling algorithms to consistently produce well-balanced training sets for further classification with the 1-NN model.

When using the C4.5 decision tree, Table 7 shows that there are less statistically significant differences than in the previous case of the 1-NN rule. Nonetheless, the GG-SMOTE algorithm performs significantly better than B-SMOTE, SL-SMOTE, ROS, AHC and ADOMS at both significance levels. On the other hand, NCN-SMOTE is also significantly better than those methods (except to ADOMS) at both significance levels. The original SMOTE algorithm and the ADASYN technique perform equally well as the methods based on the surrounding neighborhood here introduced.

In the case of the MLP neural network, ADOMS and NCN-SMOTE appear to be the algorithms with most differences, being significantly better than SMOTE, B-SMOTE, ROS and AHC for \(\alpha =0.05\); besides, at a significance level of \(\alpha =0.10\), ADOMS also performs significantly better than SL-SMOTE and ADASYN, whereas NCN-SMOTE is also significantly better than the RNG-SMOTE approach. Finally, for \(\alpha =0.10\), the GG-SMOTE method is significantly superior to SMOTE, B-SMOTE, SL-SMOTE, RNG-SMOTE, ROS, AHC and ADASYN.

As a summary of the Wilcoxon’s tests for an easier analysis, the three values in the cells of Table 9 show how many times each method has been significantly-better/same/significantly-worse than the rest of over-sampling strategies at significance levels of \(\alpha =0.10\) and \(\alpha =0.05\) for each classifier. The results here reported corroborate the discussion of previous tables, proving the practical relevance of over-sampling the minority class irrespective of the classification model (using the imbalanced set is significantly worse than employing a training set that has been preprocessed by some over-sampling algorithm). This summary also allows to clearly state the superiority of the NCN-SMOTE and GG-SMOTE algorithms over the remaining methods, especially with the 1-NN classifier.

Table 9 Summary of how many times the over-sampling techniques have been significantly-better/same/significantly-worse

7 Final conclusions and future work

This paper has focused on the problem of expanding the minority class so as to balance the class distribution of the training set. Three modifications of the original SMOTE algorithm have been proposed, all them based upon the concept of surrounding neighborhood. In particular, we have used the NCN, the GG and the RNG in the step of selecting neighbors for further generation of artificial positive examples. The aim of these alternatives is to take both proximity and spatial distribution of neighbors into account in order for extending the regions of the minority class.

Experimental results over 39 databases using three different classifiers (1-NN, C4.5 and MLP) have demonstrated that the Surrounding SMOTE methods achieve significant improvements in terms of the AUC measure with respect to the original SMOTE algorithm and other existing over-sampling procedures. From the three surrounding alternatives, the NCN-SMOTE and GG-SMOTE appear to be the strategies with the highest performance according to the average AUC value and the average Friedman’s rank. A further analysis with the Wilcoxon’s statistic has allowed to observe that both NCN-SMOTE and GG-SMOTE have performed significantly better than most of the remaining methods, especially when using the 1-NN classification rule. Despite differences are less significant in the case of the C4.5 decision tree and the MLP neural network, these two surrounding approaches to SMOTE are still the best over-sampling algorithms.

Finally, future research will be mainly addressed to incorporate a filtering phase into the general structure of the Surrounding SMOTE algorithms in order to remove any example (either positive or negative) that could be considered noisy or atypical. Another avenue for further investigation concentrates on the study of alternative methods to be exploited in the phase of generation of synthetic positive examples. Also, we are interested in analyzing the behavior of both the surrounding-based approaches and other SMOTE-like algorithms as a function of the imbalance ratio of the data sets.