1 Introduction

Class imbalance problems are one of the areas of machine learning on which much effort has been focused in recent years. This problem is considered one of the emerging challenges in the machine learning area [19, 22, 28, 42]. In class imbalance problems, the number of examples of one class (minority class) is much smaller than the number of examples of the other classes, with the minority class being the class of greatest interest one and that with the biggest error cost from the point of view of learning.

It is easily seen why this problem is a challenge for classifiers: if 99 % of the examples in a data set belong to the same class, a classifier that classifies new cases with the majority class will achieve 99 % accuracy. However, this classifier will have learned nothing from the problem we wanted to solve. Since most classification algorithms are designed to minimize the error rate, they tend to build very simple and useless classifiers with this kind of data sets [9, 29].

As a consequence, many approaches have been proposed to deal with class imbalance problems and they can be divided into two main groups [11, 19, 22]: algorithmic approaches and data approaches. The former propose modifications in the algorithms, such as the improved boosting proposed by Joshi et al. [31]; the modification of SVM proposed by Wu et al. [40], an alternative based on one-class learning [33], and some other options that can be seen in [15, 16, 34, 38, 43]. On the other hand, the data approaches usually consist of resampling (subsampling or oversampling) the data in order to balance the classes before building the classifier. The latter approaches are much more versatile since they are independent of the learning algorithm used and most of the research has been done in this direction [4, 6, 8, 17, 30]. One of the most popular techniques is SMOTE [10]: an intelligent oversampling technique to synthetically generate more minority class examples. A broad analysis and comparison of some variants can be found in [5, 22, 36].

As a third option, some authors have proposed solving this problem by combining the algorithmic and data approaches and have obtained good results [3, 37].

Although our work is focused on the data approaches, it is not our aim to propose a new resampling method, but to describe a methodology to obtain better results when we use these methods irrespective of which resampling method we use.

However, Weiss and Provost [38] showed that there is usually a class distribution different to that appearing in the data set, with which better results are obtained. To draw these conclusions, Weiss and Provost performed an experiment to find the optimal class distribution for 26 real-world databases (20 from the UCI [20] and 6 from their own environment). They worked with the C4.5 algorithm and randomly undersampled the training data to perform the experiment with 13 different class distributions in each domain, and then analyzed the results using AUC and error estimates.

Based on Weiss and Provost’s work, Albisua et al. [2] confirmed that changes in the class distribution of the training samples improve the performance of the classifiers. However, in contrast to what Weiss and Provost pointed out in their work, they found that the optimal class distribution depends on the learning algorithm used (even if there are decision tree learners using the same split criteria, such as C4.5 and CTC) and also on whether or not the trees are pruned.

This made us suspect that the results could differ depending on the class distribution used, independent of the resampling technique used before training the classifier. In most of the above-cited techniques for solving the class imbalance problem, the class distribution of the sample generated is usually 50 %: i.e. researchers tend to balance the classes. However, this does not need to be so, since most methods allow a class distribution other than 50 % to be used and, furthermore, some authors suggest that this option could be better [5, 12, 27]. Moreover, some methods do not guarantee that the final class distribution is the balanced one, for instance: SMOTE-ENN [5], EUSTSS [22], SMOTE-RSB* [36], and others.

In this work, we propose an approach for enhancing the effectiveness of the learning process that combines the use of resampling methods with the optimal class distribution (instead of balancing the classes). It should be noted that the use of this approach is not restricted to imbalanced data sets but can be applied to any data set (imbalanced or not) to improve the results of the learning process. The aim of Chawla et al. [12] was similar but they proposed a very different procedure that tackles only class imbalance problems; an heuristic procedure that applies first random subsampling to undersample the majority class while the results improve and then SMOTE to oversample the minority class until no improvement is found.

The proposed approach is the result of some research questions we tried to answer. As previously mentioned, resampling methods are used to balance class distribution. However, is the 50 % class distribution always the best one? Weiss and Provost [38] have already answered this question for the case of random subsampling, but what about the resampling methods we mentioned above? Also, is the optimal class distribution independent of the resampling method and algorithm used? And, finally, is it worth resampling even for data sets with a balanced class distribution?

The experiments described in this work confirm that an optimal class distribution exists, but that it depends not only on the data’s characteristics but also on the algorithm and on the resampling method used.

It is easy to assume that to obtain the optimal class distribution it would be necessary to analyze the results for an infinite (or sufficiently broad) range of class distributions. Limited by reality, we can describe our aim as to define a methodology that will help us to obtain a class distribution that achieves better results (with statistically significant differences) than the balanced distribution, whichever resampling method is used.

To corroborate the effectiveness of this approach, we performed experiments with 29 real problems (balanced and imbalanced ones) extracted from the UCI Repository benchmark [20] using eight different resampling methods, C4.5 and PART algorithms and the AUC performance measure. For estimating performance we used a 10-fold cross-validation methodology executed five times (\(5{\times }\) 10CV). Finally, we used the non-parametric statistical tests proposed by Demšar [14] and García et al. [23, 24] to evaluate the statistical significance of the results.

Section 2 provides a brief description of the resampling methods, algorithms and performance metric to be used. In Sect. 3, we describe the experimental methodology used to corroborate the previously mentioned hypothesis, and in Sect. 4 we present an analysis of the experimental results. Finally, in Sect. 5 we summarize the conclusions and suggest further work.

2 Background

In this section, we briefly describe some of most popular and interesting resampling methods used to tackle the class imbalance problem found in bibliography. This section also includes a brief description of the two algoritms (C4.5 and PART) and performance metric or measure used to evaluate the classifiers (AUC) obtained by applying these methods.

2.1 Resampling methods

As stated in the introduction, one of the approaches used to solve class imbalance problems is to use a resampling method to balance the class distribution in the training data. Random subsampling and random oversampling can be considered as baseline methods because they are the simplest methods and they do not use any kind of knowledge about the data set. In contrast, the rest of the methods can be considered as intelligent methods.

2.1.1 Random subsampling (RANSUB)

This consists of erasing randomly selected examples from the majority class in order to reduce it to the same number of examples the minority class has. Note that we could stop the process earlier to obtain a class distribution other than 50 %. We could even undersample the minority class to obtain a class distribution lower than the original one.

2.1.2 Random oversampling (RANOVER)

This is another non-heuristic resampling method whose idea is to replicate randomly selected examples in both the minority and majority classes, depending on whether the class distribution of the minority class has to be increased or decreased. It is normally used to replicate cases of the minority class to balance the training data, thus obtaining competitive results [5].

2.1.3 SMOTE

SMOTE (synthetic minority oversampling technique) [10] is an oversampling algorithm where the minority class is oversampled to generate new synthetic examples. The basic idea is to generate new examples that are located between each of the minority class examples and one of its k nearest neighbours. The synthetic examples are generated with the following procedure: calculate the difference between the feature vector of the current example (a minority class example) and the feature vector of an example selected randomly from its nearest neighbours. Then, multiply the difference vector by a random value between 0 and 1 and finally, add this vector to the feature vector of the current example. The new vector will be the synthetic example. The number of times that a neighbour has to be selected to be used to generate a new example depends on the number of new examples that must be generated; for example, if we need to duplicate the number of examples in the minority class it will be sufficient to use one neighbour for each of the minority class examples.

2.1.4 Borderline-SMOTE

The authors proposed two different approaches [25] and named them Borderline-SMOTE1 and Borderline-SMOTE2. The main difference between these methods and SMOTE is that only the borderline minority examples are oversampled. The minority class examples will be considered to be in the borderline if more than half of their m nearest neighbours belong to the majority class; i.e. those examples on the bordeline between the majority and minority class. The authors consider these examples to be in danger; i.e. they could be confused with majority class examples. The Borderline-SMOTE1 option (B_SMOTE1) uses just the minority class neighbours of the borderline examples to generate the synthetic examples, whereas the Borderline-SMOTE2 option (B_SMOTE2) uses all the neighbours (minority and majority class). If the selected neighbour belongs to the majority class, the random value generated to multiply the difference vector will be in the range between 0 and 0.5 (in order to create the new example closer to the minority class).

2.1.5 ENN

Wilsons Edited Nearest Neighbour Rule (ENN) [39] is a cleaning algorithm that removes an example if it could be misclassified by its three nearest neighbours. The method is applied to both the majority and minority classes in the training set; i.e. the ENN algorithm consists of erasing those examples having at least two of their three nearest neighbours belonging to the other class. It is worth noting that this is a deterministic process where the class distribution of the final sample cannot be chosen.

2.1.6 SMOTE-ENN

This hybrid method is a combination of an oversampling method (SMOTE) and a cleaning algorithm (ENN) to be applied to the over-sampled training set. The motivation behind this method is to reduce the overfitting risk in the classifier due to the introduction of artificial minority class examples too deeply in the majority class space. This was one of the methods proposed in [5] achieving very good results in particular for data sets with few minority examples and it is considered as a reference method by some authors [22, 32, 36].

2.1.7 ENN-SMOTE

This is a variant of SMOTE-ENN we propose in this work. In this case, the cleaning process is done before applying SMOTE. It seems to make sense to clean the data before oversampling with SMOTE. Moreover, this method has a lower computational cost because it reduces the size of the sample and there is no need to calculate the distance of the new synthetic cases from the rest.

2.2 Learning algorithms

As we mentionned in the introduction the proposed approach has been tested in two algorithms: C4.5 and PART.

2.2.1 C4.5

This is a supervised learning algorithm used to build decision trees using divide-and-conquer strategy [35]. The algorithm uses the gain ratio as split criteria to divide the training sample based on the most discriminant variable according to the class. Once a decision tree is built a pruning procedure is carried out to avoid overfiting. C4.5 achieves high-quality results as single classifier with an added value: it provides with explanation the proposed classification. It was identified as one of the top 10 algorithms in data mining in the IEEE International Conference in Data Mining held in 2006 [41]. It is one of the most used learning algorithms specially as base classifier in multiple classifier systems (MCS) and as the classifier to be combined with intelligent resampling methods to solve class imbalance problems.

2.2.2 PART

PART [21] is a supervised learning algorithm that builds a rule set. It was designed by the authors of the WEKA platform [7] with the aim of combining the capacities of two algorithms: C4.5 [35] and RIPPER [13]. C4.5 is used to build a partial decision tree that will be used to extract one rule (the branch with greatest weight in the tree); the examples not covered by this rule are used to generate a new sample. This new sample is used to build a new partial C4.5 decision tree; the process is repeated until every example in the remaining sample is assigned to the same leaf node of a C4.5 tree; i.e. they are all in the root node. This algorithm has been used with C4.5 as representative of learning algorithms with explaining capacities by other authors [22].

2.3 Performance metric

Accuracy (Acc), the ratio of correctly classified examples to examples to classify, is a traditional performance metric of a classifier in the machine learning context. Its opposite, error rate (\(\mathrm{Err} = 1-\mathrm{Acc}\)), is also widely used. However, these metrics are strongly biased in favour of the majority class when the prior class probabilities are very different as in class imbalance problems. In these cases, another kind of metric is required, where the performance of the classifier is measured based on the confusion matrix. Table 1 shows a confusion matrix for a two-class problem having positive (minority) and negative (majority) class values.

Table 1 Confusion matrix of a 2-class problem

From the confusion matrix we can derive performance metrics that directly measure the classification performance for positive and negative classes independently:

  • True positive rate:

    $$\begin{aligned} {\text{ TP} \text{ rate}} = \frac{\text{ TP}}{{\text{ TP}} + {\text{ FN}}} \end{aligned}$$
    (1)
  • True Negative rate:

    $$\begin{aligned} {\text{ TN} \text{ rate}} = \frac{\text{ TN}}{{\text{ TN}} + {\text{ FP}}} \end{aligned}$$
    (2)
  • False negative rate:

    $$\begin{aligned} {\text{ FN} \text{ rate}} = \frac{\text{ FN}}{{\text{ TP}} + {\text{ FN}}} \end{aligned}$$
    (3)
  • False positive rate:

    $$\begin{aligned} {\text{ FP} \text{ rate}} = \frac{\text{ FP}}{{\text{ TN}} + {\text{ FP}}} \end{aligned}$$
    (4)

For instance, True Positive rate (also called Recall or Sensibility) measures the percentage of correctly classified positive examples.

  • Precision: Another interesting metric is Precision which measures the percentage of correctly classified positive examples with respect to the examples predicted to be positive.

    $$\begin{aligned} {\text{ Precision}} = \frac{\text{ TP}}{{\text{ TP}} + {\text{ FP}}} \end{aligned}$$
    (5)

    Unfortunately, for most real-world applications it is impossible to maximize all the metrics and a trade-off must be found between pairs of them for example, between TP rate and TN rate, between FN rate and FP rate, between Precision and Recall, etc. Hence, some metrics have been designed to analyze the relationship between these two kinds of components. AUC is one of these metrics and, as mentioned before, the metric we selected for this study.

  • AUC (area under ROC curve): The ROC (relative operating characteristic) curve is a graphical representation to compare TP rate and FP rate while changing the decision threshold of an example to belong to a class. The AUC metric tries to maximize the area under this curve. Therefore, AUC evaluates the classifier in multiple contexts of the classification space [5]. This is one of the most used metrics in the bibliography for class imbalance problems. A good description of this metric can be found in [18].

 

3 Experimental setup

Before we use any resampling method, there is a question that needs to be answered, but which is often ignored: what is the final class distribution we want to obtain?

As we mentioned in the Sect. 1, most of the published works use resampling methods to obtain balanced class distributions. Although this is in general a good value, it does not take into account the singularity of the specific database (problem) and it obviates the fact that a better one could exist.

In this work, we will divide the process of finding the optimal class distribution into two stages. First, we will use the simplest and computationally most efficient resampling method (random subsampling) to find the best class distribution. We will then use this class distribution to efficiently find a distribution that is better than the balanced distribution for the selected resampling method.

In order to obtain the optimal class distribution for each data set, we used the partial results for C4.5 published in the technical report [1] based on Weiss and Provost’s work [38]. Although 30 databases were used in the technical report, in this work we decided not to use the Fraud database (a fraud detection problem in car insurance companies) as it was too big (with oversampling techniques we would generate samples of 200,000 cases) and it would not give us any additional information.

Weiss and Provost stated that the best value for the class distribution is data dependent (context or problem dependent) and they used random subsampling to determine the best of 13 different class distributions: 2, 5, 10, 20, 30, 40, 50, 60, 70, 80, 90 and 95 %, and the original (Albisua et al. added the 98 % value in order to make the scanning symmetric and we did the same for this work). We carried out the same experiments with the PART algorithm.

We first performed the experiments with 29 two-class real problems, all belonging to the UCI Repository benchmark [20]. In Table 2, we present a summary of the characteristics of the databases used in the experiment ordered according to their original class distribution from more imbalanced to more balanced. Just as Weiss and Provost did in their work, we transformed all databases with more than two classes to two-class problems (in the table we show the number of classes before the transformation). We used a 10-fold cross-validation methodology five times (\(5{\times }\) 10CV) to estimate the generalization capacity of the classifiers based on the AUC performance metric. As a consequence, we obtained 50 pairs of training and test samples for each database. From each training sample, in order to reduce the effect of chance, we then generated 100 subsamples for each of 14 class distributions (from 2 to 98 %). Figure 1 represents this sample generation process for a database. Thus, in this experiment, we used \(5\times 10\times 14\times 100\) (70,000) samples per database. All the samples generated in one database were of identical size: the number of examples of the minority class. We repeated the same subsampling process as Weiss and Provost did in their work [38]. For the 29 databases, the average size of the generated samples is 27.64 % of the training samples (the same as the mean of the original class distributions).

Fig. 1
figure 1

Schema repeated in each database for sample generation in order to discover the optimal class distribution (Step 1)

Table 2 Domains used in the experiment and their characteristics

We used these samples to build C4.5 and PART classifiers. In each fold we built a classifier with each one of the 1,400 samples generated (14 class distributions \(\times \) 100 samples). Thus, for the 29 databases we used 2,030,000 C4.5 trees and the same amount of PART sets of rules.

As Weiss and Provost mentioned, when the class distribution is changed in the sample used to induce a classifier, a corrector (oversampling ratio) has to be applied in the test process so that the induced model is adapted to the distribution expected in reality. This ratio has been used in our experiment.

Once we know the optimal class distribution for each database, we want to use this value with different resampling methods and to determine whether the results are better than those obtained with balanced samples. However, since we suspect that this optimal value may not be the optimal one for every different resampling technique, we will also use two more values that are close to the optimal one.

We repeated almost the same experiment (10-fold cross-validation five times), but using different resampling methods and limiting the scope of the scanned class distributions. We only used the optimal class distribution obtained in the first step (ocd), the next value (\(ocd+10\) %) and the previous value (\(ocd-10\) %) (see Fig. 2). Morever, we wanted to compare the results obtained with these class distributions with those obtained with the balanced samples. For this reason, if the 50 value was not among those scanned for a data set this value was also added. For example, if the optimal class distribution for a data set is 30 (\(ocd=30\) %) we would use 20, 30, 40 and 50 % values.

Fig. 2
figure 2

Schema repeated in each database and in each fold of the \(5{\times }\) 10CV for sample generation related to every resampling method we used (Step 2)

In addition, we evaluated the proposal for eight different resampling methods. We used six well-known resampling methods that were described in Sect. 2.1 (RANSUB, RANOVER, SMOTE, B_SMOTE1, B_SMOTE2 and SMOTE-ENN) where we used two versions of RANSUB, and we also tried a variant of SMOTE-ENN (ENN-SMOTE).

One version of the RANSUB method used is that explained previously, i.e. the one used to determine the optimal class distribution (ocd). We will refer to this version as RANSUB\(_{ocd}\). Since the size of these samples is very small (following Weis and Provost’s methodology), we also used a random subsampling method with no size limitations, i.e. randomly erasing selected examples from one of the classes only until the desired class distribution is achieved. We will refer to this version as RANSUB.

Since we are conscious that randomness plays an important role in every resampling method, for each training sample in the \(5 {\times }\) 10CV, we generated 50 samples for each method and class distribution as it is shown in the Fig. 2. The SMOTE-ENN method is an exception; in this case we generated a single sample, instead of 50. In order to apply this method 50 times, we had to recalculate a new distance matrix for each sample generated with SMOTE and then apply ENN. This caused problems with disk space and extended the experimental time for months and months. Addressing this problem will be a further work.

Figure 2 summarizes the resampling process carried out for each training sample of the \(5 {\times }\) 10CV. The methods used have been grouped according to the size of the samples to be generated. We try to reflect this size proportionately in the figure, as well as its difference depending on the distribution value applied for each resampling method. In fact, as stated above, using the mean of the original class distribution for the 29 data sets, the average size of the samples generated using RANSUB\(_{ocd}\) is 27.64 % of the training samples; the same for each sample in each data set regardless of the class distribution used, just as for Weiss and Provost.

If we suppose the optimal class distribution (ocd) is 50 %, for an original class distribution of 27.64 % RANSUB would generate double-sized subsamples (55.28 %). In addition, 69.10 and 46.07 % would be the sizes for \(ocd-10\) % and \(ocd+10\) %, respectively. However, for the other four oversampling methods (from RANOVER to B_SMOTE2), the sizes would be 120.60, 144.72 and 180.90 % (for \(ocd-10\) %, ocd and \(ocd+10\) %, respectively).

Finally, we cannot know a priori the size of the samples generated by the combination of the SMOTE and ENN methods because we do not know how many examples will be erased with the ENN cleaning method before or after applying SMOTE. However, we can guarantee that their size will be smaller than the size of the samples of the previous group.

As we previously indicated, AUC was the performance metric selected for our experiment to determine the best value for the class distributions. Moreover, we used the non-parametric tests proposed by Demšar [14] and García et al. [23, 24] to evaluate the statistical significance of the results.

3.1 Implementation issues

In some of these methods a distance metric is required to calculate which are the k nearest neighbours (k-NN algorithm) of a minority class example. The distance between examples is usually calculated based on Euclidean distance; however, Euclidean distance is not adequate for qualitative (rather than quantitative) features. We have implemented SMOTE using the HVDM (Heterogeneous Value Difference Metric) distance [39] which uses Euclidean distance for quantitative attributes and VDM distance for the qualitative ones. The VDM metric takes into account the similarities between the possible values of each qualitative attribute.

A further problem is that the examples can have missing values. These are usually replaced by the average of that attribute for the rest of examples of its class in the case of quantitative attributes and for the mode in the case of qualitative attributes.

It should be noted that we used the same distance implementations used in the well-known WEKA machine learning workbench [7] for our implementation of the SMOTE method and its variants.

In most of the works using this kind of method where a k-NN algorithm is used, the value of k is not specified. Chawla et al. [10] concluded that 5 was a good value for k and this is the value assumed in the revised works.

On the other hand, for the Borderline-SMOTE1 and Borderline-SMOTE2 methods, the authors explain that half of the minority class should be on the borderline before applying SMOTE, but the way the m value should be selected and the process to determine the cases in danger (those on the borderline) is presented as a further work in their paper [25]. Therefore, let us explain how we implemented the danger case detection.

We first instantiate m with a small value (5). Then we check which examples would be in danger when taking into account its m nearest neighbours. If at least half of the minority class examples are in danger, the search is finished; if not, we duplicate the value of m and search again for cases in danger. The idea is to repeat the procedure until we have sufficient danger cases, but there is a small problem: if m becomes greater than double the minority class size, all the minority examples will be in danger and, therefore, all the cases will be selected; as a result, Borderline-SMOTE methods would become standard SMOTE. In order to avoid this, our search will finish before m becomes too big (the algorithm is shown in Appendix 3).

Furthermore, in order to reduce the disk space required, and knowing that only in step 1 we already had 70,000 subsamples of the original sample for each database, we generated the samples by simply saving in a file the position of the selected example in the original database. The training and test samples were saved in this way. This method of storing samples also reduces the computational cost of the resampling processes using k-NN since the distance matrix of the complete data set can be generated only once and then used in every intelligent resampling method to generate the set of samples in each fold.

4 Experimental results

This section is devoted to showing the results obtained in the two phases described: the calculation of the optimal class distribution (ocd) for each database (based on the random subsampling method) and a second phase where we try to determine whether the ocd value helps us to achieve better results than the balanced class distribution for a set of different and well-known resampling methods. Both phases require an estimation of the best class distribution within a range of values. As we wanted these estimations to be as realistic as posible, we divided the whole sample belonging to each data set in training, validation and test data [26] and we made the estimation of the optimal class distribution based only on validation data.

In the first step of our experiment, we calculated the optimal class distribution for each database. For the C4.5 algorithm, a more detailed report of the results for each data set can be found in Appendix 1 of the technical report [1]. We can see all the results obtained for the C4.5 and PART algorithms, 14 class distributions and the AUC performance metric in Table 3. Results for the C4.5 algorithm are shown in the upper part of the table whereas results the PART argorithm are shown in lower part. The information in the columns is for different class distributions and the results for each data set are shown in different rows. For each database, the best result is marked in bold while all the results improving the results for the original distribution have a grey background. Moreover, the relative position of the original class distribution within the range of evaluated class distributions is denoted using a vertical double bar between columns (just as Weiss and Provost did). For example, for the Hypo data set the vertical bar indicates that the original distribution falls between the 2 and 5 % distributions (4.77 % from Table 2). It should be noted that each value in the table represents the average performance of 5,000 classifiers \((5\times 10\times 100)\) whereas in Weiss and Provost’s work [38] this estimate was done with only 30 classifiers.

Table 3 Average AUC values, \(5{\times }\) 10CV, for C4.5 and PART classifiers

Observing the shape of the grey background in Table 3 we can conclude that, for both algorithms, the more imbalanced the data set, the greater the number of class distribution values there are that improve the performance of the classifier for the original class distribution (orig column). The best results for each data set appear for class distributions around 50 % and only in two data sets for C4.5 (Car and Credit-a) and three datasets for PART (Car, Voting and Kr-vs-kp) out of 29, the best AUC is achieved with the original class distribution. Furthermore, if we analyze the average results C4.5 and PART (Mean and Median rows in Table 3) the average AUC achieved for the 29 databases was greater than that achieved with the original class distribution for a wide range of data distributions (between 30 and 70 % for both algorithms) and the best average results were obtained for the 50 % distribution.

Based on these results, we determined the corresponding class-distribution values for the second step. We selected the class-distribution with the best AUC value (those marked in bold in Table 3) and two more values obtained by subtracting and adding 10 % to this value. For the cases where the balanced distribution is not one of these three values (marked with * in the ocd column in Table 4), we also considered the balanced distribution to compare the results. Thus, Table 4 shows the class-distribution values applied (Applied CDs) for each algorithm and database in the second step of the experiment and the original (original CD) class distribution.

Table 4 Original, optimal and selected class distributions for every data set

In the ocd column of Table 4, the cell has been shaded if this value is other than 50 %. As can be observed, this occurs in 16 data sets out of 29 for the C4.5 algorithm and in 12 data sets for the PART algorithm. The optimal values of the table range from 30 % up to as much as 95 %.

Once we had the class distribution values, we were able to carry out the second part of the experiment, a five times 10-fold cross-validation with 50 samples per fold, for each of the proposed class distributions and resampling methods. We then built C4.5 and PART classifiers for each sample and tested the results using AUC performance metric.

The results are shown in Table 5. The ORIGINAL row shows the average values of the AUC obtained using the whole training sample to build the classifiers; i.e. without a resampling method. This will be a reference value to evaluate whether or not it is worth using any resampling method.

Table 5 Mean values of AUC for each of the evaluated algorithms, resampling methods and class distributions used

In the rows below we can see the mean of these values obtained for each resampling method. The bal column shows the results for the balanced distribution, the ocd column shows the results for the optimal class distribution and the orm column shows the results for the optimal value obtained in step 2 [note that 3 (or 4) class distributions have been used].

By way of example, to better understand where these results come from, we have added (in Appendix 1) a table with the results for the 29 data sets, but only for the SMOTE cases and PART algorithm. This same analysis was done for all the methods and both algorithms, but for the sake of readability we summarize it in Table 5.

The results improving those obtained with no-resampling method (ORIGINAL) have a grey background, while those that show an improvement on the results obtained with balanced samples (bal column) are marked in bold.

The values in Table 5 show that, for both algorithms, most of the results obtained improve those obtained with the whole sample and shown in the ORIGINAL row (grey background). According to these results, for both algorithms, the optimal class distribution found in step 2 (orm) performs better than the balanced version for almost every resampling method; i.e. for every data set and every resampling method there is a class distribution better than 50 %.

With regard to the results obtained with ocd class distribution (ocd column), it can be observed that they are better (marked in bold) than the balanced ones (bal column) for more than half of the resampling methods evaluated.

We have analyzed the results based on the average values for the 29 data sets, but are the differences found between these results statistically significant?

We used the non-parametric tests proposed by Demšar [14] and García et al. [23, 24] to evaluate if there is any statistically significant difference between the results obtained with the class distributions, we propose and those obtained with balanced class distribution. Demšar proposed the use of the non-parametric Friedman test [14] to compare more than two algorithms (or possibilities) and discover whether or not significant differences between their behaviours appear. The Friedman test ranks the algorithms for each data set separately and compares the obtained average ranks using the FF statistic. Table 6 shows the average ranks and the result of the Friedman test for each resampling method.

Table 6 Average ranks and Friedmans test values related to AUC evaluation metric

As it can be observed in Table 6, orm, the optimal class distribution found related to each resampling method has the best average rank for most of the methods. It can also be observed that for the C4.5 algorithm statistically significant differences were found for four resampling methods with an over 95 % (\(\alpha =0.05\)) significance level (marked in light grey) and for one more with an over 90 % (\(\alpha =0.1\)) significance level (marked in dark grey). However, for the PART algorithm SMOTE-ENN was the only method without significant differences (with a 100 % significance level).

Since significant differences were found in most cases, the next step was to perform post-hoc tests to discover between which pairs of options the differences appeared. We performed three different post-hoc tests. On the one hand, we performed the Nemenyi and Bonferroni-Dunn classic tests and obtained the CD (critical difference) diagrams (graphical representation of the average ranks and significant differences when multiple algorithms are compared proposed by Demšar). On the other hand, we performed the Holm test (focused on the comparison between a control method and a set of options) because being a more powerful test it could detect more significant differences.

For the sake of clarity we moved to Appendix 2 the figures of the CD diagrams for C4.5 and PART. We included a CD diagram for each resampling method we analyzed.

Results regarding to Holm’s procedure are shown in Fig. 3. We applied the Holm’s procedure to compare the results achieved with the orm class distribution (the control method) with the results achieved with other possible options (ORIGINAL, marked with squares, bal, marked with x, and ocd, marked with triangles). The graphic in the left part of the figure shows results for the C4.5 algorithm, where as the graphic in the right part does it for the PART algorithm. The graphics include 1 axe per resampling method evaluated where adjusted p values are shown. In addition to these p values, two octagons are drawn: the external one (dashed line) markes the p value \(\alpha =0.1\) and the internal one the p value \(\alpha =0.05.\) Thus, we could interpret the figures the following way: for each resampling method, the options that fall inside the internal octagon have significant differences according to the orm option with a 95 % significance level and the options that appear between the two octagons have differences with a 90 % significance level.

Fig. 3
figure 3

Star plot diagrams related to Holm’s adjusted p values for C4.5 and PART and the eight resampling methods. The values in the axis are shown in logarithmic scale

Results in Fig. 3 show that for most of the resampling methods the orm possibility performs better than the rest, and for many of them, with statistically significant differences. More specifically, in the case of the PART algorithm, significant differences were found between the orm and the ORIGINAL options with a 95 % confidence level for every method but SMOTE-ENN where the differences have been found with a 90% significance level. However, for the C4.5 algorithm significant differences between the orm and the ORIGINAL options with a 95 % significance level were found for RANSUB, SMOTE and B_SMOTE1 resampling methods and with a 90 % significance level for RANOVER and B_SMOTE2. If we compare the performance of the orm and the bal options, even if the first one obtained better ranks for every resampling method and algorithms, there are fewer significant differences. Specifically for the C4.5 algorithm, significant differences with a 95 % significance level have been found for SMOTE, B_SMOTE1 and B_SMOTE2 and with a 90 % significance level for ENN-SMOTE. With regard to the PART algorithm, differences with a 95 % significance level were found for SMOTE and B_SMOTE2. If we finally compare the results obtained with orm and ocd very similar conclusions are obtained.

As a secondary result, we have also analyzed whether there are statistically significant differences between the two combinations of SMOTE and ENN: the SMOTE-ENN and the ENN-SMOTE we proposed. The test proposed by Demšar for pairwise comparisons is the Wilcoxon signed-ranks test. Table 7 shows the average relative improvement values (Rel.Impr. row) and the Wilcoxons Tests values (Wilcoxon row) for the three values used for class distribution: bal, ocd and orm.

Table 7 Relative improvement and Wilcoxons test values comparing SMOTE-ENN variants

Based on Table 7, we can conclude that there are not significant differences between using the ENN cleaning method before or after applying the SMOTE method. However, as stated above, ENN-SMOTE has a computationally lower cost and it also allows us to tune the final class distribution of the sample.

Finally, we wanted to know how many of the optimal class distribution values were other than 50 %, and, when they were 50 %, how many times this value was outside our selected range (ocd \({\pm }\,10\)). Table 8 shows the optimal class distribution value for each database and resampling method. Column 1 shows the database name, while column 2 shows its original class distribution. After that, for each algorithm, the optimal value obtained in step 1 is shown first and then, the final class distribution value obtained for each resampling method. Values other than 50 % have a grey background. The table includes as a summary in the lower part the mean of the optimal class distribution (Mean row) and the standard deviation (Std Dev row) as well as the number of times the optimal class distribution is 50 % (row #\(orm=50\)) and the number of times it is the same as the optimal value obtained in step 1 (row # orm=ocd). As a conclusion, we can say that although the optimal class distribution is usually a nearly balanced class distribution (50 %), as indicated by the mean values, this is the best value in only a few cases. There are some results that need further explanation in order to better understand them. Although the optimal value for the two columns Optimal CD and RANSUBocd might be expected to be the same, since the same resampling method is used, this does not always happen (see row #orm=ocd). This is because chance plays a part in the method and, besides, a different number of samples was used for each option (100 subsamples for Optimal CD (step 1) and only 50 in all the resampling methods used in step 2, including RANSUBocd).

Table 8 Optimal class distribution values for all data sets and resampling methods using AUC metric

Having analyzed the results of the wide range of experiments, we can claim to have proposed a method that is able to find a pseudo-optimal class distribution to be used for a set of intelligent resampling methods.

5 Conclusions and further work

The aim of this work was to offer an approach (or methodology) that would help us to obtain better results in the context of machine learning by resampling data to obtain an optimum class distribution.

Most of the resampling methods are used to balance the classes to solve class imbalance problems. However, in this paper we have proposed using a class distribution value other than 50 % to improve the results, whatever the original class distribution may be. Our approach can be explained in a simple algorithmic way:

  1. 1.

    Determine the best class distribution for a data set using the simplest and fastest resampling method (random subsampling) for 14 different class distribution values, from 2 to 98 %.

  2. 2.

    Use the selected resampling method with this optimal class distribution and, also, with two values around it (\({\pm }10\)) to select the best value within this range.

The results obtained show that our methodology finds a class distribution which gives better results than the balanced one with statistically significant differences (in many cases) for eight resampling methods and two learning algorithms. Based on these results, we can conclude that our hypothesis is satisfied with independence from the resampling method used and learning algorithm.

Since the use of this approach is not restricted to imbalanced data sets, we can conclude that when we want to address a learning problem, it is always worth changing the class distribution of the sample to be used for training, using any resampling method for this. Although beyond the context of this work, our experiment showed that a simple and non-intelligent resampling method, random subsampling without size limitations (RANSUB), can achieve very competitive results.

As a complementary result we have also shown the results of ENN-SMOTE, a variant of the known SMOTE-ENN method, where the ENN cleaning process is applied before SMOTE. Its performance is similar to SMOTE-ENNs performance (in the context of the data sets used in this paper) but its computational cost is lower, since ENN reduces the size of the sample before applying SMOTE. In addition, the same distance matrix can be used after applying ENN. Therefore, a way to analyze this drawback of SMOTE-ENN could be investigated in the future, as previously stated.

In addition to proposing the methodology, we answered some previously mentioned questions. It has been shown that each database has its own optimal class distribution, which may or may not be other than 50 %. This is not an important factor for our methodology, since it will be found even if it is 50 %. However, this optimal value also depends on the resampling method used, which is taken into account in the second step of the methodology. We are conscious that this value could be improved, since we search for it within a small range of values. As a related further work, we have it in mind to make a trade-off between the broadening of the range to search the optimal value and the time taken to perform it. Moreover, in order to make this search, the use of optimization techniques such as genetic algorithms, simulated annealing, etc. would be of interest. These kinds of techniques could be also applied in the search of the Step 1 or this step could even be left out.

Another important question has been answered. We wanted to know whether it is worth resampling a non-imbalanced data set. As the improvement obtained for non-imbalanced data sets has been as significant as that obtained for imbalanced ones, we can be sure that the methodology guides us to a better result whatever the original class distribution.