Before we use any resampling method, there is a question that needs to be answered, but which is often ignored: what is the final class distribution we want to obtain?
As we mentioned in the Sect. 1, most of the published works use resampling methods to obtain balanced class distributions. Although this is in general a good value, it does not take into account the singularity of the specific database (problem) and it obviates the fact that a better one could exist.
In this work, we will divide the process of finding the optimal class distribution into two stages. First, we will use the simplest and computationally most efficient resampling method (random subsampling) to find the best class distribution. We will then use this class distribution to efficiently find a distribution that is better than the balanced distribution for the selected resampling method.
In order to obtain the optimal class distribution for each data set, we used the partial results for C4.5 published in the technical report [1] based on Weiss and Provost’s work [38]. Although 30 databases were used in the technical report, in this work we decided not to use the Fraud database (a fraud detection problem in car insurance companies) as it was too big (with oversampling techniques we would generate samples of 200,000 cases) and it would not give us any additional information.
Weiss and Provost stated that the best value for the class distribution is data dependent (context or problem dependent) and they used random subsampling to determine the best of 13 different class distributions: 2, 5, 10, 20, 30, 40, 50, 60, 70, 80, 90 and 95 %, and the original (Albisua et al. added the 98 % value in order to make the scanning symmetric and we did the same for this work). We carried out the same experiments with the PART algorithm.
We first performed the experiments with 29 two-class real problems, all belonging to the UCI Repository benchmark [20]. In Table 2, we present a summary of the characteristics of the databases used in the experiment ordered according to their original class distribution from more imbalanced to more balanced. Just as Weiss and Provost did in their work, we transformed all databases with more than two classes to two-class problems (in the table we show the number of classes before the transformation). We used a 10-fold cross-validation methodology five times (\(5{\times }\) 10CV) to estimate the generalization capacity of the classifiers based on the AUC performance metric. As a consequence, we obtained 50 pairs of training and test samples for each database. From each training sample, in order to reduce the effect of chance, we then generated 100 subsamples for each of 14 class distributions (from 2 to 98 %). Figure 1 represents this sample generation process for a database. Thus, in this experiment, we used \(5\times 10\times 14\times 100\) (70,000) samples per database. All the samples generated in one database were of identical size: the number of examples of the minority class. We repeated the same subsampling process as Weiss and Provost did in their work [38]. For the 29 databases, the average size of the generated samples is 27.64 % of the training samples (the same as the mean of the original class distributions).
Table 2 Domains used in the experiment and their characteristics
We used these samples to build C4.5 and PART classifiers. In each fold we built a classifier with each one of the 1,400 samples generated (14 class distributions \(\times \) 100 samples). Thus, for the 29 databases we used 2,030,000 C4.5 trees and the same amount of PART sets of rules.
As Weiss and Provost mentioned, when the class distribution is changed in the sample used to induce a classifier, a corrector (oversampling ratio) has to be applied in the test process so that the induced model is adapted to the distribution expected in reality. This ratio has been used in our experiment.
Once we know the optimal class distribution for each database, we want to use this value with different resampling methods and to determine whether the results are better than those obtained with balanced samples. However, since we suspect that this optimal value may not be the optimal one for every different resampling technique, we will also use two more values that are close to the optimal one.
We repeated almost the same experiment (10-fold cross-validation five times), but using different resampling methods and limiting the scope of the scanned class distributions. We only used the optimal class distribution obtained in the first step (ocd), the next value (\(ocd+10\) %) and the previous value (\(ocd-10\) %) (see Fig. 2). Morever, we wanted to compare the results obtained with these class distributions with those obtained with the balanced samples. For this reason, if the 50 value was not among those scanned for a data set this value was also added. For example, if the optimal class distribution for a data set is 30 (\(ocd=30\) %) we would use 20, 30, 40 and 50 % values.
In addition, we evaluated the proposal for eight different resampling methods. We used six well-known resampling methods that were described in Sect. 2.1 (RANSUB, RANOVER, SMOTE, B_SMOTE1, B_SMOTE2 and SMOTE-ENN) where we used two versions of RANSUB, and we also tried a variant of SMOTE-ENN (ENN-SMOTE).
One version of the RANSUB method used is that explained previously, i.e. the one used to determine the optimal class distribution (ocd). We will refer to this version as RANSUB\(_{ocd}\). Since the size of these samples is very small (following Weis and Provost’s methodology), we also used a random subsampling method with no size limitations, i.e. randomly erasing selected examples from one of the classes only until the desired class distribution is achieved. We will refer to this version as RANSUB.
Since we are conscious that randomness plays an important role in every resampling method, for each training sample in the \(5 {\times }\) 10CV, we generated 50 samples for each method and class distribution as it is shown in the Fig. 2. The SMOTE-ENN method is an exception; in this case we generated a single sample, instead of 50. In order to apply this method 50 times, we had to recalculate a new distance matrix for each sample generated with SMOTE and then apply ENN. This caused problems with disk space and extended the experimental time for months and months. Addressing this problem will be a further work.
Figure 2 summarizes the resampling process carried out for each training sample of the \(5 {\times }\) 10CV. The methods used have been grouped according to the size of the samples to be generated. We try to reflect this size proportionately in the figure, as well as its difference depending on the distribution value applied for each resampling method. In fact, as stated above, using the mean of the original class distribution for the 29 data sets, the average size of the samples generated using RANSUB\(_{ocd}\) is 27.64 % of the training samples; the same for each sample in each data set regardless of the class distribution used, just as for Weiss and Provost.
If we suppose the optimal class distribution (ocd) is 50 %, for an original class distribution of 27.64 % RANSUB would generate double-sized subsamples (55.28 %). In addition, 69.10 and 46.07 % would be the sizes for \(ocd-10\) % and \(ocd+10\) %, respectively. However, for the other four oversampling methods (from RANOVER to B_SMOTE2), the sizes would be 120.60, 144.72 and 180.90 % (for \(ocd-10\) %, ocd and \(ocd+10\) %, respectively).
Finally, we cannot know a priori the size of the samples generated by the combination of the SMOTE and ENN methods because we do not know how many examples will be erased with the ENN cleaning method before or after applying SMOTE. However, we can guarantee that their size will be smaller than the size of the samples of the previous group.
As we previously indicated, AUC was the performance metric selected for our experiment to determine the best value for the class distributions. Moreover, we used the non-parametric tests proposed by Demšar [14] and García et al. [23, 24] to evaluate the statistical significance of the results.
Implementation issues
In some of these methods a distance metric is required to calculate which are the k nearest neighbours (k-NN algorithm) of a minority class example. The distance between examples is usually calculated based on Euclidean distance; however, Euclidean distance is not adequate for qualitative (rather than quantitative) features. We have implemented SMOTE using the HVDM (Heterogeneous Value Difference Metric) distance [39] which uses Euclidean distance for quantitative attributes and VDM distance for the qualitative ones. The VDM metric takes into account the similarities between the possible values of each qualitative attribute.
A further problem is that the examples can have missing values. These are usually replaced by the average of that attribute for the rest of examples of its class in the case of quantitative attributes and for the mode in the case of qualitative attributes.
It should be noted that we used the same distance implementations used in the well-known WEKA machine learning workbench [7] for our implementation of the SMOTE method and its variants.
In most of the works using this kind of method where a k-NN algorithm is used, the value of k is not specified. Chawla et al. [10] concluded that 5 was a good value for k and this is the value assumed in the revised works.
On the other hand, for the Borderline-SMOTE1 and Borderline-SMOTE2 methods, the authors explain that half of the minority class should be on the borderline before applying SMOTE, but the way the m value should be selected and the process to determine the cases in danger (those on the borderline) is presented as a further work in their paper [25]. Therefore, let us explain how we implemented the danger case detection.
We first instantiate m with a small value (5). Then we check which examples would be in danger when taking into account its m nearest neighbours. If at least half of the minority class examples are in danger, the search is finished; if not, we duplicate the value of m and search again for cases in danger. The idea is to repeat the procedure until we have sufficient danger cases, but there is a small problem: if m becomes greater than double the minority class size, all the minority examples will be in danger and, therefore, all the cases will be selected; as a result, Borderline-SMOTE methods would become standard SMOTE. In order to avoid this, our search will finish before m becomes too big (the algorithm is shown in Appendix 3).
Furthermore, in order to reduce the disk space required, and knowing that only in step 1 we already had 70,000 subsamples of the original sample for each database, we generated the samples by simply saving in a file the position of the selected example in the original database. The training and test samples were saved in this way. This method of storing samples also reduces the computational cost of the resampling processes using k-NN since the distance matrix of the complete data set can be generated only once and then used in every intelligent resampling method to generate the set of samples in each fold.