Boosting for classimbalanced datasets using genetically evolved supervised nonlinear projections
 2k Downloads
 3 Citations
Abstract
It has repeatedly been shown that most classification methods suffer from an imbalanced distribution of training instances among classes. Most learning algorithms expect an approximately even distribution of instances among the different classes and suffer, to different degrees, when that is not the case. Dealing with the classimbalance problem is a difficult but relevant task, as many of the most interesting and challenging realworld problems have a very uneven class distribution. In this paper we present a new approach for dealing with classimbalanced datasets based on a new boosting method for the construction of ensembles of classifiers. The approach is based on using the distribution of the weights given by a given boosting algorithm for obtaining a supervised projection. Then, the supervised projection is used to train the next classifier using a uniform distribution of the training instances. We tested our method using 35 classimbalanced datasets and two different base classifiers: a decision tree and a support vector machine. The proposed methodology proved its usefulness achieving better accuracy than other methods both in terms of the geometric mean of specificity and sensibility and the area under the ROC curve.
Keywords
Data mining Classimbalanced problems Boosting Realcoded genetic algorithms1 Introduction
A classification problem of \(K\) classes and \(n\) training observations consists of a set of instances whose class membership is known. Let \(S = \{(\mathbf x _1, y_1), (\mathbf x _2, y_2), \ldots , (\mathbf x _n, y_n)\}\) be a set of \(n\) training samples where each instance \(\mathbf x _i\) belongs to a domain \(X\). Each label is an integer from the set \(Y = \{1, \ldots , K\}\). A multiclass classifier is a function \(f: X \rightarrow Y\) that maps an instance \(\mathbf x \in X \subset R^D\) into an element of \(Y\).
The task is to find a definition for the unknown function, \(f(\mathbf x )\), given the set of training instances. In a classifier ensemble framework we have a set of classifiers \(F = \{f_1, f_2, \ldots , f_T\}\), each classifier performing a mapping of an instance vector \(\mathbf x \in R^D\) into the set of labels \(Y = \{1, \ldots , K\}\). The design of classifier ensembles must face two main tasks: constructing the individuals classifiers, \(f_k\), and developing a combination rule that finds a class label for \(\mathbf x \) based on the outputs of the classifiers \(\{f_1(\mathbf x ), f_2(\mathbf x ), \ldots , f_T(\mathbf x )\}\).
One of the distinctive features of many common problems in current datamining applications is the uneven distribution of the instances of the different classes. In extremely active research areas, such as artificial intelligence in medicine, bioinformatics or intrusion detection, two classes are usually involved: a class of interest, or positive class, and a negative class that is overrepresented in the datasets. This is usually referred to as the classimbalance problem [28]. In highly imbalanced problems, the ratio between the positive and negative classes can be as high as 1:1,000 or 1:10,000. In recent years much attention has also been given to classimbalanced multiclass problems [13].

Internal approaches acting on the algorithm. These approaches modify the learning algorithm to deal with the imbalance problem. They can adapt the decision threshold to create a bias toward the minority class or introduce costs in the learning process to compensate the minority class.

External approaches acting on the data. These algorithms act on the data instead of the learning method. They have the advantage of being independent from the classifier used. There are two basic approaches: oversampling the minority class and undersampling the majority class.

Combined approaches that are based on ensembles of classifier, and most commonly boosting, accounting for the imbalance in the training set. These methods modify the basic boosting method to account for minority class underrepresentation in the dataset.
Datadriven algorithms can be broadly classified into two groups: those that undersample the majority class and those that oversample the minority class. There are also algorithms that combine both processes. Both undersampling and oversampling can be achieved randomly or through a more complicated process of searching for least or most useful instances. Previous works have shown that undersampling the majority class usually leads to better results than oversampling the minority class [5] when oversampling is performed using sampling with replacement from the minority class. Furthermore, combining undersampling of the majority class with oversampling of the minority class has not yielded better results than undersampling of the majority class alone [41]. One of the possible sources of the problematic performance of oversampling is the fact that no new information is introduced in the training set, as oversampling must rely on adding new copies of minority class instances already in the dataset. Sampling has proven a very efficient method of dealing with classimbalanced datasets [11, 56].
Removing instances only from the majority class, usually referred to as onesided selection [34], has two major problems. Firstly, reduction is limited by the number of instances of the minority class. Secondly, instances from the minority class are never removed, even when their contribution to the models performance is harmful.
The third approach has received less attention. However, we believe that the approaches based on boosting are more promising because these methods have proven their ability to improve the results of many base learners in balanced datasets. Furthermore, when the imbalance ratio is high it is likely that the only way of obtaining good performance is by means of combining many classifiers.
An ensemble of classifiers consists of a combination of different classifiers, homogeneous or heterogeneous, to perform a classification task jointly. Ensemble construction is one of the fields of machine learning that is receiving more research attention, mainly due to the significant performance improvements over single classifiers that have been reported with ensemble methods [1, 3, 25, 32, 55].
Techniques using multiple models usually consist of two independent phases: model generation and model combination [44]. Most techniques are focused on obtaining a group of classifiers which are as accurate as possible but which disagree as much as possible. These two objectives are somewhat conflicting, since if the classifiers are more accurate, it is obvious that they must agree more frequently. Many methods have been developed to enforce diversity on the classifiers that form the ensemble [8].
In this paper we propose a new ensemble for classimbalanced datasets based on boosting. Boosting trains a classifier on a biased distribution of the training set to optimize weighted training error. However, for some problems, optimizing this weighted error may harm the overall performance of the ensemble, because too much relevance is given to incorrectly labelled instances and outliers. Our approach is based on optimizing the weighted error in a less aggressive way. We use the biased distribution of instances given by boosting algorithm to obtain a supervised projection of the original data into a new space where the weighted error is improved. Then, the classifier is trained using this projection and with an uniform distribution of the training instances. In this way, the view of the data the classifier sees is biased towards difficult instances, as the supervised projection is obtained using the biased distribution of instances given by the boosting algorithm, but without putting too much pressure on correctly classifying misclassified instances. Informally, a supervised projection is a projection constructed using both the inputs and the label of the patterns and with the aim of improving the classification accuracy of any given learner.
The projections are constructed using a realcoded genetic algorithm (RCGA). The use of such a method allows a more flexible approach than other boosting methods. Standard boosting methods focus on optimizing weighted accuracy. However, as it is well known, accuracy is not an appropriate method for classimbalanced datasets. In our approach, the RCGA evolves using as fitness value a measure that is specific of classimbalanced problems, the geometric mean of specificity and sensitivity (the socalled \(G\)mean). The presented method is named Genetically Evolved Supervised Projection Boosting, GESuperPBoost, algorithm.
A final feature is introduced to account for the classimbalanced nature of the datasets. Before obtaining the supervised projection by means of the RCGA, we randomly undersample the majority class. With this process we obtain three relevant benefits. Firstly, we make the algorithm faster; secondly, we obtain better performance due to the balanced datasets used for the RCGA; thirdly, we introduce diversity in the ensemble from the use of the random undersampling.
Unsupervised [49] and supervised projections [22, 23] has been used before for constructing ensembles of classifiers. However, the use of genetically evolved supervised projections is new, and most of the previous approaches were devoted to indirect methods to obtain what we could call semisupervised projections.
This paper is organized as follows: Sect. 2 describes in detail the proposed methodology and its rationale; Sect. 3 explains the experimental setup; Sect. 4 shows the results of the experiments performed; and finally Sect. 5 states the conclusions of our work.
2 Supervised projection approach for boosting classifiers
In order to avoid the stated harmful effect of maximizing the margin of noisy instances, we do not use the adaptive weighting scheme of boosting methods to train the classifiers, but to obtain a supervised projection. The supervised projection is aimed at optimizing the weighted error given by the boosting algorithm. However, instead of optimizing accuracy, we use the \(G\)mean measure common in classimbalanced datasets. Other mechanisms were used to account for the classimbalance nature of the problems that are described below.
The classifier is trained using the supervised projection obtained with the RCGA and with an uniform distribution of the instances. Thus, we obtain a method that benefits from the adaptive instance weighting of boosting but that is also able to improve its drawbacks. We are extending the philosophy of a previous paper [23] where we showed that the performance of ensembles of classifiers can be improved using nonlinear projections constructed with only misclassified instances.
The popularity of boosting methods is mainly due to the success of AdaBoost. However, AdaBoost tends to perform very well for some problems but can also perform very poorly for other problems. One of the sources of the bad behavior of AdaBoost is that although it is always able to construct diverse ensembles, in some problems the individual classifiers tend to have large training errors [9]. Moreover, AdaBoost usually performs poorly on noisy problems [1, 9]. Schapire and Singer [51] identified two scenarios where AdaBoost is likely to fail: (i) When there is insufficient training data relative to the “complexity” of the base classifiers; and (ii) when the training errors of the base classifiers become too large too quickly. Unfortunately, these two scenarios are likely to occur in realworld problems. Several papers have attributed the failure of boosting methods, especially in the presence of noise, to the fact that the skewed data distribution produced by the algorithm tends to assign too much weight to hard instances [9]. In classimbalanced datasets this problem may be even harder because the distribution of the patterns in the classes is uneven.
As we have stated, in a general classification problem, we have a training set \(S = \{(\mathbf x _1, y_1), (\mathbf x _2, y_2), \ldots , (\mathbf x _n, y_n)\}\) of \(n\) training samples where each instance \(\mathbf x _i\) belongs to a domain \(X\). Let us assume that we also have a vector \(\mathbf w \) that assigns a weight \(w_i\) to each instance \(\mathbf x _i\). In a general sense, a supervised projection is a projection constructed using both the inputs and the label of the patterns. More specifically, in our framework, a supervised projection, \(\Phi \), is a projection into a new space where the weighted error \(\epsilon = \sum _{i=1}^n w_i \left[\!\left[ {f(\Phi (\mathbf x _i)) = y_i} \right]\!\right]\) for a certain classifier \(f(\mathbf x )\) trained on the projected space is improved with respect to the error using the original variables. Thus, the idea is searching for a projection of the original variables that is able to improve the weighting error even when the learner is trained using a uniform distribution of the patterns. This supervised projection, which is calculated using inputs and pattern labels, leads to an informed or biased feature space, which will be more relevant to the particular supervised learning problem [58].
To illustrate the method, let us explain the differences with a boosting algorithm for step \(k\). In a boosting algorithm after adding \(k1\) classifiers, we obtain the instance weight vector for training \(k\)th classifier, \(\mathbf w ^k\). Then, \(k\)th classifier is trained associating to each instance \(i\) a weight \(w_i^k\), which is used directly by the classifier, if it admits instance weights, or used to obtain a biased sample from the training set if not. The aim is obtaining a classifier that optimizes the weighted error \(\epsilon _k = \sum _{i=1}^n w_i^k\left[\!\left[ {f_k(\mathbf x _i) = y_i} \right]\!\right]\). Once trained, classifier \(k\)th is added to the ensemble, and the process is repeated for adding classifier \((k+1)\)th. In our method, after adding \(k1\) classifiers, we obtain the instance weight vector for training \(k\)th classifier, \(\mathbf w ^k\), using the weighting scheme of a certain boosting algorithm of our choice. However, this weight vector is not used to train the new classifier. Instead, a supervised projection, \(P_k\), of the inputs is constructed with the objective of minimizing a certain accuracy criterion that considers that the problems is classimbalanced. Then, the original instances \(\mathbf x _i\) are projected using this supervised projection to obtain a new training set \(Z_k = \{\mathbf{z}_i :\mathbf{z}_i = P_k(\mathbf{x}_i)\}\). Classifier \(k\) is trained on \(Z_k\) using a uniform distribution. Once trained, it is added to the ensemble and the process is repeated.
The proposed method shares the philosophy of our previous method based on nonlinear projections [23]. In that method new classifiers were added to the ensemble considering only the instances misclassified by the previous classifier. To avoid the large bias of using only those instances to learn the classifier, they were not used for training the classifier but for constructing a projection where their separation by the classifier was easier. In the present method, we do not use misclassified instances, but the distribution of instances given by a boosting method. This method has several advantages over the previous one. We can use the theoretical result about convergence of training error of boosting to assure the convergence to perfect classification with similar conditions as for AdaBoost. The weights assigned by boosting and used for constructing the nonlinear projection summarizes the difficulty in classifying the instance as the ensemble grows, instead of using just the last classifier. The method in [23] does not use instance weights, and it is inspired more in random subspace method (RSM) [30], using nonlinear projections instead of subspace projection. The difference in the philosophy is that, while RSM uses random projections, the method in [23] uses nonlinear projections using only misclassified instances. Furthermore, part of the theory developed for boosting is applicable to the new approach but not for the previous one. In addition, the proposed model constructs an additive model as boosting, and the previous one used a simple voting scheme to obtain the final output of the ensemble.
2.1 Constructing supervised projections using a RCGA
We have stated that our method is based on using a supervised projection to train the classifier at round \(k\). But, what exactly do we understand by a supervised projection? The intuitive meaning of a supervised projection using a weight vector \(\mathbf w _k\), is a projection into a space where the weighted error achieved by any classifier trained using the projection is minimized. In a previous work [21] we defined the concept of supervised projection as follows:^{2}
Definition 1
Supervised projection. Let \(S = \{(\mathbf x _1, y_1),\) \((\mathbf x _2, y_2), \ldots , (\mathbf x _n, y_n)\}\) be a set of \(n\) training samples where each instance \(\mathbf x _i\) belongs to a domain \(X\), and \(\mathbf w \) a vector that assigns a weight \(w_i\) to each instance \(\mathbf x _i\). A supervised projection \(\mathbf \Phi \) is a projection into a new space where the weighted error \(\epsilon = \sum _{i=1}^n w_i \left[\!\left[ { f(\Phi (\mathbf x _i)) = y_i} \right]\!\right]\) for a certain classifier \(f(\mathbf x )\) trained on the projected space is improved with respect to the error using the original variables \(\epsilon _o = \sum _{i=1}^n w_i \left[\!\left[ { f_o(\mathbf x _i) = y_i} \right]\!\right] \) for training a classifier \(f_o\).
Although most methods for projecting data are focused on the features of the input space, and do not take into account the labels of the instances, as most of them are specifically useful for nonlabelled data and aimed at data analysis, methods for supervised projection, in the sense that class labels are taken into account, do exist in the literature. For instance, SIsomap [27] is a supervised version of Isomap [53] projection algorithm which uses class labels to guide the manifold learning. Projection pursuit, a technique initially developed for exploratory data analysis [33], has been also applied to supervised projections [46] where projections are chosen to minimize estimates of the expected overall loss in each projection pursuit stage. Lee et al. [36] introduced new indexes derived from linear discriminant analysis that can be used for exploratory supervised classification. However, all of these methods do not contemplate the possibility of assigning a weight to each instance measuring its relevance.
As we have said, developing a method for obtaining a supervised projection, we face a similar problem as feature selection/extraction. There are two basic approaches for feature selection/extraction: The filter approach separates feature selection/extraction and the learning of the classifier. Measures that do not depend on the learning method are used. The wrapper approach considers that the selection of inputs (or extraction of features) and the learning algorithm cannot be separated. Both approaches are usually developed as an optimization process where certain objective function must be optimized.
Filter methods are based on performance evaluation metrics calculated directly from the data, without direct reference to the results of any induction algorithm. Such algorithms are usually computationally cheaper than those that use a wrapper approach. However, one of the disadvantages of filter approaches is that they may not be appropriate for a given problem. In previous works we have used indirect methods [23] or supervised linear projections [22]. However, none of these methods is able to obtain a supervised projection aimed at optimizing the weighted error given by the boosting resampling scheme.
Thus, in this work we propose the use of a realcoded genetic algorithm to directly optimize the weights of the nonlinear projection. The process of obtaining the supervised projection is shown in Algorithm 1. Thus, our solution is based on a wrapper approach by means of an evolutionary computation method.
To avoid overfitting we introduce additional diversity into the ensemble using random subspaces. Before obtaining the supervised projection using the RCGA the input space is divided into subspaces of the same size and each subspace is projected separately. In our experiments a subspace size of five features is used. We evolve the whole projection with all the subspaces using the same RCGA. From the practical point of view the only difference when using subspaces is that many of the elements of matrix \(\mathbf C \) are fixed to zero during the evolution.
The main problem of the method described above is the scalability [24]. When we deal with a large dataset, the cost of the RCGA is high. To partially avoid this problem we combine our method with undersampling. Thus, before the construction of each supervised projection is started we undersample the majority class to obtain a balanced distribution of both classes. Then we construct the ensemble as it has been described above. The final version of the method is shown in Algorithm 2.
The RCGA genetic algorithm used is an standard generational genetic algorithm. Initially each individual is randomly obtained. Each individual represents the coefficient matrix, \(\mathbf C \), of the nonlinear projection plus the constants, \(\mathbf B \). Then the evolution is carried out for a number of generations where new individuals are obtained by nonuniform mutation [45] and standard BLX\(\alpha \) [29] crossover with \(\alpha = 0.5\).
The source code used for all methods is in C and is licensed under the GNU General Public License. The code, the partitions of the datasets and the detailed numerical results of all the experiments are available from the authors upon request and from http://cib.uco.es.
3 Experimental setup
Summary of datasets
Dataset  Cases  Attributes  Inputs  IR  

C  B  N  
1  abalone19  4,177  7  –  1  10  1:130 
2  breastcancer  286  –  3  6  15  1:3 
3  cancer  699  9  –  –  9  1:2 
4  carG  1,728  –  –  6  16  1:25 
5  ecoliCPIM  220  7  –  –  7  1:2 
6  ecoliM  336  7  –  –  7  1:4 
7  ecoliMU  336  7  –  –  7  1:9 
8  ecoliOM  336  7  –  –  7  1:16 
9  euthyroid  3,163  7  18  –  44  1:10 
10  german  1,000  6  3  11  61  1:3 
11  glassBWFP  214  9  –  –  9  1:3 
12  glassBWNFP  214  9  –  –  9  1:3 
13  glassContainer  214  9  –  –  9  1:16 
14  glassNW  214  9  –  –  9  1:4 
16  glassTableware  214  9  –  –  9  1:23 
17  haberman  306  3  –  –  3  1:3 
18  hepatitis  155  6  13  –  19  1:4 
19  hypothyroidT  3,772  7  20  2  29  1:12 
20  ionosphere  351  33  1  –  34  1:2 
21  newthyroidT  215  5  –  –  5  1:6 
22  ozone1hr  2,536  72  –  –  72  1:34 
23  ozone8hr  2,534  72  –  –  72  1:15 
24  pima  768  8  –  –  8  1:2 
25  segmentO  2,310  19  –  –  19  1:7 
26  sick  3,772  7  20  2  33  1:16 
27  spliceEI  3,175  –  –  60  120  1:4 
28  spliceIE  3,175  –  –  60  120  1:4 
29  tictactoe  958  –  –  9  9  1:2 
30  vehicleVAN  846  18  –  –  18  1:3 
31  vowelZ  990  10  –  –  10  1:11 
32  yeastCYTPOX  483  8  –  –  8  1:24 
33  yeastEXC  1,484  8  –  –  8  1:42 
34  yeastME1  1,484  8  –  –  8  1:33 
35  yeastME2  1,484  8  –  –  8  1:29 
As base learners for the ensembles we used two classifiers: a decision tree using the C4.5 learning algorithm [47] and a support vector machine (SVM) [6] using a Gaussian kernel. The SVM learning algorithm was programmed using functions from the libsvm library [4]. We used these two methods because they are arguably the two most popular classifiers in the literature.
For the RCGA we used a population of 100 individuals evolved during 100 generations. To avoid extremely long running times, an arbitrary maximum running time of 100 seconds was imposed for the RCGA. Once this time limit was reached the best individual so far was used as the nonlinear projection.
3.1 Statistical tests
3.2 Evaluation measures
As we have stated, accuracy is not a useful measure for imbalanced data. Thus, as a general accuracy measure we will use the \(G\)mean defined above.
However, many classifiers are subject to some kind of threshold that can be varied to achieve different values of the above measures. For that kind of classifiers receiver operating characteristic (ROC) curves can be constructed. A ROC curve, is a graphical plot of the \(\text{ TP}_\mathrm{rate}\) (sensitivity) against the \(\text{ FP}_\mathrm{rate}\) (\(1  \) specificity or \(\text{ FP}_\mathrm{rate} = \frac{\text{ FP}}{\text{ TN} + \text{ FP}}\)) for a binary classifier system as its discrimination threshold is varied. The perfect model would achieve a true positive rate of 1 and a false positive rate of 0. A random guess will be represented by a line connecting the points \((0, 0)\) and \((1, 1)\). ROC curves are a good measure of the performance of the classifiers. Furthermore, from this curve a new measure, area under the curve (AUC), can be obtained, which is a very good overall measure for comparing algorithms. AUC is a useful metric for classifier performance as it is independent of the decision criterion selected and prior probabilities.
In our experiments we report both \(G\)mean, which gives a snapshot measure of the performance of each method, and AUC which gives a more general vision of its behavior.
3.3 Methods for the comparison
The first method used for comparison is undersampling the majority class until both classes have the same number of instances. We have not used oversampling methods because most previous works agree that as a common rule undersampling performs better than oversampling [40]. However, a few works have found the opposite [11]. Furthermore, methods that add new synthetic patterns, such as SMOTE [5], have also shown a good behavior. Undersampling method was used because it is very simple and offers very good performance. Thus, it is the baseline method for any comparison. Any algorithm not improving over undersampling is of very limited interest.
Random undersampling consists of randomly removing instances from the majority class until a certain criterion is reached. In most works, instances are removed until both classes have the same number of instances. Several studies comparing sophisticated undersampling methods with random undersampling [31] have failed to establish a clear advantage of the formers. Thus, in this work we consider first random undersampling. However, the problem with random undersampling is that many, potentially useful, samples from the majority class are ignored. In this way, when the majority/minority class ratio is large the performance of random undersampling degrades. Furthermore, when the number of minority class samples is very small, we will also have problems due to small training sets.
To avoid these problems, several ensemble methods have been proposed [18]. Rodríguez et al. [48] proposed the use of ensembles of decision trees. AdaCoost [52] has been proposed as an alternative to AdaBoost for imbalanced datasets.
As an additional method we also used in our experiments standard AdaBoost. Thus, our algorithm was compared against four stateoftheart classimbalanced methods: undersampling, EasyEnsemble, BalanceCascade and AdaBoost.
4 Experimental results
In this section we present the comparison of GESuperPBoost with the standard methods described above. We also present a control experiment to rule out the possibility that the good performance of GESuperPBoost is due to the introduction of a nonlinear projection regardless how the projection is obtained.
4.1 Comparison with standard methods
The aim of these experiments was testing whether our approach based on nonlinear projections was competitive when compared with the standard methods. Thus, we carried out experiments using the two base learners and the methods described in Sect. 3.3. The results, in terms of \(G\)mean and AUC, for C4.5 are shown in Table 2. The results for SVM are shown in Table 3. The Iman–Davenport test obtained a \(p\) value of 0.0000 for the four comparisons, meaning that there were significant differences among the methods.
Accuracy results, measured using the \(G\)mean and AUC, for all the methods and C4.5 as base learner
Dataset  Undersampling  AdaBoost  EasyEnsemble  BalanceCascade  GESuperPBoost  

AUC  \(G\)mean  AUC  \(G\)mean  AUC  \(G\)mean  AUC  \(G\)mean  AUC  \(G\)mean  
abalone9–18  0.7139  0.6963  0.8234  0.2391  0.7845  0.7111  0.7640  0.6743  0.8477  0.7788 
breastcancer  0.6052  0.5729  0.6222  0.4543  0.6149  0.5824  0.6362  0.5778  0.6362  0.6047 
cancer  0.9598  0.9496  0.9860  0.9686  0.9856  0.9580  0.9816  0.9709  0.9876  0.9750 
carG  0.9397  0.9447  0.9965  0.9006  0.9768  0.9596  0.9959  0.9705  0.9962  0.9833 
ecoliCPIM  0.9743  0.9738  0.9940  0.9809  0.9816  0.9571  0.9816  0.9704  0.9964  0.9741 
ecoliM  0.8955  0.8692  0.9457  0.8266  0.9313  0.9088  0.9524  0.8866  0.9455  0.8737 
ecoliMU  0.8673  0.8452  0.9052  0.6884  0.9026  0.8618  0.9262  0.8942  0.9107  0.8440 
ecoliOM  0.8772  0.8660  0.9412  0.8074  0.9403  0.8813  0.9611  0.9084  0.9873  0.8946 
euthyroid  0.9534  0.9445  0.9812  0.9294  0.9751  0.9479  0.9821  0.9443  0.9847  0.9444 
german  0.6371  0.6443  0.7589  0.5941  0.7340  0.6725  0.7462  0.6741  0.7702  0.6875 
glassBWFP  0.7741  0.7820  0.8959  0.8248  0.8791  0.8191  0.8990  0.8054  0.9182  0.8525 
glassBWNFP  0.7905  0.7468  0.9090  0.8070  0.8197  0.7316  0.8630  0.7665  0.8874  0.7783 
glassContainers  0.8501  0.7531  0.9775  0.6365  0.9508  0.8428  0.9214  0.8205  0.9551  0.8612 
glassNW  0.9227  0.8816  0.9734  0.9041  0.9413  0.8846  0.9688  0.9272  0.9753  0.8990 
glassTableware  0.9583  0.9565  0.9950  0.7000  0.9777  0.9625  0.9850  0.9562  0.9902  0.8618 
glassVWFP  0.5477  0.4485  0.6516  0.1422  0.6075  0.5742  0.6915  0.5013  0.7434  0.6878 
haberman  0.6412  0.5879  0.6931  0.3922  0.6496  0.5754  0.6511  0.6198  0.7007  0.5284 
hepatitis  0.7197  0.6697  0.8664  0.5605  0.8110  0.6245  0.7972  0.6814  0.8505  0.7726 
hypothyroidT  0.9893  0.9879  0.9942  0.9919  0.9930  0.9872  0.9938  0.9902  0.9982  0.9905 
ionosphere  0.8844  0.8812  0.9723  0.9076  0.9649  0.9221  0.9720  0.9050  0.9850  0.9394 
newthyroidT  0.9338  0.9302  0.9985  0.9124  0.9812  0.9306  0.9751  0.9617  1.0000  0.9729 
ozone1hr  0.7709  0.7384  0.8644  0.0000  0.8520  0.7532  0.8745  0.7841  0.8972  0.8272 
ozone8hr  0.7564  0.7406  0.8839  0.2997  0.8772  0.7986  0.8704  0.7743  0.9175  0.8465 
pima  0.7242  0.7008  0.8103  0.7060  0.7869  0.7296  0.7982  0.7205  0.8101  0.7322 
segmentO  0.9906  0.9903  0.9994  0.9929  0.9978  0.9938  0.9977  0.9889  0.9999  0.9919 
sick  0.9711  0.9680  0.9928  0.9157  0.9864  0.9658  0.9945  0.9757  0.9949  0.9710 
spliceEI  0.9641  0.9626  0.9914  0.9528  0.9817  0.9616  0.9866  0.9616  0.9899  0.9646 
spliceIE  0.9244  0.9242  0.9776  0.9051  0.9721  0.9346  0.9767  0.9385  0.9824  0.9404 
tictactoe  0.9490  0.9128  0.9997  0.9873  0.9937  0.9735  0.9964  0.9723  0.9997  0.9906 
vehicleVAN  0.9404  0.9419  0.9955  0.9554  0.9834  0.9531  0.9881  0.9381  0.9933  0.9651 
vowelZ  0.9113  0.9102  0.9901  0.8269  0.9580  0.9153  0.9802  0.9349  0.9978  0.9678 
yeastCYTPOX  0.7220  0.6029  0.8378  0.5818  0.8089  0.5744  0.8005  0.6290  0.8502  0.6453 
yeastEXC  0.8213  0.7999  0.8893  0.7131  0.8812  0.8309  0.9121  0.8575  0.9210  0.8238 
yeastME1  0.9371  0.9365  0.9808  0.7924  0.9657  0.9222  0.9699  0.9455  0.9897  0.9423 
yeastME2  0.8592  0.8229  0.8535  0.2085  0.9143  0.8389  0.9345  0.8114  0.9327  0.8641 
Average  0.8479  0.8253  0.9137  0.7145  0.8961  0.8412  0.9064  0.8468  0.9241  0.8622 
Accuracy results, measured using the \(G\)mean and AUC, for all the methods and a SVM as base learner
Dataset  Undersampling  AdaBoost  EasyEnsemble  BalanceCascade  GESuperPBoost  

AUC  \(G\)mean  AUC  \(G\)mean  AUC  \(G\)mean  AUC  \(G\)mean  AUC  \(G\)mean  
abalone9–18  0.7297  0.6751  0.8128  0.4443  0.8161  0.7677  0.8708  0.7384  0.8177  0.7502 
breastcancer  0.5558  0.5500  0.5351  0.3536  0.6469  0.6022  0.6626  0.6343  0.6531  0.5733 
cancer  0.9868  0.9672  0.9841  0.9551  0.9750  0.9676  0.8097  0.9683  0.9909  0.9736 
carG  0.9968  0.9789  0.9675  0.8775  0.9713  0.9523  0.9419  0.9495  0.9941  0.9866 
ecoliCPIM  0.9901  0.9809  0.9864  0.9608  0.9832  0.9826  0.5000  0.9730  0.9985  0.9756 
ecoliM  0.9335  0.8855  0.9245  0.7723  0.9089  0.8927  0.8985  0.8564  0.9564  0.8561 
ecoliMU  0.9326  0.8861  0.8963  0.5473  0.8740  0.8630  0.9099  0.8733  0.9148  0.8738 
ecoliOM  0.9811  0.9127  0.9176  0.8746  0.9819  0.9029  0.8500  0.9015  0.9843  0.9591 
euthyroid  0.9175  0.8343  0.8976  0.7517  0.9016  0.8845  0.9463  0.8925  0.8976  0.8292 
german  0.5235  0.1470  0.5230  0.0554  0.7440  0.7094  0.7726  0.7102  0.7634  0.3930 
glassBWFP  0.8781  0.8002  0.8400  0.2904  0.7599  0.6803  0.8395  0.7238  0.8624  0.7993 
glassBWNFP  0.8147  0.6925  0.8160  0.6726  0.6411  0.5197  0.6825  0.5717  0.8299  0.7628 
glassContainers  0.9700  0.9094  0.9112  0.6605  0.8905  0.8352  0.8890  0.8309  0.9390  0.7421 
glassNW  0.9798  0.9430  0.9706  0.9025  0.9406  0.9175  0.9190  0.9267  0.9675  0.9285 
glassTableware  0.9667  0.8108  0.7832  0.5926  0.9121  0.8881  0.8675  0.7955  0.9783  0.9377 
glassVWFP  0.6387  0.5590  0.7970  0.3406  0.5388  0.5111  0.6003  0.4353  0.6561  0.6007 
haberman  0.7090  0.6003  0.6761  0.4330  0.6480  0.5463  0.7096  0.6395  0.6703  0.5292 
hepatitis  0.6172  0.4788  0.4958  0.0000  0.7870  0.7324  0.8061  0.7266  0.8431  0.5829 
hypothyroidT  0.7728  0.6985  0.7858  0.6894  0.8154  0.7883  0.8909  0.7863  0.8329  0.7454 
ionosphere  0.9789  0.9078  0.9764  0.9382  0.8854  0.8643  0.8716  0.8600  0.9795  0.9450 
newthyroidT  1.0000  0.9858  0.9969  0.9499  0.9916  0.9752  0.7700  0.9752  1.0000  0.9754 
ozone1hr  0.7802  0.6929  0.6397  0.3537  0.8450  0.8237  0.8949  0.8448  0.8505  0.7265 
ozone8hr  0.8322  0.7280  0.7798  0.4646  0.8444  0.8060  0.8954  0.8096  0.8634  0.7664 
pima  0.8191  0.7283  0.7379  0.5857  0.7541  0.7380  0.8320  0.7411  0.8253  0.7325 
segmentO  0.9997  0.9902  0.9981  0.9936  0.9917  0.9909  0.5000  0.9912  0.9999  0.9968 
sick  0.9125  0.8165  0.9012  0.7248  0.8878  0.8816  0.9416  0.8828  0.9274  0.8626 
spliceEI  0.5997  0.4194  0.5891  0.4153  0.9669  0.9488  0.5000  0.9524  0.9646  0.4891 
spliceIE  0.6082  0.4404  0.6061  0.4404  0.9401  0.9121  0.5000  0.9199  0.9287  0.5459 
tictactoe  0.9999  0.9749  0.9754  0.9749  0.9754  0.9749  0.9678  0.9749  0.9986  0.9797 
vehicleVAN  0.9889  0.9571  0.9962  0.9672  0.9705  0.9649  0.5906  0.9606  0.9896  0.9650 
vowelZ  0.9952  0.9685  0.9172  0.8832  0.8949  0.8772  0.9072  0.8715  0.9881  0.9656 
yeastCYTPOX  0.7610  0.6204  0.7847  0.5062  0.7675  0.6073  0.7822  0.6070  0.8027  0.6483 
yeastEXC  0.9284  0.8888  0.9109  0.6013  0.8798  0.8701  0.9221  0.8913  0.9413  0.8643 
yeastME1  0.9843  0.9637  0.9842  0.7957  0.9634  0.9560  0.9866  0.9576  0.9842  0.9516 
yeastME2  0.8917  0.8163  0.8596  0.3465  0.8563  0.8415  0.8836  0.8174  0.9032  0.8134 
Average  0.8564  0.7774  0.8335  0.6319  0.8615  0.8279  0.8032  0.8283  0.8999  0.8008 
The results using a SVM as base learner had some sensible differences. In terms of AUC, GESuperPBoost was better than the remaining methods. However, in terms of \(G\)mean, all methods, with the exception of AdaBoost, showed a similar behavior. AdaBoost had the worse combined behavior of the five methods. This is not an unexpected result as AdaBoost is an stable method with respect to sampling. This feature makes AdaBoost less efficient when using a SVM as base learner.
If we inspect the results, we see that most of the arrows are pointing upright. This behavior is specially common when using C4.5 as base learner. For SVM, there is a clear advantage in terms of AUC, most arrows are pointing right, but the differences in terms of \(G\)mean are most homogeneously distributed.
For C4.5, the test shows that GESuperPBoost was better than all the other methods for both AUC and \(G\)mean. The differences are significant at a confidence level of 95 %. These results corroborate the differences showed by the ranks in Fig. 1. The case for a SVM as base learner is somewhat different. The differences are significant in favor of our method for all the standard methods and AUC as accuracy measure. However, Holm test fails to find significant differences between our method and undersampling, EasyEnsemble and BalanceCascade for \(G\)mean measure. However, as explained above, AUC is a more reliable measure than \(G\)mean.
It is noticeable that AdaBoost obtained very bad results for \(G\)mean for both base learners. There is an explanation of these results. AdaBoost improved as a rule the specificity values while worsening sensitivity. As AdaBoost is focused on overall error it invested its effort in the negative class, because it is more numerous and thus has a major impact on the overall error. This feature, useful for balanced datasets, is a serious drawback for classimbalanced datasets.
4.2 Control experiment
Pairwise comparison of GESuperPBoost accuracy measured using \(G\)mean and AUC and ensembles constructed using a random projection
GESuperPBoost  

C.45  SVM  
AUC  \(G\)mean  AUC  \(G\)mean  
Random projection  
w/d/l  29/1/5  24/0/11  29/0/6  28/1/6 
\(p\) value  0.0001  0.0084  0.0001  0.0156 
The table shows that the source of the good performance of GESuperPBoost was not due to the projection of the inputs alone, as the random projection method was worse than GESuperPBoost for both base learners and according to both measures.
5 Conclusions and future work
In this paper we have presented a new method for constructing ensembles based on combining the principles of boosting and the construction of supervised projections by means of a realcoded genetic algorithm. The idea of using a supervised projection, instead of the standard way of resampling or reweighting of boosting, seems appropriate for classimbalanced datasets. We combine this method with undersampling to make if more scalable and to obtain better results.
Our experiments have shown that the proposed method achieved better results than undersampling and three different boosting methods. Two of these methods are specifically designed for classimbalanced datasets and have shown their performance in previous papers [26].
The main drawback of our method is the scalability of the approach. Although this problem is ameliorated introducing undersampling in a previous step, it may be still a serious handicap if we deal with large datasets. In this way, our current research line is focused on improving the scalability of the method by means of the paradigm of the democratization [19] of learning algorithms.
Footnotes
Notes
Acknowledgments
This work was supported in part by the Grant TIN200803151 of the Spanish “Comisin Interministerial de Ciencia y Tecnología” and the Grant P09TIC4623 of the Regional Government of Andalucía.
References
 1.Bauer, E., Kohavi, R.: An empirical comparison of voting classification algorithms: bagging, boosting, and variants. Mach. Learn. 36(1/2), 105–142 (1999)CrossRefGoogle Scholar
 2.Breiman, L.: Bias, variance, and arcing classifiers. Tech. Rep. 460, Department of Statistics, University of California, Berkeley (1996)Google Scholar
 3.Breiman, L.: Stacked regressions. Mach. Learn. 24(1), 49–64 (1996)MathSciNetMATHGoogle Scholar
 4.Chang, C.C., Lin, C.J.: LIBSVM: a library for support vector machines (2001)Google Scholar
 5.Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: synthetic minority oversampling technique. J. Artif. Intell. Res. 16, 321–357 (2002)MATHGoogle Scholar
 6.Cristianini, N., ShaweTaylor, J.: An Introduction to Support Vector Machines. Cambridge University Press, Cambridge (2000)Google Scholar
 7.Demšar, J.: Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn. Res. 7, 1–30 (2006)MathSciNetMATHGoogle Scholar
 8.Dietterich, T.G.: Ensemble methods in machine learning. In: Kittler, J., Roli, F. (eds.) Proceedings of the First International Workshop on Multiple Classifier Systems, Lecture Notes in Computer Science, vol. 1857, pp. 1–15. Springer (2000)Google Scholar
 9.Dietterich, T.G.: An experimental comparison of three methods for constructing ensembles of decision trees: bagging, boosting, and randomization. Mach. Learn. 40, 139–157 (2000)CrossRefGoogle Scholar
 10.Dzeroski, S., Zenko, B.: Is combining classifiers with stacking better than selecting the best one? Mach. Learn. 54, 255–273 (2004)MATHCrossRefGoogle Scholar
 11.Estabrooks, A., Jo, T., Japkowicz, N.: A multiple resampling method for learning from imbalanced data sets. Comput. Intell. 20(1), 18–36 (2004)MathSciNetCrossRefGoogle Scholar
 12.Fern, A., Givan, R.: Online ensemble learning: an empirical study. Mach. Learn. 53, 71–109 (2003)MATHCrossRefGoogle Scholar
 13.Fernández, A., Jesús, M.J.D., Herrera, F.: Multiclass imbalanced datasets with linguistic fuzzy rule based classification systems based on pairwise learning. In: Proceedings of the Computational intelligence for knowledgebased systems design, and 13th international conference on Information processing and management of uncertainty, IPMU’10, pp. 89–98. Springer, Berlin (2010)Google Scholar
 14.Frank, A., Asuncion, A.: UCI machine learning repository (2010). http://archive.ics.uci.edu/ml
 15.Freund, Y., Schapire, R.: Experiments with a new boosting algorithm. In: Proceedings of the Thirteenth International Conference on Machine Learning, pp. 148–156. Bari (1996)Google Scholar
 16.Freund, Y., Schapire, R.E.: A decisiontheoretic generalization of online learning and an application to boosting. J. Comput. Syst. Sci. 55(1), 119–139 (1997)MathSciNetMATHCrossRefGoogle Scholar
 17.Friedman, J.H., Hastie, T., Tibshirani, R.: Additive logistic regression:a statistical view of boosting. Ann. Stat. 28(2), 337–407 (2000)MathSciNetMATHCrossRefGoogle Scholar
 18.Galar, M., Fernández, A., Barrenechea, E., Bustince, H., Herrera, F.: A review on ensembles for the class imbalance problem: bagging, boosting, and hybridbased approaches. IEEE Trans. Syst. Man Cybern. C Appl. Rev. 42, 463–484 (2012)CrossRefGoogle Scholar
 19.GarcíaOsorio, C., de HaroGarcía, A., GarcíaPedrajas, N.: Democratic instance selection: a linear complexity instance selection algorithm based on classifier ensemble concepts. Artif. Intell. 174, 410–441 (2010)CrossRefGoogle Scholar
 20.GarcíaPedrajas, N.: Constructing ensembles of classifiers by means of weighted instance selection. IEEE Trans. Neural Netw. 20(2), 258–277 (2008)CrossRefGoogle Scholar
 21.GarcíaPedrajas, N.: Supervised projection approach for boosting classifiers. Pattern Recognit. 42, 1741–1760 (2009)CrossRefGoogle Scholar
 22.GarcíaPedrajas, N., GarcíaOsorio, C.: Constructing ensembles of classifiers using supervised projection methods based on misclassified instances. Expert Syst. Appl. 38(1), 343–359 (2010)CrossRefGoogle Scholar
 23.GarcíaPedrajas, N., GarcíaOsorio, C., Fyfe, C.: Nonlinear boosting projections for ensemble construction. J. Mach. Learn. Res. 8, 1–33 (2007)MathSciNetMATHGoogle Scholar
 24.GarcíaPedrajas, N., de HaroGarcía, A.: Scaling up data mining algorithms: review and taxonomy. Progr. Artif. Intell. 1, 71–87 (2012)CrossRefGoogle Scholar
 25.GarcíaPedrajas, N., HervásMartínez, C., OrtizBoyer, D.: Cooperative coevolution of artificial neural network ensembles for pattern classification. IEEE Trans. Evol. Comput. 9(3), 271–302 (2005)CrossRefGoogle Scholar
 26.GarcíaPedrajas, N., PérezRodríguez, J., GarcíaPedrajas, M.D., OrtizBoyer, D., Fyfe, C.: Class imbalance methods for translation initiation site recognition in dna sequences. Knowl. Based Syst. 25, 22–34 (2012)CrossRefGoogle Scholar
 27.Geng, X., Zhan, D.C., Zhou, Z.H.: Supervised nonlinear dimensionality reduction for visualization and classification. IEEE Trans. Syst. Man Cybern. B Cybern. 35(6), 1098–1107 (2005)CrossRefGoogle Scholar
 28.He, H., Garcia, E.A.: Learning from imbalanced data. IEEE Trans. Knowl. Data Eng. 21, 1263–1284 (2009)CrossRefGoogle Scholar
 29.Herrera, F., Lozano, M., Verdegay, J.L.: Tackling realcoded genetic algorithms: operators and tools for behavioural analysis. Artif. Intell. Rev. 12, 265–319 (1998)MATHCrossRefGoogle Scholar
 30.Ho, T.K.: The random subspace method for constructing decision forests. IEEE Trans. Pattern Anal. Mach. Intell. 20(8), 832–844 (1998)CrossRefGoogle Scholar
 31.Japkowicz, N.: The class imbalance problem: significance and strategies. In: Proceedings of the 200 International Conference on Artificial Intelligence (ICAI’2000): Special Track on Inductive Learning, vol. 1, pp. 111–117. Las Vegas (2000)Google Scholar
 32.Kohavi, R., Kunz, C.: Option decision trees with majority voting. In: Proceedings of the Fourteenth International Conference on Machine Learning, pp. 161–169. Morgan Kaufman, San Francisco (1997)Google Scholar
 33.Kruskal, J.B.: Toward a practical method which helps uncover the structure of a set of multivariate observations by finding the linear transformation which optimizes a new ‘index of condensation’. In: Milton, R.C., Nelder, J.A. (eds.) Statistical Computing, pp. 427–440. Academic Press, London (1969)Google Scholar
 34.Kubat, M., Matwin, S.: Addressing the curse of imbalanced training sets: Onesided selection. In: Proceedings of the Fourteenth International Conference on Machine Learning, pp. 179–186. Morgan Kaufmann (1997)Google Scholar
 35.Kuncheva, L., Whitaker, C.J.: Measures of diversity in classifier ensembles and their relationship with the ensemble accuracy. Mach. Learn. 51(2), 181–207 (2003)MATHCrossRefGoogle Scholar
 36.Lee, E., Cook, D., Klinke, S., Lumley, T.: Projection pursuit for exploratory supervised classification. J. Comput. Graph. Stat. 14(4), 831–846 (2005)MathSciNetCrossRefGoogle Scholar
 37.Lee, Y., Ahn, D., Moon, K.: Margin preserving projections. Electron. Lett. 42(21), 1249–1250 (2006)CrossRefGoogle Scholar
 38.Lerner, B., Guterman, H., Aladjem, M., Dinstein, I., Romem, Y.: On pattern classification with sammon’s nonlinear mapping—an experimental study. Pattern Recognit. 31(4), 371–381 (1998)CrossRefGoogle Scholar
 39.Li, C.J., Jansuwan, C.: Dynamic projection network for supervised pattern classification. Int. J. Approx. Reason. 40, 243–261 (2005)MathSciNetCrossRefGoogle Scholar
 40.Li, X., Yan, Y., Peng, Y.: The method of text categorization on imbalanced datasets. In: Proceedings of the 2009 International Conference on Communication Software and Networks, pp. 650–653 (2009)Google Scholar
 41.Ling, C., Li, G.: Data mining for direct marketing problems and solutions, Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining (KDD98), pp. 73–79. AAAI Press, New York (1998)Google Scholar
 42.Liu, X.Y., Wu, J., Zhou, Z.H.: Exploratory undersampling for classimbalance learning. IEEE Trans. Syst. Man Cybern. B Cybern. 39(2), 539–550 (2009) Google Scholar
 43.MaudesRaedo, J., RodríguezDíez, J.J., GarcíaOsorio, C.: Disturbing neighbors diversity for decision forest. In: G. Valentini, O. Okun (eds.) Workshop on Supervised and Unsupervised Ensemble Methods and Their Applications (SUEMA 2008), pp. 67–71. Patras, Grecia (2008)Google Scholar
 44.Merz, C.J.: Using correspondence analysis to combine classifiers. Mach. Learn. 36(1), 33–58 (1999)CrossRefGoogle Scholar
 45.Michalewicz, Z.: Genetic Algorithms + Data Structures = Evolution Programs. Springer, New York (1994)MATHCrossRefGoogle Scholar
 46.Polzehl, J.: Projection pursuit discriminant analysis. Comput. Stat. Data Anal. 20(2), 141–157 (1995)MathSciNetMATHCrossRefGoogle Scholar
 47.Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann, San Mateo (1993)Google Scholar
 48.Rodríguez, J.J., DíezPastor, J.F., GarcíaOsorio, C.: Ensembles of decision trees for imbalanced data. Lect. Notes Comput. Sci. 6713, 76–85 (2011)CrossRefGoogle Scholar
 49.Rodríguez, J.J., Kuncheva, L.I., Alonso, C.J.: Rotation forest: a new classifier ensemble method. IEEE Trans. Pattern Anal. Mach. Intell. 28(10), 1619–1630 (2006)CrossRefGoogle Scholar
 50.Schapire, R.E., Freund, Y., Bartlett, P.L., Lee, W.S.: Boosting the margin: a new explanation for the effectiveness of voting methods. Ann. Stat. 26(5), 1651–1686 (1998)MathSciNetMATHCrossRefGoogle Scholar
 51.Schapire, R.E., Singer, Y.: Improved boosting algorithms using confidencerated predictions. Mach. Learn. 37, 297–336 (1999)MATHCrossRefGoogle Scholar
 52.Sun, Y., Kamel, M.S., Wong, A.K., Wang, Y.: Costsensitive boosting for classification of imbalanced data. Pattern Recognit. 40, 3358–3378 (2007)MATHCrossRefGoogle Scholar
 53.Tenenbaum, J.B., de Silva, V., Langford, J.C.: A global geometric framework for nonlinear dimensionality reduction. Science 290(5500), 2319–2323 (2000)CrossRefGoogle Scholar
 54.Ensemble diversity for class imbalance learning. Ph.D. thesis, University of Birmingham (2011)Google Scholar
 55.Webb, G.I.: Multiboosting: a technique for combining boosting and wagging. Mach.Learn. 40(2), 159–196 (2000)CrossRefGoogle Scholar
 56.Weiss, G.M., Provost, F.: The effect of class distribution on classifier learning: An empirical study. Rutgers University, Tech. Rep. TR43, Department of Computer Science (2001)Google Scholar
 57.Wilcoxon, F.: Individual comparisons by ranking methods. Biometrics 1, 80–83 (1945)CrossRefGoogle Scholar
 58.Yu, S., Yu, K., Tresp, V., Kriegel, H.P.: Multioutput regularized feature projection. IEEE Trans. Knowl. Data Eng. 18(12), 1600–1613 (2006)CrossRefGoogle Scholar
 59.Zhao, H., Sun, S., Jing, Z., Yang, J.: Local structure based supervised feature extraction. Pattern Recognit. 39, 1546–1550 (2006)MATHCrossRefGoogle Scholar