Keywords

1 Introduction

An ensemble of classifiers [16] combines the decision of individual classifiers in some way in order to classify new instances. Various methods have been presented for the creation of such a set of classifiers [4]. The most common and widely used ways to create such a combination of classifiers are the following three:

  1. (a)

    Using an individual learning method and different subsets of training data.

  2. (b)

    Using an individual training algorithm and various training parameters.

  3. (c)

    Using different learning methods.

One can reasonably be asked why an ensemble of classifiers may provide better results than a single classifier. The reasons are both representational, statistical and computational and are presented in [4]. The training data cannot produce informative knowledge in some of the cases in order to select a single classifier. This could be a possible cause of the aforementioned issue since the training data is too small concerning the amount of the hypothesis space.

Despite the fact that the interest gathered about methods of ensembles in recent years is so big, further study is required since it is not clear which method is the best one. In the present work, we have performed a hybrid scheme that combines the following ensemble methods: Random Forest [2], ExtraTrees [10] and Gradient Boosting [9] algorithm using a stacking variant methodology. The proposed hybrid decision support system is compared with several ensembles methods on a variety of benchmark datasets. The results obtained showed that the provided method has better accuracy in most of the studied problems.

The rest of the paper is organized as follows. In the next section, we present well-known methods for creating ensembles of classifiers, while in Sect. 3 we consider the proposed ensemble method. The conducted experiments and the comparisons with other ensemble methods are presented in Sect. 4. Finally, we conclude in Sect. 5 with a summary and further research remarks.

2 Well-Known Methods for Creating Ensembles

There are a plethora of methods using a single classifier in order to solve a prediction problem. However, the need for models that will make a more accurate prediction is intense. This purpose, as mentioned in the introduction of our work, is served by a combination of classifiers. In order to create a good ensemble of classifiers, there are different methodologies. Each of these can provide a varied rate of correctly classified instances. Thus, the overall goal of the methods belonging in decision support systems is to achieve reliable and very accurate prediction results. This section presents widely used methods for building ensembles.

Bagging technique or Bootstrap aggregating [1] is one of the most known and widely used technique for creating ensembles of classifiers. A key feature of this method is the training of each learner using different training sets. Specifically, let N be a set of training data of size n. Bagging method selects from this set \(n'\) random examples, creating a set of training subsets, let \(N_i\). The selection of instances is done using a distribution, for example the uniform distribution, while the examples are drawn with replacement. Since the building of the subsets is done in that way, it is possible some of the examples to be repeated in \(N_i\) and some duplicates and omissions to contained in regard to the starting training set N. This execution produces a set of classifiers. Then, a voting process is taking place in relation to the predictions of each classifier. The result of the vote gives the final decision.

A well-known meta-algorithm is also the Boosting approach [8]. One could say that this method resembles with the bagging approach since several sub-samples are used for training a single learner. Moreover, this algorithm was based on the question “Can a set of weak learners create a single strong learner?” [14]. Boosting method is directly related to this question as it focuses on the performance of a weak classifier in order to enhance it, especially in cases where instances are not effectively been learned. Thus, rather than selecting the instances in a random way based on a distribution, the algorithm focuses on the training instances that have not been learned correctly. In order to create a strong classifier, after several steps, the prediction is done through a weighted voting procedure. In detail, each classifier prediction takes a weight according to the classification accuracy that it achieved.

Besides the aforementioned techniques that mostly train classifiers in a subset of training data, there are methods that create a model using a set of learning algorithms on the entire training dataset. Then, by choosing an appropriate voting pattern, they combine the responses in order to obtain the final prediction. Through this process, different learning algorithms are used to increase the diversity of prediction errors of the models. This is so because they differ in both the search process and the representation. In conclusion, the models that are created are more likely to make mistakes at different points, as different learning biases are used. Finally, utilizing the majority of forecasts we easily end up in the final decision, as no prior training is required.

A well-known method that uses the above-mentioned procedure is the stacked generalization approach or Stacking [26]. According to this method, the production of a strong, high-level learner with high generalized performance is the main goal. To obtain this, a set of different classifiers are combined. To achieve an appropriate combination of the base predictions, a learning algorithm is used. In order to perform the final forecast, the model utilizes the following steps: (a) Level zero data. All the base learners run on the original dataset, (b) Level one data. After the zero level, the predictions made by the classifiers are considered as new data and (c) Final prediction. Another learning process takes place using the level one data as new inputs and as output, the final prediction is gained.

An alternative idea was used in [25]. The proposed method is an extension of the well-known stacking [26] and is based on class probability distributions rather than the predictions. Thus, each learner does not present a final prediction for the class to which an example belongs. As a result, it gives an estimated probability for all classes. The authors, in order to achieve meta-level learning, used the Multi-response Linear Regression (MLR).

Several approaches have been provided in order to improve the performance of [25]. In particular, in [7] the authors adopted the same strategy except for the meta-learning procedure. The authors, used model tree induction instead of MLR and the obtained results showed that better performance has achieved. Furthermore, another modification is proposed in [21]. In this method, instead of the overall class probability distributions, only class probabilities concerning the true class are taking into account. The experimental results showed that by the variation of the accuracy of this method seems to be improved.

In [15] another stacking algorithm, called Troika, is presented in order to tackle multi-class classification tasks. Specifically, the proposed method instead of one meta-learner it uses three layers of the combining classifiers: (a) The zero layer consists of base learners, (b) The first one contains specialist classifiers, (c) The second layer is composed by meta-learners and (d) In the last layer is met the super-classifier which take the predictions of the previous layer in order to make the final prediction of the model. The main comparison included the Troika method, the basic stacking scheme and an improvement of the standard Stacking, the StackingC method. The experiments conducted showed that the presented method outperforms the other two methods in terms of accuracy irrespective of the base-classifier binarization method that was selected.

Stacking ensemble methodology can be seen as a combinatorial optimization problem. This is because both base classifiers and meta-classifier have to be determined in an optimal way. In [3] an adaptive metaheuristic search method was adopted in order to create a new stacking approach, called Ant Colony Optimization (ACO) Stacking algorithm. Thus, given a set of base learners, let B, which consists of n classifiers, ACO algorithm [5] determines a subset of B consisting of a smaller amount of learners. This set of classifiers is expected to has better classification accuracy than the original one. The authors compared ACO-Stacking approach with well-known ensembles such as Bagging, AdaBoost, StackingC and Random Forest. The experimental results showed that the performance of ACO-Stacking algorithm is promising.

A similar approach in which the problem of selecting the appropriate configuration of base learners and the meta-classifier is proposed in [23]. The main difference relating to [3] concerns the swarm intelligent search algorithm that is used for the optimization task. Particularly, the authors used the well-known Artificial Bee Colony (ABC) algorithm [13] in order to build two different types of stacking. Thus, the optimal configuration of base learners is taking place at the beginning. In that phase, the meta-learner that is considered is a fixed learning algorithm. During the meta-phase, the ABC algorithm optimizes at the same time the base learners and the meta-learner. The ABC-Stacking approach was tested in several well-known benchmark datasets and compared with different predictive learners, basic ensemble techniques, such as GA-Stacking [17] and ACO-Stacking. The experiments conducted by the authors showed that the provided approach has addressed effectively the configuration of base-classifiers and the selection of meta-learner too. Moreover, the performance of ABC-Stacking is comparable to the best results cited so far.

In [19] a strategy which is similar to the Stacking methodology is adopted in order to apply a sequential learning algorithm in multi-class problems. For a more extensive study of Stacking methodology, Stacking methods and related with Stacking approaches that have been developed over the last 20 years the reader is referred to [22].

3 Proposed Hybrid Scheme

In general, for any classifier the addition of new input features can improve classification accuracy when the new features contain new information about the class. This performance improvement isn’t guaranteed because classifiers are imperfect and may not be able to exploit the information. If the new features share information with existing features, the new features may or may not assist the process. In all the cases, the curse of dimensionality must be considered. The presence of more features can hurt many classifiers and can increase opportunities for overfitting. The situation is similar when adding new base classifiers to a stacking setup, because the base classifiers’ outputs are features for the final classifier. All the same above arguments hold here. In this case, these ‘second level’ features are likely correlated because all base classifiers are all trying to predict the same case. However, they do it suboptimally. The hope is that they behave in different ways, so that the final classifier can combine the noisy predictions into a better final prediction. Loosely, then, adding new base classifiers has the best chance of helping when they do a good job and behave differently than existing base classifiers, but this isn’t guaranteed. In all cases, the time of training must be considered. In our case, we select to use three strong ensemble methods as base learners: Random Forest, ExtraTrees and Gradient Boosting.

According to Brieman’s definition in [2], a Random Forest is a classifier that consists of a set of tree-structured classifiers. A worth point is that in each tree the decisions are made by a random way when splitting a node, without taking all the features into account. In this process, each tree is trained in a sub-sample in a similar way with bagging and votes for the most popular class concerning a given input. This kind of decision trees using averaging can improve the prediction accuracy. In addition, they can achieve the control of the overfitting. As in the Bootstrap aggregating the creation of the sub-samples is done with replacement and the size of each sub-sample is equal to the beginning sample.

A similar method to the Random Forests is the Extremely randomized Trees or ExtraTrees [10]. This method is successfully applied and produces good results in terms of accuracy while effectively controls the overfitting issue. The difference with the Random Forests can be found in two points: (a) The training of each tree is made without using a bootstrap sample but a whole training set and (b) The cut-point for splitting a node is selected by a fully random way.

Gradient Boosting [9] is a widely used approach for handling regression and classification tasks, which is mainly based on decision trees classifiers. The main strategy of this method is similar to other boosting methods (such as AdaBoost or LogitBoost). In a first phase it builds a set of weak learners in order to make an initiate prediction. In a second phase it optimizes the prediction accuracy of the model based on the minimization of a cost function. Specifically, gradient boosting iteratively determines a function that points to in the negative gradient direction.

In our stacking algorithm, the set of meta-data will consist of the predictions given by the base ensembles and the true class for each training case. Thus, the classifier trained on this dataset is the so-called meta-learner. Our approach uses the class probabilities correlated with the true class rather than all class probabilities as in the standard Stacking.

Another point that is worth mention is the learning process. The goal is to build a fast learning procedure. This is achieved by reducing the dimensionality of meta-data set by a factor associated with the number of classes. Despite this reduction, the accuracy for the two-class problems remains unaffected. Moreover, the learning algorithm that we use at the meta-level is the well-known Logistic Model Tree classifier [24]. This is because piecewise linear approximations are gained through these models. The Fig. 1 illustrates the process that was adopted in order to create a strong classifier, while the proposed method is illustrated in the Algorithm 1.

Fig. 1.
figure 1

Stacking strong ensembles process

figure a

4 Comparisons and Experimental Results

For our experiments, well-known datasets from UCI Machine Learning repository [6] were used. Specifically, 38 problems from different fields were selected, which differ in both the number of their attributes and their classes. In Table 1, there is a brief description of these datasets such as the number of features, the number of instances and the number of output classes.

Table 1. Collection of 38 multi-class datasets from the UCI Machine Learning Repository. The number of instances, the number of features as well as the number of classes, are exhibited.
Table 2. Classification accuracy and standard deviation of the compared methods, where “ET” denotes ExtraTrees, “GB” represents Gradient Boosting, while “RF” denotes Random Forest.

The accuracy of the classifiers was evaluated according to the following procedure: The training dataset was divided into ten equally-sized subsets, and for each of them, the classifier was trained in the remainder subsets. Then, for each method, we run the five fold cross-validation procedure and the average value of five fold cross validations was measured. The experiments have been conducted with Python using the available implementations from the scikit-learn [18] and MLxtend library [20]. In Table 2 the obtained results for our hybrid scheme and for methods which participate in the comparison are exhibited. The best performing scheme for each dataset is described using boldface writing. It can be easily seen that the proposed stacking ensemble method outperforms the other three well-known ensembles in most of the cases and thus we can say that the proposed method is more robust.

Furthermore, statistical tests have been performed in order to check the significance of the results. In particular, due to the small number of comparison methods, the non-parametric Friedman test [11] has been conducted. Therefore, in Table 3 the Friedman test ranking is presented. Moreover, the p-value in all the comparisons indicates that the null hypotheses should be rejected. Thus, there are methods whose performance difference was statistically significant to the others. In Table 4, the obtained results of the post-hoc Holm test [12] are exhibited.

Table 3. Rankings of the algorithms using the Friedman test
Table 4. Post-hoc Holm test using Stacking as control method

5 Conclusions

Hybrid decision schemes seem to be a reliable and effective approach in order to tackle decision-making tasks in several scientific fields. Compared to the accuracy that a single classifier can achieve, an ensemble method seems to be able to reach better classification results. However, designing such a method is a difficult problem, as we have a plethora of basic classifiers for use. Stacking methodology is an appropriate and accurate procedure in order to create a strong combination using the best over-all stacked classifier. Despite the fact that a variety of ensemble methods have been designed, creating the best combination for solving specific classification problems remains a hot and ongoing project. This paper proposes a Stacking strong ensemble methodology using Logistic Model Trees and three well-known ensembles. The experimental results conducted over well-known datasets showed that the provided scheme gives promising results in most of the benchmark problems. Nevertheless, there are several issues that need further study, such as the rules that build a suitable set of classifiers with the highest possible accuracy. Moreover, further comparisons of our method with well-known variations of Stacking such as [3, 17] and implementation in regression problems could be an interesting task for future research.