1 Introduction

Ensemble methods have been a well-known and quickly developing area of research. They owe their success to the fact that their application allows for dealing with a variety of learning problems, such as learning from distributed data sources [23], improving overall classification accuracy [28], learning from data streams [18], hyperspectral image analysis [17] and imbalanced data classification [19]. While in the classic approach only one learner is trained for a given problem, ensemble methods construct many classifiers based on the available training data and combine them to obtain a final decision. The base learners making up the classifier ensemble are trained in such a way that allows for achieving suitable diversity among the classifiers [29]. An ensemble may consist of either heterogeneous or homogeneous models [3]. Heterogeneous classifiers derive, e.g., from employing various learning algorithms to the same training data, while homogeneous classifiers employ different executions of the same learning algorithm (e.g., by differentiating parameters or using different learning set partitions).

Usually, achieving high classification performance by an ensemble is compensated for overall computational complexity growth, because rather than determining the best single classifier, we look for the best-performing set of classifiers and the best combination rule for obtaining the final decision. It is worth mentioning that [13] enumerated two main approaches to design a classifier ensemble, i.e., coverage optimization, where the combination rule is given and the main effort is to form an appropriate line-up of individual predictors, and decision optimization which aims for finding an optimal combination rule, while the ensemble line-up is fixed.

This work addresses the topic of classifier ensemble pruning, especially clustering-based ensemble pruning methods, in which our goal is to decrease the total number of ensemble members. Due to this, we can improve predictive performance and considerably reduce the computational overhead.

In a nutshell, the main contributions of this work are as follows:

  • The proposition of a novel mutual diversity measure based on the non-pairwise and averaged pairwise diversity, which allows to evaluate the impact of a particular predictor on a given classifier ensemble diversity. Thus, it could be used as the criterion for ensemble pruning.

  • The formalization of an algorithm that uses the proposed measure for ensemble pruning and multistage organization of majority voting.

  • An extensive experimental analysis on a large number of benchmark datasets comparing the performance of proposed methods and the state-of-the-art ensemble methods which are backed up by the statistical tests.

2 Related works

Let us first present the ensemble pruning taxonomy proposed in [32]:

  • Ranking-based pruning chooses a fixed number of the best-ranked individual classifiers according to a given metric (as kappa statistics) [24].

  • Optimization-based pruning solves the problem of choosing individual classifiers as an optimization task. Because the number of base models is typically high, therefore heuristic methods [27], evolutionary algorithms [33] or cross-validation-based techniques [5] are usually used.

  • Clustering-based pruning looks for groups of base classifiers, where individuals in the same group behave similarly while different groups have large diversity. Then, from each cluster, the representative is selected, which is placed in the final ensemble.

Because this work focuses on employing clustering-based classifier ensemble pruning methods to improve the predictive performance of combined classifiers, let us briefly present the main works related to the problem under consideration. Basically, clustering-based pruning consists of two steps. The first one groups base models into several clusters based on a criterion, which should take into consideration their impact on the ensemble performance. For this purpose, various clustering methods were used, such as hierarchical agglomerative clustering [10], deterministic annealing [2], k-means clustering [9, 22] and spectral clustering [31]. Most of those methods employ a kind of diversity-based criteria. Giacinto et al. [10] estimated the probability that classifiers do not make coincident errors in a separate validation set, while Lazarevic and Obradovic [22] used the Euclidean distance in the training set. Kuncheva proposed employing a matrix of pairwise diversity for hierarchical and spectral methods [20].

In the second step, a prototype base learner is selected from each cluster. In [2] a new model was trained for each cluster, based on clusters centroids. In Giacinto et al. [10] choose the classifier, which is the most distant to the rest of clusters. In [22] models were iteratively removed from the least to the most accurate. The model with the best classification accuracy was chosen in [9].

The last issue is the choice of the number of clusters. This could be determined based on the performance of the method on a validation set [9]. In the case of fuzzy clustering methods, we can use indexes based on membership values and dataset or statistical indexes to automatically select the number of clusters [16].

The alternative proposal is a multiple-stage organization, which was briefly mentioned in [14] and described in detail by Ruta and Gabrys [26], where authors refer to such systems as a multistage organization with majority voting (MOMV) since the decision at each level is given by majority voting. Initially, all outputs are allocated to different groups by permutation and majority voting is applied for each group producing single binary outputs, forming the next layer. In the next layers, exactly the same way of grouping and combining is applied with the only difference being that the number of outputs in each layer is reduced to the number of groups formed previously. This repetitive process is continued until the final single decision is obtained. In this research, we employ this approach but to form groups of voting classifier we use clustering methods.

2.1 Ensemble diversity

As mentioned before, diversity is one of the key factors for generating a valuable classifier ensemble, but the main problem is how to measure it. In this work, we decided to use the diversity-based criterion of base classifier clustering. Basically, known diversity measures may be divided into two groups: pairwise and non-pairwise diversity measures. Pairwise diversity measures determine the diversity between pair of base models; ensemble consisting of L classifiers will have \(L(L-1)/2\) values of pairwise diversity. To get the value for the entire ensemble, we calculate the average. Non-pairwise measures take into consideration all base classifiers and give one diversity value for the entire ensemble. Let \(\varPsi _i\) denote the ith base classifier and \(\varPi =\{\varPsi _1, \varPsi _2,\ldots _l\}\) be the ensemble of base models. In this work, three non-pairwise (i.e., entropy measure E, Kohavi–Wolpert variance and measurement of interrater agreement K) and two averaged pairwise (i.e., averaged Q statistics and averaged disagreement measure) ensemble diversity measures have been used. Let us present the selected diversity measures.

The entropy measure E [4] is defined as

$$\begin{aligned} E(\varPi ) = \dfrac{1}{N} \sum _{j=1}^{N}\left( \dfrac{1}{L-[L/2]}\right) \min \{l(z_{j}), L-l(z_{j})\}, \end{aligned}$$

where N is the number of instances, L stands for the number of base models in the ensemble, and \(l(z_{j})\) denotes the number of classifiers that correctly recognize \(z_{j}\). E varies between 0 and 1, where 0 indicates no difference and 1 indicates the highest possible diversity.

Kohavi–Wolpert variance [15] is defined as

$$\begin{aligned} {\text{KW}}(\varPi ) = \dfrac{1}{NL^2} \sum _{j=1}^{N}l(z_{j})(L-l(z_{j})), \end{aligned}$$

The higher the value of KW, the more diverse the classifiers in the ensemble. Also, KW differs from the averaged disagreement measure \({\text{Dis}}_{\mathrm{av}}\) by a coefficient, i.e.,

$$\begin{aligned} {\text{KW}}(\varPi ) = \dfrac{L-1}{2L}{\text{Dis}}_{\mathrm{av}}(\varPi ). \end{aligned}$$

Measurement of interrater agreement K [6, 8]

$$\begin{aligned} K(\varPi ) = 1-\dfrac{\dfrac{1}{L} \sum _{j=1}^{N}l(z_{j})(L-l(z_{j}))}{N(L-1)\bar{p}(1-\bar{p})}, \end{aligned}$$

where \(\bar{p}\) is average individual classification accuracy

$$\begin{aligned} \bar{p} = \dfrac{1}{NL}\sum _{j=1}^{N}\sum _{i=1}^{L}y_{j,i}, \end{aligned}$$

where \(y_{j,i}\) is an element of an N-dimensional binary vector \(y_{i}=[y_{1,i}, \ldots , y_{N,i}]^{{\rm T}}\) representing the output of a classifier \(\varPsi _{i}\), such that \(y_{j,i}=1\), if \(\varPsi _{i}\) recognizes \(z_{j}\) correctly, and 0 otherwise. K varies between 1 and 0, where 1 indicates complete agreement and 0 indicates the highest possible diversity.

Table 1 A table of the relationship between a pair of classifiers

The averaged Q statistics [30] over all pairs of classifiers is given as

$$\begin{aligned} Q_{\mathrm{av}}(\varPi )=\dfrac{2}{L(L-1)}\sum _{i=1}^{L-1}\sum _{k=i+1}^{L}Q(\varPsi _i, \varPsi _k), \end{aligned}$$


$$\begin{aligned} Q(\varPsi _i, \varPsi _k)=\dfrac{N^{11}N^{00}-N^{01}N^{10}}{N^{11}N^{00}+N^{01}N^{10}}, \end{aligned}$$

and \(N^{ab}\) is the number of elements \(z_{j}\) for which \(y_{j,i}=a\) and \(y_{j,i}=b\). Relationship between a pair of classifiers is denoted according to Table 1. Q varies between \(-1\) and 1. Classifiers that recognize the same objects correctly will have positive values of Q, and those which commit errors on different objects will render Q negative.

The averaged disagreement measure [11] over all pairs of classifiers

$$\begin{aligned} {\text{Dis}}_{\mathrm{av}}(\varPsi _i, \varPsi _k)=\dfrac{2}{L(L-1)}\sum _{i=1}^{L-1}\sum _{k=i+1}^{L}{\text{Dis}}(\varPsi _i, \varPsi _k), \end{aligned}$$


$$\begin{aligned} {\text{Dis}}(\varPsi _i, \varPsi _k)=\dfrac{N^{01}+N^{10}}{N^{11}+N^{10}+N^{01}+N^{00}}. \end{aligned}$$

The averaged disagreement measure is the ratio between the number of observations on which one classifier is correct and the other is incorrect to the total number of observations. Dis varies between 0 and 1, where 0 indicates no difference and 1 indicates the highest possible diversity.

3 Proposed methods

In this section, we propose three methods for increasing the ensemble’s accuracy using clustering and diversity-based criterion.

3.1 Clustering criterion

Firstly, let us propose the measure which may be used for the clustering-based pruning. As the non-pairwise and averaged pairwise diversity measures consider all the base models together and calculate one value for the entire ensemble, they could not be used for pruning, because they do not present an impact of a particular base classifier on the ensemble diversity. Therefore, we propose a novel measure M as the clustering criterion, which is the difference between the value of diversity measure for the whole ensemble \(\varPi \) and the value of diversity for the ensemble without a given classifier \(\varPsi _i\).

$$\begin{aligned} M(\varPsi _i) = {\text{Div}}(\varPi )-{\text{Div}}(\varPi -\varPsi _i). \end{aligned}$$

Thanks to this proposition, the impact of each base learner on the ensemble diversity is presented in a one-dimensional space, shown in Fig. 1. Each marker represents one of the one hundred base classifiers, placed in the space according to its value of M measure based on the averaged disagreement.

Fig. 1
figure 1

Visualization of the proposed clustering space for Glass dataset [7], where the clustering criterion (i.e., M measure) is calculated based on the entropy measure E

3.2 Diversity-based one-dimensional clustering space and cluster pruning

In this proposition, the chosen clustering algorithm is applied to the obtained clustering space. The pruned ensemble consists of the base models with the best classification accuracy in each cluster (one for each cluster).

In case of this work, the K-means clustering algorithm, according to the Scikit-learn [25] implementation, has been employed to find a given number of clusters (from 2 up to 10) in the clustering space constructed by the proposed M measure. From each group, a representative classifier with the highest predictive performance has been chosen. We aim to construct an ensemble containing strong, yet diverse base models, as these two characteristics are distinguishing features of a well-performing classifier ensemble.

3.3 Two-step majority voting organization

Fig. 2
figure 2

Example of a two-step majority voting organization with 9 classifiers divided into 3 clusters. Layer 2 is the result of majority voting of each cluster, and the final decision is made by the second majority voting

The second proposed method is a modification of the MOMV structure described in [26]. Instead of allocating outputs to different groups by permutation, we treat base models in each cluster as a separate ensemble combined by majority voting. Then we collect predictions from each cluster and apply the majority voting rule for the second time, to make a final decision (Fig. 2).

Fig. 3
figure 3

Example of two-step majority voting organization with 9 classifiers divided into 3 clusters, using sampling with replacement. The number of groups and classifiers in each group in the first layer is equal to the number of clusters found. Layer 2 and the final decision are also made according to the majority voting

Additionally, we propose the third method, based on the assumption that classifiers belonging to the same cluster make similar decisions, so we do not have to use them all in the classification process. In this method, we construct the first layer of voting by creating the number of groups equal to the number of clusters found, each group containing one classifier sampled with replacement from each of the clusters (Fig. 3).

4 Experimental study

Table 2 Datasets characteristics

In this section, we present the experimental study performed to evaluate the effectiveness of the proposed clustering-based ensemble pruning and multistage organization methods. As the reference, two state-of-the-art methods: majority voting and the aggregation of probabilities, were used. Experiments were designed to answer the following research questions:

  • Which set of parameters (approach, diversity measure, base learner type, number of clusters) yields the best results for the given dataset?

  • How the number of clusters affects the performance of methods?

  • Does the proposed ensemble pruning and multistage organization methods lead to improvements in accuracy over state-of-the-art methods?

4.1 Datasets

We have used 30 datasets from KEEL [1] and UCI [7] repositories to evaluate the performance of the proposed methods. We have selected a diverse set of benchmarks with varying characteristics, including the different number of instances and features, which are shown in Table 2. Additionally, we take into consideration both binary and multiclass classification problems.

4.2 Setup

As base learners, we used four popular types of classifiers: multilayer perceptron (MLP), classification and regression trees (CART), Gaussian naïve Bayes (NB) and k-nearest neighbors classifier (KNN). In each case, learners from Scikit-learn machine learning library [25] with the default parameters were used. The classifier pool always consists of 100 base models. Diversity between learners is based on the random subspace method [12], where classifiers are trained on pseudorandomly selected subsets of components of the feature vector. The percentage of features for training a single model has been selected depending on the number of features in the dataset. For majority of datasets it is 50%, the only exceptions being: Libras dataset—20%, MuskV1 dataset—10%, Sonar dataset—25%, Spambase dataset—25% and Spectfheart dataset—35%. We change the percentage of features used for training so that, regardless of their total number in a given dataset, only a maximum of a dozen or so features were used to train each of the base models to ensure the high diversity.

Based on 3 parameters (approach, diversity measure and base learner type), we distinguish 60 different methods for improving classification score of the ensemble (20 for pruning, 20 for multistage organization and 20 for MO using sampling with replacement). Experiments were carried out for the number of clusters in the range from 2 to 10. For the sake of simplicity, for each method, we take into account only the number of clusters that obtained the best classification accuracy. The name of each method is based on abbreviations of parameter values (ApproachClassifierDiversityMeasure format) including two state-of-the-art methods (majority voting and aggregation of probabilities) for each base learner, which gives us 68 methods overall. The following abbreviations have been used:

  • Approach MV—majority voting, Aggr—aggregation of probabilities, Mo—multistage voting organization, MoR—multistage voting using sampling with replacement and Pr—clustering-based pruning,

  • Classifier Mlp—Multilayer perceptron, Cart—classification and regression trees, Nb—Gaussian naïve Bayes and Knn—k-nearest neighbors classifier,

  • DiversityMeasure E—the entropy measure, KW—Kohavi–Wolpert variance, K—measurement of interrater agreement, Q—the averaged Q statistics and Dis—the averaged disagreement measure.

Experiments were implemented in Python programming language and may be repeated according to source code published on GitHub.Footnote 1

Fig. 4
figure 4

Diagram of critical difference (CD) for Nemenyi post hoc test at \(\alpha = 0.05\) for Pr methods. \({\mathrm{CD}}=5.41\)

Fig. 5
figure 5

Diagram of critical difference (CD) for Nemenyi post hoc test at \(\alpha = 0.05\) for Mo methods. \({\mathrm{CD}}=5.41\)

Fig. 6
figure 6

Diagram of critical difference (CD) for Nemenyi post hoc test at \(\alpha = 0.05\) for MoR methods. \({\mathrm{CD}}=5.41\)

Fig. 7
figure 7

Diagram of critical difference (CD) for Nemenyi post hoc test at \(\alpha = 0.05\) for CART methods. \({\mathrm{CD}}=4.51\)

Fig. 8
figure 8

Diagram of critical difference (CD) for Nemenyi post hoc test at \(\alpha = 0.05\) for KNN methods. \({\mathrm{CD}}=4.51\)

Fig. 9
figure 9

Diagram of critical difference (CD) for Nemenyi post hoc test at \(\alpha = 0.05\) for MLP methods. \({\mathrm{CD}}=4.51\)

Fig. 10
figure 10

Diagram of critical difference (CD) for Nemenyi post hoc test at \(\alpha = 0.05\) for NB methods. \({\mathrm{CD}}=4.51\)

4.3 Statistical evaluation

First, the proposed methods were divided into 3 groups of 20, based on the used approach. For each group, Nemenyi post hoc test, based on the average ranks according to classification score, was performed (Figs. 4, 5 and 6). In each case, methods employing classification and regression trees as base models achieved the highest average ranks, while methods using Gaussian naïve Bayes classifiers performed the worst.

Figures 78, 9 and 10 show CD diagrams for the proposed methods depending on the type of base models used. In Fig. 7, we can see that, among CART methods, pruning approaches achieved the highest average ranks and are statistically significantly better than the state-of-the-art and most multistage organization methods. It is worth mentioning the fact that the best-ranked method for every tested approach used the averaged Q statistics as the diversity measure for constructing the clustering space. The same is true for KNN methods (Fig. 8).

In the case of MLP (Fig. 9) and NB (Fig. 10), we can see that methods employing pruning are statistically significantly better than the rest of the proposed and state-of-the-art approaches. Also, the averaged Q statistics again has been the best diversity method for constructing the clustering space, when used for multistage organization methods.

Table 3 The classification accuracy of the best-performing method for each dataset, depending on the number of clusters
Fig. 11
figure 11

The classification accuracy of the best-performing methods for different numbers of clusters

Fig. 12
figure 12

The classification accuracy of the best-performing methods for different numbers of clusters

Fig. 13
figure 13

The classification accuracy of the best-performing methods for different numbers of clusters

Table 3 presents the impact of the number of clusters on the best-performing methods, according to the classification score, for each tested dataset. As it was not possible to find ten clusters for each method and dataset, we present the maximum number of clusters found in each of the k-folds during evaluation. In the case where several methods achieved the same classification accuracy, the first one was chosen according to the order: Aggr-, MV-, Pr(E/Kw/K/Dis/Q), Mo(E/Kw/K/Dis/Q), MoR(E/Kw/K/Dis/Q). For every dataset, the proposed pruning methods achieved the best classification accuracy. Figures 11, 12 and 13 present how the performance of methods varies depending on the number of clusters.

4.4 Lessons learned

Increasing the number of clusters positively impacts the classifier performance for the majority of tested classifiers, yet, sometimes we observe a decrease in the classification accuracy after exceeding a certain number of clusters, which may be caused by overfitting leading to a non-optimal number of clusters for a given problem.

The low classification score in the case of two clusters may be caused by using only two classifiers for majority voting. In that case, when there is no agreement between classifiers, the first label in the order is chosen. In the case of several datasets (e.g., Australian, Contraceptive, Sonar or Yeast), we can observe the reduction in the quality of classification when there is an even number of base models in the ensemble. These reductions may be the result of voting ties and also selecting the first available label as the final decision.

In the case of some datasets (i.e., Dermatology, Wine, and ZOO), we may see that increasing the number of clusters for the proposed pruning method (and thus creating larger ensembles) resulted in achieving a 100% classification accuracy. We may conclude that the proposed clustering-based method really allows for choosing suitably diverse base models, which can create a well-complementing and strong classifier ensemble.

We can also notice the trend for proposed algorithms to perform better on higher-dimensional datasets (e.g., Ionosphere, Libras, MuskV1 or Spambase) when the higher number of clusters is discovered. Similar observation does occur in the case of datasets with more possible class labels (e.g., Libras, Vowel or WineRed).

Although conducted statistical tests indicate that the most suitable diversity measure for the problems considered during experimentation may be the averaged Q statistics, we cannot definitively consider it the best. As stated in [21], after studying various diversity measures, there is no definitive connection between the measures and the improvement of the accuracy and \(Q_{\mathrm{av}}\) was recommended only based on ease of interpretation and calculation.

5 Conclusions

The main aim of this work was to propose a novel, effective classifier pruning method based on clustering. We proposed the one-dimensional clustering space based on ensemble diversity measures, which is later used in order to prune the existing classifier pool or to perform a multistage majority voting. The computer experiments confirmed the usefulness of the proposed pruning method and based on a statistical analysis we may conclude that it is statistically significantly better than state-of-the-art ensemble methods. It is also worth noting that the pruning approach performed the best among the three methods proposed in this paper. The proposed multistage organization voting scheme (both using the whole classifier pool and sampling with replacement) did not achieve statistically better results than state-of-the-art methods.

The results presented in this paper are quite promising; therefore, they encourage us to continue our work on employing clustering-based methods for ensemble pruning. Future research directions may include exploring the different ways of calculating the proposed M measure (including both deterministic and non-deterministic variants) and, in the case of multistage organization methods, employing different types of voting (e.g., weighted majority voting). It would be useful to also consider ways of dealing with ties during the voting process and, possibly, investigate the effects of data dimensionality on the performance of the proposed algorithms.