1 Introduction

Contemporary machine learning has to deal with escalating complexity of problems appearing with the increasing prevalence of data. Standard classification methods are often unable to capture complex patterns or cannot maintain their generalization capabilities, leading to either under- or overfitting. Therefore, models that can capture multi-dimensional data properties while avoiding mentioned pitfalls are very desirable. Classifier ensemble, known also as multiple classifier system or classifier committee [37], is an example of such an approach. By combining predictions of a number of simpler models, the classifier ensembles can produce a more efficient and flexible recognition systems, at the same time benefiting from the high generality of the base classifiers [17].

To ensure a satisfactory performance, several requirements have to be met by an ensemble. Perhaps the most important one is the need to supply a pool of a diverse classifiers [34]. This could be achieved in a various ways, e.g., by training every learner on the basis of a different features. The motivation behind this approach is twofold: firstly, to simplify the training procedure by reducing the number of features used by each learner and at the same time allowing the learner to explore different properties of the supplied subspaces [11]. The most notable examples of techniques relying on partitioning features into subspaces include Random Subspace (RS) [15] and Random Forest (RF) [5] methods. While simple and computationally efficient, such approaches have a major drawback. Due to the random nature of the feature subset construction process, they are inherently unstable, prone to producing models of a poor quality in the worst-case scenarios. Additionally, for each run of these algorithms one will obtain different feature subsets and thus different base classifiers. It should be underlined that in the real-life applications the stability is often a necessary component, which significantly limits the use of the mentioned methods, particularly the Random Subspace approach.

To overcome this limitation, we consider how the feature subspaces can be constructed in a guided way, in order to make the resulting classifier ensemble more stable and less prone to producing under-performing base learners. As a result of our investigation, we present a novel approach, which allows to form feature subspaces in a fully deterministic manner. It is based on the same idea as the previously mentioned methods, namely creating an ensemble of a simpler, diverse classifiers, each trained on a feature subset. However, instead of a stochastic subspace generation procedure, the proposed approach employs a guided search strategy. It assigns the new features greedily in a round-robin fashion, based on the defined feature quality and the subspace diversity measures. The final decision of the ensemble is made using the majority voting rule. Proposed method is as flexible as the Random Subspace approach and can work with any type of base learners. However, the created ensemble is more stable and uses the information about the predictive power of individual features.

The main contributions of the work are as follows:

  • Proposition of the new classifier ensemble learning algorithm, Deterministic Subspace (DS), which allows forming a set of diverse classifiers trained on selected features, where the attributes used by the base classifiers are being chosen by a guided search strategy.

  • Improvement over random subspace selection, leading to stable ensemble forming procedure.

  • Extensive experimental results on a set of benchmark datasets, which evaluate the dependency among the quality of DS algorithm, its parameters and chosen model of base classifiers.

  • Proving the high usefulness of the proposed DS approach for specific classifiers that take advantage of creating less correlated subspaces.

The rest of this paper is organized as follows. Section 2 presents related works on the subject of ensemble classification. In Sect. 3, Deterministic Subspace method is discussed. Section 4 describes conducted experimental study and obtained results. Finally, Sect. 5 summarizes our findings.

2 Related work

Ensemble classification methods have several properties that make them one of the most prevalent techniques in supervised learning domain. Perhaps most importantly, models produced using this paradigm tend to be capable of approximating complex decision boundaries while remaining resilient to overfitting. Performance gain is possible by exploiting local competencies of base learners. At the same time, ensembles incorporate mechanisms that prevent from choosing the worst learner from the pool [20]. Additionally, this approach is highly flexible. The parameters of learners are often easily adjustable, and the training procedure can be parallelized without much effort.

Even though ensembles solve some of the issues related to the classification problems, at the same time they introduce several new challenges related to their design. Necessity of providing a diverse pool of learners while preserving their individual accuracies [36] is the one on which we will focus in this paper. This issue is especially severe due to the ambiguity of this term. Proposing an agreed upon definition of diversity remains an important open question [6], particularly in context of classification task. Existing diversity measures are therefore only approximations. The exact extent of diversity influence on ensemble performance remains unclear [12]. At the same time, it is hard to argue that learners should display some differing characteristics, since adding identical models would not contribute to efficacy of formed ensemble.

Diversity may be introduced to the ensemble on many different levels, often as a combination of factors. Main approaches include:

  • Varying learning models by using either completely different classification algorithms or the same algorithms with modified hyperparameters.

  • Varying outputs of learners by decomposing classification task, for instance into binary problems.

  • Varying inputs of learners by supplying different partitions of dataset or different feature subspaces during training.

In the case of heterogeneous ensembles, one assumes that using different learners will be sufficient to ensure diversity. Indeed, in many situations modifying learning paradigm may lead to obtaining significantly different decision boundaries. Selection process is crucial when applying this approach, to reduce the chance of using similar outputs produced by different models. Entire family of dynamic classifier selection methods is worth mentioning, as they offer a flexible ensemble line-up for each incoming sample [33].

Alternatively, instead of using entirely different models one may rather train them using different set of hyperparameters or initial conditions. This approach is based on the assumption of existence of complex search space during the training, which could lead to reaching different local extrema. Most notable examples of this method include ensembles of neural networks with early stopping condition [35] or Support Vector Machines with varying kernels [32].

Another approach is based on manipulating the classifier outputs. Common techniques rely on a multi-class decomposition, after which specialized learners, trained to recognize reduced number of classes, are obtained. Dedicated combination method, such as Error-Correcting Output Codes [25], is being used to reconstruct original multi-class task. Most notable examples of this type of technique include binarization [13], hierarchical decomposition [27] and classifier chains [22].

Diversity may be also induced by input manipulation in either data or feature space. The former approach relies on assumption that variance is being introduced into the training instances, which enables learners to capture properties of different subsets of objects [29]. Bagging [30] and Boosting [2] are most significant realizations of this approach, but ensembles may also be trained on the basis of clusters to preserve spatial relations among instances [10].

Finally, one may increase the diversity by manipulating the feature space. This could be done in either randomized [21] or guided manner, using feature selection [8] or global optimization methods [7, 24]. The most notable techniques based on this paradigm are previously mentioned Random Subspace and Random Forest methods. Both of them rely on randomly created feature subspaces to increase the diversity of the ensemble.

Despite its simplicity, the Random Subspace method has gained a popularity in the machine learning community. Its main advantage over Random Forest lies in flexibility, as it can be used with an any type of base learners [31]. There are several interesting variations of this approach that appeared in recent years. Polikar et al. [28] proposed \(\hbox {Learn}^{++}\).MF, a modification of popular \(\hbox {Learn}^{++}\) algorithm that utilized Random Subspaces in order to handle missing values in classified instances. In case of incomplete information, only classifiers trained on available features were used for the classification phase. Li et al. [18] used Random Subspaces together with distance-based lazy learners and combined them using Dempster’s rule. Mert et al. [23] developed a weighted combination of Random Subspaces, where weight assigned to each of them was based on their ability to provide a good class separability. Random Subspaces have also been successfully used for semi-supervised learning. Yaslan and Cataltepe [39] used randomized set of features for co-training an ensemble of classifiers, while Yu et al. [40] used them for a semi-supervised dimensional reduction using graphs. Recent work by Carbonneau et al. [9] proved that this method offers very good performance in multi-instance learning. Another reason behind popularity of Random Subspaces approach lies in many successful applications of this technique to solving real-life problems. Plumpton et al. [26] used it for real-time classification of fMRI data, Xia et al. [38] for hyperspectral image analysis, while Zhu and Xue [41] combined it with tensor analysis for face recognition.

However, significant drawback of Random Subspace approach (as well as of Random Forest) lies in its purely random nature. Therefore, it lacks any deterministic element that would allow to maintain stability or ensure that the same model will be trained for given data in every repetition. This is especially crucial for real-life applications, where a final model is required for some purposes, e.g., being embedded in a hardware unit. Some researchers tried to improve the stability of this method [19]; however, no fully deterministic solution was proposed so far.

3 Deterministic subspace method

Due to the random subspace creation procedure, RS and RF are conceptually simple and computationally efficient methods. However, there is a probability that due to their stochastic components, the produced subspaces may lack the discriminative power necessary for a proper separation of classes. Additionally, even if individually strong subspaces are created, the lack of diversity among them may deteriorate performance of the ensemble. In that sense, these methods may be viewed as unstable. Furthermore, random techniques could be somewhat unsatisfying: Even if they produce highly accurate models, it might be unclear what make them good.

To overcome these limitations, we propose an alternative to the RS method, a Deterministic Subspace (DS) approach. Our main goal here is to offer a stable and deterministic substitute for the RS method. In this section, a detailed description of the algorithm will be given, along with a discussion of several feature quality and subspace diversity measures that may be used as its components.

3.1 Algorithm

The proposed algorithm is based on the idea of creating subspaces incrementally, in a manner guided by both the quality of individual subspaces and the diversity of the whole ensemble. The preference toward either quality or diversity can be adjusted by modifying the algorithms hyperparameter \(\alpha \). For the approach to be computationally feasible, we had to make several simplifications. Firstly, we create the subspaces in a greedy manner based on a round-robin strategy, which may produce a non-optimal solution. Secondly, we make a strong assumption that a subspace consisting of individually strong features is itself of a high quality. This assumption does not have to hold in practice; in fact, it can be easily shown that two weak features can together have a high discriminant power [14]. However, it was necessary to make training on a highly dimensional data feasible. The proposed algorithm has three parameters:

  • the number of subspaces to be created k,

  • the number of features selected for every subspace n,

  • the weight coefficient \(\alpha \), indicating preference toward either the feature quality or the diversity.

Lower values of \(\alpha \) lead to creation of more diverse subspaces, with features allocated close to evenly among them. On the other hand, by choosing a higher value we force the algorithm to pick the individually strong features more often. Setting \(\alpha \) to 0 would make the algorithm disregard feature quality completely, whereas setting it to 1 would result in creation of a single subspace, consisting of individually strongest features.

Smaller number of features per subspace n should, in principle, result in producing weaker base learners. Additionally, the subspaces created in that case are more diverse, since there is less overlap between their features. Larger number of subspaces k leads to creation of bigger ensemble, at the same time decreasing the diversity of the subspaces.

3.2 Diversity measure

It is intuitive that increasing the classifier ensemble diversity should lead to a better accuracy, but on the other hand there is no formal proof of this dependency [4]. Several different approaches to measuring the diversity of classifier ensemble have been proposed in the existing literature. However, most of them rely on predictions made by classifiers [3] and as a result are computationally expensive. We propose a naive, yet fast approach based on measuring evenness of the feature spread among the subspaces.

Let S denote the set of the existing subspaces and \(S_{j}\) stand for the jth subspace. Let \(\mathcal {X}\) be a set of the available features \(\mathcal {X}=\{x^{(1)}, x^{(2)}, \ldots , x^{(d)}\}\). Consider inserting additional feature \(x^{(c)}\) into the currently considered subspace \(S_{j}\). We define a diversity metric \(div\_m(S, S_j, x^{(c)})\) as an average of two components: the proportion of existing subspaces already containing the considered feature \(div\_m_x(S, x^{(c)})\) and the distance to the most similar subspace \(div\_m_s(S, S_j)\):

$$\begin{aligned} div\_m\left( S, S_j, x^{(c)}\right) = \frac{div\_m_x\left( S, x^{(c)}\right) + div\_m_s(S, S_j)}{2}, \end{aligned}$$
(1)

where:

$$\begin{aligned} div\_m_x\left( S, x^{(c)}\right) = 1 - \frac{\left| \left\{ S_j : x^{(c)} \in S_j \right\} \right| }{|S|}, \end{aligned}$$
(2)

and:

$$\begin{aligned} div\_m_s(S, S_j) = 1 - \max _{j \ne l}{\frac{\left| S_j \cap S_l\right| }{|S_j|}}. \end{aligned}$$
(3)

By minimizing the proposed metric, we ensure that the features are spread evenly among the subspaces, which should contribute toward creation of a diverse set of learners. We make an underlying assumption that large groups of features are not highly correlated, in which case the proposed dissimilarity would be too simplistic. In practice, the situations that would lead to a complete failure of the proposed metric are very rare.

3.3 Quality measures

As mentioned before, the estimation of subspace quality is based on the strength of the individual predictors. Using only the individually strong features not necessarily will improve the discriminative power of the subspace, or even more so of the whole ensemble. However, we claim that reducing frequency of the occurrence of the weak predictors will, on average, result in an increased performance.

Let us denote the ith class label, encoded as an integer, as \(i \in \mathcal {M} = \{1, 2, \ldots , M\}\). Furthermore, let \(\mathcal {LS} = \{(x_1, i_1), (x_2, i_2), \ldots , (x_n, i_n)\}\) be the learning set consisting of n observations, \(\overline{x}^{(c)} = [x^{(c)}_1, x^{(c)}_2, \ldots , x^{(c)}_n]\) be the vector of observations of feature \(x^{(c)}\), and \(\overline{i} = [i_1, i_2, \ldots , i_n]\) be the vector of class labels associated with observations. We define the classification accuracy on kth fold, obtained by using a single feature \(x^{(c)}\), as \(Acc(x^{(c)}, k)\). Marginal probabilities of \(\overline{x}^{(c)}\) and \(\mathcal {M}\) as \(p(x^{(c)}_j)\) and p(i), respectively, and their joint probability as \(p(x^{(c)}_j, i)\). Covariance of \(\overline{x}^{(c)}\) and \(\overline{i}\) as \({\text {cov}} (\overline{x}^{(c)}, \overline{i})\), and standard deviation as \(\sigma _{\overline{x}^{(c)}}\) and \(\sigma _{\overline{i}}\). We propose three different measures. First and foremost, a twofold cross-validation accuracy on the training data \(qual\_m_{acc}(x_c)\) was obtained while training on the individual features:

$$\begin{aligned} qual\_m_{acc}(x_c) = \frac{1}{2} \sum _{k = 1}^{2} Acc(x_c, k). \end{aligned}$$
(4)

It provides a conceptually simple metric with an important property of being adaptable to the type of chosen learner. However, depending on the dimensionality of the data it might require training a large number of classifiers. Because of that, we propose two alternative measures.

The first one is the mutual information between the feature and the target \(qual\_m_{mi}(x_c)\):

$$\begin{aligned} qual\_m_{mi}(x_c) = \sum _{i = 1}^{M}\sum _{j = 1}^{n}p(x^{(c)}_j, i)\log {\left( {\frac{p(x^{(c)}_j, i)}{p(x^{(c)}_j)\, p(i)}}\right) }, \end{aligned}$$
(5)

while the second one is the population Pearson correlation between the cth feature and the labels:

$$\begin{aligned} qual\_m_{corr}(x_c) = {\frac{{\text {cov}} (\overline{x}^{(c)}, \overline{i})}{\sigma _{\overline{x}^{(c)}} \sigma _{\overline{i}}}}. \end{aligned}$$
(6)

Because the probability characteristics of the classification tasks are usually unknown, therefore we use the appropriate estimators as the sample correlation coefficient, which is used to estimate the population Pearson correlation.

The idea behind using the proposed measures is to accelerate the learning process without a significant loss of the accuracy.

The pseudocode of the Deterministic Subspace algorithm is presented in Algorithm 1.

4 Experimental study

In this section, we present a detailed description of the conducted experimental study and perform an analysis of the obtained results. The main goal was to evaluate whether proposed deterministic approach is capable of achieving at least as high accuracy as the RS method. Secondly, we tried to establish whether, and if so under what conditions, DS can actually outperform the RS method. Finally, we compared the both approaches with another popular method relying on the creation of a random feature subspaces, Random Forest.

4.1 Set-up

All experiments were implemented in Python programming language; code sufficient to repeat them was made publicly available at.Footnote 1 Whenever possible, the existing implementations of the classification algorithms from the scikit-learn machine learning libraryFootnote 2 were used to limit the possibility of a programming errors.

Performance of the considered algorithms was evaluated on the basis of 21 benchmark datasets with varying number of objects and features. All of them were taken from the UCIFootnote 3 and the KEELFootnote 4 repositories and, as such, are publicly available. The detailed parameters of the datasets are presented in Table 1. For every dataset, 5 \(\times \) 2-fold partitions were randomly created and used during the experiments. These partitions are available together with the code.

Table 1 Details of datasets used throughout the experiment

Seven classifiers were evaluated during the experiments, namely CART, k-nearest neighbors (kNN), linear support vector machine (SVM), Naïve Bayes, Parzen window kernel density estimation (ParzenKDE), nearest neighbor kernel density estimation (NNKDE) and Gaussian mixture model (GMM). First four were tested in the initial stage of the experiment, with partial results published in [16], and were chosen to cover different types of algorithms. After obtaining the results, the remaining three classifiers were evaluated to establish whether trends observable for Naïve Bayes extend to other types of nonparametric classifiers.

The hyperparameters specific for the particular classification methods were constant throughout the experiments. Whenever possible, their default values provided in the corresponding scikit-learn modules were used. The most significant parameters are presented in Table 2. Different numbers of subspaces \(k \in \{5, 10, \ldots , 50\}\) were evaluated in all cases. Number of features per subspace used by RS and DS methods was fixed at half the total number of the features. Additionally, the quality coefficients \(\alpha \in \{0.0, 0.1, \ldots , 1.0\}\) and three different quality metrics, namely twofold cross-validation accuracy on training set, mutual information between the features and the labels, and absolute value of correlation between the two were tested for the DS approach. Finally, varying number of trees \(\in \{5, 10, ..., 50\}\) was used in combination with the Random Forest method.

Table 2 Values of base classifiers hyperparameters

4.2 Results

Average accuracy of proposed method for different classification algorithms, quality measures and quality coefficients \(\alpha \) is presented in Fig. 1. Only selection of \(\alpha \) parameters was shown to improve the clarity of the presentation. Most significant changes in the algorithms behavior were observed for the selected values. The classification accuracy of the RS method is used as a baseline. The average scores for specific base learners and datasets are to be found in Fig. 2, where accuracy was used as a feature quality metric. Results of the combined \(5 \times 2\) cross-validation F test [1] are depicted in Fig. 3. It presents, for all k and \(\alpha \) parameters, difference between the number of datasets on which proposed method achieved statistically significantly better and worse results than its random counterpart. Finally, the average rankings obtained from Friedman N \(\times \) N test are given in Table 3. Once again, only a subset of considered \(\alpha \) parameters and a single quality measure, twofold cross-validation accuracy, was presented for clarity. In addition to deterministic and random feature subspace methods, performance of Random Forest classifier was also reported in this step. The complete results of the experimental study can be found at.Footnote 5

4.3 Discussion

Presented results indicate that DS can achieve not only similar performance as random methods but also, depending on the type of classifier used, significantly outperform it. Observed accuracy gain was especially high with certain types of nonparametric classifiers, namely Naïve Bayes, Gaussian mixture models and Parzen kernel density estimation. In all of these cases, large values of \(\alpha \) returned the best performance, which corresponds to favoring individually strong features over higher diversity. This may be caused by the nature of considered classifiers that assume low or no correlation among features and thus directly benefit from the way DS algorithm creates feature subspaces for its base learners.

In the second group of classifiers consisting of CART, k-nearest neighbors, SVM and nearest neighbors kernel density estimation choosing large values of \(\alpha \) actually led to performance drop. Behavior of proposed method was more stable with small values of \(\alpha \): For all tested classifiers, average rank was slightly higher than with RS method in that setting. Most notably, when combined with decision tree, proposed method achieved highest average rank for a particular choice of \(\alpha \), higher than Random Forest classifier.

Despite lower computational complexity, alternative measures of feature quality, namely mutual information and correlation, resulted on average in degradation of performance.

Overall, the proposed method operated in two distinguishable modes. The first one, in which small values of \(\alpha \) parameter were applied, presents comparable alternative to RS approach offering slightly higher performance and sought-after stability. The second one, with setting \(\alpha \) to higher values, requires greater care but can lead to significantly better results when combined with particular types of classifiers.

Fig. 1
figure 1

Correct classification rates (CCR) averaged over all datasets and examined number of subspaces for Random Subspace and Deterministic Subspace algorithms. Three different feature quality measures were considered in combination with Deterministic Subspace method, namely accuracy, mutual information and correlation

Fig. 2
figure 2

Correct classification rates for specific datasets and base learners. The accuracy was used as a feature quality measure

Fig. 3
figure 3

Differences between the number of datasets on which the Deterministic Subspace algorithm achieved statistically significantly better (positive values) and worse (negative values) results than the Random Subspace algorithm

Table 3 Average rankings of the algorithms

5 Conclusions and future work

The novel classification method Deterministic Subspace based on the feature subspaces was proposed and evaluated throughout this study. During the experiments, we established that it presents stable alternative to the Random Subspace approach, also capable of outperforming the Random Forest method with a proper choice of classifiers. However, in contrast to the random methods, the Deterministic Subspace algorithm always returns the same model for a given learning dataset. Additionally, we observed that the proposed method can significantly outperform the random approach when used in combination with some types of classifiers.

The main limitation of the proposed algorithm lies in its high computational complexity. Subspace creation procedure can take significantly greater amount of time compared to random approaches, especially when the number of features is large. In the future, we plan to improve the computational efficiency of our method in order to offer a speedup of the training process. Additionally, two less computationally expensive feature quality metrics were proposed in the course of this paper to try to remedy that issue, but at the expense of the classification accuracy. Finding better estimators of the feature quality could significantly improve practical usefulness of the proposed method. Furthermore, the possibility of parallelization of the algorithm could be investigated to make the algorithm more suitable for larger datasets.

Additionally, the exact conditions under which Deterministic Subspace method is capable of achieving significantly higher performance remain unknown. Determining what types of classifiers benefit more from individual feature quality than the diversity of produced subspaces was done partially in this paper; more extensive evaluation would be, however, necessary. This remains for the further study.