The deterministic subspace method for constructing classifier ensembles
Abstract
Ensemble classification remains one of the most popular techniques in contemporary machine learning, being characterized by both high efficiency and stability. An ideal ensemble comprises mutually complementary individual classifiers which are characterized by the high diversity and accuracy. This may be achieved, e.g., by training individual classification models on feature subspaces. Random Subspace is the most wellknown method based on this principle. Its main limitation lies in stochastic nature, as it cannot be considered as a stable and a suitable classifier for reallife applications. In this paper, we propose an alternative approach, Deterministic Subspace method, capable of creating subspaces in guided and repetitive manner. Thus, our method will always converge to the same final ensemble for a given dataset. We describe general algorithm and three dedicated measures used in the feature selection process. Finally, we present the results of the experimental study, which prove the usefulness of the proposed method.
Keywords
Machine learning Classifier ensemble Feature subspaces Classifier diversity1 Introduction
Contemporary machine learning has to deal with escalating complexity of problems appearing with the increasing prevalence of data. Standard classification methods are often unable to capture complex patterns or cannot maintain their generalization capabilities, leading to either under or overfitting. Therefore, models that can capture multidimensional data properties while avoiding mentioned pitfalls are very desirable. Classifier ensemble, known also as multiple classifier system or classifier committee [37], is an example of such an approach. By combining predictions of a number of simpler models, the classifier ensembles can produce a more efficient and flexible recognition systems, at the same time benefiting from the high generality of the base classifiers [17].
To ensure a satisfactory performance, several requirements have to be met by an ensemble. Perhaps the most important one is the need to supply a pool of a diverse classifiers [34]. This could be achieved in a various ways, e.g., by training every learner on the basis of a different features. The motivation behind this approach is twofold: firstly, to simplify the training procedure by reducing the number of features used by each learner and at the same time allowing the learner to explore different properties of the supplied subspaces [11]. The most notable examples of techniques relying on partitioning features into subspaces include Random Subspace (RS) [15] and Random Forest (RF) [5] methods. While simple and computationally efficient, such approaches have a major drawback. Due to the random nature of the feature subset construction process, they are inherently unstable, prone to producing models of a poor quality in the worstcase scenarios. Additionally, for each run of these algorithms one will obtain different feature subsets and thus different base classifiers. It should be underlined that in the reallife applications the stability is often a necessary component, which significantly limits the use of the mentioned methods, particularly the Random Subspace approach.
To overcome this limitation, we consider how the feature subspaces can be constructed in a guided way, in order to make the resulting classifier ensemble more stable and less prone to producing underperforming base learners. As a result of our investigation, we present a novel approach, which allows to form feature subspaces in a fully deterministic manner. It is based on the same idea as the previously mentioned methods, namely creating an ensemble of a simpler, diverse classifiers, each trained on a feature subset. However, instead of a stochastic subspace generation procedure, the proposed approach employs a guided search strategy. It assigns the new features greedily in a roundrobin fashion, based on the defined feature quality and the subspace diversity measures. The final decision of the ensemble is made using the majority voting rule. Proposed method is as flexible as the Random Subspace approach and can work with any type of base learners. However, the created ensemble is more stable and uses the information about the predictive power of individual features.

Proposition of the new classifier ensemble learning algorithm, Deterministic Subspace (DS), which allows forming a set of diverse classifiers trained on selected features, where the attributes used by the base classifiers are being chosen by a guided search strategy.

Improvement over random subspace selection, leading to stable ensemble forming procedure.

Extensive experimental results on a set of benchmark datasets, which evaluate the dependency among the quality of DS algorithm, its parameters and chosen model of base classifiers.

Proving the high usefulness of the proposed DS approach for specific classifiers that take advantage of creating less correlated subspaces.
2 Related work
Ensemble classification methods have several properties that make them one of the most prevalent techniques in supervised learning domain. Perhaps most importantly, models produced using this paradigm tend to be capable of approximating complex decision boundaries while remaining resilient to overfitting. Performance gain is possible by exploiting local competencies of base learners. At the same time, ensembles incorporate mechanisms that prevent from choosing the worst learner from the pool [20]. Additionally, this approach is highly flexible. The parameters of learners are often easily adjustable, and the training procedure can be parallelized without much effort.
Even though ensembles solve some of the issues related to the classification problems, at the same time they introduce several new challenges related to their design. Necessity of providing a diverse pool of learners while preserving their individual accuracies [36] is the one on which we will focus in this paper. This issue is especially severe due to the ambiguity of this term. Proposing an agreed upon definition of diversity remains an important open question [6], particularly in context of classification task. Existing diversity measures are therefore only approximations. The exact extent of diversity influence on ensemble performance remains unclear [12]. At the same time, it is hard to argue that learners should display some differing characteristics, since adding identical models would not contribute to efficacy of formed ensemble.

Varying learning models by using either completely different classification algorithms or the same algorithms with modified hyperparameters.

Varying outputs of learners by decomposing classification task, for instance into binary problems.

Varying inputs of learners by supplying different partitions of dataset or different feature subspaces during training.
Alternatively, instead of using entirely different models one may rather train them using different set of hyperparameters or initial conditions. This approach is based on the assumption of existence of complex search space during the training, which could lead to reaching different local extrema. Most notable examples of this method include ensembles of neural networks with early stopping condition [35] or Support Vector Machines with varying kernels [32].
Another approach is based on manipulating the classifier outputs. Common techniques rely on a multiclass decomposition, after which specialized learners, trained to recognize reduced number of classes, are obtained. Dedicated combination method, such as ErrorCorrecting Output Codes [25], is being used to reconstruct original multiclass task. Most notable examples of this type of technique include binarization [13], hierarchical decomposition [27] and classifier chains [22].
Diversity may be also induced by input manipulation in either data or feature space. The former approach relies on assumption that variance is being introduced into the training instances, which enables learners to capture properties of different subsets of objects [29]. Bagging [30] and Boosting [2] are most significant realizations of this approach, but ensembles may also be trained on the basis of clusters to preserve spatial relations among instances [10].
Finally, one may increase the diversity by manipulating the feature space. This could be done in either randomized [21] or guided manner, using feature selection [8] or global optimization methods [7, 24]. The most notable techniques based on this paradigm are previously mentioned Random Subspace and Random Forest methods. Both of them rely on randomly created feature subspaces to increase the diversity of the ensemble.
Despite its simplicity, the Random Subspace method has gained a popularity in the machine learning community. Its main advantage over Random Forest lies in flexibility, as it can be used with an any type of base learners [31]. There are several interesting variations of this approach that appeared in recent years. Polikar et al. [28] proposed \(\hbox {Learn}^{++}\).MF, a modification of popular \(\hbox {Learn}^{++}\) algorithm that utilized Random Subspaces in order to handle missing values in classified instances. In case of incomplete information, only classifiers trained on available features were used for the classification phase. Li et al. [18] used Random Subspaces together with distancebased lazy learners and combined them using Dempster’s rule. Mert et al. [23] developed a weighted combination of Random Subspaces, where weight assigned to each of them was based on their ability to provide a good class separability. Random Subspaces have also been successfully used for semisupervised learning. Yaslan and Cataltepe [39] used randomized set of features for cotraining an ensemble of classifiers, while Yu et al. [40] used them for a semisupervised dimensional reduction using graphs. Recent work by Carbonneau et al. [9] proved that this method offers very good performance in multiinstance learning. Another reason behind popularity of Random Subspaces approach lies in many successful applications of this technique to solving reallife problems. Plumpton et al. [26] used it for realtime classification of fMRI data, Xia et al. [38] for hyperspectral image analysis, while Zhu and Xue [41] combined it with tensor analysis for face recognition.
However, significant drawback of Random Subspace approach (as well as of Random Forest) lies in its purely random nature. Therefore, it lacks any deterministic element that would allow to maintain stability or ensure that the same model will be trained for given data in every repetition. This is especially crucial for reallife applications, where a final model is required for some purposes, e.g., being embedded in a hardware unit. Some researchers tried to improve the stability of this method [19]; however, no fully deterministic solution was proposed so far.
3 Deterministic subspace method
Due to the random subspace creation procedure, RS and RF are conceptually simple and computationally efficient methods. However, there is a probability that due to their stochastic components, the produced subspaces may lack the discriminative power necessary for a proper separation of classes. Additionally, even if individually strong subspaces are created, the lack of diversity among them may deteriorate performance of the ensemble. In that sense, these methods may be viewed as unstable. Furthermore, random techniques could be somewhat unsatisfying: Even if they produce highly accurate models, it might be unclear what make them good.
To overcome these limitations, we propose an alternative to the RS method, a Deterministic Subspace (DS) approach. Our main goal here is to offer a stable and deterministic substitute for the RS method. In this section, a detailed description of the algorithm will be given, along with a discussion of several feature quality and subspace diversity measures that may be used as its components.
3.1 Algorithm

the number of subspaces to be created k,

the number of features selected for every subspace n,

the weight coefficient \(\alpha \), indicating preference toward either the feature quality or the diversity.
Smaller number of features per subspace n should, in principle, result in producing weaker base learners. Additionally, the subspaces created in that case are more diverse, since there is less overlap between their features. Larger number of subspaces k leads to creation of bigger ensemble, at the same time decreasing the diversity of the subspaces.
3.2 Diversity measure
It is intuitive that increasing the classifier ensemble diversity should lead to a better accuracy, but on the other hand there is no formal proof of this dependency [4]. Several different approaches to measuring the diversity of classifier ensemble have been proposed in the existing literature. However, most of them rely on predictions made by classifiers [3] and as a result are computationally expensive. We propose a naive, yet fast approach based on measuring evenness of the feature spread among the subspaces.
3.3 Quality measures
As mentioned before, the estimation of subspace quality is based on the strength of the individual predictors. Using only the individually strong features not necessarily will improve the discriminative power of the subspace, or even more so of the whole ensemble. However, we claim that reducing frequency of the occurrence of the weak predictors will, on average, result in an increased performance.
The idea behind using the proposed measures is to accelerate the learning process without a significant loss of the accuracy.
The pseudocode of the Deterministic Subspace algorithm is presented in Algorithm 1.
4 Experimental study
In this section, we present a detailed description of the conducted experimental study and perform an analysis of the obtained results. The main goal was to evaluate whether proposed deterministic approach is capable of achieving at least as high accuracy as the RS method. Secondly, we tried to establish whether, and if so under what conditions, DS can actually outperform the RS method. Finally, we compared the both approaches with another popular method relying on the creation of a random feature subspaces, Random Forest.
4.1 Setup
All experiments were implemented in Python programming language; code sufficient to repeat them was made publicly available at.^{1} Whenever possible, the existing implementations of the classification algorithms from the scikitlearn machine learning library^{2} were used to limit the possibility of a programming errors.
Details of datasets used throughout the experiment
No.  Name  Features  Objects  Classes 

1  winequality  11  6497  11 
2  vowel  13  990  11 
3  vehicle  18  846  4 
4  segment  19  2310  7 
5  ring  20  7400  2 
6  thyroid  21  7200  3 
7  mushroom  22  5644  2 
8  chronic kidney  24  157  2 
9  wdbc  30  569  2 
10  ionosphere  33  351  2 
11  dermatology  34  358  6 
12  texture  40  5500  11 
13  biodegradation  41  1055  2 
14  spectfheart  44  267  2 
15  spambase  57  4597  2 
16  sonar  60  208  2 
17  splice  60  3190  3 
18  optdigits  64  5620  10 
19  mice protein  80  552  8 
20  coil2000  85  9822  2 
21  movement libras  90  360  15 
Seven classifiers were evaluated during the experiments, namely CART, knearest neighbors (kNN), linear support vector machine (SVM), Naïve Bayes, Parzen window kernel density estimation (ParzenKDE), nearest neighbor kernel density estimation (NNKDE) and Gaussian mixture model (GMM). First four were tested in the initial stage of the experiment, with partial results published in [16], and were chosen to cover different types of algorithms. After obtaining the results, the remaining three classifiers were evaluated to establish whether trends observable for Naïve Bayes extend to other types of nonparametric classifiers.
Values of base classifiers hyperparameters
Classifier  Parameters 

kNN  k = 5 
SVM  kernel = linear, C = 1.0 
ParzenKDE  window size = 1.0 
NNKDE  kernel = Gaussian, bandwidth = 1.0 
GMM  covariance = diagonal, iterations = 100 
4.2 Results
Average accuracy of proposed method for different classification algorithms, quality measures and quality coefficients \(\alpha \) is presented in Fig. 1. Only selection of \(\alpha \) parameters was shown to improve the clarity of the presentation. Most significant changes in the algorithms behavior were observed for the selected values. The classification accuracy of the RS method is used as a baseline. The average scores for specific base learners and datasets are to be found in Fig. 2, where accuracy was used as a feature quality metric. Results of the combined \(5 \times 2\) crossvalidation F test [1] are depicted in Fig. 3. It presents, for all k and \(\alpha \) parameters, difference between the number of datasets on which proposed method achieved statistically significantly better and worse results than its random counterpart. Finally, the average rankings obtained from Friedman N \(\times \) N test are given in Table 3. Once again, only a subset of considered \(\alpha \) parameters and a single quality measure, twofold crossvalidation accuracy, was presented for clarity. In addition to deterministic and random feature subspace methods, performance of Random Forest classifier was also reported in this step. The complete results of the experimental study can be found at.^{5}
4.3 Discussion
Presented results indicate that DS can achieve not only similar performance as random methods but also, depending on the type of classifier used, significantly outperform it. Observed accuracy gain was especially high with certain types of nonparametric classifiers, namely Naïve Bayes, Gaussian mixture models and Parzen kernel density estimation. In all of these cases, large values of \(\alpha \) returned the best performance, which corresponds to favoring individually strong features over higher diversity. This may be caused by the nature of considered classifiers that assume low or no correlation among features and thus directly benefit from the way DS algorithm creates feature subspaces for its base learners.
In the second group of classifiers consisting of CART, knearest neighbors, SVM and nearest neighbors kernel density estimation choosing large values of \(\alpha \) actually led to performance drop. Behavior of proposed method was more stable with small values of \(\alpha \): For all tested classifiers, average rank was slightly higher than with RS method in that setting. Most notably, when combined with decision tree, proposed method achieved highest average rank for a particular choice of \(\alpha \), higher than Random Forest classifier.
Despite lower computational complexity, alternative measures of feature quality, namely mutual information and correlation, resulted on average in degradation of performance.
Average rankings of the algorithms
Position  Algorithm  Ranking 

1  DS(CART, 0.3)  5.7619 
2  RandomForest  6.4286 
3  RS(CART)  7.4762 
4  DS(CART, 0.5)  8 
5  DS(SVM, 0.3)  9.5238 
6  DS(SVM, 0.5)  9.6429 
7  DS(kNN, 0.5)  9.8571 
8  DS(kNN, 0.3)  9.9048 
9  RS(SVM)  10.7619 
10  RS(kNN)  10.9524 
11  DS(CART, 0.7)  11.7143 
12  DS(SVM, 0.7)  12.9762 
13  DS(kNN, 0.7)  13.1905 
14  DS(NNKDE, 0.5)  14.1429 
15  DS(NNKDE, 0.3)  14.381 
16  RS(NNKDE)  15.7143 
17  DS(NNKDE, 0.7)  15.8571 
18  DS(NaiveBayes, 0.7)  16.8571 
19  DS(NaiveBayes, 0.5)  17.5714 
20  DS(NaiveBayes, 0.3)  19.3333 
21  DS(ParzenKDE, 0.7)  20.2619 
22  RS(NaiveBayes)  20.4762 
23  DS(ParzenKDE, 0.5)  20.5952 
24  DS(GMM, 0.5)  21.4286 
25  DS(ParzenKDE, 0.3)  21.7857 
26  DS(GMM, 0.7)  21.8095 
27  RS(ParzenKDE)  22.0714 
28  DS(GMM, 0.3)  22.4762 
29  RS(GMM)  24.0476 
5 Conclusions and future work
The novel classification method Deterministic Subspace based on the feature subspaces was proposed and evaluated throughout this study. During the experiments, we established that it presents stable alternative to the Random Subspace approach, also capable of outperforming the Random Forest method with a proper choice of classifiers. However, in contrast to the random methods, the Deterministic Subspace algorithm always returns the same model for a given learning dataset. Additionally, we observed that the proposed method can significantly outperform the random approach when used in combination with some types of classifiers.
The main limitation of the proposed algorithm lies in its high computational complexity. Subspace creation procedure can take significantly greater amount of time compared to random approaches, especially when the number of features is large. In the future, we plan to improve the computational efficiency of our method in order to offer a speedup of the training process. Additionally, two less computationally expensive feature quality metrics were proposed in the course of this paper to try to remedy that issue, but at the expense of the classification accuracy. Finding better estimators of the feature quality could significantly improve practical usefulness of the proposed method. Furthermore, the possibility of parallelization of the algorithm could be investigated to make the algorithm more suitable for larger datasets.
Additionally, the exact conditions under which Deterministic Subspace method is capable of achieving significantly higher performance remain unknown. Determining what types of classifiers benefit more from individual feature quality than the diversity of produced subspaces was done partially in this paper; more extensive evaluation would be, however, necessary. This remains for the further study.
Acknowledgements
This work was supported by the Polish National Science Center under the grant no. UMO2015/19/B/ST6/01597 as well as the PLGrid Infrastructure.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.