1 Introduction

In supervised classification tasks, the aim is to construct a predictor that assigns a class label to new observations. To do so the training data is utilized, where a class label is associated with each pattern. The class label of an observation is described by a feature vector. However, in many real life classification problems, one often encounters with imprecise data including non-informative features which dramatically increases the classification error of the algorithms (Nettleton et al. 2010).

To overcome this problem feature selection methods are usually recommended before classification to mitigate the effect of such non-informative features (Liu et al. 2014; Mahmoud et al. 2014). These methods investigate the most discriminative features subset from the original features that increases classification performance of a classifier. However, different feature selection methods will result in different feature subsets for the same data set thus varying feature relevancy. This encourages combining the results of several best feature subsets.

Combining multiple classifiers, known as ensemble techniques, have emerged as promising methods to improve the classification performance of weak learners and have gained a lot of interest in the last two decades (Barandela et al. 2013; Bauer and Kohavi 1999; Maclin and Opitz 2011; Melville et al. 2004). These techniques lead to substantial reduction in classification error in many real life applications and, in general, are more resilient to non-informative features in the data than using an individual model (Khoshgoftaar et al. 2011; Melville et al. 2004). One of the simplest ensemble technique is bootstrap aggregation (bagging), that combines the outputs of classifiers constructed on randomly-generated bootstrap training sets (Breiman 1996a). In bagging, B bootstrap samples are randomly drawn from the learning set, and a base learner is developed on each of these samples. A new observation is then classified by majority voting of these individual classifiers. Bagging has been used with numerous variations in the literature (Bauer and Kohavi 1999; Hothorn and Lausen 2003a, b). It is demonstrated that bagging can be used to improve the prediction accuracy of weak classifiers, such as decision trees (Breiman 1996a; Hothorn et al. 2004; Hothorn and Lausen 2005).

One of the simplest and oldest methods for classification is the k nearest neighbours (kNN) classifier. It classifies an unknown observation to the class of majority among its k nearest neighbours observations, as measured by a distance metric, in the training data (Cover and Hart 1967; Guvenir and Akkus 1997). Despite its simplicity, kNN gives competitive results and in some cases even outperforms other complex learning algorithms. However, kNN is affected by non-informative features in the data, often the case with high dimensional data. Attempts have been made to improve the performance of nearest neighbours classifier by ensemble techniques. Some related work on ensemble of kNN classifiers can be found in Grabowski (2002), Domeniconi and Yan (2004), Zhou and Yu (2005), Hall and Samworth (2005) and Samworth (2012).

An ensemble of nearest neighbour classifiers where each member classifier of the ensemble has access to a random feature subset only and the outcomes of these multiple nearest neighbour classifiers are combined for final decision is proposed in Bay (1998). A similar approach based on random feature subsets, random kNN based on the idea of random forest, is proposed for classification of high dimensional data sets (Li et al. 2011). Li et al. (2011) rank the features according to their importance and get a final set of features for the final model.

In this manuscript we suggest an ensemble of subset of kNN classifiers (ESkNN) particularly to deal with the issue of non-informative features in a data set. We applied ESkNN to a benchmark and simulated classification problems and compare the results with those of simple kNN, bagged kNN (BkNN), random kNN (RkNN), ensemble based on multiple feature subset method (MFS), random forest (RF) and support vector machines (SVM). Experiments are carried out on the data sets with their original features set and with some added non-informative features.

2 Ensemble of subset of kNN classifiers

Let \({\mathcal {L}}={(\mathbf x _i,y_i), i=1 \ldots n}\) be a training set consisting of n independent observations, where \(\mathbf{x _i}= (x_{i1}, x_{i2}, \ldots , x_{id})\) is a d-dimensional feature vector and y is the vector of class labels; where \({ y_i \in \{{1, \ldots , J}}\}\), J being the total number of classes, here we consider the two class problem, thus \({ y_i \in \{{1,2}}\}\). Based on this available data set \({\mathcal {L}}\), a classifier predicts the class label for a new/test observation with feature vector \((\mathbf {x}^\prime )\). Divide the training data \({\mathcal {L}}\) in two parts, \({\mathcal {L}_{T}}\) and \({\mathcal {L}_{\textit{V}}}\), the first one for construction of the classifiers and the other part for validation. For simplicity we denote the set used for construction of the models \({\mathcal {L}_{T}}\) by \({\mathcal {L} ^ *}\). Let us denote the d input features in \({\mathcal {L} ^ *}\) by \({\mathbf {P}} = (p_1,p_2,p_3, \ldots , p_d)\). For a given subset size, say l, where \(l < d\), a random subset of features \({\mathbf {P}}^{l}\), is drawn from \(\mathbf {P}\). Based on the randomly selected features a bootstrap sample is drawn from \({\mathcal {L} ^ *}\). The new bootstrap learning set \({\mathcal {L}^*}^{(l)}\), consists of l dimensional feature vector. This process is repeated until we get m training sets, \({\mathcal {L}^ *}^{(1l)}, \ldots , {\mathcal {L}^ *}^{(ml)}\), each of \(n\times {l}+1\) dimensions. The base kNN classifier is constructed on these bootstrap training sets and a set of m classifiers is generated.

While, drawing a random sample of the same size n from the training set, approximately \(\frac{1}{3}\) of the observations are left out from that sample. These observations are called out-of-bag (OOB) observations, and can be utilized for estimation of the classification error (Breiman 1996b). In our framework we use the OOB sample for the assessment of the classifier. The m classifiers are then ranked according to their individual classification accuracy on the OOB sample and the first h of the m classifiers are selected from them. The selected classifiers are then assessed for their collective contribution as an ensemble on the validation set \({\mathcal {L}_{\textit{V}}}\). This is done by starting from the best one among h classifiers and then adding one by one the rest of the classifiers to the ensemble.

The formation of the ensemble of subset of kNN classifiers can be summarized as:

  1. 1.

    Draw a random sample of size \(l < d\), without replacement, of features from the feature vector \(\mathbf {P}\) of \({\mathcal {L}^*}\), denote the feature vector by \(\mathbf {P}^{l}\).

  2. 2.

    Based on the selected random feature subset \(\mathbf {P}^{l}\), draw a random sample of size n, \({\mathcal {L}^*}^{(l)}\), from \({\mathcal {L}^* }\).

  3. 3.

    Construct the kNN classifier on \({\mathcal {L}^*}^{(l)}\).

  4. 4.

    Calculate the accuracy of the classifier on the OOB sample using the same feature set as used for its construction.

  5. 5.

    Iterate step (1) to (4) m times and rank the m classifiers according to their accuracies.

  6. 6.

    Select first h classifiers with highest accuracies.

These selected classifiers are further assessed as follows:

  • The ensemble is started with combining the second best classifier to the first best classifier, and classification performance is evaluated on the validation set \({\mathcal {L}_{\textit{V}}}\). The ensemble is then grown by adding the third best classifier and the performance is measured, this process is carried out for all the h classifiers,

    • let \({{\mathcal {BS}}}^{\langle r-1 \rangle }\) be the Brier score of the ensemble of selected best kNN models without the rth model and \({{\mathcal {BS}}}^{\langle r \rangle }\) be the Brier score of the ensemble of the best models after including the rth model, then rth model is selected if

      $$\begin{aligned} {{\mathcal {BS}}}^{\langle r \rangle } < {{\mathcal {BS}}}^{\langle r-1 \rangle }. \end{aligned}$$

The ensemble is formed in a two stage procedure by assessing the models using two different performance measures misclassification rate and Brier score.

In the first stage the classification models are evaluated using the misclassification rate (MR) as the performance measure. A classification model is desired to have minimum misclassification rate than others used for a classification task, and thus the classification models with a low misclassification rate are selected.

In the second stage of the algorithm the selected models are further evaluated using the Brier score as a performance measure. The Brier score measures the difference between the observed state of the outcomes of the test instances and the estimated probabilities that are in turn used to classify new observations using some threshold. Besides the traditional misclassification rate and other metrics, Brier score can also be used to evaluate the predictive performance of a classifier. While using output of the classifier as a basis for decision making, a more detailed evaluation is required; where not only the prediction accuracy of the classifier should be considered but also the quality of the estimate needs ample consideration. That can be done through a score such as the Brier score that, in principle, measures the predictive ability/quality of a classifier in classifying new data (Hernández-Orallo et al. 2012; Steyerberg et al. 2010; Kruppa et al. 2014).

Let the class labels of the test instances from the two classes, “positive” and “negative”, are represented by 0, or 1, i.e \({ y \in \{{0,1}}\}\). The Brier score for the probabilities of the predicted class 1, \(y=1\), is:

$$\begin{aligned} \mathcal {BS}= & {} E(y_i-p(y_i=1))^2. \end{aligned}$$

An estimator for the above score is:

$$\begin{aligned} \hat{\mathcal {BS}} = \frac{\sum _{i=1}^{n_{t}}\left( y_i-\hat{p}(y_i| \mathbf{{x}})\right) ^2}{n_{t}}, \end{aligned}$$

where, \(n_t\) is the total number of test points and the state of the outcome is, \({ y \in \{{0,1}}\}\). A low Brier score indicates better performance of the predictor. Thus the models minimizing the Brier score of the ensemble are selected.

One technical reason for assessing the individually selected models, in the first stage, for their collective contribution using the Brier score is that this score is more capable of determining the contribution of a model, to be included in the ensemble, than the misclassification rate. To illustrate this, let the estimated probability of a test observation belonging to class 1, provided that class 1 is the true class, by a classifier c1 is given as:

$$\begin{aligned} \hat{f}_{c1}= 0.56. \end{aligned}$$

Suppose that the cut-off for assigning this observation to class 1 is

$$\begin{aligned} \hat{f}(.) > 0.5, \end{aligned}$$

which implies that the given observation belongs to class 1 and classification error will be 0 (correct classification). The Brier score in this case is 0.1936.

Now consider that the second classifier gives the estimated probability for that observation as 0.68. The combined probability estimate of the two classifiers for the same observation, denoted by \(\hat{f}_{{c1,c2}}\), is given as:

$$\begin{aligned} \hat{f}_{{c1,c2}} = 0.62 \ . \end{aligned}$$

Consequently, the Brier score decreases to 0.1444. The classification error in both the cases is 0 as that of a single classifier for the given cut-off.

A third classifier has an estimated probability of 0.88, the resultant combined probability is:

$$\begin{aligned} \hat{f}_{{c1,c2,c3}}= 0.71 . \end{aligned}$$

Here the Brier score decreases to 0.0841 while the classification error remains the same (0) as the previous ensemble of two classifiers for the given cut-off.

This follows that if classification errors are considered for classifier addition into the ensemble, classifier c2 and c3 would not be part of the ensemble, as the error remains the same, whereas the Brier score reduces with the addition of classifiers c1 and c2 thus leading to an ensemble of size 3.

The general pseudo code of ESkNN is given in Algorithm 1.

figure a

3 Simulation study

In addition to bench mark data sets we assessed ESkNN by simulation setups. We state two simulation models to assess the performance of ESkNN. The models proposed in our simulation study involve several variations to get an understanding of the behaviour of classifiers under different situations. The details of the two models are given below.

3.1 Simulation model 1

In this model, the performance of the classifiers is investigated in different setups. Firstly, the predictors of the two classes are generated with correlated and uncorrelated structures respectively. The variables for class 1 are correlated and generated with a varying variance covariance structure, while the features determining class 2 are independent. A total of 500 independent binary class data sets are generated, each with 20 features. The variables for class 1 are generated from \({\mathcal {N}(2,w\Psi )}\), while those of class 2 generated from \({\mathcal {N}({1},{1})}\). The values considered for w in class 1 are 3, 5, 10, 15 and 20. The predictive performance of the algorithms are investigated by adding 50, 100, 200 and 500 non-informative features, generated from normal distribution, to the data. The variance covariance matrix \(\Psi \), which is a \(d\times {d}\) matrix, is:

$$\begin{aligned} \Psi = \left( {\begin{array}{l@{\quad }l@{\quad }l@{\quad }l} \sigma _{1,1} &{} \varrho _{1,2} &{} ,\ldots , &{} \varrho _{1,d} \\ \varrho _{2,1} &{} \sigma _{2,2} &{} ,\ldots , &{} \varrho _{2,d} \\ \vdots &{}\vdots &{} \vdots &{} \vdots \\ \varrho _{m,1} &{} \varrho _{d,2} &{} ,\ldots ,&{} \sigma _{d,d} \end{array}} \right) , \end{aligned}$$

where \(\varrho _{ij}\) are the covariances given by

$$\begin{aligned} \varrho _{ij} =(1/2)^{|{i-j}|}, i,j=1,\ldots ,d. \end{aligned}$$

and \(\sigma _{ij}\), on the diagonal of \({\Psi }\), is the variance, \(\sigma _{ij}\) = 1 when w is 1. Changing the value of w results in different degree of correlation between variables. The data is generated in such a manner that the variables within Class 1 are correlated among each other and are exhibiting negligible/no correlation with the features from Class 2.

3.2 Simulation model 2

The second simulation model, model 2, is a four-dimensional model, derived from the model proposed in Mease et al. (2007). A set of 500 independent binary class data sets are generated each consisting of 1000 observations and 4 features. The feature vector \(\mathbf {x}\) is a four dimensional random vector uniformly distributed on [0, 100] and the response variable y with two outcomes 0 or 1. The class is determined by the distance r, the distance of feature vector \(\mathbf {x}\) from the central point. The class probabilities given features are:

$$\begin{aligned} p({y=1} \mid \mathbf {x})=\left\{ \begin{array}{ll} 1, &{} \quad \hbox {if }{r< 110},\\ \frac{150-r}{140}, &{} \quad \hbox {if } {110\le {r}\le {140}},\\ 0, &{} \quad {otherwise.} \end{array} \right. \end{aligned}$$

The response values are generated from the above distribution using a Bernoulli random number generator. We extend the dimensions of this model by adding 50, 100, 200 and 500 non-informative feature generated from uniform distribution. The data complexity increases with the increase in the number of added non-informative features.

4 Simulation results and discussion

The average misclassification rate, from model 1 and model 2, are presented in Tables 1, 2 and 3.

Table 1 Misclassification rate of the methods on the data sets with added non-informative features from model 1

The results from model 1, in Table 1 indicate that the classification accuracy of ESkNN is higher than all the other methods on most of the cases except for the data with original 20 features where SVM outperforms all the methods. The table reveals that unsurprisingly, kNN shows high error rate compared to other methods and the performance of kNN based methods declines with the increasing number of non-informative features in the data where as ESkNN still perform better. In case of the data set with original features SVM performs better, by giving minimum misclassification rate, as compared to all the other methods.

Table 2 Misclassification rate of the classifiers on the data sets from model 1 for different values of w, on 70 features (\(20+50\) noninformative), listed in column 1
Table 3 Misclassification rate of the methods on the data sets with added non-informative features from model 2

From Table 2, there is an increase of misclassification rate of all the classifiers, except random forest. It can be observed that the prediction performance of the kNN based classification methods and SVM decrease with high variance and covariance of the data, i.e., for increasing values of w. However random forest gives better classification accuracy in this case. Although the performance of kNN based methods declines, ESkNN consistently perform better than the other methods except from random forest in such situations.

Fig. 1
figure 1

Misclassification rate, of simulated data from model 2 with added non-informative features. a 50 added non-informative features; b 100 added non-informative features; c 200 added non-informative features; d 500 added non-informative features

The results of model 2 from Table 3 reveal that ESkNN consistently outperform the other methods in the presence of non-informative features in the data, however, in the case of data with original features only, SVM is giving the best result and in case of 100 features ESkNN gives better results than other methods and comparable to SVM. Bagged kNN provide same results as usual kNN on the data with 4 features and slight accuracy gain is achieved than the usual kNN on the data with added no-informative features (Fig. 1).

5 Experiments on bench mark data sets

The performance of the proposed method in terms of misclassification rate, is evaluated on a total of 31 benchmark data sets. The data sets chosen include a wide range of domain that is microarray gene expression data sets, data sets from life science, finance and physical science. “Diabetes” and “Sonar”, data sets are from R-packages “mlbench” (Leisch and Dimitriadou 2010); ‘dystrophy” and “Glaucoma” are from “ipred” (Peters and Hothorn 2012). All the other data sets are from UCI (Bache and Lichman 2013). Summary of the data sets is given in Table 4.

Table 4 Summary of the data sets
Table 5 Misclassification rate of kNN, RkNN, BkNN, MFS, RF, SVM and ESkNN

5.1 Experimental setup

The performance of the ESkNN is evaluated on a total of 23 data sets. The ESkNN is evaluated in two scenarios on benchmark data sets; in case of benchmark data sets with their original features and then adding non-informative features to the data sets. The performance of ESkNN in terms of misclassification rate is compared with usual kNN, bagged kNN, random kNN, MFS, random forest and SVM. Each data set is divided into test and training sets, 90 % of the total data is used for the training and 10 % for testing. The same test and training set is used for all the methods and the results are averaged over a total of 1000 such splits. All the experiments are carried out using R (R Core Team 2013). The value of k for \(k=1,\ldots ,10\), is selected by tenfold cross validation using the R-Package “e1071” for the kNN based methods (Meyer et al. 2012). Random forest is tuned by using R-function “tune.randomForest” available within the same package. For SVM we used “kernlab” R-Package (Karatzoglou et al. 2004). For tuning sigma for SVM, we used the automatic selection available with the “kernlab” R package. The other parameters are fixed at default values. Total of 1001, kNN models are generated on bootstrap samples and then 40 % of the total are reselected for the second stage. The number of models generated is taken an odd number to break ties in voting on the classifiers for classification of a test point. The feature subset size is set to one-third of the input features, however, in low dimensions, in case of original features in the data, i.e., \(d<6\) the feature subset size is taken as 2.

6 Results and discussion

The results on the data sets with their original features and with added 500 randomly generated non-informative features are reported in Tables 5 and 6 respectively. The results from Table 5, show that ESkNN outperform or giving comparable results to other methods considered here. It is interesting to note that in case of the data sets with their original features ESkNN consistently outperform the kNN based methods on most of the data sets and gives comparable results to random forest. ESkNN gives overall better results on 8 data sets, on 9 data sets random forest is better than all the methods, on 5 data sets SVM is giving minimum classification error and on one data sets RkNN outperforms the rest of the methods.

In case of non-informative features in the data, Table 6, on 11 data sets ESkNN gives minimum classification error than the other methods, on 9 data set RF is giving best classification performance and on one data set SVM is giving better results and on two data sets their is no clear winner between random forest and ESkNN, however, ESkNN gives better performance than kNN based methods and SVM. Here again, it is observed that ESkNN results in smaller classification error than kNN based methods on most of the data sets.

Table 6 Misclassification rate of kNN, RkNN, BkNN, MFS, RF, SVM and ESkNN with added non-informative features to the data sets

7 Conclusion and outlook

Considering the idea of ensemble techniques, we have proposed an ensemble of subset of kNN classifiers (ESkNN) for classification tasks particularly to deal with the issue of non-informative features in the data sets. Our approach consists of forming an ensemble of best kNN models thus implicitly digging out the informative features subsets and discarding the non-informative ones. ESkNN is assessed for its classification performance on simulated and benchmark data sets. Our results on simulated and benchmark data sets show that the ESkNN gives comparable results to RF and outperform kNN and kNN based ensembles. The results from the simulations, Table 2, reveal that in case of high variance in the classes RF performs better than the others. Random projection ensemble classification (Cannings and Samworth 2015) may allow further improvements. Moreover, it would be of interest to investigate if recent proposals as predictive hubs (Lausser et al. 2014) and representative prototypes (Müssel et al. 2015) can be exploited to develop ESkNN further. ESkNN is implemented and available as R-Package “ESkNN” on CRAN (Gul et al. 2015).