Let \({\mathcal {L}}={(\mathbf x _i,y_i), i=1 \ldots n}\) be a training set consisting of n independent observations, where \(\mathbf{x _i}= (x_{i1}, x_{i2}, \ldots , x_{id})\) is a d-dimensional feature vector and y is the vector of class labels; where \({ y_i \in \{{1, \ldots , J}}\}\), J being the total number of classes, here we consider the two class problem, thus \({ y_i \in \{{1,2}}\}\). Based on this available data set \({\mathcal {L}}\), a classifier predicts the class label for a new/test observation with feature vector \((\mathbf {x}^\prime )\). Divide the training data \({\mathcal {L}}\) in two parts, \({\mathcal {L}_{T}}\) and \({\mathcal {L}_{\textit{V}}}\), the first one for construction of the classifiers and the other part for validation. For simplicity we denote the set used for construction of the models \({\mathcal {L}_{T}}\) by \({\mathcal {L} ^ *}\). Let us denote the d input features in \({\mathcal {L} ^ *}\) by \({\mathbf {P}} = (p_1,p_2,p_3, \ldots , p_d)\). For a given subset size, say l, where \(l < d\), a random subset of features \({\mathbf {P}}^{l}\), is drawn from \(\mathbf {P}\). Based on the randomly selected features a bootstrap sample is drawn from \({\mathcal {L} ^ *}\). The new bootstrap learning set \({\mathcal {L}^*}^{(l)}\), consists of l dimensional feature vector. This process is repeated until we get m training sets, \({\mathcal {L}^ *}^{(1l)}, \ldots , {\mathcal {L}^ *}^{(ml)}\), each of \(n\times {l}+1\) dimensions. The base kNN classifier is constructed on these bootstrap training sets and a set of m classifiers is generated.
While, drawing a random sample of the same size n from the training set, approximately \(\frac{1}{3}\) of the observations are left out from that sample. These observations are called out-of-bag (OOB) observations, and can be utilized for estimation of the classification error (Breiman 1996b). In our framework we use the OOB sample for the assessment of the classifier. The m classifiers are then ranked according to their individual classification accuracy on the OOB sample and the first h of the m classifiers are selected from them. The selected classifiers are then assessed for their collective contribution as an ensemble on the validation set \({\mathcal {L}_{\textit{V}}}\). This is done by starting from the best one among h classifiers and then adding one by one the rest of the classifiers to the ensemble.
The formation of the ensemble of subset of kNN classifiers can be summarized as:
-
1.
Draw a random sample of size \(l < d\), without replacement, of features from the feature vector \(\mathbf {P}\) of \({\mathcal {L}^*}\), denote the feature vector by \(\mathbf {P}^{l}\).
-
2.
Based on the selected random feature subset \(\mathbf {P}^{l}\), draw a random sample of size n, \({\mathcal {L}^*}^{(l)}\), from \({\mathcal {L}^* }\).
-
3.
Construct the kNN classifier on \({\mathcal {L}^*}^{(l)}\).
-
4.
Calculate the accuracy of the classifier on the OOB sample using the same feature set as used for its construction.
-
5.
Iterate step (1) to (4) m times and rank the m classifiers according to their accuracies.
-
6.
Select first h classifiers with highest accuracies.
These selected classifiers are further assessed as follows:
-
The ensemble is started with combining the second best classifier to the first best classifier, and classification performance is evaluated on the validation set \({\mathcal {L}_{\textit{V}}}\). The ensemble is then grown by adding the third best classifier and the performance is measured, this process is carried out for all the h classifiers,
The ensemble is formed in a two stage procedure by assessing the models using two different performance measures misclassification rate and Brier score.
In the first stage the classification models are evaluated using the misclassification rate (MR) as the performance measure. A classification model is desired to have minimum misclassification rate than others used for a classification task, and thus the classification models with a low misclassification rate are selected.
In the second stage of the algorithm the selected models are further evaluated using the Brier score as a performance measure. The Brier score measures the difference between the observed state of the outcomes of the test instances and the estimated probabilities that are in turn used to classify new observations using some threshold. Besides the traditional misclassification rate and other metrics, Brier score can also be used to evaluate the predictive performance of a classifier. While using output of the classifier as a basis for decision making, a more detailed evaluation is required; where not only the prediction accuracy of the classifier should be considered but also the quality of the estimate needs ample consideration. That can be done through a score such as the Brier score that, in principle, measures the predictive ability/quality of a classifier in classifying new data (Hernández-Orallo et al. 2012; Steyerberg et al. 2010; Kruppa et al. 2014).
Let the class labels of the test instances from the two classes, “positive” and “negative”, are represented by 0, or 1, i.e \({ y \in \{{0,1}}\}\). The Brier score for the probabilities of the predicted class 1, \(y=1\), is:
$$\begin{aligned} \mathcal {BS}= & {} E(y_i-p(y_i=1))^2. \end{aligned}$$
An estimator for the above score is:
$$\begin{aligned} \hat{\mathcal {BS}} = \frac{\sum _{i=1}^{n_{t}}\left( y_i-\hat{p}(y_i| \mathbf{{x}})\right) ^2}{n_{t}}, \end{aligned}$$
where, \(n_t\) is the total number of test points and the state of the outcome is, \({ y \in \{{0,1}}\}\). A low Brier score indicates better performance of the predictor. Thus the models minimizing the Brier score of the ensemble are selected.
One technical reason for assessing the individually selected models, in the first stage, for their collective contribution using the Brier score is that this score is more capable of determining the contribution of a model, to be included in the ensemble, than the misclassification rate. To illustrate this, let the estimated probability of a test observation belonging to class 1, provided that class 1 is the true class, by a classifier c1 is given as:
$$\begin{aligned} \hat{f}_{c1}= 0.56. \end{aligned}$$
Suppose that the cut-off for assigning this observation to class 1 is
$$\begin{aligned} \hat{f}(.) > 0.5, \end{aligned}$$
which implies that the given observation belongs to class 1 and classification error will be 0 (correct classification). The Brier score in this case is 0.1936.
Now consider that the second classifier gives the estimated probability for that observation as 0.68. The combined probability estimate of the two classifiers for the same observation, denoted by \(\hat{f}_{{c1,c2}}\), is given as:
$$\begin{aligned} \hat{f}_{{c1,c2}} = 0.62 \ . \end{aligned}$$
Consequently, the Brier score decreases to 0.1444. The classification error in both the cases is 0 as that of a single classifier for the given cut-off.
A third classifier has an estimated probability of 0.88, the resultant combined probability is:
$$\begin{aligned} \hat{f}_{{c1,c2,c3}}= 0.71 . \end{aligned}$$
Here the Brier score decreases to 0.0841 while the classification error remains the same (0) as the previous ensemble of two classifiers for the given cut-off.
This follows that if classification errors are considered for classifier addition into the ensemble, classifier c2 and c3 would not be part of the ensemble, as the error remains the same, whereas the Brier score reduces with the addition of classifiers c1 and c2 thus leading to an ensemble of size 3.
The general pseudo code of ESkNN is given in Algorithm 1.