Evaluation of k-nearest neighbour classifier performance for heterogeneous data sets

Distance-based algorithms are widely used for data classification problems. The k-nearest neighbour classification (k-NN) is one of the most popular distance-based algorithms. This classification is based on measuring the distances between the test sample and the training samples to determine the final classification output. The traditional k-NN classifier works naturally with numerical data. The main objective of this paper is to investigate the performance of k-NN on heterogeneous datasets, where data can be described as a mixture of numerical and categorical features. For the sake of simplicity, this work considers only one type of categorical data, which is binary data. In this paper, several similarity measures have been defined based on a combination between well-known distances for both numerical and binary data, and to investigate k-NN performances for classifying such heterogeneous data sets. The experiments used six heterogeneous datasets from different domains and two categories of measures. Experimental results showed that the proposed measures performed better for heterogeneous data than Euclidean distance, and that the challenges raised by the nature of heterogeneous data need personalised similarity measures adapted to the data characteristics.


Introduction
Classification is a supervised machine learning process that maps input data into predefined groups or classes [1]. The main condition for applying a classification technique is that all data objects should be assigned to classes, and that each of the data objects should be assigned to only one class [2].
Distance-based classification algorithms are techniques used for classifying data objects by computing the distance between the test sample and all training samples using a distance function. Distance-based algorithms though were originally proposed to deal with one type of data using distance-based measurements to determine the similarity between data objects. These algorithms were subsequently developed to enable handling of heterogeneous data as real-world data sets are often diverse in types, format, content and quality, particularly when they are gathered from different sources.
In general, when classifying heterogeneous data using distance-based algorithms, there are two categories of methods. The first category converts values from one data type to another (e.g. binning data, interpolating or projecting data) and then, distance-based algorithms can be used with an appropriate measurement to classify the data.
However, this method is not effective as the similarity measure of the transformed data does not necessarily represent consistently the similarity of the original heterogeneous data, especially when the transformation is not fully reversible. Moreover, the data conversion could also fundamentally alter values to make them more equidistant, meaning there are no guarantees that data will be interpreted correctly, which introduces the risk of losing or altering vital information in the process of decision the classification task is designed to support.
The second category extends distance-based algorithms to match the heterogeneous data. This can be done using a distance measures that can handle heterogeneous data.
One common classification technique based on the use of distance measures is k-nearest neighbours (k-NN) [3]. The traditional k-NN classification algorithm finds the k-nearest neighbour(s) and classifies numerical data records by calculating the distance between the test sample and all training samples using the Euclidian distance [4].
The primary focus of the k-NN classifier has been on data sets with pure numerical features [5]. However, k-NN can also be applied to other type of data includes categorical data [6]. Several investigations have been done to find a proper categorical measures for such data, such as the works presented in [7][8][9][10][11][12].
Moreover, it also can be applied to classify data described by numerical and categorical features such as studies reported in [7,13].
This paper aims to investigate the performance of k-NN classification on heterogeneous data sets using two types of measures:the well-known (Euclidean and Manhattan) distances and the combination of similarity measures that are formed by fusing existing numerical distances with binary data distances. It also aims to provide a first attempt of guidance as to the best combination of similarity function that can be used with k-NN for heterogeneous data classification (of numerical and binary features). The rest of this paper is organised as follows. Section 2 provides the concepts, background and literature review relevant for the research topic. Section 3 briefly describes the six well-known distance functions that are used in this study and explains the proposed technique for classifying heterogeneous data. Section 4 presents the experimental work and results. Finally, Sect. 5 presents the conclusion and future work.

Distance and similarity measures
The concept of similarity between data objects is widely used across many domains to solve a variety of pattern recognition problems such as categorisation, classification, clustering and forecast [14]. Various measures have been proposed in the literature for comparing data objects [15]. In this section the concepts of distance measure, similarity measure are introduced, followed by a review of the k-NN algorithm and its performance evaluation.
Definition 1 A distance measure d ∶ X × X → R is a function called metric if it satisfies the following requirements [16] ∀x, y, z ∈ X : However, similarity measurement shows more debates, as it provides some flexibility in the identification of how close two data objects could be. Similarity measure is generally perceived as complementary to a distance measure. Definition 2 similarity measure S ∶ X × X → R is a function that satisfies the following requirements ∀x, y ∈ X : 1. 0 ≤ S(x, y) (Non-negative); 2. S(x, y) = 1 ,if and only if x = y (Identity); 3. S(x, y) = S(y, x) (Symmetry).

K-nearest neighbour classifier (k-NN)
In this section, we look at the classification that uses the concept of distance for classifying data objects. The k-NN classifier is one of the simplest and most widely used in such classification algorithms. k-NN was proposed in 1951 by Fix and Hodges [17] and modified by Cover and Hart [3]. The technique can be used for both classification and regression [18].
The main concept for k-NN depends on calculating the distances between the tested, and the training data samples in order to identify its nearest neighbours. The tested sample is then simply assigned to the class of its nearest neighbour [19].
In k-NN, the k value represents the number of nearest neighbours. This value is the core deciding factor for this classifier due to the k-value deciding how many neighbours influence the classification. When k = 1 then the new data object is simply assigned to the class of its nearest neighbour. The neighbours are taken from a set of training data objects for where the correct classification is already known. k-NN works naturally with numerical data. Various numerical measures have been used such as Euclidean, Manhattan, Minkowsky, City-block, and Chebyshev distances. Amongst these, the Euclidean is the most widely used distance function with k-NN [20]. The main steps of k-NN algorithm in Fig. 1  According to [21], the k-NN classifier can be used to classify new data objects using only their distance to labelled samples. However, some works consider any metric or non-metric measures used with this classifier: several studies have been conducted to evaluate the k-NN classifier using different metric and non-metric measures such as the studies presented in [7,10,[22][23][24][25][26]].

Performance metrics for classification
The most widely used technique for summarizing the performance of a classification algorithm is the Confusion Matrix. Figure 2 shows the confusion matrix for the case of binary classification with the following elements: 1. True Positives (TP) is defined by the total number of accurate outputs when the actual class of the data object was True and the prediction was also the True value. 2. True Negatives (TN) is defined by the total number of accurate outputs when the actual class of the data object was False and the predicted is also the False value. 3. False Positives (FP) when the actual class of the data object was False and the output value was the True value 4. False Negatives (FN) when the actual class of the data object was True and the output value was the False value.

Metrics computed from a confusion matrix
A confusion matrix gives a useful information about how well the model does. However, its elements can be used to calculate many performance metrics to get even more information. Among the most popular are (see also Tables 1, 2): 1 Accuracy is the most intuitive performance measure, and defined as the ratio of the number of correctly classified objects to the total number of objects evaluated. 2 Precision it is simply a ratio of correctly predicted positive data objects to the total predicted positive data objects.
3 Recall it is defined by the number of correct positive results divided by the total number of relevant samples (all samples that should have been identified as positive). 4 F-score it can be defined as a weighted average of the precision and recall. An F-score is considered perfect when reaches its best value at 1, while the model is a total failure when it reaches the 0 value. Tables 1 and 2 show the evaluation measures for binary and multi-class dataset respectively.

Related work
As we mentioned earlier, plenty of studies investigated, analysed, and evaluated the performance of k-NN on pure numerical and pure categorical data sets. Regarding applying k-NN to heterogeneous data described by numerical and categorical features, the most widely used method is to treat the data before feeding to the classifier. This can be done by converting non-numerical features into numerical features using different techniques, and then the traditional k-NN can be applied with any numerical distance. A study presented by Hu et al. [7] evaluated the performance of k-NN on three types of medical data sets, pure numerical, pure categorical, and mixed data using different numeric measures. They treat non-numerical features by encoding them as binary. Similar technique also has been applied in some studies such as [8,13,27].
On the other hand, studies have used the combination approach for classifying heterogeneous data using k-NN. Such study presented by Pereira et al. [28] have proposed a new measure for computing the distance between heterogeneous data objects and used this measure with k-NN. This distance is called Heterogeneous Centered Distance Measure (HCDM). It is based on a combination of two techniques: Nearest Neighbour Classifier (CNND) distance for numerical features and Value Difference Metric (VDM) with k-NN for classifying heterogeneous data sets, described by two different features type; numerical and categorical. The combination measures include: Heterogeneous Euclidean-Overlap Metric (HEOM), which uses the overlap metric for categorical features and the normalized Euclidean distance for numerical features; Heterogeneous Manhattan-Overlap Metric (HMOM), which uses the overlap metric for categorical features and Manhattan distance for numerical features; Heterogeneous Distance Function (HVDM) which uses the Value Difference Metric (VDM) for categorical features and the normalized Euclidean distance for numerical features In [29], Deekshatulu et al. have proposed a new classification algorithm which combines k-NN and genetic algorithm, to predict heart disease of a patient for Andhra Pradesh population. The authors also have applied the model to medical data and non-medical data sets such as Hypothyroid, liver disorder, primary tumour, and Weather data sets. In this model the features are ranked based on their value. The least ranked features are removed, and the classification algorithm is built based on evaluated features. Generally, the most commonly used approaches for classifying heterogeneous data by k-NN classifier can be described as a mixture of numerical and categorical features which include: 1. Conversion approach a method of converting the data set into a single data type, and then applying appropriate distance measures to the transformed data. 2. Unified approach a method to integrate two or more different measures to infer the overall value.

Measures for comparing data objects
As we mentioned in the previous section, a combination approach is one of the most widely used methods for comparing data objects described by a mixture of data types. The simple idea of applying this technique for calculating the similarity between two data objects described by a mixture of features is to split these features into subsets based on their data type and then to identify the similarity between the subsets of same type. The next step is to combine these measures to obtain a single value representing the similarity between two data objects. In this study, we have used the combination approach to generate a number of similarity measures based on the existing measures to handle heterogeneous data when the representation of the data includes a mixture of numerical and binary features. The data is first divided into pure numerical and pure binary features, specific distances are then applied to the numerical and binary features, and the result of the two distances is assembled into one single distance using a weighted average to form the combined distance value.

Measures for numerical data
In [30] Cha categorized the numerical distances into eight distance families. The study presented by Prasath et al. [23], classified the distance measures following a similar classification done by Cha. Their study also evaluated the performance (measured by accuracy, precision and recall) of the k-NN with the classified distance families for classifying numerical data.
In this study, we will investigate the performance of k-NN for classifying heterogeneous data by using measures from three different families. We have chosen the most representative measures from these families, as they have been applied with k-NN in different studies for classifying the data and represent good references for critical comparisons of results reported hereby. The five chosen measures belong to the following families: 1. L p Minkowski family it is also known as the p-norm distance. The chosen measures from this family include: (i) Manhattan distance is defined by: (ii) Euclidean distance is defined by: 2. Inner product family distance measures belonging to this family are calculated by some products of pair wise values from both vectors. Two measures have been selected from this family: (i) Cosine similarity measure is defined by: (ii) Jaccard distance is defined by: 3. L 1 distance family the distances in this family are calculated based on finding the absolute difference. Only one measure have been chosen from this family: (i) Canberra distance is defined by: As we mentioned in this section, the chosen measures have been widely applied with k-NN for classifying the datasets in the selected case studies presented in [7,22,26,[31][32][33]. Most the equations are confirmed metrics: Euclidean, Manhattan, Canberra according to [34,35], and Jaccard according to [36], satisfy the conditions in Cosine measure is not metric. It does not satisfy condition 4 in Definition 1.

Measures for categorical data
Generally, categorical data is classified as a type of qualitative data [37]. Such data corresponds to a possible representation for nominal, binary, ordinal, and interval instances. For the sake of simplicity, in this work, we will focus on only one type of categorical data which is binary data. The set of measures developed for dealing with binary data is known as matching coefficients [38]. They calculate the distance between two data objects x and y defined as x = {x 1 , x 2 , … , x p } , and y = {y 1 , y 2 , … , y p } , where p represents the number of binary features in each data object.
The strategy behind these methods is that the two data objects are viewed as similar to the degree that they share a common pattern of feature values among the binary variables. The matching coefficient values range between 0 for not similar at all and 1 for completely similar [39]. Figure 3 shows the main four quantities of binary features.
Any binary feature has only one of two cases: 0 means that the feature is absent and 1 means that the feature is present, this is called symmetric binary features [39]. Those are listed below : Each feature in data objects must belong to one of these four categories a, b, c, and d , and a + b + c + d = p , where p is the total number of binary features. There are various similarity measures for binary data proposed in the literature.
In [40], Choi et al. has compared 76 binary similarity measures and classified them hierarchically to observe close relationships among them. The overlap similarity measure is widely used in data mining tasks such as clustering, classification, and regression for handling binary data. It is also known as a simple matching similarity measure. The overlap similarity measure determines by the number of corresponding features that have identical values. The measure is defined by: Researchers in different studies have also applied the overlap measure with k-NN for both classification and regression tasks. They used overlap measure for comparing categorical (nominal/ binary) data such as studies presented in [32,41,42].
However, the main limitation of this measure is that this measure only determines whether the features are match to one another (a and d), and does not make full use of the rest of the classification information. Therefore, in this study, Jaccard coefficient similarity measure is adopted to deal with binary data and is defined as: It should be noted that Jaccard coefficient similarity measure excludes d from consideration which represents joint absences for both features. According to [43], the d value in Fig. 3 does not necessarily represent resemblance between data objects, since a large proportion of the binary dimensions in two data objects are more likely to have negative matches.
On the other hand, the study presented by Faith et al. [44] considered d value in the calculation of comparing binary data. However, their studies showed that positive matches as more considerable, therefore they give the former less weight comparing to the negative matches.

Similarity measures for objects described by heterogeneous features
Many aggregation operators were used to aggregate the values obtained through multiple similarity measures for data mining applications such as clustering and classification. Plenty studies have introduced such aggregated similarity measures [45][46][47]. This includes measures for different types of data such as classical data (numerical and categorical), fuzzy data, and intuitionistic Fuzzy data or even the combination between them. Some of these studies include study presented by Bashan et al. [48] have introduced a classical similarity measure called weighed average similarity measure. It is based on the combination between numerical and categorical similarities. They also introduced weighed average similarity measure based on the combination between classical and fuzzy similarities for comparing heterogeneous data sets. Another study [46] have proposed the weighted average similarity measure between intervals of linguistic 2-tuples for solving fuzzy group decision making issue. Studies presented in [49,50] have also proposed weighted average similarity measures for Intuitionistic Fuzzy data. The proposed measures are applied to various pattern recognition problems.
Actually, this approach already existed in other machine learning algorithms: for example in random forest [51] when trained on the subsets, the weights are calculated according to the global outputs.
In this work, we used the weighted average methods for giving the weights to numerical and binary similarities that will be used with k-NN for classifying heterogeneous features.
The weighted average of set of values x 1 , x 2 , … , x n with corresponding weights w 1 , w 2 , … w n is computing from the following formula: where w 1 , w 2 , … w n > 0 . It should be noted that if w 1 + w 2 + ⋯ + w n = 1 then: If w 1 + w 2 + ⋯ + w n > 1 then Eq. 8 can be used.

Definition 3
The similarity between two data records R 1 and where S Num is numerical similarity value, S Bin is categorical similarity value, and w 1 and w 2 are non-negative values which can be used for giving weights for numerical and Table 3 The combination of similarity measures based on a weighted average The measure The formula M jj S 3 R 1 , R 2 = w 1 (S Jaccard Num (R1,R2))+w2(SJaccard Bin (R1,R2)) (w 1 +w 2 ) M caj S 4 R 1 , R 2 = w 1 (S Canberra (R1,R2))+w2(SJaccard Bin (R1,R2)) (w 1 +w 2 ) binary features respectively. We have introduced a list of similarity measures based on Definition 3. Table 3 shows the combination of similarity measures that have been generated based on Eq. 8 from well-known distances. These measures will be used in the next section for the experimental work.

Experimental analysis
This section evaluates the effectiveness of both traditional k-NN, and k-NN with the combination of similarity measurements over six heterogeneous data sets from different domains. The data sets are described by mixtures of numerical and binary features only. The characteristics of the data sets are shown in Table 4. Two data sets named Hypothyroid and Hepatitis are taken from the UCI Machine Learning Repository [52], and four data sets named Treatment, Labour training evaluation, Catsup, Azpro data sets are taken from the R packages. More description of the data sets is available in [53]. The UCI data sets have been considered after some in depth review of existing UCI benchmark data sets to satisfy the following conditions: 1. Data set should contain numerical and binary features only. 2. The data should not contain more than 3% of missing values. 3. The number of features for each type of data should be enough for calculating the similarity (not less than 2). 4. The number of classes should be small.
Both (benchmark and real) data sets types have been chosen to cover small to medium size data sets.

Data pre-processing
Before running the experiments, all datasets were preprocessed by removing irrelevant features (ID), and data objects with missing values. Numerical features were normalised to fall between 0 and 1. Each data set was split randomly into 80% for training and 20% for the testing sets.
Five k values were evaluated: 1, 3, 5, 7 and 9 neighbours. We investigated the implementation of k-NN with two different categories of measures; the first category includes Euclidean and Manhattan measures while the second category includes the four combination of similarity measures, which are described in Table 3.
It should be noted that we applied normalised Euclidean and normalised Manhattan distances to numerical datasets. Therefore, all the obtained results fall between 0 and 1. Because the similarity is complement of the distance, in this study the similarity is computed based on: All the measures are used with the k-NN classifier individually with three different weights, and these measures are applied with k-NN to the same training and test samples each time. For evaluating the performance of k-NN we have used both accuracy (A) and F-score (F) metric. It should be noted that: 1. The values of w 1 and w 2 are set by default as following: (i) When the numerical features are most impotent than the binary features, we set w 1 = 0.8 and w 2 = 0.6. (ii) When the binary features are most impotent than the numerical features, we set w 1 = 0.6 and w 2 = 0.8. (iii) When the numerical and binary features have the same degree of importance, we set w 1 = 0.5 and w 2 = 0.5.

2.
The values w 1 = 0 and w 2 = 1 or w 1 = 1 and w 2 = 0 are not suggested for heterogeneous data because this leads to using a single measure, negating the advantages of a combined measures.
The implementation of classifying heterogeneous data can be summarised in the following steps: 1. For each data, set the value of k, w 1 and w 2 . 2. Split the data randomly into 80% for training and 20% for the test sample.

Experimental results
The experimental works have been done in three stages. For each stage, the implementation steps are applied with different weight values as mentioned above. In the first stage of the experimental work, we assume that the numerical features are more important than the binary features. Tables 5, 6, 7, 8 and 9 show the results obtained by applying k-NN to six heterogeneous data sets with k = 1, 3, 5, 7 , and 9 w 1 = 0.8 and w 2 = 0.6.
As it can be seen from the experiments, for traditional k-NN, the results showed that k-NN with Manhattan distance produces better results compared to the classifier with Euclidean distance for all data sets and all k values.
The experiments showed that k-NN with the combination of similarity measures performs well for classifying the six heterogeneous data sets, and outperforms k-NN with Euclidean distance. The four combination of similarity measures are efficient in handling both numerical and binary features together. However, among of them, M caj performed the lowest in most cases. Table 5 The results obtained by k-NN with all measures and K = 1 , w 1 = 0.8 and w 2 = 0.6

Dataset
Traditional k-NN k-NN with combination similarity measures  Moreover, Manhattan distance and the combination of similarity measures produce very close results.

A (%) F (%) A (%) F (%) A (%) F (%) A (%) F (%) A (%) F (%) A (%) F (%)
The results also showed that the optimal number of k is 1 for Hypothyroid and Hepatitis, Treatment, and Labour training evaluation data sets. K = 3 is the optimal number for Catsup and Azpro data sets. Our results showed that some of measures outperform the others. Table 10 shows the best measures are used with k-NN for each given k value when w 1 = 0.8 and w 2 = 0. 6 Based on Table 10, it is clear that k-NN with combination of similarity measures outperform traditional k-NN.
In the second stage of the experimental work, we assume that the binary features are more important than the numerical features. Tables 11, 12, 13, 14 and 15 show the results obtained by applying k-NN to six heterogeneous data sets with k = 1, 3, 5, 7 , and 9 and w 1 = 0.6 and w 2 = 0.8.
According to the results k-NN with Manhattan distance outperforms k-NN with Euclidean distance.
The obtained results showed that the optimal number is k = 1 for Hypothyroid and Hepatitis, Treatment, and Labour training evaluation data sets. K = 3 is the optimal number for Catsup and Azpro data set. Table 8 The results obtained by k-NN with all measures and K = 7 , w 1 = 0.8 and w 2 = 0.6

Dataset
Traditional k-NN k-NN with combination similarity measures  Table 9 The results obtained by k-NN with all measures and K = 9 , w 1 = 0.8 and w 2 = 0.6   Table 10 The best measures are used with k-NN for each given k value when w 1 = 0.8 and w 2 = 0.6 Table 16 shows the best measures are used with k-NN for each given k value when w 1 = 0.6 and w 2 = 0.8.

Dataset
In the third stage of the experimental, our presumption is that both types of features are important. Therefore, we will assign the same weight value for both of them w 1 = w 2 = 0.5. Tables 17,18,19,20 and 21 show the results obtained by applying k-NN to six heterogeneous data sets with k = 1, 3, 5, 7 , and 9 and w 1 = w 2 = 0.5.
Again, still k-NN with Manhattan distance outperforms k-NN with Euclidean distance, and the combination of similarity measures perform well with k-NN classifier. K = 1 is the optimal number for Hypothyroid and Hepatitis, Treatment, and Labour training evaluation data sets. K = 5 is the optimal number for Catsup and Azpro data sets. Table 22 shows the best measures are used with k-NN for each given k value when w 1 = 0.5 and w 2 = 0.5.     Table 16 The best measures are used with k-NN for each given k value when w 1 = 0.6 and w 2 = 0.8 As it can be seen from the all results obtained by the experiments, there are significant differences between the performance of k-NN with Manhattan distance and k-NN with the Euclidean distance. k-NN with Manhattan distance performs reasonably well over all heterogeneous data sets compared to k-NN with Euclidean distance.

A (%) F (%) A (%) F (%) A (%) F (%) A (%) F (%) A (%) F (%) A (%) F (%)
Therefore, the results suggest that k-NN with Euclidean distance is not fit for the purpose to manage naturally heterogeneous data sets. This result supports the obtained results of previous research in [7] that was undertaken for investigating the performance of k-NN with different single measures for classifying heterogeneous data.

Conclusions and future work
Since the k-NN classification is based on measuring the distance between the test sample and each of the training samples, the chosen distance function plays a vital role in determining the final classification output. The major objective of this study was to investigate the performance of k-NN, using several measures includes single measures (Euclidean and Manhattan) and a number of combination of similarity measures, for computing the similarity between data objects described by numerical and binary features. Experimental results were carried out on six heterogeneous data sets from different domains.
The overall results of our experiments showed that Euclidean distance is not an appropriate measure that can be used with k-NN for classifying a heterogeneous data set of numerical and binary features.
Furthermore, our results showed that combining the results of numerical and binary similarity measures is a promising method to get better results than just using one single measure.
Moreover, we have observed that there are no significant differences among the results presented by the three cases of the given weights with k-NN, that may suggest some robustness of the algorithm to the impact of compact heterogeneous features to the classification performance.
Generally, the study has applied in global terms combination of similarity measures with k-NN. This approach does not consider data pre-processing before the analysis.
The study results suggest need for future work: some weights and measures do not necessarily perform well because of the distribution or the quality of the data. Therefore in future work we will address optimisation of the weights selection based on this characteristic of the data representing the ability and quality of the training and testing sets.
Finally, it is important to outline that this work is restricted to limited data types and number of measures, and therefore we aim to investigate the performance and applicability of k-NN for heterogeneous data sets described by more than two types of data, such as numerical, binary, nominal, ordinal, and apply a wider range of measures.