Evaluation of k-nearest neighbour classifier performance for heterogeneous data sets

Ali, Najat; Neagu, Daniel; Trundle, Paul

doi:10.1007/s42452-019-1356-9

Evaluation of k-nearest neighbour classifier performance for heterogeneous data sets

Research Article
Open access
Published: 06 November 2019

Volume 1, article number 1559, (2019)
Cite this article

Download PDF

You have full access to this open access article

SN Applied Sciences Aims and scope Submit manuscript

Evaluation of k-nearest neighbour classifier performance for heterogeneous data sets

Download PDF

Najat Ali¹,
Daniel Neagu¹ &
Paul Trundle¹

34k Accesses
143 Citations
9 Altmetric
1 Mention
Explore all metrics

Abstract

Distance-based algorithms are widely used for data classification problems. The k-nearest neighbour classification (k-NN) is one of the most popular distance-based algorithms. This classification is based on measuring the distances between the test sample and the training samples to determine the final classification output. The traditional k-NN classifier works naturally with numerical data. The main objective of this paper is to investigate the performance of k-NN on heterogeneous datasets, where data can be described as a mixture of numerical and categorical features. For the sake of simplicity, this work considers only one type of categorical data, which is binary data. In this paper, several similarity measures have been defined based on a combination between well-known distances for both numerical and binary data, and to investigate k-NN performances for classifying such heterogeneous data sets. The experiments used six heterogeneous datasets from different domains and two categories of measures. Experimental results showed that the proposed measures performed better for heterogeneous data than Euclidean distance, and that the challenges raised by the nature of heterogeneous data need personalised similarity measures adapted to the data characteristics.

Integrated Effect of Nearest Neighbors and Distance Measures in k-NN Algorithm

The distance function effect on k-nearest neighbor classification for medical datasets

Article Open access 09 August 2016

Multi-functional nearest-neighbour classification

Article Open access 06 March 2017

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Classification is a supervised machine learning process that maps input data into predefined groups or classes [1]. The main condition for applying a classification technique is that all data objects should be assigned to classes, and that each of the data objects should be assigned to only one class [2].

Distance-based classification algorithms are techniques used for classifying data objects by computing the distance between the test sample and all training samples using a distance function. Distance-based algorithms though were originally proposed to deal with one type of data using distance-based measurements to determine the similarity between data objects. These algorithms were subsequently developed to enable handling of heterogeneous data as real-world data sets are often diverse in types, format, content and quality, particularly when they are gathered from different sources.

In general, when classifying heterogeneous data using distance-based algorithms, there are two categories of methods. The first category converts values from one data type to another (e.g. binning data, interpolating or projecting data) and then, distance-based algorithms can be used with an appropriate measurement to classify the data.

However, this method is not effective as the similarity measure of the transformed data does not necessarily represent consistently the similarity of the original heterogeneous data, especially when the transformation is not fully reversible. Moreover, the data conversion could also fundamentally alter values to make them more equidistant, meaning there are no guarantees that data will be interpreted correctly, which introduces the risk of losing or altering vital information in the process of decision the classification task is designed to support.

The second category extends distance-based algorithms to match the heterogeneous data. This can be done using a distance measures that can handle heterogeneous data.

One common classification technique based on the use of distance measures is k-nearest neighbours (k-NN) [3]. The traditional k-NN classification algorithm finds the k-nearest neighbour(s) and classifies numerical data records by calculating the distance between the test sample and all training samples using the Euclidian distance [4].

The primary focus of the k-NN classifier has been on data sets with pure numerical features [5]. However, k-NN can also be applied to other type of data includes categorical data [6]. Several investigations have been done to find a proper categorical measures for such data, such as the works presented in [7,8,9,10,11,12].

Moreover, it also can be applied to classify data described by numerical and categorical features such as studies reported in [7, 13].

This paper aims to investigate the performance of k-NN classification on heterogeneous data sets using two types of measures:the well-known (Euclidean and Manhattan) distances and the combination of similarity measures that are formed by fusing existing numerical distances with binary data distances. It also aims to provide a first attempt of guidance as to the best combination of similarity function that can be used with k-NN for heterogeneous data classification (of numerical and binary features). The rest of this paper is organised as follows. Section 2 provides the concepts, background and literature review relevant for the research topic. Section 3 briefly describes the six well-known distance functions that are used in this study and explains the proposed technique for classifying heterogeneous data. Section 4 presents the experimental work and results. Finally, Sect. 5 presents the conclusion and future work.

2 Background

2.1 Distance and similarity measures

The concept of similarity between data objects is widely used across many domains to solve a variety of pattern recognition problems such as categorisation, classification, clustering and forecast [14]. Various measures have been proposed in the literature for comparing data objects [15]. In this section the concepts of distance measure, similarity measure are introduced, followed by a review of the k-NN algorithm and its performance evaluation.

Definition 1

A distance measure $\ d:X\ \times \ X\ \rightarrow \ R$ is a function called metric if it satisfies the following requirements [16] $\forall x,y,z \in X$ :

1.
$0 \le d\left( x,y\right)$ $(Non\text {-}negative$);
2.
$d(x,\ y)\ =\ 1$ ,if and only if $x=y$ (Identity);
3.
$d(x,\ y)\ =\ d(y,\ x)$ (Symmetry);
4.
$d\ (x,\ z)\ \le \ d\ (x,\ y)\ +\ d(y,\ z)$ (Triangle inequality).

However, similarity measurement shows more debates, as it provides some flexibility in the identification of how close two data objects could be. Similarity measure is generally perceived as complementary to a distance measure.

Definition 2

similarity measure $\ S:X\ \times \ X\ \rightarrow \ R$ is a function that satisfies the following requirements $\forall x,y \in X$ :

1.
$0 \le S\left( x,y\right)$ $(Non\text {-}negative$);
2.
$S(x,\ y)\ =\ 1$ ,if and only if $x=y$ (Identity);
3.
$S(x,\ y)\ =\ S(y,\ x)$ (Symmetry).

2.2 K-nearest neighbour classifier (k-NN)

In this section, we look at the classification that uses the concept of distance for classifying data objects. The k-NN classifier is one of the simplest and most widely used in such classification algorithms. k-NN was proposed in 1951 by Fix and Hodges [17] and modified by Cover and Hart [3]. The technique can be used for both classification and regression [18].

The main concept for k-NN depends on calculating the distances between the tested, and the training data samples in order to identify its nearest neighbours. The tested sample is then simply assigned to the class of its nearest neighbour [19].

In k-NN, the k value represents the number of nearest neighbours. This value is the core deciding factor for this classifier due to the k-value deciding how many neighbours influence the classification. When $\text {k}=1$ then the new data object is simply assigned to the class of its nearest neighbour. The neighbours are taken from a set of training data objects for where the correct classification is already known. k-NN works naturally with numerical data. Various numerical measures have been used such as Euclidean, Manhattan, Minkowsky, City-block, and Chebyshev distances. Amongst these, the Euclidean is the most widely used distance function with k-NN [20]. The main steps of k-NN algorithm in Fig. 1 are:

1.
Determine the number of nearest neighbours (K values).
2.
Compute the distance between test sample and all the training samples.
3.
Sort the distance and determine nearest neighbours based on the K-th minimum distance.
4.
Assemble the categories of the nearest neighbours.
5.
Utilise simple majority of the category of nearest neighbours as the prediction value of the new data object.

According to [21], the k-NN classifier can be used to classify new data objects using only their distance to labelled samples. However, some works consider any metric or non-metric measures used with this classifier: several studies have been conducted to evaluate the k-NN classifier using different metric and non-metric measures such as the studies presented in [7, 10, 22,23,24,25,26].

2.3 Performance metrics for classification

The most widely used technique for summarizing the performance of a classification algorithm is the Confusion Matrix. Figure 2 shows the confusion matrix for the case of binary classification with the following elements:

1.
True Positives (TP) is defined by the total number of accurate outputs when the actual class of the data object was True and the prediction was also the True value.
2.
True Negatives (TN) is defined by the total number of accurate outputs when the actual class of the data object was False and the predicted is also the False value.
3.
False Positives (FP) when the actual class of the data object was False and the output value was the True value
4.
False Negatives (FN) when the actual class of the data object was True and the output value was the False value.

2.3.1 Metrics computed from a confusion matrix

A confusion matrix gives a useful information about how well the model does. However, its elements can be used to calculate many performance metrics to get even more information. Among the most popular are (see also Tables 1, 2):

1
Accuracy is the most intuitive performance measure, and defined as the ratio of the number of correctly classified objects to the total number of objects evaluated.
2
Precision it is simply a ratio of correctly predicted positive data objects to the total predicted positive data objects.
3
Recall it is defined by the number of correct positive results divided by the total number of relevant samples (all samples that should have been identified as positive).
4
F-score it can be defined as a weighted average of the precision and recall. An F-score is considered perfect when reaches its best value at 1, while the model is a total failure when it reaches the 0 value.

Tables 1 and 2 show the evaluation measures for binary and multi-class dataset respectively.

Table 1 Evaluation measures for binary class data set

Evaluation of k-nearest neighbour classifier performance for heterogeneous data sets

Abstract

Similar content being viewed by others

Integrated Effect of Nearest Neighbors and Distance Measures in k-NN Algorithm

The distance function effect on k-nearest neighbor classification for medical datasets

Multi-functional nearest-neighbour classification

1 Introduction

2 Background

2.1 Distance and similarity measures

Definition 1

Definition 2

2.2 K-nearest neighbour classifier (k-NN)

2.3 Performance metrics for classification

2.3.1 Metrics computed from a confusion matrix

2.4 Related work

3 Measures for comparing data objects

3.1 Measures for numerical data

3.2 Measures for categorical data

3.3 Similarity measures for objects described by heterogeneous features

Definition 3

4 Experimental analysis

4.1 Data pre-processing

4.2 Experimental results

5 Conclusions and future work

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation