IoT-cloud based healthcare model for COVID-19 detection: an enhanced k-Nearest Neighbour classifier based approach

COVID - 19 affected severely worldwide. The pandemic has caused many causalities in a very short span. The IoT-cloud-based healthcare model requirement is utmost in this situation to provide a better decision in the covid-19 pandemic. In this paper, an attempt has been made to perform predictive analytics regarding the disease using a machine learning classifier. This research proposed an enhanced KNN (k NearestNeighbor) algorithm eKNN, which did not randomly choose the value of k. However, it used a mathematical function of the dataset’s sample size while determining the k value. The enhanced KNN algorithm eKNN has experimented on 7 benchmark COVID-19 datasets of different size, which has been gathered from standard data cloud of different countries (Brazil, Mexico, etc.). It appeared that the enhanced KNN classifier performs significantly better than ordinary KNN. The second research question augmented the enhanced KNN algorithm with feature selection using ACO (Ant Colony Optimization). Results indicated that the enhanced KNN classifier along with the feature selection mechanism performed way better than enhanced KNN without feature selection. This paper involves proposing an improved KNN attempting to find an optimal value of k and studying IoT-cloud-based COVID - 19 detection.


Introduction
IoT-Cloud based healthcare predictive models play an important role in the detection of several diseases and provide better decisions to the users. It is necessary to propose such an IoT-cloud-based healthcare model for the detection of COVID-19. COVID-19 is mainly caused by Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2). It has been declared as a pandemic since March 2020, and according to WHO (World Health Organization), there are 21,409,133 active cases worldwide (December 2020). The virus mainly spreads by respiratory precipitations while coughing, sneezing, etc., and gets transmitted from person to person. While fever, cough is the primary symptoms of the disease, certain pre-existing medical conditions (heart disease/diabetes/COPD/cancer) actually aggravate the disease's outcome. Towards an effort to curb this disease and build a pandemic prepared healthcare system, this paper applies predictive data mining techniques on COVID -19 data to understand the disease better.
Several researchers [5,6,8,33,35,42,43] have applied machine learning as well as deep learning algorithms on COVID -19 data and derived fruitful insights. While some researchers used neural networks [5][6][7] or deep learning methods, others used regression [12], or classification algorithms [21,24] to predict the prognosis of the disease. In this paper, we have explored the supervised learning algorithm KNN (k Nearest Neighbor) [14,[16][17][18] on seven COVID -19 cloud datasets gathered across the world (Asia, Brazil, Mexico, etc.). The dataset sample size ranged from 5000 to 1.5 million. KNN algorithm was opted for because of its simplicity and ease of use. The experimentation demonstrated an enhanced KNN (k Nearest Neighbor) algorithm, which uses a radical mathematical function (square root / cube root / fourth root etc.) of the sample size of the dataset as the k value. The k is represented as ( n √ N ) where the radicand N is actually the sample size of the dataset (number of records present in the dataset) and n = 2, 3, 4, etc. Thus, the k value is chosen dynamically during the runtime depending on the dataset size. In traditional KNN, the value of k is chosen arbitrarily or randomly [20,25]. The enhanced KNN (eKNN) overcomes this shortcoming. Several performances indicating parameters (accuracy, precision, F1score, error rate, etc.) [26,27,31,32] were calculated to show the effectiveness of the proposed eKNN. The performance of eKNN algorithm was improved when used with ACO (Ant Colony Optimization) based feature selection mechanism [30]. For a fair comparison, C4.5 based FS mechanism was also investigated. The parameters were recalculated and compared with enhanced KNN without feature selection. Several graphical representations were shown for better performance visualization.

Contribution
The paper contributed two noteworthy research questions -

RQ1 -How can the KNN classifier choose the optimal value of k?
The conducted literature survey did not show any experimentation regarding this shortcoming. To bridge this gap, this research proposed eK N N (Enhanced KNN) and overcame the limitations of traditional KNN. The newly proposed eKNN was applied on seven benchmark datasets gathered from the COVID cloud repository of different countries (Brazil, Mexico, etc.). This data analysis can act as a back-end of a healthcare model where IoT is used to develop the front-end interface.

RQ2 -What is the effect of using the KNN classifier with feature selection mechanism?
The proposed eKNN algorithm is used with a feature selection mechanism and made robust. Apart from leveraging ACO based feature selection mechanism, C4.5 based FS mechanism is also explored for a fair comparison. All the datasets were retested. Graphical illustrations are shown to visualize the effect.

Organization
The paper is organized as follows: Section 2 underlines the related work, while Sect. 3 discusses the proposed methodology and working principle of the proposed eK N N algorithm. Section 4 describes the used dataset. Section 5 demonstrates the experimental setup and proves the fair comparison of the traditional KNN, eK N N with feature selection, and eK N N without feature selection. Finally, a conclusion is provided in Sect. 6 with future directions.

Related work
Many researchers [22,23,36,37,39,[47][48][49] have been exploring machine learning techniques as well as deep learning techniques for health monitoring and tracking COVID -19 over the cloud and IoT platforms. These studies can help in effective prediction and decision-making regarding the deadly disease. These experiments can help in early intervention, and thus it can eventually reduce the mortality rate.
Khanday et al. used traditional and ensemble machine learning classifiers for classifying textual clinical reports [14]. The authors found out that logistic regression and Naïve Bayes performed better than other ML algorithms by reaching 96.2% testing accuracy. Tiwari et al. [28] also proposed an unsupervised terminformer model to mine terms from the biomedical literature for COVID-19. Flesia et al. studied 2053 individuals with 18 socio-psychological variables and identified participants with elevated stress levels using a predictive machine learning approach [13]. Randhawa et al. focused on the COVID -19 virus genome signature [15]. The authors combined machine learning and digital signal processing (MLDSP) for classifying COVID -19 virus genomic sequence. Souza et al. applied several supervised ML algorithms (linear regression, decision tree, SVM, Gradient Boosting, etc.) on COVID positive patients and compared the outcome of each method [12]. Yan et al. studied 2799 patients of Wuhan and designed a prediction model using XGBoost to predict mortality [11]. SIR (Susceptible, Infected, and Recovered) model and machine learning were used for COVID -19 pandemic forecasting by Ndiaye et al. [10]. Castelnuovo et al. studied Random Forest and indicated that decreased renal function is a potential cause of death in COVID patients [5]. Lalmuanawmaa et al. showed how AI could create less human interference in medical diagnosis [34]. Somasekar et al. applied neural networks for image classification of Chest X-ray [7,45,46]. Amar et al. applied regression on COVID data and predicted a number of infected people [8]. The authors collected data from Egypt. The average discharge time of COVID patients from the hospital was analyzed by Nemati et al., and this study [9] helped the hospital professionals to stay better prepared for the disease.
The application of KNN [19,38,41] and feature selection mechanism [1,2] for medical diagnosis is also not new. In 2016, Li et al. conducted a study with EEG graphs regarding depression and found out that optimal performance is achieved using a combination of CFS (Correlation Features Selection) and KNN [3]. Remeseiro et al. applied the feature selection technique for medical applications. A case study was conducted on two medical applications with real-life patient data [4].
Inspired by this background work, this research gained motivation to pursue the implementation of machine learning approaches on COVID -19 detection.

Proposed methodology
In the first setup, the eKNN algorithm was applied without a feature selection mechanism on several datasets obtained from the COVID cloud. This back-end data analysis can be utilized to build an IoT-based front-end Covid screening system.

Enhanced KNN (eKNN)
In the next step, the eKNN classifier was also applied on all seven datasets. Classification is a supervised ML method, and the role of a classifier is to map the input data into classes. Each object is classified into one group only. The KNN classifier was developed by Fix and Hodges in 1951 [19] and is an example of the simple classification algorithm. It can be used for both classification and regression. In KNN, distances (Manhattan, Euclidean, Minkowsky, Chebyshev, etc.) are calculated between the test sample and training data samples, and thus nearest neighbors are obtained. The neighbors are chosen from a set of training objects whose classes are already available. The test sample is assigned to the class of its nearest neighbor only.

KNN representation
The KNN join of two sets P and Q is represented by {( p, kNN( p, Q)), ∀ p ∈ P} where p and q are two elements as p ∈ P and q ∈ Q. k points (which are closest to p) are found from set Q in a dimensional space d.
The KNN algorithm suffers from a limitation and that is its dependency in choosing a proper value of parameter k (number of nearest neighbors). The performance of KNN algorithm hugely depends on this factor as the k value elects the number of neighbours which determines the class. In most of the cases, the k value is chosen randomly and this paper attempts to overcome this drawback.
In this research, the radical mathematical function (square root, cube root, fourth root etc.) is used to determine the k value dynamically during the computation time depending on the size of the dataset. The k is represented as ( n √ N ) where the radicand N is actually sample size of the dataset (number of records present in the dataset) and n = 2, 3, 4 etc. For example -When the dataset contains around 5000 records (Dataset 1 in Table 1), the k value is set to 2 √ 5000 ≈ 71. A small k value does not produce expected output because of noise/outlier (presented in Table 3) and because of that reason k value was not set to 3 √ 5000 ≈ 17. 100000 ≈ 317, then it costs huge increase in KNN algorithm running time (almost 2 times as presented in Table 3) and thus it creates poor performance. So, to avoid this performance degradation, the k value is set to 3 √ 100000 ≈ 47 and the estimated computation time of KNN becomes manageable.
-Again, when the dataset becomes bigger with around 1 million (Dataset 6) or 1.5 million (Dataset 7) records, the k value is set to 4 √ N . For example, for Dataset 6 the k value is set to 4 √ 1000000 ≈ 31. An unlabelled test sample t is given.
1. Determine the value of parameter k by n √ N where, n = 2, 3, 4 depending on the size of dataset. 2. Choose odd value of k, to avoid equal voting. 3. Calculate the distance between test sample t and all training samples. 4. The distance between two points ( point on training data (x) and the point in testing 2 5. Sort the distances and identify the nearest neighbors of t. 6. Assemble the classes/categories of the nearest neighbors. 7. The majority of the class of the nearest neighbours is the predicted class of t.
Output: The class label of test sample t is c(T ) = c.
As the proposed eK N N classifier was applied on all the seven datasets under experimentation and obtained a confusion matrix. The matrix contains info about the actual and predicted value on classification. For each eK N N run, the accuracy, precision, recall, specificity, error rate, F1-score parameters were calculated from the confusion matrix. The resultant values are tabulated in Sect. 5.

eKNN with ACO-based feature selection
In this second phase of implementation, the KNN algorithm was used with ACO based feature selection mechanism. will bias the implementation potential of the proposed eKNN. If the original feature space is represented by size S, then the feature selection process's goal is to select the optimal subset of features of size s (s ≤ S).
Ant colony optimization replicates ant's food searching behavior pattern. As the ants move from one node to another, a chemical substance (pheromone) is deposited along the path. The pheromone trail helps other ants to find the food source following the shortest path. The pheromone evaporates at a certain rate resulting in the decay of less traversed paths. ACO is a probabilistic technique that ensures convergence and promotes rapid solution-finding. Because of these traits, ACO was given preference over others.
-The edges between the nodes guide the choice of next feature -Amount of pheromone level is indicated by τ -The features which belong to the route with high level of pheromone are treated as the selected features. -Selected feature subset is governed by - -Where P n i j (t) is the probability of an ant at feature i moving to feature j at time instant t n is number of ants / number of features j n is set of potential features that can be present in temporal solution τ i j indicates amount of pheromone in edge (i, j) -η i j indicates heuristic value associated with edge (i, j) -All features have same value of τ, η initially α > 0, β > 0 (α, β is determined experimentally and taken as α = 2, β = 0.5 in this study) -Pheromone evaporation rate is 5% in this study.
-Stopping criteria is maximum number of iterations.
The basic operation of this ACO-based feature selection mechanism is depicted in Fig. 4. After applying the ACO-based FS technique, among 8 features of Dataset 1, 5 features were selected (detailed in Table 2). Dataset 2, Dataset 6, and Dataset 7 have the same features, and 10 features were selected out of 15. Dataset 3 and Dataset 4 have the same features, and 11 out of those 18 features got selected. Dataset 5 initially had 19 features, and after the FS mechanism, 11 got selected. The selected features are tabulated in Table 2. The eKNN algorithm was applied on the reduced datasets, and accuracy, precision, recall, specificity, error rate, F1-score parameters were recalculated for each dataset. The calculated values are summarized in Sect. 5. After applying the eKNN classifier with ACO-based FS mechanism, C4.5 based FS mechanism was also explored to evaluate which FS mechanism is a better performer. The result comparison is tabulated in the next section.

Dataset description
The experimentation involved seven standard COVID datasets of different sizes and origins. Table 1 represents the datasets in ascending order of their size. For convenience, we named the datasets with Dataset 1, Dataset 2, and Dataset 3, etc. The first dataset is collected from Kaggle, and it has 5000 records (both COVID positive and COVID negative cases). The second dataset is a cross-country dataset (Our World in Data COVID-19 cloud dataset [29]) focusing on COVID -19 testing data obtained from https://ourworldindata.org/ till March 2020 starting from Jan 22, 2020. Dataset 6 and Dataset 7 are also gathered from 'Our World in Data COVID-19 dataset' [29], but they represent different time frames. Dataset 6 is from April 1, 2020, to April 7, 2020, and it contains 1 million records indicating massive outbreaks across the world rapidly. Dataset 7 is collected based on August 1, 2020, to August 7, 2020, and it contains 1.5 million records. Dataset 3 is obtained from the Brazilian government's cloud (https://coronavirus.es.gov.br/painel-covid-19-es), and it contains 0.1 million records from 6/1/2020 to 10/8/2020. Dataset 4 is also from Brazil, and it contains a Gender, tobacco, pneumonia, hypertension, COPD, diabetic, cough, diarrhea, pregnancy, sore throat, renal comorbidity total of 0.25 million cases reported from 6/1/2020 to 24/12/2020. Dataset 5 contains 0.5 million records and is released by Mexican government (https://www.gob.mx/ salud/documentos/datos-abiertos-152127).
All the datasets contain several features like age, gender, country, heart disease, COPD, diabetes, pregnancy, smoking habits, etc. Three random datasets (Dataset 2, Dataset 3, and Dataset 4) were chosen to measure the presence of comorbidity, and Fig. 1 represents the count of patients with Comorbidity and Non-Comorbidity. One random dataset (Dataset 4) was chosen to judge the age and gender distribution.

Data preprocessing
The transformation of raw data into a meaningful format is known as data preprocessing. Data Quality plays a vital role in determining the experimental results. All the seven subject datasets were pre-processed and cleaned. As part of the data cleaning activity, duplicate records were deleted. Several records had certain fields as NULL (age in Dataset 2, Serology Result in Dataset 4, Blood Pressure in Dataset 5, etc.).  The NULL values were replaced with an average value of that particular field. Only very few records had inconsistent/junk characters in many fields, so such records are discarded.
For all the 7 datasets, the KNN algorithm was initially implemented using random values of k. The choice of the k parameter is made arbitrarily, and the confusion matrix is obtained for each case.

Evaluation measures
The confusion matrix helps to measure the performance of a classifier. It is a matrix of two dimensions (Actual and Predicted), and the dimensions have TP(True Positive), TN(True Negative), FP(False Positive), and False Negative(FN) as presented in Fig.  5. True positive are the cases where the predicted value is yes (having the disease), and the patients really have the disease. True negative is when the predicted value is no (not having the disease), and the patients really do not have the disease. False-positive are the cases where the predicted value is yes (having the disease), but the patients actually do not have the disease. False-negative are the cases where the predicted value is no (not having the disease), but the patients have the disease. Several performance indicators of a classifier are derived from this matrix. For example - The existing KNN algorithm was experimented (with random k value) on all the datasets, and the resultant confusion matrix yielded several performance parameters. The calculated performance metrics are -Accuracy, Precision, Specificity, Recall, F1 score, Error Rate. In each case, the computational time to validate the algorithm is also noted. The results are listed in Table 3 for respective k values for corresponding datasets. After this initial experiment, the proposed eKNN algorithm is validated, and Table 3 Performance parameters with different neighbors (k value) for KNN and eKNN   Table 3 contains all the recalculated performance parameters after the eK N N implementation of each dataset. Figure 6 contains a graphical depiction of these parameters for different k values.

Experimental analysis and findings
From the obtained experimental results (Table 3), it can be summarized --In terms of Accuracy, F1 score, Error Rate, Computation time, etc., performance indicator, the eKNN algorithm performed significantly better than the ordinary KNN algorithm. -Results indicated that for Dataset 1, when k was chosen randomly (k = 37), the obtained accuracy was only 75.6%. But by the use of eKNN, when k was chosen as 2 √ N (in this case N = 5000, so k = 71), the obtained accuracy increased to depending on the dataset's size. While k is presented as n √ N , the radical mathematical function (square root / cube root / fourth root etc.) varies according to sample size (N) of the dataset and thus n = 2, 3, 4 etc. This became very useful when the dataset size increased compared to Dataset 1 (only 5000 records). Dataset 2 and Dataset 3 have 50,000 and 0.1 million records, respectively. So for both of these datasets, k was chosen as 3 √ N resulting in high accuracy of 89.9% and 91.5%, respectively (Table 3). -For both Dataset 2 and Dataset 3, if k value was taken as 2 √ N , instead of 3 √ N then the k value became very high, resulting in approximately double computation time while showing a very slight increase in accuracy and a minor decrease in error rate. For example, for Dataset 3, when k = 2 √ N = 317 was chosen, accuracy turned out as 92.5%, error rate appeared as 6.2% and the computation time became 25.4 seconds. But for the same dataset, when k = 3 √ N = 47 was chosen, accuracy turned out as 91.5% (very negligible decrease), error rate appeared as 7.5% (very minor increase) and the computation time became 11.5 seconds (less than half). So, high accuracy is preserved with very reasonable computation time, and thus these results supported the avoidance of too high a k value.
-For Dataset 4 (0.25 million) and Dataset 5 (0.5 million), the k value of 3 √ N gave very high accuracy of 86.9% and 87.8% respectively. Even though the Precision parameter varied very slightly, the Recall and F1 score parameters also improved both datasets (Table 3).
-Results also indicated that if k value is taken too small, then accuracy decreases drastically. For example, for Dataset 5 (sample size 0.5 million), if the k value is  were observed if the cube root of radicand was used to determine k value. Thus, the necessity of using k = 4 √ N instead of k = 3 √ N is reinforced for larger datasets.
Thus, the eK N N algorithm showed a judicial approach in choosing the k value instead of choosing it randomly. The k value was dynamically determined from the sample size of the dataset using the sample size as radicand, and this formed a logicalmathematical construct instead of arbitrarily chosen k value.

Results on eKNN with feature selection
In the second phase of experimentation, as the eK N N algorithm was augmented with ACO-based feature selection mechanism, number of feature selected by each subject dataset got reduced. As described in Sect. 4, the features which belong to the route with a high level of pheromone were treated as selected features. According to Table  2, after applying ACO-based FS for maximum iterations, 5 features are selected of  Table 2 enlists the names of all the selected features per dataset. For a fair comparison, the eKNN was also tested with C4.5 based FS. The performance parameters were computed.
The eK N N algorithm was applied on all the reduced datasets, and significant improvement in accuracy value was observed. There was a negligible increase in computation time because of feature selection implementation. However, the benefit in terms of accuracy was significant. Table 4 presents the findings. From the results as detailed in Table 4, it is evident that -  -In all the datasets, eK N N with FS showed improved accuracy over eKNN without FS. The improvement is in the range of 5% to 7% (see Table 4). -Among 7 datasets, in 5 cases eK N N with ACO-based FS mechanism showed significant improvement in accuracy value over eK N N without FS mechanism. -In the case of two datasets (Dataset 4 and Dataset 5), eK N N with C4.5 based FS mechanism showed the highest accuracy values.
-The computation time of eKNN with FS mechanism was negligibly higher than eK N N without FS. The increment time was remarkably minor (in the range of 0.3 seconds to 6 seconds only)( Table 4). Only in one exceptional case for Dataset 6, the computation time of KNN was higher than eKNN (maybe because of the high k value chosen randomly). -In two datasets (Dataset 2 and Dataset 7), eK N N with ACO-based FS mechanism showed higher computation time than eK N N with C4.5 based FS. In the rest of the datasets, eK N N with C4.5 based FS took a higher time to compute.
Thus, from the results, it is very much clear that eK N N with ACO-based FS performed consistently better than all other experimented techniques. The performance parameter comparison is depicted in Figs. 7 and 8 for better visualization.
Comparison of results with previous studies also showed a very promising prospect. While the mean accuracy value for eKNN with ACO based FS mechanism came as 95.75%, studies conducted by De Souza et al. produced an accuracy of 85% [12] after the application of KNN classifier on the same COVID datasets. Also, it is noteworthy that while CPDS (COVID-19 Patients Detection Strategy) (developed in October 2020) based on a hybrid feature selection mechanism (HFSM) gave a promising result (93% accuracy), this eKNN based on ACO based feature selection mechanism produced higher accuracy of (95.75%). So, it is obvious that the proposed eKNN algorithm can be applied for COVID -19 data analysis with very high performance.

Conclusion
COVID -19 is a highly contagious disease caused by the newly found coronavirus, and it has jolted our healthcare system. Even though most of the people infected by this disease recover without hospitalization or special medical attention, it has become fatal in many cases. While some people recovered from the disease soon, some people with an underlying health condition (diabetes, heart condition) had passed away. Towards an effort to navigate this enigma, this paper attempts to build an IoT-cloud-based healthcare model for COVID-19 detection using several datasets.
In this paper, we tried to propose an IoT-cloud-based healthcare predictive model to detect COVID-19 using eKNN. A novel enhanced KNN classifier (eKNN) is proposed, which chooses the k value using a radical mathematical function instead of choosing it randomly. The newly designed eKNN algorithm has been experimented on seven COVID -19 benchmark cloud datasets from different countries (Brazil, Mexico, etc.). This classifier can act as a backend to an IoT-based frontend COVID screening system, and it can promote the processing of large datasets in reasonable time with high computational accuracy. The experiment showed that the eKNN algorithm with an ACO-based FS mechanism generated the best performance. The proposed eKNN can be beneficial to predict the outcome of the disease.
In the future, the work can be extended using weighted KNN or some other feature selection mechanism apart from ACO or C4.5 based mechanism. The work can also be extended for larger COVID datasets (gathered cumulatively over a wide time frame) in the big data domain using Hadoop / Map Reduce approach. This proposed classifier can perform better for disease detection to fight the disease and forecast possible outcomes.